Cluster Analysis

[1]:
import data_describe as dd

Load dataset from scikit-learn

[2]:
from sklearn.datasets import load_wine
import pandas as pd
df = load_wine(as_frame=True).data
target = load_wine().target  # For supervised clustering

Cluster Defaults

[3]:
c = dd.cluster(df)
[4]:
c.show()
[4]:
<AxesSubplot:title={'center':'kmeans Cluster'}, xlabel='Component 1 (36.0% variance explained)', ylabel='Component 2 (19.0% variance explained)'>
../_images/examples_cluster_analysis_6_1.png

Using Plotly

[5]:
dd.cluster(df, target=target, viz_backend="plotly")
C:\workspace\data-describe\data_describe\compat\_notebook.py:32: JupyterPlotlyWarning:

Are you running in Jupyter Lab? The extension "jupyterlab-plotly" was not found and is required for Plotly visualizations in Jupyter Lab.

None
[5]:
Cluster Widget using kmeans

Show Cluster Search for K-Means

[6]:
cl = dd.cluster(df, target=target)
cl.cluster_search_plot()
C:\Users\David\.conda\envs\test-env\lib\site-packages\seaborn\_decorators.py:43: FutureWarning:

Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

[6]:
<AxesSubplot:title={'center':'Optimal Number of Clusters'}, xlabel='Number of Clusters', ylabel='silhouette score'>
../_images/examples_cluster_analysis_10_2.png

Visualize using t-SNE for Dimensionality Reduction

[7]:
dd.cluster(df, target=target, dim_method="tsne")
<AxesSubplot:title={'center':'kmeans Cluster'}, xlabel='Dimension 1', ylabel='Dimension 2'>
[7]:
Cluster Widget using kmeans
../_images/examples_cluster_analysis_12_2.png

Return Reduced Data with Cluster Labels

[8]:
cl.viz_data.head()
[8]:
x y clusters
0 3.316751 -1.443463 0
1 2.209465 0.333393 0
2 2.516740 -1.031151 0
3 3.757066 -2.756372 0
4 1.008908 -0.869831 0

Return Cluster Labels Only

[9]:
cl.clusters
[9]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

KMeans - Specifying number of clusters

[10]:
dd.cluster(df, n_clusters=4)
<AxesSubplot:title={'center':'kmeans Cluster'}, xlabel='Component 1 (36.0% variance explained)', ylabel='Component 2 (19.0% variance explained)'>
[10]:
Cluster Widget using kmeans
../_images/examples_cluster_analysis_18_2.png

KMeans - Using Davies-Bouldin for finding optimal n_clusters

[11]:
cl = dd.cluster(df, target=target, metric='davies_bouldin_score')
cl
<AxesSubplot:title={'center':'kmeans Cluster'}, xlabel='Component 1 (36.0% variance explained)', ylabel='Component 2 (19.0% variance explained)'>
[11]:
Cluster Widget using kmeans
../_images/examples_cluster_analysis_20_2.png
[12]:
cl.cluster_search_plot()
C:\Users\David\.conda\envs\test-env\lib\site-packages\seaborn\_decorators.py:43: FutureWarning:

Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

[12]:
<AxesSubplot:title={'center':'Optimal Number of Clusters'}, xlabel='Number of Clusters', ylabel='davies bouldin score'>
../_images/examples_cluster_analysis_21_2.png

HDBSCAN

[13]:
dd.cluster(df, method="hdbscan", viz_backend="plotly")
None
[13]:
Cluster Widget using hdbscan