Cluster Analysis¶

[1]:

import data_describe as dd

Load dataset from scikit-learn¶

[2]:

from sklearn.datasets import load_wine
import pandas as pd
df = load_wine(as_frame=True).data
target = load_wine().target  # For supervised clustering

Cluster Defaults¶

[3]:

c = dd.cluster(df)

[4]:

c.show()

[4]:

<AxesSubplot:title={'center':'kmeans Cluster'}, xlabel='Component 1 (36.0% variance explained)', ylabel='Component 2 (19.0% variance explained)'>

../_images/examples_cluster_analysis_6_1.png

Using Plotly¶

[5]:

dd.cluster(df, target=target, viz_backend="plotly")

C:\workspace\data-describe\data_describe\compat\_notebook.py:32: JupyterPlotlyWarning:

Are you running in Jupyter Lab? The extension "jupyterlab-plotly" was not found and is required for Plotly visualizations in Jupyter Lab.

None

[5]:

Cluster Widget using kmeans

Show Cluster Search for K-Means¶

[6]:

cl = dd.cluster(df, target=target)
cl.cluster_search_plot()

C:\Users\David\.conda\envs\test-env\lib\site-packages\seaborn\_decorators.py:43: FutureWarning:

Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

[6]:

<AxesSubplot:title={'center':'Optimal Number of Clusters'}, xlabel='Number of Clusters', ylabel='silhouette score'>

../_images/examples_cluster_analysis_10_2.png

Visualize using t-SNE for Dimensionality Reduction¶

[7]:

dd.cluster(df, target=target, dim_method="tsne")

<AxesSubplot:title={'center':'kmeans Cluster'}, xlabel='Dimension 1', ylabel='Dimension 2'>

[7]:

Cluster Widget using kmeans

../_images/examples_cluster_analysis_12_2.png

Return Reduced Data with Cluster Labels¶

[8]:

cl.viz_data.head()

[8]:

	x	y	clusters
0	3.316751	-1.443463	0
1	2.209465	0.333393	0
2	2.516740	-1.031151	0
3	3.757066	-2.756372	0
4	1.008908	-0.869831	0

Return Cluster Labels Only¶

[9]:

cl.clusters

[9]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

KMeans - Specifying number of clusters¶

[10]:

dd.cluster(df, n_clusters=4)

<AxesSubplot:title={'center':'kmeans Cluster'}, xlabel='Component 1 (36.0% variance explained)', ylabel='Component 2 (19.0% variance explained)'>

[10]:

Cluster Widget using kmeans

../_images/examples_cluster_analysis_18_2.png

KMeans - Using Davies-Bouldin for finding optimal `n_clusters`¶

[11]:

cl = dd.cluster(df, target=target, metric='davies_bouldin_score')
cl

<AxesSubplot:title={'center':'kmeans Cluster'}, xlabel='Component 1 (36.0% variance explained)', ylabel='Component 2 (19.0% variance explained)'>

[11]:

Cluster Widget using kmeans

../_images/examples_cluster_analysis_20_2.png

[12]:

cl.cluster_search_plot()

C:\Users\David\.conda\envs\test-env\lib\site-packages\seaborn\_decorators.py:43: FutureWarning:

Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

[12]:

<AxesSubplot:title={'center':'Optimal Number of Clusters'}, xlabel='Number of Clusters', ylabel='davies bouldin score'>

../_images/examples_cluster_analysis_21_2.png

HDBSCAN¶

[13]:

dd.cluster(df, method="hdbscan", viz_backend="plotly")

None

[13]:

Cluster Widget using hdbscan

Other Versions v: master

Tags: v0.1.0b1; v0.1.0b2; v0.1.0b3

Branches: master