Cluster Analysis¶

[1]:

import data_describe as dd

Load dataset from scikit-learn¶

[2]:

from sklearn.datasets import load_wine
import pandas as pd
df = load_wine(as_frame=True).data
target = load_wine().target  # For supervised clustering

Cluster Defaults¶

[3]:

c = dd.cluster(df)

[5]:

c.show()

[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x1f82f8ff988>

../_images/examples_cluster_analysis_6_1.png

Using Plotly¶

[6]:

dd.cluster(df, target=target, viz_backend="plotly")

None

[6]:

<data_describe.core.clusters.KmeansClusterWidget at 0x1f96000eb88>

Show Cluster Search for K-Means¶

[7]:

cl = dd.cluster(df, target=target)
cl.cluster_search_plot()

[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x1f960c7fcc8>

../_images/examples_cluster_analysis_10_1.png

Visualize using t-SNE for Dimensionality Reduction¶

[8]:

dd.cluster(df, target=target, dim_method="tsne")

<matplotlib.axes._subplots.AxesSubplot at 0x1f960d24248>

[8]:

<data_describe.core.clusters.KmeansClusterWidget at 0x1f960cf5988>

../_images/examples_cluster_analysis_12_2.png

Return Reduced Data with Cluster Labels¶

[9]:

cl.viz_data.head()

[9]:

	x	y	clusters
0	3.316751	-1.443463	0
1	2.209465	0.333393	0
2	2.516740	-1.031151	0
3	3.757066	-2.756372	0
4	1.008908	-0.869831	0

Return Cluster Labels Only¶

[10]:

cl.clusters

[10]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

KMeans - Specifying number of clusters¶

[11]:

dd.cluster(df, n_clusters=4)

<matplotlib.axes._subplots.AxesSubplot at 0x1f960dc7588>

[11]:

<data_describe.core.clusters.KmeansClusterWidget at 0x1f960dc2c48>

../_images/examples_cluster_analysis_18_2.png

KMeans - Using Davies-Bouldin for finding optimal `n_clusters`¶

[12]:

cl = dd.cluster(df, target=target, metric='davies_bouldin_score')
cl

<matplotlib.axes._subplots.AxesSubplot at 0x1f9610e1a88>

[12]:

<data_describe.core.clusters.KmeansClusterWidget at 0x1f960d9aa88>

../_images/examples_cluster_analysis_20_2.png

[13]:

cl.cluster_search_plot()

[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x1f96117cd48>

../_images/examples_cluster_analysis_21_1.png

HDBSCAN¶

[14]:

dd.cluster(df, method="hdbscan", viz_backend="plotly")

None

[14]:

<data_describe.core.clusters.HDBSCANClusterWidget at 0x1f9611fe448>

Other Versions v: v0.1.0b2

Tags: v0.1.0b1; v0.1.0b2; v0.1.0b3

Branches: master