data_describe.core.clustering¶

cluster(data, method=’kmeans’, dim_method=’pca’, compute_backend=None, viz_backend=None, **kwargs)

Unsupervised determination of clusters.

class data_describe.core.clustering.ClusterWidget(method: str, clusters: List[int] = None, estimator=None, n_clusters=None, search=False, cluster_range=None, **kwargs)¶

Bases: data_describe._widget.BaseWidget

Container for clustering calculations and visualization.

This class (object) is returned from the cluster function. The attributes documented below can be accessed or extracted.

method¶

{‘kmeans’, ‘hdbscan’} The type of clustering algorithm

Type: str

clusters¶

The predicted cluster labels

Type: List[int]

estimator¶: The clustering estimator/model

input_data¶: The input data

scaled_data¶: The data after applying standardization

viz_data¶: The data used for the default visualization i.e. reduced to 2 dimensions

dim_method¶

The algorithm used for dimensionality reduction

Type: str

reductor¶: The dimensionality reduction estimator

xlabel¶

The x-axis label for the cluster plot

Type: str

ylabel¶

The y-axis label for the cluster plot

Type: str

n_clusters¶

(KMeans) The number of clusters (k) used in the final clustering fit.

Type: int, optional

search¶

(KMeans) If True, a search was performed for optimal n_clusters.

Type: bool, optional

cluster_range¶

(KMeans) The range of clusters searched as (min_cluster, max_cluster).

Type: Tuple[int, int], optional

metric¶

(KMeans) The metric used to evaluate the cluster search.

Type: str, optional

scores¶

(KMeans) The metric scores in cluster search.

Type: List

show(self, viz_backend=None, **kwargs)¶

The default display for this output.

Displays the clustered, projected data as a scatter plot, with points colored by: the cluster labels.

Parameters

viz_backend – The visualization backend.
**kwargs – Keyword arguments.

Raises

ValueError – Data to visualize is missing / not calculated.

Returns

The cluster plot.

cluster_search_plot(self, viz_backend=None, **kwargs)¶

Shows the results of cluster search.

Cluster search attempts to find an optimal n_clusters by maximizing on some criterion. This plot shows a line plot of each n_cluster that was attempted and its score.

Parameters

viz_backend – The visualization backend.
**kwargs – Additional keyword arguments to pass to the visualization backend.

Raises

ValueError – Cluster search is False.

Returns

The plot

data_describe.core.clustering.cluster(data, method='kmeans', dim_method='pca', compute_backend=None, viz_backend=None, **kwargs) → ClusterWidget¶

Unsupervised determination of clusters.

This feature computes clusters using various algorithms (KMeans, HDBSCAN) and then projects the data onto a two-dimensional plot for visualization.

Parameters

data (DataFrame) – The data.
method (str, optional) – {‘kmeans’, ‘hdbscan’} The clustering method.
dim_method (str, optional) – The method to use for dimensionality reduction.
compute_backend (str, optional) – The compute backend.
viz_backend (str, optional) – The visualization backend.
n_clusters (Optional[int], optional) – (KMeans) The number of clusters.
cluster_range (Tuple[int, int], optional) – (KMeans) A tuple of the minimum and maximum cluster search range. Defaults to (2, 20).
metric (str) – (KMeans) The metric to optimize (from sklearn.metrics).
target – (KMeans) The labels for supervised clustering, as a 1-D array.
**kwargs – Keyword arguments.

Raises

ValueError – Data frame required
ValueError – Clustering method not implemented

Returns

ClusterWidget