data_describe.core.clustering

cluster(data, method=’kmeans’, dim_method=’pca’, compute_backend=None, viz_backend=None, **kwargs)

Unsupervised determination of clusters.

class data_describe.core.clustering.ClusterWidget(method: str, clusters: List[int] = None, estimator=None, n_clusters=None, search=False, cluster_range=None, **kwargs)

Bases: data_describe._widget.BaseWidget

Container for clustering calculations and visualization.

This class (object) is returned from the cluster function. The attributes documented below can be accessed or extracted.

method

{‘kmeans’, ‘hdbscan’} The type of clustering algorithm

Type

str

clusters

The predicted cluster labels

Type

List[int]

estimator

The clustering estimator/model

input_data

The input data

scaled_data

The data after applying standardization

viz_data

The data used for the default visualization i.e. reduced to 2 dimensions

dim_method

The algorithm used for dimensionality reduction

Type

str

reductor

The dimensionality reduction estimator

xlabel

The x-axis label for the cluster plot

Type

str

ylabel

The y-axis label for the cluster plot

Type

str

n_clusters

(KMeans) The number of clusters (k) used in the final clustering fit.

Type

int, optional

search

(KMeans) If True, a search was performed for optimal n_clusters.

Type

bool, optional

cluster_range

(KMeans) The range of clusters searched as (min_cluster, max_cluster).

Type

Tuple[int, int], optional

metric

(KMeans) The metric used to evaluate the cluster search.

Type

str, optional

scores

(KMeans) The metric scores in cluster search.

Type

List

show(self, viz_backend=None, **kwargs)

The default display for this output.

Displays the clustered, projected data as a scatter plot, with points colored by

the cluster labels.

Parameters
  • viz_backend – The visualization backend.

  • **kwargs – Keyword arguments.

Raises

ValueError – Data to visualize is missing / not calculated.

Returns

The cluster plot.

cluster_search_plot(self, viz_backend=None, **kwargs)

Shows the results of cluster search.

Cluster search attempts to find an optimal n_clusters by maximizing on some criterion. This plot shows a line plot of each n_cluster that was attempted and its score.

Parameters
  • viz_backend – The visualization backend.

  • **kwargs – Additional keyword arguments to pass to the visualization backend.

Raises

ValueError – Cluster search is False.

Returns

The plot

data_describe.core.clustering.cluster(data, method='kmeans', dim_method='pca', compute_backend=None, viz_backend=None, **kwargs) → ClusterWidget

Unsupervised determination of clusters.

This feature computes clusters using various algorithms (KMeans, HDBSCAN) and then projects the data onto a two-dimensional plot for visualization.

Parameters
  • data (DataFrame) – The data.

  • method (str, optional) – {‘kmeans’, ‘hdbscan’} The clustering method.

  • dim_method (str, optional) – The method to use for dimensionality reduction.

  • compute_backend (str, optional) – The compute backend.

  • viz_backend (str, optional) – The visualization backend.

  • n_clusters (Optional[int], optional) – (KMeans) The number of clusters.

  • cluster_range (Tuple[int, int], optional) – (KMeans) A tuple of the minimum and maximum cluster search range. Defaults to (2, 20).

  • metric (str) – (KMeans) The metric to optimize (from sklearn.metrics).

  • target – (KMeans) The labels for supervised clustering, as a 1-D array.

  • **kwargs – Keyword arguments.

Raises
  • ValueError – Data frame required

  • ValueError – Clustering method not implemented

Returns

ClusterWidget