data_describe.core.clustering¶
|
Unsupervised determination of clusters. |
-
class
data_describe.core.clustering.
ClusterWidget
(method: str, clusters: List[int] = None, estimator=None, n_clusters=None, search=False, cluster_range=None, **kwargs)¶ Bases:
data_describe._widget.BaseWidget
Container for clustering calculations and visualization.
This class (object) is returned from the
cluster
function. The attributes documented below can be accessed or extracted.-
method
¶ {‘kmeans’, ‘hdbscan’} The type of clustering algorithm
- Type
str
-
clusters
¶ The predicted cluster labels
- Type
List[int]
-
estimator
¶ The clustering estimator/model
-
input_data
¶ The input data
-
scaled_data
¶ The data after applying standardization
-
viz_data
¶ The data used for the default visualization i.e. reduced to 2 dimensions
-
dim_method
¶ The algorithm used for dimensionality reduction
- Type
str
-
reductor
¶ The dimensionality reduction estimator
-
xlabel
¶ The x-axis label for the cluster plot
- Type
str
-
ylabel
¶ The y-axis label for the cluster plot
- Type
str
-
n_clusters
¶ (KMeans) The number of clusters (
k
) used in the final clustering fit.- Type
int, optional
-
search
¶ (KMeans) If True, a search was performed for optimal
n_clusters
.- Type
bool, optional
-
cluster_range
¶ (KMeans) The range of clusters searched as (min_cluster, max_cluster).
- Type
Tuple[int, int], optional
-
metric
¶ (KMeans) The metric used to evaluate the cluster search.
- Type
str, optional
-
scores
¶ (KMeans) The metric scores in cluster search.
- Type
List
-
show
(self, viz_backend=None, **kwargs)¶ The default display for this output.
- Displays the clustered, projected data as a scatter plot, with points colored by
the cluster labels.
- Parameters
viz_backend – The visualization backend.
**kwargs – Keyword arguments.
- Raises
ValueError – Data to visualize is missing / not calculated.
- Returns
The cluster plot.
-
cluster_search_plot
(self, viz_backend=None, **kwargs)¶ Shows the results of cluster search.
Cluster search attempts to find an optimal n_clusters by maximizing on some criterion. This plot shows a line plot of each n_cluster that was attempted and its score.
- Parameters
viz_backend – The visualization backend.
**kwargs – Additional keyword arguments to pass to the visualization backend.
- Raises
ValueError – Cluster search is False.
- Returns
The plot
-
-
data_describe.core.clustering.
cluster
(data, method='kmeans', dim_method='pca', compute_backend=None, viz_backend=None, **kwargs) → ClusterWidget¶ Unsupervised determination of clusters.
This feature computes clusters using various algorithms (KMeans, HDBSCAN) and then projects the data onto a two-dimensional plot for visualization.
- Parameters
data (DataFrame) – The data.
method (str, optional) – {‘kmeans’, ‘hdbscan’} The clustering method.
dim_method (str, optional) – The method to use for dimensionality reduction.
compute_backend (str, optional) – The compute backend.
viz_backend (str, optional) – The visualization backend.
n_clusters (Optional[int], optional) – (KMeans) The number of clusters.
cluster_range (Tuple[int, int], optional) – (KMeans) A tuple of the minimum and maximum cluster search range. Defaults to (2, 20).
metric (str) – (KMeans) The metric to optimize (from sklearn.metrics).
target – (KMeans) The labels for supervised clustering, as a 1-D array.
**kwargs – Keyword arguments.
- Raises
ValueError – Data frame required
ValueError – Clustering method not implemented
- Returns
ClusterWidget