User Guide¶

Using “Core Features”¶

The main, “core” features of data-describe are exported, meaning they can be utilized without any special import paths:

import data_describe as dd
dd.data_summary(df)

When running this in a Jupyter Notebook, if it is on the last line of a cell, the plot or other default output will be displayed automatically.

The “Widget” class¶

For most functions in data-describe, the output is actually a subclass of a data-describe Widget. This class serves as a container for all inputs, intermediate calculations, the default visualization, and also secondary outputs or plots related to the feature.

For example, let’s say that one is using the clustering feature:

import data_describe as dd
clustering_output = dd.cluster(df) # clustering_output is a "ClusterWidget".

Having the output variable (ClusterWidget) as the last line of a cell will display the default output i.e. the cluster plot in 2D:

clustering_output # Shows the cluster plot

Since the default logic for the cluster is to perform a search over multiple k values for K-means, one can display the cluster search plot:

clustering_output.cluster_search_plot() # Displays the cluster search plot, showing k versus some goodness-of-fit metric for clusters

One can also check the API Reference section in the documentation for a detailed list of what you can do with this Widget, or you can inspect the object using the Python builtin dir():

dir(clustering_output) # List the attributes and methods on this "Widget"

Backends¶

data-describe makes use of multiple “backends” for computation and visualization. This allows users to change the framework in which a feature operates. The backend implementation used for a function follows the following logic:

The parameter compute_backend or viz_backend directly specified in the calling function, e.g. dd.cluster(compute_backend="pandas")
(Compute only) Inferred from the input data type
The default as set in the data-describe configuration options, i.e. dd.options.backends.compute = "pandas"

compute¶

pandas: “pandas” is used as the catch-all for computation that runs in memory with no/limited parallel processing. Note: The actual implementation may sometimes utilize operations in NumPy or otherwise outside of a Pandas dataframe for performance reasons.

modin: “Scale your pandas workflows by changing one line of code”. data-describe provides wrappers for operating on your Modin dataframes.

visualization¶

seaborn: “seaborn” is used as the catch-all for matplotlib-based plots. Plots generated by this backend typically use the seaborn API with additional tweaks added using the matplotlib API.

plotly: “The interactive graphing library for Python”. plotly is generally used for interactive visualizations.

Extending data-describe¶

data-describe uses a plugin architecture for these backends, so implementations for other frameworks can be created (for example, Bokeh for visualization). See the data-describe Developer Guide for more information.

Using other features¶

data-describe has several other features that are not considered “core” functionality, such as sensitive data detection or text preprocessing. To use these features, one will need to specifically import them from the proper location, e.g.:

from data_describe.text.text_preprocessing import preprocess_texts

Optional Dependencies¶

Some features of data-describe may use other Python dependencies considered optional. These optional dependencies (e.g. hdbscan) are not automatically installed with data-describe. To install these optional dependencies, see the Installation page.