data-describe is a Python toolkit for inspecting, illuminating, and investigating enormous amounts of unknown data with mixed relationships.

With unknown "dark" data, "unclean" data, structured and unstructured data, and data embedded in images and documents, it can be difficult to get a clear understanding of your data environment. data-describe profiles the data and reveals the true landscape of all of your data. This toolset provides a Data Scientist a rich set of tools chained together to automate common data analysis tasks. These insights help facilitate conversations among other data scientists, engineers, and business analysts, ultimately lending itself to future innovation. data-describe was built by contributors that have lead projects like Tensorflow, XGboost, Kubeflow, and MXNet, and who have combined over 40 years of Data Science Experience.

Core Features
Data Summaries
Data Summaries

Streamlined data summaries for important statistics

Clustering
Clustering

Cluster similar items with unsupervised techniques

Correlations
Correlations

Out-of-the-box correlation matrices with categorical support

Heatmaps
Heatmaps

Quickly visualize data outliers and missing values with heatmaps

Distribution
Distribution

Univariate analysis with quick distribution plots

Feature Ranking
Feature Ranking

Baseline feature importance prior to model trials

Scatter Plots
Scatter Plots

Smart scatter plots using diagnostics

additional features
Dimensional Reduction Visualize high-dimensional data using PCA and t-SNE
Sensitive Data Identifies and handles things like PII
Text and NLP Tools for common tasks like text pre-processing & Topic Modeling
Big Data Support Uses Modin on top of Apache Arrow via Ray, or Dask