data-describe

data-describe is a Python toolkit for inspecting, illuminating, and investigating enormous amounts of unknown data with mixed relationships.

With unknown "dark" data, "unclean" data, structured and unstructured data, and data embedded in images and documents, it can be difficult to get a clear understanding of your data environment. data-describe profiles the data and reveals the true landscape of all of your data. This toolset provides a Data Scientist a rich set of tools chained together to automate common data analysis tasks. These insights help facilitate conversations among other data scientists, engineers, and business analysts, ultimately lending itself to future innovation. data-describe was built by contributors that have lead projects like Tensorflow, XGboost, Kubeflow, and MXNet, and who have combined over 40 years of Data Science Experience.

Data Summaries

Streamlined data summaries for important statistics

Clustering

Cluster similar items with unsupervised techniques

Correlations

Out-of-the-box correlation matrices with categorical support

Heatmaps

Quickly visualize data outliers and missing values with heatmaps

Distribution

Univariate analysis with quick distribution plots

Feature Ranking

Baseline feature importance prior to model trials

Scatter Plots

Smart scatter plots using diagnostics

Dimensional Reduction	Visualize high-dimensional data using PCA and t-SNE
Sensitive Data	Identifies and handles things like PII
Text and NLP	Tools for common tasks like text pre-processing & Topic Modeling
Big Data Support	Uses Modin on top of Apache Arrow via Ray, or Dask

Pythonic EDA Accelerator for Data Science

Exploratory Data Analysis on Steroids