Distributions¶

[1]:

import pandas as pd
import data_describe as dd

[2]:

from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data.data, columns=list(data.feature_names))
df['target'] = data.target

# Create categorical (bin) features to demonstrate count plots
df['AGE'] = df['AGE'].map(lambda x: "young" if x < 29 else "old")
df['CRIM'] = df['CRIM'].map(lambda x: "low" if x < df.CRIM.median() else "high")

[3]:

df.head(2)

[3]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target
0	low	18.0	2.31	0.0	0.538	6.575	old	4.0900	1.0	296.0	15.3	396.9	4.98	24.0
1	low	0.0	7.07	0.0	0.469	6.421	old	4.9671	2.0	242.0	17.8	396.9	9.14	21.6

Diagnostic Summary¶

The default output summarizes diagnostics on the univariate data.

[4]:

dist = dd.distribution(df)
dist

Distribution Summary:
        Skew detected in 1 columns.
        Spikey histograms detected in 0 columns.

        Use the method plot_distribution("column_name") to view plots for each feature.

        Example:
            dist = DistributionWidget(data)
            dist.plot_distribution("column1")

None

[4]:

<data_describe.core.distributions.DistributionWidget at 0x1fa324be4c8>

Plot one feature¶

[5]:

dist.plot_distribution("CRIM")

[5]:

../_images/examples_distributions_7_0.png

[6]:

dist.plot_distribution("ZN")

[6]:

../_images/examples_distributions_8_0.png

[7]:

import seaborn as sns
sns.__version__
# sns.displot(df, x="ZN", hue="CRIM")

[7]:

'0.11.0'

[8]:

dist.plot_distribution("ZN", contrast="CRIM")

[8]:

../_images/examples_distributions_10_0.png

Display diagnostic values¶

[9]:

dist.skew_value

[9]:

ZN         2.219063
INDUS      0.294146
CHAS       3.395799
NOX        0.727144
RM         0.402415
DIS        1.008779
RAD        1.001833
TAX        0.667968
PTRATIO   -0.799945
B         -2.881798
LSTAT      0.903771
target     1.104811
dtype: float64