Data Summary¶

The Data Summary feature provides an overview of the data using summary statistics. The output is similar to using pandas.DataFrame.describe, however, a different set of statistics are selected to address common questions about the data.

Data Type: The data type
Nulls: The number (count) or percentage of null values. Primarily for identifying missing data.
Zeros: The number (count) or percentage of zero values. Zero is commonly used as a special number and may indicate abnormalities.
Min, Max: The minimum and maximum values. Used to identify extreme values (outliers).
Median, Mean, Standard Deviation: Used to identify skew.
Unique: Number of unique values (levels). Used to identify high cardinality.
Top Frequency: The number (count) or percentage of values equaling the mode. Used to identify imbalanced data.

Example data¶

[1]:

from datetime import datetime
import pandas as pd
from sklearn.datasets import load_boston

import data_describe as dd

[2]:

data = load_boston()
df = pd.DataFrame(data.data, columns=list(data.feature_names))
df['target'] = data.target
df.head(1)

[2]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target
0	0.00632	18.0	2.31	0.0	0.538	6.575	65.2	4.09	1.0	296.0	15.3	396.9	4.98	24.0

[3]:

# Change data types to demonstrate data summary
df['CRIM'] = df['CRIM'] / 1000000
df['AGE'] = df['AGE'].map(lambda x: "young" if x < 29 else "old")
df["AgeFlag"] = df['AGE'].astype(bool)
df['ZN'] = df['ZN'].astype(int)
df['Date'] = datetime.strptime('1/1/2008 1:30 PM', '%m/%d/%Y %I:%M %p')

Default¶

The defaults for data_summary attempts to format floats to be easier to read by disabling scientific notation and limiting the number of decimal places shown.

[4]:

dd.data_summary(df)

	Info
Rows	506
Columns	16
Size in Memory	59.9 KB

	Data Type	Zeros	Min	Median	Max	Mean	Standard Deviation	Unique	Top Frequency
CRIM	float64	0	0.0000000063	0.00000026	0.000089	0.0000036	0.0000086	504	2
ZN	int64	372	0	0	100	11.35	23.29	26	372
INDUS	float64	0	0.46	9.69	27.74	11.14	6.85	76	132
CHAS	float64	471	0	0	1	0.069	0.25	2	471
NOX	float64	0	0.39	0.54	0.87	0.55	0.12	81	23
RM	float64	0	3.56	6.21	8.78	6.28	0.70	446	3
AGE	object	0						2	446
DIS	float64	0	1.13	3.21	12.13	3.80	2.10	412	5
RAD	float64	0	1	5	24	9.55	8.70	9	132
TAX	float64	0	187	330	711	408.24	168.37	66	132
PTRATIO	float64	0	12.60	19.050	22	18.46	2.16	46	140
B	float64	0	0.32	391.44	396.90	356.67	91.20	357	121
LSTAT	float64	0	1.73	11.36	37.97	12.65	7.13	455	3
target	float64	0	5	21.20	50	22.53	9.19	229	16
AgeFlag	bool	0						1	506
Date	datetime64[ns]	0	2008-01-01 13:30:00		2008-01-01 13:30:00			1	506

None

[4]:

data-describe Summary Widget

Display counts as percentage¶

To display the count statistics as a percentage (over the total record count), use as_percentage=True

[5]:

dd.data_summary(df, as_percentage=True)

	Info
Rows	506
Columns	16
Size in Memory	59.9 KB

	Data Type	Nulls	Zeros	Min	Median	Max	Mean	Standard Deviation	Unique	Top Frequency
CRIM	float64	0.0%	0.0%	0.0000000063	0.00000026	0.000089	0.0000036	0.0000086	504	0.4%
ZN	int64	0.0%	73.5%	0	0	100	11.35	23.29	26	73.5%
INDUS	float64	0.0%	0.0%	0.46	9.69	27.74	11.14	6.85	76	26.1%
CHAS	float64	0.0%	93.1%	0	0	1	0.069	0.25	2	93.1%
NOX	float64	0.0%	0.0%	0.39	0.54	0.87	0.55	0.12	81	4.5%
RM	float64	0.0%	0.0%	3.56	6.21	8.78	6.28	0.70	446	0.6%
AGE	object	0.0%	0.0%						2	88.1%
DIS	float64	0.0%	0.0%	1.13	3.21	12.13	3.80	2.10	412	1.0%
RAD	float64	0.0%	0.0%	1	5	24	9.55	8.70	9	26.1%
TAX	float64	0.0%	0.0%	187	330	711	408.24	168.37	66	26.1%
PTRATIO	float64	0.0%	0.0%	12.60	19.050	22	18.46	2.16	46	27.7%
B	float64	0.0%	0.0%	0.32	391.44	396.90	356.67	91.20	357	23.9%
LSTAT	float64	0.0%	0.0%	1.73	11.36	37.97	12.65	7.13	455	0.6%
target	float64	0.0%	0.0%	5	21.20	50	22.53	9.19	229	3.2%
AgeFlag	bool	0.0%	0.0%						1	100.0%
Date	datetime64[ns]	0.0%	0.0%	2008-01-01 13:30:00		2008-01-01 13:30:00			1	100.0%

None

[5]:

data-describe Summary Widget

Disable auto float formatting¶

If the formatting logic is not desired, use auto_float=False. Depending on your data, there may not be a significant difference in the output.

[6]:

dd.data_summary(df, auto_float=False)

	Info
Rows	506
Columns	16
Size in Memory	59.9 KB

	Data Type	Zeros	Min	Median	Max	Mean	Standard Deviation	Unique	Top Frequency
CRIM	float64	0	6.32e-09	2.5651e-07	8.89762e-05	3.61352e-06	8.59304e-06	504	2
ZN	int64	372	0	0	100	11.3478	23.2875	26	372
INDUS	float64	0	0.46	9.69	27.74	11.1368	6.85357	76	132
CHAS	float64	471	0	0	1	0.06917	0.253743	2	471
NOX	float64	0	0.385	0.538	0.871	0.554695	0.115763	81	23
RM	float64	0	3.561	6.2085	8.78	6.28463	0.701923	446	3
AGE	object	0						2	446
DIS	float64	0	1.1296	3.20745	12.1265	3.79504	2.10363	412	5
RAD	float64	0	1	5	24	9.54941	8.69865	9	132
TAX	float64	0	187	330	711	408.237	168.37	66	132
PTRATIO	float64	0	12.6	19.05	22	18.4555	2.16281	46	140
B	float64	0	0.32	391.44	396.9	356.674	91.2046	357	121
LSTAT	float64	0	1.73	11.36	37.97	12.6531	7.134	455	3
target	float64	0	5	21.2	50	22.5328	9.18801	229	16
AgeFlag	bool	0						1	506
Date	datetime64[ns]	0	2008-01-01 13:30:00		2008-01-01 13:30:00			1	506

None

[6]:

data-describe Summary Widget