Data Summary

The Data Summary feature provides an overview of the data using summary statistics. The output is similar to using pandas.DataFrame.describe, however, a different set of statistics are selected to address common questions about the data.

  • Data Type: The data type

  • Nulls: The number (count) or percentage of null values. Primarily for identifying missing data.

  • Zeros: The number (count) or percentage of zero values. Zero is commonly used as a special number and may indicate abnormalities.

  • Min, Max: The minimum and maximum values. Used to identify extreme values (outliers).

  • Median, Mean, Standard Deviation: Used to identify skew.

  • Unique: Number of unique values (levels). Used to identify high cardinality.

  • Top Frequency: The number (count) or percentage of values equaling the mode. Used to identify imbalanced data.

Example data

[1]:
from datetime import datetime
import pandas as pd
from sklearn.datasets import load_boston

import data_describe as dd
[2]:
data = load_boston()
df = pd.DataFrame(data.data, columns=list(data.feature_names))
df['target'] = data.target
df.head(1)
[2]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.09 1.0 296.0 15.3 396.9 4.98 24.0
[3]:
# Change data types to demonstrate data summary
df['CRIM'] = df['CRIM'] / 1000000
df['AGE'] = df['AGE'].map(lambda x: "young" if x < 29 else "old")
df["AgeFlag"] = df['AGE'].astype(bool)
df['ZN'] = df['ZN'].astype(int)
df['Date'] = datetime.strptime('1/1/2008 1:30 PM', '%m/%d/%Y %I:%M %p')

Default

The defaults for data_summary attempts to format floats to be easier to read by disabling scientific notation and limiting the number of decimal places shown.

[4]:
dd.data_summary(df)
Info
Rows 506
Columns 16
Size in Memory 59.9 KB
Data Type Nulls Zeros Min Median Max Mean Standard Deviation Unique Top Frequency
CRIM float64 0 0 0.0000000063 0.00000026 0.000089 0.0000036 0.0000086 504 2
ZN int64 0 372 0 0 100 11.35 23.29 26 372
INDUS float64 0 0 0.46 9.69 27.74 11.14 6.85 76 132
CHAS float64 0 471 0 0 1 0.069 0.25 2 471
NOX float64 0 0 0.39 0.54 0.87 0.55 0.12 81 23
RM float64 0 0 3.56 6.21 8.78 6.28 0.70 446 3
AGE object 0 0 2 446
DIS float64 0 0 1.13 3.21 12.13 3.80 2.10 412 5
RAD float64 0 0 1 5 24 9.55 8.70 9 132
TAX float64 0 0 187 330 711 408.24 168.37 66 132
PTRATIO float64 0 0 12.60 19.050 22 18.46 2.16 46 140
B float64 0 0 0.32 391.44 396.90 356.67 91.20 357 121
LSTAT float64 0 0 1.73 11.36 37.97 12.65 7.13 455 3
target float64 0 0 5 21.20 50 22.53 9.19 229 16
AgeFlag bool 0 0 1 506
Date datetime64[ns] 0 0 2008-01-01 13:30:00 2008-01-01 13:30:00 1 506
None
[4]:
data-describe Summary Widget

Display counts as percentage

To display the count statistics as a percentage (over the total record count), use as_percentage=True

[5]:
dd.data_summary(df, as_percentage=True)
Info
Rows 506
Columns 16
Size in Memory 59.9 KB
Data Type Nulls Zeros Min Median Max Mean Standard Deviation Unique Top Frequency
CRIM float64 0.0% 0.0% 0.0000000063 0.00000026 0.000089 0.0000036 0.0000086 504 0.4%
ZN int64 0.0% 73.5% 0 0 100 11.35 23.29 26 73.5%
INDUS float64 0.0% 0.0% 0.46 9.69 27.74 11.14 6.85 76 26.1%
CHAS float64 0.0% 93.1% 0 0 1 0.069 0.25 2 93.1%
NOX float64 0.0% 0.0% 0.39 0.54 0.87 0.55 0.12 81 4.5%
RM float64 0.0% 0.0% 3.56 6.21 8.78 6.28 0.70 446 0.6%
AGE object 0.0% 0.0% 2 88.1%
DIS float64 0.0% 0.0% 1.13 3.21 12.13 3.80 2.10 412 1.0%
RAD float64 0.0% 0.0% 1 5 24 9.55 8.70 9 26.1%
TAX float64 0.0% 0.0% 187 330 711 408.24 168.37 66 26.1%
PTRATIO float64 0.0% 0.0% 12.60 19.050 22 18.46 2.16 46 27.7%
B float64 0.0% 0.0% 0.32 391.44 396.90 356.67 91.20 357 23.9%
LSTAT float64 0.0% 0.0% 1.73 11.36 37.97 12.65 7.13 455 0.6%
target float64 0.0% 0.0% 5 21.20 50 22.53 9.19 229 3.2%
AgeFlag bool 0.0% 0.0% 1 100.0%
Date datetime64[ns] 0.0% 0.0% 2008-01-01 13:30:00 2008-01-01 13:30:00 1 100.0%
None
[5]:
data-describe Summary Widget

Disable auto float formatting

If the formatting logic is not desired, use auto_float=False. Depending on your data, there may not be a significant difference in the output.

[6]:
dd.data_summary(df, auto_float=False)
Info
Rows 506
Columns 16
Size in Memory 59.9 KB
Data Type Nulls Zeros Min Median Max Mean Standard Deviation Unique Top Frequency
CRIM float64 0 0 6.32e-09 2.5651e-07 8.89762e-05 3.61352e-06 8.59304e-06 504 2
ZN int64 0 372 0 0 100 11.3478 23.2875 26 372
INDUS float64 0 0 0.46 9.69 27.74 11.1368 6.85357 76 132
CHAS float64 0 471 0 0 1 0.06917 0.253743 2 471
NOX float64 0 0 0.385 0.538 0.871 0.554695 0.115763 81 23
RM float64 0 0 3.561 6.2085 8.78 6.28463 0.701923 446 3
AGE object 0 0 2 446
DIS float64 0 0 1.1296 3.20745 12.1265 3.79504 2.10363 412 5
RAD float64 0 0 1 5 24 9.54941 8.69865 9 132
TAX float64 0 0 187 330 711 408.237 168.37 66 132
PTRATIO float64 0 0 12.6 19.05 22 18.4555 2.16281 46 140
B float64 0 0 0.32 391.44 396.9 356.674 91.2046 357 121
LSTAT float64 0 0 1.73 11.36 37.97 12.6531 7.134 455 3
target float64 0 0 5 21.2 50 22.5328 9.18801 229 16
AgeFlag bool 0 0 1 506
Date datetime64[ns] 0 0 2008-01-01 13:30:00 2008-01-01 13:30:00 1 506
None
[6]:
data-describe Summary Widget