Data Summary¶
The Data Summary feature provides an overview of the data using summary statistics. The output is similar to using pandas.DataFrame.describe
, however, a different set of statistics are selected to address common questions about the data.
Data Type: The data type
Nulls: The number (count) or percentage of null values. Primarily for identifying missing data.
Zeros: The number (count) or percentage of zero values. Zero is commonly used as a special number and may indicate abnormalities.
Min, Max: The minimum and maximum values. Used to identify extreme values (outliers).
Median, Mean, Standard Deviation: Used to identify skew.
Unique: Number of unique values (levels). Used to identify high cardinality.
Top Frequency: The number (count) or percentage of values equaling the mode. Used to identify imbalanced data.
Example data¶
[1]:
from datetime import datetime
import pandas as pd
from sklearn.datasets import load_boston
import data_describe as dd
[2]:
data = load_boston()
df = pd.DataFrame(data.data, columns=list(data.feature_names))
df['target'] = data.target
df.head(1)
[2]:
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.09 | 1.0 | 296.0 | 15.3 | 396.9 | 4.98 | 24.0 |
[3]:
# Change data types to demonstrate data summary
df['CRIM'] = df['CRIM'] / 1000000
df['AGE'] = df['AGE'].map(lambda x: "young" if x < 29 else "old")
df["AgeFlag"] = df['AGE'].astype(bool)
df['ZN'] = df['ZN'].astype(int)
df['Date'] = datetime.strptime('1/1/2008 1:30 PM', '%m/%d/%Y %I:%M %p')
Default¶
The defaults for data_summary
attempts to format floats to be easier to read by disabling scientific notation and limiting the number of decimal places shown.
[4]:
dd.data_summary(df)
Info | |
---|---|
Rows | 506 |
Columns | 16 |
Size in Memory | 59.9 KB |
Data Type | Nulls | Zeros | Min | Median | Max | Mean | Standard Deviation | Unique | Top Frequency | |
---|---|---|---|---|---|---|---|---|---|---|
CRIM | float64 | 0 | 0 | 0.0000000063 | 0.00000026 | 0.000089 | 0.0000036 | 0.0000086 | 504 | 2 |
ZN | int64 | 0 | 372 | 0 | 0 | 100 | 11.35 | 23.29 | 26 | 372 |
INDUS | float64 | 0 | 0 | 0.46 | 9.69 | 27.74 | 11.14 | 6.85 | 76 | 132 |
CHAS | float64 | 0 | 471 | 0 | 0 | 1 | 0.069 | 0.25 | 2 | 471 |
NOX | float64 | 0 | 0 | 0.39 | 0.54 | 0.87 | 0.55 | 0.12 | 81 | 23 |
RM | float64 | 0 | 0 | 3.56 | 6.21 | 8.78 | 6.28 | 0.70 | 446 | 3 |
AGE | object | 0 | 0 | 2 | 446 | |||||
DIS | float64 | 0 | 0 | 1.13 | 3.21 | 12.13 | 3.80 | 2.10 | 412 | 5 |
RAD | float64 | 0 | 0 | 1 | 5 | 24 | 9.55 | 8.70 | 9 | 132 |
TAX | float64 | 0 | 0 | 187 | 330 | 711 | 408.24 | 168.37 | 66 | 132 |
PTRATIO | float64 | 0 | 0 | 12.60 | 19.050 | 22 | 18.46 | 2.16 | 46 | 140 |
B | float64 | 0 | 0 | 0.32 | 391.44 | 396.90 | 356.67 | 91.20 | 357 | 121 |
LSTAT | float64 | 0 | 0 | 1.73 | 11.36 | 37.97 | 12.65 | 7.13 | 455 | 3 |
target | float64 | 0 | 0 | 5 | 21.20 | 50 | 22.53 | 9.19 | 229 | 16 |
AgeFlag | bool | 0 | 0 | 1 | 506 | |||||
Date | datetime64[ns] | 0 | 0 | 2008-01-01 13:30:00 | 2008-01-01 13:30:00 | 1 | 506 |
None
[4]:
data-describe Summary Widget
Display counts as percentage¶
To display the count statistics as a percentage (over the total record count), use as_percentage=True
[5]:
dd.data_summary(df, as_percentage=True)
Info | |
---|---|
Rows | 506 |
Columns | 16 |
Size in Memory | 59.9 KB |
Data Type | Nulls | Zeros | Min | Median | Max | Mean | Standard Deviation | Unique | Top Frequency | |
---|---|---|---|---|---|---|---|---|---|---|
CRIM | float64 | 0.0% | 0.0% | 0.0000000063 | 0.00000026 | 0.000089 | 0.0000036 | 0.0000086 | 504 | 0.4% |
ZN | int64 | 0.0% | 73.5% | 0 | 0 | 100 | 11.35 | 23.29 | 26 | 73.5% |
INDUS | float64 | 0.0% | 0.0% | 0.46 | 9.69 | 27.74 | 11.14 | 6.85 | 76 | 26.1% |
CHAS | float64 | 0.0% | 93.1% | 0 | 0 | 1 | 0.069 | 0.25 | 2 | 93.1% |
NOX | float64 | 0.0% | 0.0% | 0.39 | 0.54 | 0.87 | 0.55 | 0.12 | 81 | 4.5% |
RM | float64 | 0.0% | 0.0% | 3.56 | 6.21 | 8.78 | 6.28 | 0.70 | 446 | 0.6% |
AGE | object | 0.0% | 0.0% | 2 | 88.1% | |||||
DIS | float64 | 0.0% | 0.0% | 1.13 | 3.21 | 12.13 | 3.80 | 2.10 | 412 | 1.0% |
RAD | float64 | 0.0% | 0.0% | 1 | 5 | 24 | 9.55 | 8.70 | 9 | 26.1% |
TAX | float64 | 0.0% | 0.0% | 187 | 330 | 711 | 408.24 | 168.37 | 66 | 26.1% |
PTRATIO | float64 | 0.0% | 0.0% | 12.60 | 19.050 | 22 | 18.46 | 2.16 | 46 | 27.7% |
B | float64 | 0.0% | 0.0% | 0.32 | 391.44 | 396.90 | 356.67 | 91.20 | 357 | 23.9% |
LSTAT | float64 | 0.0% | 0.0% | 1.73 | 11.36 | 37.97 | 12.65 | 7.13 | 455 | 0.6% |
target | float64 | 0.0% | 0.0% | 5 | 21.20 | 50 | 22.53 | 9.19 | 229 | 3.2% |
AgeFlag | bool | 0.0% | 0.0% | 1 | 100.0% | |||||
Date | datetime64[ns] | 0.0% | 0.0% | 2008-01-01 13:30:00 | 2008-01-01 13:30:00 | 1 | 100.0% |
None
[5]:
data-describe Summary Widget
Disable auto float formatting¶
If the formatting logic is not desired, use auto_float=False
. Depending on your data, there may not be a significant difference in the output.
[6]:
dd.data_summary(df, auto_float=False)
Info | |
---|---|
Rows | 506 |
Columns | 16 |
Size in Memory | 59.9 KB |
Data Type | Nulls | Zeros | Min | Median | Max | Mean | Standard Deviation | Unique | Top Frequency | |
---|---|---|---|---|---|---|---|---|---|---|
CRIM | float64 | 0 | 0 | 6.32e-09 | 2.5651e-07 | 8.89762e-05 | 3.61352e-06 | 8.59304e-06 | 504 | 2 |
ZN | int64 | 0 | 372 | 0 | 0 | 100 | 11.3478 | 23.2875 | 26 | 372 |
INDUS | float64 | 0 | 0 | 0.46 | 9.69 | 27.74 | 11.1368 | 6.85357 | 76 | 132 |
CHAS | float64 | 0 | 471 | 0 | 0 | 1 | 0.06917 | 0.253743 | 2 | 471 |
NOX | float64 | 0 | 0 | 0.385 | 0.538 | 0.871 | 0.554695 | 0.115763 | 81 | 23 |
RM | float64 | 0 | 0 | 3.561 | 6.2085 | 8.78 | 6.28463 | 0.701923 | 446 | 3 |
AGE | object | 0 | 0 | 2 | 446 | |||||
DIS | float64 | 0 | 0 | 1.1296 | 3.20745 | 12.1265 | 3.79504 | 2.10363 | 412 | 5 |
RAD | float64 | 0 | 0 | 1 | 5 | 24 | 9.54941 | 8.69865 | 9 | 132 |
TAX | float64 | 0 | 0 | 187 | 330 | 711 | 408.237 | 168.37 | 66 | 132 |
PTRATIO | float64 | 0 | 0 | 12.6 | 19.05 | 22 | 18.4555 | 2.16281 | 46 | 140 |
B | float64 | 0 | 0 | 0.32 | 391.44 | 396.9 | 356.674 | 91.2046 | 357 | 121 |
LSTAT | float64 | 0 | 0 | 1.73 | 11.36 | 37.97 | 12.6531 | 7.134 | 455 | 3 |
target | float64 | 0 | 0 | 5 | 21.2 | 50 | 22.5328 | 9.18801 | 229 | 16 |
AgeFlag | bool | 0 | 0 | 1 | 506 | |||||
Date | datetime64[ns] | 0 | 0 | 2008-01-01 13:30:00 | 2008-01-01 13:30:00 | 1 | 506 |
None
[6]:
data-describe Summary Widget