{ "cells": [ { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.011413, "end_time": "2020-12-08T22:00:30.201566", "exception": false, "start_time": "2020-12-08T22:00:30.190153", "status": "completed" }, "tags": [] }, "source": [ "# Data Summary\n", "\n", "The Data Summary feature provides an overview of the data using summary statistics. The output is similar to using `pandas.DataFrame.describe`, however, a different set of statistics are selected to address common questions about the data.\n", "\n", "- Data Type: The data type\n", "- Nulls: The number (count) or percentage of null values. Primarily for identifying missing data.\n", "- Zeros: The number (count) or percentage of zero values. Zero is commonly used as a special number and may indicate abnormalities.\n", "- Min, Max: The minimum and maximum values. Used to identify extreme values (outliers).\n", "- Median, Mean, Standard Deviation: Used to identify skew.\n", "- Unique: Number of unique values (levels). Used to identify high cardinality.\n", "- Top Frequency: The number (count) or percentage of values equaling the mode. Used to identify imbalanced data." ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.005885, "end_time": "2020-12-08T22:00:30.215499", "exception": false, "start_time": "2020-12-08T22:00:30.209614", "status": "completed" }, "tags": [] }, "source": [ "## Example data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2020-12-08T22:00:30.230862Z", "iopub.status.busy": "2020-12-08T22:00:30.230153Z", "iopub.status.idle": "2020-12-08T22:00:31.711283Z", "shell.execute_reply": "2020-12-08T22:00:31.711731Z" }, "papermill": { "duration": 1.490596, "end_time": "2020-12-08T22:00:31.711964", "exception": false, "start_time": "2020-12-08T22:00:30.221368", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "from datetime import datetime\n", "import pandas as pd\n", "from sklearn.datasets import load_boston\n", "\n", "import data_describe as dd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2020-12-08T22:00:31.727540Z", "iopub.status.busy": "2020-12-08T22:00:31.726630Z", "iopub.status.idle": "2020-12-08T22:00:31.748134Z", "shell.execute_reply": "2020-12-08T22:00:31.748515Z" }, "papermill": { "duration": 0.031568, "end_time": "2020-12-08T22:00:31.748688", "exception": false, "start_time": "2020-12-08T22:00:31.717120", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
00.0063218.02.310.00.5386.57565.24.091.0296.015.3396.94.9824.0
\n", "
" ], "text/plain": [ " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \\\n", "0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.09 1.0 296.0 15.3 \n", "\n", " B LSTAT target \n", "0 396.9 4.98 24.0 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = load_boston()\n", "df = pd.DataFrame(data.data, columns=list(data.feature_names))\n", "df['target'] = data.target\n", "df.head(1)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2020-12-08T22:00:31.766432Z", "iopub.status.busy": "2020-12-08T22:00:31.765707Z", "iopub.status.idle": "2020-12-08T22:00:31.779538Z", "shell.execute_reply": "2020-12-08T22:00:31.779930Z" }, "papermill": { "duration": 0.025, "end_time": "2020-12-08T22:00:31.780163", "exception": false, "start_time": "2020-12-08T22:00:31.755163", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "# Change data types to demonstrate data summary\n", "df['CRIM'] = df['CRIM'] / 1000000\n", "df['AGE'] = df['AGE'].map(lambda x: \"young\" if x < 29 else \"old\")\n", "df[\"AgeFlag\"] = df['AGE'].astype(bool)\n", "df['ZN'] = df['ZN'].astype(int)\n", "df['Date'] = datetime.strptime('1/1/2008 1:30 PM', '%m/%d/%Y %I:%M %p')" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.0057, "end_time": "2020-12-08T22:00:31.791982", "exception": false, "start_time": "2020-12-08T22:00:31.786282", "status": "completed" }, "tags": [] }, "source": [ "## Default\n", "The defaults for `data_summary` attempts to format floats to be easier to read by disabling scientific notation and limiting the number of decimal places shown." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2020-12-08T22:00:31.806589Z", "iopub.status.busy": "2020-12-08T22:00:31.805868Z", "iopub.status.idle": "2020-12-08T22:00:32.124701Z", "shell.execute_reply": "2020-12-08T22:00:32.125088Z" }, "papermill": { "duration": 0.327942, "end_time": "2020-12-08T22:00:32.125266", "exception": false, "start_time": "2020-12-08T22:00:31.797324", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Info
Rows506
Columns16
Size in Memory59.9 KB
\n", "
" ], "text/plain": [ " Info\n", "Rows 506\n", "Columns 16\n", "Size in Memory 59.9 KB" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Data Type Nulls Zeros Min Median Max Mean Standard Deviation Unique Top Frequency
CRIMfloat64000.00000000630.000000260.0000890.00000360.00000865042
ZNint6403720010011.3523.2926372
INDUSfloat64000.469.6927.7411.146.8576132
CHASfloat6404710010.0690.252471
NOXfloat64000.390.540.870.550.128123
RMfloat64003.566.218.786.280.704463
AGEobject002446
DISfloat64001.133.2112.133.802.104125
RADfloat640015249.558.709132
TAXfloat6400187330711408.24168.3766132
PTRATIOfloat640012.6019.0502218.462.1646140
Bfloat64000.32391.44396.90356.6791.20357121
LSTATfloat64001.7311.3637.9712.657.134553
targetfloat6400521.205022.539.1922916
AgeFlagbool001506
Datedatetime64[ns]002008-01-01 13:30:002008-01-01 13:30:001506
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "None" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "data-describe Summary Widget" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dd.data_summary(df)" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.009868, "end_time": "2020-12-08T22:00:32.142324", "exception": false, "start_time": "2020-12-08T22:00:32.132456", "status": "completed" }, "tags": [] }, "source": [ "## Display counts as percentage\n", "To display the count statistics as a percentage (over the total record count), use `as_percentage=True`" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2020-12-08T22:00:32.172926Z", "iopub.status.busy": "2020-12-08T22:00:32.170150Z", "iopub.status.idle": "2020-12-08T22:00:32.209777Z", "shell.execute_reply": "2020-12-08T22:00:32.210256Z" }, "papermill": { "duration": 0.060745, "end_time": "2020-12-08T22:00:32.210432", "exception": false, "start_time": "2020-12-08T22:00:32.149687", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Info
Rows506
Columns16
Size in Memory59.9 KB
\n", "
" ], "text/plain": [ " Info\n", "Rows 506\n", "Columns 16\n", "Size in Memory 59.9 KB" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Data Type Nulls Zeros Min Median Max Mean Standard Deviation Unique Top Frequency
CRIMfloat640.0%0.0%0.00000000630.000000260.0000890.00000360.00000865040.4%
ZNint640.0%73.5%0010011.3523.292673.5%
INDUSfloat640.0%0.0%0.469.6927.7411.146.857626.1%
CHASfloat640.0%93.1%0010.0690.25293.1%
NOXfloat640.0%0.0%0.390.540.870.550.12814.5%
RMfloat640.0%0.0%3.566.218.786.280.704460.6%
AGEobject0.0%0.0%288.1%
DISfloat640.0%0.0%1.133.2112.133.802.104121.0%
RADfloat640.0%0.0%15249.558.70926.1%
TAXfloat640.0%0.0%187330711408.24168.376626.1%
PTRATIOfloat640.0%0.0%12.6019.0502218.462.164627.7%
Bfloat640.0%0.0%0.32391.44396.90356.6791.2035723.9%
LSTATfloat640.0%0.0%1.7311.3637.9712.657.134550.6%
targetfloat640.0%0.0%521.205022.539.192293.2%
AgeFlagbool0.0%0.0%1100.0%
Datedatetime64[ns]0.0%0.0%2008-01-01 13:30:002008-01-01 13:30:001100.0%
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "None" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "data-describe Summary Widget" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dd.data_summary(df, as_percentage=True)" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.008632, "end_time": "2020-12-08T22:00:32.227499", "exception": false, "start_time": "2020-12-08T22:00:32.218867", "status": "completed" }, "tags": [] }, "source": [ "## Disable auto float formatting\n", "If the formatting logic is not desired, use `auto_float=False`. Depending on your data, there may not be a significant difference in the output." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2020-12-08T22:00:32.260924Z", "iopub.status.busy": "2020-12-08T22:00:32.258066Z", "iopub.status.idle": "2020-12-08T22:00:32.298634Z", "shell.execute_reply": "2020-12-08T22:00:32.299144Z" }, "papermill": { "duration": 0.062812, "end_time": "2020-12-08T22:00:32.299319", "exception": false, "start_time": "2020-12-08T22:00:32.236507", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Info
Rows506
Columns16
Size in Memory59.9 KB
\n", "
" ], "text/plain": [ " Info\n", "Rows 506\n", "Columns 16\n", "Size in Memory 59.9 KB" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Data TypeNullsZerosMinMedianMaxMeanStandard DeviationUniqueTop Frequency
CRIMfloat64006.32e-092.5651e-078.89762e-053.61352e-068.59304e-065042
ZNint6403720010011.347823.287526372
INDUSfloat64000.469.6927.7411.13686.8535776132
CHASfloat6404710010.069170.2537432471
NOXfloat64000.3850.5380.8710.5546950.1157638123
RMfloat64003.5616.20858.786.284630.7019234463
AGEobject002446
DISfloat64001.12963.2074512.12653.795042.103634125
RADfloat640015249.549418.698659132
TAXfloat6400187330711408.237168.3766132
PTRATIOfloat640012.619.052218.45552.1628146140
Bfloat64000.32391.44396.9356.67491.2046357121
LSTATfloat64001.7311.3637.9712.65317.1344553
targetfloat6400521.25022.53289.1880122916
AgeFlagbool001506
Datedatetime64[ns]002008-01-01 13:30:002008-01-01 13:30:001506
\n", "
" ], "text/plain": [ " Data Type Nulls Zeros Min Median \\\n", "CRIM float64 0 0 6.32e-09 2.5651e-07 \n", "ZN int64 0 372 0 0 \n", "INDUS float64 0 0 0.46 9.69 \n", "CHAS float64 0 471 0 0 \n", "NOX float64 0 0 0.385 0.538 \n", "RM float64 0 0 3.561 6.2085 \n", "AGE object 0 0 \n", "DIS float64 0 0 1.1296 3.20745 \n", "RAD float64 0 0 1 5 \n", "TAX float64 0 0 187 330 \n", "PTRATIO float64 0 0 12.6 19.05 \n", "B float64 0 0 0.32 391.44 \n", "LSTAT float64 0 0 1.73 11.36 \n", "target float64 0 0 5 21.2 \n", "AgeFlag bool 0 0 \n", "Date datetime64[ns] 0 0 2008-01-01 13:30:00 \n", "\n", " Max Mean Standard Deviation Unique \\\n", "CRIM 8.89762e-05 3.61352e-06 8.59304e-06 504 \n", "ZN 100 11.3478 23.2875 26 \n", "INDUS 27.74 11.1368 6.85357 76 \n", "CHAS 1 0.06917 0.253743 2 \n", "NOX 0.871 0.554695 0.115763 81 \n", "RM 8.78 6.28463 0.701923 446 \n", "AGE 2 \n", "DIS 12.1265 3.79504 2.10363 412 \n", "RAD 24 9.54941 8.69865 9 \n", "TAX 711 408.237 168.37 66 \n", "PTRATIO 22 18.4555 2.16281 46 \n", "B 396.9 356.674 91.2046 357 \n", "LSTAT 37.97 12.6531 7.134 455 \n", "target 50 22.5328 9.18801 229 \n", "AgeFlag 1 \n", "Date 2008-01-01 13:30:00 1 \n", "\n", " Top Frequency \n", "CRIM 2 \n", "ZN 372 \n", "INDUS 132 \n", "CHAS 471 \n", "NOX 23 \n", "RM 3 \n", "AGE 446 \n", "DIS 5 \n", "RAD 132 \n", "TAX 132 \n", "PTRATIO 140 \n", "B 121 \n", "LSTAT 3 \n", "target 16 \n", "AgeFlag 506 \n", "Date 506 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "None" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "data-describe Summary Widget" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dd.data_summary(df, auto_float=False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "papermill": { "duration": 4.379267, "end_time": "2020-12-08T22:00:33.482384", "environment_variables": {}, "exception": null, "input_path": "/Users/richardtruong-chau/Projects/data-describe/examples/Data_Summary.ipynb", "output_path": "/Users/richardtruong-chau/Projects/data-describe/examples/Data_Summary.ipynb", "parameters": {}, "start_time": "2020-12-08T22:00:29.103117", "version": "2.1.2" } }, "nbformat": 4, "nbformat_minor": 4 }