{"cells": [{"cell_type": "markdown", "metadata": {"papermill": {"exception": false, "start_time": "2020-11-12T23:04:34.119602", "end_time": "2020-11-12T23:04:34.130635", "duration": 0.011033, "status": "completed"}, "tags": []}, "source": "# Data Summary\n\nThe Data Summary feature provides an overview of the data using summary statistics. The output is similar to using `pandas.DataFrame.describe`, however, a different set of statistics are selected to address common questions about the data.\n\n- Data Type: The data type\n- Nulls: The number (count) or percentage of null values. Primarily for identifying missing data.\n- Zeros: The number (count) or percentage of zero values. Zero is commonly used as a special number and may indicate abnormalities.\n- Min, Max: The minimum and maximum values. Used to identify extreme values (outliers).\n- Median, Mean, Standard Deviation: Used to identify skew.\n- Unique: Number of unique values (levels). Used to identify high cardinality.\n- Top Frequency: The number (count) or percentage of values equaling the mode. Used to identify imbalanced data."}, {"cell_type": "markdown", "metadata": {"papermill": {"exception": false, "start_time": "2020-11-12T23:04:34.141601", "end_time": "2020-11-12T23:04:34.153597", "duration": 0.011996, "status": "completed"}, "tags": []}, "source": "## Example data"}, {"cell_type": "code", "execution_count": 1, "metadata": {"execution": {"iopub.execute_input": "2020-11-12T23:04:34.183598Z", "iopub.status.busy": "2020-11-12T23:04:34.183598Z", "iopub.status.idle": "2020-11-12T23:04:36.807237Z", "shell.execute_reply": "2020-11-12T23:04:36.807237Z"}, "papermill": {"exception": false, "start_time": "2020-11-12T23:04:34.165603", "end_time": "2020-11-12T23:04:36.808209", "duration": 2.642606, "status": "completed"}, "tags": []}, "outputs": [], "source": "from datetime import datetime\nimport pandas as pd\nfrom sklearn.datasets import load_boston\n\nimport data_describe as dd"}, {"cell_type": "code", "execution_count": 2, "metadata": {"execution": {"iopub.execute_input": "2020-11-12T23:04:36.832236Z", "iopub.status.busy": "2020-11-12T23:04:36.831237Z", "iopub.status.idle": "2020-11-12T23:04:36.866212Z", "shell.execute_reply": "2020-11-12T23:04:36.867236Z"}, "papermill": {"exception": false, "start_time": "2020-11-12T23:04:36.816236", "end_time": "2020-11-12T23:04:36.867236", "duration": 0.051, "status": "completed"}, "tags": []}, "outputs": [{"output_type": "execute_result", "metadata": {}, "data": {"text/plain": " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \\\n0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.09 1.0 296.0 15.3 \n\n B LSTAT target \n0 396.9 4.98 24.0 ", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATtarget
00.0063218.02.310.00.5386.57565.24.091.0296.015.3396.94.9824.0
\n
"}, "execution_count": 2}], "source": "data = load_boston()\ndf = pd.DataFrame(data.data, columns=list(data.feature_names))\ndf['target'] = data.target\ndf.head(1)"}, {"cell_type": "code", "execution_count": 3, "metadata": {"execution": {"iopub.execute_input": "2020-11-12T23:04:36.892209Z", "iopub.status.busy": "2020-11-12T23:04:36.892209Z", "iopub.status.idle": "2020-11-12T23:04:36.961237Z", "shell.execute_reply": "2020-11-12T23:04:36.961237Z"}, "papermill": {"exception": false, "start_time": "2020-11-12T23:04:36.877211", "end_time": "2020-11-12T23:04:36.962237", "duration": 0.085026, "status": "completed"}, "tags": []}, "outputs": [], "source": "# Change data types to demonstrate data summary\ndf['CRIM'] = df['CRIM'] / 1000000\ndf['AGE'] = df['AGE'].map(lambda x: \"young\" if x < 29 else \"old\")\ndf[\"AgeFlag\"] = df['AGE'].astype(bool)\ndf['ZN'] = df['ZN'].astype(int)\ndf['Date'] = datetime.strptime('1/1/2008 1:30 PM', '%m/%d/%Y %I:%M %p')"}, {"cell_type": "markdown", "metadata": {"papermill": {"exception": false, "start_time": "2020-11-12T23:04:36.972251", "end_time": "2020-11-12T23:04:36.983271", "duration": 0.01102, "status": "completed"}, "tags": []}, "source": "## Default\nThe defaults for `data_summary` attempts to format floats to be easier to read by disabling scientific notation and limiting the number of decimal places shown."}, {"cell_type": "code", "execution_count": 4, "metadata": {"execution": {"iopub.execute_input": "2020-11-12T23:04:37.007209Z", "iopub.status.busy": "2020-11-12T23:04:37.006212Z", "iopub.status.idle": "2020-11-12T23:04:37.255243Z", "shell.execute_reply": "2020-11-12T23:04:37.256242Z"}, "papermill": {"exception": false, "start_time": "2020-11-12T23:04:36.992237", "end_time": "2020-11-12T23:04:37.256242", "duration": 0.264005, "status": "completed"}, "tags": []}, "outputs": [{"output_type": "display_data", "metadata": {}, "data": {"text/plain": " Info\nRows 506\nColumns 16\nSize in Memory 57.9 KB", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Info
Rows506
Columns16
Size in Memory57.9 KB
\n
"}}, {"output_type": "display_data", "metadata": {}, "data": {"text/plain": "", "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Data Type Nulls Zeros Min Median Max Mean Standard Deviation Unique Top Frequency
CRIMfloat64000.00000000630.000000260.0000890.00000360.00000865042
ZNint32000010011.3523.2926372
INDUSfloat64000.469.6927.7411.146.8576132
CHASfloat64000010.0690.252471
NOXfloat64000.390.540.870.550.128123
RMfloat64003.566.218.786.280.704463
AGEobject002446
DISfloat64001.133.2112.133.802.104125
RADfloat640015249.558.709132
TAXfloat6400187330711408.24168.3766132
PTRATIOfloat640012.6019.0502218.462.1646140
Bfloat64000.32391.44396.90356.6791.20357121
LSTATfloat64001.7311.3637.9712.657.134553
targetfloat6400521.205022.539.1922916
AgeFlagbool001506
Datedatetime64[ns]002008-01-01 13:30:002008-01-01 13:30:001506
"}}, {"output_type": "display_data", "metadata": {}, "data": {"text/plain": "None"}}, {"output_type": "execute_result", "metadata": {}, "data": {"text/plain": "data-describe Summary Widget"}, "execution_count": 4}], "source": "dd.data_summary(df)"}, {"cell_type": "markdown", "metadata": {"papermill": {"exception": false, "start_time": "2020-11-12T23:04:37.269210", "end_time": "2020-11-12T23:04:37.283243", "duration": 0.014033, "status": "completed"}, "tags": []}, "source": "## Display counts as percentage\nTo display the count statistics as a percentage (over the total record count), use `as_percentage=True`"}, {"cell_type": "code", "execution_count": 5, "metadata": {"execution": {"iopub.execute_input": "2020-11-12T23:04:37.334210Z", "iopub.status.busy": "2020-11-12T23:04:37.331212Z", "iopub.status.idle": "2020-11-12T23:04:37.393603Z", "shell.execute_reply": "2020-11-12T23:04:37.394633Z"}, "papermill": {"exception": false, "start_time": "2020-11-12T23:04:37.295243", "end_time": "2020-11-12T23:04:37.394633", "duration": 0.09939, "status": "completed"}, "tags": []}, "outputs": [{"output_type": "display_data", "metadata": {}, "data": {"text/plain": " Info\nRows 506\nColumns 16\nSize in Memory 57.9 KB", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Info
Rows506
Columns16
Size in Memory57.9 KB
\n
"}}, {"output_type": "display_data", "metadata": {}, "data": {"text/plain": "", "text/html": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Data Type Nulls Zeros Min Median Max Mean Standard Deviation Unique Top Frequency
CRIMfloat640.0%0.0%0.00000000630.000000260.0000890.00000360.00000865040.4%
ZNint320.0%0.0%0010011.3523.292673.5%
INDUSfloat640.0%0.0%0.469.6927.7411.146.857626.1%
CHASfloat640.0%0.0%0010.0690.25293.1%
NOXfloat640.0%0.0%0.390.540.870.550.12814.5%
RMfloat640.0%0.0%3.566.218.786.280.704460.6%
AGEobject0.0%0.0%288.1%
DISfloat640.0%0.0%1.133.2112.133.802.104121.0%
RADfloat640.0%0.0%15249.558.70926.1%
TAXfloat640.0%0.0%187330711408.24168.376626.1%
PTRATIOfloat640.0%0.0%12.6019.0502218.462.164627.7%
Bfloat640.0%0.0%0.32391.44396.90356.6791.2035723.9%
LSTATfloat640.0%0.0%1.7311.3637.9712.657.134550.6%
targetfloat640.0%0.0%521.205022.539.192293.2%
AgeFlagbool0.0%0.0%1100.0%
Datedatetime64[ns]0.0%0.0%2008-01-01 13:30:002008-01-01 13:30:001100.0%
"}}, {"output_type": "display_data", "metadata": {}, "data": {"text/plain": "None"}}, {"output_type": "execute_result", "metadata": {}, "data": {"text/plain": "data-describe Summary Widget"}, "execution_count": 5}], "source": "dd.data_summary(df, as_percentage=True)"}, {"cell_type": "markdown", "metadata": {"papermill": {"exception": false, "start_time": "2020-11-12T23:04:37.409598", "end_time": "2020-11-12T23:04:37.426649", "duration": 0.017051, "status": "completed"}, "tags": []}, "source": "## Disable auto float formatting\nIf the formatting logic is not desired, use `auto_float=False`. Depending on your data, there may not be a significant difference in the output."}, {"cell_type": "code", "execution_count": 6, "metadata": {"execution": {"iopub.execute_input": "2020-11-12T23:04:37.485601Z", "iopub.status.busy": "2020-11-12T23:04:37.480602Z", "iopub.status.idle": "2020-11-12T23:04:37.539638Z", "shell.execute_reply": "2020-11-12T23:04:37.540627Z"}, "papermill": {"exception": false, "start_time": "2020-11-12T23:04:37.442649", "end_time": "2020-11-12T23:04:37.540627", "duration": 0.097978, "status": "completed"}, "tags": []}, "outputs": [{"output_type": "display_data", "metadata": {}, "data": {"text/plain": " Info\nRows 506\nColumns 16\nSize in Memory 57.9 KB", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Info
Rows506
Columns16
Size in Memory57.9 KB
\n
"}}, {"output_type": "display_data", "metadata": {}, "data": {"text/plain": " Data Type Nulls Zeros Min Median \\\nCRIM float64 0 0 6.32e-09 2.5651e-07 \nZN int32 0 0 0 0 \nINDUS float64 0 0 0.46 9.69 \nCHAS float64 0 0 0 0 \nNOX float64 0 0 0.385 0.538 \nRM float64 0 0 3.561 6.2085 \nAGE object 0 0 \nDIS float64 0 0 1.1296 3.20745 \nRAD float64 0 0 1 5 \nTAX float64 0 0 187 330 \nPTRATIO float64 0 0 12.6 19.05 \nB float64 0 0 0.32 391.44 \nLSTAT float64 0 0 1.73 11.36 \ntarget float64 0 0 5 21.2 \nAgeFlag bool 0 0 \nDate datetime64[ns] 0 0 2008-01-01 13:30:00 \n\n Max Mean Standard Deviation Unique \\\nCRIM 8.89762e-05 3.61352e-06 8.59304e-06 504 \nZN 100 11.3478 23.2875 26 \nINDUS 27.74 11.1368 6.85357 76 \nCHAS 1 0.06917 0.253743 2 \nNOX 0.871 0.554695 0.115763 81 \nRM 8.78 6.28463 0.701923 446 \nAGE 2 \nDIS 12.1265 3.79504 2.10363 412 \nRAD 24 9.54941 8.69865 9 \nTAX 711 408.237 168.37 66 \nPTRATIO 22 18.4555 2.16281 46 \nB 396.9 356.674 91.2046 357 \nLSTAT 37.97 12.6531 7.134 455 \ntarget 50 22.5328 9.18801 229 \nAgeFlag 1 \nDate 2008-01-01 13:30:00 1 \n\n Top Frequency \nCRIM 2 \nZN 372 \nINDUS 132 \nCHAS 471 \nNOX 23 \nRM 3 \nAGE 446 \nDIS 5 \nRAD 132 \nTAX 132 \nPTRATIO 140 \nB 121 \nLSTAT 3 \ntarget 16 \nAgeFlag 506 \nDate 506 ", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Data TypeNullsZerosMinMedianMaxMeanStandard DeviationUniqueTop Frequency
CRIMfloat64006.32e-092.5651e-078.89762e-053.61352e-068.59304e-065042
ZNint32000010011.347823.287526372
INDUSfloat64000.469.6927.7411.13686.8535776132
CHASfloat64000010.069170.2537432471
NOXfloat64000.3850.5380.8710.5546950.1157638123
RMfloat64003.5616.20858.786.284630.7019234463
AGEobject002446
DISfloat64001.12963.2074512.12653.795042.103634125
RADfloat640015249.549418.698659132
TAXfloat6400187330711408.237168.3766132
PTRATIOfloat640012.619.052218.45552.1628146140
Bfloat64000.32391.44396.9356.67491.2046357121
LSTATfloat64001.7311.3637.9712.65317.1344553
targetfloat6400521.25022.53289.1880122916
AgeFlagbool001506
Datedatetime64[ns]002008-01-01 13:30:002008-01-01 13:30:001506
\n
"}}, {"output_type": "display_data", "metadata": {}, "data": {"text/plain": "None"}}, {"output_type": "execute_result", "metadata": {}, "data": {"text/plain": "data-describe Summary Widget"}, "execution_count": 6}], "source": "dd.data_summary(df, auto_float=False)"}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"name": "python", "version": "3.7.9", "mimetype": "text/x-python", "codemirror_mode": {"name": "ipython", "version": 3}, "pygments_lexer": "ipython3", "nbconvert_exporter": "python", "file_extension": ".py"}, "papermill": {"duration": 6.325704, "end_time": "2020-11-12T23:04:38.067859", "environment_variables": {}, "exception": null, "input_path": "C:\\workspace\\data-describe\\examples\\Data_Summary.ipynb", "output_path": "C:\\workspace\\data-describe\\examples\\Data_Summary.ipynb", "parameters": {}, "start_time": "2020-11-12T23:04:31.742155", "version": "2.1.2"}}, "nbformat": 4, "nbformat_minor": 4}