{"cells": [{"cell_type": "markdown", "metadata": {"papermill": {"exception": false, "start_time": "2020-11-12T23:04:34.119602", "end_time": "2020-11-12T23:04:34.130635", "duration": 0.011033, "status": "completed"}, "tags": []}, "source": "# Data Summary\n\nThe Data Summary feature provides an overview of the data using summary statistics. The output is similar to using `pandas.DataFrame.describe`, however, a different set of statistics are selected to address common questions about the data.\n\n- Data Type: The data type\n- Nulls: The number (count) or percentage of null values. Primarily for identifying missing data.\n- Zeros: The number (count) or percentage of zero values. Zero is commonly used as a special number and may indicate abnormalities.\n- Min, Max: The minimum and maximum values. Used to identify extreme values (outliers).\n- Median, Mean, Standard Deviation: Used to identify skew.\n- Unique: Number of unique values (levels). Used to identify high cardinality.\n- Top Frequency: The number (count) or percentage of values equaling the mode. Used to identify imbalanced data."}, {"cell_type": "markdown", "metadata": {"papermill": {"exception": false, "start_time": "2020-11-12T23:04:34.141601", "end_time": "2020-11-12T23:04:34.153597", "duration": 0.011996, "status": "completed"}, "tags": []}, "source": "## Example data"}, {"cell_type": "code", "execution_count": 1, "metadata": {"execution": {"iopub.execute_input": "2020-11-12T23:04:34.183598Z", "iopub.status.busy": "2020-11-12T23:04:34.183598Z", "iopub.status.idle": "2020-11-12T23:04:36.807237Z", "shell.execute_reply": "2020-11-12T23:04:36.807237Z"}, "papermill": {"exception": false, "start_time": "2020-11-12T23:04:34.165603", "end_time": "2020-11-12T23:04:36.808209", "duration": 2.642606, "status": "completed"}, "tags": []}, "outputs": [], "source": "from datetime import datetime\nimport pandas as pd\nfrom sklearn.datasets import load_boston\n\nimport data_describe as dd"}, {"cell_type": "code", "execution_count": 2, "metadata": {"execution": {"iopub.execute_input": "2020-11-12T23:04:36.832236Z", "iopub.status.busy": "2020-11-12T23:04:36.831237Z", "iopub.status.idle": "2020-11-12T23:04:36.866212Z", "shell.execute_reply": "2020-11-12T23:04:36.867236Z"}, "papermill": {"exception": false, "start_time": "2020-11-12T23:04:36.816236", "end_time": "2020-11-12T23:04:36.867236", "duration": 0.051, "status": "completed"}, "tags": []}, "outputs": [{"output_type": "execute_result", "metadata": {}, "data": {"text/plain": " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \\\n0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.09 1.0 296.0 15.3 \n\n B LSTAT target \n0 396.9 4.98 24.0 ", "text/html": "
\n\n
\n \n \n | \n CRIM | \n ZN | \n INDUS | \n CHAS | \n NOX | \n RM | \n AGE | \n DIS | \n RAD | \n TAX | \n PTRATIO | \n B | \n LSTAT | \n target | \n
\n \n \n \n 0 | \n 0.00632 | \n 18.0 | \n 2.31 | \n 0.0 | \n 0.538 | \n 6.575 | \n 65.2 | \n 4.09 | \n 1.0 | \n 296.0 | \n 15.3 | \n 396.9 | \n 4.98 | \n 24.0 | \n
\n \n
\n
"}, "execution_count": 2}], "source": "data = load_boston()\ndf = pd.DataFrame(data.data, columns=list(data.feature_names))\ndf['target'] = data.target\ndf.head(1)"}, {"cell_type": "code", "execution_count": 3, "metadata": {"execution": {"iopub.execute_input": "2020-11-12T23:04:36.892209Z", "iopub.status.busy": "2020-11-12T23:04:36.892209Z", "iopub.status.idle": "2020-11-12T23:04:36.961237Z", "shell.execute_reply": "2020-11-12T23:04:36.961237Z"}, "papermill": {"exception": false, "start_time": "2020-11-12T23:04:36.877211", "end_time": "2020-11-12T23:04:36.962237", "duration": 0.085026, "status": "completed"}, "tags": []}, "outputs": [], "source": "# Change data types to demonstrate data summary\ndf['CRIM'] = df['CRIM'] / 1000000\ndf['AGE'] = df['AGE'].map(lambda x: \"young\" if x < 29 else \"old\")\ndf[\"AgeFlag\"] = df['AGE'].astype(bool)\ndf['ZN'] = df['ZN'].astype(int)\ndf['Date'] = datetime.strptime('1/1/2008 1:30 PM', '%m/%d/%Y %I:%M %p')"}, {"cell_type": "markdown", "metadata": {"papermill": {"exception": false, "start_time": "2020-11-12T23:04:36.972251", "end_time": "2020-11-12T23:04:36.983271", "duration": 0.01102, "status": "completed"}, "tags": []}, "source": "## Default\nThe defaults for `data_summary` attempts to format floats to be easier to read by disabling scientific notation and limiting the number of decimal places shown."}, {"cell_type": "code", "execution_count": 4, "metadata": {"execution": {"iopub.execute_input": "2020-11-12T23:04:37.007209Z", "iopub.status.busy": "2020-11-12T23:04:37.006212Z", "iopub.status.idle": "2020-11-12T23:04:37.255243Z", "shell.execute_reply": "2020-11-12T23:04:37.256242Z"}, "papermill": {"exception": false, "start_time": "2020-11-12T23:04:36.992237", "end_time": "2020-11-12T23:04:37.256242", "duration": 0.264005, "status": "completed"}, "tags": []}, "outputs": [{"output_type": "display_data", "metadata": {}, "data": {"text/plain": " Info\nRows 506\nColumns 16\nSize in Memory 57.9 KB", "text/html": "\n\n
\n \n \n | \n Info | \n
\n \n \n \n Rows | \n 506 | \n
\n \n Columns | \n 16 | \n
\n \n Size in Memory | \n 57.9 KB | \n
\n \n
\n
"}}, {"output_type": "display_data", "metadata": {}, "data": {"text/plain": "", "text/html": " | Data Type | Nulls | Zeros | Min | Median | Max | Mean | Standard Deviation | Unique | Top Frequency |
\n \n CRIM | \n float64 | \n 0 | \n 0 | \n 0.0000000063 | \n 0.00000026 | \n 0.000089 | \n 0.0000036 | \n 0.0000086 | \n 504 | \n 2 | \n
\n \n ZN | \n int32 | \n 0 | \n 0 | \n 0 | \n 0 | \n 100 | \n 11.35 | \n 23.29 | \n 26 | \n 372 | \n
\n \n INDUS | \n float64 | \n 0 | \n 0 | \n 0.46 | \n 9.69 | \n 27.74 | \n 11.14 | \n 6.85 | \n 76 | \n 132 | \n
\n \n CHAS | \n float64 | \n 0 | \n 0 | \n 0 | \n 0 | \n 1 | \n 0.069 | \n 0.25 | \n 2 | \n 471 | \n
\n \n NOX | \n float64 | \n 0 | \n 0 | \n 0.39 | \n 0.54 | \n 0.87 | \n 0.55 | \n 0.12 | \n 81 | \n 23 | \n
\n \n RM | \n float64 | \n 0 | \n 0 | \n 3.56 | \n 6.21 | \n 8.78 | \n 6.28 | \n 0.70 | \n 446 | \n 3 | \n
\n \n AGE | \n object | \n 0 | \n 0 | \n | \n | \n | \n | \n | \n 2 | \n 446 | \n
\n \n DIS | \n float64 | \n 0 | \n 0 | \n 1.13 | \n 3.21 | \n 12.13 | \n 3.80 | \n 2.10 | \n 412 | \n 5 | \n
\n \n RAD | \n float64 | \n 0 | \n 0 | \n 1 | \n 5 | \n 24 | \n 9.55 | \n 8.70 | \n 9 | \n 132 | \n
\n \n TAX | \n float64 | \n 0 | \n 0 | \n 187 | \n 330 | \n 711 | \n 408.24 | \n 168.37 | \n 66 | \n 132 | \n
\n \n PTRATIO | \n float64 | \n 0 | \n 0 | \n 12.60 | \n 19.050 | \n 22 | \n 18.46 | \n 2.16 | \n 46 | \n 140 | \n
\n \n B | \n float64 | \n 0 | \n 0 | \n 0.32 | \n 391.44 | \n 396.90 | \n 356.67 | \n 91.20 | \n 357 | \n 121 | \n
\n \n LSTAT | \n float64 | \n 0 | \n 0 | \n 1.73 | \n 11.36 | \n 37.97 | \n 12.65 | \n 7.13 | \n 455 | \n 3 | \n
\n \n target | \n float64 | \n 0 | \n 0 | \n 5 | \n 21.20 | \n 50 | \n 22.53 | \n 9.19 | \n 229 | \n 16 | \n
\n \n AgeFlag | \n bool | \n 0 | \n 0 | \n | \n | \n | \n | \n | \n 1 | \n 506 | \n
\n \n Date | \n datetime64[ns] | \n 0 | \n 0 | \n 2008-01-01 13:30:00 | \n | \n 2008-01-01 13:30:00 | \n | \n | \n 1 | \n 506 | \n
\n
"}}, {"output_type": "display_data", "metadata": {}, "data": {"text/plain": "None"}}, {"output_type": "execute_result", "metadata": {}, "data": {"text/plain": "data-describe Summary Widget"}, "execution_count": 4}], "source": "dd.data_summary(df)"}, {"cell_type": "markdown", "metadata": {"papermill": {"exception": false, "start_time": "2020-11-12T23:04:37.269210", "end_time": "2020-11-12T23:04:37.283243", "duration": 0.014033, "status": "completed"}, "tags": []}, "source": "## Display counts as percentage\nTo display the count statistics as a percentage (over the total record count), use `as_percentage=True`"}, {"cell_type": "code", "execution_count": 5, "metadata": {"execution": {"iopub.execute_input": "2020-11-12T23:04:37.334210Z", "iopub.status.busy": "2020-11-12T23:04:37.331212Z", "iopub.status.idle": "2020-11-12T23:04:37.393603Z", "shell.execute_reply": "2020-11-12T23:04:37.394633Z"}, "papermill": {"exception": false, "start_time": "2020-11-12T23:04:37.295243", "end_time": "2020-11-12T23:04:37.394633", "duration": 0.09939, "status": "completed"}, "tags": []}, "outputs": [{"output_type": "display_data", "metadata": {}, "data": {"text/plain": " Info\nRows 506\nColumns 16\nSize in Memory 57.9 KB", "text/html": "\n\n
\n \n \n | \n Info | \n
\n \n \n \n Rows | \n 506 | \n
\n \n Columns | \n 16 | \n
\n \n Size in Memory | \n 57.9 KB | \n
\n \n
\n
"}}, {"output_type": "display_data", "metadata": {}, "data": {"text/plain": "", "text/html": " | Data Type | Nulls | Zeros | Min | Median | Max | Mean | Standard Deviation | Unique | Top Frequency |
\n \n CRIM | \n float64 | \n 0.0% | \n 0.0% | \n 0.0000000063 | \n 0.00000026 | \n 0.000089 | \n 0.0000036 | \n 0.0000086 | \n 504 | \n 0.4% | \n
\n \n ZN | \n int32 | \n 0.0% | \n 0.0% | \n 0 | \n 0 | \n 100 | \n 11.35 | \n 23.29 | \n 26 | \n 73.5% | \n
\n \n INDUS | \n float64 | \n 0.0% | \n 0.0% | \n 0.46 | \n 9.69 | \n 27.74 | \n 11.14 | \n 6.85 | \n 76 | \n 26.1% | \n
\n \n CHAS | \n float64 | \n 0.0% | \n 0.0% | \n 0 | \n 0 | \n 1 | \n 0.069 | \n 0.25 | \n 2 | \n 93.1% | \n
\n \n NOX | \n float64 | \n 0.0% | \n 0.0% | \n 0.39 | \n 0.54 | \n 0.87 | \n 0.55 | \n 0.12 | \n 81 | \n 4.5% | \n
\n \n RM | \n float64 | \n 0.0% | \n 0.0% | \n 3.56 | \n 6.21 | \n 8.78 | \n 6.28 | \n 0.70 | \n 446 | \n 0.6% | \n
\n \n AGE | \n object | \n 0.0% | \n 0.0% | \n | \n | \n | \n | \n | \n 2 | \n 88.1% | \n
\n \n DIS | \n float64 | \n 0.0% | \n 0.0% | \n 1.13 | \n 3.21 | \n 12.13 | \n 3.80 | \n 2.10 | \n 412 | \n 1.0% | \n
\n \n RAD | \n float64 | \n 0.0% | \n 0.0% | \n 1 | \n 5 | \n 24 | \n 9.55 | \n 8.70 | \n 9 | \n 26.1% | \n
\n \n TAX | \n float64 | \n 0.0% | \n 0.0% | \n 187 | \n 330 | \n 711 | \n 408.24 | \n 168.37 | \n 66 | \n 26.1% | \n
\n \n PTRATIO | \n float64 | \n 0.0% | \n 0.0% | \n 12.60 | \n 19.050 | \n 22 | \n 18.46 | \n 2.16 | \n 46 | \n 27.7% | \n
\n \n B | \n float64 | \n 0.0% | \n 0.0% | \n 0.32 | \n 391.44 | \n 396.90 | \n 356.67 | \n 91.20 | \n 357 | \n 23.9% | \n
\n \n LSTAT | \n float64 | \n 0.0% | \n 0.0% | \n 1.73 | \n 11.36 | \n 37.97 | \n 12.65 | \n 7.13 | \n 455 | \n 0.6% | \n
\n \n target | \n float64 | \n 0.0% | \n 0.0% | \n 5 | \n 21.20 | \n 50 | \n 22.53 | \n 9.19 | \n 229 | \n 3.2% | \n
\n \n AgeFlag | \n bool | \n 0.0% | \n 0.0% | \n | \n | \n | \n | \n | \n 1 | \n 100.0% | \n
\n \n Date | \n datetime64[ns] | \n 0.0% | \n 0.0% | \n 2008-01-01 13:30:00 | \n | \n 2008-01-01 13:30:00 | \n | \n | \n 1 | \n 100.0% | \n
\n
"}}, {"output_type": "display_data", "metadata": {}, "data": {"text/plain": "None"}}, {"output_type": "execute_result", "metadata": {}, "data": {"text/plain": "data-describe Summary Widget"}, "execution_count": 5}], "source": "dd.data_summary(df, as_percentage=True)"}, {"cell_type": "markdown", "metadata": {"papermill": {"exception": false, "start_time": "2020-11-12T23:04:37.409598", "end_time": "2020-11-12T23:04:37.426649", "duration": 0.017051, "status": "completed"}, "tags": []}, "source": "## Disable auto float formatting\nIf the formatting logic is not desired, use `auto_float=False`. Depending on your data, there may not be a significant difference in the output."}, {"cell_type": "code", "execution_count": 6, "metadata": {"execution": {"iopub.execute_input": "2020-11-12T23:04:37.485601Z", "iopub.status.busy": "2020-11-12T23:04:37.480602Z", "iopub.status.idle": "2020-11-12T23:04:37.539638Z", "shell.execute_reply": "2020-11-12T23:04:37.540627Z"}, "papermill": {"exception": false, "start_time": "2020-11-12T23:04:37.442649", "end_time": "2020-11-12T23:04:37.540627", "duration": 0.097978, "status": "completed"}, "tags": []}, "outputs": [{"output_type": "display_data", "metadata": {}, "data": {"text/plain": " Info\nRows 506\nColumns 16\nSize in Memory 57.9 KB", "text/html": "\n\n
\n \n \n | \n Info | \n
\n \n \n \n Rows | \n 506 | \n
\n \n Columns | \n 16 | \n
\n \n Size in Memory | \n 57.9 KB | \n
\n \n
\n
"}}, {"output_type": "display_data", "metadata": {}, "data": {"text/plain": " Data Type Nulls Zeros Min Median \\\nCRIM float64 0 0 6.32e-09 2.5651e-07 \nZN int32 0 0 0 0 \nINDUS float64 0 0 0.46 9.69 \nCHAS float64 0 0 0 0 \nNOX float64 0 0 0.385 0.538 \nRM float64 0 0 3.561 6.2085 \nAGE object 0 0 \nDIS float64 0 0 1.1296 3.20745 \nRAD float64 0 0 1 5 \nTAX float64 0 0 187 330 \nPTRATIO float64 0 0 12.6 19.05 \nB float64 0 0 0.32 391.44 \nLSTAT float64 0 0 1.73 11.36 \ntarget float64 0 0 5 21.2 \nAgeFlag bool 0 0 \nDate datetime64[ns] 0 0 2008-01-01 13:30:00 \n\n Max Mean Standard Deviation Unique \\\nCRIM 8.89762e-05 3.61352e-06 8.59304e-06 504 \nZN 100 11.3478 23.2875 26 \nINDUS 27.74 11.1368 6.85357 76 \nCHAS 1 0.06917 0.253743 2 \nNOX 0.871 0.554695 0.115763 81 \nRM 8.78 6.28463 0.701923 446 \nAGE 2 \nDIS 12.1265 3.79504 2.10363 412 \nRAD 24 9.54941 8.69865 9 \nTAX 711 408.237 168.37 66 \nPTRATIO 22 18.4555 2.16281 46 \nB 396.9 356.674 91.2046 357 \nLSTAT 37.97 12.6531 7.134 455 \ntarget 50 22.5328 9.18801 229 \nAgeFlag 1 \nDate 2008-01-01 13:30:00 1 \n\n Top Frequency \nCRIM 2 \nZN 372 \nINDUS 132 \nCHAS 471 \nNOX 23 \nRM 3 \nAGE 446 \nDIS 5 \nRAD 132 \nTAX 132 \nPTRATIO 140 \nB 121 \nLSTAT 3 \ntarget 16 \nAgeFlag 506 \nDate 506 ", "text/html": "\n\n
\n \n \n | \n Data Type | \n Nulls | \n Zeros | \n Min | \n Median | \n Max | \n Mean | \n Standard Deviation | \n Unique | \n Top Frequency | \n
\n \n \n \n CRIM | \n float64 | \n 0 | \n 0 | \n 6.32e-09 | \n 2.5651e-07 | \n 8.89762e-05 | \n 3.61352e-06 | \n 8.59304e-06 | \n 504 | \n 2 | \n
\n \n ZN | \n int32 | \n 0 | \n 0 | \n 0 | \n 0 | \n 100 | \n 11.3478 | \n 23.2875 | \n 26 | \n 372 | \n
\n \n INDUS | \n float64 | \n 0 | \n 0 | \n 0.46 | \n 9.69 | \n 27.74 | \n 11.1368 | \n 6.85357 | \n 76 | \n 132 | \n
\n \n CHAS | \n float64 | \n 0 | \n 0 | \n 0 | \n 0 | \n 1 | \n 0.06917 | \n 0.253743 | \n 2 | \n 471 | \n
\n \n NOX | \n float64 | \n 0 | \n 0 | \n 0.385 | \n 0.538 | \n 0.871 | \n 0.554695 | \n 0.115763 | \n 81 | \n 23 | \n
\n \n RM | \n float64 | \n 0 | \n 0 | \n 3.561 | \n 6.2085 | \n 8.78 | \n 6.28463 | \n 0.701923 | \n 446 | \n 3 | \n
\n \n AGE | \n object | \n 0 | \n 0 | \n | \n | \n | \n | \n | \n 2 | \n 446 | \n
\n \n DIS | \n float64 | \n 0 | \n 0 | \n 1.1296 | \n 3.20745 | \n 12.1265 | \n 3.79504 | \n 2.10363 | \n 412 | \n 5 | \n
\n \n RAD | \n float64 | \n 0 | \n 0 | \n 1 | \n 5 | \n 24 | \n 9.54941 | \n 8.69865 | \n 9 | \n 132 | \n
\n \n TAX | \n float64 | \n 0 | \n 0 | \n 187 | \n 330 | \n 711 | \n 408.237 | \n 168.37 | \n 66 | \n 132 | \n
\n \n PTRATIO | \n float64 | \n 0 | \n 0 | \n 12.6 | \n 19.05 | \n 22 | \n 18.4555 | \n 2.16281 | \n 46 | \n 140 | \n
\n \n B | \n float64 | \n 0 | \n 0 | \n 0.32 | \n 391.44 | \n 396.9 | \n 356.674 | \n 91.2046 | \n 357 | \n 121 | \n
\n \n LSTAT | \n float64 | \n 0 | \n 0 | \n 1.73 | \n 11.36 | \n 37.97 | \n 12.6531 | \n 7.134 | \n 455 | \n 3 | \n
\n \n target | \n float64 | \n 0 | \n 0 | \n 5 | \n 21.2 | \n 50 | \n 22.5328 | \n 9.18801 | \n 229 | \n 16 | \n
\n \n AgeFlag | \n bool | \n 0 | \n 0 | \n | \n | \n | \n | \n | \n 1 | \n 506 | \n
\n \n Date | \n datetime64[ns] | \n 0 | \n 0 | \n 2008-01-01 13:30:00 | \n | \n 2008-01-01 13:30:00 | \n | \n | \n 1 | \n 506 | \n
\n \n
\n
"}}, {"output_type": "display_data", "metadata": {}, "data": {"text/plain": "None"}}, {"output_type": "execute_result", "metadata": {}, "data": {"text/plain": "data-describe Summary Widget"}, "execution_count": 6}], "source": "dd.data_summary(df, auto_float=False)"}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"name": "python", "version": "3.7.9", "mimetype": "text/x-python", "codemirror_mode": {"name": "ipython", "version": 3}, "pygments_lexer": "ipython3", "nbconvert_exporter": "python", "file_extension": ".py"}, "papermill": {"duration": 6.325704, "end_time": "2020-11-12T23:04:38.067859", "environment_variables": {}, "exception": null, "input_path": "C:\\workspace\\data-describe\\examples\\Data_Summary.ipynb", "output_path": "C:\\workspace\\data-describe\\examples\\Data_Summary.ipynb", "parameters": {}, "start_time": "2020-11-12T23:04:31.742155", "version": "2.1.2"}}, "nbformat": 4, "nbformat_minor": 4}