{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Quick Start Tutorial"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook shows an example Exploratory Data Analysis utilizing data-describe.\n",
"\n",
"Note: Part of this notebook uses optional dependencies for text analysis. To install these dependencies, run `pip install data-describe[nlp]`"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import data_describe as dd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data\n",
"This tutorial uses toy datasets from sklearn."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import load_boston\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"dat = load_boston()\n",
"df = pd.DataFrame(dat['data'], columns=dat['feature_names'])\n",
"df['price'] = dat['target']"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \\\n0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 \n1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 \n2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 \n3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 \n4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 \n\n PTRATIO B LSTAT price \n0 15.3 396.90 4.98 24.0 \n1 17.8 396.90 9.14 21.6 \n2 17.8 392.83 4.03 34.7 \n3 18.7 394.63 2.94 33.4 \n4 18.7 396.90 5.33 36.2 ",
"text/html": "
\n\n
\n \n \n | \n CRIM | \n ZN | \n INDUS | \n CHAS | \n NOX | \n RM | \n AGE | \n DIS | \n RAD | \n TAX | \n PTRATIO | \n B | \n LSTAT | \n price | \n
\n \n \n \n 0 | \n 0.00632 | \n 18.0 | \n 2.31 | \n 0.0 | \n 0.538 | \n 6.575 | \n 65.2 | \n 4.0900 | \n 1.0 | \n 296.0 | \n 15.3 | \n 396.90 | \n 4.98 | \n 24.0 | \n
\n \n 1 | \n 0.02731 | \n 0.0 | \n 7.07 | \n 0.0 | \n 0.469 | \n 6.421 | \n 78.9 | \n 4.9671 | \n 2.0 | \n 242.0 | \n 17.8 | \n 396.90 | \n 9.14 | \n 21.6 | \n
\n \n 2 | \n 0.02729 | \n 0.0 | \n 7.07 | \n 0.0 | \n 0.469 | \n 7.185 | \n 61.1 | \n 4.9671 | \n 2.0 | \n 242.0 | \n 17.8 | \n 392.83 | \n 4.03 | \n 34.7 | \n
\n \n 3 | \n 0.03237 | \n 0.0 | \n 2.18 | \n 0.0 | \n 0.458 | \n 6.998 | \n 45.8 | \n 6.0622 | \n 3.0 | \n 222.0 | \n 18.7 | \n 394.63 | \n 2.94 | \n 33.4 | \n
\n \n 4 | \n 0.06905 | \n 0.0 | \n 2.18 | \n 0.0 | \n 0.458 | \n 7.147 | \n 54.2 | \n 6.0622 | \n 3.0 | \n 222.0 | \n 18.7 | \n 396.90 | \n 5.33 | \n 36.2 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 5
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Overview\n",
"\n",
"[Column Descriptions](https://scikit-learn.org/stable/datasets/index.html#boston-dataset):\n",
"\n",
"* CRIM per capita crime rate by town\n",
"* ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n",
"* INDUS proportion of non-retail business acres per town\n",
"* CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n",
"* NOX nitric oxides concentration (parts per 10 million)\n",
"* RM average number of rooms per dwelling\n",
"* AGE proportion of owner-occupied units built prior to 1940\n",
"* DIS weighted distances to five Boston employment centres\n",
"* RAD index of accessibility to radial highways\n",
"* TAX full-value property-tax rate per \\$10,000\n",
"* PTRATIO pupil-teacher ratio by town\n",
"* B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n",
"* LSTAT % lower status of the population\n",
"* MEDV Median value of owner-occupied homes in \\$1000’s"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "(506, 14)"
},
"metadata": {},
"execution_count": 6
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we inspect some of the overall statistics about the data. Some examples of interesting things to note:\n",
"- 93% of `CHAS` are the same value, zero\n",
"- `ZN` also has a high amount of zeros\n",
"- The mean of `TAX` is significantly higher than the median, suggesting this is right-skewed"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": " CRIM ZN INDUS CHAS NOX \\\nData Type float64 float64 float64 float64 float64 \nMean 3.61352 11.3636 11.1368 0.06917 0.554695 \nStandard Deviation 8.60155 23.3225 6.86035 0.253994 0.115878 \nMedian 0.25651 0 9.69 0 0.538 \nMin 0.00632 0 0.46 0 0.385 \nMax 88.9762 100 27.74 1 0.871 \n# Zeros 0 372 0 471 0 \n# Nulls 0 0 0 0 0 \n% Most Frequent Value 0.4 73.52 26.09 93.08 4.55 \n\n RM AGE DIS RAD TAX PTRATIO \\\nData Type float64 float64 float64 float64 float64 float64 \nMean 6.28463 68.5749 3.79504 9.54941 408.237 18.4555 \nStandard Deviation 0.702617 28.1489 2.10571 8.70726 168.537 2.16495 \nMedian 6.2085 77.5 3.20745 5 330 19.05 \nMin 3.561 2.9 1.1296 1 187 12.6 \nMax 8.78 100 12.1265 24 711 22 \n# Zeros 0 0 0 0 0 0 \n# Nulls 0 0 0 0 0 0 \n% Most Frequent Value 0.59 8.5 0.99 26.09 26.09 27.67 \n\n B LSTAT price \nData Type float64 float64 float64 \nMean 356.674 12.6531 22.5328 \nStandard Deviation 91.2949 7.14106 9.1971 \nMedian 391.44 11.36 21.2 \nMin 0.32 1.73 5 \nMax 396.9 37.97 50 \n# Zeros 0 0 0 \n# Nulls 0 0 0 \n% Most Frequent Value 23.91 0.59 3.16 ",
"text/html": "\n\n
\n \n \n | \n CRIM | \n ZN | \n INDUS | \n CHAS | \n NOX | \n RM | \n AGE | \n DIS | \n RAD | \n TAX | \n PTRATIO | \n B | \n LSTAT | \n price | \n
\n \n \n \n Data Type | \n float64 | \n float64 | \n float64 | \n float64 | \n float64 | \n float64 | \n float64 | \n float64 | \n float64 | \n float64 | \n float64 | \n float64 | \n float64 | \n float64 | \n
\n \n Mean | \n 3.61352 | \n 11.3636 | \n 11.1368 | \n 0.06917 | \n 0.554695 | \n 6.28463 | \n 68.5749 | \n 3.79504 | \n 9.54941 | \n 408.237 | \n 18.4555 | \n 356.674 | \n 12.6531 | \n 22.5328 | \n
\n \n Standard Deviation | \n 8.60155 | \n 23.3225 | \n 6.86035 | \n 0.253994 | \n 0.115878 | \n 0.702617 | \n 28.1489 | \n 2.10571 | \n 8.70726 | \n 168.537 | \n 2.16495 | \n 91.2949 | \n 7.14106 | \n 9.1971 | \n
\n \n Median | \n 0.25651 | \n 0 | \n 9.69 | \n 0 | \n 0.538 | \n 6.2085 | \n 77.5 | \n 3.20745 | \n 5 | \n 330 | \n 19.05 | \n 391.44 | \n 11.36 | \n 21.2 | \n
\n \n Min | \n 0.00632 | \n 0 | \n 0.46 | \n 0 | \n 0.385 | \n 3.561 | \n 2.9 | \n 1.1296 | \n 1 | \n 187 | \n 12.6 | \n 0.32 | \n 1.73 | \n 5 | \n
\n \n Max | \n 88.9762 | \n 100 | \n 27.74 | \n 1 | \n 0.871 | \n 8.78 | \n 100 | \n 12.1265 | \n 24 | \n 711 | \n 22 | \n 396.9 | \n 37.97 | \n 50 | \n
\n \n # Zeros | \n 0 | \n 372 | \n 0 | \n 471 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n
\n \n # Nulls | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n 0 | \n
\n \n % Most Frequent Value | \n 0.4 | \n 73.52 | \n 26.09 | \n 93.08 | \n 4.55 | \n 0.59 | \n 8.5 | \n 0.99 | \n 26.09 | \n 26.09 | \n 27.67 | \n 23.91 | \n 0.59 | \n 3.16 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 7
}
],
"source": [
"dd.data_summary(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also look at a visual representation of the data as a heatmap:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "<AxesSubplot:title={'center':'Data Heatmap'}, xlabel='Record #', ylabel='Variable'>"
},
"metadata": {},
"execution_count": 8
},
{
"output_type": "display_data",
"data": {
"text/plain": "<Figure size 720x720 with 2 Axes>",
"image/svg+xml": "\r\n\r\n\r\n