Quick Start Tutorial¶
This notebook shows an example Exploratory Data Analysis utilizing data-describe.
Note: Part of this notebook uses optional dependencies for text analysis. To install these dependencies, run pip install data-describe[nlp]
[2]:
import data_describe as dd
Data¶
This tutorial uses toy datasets from sklearn.
[3]:
from sklearn.datasets import load_boston
import pandas as pd
[4]:
dat = load_boston()
df = pd.DataFrame(dat['data'], columns=dat['feature_names'])
df['price'] = dat['target']
[5]:
df.head()
[5]:
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
Data Overview¶
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population
MEDV Median value of owner-occupied homes in $1000’s
[6]:
df.shape
[6]:
(506, 14)
First we inspect some of the overall statistics about the data. Some examples of interesting things to note: - 93% of CHAS
are the same value, zero - ZN
also has a high amount of zeros - The mean of TAX
is significantly higher than the median, suggesting this is right-skewed
[7]:
dd.data_summary(df)
[7]:
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Data Type | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 |
Mean | 3.61352 | 11.3636 | 11.1368 | 0.06917 | 0.554695 | 6.28463 | 68.5749 | 3.79504 | 9.54941 | 408.237 | 18.4555 | 356.674 | 12.6531 | 22.5328 |
Standard Deviation | 8.60155 | 23.3225 | 6.86035 | 0.253994 | 0.115878 | 0.702617 | 28.1489 | 2.10571 | 8.70726 | 168.537 | 2.16495 | 91.2949 | 7.14106 | 9.1971 |
Median | 0.25651 | 0 | 9.69 | 0 | 0.538 | 6.2085 | 77.5 | 3.20745 | 5 | 330 | 19.05 | 391.44 | 11.36 | 21.2 |
Min | 0.00632 | 0 | 0.46 | 0 | 0.385 | 3.561 | 2.9 | 1.1296 | 1 | 187 | 12.6 | 0.32 | 1.73 | 5 |
Max | 88.9762 | 100 | 27.74 | 1 | 0.871 | 8.78 | 100 | 12.1265 | 24 | 711 | 22 | 396.9 | 37.97 | 50 |
# Zeros | 0 | 372 | 0 | 471 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Nulls | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
% Most Frequent Value | 0.4 | 73.52 | 26.09 | 93.08 | 4.55 | 0.59 | 8.5 | 0.99 | 26.09 | 26.09 | 27.67 | 23.91 | 0.59 | 3.16 |
We can also look at a visual representation of the data as a heatmap:
[8]:
dd.data_heatmap(df)
[8]:
<AxesSubplot:title={'center':'Data Heatmap'}, xlabel='Record #', ylabel='Variable'>
There are some sections of the data which have exactly the same values for some columns. For example, RAD
= 1.661245 between record number 356 ~ 487. Similar patterns appear for INDUS
and TAX
. Is this a sorting issue or is there something else going on? Some additional investigation into data collection may answer these questions.
We can also look at the correlations:
[9]:
dd.correlation_matrix(df)
<AxesSubplot:title={'center':'Correlation Matrix'}>
[9]:
<data_describe.core.correlation.CorrelationMatrixWidget at 0x23d777a9d88>
Features like AGE
and DIS
appear to be inversely correlated. CHAS
doesn’t appear to have strong correlation with any other feature.
It might also help to re-order the features for comparisons using the cluster
argument.
[10]:
dd.correlation_matrix(df, cluster=True)
<AxesSubplot:title={'center':'Correlation Matrix'}>
[10]:
<data_describe.core.correlation.CorrelationMatrixWidget at 0x23d35e2fe48>
From this plot we can observe there are are two groups of inversely related features: PTRATIO
to NOX
and B
to DIS
.
Data Inspection¶
We can also do some more detailed inspection of individual features.
We can show histograms and violin plots of each numeric feature using the dd.distribution
function.
[11]:
from IPython.display import display
# display is used to show plots from inside a loop
for col in df.columns:
display(dd.distribution(df, plot_all=True).plot_distribution(col))
We can also look at bivariate distributions using scatter plots. In addition to plotting all pairs in a scatterplot matrix, we can also specify a filter for certain scatterplot diagnostic features.
[12]:
dd.scatter_plots(df, plot_mode='matrix')
[12]:
<seaborn.axisgrid.PairGrid at 0x23d37980648>
[13]:
dd.scatter_plots(df, threshold={'Outlier': 0.9})
[13]:
<seaborn.axisgrid.PairGrid at 0x23d7fd716c8>
Advanced Analysis¶
In addition to general plots, we can also use some advanced analyses as shown below.
Cluster Analysis¶
What segments or groups are present in the data?
[14]:
dd.cluster(df)
<AxesSubplot:title={'center':'kmeans Cluster'}, xlabel='Component 1 (47.0% variance explained)', ylabel='Component 2 (12.0% variance explained)'>
[14]:
<data_describe.core.clustering.KmeansClusterWidget at 0x23d8530c4c8>
From this plot, we see that there does not appear to be strongly distinct clusters in the data.
Feature Importance¶
Which features are most predictive of price? We use Random Forest as a baseline model to test for importance.
[15]:
from sklearn.ensemble import RandomForestRegressor
[16]:
dd.importance(df, 'price', estimator=RandomForestRegressor(random_state=42))
[16]:
Text(0.5, 1.0, 'Feature Importance')
It appears that LSTAT
and RM
are most important for predicting price.
Topic Modeling¶
Since the Boston housing data set does not contain textual features, the 20 newsgroups text dataset is used to demonstrate the Topic Modeling widget.
[17]:
from sklearn.datasets import fetch_20newsgroups
[18]:
dat = fetch_20newsgroups(subset='test')
df2 = pd.DataFrame({'text': dat['data']})
df2 = df2.sample(150)
[19]:
df2.head()
[19]:
text | |
---|---|
3655 | From: glk9533@tm0006.lerc.nasa.gov (Greg L. Ki... |
4048 | From: mlee@post.RoyalRoads.ca (Malcolm Lee)\nS... |
2079 | From: whoughto@diana.cair.du.edu (Wes Houghton... |
1629 | From: keng@den.mmc.com (Ken Garrido)\nSubject:... |
6135 | From: as010b@uhura.cc.rochester.edu (Tree of S... |
Text preprocessing can be applied before topic modeling to improve accuracy.
[20]:
from data_describe.text.text_preprocessing import preprocess_texts, bag_of_words_to_docs
processed = preprocess_texts(df2['text'])
text = bag_of_words_to_docs(processed)
[21]:
from data_describe.text.topic_modeling import topic_model
[22]:
lda_model = topic_model(text, num_topics=3)
lda_model
Topic 1 | Topic 1 Coefficient Value | Topic 2 | Topic 2 Coefficient Value | Topic 3 | Topic 3 Coefficient Value | |
---|---|---|---|---|---|---|
Term 1 | people | 0.033 | bit | 0.023 | paul | 0.028 |
Term 2 | may | 0.022 | paul | 0.021 | use | 0.026 |
Term 3 | right | 0.022 | two | 0.017 | much | 0.024 |
Term 4 | even | 0.021 | really | 0.016 | may | 0.020 |
Term 5 | many | 0.017 | well | 0.015 | good | 0.015 |
Term 6 | said | 0.016 | use | 0.014 | even | 0.015 |
Term 7 | say | 0.016 | need | 0.014 | world | 0.014 |
Term 8 | good | 0.014 | problem | 0.014 | thanks | 0.014 |
Term 9 | world | 0.014 | please | 0.014 | since | 0.014 |
Term 10 | use | 0.014 | case | 0.013 | using | 0.014 |
[22]:
<data_describe.text.topic_modeling.TopicModelWidget at 0x23d8a068708>