Topic Modeling

[1]:
import pandas as pd
[2]:
from data_describe.text.topic_modeling import topic_model
[3]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
[4]:
df = pd.DataFrame({"text": newsgroups_train['data']})
[5]:
df.head()
[5]:
text
0 From: darice@yoyo.cc.monash.edu.au (Fred Rice)...
1 From: chrisb@tafe.sa.edu.au (Chris BELL)\nSubj...
2 Subject: Re: The Inimitable Rushdie\nFrom: kma...
3 From: timmbake@mcl.ucsb.edu (Bake Timmons)\nSu...
4 From: I3150101@dbstu1.rz.tu-bs.de (Benedikt Ro...

Explicitly providing number of topics

[6]:
lda_model = topic_model(df.text, num_topics=2)
lda_model
Topic 1 Topic 1 Coefficient Value Topic 2 Topic 2 Coefficient Value
Term 1 |> 0.025 >> 0.014
Term 2 : 0.009 : 0.010
Term 3 - 0.005 God 0.006
Term 4 God 0.005 | 0.004
Term 5 much 0.004 them 0.004
Term 6 those 0.004 had 0.004
Term 7 way 0.003 |> 0.004
Term 8 time 0.003 - 0.004
Term 9 atheists 0.003 these 0.004
Term 10 may 0.003 atheists 0.004
[6]:
<data_describe.text.topic_modeling.TopicModelWidget at 0x1948779a948>

Guess optimal number of topics and show elbow plot

[7]:
lda_model = topic_model(df.text, num_topics=None)
lda_model
DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
Topic 1 Topic 1 Coefficient Value Topic 2 Topic 2 Coefficient Value Topic 3 Topic 3 Coefficient Value Topic 4 Topic 4 Coefficient Value Topic 5 Topic 5 Coefficient Value Topic 6 Topic 6 Coefficient Value Topic 7 Topic 7 Coefficient Value Topic 8 Topic 8 Coefficient Value
Term 1 >> 0.008 God 0.010 : 0.031 >> 0.013 |> 0.021 >> 0.010 |> 0.071 >> 0.026
Term 2 had 0.006 | 0.008 - 0.007 God 0.008 system 0.006 evidence 0.005 : 0.016 Jesus 0.007
Term 3 | 0.005 - 0.007 God 0.006 atheists 0.005 Schneider) 0.006 am 0.005 >> 0.007 God 0.006
Term 4 - 0.005 atheists 0.005 also 0.005 those 0.005 keith@cco.caltech.edu 0.006 objective 0.004 Livesey) 0.007 them 0.006
Term 5 God 0.005 those 0.005 Islamic 0.005 moral 0.005 am 0.005 world 0.004 (Jon 0.006 things 0.006
Term 6 evidence 0.004 For 0.005 our 0.004 religious 0.005 Allan 0.005 read 0.004 livesey@solntze.wpd.sgi.com 0.006 atheists 0.005
Term 7 time 0.004 Islam 0.004 A 0.004 - 0.004 Institute 0.004 take 0.004 God 0.005 evidence 0.004
Term 8 it. 0.004 religion 0.004 may 0.004 our 0.004 objective 0.004 |> 0.004 them 0.004 it. 0.004
Term 9 We 0.004 >> 0.004 Islam 0.004 way 0.004 these 0.004 almost 0.004 moral 0.004 - 0.004
Term 10 |> 0.004 way 0.004 much 0.003 may 0.004 keith 0.004 atheists 0.003 A 0.004 They 0.004
[7]:
<data_describe.text.topic_modeling.TopicModelWidget at 0x194888dedc8>
[8]:
lda_model.elbow_plot()
DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
[8]:
<AxesSubplot:title={'center':'Coherence Values Across Topic Numbers'}, xlabel='Number of Topics', ylabel='Coherence Values'>
../_images/examples_topic_modeling_10_2.png
<Figure size 720x720 with 0 Axes>