Topic Modeling¶
[1]:
import pandas as pd
[2]:
from data_describe.text.topic_modeling import topic_model
[3]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
[4]:
df = pd.DataFrame({"text": newsgroups_train['data']})
[5]:
df.head()
[5]:
text | |
---|---|
0 | From: darice@yoyo.cc.monash.edu.au (Fred Rice)... |
1 | From: chrisb@tafe.sa.edu.au (Chris BELL)\nSubj... |
2 | Subject: Re: The Inimitable Rushdie\nFrom: kma... |
3 | From: timmbake@mcl.ucsb.edu (Bake Timmons)\nSu... |
4 | From: I3150101@dbstu1.rz.tu-bs.de (Benedikt Ro... |
Explicitly providing number of topics¶
[6]:
lda_model = topic_model(df.text, num_topics=2)
lda_model
Topic 1 | Topic 1 Coefficient Value | Topic 2 | Topic 2 Coefficient Value | |
---|---|---|---|---|
Term 1 | |> | 0.025 | >> | 0.014 |
Term 2 | : | 0.009 | : | 0.010 |
Term 3 | - | 0.005 | God | 0.006 |
Term 4 | God | 0.005 | | | 0.004 |
Term 5 | much | 0.004 | them | 0.004 |
Term 6 | those | 0.004 | had | 0.004 |
Term 7 | way | 0.003 | |> | 0.004 |
Term 8 | time | 0.003 | - | 0.004 |
Term 9 | atheists | 0.003 | these | 0.004 |
Term 10 | may | 0.003 | atheists | 0.004 |
[6]:
<data_describe.text.topic_modeling.TopicModelWidget at 0x1948779a948>
Guess optimal number of topics and show elbow plot¶
[7]:
lda_model = topic_model(df.text, num_topics=None)
lda_model
DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
Topic 1 | Topic 1 Coefficient Value | Topic 2 | Topic 2 Coefficient Value | Topic 3 | Topic 3 Coefficient Value | Topic 4 | Topic 4 Coefficient Value | Topic 5 | Topic 5 Coefficient Value | Topic 6 | Topic 6 Coefficient Value | Topic 7 | Topic 7 Coefficient Value | Topic 8 | Topic 8 Coefficient Value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Term 1 | >> | 0.008 | God | 0.010 | : | 0.031 | >> | 0.013 | |> | 0.021 | >> | 0.010 | |> | 0.071 | >> | 0.026 |
Term 2 | had | 0.006 | | | 0.008 | - | 0.007 | God | 0.008 | system | 0.006 | evidence | 0.005 | : | 0.016 | Jesus | 0.007 |
Term 3 | | | 0.005 | - | 0.007 | God | 0.006 | atheists | 0.005 | Schneider) | 0.006 | am | 0.005 | >> | 0.007 | God | 0.006 |
Term 4 | - | 0.005 | atheists | 0.005 | also | 0.005 | those | 0.005 | keith@cco.caltech.edu | 0.006 | objective | 0.004 | Livesey) | 0.007 | them | 0.006 |
Term 5 | God | 0.005 | those | 0.005 | Islamic | 0.005 | moral | 0.005 | am | 0.005 | world | 0.004 | (Jon | 0.006 | things | 0.006 |
Term 6 | evidence | 0.004 | For | 0.005 | our | 0.004 | religious | 0.005 | Allan | 0.005 | read | 0.004 | livesey@solntze.wpd.sgi.com | 0.006 | atheists | 0.005 |
Term 7 | time | 0.004 | Islam | 0.004 | A | 0.004 | - | 0.004 | Institute | 0.004 | take | 0.004 | God | 0.005 | evidence | 0.004 |
Term 8 | it. | 0.004 | religion | 0.004 | may | 0.004 | our | 0.004 | objective | 0.004 | |> | 0.004 | them | 0.004 | it. | 0.004 |
Term 9 | We | 0.004 | >> | 0.004 | Islam | 0.004 | way | 0.004 | these | 0.004 | almost | 0.004 | moral | 0.004 | - | 0.004 |
Term 10 | |> | 0.004 | way | 0.004 | much | 0.003 | may | 0.004 | keith | 0.004 | atheists | 0.003 | A | 0.004 | They | 0.004 |
[7]:
<data_describe.text.topic_modeling.TopicModelWidget at 0x194888dedc8>
[8]:
lda_model.elbow_plot()
DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
[8]:
<AxesSubplot:title={'center':'Coherence Values Across Topic Numbers'}, xlabel='Number of Topics', ylabel='Coherence Values'>
<Figure size 720x720 with 0 Axes>