data_describe.text.topic_modeling¶
|
Topic modeling. |
-
data_describe.text.topic_modeling.
gensim
¶
-
data_describe.text.topic_modeling.
topic_model
(text_docs: List[str], model_type: str = 'LDA', num_topics: Optional[int] = None, min_topics: int = 2, max_topics: int = 10, no_below: int = 10, no_above: float = 0.2, tfidf: bool = True, model_kwargs: Optional[Dict] = None)¶ Topic modeling.
Unsupervised methods of identifying topics in documents.
- Parameters
text_docs – A list of text documents in string format. These documents should generally be pre-processed
model_type – {‘LDA’, ‘LSA’, ‘LSI’, ‘SVD’, ‘NMF’} Defines the type of model/algorithm which will be used.
num_topics – Sets the number of topics for the model. If None, will be optimized using coherence values
min_topics – Starting number of topics to optimize for if number of topics not provided. Default is 2
max_topics – Maximum number of topics to optimize for if number of topics not provided. Default is 10
no_below – Minimum number of documents a word must appear in to be used in training. Default is 10
no_above – Maximum proportion of documents a word may appear in to be used in training. Default is 0.2
tfidf – If True, model created using TF-IDF matrix. Otherwise, document-term matrix with wordcounts is used. Default is True
model_kwargs – Keyword arguments for the model, should be in agreement with model_type
- Returns
Topic model widget.
-
class
data_describe.text.topic_modeling.
TopicModelWidget
(model_type: str = 'LDA', num_topics: Optional[int] = None, model_kwargs: Optional[Dict] = None)¶ Bases:
data_describe._widget.BaseWidget
Create topic model widget.
-
property
model
(self)¶ Trained topic model.
-
property
model_type
(self)¶ Type of model which either already has been or will be trained.
-
property
num_topics
(self)¶ The number of topics in the model.
-
property
coherence_values
(self)¶ A list of coherence values mapped from min_topics to max_topics.
-
property
dictionary
(self)¶ A Gensim dictionary mapping the words from the documents to their token_ids.
-
property
corpus
(self)¶ Bag of Words (BoW) representation of documents (token_id, token_count).
-
property
matrix
(self)¶ Either TF-IDF or document-term matrix with documents as rows and words as columns.
-
property
min_topics
(self)¶ If num_topics is None, this number is the first number of topics a model will be trained on.
-
property
max_topics
(self)¶ If num_topics is None, this number is the last number of topics a model will be trained on.
-
show
(self, num_topic_words: int = 10, topic_names: Optional[List[str]] = None)¶ Displays most relevant terms for each topic.
- Parameters
num_topic_words – The number of words to be displayed for each topic. Default is 10
topic_names – A list of pre-defined names set for each of the topics. Default is None
- Returns
- Pandas DataFrame displaying topics as columns and their
relevant terms as rows. LDA/LSI models will display an extra column to the right of each topic column, showing each term’s corresponding coefficient value
- Return type
display_topics_df
-
fit
(self, text_docs: List[str], model_type: Optional[str] = None, min_topics: int = 2, max_topics: int = 10, no_below: int = 10, no_above: float = 0.2, tfidf: bool = True, model_kwargs: Optional[Dict] = None)¶ Trains topic model and assigns model to object as attribute.
- Parameters
text_docs – A list of text documents in string format. These documents should generally be pre-processed
model_type – {‘LDA’, ‘LSA’, ‘LSI’, ‘SVD’, ‘NMF’} Defines the type of model/algorithm which will be used.
min_topics – Starting number of topics to optimize for if number of topics not provided. Default is 2
max_topics – Maximum number of topics to optimize for if number of topics not provided. Default is 10
no_below – Minimum number of documents a word must appear in to be used in training. Default is 10
no_above – Maximum proportion of documents a word may appear in to be used in training. Default is 0.2
tfidf – If True, model created using TF-IDF matrix. Otherwise, document-term matrix with wordcounts is used. Default is True.
model_kwargs – Keyword arguments for the model, should be in agreement with model_type.
- Raises
ValueError – Invalid model_type.
-
elbow_plot
(self, viz_backend: str = None)¶ Creates an elbow plot displaying coherence values vs number of topics.
- Parameters
viz_backend – The visualization backend.
- Raises
ValueError – No coherence values to plot.
- Returns
Elbow plot showing coherence values vs number of topics
- Return type
fig
-
get_topic_nums
(self)¶ Obtains topic distributions (LDA model) or scores (LSA/NMF model).
- Returns
Array of topic distributions (LDA model) or scores (LSA/NMF model)
- Return type
doc_topics
-
display_topic_keywords
(self, num_topic_words: int = 10, topic_names: Optional[List[str]] = None)¶ Creates Pandas DataFrame to display most relevant terms for each topic.
- Parameters
num_topic_words – The number of words to be displayed for each topic. Default is 10
topic_names – A list of pre-defined names set for each of the topics. Default is None
- Returns
- Pandas DataFrame displaying topics as columns and their
relevant terms as rows. LDA/LSI models will display an extra column to the right of each topic column, showing each term’s corresponding coefficient value
- Return type
display_topics_df
-
top_documents_per_topic
(self, text_docs: List[str], topic_names: Optional[List[str]] = None, num_docs: int = 10, summarize_docs: bool = False, summary_words: Optional[int] = None)¶ Creates Pandas DataFrame to display most relevant documents for each topic.
- Parameters
text_docs –
- A list of text documents in string format. Important to note that
this list of documents should be ordered in accordance with the matrix
or corpus on which the document was trained
topic_names – A list of pre-defined names set for each of the topics. Default is None
num_docs – The number of documents to display for each topic. Default is 10
summarize_docs – If True, the documents will be summarized (if this is the case, ‘text_docs’ should be formatted into sentences). Default is False
summary_words – The number of words the summary should be limited to. Should only be specified if summarize_docs set to True
- Returns
- Pandas DataFrame displaying topics as columns and their
most relevant documents as rows
- Return type
all_top_docs_df
-
visualize_topic_summary
(self, viz_backend: str = 'pyLDAvis')¶ Displays interactive pyLDAvis visual to understand topic model and documents.
- Parameters
viz_backend (str) – The visualization backend.
- Raises
TypeError – Only valid for LDA models.
- Returns
A visual to understand topic model and/or documents relating to model
-
property