data_describe.text.topic_modeling¶
| 
 | Topic modeling. | 
- 
data_describe.text.topic_modeling.gensim¶
- 
data_describe.text.topic_modeling.topic_model(text_docs: List[str], model_type: str = 'LDA', num_topics: Optional[int] = None, min_topics: int = 2, max_topics: int = 10, no_below: int = 10, no_above: float = 0.2, tfidf: bool = True, model_kwargs: Optional[Dict] = None)¶
- Topic modeling. - Unsupervised methods of identifying topics in documents. - Parameters
- text_docs – A list of text documents in string format. These documents should generally be pre-processed 
- model_type – {‘LDA’, ‘LSA’, ‘LSI’, ‘SVD’, ‘NMF’} Defines the type of model/algorithm which will be used. 
- num_topics – Sets the number of topics for the model. If None, will be optimized using coherence values 
- min_topics – Starting number of topics to optimize for if number of topics not provided. Default is 2 
- max_topics – Maximum number of topics to optimize for if number of topics not provided. Default is 10 
- no_below – Minimum number of documents a word must appear in to be used in training. Default is 10 
- no_above – Maximum proportion of documents a word may appear in to be used in training. Default is 0.2 
- tfidf – If True, model created using TF-IDF matrix. Otherwise, document-term matrix with wordcounts is used. Default is True 
- model_kwargs – Keyword arguments for the model, should be in agreement with model_type 
 
- Returns
- Topic model widget. 
 
- 
class data_describe.text.topic_modeling.TopicModelWidget(model_type: str = 'LDA', num_topics: Optional[int] = None, model_kwargs: Optional[Dict] = None)¶
- Bases: - data_describe._widget.BaseWidget- Create topic model widget. - 
property model(self)¶
- Trained topic model. 
 - 
property model_type(self)¶
- Type of model which either already has been or will be trained. 
 - 
property num_topics(self)¶
- The number of topics in the model. 
 - 
property coherence_values(self)¶
- A list of coherence values mapped from min_topics to max_topics. 
 - 
property dictionary(self)¶
- A Gensim dictionary mapping the words from the documents to their token_ids. 
 - 
property corpus(self)¶
- Bag of Words (BoW) representation of documents (token_id, token_count). 
 - 
property matrix(self)¶
- Either TF-IDF or document-term matrix with documents as rows and words as columns. 
 - 
property min_topics(self)¶
- If num_topics is None, this number is the first number of topics a model will be trained on. 
 - 
property max_topics(self)¶
- If num_topics is None, this number is the last number of topics a model will be trained on. 
 - 
show(self, num_topic_words: int = 10, topic_names: Optional[List[str]] = None)¶
- Displays most relevant terms for each topic. - Parameters
- num_topic_words – The number of words to be displayed for each topic. Default is 10 
- topic_names – A list of pre-defined names set for each of the topics. Default is None 
 
- Returns
- Pandas DataFrame displaying topics as columns and their
- relevant terms as rows. LDA/LSI models will display an extra column to the right of each topic column, showing each term’s corresponding coefficient value 
 
- Return type
- display_topics_df 
 
 - 
fit(self, text_docs: List[str], model_type: Optional[str] = None, min_topics: int = 2, max_topics: int = 10, no_below: int = 10, no_above: float = 0.2, tfidf: bool = True, model_kwargs: Optional[Dict] = None)¶
- Trains topic model and assigns model to object as attribute. - Parameters
- text_docs – A list of text documents in string format. These documents should generally be pre-processed 
- model_type – {‘LDA’, ‘LSA’, ‘LSI’, ‘SVD’, ‘NMF’} Defines the type of model/algorithm which will be used. 
- min_topics – Starting number of topics to optimize for if number of topics not provided. Default is 2 
- max_topics – Maximum number of topics to optimize for if number of topics not provided. Default is 10 
- no_below – Minimum number of documents a word must appear in to be used in training. Default is 10 
- no_above – Maximum proportion of documents a word may appear in to be used in training. Default is 0.2 
- tfidf – If True, model created using TF-IDF matrix. Otherwise, document-term matrix with wordcounts is used. Default is True. 
- model_kwargs – Keyword arguments for the model, should be in agreement with model_type. 
 
- Raises
- ValueError – Invalid model_type. 
 
 - 
elbow_plot(self, viz_backend: str = None)¶
- Creates an elbow plot displaying coherence values vs number of topics. - Parameters
- viz_backend – The visualization backend. 
- Raises
- ValueError – No coherence values to plot. 
- Returns
- Elbow plot showing coherence values vs number of topics 
- Return type
- fig 
 
 - 
get_topic_nums(self)¶
- Obtains topic distributions (LDA model) or scores (LSA/NMF model). - Returns
- Array of topic distributions (LDA model) or scores (LSA/NMF model) 
- Return type
- doc_topics 
 
 - 
display_topic_keywords(self, num_topic_words: int = 10, topic_names: Optional[List[str]] = None)¶
- Creates Pandas DataFrame to display most relevant terms for each topic. - Parameters
- num_topic_words – The number of words to be displayed for each topic. Default is 10 
- topic_names – A list of pre-defined names set for each of the topics. Default is None 
 
- Returns
- Pandas DataFrame displaying topics as columns and their
- relevant terms as rows. LDA/LSI models will display an extra column to the right of each topic column, showing each term’s corresponding coefficient value 
 
- Return type
- display_topics_df 
 
 - 
top_documents_per_topic(self, text_docs: List[str], topic_names: Optional[List[str]] = None, num_docs: int = 10, summarize_docs: bool = False, summary_words: Optional[int] = None)¶
- Creates Pandas DataFrame to display most relevant documents for each topic. - Parameters
- text_docs – - A list of text documents in string format. Important to note that
- this list of documents should be ordered in accordance with the matrix 
 - or corpus on which the document was trained 
- topic_names – A list of pre-defined names set for each of the topics. Default is None 
- num_docs – The number of documents to display for each topic. Default is 10 
- summarize_docs – If True, the documents will be summarized (if this is the case, ‘text_docs’ should be formatted into sentences). Default is False 
- summary_words – The number of words the summary should be limited to. Should only be specified if summarize_docs set to True 
 
- Returns
- Pandas DataFrame displaying topics as columns and their
- most relevant documents as rows 
 
- Return type
- all_top_docs_df 
 
 - 
visualize_topic_summary(self, viz_backend: str = 'pyLDAvis')¶
- Displays interactive pyLDAvis visual to understand topic model and documents. - Parameters
- viz_backend (str) – The visualization backend. 
- Raises
- TypeError – Only valid for LDA models. 
- Returns
- A visual to understand topic model and/or documents relating to model 
 
 
- 
property