data_describe.text.topic_modeling¶

topic_model(text_docs: List[str], model_type: str = ‘LDA’, num_topics: Optional[int] = None, min_topics: int = 2, max_topics: int = 10, no_below: int = 10, no_above: float = 0.2, tfidf: bool = True, model_kwargs: Optional[Dict] = None)

Topic modeling.

data_describe.text.topic_modeling.gensim¶

data_describe.text.topic_modeling.topic_model(text_docs: List[str], model_type: str = 'LDA', num_topics: Optional[int] = None, min_topics: int = 2, max_topics: int = 10, no_below: int = 10, no_above: float = 0.2, tfidf: bool = True, model_kwargs: Optional[Dict] = None)¶

Topic modeling.

Unsupervised methods of identifying topics in documents.

Parameters

text_docs – A list of text documents in string format. These documents should generally be pre-processed
model_type – {‘LDA’, ‘LSA’, ‘LSI’, ‘SVD’, ‘NMF’} Defines the type of model/algorithm which will be used.
num_topics – Sets the number of topics for the model. If None, will be optimized using coherence values
min_topics – Starting number of topics to optimize for if number of topics not provided. Default is 2
max_topics – Maximum number of topics to optimize for if number of topics not provided. Default is 10
no_below – Minimum number of documents a word must appear in to be used in training. Default is 10
no_above – Maximum proportion of documents a word may appear in to be used in training. Default is 0.2
tfidf – If True, model created using TF-IDF matrix. Otherwise, document-term matrix with wordcounts is used. Default is True
model_kwargs – Keyword arguments for the model, should be in agreement with model_type

Returns

Topic model widget.

class data_describe.text.topic_modeling.TopicModelWidget(model_type: str = 'LDA', num_topics: Optional[int] = None, model_kwargs: Optional[Dict] = None)¶

Bases: data_describe._widget.BaseWidget

Create topic model widget.

property model(self)¶: Trained topic model.

property model_type(self)¶: Type of model which either already has been or will be trained.

property num_topics(self)¶: The number of topics in the model.

property coherence_values(self)¶: A list of coherence values mapped from min_topics to max_topics.

property dictionary(self)¶: A Gensim dictionary mapping the words from the documents to their token_ids.

property corpus(self)¶: Bag of Words (BoW) representation of documents (token_id, token_count).

property matrix(self)¶: Either TF-IDF or document-term matrix with documents as rows and words as columns.

property min_topics(self)¶: If num_topics is None, this number is the first number of topics a model will be trained on.

property max_topics(self)¶: If num_topics is None, this number is the last number of topics a model will be trained on.

show(self, num_topic_words: int = 10, topic_names: Optional[List[str]] = None)¶

Displays most relevant terms for each topic.

Parameters

num_topic_words – The number of words to be displayed for each topic. Default is 10
topic_names – A list of pre-defined names set for each of the topics. Default is None

Returns

Pandas DataFrame displaying topics as columns and their: relevant terms as rows. LDA/LSI models will display an extra column to the right of each topic column, showing each term’s corresponding coefficient value

Return type

display_topics_df

fit(self, text_docs: List[str], model_type: Optional[str] = None, min_topics: int = 2, max_topics: int = 10, no_below: int = 10, no_above: float = 0.2, tfidf: bool = True, model_kwargs: Optional[Dict] = None)¶

Trains topic model and assigns model to object as attribute.

Parameters

text_docs – A list of text documents in string format. These documents should generally be pre-processed
model_type – {‘LDA’, ‘LSA’, ‘LSI’, ‘SVD’, ‘NMF’} Defines the type of model/algorithm which will be used.
min_topics – Starting number of topics to optimize for if number of topics not provided. Default is 2
max_topics – Maximum number of topics to optimize for if number of topics not provided. Default is 10
no_below – Minimum number of documents a word must appear in to be used in training. Default is 10
no_above – Maximum proportion of documents a word may appear in to be used in training. Default is 0.2
tfidf – If True, model created using TF-IDF matrix. Otherwise, document-term matrix with wordcounts is used. Default is True.
model_kwargs – Keyword arguments for the model, should be in agreement with model_type.

Raises

ValueError – Invalid model_type.

elbow_plot(self, viz_backend: str = None)¶

Creates an elbow plot displaying coherence values vs number of topics.

Parameters: viz_backend – The visualization backend.
Raises: ValueError – No coherence values to plot.
Returns: Elbow plot showing coherence values vs number of topics
Return type: fig

get_topic_nums(self)¶

Obtains topic distributions (LDA model) or scores (LSA/NMF model).

Returns: Array of topic distributions (LDA model) or scores (LSA/NMF model)
Return type: doc_topics

display_topic_keywords(self, num_topic_words: int = 10, topic_names: Optional[List[str]] = None)¶

Creates Pandas DataFrame to display most relevant terms for each topic.

Parameters

num_topic_words – The number of words to be displayed for each topic. Default is 10
topic_names – A list of pre-defined names set for each of the topics. Default is None

Returns

Pandas DataFrame displaying topics as columns and their: relevant terms as rows. LDA/LSI models will display an extra column to the right of each topic column, showing each term’s corresponding coefficient value

Return type

display_topics_df

top_documents_per_topic(self, text_docs: List[str], topic_names: Optional[List[str]] = None, num_docs: int = 10, summarize_docs: bool = False, summary_words: Optional[int] = None)¶

Creates Pandas DataFrame to display most relevant documents for each topic.

Parameters

text_docs –

A list of text documents in string format. Important to note that
this list of documents should be ordered in accordance with the matrix

or corpus on which the document was trained
topic_names – A list of pre-defined names set for each of the topics. Default is None
num_docs – The number of documents to display for each topic. Default is 10
summarize_docs – If True, the documents will be summarized (if this is the case, ‘text_docs’ should be formatted into sentences). Default is False
summary_words – The number of words the summary should be limited to. Should only be specified if summarize_docs set to True

Returns

Pandas DataFrame displaying topics as columns and their: most relevant documents as rows

Return type

all_top_docs_df

visualize_topic_summary(self, viz_backend: str = 'pyLDAvis')¶

Displays interactive pyLDAvis visual to understand topic model and documents.

Parameters: viz_backend (str) – The visualization backend.
Raises: TypeError – Only valid for LDA models.
Returns: A visual to understand topic model and/or documents relating to model