data_describe.text.topic_modeling

topic_model(text_docs: List[str], model_type: str = ‘LDA’, num_topics: Optional[int] = None, min_topics: int = 2, max_topics: int = 10, no_below: int = 10, no_above: float = 0.2, tfidf: bool = True, model_kwargs: Optional[Dict] = None)

Topic modeling.

data_describe.text.topic_modeling.gensim
data_describe.text.topic_modeling.topic_model(text_docs: List[str], model_type: str = 'LDA', num_topics: Optional[int] = None, min_topics: int = 2, max_topics: int = 10, no_below: int = 10, no_above: float = 0.2, tfidf: bool = True, model_kwargs: Optional[Dict] = None)

Topic modeling.

Unsupervised methods of identifying topics in documents.

Parameters
  • text_docs – A list of text documents in string format. These documents should generally be pre-processed

  • model_type – {‘LDA’, ‘LSA’, ‘LSI’, ‘SVD’, ‘NMF’} Defines the type of model/algorithm which will be used.

  • num_topics – Sets the number of topics for the model. If None, will be optimized using coherence values

  • min_topics – Starting number of topics to optimize for if number of topics not provided. Default is 2

  • max_topics – Maximum number of topics to optimize for if number of topics not provided. Default is 10

  • no_below – Minimum number of documents a word must appear in to be used in training. Default is 10

  • no_above – Maximum proportion of documents a word may appear in to be used in training. Default is 0.2

  • tfidf – If True, model created using TF-IDF matrix. Otherwise, document-term matrix with wordcounts is used. Default is True

  • model_kwargs – Keyword arguments for the model, should be in agreement with model_type

Returns

Topic model widget.

class data_describe.text.topic_modeling.TopicModelWidget(model_type: str = 'LDA', num_topics: Optional[int] = None, model_kwargs: Optional[Dict] = None)

Bases: data_describe._widget.BaseWidget

Create topic model widget.

property model(self)

Trained topic model.

property model_type(self)

Type of model which either already has been or will be trained.

property num_topics(self)

The number of topics in the model.

property coherence_values(self)

A list of coherence values mapped from min_topics to max_topics.

property dictionary(self)

A Gensim dictionary mapping the words from the documents to their token_ids.

property corpus(self)

Bag of Words (BoW) representation of documents (token_id, token_count).

property matrix(self)

Either TF-IDF or document-term matrix with documents as rows and words as columns.

property min_topics(self)

If num_topics is None, this number is the first number of topics a model will be trained on.

property max_topics(self)

If num_topics is None, this number is the last number of topics a model will be trained on.

show(self, num_topic_words: int = 10, topic_names: Optional[List[str]] = None)

Displays most relevant terms for each topic.

Parameters
  • num_topic_words – The number of words to be displayed for each topic. Default is 10

  • topic_names – A list of pre-defined names set for each of the topics. Default is None

Returns

Pandas DataFrame displaying topics as columns and their

relevant terms as rows. LDA/LSI models will display an extra column to the right of each topic column, showing each term’s corresponding coefficient value

Return type

display_topics_df

fit(self, text_docs: List[str], model_type: Optional[str] = None, min_topics: int = 2, max_topics: int = 10, no_below: int = 10, no_above: float = 0.2, tfidf: bool = True, model_kwargs: Optional[Dict] = None)

Trains topic model and assigns model to object as attribute.

Parameters
  • text_docs – A list of text documents in string format. These documents should generally be pre-processed

  • model_type – {‘LDA’, ‘LSA’, ‘LSI’, ‘SVD’, ‘NMF’} Defines the type of model/algorithm which will be used.

  • min_topics – Starting number of topics to optimize for if number of topics not provided. Default is 2

  • max_topics – Maximum number of topics to optimize for if number of topics not provided. Default is 10

  • no_below – Minimum number of documents a word must appear in to be used in training. Default is 10

  • no_above – Maximum proportion of documents a word may appear in to be used in training. Default is 0.2

  • tfidf – If True, model created using TF-IDF matrix. Otherwise, document-term matrix with wordcounts is used. Default is True.

  • model_kwargs – Keyword arguments for the model, should be in agreement with model_type.

Raises

ValueError – Invalid model_type.

elbow_plot(self, viz_backend: str = None)

Creates an elbow plot displaying coherence values vs number of topics.

Parameters

viz_backend – The visualization backend.

Raises

ValueError – No coherence values to plot.

Returns

Elbow plot showing coherence values vs number of topics

Return type

fig

get_topic_nums(self)

Obtains topic distributions (LDA model) or scores (LSA/NMF model).

Returns

Array of topic distributions (LDA model) or scores (LSA/NMF model)

Return type

doc_topics

display_topic_keywords(self, num_topic_words: int = 10, topic_names: Optional[List[str]] = None)

Creates Pandas DataFrame to display most relevant terms for each topic.

Parameters
  • num_topic_words – The number of words to be displayed for each topic. Default is 10

  • topic_names – A list of pre-defined names set for each of the topics. Default is None

Returns

Pandas DataFrame displaying topics as columns and their

relevant terms as rows. LDA/LSI models will display an extra column to the right of each topic column, showing each term’s corresponding coefficient value

Return type

display_topics_df

top_documents_per_topic(self, text_docs: List[str], topic_names: Optional[List[str]] = None, num_docs: int = 10, summarize_docs: bool = False, summary_words: Optional[int] = None)

Creates Pandas DataFrame to display most relevant documents for each topic.

Parameters
  • text_docs

    A list of text documents in string format. Important to note that

    this list of documents should be ordered in accordance with the matrix

    or corpus on which the document was trained

  • topic_names – A list of pre-defined names set for each of the topics. Default is None

  • num_docs – The number of documents to display for each topic. Default is 10

  • summarize_docs – If True, the documents will be summarized (if this is the case, ‘text_docs’ should be formatted into sentences). Default is False

  • summary_words – The number of words the summary should be limited to. Should only be specified if summarize_docs set to True

Returns

Pandas DataFrame displaying topics as columns and their

most relevant documents as rows

Return type

all_top_docs_df

visualize_topic_summary(self, viz_backend: str = 'pyLDAvis')

Displays interactive pyLDAvis visual to understand topic model and documents.

Parameters

viz_backend (str) – The visualization backend.

Raises

TypeError – Only valid for LDA models.

Returns

A visual to understand topic model and/or documents relating to model