data_describe.text.topic_modeling
========================================

.. py:module:: data_describe.text.topic_modeling


.. autoapisummary::

   data_describe.text.topic_modeling.topic_model


.. data:: gensim
   

.. function:: topic_model(text_docs: List[str], model_type: str = 'LDA', num_topics: Optional[int] = None, min_topics: int = 2, max_topics: int = 10, no_below: int = 10, no_above: float = 0.2, tfidf: bool = True, model_kwargs: Optional[Dict] = None)

   Topic modeling.

   Unsupervised methods of identifying topics in documents.

   :param text_docs: A list of text documents in string format. These documents should
                     generally be pre-processed
   :param model_type: {'LDA', 'LSA', 'LSI', 'SVD', 'NMF'}
                      Defines the type of model/algorithm which will be used.
   :param num_topics: Sets the number of topics for the model. If None, will be optimized
                      using coherence values
   :param min_topics: Starting number of topics to optimize for if number of topics not
                      provided. Default is 2
   :param max_topics: Maximum number of topics to optimize for if number of topics not
                      provided. Default is 10
   :param no_below: Minimum number of documents a word must appear in to be used in
                    training. Default is 10
   :param no_above: Maximum proportion of documents a word may appear in to be used in
                    training. Default is 0.2
   :param tfidf: If True, model created using TF-IDF matrix. Otherwise, document-term
                 matrix with wordcounts is used. Default is True
   :param model_kwargs: Keyword arguments for the model, should be in agreement with
                        `model_type`

   :returns: Topic model widget.


.. py:class:: TopicModelWidget(model_type: str = 'LDA', num_topics: Optional[int] = None, model_kwargs: Optional[Dict] = None)

   Bases: :class:`data_describe._widget.BaseWidget`

   Create topic model widget.

   .. method:: model(self)
      :property:


      Trained topic model.


   .. method:: model_type(self)
      :property:


      Type of model which either already has been or will be trained.


   .. method:: num_topics(self)
      :property:


      The number of topics in the model.


   .. method:: coherence_values(self)
      :property:


      A list of coherence values mapped from min_topics to max_topics.


   .. method:: dictionary(self)
      :property:


      A Gensim dictionary mapping the words from the documents to their token_ids.


   .. method:: corpus(self)
      :property:


      Bag of Words (BoW) representation of documents (token_id, token_count).


   .. method:: matrix(self)
      :property:


      Either TF-IDF or document-term matrix with documents as rows and words as columns.


   .. method:: min_topics(self)
      :property:


      If num_topics is None, this number is the first number of topics a model will be trained on.


   .. method:: max_topics(self)
      :property:


      If num_topics is None, this number is the last number of topics a model will be trained on.


   .. method:: show(self, num_topic_words: int = 10, topic_names: Optional[List[str]] = None)


      Displays most relevant terms for each topic.

      :param num_topic_words: The number of words to be displayed for each topic. Default is 10
      :param topic_names: A list of pre-defined names set for each of the topics. Default is None

      :returns:

                Pandas DataFrame displaying topics as columns and their
                    relevant terms as rows. LDA/LSI models will display an extra column to
                    the right of each topic column, showing each term's corresponding
                    coefficient value
      :rtype: display_topics_df


   .. method:: fit(self, text_docs: List[str], model_type: Optional[str] = None, min_topics: int = 2, max_topics: int = 10, no_below: int = 10, no_above: float = 0.2, tfidf: bool = True, model_kwargs: Optional[Dict] = None)


      Trains topic model and assigns model to object as attribute.

      :param text_docs: A list of text documents in string format. These documents should
                        generally be pre-processed
      :param model_type: {'LDA', 'LSA', 'LSI', 'SVD', 'NMF'}
                         Defines the type of model/algorithm which will be used.
      :param min_topics: Starting number of topics to optimize for if number of topics
                         not provided. Default is 2
      :param max_topics: Maximum number of topics to optimize for if number of topics not
                         provided. Default is 10
      :param no_below: Minimum number of documents a word must appear in to be used in
                       training. Default is 10
      :param no_above: Maximum proportion of documents a word may appear in to be used in
                       training. Default is 0.2
      :param tfidf: If True, model created using TF-IDF matrix. Otherwise, document-term
                    matrix with wordcounts is used. Default is True.
      :param model_kwargs: Keyword arguments for the model, should be in agreement with
                           `model_type`.

      :raises ValueError: Invalid `model_type`.


   .. method:: elbow_plot(self, viz_backend: str = None)


      Creates an elbow plot displaying coherence values vs number of topics.

      :param viz_backend: The visualization backend.

      :raises ValueError: No coherence values to plot.

      :returns: Elbow plot showing coherence values vs number of topics
      :rtype: fig


   .. method:: get_topic_nums(self)


      Obtains topic distributions (LDA model) or scores (LSA/NMF model).

      :returns: Array of topic distributions (LDA model) or scores (LSA/NMF model)
      :rtype: doc_topics


   .. method:: display_topic_keywords(self, num_topic_words: int = 10, topic_names: Optional[List[str]] = None)


      Creates Pandas DataFrame to display most relevant terms for each topic.

      :param num_topic_words: The number of words to be displayed for each topic.
                              Default is 10
      :param topic_names: A list of pre-defined names set for each of the topics.
                          Default is None

      :returns:

                Pandas DataFrame displaying topics as columns and their
                    relevant terms as rows. LDA/LSI models will display an extra column to
                    the right of each topic column, showing each term's corresponding
                    coefficient value
      :rtype: display_topics_df


   .. method:: top_documents_per_topic(self, text_docs: List[str], topic_names: Optional[List[str]] = None, num_docs: int = 10, summarize_docs: bool = False, summary_words: Optional[int] = None)


      Creates Pandas DataFrame to display most relevant documents for each topic.

      :param text_docs:
                        A list of text documents in string format. Important to note that
                         this list of documents should be ordered in accordance with the matrix
                        or corpus on which the document was trained
      :param topic_names: A list of pre-defined names set for each of the topics.
                          Default is None
      :param num_docs: The number of documents to display for each topic. Default is 10
      :param summarize_docs: If True, the documents will be summarized (if this is the
                             case, 'text_docs' should be formatted into sentences). Default is False
      :param summary_words: The number of words the summary should be limited to. Should
                            only be specified if summarize_docs set to True

      :returns:

                Pandas DataFrame displaying topics as columns and their
                    most relevant documents as rows
      :rtype: all_top_docs_df


   .. method:: visualize_topic_summary(self, viz_backend: str = 'pyLDAvis')


      Displays interactive pyLDAvis visual to understand topic model and documents.

      :param viz_backend: The visualization backend.
      :type viz_backend: str

      :raises TypeError: Only valid for LDA models.

      :returns: A visual to understand topic model and/or documents relating to model