data_describe.text.text_preprocessing
============================================

.. py:module:: data_describe.text.text_preprocessing

.. autoapi-nested-parse::

   Text preprocessing module.

   This module contains a number of methods by which text documents can be preprocessed. The individual preprocessing functions can be classified as "Bag of Words Functions"
   (to_lower, remove_punct, remove_digits, remove_single_char_and_spaces, remove_stopwords, lemmatize, stem) or "Document Functions" (tokenize, bag_of_words_to_docs).
   Each of the functions in these groups return generator objects, and when using them on their own, the internal function to_list can be utilized as depicted below.

   .. rubric:: Example

   Individual Document Functions should be processed as such::

       tokenized_docs = to_list(tokenize(original_docs), bow=False)

   Individual Bag of Words Functions should be processed as such::

       lower_case_docs_bow = to_list(to_lower(original_docs_bow))


.. autoapisummary::

   data_describe.text.text_preprocessing.tokenize
   data_describe.text.text_preprocessing.to_lower
   data_describe.text.text_preprocessing.remove_punct
   data_describe.text.text_preprocessing.remove_digits
   data_describe.text.text_preprocessing.remove_single_char_and_spaces
   data_describe.text.text_preprocessing.remove_stopwords
   data_describe.text.text_preprocessing.lemmatize
   data_describe.text.text_preprocessing.stem
   data_describe.text.text_preprocessing.bag_of_words_to_docs
   data_describe.text.text_preprocessing.create_tfidf_matrix
   data_describe.text.text_preprocessing.create_doc_term_matrix
   data_describe.text.text_preprocessing.preprocess_texts
   data_describe.text.text_preprocessing.to_list
   data_describe.text.text_preprocessing.ngram_freq
   data_describe.text.text_preprocessing.filter_dictionary


.. data:: nltk
   

.. function:: tokenize(text_docs: Iterable[str]) -> Iterable[Iterable[str]]

   Turns list of documents into "bag of words" format.

   :param text_docs: A list of text documents in string format

   :returns: A generator expression for all of the processed documents


.. function:: to_lower(text_docs_bow: Iterable[Iterable[str]]) -> Iterable[Iterable[str]]

   Converts all letters in documents ("bag of words" format) to lowercase.

   :param text_docs_bow: A list of lists of words from a document

   :returns: A generator expression for all of the processed documents


.. function:: remove_punct(text_docs_bow: Iterable[Iterable[str]], replace_char: str = '', remove_all: bool = False) -> Iterable[Iterable[str]]

   Removes all punctuation from documents (e.g. periods, question marks, etc.).

   :param text_docs_bow: A list of lists of words from a document
   :param replace_char: Character to replace punctuation instances with. Default is space
   :param remove_all: If True, removes all instances of punctuation from document. Default
                      is False, which only removes leading and/or trailing instances.

   :returns: A generator expression for all of the processed documents


.. function:: remove_digits(text_docs_bow: Iterable[Iterable[str]]) -> Iterable[Iterable[str]]

   Removes all numbers and words containing numerical digits from documents.

   :param text_docs_bow: A list of lists of words from a document

   :returns: A generator expression for all of the processed documents


.. function:: remove_single_char_and_spaces(text_docs_bow: Iterable[Iterable[str]]) -> Iterable[Iterable[str]]

   Removes all words that contain only one character and blank spaces from documents.

   :param text_docs_bow: A list of lists of words from a document

   :returns: A generator expression for all of the processed documents


.. function:: remove_stopwords(text_docs_bow: Iterable[Iterable[str]], custom_stopwords: Optional[List[str]] = None) -> Iterable[Iterable[str]]

   Removes all "stop words" from documents.

   "Stop words" can be defined as commonly used words which are typically useless
   for NLP.

   :param text_docs_bow: A list of lists of words from a document
   :param custom_stopwords: An optional list of words to remove along with the stop words.
                            Defaults to nltk english stopwords.

   :returns: A generator expression for all of the processed documents


.. function:: lemmatize(text_docs_bow: Iterable[Iterable[str]]) -> Iterable[Iterable[str]]

   Lemmatizes all words in documents.

   Lemmatization is grouping words together by their reducing them to their inflected
   forms so they can be analyzed as a single item.

   :param text_docs_bow: A lists of list of words from a document

   :returns: A generator expression for all of the processed documents


.. function:: stem(text_docs_bow: Iterable[Iterable[str]]) -> Iterable[Iterable[str]]

   Stems all words in documents.

   Stemming is grouping words together by taking the stems of their inflected forms
   so they can be analyzed as a single item.

   :param text_docs_bow: A list of lists of words from a document

   :returns: A generator expression for all of the processed documents


.. function:: bag_of_words_to_docs(text_docs_bow: Iterable[Iterable[str]]) -> Iterable[str]

   Converts list of documents in "bag of words" format.

   This converts back into form of document being stored in one string.

   :param text_docs_bow: A list of lists of words from a document

   :returns: A generator expression for all of the processed documents


.. function:: create_tfidf_matrix(text_docs: Iterable[str], **kwargs) -> pd.DataFrame

   Creates a Term Frequency-Inverse Document Frequency matrix.

   :param text_docs: A list of strings of text documents
   :param \*\*kwargs: Other arguments to be passed to sklearn.feature_extraction.text.TfidfVectorizer

   :returns:

             Pandas DataFrame of TF-IDF matrix with documents as rows and words
                 as columns
   :rtype: matrix_df


.. function:: create_doc_term_matrix(text_docs: Iterable[str], **kwargs) -> pd.DataFrame

   Creates a document-term matrix which gives wordcount per document.

   :param text_docs: A list of strings of text documents
   :param \*\*kwargs: Other arguments to be passed to sklearn.feature_extraction.text.CountVectorizer

   :returns:

             Pandas DataFrame of document-term matrix with documents as rows
                 and words as columns
   :rtype: matrix_df


.. function:: preprocess_texts(text_docs: Iterable[str], lem: bool = False, stem: bool = False, custom_pipeline: List = None) -> Iterable[Any]

   Pre-process a text corpus.

   Cleans list of documents by running through a customizable text-preprocessing
   pipeline.

   :param text_docs: A list of strings of text documents (also accepts arrays and
                     Pandas series)
   :param lem: If True, lemmatization becomes part of the pre-processing. Recommended to
               set as False and run user-created lemmatization function if pipeline is
               customized. Default is False.
   :param stem: If True, stemming becomes part of the pre-processing. Recommended to
                set as False and run user-created stemming function if pipeline is
                customized. Default is False.
   :param custom_pipeline: A custom list of strings and/or function objects which are
                           the function names that text_docs_bow will run through. Default is None,
                           which uses the pipeline: ['tokenize', 'to_lower', 'remove_punct',
                           'remove_digits', 'remove_single_char_and_spaces', 'remove_stopwords']

   :returns:

             List of lists of words for each document which have undergone a
                 pre-processing pipeline
   :rtype: text_docs


.. function:: to_list(text_docs_gen) -> List[Any]

   Converts a generator expression from an individual preprocessing function into a list.

   :param text_docs_gen: A generator expression for the processed text documents

   :returns:

             A list of processed text documents or a list of tokens (list of strings)
                 for each document


.. function:: ngram_freq(text_docs_bow: Iterable[Iterable[str]], n: int = 3, only_n: bool = False) -> 'nltk.FreqDist'

   Generates frequency distribution of "n-grams" from all of the text documents.

   :param text_docs_bow: A list of lists of words from a document
   :param n: Highest `n` for n-gram sequence to include. Default is 3
   :param only_n: If True, will only include n-grams for specified value of `n`.
                  Default is False, which also includes n-grams for all numbers leading
                  up to `n`

   :raises ValueError: `n` must be >= 2.

   :returns:

             Dictionary which contains all identified n-grams as keys and their
                 respective counts as their values
   :rtype: freq


.. function:: filter_dictionary(text_docs: List[str], no_below: int = 10, no_above: float = 0.2, **kwargs)

   Filters words outside specified frequency thresholds.

   :param text_docs: A list of list of words from a document, can include n-grams up to 3.
   :param no_below: Keep tokens which are contained in at least no_below documents.
                    Default is 10.
   :param no_above: Keep tokens which are contained in no more than no_above portion of
                    documents (fraction of total corpus size). Default is 0.2.
   :param \*\*kwargs: Other arguments to be passed to gensim.corpora.Dictionary.filter_extremes

   :returns:

             Gensim Dictionary encapsulates the mapping between normalized words
                 and their integer ids.
             corpus: Bag of Words (BoW) representation of documents (token_id, token_count).
   :rtype: dictionary