data_describe.text.text_preprocessing¶

Text preprocessing module.

This module contains a number of methods by which text documents can be preprocessed. The individual preprocessing functions can be classified as “Bag of Words Functions” (to_lower, remove_punct, remove_digits, remove_single_char_and_spaces, remove_stopwords, lemmatize, stem) or “Document Functions” (tokenize, bag_of_words_to_docs). Each of the functions in these groups return generator objects, and when using them on their own, the internal function to_list can be utilized as depicted below.

Example

Individual Document Functions should be processed as such:

tokenized_docs = to_list(tokenize(original_docs), bow=False)

Individual Bag of Words Functions should be processed as such:

lower_case_docs_bow = to_list(to_lower(original_docs_bow))

`tokenize`(text_docs: Iterable[str])	Turns list of documents into “bag of words” format.
`to_lower`(text_docs_bow: Iterable[Iterable[str]])	Converts all letters in documents (“bag of words” format) to lowercase.
`remove_punct`(text_docs_bow: Iterable[Iterable[str]], replace_char: str = ‘’, remove_all: bool = False)	Removes all punctuation from documents (e.g. periods, question marks, etc.).
`remove_digits`(text_docs_bow: Iterable[Iterable[str]])	Removes all numbers and words containing numerical digits from documents.
`remove_single_char_and_spaces`(text_docs_bow: Iterable[Iterable[str]])	Removes all words that contain only one character and blank spaces from documents.
`remove_stopwords`(text_docs_bow: Iterable[Iterable[str]], custom_stopwords: Optional[List[str]] = None)	Removes all “stop words” from documents.
`lemmatize`(text_docs_bow: Iterable[Iterable[str]])	Lemmatizes all words in documents.
`stem`(text_docs_bow: Iterable[Iterable[str]])	Stems all words in documents.
`bag_of_words_to_docs`(text_docs_bow: Iterable[Iterable[str]])	Converts list of documents in “bag of words” format.
`create_tfidf_matrix`(text_docs: Iterable[str], **kwargs)	Creates a Term Frequency-Inverse Document Frequency matrix.
`create_doc_term_matrix`(text_docs: Iterable[str], **kwargs)	Creates a document-term matrix which gives wordcount per document.
`preprocess_texts`(text_docs: Iterable[str], lem: bool = False, stem: bool = False, custom_pipeline: List = None)	Pre-process a text corpus.
`to_list`(text_docs_gen)	Converts a generator expression from an individual preprocessing function into a list.
`ngram_freq`(text_docs_bow: Iterable[Iterable[str]], n: int = 3, only_n: bool = False)	Generates frequency distribution of “n-grams” from all of the text documents.
`filter_dictionary`(text_docs: List[str], no_below: int = 10, no_above: float = 0.2, **kwargs)	Filters words outside specified frequency thresholds.

data_describe.text.text_preprocessing.nltk¶

data_describe.text.text_preprocessing.tokenize(text_docs: Iterable[str]) → Iterable[Iterable[str]]¶

Turns list of documents into “bag of words” format.

Parameters: text_docs – A list of text documents in string format
Returns: A generator expression for all of the processed documents

data_describe.text.text_preprocessing.to_lower(text_docs_bow: Iterable[Iterable[str]]) → Iterable[Iterable[str]]¶

Converts all letters in documents (“bag of words” format) to lowercase.

Parameters: text_docs_bow – A list of lists of words from a document
Returns: A generator expression for all of the processed documents

data_describe.text.text_preprocessing.remove_punct(text_docs_bow: Iterable[Iterable[str]], replace_char: str = '', remove_all: bool = False) → Iterable[Iterable[str]]¶

Removes all punctuation from documents (e.g. periods, question marks, etc.).

Parameters

text_docs_bow – A list of lists of words from a document
replace_char – Character to replace punctuation instances with. Default is space
remove_all – If True, removes all instances of punctuation from document. Default is False, which only removes leading and/or trailing instances.

Returns

A generator expression for all of the processed documents

data_describe.text.text_preprocessing.remove_digits(text_docs_bow: Iterable[Iterable[str]]) → Iterable[Iterable[str]]¶

Removes all numbers and words containing numerical digits from documents.

Parameters: text_docs_bow – A list of lists of words from a document
Returns: A generator expression for all of the processed documents

data_describe.text.text_preprocessing.remove_single_char_and_spaces(text_docs_bow: Iterable[Iterable[str]]) → Iterable[Iterable[str]]¶

Removes all words that contain only one character and blank spaces from documents.

Parameters: text_docs_bow – A list of lists of words from a document
Returns: A generator expression for all of the processed documents

data_describe.text.text_preprocessing.remove_stopwords(text_docs_bow: Iterable[Iterable[str]], custom_stopwords: Optional[List[str]] = None) → Iterable[Iterable[str]]¶

Removes all “stop words” from documents.

“Stop words” can be defined as commonly used words which are typically useless for NLP.

Parameters

text_docs_bow – A list of lists of words from a document
custom_stopwords – An optional list of words to remove along with the stop words. Defaults to nltk english stopwords.

Returns

A generator expression for all of the processed documents

data_describe.text.text_preprocessing.lemmatize(text_docs_bow: Iterable[Iterable[str]]) → Iterable[Iterable[str]]¶

Lemmatizes all words in documents.

Lemmatization is grouping words together by their reducing them to their inflected forms so they can be analyzed as a single item.

Parameters: text_docs_bow – A lists of list of words from a document
Returns: A generator expression for all of the processed documents

data_describe.text.text_preprocessing.stem(text_docs_bow: Iterable[Iterable[str]]) → Iterable[Iterable[str]]¶

Stems all words in documents.

Stemming is grouping words together by taking the stems of their inflected forms so they can be analyzed as a single item.

Parameters: text_docs_bow – A list of lists of words from a document
Returns: A generator expression for all of the processed documents

data_describe.text.text_preprocessing.bag_of_words_to_docs(text_docs_bow: Iterable[Iterable[str]]) → Iterable[str]¶

Converts list of documents in “bag of words” format.

This converts back into form of document being stored in one string.

Parameters: text_docs_bow – A list of lists of words from a document
Returns: A generator expression for all of the processed documents

data_describe.text.text_preprocessing.create_tfidf_matrix(text_docs: Iterable[str], **kwargs) → pd.DataFrame¶

Creates a Term Frequency-Inverse Document Frequency matrix.

Parameters

text_docs – A list of strings of text documents
**kwargs – Other arguments to be passed to sklearn.feature_extraction.text.TfidfVectorizer

Returns

Pandas DataFrame of TF-IDF matrix with documents as rows and words: as columns

Return type

matrix_df

data_describe.text.text_preprocessing.create_doc_term_matrix(text_docs: Iterable[str], **kwargs) → pd.DataFrame¶

Creates a document-term matrix which gives wordcount per document.

Parameters

text_docs – A list of strings of text documents
**kwargs – Other arguments to be passed to sklearn.feature_extraction.text.CountVectorizer

Returns

Pandas DataFrame of document-term matrix with documents as rows: and words as columns

Return type

matrix_df

data_describe.text.text_preprocessing.preprocess_texts(text_docs: Iterable[str], lem: bool = False, stem: bool = False, custom_pipeline: List = None) → Iterable[Any]¶

Pre-process a text corpus.

Cleans list of documents by running through a customizable text-preprocessing pipeline.

Parameters

text_docs – A list of strings of text documents (also accepts arrays and Pandas series)
lem – If True, lemmatization becomes part of the pre-processing. Recommended to set as False and run user-created lemmatization function if pipeline is customized. Default is False.
stem – If True, stemming becomes part of the pre-processing. Recommended to set as False and run user-created stemming function if pipeline is customized. Default is False.
custom_pipeline – A custom list of strings and/or function objects which are the function names that text_docs_bow will run through. Default is None, which uses the pipeline: [‘tokenize’, ‘to_lower’, ‘remove_punct’, ‘remove_digits’, ‘remove_single_char_and_spaces’, ‘remove_stopwords’]

Returns

List of lists of words for each document which have undergone a: pre-processing pipeline

Return type

text_docs

data_describe.text.text_preprocessing.to_list(text_docs_gen) → List[Any]¶

Converts a generator expression from an individual preprocessing function into a list.

Parameters

text_docs_gen – A generator expression for the processed text documents

Returns

A list of processed text documents or a list of tokens (list of strings): for each document

data_describe.text.text_preprocessing.ngram_freq(text_docs_bow: Iterable[Iterable[str]], n: int = 3, only_n: bool = False) → ’nltk.FreqDist’¶

Generates frequency distribution of “n-grams” from all of the text documents.

Parameters

text_docs_bow – A list of lists of words from a document
n – Highest n for n-gram sequence to include. Default is 3
only_n – If True, will only include n-grams for specified value of n. Default is False, which also includes n-grams for all numbers leading up to n

Raises

ValueError – n must be >= 2.

Returns

Dictionary which contains all identified n-grams as keys and their: respective counts as their values

Return type

freq

data_describe.text.text_preprocessing.filter_dictionary(text_docs: List[str], no_below: int = 10, no_above: float = 0.2, **kwargs)¶

Filters words outside specified frequency thresholds.

Parameters

text_docs – A list of list of words from a document, can include n-grams up to 3.
no_below – Keep tokens which are contained in at least no_below documents. Default is 10.
no_above – Keep tokens which are contained in no more than no_above portion of documents (fraction of total corpus size). Default is 0.2.
**kwargs – Other arguments to be passed to gensim.corpora.Dictionary.filter_extremes

Returns

Gensim Dictionary encapsulates the mapping between normalized words: and their integer ids.

corpus: Bag of Words (BoW) representation of documents (token_id, token_count).

Return type

dictionary