data_describe.privacy.detection¶

`sensitive_data`(df, mode: str = ‘redact’, detect_infotypes: bool = True, columns: Optional[list] = None, score_threshold: float = _DEFAULT_SCORE_THRESHOLD, sample_size: int = _SAMPLE_SIZE, engine_backend=None, compute_backend: Optional[str] = None)	Identifies, redacts, and/or encrypts PII data.
`compute_sensitive_data`(df, mode: str = ‘redact’, detect_infotypes: bool = True, columns: Optional[list] = None, score_threshold: float = _DEFAULT_SCORE_THRESHOLD, sample_size: Union[int, float] = _SAMPLE_SIZE, engine_backend=None)	Identifies, redacts, and encrypts PII data.
`identify_pii`(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)	Identifies infotypes contained in a string.
`create_mapping`(text, response)	Identifies sensitive data and creates a mapping with the hashed data.
`redact_info`(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)	Redact sensitive data with mapping between hashed values and infotype.
`identify_column_infotypes`(data_series, engine_backend, sample_size: Union[int, float] = _SAMPLE_SIZE, score_threshold=_DEFAULT_SCORE_THRESHOLD)	Identifies the infotype of a pandas series object using a sample of rows.
`identify_infotypes`(df, engine_backend, sample_size=_SAMPLE_SIZE, score_threshold=_DEFAULT_SCORE_THRESHOLD)	Identify infotypes for each column in the dataframe.
`encrypt_text`(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)	Encrypt text using python’s hash function.
`hash_string`(text)	Applies SHA256 text hashing.
`presidio_engine`()	Initialize presidio engine.

data_describe.privacy.detection.logger¶

data_describe.privacy.detection.sensitive_data(df, mode: str = 'redact', detect_infotypes: bool = True, columns: Optional[list] = None, score_threshold: float = _DEFAULT_SCORE_THRESHOLD, sample_size: int = _SAMPLE_SIZE, engine_backend=None, compute_backend: Optional[str] = None)¶

Identifies, redacts, and/or encrypts PII data.

Note

sensitive_data uses Microsoft’s Presidio in the backend. Presidio can be used to help identify sensitive data. However, because Presidio uses trained ML models, there is no guarantee that Presidio will find all sensitive information.

Parameters

df (DataFrame) – The dataframe
mode (str) – {‘redact’, ‘encrypt’} redact: Redact the sensitive data encrypt: Anonymize the sensitive data
detect_infotypes (bool) – If True, identifies infotypes for each column
columns ([str]) – Defaults to None
score_threshold (float) – Minimum confidence value for detected entities to be returned. Default is 0.2.
sample_size (int) – Number of sampled rows used for identifying column infotypes. Default is 100.
engine_backend – The backend analyzer engine. Default is presidio_analyzer.
compute_backend (str) – Select compute backend

Raises

ValueError – Invalid input data type.
TypeError – columns not a list of strings.

Returns

SensitiveDataWidget

class data_describe.privacy.detection.SensitiveDataWidget(engine=None, redact=None, encrypt=None, infotypes=None, sample_size=None, **kwargs)¶

Bases: data_describe._widget.BaseWidget

Interface for collecting additional information about the sensitive data widget.

show(self, **kwargs)¶: Show the transformed data or infotypes.

data_describe.privacy.detection.compute_sensitive_data(df, mode: str = 'redact', detect_infotypes: bool = True, columns: Optional[list] = None, score_threshold: float = _DEFAULT_SCORE_THRESHOLD, sample_size: Union[int, float] = _SAMPLE_SIZE, engine_backend=None)¶

Identifies, redacts, and encrypts PII data.

Note: sensitive_data uses Microsoft’s Presidio in the backend. Presidio can be help identify sensitive data. However, because Presidio uses trained ML models, there is no guarantee that Presidio will find all sensitive information.

Parameters

df (DataFrame) – The dataframe
mode (str) – {‘redact’, ‘encrypt’} redact: Redact the sensitive data encrypt: Anonymize the sensitive data
detect_infotypes (bool) – If True, identifies infotypes for each column
columns ([str]) – Defaults to None
score_threshold (float) – Minimum confidence value for detected entities to be returned. Default is 0.2.
sample_size (int) – Number of sampled rows used for identifying column infotypes. Default is 100.
engine_backend – The backend analyzer engine. Default is presidio_analyzer.

Raises

ValueError – sample_size greater than data size.

Returns

SensitiveDataWidget

data_describe.privacy.detection.identify_pii(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶

Identifies infotypes contained in a string.

Parameters

text (str) – A string value
engine_backend – The backend analyzer engine. Default is presidio_analyzer.
score_threshold (float) – Minimum confidence value for detected entities to be returned

Returns

List of presidio_analyzer.recognizer_result.RecognizerResult

data_describe.privacy.detection.create_mapping(text, response)¶

Identifies sensitive data and creates a mapping with the hashed data.

Parameters

text (str) – String value
response – List of presidio_analyzer.recognizer_result.RecognizerResult

Returns

Mapping of the hashed data with the redacted string ref_text (str): String with hashed values

Return type

word_mapping (dict)

data_describe.privacy.detection.redact_info(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶

Redact sensitive data with mapping between hashed values and infotype.

Parameters

text (str) – String value
engine_backend – The backend analyzer engine. Default is presidio_analyzer.
score_threshold (float) – Minimum confidence value for detected entities to be returned

Returns

String with redacted information

data_describe.privacy.detection.identify_column_infotypes(data_series, engine_backend, sample_size: Union[int, float] = _SAMPLE_SIZE, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶

Identifies the infotype of a pandas series object using a sample of rows.

Parameters

data_series (Series) – A Series
engine_backend – The backend analyzer engine. Default is presidio_analyzer.
sample_size (int) – Number of rows to sample from
score_threshold (float) – Minimum confidence value for detected entities to be returned

Returns

List of infotypes

data_describe.privacy.detection.identify_infotypes(df, engine_backend, sample_size=_SAMPLE_SIZE, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶

Identify infotypes for each column in the dataframe.

Parameters

df (DataFrame) – The dataframe
engine_backend – The backend analyzer engine. Default is presidio_analyzer.
sample_size (int) – Number of rows to sample from
score_threshold (float) – Minimum confidence value for detected entities to be returned

Returns

Dictionary with columns as keys and values as infotypes detected

data_describe.privacy.detection.encrypt_text(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶

Encrypt text using python’s hash function.

Parameters

text (str) – A string value
engine_backend – The backend analyzer engine. Default is presidio_analyzer.
score_threshold (float) – Minimum confidence value for detected entities to be returned

Returns

Text with hashed sensitive data

data_describe.privacy.detection.hash_string(text)¶

Applies SHA256 text hashing.

Parameters: text (str) – The string value
Returns: Hashed text
Return type: sha_signature

data_describe.privacy.detection.presidio_engine()¶

Initialize presidio engine.

Returns: Presidio engine