data_describe.privacy.detection

sensitive_data(df, mode: str = ‘redact’, detect_infotypes: bool = True, columns: Optional[list] = None, score_threshold: float = _DEFAULT_SCORE_THRESHOLD, sample_size: int = _SAMPLE_SIZE, engine_backend=None, compute_backend: Optional[str] = None)

Identifies, redacts, and/or encrypts PII data.

compute_sensitive_data(df, mode: str = ‘redact’, detect_infotypes: bool = True, columns: Optional[list] = None, score_threshold: float = _DEFAULT_SCORE_THRESHOLD, sample_size: Union[int, float] = _SAMPLE_SIZE, engine_backend=None)

Identifies, redacts, and encrypts PII data.

identify_pii(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)

Identifies infotypes contained in a string.

create_mapping(text, response)

Identifies sensitive data and creates a mapping with the hashed data.

redact_info(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)

Redact sensitive data with mapping between hashed values and infotype.

identify_column_infotypes(data_series, engine_backend, sample_size: Union[int, float] = _SAMPLE_SIZE, score_threshold=_DEFAULT_SCORE_THRESHOLD)

Identifies the infotype of a pandas series object using a sample of rows.

identify_infotypes(df, engine_backend, sample_size=_SAMPLE_SIZE, score_threshold=_DEFAULT_SCORE_THRESHOLD)

Identify infotypes for each column in the dataframe.

encrypt_text(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)

Encrypt text using python’s hash function.

hash_string(text)

Applies SHA256 text hashing.

presidio_engine()

Initialize presidio engine.

data_describe.privacy.detection.logger
data_describe.privacy.detection.sensitive_data(df, mode: str = 'redact', detect_infotypes: bool = True, columns: Optional[list] = None, score_threshold: float = _DEFAULT_SCORE_THRESHOLD, sample_size: int = _SAMPLE_SIZE, engine_backend=None, compute_backend: Optional[str] = None)

Identifies, redacts, and/or encrypts PII data.

Note

sensitive_data uses Microsoft’s Presidio in the backend. Presidio can be used to help identify sensitive data. However, because Presidio uses trained ML models, there is no guarantee that Presidio will find all sensitive information.

Parameters
  • df (DataFrame) – The dataframe

  • mode (str) – {‘redact’, ‘encrypt’} redact: Redact the sensitive data encrypt: Anonymize the sensitive data

  • detect_infotypes (bool) – If True, identifies infotypes for each column

  • columns ([str]) – Defaults to None

  • score_threshold (float) – Minimum confidence value for detected entities to be returned. Default is 0.2.

  • sample_size (int) – Number of sampled rows used for identifying column infotypes. Default is 100.

  • engine_backend – The backend analyzer engine. Default is presidio_analyzer.

  • compute_backend (str) – Select compute backend

Raises
  • ValueError – Invalid input data type.

  • TypeErrorcolumns not a list of strings.

Returns

SensitiveDataWidget

class data_describe.privacy.detection.SensitiveDataWidget(engine=None, redact=None, encrypt=None, infotypes=None, sample_size=None, **kwargs)

Bases: data_describe._widget.BaseWidget

Interface for collecting additional information about the sensitive data widget.

show(self, **kwargs)

Show the transformed data or infotypes.

data_describe.privacy.detection.compute_sensitive_data(df, mode: str = 'redact', detect_infotypes: bool = True, columns: Optional[list] = None, score_threshold: float = _DEFAULT_SCORE_THRESHOLD, sample_size: Union[int, float] = _SAMPLE_SIZE, engine_backend=None)

Identifies, redacts, and encrypts PII data.

Note: sensitive_data uses Microsoft’s Presidio in the backend. Presidio can be help identify sensitive data. However, because Presidio uses trained ML models, there is no guarantee that Presidio will find all sensitive information.

Parameters
  • df (DataFrame) – The dataframe

  • mode (str) – {‘redact’, ‘encrypt’} redact: Redact the sensitive data encrypt: Anonymize the sensitive data

  • detect_infotypes (bool) – If True, identifies infotypes for each column

  • columns ([str]) – Defaults to None

  • score_threshold (float) – Minimum confidence value for detected entities to be returned. Default is 0.2.

  • sample_size (int) – Number of sampled rows used for identifying column infotypes. Default is 100.

  • engine_backend – The backend analyzer engine. Default is presidio_analyzer.

Raises

ValueErrorsample_size greater than data size.

Returns

SensitiveDataWidget

data_describe.privacy.detection.identify_pii(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)

Identifies infotypes contained in a string.

Parameters
  • text (str) – A string value

  • engine_backend – The backend analyzer engine. Default is presidio_analyzer.

  • score_threshold (float) – Minimum confidence value for detected entities to be returned

Returns

List of presidio_analyzer.recognizer_result.RecognizerResult

data_describe.privacy.detection.create_mapping(text, response)

Identifies sensitive data and creates a mapping with the hashed data.

Parameters
  • text (str) – String value

  • response – List of presidio_analyzer.recognizer_result.RecognizerResult

Returns

Mapping of the hashed data with the redacted string ref_text (str): String with hashed values

Return type

word_mapping (dict)

data_describe.privacy.detection.redact_info(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)

Redact sensitive data with mapping between hashed values and infotype.

Parameters
  • text (str) – String value

  • engine_backend – The backend analyzer engine. Default is presidio_analyzer.

  • score_threshold (float) – Minimum confidence value for detected entities to be returned

Returns

String with redacted information

data_describe.privacy.detection.identify_column_infotypes(data_series, engine_backend, sample_size: Union[int, float] = _SAMPLE_SIZE, score_threshold=_DEFAULT_SCORE_THRESHOLD)

Identifies the infotype of a pandas series object using a sample of rows.

Parameters
  • data_series (Series) – A Series

  • engine_backend – The backend analyzer engine. Default is presidio_analyzer.

  • sample_size (int) – Number of rows to sample from

  • score_threshold (float) – Minimum confidence value for detected entities to be returned

Returns

List of infotypes

data_describe.privacy.detection.identify_infotypes(df, engine_backend, sample_size=_SAMPLE_SIZE, score_threshold=_DEFAULT_SCORE_THRESHOLD)

Identify infotypes for each column in the dataframe.

Parameters
  • df (DataFrame) – The dataframe

  • engine_backend – The backend analyzer engine. Default is presidio_analyzer.

  • sample_size (int) – Number of rows to sample from

  • score_threshold (float) – Minimum confidence value for detected entities to be returned

Returns

Dictionary with columns as keys and values as infotypes detected

data_describe.privacy.detection.encrypt_text(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)

Encrypt text using python’s hash function.

Parameters
  • text (str) – A string value

  • engine_backend – The backend analyzer engine. Default is presidio_analyzer.

  • score_threshold (float) – Minimum confidence value for detected entities to be returned

Returns

Text with hashed sensitive data

data_describe.privacy.detection.hash_string(text)

Applies SHA256 text hashing.

Parameters

text (str) – The string value

Returns

Hashed text

Return type

sha_signature

data_describe.privacy.detection.presidio_engine()

Initialize presidio engine.

Returns

Presidio engine