data_describe.privacy.detection¶
| 
 | Identifies, redacts, and/or encrypts PII data. | 
| 
 | Identifies, redacts, and encrypts PII data. | 
| 
 | Identifies infotypes contained in a string. | 
| 
 | Identifies sensitive data and creates a mapping with the hashed data. | 
| 
 | Redact sensitive data with mapping between hashed values and infotype. | 
| 
 | Identifies the infotype of a pandas series object using a sample of rows. | 
| 
 | Identify infotypes for each column in the dataframe. | 
| 
 | Encrypt text using python’s hash function. | 
| 
 | Applies SHA256 text hashing. | 
| Initialize presidio engine. | 
- 
data_describe.privacy.detection.logger¶
- 
data_describe.privacy.detection.sensitive_data(df, mode: str = 'redact', detect_infotypes: bool = True, columns: Optional[list] = None, score_threshold: float = _DEFAULT_SCORE_THRESHOLD, sample_size: int = _SAMPLE_SIZE, engine_backend=None, compute_backend: Optional[str] = None)¶
- Identifies, redacts, and/or encrypts PII data. - Note - sensitive_data uses Microsoft’s Presidio in the backend. Presidio can be used to help identify sensitive data. However, because Presidio uses trained ML models, there is no guarantee that Presidio will find all sensitive information. - Parameters
- df (DataFrame) – The dataframe 
- mode (str) – {‘redact’, ‘encrypt’} redact: Redact the sensitive data encrypt: Anonymize the sensitive data 
- detect_infotypes (bool) – If True, identifies infotypes for each column 
- columns ([str]) – Defaults to None 
- score_threshold (float) – Minimum confidence value for detected entities to be returned. Default is 0.2. 
- sample_size (int) – Number of sampled rows used for identifying column infotypes. Default is 100. 
- engine_backend – The backend analyzer engine. Default is presidio_analyzer. 
- compute_backend (str) – Select compute backend 
 
- Raises
- ValueError – Invalid input data type. 
- TypeError – columns not a list of strings. 
 
- Returns
- SensitiveDataWidget 
 
- 
class data_describe.privacy.detection.SensitiveDataWidget(engine=None, redact=None, encrypt=None, infotypes=None, sample_size=None, **kwargs)¶
- Bases: - data_describe._widget.BaseWidget- Interface for collecting additional information about the sensitive data widget. - 
show(self, **kwargs)¶
- Show the transformed data or infotypes. 
 
- 
- 
data_describe.privacy.detection.compute_sensitive_data(df, mode: str = 'redact', detect_infotypes: bool = True, columns: Optional[list] = None, score_threshold: float = _DEFAULT_SCORE_THRESHOLD, sample_size: Union[int, float] = _SAMPLE_SIZE, engine_backend=None)¶
- Identifies, redacts, and encrypts PII data. - Note: sensitive_data uses Microsoft’s Presidio in the backend. Presidio can be help identify sensitive data. However, because Presidio uses trained ML models, there is no guarantee that Presidio will find all sensitive information. - Parameters
- df (DataFrame) – The dataframe 
- mode (str) – {‘redact’, ‘encrypt’} redact: Redact the sensitive data encrypt: Anonymize the sensitive data 
- detect_infotypes (bool) – If True, identifies infotypes for each column 
- columns ([str]) – Defaults to None 
- score_threshold (float) – Minimum confidence value for detected entities to be returned. Default is 0.2. 
- sample_size (int) – Number of sampled rows used for identifying column infotypes. Default is 100. 
- engine_backend – The backend analyzer engine. Default is presidio_analyzer. 
 
- Raises
- ValueError – sample_size greater than data size. 
- Returns
- SensitiveDataWidget 
 
- 
data_describe.privacy.detection.identify_pii(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶
- Identifies infotypes contained in a string. - Parameters
- text (str) – A string value 
- engine_backend – The backend analyzer engine. Default is presidio_analyzer. 
- score_threshold (float) – Minimum confidence value for detected entities to be returned 
 
- Returns
- List of presidio_analyzer.recognizer_result.RecognizerResult 
 
- 
data_describe.privacy.detection.create_mapping(text, response)¶
- Identifies sensitive data and creates a mapping with the hashed data. - Parameters
- text (str) – String value 
- response – List of presidio_analyzer.recognizer_result.RecognizerResult 
 
- Returns
- Mapping of the hashed data with the redacted string ref_text (str): String with hashed values 
- Return type
- word_mapping (dict) 
 
- 
data_describe.privacy.detection.redact_info(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶
- Redact sensitive data with mapping between hashed values and infotype. - Parameters
- text (str) – String value 
- engine_backend – The backend analyzer engine. Default is presidio_analyzer. 
- score_threshold (float) – Minimum confidence value for detected entities to be returned 
 
- Returns
- String with redacted information 
 
- 
data_describe.privacy.detection.identify_column_infotypes(data_series, engine_backend, sample_size: Union[int, float] = _SAMPLE_SIZE, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶
- Identifies the infotype of a pandas series object using a sample of rows. - Parameters
- data_series (Series) – A Series 
- engine_backend – The backend analyzer engine. Default is presidio_analyzer. 
- sample_size (int) – Number of rows to sample from 
- score_threshold (float) – Minimum confidence value for detected entities to be returned 
 
- Returns
- List of infotypes 
 
- 
data_describe.privacy.detection.identify_infotypes(df, engine_backend, sample_size=_SAMPLE_SIZE, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶
- Identify infotypes for each column in the dataframe. - Parameters
- df (DataFrame) – The dataframe 
- engine_backend – The backend analyzer engine. Default is presidio_analyzer. 
- sample_size (int) – Number of rows to sample from 
- score_threshold (float) – Minimum confidence value for detected entities to be returned 
 
- Returns
- Dictionary with columns as keys and values as infotypes detected 
 
- 
data_describe.privacy.detection.encrypt_text(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶
- Encrypt text using python’s hash function. - Parameters
- text (str) – A string value 
- engine_backend – The backend analyzer engine. Default is presidio_analyzer. 
- score_threshold (float) – Minimum confidence value for detected entities to be returned 
 
- Returns
- Text with hashed sensitive data 
 
- 
data_describe.privacy.detection.hash_string(text)¶
- Applies SHA256 text hashing. - Parameters
- text (str) – The string value 
- Returns
- Hashed text 
- Return type
- sha_signature 
 
- 
data_describe.privacy.detection.presidio_engine()¶
- Initialize presidio engine. - Returns
- Presidio engine