data_describe.privacy.detection¶
|
Identifies, redacts, and/or encrypts PII data. |
|
Identifies, redacts, and encrypts PII data. |
|
Identifies infotypes contained in a string. |
|
Identifies sensitive data and creates a mapping with the hashed data. |
|
Redact sensitive data with mapping between hashed values and infotype. |
|
Identifies the infotype of a pandas series object using a sample of rows. |
|
Identify infotypes for each column in the dataframe. |
|
Encrypt text using python’s hash function. |
|
Applies SHA256 text hashing. |
Initialize presidio engine. |
-
data_describe.privacy.detection.
logger
¶
-
data_describe.privacy.detection.
sensitive_data
(df, mode: str = 'redact', detect_infotypes: bool = True, columns: Optional[list] = None, score_threshold: float = _DEFAULT_SCORE_THRESHOLD, sample_size: int = _SAMPLE_SIZE, engine_backend=None, compute_backend: Optional[str] = None)¶ Identifies, redacts, and/or encrypts PII data.
Note
sensitive_data uses Microsoft’s Presidio in the backend. Presidio can be used to help identify sensitive data. However, because Presidio uses trained ML models, there is no guarantee that Presidio will find all sensitive information.
- Parameters
df (DataFrame) – The dataframe
mode (str) – {‘redact’, ‘encrypt’} redact: Redact the sensitive data encrypt: Anonymize the sensitive data
detect_infotypes (bool) – If True, identifies infotypes for each column
columns ([str]) – Defaults to None
score_threshold (float) – Minimum confidence value for detected entities to be returned. Default is 0.2.
sample_size (int) – Number of sampled rows used for identifying column infotypes. Default is 100.
engine_backend – The backend analyzer engine. Default is presidio_analyzer.
compute_backend (str) – Select compute backend
- Raises
ValueError – Invalid input data type.
TypeError – columns not a list of strings.
- Returns
SensitiveDataWidget
-
class
data_describe.privacy.detection.
SensitiveDataWidget
(engine=None, redact=None, encrypt=None, infotypes=None, sample_size=None, **kwargs)¶ Bases:
data_describe._widget.BaseWidget
Interface for collecting additional information about the sensitive data widget.
-
show
(self, **kwargs)¶ Show the transformed data or infotypes.
-
-
data_describe.privacy.detection.
compute_sensitive_data
(df, mode: str = 'redact', detect_infotypes: bool = True, columns: Optional[list] = None, score_threshold: float = _DEFAULT_SCORE_THRESHOLD, sample_size: Union[int, float] = _SAMPLE_SIZE, engine_backend=None)¶ Identifies, redacts, and encrypts PII data.
Note: sensitive_data uses Microsoft’s Presidio in the backend. Presidio can be help identify sensitive data. However, because Presidio uses trained ML models, there is no guarantee that Presidio will find all sensitive information.
- Parameters
df (DataFrame) – The dataframe
mode (str) – {‘redact’, ‘encrypt’} redact: Redact the sensitive data encrypt: Anonymize the sensitive data
detect_infotypes (bool) – If True, identifies infotypes for each column
columns ([str]) – Defaults to None
score_threshold (float) – Minimum confidence value for detected entities to be returned. Default is 0.2.
sample_size (int) – Number of sampled rows used for identifying column infotypes. Default is 100.
engine_backend – The backend analyzer engine. Default is presidio_analyzer.
- Raises
ValueError – sample_size greater than data size.
- Returns
SensitiveDataWidget
-
data_describe.privacy.detection.
identify_pii
(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶ Identifies infotypes contained in a string.
- Parameters
text (str) – A string value
engine_backend – The backend analyzer engine. Default is presidio_analyzer.
score_threshold (float) – Minimum confidence value for detected entities to be returned
- Returns
List of presidio_analyzer.recognizer_result.RecognizerResult
-
data_describe.privacy.detection.
create_mapping
(text, response)¶ Identifies sensitive data and creates a mapping with the hashed data.
- Parameters
text (str) – String value
response – List of presidio_analyzer.recognizer_result.RecognizerResult
- Returns
Mapping of the hashed data with the redacted string ref_text (str): String with hashed values
- Return type
word_mapping (dict)
-
data_describe.privacy.detection.
redact_info
(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶ Redact sensitive data with mapping between hashed values and infotype.
- Parameters
text (str) – String value
engine_backend – The backend analyzer engine. Default is presidio_analyzer.
score_threshold (float) – Minimum confidence value for detected entities to be returned
- Returns
String with redacted information
-
data_describe.privacy.detection.
identify_column_infotypes
(data_series, engine_backend, sample_size: Union[int, float] = _SAMPLE_SIZE, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶ Identifies the infotype of a pandas series object using a sample of rows.
- Parameters
data_series (Series) – A Series
engine_backend – The backend analyzer engine. Default is presidio_analyzer.
sample_size (int) – Number of rows to sample from
score_threshold (float) – Minimum confidence value for detected entities to be returned
- Returns
List of infotypes
-
data_describe.privacy.detection.
identify_infotypes
(df, engine_backend, sample_size=_SAMPLE_SIZE, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶ Identify infotypes for each column in the dataframe.
- Parameters
df (DataFrame) – The dataframe
engine_backend – The backend analyzer engine. Default is presidio_analyzer.
sample_size (int) – Number of rows to sample from
score_threshold (float) – Minimum confidence value for detected entities to be returned
- Returns
Dictionary with columns as keys and values as infotypes detected
-
data_describe.privacy.detection.
encrypt_text
(text, engine_backend, score_threshold=_DEFAULT_SCORE_THRESHOLD)¶ Encrypt text using python’s hash function.
- Parameters
text (str) – A string value
engine_backend – The backend analyzer engine. Default is presidio_analyzer.
score_threshold (float) – Minimum confidence value for detected entities to be returned
- Returns
Text with hashed sensitive data
-
data_describe.privacy.detection.
hash_string
(text)¶ Applies SHA256 text hashing.
- Parameters
text (str) – The string value
- Returns
Hashed text
- Return type
sha_signature
-
data_describe.privacy.detection.
presidio_engine
()¶ Initialize presidio engine.
- Returns
Presidio engine