Redacting Data
Learn how to redact data
Redacting Text
Each SDK provides a redact
method that can be used to Redact text. Here's an example of redacting a phone number with the Redact service.
import os
import pangea.exceptions as pe
from pangea.config import PangeaConfig
from pangea.services import Redact
token = os.getenv("PANGEA_REDACT_TOKEN")
domain = os.getenv("PANGEA_DOMAIN")
config = PangeaConfig(domain=domain)
redact = Redact(token, config=config)
def main():
text = "Hello, my phone number is 123-456-7890"
print(f"Redacting PII from: {text}")
try:
redact_response = redact.redact(text=text)
print(f"Redacted text: {redact_response.result.redacted_text}")
except pe.PangeaAPIException as e:
print(f"Embargo Request Error: {e.response.summary}")
for err in e.errors:
print(f"\t{err.detail} \n")
if __name__ == "__main__":
main()
The debug
option will provide a detailed list of the redactions that occurred for the provided text. This can be useful in testing or in cases where a report of what was redacted from the provided text is required.
For complete details on the redact
method see the API documentation or for information on other language SDKs, see the SDK documentation.
Redacting Structured Data
In some cases, structured JSON data may require redaction. By default, the Redact service will iterate and apply redaction rules to all values in the supplied JSON. For a more targeted approach, JSONPaths can be provided to identify specific fields to be redacted.
Here's an example of redacting an email address and Driver's License from JSON data using the Python SDK:
from pangea.services import Redact
# include your API token here
redact = Redact(token="API_Token")
structured_data = {
"First_Name": "Dennis",
"Last_Name": "Nedry",
"email": "dennis.nedry@ingen.com",
"DL": "Y2500760",
}
check_res = redact.redact_structured(
structured_data,
jsonp=["$.email", "$.DL"]
)
In the above example, the jsonp
keyword argument is supplied to the redact_structured
method. It is supplied as a list of JSONPaths targeting the email
and DL
fields.
Using the jsonp
keyword argument can reduce the time it takes to perform a redaction operation while also reducing the occurrences of accidental redaction occurring.
For complete details on the redact_structured
method see the API documentation or for information on other language SDKs, see the SDK Documentation.
About JSONPath
JSONPath was born out of the need to easily extract data from JSON documents in much the same way that XMLPath does for XML Documents.
Consider the following JSON:
{
"First_Name": "Dennis",
"Last_Name": "Nedry",
"SSN": "078-05-1120",
"DL": "Y2500760",
"Phone_Numbers": [
{
"type": "mobile",
"number": "111-111-1111"
},
{
"type": "home",
"number": "222-222-2222"
}
]
}
In this case, as in the above example, the SSN can be extracted by using the following JSONPath: $.SSN
. The $
represents the root of the document, and the .SSN
indicates a child with a key of SSN.
If the first phone number was needed, a JSONPath of $.Phone_Numbers[0].number
could be provided. In this case:
- The
$
represents the root of the document - The
.Phone_Numbers
represents the child with a key ofPhone_Numbers
- The
[0]
indicates the first phone_number in the array. - The
.number
indicates the child with a key ofnumber
Finally, if specified, the mobile number was needed the following JSONPath could be provided $.Phone_Numbers[?(@.type=="mobile")].number
. In this case:
- The
$
represents the root of the document - The
.Phone_Numbers
represents the child with a key ofPhone_Numbers
- The
[...]
is used to iterate through thePhone_Numbers
- The
?()
indicates that a script should be applied, in this case, a comparison - The
@.type
indicates the current record'stype
key - The
@.type=="mobile"
indicates where the current recordstype
key is equal to "mobile" - The
.number
indicates thenumber
key of the matching record(s)
JSONPath is an extremely powerful tool. To learn more read about the JSONPath specification and test out some JSONPath's with this interactive tool.
Rules & Ruleset Parameters
Sometimes for specific calls, you need extra rules on top of what is configured in your configuration's enabled rules. For those cases we provide two parameters that allow you to add rules to your base set of rules: rules
and rulesets
.
The rules
parameter allows you to specify rules using their short names to provide additional redaction options to your current selection. Likewise the rulesets
parameter allows you to provide an entire set of one or more rulesets (also referenced by their respective short names) to apply
to your text.
from pangea.services import Redact
# include your API token here
redact = Redact(token="API_Token")
text = "Dennis Nedry who's email is dennis.nedry@ingen.com Y2500760 408-444-4444",
check_res = redact.redact(
text,
rules=["PHONE_NUMBER"],
)
Not only will you see the first few sets of fields redacted if you have those respective rules enabled, you'll see the phone number is also redacted.
Overlapping Rules
When using multiple rules, there are times when rules overlap or outright match the same text. The question then becomes: "How do we decide which rules redact the matching text?". Well for rules enabled within your configuration (not using the rules
or rulesets
parameter mentioned in the previous section), the longest match always wins. This is because we don't want a broader context such as an email address dennis.nedry@gmail.com
to only have gmail.com
redacted. In this case, the address is more important than the domain since it's this example person's first and last name. In the case where the text is the exact same length, the higher confidence score wins. If the confidence and match length
is the same, then the choice is arbitrary.
Regarding rules
and rulesets
, when there is an overlap between rules specified by these parameters and those defined in the standard redact configuration, the preference is consistently given to the rules in the standard configuration. This preference is rooted in the many use cases that a company may encounter, especially when dealing with a substantial number of microservices. Each service may require unique redaction approaches, but the internal security team always seeks to enforce specific PII information redaction uniformly across the entire company.
Was this article helpful?