Skip to main content

Sanitize API

The Sanitize API tightly integrates with other Pangea services to give the platform additional unique capabilities which can also be seamlessly used in your app, such as:

  • Sanitize uses File Scan to scan the file both before and after the sanitization process, except when the source or destination is Secure Share. This is because Secure Share scans the file when it receives the file. This prevents a file from being scanned twice.

  • Sanitize can be used to remove possibly malicious embedded content from the file.

  • Sanitize can scan URLs and domains for malicious links.

  • Sanitize can be used with Redact to remove sensitive information in the files.

  • Sanitize can use multiple types of transfer methods, including Secure Share. This allows you to tailor Sanitize to meet the needs of your application.

Sanitize API requests

The Sanitize Content and File Operations configured in the Pangea User Console Sanitize Settings will be used when Sanitize is called unless they are overridden at runtime using optional Sanitize API file or content parameters.

The Sanitize file parameter options allow you to override the following:

  • Configured File Scan provider

The Sanitize content parameter options allow you to override the following:

  • Configured URL Intel provider
  • Configured Domain Intel provider
  • Defang threshold
  • Removal of attachments
  • Removal of interactive content
  • Redact enablement

Sanitize API requests use additional configuration and/or input parameters for directing the output to either a presigned URL or to Secure Share. This feature is useful in automating processes for sharing and storing files, especially when combined with Secure Share, reducing the number of steps and interactions required.

For a Sanitize API service call, an input file can be provided using one of the following transfer_method options discussed in Transfer Methods:

  • "source-url"
  • "put-url"
  • "post-url"
  • "share-id"
  • "multipart"
note

All listed examples use "source-url" for the input method.

Setting API destinations

You can choose how the results of a Sanitize API call are delivered by specifying an additional optional parameter, share_output.

There are two options for receiving the results of a Sanitize API call:

Setting output to a destination URL

  • If you omit the optional share_output parameter in your initial request, the successful response from the Sanitize service will contain a presigned GET URL in result.dest_url, which you can use to download the sanitized output.

    For example:

    1. Request sanitization of a file.

      POSTsanitize/file/at/source-url
      curl --location 'https://sanitize.aws.us.pangea.cloud/v1beta/sanitize' \
      --header 'Content-Type: application/json' \
      --header "Authorization: Bearer $PANGEA_SANITIZE_TOKEN" \
      --data '{
          "transfer_method": "source-url",
          "source_url": "https://my-sanitize-input.s3.us-west-2.amazonaws.com/samples/my_tiny.pdf?..."
      }'
      

      A call to the Sanitize service receives an asynchronous response. This response contains a GET URL in result.location, which you can use to track the status of your request.

      response/with/results/location
      {
        "request_id": "prq_64tjspdh4yxxownpm2ap4qb4rbxuedeo",
        "status": "Accepted",
        "summary": "Your request is in progress. Use 'result, location' below to poll for results. See https://pangea.cloud/docs/api/async?service=sanitize&request_id=prq_64tjspdh4yxxownpm2ap4qb4rbxuedeo for more information.",
        "result": {
            "location": "https://sanitize.aws.us.pangea.cloud/request/prq_64tjspdh4yxxownpm2ap4qb4rbxuedeo",
            . . .
        },
        . . .
      }
      
    2. Check the results of the requested sanitization.

      GETresults/of/sanitize
      curl --location 'https://sanitize.aws.us.pangea.cloud/request/prq_64tjspdh4yxxownpm2ap4qb4rbxuedeo' \
      --header "Authorization: Bearer $PANGEA_SANITIZE_TOKEN"
      

      Use the presigned GET URL in result.dest_url to download the sanitized output.

      results/of/sanitize
      {
        "request_id": "prq_64tjspdh4yxxownpm2ap4qb4rbxuedeo",
        "result": {
            "dest_url": "https://pangea-sanitize-input.s3.us-west-2.amazonaws.com/2024030423/prq_64tjspdh4yxxownpm2ap4qb4rbxuedeo/sanitized.my_tiny.pdf?...",
            . . .
        },
        "status": "Success",
        "summary": "Successfully completed the request.  The file download link is valid for 24h0m0s."
      }
      

Setting output to Secure Share

Enabling share_output in your initial request saves the sanitized output in Secure Share.

For example:

  1. Request sanitization of a file.

    POSTsanitize/file/at/source-url
    curl --location 'https://sanitize.aws.us.pangea.cloud/v1beta/sanitize' \
    --header 'Content-Type: application/json' \
    --header "Authorization: Bearer $PANGEA_SANITIZE_TOKEN" \
    --data '{
      "transfer_method": "source-url",
      "source_url": "https://pangea-sanitize-input.s3.us-west-2.amazonaws.com/samples/redact_tiny.pdf?...",
      "share_output": {
          "enabled": true,
          "output_folder": "/"
      }
    }'
    

    If you specify a non-existent "output_folder" location, Secure Share will automatically create it for you.

    The response contains a GET URL in result.location. You can use this URL to check the status of the call and get the eventual results.

    response/with/results/location
    {
      "request_id": "prq_zrdj2aggcspg6nslzlk7im63s577o34z",
      "result": {
        "location": "https://sanitize.aws.us.pangea.cloud/request/prq_zrdj2aggcspg6nslzlk7im63s577o34z",
        . . .
      },
      "status": "Accepted",
      "summary": "Your request is in progress. Use 'result, location' below to poll for results. See https://pangea.cloud/docs/api/async?service=sanitize&request_id=prq_zrdj2aggcspg6nslzlk7im63s577o34z for more information.",
      . . .
    }
    
  2. Check the results of the sanitization request.

    GETresults/of/sanitize
    curl --location 'https://sanitize.aws.us.pangea.cloud/request/prq_zrdj2aggcspg6nslzlk7im63s577o34z' \
    --header "Authorization: Bearer $PANGEA_SANITIZE_TOKEN"
    

    If the call is successful, result.dest_share_id will contain the ID of the file saved in Secure Share.

    results/of/sanitize
    {
      "request_id": "prq_zrdj2aggcspg6nslzlk7im63s577o34z",
      "status": "Success",
      "summary": "Successfully completed the request.  The Sanitized file sanitized.Asynchronous API Responses Pangea.pdf can be found in the Secure Share under folder: /.",
      "result": {
        "dest_share_id": "pos_pp2l24fj7kcdafmyqtztd6oeoofpmeid",
        . . .
      },
      . . .
    }
    

Sanitize output data fields

This list is all the data fields in the details of a Sanitize output and their types.

Expand for details

{
  "request_id": "prq_zhe46bpihtqqm4wussuaa3rwmgoouc3w",
  "request_time": "2024-03-19T23:08:08.699280Z",
  "response_time": "2024-03-19T23:08:21.757699Z",
  "status": "Success",
  "summary": "Successfully completed the request.  The file download link is valid for 24h0m0s.",
  "result": {
    "dest_url": "https://pangea-sanitize-input-dev.s3.us-west-2.amazonaws.com/2024031923/prq_zhe46bpihtqqm4wussuaa3rwmgoouc3w/sanitized.Pangea.pdf",
    "data": {
      "redact": {
        "redaction_count": 13,
        "summary_counts": {
          "PERSON": 9
        }
      },
      "defang": {
        "external_urls_count": 48,
        "external_domains_count": 6,
        "defanged_count": 0,
        "url_intel_summary": "Processed 31 URLs: 0 are malicious, 0 are suspicious, 31 are unknown.",
        "domain_intel_summary": "Processed 6 Domains: 0 are malicious, 0 are suspicious, 6 are unknown."
      },
      "cdr": {
        "file_attachments_removed": 0,
        "interactive_contents_removed": 0
      },
      "malicious_file": false
    },
    "parameters": {
      "transfer_method": "multipart",
      "source_url": "",
      "share_id": "",
      "config_id": null,
      "file": {
        "cdr_provider": "apryse"
      },
      "content": {
        "defang_threshold": null
      },
      "share_output": null
    }
  }
}

note

The external_urls_count and url_intel_summary in the defang summary may not be the same. This is because external_urls_count is the total number of URLs and url_intel_summary is the number of unique URLs. These numbers being different generally means that there were duplicate URLs in the original document. The duplicates are removed prior to sending them to URL Intel for lookup.

Was this article helpful?

Contact us