Skip to main content

Implementing Pangea Multipass

Pangea Multipass is a general purpose library for checking a user's access to resources in an upstream system. It ingests the resources using an admin account and then filters out data based on what the user can access during inference.

Although this is intended to be used with AI/LLM apps, you can use this library independently.

Features

Pangea Multipass contains the following features:

  • Document Reading: Supports document content extraction for use in processing and enrichment.
  • Metadata Enrichment: Includes enrichers for hashing, constant value setting, and custom metadata.
  • Metadata Filtering: Provides flexible operators to filter document metadata for customized queries.
  • Authorization Processing: Manages authorized and unauthorized nodes with customizable node processors.
  • Extensible: Built on abstract base classes, allowing easy extension and customization of functionality.

Prerequisites

  • Python v3.10 or greater, but less than v3.13
  • Poetry v2 or greater

All other dependencies will be installed using Poetry later in the tutorial.

Set up the environment

Using many of the following upstream data sources requires first setting up environment variables that will be used to ingest data and during inference.

The following information is provided as a quick reference and will be used throughout the tutorial for the supplied data sources.

Mac and Linux systems

export VARIABLE_NAME=value

Windows

setx VARIABLE_NAME=value
note

You might need to reopen the terminal or reload the file before using the environment variables in your terminal.

note

You can also add the environment variables to Docker, Kubernetes, or in your code using your preferred programming language. Consult the documentation for the method you prefer for how to complete these tasks.

Installation

For using the examples from the Github folders, you will need to clone the Pangea Multipass directory into your app folder using the following command:

git clone https://github.com/pangeacyber/pangea-multipass.git

Install dependencies

Multipass uses Poetry for dependency management and virtual environment setup. This simplifies the installation process and makes sure that the proper versions are being installed for Pangea Multipass.

Navigate to the Pangea Multipass package folder:

cd pangea-multipass/packages/pangea-multipass

To run examples, install the dependencies including Multipass in each examples folder:

poetry install --no-root

Data sources

This is the current list of upstream data sources the core library supports. Configure the ones you need and store the credentials for the examples. Most of these data sources will require administrator access to get the credentials.

Google Drive

In order to use Google Drive as a source in the examples you need to:

After setting up the Google environment variables and installing dependencies, you can run a check to verify that everything is working properly.

Navigate to the examples/llama_index_examples folder and run the following script:

poetry run python 03-rag-LlamaIndex-gdrive-filter.py

The Google login page will display and you should be able to log in. The terminal should then display something similar to the following:

poetry run python 03-rag-LlamaIndex-gdrive-filter.py
Loading Google Drive docs...
Login to GDrive as admin...
Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=103...mi.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A64963%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.readonly&state=YVx...RVGi&access_type=offline
Processing 2 docs...
Create and save index...
Login to GDrive as user...
Enter your question:

Atlassian

For both Jira and Confluence, which are Atlassian products, you will need to create a token. The process is the same for both softwares and can be located in their support documentation .

Jira

In order to use Jira as a source, you need to set the following environment variables:

  • JIRA_BASE_URL - Jira project base URL. It uses the following format: <your-project-id>.atlassian.net/. Make sure you remove the https:// part.
  • JIRA_ADMIN_EMAIL: Admin email used at ingestion time. The system will process all of the tickets that this user can access.
  • JIRA_ADMIN_TOKEN: Access token of the JIRA_ADMIN_EMAIL above.
  • JIRA_USER_EMAIL: User email used at inference time. This email will be used to validate which tickets the user can access of all of the tickets returned by the LLM. When using the JIRA_USER_ACCOUNT_ID, this variable is not necessary.
  • JIRA_USER_TOKEN: Access token of the user email set above. When using the JIRA_USER_ACCOUNT_ID, this variable is not necessary.
  • JIRA_USER_ACCOUNT_ID: Set the JIRA_USER_ACCOUNT_ID to use JIRA_ADMIN_TOKEN and JIRA_ADMIN_EMAIL at inference time to check for user permissions. When JIRA_USER_ACCOUT_ID is set, there is no need to set the JIRA_USER_EMAIL and JIRA_USER_TOKEN variables. The User Account ID is found on the Jira profile page of the user and is everything after jira/people/.

After setting up the Jira environment variables and installing dependencies, you can run a check to verify that everything is working properly.

To verify your Jira configuration, navigate to the examples/llama_index_examples folder and run the following script:

poetry run python 07-jira-check-access.py

This should return the list of Jira docs that the user can access:

Loading Jira docs...
Processing 2 Jira docs...

Authorized issues: 5
10006
10005
10002
10001
10000

Confluence

In order to use Confluence as a source, you need to set the following environment variables.

  • CONFLUENCE_BASE_URL - Confluence project base URL in the following format: https://<your-project-id>.atlassian.net/.
  • CONFLUENCE_ADMIN_EMAIL - Admin email used in the ingestion time. The system will process all of the files this user can access.
  • CONFLUENCE_ADMIN_TOKEN - Access token of the admin email set above.
  • CONFLUENCE_USER_EMAIL - User email used at inference time. This email will be used to validate which tickets the user can access of all of the tickets returned by the LLM.
  • CONFLUENCE_USER_TOKEN - Access token of the CONFLUENCE_USER_EMAIL set above.

For Confluence, the check script is pretty similar and should also be run from the examples/llama_index_examples folder:

poetry run python 08-confluence-check-access.py

Which will give an output that is similar to this:

Loading Confluence docs...
Processing 9 Confluence docs...
Loaded 9 pages.

Authorized pages: 6
98444
393323
622596
753666
1081345
1245185

Github

You can follow Github’s support documentation to create an access token.

In order to use Github as a source, you need to set the following environment variables:

  • GITHUB_ADMIN_TOKEN: Access token used in the ingestion time. The system will process all of the repositories this token can access.
    • This should be a fine-grained personal access token with access to all of the repositories owned by the admin account and both content and metadata in the repository permissions set to read access.
  • GITHUB_USERNAME: Username used in inference time. The username can be obtained by logging in to Github and navigating to the user profile. The username displays below the avatar for the user. This account will be used to validate the user's access to the files returned by the LLM.

After setting up the Github environment variables and installing dependencies, you can run a check to verify that everything is working properly.

Run the following command inside the multipass_examples folder to test the configuration. It is important to realize that the number of repositories will directly affect the amount of time the test will take. When using a production-level Github repository, this can take a very long time and even exhaust hourly API request quotas. We recommend performing this test with a small test account.

poetry run python 01-github-check-access.py

The output should be similar to the following.

Loaded 8 docs:
offices.txt
strategy.txt
capacitor.txt
folder_1/internal_architecture.txt
folder_2/react.txt
folder_1/salaries.txt
folder_2/venture-capital.txt
interest-rate.txt

Authorized docs: 5
offices.txt
strategy.txt
capacitor.txt
folder_1/internal_architecture.txt
folder_2/react.txt

Slack

In order to use Slack as a source, you need to set the following environment variables: SLACK_ADMIN_TOKEN and SLACK_USER_TOKEN.

To get these tokens, you must first use a Slack workspace admin account to create a Slack app following this tutorial . The default app settings are sufficient.

After you create an app, you need to navigate to the OAuth & Permissions page of your app to add scopes. Scroll down to a section on that page labeled Scopes. Click Add an OAuth Scope and add any required scopes.

The token's scope should at least contain channels:history and groups:history in order to process all public and private channels. These scopes are only visible to admin accounts in Slack.

  • SLACK_ADMIN_TOKEN: Access token used in the ingestion time. The system will process all the channels this token has access to.
  • SLACK_USER_TOKEN: User email address used in inference time. It will be used to validate the user's access to the files returned by the LLM.

After setting up the Slack environment variables and installing dependencies, you can run a check to verify that everything is working properly.

In the multipass_examples folder, run the following command to test the admin token configuration and user access.

poetry run python 03-slack-check-access.py

Which should give output similar to this:

Loaded 45 messages.
User has access to channel ids:
random
User has access to 10 messages
Loaded 10 messages.

GitLab setup

In order to use GitLab as a source, you need to set two environment variables:

  • GITLAB_ADMIN_TOKEN: The admin user of the Jira account used at ingestion time. The system processes all repositories that the admin user can access.
  • GITLAB_USERNAME: Username used in inference time. This username will be used to validate which files returned by the LLM can be accessed by the user.

After configuring the GitLab variables, you can verify that it is set up correctly by running the following command inside the multipass_examples folder.

poetry run python 06-gitlab-check-access.py

If it is set up correctly, then the output should look similar to the output below. The files the user can access should be a subset of the files loaded. If the user shows access to 0 files, then the GITLAB_USERNAME might have a typo. Verify using the method described in the GitLab setup instructions.

Loaded 4 files.
User 'cbass' has access to 4 files.

Dropbox setup

In order to use Dropbox as a source, you need two environment variables:

When using Dropbox as a source, the admin user needs to:

  1. Sign in to Dropbox .
  2. Navigate to create a Dropbox app and create an app. Define the scope and access you want to provide to Pangea Multipass.
  3. Copy the App key that displays in the newly created app.
  4. Set the DROPBOX_APP_KEY to the App key that you copied from Dropbox.

To test the configuration:

  1. In the multipass_examples folder, run the following command.
poetry run python 04-dropbox-check-access.py
  1. The first time you run the command, a login screen opens in a browser with a link similar to this:

    https://www.dropbox.com/oauth2/authorize?client_id=hm...6&response_type=code&token_access_type=offline&redirect_uri=http://localhost:8080&code_challenge=1T9...R4&code_challenge_method=S256

    Login with the Dropbox admin account and approve connecting the Pangea Multipass Dropbox app. This is a one time requirement.

  2. After you connect to the Pangea Multipass app, the terminal loads documents and displays the output. It should look similar to the following text.

Listening for authentication response on http://localhost:8080 ...
127.0.0.1 - - [18/Feb/2025 11:18:32] "GET /?code=crD1...3gA HTTP/1.1" 200 -
Loading documents from Dropbox...
Loaded 1 docs
Filtering authorized documents...
Authorized docs: 1

Next steps

You can add our AI Guard and Prompt Guard services on top of Pangea Multipass to continue adding security and guard rails for your AI app. Adding AI security to your app will help prevent data leaks and jailbreaking of your LLM, reducing your risk.

Another option is to add tamperproof audit logging with our Secure Audit Log service. This service will provide you with an audit trail to trace any issues that might occur and create fixes that prevent further occurrences.

Was this article helpful?

Contact us