Implementing Pangea Multipass
Pangea Multipass is a general purpose library for checking a user's access to resources in an upstream system. It ingests the resources using an admin account and then filters out data based on what the user can access during inference.
Although this is intended to be used with AI/LLM apps, you can use this library independently.
Features
Pangea Multipass contains the following features:
- Document Reading: Supports document content extraction for use in processing and enrichment.
- Metadata Enrichment: Includes enrichers for hashing, constant value setting, and custom metadata.
- Metadata Filtering: Provides flexible operators to filter document metadata for customized queries.
- Authorization Processing: Manages authorized and unauthorized nodes with customizable node processors.
- Extensible: Built on abstract base classes, allowing easy extension and customization of functionality.
Prerequisites
- Python v3.10 or greater, but less than v3.13
- Poetry v2 or greater
All other dependencies will be installed using Poetry later in the tutorial.
Set up the environment
Using many of the following upstream data sources requires first setting up environment variables that will be used to ingest data and during inference.
The following information is provided as a quick reference and will be used throughout the tutorial for the supplied data sources.
Mac and Linux systems
export VARIABLE_NAME=value
Windows
setx VARIABLE_NAME=value
You might need to reopen the terminal or reload the file before using the environment variables in your terminal.
You can also add the environment variables to Docker, Kubernetes, or in your code using your preferred programming language. Consult the documentation for the method you prefer for how to complete these tasks.
Installation
For using the examples from the Github folders, you will need to clone the Pangea Multipass directory into your app folder using the following command:
git clone https://github.com/pangeacyber/pangea-multipass.git
Install dependencies
Multipass uses Poetry for dependency management and virtual environment setup. This simplifies the installation process and makes sure that the proper versions are being installed for Pangea Multipass.
Navigate to the Pangea Multipass package folder:
cd pangea-multipass/packages/pangea-multipass
To run examples, install the dependencies including Multipass in each examples folder:
poetry install --no-root
Data sources
This is the current list of upstream data sources the core library supports. Configure the ones you need and store the credentials for the examples. Most of these data sources will require administrator access to get the credentials.
Google Drive
In order to use Google Drive as a source in the examples you need to:
- Create a Desktop OAuth 2.0 client in Google console on the Credentials page under API & Services and save the credentials as
credentials.json
in the<repo-root-directory>/examples/
folder. - On the example script, update
gdrive_fid
variable value with the Google Drive folder ID to process. This folder ID is found by navigating to the folder in Google Drive and copying everything afterfolders/
in the browser link.
After setting up the Google environment variables and installing dependencies, you can run a check to verify that everything is working properly.
Navigate to the examples/llama_index_examples
folder and run the following script:
poetry run python 03-rag-LlamaIndex-gdrive-filter.py
The Google login page will display and you should be able to log in. The terminal should then display something similar to the following:
poetry run python 03-rag-LlamaIndex-gdrive-filter.py
Loading Google Drive docs...
Login to GDrive as admin...
Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=103...mi.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A64963%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.readonly&state=YVx...RVGi&access_type=offline
Processing 2 docs...
Create and save index...
Login to GDrive as user...
Enter your question:
Atlassian
For both Jira and Confluence, which are Atlassian products, you will need to create a token. The process is the same for both softwares and can be located in their support documentation .
Jira
In order to use Jira as a source, you need to set the following environment variables:
JIRA_BASE_URL
- Jira project base URL. It uses the following format: <your-project-id>.atlassian.net/. Make sure you remove the https:// part.JIRA_ADMIN_EMAIL
: Admin email used at ingestion time. The system will process all of the tickets that this user can access.JIRA_ADMIN_TOKEN
: Access token of theJIRA_ADMIN_EMAIL
above.JIRA_USER_EMAIL
: User email used at inference time. This email will be used to validate which tickets the user can access of all of the tickets returned by the LLM. When using theJIRA_USER_ACCOUNT_ID
, this variable is not necessary.JIRA_USER_TOKEN
: Access token of the user email set above. When using theJIRA_USER_ACCOUNT_ID
, this variable is not necessary.JIRA_USER_ACCOUNT_ID
: Set theJIRA_USER_ACCOUNT_ID
to useJIRA_ADMIN_TOKEN
andJIRA_ADMIN_EMAIL
at inference time to check for user permissions. WhenJIRA_USER_ACCOUT_ID
is set, there is no need to set theJIRA_USER_EMAIL
andJIRA_USER_TOKEN
variables. The User Account ID is found on the Jira profile page of the user and is everything afterjira/people/
.
After setting up the Jira environment variables and installing dependencies, you can run a check to verify that everything is working properly.
To verify your Jira configuration, navigate to the examples/llama_index_examples
folder and run the following script:
poetry run python 07-jira-check-access.py
This should return the list of Jira docs that the user can access:
Loading Jira docs...
Processing 2 Jira docs...
Authorized issues: 5
10006
10005
10002
10001
10000
Confluence
In order to use Confluence as a source, you need to set the following environment variables.
CONFLUENCE_BASE_URL
- Confluence project base URL in the following format: https://<your-project-id>.atlassian.net/.CONFLUENCE_ADMIN_EMAIL
- Admin email used in the ingestion time. The system will process all of the files this user can access.CONFLUENCE_ADMIN_TOKEN
- Access token of the admin email set above.CONFLUENCE_USER_EMAIL
- User email used at inference time. This email will be used to validate which tickets the user can access of all of the tickets returned by the LLM.CONFLUENCE_USER_TOKEN
- Access token of theCONFLUENCE_USER_EMAIL
set above.
For Confluence, the check script is pretty similar and should also be run from the examples/llama_index_examples
folder:
poetry run python 08-confluence-check-access.py
Which will give an output that is similar to this:
Loading Confluence docs...
Processing 9 Confluence docs...
Loaded 9 pages.
Authorized pages: 6
98444
393323
622596
753666
1081345
1245185
Github
You can follow Github’s support documentation to create an access token.
In order to use Github as a source, you need to set the following environment variables:
GITHUB_ADMIN_TOKEN
: Access token used in the ingestion time. The system will process all of the repositories this token can access.- This should be a fine-grained personal access token with access to all of the repositories owned by the admin account and both content and metadata in the repository permissions set to read access.
GITHUB_USERNAME
: Username used in inference time. The username can be obtained by logging in to Github and navigating to the user profile. The username displays below the avatar for the user. This account will be used to validate the user's access to the files returned by the LLM.
After setting up the Github environment variables and installing dependencies, you can run a check to verify that everything is working properly.
Run the following command inside the multipass_examples
folder to test the configuration. It is important to realize that the number of repositories will directly affect the amount of time the test will take. When using a production-level Github repository, this can take a very long time and even exhaust hourly API request quotas. We recommend performing this test with a small test account.
poetry run python 01-github-check-access.py
The output should be similar to the following.
Loaded 8 docs:
offices.txt
strategy.txt
capacitor.txt
folder_1/internal_architecture.txt
folder_2/react.txt
folder_1/salaries.txt
folder_2/venture-capital.txt
interest-rate.txt
Authorized docs: 5
offices.txt
strategy.txt
capacitor.txt
folder_1/internal_architecture.txt
folder_2/react.txt
Slack
In order to use Slack as a source, you need to set the following environment variables: SLACK_ADMIN_TOKEN
and SLACK_USER_TOKEN
.
To get these tokens, you must first use a Slack workspace admin account to create a Slack app following this tutorial . The default app settings are sufficient.
After you create an app, you need to navigate to the OAuth & Permissions page of your app to add scopes. Scroll down to a section on that page labeled Scopes. Click Add an OAuth Scope and add any required scopes.
The token's scope should at least contain channels:history
and groups:history
in order to process all public and private channels. These scopes are only visible to admin accounts in Slack.
SLACK_ADMIN_TOKEN
: Access token used in the ingestion time. The system will process all the channels this token has access to.SLACK_USER_TOKEN
: User email address used in inference time. It will be used to validate the user's access to the files returned by the LLM.
After setting up the Slack environment variables and installing dependencies, you can run a check to verify that everything is working properly.
In the multipass_examples
folder, run the following command to test the admin token configuration and user access.
poetry run python 03-slack-check-access.py
Which should give output similar to this:
Loaded 45 messages.
User has access to channel ids:
random
User has access to 10 messages
Loaded 10 messages.
GitLab setup
In order to use GitLab as a source, you need to set two environment variables:
GITLAB_ADMIN_TOKEN
: The admin user of the Jira account used at ingestion time. The system processes all repositories that the admin user can access.GITLAB_USERNAME
: Username used in inference time. This username will be used to validate which files returned by the LLM can be accessed by the user.
After configuring the GitLab variables, you can verify that it is set up correctly by running the following command inside the multipass_examples
folder.
poetry run python 06-gitlab-check-access.py
If it is set up correctly, then the output should look similar to the output below. The files the user can access should be a subset of the files loaded. If the user shows access to 0
files, then the GITLAB_USERNAME
might have a typo. Verify using the method described in the GitLab setup instructions.
Loaded 4 files.
User 'cbass' has access to 4 files.
Dropbox setup
In order to use Dropbox as a source, you need two environment variables:
-
DROPBOX_APP_KEY
: The identifier for the Dropbox app that Multipass will use to access your files. For testing, you can use our Pangea app with key: hmhe1wh0koy8cv6A Dropbox app key is created when you create a Dropbox app . You can then obtain the app key by navigating to your app in your Dropbox developer account.
-
DROPBOX_USER_EMAIL
: User email used in inference time. This email will be used to validate which files returned by the LLM the user has access to.
When using Dropbox as a source, the admin user needs to:
- Sign in to Dropbox .
- Navigate to create a Dropbox app and create an app. Define the scope and access you want to provide to Pangea Multipass.
- Copy the
App key
that displays in the newly created app. - Set the
DROPBOX_APP_KEY
to theApp key
that you copied from Dropbox.
To test the configuration:
- In the
multipass_examples
folder, run the following command.
poetry run python 04-dropbox-check-access.py
-
The first time you run the command, a login screen opens in a browser with a link similar to this:
https://www.dropbox.com/oauth2/authorize?client_id=hm...6&response_type=code&token_access_type=offline&redirect_uri=http://localhost:8080&code_challenge=1T9...R4&code_challenge_method=S256
Login with the Dropbox admin account and approve connecting the Pangea Multipass Dropbox app. This is a one time requirement.
-
After you connect to the Pangea Multipass app, the terminal loads documents and displays the output. It should look similar to the following text.
Listening for authentication response on http://localhost:8080 ...
127.0.0.1 - - [18/Feb/2025 11:18:32] "GET /?code=crD1...3gA HTTP/1.1" 200 -
Loading documents from Dropbox...
Loaded 1 docs
Filtering authorized documents...
Authorized docs: 1
Next steps
You can add our AI Guard and Prompt Guard services on top of Pangea Multipass to continue adding security and guard rails for your AI app. Adding AI security to your app will help prevent data leaks and jailbreaking of your LLM, reducing your risk.
Another option is to add tamperproof audit logging with our Secure Audit Log service. This service will provide you with an audit trail to trace any issues that might occur and create fixes that prevent further occurrences.
Was this article helpful?