Centralized access management in secure RAG applications with Python, LangChain, and Pangea AuthZ
The full code for this tutorial is available on GitHub .
This tutorial demonstrates how permissions from a data source can be translated into authorization policies in Pangea's AuthZ and enable unified access control at inference time in a LangChain-based retrieval-augmented generation (RAG) application in Python.
During ingestion, proprietary documents are embedded in a vector store to enable efficient semantic search. At inference time, the user’s prompt is enriched only with the context they are authorized to access. Centralizing authorization policies in this setup enables consistent access control to vectorized data across different sources and applications for all application users. Capturing these policies at ingestion time reduces latency during inference, enhancing the user experience.
Follow this tutorial to understand the application structure, prepare data, configure services, and run the example. Use the table of contents on the right to skip the steps you’ve completed. Note that Informational steps provide context on the application code and don’t require any action.
LangChain app architecture
The LangChain application uses Google APIs and an OAuth client to access data from Google Drive, serving as an example of an external data source. Documents from a specified Google Drive folder are ingested into a vector store, with each document identifier stored in the vector metadata. During this process, permissions associated with each Google document are extracted and converted into AuthZ tuples, using the document identifier and user ID as references.
To use the application, users sign in through Pangea’s AuthN-hosted login page, either using the Google social login option or their Google email as a username. At inference time, user questions are processed to retrieve relevant context from the vector store via semantic search on the embedded data. This search is filtered based on the permission assignments stored in AuthZ for the authenticated user.
The application administrator can optionally expand access to the embedded data by creating additional AuthZ policies based on the same document IDs, granting permissions to users without direct access to the Google Drive data. Additionally, the application can enforce rules beyond those in the original data source by leveraging inherited permissions and the RBAC, ReBAC, and ABAC policies defined in the authorization schema.
Prerequisites
Python
- Python v3.12 or greater
- pip v24.0 or greater
Project code
-
Clone the project repository from GitHub.
git clone https://github.com/pangeacyber/authz-rag-app.git
cd authz-rag-app -
Create a virtual environment and install the required packages.
python -m venv .venv
source .venv/bin/activate
pip install .
OpenAI API key
We will use OpenAI models. Get your OpenAI API key to run the examples.
Save your key in a .env
file, for example:
# OpenAI
OPENAI_API_KEY="sk-proj-54bgCI...vG0g1M-GWlU99...3Prt1j-V1-4r0MOL...X6GMA"
Google Cloud project
To get started, you'll need a Google account to create a Google Cloud project with both the Google Drive and Google Sheets APIs enabled. This project also requires a service account to grant your LangChain application access to these APIs. Using this Google account, you can create a folder in Google Drive that your application will read from and use to ingest example data.
If your organizational Google account does not allow setting up a Google Cloud project this way, you can use a personal Google account instead.
Additionally, you’ll need a separate Google account to test restricted access to the files stored in this folder.
To set up a Google Cloud project, you can follow the official Google documentation:
-
Follow the Creating and managing projects guide in the Google Resource Manager documentation to create your Google Cloud project.
In your Google Cloud project, enable both the Google Drive API and the Google Sheets API .
-
Follow the Create service accounts and Create and delete service account keys guides in the Google IAM documentation to create a service account and save its credentials in your LangChain application folder.
Below, find an example walk-through.
Create a Google Cloud Project
- Go to the Google Cloud Console Manage resources page and sign in to your Google account.
- On the Manage resources page, click + CREATE PROJECT or select an existing project.
- If creating a new project, enter your Project name and click CREATE. Wait for the confirmation in the Notifications panel on the right.
- Once confirmed, click SELECT PROJECT in the Notifications panel.
Enable Google Drive and Google Sheets API
- In the search bar, type
Drive
and click the Google Drive API link in the search results. - On the Google Drive API details page, click ENABLE.
- Then, search for
Sheets
and click the Google Sheets API link in the search results. - On the Google Sheets API details page, click ENABLE.
Create a Service Account
- Using the navigation menu in the top left, go to IAM & Admin >> Service Accounts.
- On the Service accounts page, click + CREATE SERVICE ACCOUNT.
- On the Create service account page, under Service account details, fill out the form and click CREATE AND CONTINUE.
- In the Grant this service account access to project (optional) step, click CONTINUE.
- In the Grant users access to this service account (optional) step, click DONE.
- On the Service accounts for project "<your-Google-Cloud-project-name>" page, click on your service account link.
- On the Service account details page, go to the KEYS tab.
- On the Keys page, click ADD KEY and select Create new key.
- In the Create private key for <your-service-account-email> dialog, select JSON and click CREATE.
- You will be prompted to save your key; save it in your LangChain project folder as a
credentials.json
file.
Your credentials.json
file should look similar to this:
{
"type": "service_account",
"project_id": "my-project",
"private_key_id": "l3JYno7aIrRSZkAGFHSNPcjYS6lrpL1UnqbkWW1b",
"private_key": "...",
"client_email": "my-service-account@my-project.iam.gserviceaccount.com",
"client_id": "1234567890",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/my-service-account%40my-project.iam.gserviceaccount.com",
"universe_domain": "googleapis.com"
}
Note your service account email in the <service-account-ID>@<Google-Cloud-project-ID>.iam.gserviceaccount.com
format - you will need it in the next step.
Add example data
Add Example Data
-
Create a folder in your Google Drive and note its ID from the URL. For example, if the folder URL is
https://drive.google.com/drive/folders/1MPqBul...m3yWO4
, the folder ID is1MPqBul...m3yWO4
. You’ll provide this ID to the LangChain application to access data in the folder. -
Share this folder with the service account’s email, granting it Editor access to query permissions for the files stored in it.
-
In the folder, create a spreadsheet for each fictitious employee account you’ll use to test the application. For each spreadsheet:
-
Add some employee-specific information, such as PTO balance.
For example:
Employee PTO balance, hours Alice 128 For simplicity, the employee's name is added directly to the document content. However, it could also be derived from the file permissions read by the application and included in the user prompt context at inference time.
Currently, the GoogleDriveRetriever class used in this tutorial assumes the first row of a spreadsheet is a header and processes data starting from the second row. If you only populate the first row, it won’t return a document. To ensure data is loaded, include a header row and place your data starting from the second row.
-
Share the spreadsheet with the employee's Google account (the account you’ll use to test restricted access to the files in this folder), granting it at least Viewer access.
-
Authorization service
To implement access control during retrieval, the application uses a customized VectorStoreRetriever . This retriever calls the AuthZ service to filter vector search results based on the policies defined for the current user. Authorization policies are created or updated during ingestion, using permissions from the original data source.
Enable AuthZ
To host AuthZ and other Pangea services to secure your LangChain app, start by creating a free Pangea account . After creating your account, click Skip on the Get started with a common service screen. This will take you to the Pangea User Console, where you can enable the service.
To enable the service, click its name in the left-hand sidebar and follow the prompts in the service enablement wizard. Accept the defaults in the first two dialogs.
On the third and final screen with the Done button, select the File Drive authorization schema—this is the schema expected by the example LangChain application. Click Done.
The built-in File Drive authorization schema provides relationship-based access control (ReBAC), which the LangChain application will use to assign user permissions to specific Google documents.
If your application requires a different schema, you can later reset the authorization schema to another built-in example or start fresh with a blank schema.
Once the service is enabled, you will be taken to its Overview page. Capture the Configuration Details:
- Domain (shared across all services in the project)
- Default Token (a token provided by default for each service)
You can copy these values by clicking on the respective property tiles.
Save the configuration values in your .env
, for example:
# OpenAI
OPENAI_API_KEY="sk-proj-54bgCI...vG0g1M-GWlU99...3Prt1j-V1-4r0MOL...X6GMA"
# Pangea
PANGEA_DOMAIN="aws.us.pangea.cloud"
PANGEA_AUTHZ_TOKEN="pts_kwaun3...jhpqzf"
Instead of storing secrets locally and potentially exposing them to the environment, you can securely store your credentials in Vault , optionally enable rotation, and retrieve them dynamically at runtime. Enable Vault the same way you enabled other services by selecting it in the left-hand sidebar of the Pangea User Console. The Manage Secrets documentation provides guidance on storing and using secrets in Vault.
For example, you can store your OpenAI key in Vault and retrieve it using the Vault APIs . When you enable a new Pangea service, its default token is stored in Vault automatically.
The application will set and check permissions in the authorization schema using the Pangea SDK . You can also use the AuthZ APIs to interact with the service.
For more information on setting up the advanced capabilities of the AuthZ service and how to use it, visit the AuthZ documentation .
Authentication service
To reduce risks associated with public access to your application, you can require users to sign in. After authentication, the user’s ID can be used to retrieve the authorization policies stored in AuthZ.
You can easily add login functionality to your application using the AuthN service. This service allows users to sign in through the Pangea-hosted authorization server using their browser. Your application will then perform an authorization code flow, using the Pangea SDK you've already imported, to communicate with the AuthZ service. For demonstration purposes, the example application uses Flask as the client server.
Enable AuthN
If you're on the AuthZ page, navigate to the list of services by clicking Back to Main Menu in the top left corner. This will return you to the project page in your Pangea User Console, where enabled services are marked with a green dot.
Click AuthN in the left-hand sidebar and follow the prompts, accepting all defaults. When finished, click Done and Finish.
Once the service is enabled, you’ll be taken to its Overview page. Capture the AuthN Configuration Details:
- Default Token (a token provided by default for each service)
- Hosted Login (the login URL that your application will redirect users to for sign-in)
You can copy these values by clicking on the respective property tiles.
Save the configuration values in your .env
, for example:
# OpenAI
OPENAI_API_KEY="sk-proj-54bgCI...vG0g1M-GWlU99...3Prt1j-V1-4r0MOL...X6GMA"
# Pangea
PANGEA_DOMAIN="aws.us.pangea.cloud"
PANGEA_AUTHZ_TOKEN="pts_kwaun3...jhpqzf"
PANGEA_AUTHN_HOSTED_LOGIN="https://pdn-lqcuqlhizxsjrpbewgdrpi53cc72gdit.login.aws.us.pangea.cloud"
PANGEA_AUTHN_CLIENT_TOKEN="pcl_pgd43k...yoy6kn"
Enable Hosted Login flow
An easy and secure way to authenticate your users with AuthN is to use its Hosted Login , which implements the OAuth 2 Authorization Code grant. The Pangea SDK will manage the flow, providing user profile information and allowing you to use the user's login to verify their permissions defined in AuthZ.
- Click General in the left-hand sidebar.
- On the Authentication Settings screen, click Redirect (Callback) Settings.
- In the right pane, click + Redirect.
- Enter
http://localhost:3000
in the URL input field. - Click Save in the Add redirect dialog.
- Click Save again in the Redirect (Callback) Settings pane on the right.
For more information on setting up advanced capabilities of the AuthN service (such as sign-in and sign-up options, security controls, session management, and more), visit the AuthN documentation .
Run the LangChain application
-
Run the application in the project folder terminal, specifying the Google Drive folder ID you set up earlier. For example:
python -m authz_rag_app --google-drive-folder-id 1MPqBu...m3yWO4
-
On the AuthN login screen, select the Continue with Google option and sign in using an account with (at least) Viewer access to one of the spreadsheets you created in your Google Drive.
If you don't have an account, by default, the AuthN service allows users to sign up. During sign-up, you’ll need to pass a captcha challenge and select a second authentication factor.
tipTo simplify future sign-ins, you can enable the Remember My Device setting under Security Controls on the AuthN service page in the Pangea User Console.
After signing in, close the browser tab with the
Done, you can close this tab.
message and return to the LangChain application. -
In the terminal, you will see a prompt for a question.
Application promptAsk a question about PTO availability:
-
If the currently authenticated user has access to any of the spreadsheets in the folder, they can receive information about the document content in the application response. For example:
Ask a question about PTO availability: My PTO?
Your PTO balance is 200 hours. -
(Optional) View the authorization policies created in AuthZ.
Go to the Assigned Roles & Relations page in your Pangea User Console to review the permission assignments your application created from the Google data. For example:
-
(Optional) Add authorization policies.
AuthZ can abstract access control from the original data sources. To demonstrate this:
-
Click + Assign and add permission, such as the one shown below:
-
Click Save.
-
Restart the app and sign in using the username
carol
. During sign-up, provide an email of your preference. -
Ask a question about the information in the referenced Google Drive file, and it should return the appropriate answer. For example:
Authenticated as carol (pui_24ppbs3hlyiwlv5cmpslazju4gohq6gp).
Ask a question about PTO availability: What's the PTO balance?
PTO balance is 200 hours.
The application admin can use this approach to share context data with users who don't have direct access to it through other systems.
-
Conclusion
In this example, data from a Google Drive folder was embedded and saved in a vector store, making it easily accessible for use in an AI application. The share permissions for documents in the folder were converted into authorization policies stored in Pangea's AuthZ service, allowing user permissions to be checked at inference time when adding Google Drive data to the context of user questions.
Similarly, data from other sources can be vectorized, tagged with original document IDs in the vector metadata, and referenced in authorization policies within AuthZ. This approach enables flexible user access, supports sharing the same policies across different applications, and supports RBAC, ReBAC, and ABAC authorization schemas for a wide range of use cases.
For more examples and detailed implementations, explore the following GitHub repositories:
- Authenticating Users for Access Control with RAG for LangChain in Python
- Authenticating Users for Access Control with RAG for LangChain in JavaScript
- User-based Access Control with RAG for LangChain in Python
- Identity and Access Management in LLM apps with Python, LangChain, and Pangea
For an overview of considerations regarding access control implementation in AI apps, check out the Building Authorization in AI Apps blog.
Was this article helpful?