Skip to main content

Centralized Access Management in RAG Apps

This tutorial demonstrates how permissions from a data source can be translated into authorization policies in Pangea's AuthZ and enable unified access control at inference time in a LangChain-based retrieval-augmented generation (RAG) application in Python.

During ingestion, proprietary documents are embedded into the application's vector store to enable efficient semantic search and tagged with their unique IDs in the vector metadata. Document permissions are converted into tuples and stored in AuthZ policies.

At inference time, the user’s prompt is dynamically augmented only with the context they are authorized to access. Centralizing authorization policies in this setup enables consistent access control to vectorized data across different sources and applications for all users. Capturing these policies during ingestion helps reduce latency during inference, improving the overall user experience.

Follow this tutorial to understand the application structure, prepare data, configure services, and run the example. Use the table of contents on the right to skip the steps you’ve completed.

LangChain app architecture

The LangChain application uses Google APIs and an OAuth client to access data from Google Drive, serving as an example of an external data source.

Ingestion

Documents from a specified Google Drive folder are ingested into a vector store, with each document identifier stored in the vector metadata. During this process, permissions associated with each Google document are extracted and converted into AuthZ tuples, using the document identifier and user ID as references.

Inference

To use the application, users sign in through Pangea’s AuthN-hosted login page, either using the Google social login option or their Google email as a username. At inference time, user questions are processed to retrieve relevant context from the vector store via semantic search on the embedded data. This search is filtered based on the permission assignments stored in AuthZ for the authenticated user.

note

The application administrator can optionally expand access to the embedded data by creating additional AuthZ policies based on the same document IDs, as reflected in the first step. This allows permissions to be granted to users who do not have direct access to the Google Drive data or even a Google account. Furthermore, the application can enforce rules beyond those in the original data source by leveraging inherited permissions and implementing RBAC, ReBAC, and ABAC policies defined in the authorization schema.

Prerequisites

Python

  • Python v3.12 or greater
  • pip v24.0 or greater

Project code

  1. Clone the project repository from GitHub.

    git clone https://github.com/pangeacyber/authz-rag-app.git
    cd authz-rag-app
  2. Create a virtual environment and install the required packages.

    python -m venv .venv
    source .venv/bin/activate
    pip install .

OpenAI API key

We will use OpenAI models. Get your OpenAI API key to run the examples.

Save your key in a .env file, for example:

.env file
# OpenAI
OPENAI_API_KEY="sk-proj-54bgCI...vG0g1M-GWlU99...3Prt1j-V1-4r0MOL...X6GMA"

Google Cloud project

To get started, you'll need a Google account to create a Google Cloud project with both the Google Drive and Google Sheets APIs enabled. This project also requires a service account to grant your LangChain application access to these APIs. Using this Google account, you can create a folder in Google Drive that your application will read from and use to ingest example data.

If your organizational Google account does not allow setting up a Google Cloud project this way, you can use a personal Google account instead.

Additionally, you’ll need a separate Google account to test restricted access to the files stored in this folder.

To set up a Google Cloud project, you can follow the official Google documentation:

Below, find an example walk-through.

Create a Google Cloud Project

  1. Go to the Google Cloud Console Manage resources page and sign in to your Google account.
  2. On the Manage resources page, click + CREATE PROJECT or select an existing project.
  3. If creating a new project, enter your Project name and click CREATE. Wait for the confirmation in the Notifications panel on the right.
  4. Once confirmed, click SELECT PROJECT in the Notifications panel.

Enable Google Drive and Google Sheets API

  1. In the search bar, type Drive and click the Google Drive API link in the search results.
  2. On the Google Drive API details page, click ENABLE.
  3. Then, search for Sheets and click the Google Sheets API link in the search results.
  4. On the Google Sheets API details page, click ENABLE.

Create a Service Account

  1. Using the navigation menu in the top left, go to IAM & Admin >> Service Accounts.
  2. On the Service accounts page, click + CREATE SERVICE ACCOUNT.
  3. On the Create service account page, under Service account details, fill out the form and click CREATE AND CONTINUE.
  4. In the Grant this service account access to project (optional) step, click CONTINUE.
  5. In the Grant users access to this service account (optional) step, click DONE.
  6. On the Service accounts for project "<your-Google-Cloud-project-name>" page, click on your service account link.
  7. On the Service account details page, go to the KEYS tab.
  8. On the Keys page, click ADD KEY and select Create new key.
  9. In the Create private key for <your-service-account-email> dialog, select JSON and click CREATE.
  10. You will be prompted to save your key; save it in your LangChain project folder as a credentials.json file.

Your credentials.json file should look similar to this:

{
"type": "service_account",
"project_id": "my-project",
"private_key_id": "l3JYno7aIrRSZkAGFHSNPcjYS6lrpL1UnqbkWW1b",
"private_key": "...",
"client_email": "my-service-account@my-project.iam.gserviceaccount.com",
"client_id": "1234567890",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/my-service-account%40my-project.iam.gserviceaccount.com",
"universe_domain": "googleapis.com"
}

Note your service account email in the <service-account-ID>@<Google-Cloud-project-ID>.iam.gserviceaccount.com format - you will need it in the next step.

Add example data

  1. Create a folder in your Google Drive and note its ID from the URL. For example, if the folder URL is https://drive.google.com/drive/folders/1MPqBul...m3yWO4, the folder ID is 1MPqBul...m3yWO4. You’ll provide this ID to the LangChain application to access data in the folder.

  2. Share this folder with the service account’s email, granting it Editor access to query permissions for the files stored in it.

    Google Drive folder sharing dialog.
    Google Drive Folder Sharing
  3. In the folder, create a spreadsheet for each fictitious employee account you’ll use to test the application. For each spreadsheet:

    1. Add some employee-specific information, such as PTO balance.

      For example:

      EmployeePTO balance, hours
      Alice128

      For simplicity, the employee's name is added directly to the document content. However, it could also be derived from the file permissions read by the application and included in the user prompt context at inference time.

      Currently, the GoogleDriveRetriever class used in this tutorial assumes the first row of a spreadsheet is a header and processes data starting from the second row. If you only populate the first row, it won’t return a document. To ensure data is loaded, include a header row and place your data starting from the second row.

    2. Share the spreadsheet with the employee's Google account (the account you’ll use to test restricted access to the files in this folder), granting it at least Viewer access.

      Google Drive file sharing dialog.
      Google Drive File Sharing

Pangea services

To use Pangea services to secure your LangChain app, start by creating a free Pangea account . After creating your account, click Skip on the Get started with a common service screen. This will take you to the Pangea User Console home page, where you can enable the required services.

note

If you end up on a different service page in the Pangea User Console, navigate to the list of services by clicking Back to Main Menu in the top left corner.

Authorization service

To implement access control during retrieval, the application uses a customized VectorStoreRetriever . This retriever calls the AuthZ service to filter vector search results based on the policies defined for the current user. Authorization policies are created or updated during ingestion, using permissions from the original data source.

Enable AuthZ

To enable the service, click its name in the left-hand sidebar on the Pangea User Console home page under the ACCESS category, and follow the prompts in the service enablement wizard. Accept the defaults in the first two dialogs.

In the third and final dialog with the Done button, select the File Drive authorization schema—this is the schema expected by the example LangChain application. Click Done.

Pangea Services in the Pangea User Console with the AuthZ service highlighted
Pangea Services
The authorization schema selection dialog in the AuthZ service enablement wizard with the File Drive schema highlighted
AuthZ Example Schemas

The application will set and check permissions in the authorization schema using the Pangea SDK . You can also use the AuthZ APIs to interact with the service.

Save the API token

Once the service is enabled, you will be taken to its Overview page. Capture the Configuration Details:

  • Domain - The Pangea project domain, shared across all services in the project.
  • Default Token - A token provided by default for each service.

You can copy these values by clicking on the respective property tiles.

Pangea AuthZ Service Overview page with the service Configuration Details in the Pangea User Console
Pangea Service Configuration Details

Save the configuration values in your .env, for example:

.env file
# OpenAI
OPENAI_API_KEY="sk-proj-54bgCI...vG0g1M-GWlU99...3Prt1j-V1-4r0MOL...X6GMA"

# Pangea
PANGEA_DOMAIN="aws.us.pangea.cloud"
PANGEA_AUTHZ_TOKEN="pts_kwaun3...jhpqzf"
note

Instead of storing secrets locally and potentially exposing them to the environment, you can securely store your credentials in Vault , optionally enable rotation, and retrieve them dynamically at runtime. Enable Vault the same way you enabled other services by selecting it in the left-hand sidebar of the Pangea User Console. The Manage Secrets documentation provides guidance on storing and using secrets in Vault.

For example, you can store your OpenAI key in Vault and retrieve it using the Vault APIs . When you enable a new Pangea service, its default token is stored in Vault automatically.

Authentication service

To reduce risks associated with public access, the application authenticates users using the AuthN service. This service allows users to sign in through the Pangea-hosted page using their browser. Your application will then perform an OAuth Authorization Code flow, using the Pangea SDK to communicate with the AuthN service.

After authentication, the user’s ID can be used to perform authorization checks in AuthZ.

For demonstration purposes, the example application uses Flask as the client server.

Enable AuthN

On the Pangea User Console home page, under the ACCESS category, click AuthN in the left-hand sidebar and follow the prompts, accepting all defaults. When finished, click Done and Finish.

An easy and secure way to authenticate your users with AuthN is to use its Hosted Login , which implements the OAuth 2 Authorization Code grant. The Pangea SDK will manage the flow, providing user profile information and allowing you to use the user's login to verify their permissions defined in AuthZ.

  1. Click General in the left-hand sidebar.
  2. On the Authentication Settings screen, click Redirect (Callback) Settings.
  3. In the right pane, click + Redirect.
  4. Enter http://localhost:3000 in the URL input field.
  5. Click Save in the Add redirect dialog.
  6. Click Save again in the Redirect (Callback) Settings pane on the right.
Save the API token

Once the service is enabled, you’ll be taken to its Overview page. Capture the AuthN Configuration Details:

  • Default Token - A token provided by default for each service.
  • Hosted Login - The login URL that your application will redirect users to for sign-in.

You can copy these values by clicking on the respective property tiles.

Pangea AuthN Service Overview page with the service Configuration Details in the Pangea User Console
AuthN Configuration Details

Save the configuration values in your .env, for example:

.env file
# OpenAI
OPENAI_API_KEY="sk-proj-54bgCI...vG0g1M-GWlU99...3Prt1j-V1-4r0MOL...X6GMA"

# Pangea
PANGEA_DOMAIN="aws.us.pangea.cloud"
PANGEA_AUTHZ_TOKEN="pts_kwaun3...jhpqzf"
PANGEA_AUTHN_HOSTED_LOGIN="https://pdn-lqcuqlhizxsjrpbewgdrpi53cc72gdit.login.aws.us.pangea.cloud"
PANGEA_AUTHN_CLIENT_TOKEN="pcl_pgd43k...yoy6kn"

Run the LangChain application

  1. Run the application in the project folder terminal, specifying the Google Drive folder ID you set up earlier. For example:

    python -m authz_rag_app --google-drive-folder-id 1MPqBu...m3yWO4
  2. On the AuthN login screen, select the Continue with Google option and sign in using an account with (at least) Viewer access to one of the spreadsheets you created in your Google Drive.

    If you don’t have an account, the AuthN service allows users to sign up by default. During sign-up, you’ll need to complete a captcha challenge and select a second authentication factor.

    Pangea AuthN Sign Up page with the Continue with Google option highlighted
    AuthN Sign Up page

    After signing in, close the browser tab with the Done, you can close this tab. message and return to the LangChain application.

  3. In the terminal, you will see a prompt for a question.

    Application prompt
    Ask a question about PTO availability:
  4. If the currently authenticated user has access to any of the spreadsheets in the folder, they can receive information about the document content in the application response. For example:

    Ask a question about PTO availability: My PTO?
    Your PTO balance is 200 hours.
  5. (Optional) View the authorization policies created in AuthZ.

    Go to the Assigned Roles & Relations page in your Pangea User Console to review the permission assignments your application created from the Google data. For example:

    Pangea AuthZ Assigned Roles & Relations page in the Pangea User Console
    AuthZ Assigned Roles & Relations
  6. (Optional) Add authorization policies.

    AuthZ can abstract access control from the original data sources. To demonstrate this:

    1. Click + Assign and add permission, such as the one shown below:

      Pangea AuthZ Assign Role or Relation dialog on the Assigned Roles & Relations page in the Pangea User Console
      AuthZ Assigned Roles & Relations
    2. Click Save.

    3. Restart the app and sign in using the username carol. During sign-up, provide an email of your preference.

    4. Ask a question about the information in the referenced Google Drive file, and it should return the appropriate answer. For example:

      Authenticated as carol (pui_24ppbs3hlyiwlv5cmpslazju4gohq6gp).

      Ask a question about PTO availability: What's the PTO balance?
      PTO balance is 200 hours.

    This approach allows the application admin to share context data with users who don't have direct access to it through other systems.

Conclusion

In this example, data from a Google Drive folder was embedded and saved in a vector store, making it easily accessible for use in an AI application. The share permissions for documents in the folder were converted into authorization policies stored in Pangea's AuthZ service, allowing user permissions to be checked at inference time when adding Google Drive data to the context of user questions.

Similarly, data from other sources can be vectorized, tagged with original document IDs in the vector metadata, and referenced in authorization policies within AuthZ. This approach enables flexible user access, supports sharing the same policies across different applications, and supports RBAC, ReBAC, and ABAC authorization schemas for a wide range of use cases.

For more examples and detailed implementations, explore the following GitHub repositories:

For an overview of considerations regarding access control implementation in AI apps, check out the Building Authorization in AI Apps blog.

Was this article helpful?

Contact us