Permissions don’t persist in AI apps and that’s a big problem

John Gamble

Dec 5, 2024

I’ve spent many weeks on the road this past quarter speaking to teams all over the country about their AI product initiatives and I am struck by a particular challenge I’ve heard repeatedly expressed: when developing retrieval-augmented generation (RAG) applications and merging LLMs with internal datasets like Jira, the permissions and access controls from the original data sources do not persist to the vectorized databases.

Consequently, most teams choose to reduce the scope of merged data to avoid potential security breaches and in doing so constrain their innovation potential by limiting the contextual capabilities of this incredible technology. They could be doing more and want to do more, but they’re blocked by this authorization problem!

This blog post will review how permissions operate in RAG architectures, explore LLM security risks created by improper access controls, and show how companies can overcome these limitations with RAG-ready authorization and innovate with AI software products in a secure manner.

Permissions: Lost in Mathematical Translation

An LLM can tell you all about the Social Security Administration, but it cannot tell you your social security number since these models are trained largely, if not wholly, on public data sources. This knowledge limitation is precisely why companies want to merge their own proprietary datasets like Confluence and Google Drive with LLMs: to provide the missing context that allows an AI support chatbot, for instance, to know a customer’s service and purchase history and deliver relevant responses when queried about an issue related to a specific transaction.

To make unstructured datasets machine-readable for fast comparison and recall, these internal datasets (e.g. text and images) must be transformed into a numerical representation called a vector. While vectorization enables highly efficient retrieval and response generations, identity and access controls associated with the original data source does not persist because vectorized data lacks metadata such as user roles, relationships, and permissions. For example:

Google Drive Data: user, group or domain level permissions can restrict individuals from accessing folders, sub-folders and even specific files and can also further constrain their accessibility by role (e.g. comment-only or read-only access). Once vectorized, this nuance and access control structure is lost.
Salesforce Data: Sensitive customer data in Salesforce may be segmented by role hierarchies, ensuring only authorized personnel can access specific account records. Vectorization strips these hierarchies away.

This creates a fundamental disconnect between the authorization security model of the original data sources and the business goals of building great AI software that serves the needs of customers and employees.

The Security Perils of AI Products Without Strong Authorization

Failing to address this disconnect and shipping products that merge enterprise datasets with LLMs invites significant security risks. Imagine a scenario where an AI sales assistant application can answer complex queries by synthesizing information across Slack, Salesforce, and internal sales documents. Without persistent access controls, the AI can and will expose information about a confidential deal in Salesforce to an unauthorized user query in Slack, undermining both security and compliance and potentially putting the deal at risk if the information is then publicized.

The OWASP Top Ten for Large Language Models (LLMs), an open source project that Pangea sponsors, and it documents security risks and mitigation strategies for AI applications. The AI sales assistant app scenario would fall into their “Sensitive Information Disclosure” category:

LLM02: 2025 Sensitive Information Disclosure: LLMs, especially when embedded in applications, risk exposing sensitive data, proprietary algorithms, or confidential details through their output. This can result in unauthorized data access, privacy violations, and intellectual property breaches.

Even OpenAI, a pioneer in the AI industry, is not immune to such risks and has had several publicly-documented instances of sensitive information disclosure violations in their own chatbot, ChatGPT:

March 2023 - a breach exposed the personal information of approximately 1% of their premium chatbot users to other users of the platform, including their conversation histories, email addresses, payment addresses, and the last four digits of their credit card numbers.
February 2024 - a breach exposed user names, passwords, and conversation histories to unrelated users of the product.

The importance of strong access controls in AI applications goes well beyond the RAG vectorization problem and sensitive information disclosure risk. Many different OWASP LLM security risks can arise from lack of strong access controls such as:

Excessive Agency: In agentic systems where AI has extended capabilities such as read/write and delete, failing to enforce access controls could, for example, allow a user to send a prompt to "Delete all customer records older than 2 years" that would then execute that command and wipe the database.

Buying or Building Persistence in RAG Apps

To enforce permissions in vectorized datasets in RAG architectures, engineering teams must build custom authorization middleware in their AI software products that:

Extracts and maps permissions from the original data source
Stores permission metadata alongside the embeddings in the vector database
Ensures that retrieval queries are filtered based on the user's identity and access rights

This self-engineering is technically complex and resource-intensive, often requiring deep integration between disparate systems. In my conversations with customers I’ve found that only the most well-resourced teams at the world’s largest companies feel prepared to tackle this challenge and commit to the engineering resources required to maintain it. Most teams, however, are instead reluctantly choosing to restrict the scope of data vectorized and therefore the capabilities of their AI products.

But they don’t have to make this tradeoff!

If you’re facing this problem with your own RAG applications, you should come talk to Pangea, as we’ve built a RAG-ready authentication and authorization capability that enables access controls to scalably persist in your applications and protect your AI products from risks like sensitive information disclosure.

The lack of persistent permissions in AI applications built on RAG architecture is not just a technical problem or security problem—it’s a business problem. Companies are hampering their innovation potential to avoid security risks, a trade-off that could cost them business and market leadership. By adopting robust access control strategies and investing in solutions that preserve permissions, organizations can unlock the full potential of AI while safeguarding sensitive data. The future of AI-driven innovation depends on getting this right.

authorization permissions llm OWASP TOP 10