Companies that use private instances of large language models (LLMs) to make their business data searchable through a conversational interface face risks of data poisoning and potential data leakage if they do not properly implement security controls to harden the platforms, experts say.

Case in point: This week, Synopsys disclosed a cross-site request forgery (CSRF) flaw that affects applications based on the EmbedAI component created by AI provider SamurAI; it could allow attackers to fool users into uploading poisoned data into their language model, the application-security firm warned. The attack exploits the open source component’s lack of a safe cross-origin policy and failure to implement session management, and could allow an attacker to affect even a private LLM instance or chatbot, says Mohammed Alshehri, the Synopsys security researcher who found the vulnerability.

The risks are similar to those facing developers of software applications, but with an AI twist, he says.

“There’re products where they take an existing AI implementation [and open source components] and merge them together to create something new,” he says. “What we want to highlight here is that even after the integration, companies should test to ensure that the same controls we have for Web applications are also implemented on the APIs for their AI applications.”

The research underscores that the rush to integrate AI into business processes does pose risks, especially for companies that are giving LLMs and other generative-AI applications access to large repositories of data. Overall, only 4% of US companies have adopted AI as part of their business operations, but some industries have higher adoption rates, with the information sector at 14% and the professional services sector at 9%, according to a survey by the US Census Bureau conducted in October 2023.

AI Adoption by Industry in 2023

The risks posed by the adoption of next-gen artificial intelligence and machine learning (AI/ML) are not necessarily due to the models, which tend to have smaller attack surfaces, but the software components and tools for developing AI applications and interfaces, says Dan McInerney, lead AI threat researcher with Protect AI, an AI application security firm.

“There’s not a lot of magical incantations that you can send to an LLM and have it spit out passwords and sensitive info,” he says. “But there’s a lot of vulnerabilities in the servers that are used to host LLMs. The [LLM] is really not where you’re going to get hacked — you’re going to get hacked from all the tools you use around the LLM.”

Practical Attacks Against AI Components

Such vulnerabilities have already become actively exploited. In March, Oligo Security reported on the discovery of active attacks against Ray, a popular AI framework, using a previously discovered security issue, one of five vulnerabilities that had been discovered by research groups at Protect AI and Bishop Fox, along with independent researcher Sierra Haex. Anyscale, the company behind Ray, fixed four vulnerabilities but considered the fifth to be a misconfiguration issue.

Yet, attackers managed to find hundreds of deployments that inadvisedly exposed a Ray server to the Internet and compromised the systems, according to an analysis published by Oligo Security in March.

“This flaw has been under active exploitation for the last seven months, affecting sectors like education, cryptocurrency, biopharma and more,” the company stated. “All organizations using Ray are advised to review their environments to ensure they are not exposed and to analyze any suspicious activity.”

In its own March advisory, Anyscale acknowledged the attacks and released a tool to detect insecurely configured systems.

Private Does Not Mean Safe

While the vulnerability in the Ray framework exposed public-facing servers to attack, even private AI-powered LLMs and chatbots could face exploitations. In May, AI-security firm Protect AI released the latest tranche of vulnerabilities discovered by its bug bounty community, Huntr, encompassing 32 issues from critical remote exploits to low-severity race conditions. Some attacks may require access to the API, but others could be carried out through malicious documents and other vectors.

In its own research, Synopsys researcher Alshehri discovered the cross-site request forgery (CSRF) issue, which gives an attacker the ability to poison an LLM through a watering hole attack.

“Exploitation of this vulnerability could affect the immediate functioning of the model and can have long-lasting effects on its credibility and the security of the systems that rely on it,” Synopsys stated in its advisory. “This can manifest in various ways, including the spread of misinformation, introduction of biases, degradation of performance, and potential for denial-of-service attacks.”

By using a private instance of a chatbot service or internally hosting an LLM, many companies believe they have minimized the risk of exploitation, says Tyler Young, CISO at BigID, a data management firm.

“Most enterprises are leaning toward leveraging private LLM chatbots on top of those LLM algorithms, simply because it offers that comfort, just like hosting something in your own cloud, where you have control over who can access the data,” he says. “But there are risks … because the second you have an inherent trust, you start pumping more and more data in there, and you have overexposure. All it takes is one of those accounts to get compromised.”

New Software, Same Old Vulnerabilities

Companies need to assume that the current crop of AI systems and services have had only limited security design and review, because the platforms often are based on open-source components that have small teams and limited oversight, says Synopsys’s Alshehri. In fact, in February, the Hugging Face AI open source model repository was found to be riddled with malicious code-execution models.

“The same way we do regular testing and those code reviews with black-box and white-box testing, we need to do that … when it comes to adopting these new technologies,” he says.

Companies that are implementing AI systems based on internal data should segment the data — and the resulting LLM instances — so that only employees are allowed access to just those LLM services that were built on the data to which they have access. Each collection of users with a specific privilege level will require a separate LLM trained on their accessible data.

“You cannot just give the LLM access to a giant dump of data and say, ‘OK, everyone has access to this,’ because that’s the equivalent of giving everyone access to a database with all the data inside of it, right?” says Protect AI’s McInerney. “So you got to clean the data.”

Finally, companies need to minimize the components they are using to develop their AI tools and then regularly update those software assets and implement controls to make exploitation more difficult, he says.

Source: www.darkreading.com