Since the earliest days of computer science, the concept of garbage in, garbage out (GIGO) has shown the need for data quality. The idea that data output can only be as accurate as the data input continues to be a fundamental tenet of code development. It’s become even more important in the world of generative AI (GenAI), which is playing an increasingly significant role in business operations around the world.
Enterprises are scrambling to harness the power of GenAI in hopes of streamlining operations, enhancing customer engagement, and reducing personnel costs. In the rush to adopt a game-changing technology like GenAI, enterprises may be unaware of security risks like data poisoning, hallucination, and even more traditional threats like malware and ransomware targeted at GenAI, which can play havoc with a business. These—and many other threats—require serious attention at the CSO and CISO level before adopting new technology like GenAI. The challenge is how to ensure that security moves at the pace of the business. For most businesses out there today, the focus is on how GenAI can accelerate business, but at a pace that doesn’t circumvent security and privacy practices already in place for compliance.
GenAI Basics
GenAI uses huge amounts of data to create and train foundational models that can help to create off-the-shelf applications. Some common uses of GenAI services for enterprises include interactive and personalized customer service systems, content generation for marketing, software development, and individual digital assistants for employees.
These powerful platforms rely on large language models (LLMs) to enable the creation of accurate outputs in response to user prompts. The greatest value from LLMs comes from crafting custom prompts for specific outcomes such as enterprise specific scenarios, customized software platforms or code, or highly specialized writing.
GenAI can also create unique models to perform specific and often complex functions for business or development purposes. These models are regularly trained using proprietary datasets, product information, trade secrets, private or personal data, as well as generally available data. The higher the value of data used to train the model, the better outputs you’ll see from your GenAI application – this is where quality in, quality out (QIQO) resonates. Since the outcomes can be highly beneficial, enterprises should consider two important security elements of the process: ensuring the integrity and privacy of output data; and not inheriting any risk from public datasets.
Is Stored Data Clean and Safe to Use?
Threat actors have become successful in finding ways to embed malware into datasets. This malicious code is often designed to remain inactive until it has access to compute resources, opening the door for propagation into secure environments or access to valuable information. Reports of embedded malware discoveries have included code to exfiltrate data, search of personally identifiable information (PII) and other confidential information that could be used for future ransomware or extortion threats. Embedded malware has also been used to alter GenAI outputs, threatening the validity of AI-powered insights and analysis. These threats are real and happening today across platforms that house massive datasets available to GenAI systems and developers.
To complicate this challenge, almost all cloud service providers (CSP) are now introducing GenAI services alongside their infrastructure services. And this makes perfect sense as the cloud is exactly where many developers are building new applications. So, providers like Amazon Web Services (AWS) and Microsoft Azure embed GenAI services into their platform. That is exactly what makes these platforms the perfect target – where else would you be tempted to rapidly adopt a new technology without setting up the proper security guardrails? This is why cloud plus GenAI is increasingly becoming a target – it is where the opportunities lie.
Any enterprise not taking precautions to ensure inputs into LLMs and datasets are clean, and outputs are producing the desired outcome, are putting themselves at risk. And these risks are real and well documented. For further reading, the eBook Securing Gen AI Models: Mitigating Risks and Protecting Your Business discusses GenAI and its data security risks in detail. We believe the proliferation of datasets from GenAI and other business applications is creating another requirement for Zero Trust – this time for data.
Zero Trust, an established practice for network security is based on the premise that you cannot trust any network connection – even from inside your perimeter. Security professionals follow zero trust networking principles by using time-bound credentials, hardware tokens, and enforcing private access even when devices are located in your office, give an additional layer of protection when you can’t trust devices. GenAI is now forcing an evolution of that methodology to data.
It all begins with the assumption that any stored data is compromised at some level. Therefore, all data must be scanned for malicious code at every stage or interaction. Every enterprise should take a stance of scanning data, images, objects from all cloud repositories, 3rd party platforms, even off-the-shelf LLMs. And the reverse is true after extracting value from that data. There must be high confidence that any chat-bot, application output, or data feeding into another application, does not contain any sensitive data that should not be exposed. Taking a zero-trust position on all data, regardless of its trajectory to archive or live application, is a crucial step to reduce or even eliminate security threats.
Is Sensitive Data Being Exposed?
In addition to identifying malicious code, you should have high confidence in data content. Data Loss Prevention (DLP) has been considered just an endpoint solution for some time, but similar functions and tools that can scale to the network core and storage systems are available to help to maintain the integrity of confidential information. Loss of control or disclosure of sensitive data can cause regulatory compliance issues as well as placing companies at competitive disadvantages when customer secrets are revealed. These are the headlines every CISO dreads – ‘our chat-bot leaked sensitive data that we didn’t verify’.
While the search for PII and secrets has been a favorite activity by threat actors for a long time, GenAI increases the risk of exposing sensitive information. If proprietary or sensitive information is included in training data, it is highly likely that it will find its way into derived outputs. Predicting how and where this information could be utilized or exposed would be nearly unachievable, and once it is incorporated into an LLM, it would be impossible to root out and eliminate the threat.
DLP scanning of training data is a critical step in maintaining control of sensitive information. Organizations should consider whether sensitive data should be filtered out of the dataset before training models and as a final precaution, outputs from a GenAI system should always be scanned for sensitive data before they are delivered to end users. Details on how this works can be found in this technical article from Cloud Storage Security.
A Secure Safety Net
Enterprises should look carefully at GenAI applications alongside their public cloud services to implement a comprehensive safety net for data inputs and outputs. Ensuring that data is clean before it crosses your cloud infrastructure or enters your GenAI pipeline is essential, and securing sensitive information through seamless categorization scans of training data and outputs is crucial to preventing inadvertent disclosures. GenAI should be seen as an awesome business accelerator; not another thing for you to worry about for potential hackers. Before using any business data to justify using GenAI for enhanced returns, make sure the data on the input and output side of the GenAI application is safe by not trusting any of it.
About the Author
Cloud Storage Security (CSS) protects data in the cloud and on premises so that businesses can move forward freely and fearlessly. Its robust malware detection and data loss prevention solutions are born from a singular focus on, and dedication to, securing the world’s data, everywhere. Serving a diverse clientele spanning commercial, regulated, and public sector organizations worldwide, the company solves security and compliance challenges by identifying and eliminating threats, while reducing risk and human error. CSS’s modern, cloud-native solutions are streamlined and flexibly designed to seamlessly integrate into a wide range of use cases and workflows, while complementing and bolstering existing infrastructure and security frameworks. CSS holds certifications including SOC2, AWS Public Sector Partner with an AWS Qualified Software offering, AWS Security competency, and AWS Authority to Operate.