It may sound like hyperbole to say that machine learning operations (MLOps) have become the backbone of our digital future, but it’s actually true. Similar to how we view energy grids or transportation systems as part of the critical infrastructure that powers society, AI/ML software and capabilities is quickly becoming essential technology for a wide range of companies, industries, and citizen services.
With artificial intelligence (AI) and machine learning (ML) rapidly transforming industries, we’ve also seen the rise of a new age of “Shadow IT” now referred to as “Shadow ML,” wherein AI agents and technologies are being used by employees without the knowledge or approval of the IT department, outside the company’s sanctioned systems, potentially creating significant security risks due to lack of oversight and control over data and access. Thus, understanding the evolving role MLOps plays in managing and securing the rapidly expanding AI/ML IT landscape is essential to safeguarding the interconnected systems that define our era.
Software As Critical Infrastructure
Software is an omnipresent component of our day-to-day lives, operating quietly but indispensably behind the scenes. For that reason, failures in these systems are often hard to detect, can happen at any moment, and spread quickly across the globe, disrupting businesses, upsetting economies, undermining governments or even endangering lives.
The stakes are even more significant as AI and ML technologies increasingly take center stage. Traditional software operations are giving way to AI-driven systems capable of decision-making, prediction, and automation at unprecedented scale. However, like any technology that ushers in new but immense potential, AI and ML also introduce new complexities and risks, elevating the importance and need for strong MLOps security. As reliance on AI/ML grows, the robustness of MLOps security becomes foundational to fending off evolving cyber threats.
Understanding the Risks of the MLOps Lifecycle
The lifecycle of building and deploying ML models is filled with both complexity and opportunity. At its core, these processes include:
- Selecting an appropriate ML algorithm, such as a support vector machine (SVM) or decision tree.
- Feeding a dataset into the algorithm to train the model.
- Producing a pre-trained model that can be queried for predictions.
- Registering the pre-trained model in a model registry.
- Deploying the pre-trained model into production by either embedding it in an app or hosting it on an inference server.
It’s a structured approach but one with significant vulnerabilities that threaten stability and security. These vulnerabilities, broadly categorized as inherent and implementation-related, include:
- Inherent Vulnerabilities: The complexity of ML environments, including cloud services and open-source tools, can create security gaps that may be exploited.
- Malicious ML models: Pre-trained models can be weaponized or intentionally crafted to produce biased or harmful outputs, causing trickle-down damage across dependent systems.
- Malicious datasets: Training data can be poisoned to inject subtle yet dangerous behaviors that undermine a model’s integrity and reliability.
- Jupyter “sandbox escapes”: In another example of “Shadow ML,” many data scientists today rely on Jupyter Notebook, which can serve as a path for malicious code execution and unauthorized access when not adequately secured.
Implementation Vulnerabilities
- Authentication shortcomings: Poor access controls expose MLOps platforms to unauthorized users, enabling data theft or model tampering.
- Container escape: Containerized environments with improper configuration allow attackers to break isolation and access the host system and other containers.
- MLOps platform immaturity: The rapid pace of innovation in AI/ML often outpaces the development of secure tooling, creating gaps in resilience and reliability.
While AI and ML can offer enormous benefits for organizations, it’s crucial not to prioritize rapid development over security. Doing so could compromise ML models and put organizations at risk.
The Vulnerabilities Beneath the Surface
Recognizing and addressing these vulnerabilities is crucial to ensuring MLOps platforms remain trustworthy components of our digital infrastructure. In a recent example, a flagged PyTorch model, previously uploaded by a now-deleted account, could allow attackers to inject arbitrary Python code into critical processes upon loading. The method used to load PyTorch models, specifically the torch.load() function, can be a vector for code execution vulnerabilities, especially when models are trained with Hugging Face’s Transformers library.
The “pickle” format, often used for serializing Python objects, poses a particular risk as it can execute arbitrary code when loaded, making it vulnerable to exploitation. This scenario underscores a broader risk in the ML ecosystem. Many widely used ML model formats support code-execution-on-load, a feature meant to create efficient functionality but also introduces significant security vulnerabilities. An attacker controlling a model registry could insert backdoors into models, enabling unauthorized and instant code execution when the models are deployed or loaded.
For this reason, developers must exercise caution when loading models from public repositories, ensuring they validate the source and potential risks associated with the model files. Robust input validation, restricted access, and continuous vulnerability assessments are critical to mitigating risks and ensuring the secure deployment of machine learning solutions.
MLOps Hygiene Best Practices
There are many other vulnerabilities across the MLOps pipeline, underscoring the importance of vigilance among teams. Many separate elements within a model serve as potential attack vectors, which organizations typically manage and secure. Therefore, implementing standard APIs for artifact access and ensuring seamless integration of security tools across various ML platforms for data scientists, machine learning engineers, and core development teams is essential. Key security considerations for MLOps development should include:
- Dependencies and packages: Teams often use open-source frameworks and libraries like TensorFlow and PyTorch. Providing access to these dependencies from trusted sources—rather than directly from the internet—and conducting vulnerability scans to block malicious packages ensures the security of each component within the model.
- Source code: Models are typically developed in languages such as Python, C++, or R. Employing static application security testing (SAST) to scan source code can identify and alleviate errors that may compromise model security.
- Container images: Containers are used to deploy models for training and facilitate their use by other developers or applications. Performing comprehensive scans of container images before deployment helps prevent introducing risks into the operational environment.
- Artifact signing: Signing all new service components early in the MLOps lifecycle and treating them as immutable units throughout different stages ensures that the application remains unchanged as it advances toward release.
- Promotion/release blocking: Automatically rescanning the application or service at each stage of the MLOps pipeline allows for early detection of issues, which in turn helps with swift resolution and maintaining the integrity of the deployment process.
By adhering to these best practices, organizations can effectively safeguard MLOps pipelines and ensure that security measures enhance rather than impede the development and deployment of ML models. As we move further into an AI-driven future, the resilience of the MLOps infrastructure will become an increasingly key component to maintaining the trust, reliability, and security of the digital systems that power the world.
About the Author
Eyal Dyment is Vice President of Security Products at JFrog. Eyal can be reached at our company website: https://jfrog.com/