15 Security & Privacy
Purpose
What principles guide the protection of machine learning systems, and how do security and privacy requirements shape system architecture?
Protection mechanisms are a fundamental dimension of modern AI system design. Security considerations expose critical patterns for safeguarding data, models, and infrastructure while sustaining operational effectiveness. Implementing defensive strategies reveals inherent trade-offs between protection, performance, and usabilityâtrade-offs that influence architectural decisions throughout the AI lifecycle. Understanding these dynamics is essential for creating trustworthy systems, grounding the principles needed to preserve privacy and defend against adversarial threats while maintaining functionality in production environments.
Identify key security and privacy risks in machine learning systems.
Understand how to design models with security and privacy in mind.
Describe methods for securing model deployment and access.
Explain strategies for monitoring and defending systems at runtime.
Recognize the role of hardware in building trusted ML infrastructure.
Apply a layered approach to defending machine learning systems.
15.1 Overview
Machine learning systems, like all computational systems, must be designed not only for performance and accuracy but also for security and privacy. These concerns shape the architecture and operation of ML systems across their lifecycleâfrom data collection and model training to deployment and user interaction. While traditional system security focuses on software vulnerabilities, network protocols, and hardware defenses, machine learning systems introduce additional and unique attack surfaces. These include threats to the data that fuels learning, the models that encode behavior, and the infrastructure that serves predictions.
Security and privacy mechanisms in ML systems serve roles analogous to trust and access control layers in classical computing. Just as operating systems enforce user permissions and protect resource boundaries, ML systems must implement controls that safeguard sensitive data, defend proprietary models, and mitigate adversarial manipulation. These mechanisms span software, hardware, and organizational layers, forming a critical foundation for system reliability and trustworthiness.
Although closely related, security and privacy address distinct aspects of protection. Security focuses on ensuring system integrity and availability in the presence of adversaries. Privacy, by contrast, emphasizes the control and protection of sensitive information, even in the absence of active attacks. These concepts often interact, but they are not interchangeable. To effectively design and evaluate defenses for ML systems, it is essential to understand how these goals differ, how they reinforce one another, and what distinct mechanisms they entail.
Security and privacy often function as complementary forces. Security prevents unauthorized access and protects system behavior, while privacy measures limit the exposure of sensitive information. Their synergy is essential: strong security supports privacy by preventing data breaches, while privacy-preserving techniques reduce the attack surface available to adversaries. However, achieving robust protection on both fronts often introduces trade-offs. Defensive mechanisms may incur computational overhead, increase system complexity, or impact usability. Designers must carefully balance these costs against protection goals, guided by an understanding of threats, system constraints, and risk tolerance.
The landscape of security and privacy challenges in ML systems continues to evolve. High-profile incidents such as model extraction attacks, data leakage from generative models, and hardware-level vulnerabilities have underscored the need for comprehensive and adaptive defenses. These solutions must address not only technical threats, but also regulatory, ethical, and operational requirements across cloud, edge, and embedded deployments.
As this chapter progresses, we will examine the threats facing machine learning systems, the defensive strategies available, and the trade-offs involved in deploying them in practice. A clear understanding of these principles is essential for building trustworthy systems that operate reliably in adversarial and privacy-sensitive environments.
15.2 Definitions and Distinctions
Security and privacy are core concerns in machine learning system design, but they are often misunderstood or conflated. While both aim to protect systems and data, they do so in different ways, address different threat models, and require distinct technical responses. For ML systems, clearly distinguishing between the two helps guide the design of robust and responsible infrastructure.
15.2.1 Security Defined
Security in machine learning focuses on defending systems from adversarial behavior. This includes protecting model parameters, training pipelines, deployment infrastructure, and data access pathways from manipulation or misuse.
Security in machine learning systems is the protection of data, models, and infrastructure from unauthorized access, manipulation, or disruption. It spans the design and implementation of defensive mechanisms that protect against data poisoning, model theft, adversarial manipulation, and system-level vulnerabilities. Security mechanisms ensure the integrity, confidentiality, and availability of machine learning services across development, deployment, and operational environments.
Example: A facial recognition system deployed in public transit infrastructure may be targeted with adversarial inputs that cause it to misidentify individuals or fail entirely. This is a runtime security vulnerability that threatens both accuracy and system availability.
15.2.2 Privacy Defined
Privacy focuses on limiting the exposure and misuse of sensitive information within ML systems. This includes protecting training data, inference inputs, and model outputs from leaking personal or proprietary informationâeven when systems operate correctly and no explicit attack is taking place.
Privacy in machine learning systems is the protection of sensitive information from unauthorized disclosure, inference, or misuse. It spans the design and implementation of methods that reduce the risk of exposing personal, proprietary, or regulated data while enabling machine learning systems to operate effectively. Privacy mechanisms help preserve confidentiality and control over data usage across development, deployment, and operational environments.
Example: A language model trained on medical transcripts may inadvertently memorize snippets of patient conversations. If a user later triggers this content through a public-facing chatbot, it represents a privacy failureâeven in the absence of an attacker.
15.2.3 Security versus Privacy
Although they intersect in some areas (e.g., encrypted storage supports both), security and privacy differ in their objectives, threat models, and typical mitigation strategies. Table 15.1 below summarizes these distinctions in the context of machine learning systems.
Aspect | Security | Privacy |
---|---|---|
Primary Goal | Prevent unauthorized access or disruption | Limit exposure of sensitive information |
Threat Model | Adversarial actors (external or internal) | Honest-but-curious observers or passive leaks |
Typical Concerns | Model theft, poisoning, evasion attacks | Data leakage, re-identification, memorization |
Example Attack | Adversarial inputs cause misclassification | Model inversion reveals training data |
Representative Defenses | Access control, adversarial training | Differential privacy, federated learning |
Relevance to Regulation | Emphasized in cybersecurity standards | Central to data protection laws (e.g., GDPR) |
15.2.4 Interactions and Trade-offs
Security and privacy are deeply interrelated but not interchangeable. A secure system helps maintain privacy by restricting unauthorized access to models and data. At the same time, privacy-preserving designs can improve security by reducing the attack surfaceâe.g., minimizing the retention of sensitive data reduces the risk of exposure if a system is compromised.
However, they can also be in tension. Techniques like differential privacy reduce memorization risks but may lower model utility. Encryption enhances security but may obscure transparency and auditability, complicating privacy compliance.
In machine learning systems, designers must reason about these trade-offs holistically. Systems that serve sensitive domains, including healthcare, finance, and public safety, must simultaneously protect against both misuse (security) and overexposure (privacy). Understanding the boundaries between these concerns is key to building systems that are not only performant, but trustworthy and legally compliant.
15.3 Historical Incidents
While the security of machine learning systems introduces new technical challenges, valuable lessons can be drawn from well-known security breaches across a range of computing systems. These incidents demonstrate how weaknesses in system design, in industrial control systems, connected vehicles, or consumer devices, can lead to widespread, and sometimes physical, consequences. Although the examples discussed in this section do not all involve machine learning directly, they provide critical insights into the importance of designing secure systems. These lessons apply broadly to machine learning applications deployed across cloud, edge, and embedded environments.
15.3.1 Stuxnet
In 2010, security researchers discovered a highly sophisticated computer worm later named Stuxnet, which targeted industrial control systems used in Iranâs Natanz nuclear facility (Farwell and Rohozinski 2011). Stuxnet exploited four previously unknown âzero-dayâ vulnerabilities in Microsoft Windows, allowing it to spread undetected through both networked and isolated systems.
Unlike typical malware designed to steal information or perform espionage, Stuxnet was engineered to cause physical damage. Its objective was to disrupt uranium enrichment by sabotaging the centrifuges used in the process. Despite the facility being air-gapped from external networks, the malware is believed to have entered the system via an infected USB device, demonstrating how physical access can compromise even isolated environments.
Stuxnet represents a landmark in cybersecurity, revealing how malicious software can bridge the digital and physical worlds to manipulate industrial infrastructure. It specifically targeted programmable logic controllers (PLCs) responsible for automating electromechanical processes, such as controlling the speed of centrifuges. By exploiting vulnerabilities in the Windows operating system and the Siemens Step7 software used to program the PLCs, Stuxnet achieved highly targeted, real-world disruption.
While Stuxnet did not target machine learning systems directly, its relevance extends to any system where software interacts with physical processes. Machine learning is increasingly integrated into industrial control, robotics, and cyber-physical systems, making these lessons applicable to the security of modern ML deployments. Figure 15.1 illustrates the operation of Stuxnet in greater detail.
15.3.2 Jeep Cherokee Hack
In 2015, security researchers publicly demonstrated a remote cyberattack on a Jeep Cherokee that exposed critical vulnerabilities in automotive system design (Miller and Valasek 2015; Miller 2019). Conducted as a controlled experiment, the researchers exploited a vulnerability in the vehicleâs Uconnect entertainment system, which was connected to the internet via a cellular network. By gaining remote access to this system, they were able to send commands that affected the vehicleâs engine, transmission, and braking systemsâwithout physical access to the car.
This demonstration served as a wake-up call for the automotive industry. It highlighted the risks posed by the growing connectivity of modern vehicles. Traditionally isolated automotive control systems, such as those managing steering and braking, were shown to be vulnerable when exposed through externally accessible software interfaces. The ability to remotely manipulate safety-critical functions raised serious concerns about passenger safety, regulatory oversight, and industry best practices.
The incident also led to a recall of over 1.4 million vehicles to patch the vulnerability, highlighting the need for manufacturers to prioritize cybersecurity in their designs. The National Highway Traffic Safety Administration (NHTSA) issued guidelines for automakers to improve vehicle cybersecurity, including recommendations for secure software development practices and incident response protocols.
The automotive industry has since made significant strides in addressing these vulnerabilities, but the incident serves as a cautionary tale for all sectors that rely on connected systems. As machine learning becomes more prevalent in safety-critical applications, the lessons learned from the Jeep Cherokee hack will be essential for ensuring the security and reliability of future ML deployments.
Although this incident did not involve machine learning, the architectural patterns it exposed are highly relevant to ML system security. Modern vehicles increasingly rely on machine learning for driver-assistance, navigation, and in-cabin intelligence, which include features that operate in conjunction with connected software services. This integration expands the potential attack surface if systems are not properly isolated or secured. The Jeep Cherokee hack highlights the need for defense-in-depth strategies, secure software updates, authenticated communications, and rigorous security testingâprinciples that apply broadly to machine learning systems deployed across automotive, industrial, and consumer environments.
As machine learning continues to be integrated into connected and safety-critical applications, the lessons from the Jeep Cherokee hack remain highly relevant. They emphasize that securing externally connected software is not just a best practice but a necessity for protecting the integrity and safety of machine learning-enabled systems.
15.3.3 Mirai Botnet
In 2016, the Mirai botnet emerged as one of the most disruptive distributed denial-of-service (DDoS) attacks in internet history (Antonakakis et al. 2017). The botnet infected thousands of networked devices, including digital cameras, DVRs, and other consumer electronics. These devices, often deployed with factory-default usernames and passwords, were easily compromised by the Mirai malware and enlisted into a large-scale attack network.
The Mirai botnet was used to overwhelm major internet infrastructure providers, disrupting access to popular online services across the United States and beyond. The scale of the attack demonstrated how vulnerable consumer and industrial devices can become a platform for widespread disruption when security is not prioritized in their design and deployment.
While the devices exploited by Mirai did not include machine learning components, the architectural patterns exposed by this incident are increasingly relevant as machine learning expands into edge computing and Internet of Things (IoT) devices. Many ML-enabled products, such as smart cameras, voice assistants, and edge analytics platforms, share similar deployment characteristicsâoperating on networked devices with limited hardware resources, often managed at scale.
The Mirai botnet highlights the critical importance of basic security hygiene, including secure credential management, authenticated software updates, and network access control. Without these protections, even powerful machine learning models can become part of larger attack infrastructures if deployed on insecure hardware.
As machine learning continues to move beyond centralized data centers into distributed and networked environments, the lessons from the Mirai botnet remain highly relevant. They emphasize the need for secure device provisioning, ongoing vulnerability management, and industry-wide coordination to prevent large-scale exploitation of ML-enabled systems.
15.4 Secure Design Priorities
The historical breaches described earlier reveal how weaknesses in system design, whether in hardware, software, or network infrastructure, can lead to widespread and often physical consequences. While these incidents did not directly target machine learning systems, they offer valuable insights into architectural and operational patterns that increasingly characterize modern ML deployments. These lessons point to three overarching areas of concern: device-level security, system-level isolation and control, and protection against large-scale network exploitation.
15.4.1 Device-Level Security
The Mirai botnet exemplifies how large-scale exploitation of poorly secured devices can lead to significant disruption. This attack succeeded by exploiting common weaknesses such as default usernames and passwords, unsecured firmware update mechanisms, and unencrypted communications. While often associated with consumer-grade IoT products, these vulnerabilities are increasingly relevant to machine learning systems, particularly those deployed at the edge (Antonakakis et al. 2017).
Edge ML devices, including smart cameras, industrial controllers, and wearable health monitors, typically rely on lightweight embedded hardware like ARM-based processors running minimal operating systems. These systems are designed for low-power, distributed operation but often lack the comprehensive security features found in larger computing platforms. As these devices take on more responsibility for local data processing and real-time decision-making, they become attractive targets for remote compromise.
A compromised population of such devices can be aggregated into a botnet, similar to Mirai, and leveraged for large-scale attacks. Beyond denial-of-service threats, attackers could use these ML-enabled devices to exfiltrate sensitive data, interfere with model execution, or manipulate system outputs. Without strong device-level protections, which include secure boot processes, authenticated firmware updates, and encrypted communications, edge ML deployments remain vulnerable to being turned into platforms for broader system disruption.
15.4.2 System-Level Isolation
The Jeep Cherokee hack highlighted the risks that arise when externally connected software services are insufficiently isolated from safety-critical system functions. By exploiting a vulnerability in the vehicleâs Uconnect entertainment system, researchers were able to remotely manipulate core control functions such as steering and braking. This incident demonstrated that network connectivity, if not carefully managed, can expose critical system pathways to external threats.
Machine learning systems increasingly operate in similar contexts, particularly in domains such as automotive safety, healthcare, and industrial automation. Modern vehicles, for example, integrate machine learning models for driver-assistance, autonomous navigation, and sensor fusion. These models run alongside connected software services that provide infotainment, navigation updates, and remote diagnostics. Without strong system-level isolation, attackers can exploit these externally facing services to gain access to safety-critical ML components, expanding the overall attack surface.
The automotive industryâs response to the Jeep Cherokee incident, which includes large-scale recalls, over-the-air software patches, and the development of industry-wide cybersecurity standards through organizations such as Auto-ISAC and the National Highway Traffic Safety Administration (NHTSA), provides a valuable example of how industries can address emerging ML security risks.
Similar isolation principles apply to other machine learning deployments, including medical devices that analyze patient data in real time, industrial controllers that optimize manufacturing processes, and infrastructure systems that manage power grids or water supplies. Securing these systems requires architectural compartmentalization of subsystems, authenticated communication channels, and validated update mechanisms. These measures help prevent external actors from escalating access or manipulating ML-driven decision-making in safety-critical environments.
15.4.3 Large-Scale Network Exploitation
The Stuxnet attack demonstrated the ability of targeted cyber operations to cross from digital systems into the physical world, resulting in real-world disruption and damage. By exploiting software vulnerabilities in industrial control systems, the attack caused mechanical failures in uranium enrichment equipment (Farwell and Rohozinski 2011). While Stuxnet did not target machine learning systems directly, it revealed critical risks that apply broadly to cyber-physical systemsâparticularly those involving supply chain vulnerabilities, undisclosed (zero-day) exploits, and techniques for bypassing network isolation, such as air gaps.
As machine learning increasingly powers decision-making in manufacturing, energy management, robotics, and other operational technologies, similar risks emerge. ML-based controllers that influence physical processes, including adjusting production lines, managing industrial robots, and optimizing power distribution, represent new attack surfaces. Compromising these models or the systems that deploy them can result in physical harm, operational disruption, or strategic manipulation of critical infrastructure.
Stuxnetâs sophistication highlights the potential for state-sponsored or well-resourced adversaries to target ML-driven systems as part of larger geopolitical or economic campaigns. As machine learning takes on more influential roles in controlling real-world systems, securing these deployments against both cyber and physical threats becomes essential for ensuring operational resilience and public safety.
15.4.4 Toward Secure Design
Collectively, these incidents illustrate that security must be designed into machine learning systems from the outset. Protecting such systems requires attention to multiple layers of the stack, including model-level protections to defend against attacks such as model theft, adversarial manipulation, and data leakage; data pipeline security to ensure the confidentiality, integrity, and governance of training and inference data across cloud, edge, and embedded environments; system-level isolation and access control to prevent external interfaces from compromising model execution or manipulating safety-critical outputs; secure deployment and update mechanisms to safeguard runtime environments from tampering or exploitation; and continuous monitoring and incident response capabilities to detect and recover from breaches in dynamic, distributed deployments.
These priorities reflect the lessons drawn from past incidentsâemphasizing the need to protect device-level resources, isolate critical system functions, and defend against large-scale exploitation. The remainder of this chapter builds on these principles, beginning with a closer examination of threats specific to machine learning models and data. It then expands the discussion to hardware-level vulnerabilities and the unique considerations of embedded ML systems. Finally, it explores defensive strategies, including privacy-preserving techniques, secure hardware mechanisms, and system-level design practices, forming a foundation for building trustworthy machine learning systems capable of withstanding both known and emerging threats.
15.5 Threats to ML Models
Building on the lessons from historical security incidents, we now turn to threats that are specific to machine learning models. These threats span the entire ML lifecycle, ranging from training-time manipulations to inference-time evasion, and fall into three broad categories: threats to model confidentiality (e.g., model theft), threats to training integrity (e.g., data poisoning), and threats to inference robustness (e.g., adversarial examples). Each category targets different vulnerabilities and requires distinct defensive strategies.
Three primary threats stand out in this context: model theft, where adversaries steal proprietary models and the sensitive knowledge they encode; data poisoning, where attackers manipulate training data to corrupt model behavior; and adversarial attacks, where carefully crafted inputs deceive models into making incorrect predictions. Each of these threats exploits different stages of the machine learning lifecycleâfrom data ingestion and model training to deployment and inference.
We begin with model theft, examining how attackers extract or replicate models to undermine economic value and privacy. As shown in Figure 15.2, model theft typically targets the deployment stage of the machine learning lifecycle, where trained models are exposed through APIs, on-device engines, or serialized files. This threat sits alongside others, including data poisoning during training and adversarial attacks during inference, that together span the full pipeline from data collection to real-time prediction. Understanding the lifecycle positioning of each threat helps clarify their distinct attack surfaces and appropriate defenses.
Machine learning models are not solely passive targets of attack; in some cases, they can themselves be employed as components of an attack strategy. Pretrained models, particularly large generative or discriminative networks, may be adapted to automate tasks such as adversarial example generation, phishing content synthesis, or protocol subversion. Furthermore, open-source or publicly accessible models can be fine-tuned for malicious purposes, including impersonation, surveillance, or reverse-engineering of secure systems. This dual-use potential necessitates a broader security perspectiveâone that considers models not only as assets to defend but also as possible instruments of attack.
15.5.1 Model Theft
Threats to model confidentiality arise when adversaries gain access to a trained modelâs parameters, architecture, or output behavior. These attacks can undermine the economic value of machine learning systems, enable competitors to replicate proprietary functionality, or expose private information encoded in model weights.
Such threats arise across a range of deployment settings, including public APIs, cloud-hosted services, on-device inference engines, and shared model repositories. Machine learning models may be vulnerable due to exposed interfaces, insecure serialization formats, or insufficient access controlsâfactors that create opportunities for unauthorized extraction or replication (Ateniese et al. 2015).
High-profile legal cases have highlighted the strategic and economic value of machine learning models. For example, former Google engineer Anthony Levandowski was accused of stealing proprietary designs from Waymo, including critical components of its autonomous vehicle technology, before founding a competing startup. Such cases illustrate the potential for insider threats to bypass technical protections and gain access to sensitive intellectual property.
The consequences of model theft extend beyond economic loss. Stolen models can be used to extract sensitive information, replicate proprietary algorithms, or enable further attacks. For instance, a competitor who obtains a stolen recommendation model from an e-commerce platform might gain insights into customer behavior, business analytics, and embedded trade secrets. This knowledge can also be used to conduct model inversion attacks, where an attacker attempts to infer private details about the modelâs training data (Fredrikson, Jha, and Ristenpart 2015).
In a model inversion attack, the adversary queries the model through a legitimate interface, such as a public API, and observes its outputs. By analyzing confidence scores or output probabilities, the attacker can optimize inputs to reconstruct data resembling the modelâs training set. For example, a facial recognition model used for secure access could be manipulated to reveal statistical properties of the employee photos on which it was trained. Similar vulnerabilities have been demonstrated in studies on the Netflix Prize dataset, where researchers were able to infer individual movie preferences from anonymized data (Narayanan and Shmatikov 2006).
Model theft can target two distinct objectives: extracting exact model properties, such as architecture and parameters, or replicating approximate model behavior to produce similar outputs without direct access to internal representations. Both forms of theft undermine the security and value of machine learning systems, as explored in the following subsections.
These two attack paths are illustrated in Figure 15.3. In exact model theft, the attacker gains access to the modelâs internal components, including serialized files, weights, and architecture definitions, and reproduces the model directly. In contrast, approximate model theft relies on observing the modelâs input-output behavior, typically through a public API. By repeatedly querying the model and collecting responses, the attacker trains a surrogate that mimics the original modelâs functionality. While the first approach compromises the modelâs internal design and training investment, the second threatens its predictive value and can facilitate further attacks such as adversarial example transfer or model inversion.
Exact Model Theft
Exact model property theft refers to attacks aimed at extracting the internal structure and learned parameters of a machine learning model. These attacks often target deployed models that are exposed through APIs, embedded in on-device inference engines, or shared as downloadable model files on collaboration platforms. Exploiting weak access control, insecure model packaging, or unprotected deployment interfaces, attackers can recover proprietary model assets without requiring full control of the underlying infrastructure.
These attacks typically seek three types of information. The first is the modelâs learned parameters, such as weights and biases. By extracting these parameters, attackers can replicate the modelâs functionality without incurring the cost of training. This replication allows them to benefit from the modelâs performance while bypassing the original development effort.
The second target is the modelâs fine-tuned hyperparameters, including training configurations such as learning rate, batch size, and regularization settings. These hyperparameters significantly influence model performance, and stealing them enables attackers to reproduce high-quality results with minimal additional experimentation.
Finally, attackers may seek to reconstruct the modelâs architecture. This includes the sequence and types of layers, activation functions, and connectivity patterns that define the modelâs behavior. Architecture theft may be accomplished through side-channel attacks, reverse engineering, or analysis of observable model behavior. Revealing the architecture not only compromises intellectual property but also gives competitors strategic insights into the design choices that provide competitive advantage.
System designers must account for these risks by securing model serialization formats, restricting access to runtime APIs, and hardening deployment pipelines. Protecting models requires a combination of software engineering practices, including access control, encryption, and obfuscation techniques, to reduce the risk of unauthorized extraction (Tramèr et al. 2016).
Approximate Model Theft
While some attackers seek to extract a modelâs exact internal properties, others focus on replicating its external behavior. Approximate model behavior theft refers to attacks that attempt to recreate a modelâs decision-making capabilities without directly accessing its parameters or architecture. Instead, attackers observe the modelâs inputs and outputs to build a substitute model that performs similarly on the same tasks.
This type of theft often targets models deployed as services, where the model is exposed through an API or embedded in a user-facing application. By repeatedly querying the model and recording its responses, an attacker can train their own model to mimic the behavior of the original. This process, often called model distillation or knockoff modeling, enables attackers to achieve comparable functionality without access to the original modelâs proprietary internals (Orekondy, Schiele, and Fritz 2019).
Attackers may evaluate the success of behavior replication in two ways. The first is by measuring the level of effectiveness of the substitute model. This involves assessing whether the cloned model achieves similar accuracy, precision, recall, or other performance metrics on benchmark tasks. By aligning the substituteâs performance with that of the original, attackers can build a model that is practically indistinguishable in effectiveness, even if its internal structure differs.
The second is by testing prediction consistency. This involves checking whether the substitute model produces the same outputs as the original model when presented with the same inputs. Matching not only correct predictions but also the original modelâs mistakes can provide attackers with a high-fidelity reproduction of the target modelâs behavior. This is particularly concerning in applications such as natural language processing, where attackers might replicate sentiment analysis models to gain competitive insights or bypass proprietary systems.
Approximate behavior theft is particularly challenging to defend against in open-access deployment settings, such as public APIs or consumer-facing applications. Limiting the rate of queries, detecting automated extraction patterns, and watermarking model outputs are among the techniques that can help mitigate this risk. However, these defenses must be balanced with usability and performance considerations, especially in production environments.
One notable demonstration of approximate model theft focuses on extracting internal components of black-box language models via public APIs. In their paper, Carlini et al. (2024), researchers show how to reconstruct the final embedding projection matrix of several OpenAI models, including ada
, babbage
, and gpt-3.5-turbo
, using only public API access. By exploiting the low-rank structure of the output projection layer and making carefully crafted queries, they recover the modelâs hidden dimensionality and replicate the weight matrix up to affine transformations.
While the attack does not reconstruct the full model, it reveals critical internal architecture parameters and sets a precedent for future, deeper extractions. This work demonstrated that even partial model theft poses risks to confidentiality and competitive advantage, especially when model behavior can be probed through rich API responses such as logit bias and log-probabilities.
Model | Size (Dimension Extraction) | Number of Queries | RMS (Weight Matrix Extraction) | Cost (USD) |
---|---|---|---|---|
OpenAI ada | 1024 â | < (2 ^6) | (5 ^{-4}) | $1 / $4 |
OpenAI babbage | 2048 â | < (4 ^6) | (7 ^{-4}) | $2 / $12 |
OpenAI babbage-002 | 1536 â | < (4 ^6) | Not implemented | $2 / $12 |
OpenAI gpt-3.5-turbo-instruct | Not disclosed | < (4 ^7) | Not implemented | $200 / ~$2,000 (estimated) |
OpenAI gpt-3.5-turbo-1106 | Not disclosed | < (4 ^7) | Not implemented | $800 / ~$8,000 (estimated) |
As shown in their empirical evaluation, reproduced in Table 15.2, model parameters could be extracted with root mean square errors as low as \(10^{-4}\), confirming that high-fidelity approximation is achievable at scale. These findings raise important implications for system design, suggesting that innocuous API features, like returning top-k logits, can serve as significant leakage vectors if not tightly controlled.
Case Study: Tesla IP Theft
In 2018, Tesla filed a lawsuit against the self-driving car startup Zoox, alleging that former Tesla employees had stolen proprietary data and trade secrets related to Teslaâs autonomous driving technology. According to the lawsuit, several employees transferred over 10 gigabytes of confidential files, including machine learning models and source code, before leaving Tesla to join Zoox.
Among the stolen materials was a key image recognition model used for object detection in Teslaâs self-driving system. By obtaining this model, Zoox could have bypassed years of research and development, giving the company a competitive advantage. Beyond the economic implications, there were concerns that the stolen model could expose Tesla to further security risks, such as model inversion attacks aimed at extracting sensitive data from the modelâs training set.
The Zoox employees denied any wrongdoing, and the case was ultimately settled out of court. Nevertheless, the incident highlights the real-world risks of model theft, particularly in industries where machine learning models represent significant intellectual property. The theft of models not only undermines competitive advantage but also raises broader concerns about privacy, safety, and the potential for downstream exploitation.
This case demonstrates that model theft is not limited to theoretical attacks conducted over APIs or public interfaces. Insider threats, supply chain vulnerabilities, and unauthorized access to development infrastructure pose equally serious risks to machine learning systems deployed in commercial environments.
15.5.2 Data Poisoning
Training integrity threats stem from the manipulation of data used to train machine learning models. These attacks aim to corrupt the learning process by introducing examples that appear benign but induce harmful or biased behavior in the final model.
Data poisoning attacks are a prominent example, in which adversaries inject carefully crafted data points into the training set to influence model behavior in targeted or systemic ways (Biggio, Nelson, and Laskov 2012). Poisoned data may cause a model to make incorrect predictions, degrade its generalization ability, or embed failure modes that remain dormant until triggered post-deployment.
Data poisoning is a security threat because it involves intentional manipulation of the training data by an adversary, with the goal of embedding vulnerabilities or subverting model behavior. These attacks are especially concerning in applications where models retrain on data collected from external sources, including user interactions, crowdsourced annotations, and online scraping, since attackers can inject poisoned data without direct access to the training pipeline. Even in more controlled settings, poisoning may occur through compromised data storage, insider manipulation, or insecure data transfer processes.
From a security perspective, poisoning attacks vary depending on the attackerâs level of access and knowledge. In white-box scenarios, the adversary may have detailed insight into the model architecture or training process, enabling more precise manipulation. In contrast, black-box or limited-access attacks exploit open data submission channels or indirect injection vectors. Poisoning can target different stages of the ML pipeline, ranging from data collection and preprocessing to labeling and storage, making the attack surface both broad and system-dependent.
Poisoning attacks typically follow a three-stage process. First, the attacker injects malicious data into the training set. These examples are often designed to appear legitimate but introduce subtle distortions that alter the modelâs learning process. Second, the model trains on this compromised data, embedding the attackerâs intended behavior. Finally, once the model is deployed, the attacker may exploit the altered behavior to cause mispredictions, bypass safety checks, or degrade overall reliability.
Formally, data poisoning can be viewed as a bilevel optimization problem, where the attacker seeks to select poisoning data \(D_p\) that maximizes the modelâs loss on a validation or target dataset \(D_{\text{test}}\). Let \(D\) represent the original training data. The attackerâs obj \[ \max_{D_p} \ \mathcal{L}(f_{D \cup D_p}, D_{\text{test}}) \] where \(f_{D \cup D_p}\) is the model trained on the combined dataset. For targeted attacks, this objective may focus on specific inputs \(x_t\) and target labels \(y_t\): \[ \max_{D_p} \ \mathcal{L}(f_{D \cup D_p}, x_t, y_t) \]
This formulation captures the adversaryâs goal of introducing carefully crafted data points to manipulate the modelâs decision boundaries.
For example, consider a traffic sign classification model trained to distinguish between stop signs and speed limit signs. An attacker might inject a small number of stop sign images labeled as speed limit signs into the training data. The attackerâs goal is to subtly shift the modelâs decision boundary so that future stop signs are misclassified as speed limit signs. In this case, the poisoning data \(D_p\) consists of mislabeled stop sign images, and the attackerâs objective is to maximize the misclassification of legitimate stop signs \(x_t\) as speed limit signs \(y_t\), following the targeted attack formulation above. Even if the model performs well on other types of signs, the poisoned training process creates a predictable and exploitable vulnerability.
Data poisoning attacks can be classified based on their objectives and scope of impact. Availability attacks degrade overall model performance by introducing noise or label flips that reduce accuracy across tasks. Targeted attacks manipulate a specific input or class, leaving general performance intact but causing consistent misclassification in select cases. Backdoor attacks embed hidden triggers, which are often imperceptible patterns, that elicit malicious behavior only when the trigger is present. Subpopulation attacks degrade performance on a specific group defined by shared features, making them particularly dangerous in fairness-sensitive applications.
A notable real-world example of a targeted poisoning attack was demonstrated against Perspective, an online toxicity detection model (Hosseini et al. 2017). By injecting synthetically generated toxic comments with subtle misspellings and grammatical errors into the modelâs training set, researchers degraded its ability to detect harmful content. After retraining, the poisoned model exhibited a significantly higher false negative rate, allowing offensive language to bypass filters. This case illustrates how poisoned data can exploit feedback loops in systems that rely on user-generated input, leading to reduced effectiveness over time and creating long-term vulnerabilities in content moderation pipelines.
Mitigating data poisoning threats requires end-to-end security of the data pipeline, encompassing collection, storage, labeling, and training. Preventative measures include input validation checks, integrity verification of training datasets, and anomaly detection to flag suspicious patterns. In parallel, robust training algorithms can limit the influence of mislabeled or manipulated data by down-weighting or filtering out anomalous instances. While no single technique guarantees immunity, combining proactive data governance, automated monitoring, and robust learning practices is essential for maintaining model integrity in real-world deployments.
15.5.3 Adversarial Attacks
Inference robustness threats occur when attackers manipulate inputs at test time to induce incorrect predictions. Unlike data poisoning, which compromises the training process, these attacks exploit vulnerabilities in the modelâs decision surface during inference.
A central class of such threats is adversarial attacks, where carefully constructed inputs are designed to cause incorrect predictions while remaining nearly indistinguishable from legitimate data (Szegedy et al. 2013; Parrish et al. 2023). These attacks highlight a critical weakness in many ML models: their sensitivity to small, targeted perturbations that can drastically alter output confidence or classification results.
The central vulnerability arises from the modelâs sensitivity to small, targeted perturbations. A single image, for instance, can be subtly altered, by altering only a few pixel values, such that a classifier misidentifies a stop sign as a speed limit sign. In natural language processing, specially crafted input sequences may trigger toxic or misleading outputs in a generative model, even when the prompt appears benign to a human reader (Ramesh et al. 2021; Rombach et al. 2022).
Adversarial attacks pose critical safety and security risks in domains such as autonomous driving, biometric authentication, and content moderation. Unlike data poisoning, which corrupts the model during training, adversarial attacks manipulate the modelâs behavior at test time, often without requiring any access to the training data or model internals. The attack surface thus shifts from upstream data pipelines to real-time interaction, demanding robust defense mechanisms capable of detecting or mitigating malicious inputs at the point of inference.
Adversarial example generation can be formally described as a constrained optimization problem, where the attacker seeks to find a minimally perturbed version of a legitimate input that maximizes the modelâs prediction error. Given an input \(x\) with true label \(y\), the attackerâs objective is to find a perturbed input \(x' = x + \delta\) that maximizes the modelâs loss: \[ \max_{\delta} \ \mathcal{L}(f(x + \delta), y) \] subject to the constraint: \[ \|\delta\| \leq \epsilon \] where \(f(\cdot)\) is the model, \(\mathcal{L}\) is the loss function, and \(\epsilon\) defines the allowed perturbation magnitude. This ensures that the perturbation remains small, often imperceptible to humans, while still leading the model to produce an incorrect output.
This optimization view underlies common adversarial strategies used in both white-box and black-box settings. A full taxonomy of attack algorithms, including gradient-based, optimization-based, and transfer-based techniques, is provided in a later chapter.
Adversarial attacks vary based on the attackerâs level of access to the model. In white-box attacks, the adversary has full knowledge of the modelâs architecture, parameters, and training data, allowing them to craft highly effective adversarial examples. In black-box attacks, the adversary has no internal knowledge and must rely on querying the model and observing its outputs. Grey-box attacks fall between these extremes, with the adversary possessing partial information, such as access to the model architecture but not its parameters.
These attacker models can be summarized along a spectrum of knowledge levels. Table 15.3 highlights the differences in model access, data access, typical attack strategies, and common deployment scenarios. Such distinctions help characterize the practical challenges of securing ML systems across different deployment environments.
Adversary Knowledge Level | Model Access | Training Data Access | Attack Example | Common Scenario |
---|---|---|---|---|
White-box | Full access to architecture and parameters | Full access | Crafting adversarial examples using gradients | Insider threats, open-source model reuse |
Grey-box | Partial access (e.g., architecture only) | Limited or no access | Attacks based on surrogate model approximation | Known model family, unknown fine-tuning |
Black-box | No internal access; only query-response view | No access | Query-based surrogate model training and transfer attacks | Public APIs, model-as-a-service deployments |
A common attack strategy involves constructing a surrogate model that approximates the target modelâs behavior. This surrogate model is trained by querying the target model with a set of inputs \(\{x_i\}\) and recording the corresponding outputs \(\{f(x_i)\}\). The attackerâs goal is to train a surrogate model \(\hat{f}\) that minimizes the discrepancy between its predictions and those of the target model. This objective can be formulated as: \[ \min_{\hat{f}} \ \sum_{i=1}^{n} \ \ell(\hat{f}(x_i), f(x_i)) \] where \(\ell\) is a loss function measuring the difference between the surrogateâs output and the target modelâs output. By minimizing this loss, the attacker builds a model that behaves similarly to the target. Once trained, the surrogate model can be used to generate adversarial examples using white-box techniques. These examples often transfer to the original target model, even without internal access, making such attacks effective in black-box settings. This phenomenon, known as adversarial transferability, presents a significant challenge for defense.
Several methods have been proposed to generate adversarial examples. One notable approach leverages generative adversarial networks (GANs) (I. Goodfellow et al. 2020). In this setting, a generator network learns to produce inputs that deceive the target model, while a discriminator evaluates their effectiveness. This iterative process allows the attacker to generate sophisticated and diverse adversarial examples.
Another vector for adversarial attacks involves transfer learning pipelines. Many production systems reuse pre-trained feature extractors, fine-tuning only the final layers for specific tasks. Adversaries can exploit this structure by targeting the shared feature extractor, crafting perturbations that affect multiple downstream tasks. Headless attacks, for example, manipulate the feature extractor without requiring access to the classification head or training data (Abdelkader et al. 2020). This exposes a critical vulnerability in systems that rely on pre-trained models.
One illustrative example involves the manipulation of traffic sign recognition systems (Eykholt et al. 2017). Researchers demonstrated that placing small stickers on stop signs could cause machine learning models to misclassify them as speed limit signs. While the altered signs remained easily recognizable to humans, the model consistently misinterpreted them. Such attacks pose serious risks in applications like autonomous driving, where reliable perception is critical for safety.
Adversarial attacks highlight the need for robust defenses that go beyond improving model accuracy. Securing ML systems against adversarial threats requires runtime defenses such as input validation, anomaly detection, and monitoring for abnormal patterns during inference. Training-time robustness methods (e.g., adversarial training) complement these strategies and are discussed in more detail in a later chapter. These defenses aim to enhance model resilience against adversarial examples, ensuring that machine learning systems can operate reliably even in the presence of malicious inputs.
15.5.4 Case Study: Traffic Sign Detection Model Trickery
In 2017, researchers conducted experiments by placing small black and white stickers on stop signs (Eykholt et al. 2017). As shown in Figure 15.4, these stickers were designed to be nearly imperceptible to the human eye, yet they significantly altered the appearance of the stop sign when viewed by machine learning models. When viewed by a normal human eye, the stickers did not obscure the sign or prevent interpretability. However, when images of the stickers stop signs were fed into standard traffic sign classification ML models, they were misclassified as speed limit signs over 85% of the time.
This demonstration showed how simple adversarial stickers could trick ML systems into misreading critical road signs. If deployed realistically, these attacks could endanger public safety, causing autonomous vehicles to misinterpret stop signs as speed limits. Researchers warned this could potentially cause dangerous rolling stops or acceleration into intersections.
This case study provides a concrete illustration of how adversarial examples exploit the pattern recognition mechanisms of ML models. By subtly altering the input data, attackers can induce incorrect predictions and pose significant risks to safety-critical applications like self-driving cars. The attackâs simplicity demonstrates how even minor, imperceptible changes can lead models astray. Consequently, developers must implement robust defenses against such threats.
These threat types span different stages of the ML lifecycle and demand distinct defensive strategies. Table 15.4 below summarizes their key characteristics.
Threat Type | Lifecycle Stage | Attack Vector | Example Impact |
---|---|---|---|
Model Theft | Deployment | API access, insider leaks | Stolen IP, model inversion, behavioral clone |
Data Poisoning | Training | Label flipping, backdoors | Targeted misclassification, degraded accuracy |
Adversarial Attacks | Inference | Input perturbation | Real-time misclassification, safety failure |
The appropriate defense for a given threat depends on its type, attack vector, and where it occurs in the ML lifecycle. Figure 15.5 provides a simplified decision flow that connects common threat categories, such as model theft, data poisoning, and adversarial examples, to corresponding defensive strategies. While real-world deployments may require more nuanced or layered defenses, this flowchart serves as a conceptual guide for aligning threat models with practical mitigation techniques.
While ML models themselves present critical attack surfaces, they ultimately run on hardware that can introduce vulnerabilities beyond the modelâs control. In the next section, we examine how adversaries can target the physical infrastructure that executes machine learning workloadsâthrough hardware bugs, physical tampering, side channels, and supply chain risks.
15.6 Threats to ML Hardware
As machine learning systems move from research prototypes to large-scale, real-world deployments, their security increasingly depends on the hardware platforms they run on. Whether deployed in data centers, on edge devices, or in embedded systems, machine learning applications rely on a layered stack of processors, accelerators, memory, and communication interfaces. These hardware components, while essential for enabling efficient computation, introduce unique security risks that go beyond traditional software-based vulnerabilities.
Unlike general-purpose software systems, machine learning workflows often process high-value models and sensitive data in performance-constrained environments. This makes them attractive targets not only for software attacks but also for hardware-level exploitation. Vulnerabilities in hardware can expose models to theft, leak user data, disrupt system reliability, or allow adversaries to manipulate inference results. Because hardware operates below the software stack, such attacks can bypass conventional security mechanisms and remain difficult to detect.
These hardware threats arise from multiple sources, including design flaws in hardware architectures, physical tampering, side-channel leakage, and supply chain compromises. Together, they form a critical attack surface that must be addressed to build trustworthy machine learning systems.
Table 15.5 summarizes the major categories of hardware security threats, describing their origins, methods, and implications for machine learning system design and deployment.
Threat Type | Description | Relevance to ML Hardware Security |
---|---|---|
Hardware Bugs | Intrinsic flaws in hardware designs that can compromise system integrity. | Foundation of hardware vulnerability. |
Physical Attacks | Direct exploitation of hardware through physical access or manipulation. | Basic and overt threat model. |
Fault-injection Attacks | Induction of faults to cause errors in hardware operation, leading to potential system crashes. | Systematic manipulation leading to failure. |
Side-Channel Attacks | Exploitation of leaked information from hardware operation to extract sensitive data. | Indirect attack via environmental observation. |
Leaky Interfaces | Vulnerabilities arising from interfaces that expose data unintentionally. | Data exposure through communication channels. |
Counterfeit Hardware | Use of unauthorized hardware components that may have security flaws. | Compounded vulnerability issues. |
Supply Chain Risks | Risks introduced through the hardware lifecycle, from production to deployment. | Cumulative & multifaceted security challenges. |
15.6.1 Hardware Bugs
Hardware is not immune to the pervasive issue of design flaws or bugs. Attackers can exploit these vulnerabilities to access, manipulate, or extract sensitive data, breaching the confidentiality and integrity that users and services depend on. One of the most notable examples came with the discovery of Meltdown and Spectreâtwo vulnerabilities in modern processors that allow malicious programs to bypass memory isolation and read the data of other applications and the operating system (Kocher et al. 2019a, 2019b).
These attacks exploit speculative execution, a performance optimization in CPUs that executes instructions out of order before safety checks are complete. While improving computational speed, this optimization inadvertently exposes sensitive data through microarchitectural side channels, such as CPU caches. The technical sophistication of these attacks highlights the difficulty of eliminating vulnerabilities even with extensive hardware validation.
Further research has revealed that these were not isolated incidents. Variants such as Foreshadow, ZombieLoad, and RIDL target different microarchitectural elements, ranging from secure enclaves to CPU internal buffers, demonstrating that speculative execution flaws are a systemic hardware risk.
While these attacks were first demonstrated on general-purpose CPUs, their implications extend to machine learning accelerators and specialized hardware. ML systems often rely on heterogeneous compute platforms that combine CPUs with GPUs, TPUs, FPGAs, or custom accelerators. These components process sensitive data such as personal information, medical records, or proprietary models. Vulnerabilities in any part of this stack could expose such data to attackers.
For example, an edge device like a smart camera running a face recognition model on an accelerator could be vulnerable if the hardware lacks proper cache isolation. An attacker might exploit this weakness to extract intermediate computations, model parameters, or user data. Similar risks exist in cloud inference services, where hardware multi-tenancy increases the chances of cross-tenant data leakage.
Such vulnerabilities are particularly concerning in privacy-sensitive domains like healthcare, where ML systems routinely handle patient data. A breach could violate privacy regulations such as the Health Insurance Portability and Accountability Act (HIPAA), leading to significant legal and ethical consequences.
These examples illustrate that hardware security is not solely about preventing physical tampering. It also requires architectural safeguards to prevent data leakage through the hardware itself. As new vulnerabilities continue to emerge across processors, accelerators, and memory systems, addressing these risks requires continuous mitigation effortsâoften involving performance trade-offs, especially in compute- and memory-intensive ML workloads. Proactive solutions, such as confidential computing and trusted execution environments (TEEs), offer promising architectural defenses. However, achieving robust hardware security requires attention at every stage of the system lifecycle, from design to deployment.
15.6.2 Physical Attacks
Physical tampering refers to the direct, unauthorized manipulation of computing hardware to undermine the integrity of machine learning systems. This type of attack is particularly concerning because it bypasses traditional software security defenses, directly targeting the physical components on which machine learning depends. ML systems are especially vulnerable to such attacks because they rely on hardware sensors, accelerators, and storage to process large volumes of data and produce reliable outcomes in real-world environments.
While software security measures, including encryption, authentication, and access control, protect ML systems against remote attacks, they offer little defense against adversaries with physical access to devices. Physical tampering can range from simple actions, like inserting a malicious USB device into an edge server, to highly sophisticated manipulations such as embedding hardware trojans during chip manufacturing. These threats are particularly relevant for machine learning systems deployed at the edge or in physically exposed environments, where attackers may have opportunities to interfere with the hardware directly.
To understand how such attacks affect ML systems in practice, consider the example of an ML-powered drone used for environmental mapping or infrastructure inspection. The droneâs navigation depends on machine learning models that process data from GPS, cameras, and inertial measurement units. If an attacker gains physical access to the drone, they could replace or modify its navigation module, embedding a hidden backdoor that alters flight behavior or reroutes data collection. Such manipulation not only compromises the systemâs reliability but also opens the door to misuse, such as surveillance or smuggling operations.
Physical attacks are not limited to mobility systems. Biometric access control systems, which rely on ML models to process face or fingerprint data, are also vulnerable. These systems typically use embedded hardware to capture and process biometric inputs. An attacker could physically replace a biometric sensor with a modified component designed to capture and transmit personal identification data to an unauthorized receiver. This compromises both security and user privacy, and it can enable future impersonation attacks.
In addition to tampering with external sensors, attackers may target internal hardware subsystems. For example, the sensors used in autonomous vehicles, including cameras, LiDAR, and radar, are essential for ML models that interpret the surrounding environment. A malicious actor could physically misalign or obstruct these sensors, degrading the modelâs perception capabilities and creating safety hazards.
Hardware trojans pose another serious risk. Malicious modifications introduced during chip fabrication or assembly can embed dormant circuits in ML accelerators or inference chips. These trojans may remain inactive under normal conditions but trigger malicious behavior when specific inputs are processed or system states are reached. Such hidden vulnerabilities can disrupt computations, leak model outputs, or degrade system performance in ways that are extremely difficult to diagnose post-deployment.
Memory subsystems are also attractive targets. Attackers with physical access to edge devices or embedded ML accelerators could manipulate memory chips to extract encrypted model parameters or training data. Fault injection techniques, including voltage manipulation and electromagnetic interference, can further degrade system reliability by corrupting model weights or forcing incorrect computations during inference.
Physical access threats extend to data center and cloud environments as well. Attackers with sufficient access could install hardware implants, such as keyloggers or data interceptors, to capture administrative credentials or monitor data streams. Such implants can provide persistent backdoor access, enabling long-term surveillance or data exfiltration from ML training and inference pipelines.
In summary, physical attacks on machine learning systems threaten both security and reliability across a wide range of deployment environments. Addressing these risks requires a combination of hardware-level protections, tamper detection mechanisms, and supply chain integrity checks. Without these safeguards, even the most secure software defenses may be undermined by vulnerabilities introduced through direct physical manipulation.
15.6.3 Fault Injection Attacks
Fault injection is a powerful class of physical attacks that deliberately disrupts hardware operations to induce errors in computation. These induced faults can compromise the integrity of machine learning models by causing them to produce incorrect outputs, degrade reliability, or leak sensitive information. For ML systems, such faults not only disrupt inference but also expose models to deeper exploitation, including reverse engineering and bypass of security protocols (Joye and Tunstall 2012).
Attackers achieve fault injection by applying precisely timed physical or electrical disturbances to the hardware while it is executing computations. Techniques such as low-voltage manipulation (Barenghi et al. 2010), power spikes (Hutter, Schmidt, and Plos 2009), clock glitches (Amiel, Clavier, and Tunstall 2006), electromagnetic pulses (Agrawal et al. 2007), temperature variations (S. Skorobogatov 2009), and even laser strikes (S. P. Skorobogatov and Anderson 2003) have been demonstrated to corrupt specific parts of a programâs execution. These disturbances can cause effects such as bit flips, skipped instructions, or corrupted memory states, which adversaries can exploit to alter ML model behavior or extract sensitive information.
For machine learning systems, these attacks pose several concrete risks. Fault injection can degrade model accuracy, force incorrect classifications, trigger denial of service, or even leak internal model parameters. For example, attackers could inject faults into an embedded ML model running on a microcontroller, forcing it to misclassify inputs in safety-critical applications such as autonomous navigation or medical diagnostics. More sophisticated attackers may target memory or control logic to steal intellectual property, such as proprietary model weights or architecture details.
Experimental demonstrations have shown the feasibility of such attacks. One notable example is the work by Breier et al. (2018), where researchers successfully used a laser fault injection attack on a deep neural network deployed on a microcontroller. By heating specific transistors, as shown in Figure 15.6. they forced the hardware to skip execution steps, including a ReLU activation function.
This manipulation is illustrated in Figure 15.7, which shows a segment of assembly code implementing the ReLU activation function. Normally, the code compares the most significant bit (MSB) of the accumulator to zero and uses a brge (branch if greater or equal) instruction to skip the assignment if the value is non-positive. However, the fault injection suppresses the branch, causing the processor to always execute the âelseâ block. As a result, the neuronâs output is forcibly zeroed out, regardless of the input value.
Fault injection attacks can also be combined with side-channel analysis, where attackers first observe power or timing characteristics to infer model structure or data flow. This reconnaissance allows them to target specific layers or operations, such as activation functions or final decision layers, maximizing the impact of the injected faults.
Embedded and edge ML systems are particularly vulnerable because they often lack physical hardening and operate under resource constraints that limit runtime defenses. Without tamper-resistant packaging or secure hardware enclaves, attackers may gain direct access to system buses and memory, enabling precise fault manipulation. Furthermore, many embedded ML models are designed to be lightweight, leaving them with little redundancy or error correction to recover from induced faults.
Mitigating fault injection requires a multi-layered defense strategy. Physical protections, such as tamper-proof enclosures and design obfuscation, help limit physical access. Anomaly detection techniques can monitor sensor inputs or model outputs for signs of fault-induced inconsistencies (Hsiao et al. 2023). Error-correcting memories and secure firmware can reduce the likelihood of silent corruption. Techniques such as model watermarking may provide traceability if stolen models are later deployed by an adversary.
However, these protections are difficult to implement in cost- and power-constrained environments, where adding cryptographic hardware or redundancy may not be feasible. As a result, achieving resilience to fault injection requires cross-layer design considerations that span electrical, firmware, software, and system architecture levels. Without such holistic design practices, ML systems deployed in the field may remain exposed to these low-cost yet highly effective physical attacks.
15.6.4 Side-Channel Attacks
Side-channel attacks constitute a class of security breaches that exploit information inadvertently revealed through the physical implementation of computing systems. In contrast to direct attacks that target software or network vulnerabilities, these attacks leverage the systemâs hardware characteristics, including power consumption, electromagnetic emissions, or timing behavior, to extract sensitive information.
The fundamental premise of a side-channel attack is that a deviceâs operation can leak information through observable physical signals. Such leaks may originate from the electrical power the device consumes (Kocher, Jaffe, and Jun 1999), the electromagnetic fields it emits (Gandolfi, Mourtel, and Olivier 2001), the time it takes to complete computations, or even the acoustic noise it produces. By carefully measuring and analyzing these signals, attackers can infer internal system states or recover secret data.
Although these techniques are commonly discussed in cryptography, they are equally relevant to machine learning systems. ML models deployed on hardware accelerators, embedded devices, or edge systems often process sensitive data. Even when these models are protected by secure algorithms or encryption, their physical execution may leak side-channel signals that can be exploited by adversaries.
One of the most widely studied examples involves Advanced Encryption Standard (AES) implementations. While AES is mathematically secure, the physical process of computing its encryption functions leaks measurable signals. Techniques such as Differential Power Analysis (DPA) (Kocher et al. 2011), Differential Electromagnetic Analysis (DEMA), and Correlation Power Analysis (CPA) exploit these physical signals to recover secret keys.
A useful example of this attack technique can be seen in a power analysis of a password authentication process. Consider a device that verifies a 5-byte passwordâin this case, 0x61, 0x52, 0x77, 0x6A, 0x73
. During authentication, the device receives each byte sequentially over a serial interface, and its power consumption pattern reveals how the system responds as it processes these inputs.
Figure 15.8 shows the deviceâs behavior when the correct password is entered. The red waveform captures the serial data stream, marking each byte as it is received. The blue curve records the deviceâs power consumption over time. When the full, correct password is supplied, the power profile remains stable and consistent across all five bytes, providing a clear baseline for comparison with failed attempts.
When an incorrect password is entered, the power analysis chart changes as shown in Figure 15.9. In this case, the first three bytes (0x61, 0x52, 0x77
) are correct, so the power patterns closely match the correct password up to that point. However, when the fourth byte (0x42
) is processed and found to be incorrect, the device halts authentication. This change is reflected in the sudden jump in the blue power line, indicating that the device has stopped processing and entered an error state.
Figure 15.10 shows the case where the password is entirely incorrect (0x30, 0x30, 0x30, 0x30, 0x30
). Here, the device detects the mismatch immediately after the first byte and halts processing much earlier. This is again visible in the power profile, where the blue line exhibits a sharp jump following the first byte, reflecting the deviceâs early termination of authentication.
These examples demonstrate how attackers can exploit observable power consumption differences to reduce the search space and eventually recover secret data through brute-force analysis. For a more detailed walkthrough, Video 15.3 provides a step-by-step demonstration of how these attacks are performed.
Such attacks are not limited to cryptographic systems. Machine learning applications face similar risks. For example, an ML-based speech recognition system processing voice commands on a local device could leak timing or power signals that reveal which commands are being processed. Even subtle acoustic or electromagnetic emissions may expose operational patterns that an adversary could exploit to infer user behavior.
Historically, side-channel attacks have been used to bypass even the most secure cryptographic systems. In the 1960s, British intelligence agency MI5 famously exploited acoustic emissions from a cipher machine in the Egyptian Embassy (Burnet and Thomas 1989). By capturing the mechanical clicks of the machineâs rotors, MI5 analysts were able to dramatically reduce the complexity of breaking encrypted messages. This early example illustrates that side-channel vulnerabilities are not confined to the digital age but are rooted in the physical nature of computation.
Today, these techniques have advanced to include attacks such as keyboard eavesdropping (Asonov and Agrawal, n.d.), power analysis on cryptographic hardware (Gnad, Oboril, and Tahoori 2017), and voltage-based attacks on ML accelerators (Zhao and Suh 2018). Timing attacks, electromagnetic leakage, and thermal emissions continue to provide adversaries with indirect channels for observing system behavior.
Machine learning systems deployed on specialized accelerators or embedded platforms are especially at risk. Attackers may exploit side-channel signals to infer model structure, steal parameters, or reconstruct private training data. As ML becomes increasingly deployed in cloud, edge, and embedded environments, these side-channel vulnerabilities pose significant challenges to system security.
Understanding the persistence and evolution of side-channel attacks is essential for building resilient machine learning systems. By recognizing that where there is a signal, there is potential for exploitation, system designers can begin to address these risks through a combination of hardware shielding, algorithmic defenses, and operational safeguards.
15.6.5 Leaky Interfaces
Interfaces in computing systems are essential for enabling communication, diagnostics, and updates. However, these same interfaces can become significant security vulnerabilities when they unintentionally expose sensitive information or accept unverified inputs. Such leaky interfaces often go unnoticed during system design, yet they provide attackers with powerful entry points to extract data, manipulate functionality, or introduce malicious code.
A leaky interface is any access point that reveals more information than intended, often because of weak authentication, lack of encryption, or inadequate isolation. These issues have been widely demonstrated across consumer, medical, and industrial systems.
For example, many WiFi-enabled baby monitors have been found to expose unsecured remote access ports, allowing attackers to intercept live audio and video feeds from inside private homes. Similarly, researchers have identified wireless vulnerabilities in pacemakers that could allow attackers to manipulate cardiac functions if exploited, raising life-threatening safety concerns.
A notable case involving smart lightbulbs demonstrated that accessible debug ports left on production devices leaked unencrypted WiFi credentials. This security oversight provided attackers with a pathway to infiltrate home networks without needing to bypass standard security mechanisms. In the automotive domain, unsecured OBD-II diagnostic ports have allowed attackers to manipulate braking and steering functions in connected vehicles, as demonstrated in the well-known Jeep Cherokee hack.
While these examples do not target machine learning systems directly, they illustrate architectural patterns that are highly relevant to ML-enabled devices. Consider a smart home security system that uses machine learning to detect user routines and automate responses. Such a system may include a maintenance or debug interface for software updates. If this interface lacks proper authentication or transmits data unencrypted, attackers on the same network could gain unauthorized access. This intrusion could expose user behavior patterns, compromise model integrity, or disable security features altogether.
Leaky interfaces in ML systems can also expose training data, model parameters, or intermediate outputs. Such exposure can enable attackers to craft adversarial examples, steal proprietary models, or reverse-engineer system behavior. Worse still, these interfaces may allow attackers to tamper with firmware, introducing malicious code that disables devices or recruits them into botnets.
Mitigating these risks requires multi-layered defenses. Technical safeguards such as strong authentication, encrypted communications, and runtime anomaly detection are essential. Organizational practices such as interface inventories, access control policies, and ongoing audits are equally important. Adopting a zero-trust architecture, where no interface is trusted by default, further reduces exposure by limiting access to only what is strictly necessary.
For designers of ML-powered systems, securing interfaces must be a first-class concern alongside algorithmic and data-centric design. Whether the system operates in the cloud, on the edge, or in embedded environments, failure to secure these access points risks undermining the entire systemâs trustworthiness.
15.6.6 Counterfeit Hardware
Machine learning systems depend on the reliability and security of the hardware on which they run. Yet, in todayâs globalized hardware ecosystem, the risk of counterfeit or cloned hardware has emerged as a serious threat to system integrity. Counterfeit components refer to unauthorized reproductions of genuine parts, designed to closely imitate their appearance and functionality. These components can enter machine learning systems through complex procurement and manufacturing processes that span multiple vendors and regions.
A single lapse in component sourcing can introduce counterfeit hardware into critical systems. For example, a facial recognition system deployed for secure facility access might unknowingly rely on counterfeit processors. These unauthorized components could fail to process biometric data correctly or introduce hidden vulnerabilities that allow attackers to bypass authentication controls.
The risks posed by counterfeit hardware are multifaceted. From a reliability perspective, such components often degrade faster, perform unpredictably, or fail under load due to substandard manufacturing. From a security perspective, counterfeit hardware may include hidden backdoors or malicious circuitry, providing attackers with undetectable pathways to compromise machine learning systems. A cloned network router installed in a data center, for instance, could silently intercept model predictions or user data, undermining both system security and user privacy.
Legal and regulatory risks further compound the problem. Organizations that unknowingly integrate counterfeit components into their ML systems may face serious legal consequences, including penalties for violating safety, privacy, or cybersecurity regulations. This is particularly concerning in sectors such as healthcare and finance, where compliance with industry standards is non-negotiable.
Economic pressures often incentivize sourcing from lower-cost suppliers without rigorous verification, increasing the likelihood of counterfeit parts entering production systems. Detection is especially challenging, as counterfeit components are designed to mimic legitimate ones. Identifying them may require specialized equipment or forensic analysis, making prevention far more practical than remediation.
The stakes are particularly high in machine learning applications that require high reliability and low latency, such as real-time decision-making in autonomous vehicles, industrial automation, or critical healthcare diagnostics. Hardware failure in these contexts can lead not only to system downtime but also to significant safety risks.
As machine learning continues to expand into safety-critical and high-value applications, counterfeit hardware presents a growing risk that must be recognized and addressed. Organizations must treat hardware trustworthiness as a fundamental design requirement, on par with algorithmic accuracy and data security, to ensure that ML systems can operate reliably and securely in the real world.
15.6.7 Supply Chain Risks
While counterfeit hardware presents a serious challenge, it is only one part of the larger problem of securing the global hardware supply chain. Machine learning systems are built from components that pass through complex supply networks involving design, fabrication, assembly, distribution, and integration. Each of these stages presents opportunities for tampering, substitution, or counterfeitingâoften without the knowledge of those deploying the final system.
Malicious actors can exploit these vulnerabilities in various ways. A contracted manufacturer might unknowingly receive recycled electronic waste that has been relabeled as new components. A distributor might deliberately mix cloned parts into otherwise legitimate shipments. Insiders at manufacturing facilities might embed hardware Trojans that are nearly impossible to detect once the system is deployed. Advanced counterfeits can be particularly deceptive, with refurbished or repackaged components designed to pass visual inspection while concealing inferior or malicious internals.
Identifying such compromises typically requires sophisticated analysis, including micrography, X-ray screening, and functional testing. However, these methods are costly and impractical for large-scale procurement. As a result, many organizations deploy systems without fully verifying the authenticity and security of every component.
The risks extend beyond individual devices. Machine learning systems often rely on heterogeneous hardware platforms, integrating CPUs, GPUs, memory, and specialized accelerators sourced from a global supply base. Any compromise in one part of this chain can undermine the security of the entire system. These risks are further amplified when systems operate in shared or multi-tenant environments, such as cloud data centers or federated edge networks, where hardware-level isolation is critical to preventing cross-tenant attacks.
The 2018 Bloomberg Businessweek report alleging that Chinese state actors inserted spy chips into Supermicro server motherboards brought these risks to mainstream attention. While the claims remain disputed, the story underscored the industryâs limited visibility into its own hardware supply chains. Companies often rely on complex, opaque manufacturing and distribution networks, leaving them vulnerable to hidden compromises. Over-reliance on single manufacturers or regions, including the semiconductor industryâs reliance on TSMC, further concentrates this risk. This recognition has driven policy responses like the U.S. CHIPS and Science Act, which aims to bring semiconductor production onshore and strengthen supply chain resilience.
Securing machine learning systems requires moving beyond trust-by-default models toward zero-trust supply chain practices. This includes screening suppliers, validating component provenance, implementing tamper-evident protections, and continuously monitoring system behavior for signs of compromise. Building fault-tolerant architectures that detect and contain failures provides an additional layer of defense.
Ultimately, supply chain risks must be treated as a first-class concern in ML system design. Trust in the computational models and data pipelines that power machine learning depends fundamentally on the trustworthiness of the hardware on which they run. Without securing the hardware foundation, even the most sophisticated models remain vulnerable to compromise.
15.6.8 Case Study: The Supermicro Hardware Security Controversy
In 2018, Bloomberg Businessweek published a widely discussed report alleging that Chinese state-sponsored actors had secretly implanted tiny surveillance chips on server motherboards manufactured by Supermicro (Robertson and Riley 2018). These compromised servers were reportedly deployed by more than 30 major companies, including Apple and Amazon. The chips, described as no larger than a grain of rice, were said to provide attackers with backdoor access to sensitive data and systems.
The allegations sparked immediate concern across the technology industry, raising questions about the security of global supply chains and the potential for state-level hardware manipulation. However, the companies named in the report publicly denied the claims. Apple, Amazon, and Supermicro stated that they had found no evidence of the alleged implants after conducting thorough internal investigations. Industry experts and government agencies also expressed skepticism, noting the lack of verifiable technical evidence presented in the report.
Despite these denials, the story had a lasting impact on how organizations and policymakers view hardware supply chain security. Whether or not the specific claims were accurate, the report highlighted the real and growing concern that hardware supply chains are difficult to fully audit and secure. It underscored how geopolitical tensions, manufacturing outsourcing, and the complexity of modern hardware ecosystems make it increasingly challenging to guarantee the integrity of hardware components.
The Supermicro case illustrates a broader truth: once a product enters a complex global supply chain, it becomes difficult to ensure that every component is free from tampering or unauthorized modification. This risk is particularly acute for machine learning systems, which depend on a wide range of hardware accelerators, memory modules, and processing units sourced from multiple vendors across the globe.
In response to these risks, both industry and government stakeholders have begun to invest in supply chain security initiatives. The U.S. governmentâs CHIPS and Science Act is one such effort, aiming to bring semiconductor manufacturing back onshore to improve transparency and reduce dependency on foreign suppliers. While these efforts are valuable, they do not fully eliminate supply chain risks. They must be complemented by technical safeguards, such as component validation, runtime monitoring, and fault-tolerant system design.
The Supermicro controversy serves as a cautionary tale for the machine learning community. It demonstrates that hardware security cannot be taken for granted, even when working with reputable suppliers. Ensuring the integrity of ML systems requires rigorous attention to the entire hardware lifecycleâfrom design and fabrication to deployment and maintenance. This case reinforces the need for organizations to adopt comprehensive supply chain security practices as a foundational element of trustworthy ML system design.
15.7 Defensive Strategies
Designing secure and privacy-preserving machine learning systems requires more than identifying individual threats. It demands a layered defense strategy, which begins with protecting the data that powers models and extends through model design, deployment safeguards, runtime monitoring, and ultimately, the hardware that anchors trust. Each layer contributes to the systemâs overall resilience and must be tailored to the specific threat surfaces introduced by machine learning workflows. Unlike traditional software systems, ML systems are particularly vulnerable to input manipulation, data leakage, model extraction, and runtime abuseâall amplified by tight coupling between data, model behavior, and infrastructure.
This section presents a structured framework for defensive strategies, progressing from data-centric protections to infrastructure-level enforcement. These strategies span differential privacy and federated learning, robust model architectures, secure deployment pipelines, runtime validation and monitoring, and hardware-based trust anchors such as secure boot and TEEs. By integrating safeguards across layers, organizations can build ML systems that not only perform reliably but also withstand adversarial pressure in production environments.
Figure 15.11 shows a layered defense stack for machine learning systems. The stack progresses from foundational hardware-based security mechanisms to runtime system protections, model-level controls, and privacy-preserving techniques at the data level. Each layer builds on the trust guarantees of the layer below it, forming an end-to-end strategy for deploying ML systems securely. We will progressively explore each of these layers, highlighting their roles in securing machine learning systems against a range of threats.
15.7.1 Data Privacy Techniques
Protecting the privacy of individuals whose data fuels machine learning systems is a foundational requirement for trustworthy AI. Unlike traditional systems where data is often masked or anonymized before processing, ML workflows typically rely on access to raw, high-fidelity data to train effective models. This tension between utility and privacy has motivated a diverse set of techniques aimed at minimizing data exposure while preserving learning performance.
Differential Privacy
One of the most widely adopted frameworks for formalizing privacy guarantees is differential privacy (DP). DP provides a rigorous mathematical definition of privacy loss, ensuring that the inclusion or exclusion of a single individualâs data has a provably limited effect on the modelâs output. A randomized algorithm \(\mathcal{A}\) is said to be \(\epsilon\)-differentially private if, for all adjacent datasets \(D\) and \(D'\) differing in one record, and for all outputs \(S \subseteq \text{Range}(\mathcal{A})\), the following holds: \[ \Pr[\mathcal{A}(D) \in S] \leq e^{\epsilon} \Pr[\mathcal{A}(D') \in S] \]
This bound ensures that the algorithmâs behavior remains statistically indistinguishable regardless of whether any individualâs data is present, thereby limiting the information that can be inferred about that individual. In practice, DP is implemented by adding calibrated noise to model updates or query responses, using mechanisms such as the Laplace or Gaussian mechanism. Training techniques like differentially private stochastic gradient descent (DP-SGD) integrate noise into the optimization process to ensure per-iteration privacy guarantees.
While differential privacy offers strong theoretical assurances, it introduces a trade-off between privacy and utility. Increasing the noise to reduce \(\epsilon\) may degrade model accuracy, especially in low-data regimes or fine-grained classification tasks. Consequently, DP is often applied selectivelyâeither during training on sensitive datasets or at inference when returning aggregate statisticsâto balance privacy with performance goals (Dwork and Roth 2013).
Federated Learning
Complementary to DP, federated learning (FL) reduces privacy risks by restructuring the learning process itself. Rather than aggregating raw data at a central location, FL distributes the training across a set of client devices, each holding local data (McMahan et al. 2017). Clients compute model updates locally and share only parameter deltas with a central server for aggregation: \[ \theta_{t+1} \leftarrow \sum_{k=1}^{K} \frac{n_k}{n} \cdot \theta_{t}^{(k)} \]
Here, \(\theta_{t}^{(k)}\) represents the model update from client \(k\), \(n_k\) the number of samples held by that client, and \(n\) the total number of samples across all clients. This weighted aggregation allows the global model to learn from distributed data without direct access to it. While FL reduces the exposure of raw data, it still leaks information through gradients, motivating the use of DP, secure aggregation, and hardware-based protections in federated settings.
To address scenarios requiring computation on encrypted data, homomorphic encryption (HE) and secure multiparty computation (SMPC) allow models to perform inference or training over encrypted inputs. In the case of HE, operations on ciphertexts correspond to operations on plaintexts, enabling encrypted inference: \[ \text{Enc}(f(x)) = f(\text{Enc}(x)) \]
This property supports privacy-preserving computation in untrusted environments, such as cloud inference over sensitive health or financial records. However, the computational cost of HE remains high, making it more suitable for fixed-function models and low-latency batch tasks. SMPC, by contrast, distributes the computation across multiple parties such that no single party learns the complete input or output. This is particularly useful in joint training across institutions with strict data-use policies, such as hospitals or banks.
Synthetic Data Generation
A more pragmatic and increasingly popular alternative involves the use of synthetic data generation. By training generative models on real datasets and sampling new instances from the learned distribution, organizations can create datasets that approximate the statistical properties of the original data without retaining identifiable details (Goncalves et al. 2020). While this approach reduces the risk of direct reidentification, it does not offer formal privacy guarantees unless combined with DP constraints during generation.
Together, these techniques reflect a shift from isolating data as the sole path to privacy toward embedding privacy-preserving mechanisms into the learning process itself. Each method offers distinct guarantees and trade-offs depending on the application context, threat model, and regulatory constraints. Effective system design often combines multiple approaches, such as applying differential privacy within a federated learning setup, or employing homomorphic encryption for critical inference stages, to build ML systems that are both useful and respectful of user privacy.
Comparative Properties
These privacy-preserving techniques differ not only in the guarantees they offer but also in their system-level implications. For practitioners, the choice of mechanism depends on factors such as computational constraints, deployment architecture, and regulatory requirements.
Table 15.6 summarizes the comparative properties of these methods, focusing on privacy strength, runtime overhead, maturity, and common use cases. Understanding these trade-offs is essential for designing privacy-aware machine learning systems that operate under real-world constraints.
Technique | Privacy Guarantee | Computational Overhead | Deployment Maturity | Typical Use Case | Trade-offs |
---|---|---|---|---|---|
Differential Privacy | Formal (Îľ-DP) | Moderate to High | Production | Training with sensitive or regulated data | Reduced accuracy; careful tuning of Îľ/noise required to balance utility and protection |
Federated Learning | Structural | Moderate | Production | Cross-device or cross-org collaborative learning | Gradient leakage risk; requires secure aggregation and orchestration infrastructure |
Homomorphic Encryption | Strong (Encrypted) | High | Experimental | Inference in untrusted cloud environments | High latency and memory usage; suitable for limited-scope inference on fixed-function models |
Secure MPC | Strong (Distributed) | Very High | Experimental | Joint training across mutually untrusted parties | Expensive communication; challenging to scale to many participants or deep models |
Synthetic Data | Weak (if standalone) | Low to Moderate | Emerging | Data sharing, benchmarking without direct access to raw data | May leak sensitive patterns if training process is not differentially private or audited for fidelity |
15.7.2 Secure Model Design
Security begins at the design phase of a machine learning system. While downstream mechanisms such as access control and encryption protect models once deployed, many vulnerabilities can be mitigated earlierâthrough architectural choices, defensive training strategies, and mechanisms that embed resilience directly into the modelâs structure or behavior. By considering security as a design constraint, system developers can reduce the modelâs exposure to attacks, limit its ability to leak sensitive information, and provide verifiable ownership protection.
One important design strategy is to build robust-by-construction models that reduce the risk of exploitation at inference time. For instance, models with confidence calibration or abstention mechanisms can be trained to avoid making predictions when input uncertainty is high. These techniques can help prevent overconfident misclassifications in response to adversarial or out-of-distribution inputs. Models may also employ output smoothing, regularizing the output distribution to reduce sharp decision boundaries that are especially susceptible to adversarial perturbations.
Certain application contexts may also benefit from choosing simpler or compressed architectures. While not universally appropriate, limiting model capacity can reduce opportunities for memorization of sensitive training data and complicate efforts to reverse-engineer the model from output behavior. For embedded or on-device settings, smaller models are also easier to secure, as they typically require less memory and compute, lowering the likelihood of side-channel leakage or runtime manipulation.
Another design-stage consideration is the use of model watermarking, a technique for embedding verifiable ownership signatures directly into the modelâs parameters or output behavior (Adi et al. 2018). A watermark might be implemented, for example, as a hidden response pattern triggered by specific inputs, or as a parameter-space perturbation that does not affect accuracy but is statistically identifiable. These watermarks can be used to detect and prove misuse of stolen models in downstream deployments. Watermarking strategies must be carefully designed to remain robust to model compression, fine-tuning, and format conversion.
For example, in a keyword spotting system deployed on embedded hardware for voice activation (e.g., âHey Alexaâ or âOK Googleâ), a secure design might use a lightweight convolutional neural network with confidence calibration to avoid false activations on uncertain audio. The model might also include an abstention threshold, below which it produces no activation at all. To protect intellectual property, a designer could embed a watermark by training the model to respond with a unique label only when presented with a specific, unused audio trigger known only to the developer. These design choices not only improve robustness and accountability, but also support future verification in case of IP disputes or performance failures in the field.
In high-risk applications, such as medical diagnosis, autonomous vehicles, or financial decision systems, designers may also prioritize interpretable model architectures, such as decision trees, rule-based classifiers, or sparsified networks, to enhance system auditability. These models are often easier to understand and explain, making it simpler to identify potential vulnerabilities or biases. Using interpretable models allows developers to provide clearer insights into how the system arrived at a particular decision, which is crucial for building trust with users and regulators.
Model design choices often reflect trade-offs between accuracy, robustness, transparency, and system complexity. However, when viewed from a systems perspective, early-stage design decisions frequently yield the highest leverage for long-term security. They shape what the model can learn, how it behaves under uncertainty, and what guarantees can be made about its provenance, interpretability, and resilience.
15.7.3 Secure Model Deployment
Protecting machine learning models from theft, abuse, and unauthorized manipulation requires security considerations throughout both the design and deployment phases. A modelâs vulnerability is not solely determined by its training procedure or architecture, but also by how it is serialized, packaged, deployed, and accessed during inference. As models are increasingly embedded into edge devices, served through public APIs, or integrated into multi-tenant platforms, robust security practices are essential to ensure the integrity, confidentiality, and availability of model behavior.
This section addresses security mechanisms across three key stages: model design, secure packaging and serialization, and deployment and access control.
From a design perspective, architectural choices can reduce a modelâs exposure to adversarial manipulation and unauthorized use. For example, models can incorporate confidence calibration or abstention mechanisms that allow them to reject uncertain or anomalous inputs rather than producing potentially misleading outputs. Designing models with simpler or compressed architectures can also reduce the risk of reverse engineering or information leakage through side-channel analysis. In some cases, model designers may embed imperceptible watermarks, which are unique signatures embedded in the parameters or behavior of the model, that can later be used to demonstrate ownership in cases of misappropriation (Uchida et al. 2017). These design-time protections are particularly important for commercially valuable models, where intellectual property rights are at stake.
Once training is complete, the model must be securely packaged for deployment. Storing models in plaintext formats, including unencrypted ONNX or PyTorch checkpoint files, can expose internal structures and parameters to attackers with access to the file system or memory. To mitigate this risk, models should be encrypted, obfuscated, or wrapped in secure containers. Decryption keys should be made available only at runtime and only within trusted environments. Additional mechanisms, such as quantization-aware encryption or integrity-checking wrappers, can prevent tampering and offline model theft.
Deployment environments must also enforce strong access control policies to ensure that only authorized users and services can interact with inference endpoints. Authentication protocols, including OAuth tokens, mutual TLS, or API keys, should be combined with role-based access control (RBAC) to restrict access according to user roles and operational context. For instance, OpenAIâs hosted model APIs require users to include an OPENAI_API_KEY when submitting inference requests. This key authenticates the client and enables the backend to enforce usage policies, monitor for abuse, and log access patterns. A simplified example of secure usage is shown in Listing 15.1, where the API key is securely loaded from an environment variable before being used to authenticate requests.
import openai
import os
# Securely load the API key from an environment variable
= os.getenv("OPENAI_API_KEY")
openai.api_key
# Submit a prompt to the model
= openai.ChatCompletion.create(
response ="gpt-4",
model=[
messages
{"role": "user",
"content": (
"Summarize the principles of "
"differential privacy."
)
}
]
)
print(response.choices[0].message["content"])
In this example, the API key is retrieved from an environment variableâavoiding the security risk of hardcoding it into source code or exposing it to the client side. Such key-based access control mechanisms are simple to implement but require careful key management and monitoring to prevent misuse, unauthorized access, or model extraction.
Beyond endpoint access, the integrity of the deployment pipeline itself must also be protected. Continuous integration and deployment (CI/CD) workflows that automate model updates should enforce cryptographic signing of artifacts, dependency validation, and infrastructure hardening. Without these controls, adversaries could inject malicious models or alter existing ones during the build and deployment process. Verifying model signatures and maintaining audit trails helps ensure that only authorized models are deployed into production.
When applied together, these practices protect against a range of threatsâfrom model theft and unauthorized inference access to tampering during deployment and output manipulation at runtime. No single mechanism suffices in isolation, but a layered strategy, beginning at the design phase and extending through deployment, provides a strong foundation for securing machine learning systems under real-world conditions.
15.7.4 System-level Monitoring
Even with robust design and deployment safeguards, machine learning systems remain vulnerable to runtime threats. Attackers may craft inputs that bypass validation, exploit model behavior, or target system-level infrastructure. As ML systems enter production, particularly in cloud, edge, or embedded deployments, defensive strategies must extend beyond static protection to include real-time monitoring, threat detection, and incident response. This section outlines operational defenses that maintain system trust under adversarial conditions.
Runtime monitoring encompasses a range of techniques for observing system behavior, detecting anomalies, and triggering mitigation. These techniques can be grouped into three categories: input validation, output monitoring, and system integrity checks.
Input Validation
Input validation is the first line of defense at runtime. It ensures that incoming data conforms to expected formats, statistical properties, or semantic constraints before it is passed to a machine learning model. Without these safeguards, models are vulnerable to adversarial inputs, which are crafted examples designed to trigger incorrect predictions, or to malformed inputs that cause unexpected behavior in preprocessing or inference.
Machine learning models, unlike traditional rule-based systems, often do not fail safely. Small, carefully chosen changes to input data can cause models to make high-confidence but incorrect predictions. Input validation helps detect and reject such inputs early in the pipeline (I. J. Goodfellow, Shlens, and Szegedy 2014).
Validation techniques range from low-level checks (e.g., input size, type, and value ranges) to semantic filters (e.g., verifying whether an image contains a recognizable object or whether a voice recording includes speech). For example, a facial recognition system might validate that the uploaded image is within a certain resolution range (e.g., 224Ă224 to 1024Ă1024 pixels), contains RGB channels, and passes a lightweight face detection filter. This prevents inputs like blank images, text screenshots, or synthetic adversarial patterns from reaching the model. Similarly, a voice assistant might require that incoming audio files be between 1 and 5 seconds long, have a valid sampling rate (e.g., 16kHz), and contain detectable human speech using a speech activity detector (SAD). This ensures that empty recordings, music clips, or noise bursts are filtered before model inference.
In generative systems such as DALL¡E, Stable Diffusion, or Sora, input validation often involves prompt filtering. This includes scanning the userâs text prompt for banned terms, brand names, profanity, or misleading medical claims. For example, a user prompt like âGenerate an image of a medication bottle labeled with Pfizerâs logoâ might be rejected or rewritten due to trademark concerns. Filters may operate using keyword lists, regular expressions, or lightweight classifiers that assess prompt intent. These filters prevent the generative model from being used to produce harmful, illegal, or misleading contentâeven before sampling begins.
In some applications, distributional checks are also used. These assess whether the incoming data statistically resembles what the model saw during training. For instance, a computer vision pipeline might compare the color histogram of the input image to a baseline distribution, flagging outliers for manual review or rejection.
These validations can be lightweight (heuristics or threshold rules) or learned (small models trained to detect distribution shift or adversarial artifacts). In either case, input validation serves as a critical pre-inference firewallâreducing exposure to adversarial behavior, improving system stability, and increasing trust in downstream model decisions.
Output Monitoring
Even when inputs pass validation, adversarial or unexpected behavior may still emerge at the modelâs output. Output monitoring helps detect such anomalies by analyzing model predictions in real time. These mechanisms observe how the model behaves across inputs, by tracking its confidence, prediction entropy, class distribution, or response patterns, to flag deviations from expected behavior.
A key target for monitoring is prediction confidence. For example, if a classification model begins assigning high confidence to low-frequency or previously rare classes, this may indicate the presence of adversarial inputs or a shift in the underlying data distribution. Monitoring the entropy of the output distribution can similarly reveal when the model is overly certain in ambiguous contextsâan early signal of possible manipulation.
In content moderation systems, a model that normally outputs neutral or âsafeâ labels may suddenly begin producing high-confidence âsafeâ labels for inputs containing offensive or restricted content. Output monitoring can detect this mismatch by comparing predictions against auxiliary signals or known-safe reference sets. When deviations are detected, the system may trigger a fallback policyâsuch as escalating the content for human review or switching to a conservative baseline model.
Time-series models also benefit from output monitoring. For instance, an anomaly detection model used in fraud detection might track predicted fraud scores for sequences of financial transactions. A sudden drop in fraud scores, especially during periods of high transaction volume, may indicate model tampering, label leakage, or evasion attempts. Monitoring the temporal evolution of predictions provides a broader perspective than static, pointwise classification.
Generative models, such as text-to-image systems, introduce unique output monitoring challenges. These models can produce high-fidelity imagery that may inadvertently violate content safety policies, platform guidelines, or user expectations. To mitigate these risks, post-generation classifiers are commonly employed to assess generated content for objectionable characteristics such as violence, nudity, or brand misuse. These classifiers operate downstream of the generative model and can suppress, blur, or reject outputs based on predefined thresholds. Some systems also inspect internal representations (e.g., attention maps or latent embeddings) to anticipate potential misuse before content is rendered.
However, prompt filtering alone is insufficient for safety. Research has shown that text-to-image systems can be manipulated through implicitly adversarial prompts, which are queries that appear benign but lead to policy-violating outputs. The Adversarial Nibbler project introduces an open red teaming methodology that identifies such prompts and demonstrates how models like Stable Diffusion can produce unintended content despite the absence of explicit trigger phrases (Quaye et al. 2024). These failure cases often bypass prompt filters because their risk arises from model behavior during generation, not from syntactic or lexical cues.
As shown in Figure 15.12, even prompts that appear innocuous can trigger unsafe generations. Such examples highlight the limitations of pre-generation safety checks and reinforce the necessity of output-based monitoring as a second line of defense. This two-stage pipelineâconsisting of prompt filtering followed by post-hoc content analysisis essential for ensuring the safe deployment of generative models in open-ended or user-facing environments.
In the domain of language generation, output monitoring plays a different but equally important role. Here, the goal is often to detect toxicity, hallucinated claims, or off-distribution responses. For example, a customer support chatbot may be monitored for keyword presence, tonal alignment, or semantic coherence. If a response contains profanity, unsupported assertions, or syntactically malformed text, the system may trigger a rephrasing, initiate a fallback to scripted templates, or halt the response altogether.
Effective output monitoring combines rule-based heuristics with learned detectors trained on historical outputs. These detectors are deployed to flag deviations in real time and feed alerts into incident response pipelines. In contrast to model-centric defenses like adversarial training, which aim to improve model robustness, output monitoring emphasizes containment and remediation. Its role is not to prevent exploitation but to detect its symptoms and initiate appropriate countermeasures (Savas et al. 2022). In safety-critical or policy-sensitive applications, such mechanisms form a critical layer of operational resilience.
These principles have been implemented in recent output filtering frameworks. For example, LLM Guard combines transformer-based classifiers with safety dimensions such as toxicity, misinformation, and illegal content to assess and reject prompts or completions in instruction-tuned LLMs (Inan et al. 2023). Similarly, ShieldGemma, developed as part of Googleâs open Gemma model release, applies configurable scoring functions to detect and filter undesired outputs during inference. Both systems exemplify how safety classifiers and output monitors are being integrated into the runtime stack to support scalable, policy-aligned deployment of generative language models.
Integrity Checks
While input and output monitoring focus on model behavior, system integrity checks ensure that the underlying model files, execution environment, and serving infrastructure remain untampered throughout deployment. These checks detect unauthorized modifications, verify that the model running in production is authentic, and alert operators to suspicious system-level activity.
One of the most common integrity mechanisms is cryptographic model verification. Before a model is loaded into memory, the system can compute a cryptographic hash (e.g., SHA-256) of the model file and compare it against a known-good signature. This process ensures that the model has not been altered during transit or storage. For example, a PyTorch .pt or TensorFlow .pb model artifact stored in object storage (e.g., S3) might be verified using a signed hash from a deployment registry before loading into a production container. If the verification fails, the system can block inference, alert an operator, or revert to a trusted model version.
Access control and audit logging complement cryptographic checks. ML systems should restrict access to model files using role-based permissions and monitor file access patterns. For instance, repeated attempts to read model checkpoints from a non-standard path, or inference requests from unauthorized IP ranges, may indicate tampering, privilege escalation, or insider threats.
In cloud environments, container- or VM-based isolation helps enforce process and memory boundaries, but these protections can erode over time due to misconfiguration or supply chain vulnerabilities. To reinforce runtime assurance, systems may deploy periodic attestation checksâverifying not just the model artifact, but also the software environment, installed dependencies, and hardware identity. These techniques are often backed by hardware trust anchors (e.g., TPMs or TEEs) discussed later on in this chapter.
For example, in a regulated healthcare ML deployment, integrity checks might include: verifying the model hash against a signed manifest, validating that the runtime environment uses only approved Python packages, and checking that inference occurs inside a signed and attested virtual machine. These checks ensure compliance, limit the risk of silent failures, and create a forensic trail in case of audit or breach.
Some systems also implement runtime memory verification, such as scanning for unexpected model parameter changes or checking that memory-mapped model weights remain unaltered during execution. While more common in high-assurance systems, such checks are becoming more feasible with the adoption of secure enclaves and trusted runtimes.
Taken together, system integrity checks play a critical role in protecting machine learning systems from low-level attacks that bypass the model interface. When coupled with input/output monitoring, they provide layered assurance that both the model and its execution environment remain trustworthy under adversarial conditions.
Response and Rollback
When a security breach, anomaly, or performance degradation is detected in a deployed machine learning system, rapid and structured incident response is critical to minimizing impact. The goal is not only to contain the issue but to restore system integrity and ensure that future deployments benefit from the insights gained. Unlike traditional software systems, ML responses may require handling model state, data drift, or inference behavior, making recovery more complex.
The first step is to define incident detection thresholds that trigger escalation. These thresholds may come from input validation (e.g., invalid input rates), output monitoring (e.g., drop in prediction confidence), or system integrity checks (e.g., failed model signature verification). When a threshold is crossed, the system should initiate an automated or semi-automated response protocol.
One common strategy is model rollback, where the system reverts to a previously verified version of the model. For instance, if a newly deployed fraud detection model begins misclassifying transactions, the system may fall back to the last known-good checkpoint, restoring service while the affected version is quarantined. Rollback mechanisms require version-controlled model storage, typically supported by MLOps platforms such as MLflow, TFX, or SageMaker.
In high-availability environments, model isolation may be used to contain failures. The affected model instance can be removed from load balancers or shadowed in a canary deployment setup. This allows continued service with unaffected replicas while maintaining forensic access to the compromised model for analysis.
Traffic throttling is another immediate response tool. If an adversarial actor is probing a public inference API at high volume, the system can rate-limit or temporarily block offending IP ranges while continuing to serve trusted clients. This containment technique helps prevent abuse without requiring full system shutdown.
Once immediate containment is in place, investigation and recovery can begin. This may include forensic analysis of input logs, parameter deltas between model versions, or memory snapshots from inference containers. In regulated environments, organizations may also need to notify users or auditors, particularly if personal or safety-critical data was affected.
Recovery typically involves retraining or patching the model. This must occur through a secure update process, using signed artifacts, trusted build pipelines, and validated data. To prevent recurrence, the incident should feed back into model evaluation pipelinesâupdating tests, refining monitoring thresholds, or hardening input defenses. For example, if a prompt injection attack bypassed a content filter in a generative model, retraining might include adversarially crafted prompts, and the prompt validation logic would be updated to reflect newly discovered patterns.
Finally, organizations should establish post-incident review practices. This includes documenting root causes, identifying gaps in detection or response, and updating policies and playbooks. Incident reviews help translate operational failures into actionable improvements across the design-deploy-monitor lifecycle.
15.7.5 Hardware-based Security
Machine learning systems are increasingly deployed in environments where hardware-based security features can provide additional layers of protection. These features can help ensure the integrity of model execution, protect sensitive data, and prevent unauthorized access to system resources. This section discusses several key hardware-based security mechanisms that can enhance the security posture of machine learning systems.
Trusted Execution Environments
A Trusted Execution Environment (TEE) is a hardware-isolated region within a processor designed to protect sensitive computations and data from potentially compromised software. TEEs enforce confidentiality, integrity, and runtime isolation, ensuring that even if the host operating system or application layer is attacked, sensitive operations within the TEE remain secure.
In the context of machine learning, TEEs are increasingly important for preserving the confidentiality of models, securing sensitive user data during inference, and ensuring that model outputs remain trustworthy. For example, a TEE can protect model parameters from being extracted by malicious software running on the same device, or ensure that computations involving biometric inputs, including facial data or fingerprint data, are performed securely. This capability is particularly critical in applications where model integrity, user privacy, or regulatory compliance are non-negotiable.
One widely deployed example is Appleâs Secure Enclave, which provides isolated execution and secure key storage for iOS devices. By separating cryptographic operations and biometric data from the main processor, the Secure Enclave ensures that user credentials and Face ID features remain protected, even in the event of a broader system compromise.
Trusted Execution Environments are essential across a range of industries with high security requirements. In telecommunications, TEEs are used to safeguard encryption keys and secure critical 5G control-plane operations. In finance, they enable secure mobile payments and protect PIN-based authentication workflows. In healthcare, TEEs help enforce patient data confidentiality during edge-based ML inference on wearable or diagnostic devices. In the automotive industry, they are deployed in advanced driver-assistance systems (ADAS) to ensure that safety-critical perception and decision-making modules operate on verified software.
In machine learning systems, TEEs can provide several important protections. They secure the execution of model inference or training, shielding intermediate computations and final predictions from system-level observation. They protect the confidentiality of sensitive inputs, including biometric or clinical signals, used in personal identification or risk scoring tasks. TEEs also serve to prevent reverse engineering of deployed models by restricting access to weights and architecture internals. When models are updated, TEEs ensure the authenticity of new parameters and block unauthorized tampering. Furthermore, in distributed ML settings, TEEs can protect data exchanged between components by enabling encrypted and attested communication channels.
The core security properties of a TEE are achieved through four mechanisms: isolated execution, secure storage, integrity protection, and in-TEE data encryption. Code that runs inside the TEE is executed in a separate processor mode, inaccessible to the normal-world operating system. Sensitive assets such as cryptographic keys or authentication tokens are stored in memory that only the TEE can access. Code and data can be verified for integrity before execution using hardware-anchored hashes or signatures. Finally, data processed inside the TEE can be encrypted, ensuring that even intermediate results are inaccessible without appropriate keys, which are also managed internally by the TEE.
Several commercial platforms provide TEE functionality tailored for different deployment contexts. ARM TrustZone offers secure and normal world execution on ARM-based systems and is widely used in mobile and IoT applications. Intel SGX implements enclave-based security for cloud and desktop systems, enabling secure computation even on untrusted infrastructure. Qualcommâs Secure Execution Environment supports secure mobile transactions and user authentication. Appleâs Secure Enclave remains a canonical example of a hardware-isolated security coprocessor for consumer devices.
Figure 15.13 illustrates a secure enclave integrated into a system-on-chip (SoC) architecture. The enclave includes a dedicated processor, an AES engine, a true random number generator (TRNG), a public key accelerator (PKA), and a secure I²C interface to nonvolatile storage. These components operate in isolation from the main application processor and memory subsystem. A memory protection engine enforces access control, while cryptographic operations such as NAND flash encryption are handled internally using enclave-managed keys. By physically separating secure execution and key management from the main system, this architecture limits the impact of system-level compromises and forms the foundation of hardware-enforced trust.
This architecture underpins the secure deployment of machine learning applications on consumer devices. For example, Appleâs Face ID system uses a secure enclave to perform facial recognition entirely within a hardware-isolated environment. The face embedding model is executed inside the enclave, and biometric templates are stored in secure nonvolatile memory accessible only via the enclaveâs I²C interface. During authentication, input data from the infrared camera is processed locally, and no facial features or predictions ever leave the secure region. Even if the application processor or operating system is compromised, the enclave prevents access to sensitive model inputs, parameters, and outputsâensuring that biometric identity remains protected end to end.
Despite their strengths, Trusted Execution Environments come with notable trade-offs. Implementing a TEE increases both direct hardware costs and indirect costs associated with developing and maintaining secure software. Integrating TEEs into existing systems may require architectural redesigns, especially for legacy infrastructure. Developers must adhere to strict protocols for isolation, attestation, and secure update management, which can extend development cycles and complicate testing workflows. TEEs can also introduce performance overhead, particularly when cryptographic operations are involved, or when context switching between trusted and untrusted modes is frequent.
Energy efficiency is another consideration, particularly in battery-constrained devices. TEEs typically consume additional power due to secure memory accesses, cryptographic computation, and hardware protection logic. In resource-limited embedded systems, these costs may limit their use. In terms of scalability and flexibility, the secure boundaries enforced by TEEs may complicate distributed training or federated inference workloads, where secure coordination between enclaves is required.
Market demand also varies. In some consumer applications, perceived threat levels may be too low to justify the integration of TEEs. Moreover, systems with TEEs may be subject to formal security certifications, such as Common Criteria or evaluation under ENISA, which can introduce additional time and expense. For this reason, TEEs are typically adopted only when the expected threat model, including adversarial users, cloud tenants, and malicious insiders, justifies the investment.
Nonetheless, TEEs remain a powerful hardware primitive in the machine learning security landscape. When paired with software- and system-level defenses, they provide a trusted foundation for executing ML models securely, privately, and verifiably, especially in scenarios where adversarial compromise of the host environment is a serious concern.
Here is the revised 7.5.2 Secure Boot section, rewritten in formal textbook tone with all original technical content, hyperlinks, and figures preserved. The structure emphasizes narrative clarity, avoids bullet lists, and integrates the Apple Face ID case study naturally.
Secure Boot
Secure Boot is a mechanism that ensures a device only boots software components that are cryptographically verified and explicitly authorized by the manufacturer. At startup, each stage of the boot process, comprising the bootloader, kernel, and base operating system, is checked against a known-good digital signature. If any signature fails verification, the boot sequence is halted, preventing unauthorized or malicious code from executing. This chain-of-trust model establishes system integrity from the very first instruction executed.
In ML systems, especially those deployed on embedded or edge hardware, Secure Boot plays an important role. A compromised boot process may result in malicious software loading before the ML runtime begins, enabling attackers to intercept model weights, tamper with training data, or reroute inference results. Such breaches can lead to incorrect or manipulated predictions, unauthorized data access, or device repurposing for botnets or crypto-mining.
For machine learning systems, Secure Boot offers several guarantees. First, it protects model-related data, such as training data, inference inputs, and outputs, during the boot sequence, preventing pre-runtime tampering. Second, it ensures that only authenticated model binaries and supporting software are loaded, which helps guard against deployment-time model substitution. Third, Secure Boot enables secure model updates by verifying that firmware or model changes are signed and have not been altered in transit.
Secure Boot frequently works in tandem with hardware-based Trusted Execution Environments (TEEs) to create a fully trusted execution stack. As shown in Figure 15.14, this layered boot process verifies firmware, operating system components, and TEE integrity before permitting execution of cryptographic operations or ML workloads. In embedded systems, this architecture provides resilience even under severe adversarial conditions or physical device compromise.
A well-known real-world implementation of Secure Boot appears in Appleâs Face ID system, which leverages advanced machine learning for facial recognition. For Face ID to operate securely, the entire device stack, from the initial power-on to the execution of the model, must be verifiably trusted.
Upon device startup, Secure Boot initiates within Appleâs Secure Enclave, a dedicated security coprocessor that handles biometric data. The firmware loaded onto the Secure Enclave is digitally signed by Apple, and any unauthorized modification causes the boot process to fail. Once verified, the Secure Enclave performs continuous checks in coordination with the central processor to maintain a trusted boot chain. Each system component, ranging from the iOS kernel to the application-level code, is verified using cryptographic signatures.
After completing the secure boot sequence, the Secure Enclave activates the ML-based Face ID system. The facial recognition model projects over 30,000 infrared points to map a userâs face, generating a depth image and computing a mathematical representation that is compared against a securely stored profile. These facial data artifacts are never written to disk, transmitted off-device, or shared externally. All processing occurs within the enclave to protect against eavesdropping or exfiltration, even in the presence of a compromised kernel.
To support continued integrity, Secure Boot also governs software updates. Only firmware or model updates signed by Apple are accepted, ensuring that even over-the-air patches do not introduce risk. This process maintains a robust chain of trust over time, enabling the secure evolution of the ML system while preserving user privacy and device security.
While Secure Boot provides strong protection, its adoption presents technical and operational challenges. Managing the cryptographic keys used to sign and verify system components is complex, especially at scale. Enterprises must securely provision, rotate, and revoke keys, ensuring that no trusted root is compromised. Any such breach would undermine the entire security chain.
Performance is also a consideration. Verifying signatures during the boot process introduces latency, typically on the order of tens to hundreds of milliseconds per component. Although acceptable in many applications, these delays may be problematic for real-time or power-constrained systems. Developers must also ensure that all components, including bootloaders, firmware, kernels, drivers, and even ML models, are correctly signed. Integrating third-party software into a Secure Boot pipeline introduces additional complexity.
Some systems limit user control in favor of vendor-locked security models, restricting upgradability or customization. In response, open-source bootloaders like u-boot and coreboot have emerged, offering Secure Boot features while supporting extensibility and transparency. To further scale trusted device deployments, emerging industry standards such as the Device Identifier Composition Engine (DICE) and IEEE 802.1AR IDevID provide mechanisms for secure device identity, key provisioning, and cross-vendor trust assurance.
Secure Boot, when implemented carefully and complemented by trusted hardware and secure software update processes, forms the backbone of system integrity for embedded and distributed ML. It provides the assurance that the machine learning model running in production is not only the correct version, but is also executing in a known-good environment, anchored to hardware-level trust.
Hardware Security Modules
A Hardware Security Module (HSM) is a tamper-resistant physical device designed to perform cryptographic operations and securely manage digital keys. HSMs are widely used across security-critical industries such as finance, defense, and cloud infrastructure, and they are increasingly relevant for securing the machine learning pipelineâparticularly in deployments where key confidentiality, model integrity, and regulatory compliance are essential.
HSMs provide an isolated, hardened environment for performing sensitive operations such as key generation, digital signing, encryption, and decryption. Unlike general-purpose processors, they are engineered to withstand physical tampering and side-channel attacks, and they typically include protected storage, cryptographic accelerators, and internal audit logging. HSMs may be implemented as standalone appliances, plug-in modules, or integrated chips embedded within broader systems.
In machine learning systems, HSMs enhance security across several dimensions. They are commonly used to protect encryption keys associated with sensitive data that may be processed during training or inference. These keys might encrypt data at rest in model checkpoints or enable secure transmission of inference requests across networked environments. By ensuring that the keys are generated, stored, and used exclusively within the HSM, the system minimizes the risk of key leakage, unauthorized reuse, or tampering.
HSMs also play a role in maintaining the integrity of machine learning models. In many production pipelines, models must be signed before deployment to ensure that only verified versions are accepted into runtime environments. The signing keys used to authenticate models can be stored and managed within the HSM, providing cryptographic assurance that the deployed artifact is authentic and untampered. Similarly, secure firmware updates and configuration changes, regardless of whether they pertain to models, hyperparameters, or supporting infrastructure, can be validated using signatures produced by the HSM.
In addition to protecting inference workloads, HSMs can be used to secure model training. During training, data may originate from distributed and potentially untrusted sources. HSM-backed protocols can help ensure that training pipelines perform encryption, integrity checks, and access control enforcement securely and in compliance with organizational or legal requirements. In regulated industries such as healthcare and finance, such protections are often mandatory.
Despite these benefits, incorporating HSMs into embedded or resource-constrained ML systems introduces several trade-offs. First, HSMs are specialized hardware components and often come at a premium. Their cost may be justified in data center settings or safety-critical applications but can be prohibitive for low-margin embedded products or wearables. Physical space is also a concern. Embedded systems often operate under strict size, weight, and form factor constraints, and integrating an HSM may require redesigning circuit layouts or sacrificing other functionality.
From a performance standpoint, HSMs introduce latency, particularly for operations like key exchange, signature verification, or on-the-fly decryption. In real-time inference systems, including autonomous vehicles, industrial robotics, and live translation devices, these delays can affect responsiveness. While HSMs are typically optimized for cryptographic throughput, they are not general-purpose processors, and offloading secure operations must be carefully coordinated.
Power consumption is another concern. The continuous secure handling of keys, signing of transactions, and cryptographic validations can consume more power than basic embedded components, impacting battery life in mobile or remote deployments.
Integration complexity also grows when HSMs are introduced into existing ML pipelines. Interfacing between the HSM and the host processor requires dedicated APIs and often specialized software development. Firmware and model updates must be routed through secure, signed channels, and update orchestration must account for device-specific key provisioning. These requirements increase the operational burden, especially in large deployments.
Scalability presents its own set of challenges. Managing a distributed fleet of HSM-equipped devices requires secure provisioning of individual keys, secure identity binding, and coordinated trust management. In large ML deployments, including fleets of smart sensors or edge inference nodes, ensuring uniform security posture across all devices is nontrivial.
Finally, the use of HSMs often requires organizations to engage in certification and compliance processes, particularly when handling regulated data. Meeting standards such as FIPS 140-2 or Common Criteria adds time and cost to development. Access to the HSM is typically restricted to a small set of authorized personnel, which can complicate development workflows and slow iteration cycles.
Despite these operational complexities, HSMs remain a valuable option for machine learning systems that require high assurance of cryptographic integrity and access control. When paired with TEEs, secure boot, and software-based defenses, HSMs contribute to a multilayered security model that spans hardware, system software, and ML runtime.
Physical Unclonable Functions
Physical Unclonable Functions (PUFs) provide a hardware-intrinsic mechanism for cryptographic key generation and device authentication by leveraging physical randomness in semiconductor fabrication (Gassend et al. 2002). Unlike traditional keys stored in memory, a PUF generates secret values based on microscopic variations in a chipâs physical propertiesâvariations that are inherent to manufacturing processes and difficult to clone or predict, even by the manufacturer.
These variations arise from uncontrollable physical factors such as doping concentration, line edge roughness, and dielectric thickness. As a result, even chips fabricated with the same design masks exhibit small but measurable differences in timing, power consumption, or voltage behavior. PUF circuits amplify these variations to produce a device-unique digital output. When a specific input challenge is applied to a PUF, it generates a corresponding response based on the chipâs physical fingerprint. Because these characteristics are effectively impossible to replicate, the same challenge will yield different responses across devices.
This challenge-response mechanism allows PUFs to serve several cryptographic purposes. They can be used to derive device-specific keys that never need to be stored externally, reducing the attack surface for key exfiltration. The same mechanism also supports secure authentication and attestation, where devices must prove their identity to trusted servers or hardware gateways. These properties make PUFs a natural fit for machine learning systems deployed in embedded and distributed environments.
In ML applications, PUFs offer unique advantages for securing resource-constrained systems. For example, consider a smart camera drone that uses onboard computer vision to track objects. A PUF embedded in the droneâs processor can generate a private key to encrypt the model during boot. Even if the model were extracted, it would be unusable on another device lacking the same PUF response. That same PUF-derived key could also be used to watermark the model parameters, creating a cryptographically verifiable link between a deployed model and its origin hardware. If the model were leaked or pirated, the embedded watermark could help prove the source of the compromise.
PUFs also support authentication in distributed ML pipelines. If the drone offloads computation to a cloud server, the PUF can help verify that the drone has not been cloned or tampered with. The cloud backend can issue a challenge, verify the correct response from the device, and permit access only if the PUF proves device authenticity. These protections enhance trust not only in the model and data, but in the execution environment itself.
The internal operation of a PUF is illustrated in Figure 15.15. At a high level, a PUF accepts a challenge input and produces a unique response determined by the physical microstructure of the chip (Gao, Al-Sarawi, and Abbott 2020). Variants include optical PUFs, in which the challenge consists of a light pattern and the response is a speckle image, and electronic PUFs such as Arbiter PUFs (APUFs), where timing differences between circuit paths produce a binary output. Another common implementation is the SRAM PUF, which exploits the power-up state of uninitialized SRAM cells: due to threshold voltage mismatch, each cell tends to settle into a preferred value when power is first applied. These response patterns form a stable, reproducible hardware fingerprint.
Despite their promise, PUFs present several challenges in system design. Their outputs can be sensitive to environmental variation, such as changes in temperature or voltage, which can introduce instability or bit errors in the response. To ensure reliability, PUF systems must often incorporate error correction codes or helper data schemes. Managing large sets of challenge-response pairs also raises questions about storage, consistency, and revocation. Additionally, the unique statistical structure of PUF outputs may make them vulnerable to machine learning-based modeling attacks if not carefully shielded from external observation.
From a manufacturing perspective, incorporating PUF technology can increase device cost or require additional layout complexity. While PUFs eliminate the need for external key storage, thereby reducing long-term security risk and provisioning cost, they may require calibration and testing during fabrication to ensure consistent performance across environmental conditions and device aging.
Nevertheless, Physical Unclonable Functions remain a compelling building block for securing embedded machine learning systems. By embedding hardware identity directly into the chip, PUFs support lightweight cryptographic operations, reduce key management burden, and help establish root-of-trust anchors in distributed or resource-constrained environments. When integrated thoughtfully, they complement other hardware-assisted security mechanisms such as Secure Boot, TEEs, and HSMs to provide defense-in-depth across the ML system lifecycle.
Mechanisms Comparison
Hardware-assisted security mechanisms play a foundational role in establishing trust within modern machine learning systems. While software-based defenses offer flexibility, they ultimately rely on the security of the hardware platform. As machine learning workloads increasingly operate on edge devices, embedded platforms, and untrusted infrastructure, hardware-backed protections become essential for maintaining system integrity, confidentiality, and trust.
Trusted Execution Environments (TEEs) provide runtime isolation for model inference and sensitive data handling. Secure Boot enforces integrity from power-on, ensuring that only verified software is executed. Hardware Security Modules (HSMs) offer tamper-resistant storage and cryptographic processing for secure key management, model signing, and firmware validation. Physical Unclonable Functions (PUFs) bind secrets and authentication to the physical characteristics of a specific device, enabling lightweight and unclonable identities.
These mechanisms address different layers of the system stack, ranging from initialization and attestation to runtime protection and identity binding, and complement one another when deployed together. Table 15.7 below compares their roles, use cases, and trade-offs in machine learning system design.
Mechanism | Primary Function | Common Use in ML | Trade-offs |
---|---|---|---|
Trusted Execution Environment (TEE) | Isolated runtime environment for secure computation | Secure inference and on-device privacy for sensitive inputs and outputs | Added complexity, memory limits, perf. cost Requires trusted code development |
Secure Boot | Verified boot sequence and firmware validation | Ensures only signed ML models and firmware execute on embedded devices | Key management complexity, vendor lock-in Performance impact during startup |
Hardware Security Module (HSM) | Secure key generation and storage, crypto-processing | Signing ML models, securing training pipelines, verifying firmware | High cost, integration overhead, limited I/O Requires infrastructure-level provisioning |
Physical Unclonable Function (PUF) | Hardware-bound identity and key derivation | Model binding, device authentication, protecting IP in embedded deployments | Environmental sensitivity, modeling attacks Needs error correction and calibration |
Together, these hardware primitives form the foundation of a defense-in-depth strategy for securing ML systems in adversarial environments. Their integration is especially important in domains that demand provable trust, such as autonomous vehicles, healthcare devices, federated learning systems, and critical infrastructure.
15.7.6 Toward Trustworthy Systems
Defending machine learning systems against adversarial threats, misuse, and system compromise requires more than isolated countermeasures. As this section has shown, effective defense emerges from the careful integration of mechanisms at multiple layers of the ML stackâfrom privacy-preserving data handling and robust model design to runtime monitoring and hardware-enforced isolation. No single component can provide complete protection; instead, a trustworthy system is the result of coordinated design decisions that address risk across the data, model, system, and infrastructure layers.
Defensive strategies must align with the deployment context and threat model. What is appropriate for a public cloud API may differ from the requirements of an embedded medical device or a fleet of edge-deployed sensors. Design choices must balance security, performance, and usability, recognizing that protections often introduce operational trade-offs. Monitoring and incident response mechanisms ensure resilience during live operation, while hardware-based roots of trust ensure system integrity even when higher layers are compromised.
As machine learning continues to expand into safety-critical, privacy-sensitive, and decentralized environments, the need for robust, end-to-end defense becomes increasingly urgent. Building ML systems that are not only accurate, but secure, private, and auditable, is fundamental to long-term deployment success and public trust. The principles introduced in this section lay the groundwork for such systemsâwhile connecting forward to broader concerns explored in subsequent chapters, including robustness, responsible AI, and MLOps operations.
The process of engineering trustworthy ML systems requires a structured approach that connects threat modeling to layered defenses and runtime resilience. Figure 15.16 provides a conceptual framework to guide this process across technical and deployment dimensions. The design flow begins with a thorough assessment of the threat model and deployment context, which informs the selection of appropriate defenses across the system stack. This includes data-layer protections such as differential privacy (DP), federated learning (FL), and encryption; model-layer defenses like robustness techniques, watermarking, and secure deployment practices; runtime-layer measures such as input validation and output monitoring; and hardware-layer solutions including TEEs, secure boot, and PUFs.
This design flow emphasizes the importance of a comprehensive approach to security, where each layer of the system is fortified against potential threats while remaining adaptable to evolving risks. By integrating these principles into the design and deployment of machine learning systems, organizations can build solutions that are not only effective but also resilient, trustworthy, and aligned with ethical standards.
15.8 Offensive Capabilities
While machine learning systems are often treated as assets to protect, they may also serve as tools for launching attacks. In adversarial settings, the same models used to enhance productivity, automate perception, or assist decision-making can be repurposed to execute or amplify offensive operations. This dual-use characteristic of machine learning, its capacity to secure systems as well as to subvert them, marks a fundamental shift in how ML must be considered within system-level threat models.
An offensive use of machine learning refers to any scenario in which a machine learning model is employed to facilitate the compromise of another system. In such cases, the model itself is not the object under attack, but the mechanism through which an adversary advances their objectives. These applications may involve reconnaissance, inference, subversion, impersonation, or the automation of exploit strategies that would otherwise require manual execution.
Importantly, such offensive applications are not speculative. Attackers are already integrating machine learning into their toolchains across a wide range of activities, from spam filtering evasion to model, driven malware generation. What distinguishes these scenarios is the deliberate use of learning-based systems to extract, manipulate, or generate information in ways that undermine the confidentiality, integrity, or availability of targeted components.
To clarify the diversity and structure of these applications, Table 15.8 summarizes several representative use cases. For each, the table identifies the type of machine learning model typically employed, the underlying system vulnerability it exploits, and the primary advantage conferred by the use of machine learning.
Offensive Use Case | ML Model Type | Targeted System Vulnerability | Advantage of ML |
---|---|---|---|
Phishing and Social Engineering | Large Language Models (LLMs) | Human perception and communication systems | Personalized, context-aware message crafting |
Reconnaissance and Fingerprinting | Supervised classifiers, clustering models | System configuration, network behavior | Scalable, automated profiling of system behavior |
Exploit Generation | Code generation models, fine-tuned transformers | Software bugs, insecure code patterns | Automated discovery of candidate exploits |
Data Extraction (Inference Attacks) | Classification models, inversion models | Privacy leakage through model outputs | Inference with limited or black-box access |
Evasion of Detection Systems | Adversarial input generators | Detection boundaries in deployed ML systems | Crafting minimally perturbed inputs to evade filters |
Hardware-Level Attacks | CNNs and RNNs for time-series analysis | Physical side-channels (e.g., power, timing, EM) | Learning leakage patterns directly from raw signals |
Each of these scenarios illustrates how machine learning models can serve as amplifiers of adversarial capability. For example, language models enable more convincing and adaptable phishing attacks, while clustering and classification algorithms facilitate reconnaissance by learning system-level behavioral patterns. Similarly, adversarial example generators and inference models systematically uncover weaknesses in decision boundaries or data privacy protections, often requiring only limited external access to deployed systems. In hardware contexts, as discussed in the next section, deep neural networks trained on side-channel data can automate the extraction of cryptographic secrets from physical measurementsâtransforming an expert-driven process into a learnable pattern recognition task.
Although these applications differ in technical implementation, they share a common foundation: the adversary replaces a static exploit with a learned model capable of approximating or adapting to the targetâs vulnerable behavior. This shift increases flexibility, reduces manual overhead, and improves robustness in the face of evolving or partially obscured defenses.
What makes this class of threats particularly significant is their favorable scaling behavior. Just as accuracy in computer vision or language modeling improves with additional data, larger architectures, and greater compute resources, so too does the performance of attack-oriented machine learning models. A model trained on larger corpora of phishing attempts or power traces, for instance, may generalize more effectively, evade more detectors, or require fewer inputs to succeed. The same ecosystem that drives innovation in beneficial AI, including public datasets, open-source tooling, and scalable infrastructure, also lowers the barrier to developing effective offensive models.
This dynamic creates an asymmetry between attacker and defender. While defensive measures are bounded by deployment constraints, latency budgets, and regulatory requirements, attackers can scale training pipelines with minimal marginal cost. The widespread availability of pretrained models and public ML platforms further reduces the expertise required to develop high-impact attacks.
As a result, any comprehensive treatment of machine learning system security must consider not only the vulnerabilities of ML systems themselves but also the ways in which machine learning can be harnessed to compromise other componentsâwhether software, data, or hardware. Understanding the offensive potential of machine-learned systems is essential for designing resilient, trustworthy, and forward-looking defenses.
15.8.1 Case Study: Deep Learning for SCA
One of the most well-known and reproducible demonstrations of deep-learning-assisted SCA is the SCAAML framework (Side-Channel Attacks Assisted with Machine Learning) (Bursztein et al. 2019). Developed by researchers at Google, SCAAML provides a practical implementation of the attack pipeline described above.
As shown in Figure 15.17, cryptographic computations exhibit data-dependent variations in their power consumption. These variations, while subtle, are measurable and reflect the internal state of the algorithm at specific points in time.
In traditional side-channel attacks, experts rely on statistical techniques to extract these differences. However, a neural network can learn to associate the shape of these signals with the specific data values being processed, effectively learning to decode the signal in a manner that mimics expert-crafted models, yet with enhanced flexibility and generalization. The model is trained on labeled examples of power traces and their corresponding intermediate values (e.g., output of an S-box operation). Over time, it learns to associate patterns in the trace, similar to those depicted in Figure 15.17, with secret-dependent computational behavior. This transforms the key recovery task into a classification problem, where the goal is to infer the correct key byte based on trace shape alone.
In their study, Bursztein et al. (2019) trained a convolutional neural network to extract AES keys from power traces collected on an STM32F415 microcontroller running the open-source TinyAES implementation. The model was trained to predict intermediate values of the AES algorithm, such as the output of the S-box in the first round, directly from raw power traces. Remarkably, the trained model was able to recover the full 128-bit key using only a small number of traces per byte.
The traces were collected using a ChipWhisperer setup with a custom STM32F target board, shown in Figure 15.18. This board executes AES operations while allowing external equipment to monitor power consumption with high temporal precision. The experimental setup captures how even inexpensive, low-power embedded devices can leak information through side channelsâinformation that modern machine learning models can learn to exploit.
Subsequent work expanded on this approach by introducing long-range models capable of leveraging broader temporal dependencies in the traces, improving performance even under noise and desynchronization (Bursztein et al. 2024). These developments highlight the potential for machine learning models to serve as offensive cryptanalysis toolsâespecially in the analysis of secure hardware.
The implications extend beyond academic interest. As deep learning models continue to scale, their application to side-channel contexts is likely to lower the cost, skill threshold, and trace requirements of hardware-level attacksâposing a growing challenge for the secure deployment of embedded machine learning systems, cryptographic modules, and trusted execution environments.
15.9 Conclusion
Security and privacy are foundational to the deployment of machine learning systems in real-world environments. As ML moves beyond the lab and into production, as it is deployed across cloud services, edge devices, mobile platforms, and critical infrastructure, the threats it faces become more complex and more consequential. From model theft and data leakage to adversarial manipulation and hardware compromise, securing ML systems requires a comprehensive understanding of the entire software and hardware stack.
This chapter explored these challenges from multiple angles. We began by examining real-world security incidents and threat models that impact ML systems, including attacks on training data, inference pipelines, and deployed models. We then discussed defense strategies that operate at different layers of the system: from data privacy techniques like differential privacy and federated learning, to robust model design, secure deployment practices, runtime monitoring, and hardware-enforced trust. Each of these layers addresses a distinct surface of vulnerability, and together they form the basis of a defense-in-depth approach.
Importantly, security is not a static checklist. It is an evolving process shaped by the deployment context, the capabilities of adversaries, and the risk tolerance of stakeholders. What protects a publicly exposed API may not suffice for an embedded medical device or a distributed fleet of autonomous systems. The effectiveness of any given defense depends on how well it fits into the larger system and how it interacts with other components, users, and constraints.
The goal of this chapter was not to catalog every threat or prescribe a fixed set of solutions. Rather, it was to help build the mindset needed to design secure, private, and trustworthy ML systemsâsystems that perform reliably under pressure, protect the data they rely on, and respond gracefully when things go wrong.
As we look ahead, security and privacy will remain intertwined with other system concerns: robustness, fairness, sustainability, and operational scale. In the chapters that follow, we will explore these additional dimensions and extend the foundation laid here toward the broader challenge of building ML systems that are not only performant, but responsible, reliable, and resilient by design.
15.10 Resources
- Coming soon.
- Coming soon.
- Coming soon.