Robust AI

DALL·E 3 Prompt: Create an image featuring an advanced AI system symbolized by an intricate, glowing neural network, deeply nested within a series of progressively larger and more fortified shields. Each shield layer represents a layer of defense, showcasing the system’s robustness against external threats and internal errors. The neural network, at the heart of this fortress of shields, radiates with connections that signify the AI’s capacity for learning and adaptation. This visual metaphor emphasizes not only the technological sophistication of the AI but also its resilience and security, set against the backdrop of a state-of-the-art, secure server room filled with the latest in technological advancements. The image aims to convey the concept of ultimate protection and resilience in the field of artificial intelligence.

Purpose

How do we develop fault-tolerant and resilient machine learning systems for real-world deployment?

The integration of machine learning systems into real-world applications demands fault-tolerant execution. However, these systems are inherently vulnerable to a spectrum of challenges that can degrade their capabilities. From subtle hardware anomalies to sophisticated adversarial attacks and the unpredictable nature of real-world data, the potential for failure is ever-present. This reality underscores the need to fundamentally rethink how AI systems are designed and deployed, placing robustness and trustworthiness at the forefront. Building resilient machine learning systems is not merely a technical objective; it is a foundational requirement for ensuring their safe and effective operation in dynamic and uncertain environments.

Learning Objectives

Identify common hardware faults impacting AI performance.
Explain how hardware faults (transient and permanent) affect AI systems.
Define adversarial attacks and their impact on ML models.
Recognize the vulnerabilities of ML models to data poisoning.
Explain the challenges posed by distribution shifts in ML models.
Describe the role of software fault detection and mitigation in AI systems.
Understand the importance of a holistic software development approach for robust AI.

Overview

As ML systems become increasingly integrated into various domains, ranging from cloud-based services to edge devices and embedded systems, the impact of hardware and software faults on their performance and reliability grows more pronounced. Looking ahead, as these systems become more complex and are deployed in safety-critical applications, the need for robust and fault-tolerant designs becomes paramount.

ML systems are expected to play critical roles in autonomous vehicles, smart cities, healthcare, and industrial automation. In these domains, the consequences of systemic failures, including hardware and software faults, and malicious inputs such as adversarial attacks and data poisoning, and environmental shifts, can be severe, potentially resulting in loss of life, economic disruption, or environmental harm.

To address these risks, researchers and engineers must develop advanced techniques for fault detection, isolation, and recovery, ensuring the reliable operation of future ML systems.

Definition of Robust AI

Robust Artificial Intelligence (Robust AI) refers to the ability of AI systems to maintain performance and reliability in the presence of internal and external system errors, and malicious inputs and changes to the data or environment. Robust AI systems are designed to be fault-tolerant and error-resilient, capable of functioning effectively despite variations and errors within the operational environment. Achieving Robust AI involves strategies for fault detection, mitigation, and recovery, as well as prioritizing resilience throughout the AI development lifecycle.

We focus specifically on categories of faults and errors that can impact the robustness of ML systems: errors arising from the underlying system, malicious manipulation, and environmental changes.

Systemic hardware failures present significant challenges across computing systems. Whether transient, permanent, or intermittent, these faults can corrupt computations and degrade system performance. The impact ranges from temporary glitches to complete component failures, requiring robust detection and mitigation strategies to maintain reliable operation.

Malicious manipulation of ML models remains a critical concern as ML systems face various threats to their integrity. Adversarial attacks, data poisoning attempts, and distribution shifts can cause models to misclassify inputs, exhibit distorted behavior patterns, or produce unreliable outputs. These vulnerabilities underscore the importance of developing resilient architectures and defensive mechanisms to protect model performance.

Environmental changes introduce another dimension of potential faults that must be carefully managed. Bugs, design flaws, and implementation errors within algorithms, libraries, and frameworks can propagate through the system, creating systemic vulnerabilities. Rigorous testing, monitoring, and quality control processes help identify and address these software-related issues before they impact production systems.

The specific approaches to achieving robustness vary significantly based on deployment context and system constraints. Large-scale cloud computing environments and data centers typically emphasize fault tolerance through redundancy, distributed processing architectures, and sophisticated error detection mechanisms. In contrast, edge devices and embedded systems must address robustness challenges within strict computational, memory, and energy limitations. This necessitates careful optimization and targeted hardening strategies appropriate for resource-constrained environments.

Regardless of deployment context, the essential characteristics of a robust ML system include fault tolerance, error resilience, and sustained performance. By understanding and addressing these multifaceted challenges, it is possible to develop reliable ML systems capable of operating effectively in real-world environments.

This chapter not only explores the tools, frameworks, and techniques used to detect and mitigate faults, attacks, and distribution shifts, but also emphasizes the importance of prioritizing resilience throughout the AI development lifecycle—from data collection and model training to deployment and monitoring. Proactively addressing robustness challenges is key to unlocking the full potential of ML technologies while ensuring their safe, dependable, and responsible deployment.

Real-World Applications

Understanding the importance of robustness in machine learning systems requires examining how faults manifest in practice. Real-world case studies illustrate the consequences of hardware and software faults across cloud, edge, and embedded environments. These examples highlight the critical need for fault-tolerant design, rigorous testing, and robust system architectures to ensure reliable operation in diverse deployment scenarios.

Cloud

In February 2017, Amazon Web Services (AWS) experienced a significant outage due to human error during routine maintenance. An engineer inadvertently entered an incorrect command, resulting in the shutdown of multiple servers. This outage disrupted many AWS services, including Amazon’s AI-powered assistant, Alexa. As a consequence, Alexa-enabled devices, including Amazon Echo and third-party products that utilize Alexa Voice Service, were unresponsive for several hours. This incident underscores the impact of human error on cloud-based ML systems and the importance of robust maintenance protocols and failsafe mechanisms.

In another case (Vangal et al. 2021), Facebook encountered a silent data corruption (SDC) issue in its distributed querying infrastructure, illustrated in Figure 1. SDC refers to undetected errors during computation or data transfer that propagate silently through system layers. Facebook’s system processed SQL-like queries across datasets and supported a compression application designed to reduce data storage footprints. Files were compressed when not in use and decompressed upon read requests. A size check was performed before decompression to ensure the file was valid. However, an unexpected fault occasionally returned a file size of zero for valid files, leading to decompression failures and missing entries in the output database. The issue appeared sporadically, with some computations returning correct file sizes, making it particularly difficult to diagnose.

Vangal, Sriram, Somnath Paul, Steven Hsu, Amit Agarwal, Saurabh Kumar, Ram Krishnamurthy, Harish Krishnamurthy, James Tschanz, Vivek De, and Chris H. Kim. 2021. “Wide-Range Many-Core SoC Design in Scaled CMOS: Challenges and Opportunities.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems 29 (5): 843–56. https://doi.org/10.1109/tvlsi.2021.3061649.

Figure 1: **Silent Data Corruption**: Unexpected Faults Can Return Incorrect File Sizes, Leading to Data Loss During Decompression and Propagating Errors Through Distributed Querying Systems Despite Apparent Operational Success. This Example From Facebook Emphasizes the Challenge of Undetected Errors—silent Data Corruption—and the Importance of Robust Error Detection Mechanisms in Large-Scale Data Processing Pipelines. Source: Facebook.

This case illustrates how silent data corruption can propagate across multiple layers of the application stack, resulting in data loss and application failures in large-scale distributed systems. Left unaddressed, such errors can degrade ML system performance. For example, corrupted training data or inconsistencies in data pipelines due to SDC may compromise model accuracy and reliability. Similar challenges have been reported by other major companies. As shown in Figure 2, Jeff Dean, Chief Scientist at Google DeepMind and Google Research, highlighted these issues in AI hypercomputers during a keynote at MLSys 2024.

Figure 2: **Silent Data Corruption**: Modern AI Systems, Particularly Those Employing Large-Scale Data Processing Like Spark, Are Vulnerable to Silent Data Corruption (SDC) – Subtle Errors Accumulating During Data Transfer and Storage. SDC Manifests in a Shuffle and Merge Database, Highlighting Corrupted Data Blocks (Red) Amidst Healthy Data (Blue/Gray) and Emphasizing the Challenge of Detecting These Errors in Distributed Systems Using the Figure. Source: Jeff Dean at MLSys 2024, Keynote (Google).

Edge

In the edge computing domain, self-driving vehicles provide prominent examples of how faults can critically affect ML systems. These vehicles depend on machine learning for perception, decision-making, and control, making them particularly vulnerable to both hardware and software faults.

Figure 3: **Autopilot Perception Failure**: This Crash Provides the Critical Safety Risks of Relying on Machine Learning for Perception in Autonomous Systems, Where Failures to Correctly Classify Objects Can Lead to Catastrophic Outcomes. The Incident Underscores the Need for Robust Validation, Redundancy, and Failsafe Mechanisms in Self-Driving Vehicle Designs to Mitigate the Impact of Imperfect AI Models. Source: BBC News.

In May 2016, a fatal crash occurred when a Tesla Model S operating in Autopilot mode¹ collided with a white semi-trailer truck. The system, relying on computer vision and ML algorithms, failed to distinguish the trailer against a bright sky, leading to a high-speed impact. The driver, reportedly distracted at the time, did not intervene, as shown in Figure 3. This incident raised serious concerns about the reliability of AI-based perception systems and emphasized the need for robust failsafe mechanisms in autonomous vehicles. A similar case occurred in March 2018, when an Uber self-driving test vehicle struck and killed a pedestrian in Tempe, Arizona. The accident was attributed to a flaw in the vehicle’s object recognition software, which failed to classify the pedestrian as an obstacle requiring avoidance.

¹ Autopilot: Tesla’s driver assistance system that provides semi-autonomous capabilities like steering, braking, and acceleration while requiring active driver supervision.

Embedded

Embedded systems operate in resource-constrained and often safety-critical environments. As AI capabilities are increasingly integrated into these systems, the complexity and consequences of faults grow significantly.

One example comes from space exploration. In 1999, NASA’s Mars Polar Lander mission experienced a catastrophic failure due to a software error in its touchdown detection system (Figure 4). The lander’s software misinterpreted the vibrations from the deployment of its landing legs as a successful touchdown, prematurely shutting off its engines and causing a crash. This incident underscores the importance of rigorous software validation and robust system design, particularly for remote missions where recovery is impossible. As AI becomes more integral to space systems, ensuring robustness and reliability will be essential to mission success.

Figure 4: **Touchdown Detection Failure**: Erroneous Sensor Readings During the Mars Polar Lander Mission Triggered a Premature Engine Shutdown, Demonstrating the Critical Need for Robust Failure Modes and Rigorous Validation of Embedded Systems—particularly Those Operating in Inaccessible Environments. This Incident Underscores How Software Errors Can Lead to Catastrophic Consequences in Safety-Critical Applications and Emphasizes the Growing Importance of Reliable AI Integration in Complex Systems. Source: Slashgear.

Another example occurred in 2015, when a Boeing 787 Dreamliner experienced a complete electrical shutdown mid-flight due to a software bug in its generator control units. The failure stemmed from a scenario in which powering up all four generator control units simultaneously, following 248 days of uninterrupted operation, caused them to enter failsafe mode, disabling all AC electrical power.

“If the four main generator control units (associated with the engine-mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time, resulting in a loss of all AC electrical power regardless of flight phase.” — Federal Aviation Administration directive (2015)

As AI is increasingly applied in aviation, including tasks such as autonomous flight control and predictive maintenance, the robustness of embedded systems becomes critical for passenger safety.

Finally, consider the case of implantable medical devices. For instance, a smart pacemaker that experiences a fault or unexpected behavior due to software or hardware failure could place a patient’s life at risk. As AI systems take on perception, decision-making, and control roles in such applications, new sources of vulnerability emerge, including data-related errors, model uncertainty², and unpredictable behaviors in rare edge cases. Moreover, the opaque nature of some AI models complicates fault diagnosis and recovery.

² Model Uncertainty: The inadequacy of a machine learning model to capture the full complexity of the underlying data-generating process.

Self-Check: Question 1.1

Which of the following incidents highlights the impact of human error on cloud-based ML systems?
1. Tesla Model S crash due to Autopilot failure
2. AWS outage due to incorrect command entry
3. Facebook’s silent data corruption issue
4. NASA Mars Polar Lander software error
Explain how silent data corruption (SDC) can affect machine learning system performance in large-scale distributed systems.
True or False: The Tesla Model S crash in 2016 was primarily due to a software bug in its object recognition system.

See Answers →

Hardware Faults

Hardware faults are a significant challenge in computing systems, including both traditional and ML systems. These faults occur when physical components, including processors, memory modules, storage devices, and interconnects, malfunction or behave abnormally. Hardware faults can cause incorrect computations, data corruption, system crashes, or complete system failure, compromising the integrity and trustworthiness of the computations performed by the system (Jha et al. 2019). A complete system failure refers to a situation where the entire computing system becomes unresponsive or inoperable due to a critical hardware malfunction. This type of failure is the most severe, as it renders the system unusable and may lead to data loss or corruption, requiring manual intervention to repair or replace the faulty components.

ML systems depend on complex hardware architectures and large-scale computations to train and deploy models that learn from data and make intelligent predictions. As a result, hardware faults can disrupt the MLOps pipeline, introducing errors that compromise model accuracy, robustness, and reliability (G. Li et al. 2017). Understanding the types of hardware faults, their mechanisms, and their impact on system behavior is essential for developing strategies to detect, mitigate, and recover from these issues.

The following sections will explore the three main categories of hardware faults: transient, permanent, and intermittent. We will discuss their definitions, characteristics, causes, mechanisms, and examples of how they manifest in computing systems. Detection and mitigation techniques specific to each fault type will also be covered.

Transient Faults: Transient faults are temporary and non-recurring. They are often caused by external factors such as cosmic rays, electromagnetic interference, or power fluctuations. A common example of a transient fault is a bit flip, where a single bit in a memory location or register changes its value unexpectedly. Transient faults can lead to incorrect computations or data corruption, but they do not cause permanent damage to the hardware.
Permanent Faults: Permanent faults, also called hard errors, are irreversible and persist over time. They are typically caused by physical defects or wear-out of hardware components. Examples of permanent faults include stuck-at faults, where a bit or signal is permanently set to a specific value (e.g., always 0 or always 1), and device failures, such as a malfunctioning processor or a damaged memory module. Permanent faults can result in complete system failure or significant performance degradation.
Intermittent Faults: Intermittent faults are recurring faults that appear and disappear intermittently. Unstable hardware conditions, such as loose connections, aging components, or manufacturing defects, often cause them. Intermittent faults can be challenging to diagnose and reproduce because they may occur sporadically and under specific conditions. Examples include intermittent short circuits or contact resistance issues. These faults can lead to unpredictable system behavior and sporadic errors.

Understanding this fault taxonomy and its relevance to both traditional computing and ML systems provides a foundation for making informed decisions when designing, implementing, and deploying fault-tolerant solutions. This knowledge is crucial for improving the reliability and trustworthiness of computing systems and ML applications.

Transient Faults

Transient faults in hardware can manifest in various forms, each with its own unique characteristics and causes. These faults are temporary in nature and do not result in permanent damage to the hardware components.

Characteristics

All transient faults are characterized by their short duration and non-permanent nature. They do not persist or leave any lasting impact on the hardware. However, they can still lead to incorrect computations, data corruption, or system misbehavior if not properly handled. A classic example is shown in Figure 5, where a single bit in memory unexpectedly changes state, potentially altering critical data or computations.

Some of the common types of transient faults include Single Event Upsets (SEUs) caused by ionizing radiation, voltage fluctuations (Reddi and Gupta 2013) due to power supply noise or electromagnetic interference, Electromagnetic Interference (EMI) induced by external electromagnetic fields, Electrostatic Discharge (ESD) resulting from sudden static electricity flow, crosstalk caused by unintended signal coupling, ground bounce triggered by simultaneous switching of multiple outputs, timing violations due to signal timing constraint breaches, and soft errors in combinational logic affecting the output of logic circuits (Mukherjee, Emer, and Reinhardt, n.d.). Understanding these different types of transient faults is crucial for designing robust and resilient hardware systems that can mitigate their impact and ensure reliable operation.

Reddi, Vijay Janapa, and Meeta Sharma Gupta. 2013. Resilient Architecture Design for Voltage Variation. Springer International Publishing. https://doi.org/10.1007/978-3-031-01739-1.

Mukherjee, S. S., J. Emer, and S. K. Reinhardt. n.d. “The Soft Error Problem: An Architectural Perspective.” In 11th International Symposium on High-Performance Computer Architecture, 243–47. IEEE; IEEE. https://doi.org/10.1109/hpca.2005.37.

Figure 5: **Bit-Flip Error**: Transient faults can alter individual bits in memory, corrupting data or program instructions and potentially causing system malfunctions. These single-bit errors exemplify the vulnerability of hardware to transient faults like those induced by radiation or electromagnetic interference.

Causes

Transient faults can be attributed to various external factors. One common cause is cosmic rays—high-energy particles originating from outer space. When these particles strike sensitive areas of the hardware, such as memory cells or transistors, they can induce charge disturbances that alter the stored or transmitted data. This is illustrated in Figure 6. Another cause of transient faults is electromagnetic interference (EMI) from nearby devices or power fluctuations. EMI can couple with the circuits and cause voltage spikes or glitches that temporarily disrupt the normal operation of the hardware.

Figure 6: **Transient Fault Mechanism**: Cosmic rays and electromagnetic interference induce bit flips within hardware by altering electrical charges in memory cells and transistors, potentially corrupting data and causing system errors. Understanding these fault sources is critical for building robust ai systems that can tolerate unpredictable hardware behavior. Source: NTT.

Mechanisms

Transient faults can manifest through different mechanisms depending on the affected hardware component. In memory devices like DRAM or SRAM, transient faults often lead to bit flips, where a single bit changes its value from 0 to 1 or vice versa. This can corrupt the stored data or instructions. In logic circuits, transient faults can cause glitches³ or voltage spikes propagating through the combinational logic⁴, resulting in incorrect outputs or control signals. Transient faults can also affect communication channels, causing bit errors or packet losses during data transmission.

³ Glitches: Momentary deviation in voltage, current, or signal, often causing incorrect operation.

⁴ Combinational logic: Digital logic, wherein the output depends only on the current input states, not any past states.

Impact on ML

A common example of a transient fault is a bit flip in the main memory. If an important data structure or critical instruction is stored in the affected memory location, it can lead to incorrect computations or program misbehavior. For instance, a bit flip in the memory storing a loop counter can cause the loop to execute indefinitely or terminate prematurely. Transient faults in control registers or flag bits can alter the flow of program execution, leading to unexpected jumps or incorrect branch decisions. In communication systems, transient faults can corrupt transmitted data packets, resulting in retransmissions or data loss.

In ML systems, transient faults can have significant implications during the training phase (He et al. 2023). ML training involves iterative computations and updates to model parameters based on large datasets. If a transient fault occurs in the memory storing the model weights or gradients, it can lead to incorrect updates and compromise the convergence and accuracy of the training process. For example, a bit flip in the weight matrix of a neural network can cause the model to learn incorrect patterns or associations, leading to degraded performance (Wan et al. 2021). Transient faults in the data pipeline, such as corruption of training samples or labels, can also introduce noise and affect the quality of the learned model.

Wan, Zishen, Aqeel Anwar, Yu-Shun Hsiao, Tianyu Jia, Vijay Janapa Reddi, and Arijit Raychowdhury. 2021. “Analyzing and Improving Fault Tolerance of Learning-Based Navigation Systems.” In 2021 58th ACM/IEEE Design Automation Conference (DAC), 841–46. IEEE; IEEE. https://doi.org/10.1109/dac18074.2021.9586116.

As shown in Figure 7, a real-world example from Google’s production fleet highlights how an SDC anomaly caused a significant deviation in the gradient norm—a measure of the magnitude of updates to the model parameters. Such deviations can disrupt the optimization process, leading to slower convergence or failure to reach an optimal solution.

Figure 7: **Gradient Norm Deviation**: Transient hardware faults, such as single data corruption (SDC), disrupt optimization by causing abrupt changes in gradient norms during model training, potentially leading to convergence issues or inaccurate models. Real-world data from Google’s production fleet confirms that SDC anomalies manifest as visible spikes in gradient norm over time, indicating a disruption to the expected parameter update process. Source: jeff dean, mlsys 2024 keynote (Google).

During the inference phase, transient faults can impact the reliability and trustworthiness of ML predictions. If a transient fault occurs in the memory storing the trained model parameters or during the computation of inference results, it can lead to incorrect or inconsistent predictions. For instance, a bit flip in the activation values of a neural network can alter the final classification or regression output (Mahmoud et al. 2020). In safety-critical applications, such as autonomous vehicles or medical diagnosis, these faults can have severe consequences, resulting in incorrect decisions or actions that may compromise safety or lead to system failures (G. Li et al. 2017; Jha et al. 2019).

Li, Guanpeng, Siva Kumar Sastry Hari, Michael Sullivan, Timothy Tsai, Karthik Pattabiraman, Joel Emer, and Stephen W. Keckler. 2017. “Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1–12. ACM. https://doi.org/10.1145/3126908.3126964.

Courbariaux, Matthieu, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. “Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1.” arXiv Preprint arXiv:1602.02830, February. http://arxiv.org/abs/1602.02830v3.

Aygun, Sercan, Ece Olcay Gunes, and Christophe De Vleeschouwer. 2021. “Efficient and Robust Bitstream Processing in Binarised Neural Networks.” Electronics Letters 57 (5): 219–22. https://doi.org/10.1049/ell2.12045.

⁵ Stochastic Computing: A collection of techniques using random bits and logic operations to perform arithmetic and data processing, promising better fault tolerance.

Transient faults can be amplified in resource-constrained environments like TinyML, where limited computational and memory resources exacerbate their impact. One prominent example is Binarized Neural Networks (BNNs) (Courbariaux et al. 2016), which represent network weights in single-bit precision to achieve computational efficiency and faster inference times. While this binary representation is advantageous for resource-constrained systems, it also makes BNNs particularly fragile to bit-flip errors. For instance, prior work (Aygun, Gunes, and De Vleeschouwer 2021) has shown that a two-hidden-layer BNN architecture for a simple task such as MNIST classification suffers performance degradation from 98% test accuracy to 70% when random bit-flipping soft errors are inserted through model weights with a 10% probability. To address these vulnerabilities, techniques like flip-aware training and emerging approaches such as stochastic computing ⁵ are being explored to enhance fault tolerance.

Permanent Faults

Permanent faults are hardware defects that persist and cause irreversible damage to the affected components. These faults are characterized by their persistent nature and require repair or replacement of the faulty hardware to restore normal system functionality.