15 Responsible AI
Resources: Slides, Videos, Exercises, Labs
As machine learning models grow across various domains, these algorithms have the potential to perpetuate historical biases, breach privacy, or enable unethical automated decisions if developed without thoughtful consideration of their societal impacts. Even systems created with good intentions can ultimately discriminate against certain demographic groups, enable surveillance, or lack transparency into their behaviors and decision-making processes. As such, machine learning engineers and companies have an ethical responsibility to proactively ensure principles of fairness, accountability, safety, and transparency are reflected in their models to prevent harm and build public trust.
Understand responsible AI’s core principles and motivations, including fairness, transparency, privacy, safety, and accountability.
Learn technical methods for implementing responsible AI principles, such as detecting dataset biases, building interpretable models, adding noise for privacy, and testing model robustness.
Recognize organizational and social challenges to achieving responsible AI, including data quality, model objectives, communication, and job impacts.
Knowledge of ethical frameworks and considerations for AI systems, spanning AI safety, human autonomy, and economic consequences.
Appreciate the increased complexity and costs of developing ethical, trustworthy AI systems compared to unprincipled AI.
15.1 Introduction
Machine learning models are increasingly used to automate decisions in high-stakes social domains like healthcare, criminal justice, and employment. However, without deliberate care, these algorithms can perpetuate biases, breach privacy, or cause other harm. For instance, a loan approval model solely trained on data from high-income neighborhoods could disadvantage applicants from lower-income areas. This motivates the need for responsible machine learning - creating fair, accountable, transparent, and ethical models.
Several core principles underlie responsible ML. Fairness ensures models do not discriminate based on gender, race, age, and other attributes. Explainability enables humans to interpret model behaviors and improve transparency. Robustness and safety techniques prevent vulnerabilities like adversarial examples. Rigorous testing and validation help reduce unintended model weaknesses or side effects.
Implementing responsible ML presents both technical and ethical challenges. Developers must grapple with defining fairness mathematically, balancing competing objectives like accuracy vs interpretability, and securing quality training data. Organizations must also align incentives, policies, and culture to uphold ethical AI.
This chapter will equip you to critically evaluate AI systems and contribute to developing beneficial and ethical machine learning applications by covering the foundations, methods, and real-world implications of responsible ML. The responsible ML principles discussed are crucial knowledge as algorithms mediate more aspects of human society.
15.2 Definition
Responsible AI is about developing AI that positively impacts society under human ethics and values. There is no universally agreed-upon definition of “responsible AI,” but here is a summary of how it is commonly described. Responsible AI refers to designing, developing, and deploying artificial intelligence systems in an ethical, socially beneficial way. The core goal is to create trustworthy, unbiased, fair, transparent, accountable, and safe AI. While there is no canonical definition, responsible AI is generally considered to encompass principles such as:
Fairness: Avoiding biases, discrimination, and potential harm to certain groups or populations
Explainability: Enabling humans to understand and interpret how AI models make decisions
Transparency: Openly communicating how AI systems operate, are built, and are evaluated
Accountability: Having processes to determine responsibility and liability for AI failures or negative impacts
Robustness: Ensuring AI systems are secure, reliable, and behave as intended
Privacy: Protecting sensitive user data and adhering to privacy laws and ethics
Putting these principles into practice involves technical techniques, corporate policies, governance frameworks, and moral philosophy. There are also ongoing debates around defining ambiguous concepts like fairness and determining how to balance competing objectives.
15.3 Principles and Concepts
15.3.1 Transparency and Explainability
Machine learning models are often criticized as mysterious “black boxes” - opaque systems where it’s unclear how they arrived at particular predictions or decisions. For example, an AI system called COMPAS used to assess criminal recidivism risk in the US was found to be racially biased against black defendants. Still, the opacity of the algorithm made it difficult to understand and fix the problem. This lack of transparency can obscure biases, errors, and deficiencies.
Explaining model behaviors helps engender trust from the public and domain experts and enables identifying issues to address. Interpretability techniques like LIME, Shapley values, and saliency maps empower humans to understand and validate model logic. Laws like the EU’s GDPR also mandate transparency, which requires explainability for certain automated decisions. Overall, transparency and explainability are critical pillars of responsible AI.
15.3.2 Fairness, Bias, and Discrimination
ML models trained on historically biased data often perpetuate and amplify those prejudices. Healthcare algorithms have been shown to disadvantage black patients by underestimating their needs (Obermeyer et al. 2019). Facial recognition needs to be more accurate for women and people of color. Such algorithmic discrimination can negatively impact people’s lives in profound ways.
Different philosophical perspectives also exist on fairness - for example, is it fairer to treat all individuals equally or try to achieve equal outcomes for groups? Ensuring fairness requires proactively detecting and mitigating biases in data and models. However, achieving perfect fairness is tremendously difficult due to contrasting mathematical definitions and ethical perspectives. Still, promoting algorithmic fairness and non-discrimination is a key responsibility in AI development.
15.3.3 Privacy and Data Governance
Maintaining individuals’ privacy is an ethical obligation and legal requirement for organizations deploying AI systems. Regulations like the EU’s GDPR mandate data privacy protections and rights, such as the ability to access and delete one’s data.
However, maximizing the utility and accuracy of data for training models can conflict with preserving privacy - modeling disease progression could benefit from access to patients’ full genomes, but sharing such data widely violates privacy.
Responsible data governance involves carefully anonymizing data, controlling access with encryption, getting informed consent from data subjects, and collecting the minimum data needed. Honoring privacy is challenging but critical as AI capabilities and adoption expand.
15.3.4 Safety and Robustness
Putting AI systems into real-world operation requires ensuring they are safe, reliable, and robust, especially for human interaction scenarios. Self-driving cars from Uber and Tesla have been involved in deadly crashes due to unsafe behaviors.
Adversarial attacks that subtly alter input data can also fool ML models and cause dangerous failures if systems are not resistant. Deepfakes represent another emerging threat area.
Video 15.1 is a deepfake video of Barack Obama that went viral a few years ago.
Promoting safety requires extensive testing, risk analysis, human oversight, and designing systems that combine multiple weak models to avoid single points of failure. Rigorous safety mechanisms are essential for the responsible deployment of capable AI.
15.3.5 Accountability and Governance
When AI systems eventually fail or produce harmful outcomes, mechanisms must exist to address resultant issues, compensate affected parties, and assign responsibility. Both corporate accountability policies and government regulations are indispensable for responsible AI governance. For instance, Illinois’ Artificial Intelligence Video Interview Act requires companies to disclose and obtain consent for AI video analysis, promoting accountability.
Without clear accountability, even harms caused unintentionally could go unresolved, furthering public outrage and distrust. Oversight boards, impact assessments, grievance redress processes, and independent audits promote responsible development and deployment.
15.4 Cloud, Edge & Tiny ML
While these principles broadly apply across AI systems, certain responsible AI considerations are unique or pronounced when dealing with machine learning on embedded devices versus traditional server-based modeling. Therefore, we present a high-level taxonomy comparing responsible AI considerations across cloud, edge, and TinyML systems.
15.4.1 Summary
Table 15.1 summarizes how responsible AI principles manifest differently across cloud, edge, and TinyML architectures and how core considerations tie into their unique capabilities and limitations. Each environment’s constraints and tradeoffs shape how we approach transparency, accountability, governance, and other pillars of responsible AI.
Principle | Cloud ML | Edge ML | TinyML |
---|---|---|---|
Explainability | Complex models supported | Lightweight required | Severe limits |
Fairness | Broad data available | On-device biases | Limited data labels |
Privacy | Cloud data vulnerabilities | More sensitive data | Data dispersed |
Safety | Hacking threats | Real-world interaction | Autonomous devices |
Accountability | Corporate policies | Supply chain issues | Component tracing |
Governance | External oversight feasible | Self-governance needed | Protocol constraints |
15.4.2 Explainability
For cloud-based machine learning, explainability techniques can leverage significant compute resources, enabling complex methods like SHAP values or sampling-based approaches to interpret model behaviors. For example, Microsoft’s InterpretML toolkit provides explainability techniques tailored for cloud environments.
However, edge ML operates on resource-constrained devices, requiring more lightweight explainability methods that can run locally without excessive latency. Techniques like LIME (Ribeiro, Singh, and Guestrin 2016) approximate model explanations using linear models or decision trees to avoid expensive computations, which makes them ideal for resource-constrained devices. However, LIME requires training hundreds to even thousands of models to generate good explanations, which is often infeasible given edge computing constraints. In contrast, saliency-based methods are often much faster in practice, only requiring a single forward pass through the network to estimate feature importance. This greater efficiency makes such methods better suited to edge devices with limited compute resources where low-latency explanations are critical.
Given tiny hardware capabilities, embedded systems pose the most significant challenges for explainability. More compact models and limited data make inherent model transparency easier. Explaining decisions may not be feasible on high-size and power-optimized microcontrollers. DARPA’s Transparent Computing program tries to develop extremely low overhead explainability, especially for TinyML devices like sensors and wearables.
15.4.3 Fairness
For cloud machine learning, vast datasets and computing power enable detecting biases across large heterogeneous populations and mitigating them through techniques like re-weighting data samples. However, biases may emerge from the broad behavioral data used to train cloud models. Amazon’s Fairness Flow framework helps assess cloud ML fairness.
Edge ML relies on limited on-device data, making analyzing biases across diverse groups harder. However, edge devices interact closely with individuals, providing an opportunity to adapt locally for fairness. Google’s Federated Learning distributes model training across devices to incorporate individual differences.
TinyML poses unique challenges for fairness with highly dispersed specialized hardware and minimal training data. Bias testing is difficult across diverse devices. Collecting representative data from many devices to mitigate bias has scale and privacy hurdles. DARPA’s Assured Neuro Symbolic Learning and Reasoning (ANSR) efforts are geared toward developing fairness techniques given extreme hardware constraints.
15.4.4 Safety
Key safety risks for cloud ML include model hacking, data poisoning, and malware disrupting cloud services. Robustness techniques like adversarial training, anomaly detection, and diversified models aim to harden cloud ML against attacks. Redundancy can help prevent single points of failure.
Edge ML and TinyML interact with the physical world, so reliability and safety validation are critical. Rigorous testing platforms like Foretellix synthetically generate edge scenarios to validate safety. TinyML safety is magnified by autonomous devices with limited supervision. TinyML safety often relies on collective coordination - swarms of drones maintain safety through redundancy. Physical control barriers also constrain unsafe TinyML device behaviors.
In summary, safety is crucial but manifests differently in each domain. Cloud ML guards against hacking, edge ML interacts physically, so reliability is key, and TinyML leverages distributed coordination for safety. Understanding the nuances guides appropriate safety techniques.
15.4.5 Accountability
Cloud ML’s accountability centers on corporate practices like responsible AI committees, ethical charters, and processes to address harmful incidents. Third-party audits and external government oversight promote cloud ML accountability.
Edge ML accountability is more complex with distributed devices and supply chain fragmentation. Companies are accountable for devices, but components come from various vendors. Industry standards help coordinate edge ML accountability across stakeholders.
With TinyML, accountability mechanisms must be traced across long, complex supply chains of integrated circuits, sensors, and other hardware. TinyML certification schemes help track component provenance. Trade associations should ideally promote shared accountability for ethical TinyML.
15.4.6 Governance
Organizations institute internal governance for cloud ML, such as ethics boards, audits, and model risk management. But external governance also oversees cloud ML, like regulations on bias and transparency such as the AI Bill of Rights, General Data Protection Regulation (GDPR), and California Consumer Protection Act (CCPA). Third-party auditing supports cloud ML governance.
Edge ML is more decentralized, requiring responsible self-governance by developers and companies deploying models locally. Industry associations coordinate governance across edge ML vendors, and open software helps align incentives for ethical edge ML.
Extreme decentralization and complexity make external governance infeasible with TinyML. TinyML relies on protocols and standards for self-governance baked into model design and hardware. Cryptography enables the provable trustworthiness of TinyML devices.
15.4.7 Privacy
For cloud ML, vast amounts of user data are concentrated in the cloud, creating risks of exposure through breaches. Differential privacy techniques add noise to cloud data to preserve privacy. Strict access controls and encryption protect cloud data at rest and in transit.
Edge ML moves data processing onto user devices, reducing aggregated data collection but increasing potential sensitivity as personal data resides on the device. Apple uses on-device ML and differential privacy to train models while minimizing data sharing. Data anonymization and secure enclaves protect on-device data.
TinyML distributes data across many resource-constrained devices, making centralized breaches unlikely and making scale anonymization challenging. Data minimization and using edge devices as intermediaries help TinyML privacy.
So, while cloud ML must protect expansive centralized data, edge ML secures sensitive on-device data, and TinyML aims for minimal distributed data sharing due to constraints. While privacy is vital throughout, techniques must match the environment. Understanding nuances allows for selecting appropriate privacy preservation approaches.
15.5 Technical Aspects
15.5.1 Detecting and Mitigating Bias
A large body of work has demonstrated that machine learning models can exhibit bias, from underperforming people of a certain identity to making decisions that limit groups’ access to important resources (Buolamwini and Gebru 2018).
Ensuring fair and equitable treatment for all groups affected by machine learning systems is crucial as these models increasingly impact people’s lives in areas like lending, healthcare, and criminal justice. We typically evaluate model fairness by considering “subgroup attributes” unrelated to the prediction task that capture identities like race, gender, or religion. For example, in a loan default prediction model, subgroups could include race, gender, or religion. When models are trained naively to maximize accuracy, they often ignore subgroup performance. However, this can negatively impact marginalized communities.
To illustrate, imagine a model predicting loan repayment where the plusses (+’s) represent repayment and the circles (O’s) represent default, as shown in Figure 15.1. The optimal accuracy would be correctly classifying all of Group A while misclassifying some of Group B’s creditworthy applicants as defaults. If positive classifications allow access loans, Group A would receive many more loans—which would naturally result in a biased outcome.
Alternatively, correcting the biases against Group B would likely increase “false positives” and reduce accuracy for Group A. Or, we could train separate models focused on maximizing true positives for each group. However, this would require explicitly using sensitive attributes like race in the decision process.
As we see, there are inherent tensions around priorities like accuracy versus subgroup fairness and whether to explicitly account for protected classes. Reasonable people can disagree on the appropriate tradeoffs. Constraints around costs and implementation options further complicate matters. Overall, ensuring the fair and ethical use of machine learning involves navigating these complex challenges.
Thus, the fairness literature has proposed three main fairness metrics for quantifying how fair a model performs over a dataset (Hardt, Price, and Srebro 2016). Given a model h and a dataset D consisting of (x,y,s) samples, where x is the data features, y is the label, and s is the subgroup attribute, and we assume there are simply two subgroups a and b, we can define the following.
Demographic Parity asks how accurate a model is for each subgroup. In other words, P(h(X) = Y S = a) = P(h(X) = Y S = b)
Equalized Odds asks how precise a model is on positive and negative samples for each subgroup. P(h(X) = y S = a, Y = y) = P(h(X) = y S = b, Y = y)
Equality of Opportunity is a special case of equalized odds that only asks how precise a model is on positive samples. This is relevant in cases such as resource allocation, where we care about how positive (i.e., resource-allocated) labels are distributed across groups. For example, we care that an equal proportion of loans are given to both men and women. P(h(X) = 1 S = a, Y = 1) = P(h(X) = 1 S = b, Y = 1)
Note: These definitions often take a narrow view when considering binary comparisons between two subgroups. Another thread of fair machine learning research focusing on multicalibration and multiaccuracy considers the interactions between an arbitrary number of identities, acknowledging the inherent intersectionality of individual identities in the real world (Hébert-Johnson et al. 2018).
Context Matters
Before making any technical decisions to develop an unbiased ML algorithm, we need to understand the context surrounding our model. Here are some of the key questions to think about:
- Who will this model make decisions for?
- Who is represented in the training data?
- Who is represented, and who is missing at the table of engineers, designers, and managers?
- What sort of long-lasting impacts could this model have? For example, will it impact an individual’s financial security at a generational scale, such as determining college admissions or admitting a loan for a house?
- What historical and systematic biases are present in this setting, and are they present in the training data the model will generalize from?
Understanding a system’s social, ethical, and historical background is critical to preventing harm and should inform decisions throughout the model development lifecycle. After understanding the context, one can make various technical decisions to remove bias. First, one must decide what fairness metric is the most appropriate criterion for optimizing. Next, there are generally three main areas where one can intervene to debias an ML system.
First, preprocessing is when one balances a dataset to ensure fair representation or even increases the weight on certain underrepresented groups to ensure the model performs well. Second, in processing attempts to modify the training process of an ML system to ensure it prioritizes fairness. This can be as simple as adding a fairness regularizer (Lowy et al. 2021) to training an ensemble of models and sampling from them in a specific manner (Agarwal et al. 2018).
Finally, post-processing debases a model after the fact, taking a trained model and modifying its predictions in a specific manner to ensure fairness is preserved (Alghamdi et al. 2022; Hardt, Price, and Srebro 2016). Post-processing builds on the preprocessing and in-processing steps by providing another opportunity to address bias and fairness issues in the model after it has already been trained.
The three-step process of preprocessing, in-processing, and post-processing provides a framework for intervening at different stages of model development to mitigate issues around bias and fairness. While preprocessing and in-processing focus on data and training, post-processing allows for adjustments after the model has been fully trained. Together, these three approaches give multiple opportunities to detect and remove unfair bias.
Thoughtful Deployment
The breadth of existing fairness definitions and debiasing interventions underscores the need for thoughtful assessment before deploying ML systems. As ML researchers and developers, responsible model development requires proactively educating ourselves on the real-world context, consulting domain experts and end-users, and centering harm prevention.
Rather than seeing fairness considerations as a box to check, we must deeply engage with the unique social implications and ethical tradeoffs around each model we build. Every technical choice about datasets, model architectures, evaluation metrics, and deployment constraints embeds values. By broadening our perspective beyond narrow technical metrics, carefully evaluating tradeoffs, and listening to impacted voices, we can work to ensure our systems expand opportunity rather than encode bias.
The path forward lies not in an arbitrary debiasing checklist but in a commitment to understanding and upholding our ethical responsibility at each step. This commitment starts with proactively educating ourselves and consulting others rather than just going through the motions of a fairness checklist. It requires engaging deeply with ethical tradeoffs in our technical choices, evaluating impacts on different groups, and listening to those voices most impacted.
Ultimately, responsible and ethical AI systems do not come from checkbox debiasing but from upholding our duty to assess harms, broaden perspectives, understand tradeoffs, and ensure we provide opportunity for all groups. This ethical responsibility should drive every step.
The connection between the paragraphs is that the first paragraph establishes the need for a thoughtful assessment of fairness issues rather than a checkbox approach. The second paragraph then expands on what that thoughtful assessment looks like in practice—engaging with tradeoffs, evaluating impacts on groups, and listening to impacted voices. Finally, the last paragraph refers to avoiding an “arbitrary debiasing checklist” and committing to ethical responsibility through assessment, understanding tradeoffs, and providing opportunity.
15.5.2 Preserving Privacy
Recent incidents have shed light on how AI models can memorize sensitive user data in ways that violate privacy. Ippolito et al. (2023) demonstrate that language models tend to memorize training data and can even reproduce specific training examples. These risks are amplified with personalized ML systems deployed in intimate environments like homes or wearables. Consider a smart speaker that uses our conversations to improve its service quality for users who appreciate such enhancements. While potentially beneficial, this also creates privacy risks, as malicious actors could attempt to extract what the speaker “remembers.” The issue extends beyond language models. Figure 15.2 showcases how diffusion models can memorize and generate individual training examples (Nicolas Carlini et al. 2023), further demonstrating the potential privacy risks associated with AI systems learning from user data.
As AI becomes increasingly integrated into our daily lives, it is becoming more important that privacy concerns and robust safeguards to protect user information are developed with a critical eye. The challenge lies in balancing the benefits of personalized AI with the fundamental right to privacy.
Adversaries can use these memorization capabilities and train models to detect if specific training data influenced a target model. For example, membership inference attacks train a secondary model that learns to detect a change in the target model’s outputs when making inferences over data it was trained on versus not trained on (Shokri et al. 2017).
ML devices are especially vulnerable because they are often personalized on user data and are deployed in even more intimate settings such as the home. Private machine learning techniques have evolved to establish safeguards against adversaries, as mentioned in the Security and Privacy chapter to combat these privacy issues. Methods like differential privacy add mathematical noise during training to obscure individual data points’ influence on the model. Popular techniques like DP-SGD (Abadi et al. 2016) also clip gradients to limit what the model leaks about the data. Still, users should also be able to delete the impact of their data after the fact.
15.5.3 Machine Unlearning
With ML devices personalized to individual users and then deployed to remote edges without connectivity, a challenge arises—how can models responsively “forget” data points after deployment? If users request their data be removed from a personalized model, the lack of connectivity makes retraining infeasible. Thus, efficient on-device data forgetting is necessary but poses hurdles.
Initial unlearning approaches faced limitations in this context. Given the resource constraints, retrieving models from scratch on the device to forget data points proves inefficient or even impossible. Fully retraining also requires retaining all the original training data on the device, which brings its own security and privacy risks. Common machine unlearning techniques (Bourtoule et al. 2021) for remote embedded ML systems fail to enable responsive, secure data removal.
However, newer methods show promise in modifying models to approximately forget data [?] without full retraining. While the accuracy loss from avoiding full rebuilds is modest, guaranteeing data privacy should still be the priority when handling sensitive user information ethically. Even slight exposure to private data can violate user trust. As ML systems become deeply personalized, efficiency and privacy must be enabled from the start—not afterthoughts.
Recent policy discussions which include the European Union’s General Data, Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), the Act on the Protection of Personal Information (APPI), and Canada’s proposed Consumer Privacy Protection Act (CPPA), require the deletion of private information. These policies, coupled with AI incidents like Stable Diffusion memorizing artist data, have underscored the ethical need for users to delete their data from models after training.
The right to remove data arises from privacy concerns around corporations or adversaries misusing sensitive user information. Machine unlearning refers to removing the influence of specific points from an already-trained model. Naively, this involves full retraining without the deleted data. However, connectivity constraints often make retraining infeasible for ML systems personalized and deployed to remote edges. If a smart speaker learns from private home conversations, retaining access to delete that data is important.
Although limited, methods are evolving to enable efficient approximations of retraining for unlearning. By modifying models’ inference time, they can mimic “forgetting” data without full access to training data. However, most current techniques are restricted to simple models, still have resource costs, and trade some accuracy. Though methods are evolving, enabling efficient data removal and respecting user privacy remains imperative for responsible TinyML deployment.
15.5.4 Adversarial Examples and Robustness
Machine learning models, especially deep neural networks, have a well-documented Achilles heel: they often break when even tiny perturbations are made to their inputs (Szegedy et al. 2014). This surprising fragility highlights a major robustness gap threatening real-world deployment in high-stakes domains. It also opens the door for adversarial attacks designed to fool models deliberately.
Machine learning models can exhibit surprising brittleness—minor input tweaks can cause shocking malfunctions, even in state-of-the-art deep neural networks (Szegedy et al. 2014). This unpredictability around out-of-sample data underscores gaps in model generalization and robustness. Given the growing ubiquity of ML, it also enables adversarial threats that weaponize models’ blindspots.
Deep neural networks demonstrate an almost paradoxical dual nature - human-like proficiency in training distributions coupled with extreme fragility to tiny input perturbations (Szegedy et al. 2014). This adversarial vulnerability gap highlights gaps in standard ML procedures and threats to real-world reliability. At the same time, it can be exploited: attackers can find model-breaking points humans wouldn’t perceive.
Figure 15.3 includes an example of a small meaningless perturbation that changes a model prediction. This fragility has real-world impacts: lack of robustness undermines trust in deploying models for high-stakes applications like self-driving cars or medical diagnosis. Moreover, the vulnerability leads to security threats: attackers can deliberately craft adversarial examples that are perceptually indistinguishable from normal data but cause model failures.
For instance, past work shows successful attacks that trick models for tasks like NSFW detection (Bhagoji et al. 2018), ad-blocking (Tramèr et al. 2019), and speech recognition (Nicholas Carlini et al. 2016). While errors in these domains already pose security risks, the problem extends beyond IT security. Recently, adversarial robustness has been proposed as an additional performance metric by approximating worst-case behavior.
The surprising model fragility highlighted above casts doubt on real-world reliability and opens the door to adversarial manipulation. This growing vulnerability underscores several needs. First, moral robustness evaluations are essential for quantifying model vulnerabilities before deployment. Approximating worst-case behavior surfaces blindspots.
Second, effective defenses across domains must be developed to close these robustness gaps. With security on the line, developers cannot ignore the threat of attacks exploiting model weaknesses. Moreover, we cannot afford any fragility-induced failures for safety-critical applications like self-driving vehicles and medical diagnosis. Lives are at stake.
Finally, the research community continues mobilizing rapidly in response. Interest in adversarial machine learning has exploded as attacks reveal the need to bridge the robustness gap between synthetic and real-world data. Conferences now commonly feature defenses for securing and stabilizing models. The community recognizes that model fragility is a critical issue that must be addressed through robustness testing, defense development, and ongoing research. By surfacing blindspots and responding with principled defenses, we can work to ensure reliability and safety for machine learning systems, especially in high-stakes domains.
15.5.5 Building Interpretable Models
As models are deployed more frequently in high-stakes settings, practitioners, developers, downstream end-users, and increasing regulation have highlighted the need for explainability in machine learning. The goal of many interpretability and explainability methods is to provide practitioners with more information about the models’ overall behavior or the behavior given a specific input. This allows users to decide whether or not a model’s output or prediction is trustworthy.
Such analysis can help developers debug models and improve performance by pointing out biases, spurious correlations, and failure modes of models. In cases where models can surpass human performance on a task, interpretability can help users and researchers better understand relationships in their data and previously unknown patterns.
There are many classes of explainability/interpretability methods, including post hoc explainability, inherent interpretability, and mechanistic interpretability. These methods aim to make complex machine learning models more understandable and ensure users can trust model predictions, especially in critical settings. By providing transparency into model behavior, explainability techniques are an important tool for developing safe, fair, and reliable AI systems.
Post Hoc Explainability
Post hoc explainability methods typically explain the output behavior of a black-box model on a specific input. Popular methods include counterfactual explanations, feature attribution methods, and concept-based explanations.
Counterfactual explanations, also frequently called algorithmic recourse, “If X had not occurred, Y would not have occurred” (Wachter, Mittelstadt, and Russell 2017). For example, consider a person applying for a bank loan whose application is rejected by a model. They may ask their bank for recourse or how to change to be eligible for a loan. A counterfactual explanation would tell them which features they need to change and by how much such that the model’s prediction changes.
Feature attribution methods highlight the input features that are important or necessary for a particular prediction. For a computer vision model, this would mean highlighting the individual pixels that contributed most to the predicted label of the image. Note that these methods do not explain how those pixels/features impact the prediction, only that they do. Common methods include input gradients, GradCAM (Selvaraju et al. 2017), SmoothGrad (Smilkov et al. 2017), LIME (Ribeiro, Singh, and Guestrin 2016), and SHAP (Lundberg and Lee 2017).
By providing examples of changes to input features that would alter a prediction (counterfactuals) or indicating the most influential features for a given prediction (attribution), these post hoc explanation techniques shed light on model behavior for individual inputs. This granular transparency helps users determine whether they can trust and act upon specific model outputs.
Concept-based explanations aim to explain model behavior and outputs using a pre-defined set of semantic concepts (e.g., the model recognizes scene class “bedroom” based on the presence of concepts “bed” and “pillow”). Recent work shows that users often prefer these explanations to attribution and example-based explanations because they “resemble human reasoning and explanations” (Vikram V. Ramaswamy et al. 2023b). Popular concept-based explanation methods include TCAV (Cai et al. 2019), Network Dissection (Bau et al. 2017), and interpretable basis decomposition (Zhou et al. 2018).
Note that these methods are extremely sensitive to the size and quality of the concept set, and there is a tradeoff between their accuracy and faithfulness and their interpretability or understandability to humans (Vikram V. Ramaswamy et al. 2023a). However, by mapping model predictions to human-understandable concepts, concept-based explanations can provide transparency into the reasoning behind model outputs.
Inherent Interpretability
Inherently interpretable models are constructed such that their explanations are part of the model architecture and are thus naturally faithful, which sometimes makes them preferable to post-hoc explanations applied to black-box models, especially in high-stakes domains where transparency is imperative (Rudin 2019). Often, these models are constrained so that the relationships between input features and predictions are easy for humans to follow (linear models, decision trees, decision sets, k-NN models), or they obey structural knowledge of the domain, such as monotonicity (Gupta et al. 2016), causality, or additivity (Lou et al. 2013; Beck and Jackman 1998).
However, more recent works have relaxed the restrictions on inherently interpretable models, using black-box models for feature extraction and a simpler inherently interpretable model for classification, allowing for faithful explanations that relate high-level features to prediction. For example, Concept Bottleneck Models (Koh et al. 2020) predict a concept set c that is passed into a linear classifier. ProtoPNets (Chen et al. 2019) dissect inputs into linear combinations of similarities to prototypical parts from the training set.
Mechanistic Interpretability
Mechanistic interpretability methods seek to reverse engineer neural networks, often analogizing them to how one might reverse engineer a compiled binary or how neuroscientists attempt to decode the function of individual neurons and circuits in brains. Most research in mechanistic interpretability views models as a computational graph (Geiger et al. 2021), and circuits are subgraphs with distinct functionality (Wang and Zhan 2019). Current approaches to extracting circuits from neural networks and understanding their functionality rely on human manual inspection of visualizations produced by circuits (Olah et al. 2020).
Alternatively, some approaches build sparse autoencoders that encourage neurons to encode disentangled interpretable features (Davarzani et al. 2023). This field is much newer than existing areas in explainability and interpretability, and as such, most works are generally exploratory rather than solution-oriented.
There are many problems in mechanistic interpretability, including the polysemanticity of neurons and circuits, the inconvenience and subjectivity of human labeling, and the exponential search space for identifying circuits in large models with billions or trillions of neurons.
Challenges and Considerations
As methods for interpreting and explaining models progress, it is important to note that humans overtrust and misuse interpretability tools (Kaur et al. 2020) and that a user’s trust in a model due to an explanation can be independent of the correctness of the explanations (Lakkaraju and Bastani 2020). As such, it is necessary that aside from assessing the faithfulness/correctness of explanations, researchers must also ensure that interpretability methods are developed and deployed with a specific user in mind and that user studies are performed to evaluate their efficacy and usefulness in practice.
Furthermore, explanations should be tailored to the user’s expertise, the task they are using the explanation for and the corresponding minimal amount of information required for the explanation to be useful to prevent information overload.
While interpretability/explainability are popular areas in machine learning research, very few works study their intersection with TinyML and edge computing. Given that a significant application of TinyML is healthcare, which often requires high transparency and interpretability, existing techniques must be tested for scalability and efficiency concerning edge devices. Many methods rely on extra forward and backward passes, and some even require extensive training in proxy models, which are infeasible on resource-constrained microcontrollers.
That said, explainability methods can be highly useful in developing models for edge devices, as they can give insights into how input data and models can be compressed and how representations may change post-compression. Furthermore, many interpretable models are often smaller than their black-box counterparts, which could benefit TinyML applications.
15.5.6 Monitoring Model Performance
While developers may train models that seem adversarially robust, fair, and interpretable before deployment, it is imperative that both the users and the model owners continue to monitor the model’s performance and trustworthiness during the model’s full lifecycle. Data is frequently changing in practice, which can often result in distribution shifts. These distribution shifts can profoundly impact the model’s vanilla predictive performance and its trustworthiness (fairness, robustness, and interpretability) in real-world data.
Furthermore, definitions of fairness frequently change with time, such as what society considers a protected attribute, and the expertise of the users asking for explanations may also change.
To ensure that models keep up to date with such changes in the real world, developers must continually evaluate their models on current and representative data and standards and update models when necessary.
15.6 Implementation Challenges
15.6.1 Organizational and Cultural Structures
While innovation and regulation are often seen as having competing interests, many countries have found it necessary to provide oversight as AI systems expand into more sectors. As illustrated in Figure 15.4, this oversight has become crucial as these systems continue permeating various industries and impacting people’s lives (see Human-Centered AI, Chapter 8 “Government Interventions and Regulations”.
Among these are:
The European Union’s General Data Protection Regulation (GDPR)
The European Commission’s White Paper on Artificial Intelligence: a European approach to excellence and trust
The UK’s Information Commissioner’s Office and Alan Turing Institute’s Consultation on Explaining AI Decisions Guidance co-badged guidance by the individuals affected by them.
15.6.2 Obtaining Quality and Representative Data
As discussed in the Data Engineering chapter, responsible AI design must occur at all pipeline stages, including data collection. This begs the question: what does it mean for data to be high-quality and representative? Consider the following scenarios that hinder the representativeness of data:
Subgroup Imbalance
This is likely what comes to mind when hearing “representative data.” Subgroup imbalance means the dataset contains relatively more data from one subgroup than another. This imbalance can negatively affect the downstream ML model by causing it to overfit a subgroup of people while performing poorly on another.
One example consequence of subgroup imbalance is racial discrimination in facial recognition technology (Buolamwini and Gebru 2018); commercial facial recognition algorithms have up to 34% worse error rates on darker-skinned females than lighter-skinned males.
Note that data imbalance goes both ways, and subgroups can also be harmful overrepresented in the dataset. For example, the Allegheny Family Screening Tool (AFST) predicts the likelihood that a child will eventually be removed from a home. The AFST produces disproportionate scores for different subgroups, one of the reasons being that it is trained on historically biased data, sourced from juvenile and adult criminal legal systems, public welfare agencies, and behavioral health agencies and programs.
Quantifying Target Outcomes
This occurs in applications where the ground-truth label cannot be measured or is difficult to represent in a single quantity. For example, an ML model in a mobile wellness application may want to predict individual stress levels. The true stress labels themselves are impossible to obtain directly and must be inferred from other biosignals, such as heart rate variability and user self-reported data. In these situations, noise is built into the data by design, making this a challenging ML task.
Distribution Shift
Data may no longer represent a task if a major external event causes the data source to change drastically. The most common way to think about distribution shifts is with respect to time; for example, data on consumer shopping habits collected pre-covid may no longer be present in consumer behavior today.
The transfer causes another form of distribution shift. For instance, when applying a triage system that was trained on data from one hospital to another, a distribution shift may occur if the two hospitals are very different.
Gathering Data
A reasonable solution for many of the above problems with non-representative or low-quality data is to collect more; we can collect more data targeting an underrepresented subgroup or from the target hospital to which our model might be transferred. However, for some reasons, gathering more data is an inappropriate or infeasible solution for the task at hand.
Data collection can be harmful. This is the paradox of exposure, the situation in which those who stand to significantly gain from their data being collected are also those who are put at risk by the collection process (D’ignazio and Klein (2023), Chapter 4). For example, collecting more data on non-binary individuals may be important for ensuring the fairness of the ML application, but it also puts them at risk, depending on who is collecting the data and how (whether the data is easily identifiable, contains sensitive content, etc.).
Data collection can be costly. In some domains, such as healthcare, obtaining data can be costly in terms of time and money.
Biased data collection. Electronic Health Records is a huge data source for ML-driven healthcare applications. Issues of subgroup representation aside, the data itself may be collected in a biased manner. For example, negative language (“nonadherent,” “unwilling”) is disproportionately used on black patients (Himmelstein, Bates, and Zhou 2022).
We conclude with several additional strategies for maintaining data quality. First, fostering a deeper understanding of the data is crucial. This can be achieved through the implementation of standardized labels and measures of data quality, such as in the Data Nutrition Project. Collaborating with organizations responsible for collecting data helps ensure the data is interpreted correctly. Second, employing effective tools for data exploration is important. Visualization techniques and statistical analyses can reveal issues with the data. Finally, establishing a feedback loop within the ML pipeline is essential for understanding the real-world implications of the data. Metrics, such as fairness measures, allow us to define “data quality” in the context of the downstream application; improving fairness may directly improve the quality of the predictions that the end users receive.
15.6.3 Balancing Accuracy and Other Objectives
Machine learning models are often evaluated on accuracy alone, but this single metric cannot fully capture model performance and tradeoffs for responsible AI systems. Other ethical dimensions, such as fairness, robustness, interpretability, and privacy, may compete with pure predictive accuracy during model development. For instance, inherently interpretable models such as small decision trees or linear classifiers with simplified features intentionally trade some accuracy for transparency in the model behavior and predictions. While these simplified models achieve lower accuracy by not capturing all the complexity in the dataset, improved interpretability builds trust by enabling direct analysis by human practitioners.
Additionally, certain techniques meant to improve adversarial robustness, such as adversarial training examples or dimensionality reduction, can degrade the accuracy of clean validation data. In sensitive applications like healthcare, focusing narrowly on state-of-the-art accuracy carries ethical risks if it allows models to rely more on spurious correlations that introduce bias or use opaque reasoning. Therefore, the appropriate performance objectives depend greatly on the sociotechnical context.
Methodologies like Value Sensitive Design provide frameworks for formally evaluating the priorities of various stakeholders within the real-world deployment system. These explain the tensions between values like accuracy, interpretability and fairness, which can then guide responsible tradeoff decisions. For a medical diagnosis system, achieving the highest accuracy may not be the singular goal - improving transparency to build practitioner trust or reducing bias towards minority groups could justify small losses in accuracy. Analyzing the sociotechnical context is key for setting these objectives.
By taking a holistic view, we can responsibly balance accuracy with other ethical objectives for model success. Ongoing performance monitoring along multiple dimensions is crucial as the system evolves after deployment.
15.7 Ethical Considerations in AI Design
We must discuss at least some of the many ethical issues at stake in designing and applying AI systems and diverse frameworks for approaching these issues, including those from AI safety, Human-Computer Interaction (HCI), and Science, Technology, and Society (STS).
15.7.1 AI Safety and Value Alignment
In 1960, Norbert Weiner wrote, “’if we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively… we had better be quite sure that the purpose put into the machine is the purpose which we desire” (Wiener 1960).
In recent years, as the capabilities of deep learning models have achieved, and sometimes even surpassed, human abilities, the issue of creating AI systems that act in accord with human intentions instead of pursuing unintended or undesirable goals has become a source of concern (Russell 2021). Within the field of AI safety, a particular goal concerns “value alignment,” or the problem of how to code the “right” purpose into machines Human-Compatible Artificial Intelligence. Present AI research assumes we know the objectives we want to achieve and “studies the ability to achieve objectives, not the design of those objectives.”
However, complex real-world deployment contexts make explicitly defining “the right purpose” for machines difficult, requiring frameworks for responsible and ethical goal-setting. Methodologies like Value Sensitive Design provide formal mechanisms to surface tensions between stakeholder values and priorities.
By taking a holistic sociotechnical view, we can better ensure intelligent systems pursue objectives that align with broad human intentions rather than maximizing narrow metrics like accuracy alone. Achieving this in practice remains an open and critical research question as AI capabilities advance rapidly.
The absence of this alignment can lead to several AI safety issues, as have been documented in a variety of deep learning models. A common feature of systems that optimize for an objective is that variables not directly included in the objective may be set to extreme values to help optimize for that objective, leading to issues characterized as specification gaming, reward hacking, etc., in reinforcement learning (RL).
In recent years, a particularly popular implementation of RL has been models pre-trained using self-supervised learning and fine-tuned reinforcement learning from human feedback (RLHF) (Christiano et al. 2017). Ngo 2022 (Ngo, Chan, and Mindermann 2022) argues that by rewarding models for appearing harmless and ethical while also maximizing useful outcomes, RLHF could encourage the emergence of three problematic properties: situationally aware reward hacking, where policies exploit human fallibility to gain high reward, misaligned internally-represented goals that generalize beyond the RLHF fine-tuning distribution, and power-seeking strategies.
Similarly, Van Noorden (2016) outlines six concrete problems for AI safety, including avoiding negative side effects, avoiding reward hacking, scalable oversight for aspects of the objective that are too expensive to be frequently evaluated during training, safe exploration strategies that encourage creativity while preventing harm, and robustness to distributional shift in unseen testing environments.
15.7.2 Autonomous Systems and Control [and Trust]
The consequences of autonomous systems that act independently of human oversight and often outside human judgment have been well documented across several industries and use cases. Most recently, the California Department of Motor Vehicles suspended Cruise’s deployment and testing permits for its autonomous vehicles citing “unreasonable risks to public safety”. One such accident occurred when a vehicle struck a pedestrian who stepped into a crosswalk after the stoplight had turned green, and the vehicle was allowed to proceed. In 2018, a pedestrian crossing the street with her bike was killed when a self-driving Uber car, which was operating in autonomous mode, failed to accurately classify her moving body as an object to be avoided.
Autonomous systems beyond self-driving vehicles are also susceptible to such issues, with potentially graver consequences, as remotely-powered drones are already reshaping warfare. While such incidents bring up important ethical questions regarding who should be held responsible when these systems fail, they also highlight the technical challenges of giving full control of complex, real-world tasks to machines.
At its core, there is a tension between human and machine autonomy. Engineering and computer science disciplines have tended to focus on machine autonomy. For example, as of 2019, a search for the word “autonomy” in the Digital Library of the Association for Computing Machinery (ACM) reveals that of the top 100 most cited papers, 90% are on machine autonomy (Calvo et al. 2020). In an attempt to build systems for the benefit of humanity, these disciplines have taken, without question, increasing productivity, efficiency, and automation as primary strategies for benefiting humanity.
These goals put machine automation at the forefront, often at the expense of the human. This approach suffers from inherent challenges, as noted since the early days of AI through the Frame problem and qualification problem, which formalizes the observation that it is impossible to specify all the preconditions needed for a real-world action to succeed (McCarthy 1981).
These logical limitations have given rise to mathematical approaches such as Responsibility-sensitive safety (RSS) (Shalev-Shwartz, Shammah, and Shashua 2017), which is aimed at breaking down the end goal of an automated driving system (namely safety) into concrete and checkable conditions that can be rigorously formulated in mathematical terms. The goal of RSS is that those safety rules guarantee Automated Driving System (ADS) safety in the rigorous form of mathematical proof. However, such approaches tend towards using automation to address the problems of automation and are susceptible to many of the same issues.
Another approach to combating these issues is to focus on the human-centered design of interactive systems that incorporate human control. Value-sensitive design (Friedman 1996) described three key design factors for a user interface that impact autonomy, including system capability, complexity, misrepresentation, and fluidity. A more recent model, called METUX (A Model for Motivation, Engagement, and Thriving in the User Experience), leverages insights from Self-determination Theory (SDT) in Psychology to identify six distinct spheres of technology experience that contribute to the design systems that promote well-being and human flourishing (Peters, Calvo, and Ryan 2018). SDT defines autonomy as acting by one’s goals and values, which is distinct from the use of autonomy as simply a synonym for either independence or being in control (Ryan and Deci 2000).
Calvo et al. (2020) elaborates on METUX and its six “spheres of technology experience” in the context of AI-recommender systems. They propose these spheres—Adoption, Interface, Tasks, Behavior, Life, and Society—as a way of organizing thinking and evaluation of technology design in order to appropriately capture contradictory and downstream impacts on human autonomy when interacting with AI systems.
15.7.3 Economic Impacts on Jobs, Skills, Wages
A major concern of the current rise of AI technologies is widespread unemployment. As AI systems’ capabilities expand, many fear these technologies will cause an absolute loss of jobs as they replace current workers and overtake alternative employment roles across industries. However, changing economic landscapes at the hands of automation is not new, and historically, have been found to reflect patterns of displacement rather than replacement (Shneiderman 2022)—Chapter 4. In particular, automation usually lowers costs and increases quality, greatly increasing access and demand. The need to serve these growing markets pushes production, creating new jobs.
Furthermore, studies have found that attempts to achieve “lights-out” automation – productive and flexible automation with a minimal number of human workers – have been unsuccessful. Attempts to do so have led to what the MIT Work of the Future taskforce has termed “zero-sum automation”, in which process flexibility is sacrificed for increased productivity.
In contrast, the task force proposes a “positive-sum automation” approach in which flexibility is increased by designing technology that strategically incorporates humans where they are very much needed, making it easier for line employees to train and debug robots, using a bottom-up approach to identifying what tasks should be automated; and choosing the right metrics for measuring success (see MIT’s Work of the Future).
However, the optimism of the high-level outlook does not preclude individual harm, especially to those whose skills and jobs will be rendered obsolete by automation. Public and legislative pressure, as well as corporate social responsibility efforts, will need to be directed at creating policies that share the benefits of automation with workers and result in higher minimum wages and benefits.
15.7.4 Scientific Communication and AI Literacy
A 1993 survey of 3000 North American adults’ beliefs about the “electronic thinking machine” revealed two primary perspectives of the early computer: the “beneficial tool of man” perspective and the “awesome thinking machine” perspective. The attitudes contributing to the “awesome thinking machine” view in this and other studies revealed a characterization of computers as “intelligent brains, smarter than people, unlimited, fast, mysterious, and frightening” (Martin 1993). These fears highlight an easily overlooked component of responsible AI, especially amidst the rush to commercialize such technologies: scientific communication that accurately communicates the capabilities and limitations of these systems while providing transparency about the limitations of experts’ knowledge about these systems.
As AI systems’ capabilities expand beyond most people’s comprehension, there is a natural tendency to assume the kinds of apocalyptic worlds painted by our media. This is partly due to the apparent difficulty of assimilating scientific information, even in technologically advanced cultures, which leads to the products of science being perceived as magic—“understandable only in terms of what it did, not how it worked” (Handlin 1965).
While tech companies should be held responsible for limiting grandiose claims and not falling into cycles of hype, research studying scientific communication, especially concerning (generative) AI, will also be useful in tracking and correcting public understanding of these technologies. An analysis of the Scopus scholarly database found that such research is scarce, with only a handful of papers mentioning both “science communication” and “artificial intelligence” (Schäfer 2023).
Research that exposes the perspectives, frames, and images of the future promoted by academic institutions, tech companies, stakeholders, regulators, journalists, NGOs, and others will also help to identify potential gaps in AI literacy among adults (Lindgren 2023). Increased focus on AI literacy from all stakeholders will be important in helping people whose skills are rendered obsolete by AI automation (Ng et al. 2021).
“But even those who never acquire that understanding need assurance that there is a connection between the goals of science and their welfare, and above all, that the scientist is not a man altogether apart but one who shares some of their value.” (Handlin, 1965)
15.8 Conclusion
Responsible artificial intelligence is crucial as machine learning systems exert growing influence across healthcare, employment, finance, and criminal justice sectors. While AI promises immense benefits, thoughtlessly designed models risk perpetrating harm through biases, privacy violations, unintended behaviors, and other pitfalls.
Upholding principles of fairness, explainability, accountability, safety, and transparency enables the development of ethical AI aligned with human values. However, implementing these principles involves surmounting complex technical and social challenges around detecting dataset biases, choosing appropriate model tradeoffs, securing quality training data, and more. Frameworks like value-sensitive design guide balancing accuracy versus other objectives based on stakeholder needs.
Looking forward, advancing responsible AI necessitates continued research and industry commitment. More standardized benchmarks are required to compare model biases and robustness. As personalized TinyML expands, enabling efficient transparency and user control for edge devices warrants focus. Revised incentive structures and policies must encourage deliberate, ethical development before reckless deployment. Education around AI literacy and its limitations will further contribute to public understanding.
Responsible methods underscore that while machine learning offers immense potential, thoughtless application risks adverse consequences. Cross-disciplinary collaboration and human-centered design are imperative so AI can promote broad social benefit. The path ahead lies not in an arbitrary checklist but in a steadfast commitment to understand and uphold our ethical responsibility at each step. By taking conscientious action, the machine learning community can lead AI toward empowering all people equitably and safely.
15.9 Resources
Here is a curated list of resources to support students and instructors in their learning and teaching journeys. We are continuously working on expanding this collection and will be adding new exercises soon.
These slides are a valuable tool for instructors to deliver lectures and for students to review the material at their own pace. We encourage students and instructors to leverage these slides to improve their understanding and facilitate effective knowledge transfer.
To reinforce the concepts covered in this chapter, we have curated a set of exercises that challenge students to apply their knowledge and deepen their understanding.
- Coming soon.
In addition to exercises, we offer a series of hands-on labs allowing students to gain practical experience with embedded AI technologies. These labs provide step-by-step guidance, enabling students to develop their skills in a structured and supportive environment. We are excited to announce that new labs will be available soon, further enriching the learning experience.
- Coming soon.