Lecture: AI Ethics Case Studies

Learning Objectives

Apply ethical analysis frameworks to real AI deployment scenarios. Identify stakeholders, harms, and tradeoffs in AI systems. Evaluate competing fairness and accountability claims. Practice structured ethical argumentation.

Introduction: From Abstract to Applied

AI ethics is easy to discuss in the abstract. Applied to real systems with real data and real affected populations, the easy principles ("AI should be fair") reveal themselves as complex tradeoffs with no clean answers. This lecture examines three case studies in depth, giving you frameworks and vocabulary for this analysis.

For each case study: (1) identify the stakeholders, (2) identify the potential harms and benefits, (3) apply a relevant ethical framework, (4) consider the distributional effects, and (5) assess what should be done.

Case Study 1: Algorithmic Hiring Screening

Scenario: A Fortune 500 company deploys an AI system to pre-screen 50,000 job applications per year for engineering roles. The model predicts "likelihood of success" based on resume content. It reduces screening time from 6 weeks to 3 days. An internal audit finds the model recommends female candidates at 75% the rate of equivalent male candidates. HR says this mirrors historical hiring rates; the ML team says the model is simply "reflecting reality."

The "reflecting reality" argument, examined: If historical hiring rates for women in engineering were 75% of men's rates, and those rates resulted from bias (gender discrimination, hostile work culture, unequal educational opportunity) rather than merit, then training a model on these rates "encodes" the bias. The model doesn't merely reflect reality — it perpetuates it, and at scale, with algorithmic authority that may be harder to challenge than individual human decisions.

Stakeholders:

Applicants (especially women and minorities who are disproportionately screened out)
Hiring managers (who have less time but may be working with a systematically biased tool)
The company (efficiency gains, but legal risk under Title VII, EEOC guidelines)
Future employees of the company (workforce diversity affects culture and problem-solving)

Discussion questions:

Is training on historical data that reflects discrimination ethically permissible?
What fairness metric should be applied? Demographic parity (equal recommendation rates) or equalized odds (equal rates conditional on actual job performance)?
What transparency obligations exist to applicants who are screened out by this system?
If the company audits and finds bias, what are their obligations? To fix it? To stop using it?

Case Study 2: AI Diagnostic Aid in Emergency Medicine

Scenario: A large hospital network deploys an AI system that analyzes chest X-rays to flag potential pneumonia for urgent review. The system was trained on 2 million X-rays from academic medical centers. It has 92% sensitivity and 88% specificity overall. An equity audit reveals: 86% sensitivity and 84% specificity for Black patients, 88%/86% for Hispanic patients, and 95%/91% for white patients. The performance gap is attributed to underrepresentation of darker skin tones in training X-rays (X-ray exposure settings vary by radiologist practice).

The harm quantified: At 92% sensitivity for the overall population, the hospital flags 92 of 100 pneumonia cases for urgent review. But Black patients experience only 86% sensitivity — the system misses 6 more cases per 100 than for white patients. In emergency medicine, a missed pneumonia diagnosis increases mortality risk significantly. The disparity is literally life and death.

The deployment dilemma: The system, even for Black patients, is probably better than no AI assistance at all. But is it ethical to deploy a system with known disparate performance that may widen health disparities? Is it ethical not to deploy it because Black patients also benefit from the 86% sensitivity? What are the disclosure obligations to patients?

Discussion questions:

Should the hospital deploy the system with the known disparity while working to address it?
Should the hospital develop separate models by demographic group? What problems does this raise?
What are the disclosure obligations to patients whose X-rays are screened by this system?
Who is responsible for the training data underrepresentation — the model developer, the hospital, the training data providers?

Case Study 3: Content Moderation at Scale

Scenario: A social media platform with 500 million users uses ML models to detect and remove policy-violating content (hate speech, harassment, misinformation). The models process 10 million posts per day and flag 2% for removal. Accuracy: 78% precision (78% of removed posts genuinely violate policy) and 65% recall (65% of violating posts are caught). The error analysis reveals: satirical content, non-English languages (especially African languages), and political speech about marginalized communities are overrepresented in false positives.

The precision-recall tradeoff as an ethics question: Improving precision (fewer innocent posts removed) requires accepting lower recall (more harmful content stays up). Improving recall (more harmful content removed) requires accepting lower precision (more innocent content removed). Who bears the costs of each type of error? Innocent posts removed are disproportionately from underrepresented language communities. Harmful content that remains reaches users who may be harassed or radicalized. Neither error is costless; both errors fall disproportionately on specific populations.

The scale problem: At 10 million posts/day, even a very accurate system makes hundreds of thousands of errors daily. Human review of flagged posts is expensive and psychologically damaging to reviewers. Automated systems are the only economically feasible option at scale. But automated systems can't do nuance well.

Discussion Exercise (30 minutes)

For each case study, write a 200-word recommendation memo. Your memo should: identify the 3 most important stakeholders, state the key ethical tradeoff, take a position, and explain what monitoring or mitigation you'd require. Compare your recommendations with classmates. Where do people disagree? Why?

Research Exercise

Find one real-world case from the past 3 years that resembles one of these scenarios. What actually happened? What would you have recommended? What does the actual outcome teach you about the gap between ethical recommendation and organizational reality?