Red-Teaming AI Systems: Methodology and Practice

What Red-Teaming Is

Red-teaming (borrowing military terminology for adversarial testing) is the practice of systematically trying to break AI systems before deployment, finding harmful behaviors, safety failures, and misuse vectors before real users do. It's a critical part of responsible AI deployment and increasingly required by regulatory frameworks (the EU AI Act mandates adversarial testing for high-risk and systemic-risk AI systems).

Types of Red-Teaming

Manual red-teaming: Human "attackers" attempt to elicit harmful behavior through creative prompting. Effective at finding novel attack vectors that automated methods miss; limited in coverage.
Automated red-teaming: AI models generate attack prompts systematically. Scales coverage but may miss attacks requiring genuine creativity or social engineering.
Structured red-teaming: Testing against a predefined taxonomy of harms (OWASP LLM Top 10, NIST AI RMF threat categories). Ensures coverage of known failure types but may miss novel ones.
Domain expert red-teaming: Subject matter experts (medical professionals, security researchers, lawyers) test AI in their domain. Essential for high-stakes deployments.

Common Attack Vectors for LLMs

Jailbreaking: Prompt constructions that bypass safety training. Many types: role-play framing, "do anything now" (DAN) prompts, gradual escalation, hypothetical framing.
Prompt injection: Malicious instructions in retrieved content (documents, web pages) that override the system prompt. Critical risk for RAG systems and agents with internet access.
Indirect harm: Eliciting outputs that are harmful not by being dangerous themselves but by being misleading, privacy-violating, or used as part of a larger harmful workflow.
Multi-turn attacks: Building context across a conversation to gradually elicit harmful outputs that would be refused in a single turn.

Dr. Okafor's Automated Red-Teaming Framework

Meridian AI's Associate Professor James Okafor developed an automated red-teaming framework that uses one LLM to systematically generate attack prompts against another. The "attacker" model is trained to find prompts that the "target" model refuses, then iteratively refines attacks based on refusal feedback. The framework has been adopted by several major AI labs and is available as open-source software. SAF-220 covers the technical details and hands-on implementation.

# ============================================================ # SCHOOL OF APPLIED INTELLIGENCE # ============================================================