Unit 2 of 4
5.2 — Red-Teaming Fundamentals
AI red-teaming is structured adversarial testing designed to find failures, vulnerabilities, and harmful behaviors in AI systems before deployment. It goes beyond standard testing by actively trying to make the system fail.
Red-Team Process Flow
Scope
Define targets & boundaries
→
Plan
Select attack taxonomy
→
Execute
Run adversarial tests
→
Report
Document findings
→
Remediate
Fix & retest
Red-Team Attack Dimensions
| Dimension | Description | Example Attacks |
|---|---|---|
| Safety | Can the system produce harmful content? | Generating instructions for dangerous activities, self-harm content |
| Security | Can the system be exploited? | Prompt injection, jailbreaking, data extraction, system prompt leakage |
| Fairness | Does it behave differently across groups? | Demographic bias in outputs, stereotyping, differential quality |
| Reliability | How does it handle edge cases? | Out-of-distribution inputs, adversarial perturbations, ambiguous queries |
| Factuality | Does it generate false information? | Hallucinations, confabulation, citation fabrication, date errors |
Red-Teaming Methodologies
Method
Approach
Best For
Manual
Human testers craft adversarial inputs
Creative, novel attack discovery
Automated
AI-assisted generation of adversarial prompts
Scale and coverage
Structured
Following predefined attack taxonomies
Systematic, reproducible assessment
Domain-Expert
Domain specialists test for domain-specific risks
High-stakes or regulated domains
★EXAM TIP
Red-team reports must document: the attack taxonomy used, specific prompts/inputs that caused failures, severity classification of each failure, reproducibility information, and recommended mitigations. Results should be shared with development teams before public disclosure (responsible disclosure).
Key Points
Red-teaming: structured adversarial testing before deployment
Scope: safety, security, fairness, reliability, factuality
Manual + automated + structured + domain-expert approaches
Reports: attack taxonomy, failure severity, mitigations
Responsible disclosure practices