Model Cards & Red-Teaming
Learn to create comprehensive model documentation (model cards, datasheets, system cards) and conduct structured adversarial testing (red-teaming) of AI systems. Covers evaluation methodologies, bias detection, and safety testing.
5.1 — Model Cards
Model Cards (Mitchell et al., 2019) are standardized documentation for trained ML models. They provide essential transparency about a model's capabilities, limitations, and appropriate use contexts.
Developer, version, model type, architecture, license, contact information.
Primary use cases, downstream applications, and explicitly out-of-scope uses.
Relevant demographic, environmental, and instrumentation factors that affect performance.
Performance measures selected, thresholds for acceptable performance, variation across factors.
Datasets used for evaluation, preprocessing steps, motivation for selection.
Overview of training data (no need to expose proprietary details), size, characteristics.
Sensitive use cases, known limitations, potential for misuse, societal impacts.
Known limitations, recommended use patterns, warnings about edge cases.
Model cards, system cards, and datasheets are LIVING DOCUMENTS — they must be updated as models, systems, and datasets evolve. A model card written once at launch and never updated is a common audit finding.
5.2 — Red-Teaming Fundamentals
AI red-teaming is structured adversarial testing designed to find failures, vulnerabilities, and harmful behaviors in AI systems before deployment. It goes beyond standard testing by actively trying to make the system fail.
| Dimension | Description | Example Attacks |
|---|---|---|
| Safety | Can the system produce harmful content? | Generating instructions for dangerous activities, self-harm content |
| Security | Can the system be exploited? | Prompt injection, jailbreaking, data extraction, system prompt leakage |
| Fairness | Does it behave differently across groups? | Demographic bias in outputs, stereotyping, differential quality |
| Reliability | How does it handle edge cases? | Out-of-distribution inputs, adversarial perturbations, ambiguous queries |
| Factuality | Does it generate false information? | Hallucinations, confabulation, citation fabrication, date errors |
Red-team reports must document: the attack taxonomy used, specific prompts/inputs that caused failures, severity classification of each failure, reproducibility information, and recommended mitigations. Results should be shared with development teams before public disclosure (responsible disclosure).
5.3 — Bias Detection and Fairness Testing
Bias in AI systems can arise from training data (representation bias, measurement bias, historical bias), model architecture choices, labeling processes, and evaluation methodology. Auditors must understand each source.
| Metric | Definition | When to Use |
|---|---|---|
| Demographic Parity | Equal positive prediction rates across groups | When equal representation in outcomes is the primary goal |
| Equalized Odds | Equal true positive and false positive rates across groups | When accuracy across groups matters (e.g., medical diagnosis) |
| Predictive Parity | Equal precision (PPV) across groups | When confidence in positive predictions must be equal |
| Individual Fairness | Similar individuals receive similar outcomes | When individual-level treatment consistency matters |
| Calibration | Predicted probabilities match actual outcomes per group | When probability estimates are used for downstream decisions |
It is mathematically impossible to satisfy Demographic Parity, Equalized Odds, and Predictive Parity simultaneously (except in trivial cases). Choosing the right metric depends on the context, legal requirements, and stakeholder priorities. Document the rationale for your metric choice.
Identify relevant demographic groups (race, gender, age, disability, etc.) based on context and legal requirements.
Choose appropriate metrics for the context — consider legal, ethical, and stakeholder requirements.
Gather evaluation data with demographic labels for each group of interest.
Calculate selected fairness metrics separately for each demographic group.
Evaluate whether disparities exceed acceptable thresholds (e.g., 80% rule / four-fifths rule).
Trace disparities back to training data, features, model architecture, or labeling processes.
Record findings, rationale, and specific mitigation recommendations.
Intersectional analysis examines bias across combinations of protected attributes (e.g., race x gender x age) rather than single attributes alone. This can reveal disparities hidden in single-attribute analyses. Always test intersectionally.
5.4 — Safety Evaluation and Benchmarks
Safety evaluation requires both standardized benchmarks and custom testing tailored to the system's risk profile. Auditors should select benchmarks appropriate to the system's deployment context.
| Benchmark | What It Tests | Relevance |
|---|---|---|
| TruthfulQA | Factuality — does the model produce truthful answers? | Critical for any information-providing system |
| BBQ (Bias Benchmark for QA) | Social bias in question-answering | Essential for systems making decisions about people |
| RealToxicityPrompts | Toxicity in text generation | Key for any user-facing generative AI |
| BOLD | Bias in open-ended language generation | Important for creative and conversational AI |
| HELM | Holistic evaluation across multiple dimensions | Comprehensive assessment for foundation models |
| HarmBench | Harmful content generation across categories | Emerging standard for safety evaluation |
Evaluation must be ONGOING, not a one-time event. AI systems can degrade over time (data drift, model drift, concept drift). Post-deployment monitoring should include automated safety checks, user feedback analysis, and periodic re-evaluation against updated benchmarks.
Show Answer
Model Details, Intended Use, Factors, Metrics, Evaluation Data, Training Data, Ethical Considerations, and Caveats/Recommendations.
Show Answer
Model cards document a trained ML model in isolation. System cards document the complete AI system pipeline: model + prompts + guardrails + post-processing + deployment context.
Show Answer
Demographic Parity (equal positive prediction rates — use when equal representation is the goal), Equalized Odds (equal TPR and FPR — use when accuracy across groups matters), Predictive Parity (equal precision — use when confidence in positive predictions must be equal).
Show Answer
Direct injection (adversarial content in user input), indirect injection (adversarial content in retrieved documents/web pages), and system prompt extraction (attempting to reveal system instructions).
Show Answer
It is mathematically impossible to satisfy Demographic Parity, Equalized Odds, and Predictive Parity simultaneously except in trivial cases (perfect prediction or equal base rates). This is known as the 'impossibility theorem' of fairness. Context determines which metric to prioritize.
Show Answer
(1) Define protected attributes and groups, (2) Select appropriate fairness metrics, (3) Collect disaggregated evaluation data, (4) Compute metrics per group, (5) Compare against thresholds, (6) Investigate root causes of disparities, (7) Document findings and recommend mitigations.
Show Answer
Intersectional analysis examines bias across combinations of protected attributes (e.g., race x gender x age) rather than single attributes alone. It's important because disparities can be hidden in single-attribute analyses — a system may appear fair for each attribute individually but show significant bias for specific intersectional subgroups.
Show Answer
Indirect injection embeds adversarial instructions in external data sources (documents, web pages) that the RAG system retrieves. Because the system trusts retrieved content as factual context, it may follow malicious instructions embedded within. This is harder to detect than direct injection and can compromise system integrity without the user's knowledge.