MODULE 05 · ~2.5 hrs

Model Cards & Red-Teaming

Learn to create comprehensive model documentation (model cards, datasheets, system cards) and conduct structured adversarial testing (red-teaming) of AI systems. Covers evaluation methodologies, bias detection, and safety testing.

5.1 — Model Cards

Model Cards (Mitchell et al., 2019) are standardized documentation for trained ML models. They provide essential transparency about a model's capabilities, limitations, and appropriate use contexts.

Eight Sections of a Model Card

Model Details

Developer, version, model type, architecture, license, contact information.

Intended Use

Primary use cases, downstream applications, and explicitly out-of-scope uses.

Factors

Relevant demographic, environmental, and instrumentation factors that affect performance.

Metrics

Performance measures selected, thresholds for acceptable performance, variation across factors.

Evaluation Data

Datasets used for evaluation, preprocessing steps, motivation for selection.

Training Data

Overview of training data (no need to expose proprietary details), size, characteristics.

Ethical Considerations

Sensitive use cases, known limitations, potential for misuse, societal impacts.

Caveats & Recommendations

Known limitations, recommended use patterns, warnings about edge cases.

Documentation Types Comparison

Feature

Model Card

System Card

Datasheet

What it documents

A trained ML model

Complete AI system pipeline

A dataset

Introduced by

Mitchell et al. (2019)

Meta, OpenAI, Google

Gebru et al. (2021)

Scope

Single model in isolation

Model + prompts + guardrails + deployment

Training or evaluation data

Key focus

Performance, limitations, intended use

End-to-end behavior, safety measures

Composition, collection, bias, provenance

Updates

With each model version

With system changes

With dataset updates

★EXAM TIP

Model cards, system cards, and datasheets are LIVING DOCUMENTS — they must be updated as models, systems, and datasets evolve. A model card written once at launch and never updated is a common audit finding.

Key Points

Model Cards: standardized model documentation (Mitchell 2019)

Eight key sections covering model details through ethical considerations

System Cards: document the complete AI system pipeline

Datasheets for Datasets: document training/evaluation data provenance

Living documents — must be updated as models evolve

5.2 — Red-Teaming Fundamentals

AI red-teaming is structured adversarial testing designed to find failures, vulnerabilities, and harmful behaviors in AI systems before deployment. It goes beyond standard testing by actively trying to make the system fail.

Red-Team Process Flow

Scope

Define targets & boundaries

→

Plan

Select attack taxonomy

→

Execute

Run adversarial tests

→

Report

Document findings

→

Remediate

Fix & retest

Red-Team Attack Dimensions

Dimension	Description	Example Attacks
Safety	Can the system produce harmful content?	Generating instructions for dangerous activities, self-harm content
Security	Can the system be exploited?	Prompt injection, jailbreaking, data extraction, system prompt leakage
Fairness	Does it behave differently across groups?	Demographic bias in outputs, stereotyping, differential quality
Reliability	How does it handle edge cases?	Out-of-distribution inputs, adversarial perturbations, ambiguous queries
Factuality	Does it generate false information?	Hallucinations, confabulation, citation fabrication, date errors

Red-Teaming Methodologies

Method

Approach

Best For

Manual

Human testers craft adversarial inputs

Creative, novel attack discovery

Automated

AI-assisted generation of adversarial prompts

Scale and coverage

Structured

Following predefined attack taxonomies

Systematic, reproducible assessment

Domain-Expert

Domain specialists test for domain-specific risks

High-stakes or regulated domains

★EXAM TIP

Red-team reports must document: the attack taxonomy used, specific prompts/inputs that caused failures, severity classification of each failure, reproducibility information, and recommended mitigations. Results should be shared with development teams before public disclosure (responsible disclosure).

Key Points

Red-teaming: structured adversarial testing before deployment

Scope: safety, security, fairness, reliability, factuality

Manual + automated + structured + domain-expert approaches

Reports: attack taxonomy, failure severity, mitigations

Responsible disclosure practices

5.3 — Bias Detection and Fairness Testing

Bias in AI systems can arise from training data (representation bias, measurement bias, historical bias), model architecture choices, labeling processes, and evaluation methodology. Auditors must understand each source.

Fairness Metrics Reference

Metric	Definition	When to Use
Demographic Parity	Equal positive prediction rates across groups	When equal representation in outcomes is the primary goal
Equalized Odds	Equal true positive and false positive rates across groups	When accuracy across groups matters (e.g., medical diagnosis)
Predictive Parity	Equal precision (PPV) across groups	When confidence in positive predictions must be equal
Individual Fairness	Similar individuals receive similar outcomes	When individual-level treatment consistency matters
Calibration	Predicted probabilities match actual outcomes per group	When probability estimates are used for downstream decisions

⚠FAIRNESS METRICS CAN CONFLICT

It is mathematically impossible to satisfy Demographic Parity, Equalized Odds, and Predictive Parity simultaneously (except in trivial cases). Choosing the right metric depends on the context, legal requirements, and stakeholder priorities. Document the rationale for your metric choice.

7-Step Bias Testing Process

Define Protected Attributes

Identify relevant demographic groups (race, gender, age, disability, etc.) based on context and legal requirements.

Select Fairness Metrics

Choose appropriate metrics for the context — consider legal, ethical, and stakeholder requirements.

Collect Disaggregated Data

Gather evaluation data with demographic labels for each group of interest.

Compute Metrics Per Group

Calculate selected fairness metrics separately for each demographic group.

Compare Against Thresholds

Evaluate whether disparities exceed acceptable thresholds (e.g., 80% rule / four-fifths rule).

Investigate Root Causes

Trace disparities back to training data, features, model architecture, or labeling processes.

Document & Recommend

Record findings, rationale, and specific mitigation recommendations.

★EXAM TIP

Intersectional analysis examines bias across combinations of protected attributes (e.g., race x gender x age) rather than single attributes alone. This can reveal disparities hidden in single-attribute analyses. Always test intersectionally.

Key Points

Bias sources: data, architecture, labeling, evaluation

Key metrics: Demographic Parity, Equalized Odds, Predictive Parity

Fairness metrics can conflict — context determines choice

Intersectional analysis across multiple attributes

Seven-step bias testing process

5.4 — Safety Evaluation and Benchmarks

Safety evaluation requires both standardized benchmarks and custom testing tailored to the system's risk profile. Auditors should select benchmarks appropriate to the system's deployment context.

Key AI Safety Benchmarks

Benchmark	What It Tests	Relevance
TruthfulQA	Factuality — does the model produce truthful answers?	Critical for any information-providing system
BBQ (Bias Benchmark for QA)	Social bias in question-answering	Essential for systems making decisions about people
RealToxicityPrompts	Toxicity in text generation	Key for any user-facing generative AI
BOLD	Bias in open-ended language generation	Important for creative and conversational AI
HELM	Holistic evaluation across multiple dimensions	Comprehensive assessment for foundation models
HarmBench	Harmful content generation across categories	Emerging standard for safety evaluation

Prompt Injection Attack Types

Type

How It Works

Risk Level

Direct Injection

Adversarial content placed directly in user input

High — but easier to detect and filter

Indirect Injection

Adversarial content embedded in retrieved documents or web pages (via RAG)

Very High — harder to detect, exploits trust in data sources

System Prompt Extraction

Attempts to reveal the system's hidden instructions or configuration

Medium — reveals system design, enables further attacks

★EXAM TIP

Evaluation must be ONGOING, not a one-time event. AI systems can degrade over time (data drift, model drift, concept drift). Post-deployment monitoring should include automated safety checks, user feedback analysis, and periodic re-evaluation against updated benchmarks.

Key Points

Standard benchmarks: TruthfulQA, BBQ, RealToxicityPrompts, HELM

Adversarial robustness: perturbation, multilingual, role-play attacks

Prompt injection: direct, indirect, and system prompt extraction

Continuous evaluation — not just pre-deployment

Post-deployment monitoring for drift

// Practice Questions

Q1: What are the key sections of a model card?

Show Answer

Model Details, Intended Use, Factors, Metrics, Evaluation Data, Training Data, Ethical Considerations, and Caveats/Recommendations.

Q2: What is the difference between model cards and system cards?

Show Answer

Model cards document a trained ML model in isolation. System cards document the complete AI system pipeline: model + prompts + guardrails + post-processing + deployment context.

Q3: Name three fairness metrics and when you might choose each.

Show Answer

Demographic Parity (equal positive prediction rates — use when equal representation is the goal), Equalized Odds (equal TPR and FPR — use when accuracy across groups matters), Predictive Parity (equal precision — use when confidence in positive predictions must be equal).

Q4: What are the three types of prompt injection attacks?

Show Answer

Direct injection (adversarial content in user input), indirect injection (adversarial content in retrieved documents/web pages), and system prompt extraction (attempting to reveal system instructions).

Q5: Why can't you satisfy all fairness metrics simultaneously?

Show Answer

It is mathematically impossible to satisfy Demographic Parity, Equalized Odds, and Predictive Parity simultaneously except in trivial cases (perfect prediction or equal base rates). This is known as the 'impossibility theorem' of fairness. Context determines which metric to prioritize.

Q6: Describe the 7-step bias testing process.

Show Answer

(1) Define protected attributes and groups, (2) Select appropriate fairness metrics, (3) Collect disaggregated evaluation data, (4) Compute metrics per group, (5) Compare against thresholds, (6) Investigate root causes of disparities, (7) Document findings and recommend mitigations.

Q7: What is intersectional bias analysis and why is it important?

Show Answer

Intersectional analysis examines bias across combinations of protected attributes (e.g., race x gender x age) rather than single attributes alone. It's important because disparities can be hidden in single-attribute analyses — a system may appear fair for each attribute individually but show significant bias for specific intersectional subgroups.

Q8: Why must indirect prompt injection be tested in RAG-based systems?

Show Answer

Indirect injection embeds adversarial instructions in external data sources (documents, web pages) that the RAG system retrieves. Because the system trusts retrieved content as factual context, it may follow malicious instructions embedded within. This is harder to detect than direct injection and can compromise system integrity without the user's knowledge.

← 04. India DPDP Act + RBI AI/ML Guidelines 06. Audit Documentation & Governance →