← All Modules
MODULE 05 · ~2.5 hrs

Model Cards & Red-Teaming

Learn to create comprehensive model documentation (model cards, datasheets, system cards) and conduct structured adversarial testing (red-teaming) of AI systems. Covers evaluation methodologies, bias detection, and safety testing.

5.1 — Model Cards

Model Cards (Mitchell et al., 2019) are standardized documentation for trained ML models. They provide essential transparency about a model's capabilities, limitations, and appropriate use contexts.

Eight Sections of a Model Card
01
Model Details

Developer, version, model type, architecture, license, contact information.

02
Intended Use

Primary use cases, downstream applications, and explicitly out-of-scope uses.

03
Factors

Relevant demographic, environmental, and instrumentation factors that affect performance.

04
Metrics

Performance measures selected, thresholds for acceptable performance, variation across factors.

05
Evaluation Data

Datasets used for evaluation, preprocessing steps, motivation for selection.

06
Training Data

Overview of training data (no need to expose proprietary details), size, characteristics.

07
Ethical Considerations

Sensitive use cases, known limitations, potential for misuse, societal impacts.

08
Caveats & Recommendations

Known limitations, recommended use patterns, warnings about edge cases.

Documentation Types Comparison
Feature
Model Card
System Card
Datasheet
What it documents
A trained ML model
Complete AI system pipeline
A dataset
Introduced by
Mitchell et al. (2019)
Meta, OpenAI, Google
Gebru et al. (2021)
Scope
Single model in isolation
Model + prompts + guardrails + deployment
Training or evaluation data
Key focus
Performance, limitations, intended use
End-to-end behavior, safety measures
Composition, collection, bias, provenance
Updates
With each model version
With system changes
With dataset updates
EXAM TIP

Model cards, system cards, and datasheets are LIVING DOCUMENTS — they must be updated as models, systems, and datasets evolve. A model card written once at launch and never updated is a common audit finding.

Key Points
Model Cards: standardized model documentation (Mitchell 2019)
Eight key sections covering model details through ethical considerations
System Cards: document the complete AI system pipeline
Datasheets for Datasets: document training/evaluation data provenance
Living documents — must be updated as models evolve

5.2 — Red-Teaming Fundamentals

AI red-teaming is structured adversarial testing designed to find failures, vulnerabilities, and harmful behaviors in AI systems before deployment. It goes beyond standard testing by actively trying to make the system fail.

Red-Team Process Flow
Scope
Define targets & boundaries
Plan
Select attack taxonomy
Execute
Run adversarial tests
Report
Document findings
Remediate
Fix & retest
Red-Team Attack Dimensions
DimensionDescriptionExample Attacks
SafetyCan the system produce harmful content?Generating instructions for dangerous activities, self-harm content
SecurityCan the system be exploited?Prompt injection, jailbreaking, data extraction, system prompt leakage
FairnessDoes it behave differently across groups?Demographic bias in outputs, stereotyping, differential quality
ReliabilityHow does it handle edge cases?Out-of-distribution inputs, adversarial perturbations, ambiguous queries
FactualityDoes it generate false information?Hallucinations, confabulation, citation fabrication, date errors
Red-Teaming Methodologies
Method
Approach
Best For
Manual
Human testers craft adversarial inputs
Creative, novel attack discovery
Automated
AI-assisted generation of adversarial prompts
Scale and coverage
Structured
Following predefined attack taxonomies
Systematic, reproducible assessment
Domain-Expert
Domain specialists test for domain-specific risks
High-stakes or regulated domains
EXAM TIP

Red-team reports must document: the attack taxonomy used, specific prompts/inputs that caused failures, severity classification of each failure, reproducibility information, and recommended mitigations. Results should be shared with development teams before public disclosure (responsible disclosure).

Key Points
Red-teaming: structured adversarial testing before deployment
Scope: safety, security, fairness, reliability, factuality
Manual + automated + structured + domain-expert approaches
Reports: attack taxonomy, failure severity, mitigations
Responsible disclosure practices

5.3 — Bias Detection and Fairness Testing

Bias in AI systems can arise from training data (representation bias, measurement bias, historical bias), model architecture choices, labeling processes, and evaluation methodology. Auditors must understand each source.

Fairness Metrics Reference
MetricDefinitionWhen to Use
Demographic ParityEqual positive prediction rates across groupsWhen equal representation in outcomes is the primary goal
Equalized OddsEqual true positive and false positive rates across groupsWhen accuracy across groups matters (e.g., medical diagnosis)
Predictive ParityEqual precision (PPV) across groupsWhen confidence in positive predictions must be equal
Individual FairnessSimilar individuals receive similar outcomesWhen individual-level treatment consistency matters
CalibrationPredicted probabilities match actual outcomes per groupWhen probability estimates are used for downstream decisions
FAIRNESS METRICS CAN CONFLICT

It is mathematically impossible to satisfy Demographic Parity, Equalized Odds, and Predictive Parity simultaneously (except in trivial cases). Choosing the right metric depends on the context, legal requirements, and stakeholder priorities. Document the rationale for your metric choice.

7-Step Bias Testing Process
01
Define Protected Attributes

Identify relevant demographic groups (race, gender, age, disability, etc.) based on context and legal requirements.

02
Select Fairness Metrics

Choose appropriate metrics for the context — consider legal, ethical, and stakeholder requirements.

03
Collect Disaggregated Data

Gather evaluation data with demographic labels for each group of interest.

04
Compute Metrics Per Group

Calculate selected fairness metrics separately for each demographic group.

05
Compare Against Thresholds

Evaluate whether disparities exceed acceptable thresholds (e.g., 80% rule / four-fifths rule).

06
Investigate Root Causes

Trace disparities back to training data, features, model architecture, or labeling processes.

07
Document & Recommend

Record findings, rationale, and specific mitigation recommendations.

EXAM TIP

Intersectional analysis examines bias across combinations of protected attributes (e.g., race x gender x age) rather than single attributes alone. This can reveal disparities hidden in single-attribute analyses. Always test intersectionally.

Key Points
Bias sources: data, architecture, labeling, evaluation
Key metrics: Demographic Parity, Equalized Odds, Predictive Parity
Fairness metrics can conflict — context determines choice
Intersectional analysis across multiple attributes
Seven-step bias testing process

5.4 — Safety Evaluation and Benchmarks

Safety evaluation requires both standardized benchmarks and custom testing tailored to the system's risk profile. Auditors should select benchmarks appropriate to the system's deployment context.

Key AI Safety Benchmarks
BenchmarkWhat It TestsRelevance
TruthfulQAFactuality — does the model produce truthful answers?Critical for any information-providing system
BBQ (Bias Benchmark for QA)Social bias in question-answeringEssential for systems making decisions about people
RealToxicityPromptsToxicity in text generationKey for any user-facing generative AI
BOLDBias in open-ended language generationImportant for creative and conversational AI
HELMHolistic evaluation across multiple dimensionsComprehensive assessment for foundation models
HarmBenchHarmful content generation across categoriesEmerging standard for safety evaluation
Prompt Injection Attack Types
Type
How It Works
Risk Level
Direct Injection
Adversarial content placed directly in user input
High — but easier to detect and filter
Indirect Injection
Adversarial content embedded in retrieved documents or web pages (via RAG)
Very High — harder to detect, exploits trust in data sources
System Prompt Extraction
Attempts to reveal the system's hidden instructions or configuration
Medium — reveals system design, enables further attacks
EXAM TIP

Evaluation must be ONGOING, not a one-time event. AI systems can degrade over time (data drift, model drift, concept drift). Post-deployment monitoring should include automated safety checks, user feedback analysis, and periodic re-evaluation against updated benchmarks.

Key Points
Standard benchmarks: TruthfulQA, BBQ, RealToxicityPrompts, HELM
Adversarial robustness: perturbation, multilingual, role-play attacks
Prompt injection: direct, indirect, and system prompt extraction
Continuous evaluation — not just pre-deployment
Post-deployment monitoring for drift
// Practice Questions
Q1: What are the key sections of a model card?
Show Answer

Model Details, Intended Use, Factors, Metrics, Evaluation Data, Training Data, Ethical Considerations, and Caveats/Recommendations.

Q2: What is the difference between model cards and system cards?
Show Answer

Model cards document a trained ML model in isolation. System cards document the complete AI system pipeline: model + prompts + guardrails + post-processing + deployment context.

Q3: Name three fairness metrics and when you might choose each.
Show Answer

Demographic Parity (equal positive prediction rates — use when equal representation is the goal), Equalized Odds (equal TPR and FPR — use when accuracy across groups matters), Predictive Parity (equal precision — use when confidence in positive predictions must be equal).

Q4: What are the three types of prompt injection attacks?
Show Answer

Direct injection (adversarial content in user input), indirect injection (adversarial content in retrieved documents/web pages), and system prompt extraction (attempting to reveal system instructions).

Q5: Why can't you satisfy all fairness metrics simultaneously?
Show Answer

It is mathematically impossible to satisfy Demographic Parity, Equalized Odds, and Predictive Parity simultaneously except in trivial cases (perfect prediction or equal base rates). This is known as the 'impossibility theorem' of fairness. Context determines which metric to prioritize.

Q6: Describe the 7-step bias testing process.
Show Answer

(1) Define protected attributes and groups, (2) Select appropriate fairness metrics, (3) Collect disaggregated evaluation data, (4) Compute metrics per group, (5) Compare against thresholds, (6) Investigate root causes of disparities, (7) Document findings and recommend mitigations.

Q7: What is intersectional bias analysis and why is it important?
Show Answer

Intersectional analysis examines bias across combinations of protected attributes (e.g., race x gender x age) rather than single attributes alone. It's important because disparities can be hidden in single-attribute analyses — a system may appear fair for each attribute individually but show significant bias for specific intersectional subgroups.

Q8: Why must indirect prompt injection be tested in RAG-based systems?
Show Answer

Indirect injection embeds adversarial instructions in external data sources (documents, web pages) that the RAG system retrieves. Because the system trusts retrieved content as factual context, it may follow malicious instructions embedded within. This is harder to detect than direct injection and can compromise system integrity without the user's knowledge.

04. India DPDP Act + RBI AI/ML Guidelines06. Audit Documentation & Governance