Unit 4 of 4
5.4 — Safety Evaluation and Benchmarks
Safety evaluation requires both standardized benchmarks and custom testing tailored to the system's risk profile. Auditors should select benchmarks appropriate to the system's deployment context.
Key AI Safety Benchmarks
| Benchmark | What It Tests | Relevance |
|---|---|---|
| TruthfulQA | Factuality — does the model produce truthful answers? | Critical for any information-providing system |
| BBQ (Bias Benchmark for QA) | Social bias in question-answering | Essential for systems making decisions about people |
| RealToxicityPrompts | Toxicity in text generation | Key for any user-facing generative AI |
| BOLD | Bias in open-ended language generation | Important for creative and conversational AI |
| HELM | Holistic evaluation across multiple dimensions | Comprehensive assessment for foundation models |
| HarmBench | Harmful content generation across categories | Emerging standard for safety evaluation |
Prompt Injection Attack Types
Type
How It Works
Risk Level
Direct Injection
Adversarial content placed directly in user input
High — but easier to detect and filter
Indirect Injection
Adversarial content embedded in retrieved documents or web pages (via RAG)
Very High — harder to detect, exploits trust in data sources
System Prompt Extraction
Attempts to reveal the system's hidden instructions or configuration
Medium — reveals system design, enables further attacks
★EXAM TIP
Evaluation must be ONGOING, not a one-time event. AI systems can degrade over time (data drift, model drift, concept drift). Post-deployment monitoring should include automated safety checks, user feedback analysis, and periodic re-evaluation against updated benchmarks.
Key Points
Standard benchmarks: TruthfulQA, BBQ, RealToxicityPrompts, HELM
Adversarial robustness: perturbation, multilingual, role-play attacks
Prompt injection: direct, indirect, and system prompt extraction
Continuous evaluation — not just pre-deployment
Post-deployment monitoring for drift