Unit 4 of 4

5.4 — Safety Evaluation and Benchmarks

Safety evaluation requires both standardized benchmarks and custom testing tailored to the system's risk profile. Auditors should select benchmarks appropriate to the system's deployment context.

Key AI Safety Benchmarks
BenchmarkWhat It TestsRelevance
TruthfulQAFactuality — does the model produce truthful answers?Critical for any information-providing system
BBQ (Bias Benchmark for QA)Social bias in question-answeringEssential for systems making decisions about people
RealToxicityPromptsToxicity in text generationKey for any user-facing generative AI
BOLDBias in open-ended language generationImportant for creative and conversational AI
HELMHolistic evaluation across multiple dimensionsComprehensive assessment for foundation models
HarmBenchHarmful content generation across categoriesEmerging standard for safety evaluation
Prompt Injection Attack Types
Type
How It Works
Risk Level
Direct Injection
Adversarial content placed directly in user input
High — but easier to detect and filter
Indirect Injection
Adversarial content embedded in retrieved documents or web pages (via RAG)
Very High — harder to detect, exploits trust in data sources
System Prompt Extraction
Attempts to reveal the system's hidden instructions or configuration
Medium — reveals system design, enables further attacks
EXAM TIP

Evaluation must be ONGOING, not a one-time event. AI systems can degrade over time (data drift, model drift, concept drift). Post-deployment monitoring should include automated safety checks, user feedback analysis, and periodic re-evaluation against updated benchmarks.

Key Points
Standard benchmarks: TruthfulQA, BBQ, RealToxicityPrompts, HELM
Adversarial robustness: perturbation, multilingual, role-play attacks
Prompt injection: direct, indirect, and system prompt extraction
Continuous evaluation — not just pre-deployment
Post-deployment monitoring for drift
← Previous unitModule overview →