Unit 4 of 4

5.4 — Safety Evaluation and Benchmarks

Safety evaluation requires both standardized benchmarks and custom testing tailored to the system's risk profile. Auditors should select benchmarks appropriate to the system's deployment context.

Key AI Safety Benchmarks

Benchmark	What It Tests	Relevance
TruthfulQA	Factuality — does the model produce truthful answers?	Critical for any information-providing system
BBQ (Bias Benchmark for QA)	Social bias in question-answering	Essential for systems making decisions about people
RealToxicityPrompts	Toxicity in text generation	Key for any user-facing generative AI
BOLD	Bias in open-ended language generation	Important for creative and conversational AI
HELM	Holistic evaluation across multiple dimensions	Comprehensive assessment for foundation models
HarmBench	Harmful content generation across categories	Emerging standard for safety evaluation

Prompt Injection Attack Types

Type

How It Works

Risk Level

Direct Injection

Adversarial content placed directly in user input

High — but easier to detect and filter

Indirect Injection

Adversarial content embedded in retrieved documents or web pages (via RAG)

Very High — harder to detect, exploits trust in data sources

System Prompt Extraction

Attempts to reveal the system's hidden instructions or configuration

Medium — reveals system design, enables further attacks

★EXAM TIP

Evaluation must be ONGOING, not a one-time event. AI systems can degrade over time (data drift, model drift, concept drift). Post-deployment monitoring should include automated safety checks, user feedback analysis, and periodic re-evaluation against updated benchmarks.

Key Points

Standard benchmarks: TruthfulQA, BBQ, RealToxicityPrompts, HELM

Adversarial robustness: perturbation, multilingual, role-play attacks

Prompt injection: direct, indirect, and system prompt extraction

Continuous evaluation — not just pre-deployment

Post-deployment monitoring for drift

← Previous unit Module overview →