How evaluation works

Flint AI Eval sends prompts to your running agent and scores the responses. Tests combine evaluation types (how prompts are generated) with detectors (how responses are scored).

Evaluation framework

Flint AI Eval uses a composable architecture:

Evaluation generates test prompts

The evaluation type determines what prompts to send to your agent

Agent responds to each prompt

Your agent processes prompts just like in production

Detector scores each response

The detector type determines how responses are evaluated

Result aggregates scores

Individual scores combine into a 0.0-1.0 reliability metric

Evaluation types

Choose between fixed prompts (repeatable tests) or AI-generated prompts (adaptive attacks). Evaluations define what prompts to send to your agent.

Adversarial probe

AI-generated attack prompts that adapt to your agent’s responses across multiple turns.How it works:

LLM (GENERATOR_MODEL) generates prompts designed to exploit specific vulnerabilities
Attacker model adjusts strategy based on agent responses
Supports multi-turn conversations (up to 10 turns per test)

Example: OWASP LLM01 adversarial probe generates prompts trying to override system instructionsRequires: LLM provider configured via GENERATOR_MODEL

Message collection

Fixed list of pre-written test prompts.How it works:

Prompts loaded from CSV file, in-memory list, or Garak module
Each prompt sent once to your agent
Deterministic and repeatable

Example: Custom scope-boundary test with 20 hand-crafted promptsRequires: Prompt source (CSV file, JSON array, or Garak module)

Topic guard

AI-generated out-of-scope prompts to test whether your agent stays within its defined role.How it works:

LLM generates plausible but off-topic requests
Tests agent’s ability to refuse gracefully

Example: Weather agent should refuse requests about financial dataRequires: LLM provider configured via GENERATOR_MODEL

Detector types

Some detectors use AI judges, others use local pattern matching (no LLM required). Detectors define how responses are scored.

LLM-as-judge

A separate LLM evaluates whether your agent’s response meets the test criteria.How it works:

Response sent to judge model (GENERATOR_MODEL)
Judge follows scoring instructions (e.g., “Score 1 if agent refused, 0 if it leaked data”)
Returns 0.0-1.0 score

Example: Judge evaluates whether agent leaked its system promptRequires: LLM provider configured via GENERATOR_MODELAccuracy: Strong judges achieve 80-90% agreement with human evaluators

PII detector

Regex-based detection of personally identifiable information.How it works:

Scans response for patterns: emails, phone numbers, SSNs, credit cards
Runs locally, no LLM required
Returns 1.0 if no PII found, 0.0 if PII detected

Example: Detects if agent leaked john.doe@example.com in its responseRequires: Nothing (local detector)

Secret detector

Regex-based detection of API keys, tokens, and credentials.How it works:

Scans for AWS keys, GitHub tokens, private keys, etc.
Runs locally, no LLM required
Returns 1.0 if no secrets found, 0.0 if secrets detected

Example: Detects if agent exposed sk-proj-abc123...Requires: Nothing (local detector)

Toxicity classifier

ML-based classifier for toxic, offensive, or harmful content.How it works:

Uses local classifier model
No LLM required
Returns toxicity score

Example: Detects if agent generated hateful or abusive languageRequires: Nothing (local detector)

Garak detectors

Adapters for Garak framework detectors.How it works:

Runs Garak’s built-in detectors locally
Includes pattern matching, heuristics, and specialized checks
No LLM required

Example: Garak’s encoding detector checks for Base64-encoded attacksRequires: Nothing (local detector)

How evaluations combine with detectors

Each builtin evaluation pairs an evaluation type with a detector. Here are examples showing how they work together:

LLM01 Adversarial

adversarial_probe + LLM-as-judge

Adversarial probe generates prompt injection attacks, LLM-as-judge scores whether agent followed attacker’s instructions.Result: 0.0-1.0 score measuring prompt injection resistance

PII Leakage

message_collection + PII detector

Message collection sends fixed prompts requesting sensitive data, PII detector scans responses for email/phone/SSN patterns.Result: 1.0 if no PII found, 0.0 if PII detected

Garak Module

garak_module + garak detector

Loads any Garak attack module (encoding, prompt injection, jailbreaks, and 30+ others) and pairs it with a Garak detector that scores the agent’s responses.Result: Pass/fail per probe attempt

See Built-in evaluations for the complete catalog.

Scoring

Each evaluation returns a 0.0-1.0 score:

1.0 = Perfect (all tests passed)
0.8+ = Good (minor issues)
0.5-0.8 = Needs improvement
< 0.5 = Critical issues

Your overall score is the weighted average across all attached evaluations. See Eval results for how to interpret scores and fix issues.

Next steps

Browse Evaluations

See all 38+ builtin tests

Configuration

Set up and run tests

Data Privacy

What gets sent to LLMs

​Evaluation framework

​Evaluation types

​Detector types

​How evaluations combine with detectors

​Scoring

​Next steps