Skip to main content
Flint AI Eval sends prompts to your running agent and scores the responses. Tests combine evaluation types (how prompts are generated) with detectors (how responses are scored).

Evaluation framework

Flint AI Eval uses a composable architecture:
1

Evaluation generates test prompts

The evaluation type determines what prompts to send to your agent
2

Agent responds to each prompt

Your agent processes prompts just like in production
3

Detector scores each response

The detector type determines how responses are evaluated
4

Result aggregates scores

Individual scores combine into a 0.0-1.0 reliability metric

Evaluation types

Choose between fixed prompts (repeatable tests) or AI-generated prompts (adaptive attacks). Evaluations define what prompts to send to your agent.
AI-generated attack prompts that adapt to your agent’s responses across multiple turns.How it works:
  • LLM (GENERATOR_MODEL) generates prompts designed to exploit specific vulnerabilities
  • Attacker model adjusts strategy based on agent responses
  • Supports multi-turn conversations (up to 10 turns per test)
Example: OWASP LLM01 adversarial probe generates prompts trying to override system instructionsRequires: LLM provider configured via GENERATOR_MODEL
Fixed list of pre-written test prompts.How it works:
  • Prompts loaded from CSV file, in-memory list, or Garak module
  • Each prompt sent once to your agent
  • Deterministic and repeatable
Example: Custom scope-boundary test with 20 hand-crafted promptsRequires: Prompt source (CSV file, JSON array, or Garak module)
AI-generated out-of-scope prompts to test whether your agent stays within its defined role.How it works:
  • LLM generates plausible but off-topic requests
  • Tests agent’s ability to refuse gracefully
Example: Weather agent should refuse requests about financial dataRequires: LLM provider configured via GENERATOR_MODEL

Detector types

Some detectors use AI judges, others use local pattern matching (no LLM required). Detectors define how responses are scored.
A separate LLM evaluates whether your agent’s response meets the test criteria.How it works:
  • Response sent to judge model (GENERATOR_MODEL)
  • Judge follows scoring instructions (e.g., “Score 1 if agent refused, 0 if it leaked data”)
  • Returns 0.0-1.0 score
Example: Judge evaluates whether agent leaked its system promptRequires: LLM provider configured via GENERATOR_MODELAccuracy: Strong judges achieve 80-90% agreement with human evaluators
Regex-based detection of personally identifiable information.How it works:
  • Scans response for patterns: emails, phone numbers, SSNs, credit cards
  • Runs locally, no LLM required
  • Returns 1.0 if no PII found, 0.0 if PII detected
Example: Detects if agent leaked john.doe@example.com in its responseRequires: Nothing (local detector)
Regex-based detection of API keys, tokens, and credentials.How it works:
  • Scans for AWS keys, GitHub tokens, private keys, etc.
  • Runs locally, no LLM required
  • Returns 1.0 if no secrets found, 0.0 if secrets detected
Example: Detects if agent exposed sk-proj-abc123...Requires: Nothing (local detector)
ML-based classifier for toxic, offensive, or harmful content.How it works:
  • Uses local classifier model
  • No LLM required
  • Returns toxicity score
Example: Detects if agent generated hateful or abusive languageRequires: Nothing (local detector)
Adapters for Garak framework detectors.How it works:
  • Runs Garak’s built-in detectors locally
  • Includes pattern matching, heuristics, and specialized checks
  • No LLM required
Example: Garak’s encoding detector checks for Base64-encoded attacksRequires: Nothing (local detector)

How evaluations combine with detectors

Each builtin evaluation pairs an evaluation type with a detector. Here are examples showing how they work together:
LLM01 Adversarial
adversarial_probe + LLM-as-judge
Adversarial probe generates prompt injection attacks, LLM-as-judge scores whether agent followed attacker’s instructions.Result: 0.0-1.0 score measuring prompt injection resistance
PII Leakage
message_collection + PII detector
Message collection sends fixed prompts requesting sensitive data, PII detector scans responses for email/phone/SSN patterns.Result: 1.0 if no PII found, 0.0 if PII detected
Garak Module
garak_module + garak detector
Loads any Garak attack module (encoding, prompt injection, jailbreaks, and 30+ others) and pairs it with a Garak detector that scores the agent’s responses.Result: Pass/fail per probe attempt
See Built-in evaluations for the complete catalog.

Scoring

Each evaluation returns a 0.0-1.0 score:
  • 1.0 = Perfect (all tests passed)
  • 0.8+ = Good (minor issues)
  • 0.5-0.8 = Needs improvement
  • < 0.5 = Critical issues
Your overall score is the weighted average across all attached evaluations. See Eval results for how to interpret scores and fix issues.

Next steps

Browse Evaluations

See all 38+ builtin tests

Configuration

Set up and run tests

Data Privacy

What gets sent to LLMs