Eval your agent - Flint AI

Test factual accuracy, instruction adherence, prompt injection, jailbreaks, and more. Tests are framework-agnostic and provide a 0.0-1.0 score proving agent reliability.

Install our MCP server in Claude Code or your AI code assistant, then ask: “Help me set up Flint AI Eval” to get live guidance, troubleshoot issues, and work through these steps together. Learn how →

Evaluate your agent at runtime

Verify flintai-cli is installed

flintai --version

Need setup?

If not installed:

pip install flintai-cli
flintai init

Full installation guide →

Start your agent and verify it's running

Check if your agent responds on the expected port:

curl http://localhost:8000/health

Add your agent

Create or update your agent config file with connection details:

{
  "models": [
    {
      "id": "my-agent",
      "type": "adk",
      "name": "My Agent",
      "host": "http://localhost:8000"
    }
  ]
}

Important: The host field must match where your agent is actually running.

How do I edit my config file?

The config file is stored in ~/.flintai/config.json (where ~ means your home directory).

Folders starting with a dot are hidden from Finder and File Explorer. Use the commands below to create and open the file automatically.

macOS / Linux
Windows PowerShell

These commands create the .flintai directory if needed, then open the config file in TextEdit:

mkdir -p ~/.flintai
open -e ~/.flintai/config.json

Add your agent’s connection details and save (Cmd+S or File → Save).

These commands create the .flintai directory if needed, then open the config file in Notepad:

New-Item -ItemType Directory -Force "$HOME\.flintai" | Out-Null
notepad "$HOME\.flintai\config.json"

Add your agent’s connection details and save (Ctrl+S or File → Save).

What do these config fields mean?

id - Unique ID for this model or agent. You’ll use it in commands like --model my-agent.
type - Your agent’s framework (expand supported types below).
name - The label that will identify this agent in results and logs.
host - Base URL for the target endpoint, if this type connects over HTTP.

Important: Your agent must be running and accessible via HTTP before you can run evaluations.

Supported agent types

adk - Google ADK agents
openai_agent - OpenAI Agents SDK
langchain - LangChain agents
crewai - CrewAI agents

See Configuration for all types and options.

Attach evaluations

Browse built-in evaluations to see available tests, then attach them to your agent:

No evaluations run by default. You must attach at least one evaluation before running flintai eval run.

flintai eval model-evaluations attach \
  --model my-agent \
  --eval eval-llm09-fixed

flintai eval model-evaluations attach \
  --model my-agent \
  --eval eval-llm01-fixed

How do I attach multiple evaluations at once?

Use --eval-tag to batch-attach evaluations by tag:

flintai eval model-evaluations attach --model my-agent --eval-tag owasp_code=LLM01

This attaches all evaluations tagged with owasp_code=LLM01 (prompt injection tests) in a single command. See built-in evaluations for all available tests and tags.

Run evaluation

Execute all attached tests:

flintai eval run --model my-agent

flintai eval sends test prompts to your agent, judges the responses using LLM-as-judge, and scores reliability on a 0.0-1.0 scale.Evaluations can take several minutes depending on the number of tests. Progress updates appear in the CLI, and a summary displays when complete. Results are saved to eval_<timestamp>.json.

Integrate with CI/CD. Save eval results as build artifacts to prove agent reliability before deployment. See CI/CD integration guide →

Ship with confidence

What the score means:

0.8+ - Production-ready
0.6-0.8 - Needs improvement
<0.6 - Not ready for production

Next steps:

Interpret your results

Understand score breakdowns and track improvement over time

How evaluation works

Learn the LLM-as-judge methodology and scoring calculation

Scan agent code

Find agent code issues before deployment with Flint AI Scan

​Evaluate your agent at runtime

​Ship with confidence

Interpret your results

How evaluation works

Scan agent code

Evaluate your agent at runtime

Ship with confidence