Skip to main content
Test factual accuracy, instruction adherence, prompt injection, jailbreaks, and more. Tests are framework-agnostic and provide a 0.0-1.0 score proving agent reliability.
Install our MCP server in Claude Code or your AI code assistant, then ask: “Help me set up Flint AI Eval” to get live guidance, troubleshoot issues, and work through these steps together. Learn how →

Evaluate your agent at runtime

1

Verify flintai-cli is installed

flintai --version
If not installed:
pip install flintai-cli
flintai init
Full installation guide →
2

Start your agent and verify it's running

Check if your agent responds on the expected port:
curl http://localhost:8000/health
3

Add your agent

Create or update your agent config file with connection details:
{
  "models": [
    {
      "id": "my-agent",
      "type": "adk",
      "name": "My Agent",
      "host": "http://localhost:8000"
    }
  ]
}
Important: The host field must match where your agent is actually running.
The config file is stored in ~/.flintai/config.json (where ~ means your home directory).
Folders starting with a dot are hidden from Finder and File Explorer. Use the commands below to create and open the file automatically.
These commands create the .flintai directory if needed, then open the config file in TextEdit:
mkdir -p ~/.flintai
open -e ~/.flintai/config.json
Add your agent’s connection details and save (Cmd+S or File → Save).
  • id - Unique ID for this model or agent. You’ll use it in commands like --model my-agent.
  • type - Your agent’s framework (expand supported types below).
  • name - The label that will identify this agent in results and logs.
  • host - Base URL for the target endpoint, if this type connects over HTTP.
Important: Your agent must be running and accessible via HTTP before you can run evaluations.
  • adk - Google ADK agents
  • openai_agent - OpenAI Agents SDK
  • langchain - LangChain agents
  • crewai - CrewAI agents
See Configuration for all types and options.
4

Attach evaluations

Browse built-in evaluations to see available tests, then attach them to your agent:
No evaluations run by default. You must attach at least one evaluation before running flintai eval run.
flintai eval model-evaluations attach \
  --model my-agent \
  --eval eval-llm09-fixed

flintai eval model-evaluations attach \
  --model my-agent \
  --eval eval-llm01-fixed
Use --eval-tag to batch-attach evaluations by tag:
flintai eval model-evaluations attach --model my-agent --eval-tag owasp_code=LLM01
This attaches all evaluations tagged with owasp_code=LLM01 (prompt injection tests) in a single command. See built-in evaluations for all available tests and tags.
5

Run evaluation

Execute all attached tests:
flintai eval run --model my-agent
flintai eval sends test prompts to your agent, judges the responses using LLM-as-judge, and scores reliability on a 0.0-1.0 scale.Evaluations can take several minutes depending on the number of tests. Progress updates appear in the CLI, and a summary displays when complete. Results are saved to eval_<timestamp>.json.
Integrate with CI/CD. Save eval results as build artifacts to prove agent reliability before deployment. See CI/CD integration guide →

Ship with confidence

What the score means:
  • 0.8+ - Production-ready
  • 0.6-0.8 - Needs improvement
  • <0.6 - Not ready for production
Next steps:

Interpret your results

Understand score breakdowns and track improvement over time

How evaluation works

Learn the LLM-as-judge methodology and scoring calculation

Scan agent code

Find agent code issues before deployment with Flint AI Scan