Evaluate your agent at runtime
Verify flintai-cli is installed
Need setup?
Need setup?
Add your agent
Create or update your agent config file with connection details:
Important: The
host field must match where your agent is actually running.How do I edit my config file?
How do I edit my config file?
The config file is stored in
~/.flintai/config.json (where ~ means your home directory).- macOS / Linux
- Windows PowerShell
These commands create the Add your agent’s connection details and save (Cmd+S or File → Save).
.flintai directory if needed, then open the config file in TextEdit:What do these config fields mean?
What do these config fields mean?
id- Unique ID for this model or agent. You’ll use it in commands like--model my-agent.type- Your agent’s framework (expand supported types below).name- The label that will identify this agent in results and logs.host- Base URL for the target endpoint, if this type connects over HTTP.
Important: Your agent must be running and accessible via HTTP before you can run evaluations.
Supported agent types
Supported agent types
- adk - Google ADK agents
- openai_agent - OpenAI Agents SDK
- langchain - LangChain agents
- crewai - CrewAI agents
Attach evaluations
Browse built-in evaluations to see available tests, then attach them to your agent:
No evaluations run by default. You must attach at least one evaluation before running
flintai eval run.How do I attach multiple evaluations at once?
How do I attach multiple evaluations at once?
Use This attaches all evaluations tagged with
--eval-tag to batch-attach evaluations by tag:owasp_code=LLM01 (prompt injection tests) in a single command. See built-in evaluations for all available tests and tags.Run evaluation
Execute all attached tests:
flintai eval sends test prompts to your agent, judges the responses using LLM-as-judge, and scores reliability on a 0.0-1.0 scale.Evaluations can take several minutes depending on the number of tests. Progress updates appear in the CLI, and a summary displays when complete. Results are saved to eval_<timestamp>.json.Ship with confidence
What the score means:- 0.8+ - Production-ready
- 0.6-0.8 - Needs improvement
- <0.6 - Not ready for production
Interpret your results
Understand score breakdowns and track improvement over time
How evaluation works
Learn the LLM-as-judge methodology and scoring calculation
Scan agent code
Find agent code issues before deployment with Flint AI Scan