Evaluation & Testing
Build test suites to validate agent behavior, catch regressions, and enable safe continuous improvement.
Evals help you test and benchmark your AI analyst's reliability. Build test suites to validate agent behavior, catch regressions, and enable safe continuous improvement.
Why Evals Matter
A successful AI analyst is a reliable AI analyst. Evals establish benchmarks for agent reliability while enabling safe iteration on instructions, context, and LLM configurations.
Two Evaluation Approaches
Deterministic Tests (Create Data Rules)
Machine-checkable assertions that verify:
- Specific tables or columns are used
- Row counts meet expected thresholds
- Generated code is valid SQL
- Output matches expected patterns
Judge Tests (LLM Judge)
Natural-language evaluation using a lightweight LLM that assesses:
- Presentation quality
- User experience
- Reasoning quality
- Adherence to custom rubrics
Creating Tests
When adding a test, specify:
- User prompt — The question to test (e.g., "revenue by film chart")
- Data sources — Which connections to use
- LLM — Which model to evaluate
- File attachments — Optional supporting files
- Expectations — Pass/fail criteria
Expectation Types
| Type | Description |
|---|---|
| Create Data | Conditions checking tables used, columns, row counts, or code validity |
| Clarify | Requires the agent to ask clarifying questions for ambiguous prompts |
| Judge | Evaluator model applies plain-English rubrics to assess quality |
Test Suites
Group tests into suites (e.g., "Finance", "Marketing") for organized batch execution. Results display logs, expectation outcomes, and generated artifacts including code and visualizations.
Best Practices
- Start with deterministic checks for foundational data integrity
- Layer judge rubrics for workflow and reasoning quality
- Use realistic prompts that match actual user behavior
- Run suites regularly after instruction or context changes