Blog/insights

How to Evaluate an Analytics Agent: A Practical Guide with nao test

A step-by-step guide to evaluating your analytics agent's reliability using nao's built-in unit test framework — from writing your first test to running the visual dashboard.

How to Evaluate an Analytics Agent: A Practical Guide with nao test

26 February 2026

By ClaireCo-founder & CEO

Most analytics agent failures happen quietly. The agent returns an answer. The answer looks plausible. No one checks whether the SQL was correct until a stakeholder catches a discrepancy in a board meeting.

The fix is not a better model. It is a proper evaluation framework: a set of unit tests that run against your agent, compare outputs to known-correct results, and tell you whether your context is working.

nao ships a built-in evaluation command — nao test — designed for exactly this. Here is how to set it up and use it.

The Core Idea: Unit Tests for SQL

The model for nao's evaluation is simple: write questions your users actually ask, pair each one with the SQL query that produces the correct answer, and let the framework run the agent against all of them.

yaml
name: total_revenue prompt: What is the total revenue from all orders? sql: | SELECT SUM(amount) as total_revenue FROM orders

Each test has three fields:

  • name — a descriptive identifier
  • prompt — the natural language question
  • sql — the query that produces the ground truth

The framework sends the prompt to your running agent, captures the answer, executes your SQL to get the expected data, and compares the two. A test passes only when the data matches exactly — not when the SQL looks similar.

This approach catches the most dangerous failure mode in production analytics agents: SQL that is syntactically valid but semantically wrong.

Setting Up Your Test Suite

Create a tests/ folder in your nao project root:

text
your-project/ ├── nao_config.yaml ├── RULES.md ├── tests/ │ ├── total_revenue.yml │ ├── weekly_signups.yml │ └── outputs/ ← auto-generated results

Write your test files. Start with the 10 to 20 questions your team asks most often — funnel, retention, revenue, churn. Include questions with known correct answers so you have real ground truth to compare against.

yaml
name: signups_weekly prompt: How many signups did we have per week in the last 4 weeks? sql: | select week, sum(n_new_users) as n_signups from prod_silver.fct_users_activity_weekly where week >= date_trunc(current_date - interval 4 week, isoweek) and week < date_trunc(current_date, isoweek) group by week order by week

Add edge cases too: conflicting metric names, ambiguous date windows, multi-table joins. Happy-path accuracy does not predict production reliability — edge case accuracy does.

Running the Evaluation

Start your nao chat server, then run:

bash
nao test

The command discovers all .yml files in your tests/ folder, runs each through your agent (defaulting to openai:gpt-4.1), and displays a summary table with pass/fail status, token usage, cost, execution time, and tool call count.

To run faster with parallel threads:

bash
nao test -t 4

To test a specific model:

bash
nao test -m openai:gpt-4.1

Results are saved automatically to tests/outputs/results_TIMESTAMP.json — full conversation history, tool calls, actual vs expected data, and a detailed diff on failures.

How It Works Under the Hood

For each test, the framework:

  1. Sends the prompt to your agent and runs it normally — the agent may execute SQL, search context, use tools
  2. Captures the full conversation history and response
  3. Extracts the agent's final answer as structured data using a secondary LLM call
  4. Executes your expected SQL against the warehouse to get ground truth
  5. Compares the two datasets: exact match first, then approximate match with numeric tolerance for floating-point differences
  6. If both comparisons fail, generates a detailed diff showing exactly where values diverge

The comparison is strict by design. Row count mismatch = fail. Semantic equivalence is not enough — the data has to match.

Visualizing Results with nao test server

Once you have run nao test at least once, start the visual interface:

bash
nao test server

This opens a dashboard at http://localhost:8765 with:

  • Summary cards — pass rate, total tests, token cost, total duration
  • Results table — all test runs with status, metrics, and inline details
  • Detail view — click any test to see the full response, tool calls, actual vs expected data, and a diff for failures

nao test server dashboard

The visual interface makes it easy to spot patterns: which question types are failing, where tool call counts spike, which context changes caused regressions.

What Good Evaluation Looks Like in Practice

The first run of nao test on a fresh context setup will not score 100%. That is expected and useful. The number tells you where your context needs work.

From there, the workflow is:

  1. Run nao test — identify which questions fail
  2. Look at the detail view in nao test server — trace the failure to a specific context gap (wrong table selected, missing metric definition, bad join key)
  3. Fix the context: update RULES.md, add a column description, clarify a join rule
  4. Re-run nao test — confirm the fix did not introduce a regression

We run this loop every time we change context. The full test suite takes under a minute for 40 questions with -t 4. It is fast enough to run on every context commit.

For reference: our best-performing context configuration — schema + data sampling + RULES.md, restricted to the silver layer — reached 45% reliability on our 40-question test suite on the first run. Context iteration from there is what moves that number. See the context impact study for the full breakdown of which context pieces actually move the needle.

The Self-Learning Loop: From Feedback to Better Context

Unit tests tell you whether your agent is correct on known questions. But once you roll out to users, you need a second signal: what are real users asking, and when does the agent get it wrong?

nao closes this loop with a built-in feedback and monitoring system.

In the chat interface, users can rate any answer with a thumbs up or thumbs down. They can optionally add a reason. This creates a lightweight feedback signal directly tied to real production queries — not your test suite.

In the admin monitoring dashboard, you see all feedbacks aggregated:

  • Which answers received negative ratings
  • What the user asked and what the agent responded
  • Who flagged it and when

This is the memory component. Every negative feedback is a signal that something in your context is missing or wrong — a metric definition, a join rule, an undocumented edge case.

The improvement cycle:

  1. Run nao test → fix context gaps on your known question set
  2. Roll out to users → collect real feedback from the chat
  3. Review negative feedback in the monitoring dashboard → identify new gaps
  4. Update RULES.md or schema context to address them
  5. Add the failing question to your tests/ folder as a new unit test
  6. Re-run nao test → confirm the fix did not regress anything

Over time, your test suite grows to reflect the actual questions your users ask. Your context gets progressively more accurate. The agent self-improves — not through model fine-tuning, but through structured context iteration driven by real usage.

This is why evaluation and monitoring are not separate concerns. They are the same loop, running at two speeds: automated unit tests for fast iteration, user feedback for continuous production improvement.

Best Practices

Start with critical queries. Test the questions that matter most to your business users — the ones that would cause real damage if wrong. Add edge cases after you have covered the high-value baseline.

Keep tests focused. Each test should verify one behavior. A test that asks about MRR by segment by channel by month is hard to debug when it fails. Break it into smaller tests.

Avoid leakage. Do not include exact answers or overly specific details in your RULES.md that allow the agent to pattern-match rather than reason. The test measures whether your context enables correct reasoning — not whether the agent memorized your test cases.

Commit your tests folder to git. Your test suite is a versioned record of what "correct" means for your data model. It should live next to your context files, not in a separate spreadsheet.

Run tests after every context change. Agent reliability can regress when you update RULES.md, add new tables, or change metric definitions. Automated evaluation catches this before users do.

Full documentation for nao test and nao test server is at docs.getnao.io/nao-agent/context-engineering/evaluation.

Frequently Asked Questions

Claire

Claire

For nao team