Blog/Technical Guide

How to Build Your In-House Analytics Agent Fully with Open Source

A practical 7-step framework to build your in-house analytics agent fully with open source tooling, from context engineering to evaluation and rollout.

5 March 2026

By Claire GouzéFounder @ nao

The most successful analytics agents are increasingly built in-house.

You can see the pattern in public examples like OpenAI's in-house data agent story (Inside our in-house data agent) and Vercel's tooling simplification work (We removed 80% of our agents tools).

Building in-house with open source components gives teams better control and better adoption, but it still takes real work to rebuild the full loop: agent runtime, context builder, analytics UI with visualization, and evaluation workflow.

This guide is both a process guide and a technical guide for assembling those existing bricks into a reliable, trusted analytics agent that teams actually use to chat with data.

Step 1. Pick one painful analytics workflow

Start with a use case where demand is high, the question comes back often, and the ROI is easy to prove.

The best first workflows are usually things like:

answering recurring questions on main company metrics,
handling repeated data retrieval requests from business teams,
checking weekly sales or pipeline performance,
reviewing retention or activation metrics,
explaining monthly finance variance,

These are good starting points because they are frequent, visible, and easy to validate. If the agent gets them right, people use it again quickly. If it saves time on these workflows, adoption starts fast.

Step 2. Choose your analytics agent stack

At this stage, the goal is to choose a stack that stays fully open source while still being maintainable in production.

In practice, most teams evaluating analytics agents look at five main solution paths:

Claude + MCP: plug your context as MCPs plus rules inside a Claude team org,
LangChain: python agent framework to build fully customized agents,
LibreChat: open-source chat interface you can extend with MCPs for data connection,
Vercel Knowledge Agent: open template you can adapt to analytics by adding SQL/dbt/retrieval and an evaluation loop,
nao: vertical analytics-agent stack combining context builder, analytics chat UI, and evaluation in one workflow,

If you want to compare these approaches in more detail, the nao benchmark comparison pages are a useful place to start.

With nao, context building, analytics UI, and evaluation run in one product workflow. That is why it is the only vertical, all-in-one option in this list.

Step 3. Define clear context on a small scope

The goal here is to give the agent only the minimum high-quality context needed to be reliable.

Start with 5 to 10 tables and structure context in four blocks:

Databases: schema metadata, sampled rows, profiling signals,
Repos: your dbt project and analytics codebase,
Rules context: business rules, joins logic, SQL guardrails, metric definitions,
Custom context: lightweight business notes (data_quality.md, KPI references, figures),

In nao, this is configured in YAML by listing the context sources for the agent: databases, repos (dbt/semantic-layer assets), business tools (Notion/Jira), skills, and MCPs. See databases, repos, rules context, and custom context.

One practical rule matters a lot here: it is better to have an agent with 5 tables used by 100 people than an agent with 100 tables used by 5 people. With 5 tables for 100 people, quality is usually higher, trust grows faster, and teams naturally ask for broader coverage. With 100 tables for 5 people, reliability drops, trust breaks early, and rollout usually stalls before it reaches the wider company.

Step 4. Build your agent unit tests & eval framework

At this point, reliability has to become measurable before you roll the agent out more broadly.

Convert real business questions into unit tests that pair a natural-language prompt with ground-truth SQL.

Keep each test simple and explicit. A good structure is:

yaml

name: weekly_signups
prompt: How many signups did we have per week in the last 4 weeks?
sql: |
  select
    week,
    sum(n_new_users) as n_signups
  from prod_silver.fct_users_activity_weekly
  where week >= date_trunc(current_date - interval 4 week, isoweek)
  and week < date_trunc(current_date, isoweek)
  group by week
  order by week

Then define KPI targets and success criteria up front:

pass rate target,
metric correctness on high-priority questions,
latency and cost per run,
regression tolerance after context changes,

If you want the exact mechanics, the nao evaluation docs and How to Evaluate an Analytics Agent: A Practical Guide with nao test go deeper on the setup.

In nao, the testing framework is already built in. Tests are written in YAML files, then nao test evaluates them automatically across key KPIs like reliability percentage, LLM cost, and answer time, while nao test server helps inspect failures and regressions. See evaluation docs.

Step 5. Run tests and measure quality

This step is where you establish a baseline trust score for the agent.

Run the full suite and track pass rate, accuracy by question type, latency, and cost. That baseline tells you how much you can trust current answers and how much context improvement is still required.

Our latest benchmark suggests ~80% is a reasonable reliability target, and much of the remaining gap is often interpretation edge cases, not only SQL generation. See How I improved my analytics agent.

This is also the right moment to compare your result against what good looks like. If you are still far below the ~80% reliability range, the benchmark is telling you the agent is not ready for broad trust yet.

In nao, the practical move is to run nao test on every context change and compare results over time in tests/outputs and nao test server.

Step 6. Iterate on context quality

The goal now is to improve quality by fixing context, not by blindly changing prompts.

For each test, inspect failures and successes. When it fails, identify the exact gap: definitions, joins, metric context, or rules. When it succeeds, make that useful context easier to retrieve. If performance improves without RULES.md, clean or rewrite it. Also test where semantic layer + dbt should help and document why it did not.

The article What Context Has the Most Impact on Analytics Agent Performance? is useful if you want the full breakdown of which context pieces moved reliability the most.

In nao, this usually means updating context files (RULES.md, custom context, repo/database scope), rerunning nao test, and keeping only changes that improve benchmark outcomes.

Step 7. Roll out and monitor

The final step is turning benchmark reliability into sustained real-world adoption.

Roll out progressively, review real chats, track repeated failure patterns, and keep a steady evaluation cadence.

In nao, that means collecting in-product user feedback, reviewing monitoring signals, and feeding production failures back into your unit-test suite. Chat replay is also being added to accelerate production QA.

Final takeaway

A reliable analytics agent is not one clever prompt. It is a system: focused scope, explicit context, repeatable tests, measurable quality, and continuous iteration across your data stack.

It is also not a one-time job. Context drifts as schemas, definitions, and business logic change, so evaluation has to stay active over time. A strong practice is to run these checks in your CI/CD pipeline so regressions are caught before users lose trust.

Claire

For nao team

Frequently Asked Questions

product updates