ContextAtlas
← Home

ContextAtlas v1.0 — what I built, why, and what the numbers say

May 18, 2026

When the "Build with Opus 4.7" hackathon was announced, I didn't have a strong idea of what I wanted to build. Recent private projects I'd been working on I was either not willing to publicize or didn't find important to others than myself. But one problem I was starting to obsess over was the tokenomics of agents — how to make your tokens go further. I'd tried a number of open-source repos that spun up agents, and what I was finding was that there was little care put into cost effectiveness, model selection, and optimizing the amount of useful work you could do with your token budget. I shared my interests on token optimization in the application even though I didn't yet have a concrete idea, because I think the direction the tech world is heading — competing between co-workers, teams, or companies to see who can use the most tokens — is unsustainable. My thought is it should be who can produce the best work with as few tokens as possible.

So I started thinking about how to use Claude Code more efficiently, smartly, and less expensively, while solving the problems I kept hitting: token burn, scope drift, and quality. That became ContextAtlas. I spent a few days hashing out the idea, getting scaffolding and scope ready in case I was selected. When the results came through that Tuesday morning and I wasn't, I still believed the project was worth pursuing. I'm happy to announce that as of this post, v1.0 is out and I'd like to share what I've built.

The problem

Claude Code currently learns your codebase by brute force. Every session starts fresh. Every "where is OrderProcessor?" triggers a flurry of greps. Every "what depends on AuthMiddleware?" is another round of file reads. On a mid-sized codebase, answering one architectural question can consume 40+ tool calls and 100,000+ tokens before Claude has enough context to reason well.

Worse: the architectural decisions that govern your code — the ADRs, the design docs, the "we did it this way because" reasoning — are completely invisible to it. You wrote in docs/adr/ADR-07.md that OrderProcessor must be idempotent. Claude doesn't know that. It'll happily propose a change that breaks the constraint, and either you catch it in review or you don't.

And none of yesterday's understanding carries to today. Every conversation starts from zero. Your ADRs, your commit history, your test coverage — none of it is on the agent's table.

What ContextAtlas does

ContextAtlas is an MCP server that runs underneath Claude Code and gives it a curated atlas of your codebase before it starts grepping around for context. The thesis is simple: if Claude already knows where your symbols live, what your architectural rules say, what's been touched recently, and which tests cover what, it doesn't have to burn 40 tool calls reconstructing all of that every session.

ContextAtlas closes the gap by fusing four signals into one pre-computed index:

When Claude calls get_symbol_context("OrderProcessor"), it gets all four in one response. Definition, references, governing ADR constraints, recent commits, related tests. One call. Dense format optimized for LLM consumption. The thing it would have otherwise spent 40 tool calls reconstructing.

SYM OrderProcessor@src/orders/processor.ts:42 class
  SIG class OrderProcessor extends BaseProcessor<Order>
  INTENT ADR-07 hard "must be idempotent"
    RATIONALE "All order processing must be safely retryable."
  REFS 23 [billing:14 admin:9]
  GIT hot last=2026-03-14
  TESTS src/orders/processor.test.ts (+11)

Now when Claude is asked to modify OrderProcessor, it sees the idempotency constraint before it proposes changes — not after a review catches the violation.

ContextAtlas demo

The architecture, briefly

The thing that makes this work is a hard split between index time and query time.

Index time is where the expensive reasoning happens. Opus 4.7 reads your ADRs, docstrings, and filtered commit messages once per change, producing structured claims keyed to LSP-resolved symbol IDs. The output is atlas.json, a committable artifact that lives in your repo alongside your code.

Query time is where Claude actually works. Every get_symbol_context call is a local SQLite lookup plus local LSP calls. Zero API calls at query time. No model in the loop. No embedding service. No external network. That's a hard architectural invariant documented in ADR-02: expensive reasoning happens once at index time and never again.

That's also why I went LSP-grounded rather than graph-based. There are MCP servers in this space — Graphify and others — that build knowledge graphs over codebases. Real category overlap, worth being straight about. But the architectural bets differ. Graphify derives structure by parsing and extracting; I delegate every structural question to the language server, which has spent 20+ years getting symbol resolution right. Graphify exposes graph primitives (get_neighbors, shortest_path) that callers compose; ContextAtlas returns pre-composed bundles in one call. Graphify ingests documentation, diagrams, and research papers; I index code, prose, and git, and nothing else. Whether those bets produce better results is an empirical question — see the numbers below.

For search ranking inside the atlas, I chose local FTS5 with BM25 over embeddings (ADR-09). Embeddings are fuzzy; LSP symbols are exact. For code, exactness wins, and it keeps the tool offline-first.

The numbers

I'll save the deep methodology for the benchmarks repo and RUBRIC.md, but the headline: across three target repos in three languages — hono (TypeScript), httpx (Python), cobra (Go) — ContextAtlas reduces token consumption by 45–72% on architectural-intent prompts vs vanilla Claude Code, with zero quality regression across measured axes. Replicated across all three languages. The most dramatic single case ran 7.3× cheaper at equivalent answer depth.

The quality axis is where I wanted to push past vibes. v0.5 shipped blind-graded LLM-judge methodology under paired-mode anonymization (ADR-19). Paired-t at N=27 differences per axis, pre-registered thresholds locked before the precision values were computed:

A 76% tie rate from the judge confirms the anonymization worked — three-quarters of the time it couldn't tell which condition was which. I'm not going to pretend this is a randomized clinical trial, but for a hackathon-shaped project it's more rigor than most people expect, and that was the point. I wanted measurements, not vibes.

The full Phase 5–10 reference runs are in the benchmarks repo if you want the per-cell breakdown, the cost decomposition, or the nine named findings per cycle.

Beyond tokens — the design-alignment angle

The cheaper-tokens story sells the README, but it isn't the part I find most interesting. That came out of a self-dogfood experiment during v0.3.

I gave Claude Code the same code-change task twice — fix a real bug in the FTS5 tokenizer where identifier-shaped feature names like narrow_attribution were getting split into narrow + attribution on indexing. Two clones, identical prompt, identical atlas. Only difference: one session had ContextAtlas MCP active, the other didn't.

Both arms fixed the bug. But they fixed it differently.

Arm A (no MCP) reached for query-time OR-expansion: take narrow_attribution, expand it to "narrow_attribution" OR "narrow" OR "attribution", let BM25 sort it out. Higher recall, but the canonical-identifier claim now had to compete with split-part-hit noise from unrelated docs.

Arm B (MCP active) reached for index-time dual-form indexing: store both the identifier and its split-word form at write time, then match the identifier form exactly at query time. Higher precision, no noise competition.

I picked Arm B's fix and shipped it (it's ADR-17 now), not because it ran faster, but because it fit the thesis. The whole point of paying extraction-time API cost is to enable query-time precision. Arm B saw that thesis in the ADRs and aligned with it. Arm A, with no architectural intent in front of it, defaulted to generic IR recall-first instincts. Both fixes worked. One was correct.

That's the value prop that goes beyond tokens: agents with architectural intent in their context window don't just answer faster, they make better design-shaped choices. Full write-up at v0.3 Round 3 dogfood evidence.

The team artifact

One thing I think genuinely differentiates this from session-memory tools (claude-mem, engram, anamnesis) is that ContextAtlas's data source is your repo itself, not your accumulated chat history. The atlas commits to your repo as atlas.json. New teammate clones the repo, opens Claude Code, and immediately gets the same context quality as someone who's been working with the project for months. No "session 1 is dumb because the AI hasn't learned anything yet." It's a team artifact, not a personal cache.

PR reviewers see atlas diffs alongside code diffs the same way they see schema changes. Open-source contributors get the maintainers' accumulated architectural knowledge for free. Developers returning to a project after months away pull main and the atlas reflects everything the team did while they were gone. The pattern is documented in ADR-06.

For teams that can't commit the atlas, set atlas.committed: false in the config — every developer runs their own extraction; the team-artifact benefit is lost but the tool still works.

Two paths to set it up

You can run ContextAtlas two ways, and they produce identical atlases (ADR-02 locks the substrate equivalence).

Skills path (/index-atlas, /generate-adrs, /prime-atlas) — if you're already in Claude Code and want zero setup friction. No API key required; runs under your Claude subscription.

CLI path (contextatlas init && contextatlas index) — if you want to script atlas refresh in CI/CD, integrate with non-Claude-Code agents, or run unattended. Anthropic API direct; typical refresh ~$0.20–1, first-time ADR scaffolding ~$5–15 per repo.

Pick whichever fits your workflow. The atlas downstream is the same.

Honest limits

I want to be straight about what we don't claim.

Quality measurements use a single judge model (Sonnet 4.6) with within-judge consistency ≥80% per axis. Cross-vendor judge panel graduation is post-v1.0 work. Quantitative claims are bounded to the three benchmark targets (hono, httpx, cobra) plus our own dogfood. The v0.5 substrate is 5 anchor cells × n≥5 trials; not full-matrix replication. A v0.6 cross-cycle subset surfaced an atlas-substrate-version confound still under investigation — disclosed in the Phase-10 reference doc.

Tie- and trick-bucket prompts routinely show ContextAtlas net-negative. Bucket-aware methodology surfaces those rather than burying them.

Favorable and unfavorable results both ship. The cross-cycle finding that one of my own hypotheses was falsified is in the README. That's the bar I want to hold to.

Try it

npm install -g contextatlas
contextatlas init
contextatlas index
# then add the MCP server entry to your Claude Code config
# (snippet in the README)

Repo: github.com/traviswye/ContextAtlas. Full README, install instructions, the four-condition benchmark matrix, and all 21 ADRs are there. Cycle-by-cycle development history at docs/release-history.md.

Language adapters for Rust, Java, and C# are the obvious gaps next. The LanguageAdapter interface is small and stable enough that they're realistic community contributions — see docs/language-adapter-guide.md. v1.1 is shaping up around developer onboarding flows and post-launch cohort exposure.

If you've ever watched Claude Code grep its way through a codebase and felt the tokens tick by, I'd love your eyes on this. Star the repo if you want to follow along, file an issue if it breaks for you, and please be honest — this only gets better with feedback from people running it on real codebases. Happy to answer anything in the comments.

// code fences render with the Tailwind Typography styles configured in
// app/blog/v1.0.0/layout.tsx
import { contextAtlas } from "contextatlas";