Real benchmark: five developer questions about the antilist repo (Convex + React, 66 files, ~50 k tokens of source), run twice — with raw file context and with Codegraph MCP tools. Measured on gpt-4o-mini and local llama3.1:8b.
Same repo, same questions, two runs. Direct loads source files into the prompt; MCP queries the graph through Codegraph tools instead.
save function in convex/userData.ts used for, and what calls it?App component in App.tsx do and which child screens does it render?convex/_generated/api)?HomeScreen in components/HomeScreen.tsx connect to Convex mutations or queries?| Q | Direct (in) | MCP (in) | Tools | Direct (ms) | MCP (ms) | Saved |
|---|---|---|---|---|---|---|
| Q1 | 50,322 | 8,978 | 6 | 8,000 | 7,034 | 82% |
| Q2 | 50,327 | 5,533 | 3 | 13,230 | 4,644 | 89% |
| Q3 | 50,324 | 9,766 | 2 | 8,803 | 4,476 | 81% |
| Q4 | 50,323 | 2,229 | 1 | 3,181 | 2,045 | 96% |
| Q5 | 50,324 | 11,912 | 5 | 9,186 | 8,085 | 76% |
| Total | 251,620 | 38,418 | 17 | 42,400 | 26,284 | 85% |
Cost on gpt-4o-mini ($0.15 / M in, $0.60 / M out): $0.0387 → $0.0062 across all five questions. Same model, same answers, 84% cheaper and 38% faster end-to-end. On Claude Sonnet or GPT-4 the dollar savings scale proportionally.
Codegraph is provider-agnostic. The same five questions answered with a local 8B model via Ollama — no API key, no network, no marginal cost.
| Q | Direct (in) | MCP (in) | Tools | Direct (ms) | MCP (ms) | Saved |
|---|---|---|---|---|---|---|
| Q1 | 4,096 | 2,238 | 3 | 31,808 | 16,252 | 45% |
| Q2 | 4,096 | 1,280 | 0 | 30,974 | 18,033 | 69% |
| Q3 | 4,096 | 1,277 | 0 | 36,180 | 11,593 | 69% |
| Q4 | 4,096 | 1,506 | 1 | 28,825 | 9,843 | 63% |
| Q5 | 4,096 | 1,277 | 0 | 45,800 | 13,242 | 69% |
| Total | 20,480 | 7,578 | 4 | 173,587 | 68,963 | 63% |
Direct context is capped at the 8B model's 4 096-token window — Codegraph still trims another 63% off that, and runs 2.5× faster (173 s → 69 s) because the model thinks over smaller, structured input. Cost on both runs: $0.
One semantic-search call against a different repo (Camwatcher) for "Where is object detection written?". Ten ranked symbol pointers; zero file contents transferred.
| Symbol | Kind | File | Relevance |
|---|---|---|---|
| motion_detector | Var | backend/app/ai/motion_detector.py:78 | |
| yolo_detector | Var | backend/app/ai/yolo_detector.py:11 | |
| YOLODetector | Class | backend/app/ai/yolo_detector.py:5 | |
| get_motion_detection | Fn | backend/app/integrations/tapo_client.py:180 | |
| detect | Fn | backend/app/ai/yolo_detector.py:6 | |
| MotionDetector | Class | backend/app/ai/motion_detector.py:15 | |
| set_motion_detection | Fn | backend/app/integrations/tapo_client.py:184 | |
| step_motion_detection | Fn | backend/test_tapo.py:96 | |
| handle_event | Fn | backend/app/events/event_pipeline.py:37 | |
| _get_subtractor | Fn | backend/app/ai/motion_detector.py:30 |
The MCP returned ~600 tokens of symbol metadata. Without it, Claude would have loaded ~6 detection-related files (4 000–6 000 tokens) just to find the same starting points.
Direct context scales with repo size; MCP returns symbol pointers and stays nearly flat. The 66-file row is from the benchmark above; the larger rows are projected from the same access pattern.
MCP input stays flat because the graph index returns symbol pointers, not source. Direct context grows linearly with codebase size until it exceeds the model's context window — gpt-4o-mini caps at 128 k, Claude Sonnet at 200 k.
The 5-question benchmark above ran on gpt-4o-mini. Below: the same input pattern applied to Claude Sonnet 4.6 at $3 / M in · $15 / M out across codebases of different sizes.
| Project | Without MCP | With MCP | Reduction |
|---|---|---|---|
| Camwatcher (~50 files) | $0.0255 | $0.0252 | 84% |
| Medium (~200 files) | $0.4538 | $0.0390 | 91% |
| Large (1 000+ files) | won't fit | $0.0480 | 100%* |
MCP cost is nearly fixed: ~8 000 input + ~150 output tokens per question. Direct cost grows with repo size — past ~200 files, the full codebase no longer fits in any model's context window, so retrieval is the only viable mode. *Large-repo direct mode is undefined; MCP is the path that actually works.
Projection from the antilist benchmark — $0.1548 direct vs $0.0252 with MCP per question.
| Queries / month | Without MCP | With MCP | Monthly saving | Annual saving |
|---|---|---|---|---|
| 50 | $7.74 | $1.26 | $6.48 | $77.76 |
| 200 | $30.96 | $5.04 | $25.92 | $311.04 |
| 500 | $77.40 | $12.60 | $64.80 | $777.60 |
| 1 000 | $154.80 | $25.20 | $129.60 | $1 555.20 |
| 2 500 | $387.00 | $63.00 | $324.00 | $3 888.00 |
Savings compound on larger codebases. A team of 5 developers each running 500 queries / month on this baseline saves roughly $3 888 / year; on a 200-file project that grows past $13 800 / year.
The model receives raw file contents as context. On the 66-file antilist benchmark that meant 50 322 input tokens per question — 100% of the repo, every turn.
Context size grows linearly with codebase size, then hits the model's context-window ceiling and stops working entirely.
Claude calls typed graph tools (3.4 on average) and receives symbol names, paths, signatures, and relevance scores — about 7 700 tokens per question, regardless of repo size.
The model reads compact pointers, not source. Same answers, fraction of the cost, and 1.6× faster end-to-end.
find_callers and blast_radius walk typed edges in Kuzu; Claude receives exact structural answers.