Cost & Token Analysis

Real benchmark: five developer questions about the antilist repo (Convex + React, 66 files, ~50 k tokens of source), run twice — with raw file context and with Codegraph MCP tools. Measured on gpt-4o-mini and local llama3.1:8b.

85%

fewer input
tokens used

84%

cost reduction
per session

3.4

avg tool calls
per question

Benchmark · 5 questions on gpt-4o-mini

Same repo, same questions, two runs. Direct loads source files into the prompt; MCP queries the graph through Codegraph tools instead.

Where is the Convex schema defined and what tables or fields does it declare?
What is the save function in convex/userData.ts used for, and what calls it?
What does the App component in App.tsx do and which child screens does it render?
Which source files import from the Convex generated API (convex/_generated/api)?
How does HomeScreen in components/HomeScreen.tsx connect to Convex mutations or queries?

Q	Direct (in)	MCP (in)	Tools	Direct (ms)	MCP (ms)	Saved
Q1	50,322	8,978	6	8,000	7,034	82%
Q2	50,327	5,533	3	13,230	4,644	89%
Q3	50,324	9,766	2	8,803	4,476	81%
Q4	50,323	2,229	1	3,181	2,045	96%
Q5	50,324	11,912	5	9,186	8,085	76%
Total	251,620	38,418	17	42,400	26,284	85%

Cost on gpt-4o-mini ($0.15 / M in, $0.60 / M out): $0.0387 → $0.0062 across all five questions. Same model, same answers, 84% cheaper and 38% faster end-to-end. On Claude Sonnet or GPT-4 the dollar savings scale proportionally.

Same benchmark · local Ollama (llama3.1:8b)

Codegraph is provider-agnostic. The same five questions answered with a local 8B model via Ollama — no API key, no network, no marginal cost.

Q	Direct (in)	MCP (in)	Tools	Direct (ms)	MCP (ms)	Saved
Q1	4,096	2,238	3	31,808	16,252	45%
Q2	4,096	1,280	0	30,974	18,033	69%
Q3	4,096	1,277	0	36,180	11,593	69%
Q4	4,096	1,506	1	28,825	9,843	63%
Q5	4,096	1,277	0	45,800	13,242	69%
Total	20,480	7,578	4	173,587	68,963	63%

Direct context is capped at the 8B model's 4 096-token window — Codegraph still trims another 63% off that, and runs 2.5× faster (173 s → 69 s) because the model thinks over smaller, structured input. Cost on both runs: $0.

What an MCP answer looks like

One semantic-search call against a different repo (Camwatcher) for "Where is object detection written?". Ten ranked symbol pointers; zero file contents transferred.

Symbol	Kind	File	Relevance
motion_detector	Var	backend/app/ai/motion_detector.py:78	50.6%
yolo_detector	Var	backend/app/ai/yolo_detector.py:11	48.5%
YOLODetector	Class	backend/app/ai/yolo_detector.py:5	48.4%
get_motion_detection	Fn	backend/app/integrations/tapo_client.py:180	47.8%
detect	Fn	backend/app/ai/yolo_detector.py:6	46.2%
MotionDetector	Class	backend/app/ai/motion_detector.py:15	45.8%
set_motion_detection	Fn	backend/app/integrations/tapo_client.py:184	44.4%
step_motion_detection	Fn	backend/test_tapo.py:96	42.9%
handle_event	Fn	backend/app/events/event_pipeline.py:37	35.3%
_get_subtractor	Fn	backend/app/ai/motion_detector.py:30	33.7%

The MCP returned ~600 tokens of symbol metadata. Without it, Claude would have loaded ~6 detection-related files (4 000–6 000 tokens) just to find the same starting points.

Input tokens per query — measured + projected

Direct context scales with repo size; MCP returns symbol pointers and stays nearly flat. The 66-file row is from the benchmark above; the larger rows are projected from the same access pattern.

Directantilist (66 files · measured)

50 322

With MCPantilist (66 files · measured)

7 684

Directmedium (~200 files · projected)

150 000

With MCPmedium (~200 files · projected)

~12 000

Directlarge (1 000+ files)

won't fit

With MCPlarge (1 000+ files · projected)

~15 000

MCP input stays flat because the graph index returns symbol pointers, not source. Direct context grows linearly with codebase size until it exceeds the model's context window — gpt-4o-mini caps at 128 k, Claude Sonnet at 200 k.

How cost scales — Claude Sonnet 4.6 projection

The 5-question benchmark above ran on gpt-4o-mini. Below: the same input pattern applied to Claude Sonnet 4.6 at $3 / M in · $15 / M out across codebases of different sizes.

Project	Without MCP	With MCP	Reduction
Camwatcher (~50 files)	$0.0255	$0.0252	84%
Medium (~200 files)	$0.4538	$0.0390	91%
Large (1 000+ files)	won't fit	$0.0480	100%*

MCP cost is nearly fixed: ~8 000 input + ~150 output tokens per question. Direct cost grows with repo size — past ~200 files, the full codebase no longer fits in any model's context window, so retrieval is the only viable mode. *Large-repo direct mode is undefined; MCP is the path that actually works.

Monthly savings — Sonnet 4.6 on a ~66-file repo

Projection from the antilist benchmark — $0.1548 direct vs $0.0252 with MCP per question.

Queries / month	Without MCP	With MCP	Monthly saving	Annual saving
50	$7.74	$1.26	$6.48	$77.76
200	$30.96	$5.04	$25.92	$311.04
500	$77.40	$12.60	$64.80	$777.60
1 000	$154.80	$25.20	$129.60	$1 555.20
2 500	$387.00	$63.00	$324.00	$3 888.00

Savings compound on larger codebases. A team of 5 developers each running 500 queries / month on this baseline saves roughly $3 888 / year; on a 200-file project that grows past $13 800 / year.

Why the difference

Without Codegraph MCP

The model receives raw file contents as context. On the 66-file antilist benchmark that meant 50 322 input tokens per question — 100% of the repo, every turn.

Context size grows linearly with codebase size, then hits the model's context-window ceiling and stops working entirely.

With Codegraph MCP

Claude calls typed graph tools (3.4 on average) and receives symbol names, paths, signatures, and relevance scores — about 7 700 tokens per question, regardless of repo size.

The model reads compact pointers, not source. Same answers, fraction of the cost, and 1.6× faster end-to-end.

How accuracy is maintained

1 Parsed, not guessed — tree-sitter extracts every symbol, call edge, and import from your source. No LLM hallucination in the index.

2 Vector KNN — each symbol is embedded at index time. Queries find semantically similar symbols even with different naming.

3 Graph traversal — tools like find_callers and blast_radius walk typed edges in Kuzu; Claude receives exact structural answers.

4 Incremental re-index — only changed files are re-parsed on each run, so the graph stays current without full re-embedding.