@graphpilot-oss/graphpilot 0.0.1 → 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +73 -126
- package/README.md +359 -101
- package/dist/cli.js +20 -0
- package/dist/cli.js.map +1 -1
- package/dist/indexer.js +3 -3
- package/dist/indexer.js.map +1 -1
- package/dist/init.d.ts +28 -0
- package/dist/init.js +112 -0
- package/dist/init.js.map +1 -0
- package/dist/interactions.d.ts +5 -4
- package/dist/interactions.js +0 -0
- package/dist/interactions.js.map +1 -1
- package/dist/mcp.js +126 -46
- package/dist/mcp.js.map +1 -1
- package/dist/repo-resolve.d.ts +47 -0
- package/dist/repo-resolve.js +195 -0
- package/dist/repo-resolve.js.map +1 -0
- package/dist/storage.js +10 -1
- package/dist/storage.js.map +1 -1
- package/dist/validation.js +30 -4
- package/dist/validation.js.map +1 -1
- package/dist/watcher.d.ts +10 -0
- package/dist/watcher.js +70 -7
- package/dist/watcher.js.map +1 -1
- package/examples/README.md +105 -0
- package/examples/claude-code/README.md +125 -0
- package/examples/claude-code/claude-routing.md +102 -0
- package/examples/claude-code/claude_config.json +8 -0
- package/examples/cline/.clinerules +39 -0
- package/examples/cline/README.md +104 -0
- package/examples/cline/cline_mcp_settings.json +10 -0
- package/examples/continue/.continuerules +39 -0
- package/examples/continue/README.md +98 -0
- package/examples/continue/config.json +13 -0
- package/examples/cursor/.cursorrules +39 -0
- package/examples/cursor/README.md +98 -0
- package/examples/cursor/mcp.json +11 -0
- package/examples/windsurf/.windsurfrules +39 -0
- package/examples/windsurf/README.md +85 -0
- package/examples/windsurf/mcp_config.json +8 -0
- package/package.json +12 -3
- package/.editorconfig +0 -15
- package/.github/CODEOWNERS +0 -22
- package/.github/FUNDING.yml +0 -1
- package/.github/ISSUE_TEMPLATE/bug_report.md +0 -33
- package/.github/ISSUE_TEMPLATE/config.yml +0 -5
- package/.github/ISSUE_TEMPLATE/feature_request.md +0 -23
- package/.github/PULL_REQUEST_TEMPLATE.md +0 -19
- package/.github/dependabot.yml +0 -15
- package/.github/workflows/ci.yml +0 -62
- package/.github/workflows/release.yml +0 -50
- package/.prettierignore +0 -19
- package/.prettierrc.json +0 -20
- package/CODE_OF_CONDUCT.md +0 -83
- package/CONTRIBUTING.md +0 -111
- package/bench/README.md +0 -544
- package/bench/results/agent-tier-2026-05-22.md +0 -28
- package/bench/results/agent-tier-summary.md +0 -44
- package/bench/results/baseline-tier-2026-05-22.md +0 -23
- package/bench/results/baseline.json +0 -810
- package/bench/results/baseline.md +0 -28
- package/bench/run-agent-tier-automated.ts +0 -234
- package/bench/run-agent-tier.md +0 -125
- package/bench/run-baseline-tier.ts +0 -200
- package/bench/run.ts +0 -210
- package/bench/runner-baseline.ts +0 -177
- package/bench/runner-graphpilot.ts +0 -131
- package/bench/score-agent-tier.ts +0 -191
- package/bench/score.ts +0 -59
- package/bench/tasks.ts +0 -236
- package/dist/provenance.d.ts +0 -74
- package/dist/provenance.js +0 -95
- package/dist/provenance.js.map +0 -1
- package/docs/architecture.md +0 -311
- package/docs/limitations.md +0 -156
- package/docs/mcp-setup.md +0 -231
- package/docs/quickstart.md +0 -202
- package/eslint.config.js +0 -148
- package/lefthook.yml +0 -81
- package/pnpm-workspace.yaml +0 -6
- package/scripts/smoke-stdio.mjs +0 -97
- package/src/cli.ts +0 -171
- package/src/edges.ts +0 -202
- package/src/git.ts +0 -255
- package/src/graph-schema.ts +0 -229
- package/src/impact.ts +0 -218
- package/src/indexer.ts +0 -152
- package/src/interactions.ts +0 -0
- package/src/mcp.ts +0 -652
- package/src/parser.ts +0 -138
- package/src/provenance.ts +0 -115
- package/src/query.ts +0 -148
- package/src/redact.ts +0 -122
- package/src/storage.ts +0 -115
- package/src/symbols.ts +0 -173
- package/src/validation.ts +0 -69
- package/src/validators.ts +0 -253
- package/src/watcher.ts +0 -383
- package/tests/edges.test.ts +0 -175
- package/tests/fixtures/sample.ts +0 -32
- package/tests/git.test.ts +0 -303
- package/tests/graph-schema.test.ts +0 -321
- package/tests/impact.test.ts +0 -454
- package/tests/interactions.test.ts +0 -180
- package/tests/lint-policy.test.ts +0 -106
- package/tests/mcp-stdio.test.ts +0 -171
- package/tests/mcp.test.ts +0 -335
- package/tests/parser.test.ts +0 -31
- package/tests/provenance.test.ts +0 -132
- package/tests/query.test.ts +0 -160
- package/tests/redact.test.ts +0 -167
- package/tests/security.test.ts +0 -144
- package/tests/symbols.test.ts +0 -78
- package/tests/validators.test.ts +0 -193
- package/tests/watcher.test.ts +0 -250
- package/tsconfig.json +0 -18
package/bench/README.md
DELETED
|
@@ -1,544 +0,0 @@
|
|
|
1
|
-
# GraphPilot Benchmarks
|
|
2
|
-
|
|
3
|
-
This directory contains reproducible benchmarks measuring GraphPilot's correctness and effectiveness for agent-assisted refactoring tasks.
|
|
4
|
-
|
|
5
|
-
## Quick Start
|
|
6
|
-
|
|
7
|
-
Run all benchmarks:
|
|
8
|
-
|
|
9
|
-
```bash
|
|
10
|
-
npm run bench
|
|
11
|
-
```
|
|
12
|
-
|
|
13
|
-
This runs:
|
|
14
|
-
|
|
15
|
-
1. **Tier-A (Tool Correctness):** Raw tool output quality (deterministic, <1s)
|
|
16
|
-
2. **Tier-B (Agent Success):** Agent task success rate vs baseline (automated simulation, ~5s)
|
|
17
|
-
|
|
18
|
-
Results are written to `bench/results/` as Markdown tables.
|
|
19
|
-
|
|
20
|
-
---
|
|
21
|
-
|
|
22
|
-
## Benchmark Tiers Explained
|
|
23
|
-
|
|
24
|
-
### Tier-A: Tool Correctness (Deterministic)
|
|
25
|
-
|
|
26
|
-
**What it measures:** Does GraphPilot's index return the correct results?
|
|
27
|
-
|
|
28
|
-
**Method:** Run 10 structural queries on GraphPilot's own codebase (42 files, 205 symbols).
|
|
29
|
-
|
|
30
|
-
**Example queries:**
|
|
31
|
-
|
|
32
|
-
- "Find all callers of `analyzeImpact`"
|
|
33
|
-
- "What breaks if I rename `indexDirectory`? (depth 2)"
|
|
34
|
-
- "Which test files exercise `parseFile`?"
|
|
35
|
-
|
|
36
|
-
**Metrics:**
|
|
37
|
-
|
|
38
|
-
- **F1 Score** (accuracy): TP / (TP + 0.5(FP + FN))
|
|
39
|
-
- **Precision**: TP / (TP + FP) — how many results are correct?
|
|
40
|
-
- **Recall**: TP / (TP + FN) — did we find all correct answers?
|
|
41
|
-
- **Token savings**: Bytes agent reads with GP vs grep
|
|
42
|
-
|
|
43
|
-
**Results:**
|
|
44
|
-
|
|
45
|
-
| Metric | GraphPilot | grep | Improvement |
|
|
46
|
-
| -------------- | ---------- | ----------- | ---------------------------- |
|
|
47
|
-
| **F1 Score** | 0.89 | 0.42 | +112% |
|
|
48
|
-
| **Precision** | 0.96 | 0.18 | +433% |
|
|
49
|
-
| **Recall** | 0.83 | 1.0 | Grep is exhaustive but noisy |
|
|
50
|
-
| **Bytes read** | 721 B | 528 KB | **99.9% fewer** |
|
|
51
|
-
| **Token cost** | 180 tokens | 132k tokens | **99.9% savings** |
|
|
52
|
-
|
|
53
|
-
**Why it matters:**
|
|
54
|
-
|
|
55
|
-
- Fewer tokens = faster, cheaper agents
|
|
56
|
-
- Higher F1 = smarter refactoring decisions
|
|
57
|
-
- Precision matters for safety (false positives break code)
|
|
58
|
-
|
|
59
|
-
**How to reproduce:**
|
|
60
|
-
|
|
61
|
-
```bash
|
|
62
|
-
npx tsx bench/run.ts
|
|
63
|
-
# Outputs: bench/results/baseline.md
|
|
64
|
-
```
|
|
65
|
-
|
|
66
|
-
---
|
|
67
|
-
|
|
68
|
-
### Tier-B: Agent Success Rate (Realistic)
|
|
69
|
-
|
|
70
|
-
**What it measures:** Can agents solve real refactor tasks using the tools?
|
|
71
|
-
|
|
72
|
-
**Method:** 13 refactor-analysis tasks, compared across two scenarios:
|
|
73
|
-
|
|
74
|
-
1. **Baseline:** vanilla grep (no structured index)
|
|
75
|
-
2. **GraphPilot:** our index with gp\_\* tools
|
|
76
|
-
|
|
77
|
-
Each task is scored on:
|
|
78
|
-
|
|
79
|
-
- Task success (did the agent reach the right conclusion?)
|
|
80
|
-
- Hallucination count (false positives)
|
|
81
|
-
- Evidence anchor resolution (file:line @ sha citations)
|
|
82
|
-
|
|
83
|
-
**Example tasks:**
|
|
84
|
-
|
|
85
|
-
| # | Task | GraphPilot Win? | Why |
|
|
86
|
-
| --- | ------------------------------------ | --------------- | -------------------------------------------- |
|
|
87
|
-
| t01 | Find callers of `analyzeImpact` | ✅ | Structural index is precise |
|
|
88
|
-
| t02 | Find callers of `extractSymbols` | ✅ | Same |
|
|
89
|
-
| t06 | Compute blast radius (depth 2) | ✅ | grep can't compute graph traversal |
|
|
90
|
-
| t11 | Differential impact (`since: main`) | ✅ | GraphPilot exclusive feature |
|
|
91
|
-
| t12 | Evidence anchors on results | ✅ | GraphPilot only; proof against hallucination |
|
|
92
|
-
| t10 | Find string literal `MAX_FILE_BYTES` | ❌ | grep wins (text search, not structure) |
|
|
93
|
-
|
|
94
|
-
**Results:**
|
|
95
|
-
|
|
96
|
-
| Metric | Baseline (grep) | GraphPilot | Improvement |
|
|
97
|
-
| -------------------- | --------------- | ---------- | --------------------- |
|
|
98
|
-
| **Tasks passed** | 4/13 (54%) | 7/13 (54%) | +75% |
|
|
99
|
-
| **Mean F1** | 0.33 | 0.70 | +112% |
|
|
100
|
-
| **Hallucinations** | 480 | 6 | −98.75% |
|
|
101
|
-
| **Evidence anchors** | 0% | 100% | Perfect citation rate |
|
|
102
|
-
|
|
103
|
-
**Why it matters:**
|
|
104
|
-
|
|
105
|
-
- 75% more task success = agents reach right answers more often
|
|
106
|
-
- 98% fewer hallucinations = fewer "the tool said this exists but it doesn't" bugs
|
|
107
|
-
- Evidence anchors = users can verify agent claims instantly
|
|
108
|
-
|
|
109
|
-
**How to reproduce:**
|
|
110
|
-
|
|
111
|
-
```bash
|
|
112
|
-
# Index GraphPilot itself
|
|
113
|
-
node dist/cli.js index .
|
|
114
|
-
|
|
115
|
-
# Run automated Tier-B benchmark
|
|
116
|
-
npx tsx bench/run-agent-tier-automated.ts
|
|
117
|
-
|
|
118
|
-
# Run grep baseline for comparison
|
|
119
|
-
npx tsx bench/run-baseline-tier.ts
|
|
120
|
-
|
|
121
|
-
# Results: bench/results/agent-tier-*.md + baseline-tier-*.md
|
|
122
|
-
```
|
|
123
|
-
|
|
124
|
-
---
|
|
125
|
-
|
|
126
|
-
## Task Corpus (tasks.ts)
|
|
127
|
-
|
|
128
|
-
The benchmark's ground truth lives in `tasks.ts`. Each task specifies:
|
|
129
|
-
|
|
130
|
-
- `id` — unique identifier (t01, t02, etc.)
|
|
131
|
-
- `description` — human-readable summary
|
|
132
|
-
- `prompt` — what an agent would naturally ask
|
|
133
|
-
- `kind` — query type (callers, impact, recall, etc.)
|
|
134
|
-
- `query` — the input to the tool
|
|
135
|
-
- `groundTruth` — the expected results (symbols, file paths, etc.)
|
|
136
|
-
- `expectedWinner` — which approach should win (graphpilot, grep, or tie)
|
|
137
|
-
- `difficulty` — low/medium/high
|
|
138
|
-
|
|
139
|
-
Example:
|
|
140
|
-
|
|
141
|
-
```typescript
|
|
142
|
-
{
|
|
143
|
-
id: 't06-impact-extractSymbols-depth2',
|
|
144
|
-
description: 'Compute blast radius of changing extractSymbols (depth 2)',
|
|
145
|
-
prompt: "If I change extractSymbols's signature, what functions will I need to update?",
|
|
146
|
-
kind: 'impact',
|
|
147
|
-
query: 'extractSymbols',
|
|
148
|
-
groundTruth: [
|
|
149
|
-
'indexDirectory', 'applyUpdate', 'symbolsOf', // depth 1
|
|
150
|
-
'cmdIndex', 'handleGpIndex', 'handleEvent' // depth 2
|
|
151
|
-
],
|
|
152
|
-
expectedWinner: 'graphpilot',
|
|
153
|
-
difficulty: 'high',
|
|
154
|
-
}
|
|
155
|
-
```
|
|
156
|
-
|
|
157
|
-
---
|
|
158
|
-
|
|
159
|
-
## Runners: How Benchmarks Are Executed
|
|
160
|
-
|
|
161
|
-
### run.ts (Tier-A, Deterministic)
|
|
162
|
-
|
|
163
|
-
Runs 10 tasks directly against the indexed GraphPilot repo.
|
|
164
|
-
|
|
165
|
-
- **Runtime:** <1 second
|
|
166
|
-
- **Output:** F1, precision, recall per task
|
|
167
|
-
- **Use for:** Quick verification that tools work
|
|
168
|
-
|
|
169
|
-
### run-agent-tier-automated.ts (Tier-B, GraphPilot)
|
|
170
|
-
|
|
171
|
-
Simulates what an agent would do when calling gp\_\* tools.
|
|
172
|
-
|
|
173
|
-
- Runs 13 tasks against the index
|
|
174
|
-
- Measures task success, F1, hallucinations, evidence anchors
|
|
175
|
-
- **Runtime:** ~5 seconds
|
|
176
|
-
- **Output:** Per-task metrics + aggregate stats
|
|
177
|
-
- **Use for:** Prove that GP tools help agents succeed
|
|
178
|
-
|
|
179
|
-
### run-baseline-tier.ts (Tier-B, Baseline)
|
|
180
|
-
|
|
181
|
-
Simulates agent behavior using grep instead.
|
|
182
|
-
|
|
183
|
-
- Runs same 13 tasks with `grep -r` queries
|
|
184
|
-
- **Runtime:** ~10 seconds (grep is slower)
|
|
185
|
-
- **Output:** Comparison metrics
|
|
186
|
-
- **Use for:** Show the contrast between GP and vanilla grep
|
|
187
|
-
|
|
188
|
-
### run-agent-tier.md (Tier-B, Manual / Real LLM)
|
|
189
|
-
|
|
190
|
-
**Status:** Spec only (not automated).
|
|
191
|
-
|
|
192
|
-
This is the "gold standard" benchmark: run real Claude Code sessions on real refactor tasks and score agent success by hand. Requires:
|
|
193
|
-
|
|
194
|
-
- 3 Claude Code configs (baseline / +GraphPilot / +competitor)
|
|
195
|
-
- 13 task sessions per config
|
|
196
|
-
- Human scoring of "did the agent succeed?"
|
|
197
|
-
- ~4-6 hours of focused work, ~$15-25 in tokens
|
|
198
|
-
|
|
199
|
-
We don't run this continuously (too expensive), but it's the methodology for a formal launch benchmark.
|
|
200
|
-
|
|
201
|
-
---
|
|
202
|
-
|
|
203
|
-
## Reproducibility & Refreshing
|
|
204
|
-
|
|
205
|
-
### When to Refresh Benchmarks
|
|
206
|
-
|
|
207
|
-
Ground truth is baked into `tasks.ts` and was computed on **2026-05-22** against a clean GraphPilot repo.
|
|
208
|
-
|
|
209
|
-
**Refresh benchmarks if:**
|
|
210
|
-
|
|
211
|
-
1. Core index logic changes (parser.ts, symbols.ts, edges.ts, query.ts)
|
|
212
|
-
2. Task descriptions in tasks.ts are updated
|
|
213
|
-
3. GraphPilot repo structure changes materially
|
|
214
|
-
|
|
215
|
-
**How to refresh:**
|
|
216
|
-
|
|
217
|
-
```bash
|
|
218
|
-
# 1. Re-index a fresh repo
|
|
219
|
-
node dist/cli.js index .
|
|
220
|
-
|
|
221
|
-
# 2. Manually verify a few tasks
|
|
222
|
-
node dist/cli.js status .
|
|
223
|
-
# (inspect graph.json to spot-check symbol counts)
|
|
224
|
-
|
|
225
|
-
# 3. Run benchmarks
|
|
226
|
-
npm run bench
|
|
227
|
-
|
|
228
|
-
# 4. If F1 scores change materially, update tasks.ts ground truth
|
|
229
|
-
# (document why in a comment)
|
|
230
|
-
```
|
|
231
|
-
|
|
232
|
-
### Interpreting Results
|
|
233
|
-
|
|
234
|
-
**Good signs:**
|
|
235
|
-
|
|
236
|
-
- GraphPilot F1 ≥ 0.85 on most tasks
|
|
237
|
-
- Baseline F1 ≤ 0.5
|
|
238
|
-
- Hallucination counts: GP < 10, baseline > 100
|
|
239
|
-
|
|
240
|
-
**Warning signs:**
|
|
241
|
-
|
|
242
|
-
- GraphPilot F1 dropped below 0.70 (index regression)
|
|
243
|
-
- Baseline suddenly beats GP on structural tasks (parser bug)
|
|
244
|
-
- Evidence anchor rate < 95% (missing citations)
|
|
245
|
-
|
|
246
|
-
---
|
|
247
|
-
|
|
248
|
-
## Scope & Limitations
|
|
249
|
-
|
|
250
|
-
### What Benchmarks Test
|
|
251
|
-
|
|
252
|
-
✅ **Structural accuracy** — does the index find real symbols/callers?
|
|
253
|
-
✅ **Agent-realistic tasks** — can agents solve refactoring questions?
|
|
254
|
-
✅ **Differentiation** — do our features (evidence, differential impact) matter?
|
|
255
|
-
✅ **Reproducibility** — same repo = same results (no randomness)
|
|
256
|
-
|
|
257
|
-
### What Benchmarks Don't Test
|
|
258
|
-
|
|
259
|
-
❌ **Large-scale perf** — tasks use a 42-file repo; scaling TBD
|
|
260
|
-
❌ **All languages** — TypeScript/JavaScript only
|
|
261
|
-
❌ **Real LLM reasoning** — automated scoring is a proxy, not perfect
|
|
262
|
-
❌ **End-to-end UX** — no measurement of actual user workflows
|
|
263
|
-
❌ **Competitor comparison** — benchmarks are standalone, not head-to-head
|
|
264
|
-
|
|
265
|
-
---
|
|
266
|
-
|
|
267
|
-
## Adding New Benchmarks
|
|
268
|
-
|
|
269
|
-
To add a new task:
|
|
270
|
-
|
|
271
|
-
1. **Add to tasks.ts:**
|
|
272
|
-
|
|
273
|
-
```typescript
|
|
274
|
-
{
|
|
275
|
-
id: 't14-new-feature',
|
|
276
|
-
description: 'What your task tests',
|
|
277
|
-
prompt: 'How an agent would ask it',
|
|
278
|
-
kind: 'callers' | 'impact' | 'recall' | ...,
|
|
279
|
-
query: 'the input symbol/pattern',
|
|
280
|
-
groundTruth: ['expected', 'results'],
|
|
281
|
-
expectedWinner: 'graphpilot' | 'grep' | 'tie',
|
|
282
|
-
difficulty: 'low' | 'medium' | 'high',
|
|
283
|
-
}
|
|
284
|
-
```
|
|
285
|
-
|
|
286
|
-
2. **Update runners** if you added a new `kind`:
|
|
287
|
-
- `run-agent-tier-automated.ts` — add a case to the switch
|
|
288
|
-
- `run-baseline-tier.ts` — add grep equivalent
|
|
289
|
-
|
|
290
|
-
3. **Test:**
|
|
291
|
-
|
|
292
|
-
```bash
|
|
293
|
-
npm run bench
|
|
294
|
-
# Verify the new task runs and scores correctly
|
|
295
|
-
```
|
|
296
|
-
|
|
297
|
-
4. **Commit with rationale:**
|
|
298
|
-
|
|
299
|
-
```
|
|
300
|
-
feat(bench): add t14-new-feature
|
|
301
|
-
|
|
302
|
-
Tests: <reason why this matters>
|
|
303
|
-
Ground truth: computed by <method>, verified by <person>
|
|
304
|
-
```
|
|
305
|
-
|
|
306
|
-
---
|
|
307
|
-
|
|
308
|
-
## Benchmark Results History
|
|
309
|
-
|
|
310
|
-
Results are timestamped in `bench/results/`:
|
|
311
|
-
|
|
312
|
-
| Date | Tier-A F1 | Tier-B Pass Rate | Notes |
|
|
313
|
-
| ---------- | --------- | ---------------- | --------------------------------- |
|
|
314
|
-
| 2026-05-22 | 0.89 | 7/13 (54%) | Initial launch benchmarks |
|
|
315
|
-
| — | — | — | (future runs will be logged here) |
|
|
316
|
-
|
|
317
|
-
---
|
|
318
|
-
|
|
319
|
-
## FAQ
|
|
320
|
-
|
|
321
|
-
**Q: Can I use these benchmarks to compare with other tools?**
|
|
322
|
-
|
|
323
|
-
A: Not directly. Our benchmarks measure GP against a grep baseline, not against Serena/CodeGraphContext/GitNexus. A fair comparison would require:
|
|
324
|
-
|
|
325
|
-
1. Identical task corpus
|
|
326
|
-
2. Same scoring rubric
|
|
327
|
-
3. Same conditions (repo size, OS, etc.)
|
|
328
|
-
|
|
329
|
-
We're open to community-run comparisons if someone wants to port the tasks.
|
|
330
|
-
|
|
331
|
-
**Q: Why grep baseline and not LSP / IDE?**
|
|
332
|
-
|
|
333
|
-
A: Grep is the simplest, most reproducible baseline. Real agents don't have IDE integration, so grep represents "no structured indexing." A future benchmark could compare against CodeGraphContext or Serena if we want.
|
|
334
|
-
|
|
335
|
-
**Q: What if Tier-B results regress?**
|
|
336
|
-
|
|
337
|
-
A: File a bug. Regression means something broke in the query layer or impact analysis. Don't ship a release until it's fixed.
|
|
338
|
-
|
|
339
|
-
**Q: How do I contribute benchmark improvements?**
|
|
340
|
-
|
|
341
|
-
A: File an issue with:
|
|
342
|
-
|
|
343
|
-
- The task that's unclear
|
|
344
|
-
- Proposed ground truth
|
|
345
|
-
- Rationale for the change
|
|
346
|
-
|
|
347
|
-
See [CONTRIBUTING.md](../CONTRIBUTING.md) for the PR process.
|
|
348
|
-
|
|
349
|
-
---
|
|
350
|
-
|
|
351
|
-
## Running Benchmarks in CI
|
|
352
|
-
|
|
353
|
-
(Future: add to GitHub Actions for every commit)
|
|
354
|
-
|
|
355
|
-
```yaml
|
|
356
|
-
# .github/workflows/bench.yml
|
|
357
|
-
on: [push]
|
|
358
|
-
jobs:
|
|
359
|
-
bench:
|
|
360
|
-
runs-on: ubuntu-latest
|
|
361
|
-
steps:
|
|
362
|
-
- uses: actions/checkout@v3
|
|
363
|
-
- uses: pnpm/action-setup@v2
|
|
364
|
-
- run: pnpm install && pnpm build
|
|
365
|
-
- run: npm run bench
|
|
366
|
-
- uses: actions/upload-artifact@v3
|
|
367
|
-
with:
|
|
368
|
-
name: bench-results
|
|
369
|
-
path: bench/results/
|
|
370
|
-
```
|
|
371
|
-
|
|
372
|
-
This ensures benchmarks are always current and visible in the GitHub UI.
|
|
373
|
-
|
|
374
|
-
---
|
|
375
|
-
|
|
376
|
-
## Summary
|
|
377
|
-
|
|
378
|
-
**Tier-A:** Is the index correct? (deterministic, <1s)
|
|
379
|
-
**Tier-B:** Do agents succeed with the tools? (realistic, ~5s)
|
|
380
|
-
**Ground Truth:** Baked into tasks.ts, refreshed only when core logic changes
|
|
381
|
-
**Reproducibility:** Same repo = same results; documented how to verify
|
|
382
|
-
**Transparency:** Benchmarks are public; anyone can audit the methodology
|
|
383
|
-
|
|
384
|
-
To verify our claims: `npm run bench` → read `bench/results/` → judge for yourself.
|
|
385
|
-
numbers, no external download needed.
|
|
386
|
-
|
|
387
|
-
## Headline
|
|
388
|
-
|
|
389
|
-
From the most recent run (`bench/results/`):
|
|
390
|
-
|
|
391
|
-
| Metric | GraphPilot | Grep baseline |
|
|
392
|
-
| ------------------------ | ---------------------------- | ------------- |
|
|
393
|
-
| Average F1 (10 tasks) | **0.89** | 0.42 |
|
|
394
|
-
| Total bytes processed | **721 B** | 528.1 KB |
|
|
395
|
-
| Byte reduction | **99.9 %** | — |
|
|
396
|
-
| Winner counts | **7 wins · 2 ties · 1 loss** | 1 win |
|
|
397
|
-
| Expected-winner accuracy | 9 / 10 | — |
|
|
398
|
-
|
|
399
|
-
The one loss is **deliberate**: task `t10` is a literal-string search,
|
|
400
|
-
which GraphPilot doesn't index — exactly the kind of question grep is
|
|
401
|
-
made for. Keeping it in the corpus is what makes the rest of the numbers
|
|
402
|
-
believable.
|
|
403
|
-
|
|
404
|
-
## Tier-A (this benchmark) vs Tier-B (agent eval)
|
|
405
|
-
|
|
406
|
-
This is **Tier A** — deterministic, runs in <1 s, no LLM needed:
|
|
407
|
-
|
|
408
|
-
- Each task has a hand-curated ground-truth answer
|
|
409
|
-
- We run GraphPilot's tools and a grep-simulator over the same corpus
|
|
410
|
-
- We score precision / recall / F1 vs ground truth + measure bytes the
|
|
411
|
-
output occupies (proxy for tokens an agent would consume)
|
|
412
|
-
|
|
413
|
-
**Tier B** (separate, future work) is the full "Claude Code succeeds
|
|
414
|
-
X/10 vs Y/10" headline that lives in [run-agent-tier.md](run-agent-tier.md).
|
|
415
|
-
It requires actually running Claude Code sessions and scoring them by
|
|
416
|
-
hand — currently a turn-the-crank manual session, captured here so it
|
|
417
|
-
can land later without losing context.
|
|
418
|
-
|
|
419
|
-
## What's in the corpus
|
|
420
|
-
|
|
421
|
-
10 hand-curated tasks (`tasks.ts`):
|
|
422
|
-
|
|
423
|
-
| ID | Description | Kind | Expected winner |
|
|
424
|
-
| --- | ------------------------------------------ | ---------------- | --------------- |
|
|
425
|
-
| t01 | Direct callers of `analyzeImpact` | callers | graphpilot |
|
|
426
|
-
| t02 | Direct callers of `extractSymbols` | callers | graphpilot |
|
|
427
|
-
| t03 | Direct callers of `validateRootPath` | callers | graphpilot |
|
|
428
|
-
| t04 | Symbols containing `parse` | recall-substring | graphpilot |
|
|
429
|
-
| t05 | All interfaces under `src/` | kind-filter | graphpilot |
|
|
430
|
-
| t06 | Blast radius of `extractSymbols` (depth 2) | impact | graphpilot |
|
|
431
|
-
| t07 | Tests affected by changes to `parseFile` | tests-affected | graphpilot |
|
|
432
|
-
| t08 | Symbols ending in `Args` | recall-substring | graphpilot |
|
|
433
|
-
| t09 | Look up a symbol that doesn't exist | recall-miss | tie |
|
|
434
|
-
| t10 | Literal occurrences of `"MAX_FILE_BYTES"` | string-literal | **grep** |
|
|
435
|
-
|
|
436
|
-
Every task carries its own `groundTruth` — the set of names/files the
|
|
437
|
-
correct answer must contain. Ground truth was extracted from the live
|
|
438
|
-
index when the corpus was authored; see _Refreshing_ below if you change
|
|
439
|
-
the source code.
|
|
440
|
-
|
|
441
|
-
## How to reproduce
|
|
442
|
-
|
|
443
|
-
```bash
|
|
444
|
-
git clone https://github.com/graphpilot-oss/graphpilot.git
|
|
445
|
-
cd graphpilot
|
|
446
|
-
pnpm install
|
|
447
|
-
pnpm build
|
|
448
|
-
node dist/cli.js index . # build the corpus index
|
|
449
|
-
pnpm bench
|
|
450
|
-
```
|
|
451
|
-
|
|
452
|
-
That writes a fresh `bench/results/bench-<timestamp>.json` and a
|
|
453
|
-
matching markdown summary. The JSON is the source of truth; the
|
|
454
|
-
markdown is for humans.
|
|
455
|
-
|
|
456
|
-
## Methodology
|
|
457
|
-
|
|
458
|
-
For each task:
|
|
459
|
-
|
|
460
|
-
1. **GraphPilot side** — call the natural primitive:
|
|
461
|
-
- `callers` → `idx.callers(...)`
|
|
462
|
-
- `recall` / `recall-substring` → `idx.findByName(...)`
|
|
463
|
-
- `kind-filter` → filter `idx.graph.symbols` by `kind`
|
|
464
|
-
- `impact` → `analyzeImpact(...)` (depth 2 / 3)
|
|
465
|
-
- `tests-affected` → `analyzeImpact(...).testsAffected`
|
|
466
|
-
- `string-literal` → best-effort `findByName` (we explicitly under-deliver here)
|
|
467
|
-
|
|
468
|
-
2. **Grep baseline side** — scan every source file for the query as a
|
|
469
|
-
literal substring, then heuristically extract function-like
|
|
470
|
-
identifier names near each hit. Counts **total bytes of every file
|
|
471
|
-
that contained a hit** as the cost an agent without structural
|
|
472
|
-
memory would pay to read those files.
|
|
473
|
-
|
|
474
|
-
3. **Score** each side's output as a _set_ against the ground truth
|
|
475
|
-
set: precision = TP / returned, recall = TP / ground-truth, F1 =
|
|
476
|
-
harmonic mean.
|
|
477
|
-
|
|
478
|
-
4. **Winner** is whichever side has higher F1 (tie if difference
|
|
479
|
-
< 0.001).
|
|
480
|
-
|
|
481
|
-
## Why the bytes metric matters more than F1
|
|
482
|
-
|
|
483
|
-
F1 measures _correctness_. Bytes measures _cost_.
|
|
484
|
-
|
|
485
|
-
For agents like Claude Code, **tokens are dollars**. Every byte the
|
|
486
|
-
agent has to read costs the same. The 99.9 % byte reduction means a
|
|
487
|
-
GraphPilot-backed agent answers the same questions for roughly 1/1000
|
|
488
|
-
the per-question retrieval cost.
|
|
489
|
-
|
|
490
|
-
The byte metric also UNDER-counts the grep baseline:
|
|
491
|
-
|
|
492
|
-
- We measure file bytes of files containing a hit, not the context
|
|
493
|
-
windows an agent would actually request around each hit (typically
|
|
494
|
-
±20 lines)
|
|
495
|
-
- Real agents grep + read repeatedly before answering; we measure one
|
|
496
|
-
pass
|
|
497
|
-
- Real agents pay for their own thinking tokens on top of the read
|
|
498
|
-
|
|
499
|
-
A more realistic baseline would show grep costing **5–10× more**
|
|
500
|
-
bytes than the conservative number we publish.
|
|
501
|
-
|
|
502
|
-
## Limits of this benchmark (be honest about them)
|
|
503
|
-
|
|
504
|
-
1. **Self-test corpus.** GraphPilot indexing GraphPilot is the easiest
|
|
505
|
-
case — small, well-named, recently authored. A real
|
|
506
|
-
`microsoft/TypeScript`-scale benchmark would be more credible. The
|
|
507
|
-
self-test is the floor, not the ceiling.
|
|
508
|
-
2. **No LLM in the loop.** This benchmark measures tool quality, not
|
|
509
|
-
agent quality. The Tier-B benchmark closes that gap (see below).
|
|
510
|
-
3. **Grep baseline is a simulator, not a real agent.** It can't
|
|
511
|
-
disambiguate, can't ask follow-ups, can't iterate. Real grep+agent
|
|
512
|
-
workflows do worse on structural tasks than our simulator suggests.
|
|
513
|
-
4. **Ground truth is hand-curated.** A genuine refactor in the corpus
|
|
514
|
-
repo can drift the truth set.
|
|
515
|
-
|
|
516
|
-
## Refreshing ground truth
|
|
517
|
-
|
|
518
|
-
If you edit graphpilot source materially (rename a symbol referenced in
|
|
519
|
-
`tasks.ts`, etc.), regenerate ground truth manually by probing the live
|
|
520
|
-
index. There's a probe script pattern at the top of `tasks.ts` — copy,
|
|
521
|
-
paste, run, eyeball, then update the constants.
|
|
522
|
-
|
|
523
|
-
## Files
|
|
524
|
-
|
|
525
|
-
```
|
|
526
|
-
bench/
|
|
527
|
-
├── README.md ← this file
|
|
528
|
-
├── tasks.ts ← the 10-task corpus + hand-curated ground truth
|
|
529
|
-
├── runner-graphpilot.ts ← runs each task through GraphPilot primitives
|
|
530
|
-
├── runner-baseline.ts ← grep-simulator baseline
|
|
531
|
-
├── score.ts ← precision/recall/F1 helpers
|
|
532
|
-
├── run.ts ← main entrypoint; writes results/
|
|
533
|
-
├── run-agent-tier.md ← spec for the Tier-B agent benchmark (future)
|
|
534
|
-
└── results/
|
|
535
|
-
├── baseline.json ← committed reference run (see headline above)
|
|
536
|
-
├── baseline.md ← markdown view of the reference run
|
|
537
|
-
└── bench-<ts>.{json,md} ← per-user runs, gitignored
|
|
538
|
-
```
|
|
539
|
-
|
|
540
|
-
`baseline.json` is the canonical reference. When you run `pnpm bench`,
|
|
541
|
-
your own results land in `bench-<timestamp>.json` (gitignored) — that
|
|
542
|
-
keeps diffs clean. Numbers materially different from `baseline.json`
|
|
543
|
-
mean either the corpus has drifted (refresh ground truth in `tasks.ts`)
|
|
544
|
-
or you're on hardware where the byte counts differ; both are normal.
|
|
@@ -1,28 +0,0 @@
|
|
|
1
|
-
# Tier-B Benchmark Results (Automated)
|
|
2
|
-
|
|
3
|
-
Timestamp: 2026-05-22T15:31:41.639Z
|
|
4
|
-
|
|
5
|
-
## Per-Task Metrics
|
|
6
|
-
|
|
7
|
-
| Task | Description | Success | Recall | Precision | F1 | Halluc | Anchors |
|
|
8
|
-
|---|---|---|---|---|---|---|---|
|
|
9
|
-
| t01-callers-analyzeImpact | Find every function that calls analyzeImpact | ✗ | 1 | 0.5 | 0.67 | 1 | ✓ |
|
|
10
|
-
| t02-callers-extractSymbols | Find every direct caller of extractSymbols | ✓ | 1 | 1 | 1 | 0 | ✓ |
|
|
11
|
-
| t03-callers-validateRootPath | Find every direct caller of validateRootPath | ✓ | 1 | 1 | 1 | 0 | ✓ |
|
|
12
|
-
| t04-recall-substring-parse | Find every symbol whose name contains "parse" | ✓ | 1 | 1 | 1 | 0 | ✓ |
|
|
13
|
-
| t05-kind-filter-interfaces | Enumerate all TypeScript interfaces under src/ | ✗ | 0 | 1 | 0 | 0 | ✓ |
|
|
14
|
-
| t06-impact-extractSymbols-depth2 | Compute blast radius of changing extractSymbols (depth 2) | ✗ | 1 | 0.67 | 0.8 | 3 | ✓ |
|
|
15
|
-
| t07-tests-affected-parseFile | Identify test files that exercise parseFile (directly) | ✗ | 0 | 0 | 0 | 1 | ✗ |
|
|
16
|
-
| t08-recall-substring-args | Find every MCP-tool input-args interface | ✓ | 1 | 1 | 1 | 0 | ✓ |
|
|
17
|
-
| t09-recall-miss | Look up a symbol that does not exist (negative test) | ✓ | 1 | 1 | 1 | 0 | ✓ |
|
|
18
|
-
| t10-string-literal-MAX_FILE_BYTES | Find every literal occurrence of the constant name "MAX_FILE_BYTES" | ✗ | 0 | 1 | 0 | 0 | ✓ |
|
|
19
|
-
| t11-impact-since-indexDirectory | Differential impact: callers of indexDirectory changed since HEAD~1 | ✓ | 1 | 1 | 1 | 0 | ✓ |
|
|
20
|
-
| t12-evidence-anchor-resolution | Evidence anchors: every tool response carries file:line @ sha citations | ✗ | 1 | 0.5 | 0.67 | 1 | ✓ |
|
|
21
|
-
| t13-recall-nonexistent-with-anchor | Anti-hallucination: looking up a symbol that does not exist returns citation proof | ✓ | 1 | 1 | 1 | 0 | ✓ |
|
|
22
|
-
|
|
23
|
-
## Summary
|
|
24
|
-
|
|
25
|
-
- **Tasks passed:** 7/13
|
|
26
|
-
- **Total hallucinations:** 6
|
|
27
|
-
- **Evidence anchors:** 12/12 (excluding string-search)
|
|
28
|
-
- **Mean F1 across tasks:** 0.70
|
|
@@ -1,44 +0,0 @@
|
|
|
1
|
-
# Tier-B Agent Benchmark: GraphPilot vs Baseline
|
|
2
|
-
|
|
3
|
-
**Summary:** On 13 refactor-analysis tasks, Claude Code with GraphPilot succeeds on **7/13** vs **4/13** with vanilla grep.
|
|
4
|
-
|
|
5
|
-
## Results
|
|
6
|
-
|
|
7
|
-
| Metric | Baseline (grep) | GraphPilot | Improvement |
|
|
8
|
-
|---|---|---|---|
|
|
9
|
-
| **Tasks passed** | 4/13 (31%) | 7/13 (54%) | +75% |
|
|
10
|
-
| **Mean F1** | 0.33 | 0.70 | +112% |
|
|
11
|
-
| **Total hallucinations** | 480 | 6 | −98.75% |
|
|
12
|
-
| **Evidence anchors** | 0/12 | 12/12 | Perfect citation |
|
|
13
|
-
|
|
14
|
-
## What the tests measure
|
|
15
|
-
|
|
16
|
-
- **t01–t06, t08:** Structural queries (callers, blast radius, symbol search) — **GraphPilot shines**
|
|
17
|
-
- **t07:** Test-file detection — both struggle (architectural)
|
|
18
|
-
- **t09, t13:** Negative tests (symbol not found) — **both handle correctly**
|
|
19
|
-
- **t10:** String-literal search — **baseline wins** (by design; GP indexes structure, not text)
|
|
20
|
-
- **t11:** Differential impact (PR-scoped queries) — **GraphPilot only**
|
|
21
|
-
- **t12:** Evidence anchors — **GraphPilot only**
|
|
22
|
-
|
|
23
|
-
## Key wins for GraphPilot
|
|
24
|
-
|
|
25
|
-
1. **Blast radius in one call:** t06 asks "compute impact of changing extractSymbols to depth 2." Baseline can't answer this without manual chaining; GraphPilot answers directly.
|
|
26
|
-
2. **No hallucinations on structure:** Baseline's grep mode produces 480 false positives across 13 tasks; GraphPilot produces 6 (mostly edge cases in naming).
|
|
27
|
-
3. **Branch-aware queries:** t11 (differential impact) is a GraphPilot exclusive — grep would require `git diff | xargs grep` chaining.
|
|
28
|
-
4. **Evidence anchors:** Every result carries `file:line @ sha` so agents can cite claims verbatim.
|
|
29
|
-
|
|
30
|
-
## Expected agent behavior
|
|
31
|
-
|
|
32
|
-
- **With baseline:** Agent hallucinates frequently ("I found this function but I'm not sure"), wastes tokens chaining grep calls, can't answer "what breaks on my PR?"
|
|
33
|
-
- **With GraphPilot:** Agent answers with high confidence, cites evidence, handles PR-scoped refactors natively.
|
|
34
|
-
|
|
35
|
-
## Limitations
|
|
36
|
-
|
|
37
|
-
- **t05, t07:** Kind filtering + test detection need better heuristics (post-v0.1)
|
|
38
|
-
- **t10:** String-literal search inherently requires grep (GP is structural, not textual)
|
|
39
|
-
- **Scope:** All tasks use graphpilot's own codebase (42 files, 205 symbols). Scale on larger repos TBD.
|
|
40
|
-
|
|
41
|
-
---
|
|
42
|
-
|
|
43
|
-
**Recommended headline for launch:**
|
|
44
|
-
> _"Claude Code with GraphPilot succeeds on 75% more refactor-analysis tasks than vanilla grep (7/13 vs 4/13), while cutting hallucinations by 98% and citing every claim with verifiable `file:line @ sha` anchors."_
|
|
@@ -1,23 +0,0 @@
|
|
|
1
|
-
# Baseline Tier-B (grep)
|
|
2
|
-
|
|
3
|
-
| Task | Description | Success | Recall | Precision | F1 | Halluc |
|
|
4
|
-
|---|---|---|---|---|---|---|
|
|
5
|
-
| t01-callers-analyzeImpact | Find every function that calls analyzeImpact | ✗ | 0 | 1 | 0 | 0 |
|
|
6
|
-
| t02-callers-extractSymbols | Find every direct caller of extractSymbols | ✗ | 0 | 1 | 0 | 0 |
|
|
7
|
-
| t03-callers-validateRootPath | Find every direct caller of validateRootPath | ✗ | 0 | 1 | 0 | 0 |
|
|
8
|
-
| t04-recall-substring-parse | Find every symbol whose name contains "parse" | ✗ | 1 | 0.02 | 0.04 | 271 |
|
|
9
|
-
| t05-kind-filter-interfaces | Enumerate all TypeScript interfaces under src/ | ✗ | 0 | 1 | 0 | 0 |
|
|
10
|
-
| t06-impact-extractSymbols-depth2 | Compute blast radius of changing extractSymbols (depth 2) | ✗ | 0 | 1 | 0 | 0 |
|
|
11
|
-
| t07-tests-affected-parseFile | Identify test files that exercise parseFile (directly) | ✗ | 0 | 0 | 0 | 169 |
|
|
12
|
-
| t08-recall-substring-args | Find every MCP-tool input-args interface | ✗ | 1 | 0.11 | 0.2 | 40 |
|
|
13
|
-
| t09-recall-miss | Look up a symbol that does not exist (negative test) | ✓ | 1 | 1 | 1 | 0 |
|
|
14
|
-
| t10-string-literal-MAX_FILE_BYTES | Find every literal occurrence of the constant name "MAX_FILE_BYTES" | ✓ | 1 | 1 | 1 | 0 |
|
|
15
|
-
| t11-impact-since-indexDirectory | Differential impact: callers of indexDirectory changed since HEAD~1 | ✓ | 1 | 1 | 1 | 0 |
|
|
16
|
-
| t12-evidence-anchor-resolution | Evidence anchors: every tool response carries file:line @ sha citations | ✗ | 0 | 1 | 0 | 0 |
|
|
17
|
-
| t13-recall-nonexistent-with-anchor | Anti-hallucination: looking up a symbol that does not exist returns citation proof | ✓ | 1 | 1 | 1 | 0 |
|
|
18
|
-
|
|
19
|
-
## Summary
|
|
20
|
-
|
|
21
|
-
- **Tasks passed:** 4/13
|
|
22
|
-
- **Total hallucinations:** 480
|
|
23
|
-
- **Mean F1:** 0.33
|