pi-agent-extensions 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,455 @@
1
+ # Handoff Eval Strategy
2
+
3
+ A practical guide for evaluating and improving the `/handoff` command's extraction quality.
4
+
5
+ ## Overview
6
+
7
+ The handoff command uses an LLM to extract context from a conversation. The quality of this extraction directly impacts whether the next session can successfully continue the work. This document outlines how to:
8
+
9
+ 1. Collect real handoff examples
10
+ 2. Define what "good" looks like
11
+ 3. Build a test dataset
12
+ 4. Measure quality
13
+ 5. Iterate on the extraction prompt
14
+
15
+ ## Core Principles
16
+
17
+ From the llm-evals skill:
18
+
19
+ 1. **Look at data constantly** - Read traces until you stop learning
20
+ 2. **Binary pass/fail** - Not 1-5 scales. Pass/fail with written critique
21
+ 3. **Domain experts set the bar** - You (the user) define what's good
22
+ 4. **Task-specific** - Generic benchmarks don't apply here
23
+ 5. **Start small** - 30 hand-picked examples before automation
24
+
25
+ ---
26
+
27
+ ## Phase 1: Trace Collection
28
+
29
+ ### What to Capture
30
+
31
+ For each handoff run, save:
32
+
33
+ ```json
34
+ {
35
+ "trace_id": "handoff_001",
36
+ "timestamp": "2026-02-04T10:30:00Z",
37
+ "session_file": "~/.pi/sessions/abc123.jsonl",
38
+ "conversation_summary": "Implementing handoff extension for Pi",
39
+ "conversation_length": 45,
40
+ "user_goal": "implement file validation and improve extraction prompt",
41
+ "extraction_input": {
42
+ "conversation_text": "...(serialized conversation)...",
43
+ "goal": "implement file validation..."
44
+ },
45
+ "extraction_output": {
46
+ "relevantFiles": [...],
47
+ "relevantCommands": [...],
48
+ "relevantInformation": [...],
49
+ "decisions": [...],
50
+ "openQuestions": [...]
51
+ },
52
+ "final_prompt": "# Handoff Context\n...",
53
+ "model_used": "anthropic/claude-sonnet-4-5",
54
+ "latency_ms": 2150,
55
+ "tokens": { "input": 8500, "output": 450 }
56
+ }
57
+ ```
58
+
59
+ ### How to Collect
60
+
61
+ #### Option A: Manual Collection (Recommended to Start)
62
+
63
+ 1. Use `/handoff` naturally during your work
64
+ 2. Before submitting, copy the generated prompt
65
+ 3. After using the new session, note whether the handoff was successful
66
+ 4. Save examples to `evals/handoff/traces/`
67
+
68
+ #### Option B: Instrumented Collection
69
+
70
+ Add logging to the handoff extension:
71
+
72
+ ```typescript
73
+ // In index.ts, after extraction
74
+ if (process.env.HANDOFF_TRACE_DIR) {
75
+ const trace = {
76
+ trace_id: `handoff_${Date.now()}`,
77
+ timestamp: new Date().toISOString(),
78
+ session_file: currentSessionFile,
79
+ user_goal: goal,
80
+ extraction_output: extractionResult.extraction,
81
+ final_prompt: handoffPrompt,
82
+ model_used: extractionModel.id,
83
+ };
84
+ const tracePath = join(
85
+ process.env.HANDOFF_TRACE_DIR,
86
+ `${trace.trace_id}.json`
87
+ );
88
+ writeFileSync(tracePath, JSON.stringify(trace, null, 2));
89
+ }
90
+ ```
91
+
92
+ #### Option C: Session Replay
93
+
94
+ Pi sessions are stored as JSONL. Build a replay tool:
95
+
96
+ ```bash
97
+ # Replay a session through handoff extraction
98
+ node scripts/replay-handoff.ts \
99
+ --session ~/.pi/sessions/abc123.jsonl \
100
+ --goal "implement feature X" \
101
+ --output evals/handoff/traces/
102
+ ```
103
+
104
+ ### Collection Target
105
+
106
+ - **Initial goal**: 30 diverse examples
107
+ - **Ongoing**: Add 5-10 per week from real usage
108
+ - **Regression**: Add any failure as a test case
109
+
110
+ ---
111
+
112
+ ## Phase 2: Dataset Design
113
+
114
+ ### Dimensions to Cover
115
+
116
+ | Dimension | Examples |
117
+ |-----------|----------|
118
+ | Session length | Short (5 turns), Medium (20 turns), Long (50+ turns) |
119
+ | Task type | Feature implementation, bug fix, refactoring, exploration |
120
+ | Complexity | Single file, multi-file, cross-cutting |
121
+ | Goal specificity | Vague-ish, specific, highly detailed |
122
+ | Context density | Few files mentioned, many files, many commands |
123
+
124
+ ### Category Distribution
125
+
126
+ Following llm-evals guidance:
127
+
128
+ | Category | % | Description |
129
+ |----------|---|-------------|
130
+ | Happy path | 60% | Clear sessions with good goals |
131
+ | Edge cases | 25% | Compacted sessions, minimal context, ambiguous goals |
132
+ | Adversarial | 10% | Goals that could mislead, sessions with noise |
133
+ | Regression | 5% | Previously failed cases that were fixed |
134
+
135
+ ### Dataset Schema
136
+
137
+ Store in `evals/handoff/dataset.jsonl`:
138
+
139
+ ```json
140
+ {
141
+ "id": "handoff_001",
142
+ "category": "happy_path",
143
+ "session_type": "feature_implementation",
144
+ "session_length": "medium",
145
+ "goal": "implement file validation and improve extraction prompt",
146
+ "conversation_file": "traces/handoff_001_conversation.txt",
147
+ "expected_files": [
148
+ "extensions/handoff/parser.ts",
149
+ "extensions/handoff/extraction.ts"
150
+ ],
151
+ "expected_commands": ["npm test"],
152
+ "expected_context": [
153
+ "file validation filters hallucinated paths",
154
+ "extraction prompt focuses on goal-relevant context"
155
+ ],
156
+ "expected_decisions": [
157
+ "Use case-insensitive matching for file validation"
158
+ ],
159
+ "pass_criteria": {
160
+ "files_coverage": 0.8,
161
+ "context_coverage": 0.7,
162
+ "no_hallucinated_files": true,
163
+ "no_completed_tasks_in_context": true
164
+ },
165
+ "notes": "Session where we implemented v1.1 improvements"
166
+ }
167
+ ```
168
+
169
+ ---
170
+
171
+ ## Phase 3: Pass/Fail Criteria
172
+
173
+ ### What Makes a Good Handoff?
174
+
175
+ A handoff extraction PASSES if:
176
+
177
+ 1. **File coverage** - At least 80% of expected files are included
178
+ 2. **No hallucinations** - No invented file paths that weren't in the conversation
179
+ 3. **Context coverage** - At least 70% of expected context facts are captured
180
+ 4. **Goal-relevance** - Extracted info relates to the stated goal
181
+ 5. **No history dump** - Doesn't list completed tasks ("We implemented X, Y, Z")
182
+ 6. **No obvious actions** - Doesn't tell agent to "run tests" or "build"
183
+
184
+ ### Failure Modes
185
+
186
+ | Failure Mode | Description | Severity |
187
+ |--------------|-------------|----------|
188
+ | Hallucinated files | Files extracted that weren't mentioned | High |
189
+ | Missing critical file | Key file for goal not included | High |
190
+ | History dump | Lists what was done instead of what's needed | Medium |
191
+ | Obvious actions | Tells agent to run tests, build, etc. | Low |
192
+ | Missing gotcha | Doesn't capture learned convention/behavior | Medium |
193
+ | Vague context | Generic statements that don't help | Low |
194
+ | Wrong priorities | Includes irrelevant files, misses relevant ones | Medium |
195
+
196
+ ---
197
+
198
+ ## Phase 4: Evaluation Process
199
+
200
+ ### Manual Evaluation (Start Here)
201
+
202
+ For each test case:
203
+
204
+ 1. Load the conversation and goal
205
+ 2. Run the extraction
206
+ 3. Compare output to expected values
207
+ 4. Score pass/fail on each criterion
208
+ 5. Write a critique explaining the judgment
209
+
210
+ #### Scoring Template
211
+
212
+ ```markdown
213
+ ## Case: handoff_001
214
+
215
+ **Goal**: implement file validation and improve extraction prompt
216
+
217
+ ### File Coverage
218
+ - Expected: parser.ts, extraction.ts, types.ts
219
+ - Got: parser.ts, extraction.ts, index.ts
220
+ - Score: 2/3 (67%) - FAIL (below 80%)
221
+ - Critique: Missed types.ts which had the config changes
222
+
223
+ ### Hallucination Check
224
+ - Invented files: None
225
+ - Score: PASS
226
+
227
+ ### Context Coverage
228
+ - Expected: "file validation filters hallucinated paths"
229
+ - Got: "Added validateFilesAgainstConversation function"
230
+ - Score: PASS (captured the concept)
231
+
232
+ ### No History Dump
233
+ - Found: "We implemented the following changes..."
234
+ - Score: FAIL - Should not list completed work
235
+
236
+ ### Overall: FAIL
237
+ Priority fix: Remove history dump pattern from prompt
238
+ ```
239
+
240
+ ### Automated Checks (Level 1)
241
+
242
+ Add to test suite:
243
+
244
+ ```typescript
245
+ // tests/handoff/eval.test.ts
246
+ import { describe, it } from "node:test";
247
+ import { runExtraction } from "./helpers.js";
248
+ import dataset from "../evals/handoff/dataset.json";
249
+
250
+ describe("handoff extraction quality", () => {
251
+ for (const testCase of dataset.slice(0, 10)) { // Smoke subset
252
+ it(`${testCase.id}: ${testCase.goal.slice(0, 50)}...`, async () => {
253
+ const result = await runExtraction(testCase.conversation_file, testCase.goal);
254
+
255
+ // Check no hallucinated files
256
+ const mentionedFiles = extractMentionedFiles(testCase.conversation_file);
257
+ for (const file of result.relevantFiles) {
258
+ assert(
259
+ mentionedFiles.has(file.path) || mentionedFiles.has(basename(file.path)),
260
+ `Hallucinated file: ${file.path}`
261
+ );
262
+ }
263
+
264
+ // Check file coverage
265
+ const gotPaths = new Set(result.relevantFiles.map(f => f.path));
266
+ const coverage = testCase.expected_files.filter(f =>
267
+ gotPaths.has(f) || [...gotPaths].some(g => g.endsWith(basename(f)))
268
+ ).length / testCase.expected_files.length;
269
+ assert(coverage >= 0.8, `File coverage ${coverage} < 0.8`);
270
+ });
271
+ }
272
+ });
273
+ ```
274
+
275
+ ### LLM-as-Judge (Level 2)
276
+
277
+ After manual labeling stabilizes, build an automated judge:
278
+
279
+ ```typescript
280
+ const JUDGE_PROMPT = `You are evaluating the quality of a handoff extraction.
281
+
282
+ ## Conversation Summary
283
+ {conversation_summary}
284
+
285
+ ## User's Goal
286
+ {goal}
287
+
288
+ ## Extraction Output
289
+ {extraction_json}
290
+
291
+ ## Evaluation Criteria
292
+
293
+ 1. **File Relevance**: Are the extracted files actually relevant to the goal?
294
+ 2. **No Hallucinations**: Were all files actually mentioned in the conversation?
295
+ 3. **Context Quality**: Does relevantInformation capture what the next agent needs?
296
+ 4. **No History Dump**: Does it avoid listing completed work?
297
+ 5. **Captures Gotchas**: Are learned conventions/behaviors included?
298
+
299
+ ## Examples
300
+
301
+ ### Good Extraction (PASS)
302
+ Goal: "add dark mode support"
303
+ relevantInformation:
304
+ - "Theme colors are defined in src/theme.ts"
305
+ - "Use CSS variables for dynamic theming"
306
+ - "The app uses Tailwind, so use dark: prefix"
307
+ Critique: Captures specific, actionable context for the goal.
308
+
309
+ ### Bad Extraction (FAIL)
310
+ Goal: "add dark mode support"
311
+ relevantInformation:
312
+ - "We discussed the theme system"
313
+ - "Made several changes to the codebase"
314
+ - "Tests are passing"
315
+ Critique: Vague, lists completed work, no actionable context.
316
+
317
+ ## Your Task
318
+
319
+ Evaluate the extraction and output JSON:
320
+ {
321
+ "file_relevance": { "pass": true/false, "critique": "..." },
322
+ "no_hallucinations": { "pass": true/false, "critique": "..." },
323
+ "context_quality": { "pass": true/false, "critique": "..." },
324
+ "no_history_dump": { "pass": true/false, "critique": "..." },
325
+ "captures_gotchas": { "pass": true/false, "critique": "..." },
326
+ "overall_pass": true/false,
327
+ "priority_fix": "What should be improved first"
328
+ }`;
329
+ ```
330
+
331
+ ---
332
+
333
+ ## Phase 5: Iteration Workflow
334
+
335
+ ### The Prompt Tuning Loop
336
+
337
+ ```
338
+ 1. Run eval dataset
339
+ 2. Identify top failure mode
340
+ 3. Add/modify prompt instruction
341
+ 4. Re-run affected cases
342
+ 5. Verify fix, check for regressions
343
+ 6. Commit if improved
344
+ 7. Repeat
345
+ ```
346
+
347
+ ### Tracking Progress
348
+
349
+ Maintain a metrics table:
350
+
351
+ | Date | Pass Rate | Top Failure | Action Taken |
352
+ |------|-----------|-------------|--------------|
353
+ | 2026-02-04 | 65% | Hallucinated files | Added file validation |
354
+ | 2026-02-04 | 78% | History dump | Updated prompt guidelines |
355
+ | 2026-02-05 | 85% | Missing gotchas | Added "What to Extract" section |
356
+
357
+ ### When to Stop
358
+
359
+ Stop tuning when:
360
+ - Pass rate > 85% on full dataset
361
+ - No new failure modes in 10 consecutive cases
362
+ - Marginal improvements < 2% per iteration
363
+
364
+ ---
365
+
366
+ ## Phase 6: CI Integration
367
+
368
+ ### Test Levels
369
+
370
+ | Level | What | When | Time |
371
+ |-------|------|------|------|
372
+ | 1 | Unit tests (schema, parsing) | Every commit | <5s |
373
+ | 1 | Smoke eval (5 cases) | Every commit | <30s |
374
+ | 2 | Full eval (30+ cases) | Daily/weekly | <5min |
375
+ | 2 | LLM judge review | Before release | <10min |
376
+
377
+ ### CI Configuration
378
+
379
+ ```yaml
380
+ # .github/workflows/eval.yml
381
+ name: Handoff Evals
382
+
383
+ on:
384
+ push:
385
+ paths:
386
+ - 'extensions/handoff/**'
387
+ schedule:
388
+ - cron: '0 6 * * *' # Daily at 6am
389
+
390
+ jobs:
391
+ smoke-eval:
392
+ runs-on: ubuntu-latest
393
+ steps:
394
+ - uses: actions/checkout@v4
395
+ - run: npm install
396
+ - run: npm test # Includes smoke eval cases
397
+
398
+ full-eval:
399
+ if: github.event_name == 'schedule'
400
+ runs-on: ubuntu-latest
401
+ steps:
402
+ - uses: actions/checkout@v4
403
+ - run: npm install
404
+ - run: npm run eval:full
405
+ - run: npm run eval:report
406
+ ```
407
+
408
+ ---
409
+
410
+ ## Directory Structure
411
+
412
+ ```
413
+ evals/
414
+ └── handoff/
415
+ ├── dataset.jsonl # Test cases with expected values
416
+ ├── traces/ # Raw conversation + extraction pairs
417
+ │ ├── handoff_001.json
418
+ │ ├── handoff_002.json
419
+ │ └── ...
420
+ ├── judgments/ # Manual and LLM judge results
421
+ │ ├── 2026-02-04.jsonl
422
+ │ └── ...
423
+ ├── metrics/ # Historical metrics
424
+ │ └── history.jsonl
425
+ └── scripts/
426
+ ├── run-eval.ts # Run extraction on dataset
427
+ ├── judge.ts # LLM judge implementation
428
+ ├── report.ts # Generate metrics report
429
+ └── replay.ts # Replay sessions for testing
430
+ ```
431
+
432
+ ---
433
+
434
+ ## Quick Start Checklist
435
+
436
+ - [ ] Create `evals/handoff/` directory structure
437
+ - [ ] Collect first 10 handoff traces manually
438
+ - [ ] Define expected values for each trace
439
+ - [ ] Run extractions and score pass/fail
440
+ - [ ] Identify top 3 failure modes
441
+ - [ ] Update extraction prompt
442
+ - [ ] Re-run and verify improvement
443
+ - [ ] Add smoke tests to CI
444
+ - [ ] Collect 20 more examples
445
+ - [ ] Build LLM judge when manual labels stabilize
446
+ - [ ] Set up daily eval runs
447
+
448
+ ---
449
+
450
+ ## References
451
+
452
+ - [Hamel Husain's LLM Evals Course](https://maven.com/parlance-labs/evals)
453
+ - [Nicolay Gerold: How I Built Handoff](https://nicolaygerold.com/posts/how-i-built-handoff-in-amp)
454
+ - llm-evals skill: `~/.config/opencode/skill/llm-evals/`
455
+ - Handoff spec: `docs/spec_handoff.md`