@protolabsai/proto 0.21.0 → 0.23.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -262,9 +262,74 @@ A `MEMORY.md` index is auto-generated and loaded into the system prompt at the s
262
262
 
263
263
  After each conversation turn, a background extraction agent reviews recent messages and auto-creates memories for notable facts. This runs fire-and-forget with restricted tools (read/write/glob in the memory directory only).
264
264
 
265
+ ## Agent Harness
266
+
267
+ proto includes a harness system that enforces quality gates, limits scope, and recovers from failures automatically.
268
+
269
+ ### Sprint Contract (Scope Lock)
270
+
271
+ Prevents agents from modifying files outside an agreed scope. Before coding begins, negotiate a contract that defines exactly which files will be created or modified. The scope lock is armed — any write outside scope is rejected with a recovery message.
272
+
273
+ **Workflow:**
274
+
275
+ ```bash
276
+ proto
277
+ /sprint-contract
278
+ > Task: Refactor auth module
279
+ > Files: src/auth.ts, src/utils.ts
280
+ > Confirm
281
+ ```
282
+
283
+ **Behavior:**
284
+
285
+ - Write to `src/auth.ts` → ALLOWED
286
+ - Write to `tests/foo.test.ts` → BLOCKED with scope violation message
287
+
288
+ Contracts persist at `.proto/sprint-contract.json` and auto-restore on session resume.
289
+
290
+ ### Behavior Verification Gate
291
+
292
+ Post-run smoke tests that verify changes actually work. After a subagent completes, the gate runs your defined scenarios (shell commands) in parallel. Failures inject a remediation message back to the agent for self-correction.
293
+
294
+ **Setup** — create `.proto/verify-scenarios.json`:
295
+
296
+ ```json
297
+ [
298
+ { "name": "tests pass", "command": "npm test -- --run", "timeoutMs": 60000 },
299
+ { "name": "build works", "command": "npm run build", "timeoutMs": 30000 },
300
+ { "name": "no TypeScript errors", "command": "npm run typecheck" }
301
+ ]
302
+ ```
303
+
304
+ **Behavior:**
305
+
306
+ 1. Agent completes task, reports GOAL
307
+ 2. Gate fires, runs all scenarios in parallel
308
+ 3. If any fail → remediation message injected, agent self-corrects
309
+ 4. Gate fires again until all pass
310
+
311
+ ### Multi-Sample Retry
312
+
313
+ When a subagent fails (ERROR, MAX_TURNS, or TIMEOUT), proto retries up to 2 more times with escalating temperatures (0.7 → 1.0 → 1.3). Each retry gets a `[RETRY CONTEXT]` block summarizing previous failures. Best result by score is returned.
314
+
315
+ This reduces false negatives from single-run failures and gives the model multiple chances with different sampling strategies.
316
+
317
+ ### Repo Map
318
+
319
+ PageRank-based file importance ranking. Analyzes the project's TypeScript/JS import graph to surface the most central files. Useful for understanding codebase structure or finding related files.
320
+
321
+ **Usage:**
322
+
323
+ ```bash
324
+ proto -p "Use the repo_map tool to find the most important files in this codebase"
325
+ proto -p "Use repo_map with seedFiles=['src/auth.ts'] to find related files"
326
+ ```
327
+
328
+ Results are cached at `.proto/repo-map-cache.json` and auto-invalidate on file changes.
329
+
265
330
  ## Skills
266
331
 
267
- proto ships with 16 bundled skills for agentic workflows:
332
+ proto ships with 21 bundled skills for agentic workflows:
268
333
 
269
334
  - **brainstorming** — Structured ideation
270
335
  - **dispatching-parallel-agents** — Fan-out/fan-in subagent patterns
@@ -0,0 +1,146 @@
1
+ ---
2
+ name: harness-reference
3
+ description: Reference guide for all agent harness safety features — doom loop detection, scope lock, git checkpoints, observation masking, sprint contract, reminders, repo map, behavior verification, and multi-sample retry
4
+ ---
5
+
6
+ # Agent Harness Reference
7
+
8
+ The proto harness is a set of safety and reliability features that wrap every agent execution. They fire automatically — you don't need to invoke them manually. This skill documents each feature so you can understand what's protecting you and how to configure it.
9
+
10
+ ## Features
11
+
12
+ ### Doom Loop Detection
13
+
14
+ **What it does:** Detects when the agent is repeating the same tool call pattern in a sliding 20-call window. If the same fingerprint (tool + args hash) appears 3+ times, the harness injects a recovery message and records a Langfuse span.
15
+
16
+ **You don't need to do anything.** The harness detects this automatically.
17
+
18
+ ### Scope Lock (Sprint Contract)
19
+
20
+ **What it does:** Before coding, the `sprint-contract` skill negotiates an explicit contract — the set of files that may change. Once activated, any write outside that set is blocked with a structured error.
21
+
22
+ **To activate:** Use the `sprint-contract` skill at the start of an implementation task. It writes `.proto/sprint-contract.json` and arms the in-memory scope lock. The lock is restored on session restart.
23
+
24
+ **To check status:** If a write is blocked, the error message tells you the violating path and the permitted set.
25
+
26
+ ### Git Checkpoints
27
+
28
+ **What it does:** Before every file-mutating tool call (`write_file`, `edit`, `replace`), the harness creates a shadow-repo commit. This lets you diff or roll back to any pre-edit state.
29
+
30
+ **To roll back:** Use `git log` to find the checkpoint commit and `git checkout <hash> -- <file>` to restore.
31
+
32
+ ### Observation Masking
33
+
34
+ **What it does:** When the context window gets large, the harness applies a rolling verbatim window — tool-call/result pairs older than the window are summarized as `[OBSERVATION_MASK: N pairs omitted]`. This keeps recent context intact while reducing token usage.
35
+
36
+ **You don't need to do anything.** Fires automatically during LLM compaction.
37
+
38
+ ### Harness Reminders
39
+
40
+ **What it does:** The harness injects periodic reminders into context based on three triggers:
41
+
42
+ - Every 50 tool calls: warns about high tool usage
43
+ - After 3 consecutive test failures: suggests pausing to diagnose
44
+ - After 8 turns without any file write: suggests the agent may be over-analyzing
45
+
46
+ **You don't need to do anything.** The harness injects these automatically.
47
+
48
+ ### Repo Map (`repo_map` tool)
49
+
50
+ **What it does:** Analyzes the import graph of the codebase and runs PageRank to surface the most-connected (and most-relevant) files. Call it at the start of any exploration or implementation task for fast orientation.
51
+
52
+ **To use:**
53
+
54
+ ```
55
+ repo_map {} # globally most-connected files
56
+ repo_map { seedFiles: ["/abs/path"] } # personalized from known-relevant files
57
+ ```
58
+
59
+ Results are cached at `.proto/repo-map-cache.json` and invalidated on file changes.
60
+
61
+ ### Behavior Verification Gate
62
+
63
+ **What it does:** After every subagent task that completes successfully, the harness runs user-configured "verification scenarios" — shell commands that check your feature actually works. Failures are injected back to the agent for self-correction.
64
+
65
+ **To configure:** Create `.proto/verify-scenarios.json`:
66
+
67
+ ```json
68
+ [
69
+ {
70
+ "name": "Unit tests pass",
71
+ "command": "npm test -- --run",
72
+ "timeoutMs": 60000
73
+ },
74
+ {
75
+ "name": "Build succeeds",
76
+ "command": "npm run build",
77
+ "timeoutMs": 30000
78
+ },
79
+ {
80
+ "name": "API health check",
81
+ "command": "curl -sf http://localhost:3000/health",
82
+ "expectedPattern": "ok",
83
+ "timeoutMs": 5000
84
+ }
85
+ ]
86
+ ```
87
+
88
+ See `.proto/verify-scenarios.example.json` for a full reference.
89
+
90
+ ### Multi-Sample Retry (`multi_sample: true`)
91
+
92
+ **What it does:** When a subagent fails (doom loop, error, or max turns exceeded), the harness automatically retries up to 2 more times with escalating temperatures (0.7 → 1.0 → 1.3) and injects the failure context into each retry prompt. The best result among all attempts is returned and scored.
93
+
94
+ **Scoring:**
95
+
96
+ - GOAL + behavior gate pass → 3 (perfect)
97
+ - GOAL + no gate / gate pass → 3
98
+ - GOAL + gate fail → 2 (completed but not verified)
99
+ - MAX_TURNS / TIMEOUT → 1 (partial)
100
+ - ERROR → 0 (failure)
101
+
102
+ **To enable:** Set `multi_sample: true` on the Agent tool call:
103
+
104
+ ```
105
+ Agent {
106
+ subagent_type: "general-purpose",
107
+ prompt: "implement the auth service",
108
+ multi_sample: true
109
+ }
110
+ ```
111
+
112
+ Use for complex tasks with a history of failure, not for simple searches.
113
+
114
+ ### Sprint Contract Service
115
+
116
+ **What it does:** Manages the full sprint contract lifecycle — parse, activate scope lock, persist to disk, load on resume. See the `sprint-contract` skill for usage.
117
+
118
+ **Files involved:**
119
+
120
+ - `.proto/sprint-contract.json` — persisted contract (restored on session start)
121
+ - `SprintContractService` — programmatic API
122
+
123
+ ## Langfuse Fine-Tuning Data
124
+
125
+ All harness interventions emit OTel spans routed to Langfuse via OTLP → Tempo. To build fine-tuning datasets:
126
+
127
+ 1. In Langfuse > Traces, filter by span name = `harness.intervention`
128
+ 2. Use `harness.intervention.type` attribute to segment by type:
129
+ - `doom_loop` — recovery from loops
130
+ - `scope_violation` — scope lock enforcement
131
+ - `verification_failed` — post-edit and behavior gate failures
132
+ - `reminder.*` — context reminders
133
+ 3. Export matching traces → dataset items
134
+ 4. Annotate `harness.outcome` = `"recovered"` | `"not_recovered"`
135
+ 5. Train on (input_context, intervention_message) pairs where outcome = recovered
136
+
137
+ ## Configuration Summary
138
+
139
+ | Feature | Config location | Default |
140
+ | ----------------------- | ------------------------------------------------ | ------------------------------------ |
141
+ | Doom loop threshold | Code constant (`DOOM_REPEAT_THRESHOLD = 3`) | Always on |
142
+ | Scope lock | `.proto/sprint-contract.json` | Off until sprint-contract skill runs |
143
+ | Behavior gate scenarios | `.proto/verify-scenarios.json` | No scenarios (off) |
144
+ | Multi-sample retry | `multi_sample: true` on Agent call | Off (opt-in) |
145
+ | Observation mask window | Code constant (`INCREMENTAL_PROTECTED_TAIL`) | Always on |
146
+ | Harness reminders | Code constants (50 calls / 3 failures / 8 turns) | Always on |
@@ -125,12 +125,58 @@ When multiple agents share the same name, higher-priority location wins.
125
125
 
126
126
  Four agents are always available:
127
127
 
128
- | Agent | Purpose | Tools |
129
- | ----------------- | ---------------------------------------------------- | ------------------ |
130
- | `general-purpose` | Complex multi-step tasks, code search | All (except Agent) |
131
- | `Explore` | Fast codebase search and analysis | Read-only |
132
- | `verify` | Review changes for correctness before finalizing | Read-only |
133
- | `coordinator` | Orchestrate multi-agent work with task decomposition | All + Agent |
128
+ | Agent | Purpose | Tools |
129
+ | ----------------- | ---------------------------------------------------- | ------------------- |
130
+ | `general-purpose` | Complex multi-step tasks, code search | All (except Agent) |
131
+ | `Explore` | Fast codebase search and analysis | Read-only + RepoMap |
132
+ | `verify` | Review changes for correctness before finalizing | Read-only |
133
+ | `coordinator` | Orchestrate multi-agent work with task decomposition | All + Agent |
134
+
135
+ The `Explore` and `Plan` agents use the `repo_map` tool automatically at the start of tasks on large codebases to orient themselves via import-graph PageRank before diving in. You can also call `repo_map` explicitly from any agent. See [Agent Harness — Repo map](../../developers/harness#repo-map) for details.
136
+
137
+ ## Multi-sample retry
138
+
139
+ For high-stakes tasks where a single failed attempt is costly, set `multi_sample: true` on the Agent tool call. The harness will automatically retry up to 2 more times with escalating temperatures (0.7 → 1.0 → 1.3) if the first attempt fails, and return the best result.
140
+
141
+ ```json
142
+ {
143
+ "subagent_type": "general-purpose",
144
+ "description": "Implement the auth service",
145
+ "prompt": "...",
146
+ "multi_sample": true
147
+ }
148
+ ```
149
+
150
+ Each retry includes a `[RETRY CONTEXT]` block summarizing what went wrong in the previous attempt. Attempts are scored (GOAL + verification pass = 3, GOAL = 3, partial = 1, error = 0) and the highest-scoring result is returned. When scores tie, the earlier (lower-temperature) attempt wins.
151
+
152
+ Use multi-sample for complex implementation tasks, not for searches or read-only queries.
153
+
154
+ See [Agent Harness — Multi-sample retry](../../developers/harness#multi-sample-retry) for the full scoring and temperature reference.
155
+
156
+ ## Behavior verification gate
157
+
158
+ You can configure post-task verification scenarios that run automatically after a subagent completes successfully. If any scenario fails, the output is fed back to the agent so it can self-correct.
159
+
160
+ Create `.proto/verify-scenarios.json` in your project root:
161
+
162
+ ```json
163
+ [
164
+ {
165
+ "name": "Unit tests pass",
166
+ "command": "npm test -- --run",
167
+ "timeoutMs": 60000
168
+ },
169
+ {
170
+ "name": "Build succeeds",
171
+ "command": "npm run build",
172
+ "timeoutMs": 30000
173
+ }
174
+ ]
175
+ ```
176
+
177
+ Scenarios run in parallel. Each has a `name`, a shell `command`, an optional `expectedPattern` (regex the stdout must match), and an optional `timeoutMs`. Exit code 0 is a pass when no pattern is specified.
178
+
179
+ See [Agent Harness — Behavior verification gate](../../developers/harness#behavior-verification-gate) for the complete field reference.
134
180
 
135
181
  ## Background execution
136
182
 
@@ -0,0 +1,62 @@
1
+ ---
2
+ name: sprint-contract
3
+ description: Negotiate a sprint contract before coding — locks down exactly which files will be touched, what will change, and the acceptance criteria. Activates the scope lock to prevent scope creep.
4
+ ---
5
+
6
+ # Sprint Contract
7
+
8
+ Produce an explicit, machine-readable sprint contract before writing any code.
9
+ The contract defines the permitted file set (scope lock), acceptance criteria,
10
+ and a sequenced implementation plan.
11
+
12
+ **Announce at start:** "I'm using the sprint-contract skill to negotiate the contract before coding."
13
+
14
+ ## Process
15
+
16
+ 1. **Read the task** — understand exactly what is being asked
17
+ 2. **Explore** — use fff**grep and fff**find_files to locate relevant files; read key files to understand current state
18
+ 3. **Identify the change surface** — determine the minimum set of files that must change
19
+ 4. **Produce the contract** — output a JSON contract (see format below)
20
+ 5. **Activate scope lock** — write the contract to `.proto/sprint-contract.json` so the harness can enforce the file set
21
+
22
+ ## Contract Format
23
+
24
+ Output a JSON block with this exact structure:
25
+
26
+ ```json
27
+ {
28
+ "task": "one-sentence description of what will be built",
29
+ "filesToCreate": ["/absolute/path/to/new/file.ts"],
30
+ "filesToModify": ["/absolute/path/to/existing/file.ts"],
31
+ "functionsToChange": {
32
+ "/absolute/path/to/file.ts": ["functionName", "ClassName.methodName"]
33
+ },
34
+ "acceptanceCriteria": [
35
+ "The X test passes",
36
+ "Feature Y is accessible via Z",
37
+ "No existing tests are broken"
38
+ ],
39
+ "implementationSequence": [
40
+ "1. Add type definitions to types.ts",
41
+ "2. Implement service in service.ts",
42
+ "3. Wire into existing call site in client.ts",
43
+ "4. Add tests"
44
+ ],
45
+ "risks": ["Changing X may affect Y — verify after"]
46
+ }
47
+ ```
48
+
49
+ ## Rules
50
+
51
+ - **Minimize scope**: only include files that genuinely need to change
52
+ - **Absolute paths**: all file paths must be absolute
53
+ - **No speculation**: only include files you have verified exist (via read or search)
54
+ - **Testable criteria**: each acceptance criterion must be objectively verifiable
55
+ - **Sequenced implementation**: order steps to minimize breakage (types → impl → tests)
56
+
57
+ ## After the Contract
58
+
59
+ Write the JSON to `.proto/sprint-contract.json` in the project root.
60
+ Then report: "Sprint contract negotiated. Scope lock activated for N files."
61
+
62
+ The harness will automatically prevent edits to files outside the contract's file set.
@@ -120,6 +120,16 @@ Implementer subagents report one of four statuses. Handle each appropriately:
120
120
 
121
121
  **Never** ignore an escalation or force the same model to retry without changes. If the implementer said it's stuck, something needs to change.
122
122
 
123
+ ## Harness Features
124
+
125
+ The harness provides automatic safety nets you can leverage when dispatching implementers:
126
+
127
+ **Multi-sample retry** (`multi_sample: true`): For complex or high-risk tasks, set this on the Agent tool call. If the implementer fails (doom loop, error, max turns), the harness automatically retries up to 2 more times with escalating temperatures (0.7 → 1.0 → 1.3) and injects the failure context into each retry prompt. Returns the best result. Use for tasks that have previously failed or that touch many files.
128
+
129
+ **Behavior verification gate**: If `.proto/verify-scenarios.json` exists in the project, the harness runs those scenarios after every successful implementer completion. Failures are injected back to the model for self-correction. Add scenarios for smoke tests, build checks, and HTTP health checks.
130
+
131
+ **Sprint contract scope lock**: If the implementing agent was given a sprint contract (via the `sprint-contract` skill), the scope lock prevents it from writing files outside the agreed set. Any violation is blocked and reported.
132
+
123
133
  ## Prompt Templates
124
134
 
125
135
  - `./implementer-prompt.md` - Dispatch implementer subagent