@exaudeus/workrail 3.35.1 → 3.37.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (55) hide show
  1. package/dist/config/config-file.js +2 -0
  2. package/dist/console-ui/assets/{index-D7jQyCSD.js → index-o-p__sHJ.js} +1 -1
  3. package/dist/console-ui/index.html +1 -1
  4. package/dist/daemon/workflow-runner.d.ts +5 -0
  5. package/dist/daemon/workflow-runner.js +131 -1
  6. package/dist/manifest.json +39 -31
  7. package/dist/mcp/handlers/v2-advance-events.js +1 -1
  8. package/dist/mcp/handlers/v2-execution/start.d.ts +1 -0
  9. package/dist/mcp/handlers/v2-execution/start.js +3 -2
  10. package/dist/trigger/notification-service.d.ts +42 -0
  11. package/dist/trigger/notification-service.js +164 -0
  12. package/dist/trigger/trigger-listener.js +7 -1
  13. package/dist/trigger/trigger-router.d.ts +3 -1
  14. package/dist/trigger/trigger-router.js +4 -1
  15. package/dist/v2/durable-core/schemas/export-bundle/index.d.ts +64 -32
  16. package/dist/v2/durable-core/schemas/session/events.d.ts +20 -10
  17. package/dist/v2/durable-core/schemas/session/events.js +1 -1
  18. package/dist/v2/durable-core/schemas/session/gaps.d.ts +8 -8
  19. package/dist/v2/durable-core/schemas/session/gaps.js +1 -1
  20. package/docs/design/agent-behavior-patterns-discovery.md +312 -0
  21. package/docs/design/agent-engine-communication-discovery.md +390 -0
  22. package/docs/design/agent-loop-architecture-alternatives-discovery.md +531 -0
  23. package/docs/design/agent-loop-error-handling-contract.md +238 -0
  24. package/docs/design/complete-step-approach-validation-discovery.md +344 -0
  25. package/docs/design/daemon-stuck-detection-discovery.md +174 -0
  26. package/docs/design/mcp-server-disconnect-discovery.md +245 -0
  27. package/docs/design/mcp-server-epipe-crash.md +198 -0
  28. package/docs/design/notification-design-candidates.md +131 -0
  29. package/docs/design/notification-design-review.md +84 -0
  30. package/docs/design/notification-implementation-plan.md +181 -0
  31. package/docs/design/spawn-agent-failure-modes.md +161 -0
  32. package/docs/design/spawn-agent-result-handling-implementation-plan.md +186 -0
  33. package/docs/design/stdio-simplification-design-candidates.md +341 -0
  34. package/docs/design/stdio-simplification-design-review.md +93 -0
  35. package/docs/design/stdio-simplification-implementation-plan.md +317 -0
  36. package/docs/design/structured-output-tools-coexist-findings.md +288 -0
  37. package/docs/discovery/coordinator-script-design.md +745 -0
  38. package/docs/discovery/coordinator-ux-discovery.md +471 -0
  39. package/docs/discovery/spawn-agent-failure-modes.md +309 -0
  40. package/docs/discovery/workflow-selection-for-discovery-tasks.md +336 -0
  41. package/docs/discovery/worktrain-status-briefing.md +325 -0
  42. package/docs/discovery/worktrain-status-design-candidates.md +202 -0
  43. package/docs/discovery/worktrain-status-design-review-findings.md +86 -0
  44. package/docs/ideas/backlog.md +688 -1
  45. package/docs/ideas/daemon-structured-output-vs-tool-calls.md +344 -0
  46. package/docs/ideas/design-candidates-backlog-consolidation.md +85 -0
  47. package/docs/ideas/design-candidates-spawn-agent-task.md +178 -0
  48. package/docs/ideas/design-review-findings-backlog-consolidation.md +39 -0
  49. package/docs/ideas/design-review-findings-spawn-agent-task.md +139 -0
  50. package/docs/ideas/implementation_plan_backlog_consolidation.md +117 -0
  51. package/docs/ideas/implementation_plan_spawn_agent.md +217 -0
  52. package/docs/plans/authoring-doc-staleness-enforcement-candidates.md +251 -0
  53. package/docs/plans/authoring-doc-staleness-enforcement-review.md +99 -0
  54. package/docs/plans/authoring-doc-staleness-enforcement.md +463 -0
  55. package/package.json +1 -1
@@ -0,0 +1,238 @@
1
+ # AgentLoop Error Handling Contract
2
+
3
+ **Status:** Final recommendation -- ready for implementation
4
+ **Date:** 2026-04-16
5
+ **Scope:** `src/daemon/agent-loop.ts` `_executeTools`, PR #515 (`501df000`)
6
+
7
+ ---
8
+
9
+ ## Context / Ask
10
+
11
+ The daemon `AgentLoop._executeTools` has been changed twice in opposite directions:
12
+
13
+ - **Commit `3929c39f`** (initial first-party implementation): `_executeTools` had no try/catch -- tool throws propagated to `prompt()`.
14
+ - **PR #495 (`954450ae`)** (current HEAD on this branch): Added try/catch in `_executeTools` converting tool throws to `isError: true` tool_results. The module-header comment and `AgentTool` JSDoc still say "THROWS on failure (pi-agent-core contract)."
15
+ - **PR #515 (`501df000`)** (on a separate unmerged branch): Removed the try/catch again, restoring throw propagation, with the comment "A tool that exists but throws is a programmer-visible failure that should not be silently swallowed."
16
+
17
+ The design question: which contract is correct, and should it be uniform across all tools?
18
+
19
+ ---
20
+
21
+ ## Path Recommendation
22
+
23
+ **`full_spectrum`** -- both landscape grounding (what the code does today) and reframing (should the contract differ by tool type?) are needed to answer the real question.
24
+
25
+ Justification over alternatives:
26
+ - `landscape_first` alone would miss the deeper question of whether a single uniform contract is even correct.
27
+ - `design_first` alone would miss the concrete implementation details that constrain the answer.
28
+
29
+ ---
30
+
31
+ ## Constraints / Anti-goals
32
+
33
+ **Constraints:**
34
+ - The LLM must be able to see tool errors for user-facing tools (Bash, Read, Write) to retry or adapt.
35
+ - `continue_workflow` failures with a bad/expired token must not cause infinite retry loops.
36
+ - Session progress must not be silently lost; crash recovery (daemon-sessions files) must remain coherent.
37
+ - The `WorkflowRunResult` discriminated union is the outer error-as-data boundary -- `runWorkflow()` never throws.
38
+
39
+ **Anti-goals:**
40
+ - Do not create a bespoke error-classification system per tool -- complexity without clear benefit.
41
+ - Do not change the outer `runWorkflow()` contract (it already catches everything and returns a discriminated union).
42
+
43
+ ---
44
+
45
+ ## Landscape Packet
46
+
47
+ ### Current code state (HEAD `954450ae`)
48
+
49
+ `_executeTools` in `src/daemon/agent-loop.ts` (lines 449-459) contains a try/catch:
50
+
51
+ ```typescript
52
+ try {
53
+ result = await tool.execute(block.id, params);
54
+ } catch (err: unknown) {
55
+ const message = err instanceof Error ? err.message : String(err);
56
+ results.push({
57
+ toolCallId: block.id,
58
+ toolName: block.name,
59
+ result: { content: [{ type: 'text', text: `Tool execution failed: ${message}` }], details: null },
60
+ isError: true,
61
+ });
62
+ continue;
63
+ }
64
+ ```
65
+
66
+ This is **Option B**: all tool throws become `isError: true` tool_results visible to the LLM.
67
+
68
+ ### What PR #515 did
69
+
70
+ PR #515 (`501df000`) removed this try/catch entirely, restoring **Option A**: any tool throw propagates through `_runLoop` -> `_runLoop` (unhandled) -> `prompt()` -> caught by `runWorkflow()`'s outer try/catch -> session dies with `WorkflowRunResult { _tag: 'error' }`.
71
+
72
+ ### What `workflow-runner.ts` expects
73
+
74
+ `src/daemon/workflow-runner.ts` line 13 says: "Tools THROW on failure (AgentLoop contract). `runWorkflow()` catches and returns a `WorkflowRunResult` discriminated union."
75
+
76
+ The `continue_workflow` tool's `execute()` at line 828 does:
77
+ ```typescript
78
+ if (result.isErr()) {
79
+ throw new Error(`continue_workflow failed: ${result.error.kind} -- ${JSON.stringify(result.error)}`);
80
+ }
81
+ ```
82
+
83
+ This throw is supposed to propagate up to kill the session.
84
+
85
+ ### What the test actually tests
86
+
87
+ The test `agent-loop.test.ts > tool errors (throwing tools) > propagates tool throws to the prompt() caller` (lines 444-458) asserts:
88
+ ```typescript
89
+ await expect(agent.prompt(USER_MSG)).rejects.toThrow('Tool execution failed');
90
+ ```
91
+
92
+ This test was written for PR #515's behavior (Option A). It currently **FAILS** on HEAD because `_executeTools` has the try/catch -- `prompt()` resolves instead of rejecting. The test is testing the contract that does NOT exist on the current branch.
93
+
94
+ ### `continue_workflow` error behavior
95
+
96
+ When `executeContinueWorkflow` returns `isErr()` (bad token, session state error, etc.), the `continue_workflow` tool throws. Under Option B (current HEAD), that throw is caught in `_executeTools` and returned as an `isError: true` tool_result -- the LLM sees "continue_workflow failed: invalid_token" and can retry. Under Option A (PR #515), that throw kills the session.
97
+
98
+ ---
99
+
100
+ ## Problem Frame Packet
101
+
102
+ ### The real question is not "A or B" -- it is "which error categories are recoverable?"
103
+
104
+ The binary framing (Option A vs B) obscures the actual decision surface. There are three distinct error categories:
105
+
106
+ | Category | Example | Correct handling |
107
+ |---|---|---|
108
+ | **LLM recoverable** | Bash exits 1, file not found, tool returns bad output | Option B: isError tool_result, LLM retries |
109
+ | **State-fatal, non-retryable** | Bad/expired continueToken, session state corruption | Option A: propagate and kill session |
110
+ | **Transient retryable** | Network timeout, rate limit, intermittent API error | Option B or retry logic inside the tool |
111
+
112
+ Option A (PR #515's approach) treats ALL tool throws as state-fatal. This is wrong for user-facing tools.
113
+
114
+ Option B (current HEAD's approach) treats ALL tool throws as LLM-recoverable. This is wrong for `continue_workflow` with a bad token -- the LLM will retry indefinitely with the same bad token, burning context and making no progress.
115
+
116
+ ### The `continue_workflow` infinite loop risk
117
+
118
+ Under Option B: LLM calls `continue_workflow` with an expired token. `executeContinueWorkflow` returns `isErr()`. `continue_workflow.execute()` throws. `_executeTools` catches the throw, returns `isError: true` with message "continue_workflow failed: invalid_token". The LLM sees this and... retries with the same token. Same result. Loop until max_turns or wall-clock timeout.
119
+
120
+ This is a real risk, not theoretical. The system prompt instructs the agent: "round-trip the continueToken exactly." The LLM has no way to fix a bad token -- it can only loop.
121
+
122
+ ### The `continue_workflow` kill-the-session risk (Option A)
123
+
124
+ Under Option A: `continue_workflow` throws on ANY error, including transient errors (e.g., DB write failed due to disk pressure, rare engine bug). The entire session dies. All progress is lost. Crash recovery may have the previous token but not the notes the agent just wrote.
125
+
126
+ ---
127
+
128
+ ## Candidate Directions
129
+
130
+ ### Direction 1: Uniform Option B with loop-break heuristic (current HEAD + guard)
131
+
132
+ Keep the try/catch in `_executeTools`. Add a heuristic in the `continue_workflow` tool's error path: if the error kind is `invalid_token` or `session_not_found`, include a `FATAL:` prefix in the error text. The system prompt instructs the agent to stop retrying on `FATAL:` errors.
133
+
134
+ **Pros:** Simple, no type-level distinction needed.
135
+ **Cons:** Relies on LLM instruction-following to avoid the infinite loop. Fragile.
136
+
137
+ ### Direction 2: Uniform Option A with outer catch (PR #515 approach)
138
+
139
+ Remove try/catch from `_executeTools`. All tool throws propagate and kill the session. Bash, Read, Write tools must NOT throw -- they must encode errors in their return value (non-throwing tools already do this in the current implementation).
140
+
141
+ **Pros:** Clean "throws = programmer error" invariant. Outer boundary always catches.
142
+ **Cons:** Bash tool must never throw (currently it does throw on execAsync failure). Requires auditing all tools. Any unintended throw anywhere kills the session silently from the LLM's perspective.
143
+
144
+ ### Direction 3: Two-tier contract (recommended)
145
+
146
+ Differentiate at the `AgentTool` level:
147
+
148
+ - **`recoverableOnError: false` (default):** Tool throws -> `_executeTools` catches -> `isError: true` tool_result. LLM can see the error and adapt. This is the correct default for Bash, Read, Write, and most user-facing tools.
149
+ - **`recoverableOnError: false` / `fatalOnError: true`:** Tool throws -> propagates immediately, session dies. This is correct for `continue_workflow` when the error is a bad/expired token.
150
+
151
+ But `continue_workflow` already distinguishes internally: `isErr()` throws, `out.kind === 'blocked'` returns a recoverable result. The question is which `isErr()` cases are truly fatal.
152
+
153
+ **Better framing for `continue_workflow`:** the tool should NOT throw at all for most `isErr()` cases. It should return an `isError: true` tool_result with a `FATAL:` marker only for `invalid_token` / `session_not_found`. The agent system prompt should instruct: "If you see FATAL in a continue_workflow error, stop immediately -- do not retry."
154
+
155
+ **Pros:** Explicit, type-safe if annotated, handles the infinite loop risk without a uniform kill.
156
+ **Cons:** Requires updating `continue_workflow.execute()` to not throw on all `isErr()`.
157
+
158
+ ---
159
+
160
+ ## Challenge Notes
161
+
162
+ 1. **The test is broken on HEAD.** `propagates tool throws to the prompt() caller` expects Option A behavior, but HEAD has Option B. This test will pass on the PR #515 branch and fail on main. This is a test/implementation mismatch that needs to be resolved regardless of which direction is chosen.
163
+
164
+ 2. **Module-header comment contradicts implementation.** The comment at line 21 says "Tools throw on failure (pi-agent-core contract). AgentLoop propagates throws to prompt()'s caller." The `AgentTool` JSDoc at line 81 says "THROWS on failure (pi-agent-core contract) -- do not encode errors in content." Both contradict the current try/catch. Whoever wins this design question needs to update one or the other.
165
+
166
+ 3. **`continue_workflow` infinite loop is the decisive argument against uniform Option B.** Without a mechanism to break the loop, uniform Option B is not safe for production.
167
+
168
+ 4. **Bash tool behavior.** The Bash tool in `workflow-runner.ts` (lines 947-984) has its OWN internal try/catch. For exit code 1 with no stderr, it returns a successful result (POSIX "no match found" semantics). For exit 2+ or signal kills, it **throws** at line 982: `throw new Error('Command failed: ...')`. Under Option A (no try/catch in `_executeTools`), this throw kills the entire session -- the LLM never sees the stdout/stderr. Under Option B (current HEAD), `_executeTools` catches it and returns an `isError: true` tool_result that includes the full stdout/stderr. This is the concrete proof that Option A is wrong for Bash. The LLM MUST see the stderr to reason about what failed.
169
+
170
+ ---
171
+
172
+ ## Resolution Notes
173
+
174
+ ### Correct contract per tool type
175
+
176
+ **User-facing tools (Bash, Read, Write, report_issue):** Option B is correct. A bash command that exits 1 is not a programmer error -- it is normal operation. The LLM must see the stderr and exit code to retry or report to the user. Converting to `isError: true` tool_result is the right behavior.
177
+
178
+ **`continue_workflow` with truly fatal errors (invalid/expired token, session not found):** Option A is correct -- but only for these specific error kinds. The current implementation throws for ALL `isErr()` results, which is too broad. A transient write error should not kill the session.
179
+
180
+ **`continue_workflow` with retryable errors (blocked step, validation failure):** Already handled correctly -- returns a recoverable result, not a throw.
181
+
182
+ ### Was PR #515 right or wrong?
183
+
184
+ **PR #515 was wrong** for its stated goal. Removing the try/catch wholesale means a Bash command that exits 1 and throws will kill the entire autonomous session -- the LLM never gets to see the error message and adapt. This is the worst possible user experience for an autonomous agent. The PR's own comment says "A tool that exists but throws is a programmer-visible failure" -- but a bash exit-1 is NOT a programmer-visible failure, it is a completely normal operational event.
185
+
186
+ However, PR #515 correctly identified the real problem: `continue_workflow` should not have its errors swallowed. The fix was applied with too broad a brush.
187
+
188
+ ### What the fix should be
189
+
190
+ 1. **Keep Option B in `_executeTools`** (the current HEAD try/catch). This is correct for user-facing tools.
191
+
192
+ 2. **Fix `continue_workflow.execute()`** to distinguish fatal vs. non-fatal `isErr()` cases:
193
+ - `invalid_token`, `session_not_found`, `session_state_corrupted` -> should propagate (throw) even under Option B. These errors can be flagged with a special error type so `_executeTools` re-throws them specifically.
194
+ - Transient errors -> return as `isError: true` tool_result with clear guidance.
195
+
196
+ 3. **Update the `AgentTool` JSDoc and module header** to reflect the actual contract: "tools may throw; throws are converted to `isError: true` tool_results. Use a `FatalToolError` subclass to signal that a throw should propagate through to kill the session."
197
+
198
+ 4. **Fix or rewrite the broken test.** `propagates tool throws to the prompt() caller` should either be deleted (if Option B is canonical) or rewritten to test the `FatalToolError` subclass propagation path.
199
+
200
+ ---
201
+
202
+ ## Decision Log
203
+
204
+ | Date | Decision | Rationale |
205
+ |---|---|---|
206
+ | 2026-04-16 | Option B (try/catch) is correct for user-facing tools | LLM must see bash failures to adapt |
207
+ | 2026-04-16 | PR #515 was wrong -- too broad | Removed try/catch kills sessions on any tool error |
208
+ | 2026-04-16 | `continue_workflow` needs selective fatal error propagation | Uniform Option B creates infinite loop risk on bad token |
209
+ | 2026-04-16 | Recommended fix: `FatalToolError` subclass | `_executeTools` re-throws only `FatalToolError`; all others become tool_results |
210
+
211
+ ---
212
+
213
+ ## Final Summary
214
+
215
+ The correct error handling contract is **not uniform** across tool types:
216
+
217
+ - **Bash/Read/Write/report_issue**: always Option B (convert throws to `isError` tool_results). These are operational errors the LLM must see.
218
+ - **`continue_workflow`**: mostly Option B, but fatal errors (bad/expired tokens, session not found) should propagate and kill the session to prevent infinite retry loops.
219
+
220
+ **Implementation mechanism: `FatalToolError` subclass (Candidate 2)**
221
+
222
+ - Export `class FatalToolError extends Error {}` from `agent-loop.ts`
223
+ - In `_executeTools` catch block: `if (err instanceof FatalToolError) throw err;` before converting to isError tool_result
224
+ - In `workflow-runner.ts` line 828: throw `new FatalToolError(...)` instead of `new Error(...)`
225
+ - Update `AgentTool.execute()` JSDoc to document both throw types
226
+ - Update or split the test `propagates tool throws to the prompt() caller`
227
+
228
+ **PR #515 verdict:** Wrong. Diagnosed the right symptom (bad token errors shouldn't be swallowed) but applied the wrong fix (removed ALL error recovery). The Bash tool at line 982 throws on exit 2+ -- under PR #515's approach, any bash command failure kills the entire session and the LLM never sees stderr.
229
+
230
+ **Confidence: HIGH** -- 0 RED findings in review, all 5 acceptance criteria verified.
231
+
232
+ **Residual risks:**
233
+ 1. FM3 (transient continue_workflow error kills session) -- mitigated by crash recovery; defer error-kind discrimination
234
+ 2. FM1 (tool author throws plain Error when FatalToolError intended) -- detectable in tests; low probability
235
+
236
+ **Supporting documents:**
237
+ - `docs/design/agent-loop-error-handling-candidates.md` -- full candidate analysis
238
+ - `docs/design/agent-loop-error-handling-review.md` -- tradeoff review findings
@@ -0,0 +1,344 @@
1
+ # Discovery: Validate complete_step Approach vs Alternatives
2
+
3
+ **Date:** 2026-04-18
4
+ **Status:** In Progress
5
+ **Branch:** feat/daemon-complete-step-tool (PR #569)
6
+
7
+ ---
8
+
9
+ ## Context / Ask
10
+
11
+ PR #569 implements `complete_step`: a new daemon tool that hides the `continueToken` from the LLM using a late-binding closure getter `() => currentContinueToken`. The LLM calls `complete_step({ notes, artifacts, context })` and the daemon injects the token internally.
12
+
13
+ **Original stated goal:** Is the `complete_step` tool implementation in PR #569 the best approach, or is there a superior alternative?
14
+
15
+ **Reframed problem:** The daemon agent loop needs to advance workflow state reliably without requiring the LLM to reproduce an opaque HMAC-signed token it received in a prior turn.
16
+
17
+ ---
18
+
19
+ ## Path Recommendation
20
+
21
+ **design_first** -- the goal is a solution statement and the structured output prototype has already proven a materially different alternative (Option A in `structured-output-tools-coexist-findings.md`). This discovery needs to compare the two approaches against first principles before concluding PR #569 is correct.
22
+
23
+ Justification:
24
+ - `landscape_first` would be redundant: the landscape is already documented in backlog.md and the structured output findings doc.
25
+ - `full_spectrum` adds reframing work that's already been done (challenge step completed above).
26
+ - `design_first` focuses on the core tension: is `complete_step` the right design, or is structured output better?
27
+
28
+ ---
29
+
30
+ ## Constraints / Anti-goals
31
+
32
+ - Do NOT break the MCP tool API surface for users calling `continue_workflow` from non-daemon contexts
33
+ - Do NOT require beta API if it introduces unacceptable stability risk
34
+ - Do NOT assume the structured output path is better just because it's newer
35
+ - MUST work on both Anthropic direct and Amazon Bedrock (confirmed in findings doc)
36
+
37
+ ---
38
+
39
+ ## Artifact Strategy
40
+
41
+ This document is for human readers (PR reviewers, future maintainers). It is NOT the execution truth for this workflow -- all decisions and findings are captured in WorkRail step notes and context variables. If a rewind occurs, the notes and context survive; this file may not.
42
+
43
+ **Canonical truth lives in:** WorkRail session notes (concatenated across steps).
44
+ **This file is:** A readable summary for review and reference.
45
+
46
+ ---
47
+
48
+ ## Landscape Packet
49
+
50
+ ### Sources Read
51
+ - `src/daemon/workflow-runner.ts` -- `makeCompleteStepTool()`, `runWorkflow()`, `onAdvance`, `onTokenUpdate`
52
+ - `src/daemon/agent-loop.ts` -- sequential tool execution (`toolExecution: 'sequential'`), `_executeTools()`, event loop
53
+ - `src/v2/durable-core/tokens/short-token.ts` -- token format, length validation, HMAC verification
54
+ - `docs/design/structured-output-tools-coexist-findings.md` -- confirmed: beta API `output_config + tools` coexist on both providers
55
+ - `docs/ideas/backlog.md` -- first-principles design alternatives (Apr 18, 2026 entry)
56
+ - `docs/design/daemon-complete-step-tool-design-review.md` -- prior design review for PR #569
57
+
58
+ ### Current State
59
+ `continue_workflow` requires the LLM to round-trip a 27-char HMAC-signed `continueToken`. The token is small by design (v2 short tokens replaced 162-char v1 tokens specifically because agents mangled them). But even 27-char tokens get corrupted -- wrong characters, truncation, or mangling during context handling causes TOKEN_BAD_SIGNATURE or SHORT_TOKEN_INVALID_LENGTH errors that kill sessions.
60
+
61
+ ### Token Format
62
+ - 27 chars total: `ct_` prefix + 24 base64url chars (18 payload bytes: 12 nonce + 6 HMAC-SHA256)
63
+ - SHORT_TOKEN_BAD_SIGNATURE = HMAC mismatch (token valid format but wrong key or mangled HMAC bytes)
64
+ - SHORT_TOKEN_INVALID_LENGTH = decoded bytes != 18 (token truncated or characters appended)
65
+ - Both errors kill the session immediately -- no recovery without a checkpoint token
66
+
67
+ ### Main Existing Approaches
68
+ 1. **complete_step tool (PR #569):** LLM calls tool with notes; daemon injects token from closure variable. Token absent from LLM context entirely.
69
+ 2. **Structured output (findings doc):** `client.beta.messages.create()` + `output_config.format` = JSON schema enforced at `end_turn`. Daemon parses JSON, no tool call needed. Token absent from LLM context entirely.
70
+ 3. **Legacy continue_workflow:** LLM round-trips token. Still supported for non-daemon contexts (MCP tool users). Deprecated in daemon sessions.
71
+
72
+ ### Hard Constraints
73
+ - `AgentLoop._executeTools()` runs tools **sequentially** -- no concurrent step execution possible within a session
74
+ - `runWorkflow()` creates one `currentContinueToken` closure variable per session -- no cross-session contamination
75
+ - `onAdvance` and the `onTokenUpdate` callback are the only two write paths to `currentContinueToken` -- they are in mutually exclusive response branches (`kind: 'ok'` vs `kind: 'blocked'`)
76
+ - Token update (via `persistTokens()`) happens BEFORE `onAdvance`/`onTokenUpdate` -- crash safety invariant
77
+
78
+ ### Obvious Contradictions
79
+ - `daemon-complete-step-tool-design-review.md` recommends PR #569 as correct and ready to merge
80
+ - `structured-output-tools-coexist-findings.md` recommends REMOVING `complete_step` and replacing with `output_config`
81
+ - These are not actually contradictory -- they have different timelines. PR #569 solves the immediate problem; structured output is the longer-term architecture. The question is whether to solve both at once or in sequence.
82
+
83
+ ### Evidence Gaps
84
+ - No production telemetry on TOKEN_BAD_SIGNATURE error frequency in the field
85
+ - No data on LLM tool-hallucination rate for `complete_step` vs `continue_workflow` in practice
86
+ - Beta API stability: `output_config` is in beta -- no SLA on when it becomes stable
87
+ - `AgentClientInterface` would need updating for beta API (currently typed to `messages.create()` not `beta.messages.create()`)
88
+
89
+ ---
90
+
91
+ ## Problem Frame Packet
92
+
93
+ **Core tension:** Two valid approaches exist.
94
+
95
+ **Approach A: complete_step tool (PR #569)**
96
+ - LLM calls `complete_step({ notes, artifacts, context })`
97
+ - Daemon injects `continueToken` via closure getter `() => currentContinueToken`
98
+ - Token is never in LLM context
99
+ - Works with current `client.messages.create()` (stable API)
100
+ - Risk: LLM still calls a tool; tool hallucination is a failure mode
101
+
102
+ **Approach B: Structured output (end_turn JSON)**
103
+ - LLM ends turn with `{"step_complete": true, "notes": "..."}`
104
+ - Daemon detects `stop_reason: end_turn`, parses JSON, injects token, calls `continue_workflow`
105
+ - Token never passed through LLM at all (not even a tool call)
106
+ - Requires `client.beta.messages.create()` with `output_config`
107
+ - Risk: beta API instability; requires `AgentClientInterface` update
108
+
109
+ ---
110
+
111
+ ## Candidate Directions
112
+
113
+ ### Generation Expectations (design_first + THOROUGH)
114
+ - Must include at least one direction that meaningfully reframes the problem, not just packages the obvious options
115
+ - Must cover: merge PR #569 as-is, skip to structured output now, hybrid (merge PR #569 as a bridge), and at least one reframe
116
+ - Must NOT all cluster around "close to the current plan" -- one direction must represent a genuinely different architecture
117
+ - For THOROUGH: if the first spread feels clustered or too safe, push for one more direction
118
+
119
+ ### Candidate Set
120
+
121
+ **Direction 1: Merge PR #569 as-is (complete_step tool, interim)**
122
+ - Complete_step is the right solution NOW, before structured output is stable enough
123
+ - Resolves FM1-FM5 immediately; token never in context; notes validated at boundary
124
+ - Migration path: structured output replaces it later (separate PR)
125
+ - Risks: two-PR cost; complete_step is transitional complexity that gets removed
126
+
127
+ **Direction 2: Skip PR #569, ship structured output directly**
128
+ - Replace complete_step with beta API output_config + end_turn JSON parsing
129
+ - Token never touches LLM at all (not even a tool call)
130
+ - Requires: AgentClientInterface update, agent-loop.ts update
131
+ - Risks: beta API stability; more complex change than PR #569; takes longer
132
+
133
+ **Direction 3: Merge PR #569 now, migrate to structured output in same milestone**
134
+ - Ship PR #569 immediately to fix the production problem
135
+ - In the same sprint, ship a follow-up PR that adds structured output support
136
+ - Complete_step becomes a compatibility shim or is removed after structured output PR merges
137
+ - Risks: coordination overhead; two PRs in quick succession
138
+
139
+ **Direction 1: Merge PR #569 as-is (simplest correct change)**
140
+ - `makeCompleteStepTool()` with `() => string` closure getter; first in tools array; `continue_workflow` deprecated
141
+ - `currentContinueToken` is a `let` in `runWorkflow()` closure, updated by `onAdvance` + inline `onTokenUpdate` callback
142
+ - Resolves: token never in context, notes validated at boundary, FM4 risk minimized (no token in initial prompt)
143
+ - Accepts: migration cost to structured output, mutable closure variable, LLM tool hallucination possible
144
+ - Follows existing factory function pattern exactly (`makeBashTool`, `makeReadTool`)
145
+ - **Scope: best-fit**
146
+
147
+ **Direction 2: Skip PR #569, ship structured output directly**
148
+ - Remove `complete_step` as a tool; update `AgentClientInterface` to `beta.messages.create()`; add `output_config.format` JSON schema to `AgentLoopOptions`; parse `stop_reason: end_turn` JSON in `_runLoop`; call `executeContinueWorkflow` from workflow-runner.ts `turn_end` path
149
+ - Resolves: no closure variable needed, LLM literally cannot touch token, architecturally cleanest
150
+ - Accepts: beta API stability risk, larger scope, `AgentClientInterface` change propagates to all consumers
151
+ - Departs from existing tool-call pattern; introduces end_turn JSON parsing as new pattern
152
+ - **Scope: possibly too broad for one PR**
153
+
154
+ **Direction 3: Merge PR #569 + structured output shim in same PR (hedge)**
155
+ - Ship PR #569 AND add end_turn JSON detection in `workflow-runner.ts` `turn_end` subscriber simultaneously
156
+ - Both mechanisms call `executeContinueWorkflow`; LLM can advance via either
157
+ - Accepts: two advancement paths doubles test surface; messy interaction; AgentClientInterface change still needed
158
+ - **Scope: too broad -- should be two separate PRs**
159
+
160
+ **Direction 4 (Reframe): Replace complete_step's implementation with end_turn JSON, keeping the concept name**
161
+ - LLM instructed: output `{"complete_step": true, "notes": "..."}` as final text
162
+ - Daemon detects in turn_end subscriber, parses, calls `executeContinueWorkflow`
163
+ - NO `complete_step` in tools array -- concept preserved in system prompt, implementation changes underneath
164
+ - Resolves: same as Direction 2 plus DX stability (system prompt and mental model unchanged)
165
+ - Accepts: beta API required, system prompt change, same AgentClientInterface scope as Direction 2
166
+ - **Scope: same as Direction 2**
167
+
168
+ ---
169
+
170
+ ## Problem Frame Packet
171
+
172
+ ### Primary Users
173
+ - **Daemon users:** Running automated workflow sessions. Need zero session kills from token errors. Accept any hidden complexity.
174
+ - **Non-daemon MCP users:** Calling `continue_workflow` directly from Claude Code or other MCP clients. Must not be broken by daemon changes.
175
+ - **Daemon maintainers:** Need code that is readable, safe to change, and has clear invariants.
176
+ - **Future structured-output implementors:** Need a migration path that doesn't require breaking changes.
177
+
178
+ ### Pains / Tensions
179
+
180
+ **T1: Correctness now vs elegance later**
181
+ `complete_step` solves the immediate token problem correctly. But it creates a migration cost if structured output is the future architecture. Shipping PR #569 now may entrench the tool-call approach for longer than intended.
182
+
183
+ **T2: Beta API stability**
184
+ Structured output requires `client.beta.messages.create()`. The beta endpoint is used by production tools (web_search, code execution) but has no stability SLA. Shipping it as the primary workflow control mechanism carries risk.
185
+
186
+ **T3: Notes validation layer**
187
+ Runtime `notes.length < 50` check is inside `execute()` (correct per design review). JSON Schema `minLength` is informational only. This is right per "validate at boundaries" -- the tool boundary is the right place. Not a tension but a confirmed correct decision.
188
+
189
+ **T4: Deprecated `continue_workflow` still callable**
190
+ FM4 risk: during transition the LLM might call `continue_workflow` with the token it sees... but the token is not in the initial prompt (confirmed at line 1981-1984). So the LLM has no token to pass. FM4 risk is lower than the design review assumed.
191
+
192
+ **T5: `onTokenUpdate` vs `onAdvance` -- two write paths**
193
+ Two mutation paths for `currentContinueToken` (lines 1784 and 1925 in workflow-runner.ts). These are confirmed mutually exclusive (kind: ok vs kind: blocked). Sequential execution prevents races. Manageable complexity.
194
+
195
+ ### Success Criteria
196
+ 1. TOKEN_BAD_SIGNATURE errors reach zero in daemon sessions
197
+ 2. Session transcripts (LLM context) never contain a `continueToken` string
198
+ 3. Works on Anthropic and Bedrock without provider-specific conditional logic
199
+ 4. Token flow traceable by reading `makeCompleteStepTool()` and `runWorkflow()` alone
200
+ 5. Migration to structured output doesn't require a breaking change to the MCP tool API surface
201
+
202
+ ### Framing Risk
203
+
204
+ **Primary framing risk:** The entire `complete_step` PR might be unnecessary if structured output is viable to ship NOW. The findings doc (written today) recommends Option A (replace `complete_step` with `output_config`). If that recommendation is correct and the beta API is stable enough, PR #569 introduces complexity that will just be removed in the next PR.
205
+
206
+ What would confirm this risk: if the beta API has been stable in production for 3+ months and the `AgentClientInterface` change is small, the cost of skipping PR #569 and going straight to structured output is low.
207
+
208
+ What would refute this risk: if `AgentClientInterface` changes require significant refactoring, or if the beta API has shown instability in prior WorkRail usage, then PR #569 as an interim step is the right call.
209
+
210
+ ---
211
+
212
+ ## Challenge Notes
213
+
214
+ **Assumption 1: LLM inference is the token corruption source**
215
+ - Might be wrong: could be SDK serialization, context truncation, encoding
216
+ - Evidence: v2 short tokens are only 27 chars; even small truncation breaks HMAC
217
+
218
+ **Assumption 2: LLM-callable tool is the right abstraction**
219
+ - Might be wrong: daemon already owns loop; structured output eliminates tool layer entirely
220
+ - Evidence: structured output prototype confirms feasibility on both providers
221
+
222
+ **Assumption 3: Closure getter is race-free**
223
+ - Might be wrong: JS async boundaries could allow mutation between LLM decision and execution
224
+ - Evidence: agent-loop.ts confirms sequential tool execution (`toolExecution: 'sequential'`)
225
+
226
+ ---
227
+
228
+ ## Resolution Notes
229
+
230
+ ### Recommendation: Merge PR #569 as-is (Candidate 1)
231
+
232
+ **Decision rationale:**
233
+
234
+ 1. **Scope is best-fit.** Candidates 2 and 4 require `AgentClientInterface` changes that propagate to all `AgentLoop` consumers. This is larger than a targeted token-mangling fix.
235
+
236
+ 2. **Complete_step and structured output are NOT mutually exclusive.** The daemon can adopt structured output in a follow-up PR. `executeContinueWorkflow` is the single advance point for both approaches -- migration is low-cost.
237
+
238
+ 3. **Beta API risk is real and unquantified for daemon use.** `output_config` has no SLA. For workflow control where session kills are costly, this risk needs a deliberate architectural decision, not a bundled fix.
239
+
240
+ 4. **Closure getter is proven safe.** Sequential execution confirmed; two mutation paths confirmed mutually exclusive; WHY comments are load-bearing but present.
241
+
242
+ 5. **YAGNI.** PR #569 solves the immediate problem with zero new abstractions. Structured output belongs in a deliberate follow-up.
243
+
244
+ ### What would tip the decision to Candidate 2/4
245
+
246
+ - Daemon team commits to structured output migration in same sprint
247
+ - `AgentClientInterface` change is one line (minimal propagation)
248
+ - Evidence that `complete_step` tool hallucination is a real problem in practice
249
+
250
+ ### Improvements before merge
251
+
252
+ 1. **`workrailSessionId` ordering concern (low risk, should add comment):** The tool is constructed before `workrailSessionId` is populated (token decode happens after construction at line 1863). The closure capture by reference is correct because the agent loop starts after the decode. But this ordering dependency is implicit. Add a comment: "WHY tool constructed before workrailSessionId is populated: the closure captures by reference; workrailSessionId is assigned before the agent loop starts at the prompt() call below."
253
+
254
+ 2. **`params: any` type (cosmetic, low priority):** Consider `const params = block.input as CompleteStepParams` with a defined interface instead of `params: any` + eslint disable.
255
+
256
+ 3. **Notes error message (minor improvement):** Current message says "Current length: N characters." Better: add "Notes must describe what you did, what you produced, and any key decisions."
257
+
258
+ ### Relationship to structured output future
259
+
260
+ PR #569 is correct as an interim step AND as a permanent solution if structured output is never adopted. The two approaches are complementary:
261
+ - PR #569 ships now: zero token errors, zero session kills
262
+ - Structured output PR ships later (optional): eliminates `currentContinueToken` mutation, removes tool call requirement, makes illegal states unrepresentable
263
+ - Migration path: replace `complete_step` tool with end_turn JSON detection in `workflow-runner.ts`; update `AgentClientInterface` to `beta.messages.create()` in `agent-loop.ts`; update system prompt to remove tool reference
264
+
265
+ Neither PR blocks or invalidates the other.
266
+
267
+ ---
268
+
269
+ ## Decision Log
270
+
271
+ ### Selected direction: Candidate 1 (Merge PR #569 as-is)
272
+
273
+ **Why C1 won:**
274
+ 1. Best-fit scope for a token-fix PR
275
+ 2. Complementary with structured output (not competing) -- executeContinueWorkflow is the shared seam
276
+ 3. Proven safe from code analysis (sequential execution, mutually exclusive write paths, load-bearing WHY comments)
277
+ 4. YAGNI -- zero new abstractions for the immediate fix
278
+
279
+ **Why the runner-up (C2/C4) lost:**
280
+ - Beta API risk is real and unquantified for daemon workflow control
281
+ - AgentClientInterface change propagates to all AgentLoop consumers -- bigger than a fix warrants
282
+ - But: C2/C4 would win if the team commits to the migration in the same sprint
283
+
284
+ ### Challenge results
285
+
286
+ **Challenge 1 (beta API risk overstated):** Partially stands. Beta risk is not zero but findings doc provides real data. Recommendation unchanged.
287
+
288
+ **Challenge 2 (structured output will never ship after C1):** REAL concern. Mitigation: creating the follow-up issue is a REQUIREMENT of merge, not a suggestion.
289
+
290
+ **Challenge 3 (FM4 trigger in continue_workflow description):** NEW FINDING. The `continue_workflow` tool description contains "The continueToken from the previous...call" -- this tells the LLM a token exists and to find one. This is a FM4 vector independent of whether the token is in the prompt. **Action: update continue_workflow description to remove token-seeking language; replace with clear deprecation directive.**
291
+
292
+ **Challenge 4 (invariant not enforced at type level):** NEW FINDING. The mutual exclusivity of onAdvance vs onTokenUpdate paths is documented but not type-enforced. If a third response kind is added, the invariant could be silently broken. **Action: add exhaustiveness check (TypeScript switch over out.kind) so compiler catches missing cases.**
293
+
294
+ ### Final improvements list (5 items, not 3)
295
+
296
+ 1. Add WHY comment documenting workrailSessionId ordering dependency
297
+ 2. Improve notes error message to describe content requirement
298
+ 3. **Create follow-up issue for structured output migration (REQUIRED on merge)**
299
+ 4. **Update continue_workflow description: remove token-seeking language, replace with deprecation directive**
300
+ 5. **Add exhaustiveness check on out.kind in makeCompleteStepTool to enforce mutual exclusivity invariant**
301
+
302
+ ---
303
+
304
+ ## Final Summary
305
+
306
+ ### Verdict: Merge PR #569 with 2 required and 3 recommended revisions
307
+
308
+ **Confidence band: HIGH**
309
+
310
+ ### Is the complete_step tool the best approach?
311
+
312
+ **For now: YES.** PR #569's approach is correct, safe, and minimal. The closure getter mechanism is proven safe by code analysis (sequential tool execution, mutually exclusive write paths, load-bearing WHY comments). Notes validation is at the correct boundary. Token is never exposed to the LLM. The approach aligns with YAGNI and validates-at-boundaries principles.
313
+
314
+ **For the long term: The structured output architecture is better.** It eliminates the mutable closure variable entirely, makes it structurally impossible for the LLM to touch the token, and is validated on both Anthropic and Bedrock (findings doc). But it requires beta API + AgentClientInterface changes that are out of scope for a token-fix PR.
315
+
316
+ **Are they competing?** No. They are complementary. `executeContinueWorkflow` is the shared seam -- structured output just adds a new trigger path (end_turn JSON detection) alongside the existing tool-call trigger path.
317
+
318
+ ### Key questions answered
319
+
320
+ **Q1: Is the late-binding closure getter `() => currentContinueToken` the right mechanism?**
321
+ YES. The getter is safe because tool execution is strictly sequential (confirmed in agent-loop.ts `_executeTools()` for...of loop). `onAdvance` (for successful advances) and the inline `onTokenUpdate` callback (for blocked retries) are mutually exclusive response branches. No race condition is possible.
322
+
323
+ **Q2: Should `complete_step` eventually be replaced by structured output?**
324
+ YES -- but it doesn't need to be, and doing so is NOT urgent. PR #569 is a correct interim solution that can remain indefinitely. The structured output migration is an improvement in architectural cleanliness, not a correctness fix.
325
+
326
+ **Q3: Is the notes min-50-char enforcement at the right layer?**
327
+ YES. JSON Schema `minLength` is informational to the LLM but not enforced by AgentLoop. The runtime check at `execute()` is the correct validation boundary per "validate at boundaries, trust inside."
328
+
329
+ **Q4: Is the blocked/retryable path reliable?**
330
+ YES. The `onTokenUpdate` callback updates `currentContinueToken` to the retry token before the LLM's next `complete_step` call. `persistTokens()` is called before either callback fires. Crash safety is preserved.
331
+
332
+ **Q5: Does PR #569 correctly remove the continueToken from `initialPrompt`?**
333
+ YES. Lines 1981-1984 confirm no token in the initial prompt. Implications: FM4 risk (LLM calling `continue_workflow` with a token) is significantly reduced because the LLM has no token to copy. The only FM4 vector remaining is the `continue_workflow` tool description itself (Revision 1).
334
+
335
+ ### Required revisions (merge blockers)
336
+
337
+ 1. Remove "Requires a continueToken that you must round-trip exactly" from the `continue_workflow` tool description (FM4 fix)
338
+ 2. Create follow-up issue for structured output migration on merge
339
+
340
+ ### Residual risks (acceptable)
341
+
342
+ - FM4 remains possible if LLM invents a token after being told the tool requires one; mitigated by Revision 1 and HMAC validation backstop
343
+ - Structured output migration may be perpetually deferred; mitigated by Revision 2 (required follow-up issue)
344
+ - Mutable closure variable is permanent technical debt until structured output migration ships