@exaudeus/workrail 3.71.0 → 3.72.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -5,7 +5,89 @@ For historical narrative and sprint journals, see `docs/history/worktrain-journa
5
5
 
6
6
  ---
7
7
 
8
- ## Daemon and WorkTrain
8
+ ## P0 / Critical (blocks WorkTrain from working correctly)
9
+
10
+ ### Agent is doing coordinator work
11
+
12
+ **Status: bug** | Priority: P0
13
+
14
+ The agent ran `cd /path/to/main-checkout && git log`, `gh issue view`, read roadmap docs, checked open PRs -- coordinator work. The agent should never do this. It is a worker: receive scoped task, produce output, call `complete_step`. All environment setup, context gathering, git operations, worktree management, PR creation, and orchestration is coordinator responsibility.
15
+
16
+ The coordinator should: create the worktree before the agent starts, pass a clean context packet (issue body, relevant code, what to produce), handle all git operations after the agent finishes, spawn specialized sub-agents for subtasks.
17
+
18
+ **Near-term mitigation:** Inject `sessionWorkspacePath` (the worktree) into the system prompt instead of `trigger.workspacePath` (main checkout), and explicitly tell the agent "do not run git commands, do not read roadmap docs -- that is coordinator work." Partial fix held pending full redesign.
19
+
20
+ **Full fix:** Coordinator-heavy pipeline redesign (see below).
21
+
22
+ ---
23
+
24
+ ### Wrong directory: agent worked in main checkout instead of worktree
25
+
26
+ **Status: bug** | Priority: P0
27
+
28
+ All bash commands used `cd /main-checkout` instead of the worktree. Code changes went nowhere. Delivery found nothing to commit and silently skipped. Root cause: system prompt names `trigger.workspacePath`, not `sessionWorkspacePath`.
29
+
30
+ ---
31
+
32
+ ### Agent faked commit SHAs in handoff block
33
+
34
+ **Status: bug** | Priority: high
35
+
36
+ Handoff block `agentCommitShas` contained existing main-branch SHAs from `git log`, not new commits. Fix: coordinator records commit SHAs itself (before/after diff) rather than trusting the agent.
37
+
38
+ ---
39
+
40
+ ### `taskComplexity=Small` misclassification
41
+
42
+ **Status: bug** | Priority: medium
43
+
44
+ Issue #241 (TTL eviction across multiple files + new tests) was classified as Small, skipping design review, planning audit, and verification loops. Consider requiring human confirmation on Small classification before bypassing phases.
45
+
46
+ ---
47
+
48
+ ### Daemon binary stale after rebuild, no indication to user
49
+
50
+ **Status: ux gap** | Priority: medium
51
+
52
+ After `npm run build`, `worktrain daemon --start` launches the old binary. No warning. Fix: compare binary mtime to running process's binary and warn if stale.
53
+
54
+ ---
55
+
56
+ ### `worktrain daemon --start` reports success even when daemon crashes immediately
57
+
58
+ **Status: bug** | Priority: medium
59
+
60
+ Health check waits 1 second then checks `launchctl list`. If daemon crashes in < 1s, check sees a PID and reports success. Fix: poll for up to 5 seconds, verify daemon is still running at end of window.
61
+
62
+ ---
63
+
64
+ ### Handoff block not surfaced to operator
65
+
66
+ **Status: ux gap** | Priority: medium
67
+
68
+ Agent writes a complete handoff block (commitType, prTitle, prBody, filesChanged) to the session store. Invisible to operator without digging through event logs. Fix: `worktrain status <sessionId>` should show it; console session detail should surface it prominently.
69
+
70
+ ---
71
+
72
+ ## WorkTrain Daemon
73
+
74
+ The autonomous workflow runner (`worktrain daemon`). Completely separate from the MCP server -- calls the engine directly in-process.
75
+
76
+
77
+ ### `wr.refactoring` workflow (Apr 28, 2026)
78
+
79
+ **Status: idea** | Priority: medium
80
+
81
+ A dedicated `wr.refactoring` workflow for structural refactors that don't change behavior. Distinct from `wr.coding-task` because refactors have a different shape: no new features, no bug fixes, just architecture alignment. The workflow should enforce:
82
+ - **Discovery phase**: understand current state, identify violations, classify scope
83
+ - **Test-first phase**: write tests for any extracted pure functions BEFORE extracting them (TDD red)
84
+ - **Extraction phase**: one slice at a time, tests green after each
85
+ - **Verification phase**: full suite green, build clean, no behavior changes
86
+ - **Doc update phase**: update any reference docs that describe the changed invariants
87
+
88
+ The `wr.coding-task` workflow has too much overhead for pure refactors (design review, risk assessment gating, PR strategy) and not enough refactor-specific discipline (test-first enforcement, behavior-unchanged verification).
89
+
90
+ ---
9
91
 
10
92
  ### API key baked into launchd plist at install time (Apr 24, 2026)
11
93
 
@@ -23,9 +105,9 @@ For historical narrative and sprint journals, see `docs/history/worktrain-journa
23
105
 
24
106
  ### runWorkflow() functional core refactor -- Phase 2 (Apr 24, 2026)
25
107
 
26
- **Status: idea** | Priority: medium
108
+ **Status: done** | Shipped in PR #830 (Apr 29, 2026)
27
109
 
28
- Phase 1 landed in PR #818: extracted `tagToStatsOutcome`, `buildAgentClient`, `evaluateStuckSignals`, `SessionState`, and `finalizeSession`. `runWorkflow()` is still ~880 lines with I/O and pure logic interleaved in the setup phase.
110
+ Phase 1 landed in PR #818: extracted `tagToStatsOutcome`, `buildAgentClient`, `evaluateStuckSignals`, `SessionState`, and `finalizeSession`. Phase 2 landed in PR #830:
29
111
 
30
112
  **What remains:**
31
113
 
@@ -55,18 +137,32 @@ function buildSessionContext(
55
137
 
56
138
  The shell then does:
57
139
  1. All I/O in sequence: `loadDaemonSoul`, `loadWorkspaceContext`, `loadSessionNotes`, `git worktree add`, `executeStartWorkflow`, `parseContinueTokenOrFail`, `persistTokens`
58
- 2. One pure call: `buildSessionContext(trigger, client, modelId, soul, ctx, notes, state, ...)`
59
- 3. Run the agent loop with the returned config
60
-
61
- **Why this matters:** `buildSessionContext` would be unit-testable without any filesystem, LLM, or session store. The system prompt assembly, tool list construction, and session limit calculation are currently untestable in isolation. This is the biggest remaining testability gap.
140
+ **What Phase 2 delivered (PR #830):**
141
+ - `PreAgentSession` interface + `PreAgentSessionResult` discriminated union -- all early-exit paths type-enforced
142
+ - `buildPreAgentSession()` -- all pre-agent I/O extracted; steer+daemon registries registered after all failing I/O (FM1 invariant)
143
+ - `constructTools()` -- explicitly impure named function, `state` as explicit parameter
144
+ - `persistTokens()` returns `Promise<Result<void, PersistTokensError>>` using `src/runtime/result.ts`
145
+ - `sidecardLifecycleFor()` pure function with `assertNever` exhaustiveness
146
+ - TDZ hazard fixed: `abortRegistry.set()` now registered after `const agent = new AgentLoop()`
147
+
148
+ **Phase 3 (PRs #835, #837)** continued the refactor:
149
+ - `buildTurnEndSubscriber()` extracted -- runWorkflow() body: 539 → 426 lines
150
+ - Tool param validation at LLM boundary (8 tool factories)
151
+ - `buildAgentCallbacks()` + `buildSessionResult()` pure functions -- body: 426 → 308 lines
152
+ - Test flakiness fix: `settleFireAndForget()` + `retry: 2` in vitest config
153
+
154
+ **Still deferred:**
155
+ - `CriticalEffect<T>` / `ObservabilityEffect` type distinction
156
+ - `StateRef` mutation wrapper
157
+ - Zod tool param validation (replacing manual typeof checks -- requires zodToJsonSchema or two sources of truth)
158
+ - `wr.refactoring` workflow (see backlog entry above)
62
159
 
63
- **Prerequisite:** Phase 1 (PR #818) -- done. `SessionState` already exists and can be passed in.
160
+ ---
64
161
 
65
- **Scope:** Single file (`src/daemon/workflow-runner.ts`). New exports: `buildSessionContext`, `SessionContext`. No public API changes.
162
+ ## Shared / Engine
66
163
 
67
- ---
164
+ The durable session store, v2 engine, and workflow authoring features shared by all three systems.
68
165
 
69
- ## Engine and MCP
70
166
 
71
167
  ### Improve commit SHA gathering consistency in wr.coding-task
72
168
 
@@ -264,7 +360,10 @@ Surface in: `worktrain status`, `worktrain health <sessionId>`, console session
264
360
 
265
361
  ---
266
362
 
267
- ## Daemon and Coordinator
363
+ ## WorkTrain Daemon -- Coordinator patterns
364
+
365
+ Coordinator design patterns for WorkTrain's autonomous pipeline.
366
+
268
367
 
269
368
  ### Event-driven agent coordination (coordinator as event bus)
270
369
 
@@ -516,6 +615,12 @@ Step-level `systemPrompt` overrides workflow-level for that step.
516
615
 
517
616
  ---
518
617
 
618
+ ## WorkRail MCP Server
619
+
620
+ The stdio/HTTP MCP server that Claude Code (and other MCP clients) connect to. MUST be bulletproof -- crashes kill all in-flight Claude Code sessions.
621
+
622
+
623
+
519
624
  ## Console
520
625
 
521
626
  ### Console interactivity and liveliness
@@ -562,6 +667,40 @@ Ghost nodes represent steps that were compiled into the DAG but skipped at runti
562
667
 
563
668
  ## Workflow Library
564
669
 
670
+ ### General-purpose workflow / intelligent dispatcher
671
+
672
+ **Status: idea** | Priority: medium
673
+
674
+ Two related ideas:
675
+
676
+ **`wr.quick-task`** -- the simplest possible workflow. 2 steps: do the work, call complete_step. No complexity routing, no design review, no phased implementation. For tasks under ~10 minutes. Currently small tasks go through `wr.coding-task`'s Small fast-path which is still heavier than needed.
677
+
678
+ **`wr.dispatch`** -- an intelligent routing workflow. Given a goal, classify it and route to the right workflow: `wr.quick-task` | `wr.research` | `wr.coding-task` | `wr.mr-review` | `wr.competitive-analysis`. The general-purpose entry point -- not a workflow that does everything, but one that decides which workflow to use. The adaptive pipeline coordinator already does this for the queue-poll trigger; the question is whether to expose it as a named user-facing workflow.
679
+
680
+ Open questions: does `wr.dispatch` replace `workflowId` in trigger config, or coexist alongside it? How does it handle tasks that don't fit any known workflow?
681
+
682
+ ---
683
+
684
+ ### MR review session count inflation
685
+
686
+ **Status: idea** | Priority: medium
687
+
688
+ A single PR review dispatches 6-12 autonomous sessions (one per reviewer family: correctness_invariants, runtime_production_risk, missed_issue_hunter, etc.). This inflates session counts, complicates cost attribution, and makes ROI calculations imprecise. Worth investigating: are all 6 families catching distinct issues, or is there significant overlap? Should families be parallelized into a single session with sub-agents rather than separate top-level sessions?
689
+
690
+ ---
691
+
692
+ ### Session trigger source attribution (daemon vs MCP)
693
+
694
+ **Status: idea** | Priority: high
695
+
696
+ No reliable way to determine whether a session was started by the daemon (WorkTrain) or a human via MCP (Claude Code). Every session-level metric and ROI calculation is ambiguous without this.
697
+
698
+ **Fix:** Add `triggerSource: 'daemon' | 'mcp'` to `run_started` event data. One-line change at each entry point, makes attribution permanent and queryable from the event log.
699
+
700
+ Files: `src/v2/durable-core/schemas/session/events.ts`, `src/mcp/handlers/v2-execution/start-workflow.ts`, `src/daemon/workflow-runner.ts`.
701
+
702
+ ---
703
+
565
704
  ### Standup status generator
566
705
 
567
706
  **Status: idea** | Priority: low
@@ -962,6 +1101,8 @@ WorkTrain is a persistent background daemon that initiates workflows autonomousl
962
1101
  - `worktrain init` soul setup
963
1102
  - Per-trigger crash safety (`persistTokens`)
964
1103
  - Worktree orphan cleanup on delivery failure
1104
+ - runWorkflow() Phase 2 architecture (PR #830): `PreAgentSession`/`buildPreAgentSession`, `constructTools`, `persistTokens` Result type, `sidecardLifecycleFor` pure function, TDZ hazard fix for abort registry
1105
+ - runWorkflow() Phase 3 architecture (PRs #835, #837): `buildTurnEndSubscriber` (539→426 lines), tool param validation at LLM boundary (8 factories), `buildAgentCallbacks` + `buildSessionResult` pure functions (426→308 lines), test flakiness fix (settleFireAndForget + retry:2)
965
1106
 
966
1107
  ### WorkRail engine / MCP features
967
1108
 
@@ -969,6 +1110,7 @@ WorkTrain is a persistent background daemon that initiates workflows autonomousl
969
1110
 
970
1111
  - Assessment gates v1 with consequences
971
1112
  - Loop control -- all four types (`while`, `until`, `for`, `forEach`) implemented
1113
+ - Fix: sequential `artifact_contract` while loops -- stale stop artifacts from earlier loops no longer contaminate later loops (PR #830). Root cause: `collectArtifactsForEvaluation()` passed full session history to `interpreter.next()`; fix passes only `inputArtifacts` (current step's submitted artifacts).
972
1114
  - Subagent guidance feature
973
1115
  - References system (local file refs)
974
1116
  - Routine/templateCall injection
@@ -1012,3 +1154,14 @@ The agent is expensive, inconsistent, and slow. Scripts are free, deterministic,
1012
1154
  ### Metrics outcome validation
1013
1155
 
1014
1156
  **Status: done** -- `checkContextBudget` validates `metrics_outcome` enum (PR f0a1822a). SHA validation (Gap 3 above) is still open.
1157
+
1158
+ ### wr.coding-task architecture enforcement + retrospective (v1.3.0)
1159
+
1160
+ **Status: done** -- shipped in PR #830 (Apr 29, 2026)
1161
+
1162
+ - Phase 0 architecture alignment check: agent scans candidate files and names philosophy violations explicitly by function name; captures `architectureViolations` and `architectureStartsFromScratch`
1163
+ - Phase 1c conditional fragment: when `architectureStartsFromScratch = true`, blocks adapting existing violations as valid design candidates
1164
+ - Phase 8 post-implementation retrospective: runs for all tasks (no complexity gate); four practical questions applicable to any task; requires 2-4 concrete observations with explicit disposition
1165
+
1166
+ ---
1167
+
@@ -52,19 +52,23 @@ Each `runWorkflow()` call writes a per-session sidecar file at `~/.workrail/daem
52
52
 
53
53
  ### 2.1 Sidecar is written before the agent loop starts
54
54
 
55
- `persistTokens()` is called immediately after `executeStartWorkflow()` succeeds and the `continueToken` is available. A crash between `executeStartWorkflow()` returning and the first LLM call is recoverable.
55
+ `persistTokens()` is called inside `buildPreAgentSession()` immediately after `executeStartWorkflow()` succeeds and the `continueToken` is available. `buildPreAgentSession()` returns `{ kind: 'complete', result: { _tag: 'error' } }` if `persistTokens()` fails -- no agent loop starts without a valid sidecar.
56
+
57
+ `persistTokens()` returns `Promise<Result<void, PersistTokensError>>` (not throws). Callers in the setup phase treat `err` as fatal (abort); callers inside tool closures treat `err` as degraded-but-continue (log and still call `onAdvance`/`onTokenUpdate` -- see invariant 4.3).
56
58
 
57
59
  **Exception:** If `continueToken` is undefined (instant single-step completion, or `_preAllocatedStartResponse` with no token), `persistTokens()` is skipped. There is nothing to recover.
58
60
 
59
61
  ### 2.2 Sidecar is deleted on every non-worktree terminal path
60
62
 
63
+ The sidecar lifecycle decision is delegated to `sidecardLifecycleFor(tag, branchStrategy)` in `workflow-runner.ts`. That function is the authoritative source for this table; its `assertNever` default case ensures a compile error when `WorkflowRunResult` gains new variants without updating the rules.
64
+
61
65
  | Outcome | Sidecar deleted? |
62
66
  |---|---|
63
- | `success` (non-worktree) | Yes -- in `runWorkflow()` before returning |
67
+ | `success` (non-worktree) | Yes -- `finalizeSession()` deletes via `sidecardLifecycleFor` |
64
68
  | `success` (worktree) | No -- `TriggerRouter.maybeRunDelivery()` deletes it after delivery |
65
- | `error` | Yes |
66
- | `timeout` | Yes |
67
- | `stuck` | Yes |
69
+ | `error` | Yes -- `finalizeSession()` deletes via `sidecardLifecycleFor` |
70
+ | `timeout` | Yes -- `finalizeSession()` deletes via `sidecardLifecycleFor` |
71
+ | `stuck` | Yes -- `finalizeSession()` deletes via `sidecardLifecycleFor` |
68
72
 
69
73
  **Why worktree sessions differ:** Delivery (git commit, git push, gh pr create) runs inside the worktree after `runWorkflow()` returns. The sidecar must exist until delivery completes so `runStartupRecovery()` can find the worktree path if the daemon crashes during delivery.
70
74
 
@@ -92,11 +96,21 @@ Three registries track in-flight daemon sessions:
92
96
  | `SteerRegistry` | `workrailSessionId` | `(text: string) => void` | Mid-session coordinator injection |
93
97
  | `AbortRegistry` | `workrailSessionId` | `() => void` | SIGTERM graceful shutdown |
94
98
 
95
- ### 3.1 All registries are deregistered in the `finally` block
99
+ ### 3.1 Registry registration and deregistration
100
+
101
+ **Registration** happens in two places:
102
+
103
+ - `steerRegistry` and `DaemonRegistry` are registered inside `buildPreAgentSession()` -- AFTER all potentially-failing I/O (executeStartWorkflow, persistTokens, worktree creation). Error paths that return before registration have nothing to clean up. The single-step completion path (which returns success without running an agent loop) explicitly calls `steerRegistry.delete()` and `daemonRegistry.unregister()` before returning.
104
+
105
+ - `abortRegistry` is registered in `runWorkflow()` immediately after `const agent = new AgentLoop(...)`. The closure `() => agent.abort()` references `agent` -- registering before agent construction would be a TDZ hazard.
96
106
 
97
- `steerRegistry.delete()` and `abortRegistry.delete()` are called in the `finally` block of `runWorkflow()`. This ensures cleanup happens even if an exception is thrown in the agent loop or in the post-finally result handling.
107
+ **Deregistration**:
98
108
 
99
- **Why `finally` and not per-result-path:** A stale steer or abort callback on a dead session would cause `POST /sessions/:id/steer` to return 200 (calling the closed-over callback) or the shutdown handler to call `abort()` on an already-exited session. Both are silent correctness bugs.
109
+ - `steerRegistry.delete()` and `abortRegistry.delete()` are called in the `finally` block of `runWorkflow()`. This ensures cleanup happens even if an exception is thrown in the agent loop.
110
+
111
+ - `daemonRegistry.unregister()` is called at each result path (success, error, timeout, stuck) via `finalizeSession()`. It is NOT in `finally` because the completion status ('completed' vs 'failed') differs by path.
112
+
113
+ **Why stale entries are bugs:** A stale steer callback on a dead session makes `POST /sessions/:id/steer` return 200 (calling the closed-over callback) instead of 404. A stale abort callback makes the shutdown handler call `abort()` on an already-exited session. Both are silent correctness bugs.
100
114
 
101
115
  ### 3.2 `DaemonRegistry` is unregistered at every result path
102
116
 
@@ -110,7 +124,11 @@ If `parseContinueTokenOrFail()` fails (unusual -- the token just came from `exec
110
124
 
111
125
  ### 3.4 Registration gap is documented
112
126
 
113
- There is a ~50ms window between `executeStartWorkflow()` returning and `steerRegistry.set()` being called (after `parseContinueTokenOrFail()` completes). A `POST /sessions/:id/steer` call in this window receives 404. Coordinators should retry once on 404 during session startup.
127
+ **SteerRegistry gap (~50ms):** There is a ~50ms window between `executeStartWorkflow()` returning and `steerRegistry.set()` being called (after `parseContinueTokenOrFail()` completes). A `POST /sessions/:id/steer` call in this window receives 404. Coordinators should retry once on 404 during session startup.
128
+
129
+ **AbortRegistry gap (~200-500ms):** `abortRegistry.set()` is registered _after_ `const agent = new AgentLoop(...)` is constructed, which happens after the context-loading phase (`loadDaemonSoul`, `loadWorkspaceContext`, `loadSessionNotes` in parallel). This means there is a ~200-500ms window where SIGTERM will not abort an in-flight session. Sessions in this window run to completion or hit the wall-clock timeout.
130
+
131
+ **Why the abort gap is wider than the steer gap:** `abortRegistry.set` registers `() => agent.abort()` which closes over `agent`. Registering this callback before `agent` is constructed would be a TDZ (Temporal Dead Zone) hazard -- `agent` is declared with `const` and would not yet be initialized if the shutdown handler fired on an early-exit path. Registering after `agent` construction eliminates the hazard at the cost of a wider registration window. The accepted tradeoff is the same as for the steer gap.
114
132
 
115
133
  ---
116
134
 
@@ -134,6 +152,8 @@ Both are guarded by the sequential tool execution invariant (no concurrent token
134
152
 
135
153
  `persistTokens()` is called inside `makeCompleteStepTool.execute()` and `makeContinueWorkflowTool.execute()` before `onAdvance()` or `onTokenUpdate()` are called. A crash between the engine returning a new token and `persistTokens()` completing would leave an unrecoverable state.
136
154
 
155
+ `persistTokens()` returns `Promise<Result<void, PersistTokensError>>`. On `err` inside a tool closure, the policy is **log and continue** -- `onAdvance()` / `onTokenUpdate()` are still called even when persistence fails. Rationale: a persist failure degrades crash recovery but the session is still live. Killing the session on persist failure would lose in-progress work, which is strictly worse.
156
+
137
157
  **Note:** The sidecar write uses the atomic temp-rename pattern (`writeFile(tmp) → rename(tmp, final)`) to prevent corrupt partial writes.
138
158
 
139
159
  ### 4.4 Stuck detection is non-blocking for the session result
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@exaudeus/workrail",
3
- "version": "3.71.0",
3
+ "version": "3.72.0",
4
4
  "description": "Step-by-step workflow enforcement for AI agents via MCP",
5
5
  "license": "MIT",
6
6
  "repository": {
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "id": "wr.coding-task",
3
3
  "name": "Agentic Task Dev Workflow",
4
- "version": "1.2.0",
4
+ "version": "1.3.0",
5
5
  "description": "Use this to implement a software feature or task. Follows a plan-then-execute approach with architecture decisions, invariant tracking, and final verification.",
6
6
  "about": "## Agentic Coding Task Workflow\n\nThis workflow structures the full lifecycle of a software implementation task: from understanding and classifying the work, through architecture decisions and incremental implementation, to final verification and handoff.\n\n### What it does\n\nThe workflow guides an AI agent through a disciplined plan-then-execute process. It begins by analyzing the task to determine complexity, risk, and the right level of rigor (QUICK, STANDARD, or THOROUGH). For non-trivial tasks, it then gathers codebase context, surfaces invariants and non-goals, generates competing design candidates, and selects an approach before writing a single line of code. Implementation proceeds slice by slice, with built-in verification gates after each slice. A final integration verification pass confirms acceptance criteria are met before handoff.\n\n### Upstream context (Phase 0.5)\n\nPhase 0.5 looks for any upstream document that has already defined what to build -- a Shape Up pitch, PRD, BRD, RFC, design doc, user story with acceptance criteria, Jira epic, or equivalent. The agent uses whatever tools are available (repo search, WebFetch, Confluence/Notion/Glean MCPs, Memory MCP) to find it. If found, two flags are set: `upstreamSpecDetected` (something exists) and `solutionFixed` (whether the document commits to a specific technical direction). When `solutionFixed = true`, design ideation phases (1a-1c) are skipped and Phase 1d translates the upstream constraints directly into an engineering approach. When `solutionFixed = false`, design ideation runs normally but is constrained by whatever the upstream document does specify. The plan audit (Phase 4) checks for drift against `upstreamBoundaries` whenever an upstream document was found.\n\n### When to use it\n\nUse this workflow whenever you are implementing a feature, fixing a non-trivial bug, or making an architectural change in a real codebase. It is especially valuable when:\n- The task touches multiple files or systems\n- There is meaningful risk of regressions or invariant violations\n- You want the agent to surface trade-offs and commit to a reasoned design decision rather than guessing\n- You need a resumable, auditable record of what was decided and why\n\nFor quick one-liner fixes or very small changes, the workflow includes a fast path that skips heavyweight planning.\n\n### What it produces\n\n- An `implementation_plan.md` artifact covering the selected approach, vertical slices, test design, and philosophy alignment\n- A `spec.md` for large or high-risk tasks, capturing observable behavior and acceptance criteria\n- Step-level notes in WorkRail that serve as a durable execution log\n- A PR-ready handoff summary with acceptance criteria status, invariant proofs, and follow-up tickets\n\n### How to get good results\n\n- Provide a clear task description and at least partial acceptance criteria before starting\n- If you have coding philosophy or project conventions configured in session rules or Memory MCP, the workflow will apply them automatically as a design lens\n- Let the workflow classify complexity and rigor itself; override only if the classification is clearly wrong\n- For large or high-risk tasks, review the architecture decision step before implementation begins",
7
7
  "examples": [
@@ -143,7 +143,7 @@
143
143
  "SUBAGENT SYNTHESIS: treat subagent output as evidence, not conclusions. State your hypothesis before delegating, then interrogate what came back: what was missed, wrong, or new? Say what changed your mind or what you still reject, and why.",
144
144
  "PARALLELISM: when reads, audits, or delegations are independent, run them in parallel inside the phase. Parallelize cognition; serialize synthesis and canonical writes.",
145
145
  "PHILOSOPHY LENS: apply the user's coding philosophy (from active session rules) as the evaluation lens. Flag violations by principle name, not as generic feedback. If principles conflict, surface the tension explicitly instead of silently choosing.",
146
- "VALIDATION: prefer static/compile-time safety over runtime checks. Use build, type-checking, and tests as the primary proof of correctness \u2014 in that order of reliability.",
146
+ "VALIDATION: prefer static/compile-time safety over runtime checks. Use build, type-checking, and tests as the primary proof of correctness in that order of reliability.",
147
147
  "DRIFT HANDLING: when reality diverges from the plan, update the plan artifact and re-audit deliberately rather than accumulating undocumented drift.",
148
148
  "NEVER COMMIT MARKDOWN FILES UNLESS USER EXPLICITLY ASKS.",
149
149
  "SLICE DISCIPLINE: Phase 6 is a loop -- implement ONE slice per iteration. Do not implement multiple slices at once. The verification loop exists to catch drift per slice, not retroactively."
@@ -152,7 +152,7 @@
152
152
  {
153
153
  "id": "phase-0-understand-and-classify",
154
154
  "title": "Phase 0: Understand & Classify",
155
- "prompt": "Understand this before you touch anything.\n\nMake sure the expected behavior is clear enough to proceed. If it really isn't, ask me only what you can't answer yourself. Don't ask me things you can find with tools.\n\nThen dig through the code. Figure out:\n- where this starts and what the call chain looks like\n- which files, modules, and functions matter\n- what patterns this should follow\n- how this repo verifies similar work\n- what the real risks, invariants, and non-goals are\n\nFigure out what philosophy to use while doing the work. Prefer, in order: Memory MCP (`mcp_memory_conventions`, `mcp_memory_prefer`, `mcp_memory_recall`), active session/Firebender rules, repo patterns, then me only if those still conflict or aren't enough.\n\nRecord where that philosophy lives, not a summary. If the stated rules and repo patterns disagree, capture the conflict.\n\nOnce you actually understand the task, classify it:\n- `taskComplexity`: Small / Medium / Large\n- `riskLevel`: Low / Medium / High\n- `rigorMode`: QUICK / STANDARD / THOROUGH\n- `automationLevel`: High / Medium / Low\n- `prStrategy`: SinglePR / MultiPR\n\nUse this guidance:\n- QUICK: small, low-risk, clear path, little ambiguity\n- STANDARD: medium scope or moderate risk\n- THOROUGH: large scope, architectural uncertainty, or high-risk change\n\nThen force a context-clarity check. Score each from 0-2 and give one sentence of evidence for each score:\n- `entryPointClarity`: 0 = clear entry point and call chain, 1 = partial chain with gaps, 2 = still unclear where behavior starts or flows\n- `boundaryClarity`: 0 = clear boundary, 1 = likely boundary but some uncertainty, 2 = patch-vs-boundary decision still unclear\n- `invariantClarity`: 0 = important invariants are explicit, 1 = some are inferred or uncertain, 2 = important invariants are still unclear\n- `verificationClarity`: 0 = clear deterministic verification path, 1 = partial verification path, 2 = verification is still weak or unclear\n\nUse the rubric, not vibes:\n- QUICK: do not run the deeper context batch; if the rubric says you're missing too much context, your classification is probably wrong and you should reclassify upward before moving on\n- STANDARD: run the deeper context batch if the total score is 3 or more, or if `boundaryClarity`, `invariantClarity`, or `verificationClarity` is 2\n- THOROUGH: always run the deeper context batch\n\nThe deeper context batch is:\n- `routine-context-gathering` with `focus=COMPLETENESS`\n- `routine-context-gathering` with `focus=DEPTH`\n\nAfter the batch, synthesize what changed, what stayed the same, and what is still unknown. If the extra context changes the classification, update it before you leave this step.\n\nCapture:\n- `taskComplexity`\n- `riskLevel`\n- `rigorMode`\n- `automationLevel`\n- `prStrategy`\n- `contextSummary`\n- `candidateFiles`\n- `invariants`\n- `nonGoals`\n- `openQuestions` (only real human-decision questions)\n- `philosophySources`\n- `philosophyConflicts`",
155
+ "prompt": "Understand this before you touch anything.\n\nMake sure the expected behavior is clear enough to proceed. If it really isn't, ask me only what you can't answer yourself. Don't ask me things you can find with tools.\n\nThen dig through the code. Figure out:\n- where this starts and what the call chain looks like\n- which files, modules, and functions matter\n- what patterns this should follow\n- how this repo verifies similar work\n- what the real risks, invariants, and non-goals are\n\nFigure out what philosophy to use while doing the work. Prefer, in order: Memory MCP (`mcp_memory_conventions`, `mcp_memory_prefer`, `mcp_memory_recall`), active session/Firebender rules, repo patterns, then me only if those still conflict or aren't enough.\n\nRecord where that philosophy lives, not a summary. If the stated rules and repo patterns disagree, capture the conflict.\n\nOnce you actually understand the task, classify it:\n- `taskComplexity`: Small / Medium / Large\n- `riskLevel`: Low / Medium / High\n- `rigorMode`: QUICK / STANDARD / THOROUGH\n- `automationLevel`: High / Medium / Low\n- `prStrategy`: SinglePR / MultiPR\n\nUse this guidance:\n- QUICK: small, low-risk, clear path, little ambiguity\n- STANDARD: medium scope or moderate risk\n- THOROUGH: large scope, architectural uncertainty, or high-risk change\n\nThen force a context-clarity check. Score each from 0-2 and give one sentence of evidence for each score:\n- `entryPointClarity`: 0 = clear entry point and call chain, 1 = partial chain with gaps, 2 = still unclear where behavior starts or flows\n- `boundaryClarity`: 0 = clear boundary, 1 = likely boundary but some uncertainty, 2 = patch-vs-boundary decision still unclear\n- `invariantClarity`: 0 = important invariants are explicit, 1 = some are inferred or uncertain, 2 = important invariants are still unclear\n- `verificationClarity`: 0 = clear deterministic verification path, 1 = partial verification path, 2 = verification is still weak or unclear\n\nUse the rubric, not vibes:\n- QUICK: do not run the deeper context batch; if the rubric says you're missing too much context, your classification is probably wrong and you should reclassify upward before moving on\n- STANDARD: run the deeper context batch if the total score is 3 or more, or if `boundaryClarity`, `invariantClarity`, or `verificationClarity` is 2\n- THOROUGH: always run the deeper context batch\n\nThe deeper context batch is:\n- `routine-context-gathering` with `focus=COMPLETENESS`\n- `routine-context-gathering` with `focus=DEPTH`\n\nAfter the batch, synthesize what changed, what stayed the same, and what is still unknown. If the extra context changes the classification, update it before you leave this step.\n\nCapture:\n- `taskComplexity`\n- `riskLevel`\n- `rigorMode`\n- `automationLevel`\n- `prStrategy`\n- `contextSummary`\n- `candidateFiles`\n- `invariants`\n- `nonGoals`\n- `openQuestions` (only real human-decision questions)\n- `philosophySources`\n- `philosophyConflicts`\n\n**Architecture alignment check (do this last, after candidateFiles is known):**\n\nFor each candidate file, scan for violations of the philosophy you just discovered. Name each violation explicitly and specifically (e.g. \"runWorkflow() is 4900 lines -- violates compose-with-small-pure-functions\", \"tool factories use params: any -- violates validate-at-boundaries\"). Do not assert absence without checking. If you found no violations, list the principles you checked and why each passes.\n\nThen decide:\n- `architectureViolations`: list of specific violations found (may be empty)\n- `architectureStartsFromScratch`: true if violations are significant enough that the correct design starts from the user's philosophy rather than adapting the existing code. False if violations are minor or out of scope for this task.\n\nIf `architectureStartsFromScratch` is true, the design phase will be constrained: Candidate A (simplest) must still honor the philosophy -- adapting an existing violation is not a valid candidate. Record this now so the design phase uses it.",
156
156
  "requireConfirmation": {
157
157
  "or": [
158
158
  {
@@ -218,7 +218,7 @@
218
218
  },
219
219
  {
220
220
  "id": "phase-1b-design-deep",
221
- "title": "Phase 1b: Design Generation (Injected Routine \u2014 Tension-Driven Design)",
221
+ "title": "Phase 1b: Design Generation (Injected Routine Tension-Driven Design)",
222
222
  "runCondition": {
223
223
  "and": [
224
224
  {
@@ -257,7 +257,7 @@
257
257
  }
258
258
  ]
259
259
  },
260
- "prompt": "Read `design-candidates.md`, compare it to your original guess, and make the call.\n\nBe explicit about three things:\n- what the design work confirmed\n- what changed your mind\n- what you missed the first time\n\nThen pressure-test the leading option:\n- what's the strongest case against it?\n- what assumption breaks it?\n\nAfter the challenge batch, say:\n- what changed your mind\n- what didn't\n- which findings you reject and why\n\nPick the approach yourself. Don't hide behind the artifact. If the simplest thing works, prefer it. If the front-runner stops looking right after challenge, switch.\n\nCapture:\n- `selectedApproach` \u2014 chosen design with rationale tied to tensions\n- `runnerUpApproach` \u2014 next-best option and why it lost\n- `architectureRationale` \u2014 tensions resolved vs accepted\n- `pivotTriggers` \u2014 conditions under which you'd switch to the runner-up\n- `keyRiskToMonitor` \u2014 failure mode of the selected approach\n- `acceptedTradeoffs`\n- `identifiedFailureModes`",
260
+ "prompt": "Read `design-candidates.md`, compare it to your original guess, and make the call.\n\nBe explicit about three things:\n- what the design work confirmed\n- what changed your mind\n- what you missed the first time\n\nThen pressure-test the leading option:\n- what's the strongest case against it?\n- what assumption breaks it?\n\nAfter the challenge batch, say:\n- what changed your mind\n- what didn't\n- which findings you reject and why\n\nPick the approach yourself. Don't hide behind the artifact. If the simplest thing works, prefer it. If the front-runner stops looking right after challenge, switch.\n\nCapture:\n- `selectedApproach` chosen design with rationale tied to tensions\n- `runnerUpApproach` next-best option and why it lost\n- `architectureRationale` tensions resolved vs accepted\n- `pivotTriggers` conditions under which you'd switch to the runner-up\n- `keyRiskToMonitor` failure mode of the selected approach\n- `acceptedTradeoffs`\n- `identifiedFailureModes`",
261
261
  "promptFragments": [
262
262
  {
263
263
  "id": "phase-1c-challenge-standard",
@@ -277,6 +277,14 @@
277
277
  "equals": "THOROUGH"
278
278
  },
279
279
  "text": "Also run `routine-execution-simulation` on the three most likely failure paths before you decide."
280
+ },
281
+ {
282
+ "id": "phase-1c-architecture-first",
283
+ "when": {
284
+ "var": "architectureStartsFromScratch",
285
+ "equals": true
286
+ },
287
+ "text": "Architecture-first constraint is active (`architectureStartsFromScratch = true`). Before selecting a candidate, verify: does the leading option start from the user's philosophy, or does it adapt an existing violation? Adapting a code structure that was already identified as a philosophy violation is NOT a valid candidate -- even as Candidate A (simplest). The simplest valid candidate is the simplest design that honors the philosophy."
280
288
  }
281
289
  ],
282
290
  "assessmentRefs": [
@@ -421,7 +429,7 @@
421
429
  "var": "taskComplexity",
422
430
  "not_equals": "Small"
423
431
  },
424
- "prompt": "Turn the decision into a plan someone else could execute without guessing.\n\n**Open questions gate:** check `openQuestions` from Phase 0. If any remain unanswered and would materially affect implementation quality, either resolve them now with tools or record them in the risk register with an explicit decision about how to proceed without them. Do not silently carry unanswered questions into implementation.\n\nUpdate `implementation_plan.md`.\n\nIt should cover:\n1. Problem statement\n2. Acceptance criteria (mirror `spec.md` if it exists; `spec.md` owns observable behavior)\n3. Non-goals\n4. Philosophy-driven constraints\n5. Invariants\n6. Selected approach + rationale + runner-up\n7. Vertical slices\n8. Work packages only if they actually help\n9. Test design\n10. Risk register\n11. PR packaging strategy\n12. Philosophy alignment per slice:\n - [principle] -> [satisfied / tension / violated + 1-line why]\n\nCapture:\n- `implementationPlan`\n- `slices`\n- `testDesign`\n- `estimatedPRCount`\n- `followUpTickets` (initialize if needed)\n- `unresolvedUnknownCount` \u2014 count of open questions that would materially affect implementation quality\n- `planConfidenceBand` \u2014 Low / Medium / High\n\nThe plan is the deliverable for this step. Do not implement anything -- not a \"quick win\", not a file read that bleeds into edits, nothing. Execution begins in Phase 6, one slice at a time. If you find yourself writing code or editing source files right now, stop immediately.",
432
+ "prompt": "Turn the decision into a plan someone else could execute without guessing.\n\n**Open questions gate:** check `openQuestions` from Phase 0. If any remain unanswered and would materially affect implementation quality, either resolve them now with tools or record them in the risk register with an explicit decision about how to proceed without them. Do not silently carry unanswered questions into implementation.\n\nUpdate `implementation_plan.md`.\n\nIt should cover:\n1. Problem statement\n2. Acceptance criteria (mirror `spec.md` if it exists; `spec.md` owns observable behavior)\n3. Non-goals\n4. Philosophy-driven constraints\n5. Invariants\n6. Selected approach + rationale + runner-up\n7. Vertical slices\n8. Work packages only if they actually help\n9. Test design\n10. Risk register\n11. PR packaging strategy\n12. Philosophy alignment per slice:\n - [principle] -> [satisfied / tension / violated + 1-line why]\n\nCapture:\n- `implementationPlan`\n- `slices`\n- `testDesign`\n- `estimatedPRCount`\n- `followUpTickets` (initialize if needed)\n- `unresolvedUnknownCount` count of open questions that would materially affect implementation quality\n- `planConfidenceBand` Low / Medium / High\n\nThe plan is the deliverable for this step. Do not implement anything -- not a \"quick win\", not a file read that bleeds into edits, nothing. Execution begins in Phase 6, one slice at a time. If you find yourself writing code or editing source files right now, stop immediately.",
425
433
  "assessmentRefs": [
426
434
  "plan-completeness-gate",
427
435
  "invariant-clarity-gate",
@@ -535,7 +543,7 @@
535
543
  {
536
544
  "id": "phase-4b-loop-decision",
537
545
  "title": "Loop Exit Decision",
538
- "prompt": "Decide whether the plan needs another pass.\n\nIf `planFindings` is non-empty, keep going.\nIf it's empty, stop \u2014 but say what you checked so the clean pass means something.\nIf you've hit the limit, stop and record what still bothers you.\n\nThen emit the required loop-control artifact in this shape (`decision` must be `continue` or `stop`):\n```json\n{\n \"artifacts\": [{\n \"kind\": \"wr.loop_control\",\n \"decision\": \"continue\"\n }]\n}\n```",
546
+ "prompt": "Decide whether the plan needs another pass.\n\nIf `planFindings` is non-empty, keep going.\nIf it's empty, stop but say what you checked so the clean pass means something.\nIf you've hit the limit, stop and record what still bothers you.\n\nThen emit the required loop-control artifact in this shape (`decision` must be `continue` or `stop`):\n```json\n{\n \"artifacts\": [{\n \"kind\": \"wr.loop_control\",\n \"decision\": \"continue\"\n }]\n}\n```",
539
547
  "requireConfirmation": true,
540
548
  "outputContract": {
541
549
  "contractRef": "wr.contracts.loop_control"
@@ -693,6 +701,13 @@
693
701
  }
694
702
  }
695
703
  ]
704
+ },
705
+ {
706
+ "id": "phase-8-retrospective",
707
+ "title": "Phase 8: Retrospective",
708
+ "requireConfirmation": false,
709
+ "prompt": "The implementation is done and verified. Now look back.\n\nThis is not a re-run of tests. It is a short honest look at the work you just did.\n\nAsk yourself:\n\n1. **What would you do differently?** Now that the implementation is real, what approach, boundary, or decision looks wrong in hindsight?\n\n2. **What adjacent problems did this reveal?** Did the implementation expose gaps, tech debt, or fragile assumptions in the surrounding code that were not in scope but are worth noting?\n\n3. **What follow-up work is now visible?** What is the natural next step that became clear only after doing this work?\n\n4. **What was harder or easier than expected?** Were there surprises -- good or bad -- that would change how similar tasks are approached next time?\n\nProduce 2-4 concrete observations. Each should be specific enough to act on.\n\nFor each observation:\n- **File as follow-up**: add to backlog or open a ticket if it warrants tracking\n- **Accept**: note it explicitly if it is a known limitation you are consciously leaving\n- **Fix now**: if it is small and low-risk, fix it before closing\n\nCapture:\n- `retrospectiveObservations`: list of observations with disposition (filed/accepted/fixed)\n- `followUpTickets`: any new tickets created (append to existing list)"
696
710
  }
697
- ]
711
+ ],
712
+ "validatedAgainstSpecVersion": 3
698
713
  }