npm - @exaudeus/workrail - Versions diffs - 3.71.0 → 3.72.0 - Mend

@exaudeus/workrail 3.71.0 → 3.72.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

package/dist/console-ui/assets/{index-DyREuUoq.js → index-CTza1zb5.js} +1 -1
package/dist/console-ui/index.html +1 -1
package/dist/daemon/session-scope.d.ts +27 -0
package/dist/daemon/session-scope.js +21 -0
package/dist/daemon/turn-end/conversation-flusher.d.ts +4 -0
package/dist/daemon/turn-end/conversation-flusher.js +8 -0
package/dist/daemon/turn-end/detect-stuck.d.ts +2 -0
package/dist/daemon/turn-end/detect-stuck.js +5 -0
package/dist/daemon/turn-end/step-injector.d.ts +8 -0
package/dist/daemon/turn-end/step-injector.js +10 -0
package/dist/daemon/workflow-runner.d.ts +53 -2
package/dist/daemon/workflow-runner.js +400 -306
package/dist/manifest.json +52 -20
package/dist/mcp/handlers/v2-advance-core/outcome-success.js +1 -5
package/dist/mcp/handlers/v2-context-budget.d.ts +0 -5
package/dist/mcp/handlers/v2-context-budget.js +0 -17
package/dist/trigger/trigger-listener.js +4 -3
package/dist/trigger/trigger-router.js +7 -7
package/dist/v2/durable-core/schemas/export-bundle/index.d.ts +16 -16
package/docs/history/worktrain-journal.md +1945 -0
package/docs/ideas/backlog.md +165 -12
package/docs/reference/worktrain-daemon-invariants.md +29 -9
package/package.json +1 -1
package/workflows/coding-task-workflow-agentic.json +23 -8
package/workflows/wr.research.json +158 -0

package/docs/ideas/backlog.md CHANGED Viewed

@@ -5,7 +5,89 @@ For historical narrative and sprint journals, see `docs/history/worktrain-journa
 ---
-## Daemon and WorkTrain
+## P0 / Critical (blocks WorkTrain from working correctly)
+### Agent is doing coordinator work
+**Status: bug** | Priority: P0
+The agent ran `cd /path/to/main-checkout && git log`, `gh issue view`, read roadmap docs, checked open PRs -- coordinator work. The agent should never do this. It is a worker: receive scoped task, produce output, call `complete_step`. All environment setup, context gathering, git operations, worktree management, PR creation, and orchestration is coordinator responsibility.
+The coordinator should: create the worktree before the agent starts, pass a clean context packet (issue body, relevant code, what to produce), handle all git operations after the agent finishes, spawn specialized sub-agents for subtasks.
+**Near-term mitigation:** Inject `sessionWorkspacePath` (the worktree) into the system prompt instead of `trigger.workspacePath` (main checkout), and explicitly tell the agent "do not run git commands, do not read roadmap docs -- that is coordinator work." Partial fix held pending full redesign.
+**Full fix:** Coordinator-heavy pipeline redesign (see below).
+---
+### Wrong directory: agent worked in main checkout instead of worktree
+**Status: bug** | Priority: P0
+All bash commands used `cd /main-checkout` instead of the worktree. Code changes went nowhere. Delivery found nothing to commit and silently skipped. Root cause: system prompt names `trigger.workspacePath`, not `sessionWorkspacePath`.
+---
+### Agent faked commit SHAs in handoff block
+**Status: bug** | Priority: high
+Handoff block `agentCommitShas` contained existing main-branch SHAs from `git log`, not new commits. Fix: coordinator records commit SHAs itself (before/after diff) rather than trusting the agent.
+---
+### `taskComplexity=Small` misclassification
+**Status: bug** | Priority: medium
+Issue #241 (TTL eviction across multiple files + new tests) was classified as Small, skipping design review, planning audit, and verification loops. Consider requiring human confirmation on Small classification before bypassing phases.
+---
+### Daemon binary stale after rebuild, no indication to user
+**Status: ux gap** | Priority: medium
+After `npm run build`, `worktrain daemon --start` launches the old binary. No warning. Fix: compare binary mtime to running process's binary and warn if stale.
+---
+### `worktrain daemon --start` reports success even when daemon crashes immediately
+**Status: bug** | Priority: medium
+Health check waits 1 second then checks `launchctl list`. If daemon crashes in < 1s, check sees a PID and reports success. Fix: poll for up to 5 seconds, verify daemon is still running at end of window.
+---
+### Handoff block not surfaced to operator
+**Status: ux gap** | Priority: medium
+Agent writes a complete handoff block (commitType, prTitle, prBody, filesChanged) to the session store. Invisible to operator without digging through event logs. Fix: `worktrain status <sessionId>` should show it; console session detail should surface it prominently.
+---
+## WorkTrain Daemon
+The autonomous workflow runner (`worktrain daemon`). Completely separate from the MCP server -- calls the engine directly in-process.
+### `wr.refactoring` workflow (Apr 28, 2026)
+**Status: idea** | Priority: medium
+A dedicated `wr.refactoring` workflow for structural refactors that don't change behavior. Distinct from `wr.coding-task` because refactors have a different shape: no new features, no bug fixes, just architecture alignment. The workflow should enforce:
+- **Discovery phase**: understand current state, identify violations, classify scope
+- **Test-first phase**: write tests for any extracted pure functions BEFORE extracting them (TDD red)
+- **Extraction phase**: one slice at a time, tests green after each
+- **Verification phase**: full suite green, build clean, no behavior changes
+- **Doc update phase**: update any reference docs that describe the changed invariants
+The `wr.coding-task` workflow has too much overhead for pure refactors (design review, risk assessment gating, PR strategy) and not enough refactor-specific discipline (test-first enforcement, behavior-unchanged verification).
+---
 ### API key baked into launchd plist at install time (Apr 24, 2026)
@@ -23,9 +105,9 @@ For historical narrative and sprint journals, see `docs/history/worktrain-journa
 ### runWorkflow() functional core refactor -- Phase 2 (Apr 24, 2026)
-**Status: idea** | Priority: medium
+**Status: done** | Shipped in PR #830 (Apr 29, 2026)
-Phase 1 landed in PR #818: extracted `tagToStatsOutcome`, `buildAgentClient`, `evaluateStuckSignals`, `SessionState`, and `finalizeSession`. `runWorkflow()` is still ~880 lines with I/O and pure logic interleaved in the setup phase.
+Phase 1 landed in PR #818: extracted `tagToStatsOutcome`, `buildAgentClient`, `evaluateStuckSignals`, `SessionState`, and `finalizeSession`. Phase 2 landed in PR #830:
 **What remains:**
@@ -55,18 +137,32 @@ function buildSessionContext(
 The shell then does:
 1. All I/O in sequence: `loadDaemonSoul`, `loadWorkspaceContext`, `loadSessionNotes`, `git worktree add`, `executeStartWorkflow`, `parseContinueTokenOrFail`, `persistTokens`
-2. One pure call: `buildSessionContext(trigger, client, modelId, soul, ctx, notes, state, ...)`
-3. Run the agent loop with the returned config
-**Why this matters:** `buildSessionContext` would be unit-testable without any filesystem, LLM, or session store. The system prompt assembly, tool list construction, and session limit calculation are currently untestable in isolation. This is the biggest remaining testability gap.
+**What Phase 2 delivered (PR #830):**
+- `PreAgentSession` interface + `PreAgentSessionResult` discriminated union -- all early-exit paths type-enforced
+- `buildPreAgentSession()` -- all pre-agent I/O extracted; steer+daemon registries registered after all failing I/O (FM1 invariant)
+- `constructTools()` -- explicitly impure named function, `state` as explicit parameter
+- `persistTokens()` returns `Promise<Result<void, PersistTokensError>>` using `src/runtime/result.ts`
+- `sidecardLifecycleFor()` pure function with `assertNever` exhaustiveness
+- TDZ hazard fixed: `abortRegistry.set()` now registered after `const agent = new AgentLoop()`
+**Phase 3 (PRs #835, #837)** continued the refactor:
+- `buildTurnEndSubscriber()` extracted -- runWorkflow() body: 539 → 426 lines
+- Tool param validation at LLM boundary (8 tool factories)
+- `buildAgentCallbacks()` + `buildSessionResult()` pure functions -- body: 426 → 308 lines
+- Test flakiness fix: `settleFireAndForget()` + `retry: 2` in vitest config
+**Still deferred:**
+- `CriticalEffect<T>` / `ObservabilityEffect` type distinction
+- `StateRef` mutation wrapper
+- Zod tool param validation (replacing manual typeof checks -- requires zodToJsonSchema or two sources of truth)
+- `wr.refactoring` workflow (see backlog entry above)
-**Prerequisite:** Phase 1 (PR #818) -- done. `SessionState` already exists and can be passed in.
+---
-**Scope:** Single file (`src/daemon/workflow-runner.ts`). New exports: `buildSessionContext`, `SessionContext`. No public API changes.
+## Shared / Engine
----
+The durable session store, v2 engine, and workflow authoring features shared by all three systems.
-## Engine and MCP
 ### Improve commit SHA gathering consistency in wr.coding-task
@@ -264,7 +360,10 @@ Surface in: `worktrain status`, `worktrain health <sessionId>`, console session
 ---
-## Daemon and Coordinator
+## WorkTrain Daemon -- Coordinator patterns
+Coordinator design patterns for WorkTrain's autonomous pipeline.
 ### Event-driven agent coordination (coordinator as event bus)
@@ -516,6 +615,12 @@ Step-level `systemPrompt` overrides workflow-level for that step.
 ---
+## WorkRail MCP Server
+The stdio/HTTP MCP server that Claude Code (and other MCP clients) connect to. MUST be bulletproof -- crashes kill all in-flight Claude Code sessions.
 ## Console
 ### Console interactivity and liveliness
@@ -562,6 +667,40 @@ Ghost nodes represent steps that were compiled into the DAG but skipped at runti
 ## Workflow Library
+### General-purpose workflow / intelligent dispatcher
+**Status: idea** | Priority: medium
+Two related ideas:
+**`wr.quick-task`** -- the simplest possible workflow. 2 steps: do the work, call complete_step. No complexity routing, no design review, no phased implementation. For tasks under ~10 minutes. Currently small tasks go through `wr.coding-task`'s Small fast-path which is still heavier than needed.
+**`wr.dispatch`** -- an intelligent routing workflow. Given a goal, classify it and route to the right workflow: `wr.quick-task` | `wr.research` | `wr.coding-task` | `wr.mr-review` | `wr.competitive-analysis`. The general-purpose entry point -- not a workflow that does everything, but one that decides which workflow to use. The adaptive pipeline coordinator already does this for the queue-poll trigger; the question is whether to expose it as a named user-facing workflow.
+Open questions: does `wr.dispatch` replace `workflowId` in trigger config, or coexist alongside it? How does it handle tasks that don't fit any known workflow?
+---
+### MR review session count inflation
+**Status: idea** | Priority: medium
+A single PR review dispatches 6-12 autonomous sessions (one per reviewer family: correctness_invariants, runtime_production_risk, missed_issue_hunter, etc.). This inflates session counts, complicates cost attribution, and makes ROI calculations imprecise. Worth investigating: are all 6 families catching distinct issues, or is there significant overlap? Should families be parallelized into a single session with sub-agents rather than separate top-level sessions?
+---
+### Session trigger source attribution (daemon vs MCP)
+**Status: idea** | Priority: high
+No reliable way to determine whether a session was started by the daemon (WorkTrain) or a human via MCP (Claude Code). Every session-level metric and ROI calculation is ambiguous without this.
+**Fix:** Add `triggerSource: 'daemon' | 'mcp'` to `run_started` event data. One-line change at each entry point, makes attribution permanent and queryable from the event log.
+Files: `src/v2/durable-core/schemas/session/events.ts`, `src/mcp/handlers/v2-execution/start-workflow.ts`, `src/daemon/workflow-runner.ts`.
+---
 ### Standup status generator
 **Status: idea** | Priority: low
@@ -962,6 +1101,8 @@ WorkTrain is a persistent background daemon that initiates workflows autonomousl
 - `worktrain init` soul setup
 - Per-trigger crash safety (`persistTokens`)
 - Worktree orphan cleanup on delivery failure
+- runWorkflow() Phase 2 architecture (PR #830): `PreAgentSession`/`buildPreAgentSession`, `constructTools`, `persistTokens` Result type, `sidecardLifecycleFor` pure function, TDZ hazard fix for abort registry
+- runWorkflow() Phase 3 architecture (PRs #835, #837): `buildTurnEndSubscriber` (539→426 lines), tool param validation at LLM boundary (8 factories), `buildAgentCallbacks` + `buildSessionResult` pure functions (426→308 lines), test flakiness fix (settleFireAndForget + retry:2)
 ### WorkRail engine / MCP features
@@ -969,6 +1110,7 @@ WorkTrain is a persistent background daemon that initiates workflows autonomousl
 - Assessment gates v1 with consequences
 - Loop control -- all four types (`while`, `until`, `for`, `forEach`) implemented
+- Fix: sequential `artifact_contract` while loops -- stale stop artifacts from earlier loops no longer contaminate later loops (PR #830). Root cause: `collectArtifactsForEvaluation()` passed full session history to `interpreter.next()`; fix passes only `inputArtifacts` (current step's submitted artifacts).
 - Subagent guidance feature
 - References system (local file refs)
 - Routine/templateCall injection
@@ -1012,3 +1154,14 @@ The agent is expensive, inconsistent, and slow. Scripts are free, deterministic,
 ### Metrics outcome validation
 **Status: done** -- `checkContextBudget` validates `metrics_outcome` enum (PR f0a1822a). SHA validation (Gap 3 above) is still open.
+### wr.coding-task architecture enforcement + retrospective (v1.3.0)
+**Status: done** -- shipped in PR #830 (Apr 29, 2026)
+- Phase 0 architecture alignment check: agent scans candidate files and names philosophy violations explicitly by function name; captures `architectureViolations` and `architectureStartsFromScratch`
+- Phase 1c conditional fragment: when `architectureStartsFromScratch = true`, blocks adapting existing violations as valid design candidates
+- Phase 8 post-implementation retrospective: runs for all tasks (no complexity gate); four practical questions applicable to any task; requires 2-4 concrete observations with explicit disposition
+---

package/docs/reference/worktrain-daemon-invariants.md CHANGED Viewed

@@ -52,19 +52,23 @@ Each `runWorkflow()` call writes a per-session sidecar file at `~/.workrail/daem
 ### 2.1 Sidecar is written before the agent loop starts
-`persistTokens()` is called immediately after `executeStartWorkflow()` succeeds and the `continueToken` is available. A crash between `executeStartWorkflow()` returning and the first LLM call is recoverable.
+`persistTokens()` is called inside `buildPreAgentSession()` immediately after `executeStartWorkflow()` succeeds and the `continueToken` is available. `buildPreAgentSession()` returns `{ kind: 'complete', result: { _tag: 'error' } }` if `persistTokens()` fails -- no agent loop starts without a valid sidecar.
+`persistTokens()` returns `Promise<Result<void, PersistTokensError>>` (not throws). Callers in the setup phase treat `err` as fatal (abort); callers inside tool closures treat `err` as degraded-but-continue (log and still call `onAdvance`/`onTokenUpdate` -- see invariant 4.3).
 **Exception:** If `continueToken` is undefined (instant single-step completion, or `_preAllocatedStartResponse` with no token), `persistTokens()` is skipped. There is nothing to recover.
 ### 2.2 Sidecar is deleted on every non-worktree terminal path
+The sidecar lifecycle decision is delegated to `sidecardLifecycleFor(tag, branchStrategy)` in `workflow-runner.ts`. That function is the authoritative source for this table; its `assertNever` default case ensures a compile error when `WorkflowRunResult` gains new variants without updating the rules.
 | Outcome | Sidecar deleted? |
 |---|---|
-| `success` (non-worktree) | Yes -- in `runWorkflow()` before returning |
+| `success` (non-worktree) | Yes -- `finalizeSession()` deletes via `sidecardLifecycleFor` |
 | `success` (worktree) | No -- `TriggerRouter.maybeRunDelivery()` deletes it after delivery |
-| `error` | Yes |
-| `timeout` | Yes |
-| `stuck` | Yes |
+| `error` | Yes -- `finalizeSession()` deletes via `sidecardLifecycleFor` |
+| `timeout` | Yes -- `finalizeSession()` deletes via `sidecardLifecycleFor` |
+| `stuck` | Yes -- `finalizeSession()` deletes via `sidecardLifecycleFor` |
 **Why worktree sessions differ:** Delivery (git commit, git push, gh pr create) runs inside the worktree after `runWorkflow()` returns. The sidecar must exist until delivery completes so `runStartupRecovery()` can find the worktree path if the daemon crashes during delivery.
@@ -92,11 +96,21 @@ Three registries track in-flight daemon sessions:
 | `SteerRegistry` | `workrailSessionId` | `(text: string) => void` | Mid-session coordinator injection |
 | `AbortRegistry` | `workrailSessionId` | `() => void` | SIGTERM graceful shutdown |
-### 3.1 All registries are deregistered in the `finally` block
+### 3.1 Registry registration and deregistration
+**Registration** happens in two places:
+- `steerRegistry` and `DaemonRegistry` are registered inside `buildPreAgentSession()` -- AFTER all potentially-failing I/O (executeStartWorkflow, persistTokens, worktree creation). Error paths that return before registration have nothing to clean up. The single-step completion path (which returns success without running an agent loop) explicitly calls `steerRegistry.delete()` and `daemonRegistry.unregister()` before returning.
+- `abortRegistry` is registered in `runWorkflow()` immediately after `const agent = new AgentLoop(...)`. The closure `() => agent.abort()` references `agent` -- registering before agent construction would be a TDZ hazard.
-`steerRegistry.delete()` and `abortRegistry.delete()` are called in the `finally` block of `runWorkflow()`. This ensures cleanup happens even if an exception is thrown in the agent loop or in the post-finally result handling.
+**Deregistration**:
-**Why `finally` and not per-result-path:** A stale steer or abort callback on a dead session would cause `POST /sessions/:id/steer` to return 200 (calling the closed-over callback) or the shutdown handler to call `abort()` on an already-exited session. Both are silent correctness bugs.
+- `steerRegistry.delete()` and `abortRegistry.delete()` are called in the `finally` block of `runWorkflow()`. This ensures cleanup happens even if an exception is thrown in the agent loop.
+- `daemonRegistry.unregister()` is called at each result path (success, error, timeout, stuck) via `finalizeSession()`. It is NOT in `finally` because the completion status ('completed' vs 'failed') differs by path.
+**Why stale entries are bugs:** A stale steer callback on a dead session makes `POST /sessions/:id/steer` return 200 (calling the closed-over callback) instead of 404. A stale abort callback makes the shutdown handler call `abort()` on an already-exited session. Both are silent correctness bugs.
 ### 3.2 `DaemonRegistry` is unregistered at every result path
@@ -110,7 +124,11 @@ If `parseContinueTokenOrFail()` fails (unusual -- the token just came from `exec
 ### 3.4 Registration gap is documented
-There is a ~50ms window between `executeStartWorkflow()` returning and `steerRegistry.set()` being called (after `parseContinueTokenOrFail()` completes). A `POST /sessions/:id/steer` call in this window receives 404. Coordinators should retry once on 404 during session startup.
+**SteerRegistry gap (~50ms):** There is a ~50ms window between `executeStartWorkflow()` returning and `steerRegistry.set()` being called (after `parseContinueTokenOrFail()` completes). A `POST /sessions/:id/steer` call in this window receives 404. Coordinators should retry once on 404 during session startup.
+**AbortRegistry gap (~200-500ms):** `abortRegistry.set()` is registered _after_ `const agent = new AgentLoop(...)` is constructed, which happens after the context-loading phase (`loadDaemonSoul`, `loadWorkspaceContext`, `loadSessionNotes` in parallel). This means there is a ~200-500ms window where SIGTERM will not abort an in-flight session. Sessions in this window run to completion or hit the wall-clock timeout.
+**Why the abort gap is wider than the steer gap:** `abortRegistry.set` registers `() => agent.abort()` which closes over `agent`. Registering this callback before `agent` is constructed would be a TDZ (Temporal Dead Zone) hazard -- `agent` is declared with `const` and would not yet be initialized if the shutdown handler fired on an early-exit path. Registering after `agent` construction eliminates the hazard at the cost of a wider registration window. The accepted tradeoff is the same as for the steer gap.
 ---
@@ -134,6 +152,8 @@ Both are guarded by the sequential tool execution invariant (no concurrent token
 `persistTokens()` is called inside `makeCompleteStepTool.execute()` and `makeContinueWorkflowTool.execute()` before `onAdvance()` or `onTokenUpdate()` are called. A crash between the engine returning a new token and `persistTokens()` completing would leave an unrecoverable state.
+`persistTokens()` returns `Promise<Result<void, PersistTokensError>>`. On `err` inside a tool closure, the policy is **log and continue** -- `onAdvance()` / `onTokenUpdate()` are still called even when persistence fails. Rationale: a persist failure degrades crash recovery but the session is still live. Killing the session on persist failure would lose in-progress work, which is strictly worse.
 **Note:** The sidecar write uses the atomic temp-rename pattern (`writeFile(tmp) → rename(tmp, final)`) to prevent corrupt partial writes.
 ### 4.4 Stuck detection is non-blocking for the session result

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@exaudeus/workrail",
-  "version": "3.71.0",
+  "version": "3.72.0",
   "description": "Step-by-step workflow enforcement for AI agents via MCP",
   "license": "MIT",
   "repository": {

package/workflows/coding-task-workflow-agentic.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "id": "wr.coding-task",
   "name": "Agentic Task Dev Workflow",
-  "version": "1.2.0",
+  "version": "1.3.0",
   "description": "Use this to implement a software feature or task. Follows a plan-then-execute approach with architecture decisions, invariant tracking, and final verification.",
   "about": "## Agentic Coding Task Workflow\n\nThis workflow structures the full lifecycle of a software implementation task: from understanding and classifying the work, through architecture decisions and incremental implementation, to final verification and handoff.\n\n### What it does\n\nThe workflow guides an AI agent through a disciplined plan-then-execute process. It begins by analyzing the task to determine complexity, risk, and the right level of rigor (QUICK, STANDARD, or THOROUGH). For non-trivial tasks, it then gathers codebase context, surfaces invariants and non-goals, generates competing design candidates, and selects an approach before writing a single line of code. Implementation proceeds slice by slice, with built-in verification gates after each slice. A final integration verification pass confirms acceptance criteria are met before handoff.\n\n### Upstream context (Phase 0.5)\n\nPhase 0.5 looks for any upstream document that has already defined what to build -- a Shape Up pitch, PRD, BRD, RFC, design doc, user story with acceptance criteria, Jira epic, or equivalent. The agent uses whatever tools are available (repo search, WebFetch, Confluence/Notion/Glean MCPs, Memory MCP) to find it. If found, two flags are set: `upstreamSpecDetected` (something exists) and `solutionFixed` (whether the document commits to a specific technical direction). When `solutionFixed = true`, design ideation phases (1a-1c) are skipped and Phase 1d translates the upstream constraints directly into an engineering approach. When `solutionFixed = false`, design ideation runs normally but is constrained by whatever the upstream document does specify. The plan audit (Phase 4) checks for drift against `upstreamBoundaries` whenever an upstream document was found.\n\n### When to use it\n\nUse this workflow whenever you are implementing a feature, fixing a non-trivial bug, or making an architectural change in a real codebase. It is especially valuable when:\n- The task touches multiple files or systems\n- There is meaningful risk of regressions or invariant violations\n- You want the agent to surface trade-offs and commit to a reasoned design decision rather than guessing\n- You need a resumable, auditable record of what was decided and why\n\nFor quick one-liner fixes or very small changes, the workflow includes a fast path that skips heavyweight planning.\n\n### What it produces\n\n- An `implementation_plan.md` artifact covering the selected approach, vertical slices, test design, and philosophy alignment\n- A `spec.md` for large or high-risk tasks, capturing observable behavior and acceptance criteria\n- Step-level notes in WorkRail that serve as a durable execution log\n- A PR-ready handoff summary with acceptance criteria status, invariant proofs, and follow-up tickets\n\n### How to get good results\n\n- Provide a clear task description and at least partial acceptance criteria before starting\n- If you have coding philosophy or project conventions configured in session rules or Memory MCP, the workflow will apply them automatically as a design lens\n- Let the workflow classify complexity and rigor itself; override only if the classification is clearly wrong\n- For large or high-risk tasks, review the architecture decision step before implementation begins",
   "examples": [
@@ -143,7 +143,7 @@
     "SUBAGENT SYNTHESIS: treat subagent output as evidence, not conclusions. State your hypothesis before delegating, then interrogate what came back: what was missed, wrong, or new? Say what changed your mind or what you still reject, and why.",
     "PARALLELISM: when reads, audits, or delegations are independent, run them in parallel inside the phase. Parallelize cognition; serialize synthesis and canonical writes.",
     "PHILOSOPHY LENS: apply the user's coding philosophy (from active session rules) as the evaluation lens. Flag violations by principle name, not as generic feedback. If principles conflict, surface the tension explicitly instead of silently choosing.",
-    "VALIDATION: prefer static/compile-time safety over runtime checks. Use build, type-checking, and tests as the primary proof of correctness \u2014 in that order of reliability.",
+    "VALIDATION: prefer static/compile-time safety over runtime checks. Use build, type-checking, and tests as the primary proof of correctness — in that order of reliability.",
     "DRIFT HANDLING: when reality diverges from the plan, update the plan artifact and re-audit deliberately rather than accumulating undocumented drift.",
     "NEVER COMMIT MARKDOWN FILES UNLESS USER EXPLICITLY ASKS.",
     "SLICE DISCIPLINE: Phase 6 is a loop -- implement ONE slice per iteration. Do not implement multiple slices at once. The verification loop exists to catch drift per slice, not retroactively."
@@ -152,7 +152,7 @@
     {
       "id": "phase-0-understand-and-classify",
       "title": "Phase 0: Understand & Classify",
-      "prompt": "Understand this before you touch anything.\n\nMake sure the expected behavior is clear enough to proceed. If it really isn't, ask me only what you can't answer yourself. Don't ask me things you can find with tools.\n\nThen dig through the code. Figure out:\n- where this starts and what the call chain looks like\n- which files, modules, and functions matter\n- what patterns this should follow\n- how this repo verifies similar work\n- what the real risks, invariants, and non-goals are\n\nFigure out what philosophy to use while doing the work. Prefer, in order: Memory MCP (`mcp_memory_conventions`, `mcp_memory_prefer`, `mcp_memory_recall`), active session/Firebender rules, repo patterns, then me only if those still conflict or aren't enough.\n\nRecord where that philosophy lives, not a summary. If the stated rules and repo patterns disagree, capture the conflict.\n\nOnce you actually understand the task, classify it:\n- `taskComplexity`: Small / Medium / Large\n- `riskLevel`: Low / Medium / High\n- `rigorMode`: QUICK / STANDARD / THOROUGH\n- `automationLevel`: High / Medium / Low\n- `prStrategy`: SinglePR / MultiPR\n\nUse this guidance:\n- QUICK: small, low-risk, clear path, little ambiguity\n- STANDARD: medium scope or moderate risk\n- THOROUGH: large scope, architectural uncertainty, or high-risk change\n\nThen force a context-clarity check. Score each from 0-2 and give one sentence of evidence for each score:\n- `entryPointClarity`: 0 = clear entry point and call chain, 1 = partial chain with gaps, 2 = still unclear where behavior starts or flows\n- `boundaryClarity`: 0 = clear boundary, 1 = likely boundary but some uncertainty, 2 = patch-vs-boundary decision still unclear\n- `invariantClarity`: 0 = important invariants are explicit, 1 = some are inferred or uncertain, 2 = important invariants are still unclear\n- `verificationClarity`: 0 = clear deterministic verification path, 1 = partial verification path, 2 = verification is still weak or unclear\n\nUse the rubric, not vibes:\n- QUICK: do not run the deeper context batch; if the rubric says you're missing too much context, your classification is probably wrong and you should reclassify upward before moving on\n- STANDARD: run the deeper context batch if the total score is 3 or more, or if `boundaryClarity`, `invariantClarity`, or `verificationClarity` is 2\n- THOROUGH: always run the deeper context batch\n\nThe deeper context batch is:\n- `routine-context-gathering` with `focus=COMPLETENESS`\n- `routine-context-gathering` with `focus=DEPTH`\n\nAfter the batch, synthesize what changed, what stayed the same, and what is still unknown. If the extra context changes the classification, update it before you leave this step.\n\nCapture:\n- `taskComplexity`\n- `riskLevel`\n- `rigorMode`\n- `automationLevel`\n- `prStrategy`\n- `contextSummary`\n- `candidateFiles`\n- `invariants`\n- `nonGoals`\n- `openQuestions` (only real human-decision questions)\n- `philosophySources`\n- `philosophyConflicts`",
+      "prompt": "Understand this before you touch anything.\n\nMake sure the expected behavior is clear enough to proceed. If it really isn't, ask me only what you can't answer yourself. Don't ask me things you can find with tools.\n\nThen dig through the code. Figure out:\n- where this starts and what the call chain looks like\n- which files, modules, and functions matter\n- what patterns this should follow\n- how this repo verifies similar work\n- what the real risks, invariants, and non-goals are\n\nFigure out what philosophy to use while doing the work. Prefer, in order: Memory MCP (`mcp_memory_conventions`, `mcp_memory_prefer`, `mcp_memory_recall`), active session/Firebender rules, repo patterns, then me only if those still conflict or aren't enough.\n\nRecord where that philosophy lives, not a summary. If the stated rules and repo patterns disagree, capture the conflict.\n\nOnce you actually understand the task, classify it:\n- `taskComplexity`: Small / Medium / Large\n- `riskLevel`: Low / Medium / High\n- `rigorMode`: QUICK / STANDARD / THOROUGH\n- `automationLevel`: High / Medium / Low\n- `prStrategy`: SinglePR / MultiPR\n\nUse this guidance:\n- QUICK: small, low-risk, clear path, little ambiguity\n- STANDARD: medium scope or moderate risk\n- THOROUGH: large scope, architectural uncertainty, or high-risk change\n\nThen force a context-clarity check. Score each from 0-2 and give one sentence of evidence for each score:\n- `entryPointClarity`: 0 = clear entry point and call chain, 1 = partial chain with gaps, 2 = still unclear where behavior starts or flows\n- `boundaryClarity`: 0 = clear boundary, 1 = likely boundary but some uncertainty, 2 = patch-vs-boundary decision still unclear\n- `invariantClarity`: 0 = important invariants are explicit, 1 = some are inferred or uncertain, 2 = important invariants are still unclear\n- `verificationClarity`: 0 = clear deterministic verification path, 1 = partial verification path, 2 = verification is still weak or unclear\n\nUse the rubric, not vibes:\n- QUICK: do not run the deeper context batch; if the rubric says you're missing too much context, your classification is probably wrong and you should reclassify upward before moving on\n- STANDARD: run the deeper context batch if the total score is 3 or more, or if `boundaryClarity`, `invariantClarity`, or `verificationClarity` is 2\n- THOROUGH: always run the deeper context batch\n\nThe deeper context batch is:\n- `routine-context-gathering` with `focus=COMPLETENESS`\n- `routine-context-gathering` with `focus=DEPTH`\n\nAfter the batch, synthesize what changed, what stayed the same, and what is still unknown. If the extra context changes the classification, update it before you leave this step.\n\nCapture:\n- `taskComplexity`\n- `riskLevel`\n- `rigorMode`\n- `automationLevel`\n- `prStrategy`\n- `contextSummary`\n- `candidateFiles`\n- `invariants`\n- `nonGoals`\n- `openQuestions` (only real human-decision questions)\n- `philosophySources`\n- `philosophyConflicts`\n\n**Architecture alignment check (do this last, after candidateFiles is known):**\n\nFor each candidate file, scan for violations of the philosophy you just discovered. Name each violation explicitly and specifically (e.g. \"runWorkflow() is 4900 lines -- violates compose-with-small-pure-functions\", \"tool factories use params: any -- violates validate-at-boundaries\"). Do not assert absence without checking. If you found no violations, list the principles you checked and why each passes.\n\nThen decide:\n- `architectureViolations`: list of specific violations found (may be empty)\n- `architectureStartsFromScratch`: true if violations are significant enough that the correct design starts from the user's philosophy rather than adapting the existing code. False if violations are minor or out of scope for this task.\n\nIf `architectureStartsFromScratch` is true, the design phase will be constrained: Candidate A (simplest) must still honor the philosophy -- adapting an existing violation is not a valid candidate. Record this now so the design phase uses it.",
       "requireConfirmation": {
         "or": [
           {
@@ -218,7 +218,7 @@
     },
     {
       "id": "phase-1b-design-deep",
-      "title": "Phase 1b: Design Generation (Injected Routine \u2014 Tension-Driven Design)",
+      "title": "Phase 1b: Design Generation (Injected Routine — Tension-Driven Design)",
       "runCondition": {
         "and": [
           {
@@ -257,7 +257,7 @@
           }
         ]
       },
-      "prompt": "Read `design-candidates.md`, compare it to your original guess, and make the call.\n\nBe explicit about three things:\n- what the design work confirmed\n- what changed your mind\n- what you missed the first time\n\nThen pressure-test the leading option:\n- what's the strongest case against it?\n- what assumption breaks it?\n\nAfter the challenge batch, say:\n- what changed your mind\n- what didn't\n- which findings you reject and why\n\nPick the approach yourself. Don't hide behind the artifact. If the simplest thing works, prefer it. If the front-runner stops looking right after challenge, switch.\n\nCapture:\n- `selectedApproach` \u2014 chosen design with rationale tied to tensions\n- `runnerUpApproach` \u2014 next-best option and why it lost\n- `architectureRationale` \u2014 tensions resolved vs accepted\n- `pivotTriggers` \u2014 conditions under which you'd switch to the runner-up\n- `keyRiskToMonitor` \u2014 failure mode of the selected approach\n- `acceptedTradeoffs`\n- `identifiedFailureModes`",
+      "prompt": "Read `design-candidates.md`, compare it to your original guess, and make the call.\n\nBe explicit about three things:\n- what the design work confirmed\n- what changed your mind\n- what you missed the first time\n\nThen pressure-test the leading option:\n- what's the strongest case against it?\n- what assumption breaks it?\n\nAfter the challenge batch, say:\n- what changed your mind\n- what didn't\n- which findings you reject and why\n\nPick the approach yourself. Don't hide behind the artifact. If the simplest thing works, prefer it. If the front-runner stops looking right after challenge, switch.\n\nCapture:\n- `selectedApproach` — chosen design with rationale tied to tensions\n- `runnerUpApproach` — next-best option and why it lost\n- `architectureRationale` — tensions resolved vs accepted\n- `pivotTriggers` — conditions under which you'd switch to the runner-up\n- `keyRiskToMonitor` — failure mode of the selected approach\n- `acceptedTradeoffs`\n- `identifiedFailureModes`",
       "promptFragments": [
         {
           "id": "phase-1c-challenge-standard",
@@ -277,6 +277,14 @@
             "equals": "THOROUGH"
           },
           "text": "Also run `routine-execution-simulation` on the three most likely failure paths before you decide."
+        },
+        {
+          "id": "phase-1c-architecture-first",
+          "when": {
+            "var": "architectureStartsFromScratch",
+            "equals": true
+          },
+          "text": "Architecture-first constraint is active (`architectureStartsFromScratch = true`). Before selecting a candidate, verify: does the leading option start from the user's philosophy, or does it adapt an existing violation? Adapting a code structure that was already identified as a philosophy violation is NOT a valid candidate -- even as Candidate A (simplest). The simplest valid candidate is the simplest design that honors the philosophy."
         }
       ],
       "assessmentRefs": [
@@ -421,7 +429,7 @@
         "var": "taskComplexity",
         "not_equals": "Small"
       },
-      "prompt": "Turn the decision into a plan someone else could execute without guessing.\n\n**Open questions gate:** check `openQuestions` from Phase 0. If any remain unanswered and would materially affect implementation quality, either resolve them now with tools or record them in the risk register with an explicit decision about how to proceed without them. Do not silently carry unanswered questions into implementation.\n\nUpdate `implementation_plan.md`.\n\nIt should cover:\n1. Problem statement\n2. Acceptance criteria (mirror `spec.md` if it exists; `spec.md` owns observable behavior)\n3. Non-goals\n4. Philosophy-driven constraints\n5. Invariants\n6. Selected approach + rationale + runner-up\n7. Vertical slices\n8. Work packages only if they actually help\n9. Test design\n10. Risk register\n11. PR packaging strategy\n12. Philosophy alignment per slice:\n   - [principle] -> [satisfied / tension / violated + 1-line why]\n\nCapture:\n- `implementationPlan`\n- `slices`\n- `testDesign`\n- `estimatedPRCount`\n- `followUpTickets` (initialize if needed)\n- `unresolvedUnknownCount` \u2014 count of open questions that would materially affect implementation quality\n- `planConfidenceBand` \u2014 Low / Medium / High\n\nThe plan is the deliverable for this step. Do not implement anything -- not a \"quick win\", not a file read that bleeds into edits, nothing. Execution begins in Phase 6, one slice at a time. If you find yourself writing code or editing source files right now, stop immediately.",
+      "prompt": "Turn the decision into a plan someone else could execute without guessing.\n\n**Open questions gate:** check `openQuestions` from Phase 0. If any remain unanswered and would materially affect implementation quality, either resolve them now with tools or record them in the risk register with an explicit decision about how to proceed without them. Do not silently carry unanswered questions into implementation.\n\nUpdate `implementation_plan.md`.\n\nIt should cover:\n1. Problem statement\n2. Acceptance criteria (mirror `spec.md` if it exists; `spec.md` owns observable behavior)\n3. Non-goals\n4. Philosophy-driven constraints\n5. Invariants\n6. Selected approach + rationale + runner-up\n7. Vertical slices\n8. Work packages only if they actually help\n9. Test design\n10. Risk register\n11. PR packaging strategy\n12. Philosophy alignment per slice:\n   - [principle] -> [satisfied / tension / violated + 1-line why]\n\nCapture:\n- `implementationPlan`\n- `slices`\n- `testDesign`\n- `estimatedPRCount`\n- `followUpTickets` (initialize if needed)\n- `unresolvedUnknownCount` — count of open questions that would materially affect implementation quality\n- `planConfidenceBand` — Low / Medium / High\n\nThe plan is the deliverable for this step. Do not implement anything -- not a \"quick win\", not a file read that bleeds into edits, nothing. Execution begins in Phase 6, one slice at a time. If you find yourself writing code or editing source files right now, stop immediately.",
       "assessmentRefs": [
         "plan-completeness-gate",
         "invariant-clarity-gate",
@@ -535,7 +543,7 @@
         {
           "id": "phase-4b-loop-decision",
           "title": "Loop Exit Decision",
-          "prompt": "Decide whether the plan needs another pass.\n\nIf `planFindings` is non-empty, keep going.\nIf it's empty, stop \u2014 but say what you checked so the clean pass means something.\nIf you've hit the limit, stop and record what still bothers you.\n\nThen emit the required loop-control artifact in this shape (`decision` must be `continue` or `stop`):\n```json\n{\n  \"artifacts\": [{\n    \"kind\": \"wr.loop_control\",\n    \"decision\": \"continue\"\n  }]\n}\n```",
+          "prompt": "Decide whether the plan needs another pass.\n\nIf `planFindings` is non-empty, keep going.\nIf it's empty, stop — but say what you checked so the clean pass means something.\nIf you've hit the limit, stop and record what still bothers you.\n\nThen emit the required loop-control artifact in this shape (`decision` must be `continue` or `stop`):\n```json\n{\n  \"artifacts\": [{\n    \"kind\": \"wr.loop_control\",\n    \"decision\": \"continue\"\n  }]\n}\n```",
           "requireConfirmation": true,
           "outputContract": {
             "contractRef": "wr.contracts.loop_control"
@@ -693,6 +701,13 @@
           }
         }
       ]
+    },
+    {
+      "id": "phase-8-retrospective",
+      "title": "Phase 8: Retrospective",
+      "requireConfirmation": false,
+      "prompt": "The implementation is done and verified. Now look back.\n\nThis is not a re-run of tests. It is a short honest look at the work you just did.\n\nAsk yourself:\n\n1. **What would you do differently?** Now that the implementation is real, what approach, boundary, or decision looks wrong in hindsight?\n\n2. **What adjacent problems did this reveal?** Did the implementation expose gaps, tech debt, or fragile assumptions in the surrounding code that were not in scope but are worth noting?\n\n3. **What follow-up work is now visible?** What is the natural next step that became clear only after doing this work?\n\n4. **What was harder or easier than expected?** Were there surprises -- good or bad -- that would change how similar tasks are approached next time?\n\nProduce 2-4 concrete observations. Each should be specific enough to act on.\n\nFor each observation:\n- **File as follow-up**: add to backlog or open a ticket if it warrants tracking\n- **Accept**: note it explicitly if it is a known limitation you are consciously leaving\n- **Fix now**: if it is small and low-risk, fix it before closing\n\nCapture:\n- `retrospectiveObservations`: list of observations with disposition (filed/accepted/fixed)\n- `followUpTickets`: any new tickets created (append to existing list)"
     }
-  ]
+  ],
+  "validatedAgainstSpecVersion": 3
 }