npm - @exaudeus/workrail - Versions diffs - 3.35.1 → 3.37.0 - Mend

@exaudeus/workrail 3.35.1 → 3.37.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

package/dist/config/config-file.js +2 -0
package/dist/console-ui/assets/{index-D7jQyCSD.js → index-o-p__sHJ.js} +1 -1
package/dist/console-ui/index.html +1 -1
package/dist/daemon/workflow-runner.d.ts +5 -0
package/dist/daemon/workflow-runner.js +131 -1
package/dist/manifest.json +39 -31
package/dist/mcp/handlers/v2-advance-events.js +1 -1
package/dist/mcp/handlers/v2-execution/start.d.ts +1 -0
package/dist/mcp/handlers/v2-execution/start.js +3 -2
package/dist/trigger/notification-service.d.ts +42 -0
package/dist/trigger/notification-service.js +164 -0
package/dist/trigger/trigger-listener.js +7 -1
package/dist/trigger/trigger-router.d.ts +3 -1
package/dist/trigger/trigger-router.js +4 -1
package/dist/v2/durable-core/schemas/export-bundle/index.d.ts +64 -32
package/dist/v2/durable-core/schemas/session/events.d.ts +20 -10
package/dist/v2/durable-core/schemas/session/events.js +1 -1
package/dist/v2/durable-core/schemas/session/gaps.d.ts +8 -8
package/dist/v2/durable-core/schemas/session/gaps.js +1 -1
package/docs/design/agent-behavior-patterns-discovery.md +312 -0
package/docs/design/agent-engine-communication-discovery.md +390 -0
package/docs/design/agent-loop-architecture-alternatives-discovery.md +531 -0
package/docs/design/agent-loop-error-handling-contract.md +238 -0
package/docs/design/complete-step-approach-validation-discovery.md +344 -0
package/docs/design/daemon-stuck-detection-discovery.md +174 -0
package/docs/design/mcp-server-disconnect-discovery.md +245 -0
package/docs/design/mcp-server-epipe-crash.md +198 -0
package/docs/design/notification-design-candidates.md +131 -0
package/docs/design/notification-design-review.md +84 -0
package/docs/design/notification-implementation-plan.md +181 -0
package/docs/design/spawn-agent-failure-modes.md +161 -0
package/docs/design/spawn-agent-result-handling-implementation-plan.md +186 -0
package/docs/design/stdio-simplification-design-candidates.md +341 -0
package/docs/design/stdio-simplification-design-review.md +93 -0
package/docs/design/stdio-simplification-implementation-plan.md +317 -0
package/docs/design/structured-output-tools-coexist-findings.md +288 -0
package/docs/discovery/coordinator-script-design.md +745 -0
package/docs/discovery/coordinator-ux-discovery.md +471 -0
package/docs/discovery/spawn-agent-failure-modes.md +309 -0
package/docs/discovery/workflow-selection-for-discovery-tasks.md +336 -0
package/docs/discovery/worktrain-status-briefing.md +325 -0
package/docs/discovery/worktrain-status-design-candidates.md +202 -0
package/docs/discovery/worktrain-status-design-review-findings.md +86 -0
package/docs/ideas/backlog.md +688 -1
package/docs/ideas/daemon-structured-output-vs-tool-calls.md +344 -0
package/docs/ideas/design-candidates-backlog-consolidation.md +85 -0
package/docs/ideas/design-candidates-spawn-agent-task.md +178 -0
package/docs/ideas/design-review-findings-backlog-consolidation.md +39 -0
package/docs/ideas/design-review-findings-spawn-agent-task.md +139 -0
package/docs/ideas/implementation_plan_backlog_consolidation.md +117 -0
package/docs/ideas/implementation_plan_spawn_agent.md +217 -0
package/docs/plans/authoring-doc-staleness-enforcement-candidates.md +251 -0
package/docs/plans/authoring-doc-staleness-enforcement-review.md +99 -0
package/docs/plans/authoring-doc-staleness-enforcement.md +463 -0
package/package.json +1 -1

package/docs/discovery/spawn-agent-failure-modes.md ADDED Viewed

@@ -0,0 +1,309 @@
+# Discovery: spawn_agent Failure Modes and Multi-Agent Coordination
+## Context / Ask
+**Original goal:** Discovery (precedent + failure modes angle): what can we learn from how spawn_agent actually works today, what failure modes exist, and what do competitors/references do for multi-agent coordination?
+**Problem statement:** WorkTrain needs a coordinator that can spawn fix agents, await their results, and act on outcomes -- but the failure modes of that coordination loop are not yet cataloged, and no precedent has been reviewed to validate the design choices.
+**Desired outcome:** A failure mode catalog with recommended mitigations, a "minimum viable robustness" checklist, and concrete precedent from reference architectures that directly applies to WorkTrain's coordinator design.
+## Path Recommendation
+**Chosen path:** `landscape_first`
+**Rationale:** The dominant need is grounding -- understanding how spawn_agent actually works in the current codebase, what the existing error paths look like, and what reference architectures did for similar coordination problems. Reframing is secondary (the problem is already well-scoped). Design work comes after this grounding, not before.
+Alternative paths considered:
+- `full_spectrum`: Would add a reframing step, but the problem is already well-framed. The original goal is problem-shaped, not solution-shaped.
+- `design_first`: Wrong here -- the risk is not solving the wrong problem, it is shipping a coordinator with unhandled failure modes.
+## Constraints / Anti-goals
+**Core constraints:**
+- WorkTrain's first real run must not embarrass the project -- robustness over cleverness
+- The coordinator must work with spawn_agent as it exists today, not a hypothetical redesign
+- Depth limiting must account for real timeout/hang scenarios, not just recursion counts
+**Anti-goals:**
+- Do not over-engineer for theoretical failure modes that have no evidence in the codebase
+- Do not adopt reference architecture patterns wholesale -- extract the principle, not the mechanism
+- Do not build a new event bus, message queue, or distributed coordination layer
+**Primary uncertainty:** Whether the current timeout path in worktrain-await.ts actually terminates cleanly when a spawned session hangs at max_turns.
+**Known approaches to multi-agent coordination (to evaluate):**
+- OpenClaw nexus-core: referenced in backlog deep-dive
+- pi-mono: referenced in backlog
+- Semaphore-based depth limiting (current WorkRail approach)
+- Polling loop with not_awaited outcome (current worktrain-await approach)
+## Artifact Strategy
+This document is a **human-readable artifact** -- it is for people to read and reference. It is NOT execution memory.
+- Execution truth lives in WorkRail step notes and context variables.
+- If a chat rewind occurs, the durable notes/context survive; this file may not.
+- This file is updated at each research step for readability, but workflow state does not depend on it.
+## Capability Status
+- **Delegation (subagent spawning):** Available via `mcp__nested-subagent__Task`
+- **Web browsing:** Not available (WebFetch tool not active); all research is from codebase and checked-in docs only
+## Landscape Packet
+### spawn_agent mechanics (makeSpawnAgentTool)
+**Function:** `makeSpawnAgentTool` in `src/daemon/workflow-runner.ts:1415-1591`
+**Four error paths:**
+1. **Depth limit exceeded (pre-spawn):** Synchronous check `currentDepth >= maxDepth`. Returns `{outcome: 'error', childSessionId: null}`. No child created. Fail-fast.
+2. **Child session start failure:** `executeStartWorkflow()` returns `Err`. Returns `{outcome: 'error', childSessionId: null}`.
+3. **Token decode failure (silent):** `parseContinueTokenOrFail()` fails -- logs a console warning, but child session still runs with `childSessionId: null`. Zombie risk: session runs but coordinator cannot trace it.
+4. **WorkflowRunResult variants:** `success` -> `'success'`; `error` -> `'error'`; `timeout` -> `'timeout'`; `delivery_failed` -> **`'success'`** (silent bug: work done, notification failed, treated as success).
+**Depth limit enforcement:**
+- Depth is passed as a closure parameter, not a global semaphore or counter.
+- Each tree path enforces independently. Siblings at the same depth do NOT share a pool.
+- Default `maxDepth = 3` (line 2207). Root sessions start at depth 0 (line 2206).
+- Enforcement is per-tree-path, not global.
+**Semaphore bypass:**
+- `dispatch()` uses a global Semaphore for concurrency limiting.
+- `makeSpawnAgentTool` calls `runWorkflow()` directly -- **bypasses the semaphore entirely**.
+- Reason (in code comment): dispatch() is fire-and-forget; calling it from inside a running session would deadlock.
+- Child sessions pass `undefined` for `daemonRegistry` -- invisible to `worktrain status` and console live-session heartbeat.
+- Consequence: a single root session can spawn multiple children that are all untracked by daemon tooling.
+**Not used in practice:** Session store search found no spawn_agent calls in 3,278 daemon events (Apr 17-18 2026). Analysis is code-only, not runtime-observed.
+### worktrain-await.ts
+**Poll interval:** 3000ms (3 seconds). Configurable via `opts.pollInterval` for tests.
+**Default timeout:** 30 minutes (1,800,000ms). Accepts duration strings like `"30m"`, `"1h"`, `"90s"`.
+**Timeout handling:** Timeout is checked once per loop iteration (not per session poll). When timeout fires:
+- All remaining pending sessions marked `{outcome: 'timeout', status: null}`.
+- Loop breaks immediately.
+- Exit code 1.
+- The coordinator receives `'timeout'` -- then what? See failure mode catalog.
+**`not_awaited` outcome:** Only fires when `--mode any`. When first session returns `success`, all other pending sessions are marked `not_awaited` with `status: null`. They were still running and healthy -- we just stopped waiting. Exit code 0.
+**Race conditions identified:**
+1. **No atomic timeout per session (high-risk):** Timeout check runs once per loop, not per poll call. If 10 sessions are polled serially and network is slow, sessions 6-10 may timeout before being polled that round. The durationMs recorded reflects when the check ran, not when each session was last polled.
+2. **Concurrent timeout + poll result (low-risk):** Network delay between polling session A and session B could push next iteration past timeout boundary. Window is negligible at 3s poll / 30m timeout.
+3. **`--mode any` early exit with stale status (by design):** Session B may have completed 1ms after Session A but gets marked `not_awaited`. This is intentional for latency optimization.
+### Reference architectures
+**OpenClaw (nexus-core)** -- Interactive, session-coordination system (NOT batch/DAG)
+- `AcpSessionStore`: In-memory, 5k sessions, 24h TTL, LRU eviction. Not durable.
+- `SessionActorQueue`: Serializes messages per session to prevent concurrent modification.
+- `SpawnAcpParams`: Minimal spawn API (task, label, agentId, resumeSessionId, cwd, mode).
+- Task flow chaining: workflow A completion auto-triggers workflow B via `linkTaskToFlowById`.
+- **Transferable:** Session actor queue pattern (serialization per session).
+- **Not transferable:** In-memory store (violates WorkRail's durability guarantee).
+**pi-mono** -- Library of coordination primitives (not a system)
+- `agentLoop()` returns `EventStream<AgentEvent, AgentMessage[]>` -- handles multi-turn without context degradation.
+- `BeforeToolCallResult`: Can block a tool call with a reason.
+- `AfterToolCallResult`: Can override tool result content.
+- `ChannelQueue` (KeyedAsyncQueue): Serializes messages per channel.
+- **Transferable:** Tool call hooks pattern (block/override tool calls from coordinator level).
+- **Not transferable:** Entire library (WorkRail has its own agent loop).
+**Claude Code (closest analog)** -- Interactive IDE agent with coordinator/subagent model
+- Coordinator holds tokens; subagents report via durable store (not context).
+- `PreToolUse` / `PostToolUse` hooks for evidence collection.
+- Three compaction tiers: session memory > full compaction > microcompaction.
+- **Transferable:** State-via-store pattern (WorkRail already does this). Evidence collection hooks.
+**LangGraph** -- Batch/DAG pipeline (LOW comparability)
+- Time-travel checkpointing (`fork` source) -- useful for WorkRail's rewind feature.
+- Interrupt mechanism: node re-runs from scratch on resume (requires idempotency) -- NOT how WorkRail works.
+- **Not transferable:** Core interrupt/resume model.
+- **Partially transferable:** Checkpoint fork pattern.
+**Temporal.io** -- Event-sourced code-defined workflows (MEDIUM comparability)
+- Worker polling vs webhook push model.
+- Workflow versioning via `patched()`.
+- **Transferable:** Crash recovery patterns, namespace isolation for multi-tenant.
+**Assumption revision:** Assumption 2 (reference architectures are batch/DAG systems) was partially wrong. OpenClaw and Claude Code are interactive session-coordination systems, making them more comparable than expected. This strengthens the transferability of their patterns.
+## Problem Frame Packet
+**Reframed problem:** What is the minimum coordinator design that handles the real failure modes in spawn_agent today, informed by precedent, without over-engineering for failures not yet observed?
+**Primary stakeholders:**
+- Etienne (WorkTrain developer and primary user) -- needs the coordinator to work on the first real run, robustness over cleverness
+- WorkTrain session initiators running autonomous fix pipelines -- need predictable outcomes and visible failures
+**Core tension:** The coordinator must be robust enough to handle failure modes, but WorkTrain's explicit anti-goal is "don't over-engineer." The minimum viable robustness point is: handle failures that would silently corrupt the coordinator's state or leave orphaned sessions. Adding observability, atomic timeouts, and session tracking beyond that is premature optimization.
+**Framing risks:**
+1. The coordinator doesn't exist yet -- all failure mode analysis covers infrastructure (spawn_agent + worktrain-await). The real design decisions happen in the coordinator layer, which is the missing piece. We may be analyzing the wrong layer.
+2. spawn_agent has never been used in practice (0 calls in 3,278 daemon events) -- all identified failure modes are theoretical. A real run may surface completely different issues.
+3. The "first real run must not embarrass" constraint could push toward over-engineering -- the right balance is a coordinator that fails loudly (not silently) and stops cleanly.
+**HMW questions:**
+- How might we design a coordinator that degrades gracefully when a spawned session times out, without requiring the coordinator to know why it timed out?
+- How might we make spawn_agent's zombie risk (silent token decode failure) visible without requiring daemon tooling changes?
+**Challenged assumptions (updated after landscape research):**
+1. spawn_agent mechanics are the right research focus -- partially confirmed: coordinator protocol level is where the real design decisions happen, but spawn_agent has real silent failure modes that need fixing
+2. Competitor/reference architectures are batch/DAG systems -- WRONG: OpenClaw and Claude Code are interactive session-coordination systems. Comparability is higher than assumed.
+3. Depth=3 is the right safety boundary -- confirmed: the real question is timeout robustness, not depth arithmetic. A depth-1 coordinator can hang if the spawned session hangs.
+## Candidate Directions
+### Candidate Generation Expectations
+This is a `landscape_first` + `THOROUGH` pass. Candidates must:
+1. **Anchor to landscape precedents.** Each candidate must reference at least one observed precedent (OpenClaw, pi-mono, Claude Code, worktrain-await design, or spawn_agent behavior) -- not free invention.
+2. **Cover the failure mode space.** The 5 decision criteria must be addressed by at least one candidate each. No criterion can be unaddressed across the whole set.
+3. **Spread across the simplicity-completeness axis.** At least one candidate at each pole: a minimal wrapper that adds almost nothing, and a structured handoff protocol that addresses all 5 criteria.
+4. **THOROUGH push:** If the first spread feels clustered around the middle, add one more candidate that is either maximally simple (borderline too simple) or takes a structurally different approach to the termination guarantee problem.
+5. **No invented infrastructure.** Candidates must not require new daemon tooling, event buses, or message queues. They must work with spawn_agent and worktrain-await as they exist today.
+### Candidates
+*To be populated in candidate-generation step.*
+## Challenge Notes
+*To be populated after research.*
+## Resolution Notes
+### Recommendation
+**v1 coordinator design: 5 components, no new infrastructure, all justified by evidence.**
+1. **Infrastructure fix:** Change `delivery_failed` to return `outcome: 'error'` (not `'success'`) in `makeSpawnAgentTool` (~line 1580 of `src/daemon/workflow-runner.ts`). This is the highest-consequence silent failure. One-line change. **This is a hard blocker -- coordinator must not ship without it.**
+2. **Hardcoded child session timeout:** Pass `agentConfig: { maxSessionMinutes: 15 }` in all spawn triggers. No LLM arithmetic. The hardcoded value is conservative but correct under uncertainty (being too conservative is recoverable; no timeout is not).
+3. **Coordinator rule -- null childSessionId:** After spawn, if `childSessionId === null` with any outcome, treat as error. This catches the token decode failure zombie case (separate from the delivery_failed fix).
+4. **Coordinator rule -- go/no-go time check:** Before spawning, if remaining session time < 20 minutes, do not spawn. Return error with reason "insufficient session time remaining." Prevents coordinator death in edge cases without LLM arithmetic.
+5. **Layer D traceability:** Record spawn result JSON block in step notes BEFORE acting: `{ childSessionId, outcome, notes (truncated), spawnedAtEpochMs, durationMs }`. Step notes ARE injected into subsequent steps (MAX_SESSION_RECAP_NOTES=3 mechanism confirmed in code). This enables observability on first real run.
+### Strongest Alternative (Runner-Up)
+**B+C+D composition:** Coordinator-owned timeout budget (Layer B) + CoordinatorSpawnResult discriminated type (Layer C) + notes-as-retry-ledger (Layer D). Correct for v2 after empirical data. Loses for v1 because Layer B has silent failure modes (LLM arithmetic) and Layer C is a prompt-level workaround for a one-line infrastructure bug.
+### Confidence Band
+**HIGH.** Direction is grounded in code reading, challenged by an adversarial reviewer, and confirmed by design review. The challenger's false positive (Layer D notes not readable) was refuted by code evidence.
+### Residual Risks
+1. **Empirical validation of 15-min timeout.** The value is a heuristic. Real fix agents may need more or less time. Revisit after 3 real runs.
+2. **delivery_failed frequency unknown.** Zero spawn_agent calls in 3,278 daemon events. The delivery_failed -> success bug is a real code path but may never fire in typical usage. The infra fix is still correct architecture, but urgency is not yet validated empirically.
+### Constraints on Selected Direction
+- **Single-spawn per coordinator session.** Multi-spawn (diagnose + fix + verify) requires Layer B (dynamic budget) and is a v2 concern. This constraint must be explicit in the coordinator design spec.
+- **Infra fix is a hard blocker.** Do not ship coordinator without the delivery_failed -> error fix.
+## Decision Log
+| Date | Decision | Rationale |
+|------|----------|-----------|
+| 2026-04-18 | Path: landscape_first | Dominant need is grounding in current code and precedent, not reframing |
+| 2026-04-18 | v1 coordinator: infra fix + hardcoded timeout + notes traceability | spawn_agent unused in practice (0 calls). B+C+D composition over-engineered for v1. Layer C is a prompt-level workaround for a one-line infrastructure bug. Layer B relies on LLM arithmetic with silent failure mode. Layer D (notes) does work (notes are injected) but retry is premature. v2 can add dynamic budgeting and result type mapping after observing real usage. |
+| 2026-04-18 | Runner-up: B+C+D composition | Correct for v2+ after empirical data from real runs. Not justified for v1 on theoretical grounds alone. |
+## Final Summary
+### Selected Path
+`landscape_first` -- grounding in current spawn_agent code and reference architecture comparisons. The problem was already well-framed; no reframing step needed.
+### Problem Framing
+WorkTrain needs a coordinator that can spawn fix agents, await results, and act on outcomes. The failure modes of that coordination loop are not yet cataloged. The real design seam is the coordinator layer (which doesn't exist yet), not spawn_agent infrastructure (which does).
+### Landscape Takeaways
+- spawn_agent bypasses the global semaphore (direct runWorkflow call, not dispatch). Child sessions are invisible to daemon tooling.
+- `delivery_failed` is explicitly mapped to `outcome: 'success'` in makeSpawnAgentTool -- a silent failure the coordinator must guard against.
+- Token decode failure proceeds with `childSessionId: null` and `outcome: 'success'` -- a separate zombie case.
+- Reference architectures (OpenClaw, Claude Code) are interactive session-coordination systems, not batch/DAG pipelines. They are more comparable than initially assumed.
+- Most transferable patterns: pi-mono tool call hooks, OpenClaw session actor queue, Claude Code state-via-store model (WorkRail already uses this).
+- spawn_agent has never been used in practice (0 calls in 3,278 daemon events). All analysis is code-only.
+### Chosen Direction
+**v1: Infrastructure fix + hardcoded timeout + 4 coordinator rules**
+1. Fix `delivery_failed -> 'error'` in `makeSpawnAgentTool` (hard blocker -- must ship with coordinator)
+2. Hardcode `agentConfig: { maxSessionMinutes: 15 }` in all spawn calls
+3. Coordinator rule: `childSessionId === null` with any outcome = error
+4. Coordinator rule: go/no-go check -- if < 20 min session time remaining, do not spawn
+5. Layer D: record spawn result JSON in step notes before acting
+**Key constraint:** Single-spawn per coordinator session. Multi-spawn requires Layer B (v2).
+### Strongest Alternative (Runner-Up)
+B+C+D composition: coordinator-owned timeout budget (Layer B) + CoordinatorSpawnResult type mapping (Layer C) + notes-as-retry-ledger (Layer D). Loses for v1 because Layer B has silent failure modes (LLM arithmetic) and Layer C is a workaround for a one-line infrastructure bug.
+### Why It Won
+- Infrastructure fix addresses the root cause at the correct abstraction layer (not a prompt-level workaround)
+- Hardcoded timeout eliminates silent failure mode of LLM arithmetic
+- No speculative abstractions -- every component is justified by identified failure mode
+- Adversarial review validated 3 of 4 components; false positive on Layer D was refuted by code evidence
+### Confidence Band
+HIGH. Direction grounded, challenged, reviewed, confirmed. Remaining gaps are empirical.
+### Failure Mode Catalog
+| # | Failure Mode | Mechanism | Severity | Mitigation | Status |
+|---|-------------|-----------|----------|------------|--------|
+| FM1 | delivery_failed treated as success | makeSpawnAgentTool maps delivery_failed -> 'success' | HIGH | Fix in infrastructure: return outcome: 'error' | Required (hard blocker) |
+| FM2 | Zombie session (null childSessionId) | Token decode failure proceeds silently with childSessionId: null and outcome: 'success' | HIGH | Coordinator rule: treat null childSessionId as error | Required |
+| FM3 | Spawned session hangs at max_turns | No bounded timeout on spawned session | HIGH | Hardcode maxSessionMinutes: 15 | Required |
+| FM4 | Coordinator dies waiting for child | Nested timeout: coordinator and child have same timeout budget | HIGH | Go/no-go check: < 20 min remaining = don't spawn | Required |
+| FM5 | Fix agent introduces new bug | Coordinator has no way to verify fix quality | MEDIUM | Out of scope for v1; requires verification spawn | Accepted / v2 |
+| FM6 | Concurrent coordinators on same repo | spawn_agent bypasses semaphore | MEDIUM | WorkRail session queue prevents races at higher level | Already handled |
+| FM7 | worktrain-await race condition | Timeout checked once per loop, not per poll | LOW | Negligible at 15-min sessions / 3s poll | Accepted (file as separate bug) |
+| FM8 | Context compaction strips retry state | Context variables may be compacted | MEDIUM | Layer D: step notes are durable (MAX_SESSION_RECAP_NOTES=3 injects prior notes) | Mitigated |
+### Minimum Viable Robustness Checklist
+A pre-ship reviewer can use this to verify coordinator v1 is ready:
+- [ ] **Infrastructure fix landed:** `makeSpawnAgentTool` returns `outcome: 'error'` for `delivery_failed` (not `'success'`). Check `src/daemon/workflow-runner.ts` ~line 1580.
+- [ ] **Hardcoded timeout set:** All spawn triggers in the coordinator pass `agentConfig: { maxSessionMinutes: 15 }`. No dynamic calculation.
+- [ ] **Null childSessionId check present:** Coordinator explicitly checks `childSessionId !== null` before treating outcome as success. If null, treats as error.
+- [ ] **Go/no-go check present:** Coordinator checks remaining session time before spawning. If < 20 minutes, returns error without spawning.
+- [ ] **Spawn record written to notes:** Before acting on spawn result, coordinator writes a JSON record to step notes: `{ childSessionId, outcome, elapsedMs }`.
+- [ ] **Single-spawn constraint documented:** Coordinator design spec explicitly states this coordinator makes one spawn per session. Multi-spawn is not supported.
+- [ ] **Real run performed:** At least 1 real coordinator session run before declaring v1 stable. Timeout values revisited after.
+### Reference Architecture Precedents Applied
+| System | Pattern | Applied In |
+|--------|---------|------------|
+| OpenClaw | Session actor queue: serialize messages per session | WorkRail's DaemonSessionManager already does this |
+| pi-mono | Tool call hooks: BeforeToolCallResult / AfterToolCallResult | Evidence gating pattern (v2 concern) |
+| Claude Code | State-via-store: subagents report to durable store, not context | Already in WorkRail's design; coordinator uses session store, not context |
+| LangGraph | Time-travel checkpointing | WorkRail's checkpoint/rewind feature (existing) |
+| nexus-core | Knowledge injection before each LLM call | WorkRail's session recap (MAX_SESSION_RECAP_NOTES) does this |
+### Next Actions
+1. **Now:** Fix `delivery_failed -> 'error'` in `makeSpawnAgentTool`. This is the infrastructure fix that unblocks coordinator design.
+2. **Now:** Design coordinator workflow with the 4 coordinator rules above. Use `design-candidates-spawn-agent.md` as the design spec.
+3. **After first 3 real runs:** Revisit 15-min timeout value. File worktrain-await race condition as a separate bug.
+4. **v2:** Add Layer B (dynamic timeout budgeting at infrastructure level, not LLM arithmetic) and multi-spawn support.
+### Residual Risks
+1. 15-min timeout value is a heuristic with no empirical validation. May be too conservative or too liberal.
+2. delivery_failed frequency unknown in practice (0 production spawn calls). The fix is correct architecture regardless.

package/docs/discovery/workflow-selection-for-discovery-tasks.md ADDED Viewed

@@ -0,0 +1,336 @@
+# Workflow Selection for Discovery-Only Tasks
+**Status:** Discovery in progress
+**Session:** wr.discovery
+**Date:** 2026-04-17
+**Artifact strategy:** This document is for human reading. Execution truth (context variables, step notes) lives in WorkRail session state, not here. This doc is updated at each phase but is not the primary memory -- it can be reconstructed from notes if lost.
+---
+## Context / Ask
+A daemon session was dispatched using `coding-task-workflow-agentic` with a goal that said "Discovery only -- Do NOT write any code". The session ran 11 advances, produced good design candidate notes, stopped at event 74 with no `run_completed`, and the later advances had no note output (likely conditional skips).
+The question: for a discovery-only task (no code, just a design document), should we use `coding-task-workflow-agentic` or `wr.discovery`? And can `coding-task-workflow-agentic` be trusted to stay in discovery mode when the goal explicitly says no code?
+---
+## Path Recommendation
+**Path:** `landscape_first`
+**Rationale:** The dominant need here is to understand the current structure of two specific workflows and compare their fitness for a known task class (discovery-only). The answer is primarily a landscape/comparison problem, not an ambiguous framing problem. `landscape_first` is the right fit. `full_spectrum` is not needed because we are not uncertain about what the problem is -- we have a concrete incident and two concrete artifacts. `design_first` would be appropriate only if we suspected the stated problem was the wrong problem, and we do not.
+---
+## Constraints / Anti-goals
+**Constraints:**
+- We have two concrete workflow JSON files to analyze
+- We have a concrete triggers.yml with one `workflowId` configured
+- The daemon session behavior is a real observed incident, not a hypothesis
+**Anti-goals:**
+- Do not redesign either workflow
+- Do not recommend changes to workflow step content
+- Do not propose a new workflow; only decide which existing one to use
+---
+## Landscape Packet
+### Current state summary
+`coding-task-workflow-agentic` (lean v2, v1.1.0) is a full implementation lifecycle workflow. Its `about` field says: "Use this to implement a software feature or task." Its preconditions include "A deterministic validation path exists (tests, build, or an explicit verification strategy)." It explicitly describes what it produces: `implementation_plan.md`, `spec.md`, code slices, and a PR-ready handoff with commit JSON.
+`wr.discovery` (v3.1.0) is a structured thinking/design workflow. Its `about` field says: "Use this to explore and think through a problem end-to-end." Its metaGuidance explicitly states: "Boundary: this workflow can end with a recommendation memo, prototype or test plan, or a research-informed direction. It should not implement production code."
+### Step structure analysis: coding-task-workflow-agentic
+| Step | Condition | Discovery-relevant? |
+|------|-----------|---------------------|
+| phase-0: Understand & Classify | always runs | Yes -- classifies complexity/rigor |
+| phase-1a: State Hypothesis | `taskComplexity != Small AND rigorMode != QUICK` | Yes |
+| phase-1b-design-quick: Lightweight Design | `taskComplexity != Small AND rigorMode == QUICK` | Yes |
+| phase-1b-design-deep: Tension-Driven Design | `taskComplexity != Small AND rigorMode != QUICK` | Yes |
+| phase-1c: Challenge and Select | `taskComplexity != Small` | Yes |
+| phase-2: Design Review loop | `taskComplexity != Small` | Yes |
+| phase-3: Slice, Plan, and Test Design | `taskComplexity != Small` | Implementation planning |
+| phase-3b: Spec (Observable Behavior) | `taskComplexity != Small AND (Large OR High risk)` | Implementation planning |
+| phase-4: Plan Audit loop | `taskComplexity != Small AND rigorMode != QUICK` | Implementation planning |
+| phase-5: Small Task Fast Path | `taskComplexity == Small` | Implementation (code required) |
+| phase-6: Implement Slice-by-Slice loop | `taskComplexity != Small` | **Code writing** |
+| phase-7: Final Verification loop | `taskComplexity != Small` | **Code verification** |
+**Key finding:** For a task classified as `Small`, the workflow skips phases 1a, 1b, 1c, 2, 3, 3b, 4, 6, 7 and runs only phase-0 and phase-5. Phase-5 (Small Task Fast Path) **explicitly requires writing code** and producing a handoff JSON block with `filesChanged`. There is no "Small + discovery only" path.
+For Medium/Large tasks, the workflow runs the full design pipeline (phases 0-4) which produces `design-candidates.md` -- but it then continues directly into implementation (phases 6-7). There is no early exit after design.
+**Does coding-task-workflow-agentic have a "discovery only" mode?** No. It has no `runCondition` or context variable that would stop before implementation when a goal says "no code". The only escape hatch would be the agent choosing to stop itself based on the goal text -- which is an honor-system trust, not a structural guarantee.
+### What phases run for Small vs Medium/Large
+**Small task path:**
+- phase-0 (classify)
+- phase-5 (fast path -- writes code, produces commit JSON)
+- All other phases skipped via `runCondition: taskComplexity == Small` or `taskComplexity != Small`
+**Medium/Large task path:**
+- phase-0 (classify)
+- phase-1a/1b/1c (design candidates)
+- phase-2 (design review loop)
+- phase-3 (implementation plan)
+- phase-3b (spec, if Large or High risk)
+- phase-4 (plan audit loop)
+- phase-6 (implement slice loop -- **writes code**)
+- phase-7 (final verification loop)
+The daemon session ran 11 advances and stopped at event 74. Given the step structure, for a Medium/Large non-QUICK classification, 11 advances would likely cover phases 0-4 (design + planning), stopping before phase-6 (implementation). This means the session exhausted the design pipeline but never reached code-writing -- not because the workflow has a discovery mode, but because the agent stopped before phase-6, possibly because:
+1. The goal text said "no code" and the agent respected it
+2. A loop condition evaluation or `requireConfirmation` gate paused/stopped execution
+3. The session timed out or the MCP connection dropped before the loop started
+The "no note output on later advances" is consistent with conditional steps being skipped (e.g., phase-3b skipped because not Large/High-risk, or loop steps stopping early).
+### wr.discovery landscape
+`wr.discovery` runs: path selection -> capability setup -> landscape understanding -> problem framing -> re-triage (conditional) -> synthesis -> candidate generation -> challenge/selection -> direction review loop -> uncertainty resolution (direct recommendation / research loop / prototype loop) -> final validation -> handoff.
+It explicitly cannot produce production code. It always ends with a design document, recommendation memo, or prototype spec. There is no implementation path in the workflow.
+### Option categories
+1. **Use wr.discovery** for discovery tasks, `coding-task-workflow-agentic` for implementation tasks
+2. **Use coding-task-workflow-agentic for everything**, trusting the agent to stop early when goal says "no code"
+3. **Add a discovery-mode flag** to `coding-task-workflow-agentic` via a `runCondition` on phases 6-7
+4. **Use separate triggers** in triggers.yml with different `workflowId` per task type
+### Contradictions / disagreements
+- The daemon session with `coding-task-workflow-agentic` produced "good design candidates notes" -- so the workflow does good design work even though it is intended for implementation. The design pipeline (phases 1-4) is legitimate and high quality.
+- The risk is not that `coding-task-workflow-agentic` does bad design work. The risk is that (a) it might not stop before phase-6 reliably, and (b) it carries implementation framing (slices, spec, PR handoff) that pollutes a pure discovery context.
+### Evidence gaps
+- We do not know the exact event log from the stopped daemon session -- we cannot confirm whether it stopped naturally or by connection drop
+- We do not know whether the agent in that session reached phase-6 or stopped before it
+- We cannot test "honor system" reliability without more session data
+---
+## Problem Frame Packet
+### Users / stakeholders
+- Daemon dispatcher: needs to select the right `workflowId` in triggers.yml
+- Agent executing the session: needs structural guarantees, not honor-system constraints
+- Developer (you): needs the design document output to be pure and trustworthy
+### Jobs / goals / outcomes
+- Dispatch a session that produces a design document and nothing else
+- Know with certainty that no code will be written, regardless of agent judgment
+- Get a high-quality, structured design output comparable to what coding-task-workflow-agentic's design phases produce
+### Pains / tensions / constraints
+- The daemon currently has ONE `workflowId` in triggers.yml -- no per-task routing
+- `coding-task-workflow-agentic` is trusted for design quality but is not structurally bounded to stop before code
+- `wr.discovery` is structurally bounded to no-code but may produce different design output depth
+### Success criteria
+1. A discovery-only task produces only a design document, never code or a PR
+2. The selection is structural (a wrong `workflowId` cannot accidentally write code), not honor-system
+3. The design quality is not degraded by switching to `wr.discovery`
+### Assumptions
+- The daemon reads `workflowId` directly from triggers.yml and cannot dynamically select based on goal text
+- `wr.discovery` produces design candidates comparable in quality to what phases 1-4 of `coding-task-workflow-agentic` produce
+- triggers.yml supports multiple trigger entries with different `workflowId` values
+### Reframes / HMW questions
+- HMW: How might we route discovery tasks to `wr.discovery` and implementation tasks to `coding-task-workflow-agentic` at the dispatcher level instead of relying on agent judgment?
+- HMW: How might we make "discovery only" a structural guarantee rather than a goal-text instruction?
+### What would make this framing wrong
+- If the daemon cannot support multiple triggers, option 4 (separate triggers) is blocked
+- If `wr.discovery` produces materially weaker design output for technical workflow questions, the quality tradeoff matters
+---
+## Candidate Generation Expectations (landscape_first)
+Because this is a `landscape_first` path, the candidate set must:
+- Clearly reflect the landscape findings (the actual step structure of both workflows, the triggers.yml constraint)
+- Not invent options that contradict what was observed in the workflow files
+- Include at least one option that uses existing structure without any modification
+- Include the runner-up option that would feel like a real alternative, not just a straw man
+The three candidates (A, B, C) below were derived directly from the landscape analysis, not from free invention.
+---
+## Candidate Directions
+### Direction A: Use wr.discovery for discovery tasks (structural routing)
+Configure a second trigger entry in triggers.yml with `workflowId: wr.discovery` for discovery-only goals. The structural guarantee is that `wr.discovery` cannot write code -- it does not have those steps. The daemon would need to support routing (two triggers, each with a matching rule or explicit goal flag).
+**Why it fits:** Structural guarantee. `wr.discovery` was explicitly designed for this use case. Its metaGuidance says "should not implement production code."
+**Strongest evidence for it:** The session incident shows the risk of relying on honor-system stop behavior in `coding-task-workflow-agentic`. Structural routing removes the risk entirely.
+**Strongest risk against it:** triggers.yml currently supports one trigger per session. If it cannot support multiple triggers with per-task routing, this requires daemon work. Also, `wr.discovery` produces a recommendation memo/design doc, not the same `design-candidates.md` artifact shape that `coding-task-workflow-agentic` phases 1-4 produce.
+**When it should win:** Always, for any task where the desired output is a design document and there is no intent to implement code in the same session.
+---
+### Direction B: Trust coding-task-workflow-agentic with honor-system stop
+Keep triggers.yml as-is. Rely on the goal text ("Discovery only -- Do NOT write any code") to instruct the agent to stop before phase-6.
+**Why it fits (weakly):** The prior session actually did produce design candidates and apparently stopped before code. It worked once.
+**Strongest evidence for it:** The 11-advance session with good design notes suggests the agent did respect the goal text.
+**Strongest risk against it:** The workflow has no structural stop before phase-6. A future session could classify the task differently, run through phases 0-4 faster, and reach phase-6 before the session ends. Phase-6 will attempt to implement code. The only protection is the agent re-reading the goal and choosing not to implement -- which is fragile under long sessions, context window pressure, or agent model changes.
+**When it should win:** Never for production use. Acceptable as a short-term workaround only.
+---
+### Direction C: Add discoveryMode flag to coding-task-workflow-agentic
+Modify `coding-task-workflow-agentic` to support a `discoveryMode` context variable. Add `runCondition: { var: "discoveryMode", not_equals: true }` to phases 6 and 7. Pass `discoveryMode: true` via the goal or a trigger-level context override.
+**Why it fits:** Preserves the high-quality design pipeline of `coding-task-workflow-agentic` while adding a structural stop before implementation.
+**Strongest evidence for it:** The design phases (1-4) of `coding-task-workflow-agentic` are well-designed and familiar. Reusing them avoids duplication.
+**Strongest risk against it:** This requires modifying a core workflow file. It adds complexity to a workflow that was designed for a different purpose. It creates a hybrid that does neither thing cleanly. And triggers.yml still only has one trigger, so the `discoveryMode` value must come from somewhere (goal text parse? trigger-level context?).
+**When it should win:** If modifying `wr.discovery` or the daemon is unavailable, and modifying `coding-task-workflow-agentic` is cheap and acceptable.
+---
+## Challenge Notes
+**Against Direction A (wr.discovery):** The design output format differs. `coding-task-workflow-agentic` produces `design-candidates.md` via the `tension-driven-design` routine, followed by a `design-review-findings.md` and a full `implementation_plan.md`. `wr.discovery` produces a design doc with Candidate Directions and a recommendation. For a technical question about workflow architecture, the `wr.discovery` output (a recommendation memo) is actually _more_ appropriate than `implementation_plan.md`. The format difference is not a disadvantage.
+**Against Direction B:** The incident already showed the risk. The session stopped at event 74 with no `run_completed`. We do not know if it stopped intentionally or by timeout/connection drop. If it stopped by timeout, the next session might not stop in the same place. Structural guarantees are always preferred over honor-system constraints when the downside (code written to a wrong branch) is recoverable but costly.
+**Against Direction C:** Modifying `coding-task-workflow-agentic` for a use case it was not designed for violates the "make illegal states unrepresentable" principle. It is better to use the right tool than to add a mode switch to the wrong tool.
+---
+## Resolution Notes
+**Direction A wins.** `wr.discovery` is the right workflow for discovery-only tasks. The structural guarantee -- no implementation steps exist in the workflow -- is strictly better than an honor-system stop. The triggers.yml configuration needs to evolve to support per-task workflow routing.
+---
+## Decision Log
+| Decision | Rationale |
+|----------|-----------|
+| `landscape_first` path chosen | We have two concrete artifacts to compare; this is a comparison/routing question, not an ambiguous framing problem |
+| Direction A selected | Structural guarantees are always preferred over honor-system constraints; `wr.discovery` was built for this |
+| Direction B rejected | Honor-system stop is fragile; the incident confirmed the risk |
+| Direction C rejected | Adding a mode switch to the wrong tool is worse than using the right tool |
+| Multiple triggers confirmed | Read `src/trigger/trigger-store.ts` and `src/trigger/trigger-router.ts`. `loadTriggerConfig()` loads all entries; `buildTriggerIndex()` maps by unique `id`; `route()` dispatches by `triggerId`. A second trigger entry with `workflowId: wr.discovery` works today with zero code changes. |
+---
+## Final Summary
+### Selected direction: Direction A -- use wr.discovery for discovery-only tasks
+**Confidence band: High**
+#### Recommendation
+For a discovery-only task (no code, just a design document):
+- **Use `wr.discovery`**, not `coding-task-workflow-agentic`
+- Add a second trigger entry to `triggers.yml` with a unique `id` and `workflowId: wr.discovery`
+- The daemon's trigger-store.ts and trigger-router.ts already support multiple triggers with different workflowIds -- no code change required
+#### Example triggers.yml configuration
+```yaml
+triggers:
+  - id: test-task
+    provider: generic
+    workflowId: coding-task-workflow-agentic
+    workspacePath: /Users/etienneb/git/personal/workrail
+    goal: "Add the evidenceFrom field to AssessmentDimension..."
+    concurrencyMode: parallel
+    autoCommit: false
+    agentConfig:
+      maxSessionMinutes: 60
+  - id: discovery-task
+    provider: generic
+    workflowId: wr.discovery
+    workspacePath: /Users/etienneb/git/personal/workrail
+    goal: "Discovery only: ..."
+    concurrencyMode: parallel
+    autoCommit: false
+    agentConfig:
+      maxSessionMinutes: 60
+```
+The caller must send the correct `triggerId` (`discovery-task` vs `test-task`) when firing the webhook.
+#### Why coding-task-workflow-agentic cannot be trusted in discovery mode
+`coding-task-workflow-agentic` has no structural stop before phase-6 (Implement Slice-by-Slice). For Small tasks, phase-5 (Small Task Fast Path) explicitly requires writing code. For Medium/Large tasks, the design pipeline (phases 0-4) produces good design work, then phase-6 writes code. The only protection against code-writing is the agent choosing to stop based on goal text -- an honor-system constraint that can fail under context window pressure.
+The prior session stopped at event 74 (likely after phase-4, before phase-6) -- but we cannot confirm whether this was agent judgment or a connection drop. With `wr.discovery`, the question is irrelevant: there are no phases 6-7 to reach.
+#### What phases coding-task-workflow-agentic skips for Small tasks
+- Skips: phase-1a (hypothesis), phase-1b (design), phase-1c (challenge), phase-2 (design review), phase-3 (plan), phase-3b (spec), phase-4 (plan audit), phase-6 (implementation), phase-7 (verification)
+- Runs: phase-0 (classify) and phase-5 (Small Task Fast Path -- **writes code**)
+For Medium/Large tasks, all phases run in sequence, including phase-6 (implementation).
+#### Would wr.discovery have been a better choice?
+Yes, without qualification. `wr.discovery` was designed for exactly this use case. Its metaGuidance states: "should not implement production code." All paths end with a recommendation memo, prototype spec, or research plan. It uses the same `tension-driven-design` routine as `coding-task-workflow-agentic` phases 1b, so design quality is equivalent.
+#### How to configure triggers.yml for discovery vs implementation
+- **Implementation tasks**: `workflowId: coding-task-workflow-agentic` -- use the existing `test-task` trigger or rename it
+- **Discovery tasks**: `workflowId: wr.discovery` -- add a new trigger entry (e.g., `id: discovery-task`)
+- Route by sending the correct `triggerId` in the webhook
+#### Workflow selection strategy when the daemon has ONE workflowId configured
+The current `test-task` trigger always dispatches to `coding-task-workflow-agentic`. For discovery tasks, either:
+1. Add a second trigger entry (preferred -- structural routing, zero code change)
+2. Temporarily change the trigger's `workflowId` to `wr.discovery` for discovery sessions, then change it back (workable but manual and error-prone)
+3. Use console AUTO dispatch and set `workflowId: wr.discovery` explicitly in the dispatch request (for console-dispatched sessions only)
+Option 1 is the right answer.
+### Strongest alternative: Direction C (add discoveryMode flag to coding-task-workflow-agentic)
+If the two-trigger routing were unavailable (it is not), adding `runCondition: { var: "discoveryMode", not_equals: true }` to phases 6-7 would also provide structural enforcement. Loses: workflow cleanliness, YAGNI compliance, reversibility. Not recommended when Direction A is available.
+### Residual risks
+1. **Console dispatch scope boundary** (Yellow): console AUTO dispatch uses `workflowId` directly, not `triggerId`. For console-dispatched discovery sessions, the caller must explicitly set `workflowId: wr.discovery`. The two-trigger triggers.yml setup covers webhook-triggered sessions only.
+2. **Prior session at event 74**: stop reason unknown. If it was a connection drop, the design pipeline output may be incomplete. Review the session artifacts before using them. Direction A eliminates this risk for future sessions.
+### Next actions
+1. Add a second trigger entry to `triggers.yml` with `id: discovery-task` and `workflowId: wr.discovery`
+2. Route discovery-only goals to the `discovery-task` trigger ID when firing webhooks
+3. Review the prior session's design artifacts for completeness