npm - @exaudeus/workrail - Versions diffs - 3.36.0 → 3.37.1 - Mend

@exaudeus/workrail 3.36.0 → 3.37.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

package/dist/config/config-file.js +2 -0
package/dist/console-ui/assets/{index-n8cJrS4v.js → index-t8Wi304z.js} +1 -1
package/dist/console-ui/index.html +1 -1
package/dist/daemon/workflow-runner.d.ts +1 -0
package/dist/daemon/workflow-runner.js +3 -6
package/dist/infrastructure/session/SessionManager.js +17 -4
package/dist/manifest.json +25 -17
package/dist/trigger/notification-service.d.ts +42 -0
package/dist/trigger/notification-service.js +164 -0
package/dist/trigger/trigger-listener.js +7 -1
package/dist/trigger/trigger-router.d.ts +3 -1
package/dist/trigger/trigger-router.js +4 -1
package/docs/design/agent-behavior-patterns-discovery.md +312 -0
package/docs/design/agent-engine-communication-discovery.md +390 -0
package/docs/design/agent-loop-architecture-alternatives-discovery.md +531 -0
package/docs/design/agent-loop-error-handling-contract.md +238 -0
package/docs/design/complete-step-approach-validation-discovery.md +344 -0
package/docs/design/daemon-stuck-detection-discovery.md +174 -0
package/docs/design/mcp-server-disconnect-discovery.md +245 -0
package/docs/design/mcp-server-epipe-crash.md +198 -0
package/docs/design/notification-design-candidates.md +131 -0
package/docs/design/notification-design-review.md +84 -0
package/docs/design/notification-implementation-plan.md +181 -0
package/docs/design/spawn-agent-failure-modes.md +161 -0
package/docs/design/spawn-agent-result-handling-implementation-plan.md +186 -0
package/docs/design/stdio-simplification-design-candidates.md +341 -0
package/docs/design/stdio-simplification-design-review.md +93 -0
package/docs/design/stdio-simplification-implementation-plan.md +317 -0
package/docs/design/structured-output-tools-coexist-findings.md +288 -0
package/docs/discovery/coordinator-script-design.md +745 -0
package/docs/discovery/coordinator-ux-discovery.md +471 -0
package/docs/discovery/spawn-agent-failure-modes.md +309 -0
package/docs/discovery/workflow-selection-for-discovery-tasks.md +336 -0
package/docs/discovery/worktrain-status-briefing.md +325 -0
package/docs/discovery/worktrain-status-design-candidates.md +202 -0
package/docs/discovery/worktrain-status-design-review-findings.md +86 -0
package/docs/ideas/backlog.md +608 -0
package/docs/ideas/daemon-structured-output-vs-tool-calls.md +344 -0
package/docs/ideas/design-candidates-backlog-consolidation.md +85 -0
package/docs/ideas/design-review-findings-backlog-consolidation.md +39 -0
package/docs/ideas/implementation_plan_backlog_consolidation.md +117 -0
package/docs/plans/authoring-doc-staleness-enforcement-candidates.md +251 -0
package/docs/plans/authoring-doc-staleness-enforcement-review.md +99 -0
package/docs/plans/authoring-doc-staleness-enforcement.md +463 -0
package/package.json +1 -1

package/docs/ideas/backlog.md CHANGED Viewed

@@ -1779,6 +1779,322 @@ No PR merges without passing all required gates for its classification. The coor
 Right now, "has this been reviewed and audited?" is a question that requires reading through PRs and session notes. With proof records, it's a query: `SELECT * FROM proof_records WHERE module='src/trigger/' AND kind='production_audit' AND outcome='pass' AND timestamp > NOW()-30days`. The knowledge graph stores these records. The watchdog checks them on a schedule. The coordinator gates on them before merging. Verification becomes infrastructure, not process.
 ---
+### Scripts-first coordinator: avoid the main agent wherever possible (Apr 15, 2026)
+**The insight:** In the coordinator workflow we built manually today, the main agent spent most of its time on mechanical work -- reading PR lists, checking CI status, deciding whether findings are blocking, sequencing merges. That's all deterministic logic. An LLM is expensive, slow, and inconsistent for deterministic work.
+**The principle extended to coordinators:** the scripts-over-agent rule applies at the coordinator level too. The coordinator's job is to drive a DAG of child sessions. The DAG structure, routing decisions, and termination conditions should be scripts, not LLM reasoning.
+**What this means concretely:**
+Instead of a coordinator *agent* that reads MR review findings and decides what to do, use a **coordinator script** that:
+1. Calls `gh pr list` → list of PRs (script)
+2. For each PR, calls `spawn_session(mr-review-workflow-agentic)` → session handles (script)
+3. Calls `await_sessions(handles)` → structured findings (script waits)
+4. Parses the findings JSON block from each session's output (script)
+5. Routes: clean → merge queue, minor → spawn fix agent, blocking → escalate (script decision tree)
+6. Calls `spawn_session(coding-task-workflow-agentic, fix: <finding>)` for each fix needed (script)
+7. Awaits fix agents, re-queues for re-review (script loop)
+8. Executes merge sequence when queue is empty (script)
+The agent is only invoked for the *leaf work* -- the actual MR review, the actual coding fix. All coordination, routing, sequencing, and decision-making is a script.
+**What the coordinator workflow looks like under this model:**
+Not a workflow that a single LLM session runs end-to-end. Instead, a **script-driven workflow** where each step is a shell/TypeScript script that calls WorkTrain's API to spawn/await child sessions and route based on their structured outputs. WorkTrain provides:
+- `worktrain spawn --workflow <id> --goal <text>` → prints sessionHandle
+- `worktrain await --sessions <handle1,handle2>` → prints structured results JSON
+- `worktrain merge --pr <number>` → runs the merge sequence
+The coordinator "workflow" is then a shell script or TypeScript file that composes these commands. Fully deterministic, fully auditable, no tokens burned on routing decisions.
+**Why this is better than a coordinator agent:**
+- Zero LLM cost for coordination -- only leaf sessions burn tokens
+- Fully deterministic routing -- the same PR list always produces the same execution plan
+- Trivially auditable -- `set -x` on the shell script shows every decision
+- Trivially testable -- mock `worktrain spawn` and `worktrain await`, test the routing logic in isolation
+- Reusable across teams -- share the script, not the prompt
+**Build order for this model:**
+1. `worktrain spawn` / `worktrain await` CLI commands that wrap the session engine
+2. Structured output format for leaf sessions (the handoff artifact JSON block already exists)
+3. A reference `coordinator-groom-prs.sh` (or `.ts`) as the first coordinator template
+4. Console DAG view updated to show coordinator-script-spawned sessions with parent-child relationships
+**The long-term vision:** WorkTrain workflows handle the hard cognitive work. WorkTrain scripts handle orchestration, routing, and sequencing. Together they make the system fully autonomous with full observability and zero wasted tokens.
+---
+### Full development pipeline: coordinator scripts drive multi-phase autonomous work (Apr 15, 2026)
+The coordinator isn't just for review → fix → merge. The full pipeline we run manually covers every phase of software development, with different phases triggered based on task classification.
+**Full pipeline DAG:**
+```
+trigger: "implement feature X"
+  │
+  ├── [always] classify-task
+  │     outputs: taskComplexity, riskLevel, hasUI, touchesArchitecture
+  │
+  ├── [if taskComplexity != Small] discovery
+  │     workflow: routine-context-gathering (COMPLETENESS + DEPTH in parallel)
+  │     outputs: context bundle, candidate files, invariants
+  │
+  ├── [if hasUI] ux-design
+  │     workflow: ux-design-workflow (mockups, component spec, interaction model)
+  │     outputs: design-spec.md, component-list
+  │
+  ├── [if touchesArchitecture] architecture-design
+  │     workflow: coding-task-workflow-agentic (design phases only)
+  │     outputs: design-candidates.md, selected approach
+  │     └── arch-review (parallel, 2 auditors)
+  │           workflow: routine-hypothesis-challenge + routine-philosophy-alignment
+  │           outputs: findings → revise design if RED/ORANGE
+  │
+  ├── [always] coding-task
+  │     workflow: coding-task-workflow-agentic
+  │     inputs: context bundle + design spec + arch decision
+  │     outputs: implementation + handoff artifact (commitType, prTitle, filesChanged)
+  │
+  ├── [always] mr-review
+  │     workflow: mr-review-workflow-agentic
+  │     outputs: findings with severity
+  │     ├── [if clean] → auto-commit → auto-pr → merge
+  │     ├── [if Minor/Nit] → spawn fix agent → re-review (max 3 passes)
+  │     └── [if Critical/Major] → escalate to human (Slack/GitLab comment)
+  │
+  ├── [if riskLevel == High] prod-risk-audit
+  │     workflow: production-risk-audit-workflow
+  │     outputs: go / no-go + risk register
+  │     └── [if no-go] → escalate, block merge
+  │
+  └── [if merged] notify
+        script: post summary to Slack/GitLab with session DAG link
+```
+**The key insight:** the coordinator script reads the `taskComplexity`, `riskLevel`, `hasUI`, and `touchesArchitecture` flags from the classify step's output and uses them to decide which phases to spawn. A one-line bug fix runs: classify → coding-task → mr-review. A new UI feature runs everything. The same coordinator script handles both -- the DAG is dynamic, driven by structured outputs.
+**Workflow library needed (not all exist yet):**
+| Workflow | Status |
+|----------|--------|
+| `coding-task-workflow-agentic` | ✅ `coding-task-workflow-agentic.lean.v2.json` |
+| `mr-review-workflow-agentic` | ✅ `mr-review-workflow.agentic.v2.json` |
+| `routine-context-gathering` | ✅ `routines/` |
+| `routine-hypothesis-challenge` | ✅ `routines/` |
+| `routine-philosophy-alignment` | ✅ `routines/` |
+| `ux-design-workflow` | ✅ `ui-ux-design-workflow.json` |
+| `production-risk-audit-workflow` | ✅ `production-readiness-audit.json` |
+| `architecture-review-workflow` | ✅ `architecture-scalability-audit.json` |
+| `bug-investigation-workflow` | ✅ `bug-investigation.agentic.v2.json` |
+| `discovery-workflow` | ✅ `wr.discovery.json` |
+| `classify-task-workflow` | ❌ needs authoring -- fast, 1-step, outputs taskComplexity/riskLevel/hasUI/touchesArchitecture |
+**The classify step is the gate.** A cheap, fast workflow that takes a task description and returns structured vars. This is where the coordinator decides what to run. It's the single most important missing workflow -- without it, the coordinator has to spawn everything for every task, which is wasteful.
+**The coordinator script for this pipeline:**
+```typescript
+// coordinator-implement-feature.ts
+const { taskComplexity, riskLevel, hasUI, touchesArchitecture } =
+  await runWorkflow('classify-task-workflow', { goal: taskDescription });
+const contextHandle = taskComplexity !== 'Small'
+  ? spawnSession('routine-context-gathering', { goal: taskDescription })
+  : null;
+const uxHandle = hasUI
+  ? spawnSession('ux-design-workflow', { goal: taskDescription })
+  : null;
+const [context, uxSpec] = await awaitSessions([contextHandle, uxHandle]);
+// ... arch design if needed, then coding, then review, then audit
+```
+Zero coordinator LLM calls. Every decision is a script condition on structured output.
+**Audit workflows the coordinator can chain:**
+Beyond MR review, the same pattern applies to any quality gate:
+- **Production risk audit** -- scans for: exposed secrets, missing rate limits, no-rollback schema changes, unguarded env vars
+- **Architecture audit** -- scans for: coupling violations, missing abstractions, incorrect layer dependencies
+- **Test coverage audit** -- identifies untested paths on changed files
+- **Performance audit** -- scans for N+1 queries, missing indexes, unbounded loops on hot paths
+- **Security audit** -- OWASP top 10 scan on changed surfaces
+Each is a workflow. The coordinator decides which to run based on `riskLevel`, what files changed, and what domain the task touches. All feed findings back to the coordinator script which routes: fix, skip, or escalate.
+---
+### Additional coordinator pipeline templates (Apr 15, 2026)
+Beyond the feature implementation pipeline, three more coordinator templates are high value:
+---
+#### Backlog grooming coordinator
+```
+trigger: "groom backlog" (cron: weekly, or manual dispatch)
+  │
+  ├── [for each open issue] classify-issue
+  │     outputs: issueType (bug/feature/tech-debt/question), priority, complexity, stale?
+  │
+  ├── [for stale issues > 90 days with no activity] auto-close-or-ping
+  │     script: post "Still relevant?" comment, label as stale
+  │
+  ├── [for unclassified issues] label-and-size
+  │     script: apply labels (bug/enhancement/question), size estimate (XS/S/M/L)
+  │
+  ├── [for duplicate issues] detect-duplicates
+  │     workflow: semantic search over existing issues, flag likely dupes
+  │     script: comment "possible duplicate of #X", label as needs-triage
+  │
+  ├── [for high-priority bugs with no assignee] suggest-fix-approach
+  │     workflow: bug-investigation-agentic (surface root cause + candidate fix)
+  │     outputs: investigation summary posted as issue comment
+  │
+  └── produce grooming summary
+        script: post weekly digest to Slack -- issues triaged, dupes found, investigations run
+```
+No human needed for any of this. The coordinator classifies, labels, pings stale items, and runs investigations on the important ones. The human reviews the digest and acts on what needs judgment.
+---
+#### Bug investigation + fix coordinator
+```
+trigger: new issue labeled "bug" OR incident alert from monitoring
+  │
+  ├── bug-investigation-agentic
+  │     outputs: root cause hypothesis, affected files, severity, reproduction steps
+  │
+  ├── [if severity == Critical] page-oncall
+  │     script: post to Slack #incidents with investigation summary + session link
+  │
+  ├── [if severity <= High and hypothesis_confidence >= 0.8] attempt-fix
+  │     workflow: coding-task-workflow-agentic (targeted fix)
+  │     inputs: investigation findings, affected files, reproduction steps
+  │     outputs: implementation + handoff artifact
+  │     │
+  │     ├── mr-review
+  │     │     └── [if clean] auto-commit → auto-pr
+  │     │
+  │     └── regression-test
+  │           script: run test suite against affected paths
+  │           outputs: pass/fail
+  │
+  ├── [if severity == Critical OR hypothesis_confidence < 0.8] escalate
+  │     script: post investigation summary to issue + tag team lead
+  │
+  └── close-or-update-issue
+        script: if fix merged → close with "Fixed in PR #X". if escalated → update with findings.
+```
+The daemon can go from "bug filed" to "fix merged" with zero human involvement for well-understood bugs with high-confidence hypotheses. Critical bugs and uncertain root causes always escalate to a human -- the investigation is done for them, not by them.
+**What makes this work:**
+- `bug-investigation-agentic` already exists and produces structured findings
+- The `hypothesis_confidence` output from the investigation gates the auto-fix attempt
+- The coordinator script decides: high confidence + not critical = try to fix autonomously
+- The circuit breaker (max 3 fix attempts) prevents infinite loops on hard bugs
+- The human always gets the investigation findings, whether the fix succeeded or not
+---
+#### Incident monitoring coordinator
+```
+trigger: monitoring alert (CPU spike, error rate increase, latency P99 > threshold)
+  │
+  ├── triage-alert
+  │     workflow: classify if real incident vs noise (check recent deploys, known issues)
+  │     outputs: isRealIncident, likelyCause, affectedServices
+  │
+  ├── [if isRealIncident] investigate
+  │     workflow: bug-investigation-agentic (logs, traces, recent changes)
+  │     outputs: root cause, blast radius, mitigation options
+  │
+  ├── [if mitigation is config change or rollback] auto-mitigate
+  │     script: execute safe mitigations (feature flag flip, config change)
+  │     -- NEVER auto-rollback code without human approval
+  │
+  ├── page-oncall
+  │     script: post to Slack #incidents with full context + session DAG link
+  │     content: what fired, what was found, what was auto-mitigated, what needs human action
+  │
+  └── follow-up
+        cron: 30 min later → check if resolved, post update
+```
+The operator gets paged with a complete picture: what happened, likely why, what was already done automatically, and exactly what decision they need to make. No more waking up to an alert with no context.
+---
+### Interactive ideation: WorkTrain as a thinking partner with full project context (Apr 15, 2026)
+**What this is:** The ability to have a conversation with WorkTrain the way we've been talking today -- bouncing ideas, asking "what if", surfacing tradeoffs, refining designs -- and have WorkTrain respond with full awareness of what's been built, what's in flight, what's in the backlog, and what decisions were made and why.
+Today this requires a human (Claude Code + a long conversation) to maintain context across everything. WorkTrain should be able to do this natively because it already has:
+- The session store (every step note from every session ever run)
+- The knowledge graph (structural understanding of the codebase)
+- The backlog (design decisions, research findings, priorities)
+- In-flight agent state (what's running, what's been found)
+**The gap:** there's no conversational interface that pulls all of this together. The console shows sessions. The backlog is a markdown file. There's no "talk to WorkTrain about the project" entry point.
+**What it needs:**
+1. **A "talk" command** -- `worktrain talk` opens an interactive session that starts with a synthesized context bundle: recent session outcomes, open PRs, backlog top items, any findings from in-flight agents. The user types naturally; WorkTrain responds with awareness of all of it.
+2. **Project memory** -- WorkTrain maintains a synthesized "project state" that's updated after each coordinator run or major session batch. Answers questions like: "what did we build today?", "why did we choose polling triggers over webhooks?", "what's the biggest gap right now?", "what would happen if we removed pi-mono?" without requiring the user to re-explain context.
+3. **Idea capture** -- when the conversation surfaces something new (a gap, an architectural insight, a design decision), WorkTrain should offer to record it to the backlog or open a GitHub issue immediately, right from the conversation.
+4. **Context awareness** -- WorkTrain knows which agents are running, what they've found so far, and can report on it during a conversation: "the #400 review just came back with a fetch timeout blocker -- want me to queue a fix agent?"
+**What makes this different from just using Claude Code:** Claude Code has no persistent project context -- every conversation starts from scratch. WorkTrain's ideation session starts with everything loaded: session history, knowledge graph results for relevant files, backlog items, open PRs. The conversation is grounded in the actual project state, not just what the user remembers to paste in.
+**Architecture:** this is a new `talk` workflow -- a conversational loop workflow with no fixed step count. The agent has access to `query_knowledge_graph`, `read_session_notes`, `read_backlog`, `list_in_flight_agents`, and `append_to_backlog` as tools. It maintains the conversation as a standard message history. The session never "completes" -- it ends when the user exits.
+---
+### Automatic gap and improvement detection: proactive WorkTrain (Apr 15, 2026)
+**What this is:** WorkTrain notices things without being asked. After a batch of work lands, it scans for gaps, inconsistencies, missed connections, and improvement opportunities -- and surfaces them proactively.
+**Examples of what it would have caught today without human prompting:**
+- "PR #400 delivery client has no fetch timeout -- delivery could hang indefinitely" (caught by MR review, but WorkTrain could catch this pre-review)
+- "PR #391 picked up GAP-1 crash recovery code it shouldn't have -- scope leak" (caught by the reviewer)
+- "The backlog says knowledge graph should be persistent but the spike uses in-memory DuckDB" (gap between spec and impl)
+- "Three open PRs all modify workflow-runner.ts -- they're going to conflict when merged sequentially"
+- "Issue #393 filed for loadSessionNotes coverage -- this is related to the GAP-2 PR that's open, might as well fix both together"
+- "The classify-task-workflow was just authored but it's not referenced in the coordinator spec yet"
+**Two modes:**
+**1. Event-triggered scans** -- fires after significant events:
+- After a batch of PRs merge: scan for spec/impl gaps, check if any backlog items are now addressable
+- After a new workflow is authored: check if it should be added to the coordinator pipeline
+- After a bug is filed: check if any recent changes are likely culprits
+- After a coordinator run: check if findings surfaced any architectural concerns not in the backlog
+**2. Periodic health checks** -- runs on a schedule (e.g. weekly):
+- Are there backlog items that have all their prerequisites met but haven't been started?
+- Are there open issues that are actually already fixed by merged PRs?
+- Are there PRs that have been approved but not merged for more than N days?
+- Is the knowledge graph stale (files changed since last index)?
+- Are any daemon sessions orphaned (in daemon-sessions/ but older than 24h)?
+**Architecture:** a `watchdog` workflow that runs on a cron trigger. It queries the knowledge graph, reads recent session notes, lists open PRs and issues, reads the backlog priorities, and produces a `gap-report.md` with actionable findings. Each finding is either: auto-actionable (spawn a fix agent), conversation-worthy (add to the ideation queue), or escalation-worthy (post to Slack/file a GitHub issue).
+**The key difference from the coordinator:** the coordinator executes a known plan. The watchdog discovers things that aren't in any plan yet. It's the system's immune response -- continuously scanning for drift between intention and reality.
+**What makes this tractable:** WorkTrain already has all the inputs. The knowledge graph has the structural state. The session store has the history. The backlog has the intentions. The gap detection is the synthesis layer that connects them -- "what was planned" vs "what was built" vs "what's in flight". This is exactly the kind of thing an LLM is good at: cross-referencing multiple sources and identifying inconsistencies.
+---
 ### Dynamic model selection: right model for the right task (Apr 15, 2026)
 **The principle:** not every task needs Sonnet 4.6. Not every task should be locked to Anthropic. The coordinator and the task classifier should be able to select the model dynamically based on what the task actually needs.
@@ -5225,3 +5541,295 @@ With `complete_step` + `spawn_agent`:
 3. **Notifications** -- macOS notification + generic webhook. ~30 min implementation.
 4. **Late-bound goals** -- default `goalTemplate: "{{$.goal}}"` when no static goal. 10-line fix in trigger-store.ts.
 5. **Artifacts store foundation** -- `~/.workrail/artifacts/` directory structure. Step 1 of the first-class artifacts vision.
+---
+## What WorkTrain is currently capable of (as of v3.36.0, Apr 18, 2026)
+Tested empirically today. This is what actually works, not what's specced.
+---
+### Autonomous workflow execution
+**Confirmed working:**
+- Accepts webhook triggers and dispatches workflow sessions autonomously
+- `mr-review-workflow-agentic` v2.6 runs end-to-end: context gathering, parallel reviewer phases, synthesis loop, validation, structured handoff. **Confirmed today** (sess_3bmj..., APPROVE verdict).
+- `coding-task-workflow-agentic` (lean v2) runs end-to-end for Small tasks. **Confirmed today** (evidenceFrom field implementation, completed successfully).
+- `wr.discovery` v3.2.0 runs with goal reframing. **Confirmed today** (spawn_agent architecture discovery).
+- Sessions advance through 8+ workflow steps autonomously (36 step advances today across 6 sessions).
+- 402 LLM turns + 660 tool calls executed autonomously today.
+**Known reliability issues:**
+- `wr.discovery` hit timeout once today -- multi-step discovery workflows can run long and hit the 60-min limit
+- One coding task failed (error) -- assessment gate or tool issue, still being investigated
+- One MR review timed out -- complex PRs need more time than the configured limit
+---
+### Trigger system
+**Confirmed working:**
+- Generic webhook trigger (fire-and-forget via `POST /webhook/<id>`)
+- GitHub Issues polling (no webhook registration needed)
+- GitLab MR polling (no webhook registration needed)
+- Multiple triggers in one triggers.yml
+- WorkflowId validation at startup (wrong IDs caught before traffic arrives)
+- `goalTemplate` interpolation from webhook payload
+**Not yet working:**
+- Native cron trigger (requires OS crontab workaround)
+- Late-bound goals (static goal required in triggers.yml, dynamic goal via payload requires `goalTemplate`)
+---
+### Agent capabilities inside sessions
+**Confirmed working:**
+- Bash (read files, run commands, git, gh CLI)
+- Read (read files)
+- Write (write files -- used by coding tasks)
+- `complete_step` (daemon-managed token, LLM never handles continueToken)
+- `continue_workflow` (deprecated but functional for backward compat)
+- `report_issue` (agents call this when stuck, logged to `~/.workrail/issues/`)
+- `spawn_agent` (spawns child WorkRail sessions in-process, v3.35.1+)
+- Assessment artifact submission (`artifacts` field in complete_step)
+**Not yet working in production:**
+- `spawn_agent` just shipped (v3.35.1) -- untested in real workflows yet
+- `complete_step` just shipped (v3.34.1) -- daemon now using it but not yet validated end-to-end through full assessment-gate workflow
+---
+### Observability
+**Confirmed working:**
+- Daemon event log (`~/.workrail/events/daemon/YYYY-MM-DD.jsonl`) -- every LLM turn, tool call, session lifecycle event
+- `worktrain logs --follow` -- real-time event stream
+- `worktrain status <sessionId>` -- session health summary with stuck detection
+- Console (`http://localhost:3456/console`) -- live sessions, step notes, repoRoot grouping, `isLive` from event log
+- Stuck detection -- `agent_stuck` events emitted for repeated tool calls, no-progress, timeout imminent
+- `issue_reported` events when agents hit walls
+**Known gaps:**
+- Console shows flat session list, not work-unit tree (parentSessionId data exists, visualization not built)
+- `isLive` only covers today's event log (cross-midnight limitation)
+- No push notifications when daemon completes work
+---
+### Infrastructure
+**Confirmed working:**
+- MCP server stable (v3.36.0, bridge removed, EPIPE fixed)
+- `worktrain daemon --install` creates launchd service (daemon survives MCP reconnects)
+- `worktrain console` standalone (independent of daemon and MCP server)
+- `worktrain init` guided onboarding
+- `worktrain tell` / `worktrain inbox` message queue
+- `worktrain spawn` / `worktrain await` CLI (primitives exist, no coordinator templates yet)
+- Crash recovery (orphaned sessions detected and cleared on startup)
+- Workspace context injection (CLAUDE.md, AGENTS.md, daemon-soul.md)
+- maxConcurrentSessions semaphore (default 3)
+- Per-trigger timeout + max-turn limits
+---
+### What WorkTrain cannot do yet (key gaps for autonomous production use)
+1. **Multi-phase work is invisible** -- sessions are flat in console. A 5-session MR review pipeline looks like 5 unrelated sessions.
+2. **No coordinator scripts** -- spawn_agent and spawn/await exist but there's no coordinator template to run a full pipeline.
+3. **No auto-commit** -- agents write code but don't commit or open PRs autonomously (merge workflow exists in spec, not in production use).
+4. **No notifications** -- daemon completes work silently.
+5. **Assessment gates unreliable** -- complete_step fixes the token issue but full assessment-gate workflows not yet validated end-to-end.
+6. **Subagent delegation invisible** -- spawn_agent creates proper child sessions, but workflows still use mcp__nested-subagent__Task for most delegation (invisible black box).
+7. **No artifact store** -- agents dump markdown in the repo as a workaround.
+8. **Context poverty** -- each session starts from scratch, no persistent knowledge graph.
+---
+### WorkTrain benchmarking: prove it's better, publish the results (Apr 18, 2026)
+**The opportunity:** if WorkTrain can demonstrably outperform one-shot LLM calls and human-in-the-loop for specific task types, with reproducible benchmarks published in GitHub and visible in the console, that's the killer adoption argument. Not "trust us, it's better" -- actual numbers.
+**What to benchmark:**
+| Dimension | WorkTrain | One-shot | Human-in-loop |
+|-----------|-----------|----------|---------------|
+| MR review finding rate (Critical/Major caught) | ? | ? | ? |
+| False positive rate (findings that were wrong) | ? | ? | ? |
+| Coding task correctness (builds + tests pass) | ? | ? | ? |
+| Coding task completeness (wiring, exports, tests) | ? | ? | ? |
+| Bug investigation accuracy (correct root cause) | ? | ? | ? |
+| Time to complete | ? | ? | ? |
+| Token cost per task | ? | ? | ? |
+**Model comparison within WorkTrain:**
+- Haiku (fast, cheap) vs Sonnet (balanced) vs Opus (best) for each task type
+- Other providers: GPT-4o, Gemini 1.5 Pro, Llama 3 (via Ollama) -- can WorkTrain run on any model?
+- Does the workflow structure make Haiku competitive with Sonnet one-shot? (hypothesis: yes, for structured tasks)
+**The benchmark suite:**
+1. **MR review benchmark** -- 50 PRs with known ground truth (bugs that were later filed, correct implementations that had no bugs). Score: recall (caught real issues) + precision (didn't flag non-issues).
+2. **Coding task benchmark** -- 50 tasks with objective completion criteria (build passes, tests pass, correct wiring). Score: % completing correctly on first autonomous run.
+3. **Bug investigation benchmark** -- 30 real bugs with known root causes. Score: % identifying correct root cause.
+4. **Discovery quality benchmark** -- 20 design questions with expert-evaluated answers. Score: coverage of key tradeoffs, identification of non-obvious alternatives.
+**How to publish:**
+- `docs/benchmarks/` directory in the repo -- YAML results files, one per benchmark run
+- GitHub Actions CI job that runs the benchmark suite on each release and commits results
+- Console "Benchmarks" tab showing historical performance by model and workflow version
+- Public benchmark page (once cloud hosting exists) showing WorkTrain vs alternatives
+- Badge in README: "MR review recall: 87% (Sonnet 4.6, v3.36.0)"
+**Why this matters for adoption:**
+- Developers are skeptical of autonomous agents -- "it probably makes stuff up"
+- Hard numbers cut through skepticism instantly
+- Showing WorkTrain with Haiku beating one-shot Opus on structured tasks is a compelling cost argument
+- Showing improvement over workflow versions gives teams confidence the system is getting better
+- The benchmark suite is also a regression test -- if a workflow change degrades performance, CI catches it
+**What makes this hard:**
+- Ground truth is expensive to establish (need expert-labeled evaluation sets)
+- Some tasks are inherently subjective (discovery quality)
+- Benchmarks can be gamed (optimize for the benchmark, not real performance)
+- Need enough volume to be statistically meaningful
+**Starting point:** the mr-review workflow is the easiest to benchmark objectively. Start with 20 PRs where bugs were later discovered and 20 PRs that shipped cleanly. Run each through `mr-review-workflow-agentic` on several model tiers. Measure recall and precision. That's a publishable result with one weekend of work.
+---
+### Self-healing daemon: detect internal failures, kill, diagnose, fix, reboot, resume (Apr 18, 2026)
+**The problem:** today if WorkRail's MCP connection drops or the daemon's internal tooling fails, agents continue running without enforcement -- producing unverified output that looks correct but bypassed all workflow gates. The user has no way to know this happened until they manually inspect session completion events.
+**What happened today:** WorkRail MCP went down mid-session across ~10 concurrent agents. All sessions show INCOMPLETE (no `run_completed` event). Agents produced PRs, reviews, and merges -- two PRs landed on main -- without any confirmed workflow completion. Required manual audit after the fact.
+**What WorkTrain needs:**
+**1. Detect its own tooling failures**
+- Monitor whether `complete_step` / `continue_workflow` tool calls are succeeding or timing out
+- Detect when the WorkRail session store becomes unreachable
+- Detect when the MCP connection (for agents that use MCP-mode) is lost
+- Distinguish: "agent is thinking" vs "agent is stuck" vs "agent's tools are broken"
+**2. Kill the agent cleanly on detected failure**
+- When internal tooling is detected broken, stop the agent immediately -- do NOT let it continue without enforcement
+- Retain the full conversation history and step notes up to the point of failure
+- Mark the session as `interrupted_tooling_failure` (distinct from `error` or `timeout`)
+- Write the failure event to the daemon event log with the exact cause
+**3. Self-diagnose**
+- Run a lightweight health check: can we reach the session store? Can we decode the continueToken? Is the WorkRail engine responding?
+- Identify the root cause: MCP disconnect? Session store corruption? Token decode failure? Port conflict?
+- Distinguish recoverable (restart and resume) from non-recoverable (session data corrupted, must restart from scratch)
+**4. Fix and reboot**
+- For recoverable failures: restart the WorkRail engine in-process, re-register tools, verify health before resuming
+- For MCP-mode failures: reconnect without killing the parent session
+- For port conflicts: clear the lock and rebind
+- All of this happens automatically, without user intervention
+**5. Resume with context**
+- Resume the session from the last confirmed `complete_step` / `advance_recorded` event
+- Inject a `<context>` block into the resumed session: "Your previous session was interrupted due to [reason]. Here is what you completed before the interruption: [last 3 step notes]. Resume from where you left off."
+- The agent never knows the interruption happened -- it just continues
+- If resume fails (token expired, session corrupted), escalate to the user with full context on what was completed and what wasn't
+**6. Audit trail**
+- Every self-heal event is recorded in `~/.workrail/events/daemon/YYYY-MM-DD.jsonl` as `tooling_failure_detected`, `self_heal_started`, `self_heal_succeeded` / `self_heal_failed`
+- The console shows a `⚠️ self-healed` badge on sessions that were interrupted and resumed
+- The user can query: "which sessions were interrupted today?" and get the full list with causes
+**Implementation path:**
+- Phase 1: detection only -- log `tooling_failure_detected` when a session produces output without `run_completed`. Surface in console and status command.
+- Phase 2: kill on detection -- stop agents immediately when tooling failure detected. No more unverified output reaching main.
+- Phase 3: auto-resume -- restart and resume for recoverable failures.
+- Phase 4: full self-heal loop -- diagnose, fix, reboot, resume automatically.
+**The invariant that must hold:** no output from a WorkTrain agent should be acted on (committed, merged, posted) unless its session has a confirmed `run_completed` event. This is the enforcement guarantee. Self-healing is what makes that guarantee survivable.
+---
+### CRITICAL architectural clarity: three systems, one shared engine (permanent reference)
+This is the most important architectural fact about this codebase. Every agent and contributor must understand this before touching anything.
+**WorkRail/WorkTrain is three separate systems sharing one engine:**
+```
+                    Shared core
+          ┌─────────────────────────────┐
+          │  WorkRail engine            │
+          │  src/v2/durable-core/       │
+          │  ~/.workrail/data/sessions/ │
+          │  workflow registry          │
+          └────────┬──────────┬─────────┘
+                   │          │          │
+    ┌──────────────▼─┐  ┌─────▼──────┐  ┌▼─────────────┐
+    │ WorkRail MCP   │  │ WorkTrain  │  │ WorkRail     │
+    │ Server         │  │ Daemon     │  │ Console      │
+    │ workrail start │  │ worktrain  │  │ worktrain    │
+    │ src/mcp/       │  │ daemon     │  │ console      │
+    │                │  │ src/daemon/│  │ src/console/ │
+    │ Claude Code    │  │ src/trigger│  │              │
+    │ connects here  │  │            │  │ Shows BOTH   │
+    │ via stdio      │  │ autonomous │  │ MCP + daemon │
+    │                │  │ agent loop │  │ sessions     │
+    └────────────────┘  └────────────┘  └──────────────┘
+```
+**WorkRail MCP Server:** `workrail start` → stdio MCP server. Claude Code connects here. Provides `start_workflow`, `complete_step`, `list_workflows` etc. as MCP tools. Source: `src/mcp/`. Must be bulletproof -- a crash kills all Claude Code workflow tools.
+**WorkTrain Daemon:** `worktrain daemon` → autonomous agent runner. Drives sessions without human involvement. Calls the WorkRail engine **directly in-process** -- does NOT go through the MCP server. Source: `src/daemon/`, `src/trigger/`.
+**WorkRail Console:** `worktrain console` → unified read-only session viewer. Shows sessions from **both** the MCP server and the daemon (they share the same session store). Requires neither the MCP server nor the daemon to be running.
+**Rules that follow from this:**
+- MCP server code must never import from `src/daemon/` or `src/trigger/`
+- Daemon code must never depend on the MCP server being alive
+- Console code reads the session store directly -- no IPC with MCP or daemon needed
+- These are separate processes. A crash in one does not affect the others.
+---
+### CRITICAL architectural clarity: three systems, one shared engine (permanent reference)
+This is the most important architectural fact about this codebase. Every agent and contributor must understand this before touching anything.
+**WorkRail/WorkTrain is three separate systems sharing one engine:**
+```
+                    Shared core
+          ┌─────────────────────────────┐
+          │  WorkRail engine            │
+          │  src/v2/durable-core/       │
+          │  ~/.workrail/data/sessions/ │
+          │  workflow registry          │
+          └────────┬──────────┬─────────┘
+                   │          │          │
+    ┌──────────────▼─┐  ┌─────▼──────┐  ┌▼─────────────┐
+    │ WorkRail MCP   │  │ WorkTrain  │  │ WorkRail     │
+    │ Server         │  │ Daemon     │  │ Console      │
+    │ workrail start │  │ worktrain  │  │ worktrain    │
+    │ src/mcp/       │  │ daemon     │  │ console      │
+    │                │  │ src/daemon/│  │ src/console/ │
+    │ Claude Code    │  │ src/trigger│  │              │
+    │ connects here  │  │            │  │ Shows BOTH   │
+    │ via stdio      │  │ autonomous │  │ MCP + daemon │
+    │                │  │ agent loop │  │ sessions     │
+    └────────────────┘  └────────────┘  └──────────────┘
+```
+**WorkRail MCP Server:** `workrail start` → stdio MCP server. Claude Code connects here. Provides `start_workflow`, `complete_step`, `list_workflows` etc. as MCP tools. Source: `src/mcp/`. Must be bulletproof.
+**WorkTrain Daemon:** `worktrain daemon` → autonomous agent runner. Drives sessions without human involvement. Calls the WorkRail engine **directly in-process**. Does NOT go through the MCP server. Source: `src/daemon/`, `src/trigger/`.
+**WorkRail Console:** `worktrain console` → unified read-only session viewer. Shows sessions from **both** the MCP server and the daemon (shared session store). Requires neither to be running.
+**Rules:**
+- MCP server code must never import from `src/daemon/` or `src/trigger/`
+- Daemon code must never depend on the MCP server being alive
+- Console reads the session store directly -- no IPC with either needed
+- These are separate processes. A crash in one does not affect the others.