@exaudeus/workrail 3.36.0 → 3.37.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (45) hide show
  1. package/dist/config/config-file.js +2 -0
  2. package/dist/console-ui/assets/{index-n8cJrS4v.js → index-t8Wi304z.js} +1 -1
  3. package/dist/console-ui/index.html +1 -1
  4. package/dist/daemon/workflow-runner.d.ts +1 -0
  5. package/dist/daemon/workflow-runner.js +3 -6
  6. package/dist/infrastructure/session/SessionManager.js +17 -4
  7. package/dist/manifest.json +25 -17
  8. package/dist/trigger/notification-service.d.ts +42 -0
  9. package/dist/trigger/notification-service.js +164 -0
  10. package/dist/trigger/trigger-listener.js +7 -1
  11. package/dist/trigger/trigger-router.d.ts +3 -1
  12. package/dist/trigger/trigger-router.js +4 -1
  13. package/docs/design/agent-behavior-patterns-discovery.md +312 -0
  14. package/docs/design/agent-engine-communication-discovery.md +390 -0
  15. package/docs/design/agent-loop-architecture-alternatives-discovery.md +531 -0
  16. package/docs/design/agent-loop-error-handling-contract.md +238 -0
  17. package/docs/design/complete-step-approach-validation-discovery.md +344 -0
  18. package/docs/design/daemon-stuck-detection-discovery.md +174 -0
  19. package/docs/design/mcp-server-disconnect-discovery.md +245 -0
  20. package/docs/design/mcp-server-epipe-crash.md +198 -0
  21. package/docs/design/notification-design-candidates.md +131 -0
  22. package/docs/design/notification-design-review.md +84 -0
  23. package/docs/design/notification-implementation-plan.md +181 -0
  24. package/docs/design/spawn-agent-failure-modes.md +161 -0
  25. package/docs/design/spawn-agent-result-handling-implementation-plan.md +186 -0
  26. package/docs/design/stdio-simplification-design-candidates.md +341 -0
  27. package/docs/design/stdio-simplification-design-review.md +93 -0
  28. package/docs/design/stdio-simplification-implementation-plan.md +317 -0
  29. package/docs/design/structured-output-tools-coexist-findings.md +288 -0
  30. package/docs/discovery/coordinator-script-design.md +745 -0
  31. package/docs/discovery/coordinator-ux-discovery.md +471 -0
  32. package/docs/discovery/spawn-agent-failure-modes.md +309 -0
  33. package/docs/discovery/workflow-selection-for-discovery-tasks.md +336 -0
  34. package/docs/discovery/worktrain-status-briefing.md +325 -0
  35. package/docs/discovery/worktrain-status-design-candidates.md +202 -0
  36. package/docs/discovery/worktrain-status-design-review-findings.md +86 -0
  37. package/docs/ideas/backlog.md +608 -0
  38. package/docs/ideas/daemon-structured-output-vs-tool-calls.md +344 -0
  39. package/docs/ideas/design-candidates-backlog-consolidation.md +85 -0
  40. package/docs/ideas/design-review-findings-backlog-consolidation.md +39 -0
  41. package/docs/ideas/implementation_plan_backlog_consolidation.md +117 -0
  42. package/docs/plans/authoring-doc-staleness-enforcement-candidates.md +251 -0
  43. package/docs/plans/authoring-doc-staleness-enforcement-review.md +99 -0
  44. package/docs/plans/authoring-doc-staleness-enforcement.md +463 -0
  45. package/package.json +1 -1
@@ -1779,6 +1779,322 @@ No PR merges without passing all required gates for its classification. The coor
1779
1779
  Right now, "has this been reviewed and audited?" is a question that requires reading through PRs and session notes. With proof records, it's a query: `SELECT * FROM proof_records WHERE module='src/trigger/' AND kind='production_audit' AND outcome='pass' AND timestamp > NOW()-30days`. The knowledge graph stores these records. The watchdog checks them on a schedule. The coordinator gates on them before merging. Verification becomes infrastructure, not process.
1780
1780
  ---
1781
1781
 
1782
+ ### Scripts-first coordinator: avoid the main agent wherever possible (Apr 15, 2026)
1783
+
1784
+ **The insight:** In the coordinator workflow we built manually today, the main agent spent most of its time on mechanical work -- reading PR lists, checking CI status, deciding whether findings are blocking, sequencing merges. That's all deterministic logic. An LLM is expensive, slow, and inconsistent for deterministic work.
1785
+
1786
+ **The principle extended to coordinators:** the scripts-over-agent rule applies at the coordinator level too. The coordinator's job is to drive a DAG of child sessions. The DAG structure, routing decisions, and termination conditions should be scripts, not LLM reasoning.
1787
+
1788
+ **What this means concretely:**
1789
+
1790
+ Instead of a coordinator *agent* that reads MR review findings and decides what to do, use a **coordinator script** that:
1791
+ 1. Calls `gh pr list` → list of PRs (script)
1792
+ 2. For each PR, calls `spawn_session(mr-review-workflow-agentic)` → session handles (script)
1793
+ 3. Calls `await_sessions(handles)` → structured findings (script waits)
1794
+ 4. Parses the findings JSON block from each session's output (script)
1795
+ 5. Routes: clean → merge queue, minor → spawn fix agent, blocking → escalate (script decision tree)
1796
+ 6. Calls `spawn_session(coding-task-workflow-agentic, fix: <finding>)` for each fix needed (script)
1797
+ 7. Awaits fix agents, re-queues for re-review (script loop)
1798
+ 8. Executes merge sequence when queue is empty (script)
1799
+
1800
+ The agent is only invoked for the *leaf work* -- the actual MR review, the actual coding fix. All coordination, routing, sequencing, and decision-making is a script.
1801
+
1802
+ **What the coordinator workflow looks like under this model:**
1803
+
1804
+ Not a workflow that a single LLM session runs end-to-end. Instead, a **script-driven workflow** where each step is a shell/TypeScript script that calls WorkTrain's API to spawn/await child sessions and route based on their structured outputs. WorkTrain provides:
1805
+ - `worktrain spawn --workflow <id> --goal <text>` → prints sessionHandle
1806
+ - `worktrain await --sessions <handle1,handle2>` → prints structured results JSON
1807
+ - `worktrain merge --pr <number>` → runs the merge sequence
1808
+
1809
+ The coordinator "workflow" is then a shell script or TypeScript file that composes these commands. Fully deterministic, fully auditable, no tokens burned on routing decisions.
1810
+
1811
+ **Why this is better than a coordinator agent:**
1812
+ - Zero LLM cost for coordination -- only leaf sessions burn tokens
1813
+ - Fully deterministic routing -- the same PR list always produces the same execution plan
1814
+ - Trivially auditable -- `set -x` on the shell script shows every decision
1815
+ - Trivially testable -- mock `worktrain spawn` and `worktrain await`, test the routing logic in isolation
1816
+ - Reusable across teams -- share the script, not the prompt
1817
+
1818
+ **Build order for this model:**
1819
+ 1. `worktrain spawn` / `worktrain await` CLI commands that wrap the session engine
1820
+ 2. Structured output format for leaf sessions (the handoff artifact JSON block already exists)
1821
+ 3. A reference `coordinator-groom-prs.sh` (or `.ts`) as the first coordinator template
1822
+ 4. Console DAG view updated to show coordinator-script-spawned sessions with parent-child relationships
1823
+
1824
+ **The long-term vision:** WorkTrain workflows handle the hard cognitive work. WorkTrain scripts handle orchestration, routing, and sequencing. Together they make the system fully autonomous with full observability and zero wasted tokens.
1825
+
1826
+ ---
1827
+
1828
+ ### Full development pipeline: coordinator scripts drive multi-phase autonomous work (Apr 15, 2026)
1829
+
1830
+ The coordinator isn't just for review → fix → merge. The full pipeline we run manually covers every phase of software development, with different phases triggered based on task classification.
1831
+
1832
+ **Full pipeline DAG:**
1833
+
1834
+ ```
1835
+ trigger: "implement feature X"
1836
+
1837
+ ├── [always] classify-task
1838
+ │ outputs: taskComplexity, riskLevel, hasUI, touchesArchitecture
1839
+
1840
+ ├── [if taskComplexity != Small] discovery
1841
+ │ workflow: routine-context-gathering (COMPLETENESS + DEPTH in parallel)
1842
+ │ outputs: context bundle, candidate files, invariants
1843
+
1844
+ ├── [if hasUI] ux-design
1845
+ │ workflow: ux-design-workflow (mockups, component spec, interaction model)
1846
+ │ outputs: design-spec.md, component-list
1847
+
1848
+ ├── [if touchesArchitecture] architecture-design
1849
+ │ workflow: coding-task-workflow-agentic (design phases only)
1850
+ │ outputs: design-candidates.md, selected approach
1851
+ │ └── arch-review (parallel, 2 auditors)
1852
+ │ workflow: routine-hypothesis-challenge + routine-philosophy-alignment
1853
+ │ outputs: findings → revise design if RED/ORANGE
1854
+
1855
+ ├── [always] coding-task
1856
+ │ workflow: coding-task-workflow-agentic
1857
+ │ inputs: context bundle + design spec + arch decision
1858
+ │ outputs: implementation + handoff artifact (commitType, prTitle, filesChanged)
1859
+
1860
+ ├── [always] mr-review
1861
+ │ workflow: mr-review-workflow-agentic
1862
+ │ outputs: findings with severity
1863
+ │ ├── [if clean] → auto-commit → auto-pr → merge
1864
+ │ ├── [if Minor/Nit] → spawn fix agent → re-review (max 3 passes)
1865
+ │ └── [if Critical/Major] → escalate to human (Slack/GitLab comment)
1866
+
1867
+ ├── [if riskLevel == High] prod-risk-audit
1868
+ │ workflow: production-risk-audit-workflow
1869
+ │ outputs: go / no-go + risk register
1870
+ │ └── [if no-go] → escalate, block merge
1871
+
1872
+ └── [if merged] notify
1873
+ script: post summary to Slack/GitLab with session DAG link
1874
+ ```
1875
+
1876
+ **The key insight:** the coordinator script reads the `taskComplexity`, `riskLevel`, `hasUI`, and `touchesArchitecture` flags from the classify step's output and uses them to decide which phases to spawn. A one-line bug fix runs: classify → coding-task → mr-review. A new UI feature runs everything. The same coordinator script handles both -- the DAG is dynamic, driven by structured outputs.
1877
+
1878
+ **Workflow library needed (not all exist yet):**
1879
+
1880
+ | Workflow | Status |
1881
+ |----------|--------|
1882
+ | `coding-task-workflow-agentic` | ✅ `coding-task-workflow-agentic.lean.v2.json` |
1883
+ | `mr-review-workflow-agentic` | ✅ `mr-review-workflow.agentic.v2.json` |
1884
+ | `routine-context-gathering` | ✅ `routines/` |
1885
+ | `routine-hypothesis-challenge` | ✅ `routines/` |
1886
+ | `routine-philosophy-alignment` | ✅ `routines/` |
1887
+ | `ux-design-workflow` | ✅ `ui-ux-design-workflow.json` |
1888
+ | `production-risk-audit-workflow` | ✅ `production-readiness-audit.json` |
1889
+ | `architecture-review-workflow` | ✅ `architecture-scalability-audit.json` |
1890
+ | `bug-investigation-workflow` | ✅ `bug-investigation.agentic.v2.json` |
1891
+ | `discovery-workflow` | ✅ `wr.discovery.json` |
1892
+ | `classify-task-workflow` | ❌ needs authoring -- fast, 1-step, outputs taskComplexity/riskLevel/hasUI/touchesArchitecture |
1893
+
1894
+ **The classify step is the gate.** A cheap, fast workflow that takes a task description and returns structured vars. This is where the coordinator decides what to run. It's the single most important missing workflow -- without it, the coordinator has to spawn everything for every task, which is wasteful.
1895
+
1896
+ **The coordinator script for this pipeline:**
1897
+ ```typescript
1898
+ // coordinator-implement-feature.ts
1899
+ const { taskComplexity, riskLevel, hasUI, touchesArchitecture } =
1900
+ await runWorkflow('classify-task-workflow', { goal: taskDescription });
1901
+
1902
+ const contextHandle = taskComplexity !== 'Small'
1903
+ ? spawnSession('routine-context-gathering', { goal: taskDescription })
1904
+ : null;
1905
+
1906
+ const uxHandle = hasUI
1907
+ ? spawnSession('ux-design-workflow', { goal: taskDescription })
1908
+ : null;
1909
+
1910
+ const [context, uxSpec] = await awaitSessions([contextHandle, uxHandle]);
1911
+
1912
+ // ... arch design if needed, then coding, then review, then audit
1913
+ ```
1914
+
1915
+ Zero coordinator LLM calls. Every decision is a script condition on structured output.
1916
+
1917
+ **Audit workflows the coordinator can chain:**
1918
+ Beyond MR review, the same pattern applies to any quality gate:
1919
+ - **Production risk audit** -- scans for: exposed secrets, missing rate limits, no-rollback schema changes, unguarded env vars
1920
+ - **Architecture audit** -- scans for: coupling violations, missing abstractions, incorrect layer dependencies
1921
+ - **Test coverage audit** -- identifies untested paths on changed files
1922
+ - **Performance audit** -- scans for N+1 queries, missing indexes, unbounded loops on hot paths
1923
+ - **Security audit** -- OWASP top 10 scan on changed surfaces
1924
+
1925
+ Each is a workflow. The coordinator decides which to run based on `riskLevel`, what files changed, and what domain the task touches. All feed findings back to the coordinator script which routes: fix, skip, or escalate.
1926
+
1927
+ ---
1928
+
1929
+ ### Additional coordinator pipeline templates (Apr 15, 2026)
1930
+
1931
+ Beyond the feature implementation pipeline, three more coordinator templates are high value:
1932
+
1933
+ ---
1934
+
1935
+ #### Backlog grooming coordinator
1936
+
1937
+ ```
1938
+ trigger: "groom backlog" (cron: weekly, or manual dispatch)
1939
+
1940
+ ├── [for each open issue] classify-issue
1941
+ │ outputs: issueType (bug/feature/tech-debt/question), priority, complexity, stale?
1942
+
1943
+ ├── [for stale issues > 90 days with no activity] auto-close-or-ping
1944
+ │ script: post "Still relevant?" comment, label as stale
1945
+
1946
+ ├── [for unclassified issues] label-and-size
1947
+ │ script: apply labels (bug/enhancement/question), size estimate (XS/S/M/L)
1948
+
1949
+ ├── [for duplicate issues] detect-duplicates
1950
+ │ workflow: semantic search over existing issues, flag likely dupes
1951
+ │ script: comment "possible duplicate of #X", label as needs-triage
1952
+
1953
+ ├── [for high-priority bugs with no assignee] suggest-fix-approach
1954
+ │ workflow: bug-investigation-agentic (surface root cause + candidate fix)
1955
+ │ outputs: investigation summary posted as issue comment
1956
+
1957
+ └── produce grooming summary
1958
+ script: post weekly digest to Slack -- issues triaged, dupes found, investigations run
1959
+ ```
1960
+
1961
+ No human needed for any of this. The coordinator classifies, labels, pings stale items, and runs investigations on the important ones. The human reviews the digest and acts on what needs judgment.
1962
+
1963
+ ---
1964
+
1965
+ #### Bug investigation + fix coordinator
1966
+
1967
+ ```
1968
+ trigger: new issue labeled "bug" OR incident alert from monitoring
1969
+
1970
+ ├── bug-investigation-agentic
1971
+ │ outputs: root cause hypothesis, affected files, severity, reproduction steps
1972
+
1973
+ ├── [if severity == Critical] page-oncall
1974
+ │ script: post to Slack #incidents with investigation summary + session link
1975
+
1976
+ ├── [if severity <= High and hypothesis_confidence >= 0.8] attempt-fix
1977
+ │ workflow: coding-task-workflow-agentic (targeted fix)
1978
+ │ inputs: investigation findings, affected files, reproduction steps
1979
+ │ outputs: implementation + handoff artifact
1980
+ │ │
1981
+ │ ├── mr-review
1982
+ │ │ └── [if clean] auto-commit → auto-pr
1983
+ │ │
1984
+ │ └── regression-test
1985
+ │ script: run test suite against affected paths
1986
+ │ outputs: pass/fail
1987
+
1988
+ ├── [if severity == Critical OR hypothesis_confidence < 0.8] escalate
1989
+ │ script: post investigation summary to issue + tag team lead
1990
+
1991
+ └── close-or-update-issue
1992
+ script: if fix merged → close with "Fixed in PR #X". if escalated → update with findings.
1993
+ ```
1994
+
1995
+ The daemon can go from "bug filed" to "fix merged" with zero human involvement for well-understood bugs with high-confidence hypotheses. Critical bugs and uncertain root causes always escalate to a human -- the investigation is done for them, not by them.
1996
+
1997
+ **What makes this work:**
1998
+ - `bug-investigation-agentic` already exists and produces structured findings
1999
+ - The `hypothesis_confidence` output from the investigation gates the auto-fix attempt
2000
+ - The coordinator script decides: high confidence + not critical = try to fix autonomously
2001
+ - The circuit breaker (max 3 fix attempts) prevents infinite loops on hard bugs
2002
+ - The human always gets the investigation findings, whether the fix succeeded or not
2003
+
2004
+ ---
2005
+
2006
+ #### Incident monitoring coordinator
2007
+
2008
+ ```
2009
+ trigger: monitoring alert (CPU spike, error rate increase, latency P99 > threshold)
2010
+
2011
+ ├── triage-alert
2012
+ │ workflow: classify if real incident vs noise (check recent deploys, known issues)
2013
+ │ outputs: isRealIncident, likelyCause, affectedServices
2014
+
2015
+ ├── [if isRealIncident] investigate
2016
+ │ workflow: bug-investigation-agentic (logs, traces, recent changes)
2017
+ │ outputs: root cause, blast radius, mitigation options
2018
+
2019
+ ├── [if mitigation is config change or rollback] auto-mitigate
2020
+ │ script: execute safe mitigations (feature flag flip, config change)
2021
+ │ -- NEVER auto-rollback code without human approval
2022
+
2023
+ ├── page-oncall
2024
+ │ script: post to Slack #incidents with full context + session DAG link
2025
+ │ content: what fired, what was found, what was auto-mitigated, what needs human action
2026
+
2027
+ └── follow-up
2028
+ cron: 30 min later → check if resolved, post update
2029
+ ```
2030
+
2031
+ The operator gets paged with a complete picture: what happened, likely why, what was already done automatically, and exactly what decision they need to make. No more waking up to an alert with no context.
2032
+
2033
+ ---
2034
+
2035
+ ### Interactive ideation: WorkTrain as a thinking partner with full project context (Apr 15, 2026)
2036
+
2037
+ **What this is:** The ability to have a conversation with WorkTrain the way we've been talking today -- bouncing ideas, asking "what if", surfacing tradeoffs, refining designs -- and have WorkTrain respond with full awareness of what's been built, what's in flight, what's in the backlog, and what decisions were made and why.
2038
+
2039
+ Today this requires a human (Claude Code + a long conversation) to maintain context across everything. WorkTrain should be able to do this natively because it already has:
2040
+ - The session store (every step note from every session ever run)
2041
+ - The knowledge graph (structural understanding of the codebase)
2042
+ - The backlog (design decisions, research findings, priorities)
2043
+ - In-flight agent state (what's running, what's been found)
2044
+
2045
+ **The gap:** there's no conversational interface that pulls all of this together. The console shows sessions. The backlog is a markdown file. There's no "talk to WorkTrain about the project" entry point.
2046
+
2047
+ **What it needs:**
2048
+
2049
+ 1. **A "talk" command** -- `worktrain talk` opens an interactive session that starts with a synthesized context bundle: recent session outcomes, open PRs, backlog top items, any findings from in-flight agents. The user types naturally; WorkTrain responds with awareness of all of it.
2050
+
2051
+ 2. **Project memory** -- WorkTrain maintains a synthesized "project state" that's updated after each coordinator run or major session batch. Answers questions like: "what did we build today?", "why did we choose polling triggers over webhooks?", "what's the biggest gap right now?", "what would happen if we removed pi-mono?" without requiring the user to re-explain context.
2052
+
2053
+ 3. **Idea capture** -- when the conversation surfaces something new (a gap, an architectural insight, a design decision), WorkTrain should offer to record it to the backlog or open a GitHub issue immediately, right from the conversation.
2054
+
2055
+ 4. **Context awareness** -- WorkTrain knows which agents are running, what they've found so far, and can report on it during a conversation: "the #400 review just came back with a fetch timeout blocker -- want me to queue a fix agent?"
2056
+
2057
+ **What makes this different from just using Claude Code:** Claude Code has no persistent project context -- every conversation starts from scratch. WorkTrain's ideation session starts with everything loaded: session history, knowledge graph results for relevant files, backlog items, open PRs. The conversation is grounded in the actual project state, not just what the user remembers to paste in.
2058
+
2059
+ **Architecture:** this is a new `talk` workflow -- a conversational loop workflow with no fixed step count. The agent has access to `query_knowledge_graph`, `read_session_notes`, `read_backlog`, `list_in_flight_agents`, and `append_to_backlog` as tools. It maintains the conversation as a standard message history. The session never "completes" -- it ends when the user exits.
2060
+
2061
+ ---
2062
+
2063
+ ### Automatic gap and improvement detection: proactive WorkTrain (Apr 15, 2026)
2064
+
2065
+ **What this is:** WorkTrain notices things without being asked. After a batch of work lands, it scans for gaps, inconsistencies, missed connections, and improvement opportunities -- and surfaces them proactively.
2066
+
2067
+ **Examples of what it would have caught today without human prompting:**
2068
+ - "PR #400 delivery client has no fetch timeout -- delivery could hang indefinitely" (caught by MR review, but WorkTrain could catch this pre-review)
2069
+ - "PR #391 picked up GAP-1 crash recovery code it shouldn't have -- scope leak" (caught by the reviewer)
2070
+ - "The backlog says knowledge graph should be persistent but the spike uses in-memory DuckDB" (gap between spec and impl)
2071
+ - "Three open PRs all modify workflow-runner.ts -- they're going to conflict when merged sequentially"
2072
+ - "Issue #393 filed for loadSessionNotes coverage -- this is related to the GAP-2 PR that's open, might as well fix both together"
2073
+ - "The classify-task-workflow was just authored but it's not referenced in the coordinator spec yet"
2074
+
2075
+ **Two modes:**
2076
+
2077
+ **1. Event-triggered scans** -- fires after significant events:
2078
+ - After a batch of PRs merge: scan for spec/impl gaps, check if any backlog items are now addressable
2079
+ - After a new workflow is authored: check if it should be added to the coordinator pipeline
2080
+ - After a bug is filed: check if any recent changes are likely culprits
2081
+ - After a coordinator run: check if findings surfaced any architectural concerns not in the backlog
2082
+
2083
+ **2. Periodic health checks** -- runs on a schedule (e.g. weekly):
2084
+ - Are there backlog items that have all their prerequisites met but haven't been started?
2085
+ - Are there open issues that are actually already fixed by merged PRs?
2086
+ - Are there PRs that have been approved but not merged for more than N days?
2087
+ - Is the knowledge graph stale (files changed since last index)?
2088
+ - Are any daemon sessions orphaned (in daemon-sessions/ but older than 24h)?
2089
+
2090
+ **Architecture:** a `watchdog` workflow that runs on a cron trigger. It queries the knowledge graph, reads recent session notes, lists open PRs and issues, reads the backlog priorities, and produces a `gap-report.md` with actionable findings. Each finding is either: auto-actionable (spawn a fix agent), conversation-worthy (add to the ideation queue), or escalation-worthy (post to Slack/file a GitHub issue).
2091
+
2092
+ **The key difference from the coordinator:** the coordinator executes a known plan. The watchdog discovers things that aren't in any plan yet. It's the system's immune response -- continuously scanning for drift between intention and reality.
2093
+
2094
+ **What makes this tractable:** WorkTrain already has all the inputs. The knowledge graph has the structural state. The session store has the history. The backlog has the intentions. The gap detection is the synthesis layer that connects them -- "what was planned" vs "what was built" vs "what's in flight". This is exactly the kind of thing an LLM is good at: cross-referencing multiple sources and identifying inconsistencies.
2095
+
2096
+ ---
2097
+
1782
2098
  ### Dynamic model selection: right model for the right task (Apr 15, 2026)
1783
2099
 
1784
2100
  **The principle:** not every task needs Sonnet 4.6. Not every task should be locked to Anthropic. The coordinator and the task classifier should be able to select the model dynamically based on what the task actually needs.
@@ -5225,3 +5541,295 @@ With `complete_step` + `spawn_agent`:
5225
5541
  3. **Notifications** -- macOS notification + generic webhook. ~30 min implementation.
5226
5542
  4. **Late-bound goals** -- default `goalTemplate: "{{$.goal}}"` when no static goal. 10-line fix in trigger-store.ts.
5227
5543
  5. **Artifacts store foundation** -- `~/.workrail/artifacts/` directory structure. Step 1 of the first-class artifacts vision.
5544
+
5545
+ ---
5546
+
5547
+ ## What WorkTrain is currently capable of (as of v3.36.0, Apr 18, 2026)
5548
+
5549
+ Tested empirically today. This is what actually works, not what's specced.
5550
+
5551
+ ---
5552
+
5553
+ ### Autonomous workflow execution
5554
+
5555
+ **Confirmed working:**
5556
+ - Accepts webhook triggers and dispatches workflow sessions autonomously
5557
+ - `mr-review-workflow-agentic` v2.6 runs end-to-end: context gathering, parallel reviewer phases, synthesis loop, validation, structured handoff. **Confirmed today** (sess_3bmj..., APPROVE verdict).
5558
+ - `coding-task-workflow-agentic` (lean v2) runs end-to-end for Small tasks. **Confirmed today** (evidenceFrom field implementation, completed successfully).
5559
+ - `wr.discovery` v3.2.0 runs with goal reframing. **Confirmed today** (spawn_agent architecture discovery).
5560
+ - Sessions advance through 8+ workflow steps autonomously (36 step advances today across 6 sessions).
5561
+ - 402 LLM turns + 660 tool calls executed autonomously today.
5562
+
5563
+ **Known reliability issues:**
5564
+ - `wr.discovery` hit timeout once today -- multi-step discovery workflows can run long and hit the 60-min limit
5565
+ - One coding task failed (error) -- assessment gate or tool issue, still being investigated
5566
+ - One MR review timed out -- complex PRs need more time than the configured limit
5567
+
5568
+ ---
5569
+
5570
+ ### Trigger system
5571
+
5572
+ **Confirmed working:**
5573
+ - Generic webhook trigger (fire-and-forget via `POST /webhook/<id>`)
5574
+ - GitHub Issues polling (no webhook registration needed)
5575
+ - GitLab MR polling (no webhook registration needed)
5576
+ - Multiple triggers in one triggers.yml
5577
+ - WorkflowId validation at startup (wrong IDs caught before traffic arrives)
5578
+ - `goalTemplate` interpolation from webhook payload
5579
+
5580
+ **Not yet working:**
5581
+ - Native cron trigger (requires OS crontab workaround)
5582
+ - Late-bound goals (static goal required in triggers.yml, dynamic goal via payload requires `goalTemplate`)
5583
+
5584
+ ---
5585
+
5586
+ ### Agent capabilities inside sessions
5587
+
5588
+ **Confirmed working:**
5589
+ - Bash (read files, run commands, git, gh CLI)
5590
+ - Read (read files)
5591
+ - Write (write files -- used by coding tasks)
5592
+ - `complete_step` (daemon-managed token, LLM never handles continueToken)
5593
+ - `continue_workflow` (deprecated but functional for backward compat)
5594
+ - `report_issue` (agents call this when stuck, logged to `~/.workrail/issues/`)
5595
+ - `spawn_agent` (spawns child WorkRail sessions in-process, v3.35.1+)
5596
+ - Assessment artifact submission (`artifacts` field in complete_step)
5597
+
5598
+ **Not yet working in production:**
5599
+ - `spawn_agent` just shipped (v3.35.1) -- untested in real workflows yet
5600
+ - `complete_step` just shipped (v3.34.1) -- daemon now using it but not yet validated end-to-end through full assessment-gate workflow
5601
+
5602
+ ---
5603
+
5604
+ ### Observability
5605
+
5606
+ **Confirmed working:**
5607
+ - Daemon event log (`~/.workrail/events/daemon/YYYY-MM-DD.jsonl`) -- every LLM turn, tool call, session lifecycle event
5608
+ - `worktrain logs --follow` -- real-time event stream
5609
+ - `worktrain status <sessionId>` -- session health summary with stuck detection
5610
+ - Console (`http://localhost:3456/console`) -- live sessions, step notes, repoRoot grouping, `isLive` from event log
5611
+ - Stuck detection -- `agent_stuck` events emitted for repeated tool calls, no-progress, timeout imminent
5612
+ - `issue_reported` events when agents hit walls
5613
+
5614
+ **Known gaps:**
5615
+ - Console shows flat session list, not work-unit tree (parentSessionId data exists, visualization not built)
5616
+ - `isLive` only covers today's event log (cross-midnight limitation)
5617
+ - No push notifications when daemon completes work
5618
+
5619
+ ---
5620
+
5621
+ ### Infrastructure
5622
+
5623
+ **Confirmed working:**
5624
+ - MCP server stable (v3.36.0, bridge removed, EPIPE fixed)
5625
+ - `worktrain daemon --install` creates launchd service (daemon survives MCP reconnects)
5626
+ - `worktrain console` standalone (independent of daemon and MCP server)
5627
+ - `worktrain init` guided onboarding
5628
+ - `worktrain tell` / `worktrain inbox` message queue
5629
+ - `worktrain spawn` / `worktrain await` CLI (primitives exist, no coordinator templates yet)
5630
+ - Crash recovery (orphaned sessions detected and cleared on startup)
5631
+ - Workspace context injection (CLAUDE.md, AGENTS.md, daemon-soul.md)
5632
+ - maxConcurrentSessions semaphore (default 3)
5633
+ - Per-trigger timeout + max-turn limits
5634
+
5635
+ ---
5636
+
5637
+ ### What WorkTrain cannot do yet (key gaps for autonomous production use)
5638
+
5639
+ 1. **Multi-phase work is invisible** -- sessions are flat in console. A 5-session MR review pipeline looks like 5 unrelated sessions.
5640
+ 2. **No coordinator scripts** -- spawn_agent and spawn/await exist but there's no coordinator template to run a full pipeline.
5641
+ 3. **No auto-commit** -- agents write code but don't commit or open PRs autonomously (merge workflow exists in spec, not in production use).
5642
+ 4. **No notifications** -- daemon completes work silently.
5643
+ 5. **Assessment gates unreliable** -- complete_step fixes the token issue but full assessment-gate workflows not yet validated end-to-end.
5644
+ 6. **Subagent delegation invisible** -- spawn_agent creates proper child sessions, but workflows still use mcp__nested-subagent__Task for most delegation (invisible black box).
5645
+ 7. **No artifact store** -- agents dump markdown in the repo as a workaround.
5646
+ 8. **Context poverty** -- each session starts from scratch, no persistent knowledge graph.
5647
+
5648
+ ---
5649
+
5650
+ ### WorkTrain benchmarking: prove it's better, publish the results (Apr 18, 2026)
5651
+
5652
+ **The opportunity:** if WorkTrain can demonstrably outperform one-shot LLM calls and human-in-the-loop for specific task types, with reproducible benchmarks published in GitHub and visible in the console, that's the killer adoption argument. Not "trust us, it's better" -- actual numbers.
5653
+
5654
+ **What to benchmark:**
5655
+
5656
+ | Dimension | WorkTrain | One-shot | Human-in-loop |
5657
+ |-----------|-----------|----------|---------------|
5658
+ | MR review finding rate (Critical/Major caught) | ? | ? | ? |
5659
+ | False positive rate (findings that were wrong) | ? | ? | ? |
5660
+ | Coding task correctness (builds + tests pass) | ? | ? | ? |
5661
+ | Coding task completeness (wiring, exports, tests) | ? | ? | ? |
5662
+ | Bug investigation accuracy (correct root cause) | ? | ? | ? |
5663
+ | Time to complete | ? | ? | ? |
5664
+ | Token cost per task | ? | ? | ? |
5665
+
5666
+ **Model comparison within WorkTrain:**
5667
+ - Haiku (fast, cheap) vs Sonnet (balanced) vs Opus (best) for each task type
5668
+ - Other providers: GPT-4o, Gemini 1.5 Pro, Llama 3 (via Ollama) -- can WorkTrain run on any model?
5669
+ - Does the workflow structure make Haiku competitive with Sonnet one-shot? (hypothesis: yes, for structured tasks)
5670
+
5671
+ **The benchmark suite:**
5672
+
5673
+ 1. **MR review benchmark** -- 50 PRs with known ground truth (bugs that were later filed, correct implementations that had no bugs). Score: recall (caught real issues) + precision (didn't flag non-issues).
5674
+ 2. **Coding task benchmark** -- 50 tasks with objective completion criteria (build passes, tests pass, correct wiring). Score: % completing correctly on first autonomous run.
5675
+ 3. **Bug investigation benchmark** -- 30 real bugs with known root causes. Score: % identifying correct root cause.
5676
+ 4. **Discovery quality benchmark** -- 20 design questions with expert-evaluated answers. Score: coverage of key tradeoffs, identification of non-obvious alternatives.
5677
+
5678
+ **How to publish:**
5679
+
5680
+ - `docs/benchmarks/` directory in the repo -- YAML results files, one per benchmark run
5681
+ - GitHub Actions CI job that runs the benchmark suite on each release and commits results
5682
+ - Console "Benchmarks" tab showing historical performance by model and workflow version
5683
+ - Public benchmark page (once cloud hosting exists) showing WorkTrain vs alternatives
5684
+ - Badge in README: "MR review recall: 87% (Sonnet 4.6, v3.36.0)"
5685
+
5686
+ **Why this matters for adoption:**
5687
+ - Developers are skeptical of autonomous agents -- "it probably makes stuff up"
5688
+ - Hard numbers cut through skepticism instantly
5689
+ - Showing WorkTrain with Haiku beating one-shot Opus on structured tasks is a compelling cost argument
5690
+ - Showing improvement over workflow versions gives teams confidence the system is getting better
5691
+ - The benchmark suite is also a regression test -- if a workflow change degrades performance, CI catches it
5692
+
5693
+ **What makes this hard:**
5694
+ - Ground truth is expensive to establish (need expert-labeled evaluation sets)
5695
+ - Some tasks are inherently subjective (discovery quality)
5696
+ - Benchmarks can be gamed (optimize for the benchmark, not real performance)
5697
+ - Need enough volume to be statistically meaningful
5698
+
5699
+ **Starting point:** the mr-review workflow is the easiest to benchmark objectively. Start with 20 PRs where bugs were later discovered and 20 PRs that shipped cleanly. Run each through `mr-review-workflow-agentic` on several model tiers. Measure recall and precision. That's a publishable result with one weekend of work.
5700
+
5701
+ ---
5702
+
5703
+ ### Self-healing daemon: detect internal failures, kill, diagnose, fix, reboot, resume (Apr 18, 2026)
5704
+
5705
+ **The problem:** today if WorkRail's MCP connection drops or the daemon's internal tooling fails, agents continue running without enforcement -- producing unverified output that looks correct but bypassed all workflow gates. The user has no way to know this happened until they manually inspect session completion events.
5706
+
5707
+ **What happened today:** WorkRail MCP went down mid-session across ~10 concurrent agents. All sessions show INCOMPLETE (no `run_completed` event). Agents produced PRs, reviews, and merges -- two PRs landed on main -- without any confirmed workflow completion. Required manual audit after the fact.
5708
+
5709
+ **What WorkTrain needs:**
5710
+
5711
+ **1. Detect its own tooling failures**
5712
+ - Monitor whether `complete_step` / `continue_workflow` tool calls are succeeding or timing out
5713
+ - Detect when the WorkRail session store becomes unreachable
5714
+ - Detect when the MCP connection (for agents that use MCP-mode) is lost
5715
+ - Distinguish: "agent is thinking" vs "agent is stuck" vs "agent's tools are broken"
5716
+
5717
+ **2. Kill the agent cleanly on detected failure**
5718
+ - When internal tooling is detected broken, stop the agent immediately -- do NOT let it continue without enforcement
5719
+ - Retain the full conversation history and step notes up to the point of failure
5720
+ - Mark the session as `interrupted_tooling_failure` (distinct from `error` or `timeout`)
5721
+ - Write the failure event to the daemon event log with the exact cause
5722
+
5723
+ **3. Self-diagnose**
5724
+ - Run a lightweight health check: can we reach the session store? Can we decode the continueToken? Is the WorkRail engine responding?
5725
+ - Identify the root cause: MCP disconnect? Session store corruption? Token decode failure? Port conflict?
5726
+ - Distinguish recoverable (restart and resume) from non-recoverable (session data corrupted, must restart from scratch)
5727
+
5728
+ **4. Fix and reboot**
5729
+ - For recoverable failures: restart the WorkRail engine in-process, re-register tools, verify health before resuming
5730
+ - For MCP-mode failures: reconnect without killing the parent session
5731
+ - For port conflicts: clear the lock and rebind
5732
+ - All of this happens automatically, without user intervention
5733
+
5734
+ **5. Resume with context**
5735
+ - Resume the session from the last confirmed `complete_step` / `advance_recorded` event
5736
+ - Inject a `<context>` block into the resumed session: "Your previous session was interrupted due to [reason]. Here is what you completed before the interruption: [last 3 step notes]. Resume from where you left off."
5737
+ - The agent never knows the interruption happened -- it just continues
5738
+ - If resume fails (token expired, session corrupted), escalate to the user with full context on what was completed and what wasn't
5739
+
5740
+ **6. Audit trail**
5741
+ - Every self-heal event is recorded in `~/.workrail/events/daemon/YYYY-MM-DD.jsonl` as `tooling_failure_detected`, `self_heal_started`, `self_heal_succeeded` / `self_heal_failed`
5742
+ - The console shows a `⚠️ self-healed` badge on sessions that were interrupted and resumed
5743
+ - The user can query: "which sessions were interrupted today?" and get the full list with causes
5744
+
5745
+ **Implementation path:**
5746
+ - Phase 1: detection only -- log `tooling_failure_detected` when a session produces output without `run_completed`. Surface in console and status command.
5747
+ - Phase 2: kill on detection -- stop agents immediately when tooling failure detected. No more unverified output reaching main.
5748
+ - Phase 3: auto-resume -- restart and resume for recoverable failures.
5749
+ - Phase 4: full self-heal loop -- diagnose, fix, reboot, resume automatically.
5750
+
5751
+ **The invariant that must hold:** no output from a WorkTrain agent should be acted on (committed, merged, posted) unless its session has a confirmed `run_completed` event. This is the enforcement guarantee. Self-healing is what makes that guarantee survivable.
5752
+
5753
+ ---
5754
+
5755
+ ### CRITICAL architectural clarity: three systems, one shared engine (permanent reference)
5756
+
5757
+ This is the most important architectural fact about this codebase. Every agent and contributor must understand this before touching anything.
5758
+
5759
+ **WorkRail/WorkTrain is three separate systems sharing one engine:**
5760
+
5761
+ ```
5762
+ Shared core
5763
+ ┌─────────────────────────────┐
5764
+ │ WorkRail engine │
5765
+ │ src/v2/durable-core/ │
5766
+ │ ~/.workrail/data/sessions/ │
5767
+ │ workflow registry │
5768
+ └────────┬──────────┬─────────┘
5769
+ │ │ │
5770
+ ┌──────────────▼─┐ ┌─────▼──────┐ ┌▼─────────────┐
5771
+ │ WorkRail MCP │ │ WorkTrain │ │ WorkRail │
5772
+ │ Server │ │ Daemon │ │ Console │
5773
+ │ workrail start │ │ worktrain │ │ worktrain │
5774
+ │ src/mcp/ │ │ daemon │ │ console │
5775
+ │ │ │ src/daemon/│ │ src/console/ │
5776
+ │ Claude Code │ │ src/trigger│ │ │
5777
+ │ connects here │ │ │ │ Shows BOTH │
5778
+ │ via stdio │ │ autonomous │ │ MCP + daemon │
5779
+ │ │ │ agent loop │ │ sessions │
5780
+ └────────────────┘ └────────────┘ └──────────────┘
5781
+ ```
5782
+
5783
+ **WorkRail MCP Server:** `workrail start` → stdio MCP server. Claude Code connects here. Provides `start_workflow`, `complete_step`, `list_workflows` etc. as MCP tools. Source: `src/mcp/`. Must be bulletproof -- a crash kills all Claude Code workflow tools.
5784
+
5785
+ **WorkTrain Daemon:** `worktrain daemon` → autonomous agent runner. Drives sessions without human involvement. Calls the WorkRail engine **directly in-process** -- does NOT go through the MCP server. Source: `src/daemon/`, `src/trigger/`.
5786
+
5787
+ **WorkRail Console:** `worktrain console` → unified read-only session viewer. Shows sessions from **both** the MCP server and the daemon (they share the same session store). Requires neither the MCP server nor the daemon to be running.
5788
+
5789
+ **Rules that follow from this:**
5790
+ - MCP server code must never import from `src/daemon/` or `src/trigger/`
5791
+ - Daemon code must never depend on the MCP server being alive
5792
+ - Console code reads the session store directly -- no IPC with MCP or daemon needed
5793
+ - These are separate processes. A crash in one does not affect the others.
5794
+
5795
+ ---
5796
+
5797
+ ### CRITICAL architectural clarity: three systems, one shared engine (permanent reference)
5798
+
5799
+ This is the most important architectural fact about this codebase. Every agent and contributor must understand this before touching anything.
5800
+
5801
+ **WorkRail/WorkTrain is three separate systems sharing one engine:**
5802
+
5803
+ ```
5804
+ Shared core
5805
+ ┌─────────────────────────────┐
5806
+ │ WorkRail engine │
5807
+ │ src/v2/durable-core/ │
5808
+ │ ~/.workrail/data/sessions/ │
5809
+ │ workflow registry │
5810
+ └────────┬──────────┬─────────┘
5811
+ │ │ │
5812
+ ┌──────────────▼─┐ ┌─────▼──────┐ ┌▼─────────────┐
5813
+ │ WorkRail MCP │ │ WorkTrain │ │ WorkRail │
5814
+ │ Server │ │ Daemon │ │ Console │
5815
+ │ workrail start │ │ worktrain │ │ worktrain │
5816
+ │ src/mcp/ │ │ daemon │ │ console │
5817
+ │ │ │ src/daemon/│ │ src/console/ │
5818
+ │ Claude Code │ │ src/trigger│ │ │
5819
+ │ connects here │ │ │ │ Shows BOTH │
5820
+ │ via stdio │ │ autonomous │ │ MCP + daemon │
5821
+ │ │ │ agent loop │ │ sessions │
5822
+ └────────────────┘ └────────────┘ └──────────────┘
5823
+ ```
5824
+
5825
+ **WorkRail MCP Server:** `workrail start` → stdio MCP server. Claude Code connects here. Provides `start_workflow`, `complete_step`, `list_workflows` etc. as MCP tools. Source: `src/mcp/`. Must be bulletproof.
5826
+
5827
+ **WorkTrain Daemon:** `worktrain daemon` → autonomous agent runner. Drives sessions without human involvement. Calls the WorkRail engine **directly in-process**. Does NOT go through the MCP server. Source: `src/daemon/`, `src/trigger/`.
5828
+
5829
+ **WorkRail Console:** `worktrain console` → unified read-only session viewer. Shows sessions from **both** the MCP server and the daemon (shared session store). Requires neither to be running.
5830
+
5831
+ **Rules:**
5832
+ - MCP server code must never import from `src/daemon/` or `src/trigger/`
5833
+ - Daemon code must never depend on the MCP server being alive
5834
+ - Console reads the session store directly -- no IPC with either needed
5835
+ - These are separate processes. A crash in one does not affect the others.