@exaudeus/workrail 3.27.0 → 3.29.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (160) hide show
  1. package/dist/console/assets/{index-FtTaDku8.js → index-BZ6HkxGf.js} +1 -1
  2. package/dist/console/index.html +1 -1
  3. package/dist/manifest.json +3 -3
  4. package/docs/README.md +57 -0
  5. package/docs/adrs/001-hybrid-storage-backend.md +38 -0
  6. package/docs/adrs/002-four-layer-context-classification.md +38 -0
  7. package/docs/adrs/003-checkpoint-trigger-strategy.md +35 -0
  8. package/docs/adrs/004-opt-in-encryption-strategy.md +36 -0
  9. package/docs/adrs/005-agent-first-workflow-execution-tokens.md +105 -0
  10. package/docs/adrs/006-append-only-session-run-event-log.md +76 -0
  11. package/docs/adrs/007-resume-and-checkpoint-only-sessions.md +51 -0
  12. package/docs/adrs/008-blocked-nodes-architectural-upgrade.md +178 -0
  13. package/docs/adrs/009-bridge-mode-single-instance-mcp.md +195 -0
  14. package/docs/adrs/010-release-pipeline.md +89 -0
  15. package/docs/architecture/README.md +7 -0
  16. package/docs/architecture/refactor-audit.md +364 -0
  17. package/docs/authoring-v2.md +527 -0
  18. package/docs/authoring.md +873 -0
  19. package/docs/changelog-recent.md +201 -0
  20. package/docs/configuration.md +505 -0
  21. package/docs/ctc-mcp-proposal.md +518 -0
  22. package/docs/design/README.md +22 -0
  23. package/docs/design/agent-cascade-protocol.md +96 -0
  24. package/docs/design/autonomous-console-design-candidates.md +253 -0
  25. package/docs/design/autonomous-console-design-review.md +111 -0
  26. package/docs/design/autonomous-platform-mvp-discovery.md +525 -0
  27. package/docs/design/claude-code-source-deep-dive.md +713 -0
  28. package/docs/design/console-cyberpunk-ui-discovery.md +504 -0
  29. package/docs/design/console-execution-trace-candidates-final.md +160 -0
  30. package/docs/design/console-execution-trace-candidates.md +211 -0
  31. package/docs/design/console-execution-trace-design-candidates-v2.md +113 -0
  32. package/docs/design/console-execution-trace-design-review.md +74 -0
  33. package/docs/design/console-execution-trace-discovery.md +394 -0
  34. package/docs/design/console-execution-trace-final-review.md +77 -0
  35. package/docs/design/console-execution-trace-review.md +92 -0
  36. package/docs/design/console-performance-discovery.md +415 -0
  37. package/docs/design/console-ui-backlog.md +280 -0
  38. package/docs/design/daemon-architecture-discovery.md +853 -0
  39. package/docs/design/daemon-design-candidates.md +318 -0
  40. package/docs/design/daemon-design-review-findings.md +119 -0
  41. package/docs/design/daemon-engine-design-candidates.md +210 -0
  42. package/docs/design/daemon-engine-design-review.md +131 -0
  43. package/docs/design/daemon-execution-engine-discovery.md +280 -0
  44. package/docs/design/daemon-gap-analysis.md +554 -0
  45. package/docs/design/daemon-owns-console-plan.md +168 -0
  46. package/docs/design/daemon-owns-console-review.md +91 -0
  47. package/docs/design/daemon-owns-console.md +195 -0
  48. package/docs/design/data-model-erd.md +11 -0
  49. package/docs/design/design-candidates-consolidate-dev-staleness.md +98 -0
  50. package/docs/design/design-candidates-walk-cache-depth-limit.md +80 -0
  51. package/docs/design/design-review-consolidate-dev-staleness.md +54 -0
  52. package/docs/design/design-review-walk-cache-depth-limit.md +48 -0
  53. package/docs/design/implementation-plan-consolidate-dev-staleness.md +142 -0
  54. package/docs/design/implementation-plan-walk-cache-depth-limit.md +141 -0
  55. package/docs/design/layer3b-ghost-nodes-design-candidates.md +229 -0
  56. package/docs/design/layer3b-ghost-nodes-design-review.md +93 -0
  57. package/docs/design/layer3b-ghost-nodes-implementation-plan.md +219 -0
  58. package/docs/design/list-workflows-latency-fix-plan.md +128 -0
  59. package/docs/design/list-workflows-latency-fix-review.md +55 -0
  60. package/docs/design/list-workflows-latency-fix.md +109 -0
  61. package/docs/design/native-context-management-api.md +11 -0
  62. package/docs/design/performance-sweep-2026-04.md +96 -0
  63. package/docs/design/routines-guide.md +219 -0
  64. package/docs/design/sequence-diagrams.md +11 -0
  65. package/docs/design/subagent-design-principles.md +220 -0
  66. package/docs/design/temporal-patterns-design-candidates.md +312 -0
  67. package/docs/design/temporal-patterns-design-review-findings.md +163 -0
  68. package/docs/design/test-isolation-from-config-file.md +335 -0
  69. package/docs/design/v2-core-design-locks.md +2746 -0
  70. package/docs/design/v2-lock-registry.json +734 -0
  71. package/docs/design/workflow-authoring-v2.md +1044 -0
  72. package/docs/design/workflow-docs-spec.md +218 -0
  73. package/docs/design/workflow-extension-points.md +687 -0
  74. package/docs/design/workrail-auto-trigger-system.md +359 -0
  75. package/docs/design/workrail-config-file-discovery.md +513 -0
  76. package/docs/docker.md +110 -0
  77. package/docs/generated/v2-lock-closure-plan.md +26 -0
  78. package/docs/generated/v2-lock-coverage.json +797 -0
  79. package/docs/generated/v2-lock-coverage.md +177 -0
  80. package/docs/ideas/backlog.md +3927 -0
  81. package/docs/ideas/design-candidates-mcp-resilience.md +208 -0
  82. package/docs/ideas/design-review-findings-mcp-resilience.md +119 -0
  83. package/docs/ideas/implementation_plan.md +249 -0
  84. package/docs/ideas/third-party-workflow-setup-design-thinking.md +1948 -0
  85. package/docs/implementation/02-architecture.md +316 -0
  86. package/docs/implementation/04-testing-strategy.md +124 -0
  87. package/docs/implementation/09-simple-workflow-guide.md +835 -0
  88. package/docs/implementation/13-advanced-validation-guide.md +874 -0
  89. package/docs/implementation/README.md +21 -0
  90. package/docs/integrations/claude-code.md +300 -0
  91. package/docs/integrations/firebender.md +315 -0
  92. package/docs/migration/v0.1.0.md +147 -0
  93. package/docs/naming-conventions.md +45 -0
  94. package/docs/planning/README.md +104 -0
  95. package/docs/planning/github-ticketing-playbook.md +195 -0
  96. package/docs/plans/README.md +24 -0
  97. package/docs/plans/agent-managed-ticketing-design.md +605 -0
  98. package/docs/plans/agentic-orchestration-roadmap.md +112 -0
  99. package/docs/plans/assessment-gates-engine-handoff.md +536 -0
  100. package/docs/plans/content-coherence-and-references.md +151 -0
  101. package/docs/plans/library-extraction-plan.md +340 -0
  102. package/docs/plans/mr-review-workflow-redesign.md +1451 -0
  103. package/docs/plans/native-context-management-epic.md +11 -0
  104. package/docs/plans/perf-fixes-design-candidates.md +225 -0
  105. package/docs/plans/perf-fixes-design-review-findings.md +61 -0
  106. package/docs/plans/perf-fixes-new-issues-candidates.md +264 -0
  107. package/docs/plans/perf-fixes-new-issues-review.md +110 -0
  108. package/docs/plans/prompt-fragments.md +53 -0
  109. package/docs/plans/ui-ux-workflow-design-candidates.md +120 -0
  110. package/docs/plans/ui-ux-workflow-discovery.md +100 -0
  111. package/docs/plans/ui-ux-workflow-review.md +48 -0
  112. package/docs/plans/v2-followup-enhancements.md +587 -0
  113. package/docs/plans/workflow-categories-candidates.md +105 -0
  114. package/docs/plans/workflow-categories-discovery.md +110 -0
  115. package/docs/plans/workflow-categories-review.md +51 -0
  116. package/docs/plans/workflow-discovery-model-candidates.md +94 -0
  117. package/docs/plans/workflow-discovery-model-discovery.md +74 -0
  118. package/docs/plans/workflow-discovery-model-review.md +48 -0
  119. package/docs/plans/workflow-source-setup-phase-1.md +245 -0
  120. package/docs/plans/workflow-source-setup-phase-2.md +361 -0
  121. package/docs/plans/workflow-staleness-detection-candidates.md +104 -0
  122. package/docs/plans/workflow-staleness-detection-review.md +58 -0
  123. package/docs/plans/workflow-staleness-detection.md +80 -0
  124. package/docs/plans/workflow-v2-design.md +69 -0
  125. package/docs/plans/workflow-v2-roadmap.md +74 -0
  126. package/docs/plans/workflow-validation-design.md +98 -0
  127. package/docs/plans/workflow-validation-roadmap.md +108 -0
  128. package/docs/plans/workrail-platform-vision.md +420 -0
  129. package/docs/reference/agent-context-cleaner-snippet.md +94 -0
  130. package/docs/reference/agent-context-guidance.md +140 -0
  131. package/docs/reference/context-optimization.md +284 -0
  132. package/docs/reference/example-workflow-repository-template/.github/workflows/validate.yml +125 -0
  133. package/docs/reference/example-workflow-repository-template/README.md +268 -0
  134. package/docs/reference/example-workflow-repository-template/workflows/example-workflow.json +80 -0
  135. package/docs/reference/external-workflow-repositories.md +916 -0
  136. package/docs/reference/feature-flags-architecture.md +472 -0
  137. package/docs/reference/feature-flags.md +349 -0
  138. package/docs/reference/god-tier-workflow-validation.md +272 -0
  139. package/docs/reference/loop-optimization.md +209 -0
  140. package/docs/reference/loop-validation.md +176 -0
  141. package/docs/reference/loops.md +465 -0
  142. package/docs/reference/mcp-platform-constraints.md +59 -0
  143. package/docs/reference/recovery.md +88 -0
  144. package/docs/reference/releases.md +177 -0
  145. package/docs/reference/troubleshooting.md +105 -0
  146. package/docs/reference/workflow-execution-contract.md +998 -0
  147. package/docs/roadmap/README.md +22 -0
  148. package/docs/roadmap/legacy-planning-status.md +103 -0
  149. package/docs/roadmap/now-next-later.md +70 -0
  150. package/docs/roadmap/open-work-inventory.md +389 -0
  151. package/docs/tickets/README.md +39 -0
  152. package/docs/tickets/next-up.md +76 -0
  153. package/docs/workflow-management.md +317 -0
  154. package/docs/workflow-templates.md +423 -0
  155. package/docs/workflow-validation.md +184 -0
  156. package/docs/workflows.md +254 -0
  157. package/package.json +3 -1
  158. package/spec/authoring-spec.json +61 -16
  159. package/workflows/workflow-for-workflows.json +252 -93
  160. package/workflows/workflow-for-workflows.v2.json +188 -77
@@ -0,0 +1,168 @@
1
+ # Implementation Plan: Daemon-Owned HTTP Console
2
+
3
+ **Status:** Ready for implementation
4
+ **Branch:** `feat/daemon-owns-console`
5
+ **Date:** 2026-04-16
6
+
7
+ ---
8
+
9
+ ## 1. Problem Statement
10
+
11
+ The console dashboard (`localhost:3456`) goes down whenever the MCP server crashes or Claude Code restarts, because the console is hosted by whichever MCP server wins the primary election. The daemon -- the actual long-running process -- has no role in hosting the console today. The fix: the daemon starts and owns the HTTP console on port 3456.
12
+
13
+ ---
14
+
15
+ ## 2. Acceptance Criteria
16
+
17
+ - [ ] When `workrail daemon` starts, the console dashboard is available at `http://localhost:3456/console`
18
+ - [ ] The console stays available as long as the daemon process is running
19
+ - [ ] The console goes down when the daemon stops (SIGTERM or SIGINT)
20
+ - [ ] `POST /api/v2/auto/dispatch` on the daemon console dispatches through the daemon's `TriggerRouter` queue
21
+ - [ ] When port 3456 is already held by an MCP server, the daemon logs a clear human-readable message and continues (trigger listener on 3200 still starts)
22
+ - [ ] When the daemon is running, the MCP server's `HttpServer` becomes secondary (no port conflict, no double console)
23
+ - [ ] The `workrail daemon` process does NOT exit on port 3456 conflict -- only the console is degraded
24
+
25
+ ---
26
+
27
+ ## 3. Non-Goals
28
+
29
+ - Merging port 3200 (webhook) and port 3456 (console) into a single port
30
+ - Modifying `HttpServer.ts` class internals
31
+ - Streaming live agent output in the console (separate backlog item)
32
+ - Adding token auth to `POST /api/v2/auto/dispatch` (filed as a TODO in console-routes.ts)
33
+ - Adding an integration test for the full daemon startup flow (out of scope, complex process harness needed)
34
+ - Exposing the `timingRingBuffer` / perf endpoint on the daemon console (dev-only, acceptable gap)
35
+
36
+ ---
37
+
38
+ ## 4. Philosophy-Driven Constraints
39
+
40
+ | Constraint | Source |
41
+ |------------|--------|
42
+ | `startDaemonConsole()` must return `Result<DaemonConsoleHandle, DaemonConsoleError>`, not throw | CLAUDE.md: errors are data |
43
+ | `DaemonConsoleHandle` properties are `readonly` | CLAUDE.md: immutability by default |
44
+ | `ctx: V2ToolContext` is injected at the composition root, not read from env inside the function | CLAUDE.md: dependency injection for boundaries |
45
+ | No `composeServer()` lock check (deferred) -- accept cosmetic double-watcher | CLAUDE.md: YAGNI with discipline |
46
+ | Port conflict is surfaced to the user via log message, not silently swallowed | CLAUDE.md: validate at boundaries |
47
+
48
+ ---
49
+
50
+ ## 5. Invariants
51
+
52
+ 1. The daemon console and MCP server console never both serve traffic on port 3456 simultaneously (enforced by OS port exclusivity -- EADDRINUSE)
53
+ 2. When `DaemonConsoleHandle.stop()` is called, the `fs.watch` disposer from `mountConsoleRoutes()` is called before the HTTP server closes
54
+ 3. `daemon-console.lock` contains `{ pid: number, port: number }` and is deleted by `stop()` or on daemon exit
55
+ 4. The trigger listener on port 3200 starts regardless of whether the daemon console starts successfully
56
+
57
+ ---
58
+
59
+ ## 6. Selected Approach
60
+
61
+ **Simpler hybrid of Candidate B:** Extract `startDaemonConsole()` with Result type and lifecycle handle. Write `daemon-console.lock`. Do NOT add lock check to `composeServer()` (deferred).
62
+
63
+ **Runner-up:** Candidate A (inline, no guard). Lost because it doesn't extract the function (poor testability) and doesn't write the lock file (useful for future tooling).
64
+
65
+ **Rationale:** Follows established `TriggerListenerHandle` pattern exactly. Minimal surface area (2 files). Clean error handling. Deferred `composeServer()` optimization avoids premature complexity.
66
+
67
+ ---
68
+
69
+ ## 7. Vertical Slices
70
+
71
+ ### Slice 1: `src/trigger/daemon-console.ts` (new file)
72
+
73
+ **Scope:** New module that starts a standalone Express HTTP server on port 3456, mounts console routes, writes the lock file.
74
+
75
+ **Acceptance criteria:**
76
+ - Exports `DaemonConsoleHandle: { readonly port: number; stop(): Promise<void> }`
77
+ - Exports `DaemonConsoleError: { kind: 'port_conflict'; port: number } | { kind: 'io_error'; message: string }`
78
+ - Exports `startDaemonConsole(ctx: V2ToolContext, options: StartDaemonConsoleOptions): Promise<Result<DaemonConsoleHandle, DaemonConsoleError>>`
79
+ where `StartDaemonConsoleOptions = { port?: number; triggerRouter?: TriggerRouter; serverVersion?: string; workflowService?: WorkflowService }`
80
+ - Creates Express app with CORS middleware (`cors({ origin: '*', methods: ['GET', 'HEAD', 'OPTIONS', 'POST'], allowedHeaders: ['Content-Type'] })`)
81
+ - Creates `ConsoleService` from `ctx.v2`
82
+ - Calls `mountConsoleRoutes(app, consoleService, workflowService, undefined, undefined, serverVersion, ctx, triggerRouter)`
83
+ - Binds to `127.0.0.1:port` (default 3456)
84
+ - On bind success: writes `~/.workrail/daemon-console.lock = JSON.stringify({ pid: process.pid, port: actualPort })`
85
+ - Returns `{ port: actualPort, stop: async () => { stopWatcher(); server.close(); await releaseLock(); } }`
86
+ - On EADDRINUSE: returns `err({ kind: 'port_conflict', port })`
87
+ - On other bind error: returns `err({ kind: 'io_error', message })`
88
+ - Lock file write failures are non-fatal: log warning, proceed
89
+
90
+ **Files changed:** `src/trigger/daemon-console.ts` (new)
91
+
92
+ ### Slice 2: `src/cli.ts` daemon command wiring
93
+
94
+ **Scope:** After `startTriggerListener()` succeeds, call `startDaemonConsole()`. Handle port conflict with clear log. Add shutdown cleanup.
95
+
96
+ **Acceptance criteria:**
97
+ - `startDaemonConsole(ctx, { triggerRouter: handle.router, workflowService: rawCtx.workflowService })` is called after trigger listener starts
98
+ - On `ok(consoleHandle)`: log `WorkRail console available at http://localhost:${consoleHandle.port}/console`
99
+ - On `err({ kind: 'port_conflict' })`: log `[DaemonConsole] Port ${port} is already held (likely by an MCP server). The daemon is running but the console is unavailable. Restart the MCP server while the daemon is running to enable the daemon console.`
100
+ - On `err({ kind: 'io_error' })`: log the error, continue (non-fatal)
101
+ - In `shutdown()`: if `consoleHandle` exists, `await consoleHandle.stop()` before `await handle.stop()`
102
+
103
+ **Files changed:** `src/cli.ts`
104
+
105
+ ---
106
+
107
+ ## 8. Test Design
108
+
109
+ ### Unit tests for `startDaemonConsole()`
110
+
111
+ **File:** `src/trigger/daemon-console.test.ts`
112
+
113
+ **Test cases:**
114
+ 1. **Happy path**: `startDaemonConsole()` with a mock `V2ToolContext` and port=0 (OS-assigned). Verify returns `ok(handle)`, handle.port is non-zero, `GET /api/v2/sessions` returns 200.
115
+ 2. **Port conflict**: Start two instances on the same port. Second returns `err({ kind: 'port_conflict' })`.
116
+ 3. **stop() cleans up**: After `handle.stop()`, the port is released (can bind again). Verify `stopWatcher` was called (spy).
117
+ 4. **Lock file written**: After start, `~/.workrail/daemon-console.lock` exists with correct `{ pid, port }`.
118
+ 5. **Lock file deleted on stop**: After `handle.stop()`, lock file is deleted.
119
+
120
+ **Pattern:** Follow `trigger-listener.test.ts` for test structure. Use `port: 0` for OS-assigned ports to avoid conflicts.
121
+
122
+ ---
123
+
124
+ ## 9. Risk Register
125
+
126
+ | Risk | Likelihood | Impact | Mitigation |
127
+ |------|-----------|--------|------------|
128
+ | EADDRINUSE because MCP server starts before daemon | High | Low -- daemon still works on 3200 | Clear log message (Slice 2) |
129
+ | stopWatcher disposer not called in stop() | Low | Low -- fd leak | Implementation requirement, verified by test 3 |
130
+ | Lock file on NFS home directory (latency) | Very low | None -- lock write is fire-and-forget | Non-fatal lock write (log warning, proceed) |
131
+ | CORS headers missing on daemon Express app | Was a risk | None | Fixed in Slice 1 |
132
+
133
+ ---
134
+
135
+ ## 10. PR Packaging Strategy
136
+
137
+ Single PR: `feat/daemon-owns-console`
138
+ - Slice 1 and Slice 2 together (they're co-dependent -- no value without both)
139
+ - Include test file
140
+ - PR description links to `docs/design/daemon-owns-console.md`
141
+
142
+ **estimatedPRCount:** 1
143
+
144
+ ---
145
+
146
+ ## 11. Philosophy Alignment
147
+
148
+ | Slice | Principle | Status |
149
+ |-------|-----------|--------|
150
+ | Slice 1 | Errors are data | Satisfied -- `Result<DaemonConsoleHandle, DaemonConsoleError>` |
151
+ | Slice 1 | Immutability by default | Satisfied -- `readonly port` |
152
+ | Slice 1 | Dependency injection | Satisfied -- ctx injected, no env reads inside function |
153
+ | Slice 1 | Compose with small pure functions | Satisfied -- delegates to `mountConsoleRoutes()` |
154
+ | Slice 2 | Validate at boundaries | Satisfied -- port conflict logged with clear UX message |
155
+ | Slice 2 | YAGNI with discipline | Satisfied -- no composeServer() lock check |
156
+
157
+ ---
158
+
159
+ ## 12. Follow-up Tickets
160
+
161
+ 1. **Add token auth to `POST /api/v2/auto/dispatch`** -- the daemon console makes the always-up no-auth endpoint more exposed (existing TODO in console-routes.ts)
162
+ 2. **Add `composeServer()` lock check** -- if MaxListenersExceededWarning is observed in practice during daemon+MCP co-runs, add the lock file check to prevent watcher accumulation
163
+ 3. **NFS latency mitigation** -- if someone reports slow daemon startup on NFS home dir, add a 100ms timeout on the lock file write
164
+
165
+ ---
166
+
167
+ **planConfidenceBand:** High
168
+ **unresolvedUnknownCount:** 0
@@ -0,0 +1,91 @@
1
+ # Design Review Findings: Daemon-Owned HTTP Console
2
+
3
+ **Design doc:** `docs/design/daemon-owns-console.md`
4
+ **Review date:** 2026-04-16
5
+
6
+ ---
7
+
8
+ ## Tradeoff Review
9
+
10
+ ### T1: Async file I/O in startup path
11
+ **Status:** Acceptable.
12
+ `composeServer()` already does far heavier I/O at startup (keyring load, token alias store). A `fs.readFile` on a ~100-byte lock file is negligible on local disk. Edge case: NFS home directory. Mitigated by try-catch with graceful fallback (treat missing/unreadable lock as no-daemon).
13
+
14
+ ### T2: New lock file convention
15
+ **Status:** Acceptable.
16
+ Pattern is identical to `dashboard.lock`. Stale lock after daemon crash fully mitigated by pid-liveness check (`process.kill(pid, 0)`). Directory creation must use `{ recursive: true }` to handle fresh installs.
17
+
18
+ ### T3: `timingRingBuffer` unavailable in daemon console
19
+ **Status:** Acceptable.
20
+ Perf endpoint (`/api/v2/perf/tool-calls`) mounts but returns empty observations in daemon console context. This is dev-mode only. A log note at daemon console startup makes the limitation visible.
21
+
22
+ ---
23
+
24
+ ## Failure Mode Review
25
+
26
+ ### FM1: Stale lock after daemon crash
27
+ **Coverage:** Fully handled by pid-liveness check. MCP server proceeds to mount normally after crash.
28
+
29
+ ### FM2: EADDRINUSE when daemon starts after MCP server (most likely in practice)
30
+ **Coverage:** Handled -- `startDaemonConsole()` returns `err({ kind: 'port_conflict' })`. Daemon still runs (trigger listener on 3200 works). Console unavailable until MCP server releases port.
31
+ **Gap:** No user-facing log message explaining how to recover. Needs: `[DaemonConsole] Port 3456 held by MCP server; restart the MCP server with the daemon already running to enable the daemon console.`
32
+
33
+ ### FM3: Startup race (daemon writes lock, MCP reads stale state)
34
+ **Coverage:** Accepted as cosmetic -- MCP server mounts routes on a server that never listens (secondary path). One extra file watcher per concurrent MCP start. Bounded, not accumulating.
35
+
36
+ ### FM4: `stopWatcher` disposer not called on daemon stop
37
+ **Coverage:** Implementation requirement: `DaemonConsoleHandle.stop()` must call the disposer returned by `mountConsoleRoutes()`. Same pattern as `HttpServer._runStop()` calling `_routeDisposers`.
38
+
39
+ ---
40
+
41
+ ## Runner-Up / Simpler Alternative Review
42
+
43
+ **Revised recommendation:** The simpler hybrid (Candidate B minus `composeServer()` lock check) is sufficient:
44
+ - Extracts `startDaemonConsole()` with Result type and clean stop lifecycle
45
+ - Writes `daemon-console.lock` (available for future tooling)
46
+ - Does NOT add lock check to `composeServer()` -- deferred (YAGNI)
47
+ - 2 files changed (new `daemon-console.ts`, modified `cli.ts`) instead of 3
48
+
49
+ The cosmetic double-watcher from the startup race is acceptable. MaxListenersExceededWarning would require 10+ simultaneous MCP server starts -- implausible.
50
+
51
+ ---
52
+
53
+ ## Philosophy Alignment
54
+
55
+ | Principle | Status |
56
+ |-----------|--------|
57
+ | Errors are data | Satisfied -- `Result<DaemonConsoleHandle, DaemonConsoleError>` |
58
+ | YAGNI with discipline | Satisfied -- deferred `composeServer()` lock check |
59
+ | Immutability by default | Satisfied -- readonly handle properties |
60
+ | Dependency injection | Satisfied -- `ctx: V2ToolContext` injected at composition root |
61
+ | Make illegal states unrepresentable | Acceptable tension -- process-boundary invariants cannot be type-enforced |
62
+ | Validate at boundaries | Minor tension -- `composeServer()` skips daemon check. Consequence is cosmetic. |
63
+
64
+ ---
65
+
66
+ ## Findings
67
+
68
+ ### Yellow: Missing user-facing log on port conflict
69
+ When `startDaemonConsole()` returns `port_conflict`, the daemon CLI should log a clear message explaining the start-order dependency. Without this, users will be confused why the console isn't available.
70
+
71
+ ### Yellow: `stopWatcher` disposer must be called in `stop()`
72
+ Implementation-level requirement: the disposer returned by `mountConsoleRoutes()` must be stored in the `DaemonConsoleHandle` closure and called in `stop()`. Missing this leaks an `fs.watch` on the sessions directory.
73
+
74
+ ### Yellow: CORS headers needed on daemon console Express app
75
+ `mountConsoleRoutes()` does not add CORS headers -- that's the caller's responsibility. The daemon console's Express app must add CORS middleware (same as `HttpServer.setupMiddleware()`) or browser clients will be blocked when accessing from a dev tool that runs on a different origin.
76
+
77
+ ---
78
+
79
+ ## Recommended Revisions
80
+
81
+ 1. In `startDaemonConsole()`, add CORS middleware (`cors({ origin: '*', methods: ['GET', 'HEAD', 'OPTIONS'] })`) to the Express app before calling `mountConsoleRoutes()`.
82
+ 2. In `cli.ts daemon`, log a clear message when `startDaemonConsole()` returns `port_conflict`.
83
+ 3. Ensure `DaemonConsoleHandle.stop()` calls the `stopWatcher` disposer before closing the server.
84
+
85
+ ---
86
+
87
+ ## Residual Concerns
88
+
89
+ 1. **NFS home directory latency**: If `~/.workrail/` is on a slow network filesystem, the lock file write at daemon startup may add visible latency. Not a correctness concern.
90
+ 2. **No integration test**: The daemon startup path (`workrail daemon` command) is not covered by existing automated tests. The new `startDaemonConsole()` function should have unit tests, but end-to-end daemon startup testing is out of scope for this task.
91
+ 3. **Future: when to add the `composeServer()` lock check**: If `MaxListenersExceededWarning` is observed in practice during daemon+MCP co-runs, the lock check should be added to `composeServer()` at that point.
@@ -0,0 +1,195 @@
1
+ # Design: Daemon-Owned HTTP Console
2
+
3
+ **Status:** Discovery complete, ready for implementation
4
+ **Date:** 2026-04-16
5
+
6
+ ---
7
+
8
+ ## Problem Understanding
9
+
10
+ ### Core tensions
11
+
12
+ 1. **Daemon-owned console vs. DI-managed HttpServer**: The `HttpServer` class in DI is a singleton with primary/secondary election, heartbeat, and lock logic. The daemon already uses DI. But `HttpServer` has `SessionManager` injected as a required dependency -- the daemon's use case is different (it needs `ConsoleService` routes from `console-routes.ts`, not the old V1 session routes). Tension: use the existing `HttpServer` class (complex, unneeded deps) vs. a minimal standalone server (simpler, parallel infrastructure).
13
+
14
+ 2. **`mountRoutes()` called before `start()` in `composeServer()`**: Today the MCP server calls `mountRoutes()` in `composeServer()` before the transport entry point calls `start()`. If the daemon holds port 3456, `start()` returns `null` (secondary), but `mountRoutes()` was already called -- the SSE file watcher and enrichment callback are already running on a server that will never serve traffic. This causes file descriptor waste and potential `MaxListenersExceededWarning` on stderr (which corrupts the MCP stdio channel).
15
+
16
+ 3. **Lock file reliability vs. complexity**: A `daemon-console.lock` file is the most explicit signal, but it adds a new file, a new lifecycle, and a new failure mode (stale lock after daemon crash). The existing `dashboard.lock` in `HttpServer.ts` already handles this pattern. The lock file check must include a pid-liveness probe (same pattern as `HttpServer.shouldReclaimLock()`) to handle stale locks after daemon crash.
17
+
18
+ 4. **AUTO dispatch queue ownership**: `POST /api/v2/auto/dispatch` uses `triggerRouter.dispatch()` when a `TriggerRouter` is provided. If the daemon owns the console, it should pass its `TriggerRouter` instance (from `TriggerListenerHandle.router`) so dispatched workflows go through the daemon's serialized queue, not direct `runWorkflow()`.
19
+
20
+ ### What makes this hard
21
+
22
+ `mountConsoleRoutes()` registers `fs.watch` on the sessions directory **immediately** on call -- not just when the server starts listening. So even in the MCP server's secondary path (where it never serves HTTP traffic), the watcher is live. A naive fix (guard on whether `start()` returns null) doesn't work because `start()` is called from the transport entry point, after `composeServer()` has already called `mountRoutes()`.
23
+
24
+ The correct fix requires preventing `mountRoutes()` from being called at all in the secondary path, which means the mount guard must live in `composeServer()` or the mount call must move to after `start()` in the transport entry point.
25
+
26
+ ### Likely seam
27
+
28
+ - Daemon console startup: `cli.ts daemon` command action (composition root)
29
+ - MCP server mount guard: `composeServer()` in `server.ts` OR `stdio-entry.ts` after `httpServer.start()`
30
+
31
+ ---
32
+
33
+ ## Philosophy Constraints
34
+
35
+ From `CLAUDE.md` (authoritative):
36
+ - **YAGNI with discipline**: don't create elaborate abstractions, but fix real seams when they're wrong
37
+ - **Errors are data**: new `startDaemonConsole()` should return a `Result` type, not throw
38
+ - **Explicit domain types**: `DaemonConsoleHandle = { port: number; stop(): Promise<void> }` mirrors `TriggerListenerHandle`
39
+ - **Validate at boundaries**: port-conflict and lock-check are boundary validations, belong in the composition root
40
+
41
+ Repo patterns (consistent with CLAUDE.md):
42
+ - `trigger-listener.ts` uses `http.createServer(express())` directly -- the daemon console follows this pattern
43
+ - Lock files use `~/.workrail/` base directory (`DashboardHeartbeat.ts`, `HttpServer.ts`)
44
+ - Result types (`ok`/`err`) used consistently
45
+
46
+ No philosophy conflicts observed.
47
+
48
+ ---
49
+
50
+ ## Impact Surface
51
+
52
+ Changes that must stay consistent:
53
+ - `TriggerListenerHandle` (no change -- the daemon console handle is a separate type)
54
+ - `mountConsoleRoutes()` signature (no change -- all params remain optional)
55
+ - `composeServer()` return type (no change -- `mountRoutes()` call moves or is guarded)
56
+ - `stdio-entry.ts` transport entry point (requires `mountRoutes()` + `finalize()` after `start()` if daemon not live)
57
+ - Any other transport entry points that call `httpServer.start()` must be checked
58
+
59
+ Other transport entry points to check: `src/mcp/transports/` -- need to enumerate to ensure none are missed.
60
+
61
+ ---
62
+
63
+ ## Candidates
64
+
65
+ ### Candidate A: Minimal inline console in daemon, no MCP guard
66
+
67
+ **Summary:** In `cli.ts daemon`, after `startTriggerListener()`, inline `express()` + `mountConsoleRoutes()` + `http.createServer().listen(3456)`. No changes to `server.ts` or `HttpServer.ts`.
68
+
69
+ **Tensions resolved:** Daemon owns the console. Console stays up as long as daemon runs.
70
+
71
+ **Tensions accepted:** Double SSE watcher problem persists (MCP server still calls `mountRoutes()` unconditionally). Port 3456 EADDRINUSE if MCP server is already running -- daemon console silently fails unless error is surfaced.
72
+
73
+ **Boundary:** `cli.ts daemon` action -- correct composition root.
74
+
75
+ **Failure mode:** EADDRINUSE when MCP server starts before daemon. Silent watcher accumulation in secondary MCP server processes.
76
+
77
+ **Repo pattern:** Follows `trigger-listener.ts` inline Express pattern exactly.
78
+
79
+ **Gains/losses:** Minimal diff (~25 lines, 1 file). Loses: no error handling, no watcher guard.
80
+
81
+ **Scope:** Too narrow -- leaves watcher leak unfixed.
82
+
83
+ **Philosophy:** Honors YAGNI. Conflicts with "validate at boundaries" (no EADDRINUSE feedback).
84
+
85
+ ---
86
+
87
+ ### Candidate B: `startDaemonConsole()` + `daemon-console.lock` + move `mountRoutes()` to transport entry (RECOMMENDED)
88
+
89
+ **Summary:** New `src/trigger/daemon-console.ts` exports `startDaemonConsole(ctx, options)` returning `Result<DaemonConsoleHandle, DaemonConsoleError>`. Writes `~/.workrail/daemon-console.lock` with `{ pid, port }`. In `stdio-entry.ts` (and other transports), move `mountRoutes()` + `finalize()` to after `httpServer.start()`; skip when daemon-console.lock has a live pid. `cli.ts daemon` calls `startDaemonConsole()` after trigger listener starts.
90
+
91
+ **Tensions resolved:** Daemon owns 3456 with explicit lifecycle. MCP server SSE watcher is never registered when daemon is live. Clean error handling on port conflict. Testable via extracted function.
92
+
93
+ **Tensions accepted:** More surface area -- new file, new lock convention, change to transport entry points.
94
+
95
+ **Boundary:** `cli.ts` for daemon startup, `stdio-entry.ts` for MCP mount guard.
96
+
97
+ **Failure mode:** Stale lock after daemon crash -> MCP server perpetually skips console mount. Mitigated by pid-liveness check (same pattern as `HttpServer.shouldReclaimLock()`).
98
+
99
+ **Repo pattern:** `DaemonConsoleHandle` mirrors `TriggerListenerHandle`. Lock file follows `dashboard.lock` pattern. Moving `mountRoutes()` to transport entry is an architectural improvement, not a new pattern.
100
+
101
+ **Gains/losses:** Clean lifecycle, no double watchers, testable, proper error reporting. Loses: 4 files changed instead of 1.
102
+
103
+ **Scope:** Best-fit.
104
+
105
+ **Philosophy:** Honors YAGNI with discipline (seam was architecturally wrong, fixing it is correct), errors as data, explicit domain types, validate at boundaries.
106
+
107
+ ---
108
+
109
+ ### Candidate C: Daemon uses full `HttpServer` class with env var gate
110
+
111
+ **Summary:** Daemon starts the `HttpServer` DI singleton. Sets `WORKRAIL_DAEMON_CONSOLE=3456`; MCP server checks this in `composeServer()` and skips `mountRoutes()`.
112
+
113
+ **Tensions resolved:** Reuses existing `HttpServer` infrastructure. No new lock file.
114
+
115
+ **Tensions accepted:** `HttpServer` requires `SessionManager` injection (unneeded). Env var check is less reliable than a lock file (inherited by child processes, not owned by a specific process). Daemon must run full primary election machinery unnecessarily.
116
+
117
+ **Failure mode:** `SessionManager` initialization overhead. Env var persists after daemon death if MCP server is a subprocess.
118
+
119
+ **Repo pattern:** Forces `HttpServer` into a use case it wasn't designed for. Env-var coupling between processes is an antipattern in this codebase.
120
+
121
+ **Scope:** Too broad.
122
+
123
+ **Philosophy:** Conflicts with YAGNI (forces unneeded `SessionManager` dep), conflicts with "make illegal states unrepresentable" (env var as cross-process signal is fragile).
124
+
125
+ ---
126
+
127
+ ## Comparison and Recommendation
128
+
129
+ | | A | B | C |
130
+ |---|---|---|---|
131
+ | Daemon owns 3456 | Yes | Yes | Yes |
132
+ | No MCP watcher leak | No | Yes | Fragile |
133
+ | Error handling | No | Yes | Partial |
134
+ | Testable | Poor | Good | Poor |
135
+ | Files changed | 1 | 4 | 5+ |
136
+ | Follows repo patterns | Yes | Yes | Forced fit |
137
+
138
+ **Recommendation: Candidate B**
139
+
140
+ The watcher leak is the key differentiator. `fs.watch` in secondary-mode MCP server processes doesn't serve traffic but wastes file descriptors and generates `MaxListenersExceededWarning` to stderr -- which is the MCP stdio channel, and stderr writes corrupt it. This is a real crash driver (referenced in the backlog). Candidate A leaves it unfixed.
141
+
142
+ Candidate B moves `mountRoutes()` to after `httpServer.start()` in the transport entry point, which is architecturally correct (the routes should only be mounted if the server is actually listening). This is not speculative -- it fixes a real seam that was wrong.
143
+
144
+ ### Concrete implementation shape
145
+
146
+ **New file: `src/trigger/daemon-console.ts`**
147
+ ```typescript
148
+ export interface DaemonConsoleHandle { port: number; stop(): Promise<void>; }
149
+ export type DaemonConsoleError = { kind: 'port_conflict'; port: number } | { kind: 'io_error'; message: string };
150
+ export async function startDaemonConsole(
151
+ ctx: V2ToolContext,
152
+ options: { port?: number; triggerRouter?: TriggerRouter; serverVersion?: string }
153
+ ): Promise<Result<DaemonConsoleHandle, DaemonConsoleError>>
154
+ ```
155
+ - Creates `ConsoleService` from `ctx.v2`
156
+ - Creates Express app, calls `mountConsoleRoutes(app, consoleService, ctx.workflowService, undefined, undefined, serverVersion, ctx, triggerRouter)`
157
+ - Writes `~/.workrail/daemon-console.lock: { pid, port }`
158
+ - Binds to port 3456 (default) on `127.0.0.1`
159
+ - Returns handle with `stop()` that closes server + deletes lock file + calls stopWatcher disposer
160
+
161
+ **Modified: `src/mcp/server.ts` `composeServer()`**
162
+ - Remove the `if (ctx.v2 && ctx.httpServer && ...) { ctx.httpServer.mountRoutes(...); ctx.httpServer.finalize(); }` block
163
+ - Export a new `mountConsoleRoutesIfOwner(httpServer, ctx, timingRingBuffer, toolCallsPerfFile, serverVersion): void` function that performs the lock check and mount
164
+
165
+ **Modified: `src/mcp/transports/stdio-entry.ts`**
166
+ - After `httpServer.start()` call, invoke `mountConsoleRoutesIfOwner(httpServer, ctx, ...)`
167
+
168
+ **Modified: `src/cli.ts` daemon command**
169
+ - After `startTriggerListener()` succeeds, call `startDaemonConsole(ctx, { triggerRouter: handle.router })`
170
+ - In `shutdown()`, call `await consoleHandle.stop()` before `handle.stop()`
171
+
172
+ ---
173
+
174
+ ## Self-Critique
175
+
176
+ **Strongest counter-argument against B:**
177
+ Moving `mountRoutes()` from `composeServer()` to transport entry points creates a contract where every transport entry point must remember to call `mountConsoleRoutesIfOwner()` after `start()`. If a new transport entry point is added and misses this call, the console silently doesn't mount. Candidate A avoids this risk entirely.
178
+
179
+ **Pivot condition:** If there are more than 2 transport entry points (stdio + HTTP), or if any transport entry point has complex branching that makes inserting the mount call risky, fall back to a lock-file check inside `composeServer()` (without moving `mountRoutes()`). The check would be: read `daemon-console.lock` before `mountRoutes()`; if daemon pid is live, skip. This is slightly worse (the watcher leak persists in the race window at startup) but avoids modifying transport entry points.
180
+
181
+ **Narrower option (A) loses because:** The double-watcher problem is a real bug that corrupts the MCP stdio channel. Leaving it unfixed would mean the daemon-console feature creates a new crash scenario (MCP server logs watchers to stderr while the daemon console is up, corrupting the JSON channel).
182
+
183
+ **Assumption that would invalidate this design:** If `MaxListenersExceededWarning` from `fs.watch` accumulation is NOT the actual mechanism for MCP stdio corruption, then Candidate A's simplicity wins. But the watcher accumulation is observable (each `mountConsoleRoutes()` call adds one watcher), so the fix is correct regardless.
184
+
185
+ ---
186
+
187
+ ## Open Questions for the Main Agent
188
+
189
+ 1. **Transport entry points inventory**: Are there transport entry points beyond `stdio-entry.ts`? Check `src/mcp/transports/` for all files that call `httpServer.start()`. Any missed entry point will silently drop console mounting.
190
+
191
+ 2. **Lock file name convention**: Use `daemon-console.lock` (new, explicit) or repurpose/extend `dashboard.lock` (existing, but conflates daemon console with MCP primary)? The latter avoids a new file but creates coupling between two different ownership concepts.
192
+
193
+ 3. **`timingRingBuffer` for daemon console**: The daemon doesn't create a `ToolCallTimingRingBuffer`. Pass `undefined` to `mountConsoleRoutes()` for this param. The `WORKRAIL_DEV=1` perf endpoint will be unavailable from the daemon console -- is this acceptable?
194
+
195
+ 4. **CORS for daemon console**: The standalone Express server in `startDaemonConsole()` should add CORS headers (same as `HttpServer.setupMiddleware()`). The `mountConsoleRoutes()` function doesn't add CORS itself -- it relies on the caller's Express app middleware.
@@ -0,0 +1,11 @@
1
+ # Data Model (ERD) for Native Context Management
2
+
3
+ > **Not pursuing**
4
+ >
5
+ > WorkRail is not planning to implement native context management.
6
+ >
7
+ > This file is kept only as a stable tombstone so old links do not break.
8
+ >
9
+ > See:
10
+ > - `docs/roadmap/legacy-planning-status.md`
11
+ > - `docs/plans/native-context-management-epic.md`
@@ -0,0 +1,98 @@
1
+ # Design Candidates: Consolidate WORKRAIL_DEV_STALENESS into WORKRAIL_DEV
2
+
3
+ ## Problem Understanding
4
+
5
+ **Goal:** Remove the separate `WORKRAIL_DEV_STALENESS` env var so that `WORKRAIL_DEV=1` controls all dev features, including staleness visibility for all workflow categories.
6
+
7
+ **Core tensions:**
8
+ 1. Module-load-time vs call-time evaluation -- `DEV_STALENESS` is evaluated once at module load; `isDevMode()` evaluates per call via DI. Not a real tension in practice because the flag doesn't change mid-request.
9
+ 2. Backward compat vs clean design -- `WORKRAIL_DEV_STALENESS` was never publicly documented and never allowed in `~/.workrail/config.json`. Dropping it is clean with essentially zero user impact.
10
+
11
+ **Likely seam:** The default parameter value of `shouldShowStaleness()` in `src/mcp/handlers/v2-workflow.ts` line 55. This is both where the symptom appears and the correct fix location.
12
+
13
+ **What makes it hard:** Nothing structurally hard. The main subtlety is that the replacement (`isDevMode()`) is a function, not a constant, so it must be in a default parameter (call-time) rather than a module-level assignment (load-time).
14
+
15
+ ---
16
+
17
+ ## Philosophy Constraints
18
+
19
+ From `CLAUDE.md` and `AGENTS.md`:
20
+ - **Dependency injection for boundaries** -- use DI-injected flags, not raw `process.env` reads in handler code.
21
+ - **Determinism** -- call-time evaluation of `isDevMode()` is deterministic; the flag doesn't change during a request.
22
+ - **YAGNI** -- drop `WORKRAIL_DEV_STALENESS` entirely; no deprecated alias needed.
23
+ - **Document "why", not "what"** -- update JSDoc to explain the consolidation.
24
+
25
+ No conflicts between stated philosophy and repo patterns.
26
+
27
+ ---
28
+
29
+ ## Impact Surface
30
+
31
+ **Must stay consistent:**
32
+ - `shouldShowStaleness()` exported function signature -- stays the same (optional `devMode` param)
33
+ - `tests/unit/mcp/workflow-staleness.test.ts` -- tests pass `devMode` explicitly, no impact
34
+ - `buildV2WorkflowListItem` (line 528) and inspect handler (line 435) -- both call `shouldShowStaleness(visibility?.category)` with no second arg; they will now get `isDevMode()` as the default
35
+
36
+ **Docs requiring updates:**
37
+ - `src/config/feature-flags.ts` line 109: description says staleness is controlled by `WORKRAIL_DEV_STALENESS` separately
38
+ - `AGENTS.md` line 141: documents `WORKRAIL_DEV_STALENESS=1`
39
+ - `docs/authoring-v2.md` line 495: says `Set WORKRAIL_DEV_STALENESS=1`
40
+ - `docs/configuration.md` line 44: lists `WORKRAIL_DEV_STALENESS` as excluded key
41
+ - `src/config/config-file.ts` lines 10, 32: comments mention `WORKRAIL_DEV_STALENESS`
42
+
43
+ ---
44
+
45
+ ## Candidates
46
+
47
+ ### Candidate 1: Change default parameter to `isDevMode()` (recommended)
48
+
49
+ **Summary:** Remove the `DEV_STALENESS` const; change `shouldShowStaleness`'s default from `= DEV_STALENESS` to `= isDevMode()`; import `isDevMode`.
50
+
51
+ - **Tensions resolved:** Eliminates second env var; staleness now flows through DI like other dev features.
52
+ - **Accepts:** Tiny per-call DI resolution overhead (already done by other dev features; negligible).
53
+ - **Boundary:** Default parameter in `shouldShowStaleness` -- exactly where the flag is consumed.
54
+ - **Failure mode:** If `isDevMode()` is called in a context where DI isn't initialized. Mitigated by `dev-mode.ts` fallback to `process.env['WORKRAIL_DEV']`.
55
+ - **Repo pattern:** Follows -- perf timing and perf endpoint already use `isDevMode()` at call time.
56
+ - **Gain:** Single flag; DI-consistent; config-file compatible (`WORKRAIL_DEV` can be set in `~/.workrail/config.json`).
57
+ - **Give up:** `WORKRAIL_DEV_STALENESS` env var no longer works. Acceptable since it was never publicly documented.
58
+ - **Scope judgment:** best-fit.
59
+ - **Philosophy:** Honors DI-for-boundaries, determinism, YAGNI. No conflicts.
60
+
61
+ ### Candidate 2: Pass `isDevMode()` explicitly at each call site
62
+
63
+ **Summary:** Remove `DEV_STALENESS` const; pass `isDevMode()` as the second argument at each of the 2 call sites rather than via default parameter.
64
+
65
+ - **Tensions resolved:** Same runtime behavior as Candidate 1; slightly more explicit at call sites.
66
+ - **Accepts:** More code to maintain; must update 2 call sites instead of 1 location.
67
+ - **Failure mode:** Easy to miss a call site if more are added later.
68
+ - **Scope judgment:** Slightly too verbose -- 2 call sites, no benefit over Candidate 1.
69
+ - **Philosophy:** Marginally better on "explicit over implicit" but outweighed by the existing default-parameter design.
70
+
71
+ ---
72
+
73
+ ## Comparison and Recommendation
74
+
75
+ Both candidates produce identical runtime behavior. Candidate 1 is preferred:
76
+ - Touches fewer lines (1 location vs 3)
77
+ - Preserves the existing public API of `shouldShowStaleness` (tests already pass explicit values)
78
+ - Follows the established pattern in the file
79
+
80
+ **Recommendation: Candidate 1**
81
+
82
+ ---
83
+
84
+ ## Self-Critique
85
+
86
+ **Strongest counter-argument:** Candidate 2 makes the dev flag visible at the call site, which is slightly more transparent. With only 2 call sites, this would be readable. However, the existing default-parameter design was deliberately chosen for testability, and Candidate 1 respects that invariant.
87
+
88
+ **Narrower option that could work:** Only update the source code but leave the docs unchanged. Would still work but creates documentation drift -- rejected.
89
+
90
+ **Broader option:** Inject `devMode` into the handler via `ToolContext` and thread it through. This would be a larger architectural change with no benefit for a single boolean flag that already has a clean accessor via `isDevMode()`.
91
+
92
+ **Assumption that would invalidate:** If `isDevMode()` DI resolution became unreliable or slow. Currently it's a simple map lookup; this assumption is safe.
93
+
94
+ ---
95
+
96
+ ## Open Questions
97
+
98
+ None. The path is clear.
@@ -0,0 +1,80 @@
1
+ # Design Candidates: Walk Cache, Depth Limit, Skip Dirs, Graceful Degradation
2
+
3
+ ## Problem Understanding
4
+
5
+ ### Tensions
6
+ 1. **Walk cancellation vs simplicity**: `withTimeout` races the promise but cannot cancel the Node.js fs walk. The walk continues in the background. Subsequent calls within the 30s TTL window will hit the cache (acceptable tradeoff; must be documented).
7
+ 2. **Cache freshness vs staleness window**: A 30s TTL means a `.workrail` dir created inside a remembered root within that window returns stale (missing) results. Spec explicitly accepts this.
8
+ 3. **Module-level mutable state vs testability**: The cache is process-global. `clearWalkCacheForTesting()` is the escape hatch for tests.
9
+ 4. **Graceful degradation vs error visibility**: `listRememberedRoots` currently throws, which is actually LESS visible (propagates uncaught). Changing to return `[]` + log is more operator-visible.
10
+
11
+ ### Likely Seam
12
+ All changes are at the correct seams: `request-workflow-reader.ts` owns walk/discovery/cache logic; `v2-workflow.ts` owns handler error boundaries.
13
+
14
+ ### What Makes This Hard
15
+ - **Depth guard placement**: The spec says "A `.workrail` entry at depth 5 must still be discoverable -- the limit stops recursing into children, not reading the current directory's entries." If the guard is placed at the TOP of `walkForRootedWorkflowDirectories` (`if (depth >= MAX_WALK_DEPTH) return`), then `.workrail` at depth 5 is missed (the function returns before reading any entries). The correct placement is INSIDE the loop, AFTER the `.workrail` check, BEFORE the recursive call:
16
+ ```
17
+ if (entry.name === '.workrail') { ... continue; }
18
+ if (depth >= MAX_WALK_DEPTH) { /* log if dev mode */ continue; }
19
+ await walkForRootedWorkflowDirectories(entryPath, discoveredPaths, depth + 1);
20
+ ```
21
+
22
+ ## Philosophy Constraints
23
+ - **Errors are data**: `listRememberedRoots` must return `[]` instead of throwing (matching `listManagedSourceRecords` pattern)
24
+ - **Immutability by default**: `SKIP_DIRS` is a `const Set` (immutable after module load)
25
+ - **Document why not what**: Comments required on cache TTL tradeoff, non-cancelling timeout, why `listRememberedRoots` returns instead of throwing
26
+ - **Validate at boundaries**: Timeout and cache are boundary concerns in `createWorkflowReaderForRequest`
27
+ - **Module-level mutable Map**: Technically violates "immutability by default" but is the correct tradeoff for a process-global TTL cache; mutation is minimal and confined
28
+
29
+ ## Impact Surface
30
+ - Existing tests for `discoverRootedWorkflowDirectories` use unique `fs.mkdtempSync` paths so they will always miss the cache (no cross-contamination)
31
+ - New cache tests must call `clearWalkCacheForTesting()` in `afterEach`
32
+ - Both `handleV2ListWorkflows` and `handleV2InspectWorkflow` in `v2-workflow.ts` have the same `createWorkflowReaderForRequest` bare-await pattern -- both need fixing
33
+
34
+ ## Candidates
35
+
36
+ ### Candidate 1: Implement as Specified (only real candidate)
37
+
38
+ **Summary:** Apply all 10 changes exactly as specified, following established patterns already present in the file.
39
+
40
+ **Tensions resolved:**
41
+ - Walk cancellation: documented with comment, accepted as by-design
42
+ - Cache freshness: 30s TTL with explicit stale-window comment
43
+ - Error visibility: `listRememberedRoots` logs and returns `[]`
44
+ - Handler error boundary: try/catch returning `errNotRetryable`
45
+
46
+ **Boundary:** `request-workflow-reader.ts` for walk/discovery/cache; `v2-workflow.ts` for handler errors.
47
+
48
+ **Failure mode:** Depth guard placement. If placed at top of function, `.workrail` at depth 5 is missed. Must be inside the loop after `.workrail` check.
49
+
50
+ **Repo pattern:** Follows. `listManagedSourceRecords` already does graceful `{ records: [], storeError }`. `withTimeout` already in `v2-workflow.ts`. `errNotRetryable` used throughout.
51
+
52
+ **Gains:** Bounded walk, process-level caching, handlers never throw unexpectedly, operator-visible walk errors.
53
+
54
+ **Losses:** Module-level mutable state (acceptable).
55
+
56
+ **Scope:** Best-fit. Changes confined to two specified files plus test file.
57
+
58
+ **Philosophy:** Honors errors-as-data, immutability for the Set, document-why, validate-at-boundaries. Minor conflict with immutability for the module-level Map (accepted, documented).
59
+
60
+ ## Comparison and Recommendation
61
+
62
+ Only one real candidate. The spec fully prescribes the design -- data structures (Map with TTL, Set for skip dirs), function signatures (depth param), and behavior (graceful degradation). No architectural choice needed.
63
+
64
+ **Recommendation:** Implement as specified.
65
+
66
+ ## Self-Critique
67
+
68
+ **Strongest counter-argument:** The module-level Map cache could instead be passed as a dependency injection parameter to `discoverRootedWorkflowDirectories`. This would be stricter about "immutability by default". It loses because: no repo precedent for injecting caches, the cache is a pure optimization, `clearWalkCacheForTesting()` provides sufficient testability.
69
+
70
+ **Narrower option:** Just add depth limit + skip dirs, skip the cache and timeout. Loses because the spec requires both.
71
+
72
+ **Broader option:** Make TTL configurable via env var. Not justified -- YAGNI applies, no evidence of multiple TTL requirements.
73
+
74
+ **Invalidating assumption:** If `discoverRootedWorkflowDirectories` were called from multiple threads simultaneously, the module-level Map would need synchronization. Node.js is single-threaded so this does not apply.
75
+
76
+ **Pivot condition:** If depth-5 boundary test fails, the guard is likely placed at the top of the function instead of inside the loop.
77
+
78
+ ## Open Questions for the Main Agent
79
+
80
+ None -- the spec is complete and the implementation path is clear.