@exaudeus/workrail 3.28.0 → 3.30.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (160) hide show
  1. package/dist/console/assets/{index-C146q2kN.js → index-Bl5-Ghuu.js} +1 -1
  2. package/dist/console/index.html +1 -1
  3. package/dist/manifest.json +3 -3
  4. package/docs/README.md +57 -0
  5. package/docs/adrs/001-hybrid-storage-backend.md +38 -0
  6. package/docs/adrs/002-four-layer-context-classification.md +38 -0
  7. package/docs/adrs/003-checkpoint-trigger-strategy.md +35 -0
  8. package/docs/adrs/004-opt-in-encryption-strategy.md +36 -0
  9. package/docs/adrs/005-agent-first-workflow-execution-tokens.md +105 -0
  10. package/docs/adrs/006-append-only-session-run-event-log.md +76 -0
  11. package/docs/adrs/007-resume-and-checkpoint-only-sessions.md +51 -0
  12. package/docs/adrs/008-blocked-nodes-architectural-upgrade.md +178 -0
  13. package/docs/adrs/009-bridge-mode-single-instance-mcp.md +195 -0
  14. package/docs/adrs/010-release-pipeline.md +89 -0
  15. package/docs/architecture/README.md +7 -0
  16. package/docs/architecture/refactor-audit.md +364 -0
  17. package/docs/authoring-v2.md +527 -0
  18. package/docs/authoring.md +873 -0
  19. package/docs/changelog-recent.md +201 -0
  20. package/docs/configuration.md +505 -0
  21. package/docs/ctc-mcp-proposal.md +518 -0
  22. package/docs/design/README.md +22 -0
  23. package/docs/design/agent-cascade-protocol.md +96 -0
  24. package/docs/design/autonomous-console-design-candidates.md +253 -0
  25. package/docs/design/autonomous-console-design-review.md +111 -0
  26. package/docs/design/autonomous-platform-mvp-discovery.md +525 -0
  27. package/docs/design/claude-code-source-deep-dive.md +713 -0
  28. package/docs/design/console-cyberpunk-ui-discovery.md +504 -0
  29. package/docs/design/console-execution-trace-candidates-final.md +160 -0
  30. package/docs/design/console-execution-trace-candidates.md +211 -0
  31. package/docs/design/console-execution-trace-design-candidates-v2.md +113 -0
  32. package/docs/design/console-execution-trace-design-review.md +74 -0
  33. package/docs/design/console-execution-trace-discovery.md +394 -0
  34. package/docs/design/console-execution-trace-final-review.md +77 -0
  35. package/docs/design/console-execution-trace-review.md +92 -0
  36. package/docs/design/console-performance-discovery.md +415 -0
  37. package/docs/design/console-ui-backlog.md +280 -0
  38. package/docs/design/daemon-architecture-discovery.md +853 -0
  39. package/docs/design/daemon-design-candidates.md +318 -0
  40. package/docs/design/daemon-design-review-findings.md +119 -0
  41. package/docs/design/daemon-engine-design-candidates.md +210 -0
  42. package/docs/design/daemon-engine-design-review.md +131 -0
  43. package/docs/design/daemon-execution-engine-discovery.md +280 -0
  44. package/docs/design/daemon-gap-analysis.md +554 -0
  45. package/docs/design/daemon-owns-console-plan.md +168 -0
  46. package/docs/design/daemon-owns-console-review.md +91 -0
  47. package/docs/design/daemon-owns-console.md +195 -0
  48. package/docs/design/data-model-erd.md +11 -0
  49. package/docs/design/design-candidates-consolidate-dev-staleness.md +98 -0
  50. package/docs/design/design-candidates-walk-cache-depth-limit.md +80 -0
  51. package/docs/design/design-review-consolidate-dev-staleness.md +54 -0
  52. package/docs/design/design-review-walk-cache-depth-limit.md +48 -0
  53. package/docs/design/implementation-plan-consolidate-dev-staleness.md +142 -0
  54. package/docs/design/implementation-plan-walk-cache-depth-limit.md +141 -0
  55. package/docs/design/layer3b-ghost-nodes-design-candidates.md +229 -0
  56. package/docs/design/layer3b-ghost-nodes-design-review.md +93 -0
  57. package/docs/design/layer3b-ghost-nodes-implementation-plan.md +219 -0
  58. package/docs/design/list-workflows-latency-fix-plan.md +128 -0
  59. package/docs/design/list-workflows-latency-fix-review.md +55 -0
  60. package/docs/design/list-workflows-latency-fix.md +109 -0
  61. package/docs/design/native-context-management-api.md +11 -0
  62. package/docs/design/performance-sweep-2026-04.md +96 -0
  63. package/docs/design/routines-guide.md +219 -0
  64. package/docs/design/sequence-diagrams.md +11 -0
  65. package/docs/design/subagent-design-principles.md +220 -0
  66. package/docs/design/temporal-patterns-design-candidates.md +312 -0
  67. package/docs/design/temporal-patterns-design-review-findings.md +163 -0
  68. package/docs/design/test-isolation-from-config-file.md +335 -0
  69. package/docs/design/v2-core-design-locks.md +2746 -0
  70. package/docs/design/v2-lock-registry.json +734 -0
  71. package/docs/design/workflow-authoring-v2.md +1044 -0
  72. package/docs/design/workflow-docs-spec.md +218 -0
  73. package/docs/design/workflow-extension-points.md +687 -0
  74. package/docs/design/workrail-auto-trigger-system.md +359 -0
  75. package/docs/design/workrail-config-file-discovery.md +513 -0
  76. package/docs/docker.md +110 -0
  77. package/docs/generated/v2-lock-closure-plan.md +26 -0
  78. package/docs/generated/v2-lock-coverage.json +797 -0
  79. package/docs/generated/v2-lock-coverage.md +177 -0
  80. package/docs/ideas/backlog.md +3927 -0
  81. package/docs/ideas/design-candidates-mcp-resilience.md +208 -0
  82. package/docs/ideas/design-review-findings-mcp-resilience.md +119 -0
  83. package/docs/ideas/implementation_plan.md +249 -0
  84. package/docs/ideas/third-party-workflow-setup-design-thinking.md +1948 -0
  85. package/docs/implementation/02-architecture.md +316 -0
  86. package/docs/implementation/04-testing-strategy.md +124 -0
  87. package/docs/implementation/09-simple-workflow-guide.md +835 -0
  88. package/docs/implementation/13-advanced-validation-guide.md +874 -0
  89. package/docs/implementation/README.md +21 -0
  90. package/docs/integrations/claude-code.md +300 -0
  91. package/docs/integrations/firebender.md +315 -0
  92. package/docs/migration/v0.1.0.md +147 -0
  93. package/docs/naming-conventions.md +45 -0
  94. package/docs/planning/README.md +104 -0
  95. package/docs/planning/github-ticketing-playbook.md +195 -0
  96. package/docs/plans/README.md +24 -0
  97. package/docs/plans/agent-managed-ticketing-design.md +605 -0
  98. package/docs/plans/agentic-orchestration-roadmap.md +112 -0
  99. package/docs/plans/assessment-gates-engine-handoff.md +536 -0
  100. package/docs/plans/content-coherence-and-references.md +151 -0
  101. package/docs/plans/library-extraction-plan.md +340 -0
  102. package/docs/plans/mr-review-workflow-redesign.md +1451 -0
  103. package/docs/plans/native-context-management-epic.md +11 -0
  104. package/docs/plans/perf-fixes-design-candidates.md +225 -0
  105. package/docs/plans/perf-fixes-design-review-findings.md +61 -0
  106. package/docs/plans/perf-fixes-new-issues-candidates.md +264 -0
  107. package/docs/plans/perf-fixes-new-issues-review.md +110 -0
  108. package/docs/plans/prompt-fragments.md +53 -0
  109. package/docs/plans/ui-ux-workflow-design-candidates.md +120 -0
  110. package/docs/plans/ui-ux-workflow-discovery.md +100 -0
  111. package/docs/plans/ui-ux-workflow-review.md +48 -0
  112. package/docs/plans/v2-followup-enhancements.md +587 -0
  113. package/docs/plans/workflow-categories-candidates.md +105 -0
  114. package/docs/plans/workflow-categories-discovery.md +110 -0
  115. package/docs/plans/workflow-categories-review.md +51 -0
  116. package/docs/plans/workflow-discovery-model-candidates.md +94 -0
  117. package/docs/plans/workflow-discovery-model-discovery.md +74 -0
  118. package/docs/plans/workflow-discovery-model-review.md +48 -0
  119. package/docs/plans/workflow-source-setup-phase-1.md +245 -0
  120. package/docs/plans/workflow-source-setup-phase-2.md +361 -0
  121. package/docs/plans/workflow-staleness-detection-candidates.md +104 -0
  122. package/docs/plans/workflow-staleness-detection-review.md +58 -0
  123. package/docs/plans/workflow-staleness-detection.md +80 -0
  124. package/docs/plans/workflow-v2-design.md +69 -0
  125. package/docs/plans/workflow-v2-roadmap.md +74 -0
  126. package/docs/plans/workflow-validation-design.md +98 -0
  127. package/docs/plans/workflow-validation-roadmap.md +108 -0
  128. package/docs/plans/workrail-platform-vision.md +420 -0
  129. package/docs/reference/agent-context-cleaner-snippet.md +94 -0
  130. package/docs/reference/agent-context-guidance.md +140 -0
  131. package/docs/reference/context-optimization.md +284 -0
  132. package/docs/reference/example-workflow-repository-template/.github/workflows/validate.yml +125 -0
  133. package/docs/reference/example-workflow-repository-template/README.md +268 -0
  134. package/docs/reference/example-workflow-repository-template/workflows/example-workflow.json +80 -0
  135. package/docs/reference/external-workflow-repositories.md +916 -0
  136. package/docs/reference/feature-flags-architecture.md +472 -0
  137. package/docs/reference/feature-flags.md +349 -0
  138. package/docs/reference/god-tier-workflow-validation.md +272 -0
  139. package/docs/reference/loop-optimization.md +209 -0
  140. package/docs/reference/loop-validation.md +176 -0
  141. package/docs/reference/loops.md +465 -0
  142. package/docs/reference/mcp-platform-constraints.md +59 -0
  143. package/docs/reference/recovery.md +88 -0
  144. package/docs/reference/releases.md +177 -0
  145. package/docs/reference/troubleshooting.md +105 -0
  146. package/docs/reference/workflow-execution-contract.md +998 -0
  147. package/docs/roadmap/README.md +22 -0
  148. package/docs/roadmap/legacy-planning-status.md +103 -0
  149. package/docs/roadmap/now-next-later.md +70 -0
  150. package/docs/roadmap/open-work-inventory.md +389 -0
  151. package/docs/tickets/README.md +39 -0
  152. package/docs/tickets/next-up.md +76 -0
  153. package/docs/workflow-management.md +317 -0
  154. package/docs/workflow-templates.md +423 -0
  155. package/docs/workflow-validation.md +184 -0
  156. package/docs/workflows.md +254 -0
  157. package/package.json +4 -1
  158. package/spec/authoring-spec.json +61 -16
  159. package/workflows/workflow-for-workflows.json +3 -3
  160. package/workflows/workflow-for-workflows.v2.json +3 -3
@@ -0,0 +1,318 @@
1
+ # WorkRail Daemon Architecture: Design Candidates
2
+
3
+ > Raw investigative material for main-agent synthesis. Not a final decision.
4
+ > Generated: 2026-04-14.
5
+
6
+ ---
7
+
8
+ ## Problem Understanding
9
+
10
+ ### Core tensions
11
+
12
+ 1. **Single-process simplicity vs. concurrent-session correctness.**
13
+ The `engineActive` guard exists because the DI container is a global singleton.
14
+ Sequential sessions (one at a time) are safe but not scalable. Concurrent sessions
15
+ require either exposing a single shared engine instance (same process) or using
16
+ separate processes (isolated engines). These have different deployment implications.
17
+
18
+ 2. **Direct handler calls vs. MCP tool protocol portability.**
19
+ `engine-factory.ts` calls `executeStartWorkflow` / `executeContinueWorkflow` directly,
20
+ bypassing JSON-RPC. This is faster and already built, but it couples the daemon to
21
+ the internal handler API. Changes to handler signatures require daemon changes.
22
+ The MCP protocol layer is what makes callers swappable.
23
+
24
+ 3. **Freestanding vs. dependency-rich agent loop.**
25
+ `npx -y @exaudeus/workrail` portability is a core feature. Adding pi-mono as a
26
+ dependency doubles the surface area. But building a custom agent loop from scratch
27
+ converges on the same patterns.
28
+
29
+ 4. **Self-enforced trust model.**
30
+ In autonomous mode, the daemon is both driver and enforced entity. The HMAC token
31
+ protocol prevents token forgery, but the daemon can choose not to call `continueWorkflow`
32
+ at all. Enforcement integrity relies on the daemon being well-behaved.
33
+
34
+ ### Likely seam
35
+
36
+ The seam between the engine and the agent loop is clean and already designed:
37
+ - Engine produces: `{ pending: { prompt: string, stepId, title } }`
38
+ - Agent loop consumes: the prompt, calls LLM, executes tool calls, returns
39
+ `{ notesMarkdown: string, context: Record<string, unknown> }`
40
+ - Engine accepts: `continueWorkflow(stateToken, ackToken, output, context)`
41
+
42
+ The daemon's job is to close this loop. The seam is at `pending.prompt` going in and
43
+ `{ notesMarkdown, context }` coming out.
44
+
45
+ ### What makes this hard (junior developer blind spots)
46
+
47
+ 1. **Token durability across the agent loop.** The `continueToken` must survive process
48
+ crashes between LLM calls. If the process crashes after the LLM responds but before
49
+ `continueWorkflow` is called, the step is re-attempted on next start. The dedup key
50
+ system (`advance_recorded:sessionId:nodeId:attemptId`) handles this correctly, but
51
+ the daemon must persist the token durably.
52
+
53
+ 2. **Tool call routing within the agent loop.** LLM responses contain tool calls that
54
+ must be executed and results returned to the LLM BEFORE the LLM produces the final
55
+ `notesMarkdown` output to feed to `continueWorkflow`. This is the `agentLoop` pattern
56
+ from pi-mono -- it is a multi-turn LLM loop, not a single call.
57
+
58
+ 3. **Session lifecycle vs. agent context lifecycle.** A WorkRail session is durable (event
59
+ log, tokens, steps). An LLM context window is ephemeral (messages, tool results). The
60
+ daemon must coordinate these two lifecycles: the WorkRail session survives context
61
+ compaction; the LLM context window does not.
62
+
63
+ ---
64
+
65
+ ## Philosophy Constraints
66
+
67
+ From CLAUDE.md (system instructions), confirmed by code patterns:
68
+
69
+ - **Errors are data (neverthrow ResultAsync):** All handler code uses `RA` (ResultAsync)
70
+ chains. Daemon code MUST follow the same pattern. No try/catch in the agent loop.
71
+ - **Branded types:** Daemon config must use branded types for credentials
72
+ (`AnthropicApiKey`, `GitLabToken`, etc.), not primitive strings.
73
+ - **DI for boundaries:** The agent loop (LLM caller) MUST be injected as an
74
+ `AgentLoopPort`, not hardcoded to Anthropic SDK. This makes the LLM provider swappable.
75
+ - **Exhaustive discriminated unions:** `DaemonConfig` process-boundary choice should be
76
+ a discriminated union, not a boolean flag.
77
+ - **YAGNI with discipline:** The REST control plane is not speculative -- it is required
78
+ for human oversight in the changed trust model. The `engineActive` guard change is not
79
+ speculative -- it is required for concurrent sessions.
80
+
81
+ **Philosophy conflicts: none.** The codebase exactly embodies the CLAUDE.md principles.
82
+ No conflict between stated philosophy and repo patterns.
83
+
84
+ ---
85
+
86
+ ## Impact Surface
87
+
88
+ ### What must stay consistent if the daemon is added
89
+
90
+ - `engine-factory.ts`: the `engineActive` guard and `createWorkRailEngine()` API. The
91
+ daemon must not require breaking changes to this API.
92
+ - `src/mcp-server.ts`: the MCP server entry point must be unchanged. Existing Claude
93
+ Code / Cursor users must see no difference.
94
+ - The session store append-only invariant: daemon sessions write to the same store as
95
+ MCP sessions. The lock protocol must hold under concurrent access.
96
+ - The token protocol: daemon-initiated sessions produce HMAC tokens identical to
97
+ MCP-initiated sessions. The console must not distinguish them.
98
+ - Existing workflows: zero changes required. The daemon reads `pending.prompt` from the
99
+ workflow step; the workflow does not know who is driving.
100
+
101
+ ### Nearby callers and consumers
102
+
103
+ - Console (`console/`) -- reads session history from the same store. If daemon sessions
104
+ are added, they appear in the session list immediately (no console changes needed for
105
+ basic visibility; REST control plane needed for live view).
106
+ - `src/engine/index.ts` -- the library export surface. If `createWorkRailEngine()` is
107
+ changed, consumers of the engine library must be updated.
108
+ - `src/di/container.ts` -- the DI container. Any change to the `engineActive` guard
109
+ touches the container's initialization path.
110
+
111
+ ---
112
+
113
+ ## Candidates
114
+
115
+ ### Candidate 1: Minimal Sequential Daemon
116
+
117
+ **Summary:** `src/daemon/entry.ts` calls `createWorkRailEngine()`, runs one session at a
118
+ time via FIFO queue, drives the agent loop with direct Anthropic SDK calls, exits after
119
+ each session.
120
+
121
+ **Tensions resolved:** Simplicity; single-process; `engineActive` guard avoided via queue.
122
+ **Tensions accepted:** No concurrent sessions; no live view; no human override.
123
+ **Boundary:** `src/daemon/` only. No changes to engine, MCP server, or console.
124
+ **Why this boundary:** The minimum viable boundary that proves autonomous execution works.
125
+ **Failure mode:** Session throughput bottleneck. If sessions are 30-60 min each, the
126
+ queue grows unbounded. Acceptable for MVP; unacceptable for production.
127
+ **Repo pattern:** Directly follows `engine-factory.ts`. The engine was built for this.
128
+ **Gain:** Ships fast, proves the agent loop concept, zero architectural risk.
129
+ **Give up:** No concurrency, no live view, no human override.
130
+ **Scope:** Too narrow for 12-month platform. Best-fit for 3-month proof-of-concept.
131
+ **Philosophy:** Perfect fit. YAGNI applied correctly for a proof-of-concept target.
132
+
133
+ ---
134
+
135
+ ### Candidate 2: Pure MCP Client Daemon
136
+
137
+ **Summary:** A separate `workrail-daemon` process connects to the running WorkRail MCP
138
+ server over HTTP, calls `start_workflow` / `continue_workflow` via JSON-RPC, with no
139
+ direct engine access.
140
+
141
+ **Concrete shape:**
142
+ - `packages/daemon/src/mcp-client.ts` -- `call(toolName, input)` over HTTP JSON-RPC
143
+ - `packages/daemon/src/trigger/` -- trigger listeners
144
+ - `packages/daemon/src/agent-loop/` -- same structure as C1 but calls MCP client
145
+ - Deployment: two Docker services (MCP server + daemon)
146
+
147
+ **Tensions resolved:** Clean process boundary; no `engineActive` concern; maximally
148
+ decoupled; crash isolation; deployable anywhere.
149
+ **Tensions accepted:** JSON-RPC overhead per step; two-process local dev; MCP server is
150
+ single point of failure for both human and autonomous sessions.
151
+ **Boundary:** Separate package/process. MCP HTTP transport is the interface.
152
+ **Why this boundary:** The MCP protocol is the stable public interface. Calling it from
153
+ the daemon ensures the daemon is never coupled to handler internals.
154
+ **Failure mode:** MCP server crash stops both Claude Code users and autonomous sessions.
155
+ **Repo pattern:** Departs from `engine-factory.ts`. Treats the MCP server as a black box.
156
+ **Gain:** Maximum decoupling; daemon can be any language; natural cloud model.
157
+ **Give up:** Two-process deployment; HTTP overhead; MCP server as prerequisite.
158
+ **Scope:** Best-fit for 18-24 month distributed cloud. Too broad for MVP.
159
+ **Philosophy:** Honors DI at the process level (ultimate boundary). Mild YAGNI conflict.
160
+
161
+ ---
162
+
163
+ ### Candidate 3: Composite Same-Process (recommended)
164
+
165
+ **Summary:** `src/daemon/` calls the engine via a shared instance (not two separate
166
+ `createWorkRailEngine()` calls), with concurrent sessions managed by `DaemonSessionManager`,
167
+ and a thin REST/SSE control plane added to the existing HTTP server.
168
+
169
+ **Concrete shape:**
170
+ - `src/engine/engine-factory.ts` change: instead of a boolean `engineActive` guard, expose
171
+ a `getSharedEngine(config): WorkRailEngine` that creates the engine once and returns the
172
+ same instance to all callers (MCP server entry + daemon entry). The guard becomes:
173
+ "container initialized: yes/no" rather than "engine in use: yes/no."
174
+ - `src/daemon/session-manager.ts` -- `DaemonSessionManager`: `Map<SessionId, DaemonSession>`,
175
+ each session running as an independent `Promise` chain with its own `continueToken`.
176
+ `DaemonSession = { continueToken: string; status: 'running' | 'paused' | 'complete' | 'failed'; abortController: AbortController }`
177
+ - `src/daemon/agent-loop/step-runner.ts` -- `runStep(pending: PendingStep, toolExecutor, llmPort): Promise<StepOutput>`.
178
+ Multi-turn loop: call LLM -> execute tool calls -> return to LLM -> repeat until
179
+ LLM produces `continueWorkflow` output. `StepOutput = { notesMarkdown: string; context: Record<string, unknown> }`
180
+ - `src/daemon/trigger/gitlab-webhook.ts` -- HTTP listener, parses MR opened events.
181
+ - `src/daemon/tool-executor/local.ts` -- `Bash(cmd: string): RA<string, ToolError>`,
182
+ `Read(path: string): RA<string, ToolError>`, `Write(path, content): RA<void, ToolError>`.
183
+ Also `BashInRepo(repo: string, cmd: string)` for cross-repo routing.
184
+ - REST additions to existing HTTP server:
185
+ - `GET /api/v2/sessions/:id/daemon-status` -- `{ status, currentStepTitle, startedAt }`
186
+ - `POST /api/v2/sessions/:id/pause`
187
+ - `POST /api/v2/sessions/:id/resume`
188
+ - `DELETE /api/v2/sessions/:id` (cancel + abort LLM call)
189
+
190
+ **Tensions resolved:** Single deployment; concurrent sessions; live view; human override;
191
+ `engineActive` solved by shared instance (not relaxed); correct enforcement (HMAC identical
192
+ for daemon and manual sessions).
193
+ **Tensions accepted:** Shared process means shared failure domain (daemon crash affects MCP
194
+ server -- use process supervisor to restart).
195
+ **Boundary:** Same process; new `src/daemon/` module; minor HTTP server additions.
196
+ **Why this boundary:** The 12-month success criteria require concurrent sessions AND live
197
+ view AND single deployment. This is the only candidate that satisfies all three.
198
+ **Failure mode:** If the daemon's agent loop has a bug that crashes the process, the MCP
199
+ server also goes down. Mitigation: crash isolation within the daemon (try/catch at the
200
+ session manager boundary, not inside handlers) and a process supervisor.
201
+ **Repo pattern:** Directly adapts `engine-factory.ts`. Requires one targeted change to the
202
+ engine factory (shared instance pattern). All other code follows existing patterns.
203
+ **Gain:** Single process, concurrent sessions, live view, human control, upgrade path to C4.
204
+ **Give up:** Shared failure domain (vs. C4's isolated processes).
205
+ **Scope:** Best-fit for 12-month vision.
206
+ **Philosophy:** Honors DI (engine injected, `AgentLoopPort` injected), errors-as-data
207
+ (ResultAsync throughout), immutability (session store append-only), exhaustiveness
208
+ (`DaemonSession.status` is a discriminated union). No conflicts.
209
+
210
+ ---
211
+
212
+ ### Candidate 4: Composite Separate-Process
213
+
214
+ **Summary:** The daemon runs as a separate process with its own `createWorkRailEngine()`
215
+ instance, sharing durable session state with the MCP server through a shared `dataDir`.
216
+
217
+ **Concrete shape:**
218
+ - `packages/daemon/src/entry.ts` -- new process, calls `createWorkRailEngine({ dataDir: sharedPath })`
219
+ - Same `DaemonSessionManager`, `step-runner`, `trigger/`, `tool-executor/` as C3
220
+ - Separate HTTP port (default: 3101) for daemon control plane
221
+ - `withHealthySessionLock` file locking ensures safe cross-process session store writes
222
+ - Console proxies to both port 3100 (MCP server) and port 3101 (daemon) for live view
223
+
224
+ **Tensions resolved:** No `engineActive` guard change needed (each process has one engine);
225
+ crash isolation (daemon crash does not affect MCP server); natural cloud upgrade path.
226
+ **Tensions accepted:** Two-process deployment for local dev; filesystem coordination limits
227
+ to single machine without shared volume; separate HTTP port for daemon control.
228
+ **Boundary:** Separate process with shared filesystem state.
229
+ **Why this boundary:** Cleanest architectural expression. No guard changes. Natural path
230
+ to cloud (swap `LocalDataDirV2` for a remote-backed store port).
231
+ **Failure mode:** Lock contention on shared session store under high concurrent load.
232
+ `withHealthySessionLock` handles this, but cross-process file locking has not been tested
233
+ with two WorkRail processes.
234
+ **Repo pattern:** Adapts `engine-factory.ts` correctly (one engine per process). Departs
235
+ from single-process assumption in `mcp-server.ts`.
236
+ **Gain:** Clean process boundary; no guard change; independent scaling; cloud-natural.
237
+ **Give up:** Two-process local dev; lock contention risk; more complex setup.
238
+ **Scope:** Best-fit for 18-month cloud target. Slightly broad for 12-month local-first.
239
+ **Philosophy:** Architecturally the purest expression of all principles. One engine per
240
+ process, no guard relaxation, cleanest DI. No conflicts.
241
+
242
+ ---
243
+
244
+ ## Comparison and Recommendation
245
+
246
+ | | C1 | C2 | C3 | C4 |
247
+ |---|---|---|---|---|
248
+ | Single deployment | Yes | No | Yes | No |
249
+ | Concurrent sessions | No | Yes | Yes | Yes |
250
+ | Human override (live view) | No | Partial | Yes | Yes |
251
+ | engineActive change | None | None | Shared instance | None |
252
+ | Cloud upgrade path | Hard | Native | Port swap | Natural |
253
+ | Repo pattern fit | Perfect | Departs | Perfect + extend | Perfect + extend |
254
+ | Philosophy fit | Perfect | Good | Perfect | Best |
255
+ | Ship complexity | Low | High | Medium | High |
256
+
257
+ **Recommendation: Candidate 3.**
258
+
259
+ The 12-month success criteria require concurrent sessions AND live view AND single
260
+ deployment. Only C3 satisfies all three. The safety concern (concurrent handler calls)
261
+ is resolved by code analysis -- `V2Dependencies` is stateless and the session store
262
+ serializes per-session writes. The `engineActive` guard change (boolean -> shared
263
+ instance) is a targeted, well-understood change.
264
+
265
+ ---
266
+
267
+ ## Self-Critique
268
+
269
+ ### Strongest counter-argument
270
+
271
+ C1 (sequential) ships faster and proves the actual unknowns: does the agent loop work?
272
+ Does the trigger system work? Does the daemon produce correct `notesMarkdown`? The REST
273
+ control plane is a developer experience feature, not a correctness feature. If the
274
+ primary goal is "demonstrate autonomous execution," C1 is the better choice. C3 adds
275
+ scope before the core concept is proven.
276
+
277
+ ### Narrower option that could work
278
+
279
+ C1 with a note: "expand to C3 in the next iteration." C1 is a strict subset of C3 --
280
+ the FIFO queue is a degenerate case of C3's `DaemonSessionManager` (concurrency = 1).
281
+ Starting with C1 and expanding to C3 is a valid staged approach.
282
+
283
+ ### Broader option and what would justify it
284
+
285
+ C4 (separate process) is justified if cloud deployment becomes a committed 12-month
286
+ goal. The migration from C3 to C4 is: extract `src/daemon/` into `packages/daemon/`,
287
+ add a separate process entry point, verify cross-process lock safety. The daemon code
288
+ itself does not change -- only the process boundary changes.
289
+
290
+ ### Assumption that would invalidate this recommendation
291
+
292
+ If `withHealthySessionLock` does NOT safely handle concurrent callers within the same
293
+ process (i.e., if the lock is not reentrant-safe for async calls), then concurrent
294
+ sessions in C3 would corrupt the session store. This is unlikely (the lock is designed
295
+ for concurrent writes) but must be verified before shipping C3 with concurrency enabled.
296
+
297
+ ---
298
+
299
+ ## Open Questions for the Main Agent
300
+
301
+ 1. Should the first daemon version use C1 (sequential, ship fast) or C3 (concurrent,
302
+ full scope) as the initial implementation target?
303
+
304
+ 2. The `AgentLoopPort` interface -- should it abstract the full LLM conversation turn
305
+ (multi-turn tool call loop) or just the single LLM API call? A full-turn abstraction
306
+ is cleaner but harder to design. A single-call abstraction leaks the tool call loop
307
+ into the daemon.
308
+
309
+ 3. Is pi-mono's `agentLoop` the right reference for the agent loop implementation, or
310
+ should WorkRail build a minimal implementation against the Anthropic SDK directly?
311
+
312
+ 4. Cross-repo execution: is it a 12-month must-have or a post-12-month feature? If it
313
+ is a must-have, `BashInRepo` / `ReadRepo` must be designed now. If it is post-12-month,
314
+ the tool executor can be simpler (single-workspace Bash/Read/Write).
315
+
316
+ 5. The `engineActive` guard change: should it be a shared singleton instance pattern, or
317
+ a ref-counted guard, or something else? The choice affects how tests isolate engine
318
+ instances.
@@ -0,0 +1,119 @@
1
+ # WorkRail Daemon Architecture: Design Review Findings
2
+
3
+ > Review output for the selected direction: Candidate 3 (Composite Same-Process) with
4
+ > Candidate 1 safety defaults (maxConcurrentSessions: 1 for v1).
5
+ > Generated: 2026-04-14.
6
+
7
+ ---
8
+
9
+ ## Tradeoff Review
10
+
11
+ | Tradeoff | Acceptable? | Failure Condition | Hidden Assumption |
12
+ |----------|-------------|------------------|-------------------|
13
+ | Shared process failure domain | Yes (local-first 12-month scope) | WorkRail deployed as shared multi-user server | Daemon agent loop is well-behaved (AbortController timeouts required) |
14
+ | Process-level init change (`initializeWorkRailProcess`) | Yes (internal, invisible to users) | `runtimeMode` discriminant insufficient for combined mode -- may need third mode or flags object | DI container initialization has no entry-point-specific services that conflict |
15
+ | Cross-repo deferred to post-MVP | Yes (backlog explicitly says post-MVP) | First real use case (MR review) requires cross-repo | MVP MR review workflow is single-repo -- must be confirmed with actual first workflow target |
16
+
17
+ ---
18
+
19
+ ## Failure Mode Review
20
+
21
+ | Failure Mode | Design Handling | Missing Mitigation | Risk Level |
22
+ |---|---|---|---|
23
+ | Hanging agent loops | `AbortController` in `DaemonSession`; REST cancel calls `abort()` | `runStep` must accept `AbortSignal` parameter -- currently not in spec | **ORANGE** -- manageable but must be explicit in design |
24
+ | Two `initializeContainer()` calls corrupting DI state | Process-level `initializeWorkRailProcess()` called once | Exact interface (`SharedEngineContext`) not yet specified; `mcp-server.ts` startup path needs refactor | **ORANGE** -- must be designed and tested first; highest-risk change |
25
+ | Lock contention under high concurrent load | `withHealthySessionLock` per session; v1 uses queue (concurrency=1) | Session concurrency limit in `DaemonSessionManager` (max N for v1.5) | **YELLOW** -- performance concern, not correctness |
26
+
27
+ ---
28
+
29
+ ## Runner-Up / Simpler Alternative Review
30
+
31
+ **Runner-up (Candidate 1: Sequential):**
32
+ - C1's FIFO queue ensures `engineActive` guard is never violated without requiring the guard to change
33
+ - This strength is worth borrowing: v1 runs with `maxConcurrentSessions: 1`
34
+ - C1 loses because it provides no live view and no human override path -- unacceptable for the trust model change that autonomous execution represents
35
+
36
+ **Simpler alternative (C3 without REST control plane):**
37
+ - Saves ~150 lines of code in v1
38
+ - Loses: operators cannot pause a runaway autonomous session
39
+ - For local dev (one developer, their own machine), acceptable
40
+ - For team deployment, unacceptable safety gap
41
+ - Decision: include REST control plane in v1; keep it simple (3-4 routes)
42
+
43
+ **Hybrid adopted: C3 with `maxConcurrentSessions: 1` default**
44
+ - C1 safety (queue) + C3 architecture (DaemonSessionManager, REST control plane)
45
+ - No `engineActive` guard change needed in v1 (queue ensures one engine call at a time)
46
+ - Path to full concurrency: design `SharedEngineContext`, enable in v1.5
47
+
48
+ ---
49
+
50
+ ## Philosophy Alignment
51
+
52
+ **Satisfied clearly:**
53
+ - Errors as data (ResultAsync throughout)
54
+ - Immutability (append-only events, typed status transitions)
55
+ - Make illegal states unrepresentable (`DaemonSession.status` discriminated union)
56
+ - Explicit domain types (`AnthropicApiKey`, `GitLabToken` branded types)
57
+ - Validate at boundaries (Zod for trigger payloads)
58
+ - DI for boundaries (`AgentLoopPort`, `ToolExecutorPort` injected)
59
+ - YAGNI with discipline (maxConcurrentSessions:1, cross-repo deferred)
60
+
61
+ **Under tension (all acceptable):**
62
+ - Determinism: LLM outputs are non-deterministic by nature; WorkRail's value is structural enforcement, not content determinism
63
+ - Pure functions: multi-turn LLM loop is inherently stateful; `runStep` API is as pure as possible
64
+ - Architectural fixes over patches: queue is a deliberate v1 design, not a hidden workaround; SharedEngineContext is designed and documented
65
+
66
+ ---
67
+
68
+ ## Findings
69
+
70
+ ### RED (blocking -- must be resolved before implementation begins)
71
+
72
+ None.
73
+
74
+ ### ORANGE (must address before shipping)
75
+
76
+ **[ORANGE-1] `runStep` missing `AbortSignal` parameter**
77
+ - Finding: The `step-runner` spec does not include an `AbortSignal` parameter. Without it, `DaemonSession.abortController.abort()` does not propagate to LLM calls or Bash subprocesses.
78
+ - Required fix: `runStep(pending: PendingStep, toolExecutor: ToolExecutorPort, llmPort: AgentLoopPort, signal: AbortSignal): RA<StepOutput, StepError>`
79
+ - Impact: REST `DELETE /api/v2/sessions/:id` (cancel) and `POST pause` do not work without this
80
+
81
+ **[ORANGE-2] `SharedEngineContext` interface not specified**
82
+ - Finding: The process-level `initializeWorkRailProcess()` function is identified as needed but its return type and contract are not designed.
83
+ - Required: `initializeWorkRailProcess(config: ProcessConfig): Promise<SharedEngineContext>` where `SharedEngineContext` exposes the engine instance + DI-resolved ports that both MCP server and daemon entry points need.
84
+ - Impact: If both entry points call `initializeContainer()` independently, DI container state is indeterminate. This is the highest-risk change; must be designed and tested first.
85
+ - Note: For v1 with `maxConcurrentSessions: 1`, the queue ensures the `engineActive` boolean guard is never violated -- so this is not needed for v1. But it must be designed in v1 and tested before enabling concurrency in v1.5.
86
+
87
+ ### YELLOW (should address, not blocking)
88
+
89
+ **[YELLOW-1] Session concurrency limit not specified**
90
+ - Finding: `DaemonSessionManager` has no upper bound on concurrent sessions even in the v1.5 full-concurrency mode.
91
+ - Recommendation: Add `maxConcurrentSessions: number` to `DaemonConfig` with a safe default (e.g., 10). Sessions beyond the limit are queued, not rejected.
92
+
93
+ **[YELLOW-2] `runtimeMode` may be insufficient**
94
+ - Finding: The current `runtimeMode` discriminant (`library` | `server`) does not express "server + daemon combined" mode. A third mode or flags object may be needed.
95
+ - Recommendation: Evaluate whether `initializeContainer({ runtimeMode: 'server', daemon: true })` is sufficient or whether a new mode value is needed when designing `SharedEngineContext`.
96
+
97
+ **[YELLOW-3] Cross-repo tool executor interface not extensible**
98
+ - Finding: The v1 tool executor spec (`Bash`, `Read`, `Write`) is single-workspace. If the first real use case requires cross-repo, the interface must be redesigned.
99
+ - Recommendation: Design `ToolExecutorPort` to support an optional `repo` parameter from day one: `Bash(cmd: string, opts?: { repo?: string }): RA<string, ToolError>`. Single-workspace behavior when `repo` is absent; cross-repo routing when present. Costs nothing to include; prevents a breaking interface change later.
100
+
101
+ ---
102
+
103
+ ## Recommended Revisions
104
+
105
+ 1. **[Required for v1]** Add `signal: AbortSignal` to `runStep` signature.
106
+ 2. **[Required for v1]** Design `SharedEngineContext` interface and `initializeWorkRailProcess()` signature (even if only the queue-mode path is enabled in v1).
107
+ 3. **[Strongly recommended for v1]** Design `ToolExecutorPort` with optional `repo` parameter to avoid a future breaking change.
108
+ 4. **[v1.5]** Evaluate `runtimeMode` extension before enabling full concurrency.
109
+ 5. **[v1.5]** Add session concurrency limit with safe default.
110
+
111
+ ---
112
+
113
+ ## Residual Concerns
114
+
115
+ 1. **Agent loop correctness is the riskiest unknown.** The step-runner must correctly handle multi-turn LLM conversations with tool calls (not just single LLM calls). This is the piece that has never been built before in WorkRail. The pi-mono `agentLoop` reference is the best existing implementation to study. Whether to use pi-mono directly or implement from scratch against the Anthropic SDK is an unresolved dependency decision.
116
+
117
+ 2. **`mcp-server.ts` refactor scope is uncertain.** Introducing `initializeWorkRailProcess()` requires refactoring how `startStdioServer` and `startHttpServer` initialize the container. The scope of this refactor depends on how deeply initialization is entangled in each transport entry point. This should be the first code spike when C3 implementation begins.
118
+
119
+ 3. **Console integration is not specified.** The REST control plane additions are specified (`daemon-status`, `pause`, `resume`, `cancel`). How these appear in the console UI is not -- that is a separate design question for the console team / next iteration.
@@ -0,0 +1,210 @@
1
+ # Daemon Execution Engine -- Design Candidates
2
+
3
+ **Status:** Raw investigative material -- for main agent review
4
+ **Date:** 2026-04-14
5
+ **Context:** Architecture decision for WorkRail's autonomous execution daemon
6
+
7
+ ---
8
+
9
+ ## Problem Understanding
10
+
11
+ ### Core tensions
12
+
13
+ 1. **Build speed vs. structural correctness**: `engine-factory.ts` is 477 lines and already wraps the exact same v2 handlers the MCP tools call. Using it from a daemon looks like a 50-line win. But it has two hard correctness bugs when used alongside a running MCP server.
14
+
15
+ 2. **Colocation vs. isolation**: The DI container is a module-level tsyringe global singleton. Its design invariant is one container per process. Forcing two concurrent execution paths through it violates the invariant by design, not by accident.
16
+
17
+ 3. **API surface stability vs. speed of access**: Option A gives the daemon access to internal handler functions (`executeStartWorkflow`, `executeContinueWorkflow`) -- no versioning, no contract boundary. Option B uses the MCP HTTP API -- versioned, stable, Zod-validated.
18
+
19
+ 4. **Testing simplicity vs. deployment correctness**: Option A requires no running HTTP server in tests. Option B does (or a mock MCP server). This is a real cost, not a theoretical one.
20
+
21
+ ### Likely seam
22
+
23
+ **OS process boundary.** WorkRail already has two modes: library (same process, no signals) and server (own process, HTTP transport). The daemon belongs in a third role: a separate process that is a consumer of the server's MCP HTTP API. The seam is the `/mcp` endpoint.
24
+
25
+ ### What makes this hard / what a junior developer would miss
26
+
27
+ - `engineActive = false` in `engine-factory.ts` is a hard block, not advisory
28
+ - `process.kill(pid, 0)` in `LocalSessionLockV2` cannot distinguish two call paths that share a PID -- it will treat a daemon-held lock as valid even after the daemon crashes, because the process (MCP server) is still alive
29
+ - The DI global singleton means both paths share keyring material -- a compromise or divergence in one affects the other
30
+ - `ThrowingProcessTerminator` in library mode throws instead of calling `process.exit()` -- fine for embedding, but if the daemon has an invariant violation in this mode, the event loop continues in a possibly corrupt state
31
+
32
+ ---
33
+
34
+ ## Philosophy Constraints
35
+
36
+ From `/Users/etienneb/CLAUDE.md` and codebase patterns:
37
+
38
+ **Principles under most pressure:**
39
+ - **Make illegal states unrepresentable**: Option A makes a lock-held-by-same-process state possible and undetectable. Option B makes it structurally impossible.
40
+ - **Dependency injection for boundaries**: Option A collapses the boundary between daemon and server infrastructure. Option B preserves it -- the MCP API is the injected boundary.
41
+
42
+ **Principles that both options satisfy:**
43
+ - **Errors are data** (`neverthrow` / `ResultAsync`): Both options can surface typed errors.
44
+ - **YAGNI with discipline**: Neither option over-engineers.
45
+
46
+ **No stated-vs-practiced conflicts** observed in the codebase.
47
+
48
+ ---
49
+
50
+ ## Impact Surface
51
+
52
+ If Option B is chosen:
53
+ - `http-entry.ts` / `http-listener.ts`: no changes needed -- HTTP server is already multi-client
54
+ - `StreamableHTTPServerTransport` with `sessionIdGenerator: crypto.randomUUID`: already handles concurrent MCP clients
55
+ - Daemon uses `@modelcontextprotocol/sdk/client` (already a dependency for the SDK)
56
+ - `continueToken` / `checkpointToken` are the only state the daemon carries between steps
57
+
58
+ If Option A were chosen (rejected):
59
+ - `engineActive` guard would need to be bypassed or disabled -- violates explicit design intent
60
+ - `LocalSessionLockV2.acquire()` would give false "lock is valid" results for daemon-held locks after daemon crashes
61
+ - Both paths would share the same `DI.V2.Keyring` instance -- any key rotation in one path invalidates tokens in the other
62
+
63
+ ---
64
+
65
+ ## Candidates
66
+
67
+ ### Candidate 1: Daemon as MCP HTTP client (RECOMMENDED)
68
+
69
+ **Summary:** Separate OS process. Uses `@modelcontextprotocol/sdk/client` pointed at `localhost:3100/mcp`. Drives sessions via `start_workflow` / `continue_workflow` HTTP calls. No shared in-process state with the MCP server.
70
+
71
+ **Tensions resolved:**
72
+ - Colocation vs. isolation: resolved -- different PIDs, separate DI containers
73
+ - API stability: resolved -- MCP HTTP contract is versioned and Zod-validated
74
+ - Cloud/Docker portability: resolved -- HTTP over localhost = HTTP over private network
75
+
76
+ **Tension accepted:** Testing requires a running HTTP server or mock MCP server.
77
+
78
+ **Boundary solved at:** OS process boundary via HTTP. This is the boundary WorkRail already establishes for its MCP transport.
79
+
80
+ **Why this boundary is the best fit:** The `engineActive` guard, the DI global singleton, and the PID-based lock check are all designed around the process boundary as the fundamental isolation unit. Respecting this boundary means working with the codebase's invariants, not against them.
81
+
82
+ **Failure mode:** MCP session token propagation -- `StreamableHTTPServerTransport` requires `Mcp-Session-Id` headers after session establishment. The MCP SDK client handles this automatically; using raw `fetch` would require manual header management. Mitigation: use the SDK client.
83
+
84
+ **Repo-pattern relationship:** Follows `http-entry.ts` (HTTP transport already established). Adapts pi-mono's `agentLoop` (stateless loop calling external tools). No departure from existing patterns.
85
+
86
+ **Gains:**
87
+ - Zero session lock contention
88
+ - Zero DI collision
89
+ - Cloud/Docker portable without code changes
90
+ - Independently testable with mock server
91
+ - Daemon crash only affects its own sessions (no shared state to corrupt)
92
+
93
+ **Gives up:**
94
+ - Direct function call latency (~1-5ms per step vs. microseconds)
95
+ - Requires MCP HTTP server to be running at startup
96
+
97
+ **Scope judgment:** Best-fit. Directly addresses the problem without over-engineering.
98
+
99
+ **Philosophy fit:**
100
+ - Honors: make-illegal-states-unrepresentable, DI-for-boundaries, validate-at-boundaries, errors-are-data
101
+ - Conflicts: none
102
+
103
+ ---
104
+
105
+ ### Candidate 2: engine-factory via child_process.fork + IPC
106
+
107
+ **Summary:** Daemon forks a child process that exclusively runs `createWorkRailEngine()`. Parent sends JSON-serialized `{ kind: 'start' | 'continue' | 'checkpoint', ... }` messages over `process.send()` IPC. Child responds with serialized `EngineResult`. Different PIDs -- lock check works.
108
+
109
+ **Tensions resolved:**
110
+ - Colocation vs. isolation: resolved (different PIDs via fork)
111
+ - Test ergonomics: resolved (no HTTP server needed)
112
+
113
+ **Tension accepted:**
114
+ - API surface stability: weak -- IPC message format is ad-hoc, not versioned
115
+ - New keyring divergence risk: if two processes load the same keyring file independently and either rotates keys, token signatures from the other process become invalid
116
+ - Cloud/Docker: forks don't cross container boundaries; must be replaced with a network transport for cloud
117
+
118
+ **Boundary solved at:** OS process boundary via `child_process.fork()` IPC.
119
+
120
+ **Why this boundary is NOT the best fit:** It introduces a novel IPC protocol not present in the codebase and creates a keyring divergence risk that does not exist in Option B. The fork model works locally but fails in Docker multi-container or any distributed deployment.
121
+
122
+ **Failure mode (critical):** Keyring divergence. Two processes loading the same `~/.workrail/keyring.json` get the same initial HMAC keys. But if either process rotates keys (key expiry, re-keying), the other process's in-memory keyring diverges. Tokens signed by process A may fail validation in process B. This is a non-obvious correctness risk with no mitigation short of external coordination.
123
+
124
+ **Repo-pattern relationship:** Departs from existing patterns. No `child_process.fork()` or IPC in the codebase.
125
+
126
+ **Gains:**
127
+ - PID isolation (lock check works)
128
+ - No HTTP server dependency
129
+ - Reuses typed `WorkRailEngine` API
130
+
131
+ **Gives up:**
132
+ - Introduces ad-hoc IPC protocol
133
+ - Keyring divergence risk
134
+ - No cloud portability
135
+ - More maintenance burden than Option B
136
+
137
+ **Scope judgment:** Too broad. Solves the PID problem but introduces a new correctness risk. More complex than Option B without the deployment benefits.
138
+
139
+ **Philosophy fit:**
140
+ - Honors: make-illegal-states-unrepresentable (PIDs now different), errors-are-data
141
+ - Conflicts: YAGNI (novel IPC layer), validate-at-boundaries (IPC serialization is unvalidated), architectural-fixes-over-patches (this is a patch, not a fix)
142
+
143
+ ---
144
+
145
+ ### Candidate 3: Hybrid -- MCP HTTP in production, engine-factory in test/local
146
+
147
+ **Summary:** At startup, the daemon checks: if `WORKRAIL_TRANSPORT=http` and `localhost:{port}/mcp` is reachable, use Candidate 1 (MCP HTTP client). Otherwise, use `createWorkRailEngine()` directly (Candidate A). Internal `DaemonTransport` union type: `{ kind: 'mcp_http'; client: McpClient } | { kind: 'direct'; engine: WorkRailEngine }`.
148
+
149
+ **Tensions resolved:**
150
+ - Test ergonomics: resolved (direct path needs no HTTP server)
151
+ - Build speed: partially resolved (reuse engine-factory for local scenarios)
152
+
153
+ **Tension accepted:** Two code paths must be maintained. Any new engine feature must be reflected in both, or the direct path diverges from the MCP path over time.
154
+
155
+ **Boundary solved at:** Startup-time capability detection. Adapts the `resolveTransportMode()` pattern from `mcp-server.ts`.
156
+
157
+ **Why this boundary is not the best fit:** The `direct` path still has the `engineActive` singleton constraint and the PID aliasing risk if an MCP server is accidentally co-located. The runtime check protecting against this is advisory (a thrown Error), not structural (a type system constraint).
158
+
159
+ **Failure mode:** If `WORKRAIL_DAEMON_TRANSPORT=direct` is set in a production environment where an MCP server is also running, the PID aliasing bug returns silently. There is no compile-time protection.
160
+
161
+ **Repo-pattern relationship:** Adapts `resolveTransportMode()` from `mcp-server.ts`. Reasonable adaptation.
162
+
163
+ **Gains:**
164
+ - Test ergonomics (no HTTP server needed in direct mode)
165
+ - Migration path for library embedding use cases
166
+
167
+ **Gives up:**
168
+ - Two-path maintenance burden
169
+ - Direct path safety is advisory, not structural
170
+ - Adds conditional logic to every daemon session operation
171
+
172
+ **Scope judgment:** Slightly too broad for the production case. Best-fit only if test ergonomics is a primary blocking concern.
173
+
174
+ **Philosophy fit:**
175
+ - Honors: YAGNI (reuses existing API), errors-are-data
176
+ - Conflicts: make-illegal-states-unrepresentable (direct path allows aliasing), architectural-fixes-over-patches (hybrid is a patch)
177
+
178
+ ---
179
+
180
+ ## Comparison and Recommendation
181
+
182
+ | Criterion | C1 (MCP HTTP) | C2 (fork+IPC) | C3 (hybrid) |\n|-----------|-------------|-------------|------------|\n| Lock contention | Resolved structurally | Resolved via fork | Resolved in prod path |\n| DI isolation | Resolved | Resolved | Resolved in prod path |\n| Keyring safety | N/A (server owns it) | New risk | N/A in MCP path |\n| Cloud/Docker | Excellent | Poor (no cross-container fork) | Good only in MCP path |\n| API stability | Strong (MCP contract) | Weak (ad-hoc IPC) | Mixed |\n| Test ergonomics | Needs mock server | No server needed | No server needed (direct) |\n| Maintenance burden | Low (single path) | High (IPC layer) | Medium (two paths) |\n| Repo pattern fit | Excellent | Poor | Acceptable |\n| Philosophy alignment | Strong | Weak | Mixed |\n\n**Recommendation: Candidate 1 (MCP HTTP client)**
183
+
184
+ All five decision criteria are satisfied structurally, not by advisory guards or operational conventions. Cloud portability is zero-cost. The implementation is the shortest path to correctness (~100 lines for the daemon's MCP client wrapper + agent loop), not the shortest path to a running prototype.
185
+
186
+ ---
187
+
188
+ ## Self-Critique
189
+
190
+ **Strongest counter-argument against C1:**
191
+ The MCP HTTP server must be running before the daemon operates. In a single-binary deployment, this requires orchestrating two processes. In practice: use a process supervisor (PM2, systemd, Docker Compose `depends_on`), or have the daemon implement a startup retry loop. This is operational boilerplate, not a correctness problem.
192
+
193
+ **What narrower option might still work:**
194
+ Candidate 3 (hybrid) satisfies all criteria in its MCP path. It loses because the two-path maintenance burden and the advisory-only guard on the direct path make it structurally weaker than C1 with no material benefit that C1 cannot achieve with a test-only mock server.
195
+
196
+ **What broader option might be justified:**
197
+ A full job queue (Redis/BullMQ backing the daemon) for multi-tenant SaaS scale. Evidence required: concurrent sessions, multiple daemon instances, distributed scheduling. Not in scope.
198
+
199
+ **Assumption that would invalidate this design:**
200
+ The daemon must operate in an air-gapped/offline environment with no localhost HTTP available. In that case, Candidate 2 (fork+IPC) is the right shape -- but requires resolving the keyring divergence risk first (e.g., move token signing to a shared file-based signing service, or use the same process for both keyring and daemon).
201
+
202
+ ---
203
+
204
+ ## Open Questions for the Main Agent
205
+
206
+ 1. Is the MCP HTTP startup dependency acceptable, or is there a single-binary deployment requirement that makes Option B impractical?
207
+ 2. Should the daemon use a test-mode mock MCP server (simulated in-memory) or a real HTTP server in unit tests?
208
+ 3. Should the `DaemonTransport` abstraction from Candidate 3 be built as a future extension point even if only the MCP path is implemented initially?
209
+ 4. Is the 1-5ms HTTP overhead per step a real concern for the planned step intervals (seconds to minutes), or is it safe to ignore?
210
+ 5. What is the expected concurrency model for the daemon -- one session at a time, or multiple concurrent sessions driving different workflows?