@exaudeus/workrail 3.28.0 → 3.29.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (160) hide show
  1. package/dist/console/assets/{index-C146q2kN.js → index-BZ6HkxGf.js} +1 -1
  2. package/dist/console/index.html +1 -1
  3. package/dist/manifest.json +3 -3
  4. package/docs/README.md +57 -0
  5. package/docs/adrs/001-hybrid-storage-backend.md +38 -0
  6. package/docs/adrs/002-four-layer-context-classification.md +38 -0
  7. package/docs/adrs/003-checkpoint-trigger-strategy.md +35 -0
  8. package/docs/adrs/004-opt-in-encryption-strategy.md +36 -0
  9. package/docs/adrs/005-agent-first-workflow-execution-tokens.md +105 -0
  10. package/docs/adrs/006-append-only-session-run-event-log.md +76 -0
  11. package/docs/adrs/007-resume-and-checkpoint-only-sessions.md +51 -0
  12. package/docs/adrs/008-blocked-nodes-architectural-upgrade.md +178 -0
  13. package/docs/adrs/009-bridge-mode-single-instance-mcp.md +195 -0
  14. package/docs/adrs/010-release-pipeline.md +89 -0
  15. package/docs/architecture/README.md +7 -0
  16. package/docs/architecture/refactor-audit.md +364 -0
  17. package/docs/authoring-v2.md +527 -0
  18. package/docs/authoring.md +873 -0
  19. package/docs/changelog-recent.md +201 -0
  20. package/docs/configuration.md +505 -0
  21. package/docs/ctc-mcp-proposal.md +518 -0
  22. package/docs/design/README.md +22 -0
  23. package/docs/design/agent-cascade-protocol.md +96 -0
  24. package/docs/design/autonomous-console-design-candidates.md +253 -0
  25. package/docs/design/autonomous-console-design-review.md +111 -0
  26. package/docs/design/autonomous-platform-mvp-discovery.md +525 -0
  27. package/docs/design/claude-code-source-deep-dive.md +713 -0
  28. package/docs/design/console-cyberpunk-ui-discovery.md +504 -0
  29. package/docs/design/console-execution-trace-candidates-final.md +160 -0
  30. package/docs/design/console-execution-trace-candidates.md +211 -0
  31. package/docs/design/console-execution-trace-design-candidates-v2.md +113 -0
  32. package/docs/design/console-execution-trace-design-review.md +74 -0
  33. package/docs/design/console-execution-trace-discovery.md +394 -0
  34. package/docs/design/console-execution-trace-final-review.md +77 -0
  35. package/docs/design/console-execution-trace-review.md +92 -0
  36. package/docs/design/console-performance-discovery.md +415 -0
  37. package/docs/design/console-ui-backlog.md +280 -0
  38. package/docs/design/daemon-architecture-discovery.md +853 -0
  39. package/docs/design/daemon-design-candidates.md +318 -0
  40. package/docs/design/daemon-design-review-findings.md +119 -0
  41. package/docs/design/daemon-engine-design-candidates.md +210 -0
  42. package/docs/design/daemon-engine-design-review.md +131 -0
  43. package/docs/design/daemon-execution-engine-discovery.md +280 -0
  44. package/docs/design/daemon-gap-analysis.md +554 -0
  45. package/docs/design/daemon-owns-console-plan.md +168 -0
  46. package/docs/design/daemon-owns-console-review.md +91 -0
  47. package/docs/design/daemon-owns-console.md +195 -0
  48. package/docs/design/data-model-erd.md +11 -0
  49. package/docs/design/design-candidates-consolidate-dev-staleness.md +98 -0
  50. package/docs/design/design-candidates-walk-cache-depth-limit.md +80 -0
  51. package/docs/design/design-review-consolidate-dev-staleness.md +54 -0
  52. package/docs/design/design-review-walk-cache-depth-limit.md +48 -0
  53. package/docs/design/implementation-plan-consolidate-dev-staleness.md +142 -0
  54. package/docs/design/implementation-plan-walk-cache-depth-limit.md +141 -0
  55. package/docs/design/layer3b-ghost-nodes-design-candidates.md +229 -0
  56. package/docs/design/layer3b-ghost-nodes-design-review.md +93 -0
  57. package/docs/design/layer3b-ghost-nodes-implementation-plan.md +219 -0
  58. package/docs/design/list-workflows-latency-fix-plan.md +128 -0
  59. package/docs/design/list-workflows-latency-fix-review.md +55 -0
  60. package/docs/design/list-workflows-latency-fix.md +109 -0
  61. package/docs/design/native-context-management-api.md +11 -0
  62. package/docs/design/performance-sweep-2026-04.md +96 -0
  63. package/docs/design/routines-guide.md +219 -0
  64. package/docs/design/sequence-diagrams.md +11 -0
  65. package/docs/design/subagent-design-principles.md +220 -0
  66. package/docs/design/temporal-patterns-design-candidates.md +312 -0
  67. package/docs/design/temporal-patterns-design-review-findings.md +163 -0
  68. package/docs/design/test-isolation-from-config-file.md +335 -0
  69. package/docs/design/v2-core-design-locks.md +2746 -0
  70. package/docs/design/v2-lock-registry.json +734 -0
  71. package/docs/design/workflow-authoring-v2.md +1044 -0
  72. package/docs/design/workflow-docs-spec.md +218 -0
  73. package/docs/design/workflow-extension-points.md +687 -0
  74. package/docs/design/workrail-auto-trigger-system.md +359 -0
  75. package/docs/design/workrail-config-file-discovery.md +513 -0
  76. package/docs/docker.md +110 -0
  77. package/docs/generated/v2-lock-closure-plan.md +26 -0
  78. package/docs/generated/v2-lock-coverage.json +797 -0
  79. package/docs/generated/v2-lock-coverage.md +177 -0
  80. package/docs/ideas/backlog.md +3927 -0
  81. package/docs/ideas/design-candidates-mcp-resilience.md +208 -0
  82. package/docs/ideas/design-review-findings-mcp-resilience.md +119 -0
  83. package/docs/ideas/implementation_plan.md +249 -0
  84. package/docs/ideas/third-party-workflow-setup-design-thinking.md +1948 -0
  85. package/docs/implementation/02-architecture.md +316 -0
  86. package/docs/implementation/04-testing-strategy.md +124 -0
  87. package/docs/implementation/09-simple-workflow-guide.md +835 -0
  88. package/docs/implementation/13-advanced-validation-guide.md +874 -0
  89. package/docs/implementation/README.md +21 -0
  90. package/docs/integrations/claude-code.md +300 -0
  91. package/docs/integrations/firebender.md +315 -0
  92. package/docs/migration/v0.1.0.md +147 -0
  93. package/docs/naming-conventions.md +45 -0
  94. package/docs/planning/README.md +104 -0
  95. package/docs/planning/github-ticketing-playbook.md +195 -0
  96. package/docs/plans/README.md +24 -0
  97. package/docs/plans/agent-managed-ticketing-design.md +605 -0
  98. package/docs/plans/agentic-orchestration-roadmap.md +112 -0
  99. package/docs/plans/assessment-gates-engine-handoff.md +536 -0
  100. package/docs/plans/content-coherence-and-references.md +151 -0
  101. package/docs/plans/library-extraction-plan.md +340 -0
  102. package/docs/plans/mr-review-workflow-redesign.md +1451 -0
  103. package/docs/plans/native-context-management-epic.md +11 -0
  104. package/docs/plans/perf-fixes-design-candidates.md +225 -0
  105. package/docs/plans/perf-fixes-design-review-findings.md +61 -0
  106. package/docs/plans/perf-fixes-new-issues-candidates.md +264 -0
  107. package/docs/plans/perf-fixes-new-issues-review.md +110 -0
  108. package/docs/plans/prompt-fragments.md +53 -0
  109. package/docs/plans/ui-ux-workflow-design-candidates.md +120 -0
  110. package/docs/plans/ui-ux-workflow-discovery.md +100 -0
  111. package/docs/plans/ui-ux-workflow-review.md +48 -0
  112. package/docs/plans/v2-followup-enhancements.md +587 -0
  113. package/docs/plans/workflow-categories-candidates.md +105 -0
  114. package/docs/plans/workflow-categories-discovery.md +110 -0
  115. package/docs/plans/workflow-categories-review.md +51 -0
  116. package/docs/plans/workflow-discovery-model-candidates.md +94 -0
  117. package/docs/plans/workflow-discovery-model-discovery.md +74 -0
  118. package/docs/plans/workflow-discovery-model-review.md +48 -0
  119. package/docs/plans/workflow-source-setup-phase-1.md +245 -0
  120. package/docs/plans/workflow-source-setup-phase-2.md +361 -0
  121. package/docs/plans/workflow-staleness-detection-candidates.md +104 -0
  122. package/docs/plans/workflow-staleness-detection-review.md +58 -0
  123. package/docs/plans/workflow-staleness-detection.md +80 -0
  124. package/docs/plans/workflow-v2-design.md +69 -0
  125. package/docs/plans/workflow-v2-roadmap.md +74 -0
  126. package/docs/plans/workflow-validation-design.md +98 -0
  127. package/docs/plans/workflow-validation-roadmap.md +108 -0
  128. package/docs/plans/workrail-platform-vision.md +420 -0
  129. package/docs/reference/agent-context-cleaner-snippet.md +94 -0
  130. package/docs/reference/agent-context-guidance.md +140 -0
  131. package/docs/reference/context-optimization.md +284 -0
  132. package/docs/reference/example-workflow-repository-template/.github/workflows/validate.yml +125 -0
  133. package/docs/reference/example-workflow-repository-template/README.md +268 -0
  134. package/docs/reference/example-workflow-repository-template/workflows/example-workflow.json +80 -0
  135. package/docs/reference/external-workflow-repositories.md +916 -0
  136. package/docs/reference/feature-flags-architecture.md +472 -0
  137. package/docs/reference/feature-flags.md +349 -0
  138. package/docs/reference/god-tier-workflow-validation.md +272 -0
  139. package/docs/reference/loop-optimization.md +209 -0
  140. package/docs/reference/loop-validation.md +176 -0
  141. package/docs/reference/loops.md +465 -0
  142. package/docs/reference/mcp-platform-constraints.md +59 -0
  143. package/docs/reference/recovery.md +88 -0
  144. package/docs/reference/releases.md +177 -0
  145. package/docs/reference/troubleshooting.md +105 -0
  146. package/docs/reference/workflow-execution-contract.md +998 -0
  147. package/docs/roadmap/README.md +22 -0
  148. package/docs/roadmap/legacy-planning-status.md +103 -0
  149. package/docs/roadmap/now-next-later.md +70 -0
  150. package/docs/roadmap/open-work-inventory.md +389 -0
  151. package/docs/tickets/README.md +39 -0
  152. package/docs/tickets/next-up.md +76 -0
  153. package/docs/workflow-management.md +317 -0
  154. package/docs/workflow-templates.md +423 -0
  155. package/docs/workflow-validation.md +184 -0
  156. package/docs/workflows.md +254 -0
  157. package/package.json +3 -1
  158. package/spec/authoring-spec.json +61 -16
  159. package/workflows/workflow-for-workflows.json +3 -3
  160. package/workflows/workflow-for-workflows.v2.json +3 -3
@@ -0,0 +1,415 @@
1
+ # Console Performance: Architecture Discovery
2
+
3
+ **Status:** In Progress
4
+ **Date:** 2026-04-06
5
+ **Path:** full_spectrum
6
+ **Artifact strategy:** This document is the human-facing canonical artifact. Notes and execution truth live in-session.
7
+
8
+ ---
9
+
10
+ ## Context / Ask
11
+
12
+ The workrail MCP server is consuming 140% CPU due to a three-part feedback loop in the console's live session/worktree tracking system.
13
+
14
+ **The loop:**
15
+ ```
16
+ session write -> fs.watch fires -> SSE 'change' broadcast ->
17
+ queryClient.invalidateQueries(['worktrees']) ->
18
+ /api/v2/worktrees: 606 concurrent git subprocesses (12.5s) ->
19
+ session write from the next continue_workflow -> repeat
20
+ ```
21
+
22
+ **Three root causes:**
23
+ 1. `watchSessionsDir()` in `console-routes.ts` uses `fs.watch(sessionsDir, { recursive: true })`. Every `continue_workflow` call writes 2+ files, each firing the watcher. A 200ms debounce exists but is insufficient - it collapses writes within 200ms but a session write from the _response_ to a worktrees fetch triggers a new event.
24
+
25
+ 2. `useWorkspaceEvents()` in `console/src/api/hooks.ts` calls `queryClient.invalidateQueries(['worktrees'])` on every SSE `change` event with no cooldown - bypassing the `staleTime: 20_000` entirely. `invalidateQueries` marks the cache as stale and triggers an immediate refetch when the component is mounted.
26
+
27
+ 3. `/api/v2/worktrees` in `console-routes.ts` calls `getWorktreeList()` which fans out to all repo roots from sessions. With 101 discovered worktrees (79 from a stale zillow-android-2 session), each enriched with 6 git commands run in parallel, that is 606 concurrent git subprocesses. Each request takes 12.5 seconds.
28
+
29
+ **Goal:** Keep the console live and reactive. Fix the CPU spiral architecturally - not with band-aids.
30
+
31
+ ## Path Recommendation
32
+
33
+ **Selected path:** `full_spectrum`
34
+
35
+ **Rationale:**
36
+ - `landscape_first` would miss the framing risk: "is the current model right at all?" The worktrees view is modeled as a live-refresh-on-session-change resource, but worktrees change on a fundamentally different timescale than sessions. That mismatch is a design flaw, not just a performance gap.
37
+ - `design_first` would miss the landscape: there are well-established patterns for live git status (VS Code Source Control, GitLens, tig, LazyGit) that are directly relevant to the solution.
38
+ - `full_spectrum` is right because both landscape grounding (what models work for live git status at scale?) and reframing (are sessions and worktrees the right unit of reactivity?) are equally important.
39
+
40
+ ## Constraints / Anti-goals
41
+
42
+ **Constraints:**
43
+ - Console must stay reactive - real-time session state and worktree status are core value
44
+ - No feature removal
45
+ - Architectural fix - must change invariants, not add special cases
46
+ - The fix must hold even when a single active `continue_workflow` session is running (the primary use case)
47
+
48
+ **Anti-goals:**
49
+ - Do not just add a client-side debounce timer
50
+ - Do not remove SSE-driven live updates
51
+ - Do not reduce information density of the worktrees view
52
+ - Do not require rewriting the session persistence layer
53
+
54
+ ---
55
+
56
+ ## Landscape Packet
57
+
58
+ ### Current State Summary
59
+
60
+ The system has three layers:
61
+ 1. **Server-side watch** (`console-routes.ts`): `fs.watch` on `~/.workrail/sessions/` with a 200ms debounce. Every file write (session events, snapshots, recaps) triggers the watcher. A single `continue_workflow` call writes 2-4 files.
62
+ 2. **Client-side SSE consumer** (`hooks.ts`): `useWorkspaceEvents()` subscribes to `/api/v2/workspace/events` and calls `queryClient.invalidateQueries` for both `['sessions']` and `['worktrees']` on every `change` event. `staleTime` on `useWorktreeList` is 20s, but `invalidateQueries` bypasses it by marking the entry stale before the timer fires.
63
+ 3. **Worktree enrichment** (`worktree-service.ts`): `getWorktreeList()` reads repo roots from sessions (no TTL on active sessions, 60s TTL on the repo root set). For each repo root, runs `git worktree list --porcelain` then for each worktree runs 6 git commands in parallel via `Promise.allSettled`. No concurrency cap across repos or worktrees.
64
+
65
+ ### Existing Approaches / Precedents
66
+
67
+ **VS Code Source Control / GitLens:**
68
+ - Use `fs.watch` on `.git/` directory only (not recursive on the whole worktree).
69
+ - Maintain a git status cache keyed by repo root.
70
+ - Debounce the watcher with 400-800ms delays.
71
+ - Only re-scan dirty repos (tracking which repos have pending changes).
72
+ - Do NOT invalidate git status on every file write in the workspace.
73
+
74
+ **LazyGit / tig:**
75
+ - Poll on a fixed interval (configurable, default 2-4s).
76
+ - Do not react to every file change.
77
+ - Separate "refresh" (full rescan) from "watch" (detect dirty state cheaply).
78
+
79
+ **Tower / Sourcetree (GUI git clients):**
80
+ - Maintain a persistent background process per repo.
81
+ - Use a dedicated watcher per `.git/HEAD` and `.git/refs/` to detect branch/commit changes.
82
+ - Separate worktree file status from branch/commit status.
83
+
84
+ **React Query patterns:**
85
+ - `invalidateQueries` is designed for explicit user actions or focused events (e.g., "the user just did X that would change Y").
86
+ - For background sync, the correct pattern is `refetchInterval` + `staleTime`, not invalidation on every external event.
87
+ - Invalidation is appropriate when the event type is semantically tied to the data type. A session write event is NOT semantically tied to worktree git status.
88
+
89
+ **SSE patterns for developer tooling:**
90
+ - The standard pattern is to send typed/scoped events, not a generic `change` broadcast. Scoped events let subscribers react only to what concerns them.
91
+ - e.g., `{ type: "session:updated", sessionId: "..." }` vs `{ type: "change" }`.
92
+ - Generic `change` events force every subscriber to decide whether their data is affected, with no information to make that decision correctly.
93
+
94
+ ### Option Categories
95
+
96
+ Three broad option categories emerge:
97
+
98
+ **A. Fix the SSE semantics (event scoping):**
99
+ Replace the generic `change` event with typed, scoped events (`session:updated`, `worktree:dirty`). The worktrees view only invalidates when a worktree-specific event arrives, not on session writes.
100
+
101
+ **B. Fix the query invalidation strategy (decouple sessions from worktrees):**
102
+ Do not invalidate `['worktrees']` on SSE events at all. Let worktrees be governed by `refetchInterval` alone (e.g., 60s). Session events only invalidate `['sessions']`. The worktrees view becomes "near-realtime" (60s lag) rather than "instant."
103
+
104
+ **C. Fix the git subprocess fan-out (worktree data model):**
105
+ Cache git enrichment results per worktree with a TTL. Only re-enrich worktrees that have actually changed (by comparing HEAD hash or index mtime). Bound concurrency with a semaphore. Separate expensive enrichment (git log, status, ahead/behind) from cheap existence checks.
106
+
107
+ **D. Combine A + C (scoped events + server-side git caching):**
108
+ Server knows which repos/worktrees are dirty because it watches `.git/` directories. It only re-enriches dirty repos when a worktree-scoped event fires. Clients get typed events and only refetch worktrees when the server says something relevant changed.
109
+
110
+ ### Contradictions / Disagreements
111
+
112
+ 1. **The 200ms debounce exists but is insufficient:** The debounce collapses rapid writes, but because the worktrees refetch itself (12.5s) triggers more SSE events via session writes from the calling agent, the debounce is circumvented at a higher level. A longer debounce (e.g., 2s) would reduce frequency but not eliminate the loop.
113
+
114
+ 2. **`staleTime: 20_000` was designed to prevent thrash but is bypassed:** The intent was clearly to limit worktrees fetches to once per 20s, but `invalidateQueries` bypasses stale time. This is a usage error in the hook, not a React Query limitation.
115
+
116
+ 3. **The worktrees view conflates two different timescales:** Session events are high-frequency (every continue_workflow call). Git worktree state changes are low-frequency (developer switches branches, commits, etc.). Coupling them in the same invalidation event creates an impedance mismatch.
117
+
118
+ 4. **The stale zillow-android-2 session is a symptom, not a root cause:** Even with 1 repo and 5 worktrees, the loop would still exist. The 79-worktree case makes the cost visible but removing that session is a band-aid.
119
+
120
+ ### Evidence Gaps
121
+
122
+ 1. How does the actual CPU usage break down? Is the 140% from git subprocess spawning, Node.js event loop overhead from SSE broadcasts, or I/O wait from the 12.5s requests queuing up?
123
+ 2. Is there a way to detect when a worktree's git state has actually changed without running all 6 git commands? (e.g., watching `.git/refs/` and `.git/HEAD`)
124
+ 3. How many concurrent console clients are typically active? (1 vs 5 changes the SSE broadcast impact significantly)
125
+
126
+ ---
127
+
128
+ ## Problem Frame Packet
129
+
130
+ ### Users / Stakeholders
131
+
132
+ - **Primary user:** The developer running workrail locally while actively using it with an AI agent (Claude Code). They want to see live session progress and worktree state as they work.
133
+ - **Secondary user:** The developer checking the console dashboard between sessions. They want a quick overview of all their work across repos.
134
+ - **System stakeholder:** The MCP server itself. When the console consumes 140% CPU, the MCP server is starved of resources, degrading the actual agent execution that the console is meant to observe.
135
+
136
+ ### Jobs / Goals / Outcomes
137
+
138
+ - **See session progress live** (high frequency need - every continue_workflow step)
139
+ - **See which worktrees are active, dirty, ahead of main** (low frequency need - changes on developer action)
140
+ - **Not have the developer tooling destroy the developer's machine performance** (system-level need)
141
+
142
+ ### Pains / Tensions / Constraints
143
+
144
+ - **Tension 1:** Reactivity vs. cost. The most reactive system (re-fetch everything on every event) is also the most expensive. The goal is targeted reactivity - fast for what matters, lazy for what doesn't.
145
+ - **Tension 2:** Server simplicity vs. correctness. The simplest server-side fix (just adding a longer debounce or a server-side rate limiter on the worktrees endpoint) doesn't fix the root cause - the events and queries are semantically mismatched.
146
+ - **Tension 3:** The worktrees endpoint does real work (git). Other queries (sessions list) are cheap. Treating them identically in the invalidation strategy is wrong.
147
+
148
+ ### Success Criteria
149
+
150
+ 1. A single active `continue_workflow` session does not cause CPU to exceed 20% (down from 140%)
151
+ 2. Session list updates still appear within 1-2 seconds of a session state change
152
+ 3. Worktree status is still live (updates within 30-60s of a developer action, or faster if the server can detect the change cheaply)
153
+ 4. The solution works correctly with 100+ worktrees across multiple repos without degradation
154
+ 5. No regressions: stale session cleanup, session detail view, node detail view all still work
155
+
156
+ ### Assumptions
157
+
158
+ - The console runs locally, co-located with the MCP server. Latency is not a concern.
159
+ - There is typically 1 console client connected at a time (single developer use case).
160
+ - Git worktree state changes at developer-action frequency (minutes to hours), not session-event frequency (seconds).
161
+ - The sessions directory watch is the right mechanism for session updates; the question is only what actions it should trigger.
162
+
163
+ ### Reframes / HMW Questions
164
+
165
+ **Reframe 1:** "How might we make the worktrees view update when worktrees actually change, rather than when sessions change?"
166
+ - This reframe surfaces that the current coupling (session event -> worktrees refetch) is incorrect at the semantic level. Sessions and worktrees are different entities with different update rates.
167
+
168
+ **Reframe 2:** "How might we make the server do less work per request rather than making the client ask less often?"
169
+ - This reframe surfaces the server-side caching angle. Even with the SSE loop fixed, a 12.5s worktrees endpoint is too slow for a responsive console. The fix needs to address both the trigger frequency AND the per-request cost.
170
+
171
+ ### What Would Make This Framing Wrong
172
+
173
+ - If it turns out the MCP server CPU is actually from something else entirely (e.g., the `fs.watch` watcher itself has a bug causing recursive firing), then fixing the invalidation chain is a red herring.
174
+ - If git operations on these repos are intrinsically slow for reasons other than concurrency (e.g., large pack files, network mounts), then concurrency capping won't help much.
175
+
176
+ ---
177
+
178
+ ## Phase 2: Decision Shape Synthesis
179
+
180
+ The landscape and framing stories are in agreement on the core issue and converge on a clear decision shape.
181
+
182
+ ### Core Opportunity
183
+
184
+ The system has three coupled bugs that together create a feedback loop. Any single fix reduces harm but doesn't eliminate the loop. A proper architectural fix must address all three, or at minimum break the loop at one point while separately addressing the cost of the remaining path.
185
+
186
+ The dominant insight from the landscape: **the worktrees view and the sessions view should not share an invalidation trigger.** They are semantically different resources that change at different rates. The current design couples them through a single generic `change` event, which is the root cause of the loop.
187
+
188
+ ### Decision Criteria (the winning direction must satisfy all of these)
189
+
190
+ 1. **Breaks the feedback loop permanently** - a session write must not be able to trigger a 12.5s worktrees refetch in a tight loop
191
+ 2. **Maintains session reactivity** - the sessions list and session detail must update within ~2 seconds of a session event
192
+ 3. **Maintains worktree live-ness** - worktrees must update on a reasonable cadence that reflects actual git state changes, not just "slower"
193
+ 4. **Scales to 100+ worktrees** - a large repo with many worktrees must not degrade the console
194
+ 5. **Architecturally correct** - the fix must change the coupling invariants, not just slow down the loop
195
+
196
+ ### Riskiest Assumption
197
+
198
+ The riskiest assumption is that server-side git caching (memoizing enrichment results with a TTL) will be sufficient to make the worktrees endpoint fast enough to serve on a short polling interval without needing the SSE-driven invalidation. If git enrichment is intrinsically slow regardless of caching (e.g., even a single pass of 101 worktrees takes 12.5s and there's no incremental approach), then the caching direction would need to be combined with a complete decoupling of worktrees from SSE.
199
+
200
+ ### Remaining Uncertainty
201
+
202
+ Categorized as **recommendation uncertainty** - the evidence is sufficient to recommend a direction, but there is residual uncertainty about the right caching granularity and TTL values that can only be resolved through implementation.
203
+
204
+ ### Candidate Count Target: 3-4 (STANDARD rigor)
205
+
206
+ ---
207
+
208
+ ## Candidate Directions
209
+
210
+ *(Path is `full_spectrum`, STANDARD rigor. Setup expectations: candidates must reflect both the landscape and the reframing. At least one direction must meaningfully change the coupling model, not just tune parameters.)*
211
+
212
+ ---
213
+
214
+ ### Candidate A: Typed SSE Events + Client-Side Routing (Event Scoping)
215
+
216
+ **Summary:** Replace the generic `{ type: "change" }` SSE event with typed, scoped events: `{ type: "session:updated", sessionId: "..." }` for session writes, and a separate `{ type: "worktree:dirty", repoRoot: "..." }` event that the server only emits when it detects actual git state changes (by watching `.git/HEAD` and `.git/refs/` directories per known repo). The client routes each event type to the appropriate query invalidation.
217
+
218
+ **Why it fits the path:**
219
+ Directly addresses the semantic mismatch identified in the reframe. Sessions and worktrees become independently reactive. The server controls which events fire and when, so the client cannot accidentally over-invalidate.
220
+
221
+ **Strongest evidence for it:**
222
+ This is the standard pattern used by VS Code, GitLens, and other dev tools. Typed events are more informative, more extensible, and prevent the coupling by construction. The landscape showed that generic `change` events force every subscriber to decide whether their data is affected - typed events eliminate that decision.
223
+
224
+ **Strongest risk against it:**
225
+ The server now needs to watch `.git/HEAD` and `.git/refs/` per known repo. This adds watcher instances that need lifecycle management. If the known-repos set changes (new session from a new repo), the server needs to add a watcher dynamically. This adds complexity to `console-routes.ts`.
226
+
227
+ **When it wins:**
228
+ When the primary goal is architectural correctness and the team has tolerance for the additional server-side watcher complexity. Best when the console's role as a developer tool (not an ops dashboard) is emphasized - developers trigger git state changes deliberately, and the server can detect them cheaply.
229
+
230
+ ---
231
+
232
+ ### Candidate B: Decouple Worktrees from SSE (Polling-Only for Worktrees)
233
+
234
+ **Summary:** Stop invalidating `['worktrees']` on SSE events entirely. Session events only affect `['sessions']`. The worktrees query is governed solely by `refetchInterval: 60_000` (or a configurable value). Separately, fix the git subprocess concurrency with a bounded semaphore (e.g., max 8 concurrent git processes). The feedback loop is broken because the trigger that caused it (SSE -> worktrees invalidation) is removed.
235
+
236
+ **Why it fits the path:**
237
+ The simplest intervention that breaks the loop. Respects the reality that worktrees change at developer-action frequency, not session-event frequency. The reframe "make worktrees update when worktrees change" is satisfied by removing the incorrect trigger.
238
+
239
+ **Strongest evidence for it:**
240
+ All existing git GUI clients (LazyGit, tig, Tower) poll worktree status on an interval rather than watching for changes. The 60s polling interval with concurrency capping would reduce peak CPU from 140% to a brief spike once per minute. LazyGit defaults to 2s refresh with a very cheap "is anything dirty?" check first.
241
+
242
+ **Strongest risk against it:**
243
+ Worktrees feel "stale" when a developer switches branches or commits while the console is open. The 60s lag is acceptable for passive observation but feels wrong for active use (e.g., the developer wants to see their new commit reflected immediately). The solution partially degrades the "live" feel of the worktrees panel.
244
+
245
+ **When it wins:**
246
+ When simplicity of implementation is paramount and the user accepts a 60s polling cadence for worktrees. This is the minimum viable fix that breaks the loop. It's also a necessary precondition for any other option - even if Candidate A is chosen, the worktrees query should not be on SSE without server-side filtering.
247
+
248
+ ---
249
+
250
+ ### Candidate C: Server-Side Git Cache with Incremental Enrichment
251
+
252
+ **Summary:** Add a server-side git enrichment cache keyed by `(repoRoot, branch, headHash)`. After the first full enrichment of a worktree, subsequent requests for the same HEAD commit return the cached result immediately (since git log, status, ahead/behind are deterministic for a given commit). Only worktrees where HEAD or index mtime has changed trigger re-enrichment. The worktrees endpoint goes from 12.5s to ~50ms for a cache-hit scenario. With fast responses, the worktrees query can be invalidated on session events without causing a spiral (since each invalidation is cheap).
253
+
254
+ **Why it fits the path:**
255
+ Addresses the "server does too much work per request" dimension that the reframe surfaced. With a fast worktrees endpoint, the cost of occasional over-invalidation becomes negligible. This is an architectural fix to the data model layer.
256
+
257
+ **Strongest evidence for it:**
258
+ Git log, status, and ahead/behind are deterministic for a given HEAD commit. A cache keyed on HEAD hash is correct by construction - stale data cannot appear as long as HEAD changes are detected. VS Code uses exactly this pattern: caches git decorations per commit hash.
259
+
260
+ **Strongest risk against it:**
261
+ `git status --short` is NOT deterministic for a given HEAD - uncommitted working directory changes are not captured by HEAD hash. So the status (dirty/clean, changed files) would need a separate, cheaper invalidation mechanism (file watcher on the worktree directory, or a short TTL of 5-10s). This splits the cache into two tiers: commit-level (permanent) and working-directory-level (short TTL). Complexity increases.
262
+
263
+ **When it wins:**
264
+ When high-frequency worktree refreshes (< 5s) are needed and the team has the appetite to implement a two-tier cache. Best paired with Candidate A (typed events) so the cache is only invalidated when real changes occur.
265
+
266
+ ---
267
+
268
+ ### Candidate D: Compound Fix - Decouple + Concurrency Cap + Session-Only SSE (Selected)
269
+
270
+ **Summary:** Three targeted changes that together eliminate the loop at every layer:
271
+ 1. **SSE scoping** (`console-routes.ts`): The SSE `change` event is still generic, but the client only invalidates `['sessions']` in response to it (not `['worktrees']`). This breaks the loop immediately.
272
+ 2. **Worktrees polling** (`hooks.ts`): `useWorktreeList` keeps its `refetchInterval: 30_000` but `useWorkspaceEvents` stops invalidating `['worktrees']`. The `staleTime` of 20s is actually respected now.
273
+ 3. **Concurrency cap** (`worktree-service.ts`): Add a simple semaphore (max 8 concurrent git subprocesses) to `enrichWorktree`. This bounds the per-request cost from 606 concurrent to 8-at-a-time, reducing peak I/O from ~12.5s to a bounded, sequential enrichment. For 101 worktrees x 6 commands / 8 concurrent = ~76 sequential batches x ~50ms avg = ~3.8s per full request (vs 12.5s today).
274
+ 4. **Stale session cleanup** (optional but high leverage): The `remembered-roots-store` should evict roots from stale/completed sessions older than N days. This is a separate concern but reduces the worktree count from 101 to something more reasonable.
275
+
276
+ **Why it fits the path:**
277
+ Satisfies all 5 decision criteria. Breaks the feedback loop (change 1). Maintains session reactivity (SSE still fires for session events). Maintains worktree live-ness via polling (30s cadence). Scales better (concurrency cap). Architecturally correct (the coupling invariant is changed: sessions and worktrees are decoupled in the invalidation model).
278
+
279
+ **Strongest evidence for it:**
280
+ The combination of "correct invalidation semantics" (change 1) and "bounded concurrency" (change 3) maps directly to what the landscape showed: no production tool couples session writes to worktree git fetches. The 30s polling interval for worktrees is consistent with what Tower and Sourcetree use in the background.
281
+
282
+ **Strongest risk against it:**
283
+ The 30s polling cadence may feel unresponsive for active use. A developer who commits and switches branches wants the console to reflect that promptly. This is mitigated by: (a) the worktrees endpoint being faster due to the concurrency cap, so a manual refresh is snappy; (b) a future addition of Candidate A's typed SSE events for git state changes, which can be layered on top.
284
+
285
+ **What would change my mind between D and A:**
286
+ If the implementation cost of server-side `.git/` watchers is low (it's maybe 20 lines of code per repo), Candidate A is strictly better and should replace the polling fallback in D. The key question is whether the `remembered-roots` set changes frequently enough that managing watcher lifecycle is a real problem or a theoretical one.
287
+
288
+ ---
289
+
290
+ ## Challenge Notes
291
+
292
+ **Strongest argument against the leading option (D):**
293
+
294
+ The 30s polling for worktrees is not actually a full fix - it's a regression in live-ness. If a developer is actively working (frequent commits, branch switches) the console will feel sluggish. The "real" fix requires the server to know when git state has changed. Candidate D essentially says "we'll check less often" rather than "we'll check at the right time."
295
+
296
+ Counter-argument: The current state is 140% CPU and a broken console. A 30s poll that is always correct is better than a "live" console that brings the machine to a halt. More importantly, the polling fix is independently valuable and is a precondition for everything else. Even if Candidate A is the long-term target, Candidate D is the right first ship. The loop fix (removing worktrees from SSE invalidation) is the core change; the concurrency cap is risk reduction; the polling cadence can be tightened later with the server-side watcher.
297
+
298
+ **Adversarial challenge: what if the framing is wrong?**
299
+
300
+ The framing assumes the CPU cost is from git subprocess fan-out. But what if the real cost is from the `fs.watch` watcher misfiring - e.g., the watcher itself has a bug on macOS where it fires continuously even without writes?
301
+
302
+ Check: the code shows a 200ms debounce is already in place. If the watcher misfired continuously, the debounce would absorb it. The problem description says "every continue_workflow call writes 2+ files" - this is the trigger, not a watcher bug. The framing is correct.
303
+
304
+ **What challenge pressure changed:**
305
+ The challenge confirmed that the polling regression is real but acceptable as a first step. It also sharpened the recommendation to call out "Candidate A as a natural follow-on" explicitly in the handoff rather than presenting D as the final answer.
306
+
307
+ ---
308
+
309
+ ## Decision Log
310
+
311
+ **Winner: Candidate D (Compound Fix)**
312
+ **Runner-up: Candidate A (Typed SSE Events)**
313
+
314
+ **Why D won:**
315
+ - Breaks the feedback loop with the minimum number of changes
316
+ - All three changes are independent and can be shipped separately
317
+ - The concurrency cap is valuable independent of the loop fix (protects against large repos)
318
+ - The session/worktree decoupling is architecturally correct and does not degrade any existing functionality
319
+ - Can be implemented in hours, not days
320
+
321
+ **Why A lost (runner-up):**
322
+ - Server-side `.git/` watchers per known repo are the right long-term model but add lifecycle complexity
323
+ - Requires adding watcher management for a dynamic set of repos (as sessions discover new repos)
324
+ - The value of A is captured by adding it on top of D in a second pass
325
+ - A is strictly better for user experience (true live worktree updates) but D is better as the immediate fix
326
+
327
+ **Accepted tradeoffs:**
328
+ - Worktrees view will have a 30s refresh cadence instead of being SSE-driven
329
+ - The concurrency cap (8 processes) means the first cold request after restart takes longer to complete than a fully parallel request, but it won't saturate the CPU
330
+
331
+ **Identified failure modes:**
332
+ - If the sessions directory watch fires for reasons other than session writes (e.g., temp files from other processes), the loop could re-emerge. Mitigation: filter the watcher to only fire on `.jsonl` file changes.
333
+ - If the semaphore implementation is buggy (e.g., never releases), the worktrees endpoint hangs. Mitigation: use a well-tested semaphore pattern with a timeout.
334
+ - The stale session problem (79 zillow-android-2 worktrees) is not fixed by D. Even with concurrency capping, 101 worktrees takes ~3.8s. This needs a separate `remembered-roots` TTL or explicit session eviction.
335
+
336
+ **Switch triggers:**
337
+ - If after implementing D the worktrees view still feels unresponsive enough to impair daily use, add Candidate A's `.git/` watchers to enable typed SSE events for worktree state changes.
338
+ - If the worktrees endpoint is still slow after concurrency capping due to inherently slow git on the large repo, add Candidate C's per-worktree cache keyed on HEAD hash.
339
+
340
+ ---
341
+
342
+ ## Resolution Notes
343
+
344
+ **Resolution mode:** `direct_recommendation`
345
+
346
+ **Confidence band:** High (85-90%)
347
+
348
+ The three-part fix is grounded in:
349
+ - Direct code reading (not assumptions) of all three affected files
350
+ - Established precedent from VS Code, GitLens, and git GUI tools
351
+ - Clear causal chain from the problem description to the fix
352
+ - Each fix is independently testable
353
+
354
+ **Residual risks:**
355
+ 1. Stale session problem (79 worktrees) needs separate attention - not addressed by the concurrency cap alone
356
+ 2. The `.git/` watcher approach (Candidate A) is the right long-term model and should be tracked as a follow-on
357
+ 3. macOS `fs.watch` reliability: the current watcher uses `{ recursive: true }` which is macOS-specific. The fix in change 1 (stopping worktrees invalidation) makes the watcher's behavior less critical, but the underlying reliability concern remains.
358
+
359
+ ---
360
+
361
+ ## Final Summary
362
+
363
+ ### Selected Path
364
+ `full_spectrum` - because the problem required both landscape grounding (what do real git tools do?) and reframing (are sessions and worktrees the right reactivity unit?).
365
+
366
+ ### Problem Framing
367
+ A feedback loop caused by semantic mismatch: session writes (high frequency, every continue_workflow step) are coupled to worktree git status fetches (should be low frequency, developer-action-triggered) through a generic SSE `change` event and an unconditional `invalidateQueries` call.
368
+
369
+ ### Landscape Takeaways
370
+ - No production git tool couples session writes to git status fetches
371
+ - All live git tools either poll on an interval (LazyGit, tig) or watch `.git/HEAD` and `.git/refs/` specifically (VS Code, GitLens)
372
+ - `invalidateQueries` is for explicit user actions, not background sync; `staleTime` + `refetchInterval` is the correct React Query pattern for background data
373
+ - Typed SSE events are the standard for live developer tools; generic `change` events are an antipattern
374
+
375
+ ### Chosen Direction: Candidate D (Compound Fix)
376
+
377
+ **Three changes, each independently valuable:**
378
+
379
+ **Change 1 - Break the loop (console/src/api/hooks.ts):**
380
+ ```typescript
381
+ // Remove this line from useWorkspaceEvents():
382
+ void queryClient.invalidateQueries({ queryKey: ['worktrees'] });
383
+ // Keep only:
384
+ void queryClient.invalidateQueries({ queryKey: ['sessions'] });
385
+ ```
386
+ This alone breaks the feedback loop. Worktrees are now governed solely by `refetchInterval`.
387
+
388
+ **Change 2 - Bound git concurrency (src/v2/usecases/worktree-service.ts):**
389
+ Add a process-level semaphore capping concurrent git subprocesses to 8. The `enrichWorktree` function currently runs 6 git commands in parallel per worktree; with 101 worktrees all enriched simultaneously, that's 606 concurrent processes. A semaphore wrapper around `enrichWorktree` (or around the `git()` helper) bounds this.
390
+
391
+ **Change 3 - Filter SSE watcher events (src/v2/usecases/console-routes.ts):**
392
+ The `fs.watch` callback currently fires for any file change. Filter to only broadcast when the change is to a `.jsonl` file (session event log), not every temp file write. This reduces SSE noise.
393
+
394
+ **Optional Change 4 - Stale session root eviction:**
395
+ Add a TTL (e.g., 30 days) to the `remembered-roots-store` so stale sessions from inactive repos don't permanently inflate the worktree count.
396
+
397
+ ### Strongest Alternative
398
+ Candidate A (Typed SSE Events + server-side `.git/` watchers per repo). It's strictly better for user experience (true live worktree updates instead of 30s polling) but adds lifecycle complexity to the server. It's the right follow-on after D is shipped.
399
+
400
+ **Why it lost:** Implementation complexity of dynamic watcher management per known repo, and the loop fix in Change 1 is needed regardless. D is the right immediate ship; A is the right next evolution.
401
+
402
+ ### Confidence Band
403
+ High (85-90%). All three fixes are grounded in direct code reading. The main residual uncertainty is whether the stale session problem (101 worktrees, 79 from zillow-android-2) needs addressing alongside the loop fix for the console to feel good in practice.
404
+
405
+ ### Residual Risks
406
+ 1. Stale session problem: 101 worktrees even with concurrency capping means ~3.8s per full worktrees request. Tracked separately.
407
+ 2. Candidate A (`.git/` watchers) is the right long-term model - should be a follow-on ticket.
408
+ 3. `fs.watch({ recursive: true })` has platform quirks on macOS/Linux - consider a future migration to a more reliable file watching library (chokidar).
409
+
410
+ ### Next Actions
411
+ 1. **Immediate:** Implement Change 1 (one-line fix in `hooks.ts`). This alone breaks the loop.
412
+ 2. **Same PR:** Implement Change 2 (concurrency cap in `worktree-service.ts`). Protects against large repos.
413
+ 3. **Same PR:** Implement Change 3 (filter watcher to `.jsonl` only). Reduces SSE noise.
414
+ 4. **Follow-on ticket:** Add TTL to `remembered-roots-store` to evict stale session roots.
415
+ 5. **Future ticket:** Implement Candidate A's typed SSE events (`session:updated`, `worktree:dirty`) with server-side `.git/` directory watchers for true live worktree reactivity.