@exaudeus/workrail 3.36.0 → 3.37.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. package/dist/config/config-file.js +2 -0
  2. package/dist/console-ui/assets/{index-n8cJrS4v.js → index-o-p__sHJ.js} +1 -1
  3. package/dist/console-ui/index.html +1 -1
  4. package/dist/daemon/workflow-runner.d.ts +1 -0
  5. package/dist/daemon/workflow-runner.js +3 -6
  6. package/dist/manifest.json +23 -15
  7. package/dist/trigger/notification-service.d.ts +42 -0
  8. package/dist/trigger/notification-service.js +164 -0
  9. package/dist/trigger/trigger-listener.js +7 -1
  10. package/dist/trigger/trigger-router.d.ts +3 -1
  11. package/dist/trigger/trigger-router.js +4 -1
  12. package/docs/design/agent-behavior-patterns-discovery.md +312 -0
  13. package/docs/design/agent-engine-communication-discovery.md +390 -0
  14. package/docs/design/agent-loop-architecture-alternatives-discovery.md +531 -0
  15. package/docs/design/agent-loop-error-handling-contract.md +238 -0
  16. package/docs/design/complete-step-approach-validation-discovery.md +344 -0
  17. package/docs/design/daemon-stuck-detection-discovery.md +174 -0
  18. package/docs/design/mcp-server-disconnect-discovery.md +245 -0
  19. package/docs/design/mcp-server-epipe-crash.md +198 -0
  20. package/docs/design/notification-design-candidates.md +131 -0
  21. package/docs/design/notification-design-review.md +84 -0
  22. package/docs/design/notification-implementation-plan.md +181 -0
  23. package/docs/design/spawn-agent-failure-modes.md +161 -0
  24. package/docs/design/spawn-agent-result-handling-implementation-plan.md +186 -0
  25. package/docs/design/stdio-simplification-design-candidates.md +341 -0
  26. package/docs/design/stdio-simplification-design-review.md +93 -0
  27. package/docs/design/stdio-simplification-implementation-plan.md +317 -0
  28. package/docs/design/structured-output-tools-coexist-findings.md +288 -0
  29. package/docs/discovery/coordinator-script-design.md +745 -0
  30. package/docs/discovery/coordinator-ux-discovery.md +471 -0
  31. package/docs/discovery/spawn-agent-failure-modes.md +309 -0
  32. package/docs/discovery/workflow-selection-for-discovery-tasks.md +336 -0
  33. package/docs/discovery/worktrain-status-briefing.md +325 -0
  34. package/docs/discovery/worktrain-status-design-candidates.md +202 -0
  35. package/docs/discovery/worktrain-status-design-review-findings.md +86 -0
  36. package/docs/ideas/backlog.md +608 -0
  37. package/docs/ideas/daemon-structured-output-vs-tool-calls.md +344 -0
  38. package/docs/ideas/design-candidates-backlog-consolidation.md +85 -0
  39. package/docs/ideas/design-review-findings-backlog-consolidation.md +39 -0
  40. package/docs/ideas/implementation_plan_backlog_consolidation.md +117 -0
  41. package/docs/plans/authoring-doc-staleness-enforcement-candidates.md +251 -0
  42. package/docs/plans/authoring-doc-staleness-enforcement-review.md +99 -0
  43. package/docs/plans/authoring-doc-staleness-enforcement.md +463 -0
  44. package/package.json +1 -1
@@ -0,0 +1,471 @@
1
+ # Coordinator UX Discovery
2
+
3
+ **Date:** 2026-04-18
4
+ **Status:** In progress (WorkRail wr.discovery session)
5
+
6
+ **Artifact strategy:** This document is for human readability only. Execution truth (findings, decisions, context variables) lives in WorkRail session notes and context fields. If a chat rewind occurs, this file may diverge from the canonical record -- consult session notes for ground truth.
7
+
8
+ ---
9
+
10
+ ## Context / Ask
11
+
12
+ **Stated goal (solution-framed):** What should the first coordinator script template look like from the user's perspective -- how does someone invoke it, what do they see, what does it produce?
13
+
14
+ **Reframed problem (problem-framed):** How can a user kick off a multi-session AI workflow with a single action, stay informed without babysitting it, and receive a structured result -- all with less total effort than running Claude Code sessions manually?
15
+
16
+ **Original solution framing rejected because:** "coordinator script template" prescribes the mechanism without first identifying what pain it solves. A declarative invocation (`worktrain run pr-review`) could satisfy the same need with less user friction than a script the user maintains.
17
+
18
+ ---
19
+
20
+ ## Path Recommendation
21
+
22
+ **Path chosen: `design_first`**
23
+
24
+ Rationale: The goal was a solution-statement. The risk is designing the wrong interface, not lacking enough landscape knowledge. The backlog already contains a detailed vision of the coordinator architecture (Apr 15 entries). The gap is not "what are the options" -- it is "what does the user actually experience, step by step, and where does that differ from what the backlog assumes?"
25
+
26
+ ---
27
+
28
+ ## Constraints / Anti-goals
29
+
30
+ **Core constraints:**
31
+ - Must be better than Claude Code at the specific "review this PR" task (measurably, not just narratively)
32
+ - Must not require the user to write or maintain scripts (that is the developer's job, not the user's)
33
+ - First version must prove coordination works with real data, not a demo
34
+ - Must be buildable against what `worktrain spawn` / `worktrain await` already ship (Tier 3, Apr 18 backlog)
35
+
36
+ **Anti-goals:**
37
+ - Not a general workflow builder / no-code tool
38
+ - Not a replacement for Claude Code on single-session tasks
39
+ - Not dependent on cloud infrastructure or webhooks for the first version
40
+ - Not an analytics dashboard (that is a separate backlog item)
41
+
42
+ ---
43
+
44
+ ## Landscape Packet
45
+
46
+ ### What exists today (Apr 18, 2026)
47
+
48
+ **CLI commands available:**
49
+ - `worktrain init` -- onboarding, setup
50
+ - `worktrain tell <message>` -- queue async message to daemon
51
+ - `worktrain inbox` -- read daemon outbox messages
52
+ - `worktrain spawn -w <workflow> -g <goal> -W <workspace>` -- non-interactive session start, prints session handle
53
+ - `worktrain await -s <handles> [-m all|any] [-t timeout]` -- block until sessions complete, prints JSON results
54
+ - `worktrain daemon [--install|--uninstall|--status]` -- daemon lifecycle
55
+ - `worktrain console` -- standalone console UI
56
+ - `worktrain logs [--follow] [--session <id>]` -- daemon event log reader
57
+ - `worktrain status <sessionId>` -- per-session health summary
58
+
59
+ **What spawn/await enable today:**
60
+ `worktrain spawn` prints a session handle to stdout. `worktrain await` blocks until sessions complete and prints structured JSON results. These two commands are the building blocks for a coordinator script.
61
+
62
+ **The MR review workflow (mr-review-workflow-agentic):**
63
+ - 8+ phases: locate/bound/classify, hypothesis, freeze fact packet, reviewer family bundle (parallel), contradiction resolution, adversarial validation, final recommendation, handoff
64
+ - Produces: severity-graded findings (Critical/Major/Minor/Nit), recommendation (approve/request changes/needs discussion), confidence band, coverage ledger, ready-to-post MR comments
65
+ - Output lives in `notesMarkdown` and context variables -- no automatic post to GitHub
66
+ - `requireConfirmation` gates exist for THOROUGH/High-risk reviews
67
+
68
+ **What WorkTrain cannot do yet (key gaps from backlog):**
69
+ 1. Multi-phase work is invisible -- sessions are flat in console; a 5-session pipeline looks like 5 unrelated sessions
70
+ 2. No coordinator scripts -- spawn/await exist but no coordinator template runs a full pipeline
71
+ 3. No auto-commit -- agents write code but don't commit or open PRs autonomously
72
+ 4. No notifications -- daemon completes work silently
73
+ 5. Assessment gates unreliable -- not yet validated end-to-end
74
+ 6. Subagent delegation invisible -- spawn_agent creates proper child sessions, but workflows still use mcp__nested-subagent__Task for most delegation (invisible black box)
75
+ 7. No artifact store -- agents dump markdown in the repo as a workaround
76
+ 8. Context poverty -- each session starts from scratch, no persistent knowledge graph
77
+
78
+ **Daemon event log format (real session, Apr 18):**
79
+ - Events are JSONL: `{"kind":"session_started","sessionId":"...","workflowId":"...","workspacePath":"...","ts":...}`
80
+ - `tool_called` events show tool name and truncated args summary
81
+ - `session_completed` shows outcome: success/error/timeout
82
+ - Events are correlated by `sessionId` (UUID) -- no human-readable session name
83
+ - `worktrain status <sessionId>` aggregates: LLM turns, step advances, tool calls, issues, last activity
84
+
85
+ **Scripts-first coordinator (backlog Apr 15):**
86
+ The backlog explicitly describes the coordinator as a shell/TypeScript script that uses `worktrain spawn` / `worktrain await` as primitives. The coordinator is NOT an LLM agent -- it is a deterministic script driving a DAG of leaf sessions. This is the architectural commitment already made.
87
+
88
+ **Live status briefings (backlog Apr 15):**
89
+ Vision exists for `worktrain status --workspace` that produces a human-readable briefing (what's running, why, where it is, queue, recently completed, blocked). Not yet implemented.
90
+
91
+ ---
92
+
93
+ ## Problem Frame Packet
94
+
95
+ ### The real gap
96
+
97
+ The user today does the coordinator's job manually:
98
+ 1. Opens a new Claude Code session
99
+ 2. Types "review PR #47"
100
+ 3. Waits ~20-40 minutes
101
+ 4. Reads the output
102
+ 5. Decides what to do next (fix something? re-review? merge?)
103
+
104
+ With a coordinator, steps 1-5 become one command that returns a result. The coordinator handles the sequencing. The user only re-engages when the pipeline produces a result or hits a decision point.
105
+
106
+ **What makes this categorically better than Claude Code:**
107
+ - Parallelism: reviewer families run simultaneously (Claude Code is serial)
108
+ - Persistence: coordinator can restart a failed session without user intervention
109
+ - Structure: output is a graded findings JSON, not a prose wall
110
+ - Routing: coordinator decides "clean -> merge queue, needs fixes -> spawn fix agent" without asking the user
111
+ - History: `worktrain logs --session <id>` shows exactly what happened and why
112
+
113
+ **The "needs human" path:**
114
+ Not every PR review ends cleanly. The coordinator must surface decision points without requiring the user to babysit. Today: silent. The user must poll `worktrain inbox`. Needed: a push notification (even just a Terminal notification or a message written to a designated file the user watches).
115
+
116
+ ---
117
+
118
+ ## Landscape Synthesis
119
+
120
+ **Precedents:**
121
+ 1. Scripts-first coordinator pattern (backlog Apr 15) -- the committed architecture: deterministic shell/TS script calling `worktrain spawn` / `worktrain await`, not an LLM coordinator
122
+ 2. `worktrain status --workspace` vision (backlog Apr 15) -- committed UX vision for a human-readable briefing during execution (not yet implemented)
123
+ 3. Real event log format (daemon JSONL, Apr 18) -- events are UUID-correlated, tool_called with truncated args; the raw material for a feedback layer
124
+
125
+ **Hard constraints grounded in code:**
126
+ - No auto-post to GitHub -- MR review output lives in session notes; posting requires explicit user action
127
+ - No push notifications -- daemon completes work silently; user must poll `worktrain inbox` or watch `worktrain logs`
128
+ - Sessions appear flat in console -- no parent-child grouping for coordinator-spawned sessions
129
+ - `worktrain spawn/await` merged but not yet tested end-to-end in a real coordinator
130
+
131
+ **Contradictions:**
132
+ 1. "Coordinator script template" in the framing implies the user writes or maintains scripts -- but the intended design is that the developer provides a pre-built `coordinator-groom-prs.sh`; the user just invokes it. The template is a developer artifact, not a user artifact.
133
+ 2. The event log is UUID-keyed with no human names -- but a useful feedback loop needs to surface "which PR is being reviewed" not "sess_abc123 advanced step 3". A translation layer is needed.
134
+
135
+ **Evidence gaps:**
136
+ 1. No real-world end-to-end test of `worktrain spawn/await` in a coordinator (known, from backlog)
137
+ 2. No empirical data on full pipeline duration (how long does a 3-5 session MR review take total?)
138
+
139
+ ## Candidate Directions
140
+
141
+ **Generation expectations (must be met for this design_first, full-rigor pass):**
142
+
143
+ 1. **At least one reframing direction** -- not just "a coordinator with better UX" but a direction that challenges the coordinator abstraction itself or inverts the design assumptions
144
+ 2. **Span the invocation spectrum** -- from "user types nothing, daemon just runs" to "user types a named command" to "user runs a script directly"
145
+ 3. **Include the minimal viable single-session option** -- explicitly compare a single `worktrain spawn` with notification against a full coordinator; this tests the riskiest assumption
146
+ 4. **Concrete terminal session transcripts** -- each candidate must include a 3-5 line sample of what the user actually sees, not just a description
147
+ 5. **No clustering** -- if 3 of 4 candidates are "coordinator with different flag styles," one of them must be replaced with a genuinely different direction
148
+
149
+ ### Candidate A: Minimal viable single-session with notify (tests riskiest assumption)
150
+
151
+ **One sentence:** Add a `worktrain review <pr>` command that spawns a single MR review session, prints the handle, and writes the result to `worktrain inbox` when done -- no coordinator, no script to maintain.
152
+
153
+ **Terminal session:**
154
+ ```
155
+ $ worktrain review 47
156
+ Reviewing PR #47: "feat(engine): add OAuth refresh token rotation"
157
+ Session started: sess_4cd0b579 (mr-review-workflow-agentic)
158
+ Run 'worktrain logs --follow --session sess_4cd0b579' to watch progress.
159
+ Run 'worktrain inbox' when done.
160
+ ^ Takes ~20-30 min. You can do other things.
161
+ ```
162
+
163
+ - Tensions resolved: invocation simplicity, output discoverability (inbox)
164
+ - Tensions accepted: no parallelism (serial session), no coordinator routing, user decides next step manually
165
+ - Failure mode: session silently fails or gets stuck; no retry logic; user polls inbox and finds nothing
166
+ - Pattern relationship: follows existing spawn command; adds a thin `review` alias with PR context lookup
167
+ - Philosophy: honors 'validate at boundaries' (PR number validated immediately), 'errors as data' (failure appears in inbox)
168
+ - Scope judgment: **too narrow** -- proves the MVP invocation but not coordination; does not address the parallelism/persistence advantages of a real coordinator
169
+
170
+ ---
171
+
172
+ ### Candidate B: Named coordinator pipeline -- `worktrain run pr-review <pr>` (recommended)
173
+
174
+ **One sentence:** Add a `worktrain run pr-review <pr>` command that drives a pre-built TypeScript coordinator (`src/coordinators/pr-review.ts`), runs reviewer families in parallel via `spawn` + `await`, prints progress lines during execution, and produces a `pr-review-<n>.md` file and a `gh pr comment` command.
175
+
176
+ **Terminal session:**
177
+ ```
178
+ $ worktrain run pr-review 47
179
+ PR #47: feat(engine): add OAuth refresh token rotation (3 files, +142/-18)
180
+ Base: main (merge-base: abc1234)
181
+
182
+ [1/3] Context gathering... done (1m 12s)
183
+ [2/3] Running 3 reviewer families in parallel...
184
+ correctness_invariants running (3m 20s)
185
+ philosophy_alignment done (2m 45s)
186
+ runtime_production_risk running (4m 01s)
187
+ [2/3] Reviewer families complete (4m 33s)
188
+ [3/3] Synthesizing... done (45s)
189
+
190
+ RESULT: Request changes (HIGH CONFIDENCE)
191
+ Critical: Token expiry not validated before refresh -- sessions silently extend past max lifetime
192
+ Coverage: correctness_invariants ✓ philosophy_alignment ✓ runtime_production_risk ✓
193
+
194
+ Full review: pr-review-47.md
195
+ Post comment: gh pr comment 47 --body-file pr-review-47-comment.md
196
+ ```
197
+
198
+ - Tensions resolved: invocation simplicity, feedback richness (progress lines with timing), structured output (file + 3-line summary), happy path vs. needs-human distinction (RESULT line)
199
+ - Tensions accepted: script maintenance (TypeScript coordinator module), discoverability (user must know `worktrain run` exists)
200
+ - Failure mode: one reviewer session fails; coordinator produces partial results with explicit coverage gaps
201
+ - Pattern relationship: adapts existing CLI command pattern (commander.js) and spawn/await primitives; adds `run` subcommand with coordinator registry
202
+ - Philosophy: honors 'errors as data' (partial failure appears as coverage gap, not silent exit), 'make illegal states unrepresentable' (RESULT line forces unambiguous recommendation), 'determinism over cleverness' (routing is scripted, not LLM), 'surface information' (progress lines)
203
+ - Scope judgment: **best-fit** -- delivers coordination, parallelism, and structured output; builds on today's primitives; no new daemon infrastructure needed; can evolve toward daemon-native (Candidate C) later
204
+
205
+ ---
206
+
207
+ ### Candidate C: Daemon-native coordinator with `worktrain tell` (too broad for v1)
208
+
209
+ **One sentence:** The coordinator lives in the daemon as a named pipeline; the user invokes it with `worktrain tell "review PR #47"` and checks `worktrain inbox` or `worktrain console` for the result.
210
+
211
+ **Terminal session:**
212
+ ```
213
+ $ worktrain tell "review PR #47"
214
+ Queued: pr-review pipeline for PR #47 (check inbox or console for result)
215
+
216
+ [...20-30 min later...]
217
+
218
+ $ worktrain inbox
219
+ [2026-04-18 14:32] PR #47 review complete
220
+ RESULT: Approve (MEDIUM CONFIDENCE)
221
+ 1 minor finding: missing test for refresh token edge case
222
+ Full review: ~/git/repo/pr-review-47.md
223
+ Post: gh pr comment 47 --body-file ~/git/repo/pr-review-47-comment.md
224
+ ```
225
+
226
+ - Tensions resolved: invocation simplicity (natural language), integration with daemon ecosystem
227
+ - Tensions accepted: feedback gap (async; no real-time progress without console running), dependency on console for visibility
228
+ - Failure mode: daemon dies mid-pipeline; user polls inbox and finds nothing; silent failure
229
+ - Pattern relationship: follows tell/inbox pattern already in CLI; requires daemon pipeline routing logic (does NOT exist yet)
230
+ - Philosophy: honors 'validate at boundaries' (daemon validates message format), but struggles with 'make illegal states unrepresentable' (polling is silent on failure)
231
+ - Scope judgment: **too broad for v1** -- requires daemon pipeline routing, console DAG grouping (neither exists); right direction for v2 after Candidate B proves the pipeline
232
+
233
+ ---
234
+
235
+ ### Candidate D: Coordinator shell script in repo with `worktrain run` alias (backlog-committed design)
236
+
237
+ **One sentence:** A `scripts/coordinator-pr-review.sh` at repo root calls `worktrain spawn` + `worktrain await` + `jq` to produce a review; `worktrain run pr-review 47` is a thin alias that finds and runs it.
238
+
239
+ **Terminal session (what the user types -- in a second tab, optional):**
240
+ ```
241
+ # Tab 1: run the coordinator
242
+ $ worktrain run pr-review 47
243
+ Spawning review sessions for PR #47...
244
+ sess_abc123 (context-gathering)
245
+ sess_def456 (reviewer-families)
246
+ sess_ghi789 (synthesizer)
247
+ Awaiting all sessions (timeout: 30m)...
248
+ [...silence for 20+ min unless user opens Tab 2...]
249
+ Done.
250
+
251
+ RESULT: Request changes
252
+ Critical: Token expiry not validated
253
+ pr-review-47.md written.
254
+
255
+ # Tab 2 (optional): watch progress
256
+ $ worktrain logs --follow
257
+ [14:02:15] [sess_abc] step_advanced -> step advanced
258
+ [14:04:32] [sess_def] tool_called tool=Bash args="gh pr diff 47 | head..."
259
+ ```
260
+
261
+ - Tensions resolved: flexibility (script is readable/modifiable), transparency, buildable on today's primitives
262
+ - Tensions accepted: feedback gap (Tab 2 required for visibility; UUIDs in log require --session filter), script discoverability
263
+ - Failure mode: user does not open Tab 2; 20+ minutes of silence is the DEFAULT experience; not better than Claude Code on the feedback dimension
264
+ - Pattern relationship: directly implements backlog Apr 15 `coordinator-groom-prs.sh` design
265
+ - Philosophy: honors 'determinism' and 'functional/declarative' (script is pure shell logic); struggles with 'surface information' (silence is default)
266
+ - Scope judgment: **best-fit for first prototype** but worst on feedback UX; evolves naturally into Candidate B by adding the progress line printing to the script
267
+
268
+ ---
269
+
270
+ ## Problem Frame Packet
271
+
272
+ ### Primary users
273
+
274
+ **Autonomous developer (primary):** Uses WorkTrain to run multi-session work while doing other things. CLI-comfortable, knows what `worktrain spawn` does, but does not want to babysit the pipeline. Wants to delegate the cognitive work and be notified of the result.
275
+
276
+ **WorkTrain developer (secondary):** Builds the coordinator template. This discovery is about the end-user experience, not the template author.
277
+
278
+ ### Jobs / outcomes
279
+
280
+ - "I want to review PR #47 before merging it, without spending 30 minutes in a Claude Code session"
281
+ - "I want to know the review is happening without watching it happen"
282
+ - "I want the result to tell me whether to merge, fix, or escalate -- not just a wall of findings"
283
+
284
+ ### Tensions
285
+
286
+ 1. **Feedback tension:** Silent coordination feels like a black box, but verbose coordination interrupts focus. The ideal is ambient awareness -- the pipeline runs, the user gets a summary when it matters.
287
+ 2. **Trust tension:** "Did it really review all the important things?" The coverage ledger is crucial, but currently buried in session notes. Trust requires the coverage ledger to be surfaced in the output.
288
+ 3. **Decision tension:** The coordinator should auto-route clean PRs, but humans need to stay in the loop for blocking findings. The boundary between "coordinator decides" and "coordinator asks human" is the hardest design choice.
289
+ 4. **Invocation tension:** Multiple valid invocation models exist -- named pipeline (`worktrain run pr-review 47`), script (`./scripts/groom-prs.sh`), or daemon tell (`worktrain tell "review PR #47"`). Each has different friction profiles.
290
+ 5. **Output tension:** MR review produces ready-to-post comments but does not auto-post. First version should produce a file + print a `gh` command, not auto-post -- removes friction while preserving human review of the review.
291
+
292
+ ### Success criteria (concrete, observable)
293
+
294
+ 1. User types one command referencing the PR (number or URL) and nothing else
295
+ 2. Within 30 seconds of invocation, something appears showing the pipeline is running (not silence)
296
+ 3. When the pipeline completes, the user sees a 3-line summary: recommendation, top finding, suggested action
297
+ 4. Full findings are in a deterministic file location (e.g., `mr-review-47.md` or a `gh pr comment` draft)
298
+ 5. User can distinguish "pipeline ran to completion" from "pipeline stopped due to an error"
299
+ 6. Total time the user spends actively engaged (not waiting) is under 2 minutes
300
+
301
+ ### HMW reframes
302
+
303
+ - "HMW make the feedback loop feel like a teammate update rather than a log stream?" -- reframes UX from "what events are happening" to "what should I know right now"
304
+ - "HMW design the invocation so users never have to remember flags or workflow IDs?" -- reframes from "CLI command with flags" to "named workflow shortcut"
305
+
306
+ ### Primary framing risk
307
+
308
+ The design assumes the user wants to manually trigger reviews. But if `triggers.yml` already auto-triggers MR review on every push, then the invocation problem is already solved -- and the real gap is OUTPUT and FEEDBACK design, not invocation. If daemon auto-trigger is the primary path, "what does the user type to kick off a review?" is the wrong question. The right question becomes: "what does the user see when the daemon auto-triggers a review it already started?"
309
+
310
+ ## Challenge Notes
311
+
312
+ ### Challenged assumptions
313
+
314
+ 1. **Script template is the right abstraction** -- users may not want to write/maintain scripts. They want `worktrain run pr-review #47`. The "template" is for the developer building WorkTrain, not the end user. The user-facing interface should feel like a named command, not a script invocation.
315
+
316
+ 2. **Coordination is the first thing to prove** -- visibility may be the bigger gap. If a coordinator runs silently for 30 minutes and the user has no idea what's happening, it feels worse than Claude Code (where you at least see the typing). The first coordinator template must include a feedback loop, even a minimal one.
317
+
318
+ 3. **Better than Claude Code is achievable on the first template** -- only if it adds something Claude Code structurally cannot do. Parallel reviewer families is the clearest candidate. A sequential coordinator that is just slower Claude Code with more setup cost is a regression.
319
+
320
+ ---
321
+
322
+ ## Resolution Notes
323
+
324
+ *(To be populated after candidate generation and challenge steps)*
325
+
326
+ ---
327
+
328
+ ## Decision Log
329
+
330
+ ### Decision 1: Selected direction -- Candidate B with Candidate D as proof-of-concept step
331
+
332
+ **Date:** 2026-04-18
333
+
334
+ **Winner: Candidate B** -- `worktrain run pr-review <pr>` with TypeScript coordinator module + progress lines + structured output file
335
+
336
+ **Runner-up: Candidate D** -- shell script coordinator; acceptable fallback if TypeScript module proves too costly
337
+
338
+ **Why B won:**
339
+ - Only candidate with real-time feedback built in (progress lines during execution)
340
+ - RESULT line enforces unambiguous happy path vs. needs-human distinction
341
+ - Structured output (file + gh command) is cleanest delivery mechanism
342
+ - Routing and persistence value (coordinator decides next step) is the real differentiator over single-session, not parallelism
343
+
344
+ **Why the runner-up is real:**
345
+ D can reach B-quality UX by adding progress lines to the shell script. If B's TypeScript module proves too costly to build or maintain, D with progress lines is an acceptable substitute.
346
+
347
+ **Challenge findings that changed the analysis:**
348
+ 1. **Progress lines require a custom polling loop, not just `worktrain await`.** The current `await` command blocks until done; it does not stream. Real-time progress lines require the coordinator to poll `worktrain status <session-id>` in a loop while waiting. This is extra implementation work beyond the basic spawn/await pattern.
349
+ 2. **Reviewer family parallelism already exists within a single session.** The MR review workflow (Phase 3) already runs reviewer families in parallel via `mcp__nested-subagent__Task`. The coordinator's value for single-PR review is routing and persistence, not parallelism. This makes Candidate A (single session) closer to B than initially assessed for single-PR use -- but B still wins on the output/routing/persistence dimensions.
350
+
351
+ **Accepted tradeoffs:**
352
+ - Custom polling loop needed for progress lines -- **REQUIRED for first version** (not optional; without it the core feedback promise is broken)
353
+ - TypeScript coordinator module must track workflow output schema changes -- **schema validation gate REQUIRED for first version**
354
+ - Coordinator value for single-PR is routing + persistence (not parallelism -- that's already in the workflow)
355
+ - Partial failure marking REQUIRED -- failed reviewer session must be explicitly labeled, not silently absent (honoring "make illegal states unrepresentable")
356
+
357
+ **Failure modes:**
358
+ 1. `worktrain await` not streaming -- coordinator polling loop must be built
359
+ 2. Coordinator output schema couples to workflow handoff artifact key names
360
+ 3. If workflow updates without coordinator update, partial results appear without explanation
361
+
362
+ **Switch trigger to Candidate D:**
363
+ If TypeScript coordinator module requires >2 days to build or couples too tightly to CLI infrastructure.
364
+
365
+ ---
366
+
367
+ ## Final Summary
368
+
369
+ ### Recommendation
370
+
371
+ **Selected direction: Candidate B as UX spec** -- `worktrain run pr-review <pr>` with a coordinator that uses spawn + await + progress polling loop + structured output file.
372
+
373
+ **Build sequence (Candidate D first, B target):**
374
+ 1. Build `scripts/coordinator-pr-review.sh` (shell script using `worktrain spawn` + `worktrain await` + progress polling loop + `jq` output transform)
375
+ 2. Add `worktrain run pr-review <pr>` as a thin wrapper calling the script
376
+ 3. If shell script proves sufficient: DONE. No TypeScript migration needed.
377
+ 4. If shell script becomes unwieldy: migrate logic to `src/coordinators/pr-review.ts` TypeScript module
378
+
379
+ **Confidence band: HIGH**
380
+
381
+ **Strongest alternative: Candidate D with progress lines** -- the shell script approach satisfies ALL acceptance criteria; TypeScript migration is optional.
382
+
383
+ **What "better than Claude Code" actually means for this use case:**
384
+ - NOT parallelism within a single PR review (reviewer families already parallel inside the workflow)
385
+ - YES: routing decisions (coordinator decides next step without user; reading session JSON is the coordinator's job)
386
+ - YES: persistence (coordinator can restart a failed session without user intervention)
387
+ - YES: structured output (file + gh command; not a prose wall in session notes)
388
+ - YES: zero babysitting minutes (user types one command and is done; checks back when done)
389
+
390
+ ### Required first-version implementation constraints
391
+
392
+ 1. **Progress polling loop** (REQUIRED) -- coordinator must poll `worktrain status <id>` every 30s during await; `worktrain await` alone is silent for 20+ min
393
+ 2. **Output schema validation** (REQUIRED) -- validate all expected JSON keys from await output; emit explicit error on missing keys; never print recommendation without backing data
394
+ 3. **Partial failure marking** (REQUIRED) -- failed reviewer session must be explicitly labeled in output; not silently absent
395
+
396
+ ### Residual risks
397
+
398
+ 1. **Spawn/await end-to-end not validated** -- per backlog Apr 18, `worktrain spawn/await` is "merged but needs real-world test." The first coordinator build will surface any integration bugs.
399
+ 2. **Schema coupling** -- coordinator reads `findings`, `recommendation`, `coverageLedger` from await JSON; if workflow output schema changes, coordinator must update. Mitigation: schema validation gate + workflow version header.
400
+ 3. **Riskiest assumption still open** -- if users find that `worktrain review 47` (single session, Candidate A) satisfies their needs on first 5 PRs, the coordinator layer may be unnecessary for single-PR use. Observable after first real-world use.
401
+
402
+ ### Exact invocation (concrete UX spec)
403
+
404
+ **What the user types:**
405
+ ```
406
+ worktrain run pr-review 47
407
+ ```
408
+ (or with full URL: `worktrain run pr-review https://github.com/owner/repo/pull/47`)
409
+
410
+ **What the user sees during execution (Candidate D -- proof-of-concept, single session):**
411
+ ```
412
+ PR #47: feat(engine): add OAuth refresh token rotation (3 files, +142/-18)
413
+ Base: main merge-base: abc1234
414
+ Session: sess_4cd0b579 (mr-review-workflow-agentic)
415
+
416
+ Watch raw events: worktrain logs --follow --session sess_4cd0b579
417
+
418
+ Running review... step 1/8 (0:15)
419
+ Running review... step 2/8 (1:30)
420
+ Running review... step 4/8 (5:22)
421
+ Running review... step 7/8 (18:45)
422
+ Done (22:31)
423
+ ```
424
+
425
+ **What the user sees during execution (Candidate B -- target, multi-session coordinator):**
426
+ ```
427
+ PR #47: feat(engine): add OAuth refresh token rotation (3 files, +142/-18)
428
+ Base: main merge-base: abc1234
429
+
430
+ Watch raw events: worktrain logs --follow
431
+
432
+ [1/3] Context gathering... running (0:15)
433
+ [1/3] Context gathering... done (1:12)
434
+ [2/3] Reviewer families (3 parallel)... running
435
+ correctness_invariants done (2:45)
436
+ philosophy_alignment done (2:01)
437
+ runtime_production_risk running (3:20)
438
+ [2/3] Reviewer families... done (4:33)
439
+ [3/3] Synthesizing... done (0:45)
440
+ ```
441
+
442
+ *Note: The multi-session display (B) is the target UX. The single-session display (D) is what ships first.*
443
+
444
+ **What the user gets when it's done:**
445
+ ```
446
+ RESULT: Request changes [HIGH CONFIDENCE]
447
+ CRITICAL Token expiry not validated before refresh -- sessions extend past max lifetime
448
+ src/auth/token-service.ts:142
449
+
450
+ Coverage: correctness_invariants OK philosophy_alignment OK runtime_production_risk OK
451
+
452
+ Full review: ./pr-review-47.md
453
+ Post comment: gh pr comment 47 --body-file ./pr-review-47-comment.md
454
+ ```
455
+
456
+ **Happy path vs. "needs human" distinction:**
457
+ - `RESULT: Approve` -- coordinator ran to completion; PR is clean; user can post comment and merge
458
+ - `RESULT: Request changes` -- coordinator ran to completion; blocking or critical findings; user reads findings file and decides
459
+ - `RESULT: [PARTIAL] Unable to complete` -- one or more reviewer sessions failed; user must review manually with `worktrain logs --session <id>`
460
+ - All three states are UNAMBIGUOUS from the RESULT line
461
+
462
+ **Minimum viable version that proves coordination works:**
463
+ The shell script coordinator (`scripts/coordinator-pr-review.sh`) with:
464
+ - `worktrain spawn` for each session
465
+ - Progress polling loop (30s status prints)
466
+ - `worktrain await` for structured JSON results
467
+ - `jq` for output transformation
468
+ - Schema validation and partial failure marking
469
+ - Written to `pr-review-<n>.md` + printed `gh pr comment` command
470
+
471
+ This is the minimum that is BETTER than Claude Code in ways that matter: zero babysitting, structured output, explicit routing decision, unambiguous result state.