npm - @exaudeus/workrail - Versions diffs - 3.66.0 → 3.68.0 - Mend

@exaudeus/workrail 3.66.0 → 3.68.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (150) hide show

package/dist/application/services/compiler/template-registry.js +10 -1
package/dist/application/validation.js +1 -1
package/dist/cli/commands/worktrain-init.js +1 -1
package/dist/console/standalone-console.js +4 -1
package/dist/console-ui/assets/{index-BynU38Vu.js → index-CyzltI6D.js} +1 -1
package/dist/console-ui/index.html +1 -1
package/dist/coordinators/modes/full-pipeline.js +4 -4
package/dist/coordinators/modes/implement-shared.js +5 -5
package/dist/coordinators/modes/implement.js +4 -4
package/dist/coordinators/pr-review.js +4 -4
package/dist/daemon/workflow-runner.d.ts +1 -0
package/dist/daemon/workflow-runner.js +1 -0
package/dist/infrastructure/storage/schema-validating-workflow-storage.d.ts +21 -2
package/dist/infrastructure/storage/schema-validating-workflow-storage.js +48 -0
package/dist/manifest.json +41 -41
package/dist/mcp/handlers/v2-workflow.js +24 -7
package/dist/mcp/output-schemas.d.ts +36 -0
package/dist/mcp/output-schemas.js +11 -1
package/dist/mcp/workflow-protocol-contracts.js +2 -2
package/dist/v2/projections/session-metrics.d.ts +1 -1
package/dist/v2/projections/session-metrics.js +16 -35
package/dist/v2/usecases/console-routes.d.ts +2 -2
package/docs/authoring-v2.md +4 -4
package/docs/changelog-recent.md +3 -3
package/docs/configuration.md +1 -1
package/docs/design/adaptive-coordinator-context-candidates.md +1 -1
package/docs/design/adaptive-coordinator-context.md +1 -1
package/docs/design/adaptive-coordinator-routing-candidates.md +18 -18
package/docs/design/adaptive-coordinator-routing-review.md +1 -1
package/docs/design/adaptive-coordinator-routing.md +34 -34
package/docs/design/agent-cascade-protocol.md +2 -2
package/docs/design/console-daemon-separation-discovery.md +323 -0
package/docs/design/context-assembly-design-candidates.md +1 -1
package/docs/design/context-assembly-implementation-plan.md +1 -1
package/docs/design/context-assembly-layer.md +2 -2
package/docs/design/context-assembly-review-findings.md +1 -1
package/docs/design/coordinator-access-audit.md +293 -0
package/docs/design/coordinator-architecture-audit.md +62 -0
package/docs/design/coordinator-error-handling-audit.md +240 -0
package/docs/design/coordinator-testability-audit.md +426 -0
package/docs/design/daemon-architecture-discovery.md +1 -1
package/docs/design/daemon-console-separation-discovery.md +242 -0
package/docs/design/daemon-memory-audit.md +203 -0
package/docs/design/design-candidates-console-daemon-separation.md +256 -0
package/docs/design/design-candidates-discovery-loop-fix.md +141 -0
package/docs/design/design-review-findings-console-daemon-separation.md +106 -0
package/docs/design/design-review-findings-discovery-loop-fix.md +81 -0
package/docs/design/discovery-loop-fix-candidates.md +161 -0
package/docs/design/discovery-loop-fix-design-review.md +106 -0
package/docs/design/discovery-loop-fix-validation.md +258 -0
package/docs/design/discovery-loop-investigation-A.md +188 -0
package/docs/design/discovery-loop-investigation-B.md +287 -0
package/docs/design/exploration-workflow-candidates.md +205 -0
package/docs/design/exploration-workflow-design-review.md +166 -0
package/docs/design/exploration-workflow-discovery.md +443 -0
package/docs/design/ide-context-files-candidates.md +231 -0
package/docs/design/ide-context-files-design-review.md +85 -0
package/docs/design/ide-context-files.md +615 -0
package/docs/design/implementation-plan-discovery-loop-fix.md +199 -0
package/docs/design/implementation-plan-queue-poll-rotation.md +102 -0
package/docs/design/in-process-http-audit.md +190 -0
package/docs/design/layer3b-ghost-nodes-design-candidates.md +2 -2
package/docs/design/loadSessionNotes-candidates.md +108 -0
package/docs/design/loadSessionNotes-test-coverage-discovery.md +297 -0
package/docs/design/loadSessionNotes-test-coverage-session4.md +209 -0
package/docs/design/loadSessionNotes-test-coverage-v3.md +321 -0
package/docs/design/probe-session-design-candidates.md +261 -0
package/docs/design/probe-session-phase0.md +490 -0
package/docs/design/routines-guide.md +7 -7
package/docs/design/session-metrics-attribution-candidates.md +250 -0
package/docs/design/session-metrics-attribution-design-review.md +115 -0
package/docs/design/session-metrics-attribution-discovery.md +319 -0
package/docs/design/session-metrics-candidates.md +227 -0
package/docs/design/session-metrics-design-review.md +104 -0
package/docs/design/session-metrics-discovery.md +454 -0
package/docs/design/spawn-session-debug.md +202 -0
package/docs/design/trigger-validator-candidates.md +214 -0
package/docs/design/trigger-validator-review.md +109 -0
package/docs/design/trigger-validator-shaping-phase0.md +239 -0
package/docs/design/trigger-validator.md +454 -0
package/docs/design/v2-core-design-locks.md +2 -2
package/docs/design/workflow-extension-points.md +15 -15
package/docs/design/workflow-id-validation-at-startup.md +1 -1
package/docs/design/workflow-id-validation-implementation-plan.md +2 -2
package/docs/design/workflow-trigger-lifecycle-audit.md +175 -0
package/docs/design/worktrain-task-queue-candidates.md +5 -5
package/docs/design/worktrain-task-queue.md +4 -4
package/docs/discovery/coordinator-script-design.md +1 -1
package/docs/discovery/coordinator-ux-discovery.md +3 -3
package/docs/discovery/simulation-report.md +1 -1
package/docs/discovery/workflow-modernization-discovery.md +326 -0
package/docs/discovery/workflow-selection-for-discovery-tasks.md +33 -33
package/docs/discovery/worktrain-status-briefing.md +1 -1
package/docs/discovery/wr-discovery-goal-reframing.md +1 -1
package/docs/docker.md +1 -1
package/docs/ideas/backlog.md +227 -0
package/docs/ideas/third-party-workflow-setup-design-thinking.md +1 -1
package/docs/integrations/claude-code.md +5 -5
package/docs/integrations/firebender.md +1 -1
package/docs/plans/agentic-orchestration-roadmap.md +2 -2
package/docs/plans/mr-review-workflow-redesign.md +9 -9
package/docs/plans/ui-ux-workflow-design-candidates.md +4 -4
package/docs/plans/ui-ux-workflow-discovery.md +2 -2
package/docs/plans/workflow-categories-candidates.md +8 -8
package/docs/plans/workflow-categories-discovery.md +4 -4
package/docs/plans/workflow-modernization-design.md +430 -0
package/docs/plans/workflow-staleness-detection-candidates.md +11 -11
package/docs/plans/workflow-staleness-detection-review.md +4 -4
package/docs/plans/workflow-staleness-detection.md +9 -9
package/docs/plans/workrail-platform-vision.md +3 -3
package/docs/reference/agent-context-cleaner-snippet.md +1 -1
package/docs/reference/agent-context-guidance.md +4 -4
package/docs/reference/context-optimization.md +2 -2
package/docs/roadmap/now-next-later.md +2 -2
package/docs/roadmap/open-work-inventory.md +16 -16
package/docs/workflows.md +31 -31
package/package.json +1 -1
package/spec/workflow-tags.json +47 -47
package/workflows/adaptive-ticket-creation.json +16 -16
package/workflows/architecture-scalability-audit.json +22 -22
package/workflows/bug-investigation.agentic.v2.json +3 -3
package/workflows/classify-task-workflow.json +1 -1
package/workflows/coding-task-workflow-agentic.json +6 -6
package/workflows/cross-platform-code-conversion.v2.json +8 -8
package/workflows/document-creation-workflow.json +8 -8
package/workflows/documentation-update-workflow.json +8 -8
package/workflows/intelligent-test-case-generation.json +2 -2
package/workflows/learner-centered-course-workflow.json +2 -2
package/workflows/mr-review-workflow.agentic.v2.json +4 -4
package/workflows/personal-learning-materials-creation-branched.json +8 -8
package/workflows/presentation-creation.json +5 -5
package/workflows/production-readiness-audit.json +1 -1
package/workflows/relocation-workflow-us.json +31 -31
package/workflows/routines/context-gathering.json +1 -1
package/workflows/routines/design-review.json +1 -1
package/workflows/routines/execution-simulation.json +1 -1
package/workflows/routines/feature-implementation.json +3 -3
package/workflows/routines/final-verification.json +1 -1
package/workflows/routines/hypothesis-challenge.json +1 -1
package/workflows/routines/ideation.json +1 -1
package/workflows/routines/parallel-work-partitioning.json +3 -3
package/workflows/routines/philosophy-alignment.json +2 -2
package/workflows/routines/plan-analysis.json +1 -1
package/workflows/routines/plan-generation.json +1 -1
package/workflows/routines/tension-driven-design.json +6 -6
package/workflows/scoped-documentation-workflow.json +26 -26
package/workflows/ui-ux-design-workflow.json +14 -14
package/workflows/workflow-diagnose-environment.json +1 -1
package/workflows/workflow-for-workflows.json +32 -77
package/workflows/workflow-for-workflows.v2.json +0 -788

package/docs/design/daemon-memory-audit.md ADDED Viewed

@@ -0,0 +1,203 @@
+# Daemon Memory Audit
+_Audit date: 2026-04-19. Auditor: Claude (wr.bug-investigation workflow)._
+## Context
+The WorkTrain daemon runs continuously for days. This audit catalogues every module-level or instance-level mutable variable, cache, Map, Set, and Array in six specified files, assesses whether growth is bounded, identifies the cleanup mechanism, and rates severity.
+**Files audited:**
+1. `src/trigger/trigger-router.ts`
+2. `src/trigger/polling-scheduler.ts`
+3. `src/v2/usecases/worktree-service.ts`
+4. `src/v2/infra/in-memory/daemon-registry/index.ts`
+5. `src/v2/infra/in-memory/keyed-async-queue/index.ts`
+6. `src/daemon/daemon-events.ts`
+**Out of scope:** `src/mcp/`, `src/v2/durable-core/`, all test files.
+---
+## Findings Table
+| # | File | Variable | Type | Holds | Growth Bound | Cleanup Mechanism | Severity | Notes |
+|---|------|----------|------|-------|-------------|-------------------|----------|-------|
+| 1 | `polling-scheduler.ts` | `appendQueuePollLog` writes to `queue-poll.jsonl` | Disk file (not in-memory) | One JSON line per poll-cycle event | **Unbounded** | None -- single fixed path, `appendFile` only | **Critical** | See Finding 1 |
+| 2 | `worktree-service.ts` | `enrichmentQueue` (module-level) | `Array<() => void>` | Pending waiter callbacks for foreground git semaphore slots | Unbounded in theory | `releaseEnrichmentSlot()` shifts and calls next | **Major** | See Finding 2 |
+| 3 | `worktree-service.ts` | `backgroundEnrichmentQueue` (module-level) | `Array<() => void>` | Pending waiter callbacks for background git semaphore slots | Unbounded in theory | `releaseBackgroundSlot()` shifts and calls next | **Major** | See Finding 2 |
+| 4 | `trigger-router.ts` | `_recentAdaptiveDispatches` (instance) | `Map<string, number>` | `${goal}::${workspace}` -> last dispatch timestamp | Bounded at 30s TTL, but only cleaned on next dispatch | Cleanup-on-entry: purge stale before check/insert | **Minor** | See Finding 3 |
+| 5 | `polling-scheduler.ts` | `dispatchingIssues` (instance) | `Set<number>` | Issue numbers whose `dispatchAdaptivePipeline` Promise is in-flight | Bounded by session timeout (max ~65 min) | `.then()` + `.catch()` unconditionally delete | **Minor** | See Finding 4 |
+| 6 | `daemon-registry/index.ts` | `entries` (instance) | `Map<string, DaemonEntry>` | Active autonomous session liveness records | Bounded by active session count | `unregister()` in `runWorkflow()` finally block | **Minor** | See Finding 5 |
+| 7 | `keyed-async-queue/index.ts` | `queues` (instance) | `Map<string, Promise<void>>` | Per-key async serialization chains | Bounded by keys with pending work | `.finally()` identity check deletes on completion | **Acceptable** | Correct design |
+| 8 | `trigger-router.ts` | `semaphore.waiters` (instance) | `Array<() => void>` | Callers waiting for a concurrency slot | Bounded by dispatch rate / completion rate | `semaphore.release()` shifts and calls next | **Acceptable** | Transient, not a leak |
+| 9 | `worktree-service.ts` | `worktreeCache` (module-level) | `WorktreeCache \| null` | Single enriched worktree scan result | Bounded: single object, replaced on TTL expiry | Replaced every 45s (WORKTREE_CACHE_TTL_MS) | **Acceptable** | Large for big repos but not accumulating |
+| 10 | `worktree-service.ts` | `backgroundEnrichmentInFlight` (module-level) | `boolean` | Single dedup guard flag | n/a (scalar) | Reset in `finally` block of `runBackgroundEnrichment` | **Acceptable** | No growth |
+| 11 | `worktree-service.ts` | `onEnrichmentComplete` (module-level) | `(() => void) \| null` | Single SSE broadcast callback | n/a (single reference) | Overwritten by `setEnrichmentCompleteCallback` | **Acceptable** | No growth |
+| 12 | `worktree-service.ts` | `activeEnrichments` (module-level) | `number` | Count of running foreground enrichments | Bounded by MAX_CONCURRENT_ENRICHMENTS=8 | Decremented in `releaseEnrichmentSlot()` | **Acceptable** | No growth |
+| 13 | `worktree-service.ts` | `activeBackgroundEnrichments` (module-level) | `number` | Count of running background enrichments | Bounded by MAX_BACKGROUND_ENRICHMENTS=16 | Decremented in `releaseBackgroundSlot()` | **Acceptable** | No growth |
+| 14 | `daemon-events.ts` | `_dir` (instance) | `string` | Output directory path | n/a (immutable after construction) | n/a | **Acceptable** | No state accumulation |
+| 15 | `polling-scheduler.ts` | `intervals` (instance) | `Map<string, interval handle>` | Active setInterval/setTimeout handles | Bounded by trigger count (fixed at startup) | `stop()` clears all handles | **Acceptable** | Fixed at startup |
+| 16 | `polling-scheduler.ts` | `polling` (instance) | `Map<string, boolean>` | Skip-cycle guard flags per trigger | Bounded by trigger count (fixed at startup) | Reset in `runPollCycle()` finally block | **Acceptable** | Fixed at startup |
+---
+## Detailed Findings
+### Finding 1 -- Critical: `queue-poll.jsonl` grows without bound
+**File:** `src/trigger/polling-scheduler.ts`, `appendQueuePollLog()` at line 800
+**Also:** `src/cli-worktrain.ts` line 717
+`queue-poll.jsonl` is written to a single fixed path (`~/.workrail/queue-poll.jsonl`) using `fs.appendFile`. There is no rotation, no size cap, and no TTL. The file grows one or more JSON lines per poll cycle for every event (task_selected, task_skipped, poll_cycle_complete, poll_cycle_skipped).
+`cli-worktrain.ts` line 717-718 explicitly states:
+> "WHY constants: queue-poll and stderr are permanent files that never rotate."
+This is a design choice, not an oversight, but it creates a disk exhaustion risk for any 24/7 deployment.
+**Growth estimate:**
+| Poll interval | Events per cycle | Lines per day | Size per day | Size per month |
+|--------------|-----------------|---------------|--------------|----------------|
+| 5 min | 5 | 1,440 | ~290 KB | ~8.7 MB |
+| 1 min | 10 | 14,400 | ~2.9 MB | ~87 MB |
+**Recommended fix:** Add size-capped rotation. When `queue-poll.jsonl` exceeds a configurable threshold (default: 50 MB), rename to `queue-poll.jsonl.1` and start a new file. Keep at most 2 rotated files. Alternatively, switch to the same daily-rotation pattern used by `DaemonEventEmitter` (daily JSONL files under `~/.workrail/events/daemon/YYYY-MM-DD.jsonl`), which already handles rotation by filename.
+**What happens if cleanup fails:** File continues to grow. On a 1-min polling daemon with many skipped issues, the file could reach gigabyte scale within weeks.
+---
+### Finding 2 -- Major: `enrichmentQueue` and `backgroundEnrichmentQueue` have no depth cap
+**File:** `src/v2/usecases/worktree-service.ts`, `acquireEnrichmentSlot()` at line 143 and `acquireBackgroundSlot()` at line 182
+Both functions implement a Promise-based semaphore by pushing a resolve callback to a module-level Array when all slots are occupied. There is no `if (queue.length >= MAX)` guard before the push.
+```
+// Current code -- no capacity check:
+enrichmentQueue.push(resolve);
+```
+**Impact:** If an HTTP client disconnects before the request completes, the dangling resolve callback is retained in the array indefinitely (there is no AbortController or request-lifecycle cleanup). If many concurrent requests arrive (e.g., a frontend bug that loops on failure), the array grows without bound.
+**Normal operation:** Under single-client usage, the `backgroundEnrichmentInFlight` guard prevents multiple background scans from running simultaneously (one scan per 45s TTL window). The foreground queue only fills during burst scenarios. In practice, arrays stay near-empty under normal single-browser-client console usage.
+**Recommended fix:** Add a max-depth guard before each push:
+```typescript
+const MAX_ENRICHMENT_QUEUE_DEPTH = 32; // or configurable
+// In acquireEnrichmentSlot:
+if (enrichmentQueue.length >= MAX_ENRICHMENT_QUEUE_DEPTH) {
+  reject(new Error('enrichment queue full -- too many concurrent worktree requests'));
+  return;
+}
+enrichmentQueue.push(resolve);
+```
+For correctness, also add cleanup when an HTTP request is cancelled (tie into `req.on('close', ...)` and remove the specific resolve from the array or replace it with a no-op).
+**What happens if cleanup fails:** Waiter callbacks accumulate in memory. On a long-running daemon with a misbehaving frontend or test harness, this could eventually cause OOM. Under normal usage, the risk is low.
+---
+### Finding 3 -- Minor: `_recentAdaptiveDispatches` retains stale entries during idle periods
+**File:** `src/trigger/trigger-router.ts`, `_recentAdaptiveDispatches` at line 500
+The cleanup-on-entry pattern is documented and intentional (avoids background timers). Stale entries older than `ADAPTIVE_DEDUPE_TTL_MS` (30s) are purged when the next dispatch arrives. During idle periods (no dispatches), stale entries from the last active burst persist.
+**Impact:** Each entry is a `string -> number` pair (goal::workspace key + timestamp). The key length is bounded by goal string length (capped implicitly by webhook payload size). After a burst of N unique dispatches, N entries persist until the next dispatch. For a daemon that processes 100 unique GitHub issues per hour during peak and then goes idle for 12 hours, 100 entries remain in memory. At ~200 bytes each, that is ~20 KB -- negligible.
+**Recommended fix:** Either accept as a design tradeoff (entries are small and bounded by the last burst size), or add a low-frequency background sweep:
+```typescript
+// In constructor, after semaphore init:
+setInterval(() => {
+  const now = Date.now();
+  for (const [key, ts] of this._recentAdaptiveDispatches) {
+    if (now - ts >= TriggerRouter.ADAPTIVE_DEDUPE_TTL_MS) {
+      this._recentAdaptiveDispatches.delete(key);
+    }
+  }
+}, TriggerRouter.ADAPTIVE_DEDUPE_TTL_MS * 2).unref();
+```
+The `.unref()` prevents the timer from keeping the process alive.
+**What happens if cleanup fails:** Stale entries persist beyond their TTL. Maximum impact is a slight memory overhead proportional to the last dispatch burst size. No correctness impact (entries only guard against duplicates within the 30s window, which is already expired).
+---
+### Finding 4 -- Minor: `dispatchingIssues` bounded but imperfectly
+**File:** `src/trigger/polling-scheduler.ts`, `dispatchingIssues` at line 113
+Issue numbers are added before `dispatchAdaptivePipeline()` (I1) and removed unconditionally in both `.then()` and `.catch()` (I2). The Promise returned by `dispatchAdaptivePipeline` (which calls `runAdaptivePipeline` -> executor -> `runWorkflow`) is guaranteed to settle within the configured session timeout because `workflow-runner.ts` line 3792 uses `Promise.race([agentLoop, timeoutPromise])`.
+**Residual risk:** If the `timeoutHandle` in `workflow-runner.ts` is somehow garbage-collected before firing (extremely unlikely in V8), or if the WorkRail MCP server becomes permanently unresponsive while the Promise is awaiting it (no top-level abort signal), the Promise could hang and the issue number would persist in `dispatchingIssues` until daemon restart.
+**Recommended fix:** Add a conservative defense-in-depth timeout in the polling scheduler:
+```typescript
+const ISSUE_DISPATCH_TIMEOUT_MS = 4 * 60 * 60 * 1000; // 4h safety ceiling
+const dispatchP = this.router.dispatchAdaptivePipeline(...);
+const timeoutP = new Promise<void>((_, reject) =>
+  setTimeout(() => reject(new Error('dispatch timeout')), ISSUE_DISPATCH_TIMEOUT_MS)
+);
+void Promise.race([dispatchP, timeoutP])
+  .then(() => { this.dispatchingIssues.delete(top.issue.number); })
+  .catch(() => { this.dispatchingIssues.delete(top.issue.number); });
+```
+**What happens if cleanup fails:** The issue is permanently skipped with reason `active_session_in_process`. On a long-running daemon, progressive issue numbers accumulate in the set. After N such leaks, N issues are silently skipped every poll cycle, degrading queue throughput.
+---
+### Finding 5 -- Minor: `DaemonRegistry.entries` leak on abnormal session exit
+**File:** `src/v2/infra/in-memory/daemon-registry/index.ts`
+`DaemonRegistry.unregister()` is called from `runWorkflow()` in a `finally` block, which executes on both success and failure paths. This is the correct pattern.
+**Residual risk:** If the outer async chain holding `runWorkflow()` is abandoned (e.g., the process receives SIGKILL mid-session), the `finally` block does not run and the entry leaks. However, `DaemonRegistry` is an instance-level variable that is cleared on process restart. The console's `AUTONOMOUS_HEARTBEAT_THRESHOLD_MS` check provides a secondary liveness gate -- stale entries stop showing as "live" once their `lastHeartbeatMs` ages beyond the threshold.
+**Recommended fix:** No urgent action needed. Optionally, add a scheduled cleanup pass that removes entries where `lastHeartbeatMs` has not been updated within 2x the heartbeat interval:
+```typescript
+// In DaemonRegistry: periodic stale-entry sweep
+sweep(maxStalenessMs: number): void {
+  const now = Date.now();
+  for (const [id, entry] of this.entries) {
+    if (now - entry.lastHeartbeatMs > maxStalenessMs) {
+      this.entries.delete(id);
+    }
+  }
+}
+```
+**What happens if cleanup fails:** Stale entries show as "running" in the console beyond SIGKILL scenarios. Bounded by the heartbeat threshold check; no correctness impact on session execution.
+---
+## Summary by Severity
+| Severity | Count | Items |
+|----------|-------|-------|
+| Critical | 1 | queue-poll.jsonl unbounded disk growth |
+| Major | 2 | enrichmentQueue depth (no cap), backgroundEnrichmentQueue depth (no cap) |
+| Minor | 3 | _recentAdaptiveDispatches idle stale entries, dispatchingIssues session-bounded leak, DaemonRegistry SIGKILL leak |
+| Acceptable | 9 | All others -- correct cleanup, bounded, or scalar |
+---
+## Recommended Fix Priority
+1. **queue-poll.jsonl rotation** -- disk exhaustion risk on any always-on deployment. Implement size-capped rotation or daily files (1-2 hours work).
+2. **enrichmentQueue/backgroundEnrichmentQueue depth cap** -- add max-depth guard before push (30 min work). Low-impact under normal usage but correctness risk under misbehaving client scenarios.
+3. **_recentAdaptiveDispatches background sweep** -- add `.unref()` setInterval sweep (15 min work). Can also be deferred as an accepted tradeoff.
+4. **dispatchingIssues defense timeout** -- add 4-hour `Promise.race` ceiling as defense-in-depth (15 min work).
+5. **DaemonRegistry stale-entry sweep** -- add optional scheduled sweep (15 min work). Not urgent.

package/docs/design/design-candidates-console-daemon-separation.md ADDED Viewed

@@ -0,0 +1,256 @@
+# Design Candidates: Console-Daemon Separation
+**Generated by:** wr.discovery workflow
+**Date:** 2026-04-21
+**Status:** Raw investigative material -- not a final decision
+---
+## Problem Understanding
+### Core Tensions
+1. **Convenience vs. separation**: `daemon-console.ts` exists purely for convenience so the daemon auto-starts the console server. Removing it means users must run `worktrain console` separately. This is a real UX regression for the daemon workflow.
+2. **Control actions vs. read-only console**: The browser dispatch button (`POST /api/v2/auto/dispatch`) requires a live V2ToolContext. A filesystem-only console cannot serve this. Strict separation means the button returns 503 when the console has no daemon context.
+3. **Single-origin frontend vs. split backends**: ALL API calls in `console/src/api/hooks.ts` use relative URLs (`/api/v2/sessions`, `/api/v2/auto/dispatch`, etc.). There is no `VITE_API_BASE_URL` or origin abstraction. Any "split by port" approach requires frontend changes to use absolute URLs for control actions.
+4. **Redundancy vs. coupling**: `daemon-console.ts` and `standalone-console.ts` do the same thing. One has daemon coupling; the other does not. Both write to the same `daemon-console.lock` file -- they cannot coexist on port 3456 simultaneously.
+### What Makes This Hard
+The frontend single-origin assumption is the hidden constraint. Anyone proposing "split by port at browser level" without reading `console/src/api/hooks.ts` will miss that ALL four control endpoints use relative URLs. The real work in split-by-port is the frontend changes, not the server split.
+### Real Seam / Where the Problem Lives
+The seam is in the **daemon startup path** -- specifically, whether it calls `startDaemonConsole()` at all. If that call is removed, `daemon-console.ts` becomes dead code.
+The problem does NOT live in `mountConsoleRoutes()` (already correct -- all daemon params are optional with 503 fallbacks) or in `standalone-console.ts` (already the correct architecture).
+### Critical Codebase Finding
+`src/console/standalone-console.ts` is already the correct standalone implementation:
+- Zero imports from `src/daemon/` or `src/trigger/`
+- Constructs its own infrastructure adapters independently
+- Calls `mountConsoleRoutes()` with `undefined` for all daemon params
+- Working today as the `worktrain console` command
+`mountConsoleRoutes()` has 503/empty-list fallbacks for all three control endpoints:
+- `POST /api/v2/auto/dispatch` -- returns 503 if no `v2ToolContext`
+- `GET /api/v2/triggers` -- returns empty list if no `triggerRouter`
+- `POST /api/v2/triggers/:id/poll` -- returns 503 if no `pollingScheduler`
+- `POST /api/v2/sessions/:id/steer` -- returns 503 if no `steerRegistry`
+`TriggerListenerHandle` (port 3200) already exposes `router`, `steerRegistry`, `scheduler` as public fields -- designed for the caller to pass to `startDaemonConsole()`.
+---
+## Philosophy Constraints
+From `AGENTS.md` and `CLAUDE.md`:
+| Principle | Relevance to This Problem |
+|---|---|
+| Architectural fixes over patches | Argues for deleting daemon-console.ts entirely rather than reducing its imports |
+| YAGNI with discipline | Argues against adding frontend URL abstraction or subprocess management unless clearly needed |
+| Make illegal states unrepresentable | The dual-console ambiguous state (which one wrote the lock file?) is representable today -- should be eliminated |
+| Dependency injection for boundaries | Already practiced: mountConsoleRoutes accepts optional daemon handles |
+| Errors are data | Proxy failures should be explicit 503/504, not silent timeouts |
+No philosophy conflicts detected in the existing codebase -- it already follows these principles.
+---
+## Impact Surface
+**Must remain consistent if this changes:**
+- `src/cli/commands/worktrain-daemon.ts` (or equivalent daemon startup) -- the caller of `startDaemonConsole()`; must stop calling it in Candidate A
+- `~/.workrail/daemon-console.lock` -- written by both `daemon-console.ts` and `standalone-console.ts`; only one can own it cleanly
+- `src/mcp/handlers/session.ts` (`handleOpenDashboard`) -- reads the lock file to find the console port; soft coupling via filesystem, acceptable
+- `console/src/api/hooks.ts` -- changes only in Candidate B
+- Tests that start the daemon and expect the console to be up
+---
+## Candidates
+### Candidate A: Delete daemon-console.ts (Simplest + Reframing)
+**Summary:** Delete `daemon-console.ts` and its call site in daemon startup. `standalone-console.ts` becomes the only console server. Users run `worktrain console` separately.
+**Tensions resolved:**
+- Complete import separation (daemon-console.ts deleted)
+- No lock-file ambiguity
+- No redundant code
+**Tensions accepted:**
+- UX: users must start console separately when running the daemon
+- Browser dispatch returns 503 in standalone console (no daemon context)
+**Boundary solved at:** Daemon startup path (remove `startDaemonConsole()` call). ~220 lines deleted.
+**Why this boundary is best-fit:** The dual-console architecture was the root cause, not a symptom. `standalone-console.ts` already does what `daemon-console.ts` does, without the coupling.
+**Failure mode:** Dispatch button shows a confusing 503. Mitigation: improve error message to say "Autonomous dispatch requires a running daemon. Use `worktrain dispatch` CLI instead."
+**Repo-pattern relationship:** Follows existing pattern. `standalone-console.ts` is already the target architecture.
+**Gains:** Zero net code written. ~220 lines deleted. Clean architecture. No lock-file race.
+**Gives up:** Daemon no longer auto-starts the console. Browser dispatch returns 503.
+**Scope:** Best-fit. Single call site removed, one file deleted.
+**Philosophy fit:** Full alignment. "Architectural fixes over patches" (delete, not patch). "YAGNI" (remove unnecessary code). "Make illegal states unrepresentable" (only one console server can exist).
+---
+### Candidate B: Control endpoints on trigger-listener:3200 + frontend absolute URLs
+**Summary:** Add `POST /dispatch`, `POST /sessions/:id/steer`, `POST /triggers/:id/poll`, `GET /triggers` to the daemon's existing HTTP server (port 3200). Frontend calls `http://localhost:3200/...` for control actions.
+**Tensions resolved:**
+- Complete import separation (console server stays filesystem-only)
+- Browser dispatch works when daemon running
+**Tensions accepted:**
+- Frontend must use absolute URLs for 4 endpoints + handle daemon unavailability
+- CORS headers needed on trigger-listener.ts (currently has none)
+- `createTriggerApp()` purpose changes from "webhook receiver" to "daemon HTTP API"
+**Boundary solved at:** `src/trigger/trigger-listener.ts` (new routes) + `console/src/api/hooks.ts` (URL changes) + CORS middleware.
+**Why this boundary is not best-fit:** trigger-listener.ts was designed as a webhook-only receiver. Adding console API routes to it creates a new cross-boundary coupling in the other direction. The frontend URL abstraction is sticky (hard to remove once added).
+**Failure mode:** Frontend must detect ECONNREFUSED when daemon is down and disable buttons. React Query retry behavior may cause confusing UX. Also: CORS misconfiguration could silently block requests.
+**Repo-pattern relationship:** Departs. No frontend absolute URL abstraction exists. `createTriggerApp()` is currently pure.
+**Gains:** Complete separation + working browser dispatch.
+**Gives up:** Frontend complexity, CORS on webhook receiver, availability detection.
+**Scope:** Too broad for a feature (browser dispatch) used only by the project owner.
+**Philosophy fit:** Partial. "Architectural fixes" (A). YAGNI conflict (frontend abstraction layer). "Dependency injection" (A -- control deps injected at 3200).
+---
+### Candidate C: Thin HTTP proxy on standalone console to daemon:3200
+**Summary:** The standalone console (port 3456) proxies the 4 control endpoints to `http://127.0.0.1:3200` via HTTP fetch; returns 503 when daemon is unreachable.
+**Tensions resolved:**
+- Zero frontend changes (relative URLs continue to work)
+- Browser dispatch works when daemon running
+- Console server has no daemon object imports
+**Tensions accepted:**
+- Soft runtime coupling: console server must know daemon's HTTP port (3200 or `WORKRAIL_TRIGGER_PORT`)
+- New failure mode: proxy timeout when daemon is slow
+- A new lock file or env var convention needed to communicate daemon's control port
+**Boundary solved at:** `src/console/standalone-console.ts` (add 3 proxy routes using native `fetch`) + a daemon `daemon-control.lock` file (or reuse `WORKRAIL_TRIGGER_PORT` env var).
+**Why this boundary is best-fit:** Standalone-console.ts is already the correct entry point. Adding HTTP proxy routes there keeps the frontend unchanged and avoids CORS complexity. The 503 failure mode already works (useTriggerList has a 503 fallback in hooks.ts).
+**Failure mode:** Proxy silently times out when daemon is slow (fetch hangs). Mitigation: AbortController with 5-second timeout; return 504 on timeout.
+**Repo-pattern relationship:** Adapts the existing lock-file discovery pattern. `readConsoleLockPort()` in session.ts already reads a lock file for port discovery.
+**Gains:** Zero frontend changes. Server-side separation. Dispatch works. Consistent with existing patterns.
+**Gives up:** New HTTP-level runtime dependency from console to daemon. Proxy adds ~2-10ms latency on control actions.
+**Scope:** Best-fit. 3 proxy routes added to standalone-console.ts, daemon startup writes a daemon-control.lock.
+**Philosophy fit:** Mostly aligned. "Errors are data" (A -- 503/504 on failure). "Validate at boundaries" (A -- checks daemon availability). "Architectural fixes over patches" (PARTIAL -- proxy is a pattern that hides the dependency rather than eliminating it).
+---
+### Candidate D: Daemon spawns standalone-console as a subprocess
+**Summary:** Replace `startDaemonConsole()` with spawning `worktrain console` as a child process; the subprocess runs the standalone console with full process isolation.
+**Tensions resolved:**
+- True process separation (module scopes isolated)
+- Daemon auto-starts console (UX convenience preserved)
+**Tensions accepted:**
+- Subprocess management complexity (wait for port bind, handle crashes, clean up on daemon shutdown)
+- Binary path resolution must work in all install scenarios
+- Dispatch still returns 503 (subprocess has no daemon handles) unless Candidate C's proxy is also added
+**Boundary solved at:** Daemon startup code -- replace `startDaemonConsole()` with `execFile('worktrain', ['console'])` + port-wait loop.
+**Failure mode:** Path resolution fails in some environments (global vs. local npm install). Zombie child processes if daemon crashes without cleanup.
+**Repo-pattern relationship:** Departs significantly. No subprocess management exists in the current daemon startup path.
+**Gains:** True process isolation.
+**Gives up:** Subprocess management complexity, path resolution fragility, dispatch still 503s.
+**Scope:** Too broad. Subprocess lifecycle management is significant complexity for no architectural benefit over Candidate A.
+**Philosophy fit:** "Make illegal states unrepresentable" (A -- process boundary enforces separation). YAGNI violation.
+---
+## Comparison and Recommendation
+### Comparison Matrix
+| Tension | A | B | C | D |
+|---|---|---|---|---|
+| No daemon imports in console server | Full | Full | Full (no imports, HTTP proxy only) | Full |
+| Browser dispatch when daemon running | No (503) | Yes | Yes | No (503 without C's proxy) |
+| Frontend changes required | None | Significant | None | None |
+| Simplicity / YAGNI | Best | Worst | Moderate | Worst |
+| Reversibility | Easy | Hard (sticky URL abstraction) | Easy | Hard |
+| Scope | Best-fit | Too broad | Best-fit | Too broad |
+### Recommendation: **Candidate A first; Candidate C as an optional follow-on**
+Candidate A is the correct architectural fix. The standalone console already exists and is correct. `daemon-console.ts` is redundant. Deleting it achieves the goal with zero net code written.
+The dispatch button returning 503 is acceptable: the project owner is the only daemon user. A CLI alternative (`worktrain dispatch`) already handles this use case. The 503 message can be improved to explain the gap.
+If the owner decides live browser dispatch is important after using the CLI for a while, Candidate C is the right follow-on. It adds 3 proxy routes to `standalone-console.ts` that forward control actions to daemon:3200 when reachable. No frontend changes needed. The 503 fallback behavior is already tested (useTriggerList has a 503 catch in hooks.ts).
+**Candidates B and D are too broad.** B requires frontend changes for all 4 control endpoints plus CORS on a previously pure webhook receiver. D adds subprocess management complexity for no additional architectural benefit over A.
+---
+## Self-Critique
+### Strongest Counter-Argument Against Candidate A
+The dispatch button in the browser is a useful, discoverable affordance. Replacing it with "run a CLI command" is a regression that may feel jarring. If the owner uses browser dispatch as part of their workflow (trigger a session while reviewing session output in the same browser window), Candidate A removes that capability permanently until Candidate C is also implemented.
+### What Would Tip the Decision to Candidate C
+If the owner says "I use browser dispatch regularly and want it to work while viewing the console" -- Candidate C directly addresses this without frontend changes.
+### Narrower Option That Almost Won
+A + just improving the 503 error message to say "Use `worktrain dispatch` CLI while daemon is running" might be sufficient if the owner is CLI-comfortable.
+### Broader Option That Might Be Justified
+Candidate B would be justified if: (a) trigger-listener.ts was already growing into a general "daemon API" server, or (b) the frontend already had a base-URL abstraction for other reasons. Neither is true today.
+### Assumption That Would Invalidate Candidate A
+If there is a test or automation script that starts the daemon and expects the console to be reachable at 3456 without a separate `worktrain console` process, Candidate A breaks it. The daemon startup tests in `tests/` should be checked before implementing.
+---
+## Open Questions for the Main Agent
+1. **Does the owner use browser dispatch regularly?** This is the single most important question. If yes, Candidate C > Candidate A. If no, Candidate A is clearly better.
+2. **Are there tests that start the daemon and expect the console to be up?** If so, they must be updated in Candidate A.
+3. **What is the daemon startup entrypoint?** The call to `startDaemonConsole()` needs to be removed. Identify the exact file and line.
+4. **Should `daemon-console.ts` be deleted immediately or deprecated first?** Given it's a single-owner project, immediate deletion is fine -- no need for a deprecation phase.
+5. **Does the Candidate C proxy need to handle the `steer` endpoint?** Steer is not currently called from the frontend (confirmed by codebase search). If steer is a coordinator-only call (not browser UI), it can be deferred or omitted from Candidate C.

package/docs/design/design-candidates-discovery-loop-fix.md ADDED Viewed

@@ -0,0 +1,141 @@
+# Design Candidates: Discovery Loop Fix
+**Date:** 2026-04-19
+**Task:** Thread session timeouts, inspect PipelineOutcome, add sidecar idempotency
+---
+## Problem Understanding
+### Core Tensions
+1. **Fire-and-forget vs. outcome inspection**: The poller uses `void dispatchP.then(...)` intentionally to avoid blocking the poll cycle. Fix 2 needs to inspect the `PipelineOutcome` -- still inside the async callback, not blocking the caller. Adding inspection doesn't change fire-and-forget semantics but adds I/O (GitHub API call) inside a previously side-effect-free callback.
+2. **Interface evolution vs. backward compat**: Adding `agentConfig` to `CoordinatorDeps.spawnSession` changes a widely-used interface. All existing callers that pass 3-4 args must continue to compile and work. Optional 5th param resolves this but requires all test fakes to accept (and ignore) the new param.
+3. **Cross-restart idempotency vs. simplicity**: The sidecar (Fix 3) adds filesystem I/O to the poll cycle path. `checkIdempotency` must now detect a different file format. The sidecar uses a predictable filename (`queue-issue-<N>.json`) so it can be found by name rather than content scan.
+4. **Single source of truth for issue ownership**: Fix 2 (GitHub label) and Fix 3 (sidecar file) provide overlapping protection. Label is persistent, cross-process; sidecar is local, TTL-based. Both needed: label handles the post-completion case, sidecar handles the crash-during-dispatch window.
+### Likely Seam
+- Fix 1: `CoordinatorDeps.spawnSession` in `pr-review.ts` (interface), `trigger-listener.ts` (impl), `full-pipeline.ts` (call sites)
+- Fix 2: `polling-scheduler.ts` around L605-624 (outcome handler) + new `applyGitHubLabel` method
+- Fix 3: `polling-scheduler.ts` `doPollGitHubQueue` (sidecar write/delete) + `github-queue-poller.ts` `checkIdempotency` (sidecar read)
+### What Makes This Hard
+- Sidecar TTL requires `DISCOVERY_TIMEOUT_MS` -- that constant lives in `adaptive-pipeline.ts`. Options: (a) import it into `polling-scheduler.ts`, or (b) store the resolved TTL in the sidecar file itself. Option (b) is cleaner -- `checkIdempotency` reads `dispatchedAt + ttlMs` from the file without needing to know the constant.
+- `applyGitHubLabel` uses `(this.fetchFn ?? globalThis.fetch)` -- tests that verify this call must configure the fetchFn mock to handle the labels endpoint.
+- The existing conservative behavior of `checkIdempotency` (return 'active' on ANY parse error) means a malformed sidecar permanently blocks the issue until the file is deleted. Since the sidecar is written by controlled code, this is acceptable.
+---
+## Philosophy Constraints
+From `CLAUDE.md`:
+- **Immutability by default** -- all new interface fields use `readonly`
+- **Errors are data** -- sidecar write failure logged but doesn't block dispatch (fire-and-forget I/O acceptable for non-fatal defense-in-depth)
+- **Type safety as first line of defense** -- change `Promise<unknown>` to `Promise<PipelineOutcome>` enforces type at compile time
+- **Exhaustiveness** -- `.then()` handler handles all 3 `PipelineOutcome` kinds
+- **Dependency injection for boundaries** -- `applyGitHubLabel` uses injected `fetchFn`
+- **YAGNI** -- no new abstractions beyond what the spec requires
+**No philosophy conflicts**. The fire-and-forget pattern in the poller is an intentional design decision, not a violation.
+---
+## Impact Surface
+Changes to `CoordinatorDeps.spawnSession` affect:
+- `pr-review.ts` callers: 3 call sites (`spawnResult` for mr-review L991, fix-agent L1243, re-review L1309) -- all pass 3-4 args, no 5th arg needed, backward-compatible
+- All test fakes in `adaptive-full-pipeline.test.ts`, `coordinator-pr-review.test.ts` that mock `spawnSession`
+- `trigger-listener.ts` implementation (must forward the new param)
+Changes to `checkIdempotency` in `github-queue-poller.ts` affect:
+- `polling-scheduler.ts` (the only caller)
+- `tests/unit/github-queue-poller.test.ts` (existing idempotency tests must still pass)
+---
+## Candidates
+### Candidate A: Exact spec implementation (recommended)
+**Summary**: Implement all 3 fixes exactly per the spec. Sidecar TTL stored in the file. `checkIdempotency` extended to also scan for `queue-issue-*.json` files by filename pattern and check `dispatchedAt + ttlMs > Date.now()`.
+**Tensions resolved**:
+- Fire-and-forget: preserved (outcome inspection inside async callback only)
+- Interface compat: optional 5th param is additive
+- Idempotency: single function handles both session files and sidecar files
+- Source of truth: label (persistent) + sidecar (crash-window) coexist
+**Tensions accepted**:
+- `checkIdempotency` gains time-dependent behavior (TTL check)
+- `applyGitHubLabel` adds network I/O inside async callback
+**Boundary**: `pr-review.ts` (interface), `trigger-listener.ts` (impl), `full-pipeline.ts` (call sites), `polling-scheduler.ts` (outcome + sidecar), `github-queue-poller.ts` (checkIdempotency)
+**Failure mode**: Malformed sidecar -> conservative 'active' -> issue blocked indefinitely. Acceptable since sidecar is written by controlled code and has known format.
+**Repo-pattern relationship**: Follows all established patterns (injectable deps, Result types, fetchFn, fire-and-forget async, conservative idempotency)
+**Gains**: Complete fix, crash-safe, type-enforced outcomes
+**Losses**: `checkIdempotency` is slightly more complex
+**Scope**: best-fit -- exactly the files specified in the task
+**Philosophy fit**: Honors immutability, type safety, exhaustiveness, DI, YAGNI
+---
+### Candidate B: Separate sidecar check function
+**Summary**: Leave `checkIdempotency` unchanged. Add a new exported `checkQueueSidecar(issueNumber, sessionsDir)` function. Call both from `polling-scheduler.ts`.
+**Tensions resolved**:
+- `checkIdempotency` semantics unchanged (no time-dependency added)
+**Tensions accepted**:
+- Two callsites in `polling-scheduler.ts` -- future idempotency mechanisms need to be added in two places
+- More surface area to test
+**Failure mode**: A future refactor adds one check and misses the other
+**Scope**: slightly broader (new exported function)
+**Philosophy fit**: Compose with small pure functions (honors), but creates coordination risk
+---
+### Candidate C: Config-based timeout (rejected)
+**Summary**: Set `agentConfig.maxSessionMinutes: 65` in the trigger definition's config file (`triggers.yml`) instead of code changes.
+**Rejected because**: Creates dual source of truth. Coordinator already has per-phase timeouts as constants. Config can drift. Explicitly rejected in `discovery-loop-investigation-B.md` as Option B.
+---
+## Comparison and Recommendation
+**Recommendation: Candidate A**
+Candidate A matches the spec exactly and keeps idempotency checking in one function. The TTL-in-file approach keeps `checkIdempotency` free of coordinator constants, maintaining clean module boundaries. The fire-and-forget semantics are preserved throughout.
+Candidate B creates a maintenance coordination risk with no clear benefit. The two-function pattern is only beneficial if `checkIdempotency` is used in contexts where sidecar checking is unwanted -- there's only one callsite.
+---
+## Self-Critique
+**Strongest argument against Candidate A**: Modifying `checkIdempotency` adds time-dependent behavior (TTL check) to what was a pure scan. This makes the function harder to test deterministically without mocking `Date.now()`.
+**Mitigation**: Tests can control the `dispatchedAt` and `ttlMs` values in the sidecar file. An expired sidecar has `dispatchedAt + ttlMs < Date.now()` -- easily crafted in tests with `dispatchedAt: 0, ttlMs: 1`.
+**Assumption that would invalidate this**: If `this.fetchFn` type in `PollingScheduler` is too narrow to call the GitHub Labels API (e.g., typed to only accept GitLab URLs). Actual type is `FetchFn | undefined` where `FetchFn = (url: string, init?: RequestInit) => Promise<Response>` -- generic enough.
+---
+## Open Questions for Main Agent
+None. The spec is fully determined. No human-decision questions remain.