npm - @exaudeus/workrail - Versions diffs - 3.31.1 → 3.33.0 - Mend

@exaudeus/workrail 3.31.1 → 3.33.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (82) hide show

package/dist/cli/commands/index.d.ts +1 -0
package/dist/cli/commands/index.js +3 -1
package/dist/cli/commands/worktrain-await.js +11 -9
package/dist/cli/commands/worktrain-daemon-install.d.ts +35 -0
package/dist/cli/commands/worktrain-daemon-install.js +291 -0
package/dist/cli/commands/worktrain-daemon.d.ts +31 -0
package/dist/cli/commands/worktrain-daemon.js +272 -0
package/dist/cli/commands/worktrain-spawn.js +11 -9
package/dist/cli-worktrain.js +329 -0
package/dist/cli.js +4 -22
package/dist/console/standalone-console.d.ts +28 -0
package/dist/console/standalone-console.js +142 -0
package/dist/{console/assets/index-6H9DeFxj.js → console-ui/assets/index-BuJFLLfY.js} +1 -1
package/dist/{console → console-ui}/index.html +1 -1
package/dist/daemon/agent-loop.d.ts +26 -0
package/dist/daemon/agent-loop.js +53 -2
package/dist/daemon/daemon-events.d.ts +103 -0
package/dist/daemon/daemon-events.js +56 -0
package/dist/daemon/workflow-runner.d.ts +6 -3
package/dist/daemon/workflow-runner.js +229 -33
package/dist/infrastructure/session/HttpServer.js +133 -34
package/dist/manifest.json +134 -70
package/dist/mcp/output-schemas.d.ts +30 -30
package/dist/mcp/transports/bridge-events.d.ts +4 -0
package/dist/mcp/transports/fatal-exit.js +4 -0
package/dist/mcp/transports/http-entry.js +2 -0
package/dist/mcp/transports/stdio-entry.js +26 -6
package/dist/mcp/v2/tools.d.ts +4 -4
package/dist/trigger/adapters/github-poller.d.ts +44 -0
package/dist/trigger/adapters/github-poller.js +190 -0
package/dist/trigger/adapters/gitlab-poller.d.ts +27 -0
package/dist/trigger/adapters/gitlab-poller.js +81 -0
package/dist/trigger/delivery-client.d.ts +2 -1
package/dist/trigger/delivery-client.js +4 -1
package/dist/trigger/index.d.ts +4 -1
package/dist/trigger/index.js +5 -1
package/dist/trigger/polled-event-store.d.ts +22 -0
package/dist/trigger/polled-event-store.js +173 -0
package/dist/trigger/polling-scheduler.d.ts +20 -0
package/dist/trigger/polling-scheduler.js +249 -0
package/dist/trigger/trigger-listener.d.ts +5 -0
package/dist/trigger/trigger-listener.js +53 -4
package/dist/trigger/trigger-router.d.ts +4 -2
package/dist/trigger/trigger-router.js +7 -4
package/dist/trigger/trigger-store.js +114 -33
package/dist/trigger/types.d.ts +17 -1
package/dist/v2/durable-core/schemas/export-bundle/index.d.ts +224 -224
package/dist/v2/durable-core/schemas/session/events.d.ts +42 -42
package/dist/v2/durable-core/schemas/session/manifest.d.ts +6 -6
package/dist/v2/durable-core/schemas/session/validation-event.d.ts +2 -2
package/dist/v2/durable-core/tokens/payloads.d.ts +52 -52
package/dist/v2/usecases/console-routes.js +3 -3
package/dist/v2/usecases/console-service.js +133 -9
package/dist/v2/usecases/console-types.d.ts +7 -0
package/docs/design/daemon-conversation-logging-plan.md +98 -0
package/docs/design/daemon-conversation-logging-review.md +55 -0
package/docs/design/daemon-conversation-logging.md +129 -0
package/docs/design/github-polling-adapter-design-candidates.md +226 -0
package/docs/design/github-polling-adapter-design-review-findings.md +131 -0
package/docs/design/github-polling-adapter-implementation-plan.md +284 -0
package/docs/design/implementation_plan.md +192 -0
package/docs/design/workflow-id-validation-at-startup.md +146 -0
package/docs/design/workflow-id-validation-design-review.md +87 -0
package/docs/design/workflow-id-validation-implementation-plan.md +185 -0
package/docs/design/worktrain-system-prompt-report-issue-candidates.md +135 -0
package/docs/design/worktrain-system-prompt-report-issue-design-review.md +73 -0
package/docs/ideas/backlog.md +465 -0
package/package.json +1 -1
package/workflows/architecture-scalability-audit.json +1 -1
package/workflows/bug-investigation.agentic.v2.json +3 -3
package/workflows/coding-task-workflow-agentic.json +32 -32
package/workflows/coding-task-workflow-agentic.lean.v2.json +1 -1
package/workflows/coding-task-workflow-agentic.v2.json +7 -7
package/workflows/mr-review-workflow.agentic.v2.json +21 -12
package/workflows/personal-learning-materials-creation-branched.json +2 -2
package/workflows/production-readiness-audit.json +1 -1
package/workflows/relocation-workflow-us.json +2 -2
package/workflows/ui-ux-design-workflow.json +14 -14
package/workflows/workflow-for-workflows.json +3 -3
package/workflows/workflow-for-workflows.v2.json +2 -2
package/workflows/wr.discovery.json +1 -1
/package/dist/{console → console-ui}/assets/index-8dh0Psu-.css +0 -0

package/docs/design/implementation_plan.md ADDED Viewed

@@ -0,0 +1,192 @@
+# Implementation Plan: WorkTrain System Prompt Preamble + report_issue Tool
+## Problem Statement
+The WorkTrain daemon's system prompt preamble is thin (15 lines) and relies on the soul file for behavioral guidance. This leaves unattended agents without explicit direction on self-directed reasoning, the oracle hierarchy, or what to do when things go wrong. Additionally, there is no structured way for agents to record issues/errors for a future auto-fix coordinator -- failures either go unrecorded or end up buried in step notes.
+---
+## Acceptance Criteria
+1. `buildSystemPrompt()` output contains a richer preamble (~55 lines) that:
+   - Opens with "You are WorkRail Auto, an autonomous agent..." (existing test assertion preserved)
+   - Includes `## Your tools` section listing all 5 tools (existing test assertion)
+   - Includes `## Execution contract` section (existing test assertion)
+   - Adds `## What you are`, `## Your oracle`, `## Self-directed reasoning`, `## The workflow is the contract`, `## Silent failure is the worst outcome`, `## Tools are your hands not your voice`, `## You don't have a user` sections
+   - All existing `workflow-runner-system-prompt.test.ts` tests pass without modification
+2. `makeReportIssueTool(sessionId, emitter?, issuesDirOverride?)` is exported from `workflow-runner.ts`:
+   - Tool name: `report_issue`
+   - Input schema accepts: `kind` (5-value literal enum), `severity` (4-value literal enum), `summary` (string, required), `context` (string, optional), `toolName` (string, optional), `command` (string, optional), `suggestedFix` (string, optional), `continueToken` (string, optional)
+   - `execute()` appends one JSON line to `~/.workrail/issues/<sessionId>.jsonl` (or `issuesDirOverride/<sessionId>.jsonl` in tests) -- fire-and-forget (void+catch)
+   - `execute()` emits a `DaemonEventEmitter` event with `kind: 'issue_reported'`
+   - For non-fatal severity: returns `"Issue recorded (severity=<severity>). Continue with your work unless this is fatal."`
+   - For fatal severity: returns `"FATAL issue recorded. Call continue_workflow with notes explaining the blocker, then the session will end."`
+   - Wired into `runWorkflow()` tools array
+3. `IssueReportedEvent` is added to `DaemonEvent` union in `daemon-events.ts`:
+   - `kind: 'issue_reported'`, `sessionId: string`, `issueKind` (5-value literal union), `severity` (4-value literal union), `summary: string`, `continueToken?: string`
+4. `npm run build` succeeds (no TS errors)
+5. `npx vitest run` passes (all existing tests + new tests)
+---
+## Non-Goals
+- No auto-fix coordinator implementation
+- No IssueStore class (YAGNI -- extract when coordinator needs it)
+- No changes to `soul-template.ts`, `triggers.yml`, or `src/v2/`
+- No changes to the soul file template/default
+- No changes to AgentLoop behavior (fatal severity does not abort the loop)
+- No async changes to `buildSystemPrompt()` (must remain synchronous and pure)
+---
+## Philosophy-Driven Constraints
+- `buildSystemPrompt()` must remain a pure, synchronous function (no I/O, no side effects)
+- All `DaemonEvent` variants must use `readonly` fields only
+- `IssueReportedEvent.issueKind` and `.severity` must be literal union types (not `string`)
+- JSONL write must be fire-and-forget: `void appendIssueAsync().catch(() => {})`
+- `mkdir({ recursive: true })` before every appendFile (handles missing dir silently)
+- `issuesDirOverride` parameter for test isolation (mirrors DaemonEventEmitter constructor)
+---
+## Invariants
+1. `buildSystemPrompt()` is pure and synchronous -- verified by existing tests calling it directly
+2. `'You are WorkRail Auto'` is present in `buildSystemPrompt()` output -- verified by test L29
+3. `'## Your tools'` is present in `buildSystemPrompt()` output -- verified by test L30
+4. `'## Execution contract'` is present in `buildSystemPrompt()` output -- verified by test L32
+5. All `DaemonEvent` variants use `readonly` fields -- verified by TS compiler
+6. `DaemonEvent` union is exhaustive -- TS compiler enforces at every switch site
+7. `report_issue.execute()` never throws -- returns `AgentToolResult` always
+8. JSONL write never blocks `execute()` return -- `void` Promise
+---
+## Selected Approach + Rationale
+**Part 1:** Module-private `BASE_SYSTEM_PROMPT` string constant defined above `buildSystemPrompt()`. The function uses it as the start of the lines array. Rationale: named constant is readable as a document; testable via `buildSystemPrompt()` output; follows `soul-template.ts` precedent for stable-content constants.
+**Part 2:** `makeReportIssueTool(sessionId, emitter?, issuesDirOverride?)` inline tool factory following the exact shape of `makeReadTool`/`makeWriteTool`. Private `appendIssueAsync()` helper for JSONL write. `issuesDirOverride` for test isolation (hybrid of inline factory + runner-up's dirOverride). Rationale: YAGNI -- no IssueStore class until coordinator exists; hybrid resolves testability without over-engineering.
+**Runner-up:** IssueStore class (Candidate B). Lost to YAGNI -- one caller, no coordinator yet.
+---
+## Vertical Slices
+### Slice 1: Create feature branch
+- Create `feat/worktrain-system-prompt-and-report-issue` from current main
+- Verify clean state
+### Slice 2: Add IssueReportedEvent to daemon-events.ts
+- Add `IssueReportedEvent` interface
+- Add to `DaemonEvent` union
+- Verify TS compiles
+### Slice 3: Replace buildSystemPrompt() preamble
+- Define `BASE_SYSTEM_PROMPT` constant above `buildSystemPrompt()`
+- Replace lines 1087-1108 to use the constant
+- Verify all existing system-prompt tests pass
+### Slice 4: Implement makeReportIssueTool
+- Add private `appendIssueAsync()` helper
+- Add `makeReportIssueTool()` factory
+- Wire into `runWorkflow()` tools array
+### Slice 5: Tests
+- Add tests for `makeReportIssueTool` -- verify JSONL write with temp dir, verify event emitted, verify return strings, verify fatal vs non-fatal
+- Verify all existing tests still pass
+### Slice 6: Build + full test run
+- `npm run build` -- zero errors
+- `npx vitest run` -- all pass
+### Slice 7: PR
+- Commit with conventional commit message
+- Open PR to main
+---
+## Test Design
+### Existing tests (must pass unchanged)
+- `tests/unit/workflow-runner-system-prompt.test.ts` -- all 11 tests
+- `tests/unit/daemon-events.test.ts` -- all existing tests
+### New tests to add
+File: `tests/unit/workflow-runner-report-issue.test.ts`
+Test cases:
+1. `makeReportIssueTool` -- returns correct tool name and description
+2. `execute()` with non-fatal severity -- returns confirmation string with severity
+3. `execute()` with fatal severity -- returns FATAL message
+4. `execute()` -- writes JSON line to issuesDirOverride/<sessionId>.jsonl
+5. `execute()` -- written JSON contains kind, severity, summary, ts, sessionId
+6. `execute()` -- creates dir if it doesn't exist (mkdir recursive)
+7. `execute()` -- emits `issue_reported` event via emitter
+8. `execute()` -- optional fields (context, toolName, command, suggestedFix, continueToken) present in JSON when provided
+9. `execute()` -- does not throw when write fails (fire-and-forget)
+---
+## Risk Register
+| Risk | Likelihood | Impact | Mitigation |
+|---|---|---|---|
+| BASE_SYSTEM_PROMPT missing required test strings | Low | High (CI break) | Include `'You are WorkRail Auto'`, `'## Your tools'`, `'## Execution contract'` explicitly; tests catch immediately |
+| IssueReportedEvent `issueKind` vs tool input `kind` confusion | Low | Medium (runtime behavior ok, TS shape wrong) | Use `issueKind` in event interface; keep `kind` in input schema |
+| Silent JSONL write failure not caught in tests | Low | Low (fire-and-forget is intentional) | issuesDirOverride isolates write path; test case #9 verifies no throw |
+| Agent ignores fatal severity | Medium | Medium (tokens wasted) | Out of scope; coordinator detects post-hoc |
+---
+## PR Packaging Strategy
+Single PR: `feat/worktrain-system-prompt-and-report-issue`
+- All 3 files changed: `src/daemon/workflow-runner.ts`, `src/daemon/daemon-events.ts`, `tests/unit/workflow-runner-report-issue.test.ts`
+- Commit message: `feat(console): richer daemon system prompt and report_issue tool`
+Wait -- scope is `daemon`, not `console`. Correct commit message:
+`feat(mcp): richer daemon system prompt and report_issue tool for auto-fix coordinator`
+Actually these are daemon changes. The allowed scopes from CLAUDE.md are: `console`, `mcp`, `workflows`, `engine`, `schema`, `docs`. The daemon lives under `mcp` in this codebase (daemon is part of the WorkRail server). Use scope `mcp`.
+---
+## Philosophy Alignment Per Slice
+### Slice 2 (daemon-events.ts)
+- Exhaustiveness everywhere -> satisfied (new union variant, TS enforces handling)
+- Make illegal states unrepresentable -> satisfied (literal unions for issueKind/severity)
+- Immutability by default -> satisfied (readonly fields)
+### Slice 3 (BASE_SYSTEM_PROMPT)
+- Functional core, imperative shell -> satisfied (buildSystemPrompt remains pure)
+- Immutability by default -> satisfied (const)
+- Document why not what -> satisfied (JSDoc on constant)
+- YAGNI with discipline -> satisfied (no speculative additions)
+### Slice 4 (makeReportIssueTool)
+- Observability as a constraint -> satisfied (fire-and-forget, never blocks)
+- Errors are data -> satisfied (execute() returns AgentToolResult, never throws)
+- Prefer fakes over mocks -> satisfied (issuesDirOverride for tests)
+- YAGNI with discipline -> satisfied (no IssueStore class)
+- Exhaustiveness everywhere -> satisfied (return value handles all severity levels)
+### Slice 5 (tests)
+- Prefer fakes over mocks -> satisfied (temp dir, no fs mocking)
+- Determinism -> satisfied (all test writes go to unique temp dirs)
+---
+## Summary
+- `estimatedPRCount`: 1
+- `planConfidenceBand`: High
+- `unresolvedUnknownCount`: 0
+- `followUpTickets`: Extract IssueStore class when auto-fix coordinator is built

package/docs/design/workflow-id-validation-at-startup.md ADDED Viewed

@@ -0,0 +1,146 @@
+# Design: Workflow ID Validation at Daemon Startup
+**Status:** Decision made -- implement Candidate A
+**Date:** 2026-04-16
+**Context:** Backlog item "Workflow ID validation at startup" (Tier 1, groomed Apr 18)
+---
+## Problem Understanding
+### The Bug
+A user writes `workflowId: coding-task-workflow-agentic.lean.v2` (filename without extension) instead of `coding-task-workflow-agentic` (the actual workflow ID). The daemon starts fine, accepts webhooks, but every dispatch silently fails with `workflow_not_found`. The error only surfaces in logs, not at startup. The operator has no way to know their trigger is broken until they watch logs during an actual webhook event.
+### Core Tensions
+1. **Testability vs. production simplicity** -- `ctx.workflowService.getWorkflowById` is available in production but tests use `FAKE_CTX = {} as V2ToolContext` where `workflowService` is `undefined`. Requires an injectable function approach, not direct ctx access.
+2. **Warn+skip consistency vs. fail-fast** -- `loadTriggerConfig` already chose warn+skip for invalid triggers. A hard-fail here would create two conflicting behaviors in the same startup path.
+3. **Where to wire the lookup** -- `StartTriggerListenerOptions` injectable (matches existing `runWorkflowFn` pattern) vs. direct `ctx` access.
+### Likely Seam
+`startTriggerListener` in `src/trigger/trigger-listener.ts`, after `buildTriggerIndex()` returns ok (~line 235), before `new TriggerRouter(...)`. This is the correct seam -- triggers are loaded and indexed, but no webhooks can arrive yet.
+### What Makes This Hard
+- `FAKE_CTX = {} as V2ToolContext` in tests -- direct `ctx.workflowService` use breaks existing test infrastructure without any compile-time warning.
+- Need to decide what happens when `getWorkflowByIdFn` is not provided (backward compat: skip validation entirely).
+- Workflows are static YAML files -- if not found at startup, they will never be found at dispatch time either. No "not found now, maybe later" case exists.
+---
+## Philosophy Constraints
+**Sources:**
+- `/Users/etienneb/CLAUDE.md`: "Dependency injection for boundaries -- inject external effects (I/O, clocks, randomness) to keep core logic testable"
+- `/Users/etienneb/CLAUDE.md`: "Validate at boundaries, trust inside -- do input validation at system edges"
+- Repo pattern: `runWorkflowFn?: RunWorkflowFn` in `StartTriggerListenerOptions` -- exact injectable pattern to follow
+- Repo pattern: `loadTriggerConfig` warn+skip -- policy to remain consistent with
+**No conflicts.** All sources agree on: DI injectable for testability, warn+skip policy, validate at the startup boundary.
+---
+## Impact Surface
+- **`src/trigger/trigger-listener.ts`** -- primary change. New validation loop and new `StartTriggerListenerOptions` field.
+- **`tests/unit/trigger-router.test.ts`** -- add new test cases. Existing tests unaffected (they don't provide `getWorkflowByIdFn`, so validation is skipped -- same behavior as today).
+- **`src/trigger/trigger-router.ts`** -- no change. Router already handles `workflow_not_found` at dispatch; this is an earlier defense layer.
+- **`src/trigger/trigger-store.ts`** -- no change. YAML parsing is separate from workflow ID resolution.
+- **`src/trigger/types.ts`** -- no change. `TriggerDefinition` shape unchanged.
+---
+## Candidates
+### Candidate A -- Injectable function on StartTriggerListenerOptions (RECOMMENDED)
+**Summary:** Add `getWorkflowByIdFn?: (id: string) => Promise<boolean>` to `StartTriggerListenerOptions`. Production path defaults to `(id) => ctx.workflowService.getWorkflowById(id).then(w => w !== null)`. When not provided, validation is skipped (backward compat for existing tests).
+**Tensions resolved:** Testability (tests inject stub), warn+skip consistency, DI principle.
+**Tensions accepted:** Slight verbosity (new option field). Validation silently skipped if fn not provided (intentional).
+**Boundary:** `startTriggerListener`, after `buildTriggerIndex()` returns ok.
+**Why this boundary:** Single assembly point before the router accepts any traffic. Earlier (store layer) would require making `loadTriggerConfig` async. Later (dispatch time) is too late -- that's the bug we're fixing.
+**Failure mode:** Existing tests that don't inject `getWorkflowByIdFn` silently skip validation. This is intentional backward compat, not a latent bug -- they still test all other startup behavior.
+**Repo pattern:** Exact match to `runWorkflowFn?: RunWorkflowFn` in the same `StartTriggerListenerOptions` interface.
+**Gains:** Full testability, no changes to existing tests, clean DI seam, consistent with all philosophy principles.
+**Losses:** Caller must inject the fn to get validation. If someone creates a new caller of `startTriggerListener` without providing it, they get no validation. (Low risk: only one production caller.)
+**Scope judgment:** Best-fit. Changes only `trigger-listener.ts` and adds tests. No interface changes to store or router.
+**Philosophy fit:** Honors "Dependency injection for boundaries", "Validate at boundaries, trust inside". No conflicts.
+---
+### Candidate B -- Use ctx.workflowService directly with null guard
+**Summary:** Call `ctx.workflowService?.getWorkflowById(id)` directly in the validation loop, skipping the whole loop if `ctx.workflowService` is undefined.
+**Tensions resolved:** Production simplicity (no new option field).
+**Tensions accepted:** Testability gap -- the warn+skip behavior can't be tested without constructing a real `workflowService` in `ctx`.
+**Failure mode:** New validation behavior is untestable with the existing `FAKE_CTX` test infrastructure.
+**Repo pattern:** Departs from `runWorkflowFn` injectable pattern. Conflicts with DI principle.
+**Scope judgment:** Best-fit for production behavior, too narrow for test coverage.
+**Philosophy fit:** Conflicts with "Dependency injection for boundaries".
+---
+### Candidate C -- Validate inside loadTriggerConfig (store layer)
+**Summary:** Add `workflowResolver?: (id: string) => Promise<boolean>` to `loadTriggerConfig`, filtering unknown workflowId triggers at parse time.
+**Tensions resolved:** Centralizes all trigger validation.
+**Tensions accepted:** `trigger-store.ts` is a pure synchronous YAML parser; making it async for the resolver breaks its pure/impure boundary and all existing sync call sites.
+**Failure mode:** Breaks `loadTriggerConfig`'s synchronous interface contract. All existing callers would need updating.
+**Repo pattern:** Departs from the pure-sync design of `trigger-store.ts`.
+**Scope judgment:** Too broad -- adds async I/O to a pure parsing module with no justification beyond this feature.
+**Philosophy fit:** Conflicts with "Compose with small, pure functions".
+---
+## Comparison and Recommendation
+| Tension | A (Injectable) | B (ctx direct) | C (store layer) |
+|---------|---------------|----------------|-----------------|
+| Testability | Wins | Loses | N/A |
+| Warn+skip consistency | Wins | Wins | Breaks pure boundary |
+| DI principle | Honors | Conflicts | Conflicts |
+| Repo pattern fit | Exact match | Departs | Departs |
+| Reversibility | Easy | Easy | Hard |
+**Recommendation: Candidate A.** It resolves all tensions, is a direct repo-pattern match, requires minimal code change, and leaves all existing tests unchanged.
+---
+## Self-Critique
+**Strongest counter-argument:** "Why add a new option when `ctx.workflowService` is already there? That's extra API surface for a one-time startup check." -- Response: `FAKE_CTX = {} as V2ToolContext` (line 33, `trigger-router.test.ts`) means `ctx.workflowService` is `undefined` at test runtime. Without the injectable, the new validation behavior is untestable. Fixing a silent-failure bug without being able to test it is unacceptable.
+**Narrower option that lost:** Candidate B (ctx direct with null guard). Loses because new behavior is untestable.
+**Broader option that would need evidence:** Candidate C (store layer) would be justified if multiple callers of `loadTriggerConfig` needed workflow ID validation -- but there is only one production caller. The scope increase is not warranted.
+**Invalidating assumption:** If `FAKE_CTX` were replaced by a real mock with a `workflowService`, Candidate B would be equally valid. But that's a larger test infrastructure change that's out of scope.
+---
+## Open Questions for the Main Agent
+None. All design decisions are resolved. Implementation is straightforward:
+1. Add `getWorkflowByIdFn?: (id: string) => Promise<boolean>` to `StartTriggerListenerOptions`
+2. After `buildTriggerIndex()` returns ok, if `getWorkflowByIdFn` is provided, iterate `triggerIndex`, call fn for each `workflowId`, warn and delete unknowns
+3. Production default (when fn not provided): use `ctx.workflowService.getWorkflowById(id).then(w => w !== null)`
+4. Add test cases for: warn+skip on unknown workflowId, valid workflowId passes through, fn not provided skips validation

package/docs/design/workflow-id-validation-design-review.md ADDED Viewed

@@ -0,0 +1,87 @@
+# Design Review: Workflow ID Validation at Daemon Startup
+**Design under review:** Candidate A -- injectable `getWorkflowByIdFn` on `StartTriggerListenerOptions`
+**Date:** 2026-04-16
+---
+## Tradeoff Review
+| Tradeoff | Acceptable? | Condition that breaks it |
+|----------|-------------|--------------------------|
+| Validation silently skipped when fn not provided | Yes | A new production call site added without the fn |
+| New option field (API surface) | Yes | Interface is internal, non-breaking |
+Hidden assumption: single production call site for `startTriggerListener`. True today.
+**Mitigation added:** Log message when fn not provided, making the skip visible in startup logs.
+---
+## Failure Mode Review
+| Failure Mode | Handled? | Action Required |
+|--------------|----------|-----------------|
+| FM1: getWorkflowByIdFn throws/rejects | NOT YET | Add try/catch around each fn call; warn+skip on error |
+| FM2: transient workflow unavailability | Acceptable | warn+skip behavior is correct here |
+| FM3: Map mutation during iteration | NOT YET | Collect unknowns in first pass, delete in second pass |
+| FM4: ctx.workflowService undefined in production | Needs guard | Use optional chaining `?.` in default fn |
+**Highest-risk:** FM1. An unhandled rejection would crash `startTriggerListener`. Must fix.
+---
+## Runner-Up / Simpler Alternative Review
+- Runner-up (Candidate B, ctx direct): no elements to borrow. Testability loss outweighs API surface saving.
+- No simpler variant satisfies both testability and correctness requirements.
+- Candidate A is already minimum viable for the acceptance criteria.
+---
+## Philosophy Alignment
+**Satisfied:** Dependency injection, validate at boundaries, errors are data, YAGNI, surface information.
+**Under tension (acceptable):**
+- "Make illegal states unrepresentable" -- TriggerDefinition can still hold invalid workflowIds. Compile-time enforcement would require two-phase types; over-engineering for a Small task.
+- "Immutability by default" -- triggerIndex Map is mutated, but mutation is local (created and modified within `startTriggerListener`, not shared until passed to TriggerRouter).
+---
+## Findings
+**ORANGE -- FM1: Unhandled rejection from getWorkflowByIdFn**
+The current design has no try/catch around the fn call. An I/O error in the workflow storage lookup would propagate as an unhandled rejection and crash `startTriggerListener`. Fix: wrap each `await getWorkflowByIdFn(trigger.workflowId)` in try/catch; on error, log warning and skip that trigger (same policy as other validation failures).
+**YELLOW -- FM3: Two-pass Map deletion**
+Must not delete from `triggerIndex` while iterating it. Fix: collect unknown IDs in an array during the loop, then delete in a second pass.
+**YELLOW -- FM4: ctx.workflowService guard**
+The default fn production expression `ctx.workflowService.getWorkflowById(id).then(...)` will throw if `workflowService` is undefined. Fix: use optional chaining `ctx.workflowService?.getWorkflowById(id).then(w => w !== null) ?? true` (treat unavailable service as "found" -- skip validation rather than crash).
+---
+## Recommended Revisions
+1. **Required:** Add try/catch in the validation loop; treat fn errors as warn+skip.
+2. **Required:** Collect unknowns first, delete after iteration.
+3. **Required:** Use optional chaining for the default production fn.
+4. **Nice-to-have:** Log `[TriggerListener] workflowId validation skipped (no resolver provided)` when fn is absent, for observability.
+---
+## Residual Concerns
+- The "silent skip when fn not provided" is acceptable but relies on a naming convention (option field) to communicate intent. Future callers won't get a compile-time reminder to provide the fn. This is a documentation concern, not a correctness concern.
+- `onComplete.workflowId` is not validated by this design. Out of scope for this task; should be a follow-up if `onComplete` usage grows.
+- No RED findings. All issues are fixable at implementation time with minor code additions.
+---
+## Pass 2 Findings (incremental)
+No new RED or ORANGE findings. Design revisions from pass 1 are sufficient.
+**New observation (YELLOW):** `onComplete.workflowId` (secondary workflow for completion hooks) is not validated. Accepted as out of scope -- add a comment in the implementation noting this limitation.
+**Performance:** Sequential validation of N triggers is acceptable at expected trigger counts (1-10). No action needed.

package/docs/design/workflow-id-validation-implementation-plan.md ADDED Viewed

@@ -0,0 +1,185 @@
+# Implementation Plan: Workflow ID Validation at Daemon Startup
+**Status:** Ready to implement
+**Branch:** `fix/workflow-id-validation-at-startup`
+**Date:** 2026-04-16
+---
+## 1. Problem Statement
+When a user writes an incorrect `workflowId` in `triggers.yml` (e.g., `coding-task-workflow-agentic.lean.v2` instead of `coding-task-workflow-agentic`), the daemon starts successfully, accepts webhooks, but every dispatch silently fails with `workflow_not_found`. The error only appears in logs during actual webhook events -- not at startup. This is a silent-failure bug.
+---
+## 2. Acceptance Criteria
+- [x] At daemon startup, after loading and indexing triggers, validate that each trigger's `workflowId` resolves to a known workflow
+- [x] Triggers with unknown `workflowId` are logged with a clear warning (naming the triggerId and the bad workflowId) and removed from the active index
+- [x] Triggers with valid `workflowId` start normally
+- [x] If `getWorkflowByIdFn` throws or rejects, that trigger is also warned+skipped (not a daemon crash)
+- [x] Existing behavior when `getWorkflowByIdFn` is not provided: validation is skipped (backward compat, logged)
+- [x] Existing tests continue to pass without modification
+---
+## 3. Non-Goals
+- No hard-fail policy (daemon does not refuse to start; it starts with fewer triggers)
+- No validation of `onComplete.workflowId` (secondary workflow ID -- follow-up ticket)
+- No changes to `trigger-store.ts` or `TriggerDefinition` type
+- No re-validation on webhook arrival
+- No dynamic reload / hot-reload of trigger config
+- No change to `trigger-router.ts` (it already handles `workflow_not_found` at dispatch)
+---
+## 4. Philosophy-Driven Constraints
+- **Dependency injection**: `getWorkflowByIdFn` must be injectable -- no direct `ctx.workflowService` access
+- **Validate at boundaries**: validation runs at `startTriggerListener` (startup boundary), not inside routing
+- **Errors are data**: validation failures are warnings + skip, not thrown exceptions
+- **Document why**: implementation must include WHY comments on the key decisions
+- **Warn+skip over hard-fail**: consistent with `loadTriggerConfig` existing behavior
+---
+## 5. Invariants
+1. The `triggerIndex` passed to `TriggerRouter` contains ONLY triggers whose `workflowId` was confirmed to exist (when `getWorkflowByIdFn` is provided)
+2. The validation loop MUST NOT mutate `triggerIndex` during iteration (collect unknowns first, delete after)
+3. A `getWorkflowByIdFn` rejection/throw MUST NOT propagate -- it is caught, the trigger is warned+skipped
+4. When `getWorkflowByIdFn` is absent, validation is skipped entirely (backward compat) and a log message says so
+5. `DefaultWorkflowService.getWorkflowById` delegates directly to storage (no compilation cache interference) -- validation results are authoritative
+---
+## 6. Selected Approach
+**Add `getWorkflowByIdFn?: (id: string) => Promise<boolean>` to `StartTriggerListenerOptions`.**
+In `startTriggerListener`, after `buildTriggerIndex()` returns ok, if `getWorkflowByIdFn` is provided:
+1. Iterate `triggerIndex` entries (read-only pass), collect unknown workflowIds
+2. For each, try `await getWorkflowByIdFn(trigger.workflowId)` -- catch rejection, treat as false
+3. Collect trigger IDs where result is false or threw
+4. After iteration: delete collected IDs from `triggerIndex`, log warnings
+5. Log summary if any were skipped
+Production default (not on the option -- called inline): `async (id) => (await ctx.workflowService?.getWorkflowById(id)) !== null`.
+**Runner-up:** Candidate B (ctx direct with null guard) -- lost because `FAKE_CTX = {} as V2ToolContext` makes the behavior untestable.
+---
+## 7. Vertical Slices
+### Slice 1 -- Core validation logic in `startTriggerListener`
+**Files:** `src/trigger/trigger-listener.ts`
+**Changes:**
+- Add `getWorkflowByIdFn?: (id: string) => Promise<boolean>` to `StartTriggerListenerOptions`
+- After `buildTriggerIndex()` returns ok (before `new TriggerRouter`): add validation block
+- Validation block logic:
+  ```
+  if (getWorkflowByIdFn) {
+    const unknownTriggerIds: string[] = []
+    for (const [triggerId, trigger] of triggerIndex) {
+      let found: boolean
+      try {
+        found = await getWorkflowByIdFn(trigger.workflowId)
+      } catch (e) {
+        found = false
+        console.warn(`[TriggerListener] Error validating workflowId '${trigger.workflowId}' for trigger '${triggerId}': ${e}`)
+      }
+      if (!found) {
+        unknownTriggerIds.push(triggerId)
+        console.warn(`[TriggerListener] Skipping trigger '${triggerId}': workflowId '${trigger.workflowId}' not found`)
+      }
+    }
+    for (const id of unknownTriggerIds) { triggerIndex.delete(id) }
+    if (unknownTriggerIds.length > 0) {
+      console.warn(`[TriggerListener] Skipped ${unknownTriggerIds.length} trigger(s) with unknown workflowId(s)`)
+    }
+  } else {
+    console.log(`[TriggerListener] workflowId validation skipped (no resolver provided)`)
+  }
+  ```
+- Add production default invocation: pass `async (id) => (await ctx.workflowService?.getWorkflowById(id)) !== null` as the default when option not provided
+**Acceptance criterion:** Unknown workflowId triggers are not present in the index passed to `TriggerRouter`.
+### Slice 2 -- Tests in `trigger-router.test.ts`
+**Files:** `tests/unit/trigger-router.test.ts`
+**New test cases (in a new describe block `startTriggerListener workflowId validation`):**
+1. Triggers with unknown workflowId are warned and skipped (index excludes them, server starts)
+2. Triggers with valid workflowId are kept in the index
+3. When `getWorkflowByIdFn` is not provided, validation is skipped and all triggers are kept
+4. When `getWorkflowByIdFn` rejects, that trigger is warned and skipped (daemon doesn't crash)
+5. Mix: some valid, some invalid -- only valid triggers remain
+**Acceptance criterion:** All 5 test cases pass.
+---
+## 8. Test Design
+**Pattern to follow:** `startTriggerListener` tests in `trigger-router.test.ts` (~line 432). Same structure:
+- Use `tmpPath()` for workspacePath
+- `env: { WORKRAIL_TRIGGERS_ENABLED: 'true' }`
+- `port: 0` for OS-assigned port
+- `runWorkflowFn: vi.fn()`
+- `workspaces: {}` to skip workspace config loading
+**Fixtures needed:**
+- A minimal `triggers.yml` with two triggers: one with valid workflowId, one with invalid
+- `getWorkflowByIdFn` stub: `vi.fn().mockImplementation(async (id: string) => id === 'coding-task-workflow-agentic')`
+**Note:** Tests write real `triggers.yml` files to `tmpPath()` directories (pattern established in existing tests). Check how existing `startTriggerListener` tests set up workspace directories.
+---
+## 9. Risk Register
+| Risk | Likelihood | Impact | Mitigation |
+|------|-----------|--------|-----------|
+| `ctx.workflowService` undefined in production | Low | Medium | Optional chaining `?.` in default fn |
+| FM1: getWorkflowByIdFn throws | Low-Medium | High | try/catch per call, warn+skip |
+| FM3: Map mutation during iteration | Low (easy to avoid) | Medium | Two-pass (collect then delete) |
+| False positive (valid workflow not found due to I/O error) | Low | Low | Same as FM1 -- warn+skip, operator can restart |
+| `onComplete.workflowId` still silent-fails | Medium | Low | Accepted, documented, follow-up ticket |
+---
+## 10. PR Packaging
+**Single PR.** Small task, all changes in 2 files. Branch: `fix/workflow-id-validation-at-startup`.
+PR title: `fix(trigger): warn and skip triggers with unknown workflowId at startup`
+---
+## 11. Philosophy Alignment Per Slice
+| Principle | Slice 1 | Slice 2 |
+|-----------|---------|---------|
+| Dependency injection for boundaries | Satisfied -- fn injectable | Satisfied -- tests inject stub |
+| Validate at boundaries | Satisfied -- startup boundary | N/A |
+| Errors are data | Satisfied -- warn+skip, no throw | Satisfied -- tests verify no crash |
+| Document why | Satisfied -- WHY comments required | N/A |
+| Warn+skip over hard-fail | Satisfied | Verified by tests |
+| Immutability by default | Tension -- triggerIndex mutated, but local scope | N/A |
+---
+## 12. Follow-up Tickets
+- `onComplete.workflowId` validation (secondary workflow IDs in completion hooks)
+---
+**unresolvedUnknownCount:** 0
+**planConfidenceBand:** High
+**estimatedPRCount:** 1