npm - @exaudeus/workrail - Versions diffs - 3.31.1 → 3.33.0 - Mend

@exaudeus/workrail 3.31.1 → 3.33.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (82) hide show

package/dist/cli/commands/index.d.ts +1 -0
package/dist/cli/commands/index.js +3 -1
package/dist/cli/commands/worktrain-await.js +11 -9
package/dist/cli/commands/worktrain-daemon-install.d.ts +35 -0
package/dist/cli/commands/worktrain-daemon-install.js +291 -0
package/dist/cli/commands/worktrain-daemon.d.ts +31 -0
package/dist/cli/commands/worktrain-daemon.js +272 -0
package/dist/cli/commands/worktrain-spawn.js +11 -9
package/dist/cli-worktrain.js +329 -0
package/dist/cli.js +4 -22
package/dist/console/standalone-console.d.ts +28 -0
package/dist/console/standalone-console.js +142 -0
package/dist/{console/assets/index-6H9DeFxj.js → console-ui/assets/index-BuJFLLfY.js} +1 -1
package/dist/{console → console-ui}/index.html +1 -1
package/dist/daemon/agent-loop.d.ts +26 -0
package/dist/daemon/agent-loop.js +53 -2
package/dist/daemon/daemon-events.d.ts +103 -0
package/dist/daemon/daemon-events.js +56 -0
package/dist/daemon/workflow-runner.d.ts +6 -3
package/dist/daemon/workflow-runner.js +229 -33
package/dist/infrastructure/session/HttpServer.js +133 -34
package/dist/manifest.json +134 -70
package/dist/mcp/output-schemas.d.ts +30 -30
package/dist/mcp/transports/bridge-events.d.ts +4 -0
package/dist/mcp/transports/fatal-exit.js +4 -0
package/dist/mcp/transports/http-entry.js +2 -0
package/dist/mcp/transports/stdio-entry.js +26 -6
package/dist/mcp/v2/tools.d.ts +4 -4
package/dist/trigger/adapters/github-poller.d.ts +44 -0
package/dist/trigger/adapters/github-poller.js +190 -0
package/dist/trigger/adapters/gitlab-poller.d.ts +27 -0
package/dist/trigger/adapters/gitlab-poller.js +81 -0
package/dist/trigger/delivery-client.d.ts +2 -1
package/dist/trigger/delivery-client.js +4 -1
package/dist/trigger/index.d.ts +4 -1
package/dist/trigger/index.js +5 -1
package/dist/trigger/polled-event-store.d.ts +22 -0
package/dist/trigger/polled-event-store.js +173 -0
package/dist/trigger/polling-scheduler.d.ts +20 -0
package/dist/trigger/polling-scheduler.js +249 -0
package/dist/trigger/trigger-listener.d.ts +5 -0
package/dist/trigger/trigger-listener.js +53 -4
package/dist/trigger/trigger-router.d.ts +4 -2
package/dist/trigger/trigger-router.js +7 -4
package/dist/trigger/trigger-store.js +114 -33
package/dist/trigger/types.d.ts +17 -1
package/dist/v2/durable-core/schemas/export-bundle/index.d.ts +224 -224
package/dist/v2/durable-core/schemas/session/events.d.ts +42 -42
package/dist/v2/durable-core/schemas/session/manifest.d.ts +6 -6
package/dist/v2/durable-core/schemas/session/validation-event.d.ts +2 -2
package/dist/v2/durable-core/tokens/payloads.d.ts +52 -52
package/dist/v2/usecases/console-routes.js +3 -3
package/dist/v2/usecases/console-service.js +133 -9
package/dist/v2/usecases/console-types.d.ts +7 -0
package/docs/design/daemon-conversation-logging-plan.md +98 -0
package/docs/design/daemon-conversation-logging-review.md +55 -0
package/docs/design/daemon-conversation-logging.md +129 -0
package/docs/design/github-polling-adapter-design-candidates.md +226 -0
package/docs/design/github-polling-adapter-design-review-findings.md +131 -0
package/docs/design/github-polling-adapter-implementation-plan.md +284 -0
package/docs/design/implementation_plan.md +192 -0
package/docs/design/workflow-id-validation-at-startup.md +146 -0
package/docs/design/workflow-id-validation-design-review.md +87 -0
package/docs/design/workflow-id-validation-implementation-plan.md +185 -0
package/docs/design/worktrain-system-prompt-report-issue-candidates.md +135 -0
package/docs/design/worktrain-system-prompt-report-issue-design-review.md +73 -0
package/docs/ideas/backlog.md +465 -0
package/package.json +1 -1
package/workflows/architecture-scalability-audit.json +1 -1
package/workflows/bug-investigation.agentic.v2.json +3 -3
package/workflows/coding-task-workflow-agentic.json +32 -32
package/workflows/coding-task-workflow-agentic.lean.v2.json +1 -1
package/workflows/coding-task-workflow-agentic.v2.json +7 -7
package/workflows/mr-review-workflow.agentic.v2.json +21 -12
package/workflows/personal-learning-materials-creation-branched.json +2 -2
package/workflows/production-readiness-audit.json +1 -1
package/workflows/relocation-workflow-us.json +2 -2
package/workflows/ui-ux-design-workflow.json +14 -14
package/workflows/workflow-for-workflows.json +3 -3
package/workflows/workflow-for-workflows.v2.json +2 -2
package/workflows/wr.discovery.json +1 -1
/package/dist/{console → console-ui}/assets/index-8dh0Psu-.css +0 -0

package/docs/ideas/backlog.md CHANGED Viewed

@@ -3925,3 +3925,468 @@ More critically: if a session is restarted by the daemon but then stalls (Bedroc
 3. **Orphaned session cleanup should be user-facing.** `worktrain cleanup` or `worktrain status` should surface orphaned sessions with their age and offer to clear them. Right now they silently accumulate.
 4. **Better logging when runWorkflow() swallows errors.** The `void runWorkflow(...)` pattern in `console-routes.ts` and `trigger-router.ts` drops errors silently. Every path that ends in silence (no log, no session advance, no error) should at minimum log `[WorkflowRunner] Session died silently` with the session ID.
+---
+### Observability and logging as first-class citizens (Apr 17, 2026)
+**The principle:** WorkTrain should never be a black box. Every action, decision, failure, and state transition should be traceable after the fact -- by a human, by another agent, or by a coordinator script. Logging and observability are not afterthoughts; they are core infrastructure.
+**What "first-class" means:**
+1. **Structured, not prose.** Every log line should be machine-parseable. Use consistent prefixes (`[WorkflowRunner]`, `[TriggerRouter]`, `[DaemonConsole]`), consistent key=value pairs, and structured JSON for rich payloads. No freeform strings that require regex to parse.
+2. **Levels matter.** INFO for normal operations, WARN for recoverable anomalies, ERROR for failures that need attention. Silence = actively working, not unknown. A session that produces no logs for 5+ minutes should emit a heartbeat.
+3. **Every state transition logged.** Session start, step advance, tool call, tool result (including errors), session end (success/timeout/error). No silent gaps. The daemon observability logs (#442) are a start -- extend this everywhere.
+4. **Errors always include context.** Not just the message -- which session, which tool, which step, which trigger, how long it had been running, what the last successful action was. Enough to diagnose without re-running.
+5. **Correlation IDs.** Every session has a `sessionId`. Every tool call has a `toolCallId`. Log entries should include the relevant ID so you can filter across a full session's history. Today the daemon logs include `sessionId` -- extend this to trigger IDs, workflow IDs, and step IDs.
+6. **Log destinations are configurable.** Today: stdout → daemon.log file via redirect. Long-term: structured JSON to a log aggregator (Datadog, CloudWatch, file), separate log files per workspace, log rotation. The daemon should accept a `--log-level` flag and a `--log-format json|human` flag.
+7. **The session store IS the audit log.** Every `advance_recorded`, `node_output_appended`, `validation_performed` event is a durable structured record. The session store should be queryable as a post-mortem tool. `worktrain session logs <id>` should reconstruct the full story of what happened.
+**Specific gaps to close:**
+- `continue_workflow` tool: log the step ID and notes length being submitted, not just "continue_workflow called"
+- `makeBashTool`: log exit code and output length in addition to the command
+- `makeReadTool` / `makeWriteTool`: log file path and bytes
+- `AgentLoop`: log each LLM turn (turn number, stop reason, tool count) -- today nothing is logged between tool calls
+- `TriggerRouter`: log when a session is queued (semaphore at capacity) and when it dequeues
+- `PollingScheduler`: log each poll cycle result (N events found, N new, N dispatched)
+- `DeliveryClient`: log delivery attempt, HTTP status, response time
+- `DaemonConsole`: log when the console HTTP server starts, stops, or fails a request
+**The `worktrain logs` command:**
+```bash
+worktrain logs                          # tail daemon.log
+worktrain logs --session sess_abc123    # replay full session from event store
+worktrain logs --trigger test-task      # all sessions for this trigger
+worktrain logs --level error            # only errors across all sources
+worktrain logs --since 1h               # last hour
+worktrain logs --format json            # machine-readable output
+```
+**Self-healing dependency:** The automatic gap detection, WORKTRAIN_STUCK routing, and coordinator self-healing patterns all depend on logs being structured and complete. You can't auto-fix what you can't observe. Logging quality is a prerequisite for autonomous operation at scale.
+---
+### Event sourcing for orchestration: extend the session store to daemon and coordinator events (Apr 17, 2026)
+**The decision:** extend the existing WorkRail event store infrastructure to cover orchestration-level events, not build a separate system. The session store is already append-only, crash-safe, content-addressed, and queryable -- rebuilding those properties would be wasteful.
+**The model: multiple event streams, same infrastructure**
+```
+~/.workrail/events/
+  sessions/          ← already exists (per-session workflow events)
+  daemon/            ← new: lifecycle, triggers, delivery, errors
+  triggers/          ← new: per-trigger poll history and outcomes
+  coordinator/       ← future: coordinator script decisions and routing
+```
+Each stream is append-only JSONL with the same segment/manifest pattern as the session store. The `worktrain logs` command queries across streams. Watchdog and coordinator scripts subscribe to streams.
+**Daemon event stream: what gets recorded**
+Every significant daemon action becomes a structured event:
+```jsonl
+{"ts":"2026-04-17T...","kind":"daemon_started","port":3200,"workspacePath":"...","version":"3.31.0"}
+{"ts":"...","kind":"trigger_fired","triggerId":"test-task","workflowId":"coding-task-workflow-agentic"}
+{"ts":"...","kind":"session_queued","sessionId":"sess_abc","triggerId":"test-task","queueDepth":0}
+{"ts":"...","kind":"session_started","sessionId":"sess_abc","workflowId":"coding-task-workflow-agentic","modelId":"..."}
+{"ts":"...","kind":"tool_called","sessionId":"sess_abc","tool":"Bash","command":"ls docs/ | grep trigger"}
+{"ts":"...","kind":"tool_error","sessionId":"sess_abc","tool":"Bash","error":"exit 1","isError":true}
+{"ts":"...","kind":"step_advanced","sessionId":"sess_abc","stepId":"phase-0-triage-and-mode","advance":1}
+{"ts":"...","kind":"session_completed","sessionId":"sess_abc","stopReason":"stop","durationMs":1847000}
+{"ts":"...","kind":"delivery_attempted","sessionId":"sess_abc","callbackUrl":"https://...","status":200}
+{"ts":"...","kind":"poll_cycle","triggerId":"pr-review","eventsFound":3,"newEvents":1,"dispatched":1}
+```
+**`DaemonEventEmitter`:** thin wrapper around the event store, called from TriggerRouter, workflow-runner, delivery-client, and polling-scheduler. Each call appends one event to `~/.workrail/events/daemon/YYYY-MM-DD.jsonl`. Zero overhead when nothing is listening.
+**`worktrain logs` CLI:** reads from both session store and daemon event stream, correlates by `sessionId`, presents a unified timeline:
+```
+worktrain logs                          # tail current daemon events
+worktrain logs --session sess_abc123    # full timeline: trigger → steps → delivery
+worktrain logs --trigger test-task      # all sessions for this trigger
+worktrain logs --level error            # only errors across all streams
+worktrain logs --since 1h               # last hour of activity
+worktrain logs --format json            # machine-readable for scripts
+```
+**SSE extension:** the console already streams session events via SSE. Extend to also stream daemon events so the console live feed shows everything: trigger fires, tool calls, delivery attempts, errors -- not just step advances. This is the "more than just the DAG" console improvement.
+**Why this matters for self-healing:** The coordinator self-healing pattern requires the coordinator to observe what happened. Today it reads `lastStepNotes` and session store snapshots -- both batch reads after the fact. With a subscribable daemon event stream, the coordinator can react in real time: "tool_error event for session X → spawn diagnostic sub-session now" rather than "check for WORKTRAIN_STUCK markers after the fact."
+**Build order:**
+1. `DaemonEventEmitter` + daemon event stream file (append-only JSONL, no fancy infra needed to start)
+2. Wire emitter calls into TriggerRouter, workflow-runner, delivery-client
+3. `worktrain logs` CLI commands (reads files, correlates by sessionId)
+4. SSE extension in DaemonConsole for live event streaming
+5. Coordinator script subscription to event streams (replaces polling session store)
+---
+### Subagent context packaging: the main agent assumes too much (Apr 17, 2026)
+**The problem:** When a main agent spawns a subagent, the work package it creates is usually too thin. The main agent has rich context from the full conversation -- why this task matters, what was already tried, what constraints were discovered -- but it packages the subagent task as if that context is shared. The subagent gets a one-liner and has to rediscover everything from scratch.
+This is the same problem as a developer handing a junior a vague JIRA ticket instead of a proper brief. The subagent wastes tokens re-deriving what the main agent already knows, or worse, makes wrong assumptions.
+**Where this manifests:**
+- Coding task subagents that don't know why a specific approach was chosen
+- MR review subagents that don't know what invariants matter for this codebase
+- Discovery subagents that re-read files the main agent just read
+- Fix subagents that don't know what was already tried and failed
+**Three solution directions:**
+**Option A: Better instructions to the main agent (prompt engineering)**
+Add explicit guidance to the WorkTrain system prompt: "When spawning a subagent, include: (1) what you already know that the subagent won't, (2) what was already tried, (3) why this specific approach was chosen, (4) what constraints or invariants matter, (5) what 'done' looks like." This is the cheapest fix but depends on the main agent reliably following it.
+**Option B: Platform-assisted package creation (structured)**
+The `worktrain spawn` command (or the `spawn_session` tool) takes a structured work package:
+```typescript
+spawnSession({
+  workflowId: 'coding-task-workflow-agentic',
+  goal: '...',
+  context: {
+    whyThisApproach: '...',        // what the main agent knows about the decision
+    alreadyTried: [...],           // what failed
+    knownConstraints: [...],       // invariants the subagent must respect
+    relevantFiles: [...],          // files the main agent already read
+    completionCriteria: '...'      // what done actually looks like
+  }
+})
+```
+The platform validates that the package is complete before spawning -- missing fields emit a warning or block the spawn. The subagent's system prompt is enriched with this context automatically, without the main agent having to think about how to format it.
+**Option C: Platform-mediated context transfer (autonomous)**
+The platform automatically packages context from the spawning session into the child session. When the main agent calls `spawn_session`, the platform reads the current session's step notes and recent advances, synthesizes a context bundle, and injects it into the child's system prompt. No explicit packaging required from the main agent.
+This is the most powerful but also the most complex -- requires the platform to understand what's relevant, not just what's recent.
+**Recommended approach: B + A**
+Option B (structured work package with validation) as the primary mechanism. Option A (better main agent instructions) as a fallback. Option C as a long-term goal once the knowledge graph and session event stream are queryable enough to synthesize context automatically.
+**The `context` field in the structured package is the key addition.** Today `worktrain spawn` takes `goal`, `workflowId`, `workspacePath`. Adding a structured `context` object that the platform validates and injects gives subagents the brief they need without depending on the main agent to remember to include it.
+**Connection to knowledge graph:** Once the structural knowledge graph is built, `relevantFiles` can be auto-populated from a graph query rather than requiring the main agent to list them. The platform asks "what files are relevant to this goal?" and includes them automatically. This is how the context packaging problem gets solved at scale -- the platform knows what the subagent needs without the main agent having to enumerate it.
+**Session knowledge log (extends Option B):**
+As the main agent progresses, it continuously appends to a structured `session-knowledge.jsonl` for the session. Not step notes (those are workflow artifacts) -- this is a running record of things that would matter to any agent picking up this work:
+```jsonl
+{"kind":"decision","summary":"Using execFile not exec for all subprocess calls","reason":"Shell injection risk with user-controlled content","ts":1234567890}
+{"kind":"user_pushback","summary":"User rejected the polling approach","detail":"Wants webhook-based solution instead","ts":...}
+{"kind":"relevant_file","path":"src/trigger/trigger-router.ts","why":"Core routing logic, all trigger changes flow through here","ts":...}
+{"kind":"constraint","summary":"Never modify triggers.yml autonomously","source":"daemon-soul.md","ts":...}
+{"kind":"tried_and_failed","summary":"Tried npx approach, got version mismatch","detail":"Local build is different from installed package","ts":...}
+{"kind":"external_ref","url":"https://github.com/...","why":"Design doc for the delivery pattern","ts":...}
+{"kind":"plan","path":"implementation_plan.md","summary":"3-slice plan for the feature","ts":...}
+```
+When spawning a subagent, the platform automatically includes the session knowledge log in the work package. The subagent gets the full brief without the main agent having to reconstruct it.
+**Blank subagents (intentionally uncontextualized):**
+Sometimes you explicitly DON'T want context from the main session -- fresh eyes are the point. A hypothesis challenge subagent should challenge the leading hypothesis, not be anchored to it. An adversarial reviewer should find problems without knowing the main agent thinks the approach is sound.
+The `spawn_session` call should have an explicit `context: 'inherit' | 'blank' | 'custom'` field:
+- `inherit` -- auto-package from session knowledge log (default for most tasks)
+- `blank` -- no session context injected, subagent starts fresh (for adversarial roles)
+- `custom` -- explicit structured package (for precise control)
+**Subagent types with specialized system prompts and tools:**
+Different tasks need different cognitive profiles. A subagent type bundles: system prompt, available tools, and context mode:
+| Type | System prompt focus | Tools | Context |
+|------|---------------------|-------|---------|
+| `researcher` | Thorough, neutral, evidence-first | Read, Bash (read-only), Glob, Grep | inherit |
+| `challenger` | Adversarial, finds holes, challenges assumptions | Read, Bash | blank (intentionally unanchored) |
+| `implementer` | Precise, follows plans, no improvisation | Read, Write, Bash, continue_workflow | inherit |
+| `reviewer` | Finds bugs, security issues, philosophy violations | Read, Bash | blank |
+| `verifier` | Confirms claims with evidence, runs commands | Read, Bash | inherit |
+| `coordinator` | Routes work, reads event streams, dispatches | worktrain_spawn, worktrain_await | inherit |
+The type determines the system prompt variant, not just the tools. A `challenger` gets a system prompt that explicitly says "your job is to find problems, not solve them -- do not offer solutions." A `verifier` gets "do not trust claims without running the commands yourself."
+This is the WorkTrain equivalent of cognitive specialization -- different agents for different modes of thought, not just different tasks. The workflow step can specify which subagent type to spawn: `spawn_session({ type: 'challenger', goal: '...' })`.
+---
+### Workflow-scoped system prompts for subagents (Apr 17, 2026)
+**The idea:** Workflows (and individual steps within them) can declare a `systemPrompt` field that gets injected into subagent sessions spawned by that workflow step. The workflow author encodes the cognitive mode directly rather than describing it in step prose that the agent has to interpret.
+**Why this is the right layer:**
+The workflow already controls: what steps run, what tools are available, what the output contract is, what assessments are required. The cognitive mode -- how the agent should think -- is a natural extension of that. A workflow that says "run as adversarial challenger" should be able to enforce that at the platform level, not just suggest it in a prompt.
+**Two levels:**
+**1. Workflow-level `systemPrompt`** -- applies to all subagents spawned by this workflow:
+```json
+{
+  "id": "mr-review-workflow.agentic.v2",
+  "systemPrompt": "You are an adversarial code reviewer. Your job is to find problems, not validate the approach. Do not offer solutions -- only surface issues with evidence. Treat every claim as unproven until you verify it yourself.",
+  "steps": [...]
+}
+```
+**2. Step-level `systemPrompt`** -- overrides the workflow-level prompt for a specific step:
+```json
+{
+  "id": "phase-hypothesis-challenge",
+  "systemPrompt": "You are a devil's advocate. For every assumption in the hypothesis, find the strongest counterargument. Do not be balanced -- be adversarial.",
+  "prompt": "Challenge the leading hypothesis..."
+}
+```
+**How it composes with the base system prompt:**
+The final subagent system prompt is assembled in layers:
+1. WorkTrain base prompt (execution contract, oracle priority, tools)
+2. Workflow-level `systemPrompt` (cognitive mode for this workflow)
+3. Step-level `systemPrompt` (cognitive override for this step)
+4. Soul file (operator behavioral rules)
+5. AGENTS.md / workspace context
+6. Session knowledge log (inherited context, if `context: 'inherit'`)
+7. Step prompt (the actual work instruction)
+The workflow author controls layers 2-3. The operator controls layer 4. The platform assembles 1 and 5-7 automatically. Clear separation of concerns.
+**This also enables the subagent type system** (from the previous backlog entry) to be workflow-driven rather than call-site-driven. Instead of `spawn_session({ type: 'challenger' })`, the workflow step that spawns a challenger simply declares `systemPrompt: "you are adversarial..."` -- the cognitive mode travels with the workflow definition, not the spawn call.
+**Schema addition:**
+```typescript
+interface WorkflowDefinition {
+  systemPrompt?: string;  // workflow-level, injected into all subagent sessions
+  steps: WorkflowStep[];
+}
+interface WorkflowStep {
+  systemPrompt?: string;  // step-level, overrides workflow-level for this step
+  prompt: string;
+  // ...existing fields
+}
+```
+**Authoring implication:** The `workflow-for-workflows` meta-workflow should guide authors to write cognitive mode as `systemPrompt` rather than embedding it in `prompt` prose. "What mode should the agent be in?" is a structural question, not a content question.
+---
+### Console as the unified WorkRail dashboard -- standalone, file-reading, zero coupling (Apr 18, 2026)
+**The insight:** The console is the unified view of all WorkRail activity -- whether sessions were started by the autonomous daemon or by a human working interactively through the MCP server. It doesn't care how a session was created. It reads the same session store either way.
+The console doesn't need a live connection to either the daemon or the MCP server. It reads files. The current architecture where the console is owned by whichever process wins a port election is wrong -- it's a legacy of when the MCP server was the only long-running process.
+**Target architecture -- zero coupling:**
+```
+Daemon          → writes ~/.workrail/data/sessions/
+                → writes ~/.workrail/events/daemon/
+                → serves :3200 (webhooks only)
+MCP server      → reads/writes session store (same files as daemon)
+                → serves :3100 (Claude Code bridge only)
+Console         → reads ~/.workrail/data/sessions/ (file watch, not HTTP)
+                → reads ~/.workrail/events/daemon/ (file watch)
+                → reads git for PR/commit context
+                → serves :3456 (browser UI only)
+                → `worktrain console` -- fully standalone binary
+```
+**No startup coordination. No lock files. No port election. No coupling.**
+The console works whether the daemon is running or not, whether the MCP server is running or not. Start it once, leave it running permanently. It shows whatever is in the files.
+**How it gets live updates without HTTP:** FSEvents (macOS) / inotify (Linux) file watching on the session store and daemon event stream. When a new event is appended, the console picks it up within milliseconds and pushes to the browser via SSE -- same latency as today, no polling, no HTTP connection to the daemon required.
+**The `worktrain console` command:**
+```bash
+worktrain console              # start on default port 3456
+worktrain console --port 4000  # custom port
+worktrain console --workspace ~/git/myproject  # workspace-scoped view
+```
+**Migration:** Remove console startup from both the daemon command and the MCP server startup. The primary election logic (`DashboardLock`, `bindWithPortFallback`) becomes unnecessary. The `DaemonConsole` module in `src/trigger/daemon-console.ts` becomes `src/console/standalone-console.ts` with a simpler interface.
+**Why this matters:** Today the console goes down whenever the MCP server crashes. With this architecture, the console is as stable as the filesystem. The daemon crashing doesn't affect the console. The MCP server crashing doesn't affect the console. The only thing that can take down the console is killing the `worktrain console` process itself.
+---
+## WorkTrain sprint: Apr 17-18, 2026 -- shipped and current state
+### What shipped (Apr 17-18)
+**Daemon stabilization:**
+- ✅ `report_issue` tool -- agents call this instead of dying silently; structured JSON written to `~/.workrail/issues/<sessionId>.jsonl`, event emitted to daemon stream, WORKTRAIN_STUCK marker in `WorkflowRunResult`
+- ✅ Richer `BASE_SYSTEM_PROMPT` -- baked-in behavioral principles (oracle hierarchy, self-directed reasoning, workflow-as-contract, silent failure policy) rather than relying on soul file alone
+- ✅ `/bin/bash` for Bash tool -- process substitution `<(...)` and other bash-specific syntax now works
+- ✅ `DaemonEventEmitter` -- structured event stream at `~/.workrail/events/daemon/YYYY-MM-DD.jsonl`
+- ✅ Self-configuration -- `triggers.yml`, upgraded `daemon-soul.md` (WorkRail-specific rules + coding philosophy), `AGENTS.md` WorkTrain section
+**Workflow library:**
+- ✅ mr-review v2.6 -- `philosophy_alignment` reviewer family; scoped philosophy extraction in fact packet; 7th coverage domain; "is this the right design?" framing
+- ✅ wfw v2.5 -- phases 2 and 3 split into dedicated prep-step design steps (2a/2b, 3a/3b); principle: assessments need dedicated prep steps, not on-the-fly evidence gathering
+- ✅ Clean workflow display names across library (removed `v2 •`, `Lean •`, etc.)
+- ✅ `philosophy.mdc` created at `~/.firebender/commands/philosophy.mdc` -- MR review subagents now evaluate findings against coding philosophy
+**Integrations and infrastructure:**
+- ✅ GitLab polling triggers fully merged (#404) -- zero-webhook MR polling
+- ✅ TS6 forward-compat tsconfig fixes (#401) -- unblocks TypeScript 6 dep bumps
+- ✅ Standalone console spec -- `worktrain console` as independent file-reading binary, zero coupling to daemon or MCP server
+---
+### Current state (Apr 18, 2026)
+**What works:**
+- Daemon runs autonomously on webhook triggers
+- Sessions advance through full workflow steps
+- Console at `:3456` when daemon starts before MCP server
+- Daemon event stream logging every tool call
+- GitLab + GitHub polling (no webhooks needed)
+- Philosophy-aligned MR reviews
+- `report_issue` tool available to agents
+**Known issues / active bugs:**
+1. **Daemon killed by MCP server reconnects** (CRITICAL) -- the daemon and MCP server share process infrastructure via the bridge mechanism. When Claude Code reconnects and a new MCP server process starts, it displaces the running daemon. The daemon must be run from a separate terminal or as a `launchd` service to survive MCP reconnects. Root fix: decouple daemon from the MCP server process tree entirely.
+2. **Console unstable** -- the console port (3456) is contested between daemon and MCP server. Whoever starts first wins. When the MCP server reconnects, it takes the port and the daemon console goes down. Root fix: standalone `worktrain console` binary (spec in backlog).
+3. **`workflow_not_found` on first test** -- trigger used `coding-task-workflow-agentic.lean.v2` (filename) instead of `coding-task-workflow-agentic` (workflow ID). Fixed in triggers.yml. Symptom of workflow ID vs filename confusion -- worth a validator that catches this at `worktrain daemon` startup.
+4. **Session advances 0 when daemon crashes** -- if daemon dies mid-Phase-0 (before any `continue_workflow` call), the session is orphaned at `observation_recorded(8)` with 0 advances and no output. No automatic recovery. Crash recovery reads the daemon-session token file but can't resume a session that never advanced. No fix yet.
+---
+### Next priorities (groomed Apr 18)
+**Tier 1 -- Must fix for reliable autonomous operation:**
+1. **Daemon as a launchd service** -- run daemon outside Claude Code's process tree so MCP reconnects can't kill it. `worktrain daemon --install` creates a launchd plist and starts it.
+2. **Standalone `worktrain console`** -- file-watching binary independent of daemon/MCP. Zero coupling. Spec in backlog.
+3. **Workflow ID validation at startup** -- `workrail daemon` should validate that all `workflowId` values in triggers.yml resolve to real workflows before starting, not fail silently at dispatch time.
+**Tier 2 -- Workflow quality:**
+4. **mr-review prep steps** -- the audit identified missing dedicated prep steps for philosophy extraction, pattern baseline, and design decision reconstruction. These are described in the backlog but not yet in the workflow JSON. wfw v2.5 guides new workflows to add them; the mr-review workflow itself still needs a v2.7 pass to implement them.
+5. **Autonomous workflow variants** -- audit `requireConfirmation` gates across all workflows; confirm daemon's `autonomy: full` setting correctly bypasses the right ones.
+**Tier 3 -- Features:**
+6. **`worktrain spawn` / `worktrain await`** -- already merged, needs real-world test
+7. **Auto-commit from handoff artifact** -- merged but untested end-to-end
+8. **Session knowledge log** -- continuous context accumulation for subagent packaging
+9. **TypeScript 6 dep bump** -- tsconfig fixes are in (#401), unblocks #244 and #231
+**Open PRs (only dep bumps remain):**
+- #330, #287, #288 -- vitest 4 + vite 8 (major version, needs testing)
+- #244, #231 -- TypeScript 6.0.2 (now unblocked by #401)
+---
+### Duplicate task detection: prevent agents from doing the same work twice (Apr 18, 2026)
+**The problem:** with multiple agents running concurrently and a persistent work queue, it's easy to accidentally start two agents on the same task -- especially when the queue drains items from external sources (GitHub issues, Jira) that may be added again after a sync. Today, two agents can independently pick up the same issue, do the same investigation, and open duplicate PRs.
+**Detection sources:**
+1. **Open PRs**: before starting any coding task, check `gh pr list --state open` -- if a PR already exists that addresses the same issue/goal, skip it
+2. **Active sessions**: the session store knows which workflows are currently running and what their goals are; a new dispatch can check for semantic overlap before starting
+3. **Queue deduplication**: the work queue should deduplicate by external item ID (GitHub issue number, Jira ticket key) so the same item can't be enqueued twice
+4. **Session history**: before starting an investigation, check recent session notes for the same workflowId + goal combination -- if it was completed in the last 24 hours with a successful result, skip or ask the user
+**Implementation approach:**
+- Queue-level dedup is the simplest and most reliable: each queue item from an external source carries its `sourceId` (e.g. `github:EtienneBBeaulac/workrail:issues:123`). On enqueue, check if `sourceId` already exists in the queue (pending or active) -- if so, skip with a log.
+- PR-level dedup: before `worktrain spawn` dispatches a coding task, run `gh pr list --search "<issue title keywords>"` and check for matches. If found, add to outbox ("task already in progress as PR #X") and skip.
+- Session-level dedup: the coordinator script checks active session goals before spawning a new one with the same goal text.
+**The classify-task-workflow role:** when a task is classified, it can also output a `deduplicationKey` (e.g. `fix:trigger-store:error-kind-consistency`) that is stored with the queue item. Queue items with the same key are considered duplicates.
+**What makes this hard:** semantic dedup (two tasks described differently but solving the same problem) requires embedding-based similarity, not exact match. For MVP, exact `sourceId` match + approximate PR title search is sufficient. Semantic dedup is a post-knowledge-graph feature.
+---
+### Agent actions as first-class events in the session event log (Apr 18, 2026)
+**The vision:** the console should be able to reconstruct exactly what an agent did in a session -- every tool call, every argument, every result, every decision -- by reading the event log alone. No log files, no stdout parsing, no separate monitoring infrastructure. The session event store IS the audit trail.
+**What's already in the event log:**
+- `session_created`, `run_started`, `run_completed`
+- `node_created`, `edge_created`, `advance_recorded`
+- `node_output_appended` (step notes)
+- `preferences_changed`, `context_set`, `observation_recorded`
+**What's missing -- agent-level actions:**
+- `tool_call_started` -- which tool was called, with what arguments, at what timestamp
+- `tool_call_completed` -- result (truncated), duration, success/error
+- `llm_turn_started` -- model, token count estimate, step context
+- `llm_turn_completed` -- stop reason, output tokens, whether steer() was injected
+- `steer_injected` -- what context was injected and why (session recap, workspace context)
+- `report_issue_recorded` -- the structured issue from the `report_issue` tool
+- `worktrain_stuck` -- when WORKTRAIN_STUCK marker is emitted
+**Why this matters:**
+Today the `DaemonEventEmitter` writes to `~/.workrail/events/daemon/YYYY-MM-DD.jsonl` separately from the session store. That's two places to look -- and they're not correlated to specific sessions. Putting agent actions into the session event log means:
+- Console can show a session timeline: "Phase 0: called `bash` 3 times (12ms, 8ms, 45ms) → called `read` 2 times → advanced to Phase 1"
+- The proof record (verification chain spec) can link specific tool calls to assessment gate evidence
+- Crash recovery knows exactly where in the agent's execution it died
+- The knowledge graph can be updated from session events without re-reading step notes
+**The event schema (additions to the existing event store format):**
+```typescript
+// Tool call lifecycle
+{ kind: 'tool_call_started', tool: 'bash', args: { command: 'git status' }, nodeId, ts }
+{ kind: 'tool_call_completed', tool: 'bash', durationMs: 45, exitCode: 0, resultSummary: '...', nodeId, ts }
+{ kind: 'tool_call_failed', tool: 'bash', durationMs: 45, error: 'ENOENT', nodeId, ts }
+// LLM turn lifecycle
+{ kind: 'llm_turn_started', model: 'claude-sonnet-4-6', inputTokens: 12000, nodeId, ts }
+{ kind: 'llm_turn_completed', stopReason: 'tool_use', outputTokens: 450, toolsRequested: ['bash'], nodeId, ts }
+// Steer injection
+{ kind: 'steer_injected', reason: 'session_recap', contentLength: 800, nodeId, ts }
+// Agent self-reporting
+{ kind: 'report_issue_recorded', severity: 'warning', summary: '...', sessionId, ts }
+```
+**Where to emit them:**
+- In `src/daemon/agent-loop.ts` -- before and after each `tool.execute()` call, before and after each LLM call
+- In `src/daemon/workflow-runner.ts` -- for steer injection and report_issue recording
+- Use the existing `V2ToolContext` session store to append events (same mechanism as `continue_workflow` and `start_workflow`)
+**Console rendering:**
+Each session detail view gets a "Timeline" tab alongside "Steps" and "Notes":
+```
+Phase 0: Understand & Classify         [2m 14s]
+  ├── llm_turn              450 tokens → 3 tool calls
+  ├── bash: git status                    45ms ✓
+  ├── bash: gh pr list                   180ms ✓
+  ├── read: AGENTS.md                      8ms ✓
+  └── llm_turn              280 tokens → advance
+Phase 1a: State Hypothesis              [0m 38s]
+  ├── llm_turn              310 tokens → advance
+  ...
+```
+**Relationship to DaemonEventEmitter:**
+The existing `DaemonEventEmitter` (written in #498) writes to a separate daily log file. Once agent actions are first-class session events, the daemon event emitter can be simplified or removed -- the session event log is the canonical record. The console reads session events, not daemon event files.
+**Build order:**
+1. Add `tool_call_started`/`tool_call_completed` events to `agent-loop.ts` -- smallest change, highest value
+2. Add `llm_turn_started`/`llm_turn_completed` events
+3. Console Timeline tab reads and renders the new event kinds
+4. Wire `report_issue_recorded` and `steer_injected` events
+---
+### FatalToolError: distinguish recoverable from non-recoverable tool failures (follow-up from PR #523)
+The blanket try/catch in AgentLoop._executeTools() converts ALL tool throws to isError tool_results. This is correct for Bash/Read/Write (LLM can see and retry), but potentially wrong for continue_workflow failures (LLM retrying with a broken token loops). The discovery agent proposed a FatalToolError subclass: tools throw FatalToolError for non-recoverable errors (session corruption, bad tokens), plain Error for recoverable failures. _executeTools catches plain Error and returns isError; FatalToolError propagates and kills the session. Combined with the DEFAULT_MAX_TURNS cap (PR followup), this provides defense-in-depth.
+5. Deprecate `DaemonEventEmitter` once console reads from session events

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@exaudeus/workrail",
-  "version": "3.31.1",
+  "version": "3.33.0",
   "description": "Step-by-step workflow enforcement for AI agents via MCP",
   "license": "MIT",
   "repository": {

package/workflows/architecture-scalability-audit.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "id": "architecture-scalability-audit",
-  "name": "Architecture Scalability Audit (v1 • Evidence-Driven • Dimension-Scoped • rigorMode-Adaptive)",
+  "name": "Architecture Scalability Audit",
   "version": "0.1.0",
   "description": "Use this to audit a bounded codebase scope for architecture scalability. Declare which scalability dimensions matter (load, data volume, team size, feature extensibility, operational); the workflow investigates each and produces evidence-grounded findings.",
   "about": "## Architecture Scalability Audit\n\nThis workflow audits a bounded codebase scope for scalability across the dimensions you care about. It does not produce generic \"won't scale\" warnings -- every finding must cite a specific file, class, method, or pattern, and every concern must name a concrete growth scenario (e.g. 10x traffic, 100x records, 3x team size).\n\n**What it does:**\nYou declare the scope boundary and the scalability dimensions that matter for your context. The workflow reads the codebase to understand the architecture, assigns one dedicated reviewer family per dimension, runs them in parallel from a shared fact packet, reconciles contradictions and blind spots through a synthesis loop, and delivers a per-dimension verdict (will_break / risk / fine) with an overall scalability readiness verdict.\n\n**The five scalability dimensions you can select:**\n- **load** -- handles more requests, users, or throughput\n- **data_volume** -- handles more records, storage, or query size\n- **team_org** -- more teams or developers working on this scope without friction\n- **feature_extensibility** -- more features added without rearchitecting\n- **operational** -- more deployments, environments, or operational complexity\n\n**When to use it:**\n- Before investing significantly in a component you expect to grow\n- When planning capacity for a new traffic tier or data volume increase\n- When evaluating a codebase acquired through a merger, partnership, or open-source adoption\n- When a team is growing and you want to know if the architecture will hold under parallel development\n\n**What it produces:**\nAn overall scalability verdict, per-dimension findings with specific code references and growth scenarios, cross-cutting concerns that span multiple dimensions, a prioritized concern list, and explicit callouts of what is already well-designed for scale.\n\n**How to get good results:**\nBe specific about the scope boundary -- name the service, module, or feature explicitly and say what is out of scope. Choose the dimensions relevant to your actual growth pressures; the workflow will not add dimensions you did not select. If you know a specific growth target (e.g. \"we expect 50x user growth in 18 months\"), mention it.",

package/workflows/bug-investigation.agentic.v2.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "id": "bug-investigation-agentic",
-  "name": "Bug Investigation (v2 \u2022 Notes-First \u2022 WorkRail Executor)",
+  "name": "Bug Investigation",
   "version": "2.0.0",
   "description": "Use this to diagnose a bug or unexpected behavior in code. Builds a hypothesis, gathers evidence, and proves or disproves the root cause before concluding.",
   "about": "## Bug Investigation Workflow\n\nThis workflow guides an AI agent through a rigorous, evidence-driven investigation of a bug or unexpected behavior. It is designed to prevent the most common failure mode in AI debugging: jumping to a plausible-sounding conclusion without sufficient proof.\n\n**What it does:**\nThe workflow moves through triage, context gathering, hypothesis generation, evidence planning, iterative evidence collection, diagnosis validation, and a final handoff. It explicitly distinguishes between theories (formed by reading code) and proof (confirmed by running tests or reproducing the failure). The final output is a diagnosis with a confidence rating, the strongest alternative explanations that were ruled out, and a high-level fix direction  -- not a patch.\n\n**When to use it:**\n- You have a specific bug report, failing test, or production incident to investigate\n- The root cause is not immediately obvious and multiple explanations are plausible\n- You want a trustworthy diagnosis before spending time writing a fix\n- The bug carries enough risk that you need to be confident before changing code\n\n**What it produces:**\nA structured investigation handoff covering: root cause type (single cause, multi-factor, working as designed, etc.), proof summary, ruled-out alternatives, residual uncertainty, likely files involved, and verification steps for whoever implements the fix.\n\n**How to get good results:**\nProvide repro steps, observed symptoms, and expected behavior upfront. Include any relevant logs, failing test commands, or environment details you already have. The more concrete the repro, the faster the workflow can gather real evidence rather than theorizing. If the bug is intermittent, say so  -- the workflow adapts its rigor based on reproducibility confidence.",
@@ -57,7 +57,7 @@
   "steps": [
     {
       "id": "phase-0-triage-and-intake",
-      "title": "Phase 0: Triage (Bug Intake \u2022 Risk \u2022 Mode)",
+      "title": "Phase 0: Triage (Bug Intake • Risk • Mode)",
       "prompt": "Understand the bug report and choose the right rigor.\n\nCapture:\n- `bugSummary`: concise statement of the issue\n- `reproSummary`: repro steps, symptoms, expected behavior, environment notes\n- `investigationComplexity`: Small / Medium / Large\n- `riskLevel`: Low / Medium / High\n- `rigorMode`: QUICK / STANDARD / THOROUGH\n- `automationLevel`: High / Medium / Low\n- `maxParallelism`: 0 / 2 / 3\n\nDecision guidance:\n- QUICK: clear repro, narrow surface area, low ambiguity\n- STANDARD: moderate ambiguity, moderate system breadth, or meaningful risk\n- THOROUGH: high ambiguity, high-risk production impact, broad surface area, or multiple plausible causes\n\nSet context variables:\n- `bugSummary`\n- `reproSummary`\n- `investigationComplexity`\n- `riskLevel`\n- `rigorMode`\n- `automationLevel`\n- `maxParallelism`\n- `reproducibilityConfidence` (High / Medium / Low)\n\nAsk for confirmation only if the chosen rigor materially affects expectations or if critical repro details are still missing.",
       "requireConfirmation": true
     },
@@ -140,7 +140,7 @@
         {
           "id": "phase-4b-loop-decision",
           "title": "Evidence Loop Decision",
-          "prompt": "Decide whether the evidence loop should continue.\n\nDecision rules:\n- if `contradictionCount > 0` \u2192 continue\n- else if `unresolvedEvidenceGapCount > 0` \u2192 continue\n- else if `hasStrongAlternative = true` and the alternative is not meaningfully weaker \u2192 continue\n- else if `diagnosisType = inconclusive_but_narrowed` and further evidence is not realistically available \u2192 stop with bounded uncertainty\n- else \u2192 stop\n\nOutput exactly:\n```json\n{\n  \"artifacts\": [{\n    \"kind\": \"wr.loop_control\",\n    \"decision\": \"continue\"\n  }]\n}\n```",
+          "prompt": "Decide whether the evidence loop should continue.\n\nDecision rules:\n- if `contradictionCount > 0` → continue\n- else if `unresolvedEvidenceGapCount > 0` → continue\n- else if `hasStrongAlternative = true` and the alternative is not meaningfully weaker → continue\n- else if `diagnosisType = inconclusive_but_narrowed` and further evidence is not realistically available → stop with bounded uncertainty\n- else → stop\n\nOutput exactly:\n```json\n{\n  \"artifacts\": [{\n    \"kind\": \"wr.loop_control\",\n    \"decision\": \"continue\"\n  }]\n}\n```",
           "requireConfirmation": true,
           "outputContract": {
             "contractRef": "wr.contracts.loop_control"