@exaudeus/workrail 3.79.0 → 3.79.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/console-ui/assets/{index-C1ORzpX3.js → index-aaVieSSm.js} +1 -1
- package/dist/console-ui/index.html +1 -1
- package/dist/daemon/agent-loop.d.ts +1 -1
- package/dist/daemon/agent-loop.js +5 -2
- package/dist/daemon/daemon-events.d.ts +1 -0
- package/dist/daemon/daemon-events.js +2 -2
- package/dist/daemon/tools/bash.js +7 -4
- package/dist/daemon/tools/continue-workflow.js +2 -2
- package/dist/daemon/tools/file-tools.js +3 -3
- package/dist/daemon/tools/glob-grep.js +2 -2
- package/dist/daemon/tools/report-issue.js +1 -1
- package/dist/daemon/tools/signal-coordinator.js +1 -1
- package/dist/daemon/tools/spawn-agent.js +1 -1
- package/dist/manifest.json +31 -31
- package/dist/trigger/trigger-store.js +28 -0
- package/dist/trigger/types.d.ts +1 -1
- package/docs/ideas/backlog.md +822 -1
- package/docs/vision.md +10 -1
- package/package.json +1 -1
package/docs/ideas/backlog.md
CHANGED
|
@@ -152,6 +152,25 @@ Issue #241 (TTL eviction across multiple files + new tests) was classified as Sm
|
|
|
152
152
|
|
|
153
153
|
---
|
|
154
154
|
|
|
155
|
+
### `worktrain doctor`: typed service config audit and auto-repair (May 7, 2026)
|
|
156
|
+
|
|
157
|
+
**Status: idea** | Priority: high
|
|
158
|
+
|
|
159
|
+
**Score: 11** | Cor:2 Cap:2 Eff:2 Lev:3 Con:2 | Blocked: no
|
|
160
|
+
|
|
161
|
+
There is no command that audits whether the WorkTrain daemon is correctly configured and healthy before an operator relies on it for overnight work. Config problems (stale binary, wrong launchd plist path, missing env vars, port mismatch, bad model override) surface as silent failures at runtime -- often at 3am. The operator has no way to proactively detect or repair them.
|
|
162
|
+
|
|
163
|
+
OpenClaw ships a `SERVICE_AUDIT_CODES` system (typed issue codes: `gatewayCommandMissing`, `gatewayEntrypointMismatch`, `launchdKeepAlive`, `gatewayRuntimeBun`, etc.) with a `doctor --fix` command that auto-repairs the most common issues. A `worktrain doctor` command with the same pattern -- audit returns typed `ServiceConfigIssue[]`, severity `warning | error`, suggested fix per code -- would surface config problems before they become silent 3am failures.
|
|
164
|
+
|
|
165
|
+
This item is distinct from the "daemon --start reports success on crash" fix (PR #898, done): that fix verifies liveness after start; doctor would verify config correctness before start and at any time.
|
|
166
|
+
|
|
167
|
+
**Things to hash out:**
|
|
168
|
+
- Which audit codes are most valuable for Phase 1? Candidates: stale binary (binary mtime check), missing ANTHROPIC_API_KEY, triggers.yml parse error, launchd plist mismatch (plist path points to wrong binary), port conflict.
|
|
169
|
+
- Should `doctor --fix` auto-repair or only print instructions? Auto-repair for simple cases (rewrite plist path), print instructions for secrets.
|
|
170
|
+
- Where does this live -- `worktrain doctor` as a new subcommand, or integrated into `worktrain daemon --status`?
|
|
171
|
+
|
|
172
|
+
---
|
|
173
|
+
|
|
155
174
|
### Daemon binary stale after rebuild, no indication to user
|
|
156
175
|
|
|
157
176
|
**Status: ux gap** | Priority: medium
|
|
@@ -214,6 +233,118 @@ All three bugs fixed. `WorkflowContextSlots` typed interface + `extractContextSl
|
|
|
214
233
|
|
|
215
234
|
---
|
|
216
235
|
|
|
236
|
+
### Unified daemon event schema: merge daemon event log and v2 session store into one trace format (May 7, 2026)
|
|
237
|
+
|
|
238
|
+
**Status: idea** | Priority: high
|
|
239
|
+
|
|
240
|
+
**Score: 11** | Cor:1 Cap:2 Eff:2 Lev:3 Con:2 | Blocked: no
|
|
241
|
+
|
|
242
|
+
WorkTrain writes two separate event stores: the daemon event log at `~/.workrail/events/daemon/<date>.jsonl` (tool calls, session lifecycle, trigger fired) and the v2 session event store at `~/.workrail/data/sessions/<id>/`. Every console feature that needs both structured session data and daemon-layer events (goal text, trigger provenance, tool call history) must bridge two storage formats. The status briefing discovery doc explicitly flagged this as a blocking split: "Two storage systems: Daemon event log vs per-session store. Bridging them requires either reading two systems or accepting one system's incomplete picture."
|
|
243
|
+
|
|
244
|
+
OpenClaw ships a unified `TrajectoryEvent` schema (`traceId`, `source: "runtime" | "transcript" | "export"`, `type`, `ts`, `seq`, `sessionId`, `runId`, `workspaceDir`, `provider`, `modelId`) covering all event kinds from both sources. A single schema means tooling (console, export, replay, search) is built once.
|
|
245
|
+
|
|
246
|
+
The migration path does not require replacing either store immediately -- it can start by making the daemon event log speak the same schema fields, so the console can query either source without a bridge layer.
|
|
247
|
+
|
|
248
|
+
**Things to hash out:**
|
|
249
|
+
- Phase 1 scope: unify schema fields only (no storage migration), or actually consolidate to one storage backend?
|
|
250
|
+
- Does the v2 engine event format need to change, or can the daemon event log adopt v2-compatible fields without touching the engine?
|
|
251
|
+
- What is the correct `source` discriminant for WorkTrain? Candidates: `daemon` (trigger/session lifecycle), `engine` (step advances, token ops), `agent` (tool calls, LLM turns).
|
|
252
|
+
|
|
253
|
+
---
|
|
254
|
+
|
|
255
|
+
### Pluggable context assembly: replace hardcoded `buildSystemPrompt()` with an injectable interface (May 7, 2026)
|
|
256
|
+
|
|
257
|
+
**Status: idea** | Priority: medium
|
|
258
|
+
|
|
259
|
+
**Score: 9** | Cor:1 Cap:2 Eff:1 Lev:2 Con:2 | Blocked: no
|
|
260
|
+
|
|
261
|
+
WorkTrain's context injection is a hardcoded pipeline of pure section functions in `buildSystemPrompt()` (`sectionWorktreeScope`, `sectionWorkspaceContext`, `sectionAssembledContext`, `sectionPriorWorkspaceNotes`, `sectionChangedFiles`, `sectionReferenceUrls`). This works for the current fixed context set, but cannot support: (a) dynamic token-budget management (truncation today is a hard 8KB ceiling on assembled context with no compaction), (b) retrieval-augmented context injection (pull relevant prior session notes by semantic similarity rather than recency), (c) per-workflow context policies (a discovery session needs different context than an implementation session). Adding any of these requires modifying `buildSystemPrompt()` directly rather than composing a new context strategy.
|
|
262
|
+
|
|
263
|
+
OpenClaw ships a `ContextEngine` interface (`assemble(params: { tokenBudget, availableTools, model, prompt }) => Promise<{ messages, estimatedTokens, systemPromptAddition }>`, `compact()`, `maintain()` with background/foreground mode, `ingest()`, `rewriteTranscriptEntries()` via runtime callback) that is fully pluggable. Any context strategy -- windowed recency, semantic retrieval, summary-based compaction -- can be implemented behind the interface without touching the agent loop.
|
|
264
|
+
|
|
265
|
+
Phase 1 is not a full interface extraction -- it is scoping the problem: identify which callers of `buildSystemPrompt()` would benefit from a budget parameter, and what the minimal injectable seam looks like.
|
|
266
|
+
|
|
267
|
+
**Things to hash out:**
|
|
268
|
+
- What is the right Phase 1 scope? Candidates: (a) add a `tokenBudget` parameter to `buildSystemPrompt()` and use it for truncation decisions, (b) extract a `ContextAssembler` port with a single `assemble()` method, (c) define the full interface and ship a `DefaultContextAssembler` that wraps the current behavior unchanged.
|
|
269
|
+
- Does this depend on the unified event schema (above), or is it independent? If context retrieval needs session history, it depends on MemoryStore, which depends on indexed session history.
|
|
270
|
+
- Effort: Phase 1 (budget parameter only) is hours. Full interface extraction is days. The backlog score reflects Phase 1.
|
|
271
|
+
|
|
272
|
+
---
|
|
273
|
+
|
|
274
|
+
### Add runId, provider, and modelId to DaemonEvent (May 7, 2026)
|
|
275
|
+
|
|
276
|
+
**Status: idea** | Priority: high
|
|
277
|
+
|
|
278
|
+
**Score: 10** | Cor:1 Cap:2 Eff:3 Lev:2 Con:3 | Blocked: no
|
|
279
|
+
|
|
280
|
+
The console correlates daemon event log entries with v2 session store entries via `workrailSessionId`, but that field is only available after the continueToken is decoded (~50ms after session start). Events emitted before decode (notably `session_started`) have no `workrailSessionId`, creating a gap where the console cannot link early lifecycle events to the correct session.
|
|
281
|
+
|
|
282
|
+
Adding `runId` (set to the process-local `sessionId` UUID at the top of `runWorkflow()`) as an optional field on all per-session DaemonEvent interfaces closes this gap without migration: `runId` is available immediately, constant for the session's lifetime, and already threaded through `buildPreAgentSession()`. Adding `provider` and `modelId` at the same time gives the event log the model attribution that OpenClaw's trajectory schema has and WorkTrain's currently lacks.
|
|
283
|
+
|
|
284
|
+
Pattern source: OpenClaw `src/trajectory/types.ts` `TrajectoryEvent.runId` / `.provider` / `.modelId`.
|
|
285
|
+
|
|
286
|
+
**Philosophy note:** `runId` should be a branded type (`type RunId = string & { readonly _brand: 'RunId' }`) so it cannot be accidentally swapped with `sessionId` or `workrailSessionId` at call sites -- explicit domain types over primitives. All three fields should be optional in the event union (additive, backward-compatible) but required in a separate `SessionEventContext` helper type that is constructed once per session and passed through explicitly -- single source of state truth, no scattered `runId` assignments.
|
|
287
|
+
|
|
288
|
+
**Done looks like:** All per-session event interfaces (`SessionStartedEvent`, `ToolCalledEvent`, `StepAdvancedEvent`, etc.) have an optional `runId?: string` field. `runWorkflow()` assigns `runId = sessionId` and passes it wherever `workrailSessionId` is currently passed. `provider` and `modelId` are populated at session start from `buildAgentClient()`.
|
|
289
|
+
|
|
290
|
+
---
|
|
291
|
+
|
|
292
|
+
### QueuedFileWriter for DaemonEventEmitter (May 7, 2026)
|
|
293
|
+
|
|
294
|
+
**Status: idea** | Priority: medium
|
|
295
|
+
|
|
296
|
+
**Score: 9** | Cor:2 Cap:1 Eff:3 Lev:1 Con:3 | Blocked: no
|
|
297
|
+
|
|
298
|
+
`DaemonEventEmitter._append()` calls `fs.appendFile()` concurrently. Under burst writes (a turn with many tool calls emitting `tool_call_started`, `tool_call_completed` pairs in rapid succession), concurrent appends can interleave JSONL lines, producing a corrupt log that `JSON.parse()` fails on. The failure is silent -- `emit()` is fire-and-forget and errors are swallowed.
|
|
299
|
+
|
|
300
|
+
The fix is a per-file promise chain: `this._writers.set(path, (this._writers.get(path) ?? Promise.resolve()).then(() => fs.appendFile(...)))`. Each write chains onto the previous write for the same file, serializing them without a mutex. ~10-line change.
|
|
301
|
+
|
|
302
|
+
Pattern source: OpenClaw `src/trajectory/runtime.ts` `QueuedFileWriter` per-session writer map.
|
|
303
|
+
|
|
304
|
+
**Philosophy note:** The promise-chain approach is the right fix over a mutex or a lock file: it is purely functional, uses no shared mutable state beyond the Map, and composes cleanly with the existing fire-and-forget emit() contract. The Map entry should be cleaned up when the promise resolves to prevent unbounded accumulation -- determinism over cleverness, no hidden state.
|
|
305
|
+
|
|
306
|
+
**Done looks like:** `DaemonEventEmitter._append()` uses a `Map<string, Promise<void>>` to serialize writes per file path. Existing tests pass. A new test asserts that 50 concurrent emits produce 50 valid JSONL lines in the correct order.
|
|
307
|
+
|
|
308
|
+
---
|
|
309
|
+
|
|
310
|
+
### Two-layer path validation for config-derived file paths (May 7, 2026)
|
|
311
|
+
|
|
312
|
+
**Status: idea** | Priority: medium
|
|
313
|
+
|
|
314
|
+
**Score: 9** | Cor:2 Cap:1 Eff:3 Lev:1 Con:3 | Blocked: no
|
|
315
|
+
|
|
316
|
+
`workflowId` and `triggerId` values from config are validated at parse time for format correctness, but not at file-path construction time. If a malformed value slipped through (e.g. `wr/../../../etc/passwd`), file operations using those values as path segments could escape the intended directory. WorkTrain's `sessionId` values are `randomUUID()` (safe), but `workflowId` and `triggerId` are operator-supplied strings.
|
|
317
|
+
|
|
318
|
+
The pattern: `assertSafeFileSegment(id: string): string` rejects `/`, `\`, and null bytes and returns the sanitized id or throws. Then `isPathInside(safeDir, resolvedPath)` as a second structural containment check. Also: add `fs.chmod(filePath, 0o600)` to sidecar writes in `persistTokens()` so session recovery files are not world-readable.
|
|
319
|
+
|
|
320
|
+
Pattern source: OpenClaw `src/cron/run-log.ts` `assertSafeCronRunLogJobId()` + `isPathInside()`.
|
|
321
|
+
|
|
322
|
+
**Philosophy note:** This is validate-at-boundaries in practice: `workflowId` and `triggerId` are external inputs (operator config) and must be re-validated at every boundary where they're used to construct a file path, not just at parse time. `assertSafeFileSegment()` should return a branded type (`SafeFileSegment`) so the compiler enforces that path construction only ever uses validated segments -- type safety as the first line of defense. `isPathInside()` is the runtime guard for cases where the type system can't help.
|
|
323
|
+
|
|
324
|
+
**Done looks like:** A `src/infra/safe-path.ts` module exports `assertSafeFileSegment()`. Applied at all sites where `workflowId` or `triggerId` is used to construct a file path. `persistTokens()` adds `chmod 0o600` after the atomic rename. Tests cover the rejection cases.
|
|
325
|
+
|
|
326
|
+
---
|
|
327
|
+
|
|
328
|
+
### Spawn allowlist policy: restrict which workflows a session can spawn (May 7, 2026)
|
|
329
|
+
|
|
330
|
+
**Status: idea** | Priority: medium
|
|
331
|
+
|
|
332
|
+
**Score: 8** | Cor:1 Cap:2 Eff:2 Lev:1 Con:2 | Blocked: no
|
|
333
|
+
|
|
334
|
+
WorkTrain's `spawn_agent` tool allows a parent session to spawn any `workflowId` without restriction. There is no mechanism for an operator to limit which child workflows a given trigger's sessions may delegate to. A misconfigured or misbehaving agent could spawn arbitrary long-running sessions, consuming queue slots and API budget.
|
|
335
|
+
|
|
336
|
+
A `spawnPolicy` field on `TriggerDefinition` (or on the `agentConfig` block) would let operators declare an allowlist: `allowedWorkflows: ['wr.review', 'wr.coding-task']` or `'*'` for unrestricted. `makeSpawnAgentTool()` checks the allowlist before calling `executeStartWorkflow()` and returns a typed error if the requested `workflowId` is not permitted.
|
|
337
|
+
|
|
338
|
+
Pattern source: OpenClaw `src/agents/subagent-target-policy.ts` `resolveSubagentTargetPolicy()` -- pure function, `{ ok: true } | { ok: false, allowedText, error }`.
|
|
339
|
+
|
|
340
|
+
**Philosophy note:** The allowlist check must be a pure function with a discriminated union result -- `{ ok: true } | { ok: false; allowedText: string; error: string }` -- not a boolean or an exception. Errors are data. The check belongs in `makeSpawnAgentTool()` before `executeStartWorkflow()` is called, not inside the engine. The allowed workflows type should use a discriminated union: `'*'` (unrestricted) vs `{ kind: 'allowlist'; workflows: readonly string[] }` -- make illegal states unrepresentable, no stringly-typed `'*'` mixed with arrays.
|
|
341
|
+
|
|
342
|
+
**Things to hash out:**
|
|
343
|
+
- Should this be on `TriggerDefinition` (per-trigger restriction) or on `agentConfig` (inheritable by child sessions)? Per-trigger is simpler; agentConfig inheritance is more flexible.
|
|
344
|
+
- Default when absent: `'*'` (current behavior, no restriction) or an explicit opt-in? A safe default would restrict to the same `workflowId` as the parent, but that would break coordinator patterns that intentionally spawn different workflows.
|
|
345
|
+
|
|
346
|
+
---
|
|
347
|
+
|
|
217
348
|
### MemoryStore: indexed session history as a coordinator and enricher dependency (Apr 30, 2026)
|
|
218
349
|
|
|
219
350
|
**Status: idea** | Priority: medium
|
|
@@ -610,6 +741,64 @@ Tier 0 injection needs a dedicated system prompt section separate from `assemble
|
|
|
610
741
|
|
|
611
742
|
---
|
|
612
743
|
|
|
744
|
+
### Interpretation checkpoint for coding workflow: Candidate 5 (May 6, 2026)
|
|
745
|
+
|
|
746
|
+
**Status: done** | Shipped in PR #962 (feat/etienneb/interpretation-checkpoint, May 7, 2026)
|
|
747
|
+
|
|
748
|
+
**Score: 12** | Cor:2 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
|
|
749
|
+
|
|
750
|
+
Added `phase-0c-assumption-verification` step to `wr.coding-task` (v1.3.0 → v1.4.0) between Phase 0 (classify) and Phase 0.5 (upstream context). The step requires the coding agent to state a one-sentence interpretation before listing any assumptions, produce exactly 3 codebase assumptions with predicted locations and severity labels, verify each assumption by reading the predicted location, and output an `InterpretationArtifact` context key with `ambiguityLevel: clear | uncertain`. High-severity refutations surface to operator via `report_issue`. Also appended Subtype A/B classification prompt to the retrospective step for distribution measurement.
|
|
751
|
+
|
|
752
|
+
This is the first step of the intent gap intervention sequence: Candidate 5 (shipped) → Candidate 4 (git-grounded context, next) → Candidate 1 or 3 gated on Subtype A/B empirical data.
|
|
753
|
+
|
|
754
|
+
---
|
|
755
|
+
|
|
756
|
+
### External assumption ranking for interpretation checkpoint (May 6, 2026)
|
|
757
|
+
|
|
758
|
+
**Status: idea** | Priority: medium
|
|
759
|
+
|
|
760
|
+
**Score: 10** | Cor:2 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no (Candidate 5 shipped PR #962, May 7, 2026)
|
|
761
|
+
|
|
762
|
+
The interpretation checkpoint (Candidate 5) asks the coding agent to label each of its own assumptions as `severity: high` or `severity: low`. This self-labeling is a known weak point: an agent with a confident wrong prior may mislabel its most dangerous architectural assumption as low-severity to avoid triggering the gate. Self-assessed severity is the single lowest-confidence element in the pitch (confidence: 0.55).
|
|
763
|
+
|
|
764
|
+
An external agent -- one that did not produce the assumptions -- can independently rank them by actual risk before verification runs. The external agent receives only the ticket and the assumption list (not the producing agent's full context or reasoning) and answers: which of these assumptions is most load-bearing? Which, if wrong, would cause the most damage? Are there high-risk areas this agent didn't surface at all?
|
|
765
|
+
|
|
766
|
+
The producing agent then verifies in order of externally-ranked risk rather than self-assessed severity. Severity classification moves from self-labeling to an independent signal, removing the 0.55-confidence gap entirely.
|
|
767
|
+
|
|
768
|
+
**Relationship to targeted session review:** the external ranking agent's output is also a high-signal review moment -- if the external agent flags assumptions the producing agent didn't think to surface, that delta is direct evidence of an interpretation gap.
|
|
769
|
+
|
|
770
|
+
**Things to hash out:**
|
|
771
|
+
- What context does the ranking agent receive? Ticket + assumption list only, or also the affected file list and design lock references? More context improves ranking quality but risks contaminating the independence.
|
|
772
|
+
- Is this a lightweight parallel call (runs simultaneously with verification setup) or a blocking step?
|
|
773
|
+
- How are conflicts between self-assessed severity and external ranking resolved? External ranking should win, but the producing agent should see the disagreement and explain it.
|
|
774
|
+
- Cost: one additional inference call per session. Acceptable for standard/thorough sessions; probably skip for QUICK mode.
|
|
775
|
+
|
|
776
|
+
---
|
|
777
|
+
|
|
778
|
+
### Intent gap correction: fix the interpretation after assumption refutation (May 6, 2026)
|
|
779
|
+
|
|
780
|
+
**Status: idea** | Priority: medium
|
|
781
|
+
|
|
782
|
+
**Score: 10** | Cor:2 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no (Candidate 5 shipped PR #962, May 7, 2026)
|
|
783
|
+
|
|
784
|
+
When an agent's assumption-surfacing step (Candidate 5) refutes a high-severity assumption, the current scoped fix is to surface the refutation to the operator and halt. But the real problem is deeper: the wrong prior that caused the refuted assumption may have already contaminated earlier context -- the upstream context harvest, the problem framing, the `reframedProblem` and `challengedAssumptions` context keys. A simple "re-read the file and try again" doesn't fix a wrong model; it patches the symptom in one step while leaving the contaminated context intact. Long-term, a refuted assumption that reflects a codebase-specific wrong prior (Subtype B) should also update the Memory store and eventually the knowledge graph so future sessions don't repeat the mistake.
|
|
785
|
+
|
|
786
|
+
This is explicitly out of scope for the Candidate 5 pitch -- detection is the right first boundary. Correction is a separate, larger problem that depends on session context rollback, Memory store integration, and eventually the knowledge graph.
|
|
787
|
+
|
|
788
|
+
**Done looks like:** when a high-severity assumption is refuted mid-session, the system can: (1) identify which prior context keys were formed under the wrong prior, (2) trigger a targeted correction sub-flow that re-derives those keys with the corrected interpretation, (3) write the correction back to the Memory store so future sessions in this workspace start with the right prior.
|
|
789
|
+
|
|
790
|
+
**Things to hash out:**
|
|
791
|
+
- What is the right granularity for context rollback? Rolling back individual keys vs. re-running entire prior phases are very different costs.
|
|
792
|
+
- How do you distinguish "assumption was wrong about this specific file" (local fix) from "assumption reflects a systematic wrong prior about this codebase pattern" (Memory store update warranted)?
|
|
793
|
+
- What is the trigger for a Memory store write -- every refuted high-severity assumption, or only ones confirmed as Subtype B by retrospective labeling?
|
|
794
|
+
- How does this interact with the knowledge graph when it ships? The assumption store (Candidate 2 from the intent gap discovery) and the knowledge graph are both candidates for receiving the correction signal.
|
|
795
|
+
|
|
796
|
+
**Relationship to existing entries:**
|
|
797
|
+
- Blocked by: Candidate 5 (assumption surfacing step) -- detection must exist before correction can be designed
|
|
798
|
+
- Related to: Subtype B intent failure (below), Knowledge graph (backlog), Memory store / living work context (shipped PR #939, #948, #952)
|
|
799
|
+
|
|
800
|
+
---
|
|
801
|
+
|
|
613
802
|
### Subtype B intent failure: agent has a wrong prior about what this codebase does (May 5, 2026)
|
|
614
803
|
|
|
615
804
|
**Status: idea -- needs empirical study before design** | Priority: high
|
|
@@ -1362,6 +1551,143 @@ Essential before WorkTrain manages more than 2-3 repos.
|
|
|
1362
1551
|
|
|
1363
1552
|
---
|
|
1364
1553
|
|
|
1554
|
+
### Self-improvement loop MVP: WorkTrain picks up and ships workrail issues end-to-end (May 8, 2026)
|
|
1555
|
+
|
|
1556
|
+
**Status: idea** | Priority: high
|
|
1557
|
+
|
|
1558
|
+
**Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: yes (blocked by: convention enforcement in review, scope filter at dispatch, protected file gate, interpretation checkpoint operator approval, verification agent, resolver agent)
|
|
1559
|
+
|
|
1560
|
+
The self-improvement loop is the vision's north star and the primary test of whether WorkTrain works. This item defines the minimum viable version: WorkTrain picks up a labeled workrail issue, runs the full pipeline, and produces a correct, convention-compliant, reviewed PR -- without the operator intervening between phases except at the interpretation checkpoint.
|
|
1561
|
+
|
|
1562
|
+
The quality bar is non-negotiable: WorkTrain produces exemplary code that passes the same review standard it applies to others. Every finding gets fixed before merge, regardless of severity. No "we'll note it and move on."
|
|
1563
|
+
|
|
1564
|
+
---
|
|
1565
|
+
|
|
1566
|
+
**Full gate sequence:**
|
|
1567
|
+
|
|
1568
|
+
```
|
|
1569
|
+
Issue labeled worktrain:ready on workrail repo
|
|
1570
|
+
↓
|
|
1571
|
+
[Gate 0: Scope filter] -- coordinator checks issue is safe to dispatch
|
|
1572
|
+
PASS: single subsystem, no protected files, backlog Effort:3 or less
|
|
1573
|
+
FAIL: comment on issue, remove label, do not dispatch
|
|
1574
|
+
(pure TypeScript, no LLM, reads issue body + changed-files prediction)
|
|
1575
|
+
↓
|
|
1576
|
+
Adaptive coordinator: classify and select pipeline mode
|
|
1577
|
+
(implement for scoped bugs/features, full for anything needing discovery)
|
|
1578
|
+
↓
|
|
1579
|
+
Discovery + shaping phases (if full pipeline)
|
|
1580
|
+
Shaping output becomes the verifier's specification of "done"
|
|
1581
|
+
↓
|
|
1582
|
+
[Gate 1: Interpretation checkpoint -- ALWAYS requires operator approval]
|
|
1583
|
+
Operator approves: coding begins
|
|
1584
|
+
Operator edits: revised interpretation injected as coding context
|
|
1585
|
+
Operator rejects: issue returned to queue, label removed
|
|
1586
|
+
NOTE: no auto_confirm for the self-improvement loop, ever
|
|
1587
|
+
↓
|
|
1588
|
+
Coding phase (isolated worktree, branchStrategy: 'worktree')
|
|
1589
|
+
↓
|
|
1590
|
+
[Gate 2: Protected file check]
|
|
1591
|
+
Checks git diff for: daemon-soul.md, triggers.yml,
|
|
1592
|
+
src/v2/durable-core/ HMAC layer, docs/design/v2-core-design-locks.md,
|
|
1593
|
+
src/daemon/ session lifecycle core
|
|
1594
|
+
ANY HIT → stop immediately, escalate to operator, do not open PR
|
|
1595
|
+
(deterministic script, no LLM)
|
|
1596
|
+
↓
|
|
1597
|
+
PR opened automatically
|
|
1598
|
+
↓
|
|
1599
|
+
[Gate 3: CI]
|
|
1600
|
+
PASS → continue
|
|
1601
|
+
FAIL → WorkTrain reads CI output, attempts one targeted fix,
|
|
1602
|
+
pushes to same branch, re-runs CI
|
|
1603
|
+
STILL FAILING → escalate to operator, do not proceed
|
|
1604
|
+
↓
|
|
1605
|
+
[Gate 4: Verification agent] -- independent QA agent, adversarial stance
|
|
1606
|
+
Fed: original issue + shaped pitch (if exists) + implementation diff
|
|
1607
|
+
Tools: Read, Glob, Grep, constrained Bash (npx vitest, git diff, git show only)
|
|
1608
|
+
Job: prove each requirement is met with explicit evidence
|
|
1609
|
+
|
|
1610
|
+
VerificationRecord output contract:
|
|
1611
|
+
requirementsCoveredCount: number
|
|
1612
|
+
requirementsTotal: number
|
|
1613
|
+
evidencePerRequirement: Array<{
|
|
1614
|
+
requirement: string
|
|
1615
|
+
evidence: string // test output, grep result, etc.
|
|
1616
|
+
confident: boolean
|
|
1617
|
+
}>
|
|
1618
|
+
gapsFound: ReadonlyArray<string> // asked for, not implemented
|
|
1619
|
+
unexpectedScope: ReadonlyArray<string> // implemented, not asked for
|
|
1620
|
+
verdict: 'approved' | 'gaps_found' | 'disputed' | 'uncertain'
|
|
1621
|
+
|
|
1622
|
+
'approved' → proceed to review
|
|
1623
|
+
'gaps_found' → back to coding agent with gaps as context (one retry only)
|
|
1624
|
+
'uncertain' → escalate to operator
|
|
1625
|
+
'disputed' → Resolver agent (see below)
|
|
1626
|
+
↓
|
|
1627
|
+
[Gate 4b: Resolver agent -- fires only on 'disputed']
|
|
1628
|
+
Third independent agent, no stake in either position
|
|
1629
|
+
Fed: original requirement text, coding agent's implementation rationale
|
|
1630
|
+
(from session notes), verifier's specific objection + evidence
|
|
1631
|
+
Tools: same constrained Bash as verifier
|
|
1632
|
+
Job: determine ground truth -- does the code satisfy the requirement?
|
|
1633
|
+
|
|
1634
|
+
Resolver verdict (binding -- coding agent cannot argue back):
|
|
1635
|
+
'satisfied' → proceed to review
|
|
1636
|
+
'not_satisfied' → back to coding agent with resolver's rationale (final)
|
|
1637
|
+
'requirement_ambiguous' → escalate to operator with:
|
|
1638
|
+
original requirement + both agents' positions + resolver's analysis
|
|
1639
|
+
+ suggested clarification for the operator to add to the issue
|
|
1640
|
+
|
|
1641
|
+
NOTE: ACP (agent-to-agent real-time messaging) is a future enhancement
|
|
1642
|
+
to this gate. When ACP ships, verifier and resolver could exchange
|
|
1643
|
+
structured messages through the coordinator rather than using a batch
|
|
1644
|
+
three-agent panel. The coordinator-mediated panel is the MVP approach.
|
|
1645
|
+
↓
|
|
1646
|
+
[Gate 5: wr.mr-review -- calibrated for workrail]
|
|
1647
|
+
Loaded with: coding philosophy principles, daemon invariants doc,
|
|
1648
|
+
design locks, commit message rules, neverthrow/assertNever conventions
|
|
1649
|
+
ALL findings fixed before proceeding, regardless of severity
|
|
1650
|
+
WorkTrain fixes, re-reviews until clean -- no threshold, no "note it"
|
|
1651
|
+
↓
|
|
1652
|
+
Auto-merge (squash, delete worktree)
|
|
1653
|
+
↓
|
|
1654
|
+
Issue closed, backlog item marked done
|
|
1655
|
+
```
|
|
1656
|
+
|
|
1657
|
+
---
|
|
1658
|
+
|
|
1659
|
+
**What needs to be built (in dependency order):**
|
|
1660
|
+
|
|
1661
|
+
1. **Convention enforcement in wr.mr-review for workrail** -- workspace context that injects coding philosophy, design locks, and daemon invariants as explicit review criteria. Without this, Gate 5 is generic and toothless.
|
|
1662
|
+
|
|
1663
|
+
2. **Scope filter at dispatch (Gate 0)** -- pure TypeScript coordinator check. Refuses dispatch if: protected files predicted in scope, issue touches multiple subsystems, issue is architectural. Reads issue body + label set.
|
|
1664
|
+
|
|
1665
|
+
3. **Protected file gate (Gate 2)** -- post-coding delivery-layer script, runs `git diff --name-only` against a blocklist. Hard stop, no retry.
|
|
1666
|
+
|
|
1667
|
+
4. **Interpretation checkpoint wired to operator approval (Gate 1)** -- in daemon sessions on the workrail repo, the interpretation checkpoint must never auto-confirm. Coordinator sets `requireInterpretationApproval: true` for this trigger.
|
|
1668
|
+
|
|
1669
|
+
5. **Verification agent (Gate 4)** -- new agent role with dedicated system prompt (adversarial QA stance), constrained Bash tool variant, and `VerificationRecord` output contract enforced by the engine.
|
|
1670
|
+
|
|
1671
|
+
6. **Resolver agent (Gate 4b)** -- new agent role, binding verdict, same constrained Bash. Coordinator spawns only on `disputed` verdict.
|
|
1672
|
+
|
|
1673
|
+
7. **CI failure one-retry loop (Gate 3)** -- coordinator reads CI status via `gh`, spawns a targeted-fix session if failing, re-polls.
|
|
1674
|
+
|
|
1675
|
+
---
|
|
1676
|
+
|
|
1677
|
+
**Spawn depth:** The full gate sequence uses coordinator → coding → verifier → (if disputed) resolver. That's depth 3, hitting the current default `maxSubagentDepth: 3`. The workrail self-improvement trigger needs `maxSubagentDepth: 4` in `triggers.yml`.
|
|
1678
|
+
|
|
1679
|
+
**MVP issue scope:** Start with issues that are: scoped to one file or directory, have Effort:3 in the backlog (hours to a day), and don't touch `src/v2/durable-core/` or `src/daemon/` session lifecycle. Good candidates: infra utilities, CLI commands, test coverage, observability additions.
|
|
1680
|
+
|
|
1681
|
+
**Trust ramp:** For the first 10 issues, operator reviews the interpretation checkpoint output in full before approving. After 10 clean runs (no gaps found by verifier, no resolver invocations), interpretation checkpoint can be reviewed asynchronously (operator approves via `worktrain inbox`). After 20 clean runs, the gate timing can be relaxed further. The loop tightens based on track record, not a timer.
|
|
1682
|
+
|
|
1683
|
+
**Things to hash out:**
|
|
1684
|
+
- The constrained Bash tool for the verifier/resolver is a new tool variant not yet in the codebase -- `makeConstrainedBashTool(allowedPrefixes: readonly string[])`. Where does it live and how is it enforced?
|
|
1685
|
+
- `requireInterpretationApproval: true` is not a current field on `TriggerDefinition`. Does it go on `agentConfig` or as a top-level trigger field?
|
|
1686
|
+
- How does the operator approve/reject at Gate 1 in practice? Via `worktrain inbox`? Via console? This needs to work reliably at 3am when the operator isn't watching.
|
|
1687
|
+
- The one-retry CI fix (Gate 3) spawns a child session -- that child needs access to CI failure output. Does the coordinator fetch it via `gh` and inject it as context, or does the child agent fetch it itself?
|
|
1688
|
+
|
|
1689
|
+
---
|
|
1690
|
+
|
|
1365
1691
|
### Demo repo feedback loop: WorkTrain improves itself via real task execution (Apr 20, 2026)
|
|
1366
1692
|
|
|
1367
1693
|
**Status: idea** | Priority: high
|
|
@@ -1652,6 +1978,77 @@ Combined with the `DEFAULT_MAX_TURNS` cap, this provides defense-in-depth agains
|
|
|
1652
1978
|
|
|
1653
1979
|
The durable session store, v2 engine, and workflow authoring features shared by all three systems.
|
|
1654
1980
|
|
|
1981
|
+
### Coordinator-managed typed output vocabulary: agent emits typed events, coordinator reacts per type (May 7, 2026)
|
|
1982
|
+
|
|
1983
|
+
**Status: idea** | Priority: high
|
|
1984
|
+
|
|
1985
|
+
**Score: 12** | Cor:2 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
|
|
1986
|
+
|
|
1987
|
+
Today, agent output is largely untyped -- notes, artifacts, context keys. The coordinator reacts to typed handoff artifacts at phase boundaries, but within a session the agent's observations, decisions, findings, and suggestions are all prose. The coordinator cannot programmatically react to them.
|
|
1988
|
+
|
|
1989
|
+
The idea: the coordinator owns a vocabulary of typed output kinds that it supports. Before a session starts, it injects that vocabulary into the agent's context -- the agent knows exactly what typed things it can emit and what each one means. When the agent emits a typed output, the coordinator reacts with the appropriate process for that type. The reaction is deterministic coordinator logic (not LLM reasoning), specified per type.
|
|
1990
|
+
|
|
1991
|
+
**Examples of typed output kinds and coordinator reactions:**
|
|
1992
|
+
|
|
1993
|
+
- `suggestion(kind: "abstraction_extraction")` → coordinator fires targeted verification: "what are the three future cases this serves?"
|
|
1994
|
+
- `finding(severity: "critical", area: "security")` → coordinator routes to immediate review, may block merge
|
|
1995
|
+
- `decision(chose: X, over: Y, rationale: ...)` → coordinator checks for conflicts with prior decisions in the session store
|
|
1996
|
+
- `scope_change(direction: "larger", reason: ...)` → coordinator re-evaluates task complexity, may re-route to a heavier workflow
|
|
1997
|
+
- `blocker(kind: "missing_context", what: ...)` → coordinator attempts to resolve the blocker from known sources before surfacing to operator
|
|
1998
|
+
- `learning(claim: ..., area: ..., confidence: ...)` → coordinator writes to the assumption store for future sessions
|
|
1999
|
+
- `assumption(claim: ..., severity: ...)` → coordinator gates on verification before proceeding (Candidate 5 is a specific instance of this)
|
|
2000
|
+
|
|
2001
|
+
**What makes this powerful:**
|
|
2002
|
+
The agent doesn't need to know what happens next when it emits a typed output -- that's the coordinator's job. The agent just has to recognize "this is an assumption I'm making" or "this is a scope change I'm noticing" and emit the right type. The coordinator's reaction logic handles the rest deterministically, without LLM turns.
|
|
2003
|
+
|
|
2004
|
+
**Relationship to existing entries:**
|
|
2005
|
+
- "Typed suggestion artifacts with workflow-directed verification" (below): a specific application of this pattern to suggestions
|
|
2006
|
+
- "Coordinator mid-session hooks": the coordinator's reaction to typed outputs is exactly a mid-session hook triggered by a specific event type
|
|
2007
|
+
- "Candidate 5 / interpretation checkpoint": the assumption verification step is a manually-implemented instance of this pattern for one output type
|
|
2008
|
+
- "Coordinator session store awareness": the coordinator's reaction to a `learning` or `decision` type can write to the session store for future sessions
|
|
2009
|
+
|
|
2010
|
+
**Things to hash out:**
|
|
2011
|
+
- Who defines the vocabulary of supported types -- the engine (closed set), the workflow author (per-workflow), or the coordinator (per-deployment)?
|
|
2012
|
+
- How does the agent learn what types are available? Injected in the system prompt, declared in the workflow, or both?
|
|
2013
|
+
- What is the API surface for emitting a typed output? A dedicated tool, a structured artifact field, a reserved context key pattern?
|
|
2014
|
+
- How are reactions defined? TypeScript in the coordinator script, declarative rules in triggers.yml, or something else?
|
|
2015
|
+
- What happens when the agent emits a type the coordinator doesn't handle? Silent drop, warning, or error?
|
|
2016
|
+
- Should typed outputs be visible in the console as first-class events, or only in the raw session log?
|
|
2017
|
+
|
|
2018
|
+
---
|
|
2019
|
+
|
|
2020
|
+
### Typed suggestion artifacts with workflow-directed verification (May 7, 2026)
|
|
2021
|
+
|
|
2022
|
+
**Status: idea** | Priority: medium
|
|
2023
|
+
|
|
2024
|
+
**Score: 11** | Cor:2 Cap:3 Eff:2 Lev:2 Con:2 | Blocked: no
|
|
2025
|
+
|
|
2026
|
+
Agents frequently make suggestions mid-workflow -- propose an abstraction, recommend a deferral, flag a scope expansion, suggest a performance optimization. Today these live in plain prose notes. The workflow cannot distinguish one type of suggestion from another, cannot apply targeted follow-up logic, and cannot verify that the suggestion was actually scrutinized before being accepted. A suggestion that warrants architectural review gets the same treatment as one that warrants nothing.
|
|
2027
|
+
|
|
2028
|
+
The idea: a typed `suggestion` tool call that the agent makes instead of embedding the suggestion in prose. The artifact carries a `kind` field (closed enum, workflow-declared) that tells the engine what type of suggestion this is. The workflow author declares, per suggestion kind, what verification the engine should require before the suggestion is accepted.
|
|
2029
|
+
|
|
2030
|
+
**Example suggestion kinds and their natural follow-up scrutiny:**
|
|
2031
|
+
- `abstraction_extraction` -- "is this premature? what are the three concrete future cases this serves? does any of them exist in the current backlog? does this introduce coupling that didn't exist before?"
|
|
2032
|
+
- `architectural_change` -- "does this conflict with any design locks? what breaks downstream?"
|
|
2033
|
+
- `scope_expansion` -- "is this actually in scope? is this the scope rationalization failure mode -- the agent declaring it's a separate ticket to avoid doing the work?"
|
|
2034
|
+
- `deferral` -- "is this genuinely separate work, or is the agent completing checkboxes while leaving real work undone?"
|
|
2035
|
+
- `performance_optimization` -- "is this premature? what is the actual measured bottleneck? what evidence justifies this now?"
|
|
2036
|
+
|
|
2037
|
+
**Mechanism:** fits naturally with the assessment gate system. A `suggestion_quality` assessment with dimensions specific to the suggestion kind. The workflow author declares which dimensions apply to each kind. When the agent emits a typed suggestion, the engine fires a `require_followup` consequence requiring the agent to answer the verification criteria for that kind before proceeding. If the agent cannot answer them satisfactorily, the suggestion does not pass.
|
|
2038
|
+
|
|
2039
|
+
**API shape is open:** the typed suggestion could be a dedicated tool call (`suggest(type: "abstraction_extraction", ...)`), a structured artifact field in `continue_workflow`, a special context key, or something else entirely. The key property is that it is machine-readable and has a `kind` field the engine can act on -- not prose. The exact surface needs design work.
|
|
2040
|
+
|
|
2041
|
+
**The friction concern:** if suggestions require too much overhead, agents will stop surfacing them or bury them in prose to avoid the gate. The verification criteria must be targeted and lightweight -- not a full review pass, just the specific questions that matter for that kind. "What are the three future cases this abstraction serves?" is lightweight. "Run a full architecture review" is not.
|
|
2042
|
+
|
|
2043
|
+
**Things to hash out:**
|
|
2044
|
+
- What is the closed set of suggestion kinds for the initial version? Too many kinds creates complexity; too few misses the point.
|
|
2045
|
+
- Should suggestion kinds be workflow-declared (each workflow author defines their own) or engine-owned (a closed set the engine enforces)? Engine-owned is more consistent but less flexible.
|
|
2046
|
+
- How does the agent signal that a suggestion was considered and rejected, not just overlooked? A declined suggestion should be as visible as an accepted one.
|
|
2047
|
+
- Does the verification happen inline (a `require_followup` on the same step) or as a separate verification step? Inline is lower friction; a separate step is more auditable.
|
|
2048
|
+
- How does this interact with the existing `report_issue` mechanism? Some suggestions that fail verification should surface to the operator, not just loop back to the agent.
|
|
2049
|
+
|
|
2050
|
+
---
|
|
2051
|
+
|
|
1655
2052
|
### WorkTrain as the canonical workflow author -- MCP as a derived runtime (Apr 30, 2026)
|
|
1656
2053
|
|
|
1657
2054
|
**Status: idea** | Priority: high
|
|
@@ -1862,6 +2259,80 @@ This is already how mid-run resume works. The same mechanism extends naturally t
|
|
|
1862
2259
|
|
|
1863
2260
|
---
|
|
1864
2261
|
|
|
2262
|
+
### `withTimeout` + `withRetry` as first-class async boundary utilities (May 7, 2026)
|
|
2263
|
+
|
|
2264
|
+
**Status: idea** | Priority: medium
|
|
2265
|
+
|
|
2266
|
+
**Score: 10** | Cor:2 Cap:1 Eff:3 Lev:2 Con:3 | Blocked: no
|
|
2267
|
+
|
|
2268
|
+
WorkTrain's daemon has no composable timeout or retry primitives. `AgentLoop` has a stall timer wired in internally; `PollingScheduler` has no error backoff; `startup-recovery.ts` makes one attempt per session with no retry. The vision says "cancellation/timeouts are first-class" and "overnight-safe" -- but the infrastructure for that is scattered or absent.
|
|
2269
|
+
|
|
2270
|
+
The pattern (from `etienne-clone/src/types.ts`):
|
|
2271
|
+
|
|
2272
|
+
```typescript
|
|
2273
|
+
withTimeout<T>(fn: (signal: AbortSignal) => Promise<T>, ms: number, label: string): Promise<Result<T, TimeoutError>>
|
|
2274
|
+
withRetry<T, E>(fn: () => Promise<Result<T, E>>, config: RetryConfig): Promise<Result<T, E>>
|
|
2275
|
+
```
|
|
2276
|
+
|
|
2277
|
+
`RetryConfig` has `retryOn: (error: unknown) => boolean` -- the caller decides what's retryable, not the primitive. `withTimeout` threads `AbortSignal` into the function so cancellation propagates correctly. Both return `Result` types, never throw.
|
|
2278
|
+
|
|
2279
|
+
**Adaptation note:** etienne-clone rolls its own `Result<T,E>`; WorkTrain uses `neverthrow`. The `RetryConfig` shape and `retryOn` predicate are directly portable. The function bodies need to be rewritten against `ResultAsync` from neverthrow rather than copied verbatim.
|
|
2280
|
+
|
|
2281
|
+
**Philosophy note:** "Higher-order functions as a tool" -- retry and timeout are cross-cutting behaviors that should be composed around functions, not scattered across call sites. "Cancellation/timeouts are first-class" is a stated coding principle. These primitives make it structurally impossible to call an async boundary without deciding upfront whether it can timeout or retry.
|
|
2282
|
+
|
|
2283
|
+
**Done looks like:** `src/infra/async-boundaries.ts` exports `withTimeout()` and `withRetry()` using neverthrow `ResultAsync`. Used at: coordinator `callbackUrl` POST retries, polling error recovery, startup-recovery rehydrate attempts.
|
|
2284
|
+
|
|
2285
|
+
---
|
|
2286
|
+
|
|
2287
|
+
### `OrchestratorWorkflowAvailability` pattern: make missing-workflow states unrepresentable (May 7, 2026)
|
|
2288
|
+
|
|
2289
|
+
**Status: idea** | Priority: medium
|
|
2290
|
+
|
|
2291
|
+
**Score: 9** | Cor:2 Cap:1 Eff:3 Lev:1 Con:3 | Blocked: no
|
|
2292
|
+
|
|
2293
|
+
WorkTrain's adaptive coordinator checks whether a requested workflow exists at dispatch time, but the check is implicit -- a missing workflow ID returns `workflow_not_found` from the engine at runtime, mid-session, after a session slot has been consumed. There is no compile-time or startup-time guarantee that the coordinator cannot route to a non-existent workflow.
|
|
2294
|
+
|
|
2295
|
+
The pattern (from `etienne-clone/src/pipeline/orchestrator-workflow-selection.ts`):
|
|
2296
|
+
|
|
2297
|
+
```typescript
|
|
2298
|
+
type OrchestratorWorkflowAvailability =
|
|
2299
|
+
| { kind: 'standard_only' }
|
|
2300
|
+
| { kind: 'standard_and_focused' }
|
|
2301
|
+
|
|
2302
|
+
assessAvailableOrchestratorWorkflows(
|
|
2303
|
+
availableWorkflowIds: readonly string[],
|
|
2304
|
+
workflowIds: OrchestratorWorkflowIds,
|
|
2305
|
+
): Result<OrchestratorWorkflowAvailability, string>
|
|
2306
|
+
```
|
|
2307
|
+
|
|
2308
|
+
The `standard=false` state cannot be constructed -- `assessAvailableOrchestratorWorkflows` returns `err` if the standard workflow is missing. The coordinator only ever holds a valid `OrchestratorWorkflowAvailability` value, so its dispatch logic cannot route to a missing workflow.
|
|
2309
|
+
|
|
2310
|
+
**Philosophy note:** "Make illegal states unrepresentable" -- the type system enforces that routing only happens after availability is confirmed. A bare string `workflowId` can point to anything; a typed `WorkflowAvailability` discriminated union can only be constructed when the workflows actually exist. This is the difference between a label and a constraint.
|
|
2311
|
+
|
|
2312
|
+
**Done looks like:** WorkTrain's adaptive coordinator calls `assessAvailableWorkflows(ctx, workflowIds)` at startup or pre-dispatch. Returns `Result<WorkflowAvailability, string>`. The dispatch function takes `WorkflowAvailability` as a parameter -- it is structurally impossible to call without first confirming availability.
|
|
2313
|
+
|
|
2314
|
+
---
|
|
2315
|
+
|
|
2316
|
+
### Per-session cost tracking: estimated spend visible in execution stats and console (May 7, 2026)
|
|
2317
|
+
|
|
2318
|
+
**Status: idea** | Priority: medium
|
|
2319
|
+
|
|
2320
|
+
**Score: 8** | Cor:1 Cap:2 Eff:3 Lev:1 Con:3 | Blocked: no
|
|
2321
|
+
|
|
2322
|
+
WorkTrain records `inputTokens` and `outputTokens` per session in `LlmTurnCompletedEvent` and step-level metrics, but never converts them to an estimated dollar cost. Operators have no way to see how much a session cost, which workflows are expensive, or when a stuck session has burned a disproportionate budget.
|
|
2323
|
+
|
|
2324
|
+
The pattern (from `etienne-clone/src/observability/cost.ts`): `estimateCost(modelId, usage, env)` is a pure function that returns `Result<number, CostEstimationError>`. Pricing is overridable via `LLM_INPUT_COST_PER_1M` and `LLM_OUTPUT_COST_PER_1M` env vars (injected for testability). Token counts are already in `LlmTurnCompletedEvent` and the v2 session store.
|
|
2325
|
+
|
|
2326
|
+
**Philosophy note:** "Observability as a constraint" -- cost is a first-class observable dimension of a session, not a post-hoc calculation. The env-injection pattern (`env: EnvRecord = process.env`) is already how WorkTrain tests env-dependent code. The function is pure and trivially testable.
|
|
2327
|
+
|
|
2328
|
+
**Done looks like:** `src/observability/cost.ts` (port from etienne-clone, adapting to WorkTrain's model IDs). `estimatedCostUsd` added to `execution-stats.jsonl` rows and `SessionCompletedEvent`. Console session detail shows cost. Alert threshold emits `orchestrator_review_warning`-equivalent event when a session exceeds a configured per-workflow cost cap.
|
|
2329
|
+
|
|
2330
|
+
**Things to hash out:**
|
|
2331
|
+
- Should cost alerts trigger escalation (surface to operator) or just log? A stuck session burning $5 should probably escalate.
|
|
2332
|
+
- Per-workflow cost caps belong in `TriggerDefinition.agentConfig` or in a separate `costPolicy` block?
|
|
2333
|
+
|
|
2334
|
+
---
|
|
2335
|
+
|
|
1865
2336
|
### Extensible output contract registration: coordinator-owned schemas, engine-enforced (Apr 30, 2026)
|
|
1866
2337
|
|
|
1867
2338
|
**Status: idea** | Priority: medium
|
|
@@ -1954,6 +2425,186 @@ Surface in: `worktrain status`, `worktrain health <sessionId>`, console session
|
|
|
1954
2425
|
|
|
1955
2426
|
Coordinator design patterns for WorkTrain's autonomous pipeline.
|
|
1956
2427
|
|
|
2428
|
+
### Reliable synthetic human gates: mimicking operator approval and refusal in autonomous pipelines (May 6, 2026)
|
|
2429
|
+
|
|
2430
|
+
**Status: idea** | Priority: high
|
|
2431
|
+
|
|
2432
|
+
**Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
|
|
2433
|
+
|
|
2434
|
+
WorkTrain's pipeline has several points where a human operator would naturally approve, reject, or redirect -- confirming an interpretation before coding starts, approving a direction from discovery, accepting a shaped pitch. In guided MCP sessions these gates fire as `requireConfirmation` steps. In fully autonomous daemon sessions, they either don't fire or surface to the operator outbox and wait indefinitely. There is currently no reliable mechanism for the coordinator to make these gate decisions autonomously in a way that is trustworthy enough to substitute for human judgment.
|
|
2435
|
+
|
|
2436
|
+
The problem is not just "add an LLM to make the decision." An LLM making approval decisions is subject to the same sycophancy, self-enhancement bias, and overconfidence problems the rest of the pipeline has. A naïve "spawn an agent to approve this" produces rubber-stamping, not genuine gatekeeping. What is needed is a structured, auditable, multi-signal gate that approximates what a careful human reviewer would do -- checking specific criteria, flagging specific concerns, requiring specific evidence before proceeding.
|
|
2437
|
+
|
|
2438
|
+
**What a strong synthetic gate needs:**
|
|
2439
|
+
- Typed criteria against which the artifact is evaluated (not free-form "does this look good?")
|
|
2440
|
+
- An independent agent that did not produce the artifact being evaluated
|
|
2441
|
+
- A cross-family challenger where possible (different model family = different correlated blind spots)
|
|
2442
|
+
- A structured verdict with explicit rationale tied to the criteria, not a confidence score
|
|
2443
|
+
- An escalation path when the synthetic gate is uncertain -- surface to operator rather than rubber-stamp
|
|
2444
|
+
|
|
2445
|
+
**Use cases that need this:**
|
|
2446
|
+
- Interpretation checkpoint: does the coded assumption set actually cover the architectural risks for this ticket?
|
|
2447
|
+
- Shaping approval: does the pitch have genuine acceptance criteria or are they vague enough to accept anything?
|
|
2448
|
+
- Discovery direction: is the selected direction actually distinct from the runner-up, or are they the same approach with different labels?
|
|
2449
|
+
- Review verdict: is this finding severe enough to block merge, or is it a style preference being inflated?
|
|
2450
|
+
|
|
2451
|
+
**Things to hash out:**
|
|
2452
|
+
- What is the right abstraction? A reusable `synthetic-gate` routine that takes typed criteria + artifact and returns a structured verdict? Or specialized gates per use case?
|
|
2453
|
+
- How do you prevent the synthetic gate from being gamed by the same agent that produced the artifact? The gate agent must not have access to the producing agent's reasoning, only its output.
|
|
2454
|
+
- What is the confidence threshold below which the synthetic gate escalates to a human rather than deciding? And how is that threshold configured per trigger?
|
|
2455
|
+
- How do you validate that a synthetic gate is actually performing the function of a human gate -- not just producing confident verdicts? Requires a calibration dataset of known-correct and known-incorrect artifacts with human ground truth.
|
|
2456
|
+
- Relationship to the `requireConfirmation` gate mechanism: the synthetic gate is the autonomous equivalent. It should produce the same typed routing signal the human confirmation gate produces, so the coordinator routing logic doesn't need to know which kind of gate fired.
|
|
2457
|
+
|
|
2458
|
+
---
|
|
2459
|
+
|
|
2460
|
+
### Multi-score deterministic workflow routing: replace string-matching coordinator dispatch with typed scoring (May 7, 2026)
|
|
2461
|
+
|
|
2462
|
+
**Status: idea** | Priority: high
|
|
2463
|
+
|
|
2464
|
+
**Score: 12** | Cor:2 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
|
|
2465
|
+
|
|
2466
|
+
WorkTrain's adaptive coordinator selects which pipeline to run (quick_review, review_only, implement, full) based on task content. The current dispatch uses heuristics and LLM-assisted classification. This violates vision principle #1: zero LLM turns for routing. Coordinator decisions must be deterministic TypeScript code, not LLM reasoning.
|
|
2467
|
+
|
|
2468
|
+
The pattern: compute independent typed scores (size, complexity, risk, breadth) over the task's structured metadata, classify the task into a typed shape discriminated union (`isolated_fix`, `small_cohesive_behavior`, `broad_or_risky`, etc.), find hard blockers (conditions that force a specific pipeline regardless of scores), and return a typed `WorkflowAssessment` with score breakdown, shape, blockers, and a human-readable reason string. No LLM call in the routing path.
|
|
2469
|
+
|
|
2470
|
+
Pattern source: `etienne-clone/src/pipeline/orchestrator-workflow-selection.ts` -- `classifyMrShape()`, `assessWorkflowRoute()`, `chooseOrchestratorWorkflow()`. Also: `OrchestratorWorkflowAvailability` discriminated union ensures the "standard workflow missing" state cannot be constructed -- `assessAvailableOrchestratorWorkflows()` returns `Result<OrchestratorWorkflowAvailability, string>` so callers only hold valid states.
|
|
2471
|
+
|
|
2472
|
+
**Philosophy note:** This is the vision's "control flow from data state" principle made concrete: routing decisions derive from an explicit typed state machine over task scores, not from an LLM's implicit reasoning. Exhaustiveness on the shape union (`assertNever` in the confidence function) makes the routing logic refactor-safe. The `WorkflowAssessment` return type (not a bare string) makes every decision traceable without reading session transcripts.
|
|
2473
|
+
|
|
2474
|
+
**Done looks like:** The adaptive coordinator dispatches entirely from a `WorkflowAssessment` value produced by a pure TypeScript function over the task's metadata. No LLM call occurs before `runAdaptivePipeline()` selects a mode. A test can assert the routing decision for any task shape without mocking an LLM.
|
|
2475
|
+
|
|
2476
|
+
**Things to hash out:**
|
|
2477
|
+
- What dimensions make sense for WorkTrain tasks? MR review uses size/cohesion/risk/breadth over file diffs. WorkTrain tasks may need different dimensions (specificity, ambiguity, scope breadth, ticket maturity).
|
|
2478
|
+
- Where does the input data come from? The task candidate (from queue poll) has title, body, labels, issue number. Is that enough to score confidently without an LLM?
|
|
2479
|
+
- Should hard blockers be configurable per-trigger (e.g. "tasks labeled `security` always use the full pipeline") or hardcoded in the assessment function?
|
|
2480
|
+
|
|
2481
|
+
---
|
|
2482
|
+
|
|
2483
|
+
### Required `reason` field on coordinator signals: typed audit trail for headless gate decisions (May 7, 2026)
|
|
2484
|
+
|
|
2485
|
+
**Status: idea** | Priority: high
|
|
2486
|
+
|
|
2487
|
+
**Score: 11** | Cor:2 Cap:2 Eff:3 Lev:2 Con:3 | Blocked: no
|
|
2488
|
+
|
|
2489
|
+
When an agent calls `signal_coordinator` with `kind: 'approval_needed'` or `kind: 'blocked'`, the coordinator receives a signal but not necessarily the agent's reasoning. The coordinator (and operator) must read the full session transcript to understand why the agent requested approval. At scale -- dozens of sessions per day -- transcript reading is impractical. Signals without stated reasoning are unauditable.
|
|
2490
|
+
|
|
2491
|
+
Vision principle: "every decision visible in the session store." A signal that doesn't include the agent's stated reasoning violates this -- the decision (to pause and request approval) is visible, but the reason for it is buried in prose.
|
|
2492
|
+
|
|
2493
|
+
The fix is structural: make `reason` a required field on `approval_needed` and `blocked` signal kinds. The engine or coordinator rejects signals that omit it. This is the same pattern as `SelfConfirmEvent.validationReason` in `etienne-clone/src/types.ts`, where the comment reads: "`validationReason` is REQUIRED -- it's the audit trail for headless gates."
|
|
2494
|
+
|
|
2495
|
+
Pattern source: `etienne-clone/src/types.ts` `SelfConfirmEvent` -- `readonly validationReason: string` required (not optional) on the self-confirm event interface.
|
|
2496
|
+
|
|
2497
|
+
**Philosophy note:** "Errors are data" applies to under-specified signals too. A signal with no stated reason is incomplete data -- the receiver cannot act on it deterministically. Making `reason` required at the schema level turns a runtime ambiguity into a compile-time constraint. Capability-based: the signal type declares what information it carries, enforced by the schema, not by convention.
|
|
2498
|
+
|
|
2499
|
+
**Done looks like:** `CoordinatorSignalKindSchema` for `approval_needed` and `blocked` includes `reason: z.string().min(10)`. The `signal_emitted` daemon event includes the reason. `worktrain inbox` shows it. Coordinators can filter and route based on `reason` text without reading session transcripts.
|
|
2500
|
+
|
|
2501
|
+
---
|
|
2502
|
+
|
|
2503
|
+
### Typed finding extraction with multi-strategy fallback: enforce structured output contracts at coordinator boundaries (May 7, 2026)
|
|
2504
|
+
|
|
2505
|
+
**Status: idea** | Priority: high
|
|
2506
|
+
|
|
2507
|
+
**Score: 12** | Cor:3 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
|
|
2508
|
+
|
|
2509
|
+
WorkTrain's coordinators read structured phase handoff artifacts (review verdicts, discovery summaries, shaped pitches) from session output. Today, if an agent doesn't produce a clean JSON handoff, the coordinator either fails hard or reads free-text that it can't reliably parse. There is no graceful degradation ladder for malformed or missing structured output.
|
|
2510
|
+
|
|
2511
|
+
Vision principle: "structured outputs at every boundary" and "typed contracts make phases composable." This requires not just that the engine validates artifacts when they're present, but that the coordinator has a systematic recovery strategy when they aren't -- without silently accepting garbage.
|
|
2512
|
+
|
|
2513
|
+
The pattern (from `etienne-clone/src/findings/extract.ts`):
|
|
2514
|
+
1. Try the ideal path: find the typed artifact in the last `complete_step` output
|
|
2515
|
+
2. Fallback: parse a JSON block from the agent's final text response, normalize field names and casing
|
|
2516
|
+
3. Fallback: reconstruct from observable side effects (e.g. tool calls that imply what the agent concluded)
|
|
2517
|
+
4. All paths return `Result<T, ExtractionError>` with a typed error union (`no_handoff_output`, `invalid_json`, `schema_mismatch`) -- never throw, never silently accept
|
|
2518
|
+
|
|
2519
|
+
The normalization layer (`normalizeRawFindingsOutput`) handles the practical reality that agents produce `"APPROVE"` when the schema says `"approve"`, or `reviewFindings` when the field should be `findings`. This is boundary validation done right.
|
|
2520
|
+
|
|
2521
|
+
**Philosophy note:** "Validate at boundaries, trust inside" -- this is exactly the boundary. The coordinator trusts the extracted artifact once it passes validation; it never trusts raw agent output. "Errors are data" -- `ExtractionError` is a typed discriminated union that lets coordinators route on failure kind, not parse error messages.
|
|
2522
|
+
|
|
2523
|
+
**Done looks like:** Every coordinator phase that reads a structured handoff uses an `extractHandoff<T>()` function that applies the three-strategy fallback chain and returns `Result<T, HandoffExtractionError>`. The coordinator handles `err` cases explicitly -- escalate to operator, retry the phase, or degrade gracefully -- rather than crashing or silently accepting bad output.
|
|
2524
|
+
|
|
2525
|
+
**Things to hash out:**
|
|
2526
|
+
- Strategy 3 (reconstruct from tool calls) is highly session-type-specific. Should each coordinator define its own reconstruction strategy, or is there a generic fallback (e.g. "use the last `complete_step` notes as free text")?
|
|
2527
|
+
- Should extraction errors produce a `report_issue` record automatically, or is that the coordinator's responsibility?
|
|
2528
|
+
|
|
2529
|
+
---
|
|
2530
|
+
|
|
2531
|
+
### `InteractionIntent` discriminated union: platform-neutral operator decision model (May 7, 2026)
|
|
2532
|
+
|
|
2533
|
+
**Status: idea** | Priority: high
|
|
2534
|
+
|
|
2535
|
+
**Score: 12** | Cor:2 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
|
|
2536
|
+
|
|
2537
|
+
WorkTrain has no typed model for operator decisions on in-flight pipeline outcomes -- approving a PR, unblocking a stuck session, dropping an escalated finding, confirming an interpretation. Today, operator actions arrive as CLI commands (`worktrain tell`) or console HTTP calls, but the domain has no typed representation of what the operator actually decided. The coordinator cannot route on decision kind without parsing free text.
|
|
2538
|
+
|
|
2539
|
+
The pattern (from `etienne-clone/src/messaging/types.ts`): `InteractionIntent` is an exhaustive discriminated union of everything an operator can decide -- `approve_all`, `skip`, `post_finding`, `drop_finding`, `edit_finding`, `batch_post`, `batch_drop`. Platform adapters (Slack buttons, console HTTP, CLI) translate operator input into these intents. The domain handler switches on them exhaustively with `assertNever`. Neither layer knows about the other.
|
|
2540
|
+
|
|
2541
|
+
This directly addresses WorkTrain's open gap around human-in-the-loop for critical findings and stuck session escalation: the console or CLI emits a typed `OperatorIntent`, the coordinator handles it the same way regardless of channel.
|
|
2542
|
+
|
|
2543
|
+
Pattern source: `etienne-clone/src/messaging/types.ts` `InteractionIntent` + `etienne-clone/src/messaging/review-handler.ts` `createReviewInteractionHandler()`.
|
|
2544
|
+
|
|
2545
|
+
**Philosophy note:** "Make illegal states unrepresentable" -- a bare string `action=approve` can carry anything; a typed `{ kind: 'approve_all'; sessionId }` cannot. "Exhaustiveness everywhere" -- the switch on `intent.kind` with `assertNever` ensures adding a new decision kind forces every handler to be updated. "Capability-based" -- the operator is given exactly the decisions they can make, enforced by the type, not by convention.
|
|
2546
|
+
|
|
2547
|
+
**Done looks like:** `OperatorIntent` discriminated union covers the decisions WorkTrain surfaces to operators: `approve_pr`, `block_pr`, `unblock_session`, `drop_escalation`, `confirm_interpretation`. Console and CLI translate input into `OperatorIntent` values. The coordinator handler switches on them. A test can assert coordinator behavior for any intent without mocking a UI.
|
|
2548
|
+
|
|
2549
|
+
**Things to hash out:**
|
|
2550
|
+
- Which decisions should be in scope for v1? Not all decisions need to be interactive -- many can be autonomous. The union should cover only the cases where human judgment is genuinely required.
|
|
2551
|
+
- Should `OperatorIntent` carry a `reason?: string` (optional for human input, vs required for synthetic gates)? Or separate types for human vs synthetic decisions?
|
|
2552
|
+
|
|
2553
|
+
---
|
|
2554
|
+
|
|
2555
|
+
### `PendingDecision` store with pure state transitions: immutable approval state for operator-gated pipeline actions (May 7, 2026)
|
|
2556
|
+
|
|
2557
|
+
**Status: idea** | Priority: high
|
|
2558
|
+
|
|
2559
|
+
**Score: 11** | Cor:2 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: yes (needs InteractionIntent above)
|
|
2560
|
+
|
|
2561
|
+
When a WorkTrain pipeline reaches a point requiring operator approval -- a critical finding before merge, an interpretation confirmation before coding, a PR before auto-merge -- it currently either blocks indefinitely (waiting for an outbox message) or skips the gate. There is no structured in-memory state tracking what's pending, what the operator decided, and what the disposition of each item is.
|
|
2562
|
+
|
|
2563
|
+
The pattern (from `etienne-clone/src/messaging/pending-reviews.ts`): `PendingReview` is a fully immutable snapshot with `approvedFindings: ReadonlySet<FindingId>`, `droppedFindings: ReadonlySet<FindingId>`, `editedBodies: ReadonlyMap<FindingId, string>`, `status: 'pending' | 'posted' | 'skipped'`. All state transitions are pure functions `(state: PendingDecision) => PendingDecision` composed with `store.update(id, fn)` -- `approveFinding`, `dropFinding`, `editFinding`, `approveAllBySeverity`. The store is a thin `Map` wrapper with `add`, `get`, `update(id, fn)`, `cleanup(olderThanMs)`.
|
|
2564
|
+
|
|
2565
|
+
Pattern source: `etienne-clone/src/messaging/pending-reviews.ts`.
|
|
2566
|
+
|
|
2567
|
+
**Philosophy note:** "Derive state, don't accumulate it" -- each transition is a pure function over the current state, not an imperative mutation. `ReadonlySet` and `ReadonlyMap` enforce immutability at the type level. "Single source of state truth" -- the `PendingDecision` store is the one place that tracks what an operator has decided; no parallel flags or session-level booleans.
|
|
2568
|
+
|
|
2569
|
+
**Done looks like:** `PendingDecisionStore` in `src/coordinator/pending-decisions.ts`. Pipeline sessions that reach an operator gate call `pendingDecisions.add(decision)`. The coordinator polls or subscribes and calls `pendingDecisions.update(id, applyIntent(intent))` when the operator acts. `cleanup(24 * 60 * 60 * 1000)` runs at startup to expire stale pending decisions.
|
|
2570
|
+
|
|
2571
|
+
---
|
|
2572
|
+
|
|
2573
|
+
### `BriefingDelivery` + `InteractionPort` ports: hexagonal adapter contract for operator notification channels (May 7, 2026)
|
|
2574
|
+
|
|
2575
|
+
**Status: idea** | Priority: medium
|
|
2576
|
+
|
|
2577
|
+
**Score: 10** | Cor:1 Cap:3 Eff:2 Lev:2 Con:2 | Blocked: yes (needs InteractionIntent above)
|
|
2578
|
+
|
|
2579
|
+
WorkTrain's notification path (when it ships) will need to deliver pipeline results to operators and receive their decisions back. If the notification logic is coupled to a specific channel (Slack, console, CLI), adding a second channel requires duplicating the entire delivery + interaction wiring. There is no current abstraction that separates "what to deliver" from "how to deliver it."
|
|
2580
|
+
|
|
2581
|
+
The pattern (from `etienne-clone/src/messaging/port.ts`): two tiny focused interfaces -- `BriefingDelivery` (domain → adapter: `deliverBriefing`, `updateStatus`, `markCompleted`, `notifySkipped`) and `InteractionPort` (adapter → domain: `start`, `stop`, `onInteraction`). The adapter owns all layout decisions (threads, message IDs, button rendering). The domain never sees channel-specific types. Any channel -- Slack, console, webhook, CLI -- implements the same two interfaces.
|
|
2582
|
+
|
|
2583
|
+
Pattern source: `etienne-clone/src/messaging/port.ts`.
|
|
2584
|
+
|
|
2585
|
+
**Philosophy note:** "Keep interfaces small and focused" -- two interfaces with four methods each, no leakage of platform details into the domain. "Dependency injection for boundaries" -- the coordinator receives `BriefingDelivery` and `InteractionPort` as injected dependencies; swapping Slack for the console is a constructor argument change. "Capability-based architecture" -- the coordinator can only do what the `BriefingDelivery` interface exposes, not arbitrary channel operations.
|
|
2586
|
+
|
|
2587
|
+
**Done looks like:** `PipelineDelivery` and `OperatorInteractionPort` interfaces in `src/coordinator/ports.ts`. Console HTTP adapter and CLI adapter each implement both. The coordinator takes them as constructor injections. Adding a Slack adapter requires only implementing the two interfaces, no coordinator changes.
|
|
2588
|
+
|
|
2589
|
+
---
|
|
2590
|
+
|
|
2591
|
+
### `CorrectionEvent` with typed `correctionType`: structured learning from operator edits to pipeline outputs (May 7, 2026)
|
|
2592
|
+
|
|
2593
|
+
**Status: idea** | Priority: medium
|
|
2594
|
+
|
|
2595
|
+
**Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: yes (needs PendingDecision store above)
|
|
2596
|
+
|
|
2597
|
+
When an operator edits a WorkTrain output -- rewrites a PR description, changes a finding severity, adjusts a summary -- that edit is currently invisible to the system. The session event log records that the pipeline ran; it does not record what the operator thought was wrong with the output. This is the core gap in the per-run retrospective backlog item: there is no structured way to capture what the operator corrected.
|
|
2598
|
+
|
|
2599
|
+
The pattern (from `etienne-clone/src/messaging/review-handler.ts` `emitCorrection()`): when an operator edits a finding, `CorrectionEvent` is emitted with a typed `correctionType: { textChanged, severityChanged, locationChanged, principleChanged, dropped }`. The event also carries `{ original, edited, context }`. Repeated correction patterns across sessions become training signal for improving prompts and workflows.
|
|
2600
|
+
|
|
2601
|
+
Pattern source: `etienne-clone/src/messaging/review-handler.ts` `emitCorrection()` + `etienne-clone/src/types.ts` `CorrectionEvent`.
|
|
2602
|
+
|
|
2603
|
+
**Philosophy note:** "Observability as a constraint" -- operator corrections are first-class events in the session store, not manual notes. "Errors are data" -- a correction is structured data about what the system got wrong, not a free-text annotation. The typed `correctionType` makes the correction machine-readable without LLM parsing.
|
|
2604
|
+
|
|
2605
|
+
**Done looks like:** `PipelineCorrectionEvent` added to `DaemonEvent` union. When an operator edits a WorkTrain output via the `PendingDecision` flow, the correction is classified and emitted. `worktrain logs` surfaces corrections. A future analytics pass can aggregate correction patterns per workflow and per step to identify systematic output quality gaps.
|
|
2606
|
+
|
|
2607
|
+
---
|
|
1957
2608
|
|
|
1958
2609
|
### Agents must not perform delivery actions -- only the coordinator's delivery layer can (Apr 30, 2026)
|
|
1959
2610
|
|
|
@@ -2464,6 +3115,67 @@ Ghost nodes represent steps that were compiled into the DAG but skipped at runti
|
|
|
2464
3115
|
|
|
2465
3116
|
## Workflow Library
|
|
2466
3117
|
|
|
3118
|
+
### Pre-specialized expert agents: on-demand consultants for main agents (May 7, 2026)
|
|
3119
|
+
|
|
3120
|
+
**Status: idea** | Priority: high
|
|
3121
|
+
|
|
3122
|
+
**Score: 13** | Cor:2 Cap:3 Eff:2 Lev:3 Con:3 | Blocked: no
|
|
3123
|
+
|
|
3124
|
+
The main agent running a coding, review, or investigation workflow is not the expert. It is the orchestrator. When it needs specialized input -- "is this Kotlin idiomatic?", "does this violate any payments module invariants?", "what are the FP patterns this codebase uses for this?" -- it should be able to ask a pre-specialized consultant agent and get a bounded, expert answer back.
|
|
3125
|
+
|
|
3126
|
+
These expert agents are not running the main workflow. They do not own any phase or make any final decisions. They are consulted: spawned with a specific question, pre-loaded with dense expertise in a specific domain, and they return a bounded answer. The main agent synthesizes the input and retains full ownership.
|
|
3127
|
+
|
|
3128
|
+
**Examples:**
|
|
3129
|
+
- A Kotlin idioms expert pre-loaded with Kotlin best practices, common pitfalls, and idiomatic patterns -- queried when the coding or review agent wants to know "is this idiomatic Kotlin?"
|
|
3130
|
+
- A functional programming expert pre-loaded with the FP philosophy and patterns relevant to this codebase (from CLAUDE.md, design docs, etc.) -- queried when the agent is making decisions that touch FP style
|
|
3131
|
+
- A payments module expert pre-loaded with the payments execution paths, known invariants, and past design decisions -- queried when the task touches payments code
|
|
3132
|
+
- A security expert pre-loaded with the codebase's auth model, known vulnerabilities, and security invariants -- queried during review of auth-adjacent changes
|
|
3133
|
+
|
|
3134
|
+
**Two distinct usage patterns -- both valid:**
|
|
3135
|
+
|
|
3136
|
+
*Consultant mode:* The main agent mid-task asks a specific question ("is this Kotlin idiomatic?"), a pre-specialized agent is spawned with that question and its expertise briefing, it returns a bounded answer, the main agent synthesizes and moves on. Lightweight, on-demand, the main agent drives the interaction.
|
|
3137
|
+
|
|
3138
|
+
*Parallel specialist mode:* The coordinator spawns multiple pre-specialized agents simultaneously for a phase of work -- e.g. an MR review that launches a Kotlin expert, a payments module expert, and an FP patterns expert in parallel, each reviewing the same diff through their lens. The main agent or coordinator synthesizes. This is the 3-angle executor pattern from wr.discovery applied to expertise curation rather than framing angles. Each specialist contributes their perspective; no single agent has to cover everything.
|
|
3139
|
+
|
|
3140
|
+
The parallel specialist mode is conceptually similar to the existing reviewer families in wr.mr-review, but with expertise injection replacing role prompts. "You are a correctness reviewer" and "you are an agent briefed on this codebase's actual invariants, the past bugs in this module, and the specific patterns we use here" are very different levels of specificity.
|
|
3141
|
+
|
|
3142
|
+
**What makes expert consultants distinct from existing reviewer families (MR review):**
|
|
3143
|
+
Existing reviewer families are top-level sessions running the full review workflow independently. Expert consultants (in consultant mode) are lightweight bounded spawns -- more like calling a function than running a parallel pipeline. In parallel specialist mode they are closer to reviewer families, but curated for the specific task rather than generically role-assigned.
|
|
3144
|
+
|
|
3145
|
+
**What makes this distinct from existing context injection:**
|
|
3146
|
+
Existing context injection (living work context, assembledContextSummary) threads pipeline state between phases -- history of what happened. Expert consultants carry curated domain expertise -- best practices, idioms, invariants, patterns. The content type is different: not "what was done" but "what is true about this domain."
|
|
3147
|
+
|
|
3148
|
+
**Implementation shape -- specialized workflows, not just context injection:**
|
|
3149
|
+
|
|
3150
|
+
The most powerful form of a specialist is not an agent that receives a big expertise briefing at spawn time and then works freely. It is an agent running a purpose-built specialized workflow that contains both the expertise and the process for applying it systematically.
|
|
3151
|
+
|
|
3152
|
+
A `wr.kotlin-review` workflow contains: the Kotlin expertise in `metaGuidance` and `references`, and a structured procedure -- "step 1: check null safety patterns at these call sites; step 2: evaluate coroutine usage against these criteria; step 3: check data class conventions..." Breaking the domain into steps ensures the specialist covers everything the domain requires, in the right order, with the right depth. A pure context dump leaves coverage to chance; a workflow enforces it.
|
|
3153
|
+
|
|
3154
|
+
This also makes specialists auditable: you can see in the session store exactly which steps the specialist ran, what it found, and whether it covered all required dimensions. And specialized workflows improve over time via `wr.workflow-for-workflows`, compounding quality the same way all bundled workflows do.
|
|
3155
|
+
|
|
3156
|
+
For dynamic specialists (payments module expert, specific subsystem expert), the workflow defines the process for generating the briefing dynamically -- walk these execution paths, read these design docs, extract these invariants -- rather than containing a static briefing.
|
|
3157
|
+
|
|
3158
|
+
**What needs to be built:**
|
|
3159
|
+
- A catalog of specialized workflows: static domain specialists (wr.kotlin-review, wr.fp-patterns-review) and dynamic module specialists (wr.module-expert with a briefing-generation phase)
|
|
3160
|
+
- A matching mechanism: given the task's affected files and domains, which specialist workflows are relevant?
|
|
3161
|
+
- A consultation protocol: how does the main agent query a specialist? How does the specialist return a typed artifact the main agent can act on?
|
|
3162
|
+
- Dynamic briefing generation: for module-specific specialists, a workflow phase that walks affected execution paths and generates the curated briefing before the expert work begins
|
|
3163
|
+
|
|
3164
|
+
**Relationship to existing entries:**
|
|
3165
|
+
- "Knowledge graph": the long-term structural ground truth version of this. Expert briefings are the lower-cost precursor that doesn't require the full graph.
|
|
3166
|
+
- "Assumption store": verified codebase facts are one input to the module expert briefing.
|
|
3167
|
+
- "Coordinator mid-session hooks": expert consultation could be triggered mid-session by the coordinator when specific signals fire (e.g. agent touches a known-tricky module).
|
|
3168
|
+
|
|
3169
|
+
**Things to hash out:**
|
|
3170
|
+
- What is the right format for an expertise briefing? Prose vs structured facts vs a combination?
|
|
3171
|
+
- How are static briefings maintained? They go stale as language versions change and codebases evolve.
|
|
3172
|
+
- How are dynamic briefings generated? Static analysis? LLM-assisted code walk? What is the cost and freshness guarantee?
|
|
3173
|
+
- How does the main agent know which experts are available and when to consult them? Explicit workflow step, or opportunistic mid-task consultation?
|
|
3174
|
+
- Token budget: expert consultation adds turns and tokens. When is the cost worth it vs. the main agent just proceeding with its own judgment?
|
|
3175
|
+
- How does the consultation differ from just giving the main agent a bigger context window? The answer should be "specificity and freshness" -- a consultant briefed on this specific module is better than a general agent with everything injected.
|
|
3176
|
+
|
|
3177
|
+
---
|
|
3178
|
+
|
|
2467
3179
|
### Automatic root cause analysis when MR review finds issues post-coding (Apr 30, 2026)
|
|
2468
3180
|
|
|
2469
3181
|
**Status: idea** | Priority: high
|
|
@@ -2480,6 +3192,8 @@ When an MR review session (run by a WorkTrain agent) finds issues in a coding se
|
|
|
2480
3192
|
|
|
2481
3193
|
**Why this matters**: every finding that slips through is a signal about a workflow or process gap. Today that signal is lost. Capturing it systematically and feeding it back into workflow improvement closes the quality loop.
|
|
2482
3194
|
|
|
3195
|
+
**Concrete model:** CodeRabbit does this for MR reviews -- when a human reviewer corrects a CodeRabbit finding or points out something it missed, CodeRabbit extracts a structured learning (`{ claim, repo, file context, timestamp }`) and injects it into future review sessions for the same repo. WorkTrain should do the same, and broader: learnings from coding corrections (not just review corrections) feed into the per-workspace codebase assumption store, which directly addresses Subtype B intent failures. Human feedback on WorkTrain's PRs is the write path for that store.
|
|
3196
|
+
|
|
2483
3197
|
**Things to hash out:**
|
|
2484
3198
|
- How does WorkTrain detect that a human has commented on a PR post-review? This requires monitoring the PR for new review activity after WorkTrain's session completed -- either webhook events or polling.
|
|
2485
3199
|
- What does the analysis session actually produce? A structured finding about the gap? A concrete proposal for workflow improvement? Both?
|
|
@@ -2487,6 +3201,21 @@ When an MR review session (run by a WorkTrain agent) finds issues in a coding se
|
|
|
2487
3201
|
- How do you distinguish "the workflow is fine but this was a genuinely hard edge case" from "the workflow has a systematic gap"? A single miss doesn't prove a gap; multiple misses of the same kind do.
|
|
2488
3202
|
- Should the analysis result feed directly into `workflow-effectiveness-assessment`, or is it a separate concern?
|
|
2489
3203
|
- For the "coding agent missed it" case: is the right fix to change the coding workflow, or to make the review workflow more adversarial?
|
|
3204
|
+
- How are codebase-specific learnings extracted from free-form human review comments? A structured extraction step (similar to CodeRabbit's learning extraction) is needed to turn "actually this is wrong because X" into a typed store entry.
|
|
3205
|
+
- How are extracted learnings scoped and invalidated over time? Per-repo scope is right for codebase-specific facts, but learnings go stale after refactors. A `lastVerified` + staleness mechanism is needed.
|
|
3206
|
+
- Relationship to the assumption store (Candidate 2 from the intent gap discovery): human PR corrections are the primary write path for the per-workspace codebase assumption store. These two entries should be designed together.
|
|
3207
|
+
|
|
3208
|
+
---
|
|
3209
|
+
|
|
3210
|
+
### wr.discovery recommendation quality improvements v3.5 (May 6, 2026)
|
|
3211
|
+
|
|
3212
|
+
**Status: done** | Shipped in PR #951 (feat/etienneb/discovery-workflow-v35, May 6, 2026)
|
|
3213
|
+
|
|
3214
|
+
**Score: 13** | Cor:2 Cap:3 Eff:2 Lev:3 Con:3 | Blocked: no
|
|
3215
|
+
|
|
3216
|
+
Evidence-based redesign of `wr.discovery` (v3.4.0 → v3.5.0) addressing three failure modes -- coverage (right answer never generated), quality (wrong answer selected), and selection (right answer not selected). Key changes: all three assessment gates now have `assessmentConsequences` that block on failure; Phase 3d/3e split isolates external challenge from fresh-context selection; typed `SelectionOutput` tier (`strong_recommendation | provisional_recommendation | insufficient_signal`) driven by observable signals; `FrameValidityCheck` at landscape-to-frame transition; verbalized sampling + ordinary persona rotation in executor goal strings; `recommendationConfidenceBand` downgrade-only invariant across resolution phases; Phase 6 restructured as falsification-shaped fresh-context validator; `selectionTier` added to `wr.discovery_handoff` artifact.
|
|
3217
|
+
|
|
3218
|
+
Full audit at `.workrail/discovery-workflow-audit.md`, implementation plan at `.workrail/discovery-workflow-implementation-plan.md`.
|
|
2490
3219
|
|
|
2491
3220
|
---
|
|
2492
3221
|
|
|
@@ -2601,6 +3330,45 @@ Some workflows want notes to consistently capture current understanding, key fin
|
|
|
2601
3330
|
|
|
2602
3331
|
---
|
|
2603
3332
|
|
|
3333
|
+
### Targeted session review: extract high-signal moments instead of reviewing full transcripts (May 6, 2026)
|
|
3334
|
+
|
|
3335
|
+
**Status: idea** | Priority: high
|
|
3336
|
+
|
|
3337
|
+
**Score: 12** | Cor:2 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
|
|
3338
|
+
|
|
3339
|
+
Reviewing a full agent session transcript to evaluate quality is prohibitively expensive -- long sessions have hundreds of tool calls, file reads, and reasoning steps. But most of the signal about whether a session went well lives in a small number of high-signal moments: confirmation gates, places where the agent flagged uncertainty or divergence, steps where the agent's output failed to match the expected contract, and points where the agent encountered reality and had to adapt. Reviewing those moments selectively is 10-50x cheaper than reading the full transcript and captures most of the quality signal.
|
|
3340
|
+
|
|
3341
|
+
**High-signal moments worth targeting:**
|
|
3342
|
+
|
|
3343
|
+
1. **Confirmation gate outcomes** -- when a `requireConfirmation` gate fired, what did the agent report? Did it accurately represent the state of the work? Was the decision the right one in hindsight?
|
|
3344
|
+
|
|
3345
|
+
2. **Agent self-reported issues** -- calls to `report_issue` or `signal_coordinator` during the session. These are the agent's own flags that something was wrong. Each one warrants inspection: was the issue real, was the agent's characterization accurate, was the resolution appropriate?
|
|
3346
|
+
|
|
3347
|
+
3. **Contract validation failures** -- steps where the engine returned a `blocked` or `require_followup` response. The agent's output failed the output contract. What did it produce, and why?
|
|
3348
|
+
|
|
3349
|
+
4. **Agent-workflow friction points** -- places where the agent deviated from the expected step procedure, added divergence markers, or explicitly noted a gap between the workflow instructions and the reality it encountered. These are the inputs to workflow improvement.
|
|
3350
|
+
|
|
3351
|
+
5. **Interpretation vs outcome delta** -- the gap between what the agent stated it was building (interpretation checkpoint, once it exists) and what it actually produced. The delta is the intent gap in concrete form.
|
|
3352
|
+
|
|
3353
|
+
6. **Sycophancy signals** -- position changes without new evidence, position reversals after challenge, confidence-accuracy mismatches visible in the notes.
|
|
3354
|
+
|
|
3355
|
+
**Why this matters:** without targeted review, session quality is only observable at the PR level (did the output pass review?). That's a lagging indicator that catches failures after they've shipped cost. Targeted review of high-signal moments catches failures mid-session or immediately post-session, before the cost compounds.
|
|
3356
|
+
|
|
3357
|
+
**Relationship to existing entries:**
|
|
3358
|
+
- "Agent-reportable workflow bugs" (below) -- the agent's own flags are one of the primary review targets
|
|
3359
|
+
- "Synthetic human gates" -- the targeted review output is what a synthetic gate would consume to make an approval decision
|
|
3360
|
+
- "Automatic root cause analysis" -- targeted review is the cheaper precursor that identifies which sessions warrant full root cause analysis
|
|
3361
|
+
- "Per-run workflow improvement retrospective" -- the session retrospective is one moment in the targeted review; this entry is about the full set of moments across a session
|
|
3362
|
+
|
|
3363
|
+
**Things to hash out:**
|
|
3364
|
+
- What is the right extraction mechanism? The session event log already records every tool call, step advance, and artifact. A targeted review agent reads selected event types rather than the full log. What is the right query interface?
|
|
3365
|
+
- Which moments are always reviewed vs. sampled? Confirmation gates and `report_issue` calls probably warrant 100% review; routine step advances can be sampled.
|
|
3366
|
+
- Should targeted review happen synchronously (coordinator waits before proceeding) or asynchronously (review happens in parallel, findings surface to operator outbox)?
|
|
3367
|
+
- How are review findings acted on? They could feed into: (a) the synthetic gate decision for the current session, (b) the workflow improvement retrospective, (c) the assumption store if codebase-specific learnings are extracted.
|
|
3368
|
+
- What does the targeted review agent actually produce? A structured verdict per moment reviewed, a severity-tagged list of concerns, or a binary pass/fail?
|
|
3369
|
+
|
|
3370
|
+
---
|
|
3371
|
+
|
|
2604
3372
|
### Agent-reportable workflow bugs (Apr 28, 2026)
|
|
2605
3373
|
|
|
2606
3374
|
**Status: idea** | Priority: high
|
|
@@ -2617,6 +3385,7 @@ A mechanism for agents to report problems with the WorkRail system itself during
|
|
|
2617
3385
|
- Should reports survive session cleanup, or is their lifetime tied to the session?
|
|
2618
3386
|
- Who owns acting on these reports -- the operator, the workflow author, or an automated system?
|
|
2619
3387
|
- Should this be available in interactive (MCP) sessions, or daemon sessions only?
|
|
3388
|
+
- Relationship to "Targeted session review": agent-reported workflow bugs are one of the primary high-signal moments that targeted session review would extract and inspect.
|
|
2620
3389
|
|
|
2621
3390
|
---
|
|
2622
3391
|
|
|
@@ -2673,6 +3442,40 @@ A proof record contains: `prNumber`, `goal`, `verificationChain` (array of `{ ki
|
|
|
2673
3442
|
|
|
2674
3443
|
---
|
|
2675
3444
|
|
|
3445
|
+
### Coordinator mid-session hooks: react to workflow events without waiting for session completion (May 6, 2026)
|
|
3446
|
+
|
|
3447
|
+
**Status: idea** | Priority: high
|
|
3448
|
+
|
|
3449
|
+
**Score: 12** | Cor:2 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
|
|
3450
|
+
|
|
3451
|
+
The coordinator currently acts only between sessions -- it spawns a session, awaits its completion, reads the typed output artifact, and decides what to do next. It has no mechanism to react to events that happen inside a running session. This means the coordinator cannot spawn helper agents mid-session (e.g. an external assumption ranker when the interpretation checkpoint fires), cannot intercept a confirmation gate and satisfy it autonomously, and cannot act on a step completion artifact before the full session finishes.
|
|
3452
|
+
|
|
3453
|
+
The gap: workflow lifecycle events (step completed, gate fired, artifact emitted, `report_issue` called) are currently only visible after the session ends via the session store. The coordinator needs a way to subscribe to these events as they happen and act on them -- spawning agents, injecting steer messages, or making routing decisions -- without waiting for session completion.
|
|
3454
|
+
|
|
3455
|
+
**Concrete use cases this unlocks:**
|
|
3456
|
+
- Spawn an external assumption-ranking agent when the interpretation checkpoint step completes, inject its ranking back into the session before verification runs
|
|
3457
|
+
- Auto-satisfy a `requireConfirmation` gate in autonomous mode by running a synthetic gate evaluation and steering the session with the result
|
|
3458
|
+
- Spawn a targeted review agent when a specific step artifact is emitted, surface findings before the session proceeds to the next phase
|
|
3459
|
+
- React to a `report_issue` call mid-session by spawning an investigation agent immediately rather than waiting for the full session to fail
|
|
3460
|
+
|
|
3461
|
+
**What this requires:**
|
|
3462
|
+
- A real-time or near-real-time event subscription mechanism from the coordinator to the session event log (the append-only JSONL already has all the events; the coordinator needs a watch/poll interface on it)
|
|
3463
|
+
- A `steer` injection path from the coordinator into a running session (the steer endpoint already exists at `POST /sessions/:id/steer`)
|
|
3464
|
+
- A coordinator hook registry: declarative rules of the form "when session X emits event type Y with artifact kind Z, execute hook H"
|
|
3465
|
+
|
|
3466
|
+
**Relationship to existing entries:**
|
|
3467
|
+
- "Scripts-first coordinator" (below): the hooks would be coordinator scripts reacting to events, not LLM reasoning
|
|
3468
|
+
- "Native multi-agent orchestration": `spawn_session` + `await_sessions` handles between-session orchestration; this handles within-session coordination
|
|
3469
|
+
- "Workflow runtime adapter": mid-session hooks are how the daemon adapter satisfies `requireConfirmation` gates autonomously
|
|
3470
|
+
|
|
3471
|
+
**Things to hash out:**
|
|
3472
|
+
- Poll vs push: the session event log is append-only JSONL. The coordinator can poll it efficiently (tail -f equivalent), but a proper event bus (the daemon event emitter already exists) would be cleaner. Which is the right mechanism?
|
|
3473
|
+
- Hook registry format: declarative JSON rules in `triggers.yml`, or imperative TypeScript in the coordinator script? The declarative approach is more auditable; the imperative approach is more flexible.
|
|
3474
|
+
- Ordering guarantees: if the coordinator injects a steer message in response to a step completion, does the session engine guarantee the steer is processed before the next step begins? Race condition risk.
|
|
3475
|
+
- Blast radius: a hook that fires incorrectly (wrong event matched, wrong steer injected) could derail a running session in a hard-to-debug way. What are the rollback and auditability guarantees?
|
|
3476
|
+
|
|
3477
|
+
---
|
|
3478
|
+
|
|
2676
3479
|
### Scripts-first coordinator: avoid the main agent wherever possible (Apr 15, 2026)
|
|
2677
3480
|
|
|
2678
3481
|
**Status: partial** | Foundation shipped PR #908 (Apr 30, 2026)
|
|
@@ -4758,7 +5561,25 @@ The agent is expensive, inconsistent, and slow. Scripts are free, deterministic,
|
|
|
4758
5561
|
|
|
4759
5562
|
### Dynamic model selection
|
|
4760
5563
|
|
|
4761
|
-
**Status:
|
|
5564
|
+
**Status: partial** -- raw model ID (`agentConfig.model`) shipped in `triggers.yml`. Two gaps remain: (1) no validation at trigger parse or startup -- a bad model ID is only caught when the first LLM call fires; (2) every trigger hardcodes a provider-specific ID, which breaks when the inference profile naming convention changes (e.g. `us.anthropic.claude-haiku-4-5-20251001` vs `us.anthropic.claude-haiku-4-5-20251001-v1:0`).
|
|
5565
|
+
|
|
5566
|
+
### Model tier abstraction: cheap / medium / expensive (May 7, 2026)
|
|
5567
|
+
|
|
5568
|
+
**Status: idea** | Priority: medium
|
|
5569
|
+
|
|
5570
|
+
**Score: 11** | Cor:2 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
|
|
5571
|
+
|
|
5572
|
+
**The problem:** Triggers hardcode provider-specific model IDs (`amazon-bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0`). When inference profile naming conventions change, or when switching providers/regions, every trigger must be updated manually. The daemon's adaptive coordinator already makes implicit cost/quality tradeoffs (Haiku for routing, Sonnet for coding) but has no first-class mechanism to express them -- it's locked to whatever IDs are in `agentConfig.model`.
|
|
5573
|
+
|
|
5574
|
+
**The idea:** Introduce a tier abstraction. Triggers and workflow phases declare a tier (`cheap | medium | expensive`). The daemon resolves tiers to concrete model IDs from a tier map in `~/.workrail/config.json`. The adaptive coordinator picks tiers per phase: cheap for classification and routing, medium for coding, expensive for architectural review. Changing provider or region means updating the tier map once.
|
|
5575
|
+
|
|
5576
|
+
**Validation is a prerequisite.** Before tiers make sense, bad model IDs need to be caught at startup rather than at first LLM call. See "Model ID validation at daemon startup" below.
|
|
5577
|
+
|
|
5578
|
+
**Things to hash out:**
|
|
5579
|
+
- Where does the tier map live? `~/.workrail/config.json` (global) vs. `triggers.yml` (per-workspace) vs. both with cascade.
|
|
5580
|
+
- Does the tier map need to carry both a Bedrock and a direct-API model per tier, or does one path own the daemon?
|
|
5581
|
+
- Should the adaptive coordinator receive the tier map as a dependency, or should it always spawn sessions with explicit `agentConfig.model` set by the coordinator?
|
|
5582
|
+
- How do you handle models that exist on one provider but not another (e.g. Opus available on Bedrock but not direct API under certain rate limits)?
|
|
4762
5583
|
|
|
4763
5584
|
### Multi-agent support (spawn_agent + coordinator sessions)
|
|
4764
5585
|
|