@exaudeus/workrail 3.67.0 → 3.68.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/application/services/compiler/template-registry.js +10 -1
- package/dist/cli/commands/worktrain-init.js +1 -1
- package/dist/console-ui/assets/{index-tOl8Vowf.js → index-DPdRJHMX.js} +1 -1
- package/dist/console-ui/index.html +1 -1
- package/dist/coordinators/modes/full-pipeline.js +4 -4
- package/dist/coordinators/modes/implement-shared.js +5 -5
- package/dist/coordinators/modes/implement.js +4 -4
- package/dist/coordinators/pr-review.js +4 -4
- package/dist/daemon/workflow-runner.d.ts +1 -0
- package/dist/daemon/workflow-runner.js +1 -0
- package/dist/manifest.json +31 -31
- package/dist/mcp/handlers/v2-context-budget.js +18 -0
- package/dist/mcp/handlers/v2-workflow.js +1 -1
- package/dist/mcp/workflow-protocol-contracts.js +2 -2
- package/dist/v2/durable-core/constants.d.ts +2 -0
- package/dist/v2/durable-core/constants.js +2 -1
- package/dist/v2/projections/session-metrics.js +1 -1
- package/docs/authoring-v2.md +4 -4
- package/docs/changelog-recent.md +3 -3
- package/docs/configuration.md +1 -1
- package/docs/design/adaptive-coordinator-context-candidates.md +1 -1
- package/docs/design/adaptive-coordinator-context.md +1 -1
- package/docs/design/adaptive-coordinator-routing-candidates.md +18 -18
- package/docs/design/adaptive-coordinator-routing-review.md +1 -1
- package/docs/design/adaptive-coordinator-routing.md +34 -34
- package/docs/design/agent-cascade-protocol.md +2 -2
- package/docs/design/console-daemon-separation-discovery.md +323 -0
- package/docs/design/context-assembly-design-candidates.md +1 -1
- package/docs/design/context-assembly-implementation-plan.md +1 -1
- package/docs/design/context-assembly-layer.md +2 -2
- package/docs/design/context-assembly-review-findings.md +1 -1
- package/docs/design/coordinator-access-audit.md +293 -0
- package/docs/design/coordinator-architecture-audit.md +62 -0
- package/docs/design/coordinator-error-handling-audit.md +240 -0
- package/docs/design/coordinator-testability-audit.md +426 -0
- package/docs/design/daemon-architecture-discovery.md +1 -1
- package/docs/design/daemon-console-separation-discovery.md +242 -0
- package/docs/design/daemon-memory-audit.md +203 -0
- package/docs/design/design-candidates-console-daemon-separation.md +256 -0
- package/docs/design/design-candidates-discovery-loop-fix.md +141 -0
- package/docs/design/design-review-findings-console-daemon-separation.md +106 -0
- package/docs/design/design-review-findings-discovery-loop-fix.md +81 -0
- package/docs/design/discovery-loop-fix-candidates.md +161 -0
- package/docs/design/discovery-loop-fix-design-review.md +106 -0
- package/docs/design/discovery-loop-fix-validation.md +258 -0
- package/docs/design/discovery-loop-investigation-A.md +188 -0
- package/docs/design/discovery-loop-investigation-B.md +287 -0
- package/docs/design/exploration-workflow-candidates.md +205 -0
- package/docs/design/exploration-workflow-design-review.md +166 -0
- package/docs/design/exploration-workflow-discovery.md +443 -0
- package/docs/design/ide-context-files-candidates.md +231 -0
- package/docs/design/ide-context-files-design-review.md +85 -0
- package/docs/design/ide-context-files.md +615 -0
- package/docs/design/implementation-plan-discovery-loop-fix.md +199 -0
- package/docs/design/implementation-plan-queue-poll-rotation.md +102 -0
- package/docs/design/in-process-http-audit.md +190 -0
- package/docs/design/layer3b-ghost-nodes-design-candidates.md +2 -2
- package/docs/design/loadSessionNotes-candidates.md +108 -0
- package/docs/design/loadSessionNotes-test-coverage-discovery.md +297 -0
- package/docs/design/loadSessionNotes-test-coverage-session4.md +209 -0
- package/docs/design/loadSessionNotes-test-coverage-v3.md +321 -0
- package/docs/design/probe-session-design-candidates.md +261 -0
- package/docs/design/probe-session-phase0.md +490 -0
- package/docs/design/routines-guide.md +7 -7
- package/docs/design/session-metrics-attribution-candidates.md +250 -0
- package/docs/design/session-metrics-attribution-design-review.md +115 -0
- package/docs/design/session-metrics-attribution-discovery.md +319 -0
- package/docs/design/session-metrics-candidates.md +227 -0
- package/docs/design/session-metrics-design-review.md +104 -0
- package/docs/design/session-metrics-discovery.md +454 -0
- package/docs/design/spawn-session-debug.md +202 -0
- package/docs/design/trigger-validator-candidates.md +214 -0
- package/docs/design/trigger-validator-review.md +109 -0
- package/docs/design/trigger-validator-shaping-phase0.md +239 -0
- package/docs/design/trigger-validator.md +454 -0
- package/docs/design/v2-core-design-locks.md +2 -2
- package/docs/design/workflow-extension-points.md +15 -15
- package/docs/design/workflow-id-validation-at-startup.md +1 -1
- package/docs/design/workflow-id-validation-implementation-plan.md +2 -2
- package/docs/design/workflow-trigger-lifecycle-audit.md +175 -0
- package/docs/design/worktrain-task-queue-candidates.md +5 -5
- package/docs/design/worktrain-task-queue.md +4 -4
- package/docs/discovery/coordinator-script-design.md +1 -1
- package/docs/discovery/coordinator-ux-discovery.md +3 -3
- package/docs/discovery/simulation-report.md +1 -1
- package/docs/discovery/workflow-modernization-discovery.md +326 -0
- package/docs/discovery/workflow-selection-for-discovery-tasks.md +33 -33
- package/docs/discovery/worktrain-status-briefing.md +1 -1
- package/docs/discovery/wr-discovery-goal-reframing.md +1 -1
- package/docs/docker.md +1 -1
- package/docs/ideas/backlog.md +227 -0
- package/docs/ideas/third-party-workflow-setup-design-thinking.md +1 -1
- package/docs/integrations/claude-code.md +5 -5
- package/docs/integrations/firebender.md +1 -1
- package/docs/plans/agentic-orchestration-roadmap.md +2 -2
- package/docs/plans/mr-review-workflow-redesign.md +9 -9
- package/docs/plans/ui-ux-workflow-design-candidates.md +4 -4
- package/docs/plans/ui-ux-workflow-discovery.md +2 -2
- package/docs/plans/workflow-categories-candidates.md +8 -8
- package/docs/plans/workflow-categories-discovery.md +4 -4
- package/docs/plans/workflow-modernization-design.md +430 -0
- package/docs/plans/workflow-staleness-detection-candidates.md +11 -11
- package/docs/plans/workflow-staleness-detection-review.md +4 -4
- package/docs/plans/workflow-staleness-detection.md +9 -9
- package/docs/plans/workrail-platform-vision.md +3 -3
- package/docs/reference/agent-context-cleaner-snippet.md +1 -1
- package/docs/reference/agent-context-guidance.md +4 -4
- package/docs/reference/context-optimization.md +2 -2
- package/docs/roadmap/now-next-later.md +2 -2
- package/docs/roadmap/open-work-inventory.md +16 -16
- package/docs/workflows.md +31 -31
- package/package.json +1 -1
- package/spec/workflow-tags.json +47 -47
- package/workflows/adaptive-ticket-creation.json +16 -16
- package/workflows/architecture-scalability-audit.json +22 -22
- package/workflows/bug-investigation.agentic.v2.json +3 -3
- package/workflows/classify-task-workflow.json +1 -1
- package/workflows/coding-task-workflow-agentic.json +6 -6
- package/workflows/cross-platform-code-conversion.v2.json +8 -8
- package/workflows/document-creation-workflow.json +8 -8
- package/workflows/documentation-update-workflow.json +8 -8
- package/workflows/intelligent-test-case-generation.json +2 -2
- package/workflows/learner-centered-course-workflow.json +2 -2
- package/workflows/mr-review-workflow.agentic.v2.json +4 -4
- package/workflows/personal-learning-materials-creation-branched.json +8 -8
- package/workflows/presentation-creation.json +5 -5
- package/workflows/production-readiness-audit.json +1 -1
- package/workflows/relocation-workflow-us.json +31 -31
- package/workflows/routines/context-gathering.json +1 -1
- package/workflows/routines/design-review.json +1 -1
- package/workflows/routines/execution-simulation.json +1 -1
- package/workflows/routines/feature-implementation.json +3 -3
- package/workflows/routines/final-verification.json +1 -1
- package/workflows/routines/hypothesis-challenge.json +1 -1
- package/workflows/routines/ideation.json +1 -1
- package/workflows/routines/parallel-work-partitioning.json +3 -3
- package/workflows/routines/philosophy-alignment.json +2 -2
- package/workflows/routines/plan-analysis.json +1 -1
- package/workflows/routines/plan-generation.json +1 -1
- package/workflows/routines/tension-driven-design.json +6 -6
- package/workflows/scoped-documentation-workflow.json +26 -26
- package/workflows/ui-ux-design-workflow.json +14 -14
- package/workflows/workflow-diagnose-environment.json +1 -1
- package/workflows/workflow-for-workflows.json +1 -1
|
@@ -0,0 +1,258 @@
|
|
|
1
|
+
# Discovery Loop Fix Validation
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-04-21
|
|
4
|
+
**Scope:** WorkTrain autonomous pipeline -- `wr.discovery` re-runs indefinitely on issue #393
|
|
5
|
+
**Prior investigations:** `discovery-loop-investigation-A.md`, `discovery-loop-investigation-B.md`
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Context / Ask
|
|
10
|
+
|
|
11
|
+
WorkTrain's pipeline re-ran `wr.discovery` on issue #393 at least 73 times over 19+ hours. Two independent investigations (A and B) identified four root causes. This document validates three proposed fixes against actual source code, identifies gaps, and recommends a final fix set in priority order.
|
|
12
|
+
|
|
13
|
+
**Original goal (solution statement):** Validate whether three proposed fixes are correct, complete, and sufficient.
|
|
14
|
+
|
|
15
|
+
**Reframed problem:** WorkTrain's issue selection has no effective termination gate. Once an issue enters the discovery pipeline, no code path reliably marks it as processed or blocked, so it loops indefinitely.
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## Source Code Verified
|
|
20
|
+
|
|
21
|
+
All findings below are based on direct source inspection as of 2026-04-21.
|
|
22
|
+
|
|
23
|
+
| File | Lines inspected | Key finding |
|
|
24
|
+
|---|---|---|
|
|
25
|
+
| `src/trigger/trigger-listener.ts` | 430-500 | `spawnSession` has 4-parameter signature -- no `agentConfig` parameter |
|
|
26
|
+
| `src/coordinators/adaptive-pipeline.ts` | 1-120 | `DISCOVERY_TIMEOUT_MS = 55 * 60 * 1000` confirmed |
|
|
27
|
+
| `src/trigger/polling-scheduler.ts` | 480-636 | `dispatchP` typed as `Promise<unknown>` -- outcome discarded confirmed |
|
|
28
|
+
| `src/trigger/adapters/github-queue-poller.ts` | 280-360 | `checkIdempotency` logic confirmed; explicitly handles missing `context` field as `'clear'` |
|
|
29
|
+
| `src/daemon/workflow-runner.ts` | 700-720 | `persistTokens` writes `{ continueToken, checkpointToken, ts, [worktreePath] }` -- no `context` field |
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
## Root Cause Recap (Verified)
|
|
34
|
+
|
|
35
|
+
Four root causes are confirmed by source inspection. The three proposed fixes address three of them.
|
|
36
|
+
|
|
37
|
+
| # | Root cause | Status |
|
|
38
|
+
|---|---|---|
|
|
39
|
+
| RC1 | `spawnSession` passes no `agentConfig` to `routerRef.dispatch()` -- session inherits `DEFAULT_SESSION_TIMEOUT_MINUTES=30` | **Confirmed** (trigger-listener.ts:492-498) |
|
|
40
|
+
| RC2 | `PipelineOutcome` silently discarded in polling-scheduler `.then()` -- no label applied | **Confirmed** (polling-scheduler.ts:605-624, `Promise<unknown>` cast) |
|
|
41
|
+
| RC3 | `checkIdempotency` sidecar scan is dead -- `persistTokens` never writes `context` field | **Confirmed** (workflow-runner.ts:714-716, github-queue-poller.ts:330-335) |
|
|
42
|
+
| RC4 | In-memory `dispatchingIssues` lost on daemon restart | **Confirmed** (polling-scheduler.ts:113) -- not addressed by the three proposed fixes |
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## Fix Validation
|
|
47
|
+
|
|
48
|
+
### Fix 1 -- Thread `maxSessionMinutes` through `spawnSession()`
|
|
49
|
+
|
|
50
|
+
**Proposed:** Extend `CoordinatorDeps.spawnSession` to accept an optional `agentConfig` parameter and pass it through to `routerRef.dispatch()` in `trigger-listener.ts`. In `full-pipeline.ts`, pass `{ maxSessionMinutes: Math.ceil(DISCOVERY_TIMEOUT_MS / 60_000) }` (55 minutes) when spawning `wr.discovery`.
|
|
51
|
+
|
|
52
|
+
**Is the analysis correct?** Yes. Source confirms:
|
|
53
|
+
- `spawnSession` in `trigger-listener.ts` (lines 436-501) has signature `(workflowId, goal, workspace, context?)` -- no 5th `agentConfig` parameter.
|
|
54
|
+
- `routerRef.dispatch()` at line 492 passes only `{ workflowId, goal, workspacePath, context, _preAllocatedStartResponse }` -- no `agentConfig`.
|
|
55
|
+
- `DEFAULT_SESSION_TIMEOUT_MINUTES = 30` at `workflow-runner.ts:83` applies to all sessions without an explicit override.
|
|
56
|
+
- `DISCOVERY_TIMEOUT_MS = 55 * 60 * 1000` at `adaptive-pipeline.ts:39` -- the 25-minute gap is real.
|
|
57
|
+
|
|
58
|
+
**Is the fix correct?** Yes. Adding `agentConfig` as a 5th parameter to `spawnSession`, forwarding it to `dispatch()`, and calling it with `{ maxSessionMinutes: 55 }` in `full-pipeline.ts` closes the mismatch.
|
|
59
|
+
|
|
60
|
+
**Is the fix complete?** Mostly. The same mismatch affects **shaping** (35 min > 30 min) and **coding** (65 min > 30 min) sessions. The fix should be applied at all three `spawnSession` call sites in `full-pipeline.ts`, not just discovery. If only discovery is fixed, shaping and coding will exhibit the same timeout bug when those phases run.
|
|
61
|
+
|
|
62
|
+
**Risks:**
|
|
63
|
+
- Signature change to `CoordinatorDeps.spawnSession` requires updating the interface in `pr-review.ts` and all implementations (test fakes, the real implementation in `trigger-listener.ts`). This is mechanical but must be done consistently.
|
|
64
|
+
- The `AdaptivePipelineOpts` interface already has `taskCandidate` (line 119) threaded separately; `agentConfig` should follow the same pattern to avoid coupling coordinator internals to queue-poll concerns.
|
|
65
|
+
|
|
66
|
+
**Simpler alternative:** Set `agentConfig.maxSessionMinutes: 60` in `triggers.yml` for the queue-poll trigger. This is an operator-level config change with no code changes -- but it is fragile (the coordinator already holds the right per-phase value; externalizing it to config creates a dual source of truth that can drift).
|
|
67
|
+
|
|
68
|
+
**Verdict:** Correct, mostly complete (needs all three spawn sites). Recommended. The type-safe approach (Option A from Investigation B) is preferred over config-file override (Option B).
|
|
69
|
+
|
|
70
|
+
---
|
|
71
|
+
|
|
72
|
+
### Fix 2 -- Inspect `PipelineOutcome` in polling-scheduler; apply `worktrain:blocked` on escalation
|
|
73
|
+
|
|
74
|
+
**Proposed:** In polling-scheduler's `.then()` handler, inspect the `PipelineOutcome.kind` value and apply a `worktrain:blocked` label to the issue when `kind === 'escalated'`.
|
|
75
|
+
|
|
76
|
+
**Is the analysis correct?** Yes. Source confirms:
|
|
77
|
+
- At `polling-scheduler.ts:605-610`, `dispatchAdaptivePipeline` is cast to `Promise<unknown>` (not `Promise<PipelineOutcome>`), making the return value untyped and the `.then(()` parameter ignored.
|
|
78
|
+
- The `.then()` callback at lines 617-620 has no parameter -- it ignores whatever the pipeline returns.
|
|
79
|
+
- `PipelineOutcome` is a discriminated union with kinds `'merged'`, `'escalated'`, `'dry_run'` (confirmed in `adaptive-pipeline.ts:84-93`).
|
|
80
|
+
- The `excludeLabels` check at `polling-scheduler.ts:501-502` runs before maturity scoring; any label in `queueConfig.excludeLabels` would block re-selection.
|
|
81
|
+
|
|
82
|
+
**Is the fix correct?** Yes, and it addresses the primary observable symptom. Applying a `worktrain:blocked` (or `worktrain:failed`) label to the GitHub issue on escalation puts a persistent, cross-restart, cross-process marker on the issue that survives daemon restarts. The `excludeLabels` check at line 501 will then skip the issue on every subsequent poll cycle.
|
|
83
|
+
|
|
84
|
+
**Critical prerequisite:** The label name used (`worktrain:blocked` in the proposed fix) must be present in `queueConfig.excludeLabels`. If it is not, the label is applied but never checked, and the loop continues. This needs to be verified or enforced (e.g., auto-add the label to `excludeLabels` in the same code path, or use an already-excluded label like `worktrain:in-progress`).
|
|
85
|
+
|
|
86
|
+
**Is the fix complete?** Partially. It stops re-selection on escalation but:
|
|
87
|
+
1. It does not handle `kind === 'merged'` -- a merged issue should close automatically via the PR, but an explicit log or assertion is worth adding.
|
|
88
|
+
2. It does not handle the case where the pipeline resolves but `outcome.kind` is unrecognized (forward-compat safety).
|
|
89
|
+
3. The label is not removed on a subsequent success if the issue is re-opened or retried manually -- this may be acceptable, but should be documented.
|
|
90
|
+
4. The cast `Promise<unknown>` at line 610 should be changed to `Promise<PipelineOutcome>` so TypeScript enforces that the outcome is handled exhaustively. Without this, a future rename of `kind` values will silently break the handler.
|
|
91
|
+
|
|
92
|
+
**Could Fix 2 alone stop the loop?** Yes. Applying `worktrain:blocked` on escalation would stop re-selection on the very next poll cycle after the label is set. Investigation A explicitly confirms: "Fix 2 alone would stop the loop immediately because the `worktrain:in-progress` label check runs before `checkIdempotency` in `polling-scheduler.ts` line 505, and does not depend on any file state or memory state." However, Fix 2 alone does not fix the session timeout mismatch (Fix 1) or the cross-restart idempotency gap (Fix 3). The loop is stopped by labeling, but future cycles where the label is removed would immediately re-trigger the same session timeout bug.
|
|
93
|
+
|
|
94
|
+
**Should the coordinator close/unassign on success instead of just labeling on failure?** Yes, this is the cleaner termination contract. On `kind === 'merged'`, the PR should close the issue automatically (GitHub's "closes #N" mechanism). On `kind === 'escalated'`, applying a label is correct -- the issue should not be closed because a human needs to review the escalation. The important addition is: on escalation, the issue should be **unassigned** from `worktrain-etienneb` and the `worktrain:in-progress` label removed, so a human can clearly see it needs attention without the bot re-picking it up.
|
|
95
|
+
|
|
96
|
+
**Verdict:** Correct. Complete with the caveats above. This is the highest-priority fix -- it stops the loop on its own. Recommended as the first fix deployed.
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
### Fix 3 -- Write sidecar from `spawnSession()` to fix cross-restart idempotency
|
|
101
|
+
|
|
102
|
+
**Proposed:** Before calling `dispatchAdaptivePipeline`, write a sidecar file to `~/.workrail/daemon-sessions/<issueNumber>.json` containing `{ "context": { "taskCandidate": { "issueNumber": N } } }`. Remove this file in the `.then`/`.catch` completion handler.
|
|
103
|
+
|
|
104
|
+
**Is the analysis correct?** Yes. Source confirms:
|
|
105
|
+
- `checkIdempotency` at `github-queue-poller.ts:305-357` scans `daemon-sessions/*.json` and checks `context.taskCandidate.issueNumber`.
|
|
106
|
+
- At lines 330-335: if `context` is not an object or is null, it hits `continue` -- the file is treated as not owning any issue.
|
|
107
|
+
- `persistTokens` at `workflow-runner.ts:714-716` writes only `{ continueToken, checkpointToken, ts, [worktreePath] }` -- no `context` field.
|
|
108
|
+
- Therefore, all session files written by `persistTokens` are unconditionally treated as `'clear'` by `checkIdempotency`. The guard is dead.
|
|
109
|
+
|
|
110
|
+
**Is the fix correct?** Mostly. Writing a sidecar file with `{ context: { taskCandidate: { issueNumber: N } } }` before dispatch and removing it on completion would make `checkIdempotency` functional. However, the proposed location (in `spawnSession()` in trigger-listener.ts) is wrong -- `spawnSession` is called by the coordinator to spawn child sessions (e.g., `wr.discovery`, `wr.shaping`), not to represent the outer pipeline session. The sidecar should be written by the queue poller (`doPollGitHubQueue`) immediately after `this.dispatchingIssues.add(top.issue.number)`, and removed in the `.then`/`.catch` handler.
|
|
111
|
+
|
|
112
|
+
**Is the fix complete?** No. Two gaps:
|
|
113
|
+
1. **Placement:** The sidecar should be written at the polling-scheduler level (before `dispatchAdaptivePipeline` is called), not inside `spawnSession`. The outer pipeline's issue ownership should be tracked at the dispatch site.
|
|
114
|
+
2. **Deletion on timeout:** The session `.json` file written by `persistTokens` is explicitly deleted on wall_clock timeout (`workflow-runner.ts:~3867`). A sidecar written at the poller level and removed in `.then`/`.catch` avoids this deletion, but the poller's completion handler only fires when the pipeline Promise resolves -- not when the daemon crashes. A crash between sidecar write and pipeline resolution leaves a stale sidecar. The file should include a `ts` field so it expires after a reasonable duration (e.g., 4 hours), or the daemon should clean stale sidecars on startup.
|
|
115
|
+
3. **Fix 2 supersedes this for loop prevention:** With Fix 2 applying a persistent GitHub label on escalation, cross-restart idempotency for completed/escalated pipelines is already handled by the label check (which runs before `checkIdempotency` in the polling loop). Fix 3 is valuable for the concurrent-dispatch protection window (restart during an active pipeline), but is not a loop-stopper on its own.
|
|
116
|
+
|
|
117
|
+
**Risks:**
|
|
118
|
+
- File write before dispatch means a failure in `fs.writeFile` could block the dispatch. This should be fire-and-forget (log and continue, don't block dispatch on sidecar failure).
|
|
119
|
+
- If the sidecar uses the issue number as the filename, multiple issues cannot have sidecars simultaneously under that naming scheme -- but that is fine since only one issue is dispatched per cycle.
|
|
120
|
+
|
|
121
|
+
**Verdict:** Conceptually correct but placement and durability need adjustment. Not a loop-stopper on its own (Fix 2 covers the loop). Important for crash safety and concurrent dispatch protection. Should be implemented after Fix 2.
|
|
122
|
+
|
|
123
|
+
---
|
|
124
|
+
|
|
125
|
+
## Gaps Not Addressed by the Three Fixes
|
|
126
|
+
|
|
127
|
+
### Gap 1: Root Cause 4 is not addressed -- in-memory `dispatchingIssues` lost on restart
|
|
128
|
+
|
|
129
|
+
The three proposed fixes do not address the fact that `dispatchingIssues` is in-memory and lost on every daemon restart. Fix 3 (corrected) would address this for the cross-restart window. Fix 2 (label application) addresses it for the post-completion/escalation state. Together they cover the cross-restart gap, but Fix 3 needs the placement correction noted above.
|
|
130
|
+
|
|
131
|
+
### Gap 2: No escalation outbox notification
|
|
132
|
+
|
|
133
|
+
Neither investigation's fix set includes writing to the operator outbox when the pipeline escalates. The operator cannot distinguish "currently in progress" from "failed 12 times" without reading logs. Adding a comment to the GitHub issue (or posting to the outbox) explaining the escalation reason is important for operator visibility but does not affect the loop behavior.
|
|
134
|
+
|
|
135
|
+
### Gap 3: Issue assignment not cleared on escalation
|
|
136
|
+
|
|
137
|
+
When the pipeline escalates, issue #393 remains assigned to `worktrain-etienneb`. With Fix 2 applying `worktrain:blocked`, re-selection is prevented -- but the issue looks "in progress" to a human reviewer. The bot should unassign itself and remove the `worktrain:in-progress` label (if present) when applying the blocked label.
|
|
138
|
+
|
|
139
|
+
### Gap 4: Session timeout mismatch affects all phases, not just discovery
|
|
140
|
+
|
|
141
|
+
Fix 1 addresses the `wr.discovery` timeout but the same mismatch applies to shaping (35 min vs 30 min default) and coding (65 min vs 30 min default). All three spawn sites in `full-pipeline.ts` need the `agentConfig` threading.
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
145
|
+
## Recommended Final Fix Set
|
|
146
|
+
|
|
147
|
+
Listed in deployment priority order (Fix 2 is the only blocker; others reduce risk and improve correctness).
|
|
148
|
+
|
|
149
|
+
### Priority 1 (Deploy immediately): Fix 2 -- PipelineOutcome inspection with label application
|
|
150
|
+
|
|
151
|
+
**Why first:** This is the only fix that stops the loop immediately. The label persists across daemon restarts and is checked before any in-memory or sidecar guard.
|
|
152
|
+
|
|
153
|
+
**Minimum viable change:**
|
|
154
|
+
1. Change the `dispatchAdaptivePipeline` cast at line 605-610 from `Promise<unknown>` to `Promise<PipelineOutcome>` (import the type).
|
|
155
|
+
2. Change `.then(() => {` to `.then((outcome) => {`.
|
|
156
|
+
3. On `outcome.kind === 'escalated'`: add `worktrain:blocked` label to the issue via GitHub API, post a comment with `outcome.escalationReason`.
|
|
157
|
+
4. Confirm `worktrain:blocked` (or the chosen label) is in `queueConfig.excludeLabels`.
|
|
158
|
+
5. On escalation: unassign the bot from the issue and remove `worktrain:in-progress` if present.
|
|
159
|
+
|
|
160
|
+
**Outcome:** Loop stops on the next poll cycle after any escalation.
|
|
161
|
+
|
|
162
|
+
### Priority 2 (Deploy with or shortly after Fix 2): Fix 1 -- Thread `maxSessionMinutes` through `spawnSession()`
|
|
163
|
+
|
|
164
|
+
**Why second:** The session timeout mismatch means `wr.discovery` can never complete within the coordinator's expected window, so Fix 2 will immediately apply a blocked label. Fixing the timeout allows future discovery sessions to actually complete. If Fix 1 is not deployed alongside Fix 2, the pipeline will always escalate at discovery (sessions time out at 30 min, coordinator waits 55 min) and every new issue will be immediately labeled `worktrain:blocked`.
|
|
165
|
+
|
|
166
|
+
**Required scope:**
|
|
167
|
+
1. Extend `CoordinatorDeps.spawnSession` interface to accept 5th parameter `agentConfig?: { readonly maxSessionMinutes?: number; readonly maxTurns?: number }`.
|
|
168
|
+
2. Update `trigger-listener.ts` implementation to forward `agentConfig` to `routerRef.dispatch()`.
|
|
169
|
+
3. Update all call sites in `full-pipeline.ts`: discovery (55 min), shaping (35 min), coding (65 min).
|
|
170
|
+
4. Update test fakes to accept the 5th parameter.
|
|
171
|
+
|
|
172
|
+
### Priority 3 (Deploy after 1 and 2): Fix 3 -- Write sidecar for cross-restart idempotency
|
|
173
|
+
|
|
174
|
+
**Why third:** After Fixes 1 and 2 are in place, the loop is stopped and sessions can complete. Fix 3 adds defense-in-depth for crash scenarios where the daemon restarts while a pipeline is actively running.
|
|
175
|
+
|
|
176
|
+
**Corrected placement:**
|
|
177
|
+
1. Write sidecar in `doPollGitHubQueue` immediately after `dispatchingIssues.add(top.issue.number)` -- not inside `spawnSession`.
|
|
178
|
+
2. Sidecar path: `<sessionsDir>/<issueNumber>-queue.json` to distinguish from regular session files.
|
|
179
|
+
3. Sidecar content: `{ context: { taskCandidate: { issueNumber: N } }, ts: <epoch_ms>, expires: <epoch_ms + 4h> }`.
|
|
180
|
+
4. Remove sidecar in the `.then`/`.catch` handler.
|
|
181
|
+
5. In `checkIdempotency` or daemon startup: skip/delete sidecar files with `expires < Date.now()`.
|
|
182
|
+
|
|
183
|
+
---
|
|
184
|
+
|
|
185
|
+
## Summary Assessment
|
|
186
|
+
|
|
187
|
+
| Fix | Correct? | Complete? | Loop-stopper alone? | Deploy order |
|
|
188
|
+
|---|---|---|---|---|
|
|
189
|
+
| Fix 1 (timeout threading) | Yes | Needs all 3 spawn sites | No | 2nd |
|
|
190
|
+
| Fix 2 (PipelineOutcome + label) | Yes | Needs: correct label in excludeLabels, unassign on escalation | **Yes** | 1st |
|
|
191
|
+
| Fix 3 (sidecar idempotency) | Conceptually yes | Needs placement correction + expiry + daemon cleanup | No | 3rd |
|
|
192
|
+
|
|
193
|
+
**The three proposed fixes are sufficient to stop the loop** (Fix 2 alone stops it) **and together provide a robust termination contract.** The gaps are correctness/completeness issues, not show-stoppers.
|
|
194
|
+
|
|
195
|
+
**Fix 2 alone is sufficient to stop the infinite re-selection loop.** However, without Fix 1, every new issue dispatched through the FULL pipeline will immediately escalate at discovery (30-minute timeout vs 55-minute budget), so the loop would immediately re-form for any new issue. Fix 1 is needed to make the pipeline functionally completable.
|
|
196
|
+
|
|
197
|
+
**Coordinator close/unassign on success:** The pipeline does not need to explicitly close the issue on success -- that is handled by GitHub's "closes #N" keyword in the PR description. But on escalation, the bot should unassign itself to make the issue's human-actionable status clear. This is an addition to Fix 2, not a separate fix.
|
|
198
|
+
|
|
199
|
+
---
|
|
200
|
+
|
|
201
|
+
## Decision Log
|
|
202
|
+
|
|
203
|
+
**Direction confirmed after review:** Candidate 2 (complete fix set: Fix 1+2+3).
|
|
204
|
+
|
|
205
|
+
**Revision from review:** Use `worktrain:in-progress` as the escalation label instead of `worktrain:blocked`. Rationale: `worktrain:in-progress` is already in `queueConfig.excludeLabels` (confirmed at `polling-scheduler.ts:505`). Introducing a new label creates a deploy-time config dependency with no code enforcement -- if the label is applied but not in `excludeLabels`, the loop continues silently. Using the existing excluded label eliminates this risk.
|
|
206
|
+
|
|
207
|
+
**Runner-up:** Candidate 1 (Fix 2 only) -- valid only if deployed in the same PR as Fix 1. Deploying Fix 2 alone would permanently label every new issue after its first 30-minute timeout, creating operational toil.
|
|
208
|
+
|
|
209
|
+
**Deployment sequence:**
|
|
210
|
+
- PR #1: Fix 2 (PipelineOutcome inspection + label application). Independent and immediately verifiable.
|
|
211
|
+
- PR #2: Fix 1 + Fix 3 (timeout threading + sidecar idempotency). Dependent on PR #1 for full correctness but can be developed in parallel.
|
|
212
|
+
|
|
213
|
+
**Required revisions to Fix 2 implementation:**
|
|
214
|
+
1. Change `dispatchAdaptivePipeline` cast from `Promise<unknown>` to `Promise<PipelineOutcome>` (import the type)
|
|
215
|
+
2. Change `.then(() => {` to `.then((outcome) => {`
|
|
216
|
+
3. Handle all three `PipelineOutcome` kinds exhaustively (merged: log; escalated: apply label + unassign; dry_run: log)
|
|
217
|
+
4. Apply `worktrain:in-progress` (not a new label) on escalation
|
|
218
|
+
5. Unassign bot from issue on escalation
|
|
219
|
+
6. Add explicit `catch` on GitHub API label write with structured log entry
|
|
220
|
+
|
|
221
|
+
**Confidence:** High -- source code verified line by line, root causes confirmed, all three fixes validated.
|
|
222
|
+
|
|
223
|
+
**Residual risks:**
|
|
224
|
+
- Full-pipeline.ts spawn site line numbers not directly verified (high confidence from TIMEOUT constants match; low verification risk)
|
|
225
|
+
- N-strike mechanism (label only after 3+ escalations) deferred to follow-on
|
|
226
|
+
- Coordinator-owns-termination (Candidate 3) logged as architectural technical debt
|
|
227
|
+
|
|
228
|
+
**Follow-on issues to file:**
|
|
229
|
+
- N-strike mechanism for `worktrain:in-progress` label application
|
|
230
|
+
- Coordinator-Owns-Termination refactor (move GitHub lifecycle management from poller to coordinator)
|
|
231
|
+
- Exhaustive outcome handling in `.then()` -- enforce via TypeScript switch exhaustiveness
|
|
232
|
+
|
|
233
|
+
---
|
|
234
|
+
|
|
235
|
+
## Final Summary
|
|
236
|
+
|
|
237
|
+
**Selected path:** full_spectrum (source code verification + gap analysis)
|
|
238
|
+
|
|
239
|
+
**Problem framing:** WorkTrain's issue selection has no effective termination gate. Once an issue enters the discovery pipeline, no code path reliably marks it as processed or blocked, so it loops indefinitely. The three proposed fixes target the correct code paths.
|
|
240
|
+
|
|
241
|
+
**Landscape takeaways:** All three proposed fixes are confirmed correct by direct source inspection. The `Promise<unknown>` cast in `polling-scheduler.ts:610` is the most dangerous antipattern -- it silently discards type information that would otherwise prevent the loop. The `persistTokens` function has never written a `context` field in its history, making the idempotency guard permanently dead.
|
|
242
|
+
|
|
243
|
+
**Chosen direction:** Complete fix set (Fix 1+2+3), deployed as two sequential PRs.
|
|
244
|
+
|
|
245
|
+
**Strongest alternative:** Fix 2 only -- valid for immediate loop stopping but leaves the pipeline unable to complete (Fix 1 is necessary for sessions to run to completion).
|
|
246
|
+
|
|
247
|
+
**Why Candidate 2 won:** Deploying Fix 2 alone would cause every future FULL-mode issue to immediately fail at discovery (30-min timeout, labeled, stopped permanently). Fix 1 is the necessary pairing. Fix 3 adds crash safety at low cost.
|
|
248
|
+
|
|
249
|
+
**Confidence:** High -- all evidence is from direct source inspection, root causes confirmed, no contradictions.
|
|
250
|
+
|
|
251
|
+
**Residual risks:**
|
|
252
|
+
- `full-pipeline.ts` spawn site line numbers not directly verified (high confidence from TIMEOUT constants)
|
|
253
|
+
- N-strike escalation mechanism deferred to follow-on
|
|
254
|
+
|
|
255
|
+
**Next actions:**
|
|
256
|
+
1. Open PR #1: Fix 2 (inspect PipelineOutcome, apply `worktrain:in-progress` on escalation, unassign bot, add catch on label write, handle all 3 outcome kinds)
|
|
257
|
+
2. Open PR #2: Fix 1 (thread `maxSessionMinutes` through `spawnSession` for all 3 spawn sites) + Fix 3 (sidecar in `doPollGitHubQueue` with `<issueNumber>-queue.json`, expires TTL)
|
|
258
|
+
3. File follow-on issue: N-strike mechanism, Coordinator-Owns-Termination refactor
|
|
@@ -0,0 +1,188 @@
|
|
|
1
|
+
# Discovery Loop Investigation A
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-04-21
|
|
4
|
+
**Bug:** WorkTrain autonomous pipeline repeatedly runs `wr.discovery` sessions on GitHub issue #393 without progressing past discovery.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## 1. What the Logs Actually Show
|
|
9
|
+
|
|
10
|
+
### 1.1 Timeline
|
|
11
|
+
|
|
12
|
+
| Time (approx) | Event |
|
|
13
|
+
|---|---|
|
|
14
|
+
| 18:16 | `trigger_fired` + `session_started` for `wr.coding-task` (issue #393) |
|
|
15
|
+
| 18:19 | Session completes `outcome=success`. Discovery step done, but issue still open. |
|
|
16
|
+
| 19:53 | First `session_started` for `wr.discovery` (no trigger_fired -- in-process dispatch) |
|
|
17
|
+
| 19:53 - 23:54 | Six `wr.discovery` sessions start in rapid succession without any `session_completed` events between them (concurrent spawning race) |
|
|
18
|
+
| 23:24 - 07:24 | Hourly `wr.discovery` cycle: `session_started` at :24, `session_completed outcome=timeout detail=wall_clock` 30 min later at :54, followed immediately by next start |
|
|
19
|
+
|
|
20
|
+
### 1.2 Evidence from `queue-poll.jsonl`
|
|
21
|
+
|
|
22
|
+
- Issue #393 is selected on **every single poll cycle** (every 5 minutes) starting at 19:10 on Apr 20.
|
|
23
|
+
- The selection reason never changes: `maturity=specced, reason="has acceptance criteria or checklist"`.
|
|
24
|
+
- The `task_skipped` entries with `reason=active_session_in_process` appear only briefly (during the few minutes the `dispatchingIssues` in-memory Set holds the issue number). They disappear as soon as the pipeline Promise settles.
|
|
25
|
+
- No `task_skipped` with `reason=active_session` (idempotency file scan) ever appears for issue #393.
|
|
26
|
+
|
|
27
|
+
### 1.3 Evidence from `daemon/2026-04-21.jsonl`
|
|
28
|
+
|
|
29
|
+
All 14 `wr.discovery` `session_completed` events have `outcome=timeout, detail=wall_clock`. None have `outcome=success`. The pipeline never logs escalation to an outbox nor advances to shaping. No `session_started` event ever appears for `wr.shaping` or the coding workflow for this issue.
|
|
30
|
+
|
|
31
|
+
### 1.4 Evidence from `daemon.stdout.log`
|
|
32
|
+
|
|
33
|
+
The repeating pattern (confirmed at every cycle):
|
|
34
|
+
|
|
35
|
+
```
|
|
36
|
+
[QueuePoll] in-flight-clear #393 reason=completed
|
|
37
|
+
[QueuePoll] cycle start repo=EtienneBBeaulac/workrail issues_fetched=1
|
|
38
|
+
[QueuePoll] selected #393 ... maturity=specced
|
|
39
|
+
[QueuePoll] in-flight-add #393
|
|
40
|
+
[QueuePoll] dispatched via adaptivePipeline
|
|
41
|
+
[TriggerRouter] Pre-allocated session dispatched: workflowId=wr.discovery
|
|
42
|
+
[WorkflowRunner] Session started: workflowId=wr.discovery
|
|
43
|
+
... (30 min elapses) ...
|
|
44
|
+
[TriggerRouter] Dispatch timed out: workflowId=wr.discovery reason=wall_clock message=Workflow timed out after 30 minutes
|
|
45
|
+
[QueuePoll] in-flight-clear #393 reason=completed
|
|
46
|
+
(next cycle repeats)
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
Crucially: `[full-pipeline]` log lines never appear. The coordinator's inner log output is absent because `runAdaptivePipeline` is running inside `dispatchAdaptivePipeline`, which runs on the async queue, but no `stderr` bridge surfaces those logs in `daemon.stdout.log`. The FULL pipeline IS running, but it is invisible except for the child `wr.discovery` session dispatches it produces.
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
## 2. Exact Code Path
|
|
54
|
+
|
|
55
|
+
The loop follows this execution path on every cycle:
|
|
56
|
+
|
|
57
|
+
```
|
|
58
|
+
PollingScheduler.doPollGitHubQueue()
|
|
59
|
+
-> issue #393 not in dispatchingIssues (cleared from previous cycle)
|
|
60
|
+
-> checkIdempotency(393, sessionsDir) -> 'clear' [** root cause #1 **]
|
|
61
|
+
-> no 'worktrain:in-progress' label [** root cause #2 **]
|
|
62
|
+
-> dispatchingIssues.add(393)
|
|
63
|
+
-> router.dispatchAdaptivePipeline(goal, workspace) [fire-and-forget Promise]
|
|
64
|
+
-> runAdaptivePipeline(deps, opts) -> FULL mode
|
|
65
|
+
-> runFullPipelineCore()
|
|
66
|
+
-> spawnSession('wr.discovery', goal, workspace) [no context passed]
|
|
67
|
+
-> router.dispatch({ workflowId: 'wr.discovery', goal, ... })
|
|
68
|
+
-> runWorkflow() -> wall_clock timer = 30 min (DEFAULT_SESSION_TIMEOUT_MINUTES)
|
|
69
|
+
-> After 30 min: daemonRegistry.unregister(id, 'failed')
|
|
70
|
+
-> session .json file DELETED (line 3867 of workflow-runner.ts)
|
|
71
|
+
-> returns _tag:'timeout'
|
|
72
|
+
-> awaitSessions([discoveryHandle], 55_min)
|
|
73
|
+
-> polls ConsoleService.getSessionDetail(discoveryHandle)
|
|
74
|
+
-> engineState.kind !== 'complete' (agent was aborted mid-run)
|
|
75
|
+
-> status = 'dormant' (no new events, no completion)
|
|
76
|
+
-> does NOT match 'complete' | 'complete_with_gaps' | 'blocked'
|
|
77
|
+
-> keeps polling for 55 minutes
|
|
78
|
+
-> after 55 min: returns outcome='timeout' for discoveryHandle
|
|
79
|
+
-> discoveryResult.outcome === 'timeout' != 'success'
|
|
80
|
+
-> returns { kind: 'escalated', escalationReason: { phase: 'discovery' } }
|
|
81
|
+
-> dispatchAdaptivePipeline Promise resolves
|
|
82
|
+
-> dispatchP.then(): dispatchingIssues.delete(393)
|
|
83
|
+
|
|
84
|
+
(55-60 min later, next QueuePoll cycle)
|
|
85
|
+
-> issue #393 not in dispatchingIssues (cleared)
|
|
86
|
+
-> checkIdempotency(393) -> 'clear' again
|
|
87
|
+
-> LOOP REPEATS
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
**Total cycle duration:** ~55 minutes (dominated by awaitSessions timeout). This matches the ~1-hour spacing between `session_started` events in the logs (e.g. 23:24, 00:24, 01:24, 02:24, 03:24, 04:24, 05:24, 06:24).
|
|
91
|
+
|
|
92
|
+
---
|
|
93
|
+
|
|
94
|
+
## 3. Root Causes
|
|
95
|
+
|
|
96
|
+
### Root Cause #1 (Primary): `checkIdempotency` never fires for queue-originated sessions
|
|
97
|
+
|
|
98
|
+
`checkIdempotency` in `src/trigger/adapters/github-queue-poller.ts` scans `~/.workrail/daemon-sessions/*.json` for files where `context.taskCandidate.issueNumber === issueNumber`.
|
|
99
|
+
|
|
100
|
+
`persistTokens` in `src/daemon/workflow-runner.ts` (line 714) writes:
|
|
101
|
+
```json
|
|
102
|
+
{ "continueToken": "...", "checkpointToken": "...", "ts": 1234567890 }
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
There is **no `context` field** in this file. The `taskCandidate` context is passed to `dispatchAdaptivePipeline` but is never threaded into the `spawnSession` call for `wr.discovery` (see `full-pipeline.ts` line 206-210 -- `spawnSession('wr.discovery', opts.goal, opts.workspace)` with no 4th `context` argument).
|
|
106
|
+
|
|
107
|
+
Because `persistTokens` never writes `context`, `checkIdempotency` hits the `continue` branch for every session file and returns `'clear'`. The idempotency guard designed to prevent re-dispatch is permanently bypassed.
|
|
108
|
+
|
|
109
|
+
Additionally, on wall_clock timeout, the `.json` file is **explicitly deleted** (workflow-runner.ts line 3867):
|
|
110
|
+
```typescript
|
|
111
|
+
await fs.unlink(path.join(DAEMON_SESSIONS_DIR, `${sessionId}.json`)).catch(() => {});
|
|
112
|
+
```
|
|
113
|
+
Even if the context were present, the file is gone before the next poll cycle.
|
|
114
|
+
|
|
115
|
+
### Root Cause #2 (Secondary): No `worktrain:in-progress` label is ever applied to the issue
|
|
116
|
+
|
|
117
|
+
The H3 guard in `polling-scheduler.ts` (line 505) checks for the `worktrain:in-progress` label. This label is never added to issue #393 -- not by the queue poller at dispatch time, and not by the `wr.discovery` agent during its run. Without this label, the H3 guard provides no protection.
|
|
118
|
+
|
|
119
|
+
### Root Cause #3 (Contributing): `wr.discovery` session timeout mismatch
|
|
120
|
+
|
|
121
|
+
`DEFAULT_SESSION_TIMEOUT_MINUTES = 30` (workflow-runner.ts line 83) applies to `wr.discovery` sessions dispatched via `router.dispatch()`. `DISCOVERY_TIMEOUT_MS = 55 minutes` (adaptive-pipeline.ts line 39) is how long `awaitSessions` waits for the session to complete. The session always dies at 30 minutes, but `awaitSessions` wastes the remaining 25 minutes polling a `dormant` session before timing out itself. This 55-minute wasted wait is why the loop period is ~1 hour rather than ~30 minutes.
|
|
122
|
+
|
|
123
|
+
The mismatch happens because `spawnSession` calls `router.dispatch()` without passing `agentConfig.maxSessionMinutes`, so the inner `wr.discovery` session inherits the global default (30 min) rather than the coordinator's expected 55-minute budget.
|
|
124
|
+
|
|
125
|
+
### Root Cause #4 (Contributing): Escalation is silent
|
|
126
|
+
|
|
127
|
+
When the FULL pipeline escalates at the discovery phase, no signal is sent to the human outbox and the GitHub issue remains open and assigned. The queue poller has no way to distinguish "this issue has been attempted and failed" from "this issue has never been attempted." The same issue is therefore eligible for re-selection on every cycle.
|
|
128
|
+
|
|
129
|
+
---
|
|
130
|
+
|
|
131
|
+
## 4. Recommended Fixes
|
|
132
|
+
|
|
133
|
+
### Fix 1 (Required): Thread `issueNumber` into `spawnSession` context for the outer session
|
|
134
|
+
|
|
135
|
+
In `trigger-listener.ts`, the `spawnSession` implementation should accept and write `context` into the `persistTokens` payload. Specifically: when `dispatchAdaptivePipeline` is called with a `context` containing `taskCandidate.issueNumber`, that context should be stored in the session sidecar so `checkIdempotency` can find it.
|
|
136
|
+
|
|
137
|
+
One clean approach: pass `taskCandidate` as a context field in `spawnSession`, and extend `persistTokens` to write `context` alongside the tokens:
|
|
138
|
+
|
|
139
|
+
```typescript
|
|
140
|
+
// In persistTokens (workflow-runner.ts):
|
|
141
|
+
const state = JSON.stringify({
|
|
142
|
+
continueToken,
|
|
143
|
+
checkpointToken,
|
|
144
|
+
ts: Date.now(),
|
|
145
|
+
...(worktreePath !== undefined ? { worktreePath } : {}),
|
|
146
|
+
...(context !== undefined ? { context } : {}), // ADD THIS
|
|
147
|
+
}, null, 2);
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
And in `trigger-listener.ts`, thread `opts.taskCandidate` into the `spawnSession` → `dispatch` call for the top-level pipeline session.
|
|
151
|
+
|
|
152
|
+
### Fix 2 (Required): Apply `worktrain:in-progress` label at dispatch time
|
|
153
|
+
|
|
154
|
+
When `doPollGitHubQueue` selects an issue and calls `dispatchAdaptivePipeline`, it should immediately apply a `worktrain:in-progress` label to the GitHub issue via the API. This provides a cross-restart, cross-process guard that survives daemon restarts and does not depend on in-memory state or sidecar file scanning.
|
|
155
|
+
|
|
156
|
+
Remove the label when the pipeline completes (success or escalation). This is the most robust idempotency mechanism for a persistent external queue.
|
|
157
|
+
|
|
158
|
+
### Fix 3 (Required): Pass `maxSessionMinutes` to inner `wr.discovery` spawn
|
|
159
|
+
|
|
160
|
+
In `full-pipeline.ts`, pass `agentConfig.maxSessionMinutes` when calling `spawnSession` for `wr.discovery`:
|
|
161
|
+
|
|
162
|
+
```typescript
|
|
163
|
+
const discoverySpawnResult = await deps.spawnSession(
|
|
164
|
+
'wr.discovery',
|
|
165
|
+
opts.goal,
|
|
166
|
+
opts.workspace,
|
|
167
|
+
undefined, // no taskCandidate context for the child session
|
|
168
|
+
{ maxSessionMinutes: Math.floor(DISCOVERY_TIMEOUT_MS / 60_000) }, // 55 min
|
|
169
|
+
);
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
This aligns the child session's wall_clock timeout with the coordinator's `awaitSessions` timeout and eliminates the 25-minute polling waste.
|
|
173
|
+
|
|
174
|
+
### Fix 4 (Required): Write to outbox on discovery escalation
|
|
175
|
+
|
|
176
|
+
In `full-pipeline.ts` (and the overall FULL pipeline escalation path), call `deps.postToOutbox` when discovery times out, so the operator knows the pipeline failed and can intervene. The current silence means the issue stays open and the loop continues indefinitely.
|
|
177
|
+
|
|
178
|
+
### Fix 5 (Optional but recommended): Add `in_progress` → escalation-marker to session status mapping
|
|
179
|
+
|
|
180
|
+
In `awaitSessions`, treat a session whose `daemonRegistry` status is `'failed'` as outcome `'timeout'` immediately rather than polling until `awaitSessions` itself times out. This requires surfacing the registry status through `ConsoleService.getSessionDetail`, which is a more invasive change -- implement after Fixes 1-4.
|
|
181
|
+
|
|
182
|
+
---
|
|
183
|
+
|
|
184
|
+
## 5. Issue Triage
|
|
185
|
+
|
|
186
|
+
The correct fix ordering is: **Fix 2 (label) first** for immediate protection, then **Fix 1 (context threading)** to make the sidecar guard work properly, then **Fix 3 (timeout alignment)** to eliminate wasted polling, then **Fix 4 (escalation notice)** for operator visibility.
|
|
187
|
+
|
|
188
|
+
Fix 2 alone would stop the loop immediately because the `worktrain:in-progress` label check runs before `checkIdempotency` in `polling-scheduler.ts` line 505, and does not depend on any file state or memory state.
|