npm - @yemi33/minions - Versions diffs - 0.1.2043 → 0.1.2045 - Mend

@yemi33/minions 0.1.2043 → 0.1.2045

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/dashboard/js/command-center.js +64 -7
package/dashboard/js/refresh.js +146 -3
package/dashboard/js/render-prs.js +43 -9
package/dashboard/js/settings.js +4 -0
package/dashboard/styles.css +21 -0
package/dashboard.js +21 -79
package/docs/auto-discovery.md +3 -1
package/docs/qa-runbook-lifecycle.md +71 -0
package/docs/qa-runbooks.md +6 -5
package/docs/runtime-adapters.md +1 -1
package/docs/security.md +2 -1
package/docs/watches.md +19 -19
package/engine/cleanup.js +84 -2
package/engine/dispatch.js +6 -0
package/engine/kb-sweep.js +127 -0
package/engine/lifecycle.js +18 -0
package/engine/queries.js +84 -7
package/engine/shared.js +36 -0
package/engine/timeout.js +4 -0
package/engine.js +240 -11
package/package.json +1 -1

package/docs/qa-runbook-lifecycle.md ADDED Viewed

@@ -0,0 +1,71 @@
+# QA runbook lifecycle (W-mpeiwz6k0005bf34)
+Validation runbooks dispatched against live managed instances. Mirrors the
+managed-spawn lifecycle (declare → engine spawns → healthcheck → observable)
+but optimized for human/agent-driven smoke + E2E flows. Surfaced on the
+`/qa` dashboard page (`dashboard/pages/qa.html`, `dashboard/js/qa.js`); full
+live-process inventory remains on `/engine` (do NOT mirror it; see
+W-mpdad3mq000m53bb).
+## Runbook location
+`qa-runbooks.json` (engine state, JSON list keyed by `id`). Each entry:
+`{ id, name, target, steps, expectedArtifacts, createdAt, createdBy }`.
+CRUD via `GET/POST /api/qa/runbooks` (POST returns the new runbook with
+engine-assigned `id`). `target` is a name from `/api/managed-processes` or
+`/api/keep-processes` (deduped by `<project>::<name>`, managed wins).
+## Run-record path
+`qa-runs.json` (newest-first, capped). Each run:
+`{ id, runbookId, runbookName, target, status, startedAt, completedAt, workItemId, agentId, artifacts }`.
+`status` ∈ `pending|dispatched|running|passed|failed|error`. Created by
+`POST /api/qa/runbooks/run` (`{ id }`), which dispatches a `qa-validate`
+work item and seeds the run with `dispatched`. Read via
+`GET /api/qa/runs?limit=50` — UI polls every 5s while the QA page is active
+and clears the interval on page navigation via the `switchPage` wrapper in
+`dashboard/js/qa.js` (matches `_stopPlanPoll`/`_stopMeetingPoll` pattern in
+`dashboard/js/state.js`).
+## Artifact contract
+`engine/qa-artifacts/<runId>/<file>`, served via
+`GET /api/qa/artifacts/<runId>/<file>`. Files are agent-uploaded screenshots,
+video recordings, and log captures listed in the run record's
+`artifacts: [{ file, kind }]`. UI auto-detects:
+`screenshot`/`png|jpg|jpeg|gif|webp|svg` → `<img>`;
+`video`/`mp4|webm|ogg|mov` → `<video controls>`; everything else → log
+preview (first 40 lines fetched lazily) with a `View full` link to the same
+endpoint. **No direct filesystem paths are exposed** — every artifact URL
+goes through `/api/qa/artifacts/` so path traversal is server-gated.
+Optional config: `engine.qaArtifactsMaxBytes` caps per-file upload size;
+when set, dashboard Settings exposes a matching toggle (CLAUDE.md
+best-practice #15).
+## Agent sidecar shape
+The `qa-validate` agent writes `agents/<id>/qa-run.json` before exit:
+```json
+{ "runId": "qa-run-<id>",
+  "status": "passed|failed|error",
+  "summary": "...",
+  "artifacts": [ { "file": "dashboard.png", "kind": "screenshot" },
+                 { "file": "test.log",      "kind": "log" } ],
+  "written_by": "<agentId>", "wi_id": "<workItemId>" }
+```
+The engine reads the sidecar in `onAgentClose`, copies the listed files into
+`engine/qa-artifacts/<runId>/`, and stamps the run record with `status`,
+`completedAt`, and the recorded `artifacts`. Sidecar validation failure is
+non-retryable (`failure_class: invalid-qa-run`); listed files outside the
+agent worktree or larger than `qaArtifactsMaxBytes` are rejected without
+stamping the run.
+## Entry point
+`playbooks/qa-validate.md`. Dispatched by `POST /api/qa/runbooks/run`;
+receives `target`, `steps`, `expectedArtifacts` as template vars; required
+to write the sidecar above before exit. Routing line in `routing.md` maps
+the synthetic `qa-validate` task-type to the playbook so manual dispatches
+work too.

package/docs/qa-runbooks.md CHANGED Viewed

@@ -95,10 +95,11 @@ All writes use `mutateJsonFileLocked` per the repo convention. Deletes use
 unlink (so an in-progress `saveRunbook` rename can't race with the
 unlink).
-## Out of scope (deferred items)
+## Run records, artifacts, and UI
-This module deliberately does NOT:
+The deferred follow-up items (W-mpeiwz6k0005bf34-b/c/d) have since landed. Brief pointers — see [CLAUDE.md](../CLAUDE.md) → "QA validation runs" for the deep dive:
-- Spawn a QA agent or dispatch a run (W-mpeiwz6k0005bf34-c).
-- Persist run records or artifacts (W-mpeiwz6k0005bf34-b).
-- Render any UI (W-mpeiwz6k0005bf34-d).
+- **Run dispatch + persistence** (`engine/qa-runs.js`): `POST /api/qa/runbooks/run` creates a `qa-runs.json` record with `status ∈ pending|dispatched|running|passed|failed|error` and dispatches a `qa-validate` work item against the runbook's `target`. Read via `GET /api/qa/runs?limit=N&status=...` and `GET /api/qa/runs/<id>`.
+- **Artifact contract**: the `qa-validate` agent writes `agents/<id>/qa-run.json` before exit; the engine copies listed files into `engine/qa-artifacts/<runId>/` and serves them via `GET /api/qa/artifacts/<runId>/<file>` (path-traversal-gated, 403 on escape). Per-file size cap: `engine.qaArtifactsMaxBytes`.
+- **UI**: `/qa` dashboard page (`dashboard/pages/qa.html`, `dashboard/js/qa.js`) polls `GET /api/qa/runs` every 5s while active; auto-detects screenshots/videos/logs for inline preview.
+- **Playbook**: `playbooks/qa-validate.md` (routed via the synthetic `qa-validate` task-type in `routing.md`).

package/docs/runtime-adapters.md CHANGED Viewed

@@ -103,7 +103,7 @@ directly.
 Agent dispatch resolves the runtime once at spawn time:
 ```js
-// engine.js spawnAgent (~line 1158)
+// engine.js spawnAgent (~line 1866)
 const runtime = resolveRuntime(shared.resolveAgentCli(agentConfig, engineConfig));
 ```

package/docs/security.md CHANGED Viewed

@@ -60,7 +60,8 @@ system. Its threat model:
   operator visits could in principle issue requests to `http://127.0.0.1:7331`.
   The dashboard defends against this with:
   - An **Origin gate** on mutating methods (`POST`/`PUT`/`PATCH`/`DELETE`)
-    and CORS preflights — see `dashboard.js` ~3677–3730 and
+    and CORS preflights — see `dashboard.js` ~4565–4609 (and additional
+    `isAllowedOrigin` enforcement points in the SSE/CC handlers) and
     `shared.isAllowedOrigin` / `shared.buildSecurityHeaders` in
     [`engine/shared.js`](../engine/shared.js). Requests whose `Origin` (or
     `Referer`, if `Origin` is absent) is not in the local allowlist are

package/docs/watches.md CHANGED Viewed

@@ -20,11 +20,11 @@ A watch is a small JSON record persisted to `engine/watches.json`. It binds:
 | `requires`   | Optional guard: array of predicate objects evaluated against `state` / `entity` / `prevState`; trigger is suppressed when any guard fails (false-or-error). Used to gate a watch on "PR is mergeable AND build passing" etc. |
 | `status`     | `WATCH_STATUS.ACTIVE` \| `PAUSED` \| `TRIGGERED` \| `EXPIRED`            |
-`createWatch()` allocates a `watch-<uid>` id, defaults the fields above, and persists atomically via `mutateJsonFileLocked` *(source: `engine/watches.js:184-247`)*.
+`createWatch()` allocates a `watch-<uid>` id, defaults the fields above, and persists atomically via `mutateJsonFileLocked` *(source: `engine/watches.js:184-248`)*.
 ## Lifecycle (`WATCH_STATUS`)
-Defined in `engine/shared.js:1875`:
+Defined in `engine/shared.js:2523`:
 | Status      | Meaning                                                                 |
 |-------------|-------------------------------------------------------------------------|
@@ -37,10 +37,10 @@ Pause/resume flips the `status` field via `POST /api/watches/update` *(source: `
 ## Conditions (`WATCH_CONDITION`)
-Defined in `engine/shared.js:1891-1929`. Conditions split into two families:
+Defined in `engine/shared.js:2539-2577`. Conditions split into two families:
 ### Absolute conditions (`WATCH_ABSOLUTE_CONDITIONS`)
-*(source: `engine/shared.js:1938-1954`)*
+*(source: `engine/shared.js:2586-2602`)*
 `merged`, `build-fail`, `build-pass`, `completed`, `failed`, `concluded`, `approved`, `rejected`, `ready-for-merge`, `retry-limit-reached`, `all-items-done`, `item-failed-n-times`.
@@ -49,12 +49,12 @@ When `stopAfter === 0`, these are **fire-once** — the engine flips the watch t
 > **Per-target override (W-mp7hg58e000b5212):** the global `WATCH_ABSOLUTE_CONDITIONS` set is the legacy fallback. Each target type now declares its own `absoluteConditions: [...]` array in its spec; `registerTargetType` normalizes that into a `Set` that takes precedence at evaluation time. The plugin contract (see below) uses this to keep absolute-vs-change semantics local to each target type. Plugins that omit `absoluteConditions` get an empty set (all change-based).
 ### Change-based conditions
-`status-change`, `any`, `new-comments`, `vote-change`, `stage-complete`, `ran`, `enabled`, `disabled`, `activity-change`, plus the predicate conditions added under P-w4e2f6a1 / P-w5b8d2c9 for the `pr`, `work-item`, `plan`, and `pipeline` target types (`head-commit-change`, `mergeable-flipped`, `behind-master`, `draft-flipped`, `stalled`, `dependency-met`, `stage-advanced`, `stuck-in-stage`). See `engine/shared.js:2422-2460` for the canonical enum.
+`status-change`, `any`, `new-comments`, `vote-change`, `stage-complete`, `ran`, `enabled`, `disabled`, `activity-change`, plus the predicate conditions added under P-w4e2f6a1 / P-w5b8d2c9 for the `pr`, `work-item`, `plan`, and `pipeline` target types (`head-commit-change`, `mergeable-flipped`, `behind-master`, `draft-flipped`, `stalled`, `dependency-met`, `stage-advanced`, `stuck-in-stage`). See `engine/shared.js:2539-2577` for the canonical enum.
 These compare the live entity against the watch's `_lastState` snapshot and run forever when `stopAfter === 0`. Baseline `_lastState` is captured on the first check so the very next change triggers the watch *(source: `engine/watches.js:434, 520`)*.
 ### Tick-counted conditions
-`stalled`, `stuck-in-stage` — require N consecutive unchanged captures (default `WATCH_STALLED_DEFAULT_TICKS = 12`, `WATCH_STUCK_STAGE_DEFAULT_TICKS = 12`, both in `engine/shared.js:1934-1935`). Counters (`_unchangedTicks`, `_stuckStageTicks`) are recomputed inside `_captureState` by comparing the fresh snapshot against `prevState`.
+`stalled`, `stuck-in-stage` — require N consecutive unchanged captures (default `WATCH_STALLED_DEFAULT_TICKS = 12`, `WATCH_STUCK_STAGE_DEFAULT_TICKS = 12`, both in `engine/shared.js:2582-2583`). Counters (`_unchangedTicks`, `_stuckStageTicks`) are recomputed inside `_captureState` by comparing the fresh snapshot against `prevState`.
 ### Predicate conditions
@@ -65,11 +65,11 @@ Several condition keys evaluate a derived predicate on the captured entity/state
 - **plan** — `all-items-done` (`items_done === items_total > 0`), `item-failed-n-times` (any `missing_features[*]._retryCount >= ENGINE_DEFAULTS.maxRetries`).
 - **pipeline** — `stage-advanced` (`current_stage_id` changed within the same `runId`), `stuck-in-stage` (current stage unchanged for `WATCH_STUCK_STAGE_DEFAULT_TICKS` checks, default 12).
-Compound state-assertion predicates (`ready-for-merge`, `retry-limit-reached`, `all-items-done`, `item-failed-n-times`) live in `WATCH_ABSOLUTE_CONDITIONS` so they fire-once when `stopAfter === 0` — without that they would re-fire every tick while the assertion holds *(source: `engine/shared.js:1938` `WATCH_ABSOLUTE_CONDITIONS`)*.
+Compound state-assertion predicates (`ready-for-merge`, `retry-limit-reached`, `all-items-done`, `item-failed-n-times`) live in `WATCH_ABSOLUTE_CONDITIONS` so they fire-once when `stopAfter === 0` — without that they would re-fire every tick while the assertion holds *(source: `engine/shared.js:2586` `WATCH_ABSOLUTE_CONDITIONS`)*.
 ## Target Types — `TARGET_TYPES` Registry
-Target-type behavior in `engine/watches.js` is **data-driven via a registry** *(source: `engine/watches.js:124-160`)*. Each spec must provide:
+Target-type behavior in `engine/watches.js` is **data-driven via a registry** *(source: `engine/watches.js:124-153`)*. Each spec must provide:
 - `label` — human name shown in dashboard pickers
 - `description` — short help text
@@ -79,17 +79,17 @@ Target-type behavior in `engine/watches.js` is **data-driven via a registry** *(
 - `captureState(entity)` — snapshot used for change-detection diffs
 - `evaluate(condition, entity, prevState, target)` — returns `{ triggered, message }`
-The registry IS the allowlist for `createWatch` and `/api/watches/target-types`; the old hardcoded "pr or work-item" check is gone. Add a new target type at runtime with `registerTargetType(type, spec)` and look one up with `getTargetType(type)`. `listTargetTypes()` returns the serializable form used by the dashboard *(source: `engine/watches.js:124-178`)*.
+The registry IS the allowlist for `createWatch` and `/api/watches/target-types`; the old hardcoded "pr or work-item" check is gone. Add a new target type at runtime with `registerTargetType(type, spec)` and look one up with `getTargetType(type)`. `listTargetTypes()` returns the serializable form used by the dashboard *(source: `engine/watches.js:124-174`)*.
 ### User-extensible via `watches.d/` (W-mp7hg58e000b5212)
-At engine boot, every `*.js` file in `<MINIONS_DIR>/watches.d/` is auto-loaded **after** the built-in registrations *(source: `engine/watches.js:1314-1340`)*, so plugins can both add new target types and override built-ins. A plugin file exports either `{ name, spec }` or an array of such objects. Failures are logged-and-skipped — one bad plugin must not break boot or block other plugins. Reloads require an engine restart.
+At engine boot, every `*.js` file in `<MINIONS_DIR>/watches.d/` is auto-loaded **after** the built-in registrations *(source: `engine/watches.js:1319-1354`)*, so plugins can both add new target types and override built-ins. A plugin file exports either `{ name, spec }` or an array of such objects. Failures are logged-and-skipped — one bad plugin must not break boot or block other plugins. Reloads require an engine restart.
 Canonical example: `watches.d/http.js` (W-mp7i22mu00191b07) — a generic HTTP poller covering the full plugin contract including `extractState` (custom snapshot fields not on the entity itself) and `extendTemplateVars` (custom action-template vars like `{{httpStatus}}`, `{{prevExtracted}}`).
 ### Built-in target types
-The eight built-ins are registered at module load *(source: `engine/watches.js:672-1313`)*. Constants live at `engine/shared.js:2412-2421` (`WATCH_TARGET_TYPE`).
+The eight built-ins are registered at module load *(source: `engine/watches.js:672-1313`)*. Constants live at `engine/shared.js:2529-2538` (`WATCH_TARGET_TYPE`).
 | `targetType`  | Target value                         | Conditions                                                                 | Notes |
 |---------------|--------------------------------------|----------------------------------------------------------------------------|-------|
@@ -102,11 +102,11 @@ The eight built-ins are registered at module load *(source: `engine/watches.js:6
 | `dispatch`    | Dispatch entry id                    | `completed`, `failed`, `status-change`, `any`                              | Looks across `pending` / `active` / `completed` lists |
 | `agent`       | Agent id                             | `activity-change`, `status-change`, `any`                                  | `activity-change` fires only on transitions in/out of `'working'` |
-`evaluateWatch` dispatches to `tt.evaluate(...)`; unknown target types return `"Unknown target type: ..."` and unknown conditions return `"Unknown condition: ..."` — both are non-triggering *(source: `engine/watches.js:318-360`)*.
+`evaluateWatch` dispatches to `tt.evaluate(...)`; unknown target types return `"Unknown target type: ..."` and unknown conditions return `"Unknown condition: ..."` — both are non-triggering *(source: `engine/watches.js:318-371`)*.
 ### Plugin folder (`watches.d/`) — user-extensible target types
-W-mp7hg58e000b5212 added a **plugin folder** so operators can register new target types without editing engine source. At engine boot, `engine/watches.js` scans `<MINIONS_DIR>/watches.d/*.js` *after* the eight built-ins are registered (so plugins can override a built-in by re-using its key — last-write-wins) and calls `registerTargetType()` for each export *(source: `engine/watches.js:1313-1354`)*.
+W-mp7hg58e000b5212 added a **plugin folder** so operators can register new target types without editing engine source. At engine boot, `engine/watches.js` scans `<MINIONS_DIR>/watches.d/*.js` *after* the eight built-ins are registered (so plugins can override a built-in by re-using its key — last-write-wins) and calls `registerTargetType()` for each export *(source: `engine/watches.js:1319-1354`)*.
 Each `watches.d/<name>.js` file must export `{ name, spec }` (or an array of those):
@@ -133,7 +133,7 @@ Resolution is `path.join(shared.MINIONS_DIR, 'watches.d')` so it works in both d
 ## Tick Integration
-`engine.js` calls `checkWatches(config, state)` every 3 ticks (~3 min at the default 60s tick) inside its own `safe('checkWatches', ...)` block *(source: `engine.js:5432-5485`)*. The engine builds the state object from cached project files + module reads:
+`engine.js` calls `checkWatches(config, state)` every 3 ticks (~3 min at the default 60s tick) inside its own `safe('checkWatches', ...)` block *(source: `engine.js:6386-6440`)*. The engine builds the state object from cached project files + module reads:
 ```
 {
@@ -144,7 +144,7 @@ Resolution is `path.join(shared.MINIONS_DIR, 'watches.d')` so it works in both d
 }
 ```
-`checkWatches` walks every active watch and, inside a single `mutateJsonFileLocked` callback *(source: `engine/watches.js:410-560`)*:
+`checkWatches` walks every active watch and, inside a single `mutateJsonFileLocked` callback *(source: `engine/watches.js:410-561`)*:
 1. Skips paused/expired watches and any watch checked within its `interval`.
 2. Captures a baseline `_lastState` on first check (so change conditions have something to diff).
@@ -158,7 +158,7 @@ I/O happens **outside the lock**: notifications via `writeToInbox`, follow-up ac
 ## Follow-Up Actions on Trigger
-`watch.action` is an optional structured action that runs after the inbox notification fires. Action types live in a sibling registry in `engine/watch-actions.js` and are validated at create/update time *(source: `engine/watches.js:184-247` `createWatch`, `engine/watch-actions.js:223-330` `registerActionType`)*. `GET /api/watches/action-types` returns the live list for dashboard pickers.
+`watch.action` is an optional structured action that runs after the inbox notification fires. Action types live in a sibling registry in `engine/watch-actions.js` and are validated at create/update time *(source: `engine/watches.js:184-248` `createWatch`, `engine/watch-actions.js:56` `registerActionType`)*. `GET /api/watches/action-types` returns the live list for dashboard pickers.
 ### Built-in actions
@@ -174,7 +174,7 @@ I/O happens **outside the lock**: notifications via `writeToInbox`, follow-up ac
 | `archive-plan`         | Set PRD `status="archived"` + `archivedAt`                                                    |
 | `resume-plan`          | Set PRD `status=PLAN_STATUS.ACTIVE` and clear `planStale`                                     |
-Constants live in `WATCH_ACTION_TYPE` (`engine/shared.js:2491`); handlers in `engine/watch-actions.js`.
+Constants live in `WATCH_ACTION_TYPE` (`engine/shared.js:2608`); handlers in `engine/watch-actions.js`.
 ### Templating
@@ -241,11 +241,11 @@ Absolute conditions firing under `stopAfter === 0` flip `status` to `expired`; `
 | Webhook action returns `"only http/https allowed"`                      | URLs must use `http://` or `https://` schemes; other protocols are rejected by design *(source: `engine/watch-actions.js` `WEBHOOK` handler)* |
 | Trigger fires but follow-up `dispatch-work-item` is missing              | Check the engine log for `Watch <id> action <type>: <summary>`. Common reasons: missing `title`, the project's `work-items.json` couldn't be written, or the WI landed in central `work-items.json` because no project was specified |
 | Watch `_lastActionResult` shows `"timeout"` for webhook                 | Webhooks have a 10s safety timeout to keep the watches tick fast *(source: `engine/watch-actions.js:482-484`)* |
-| `checkWatches` block crashes silently                                   | Wrapped in `safe('checkWatches', ...)` so one failure doesn't abort the tick *(source: `engine.js:5432`)*. Inspect `engine/log.json` for `Watch check error (<id>)` lines. Regression #1088: the block must use `getProjects(config)`, never the long-removed `PROJECTS` constant |
+| `checkWatches` block crashes silently                                   | Wrapped in `safe('checkWatches', ...)` so one failure doesn't abort the tick *(source: `engine.js:6386`)*. Inspect `engine/log.json` for `Watch check error (<id>)` lines. Regression #1088: the block must use `getProjects(config)`, never the long-removed `PROJECTS` constant |
 ## See Also
-- `engine/shared.js:2406-2500` — `WATCH_STATUS`, `WATCH_TARGET_TYPE`, `WATCH_CONDITION`, `WATCH_ABSOLUTE_CONDITIONS`, `WATCH_ACTION_TYPE` constants
+- `engine/shared.js:2523-2618` — `WATCH_STATUS`, `WATCH_TARGET_TYPE`, `WATCH_CONDITION`, `WATCH_ABSOLUTE_CONDITIONS`, `WATCH_ACTION_TYPE` constants
 - `engine/watches.js` — registry, lifecycle, tick integration, `watches.d/` plugin loader
 - `engine/watch-actions.js` — action registry and built-in handlers (including `minions-api`)
 - `watches.d/http.js` — canonical user-extensible target type plugin

package/engine/cleanup.js CHANGED Viewed

@@ -419,6 +419,35 @@ async function runCleanup(config, verbose = false) {
     }
   }
+  // 2c. Reap orphan agents/temp-* dirs whose dispatch is no longer referenced
+  // anywhere in dispatch.json. Temp-agent dirs are created by the engine for
+  // ephemeral temp-<uid> agents; once the dispatch ages out of dispatch
+  // history they're never touched again, accumulating MB of live-output.log
+  // tails over weeks. 1h mtime gate prevents reaping a still-spawning temp
+  // agent that races dispatch.json visibility.
+  cleaned.orphanTempAgentDirs = 0;
+  try {
+    const dispatch = getDispatch();
+    const referencedAgents = new Set();
+    for (const seg of [dispatch.pending || [], dispatch.active || [], dispatch.completed || [], dispatch.history || []]) {
+      for (const e of seg) if (e?.agent) referencedAgents.add(String(e.agent));
+    }
+    let entries;
+    try { entries = fs.readdirSync(AGENTS_DIR, { withFileTypes: true }); } catch { entries = []; }
+    for (const entry of entries) {
+      if (!entry.isDirectory()) continue;
+      if (!entry.name.startsWith('temp-')) continue;
+      if (referencedAgents.has(entry.name)) continue;
+      const full = path.join(AGENTS_DIR, entry.name);
+      let stat; try { stat = fs.statSync(full); } catch { continue; }
+      if (stat.mtimeMs >= oneHourAgo) continue;
+      try {
+        fs.rmSync(full, { recursive: true, force: true });
+        cleaned.orphanTempAgentDirs++;
+      } catch (err) { log('warn', `orphan temp agent dir ${entry.name}: ${err.message}`); }
+    }
+  } catch (e) { log('warn', `orphan temp agent sweep: ${e.message}`); }
   // 2b. Detect git worktrees registered inside any linked project's working tree.
   // Nested worktrees cause glob/grep tools running with cwd=projectRoot to match
   // BOTH copies of every file; a single Edit/MultiEdit then writes the same
@@ -452,6 +481,57 @@ async function runCleanup(config, verbose = false) {
     }
   }
+  // 2d. Reap on-disk worktree dirs not registered in `git worktree list`. Can
+  // be left behind when removeWorktree fails mid-way, when `git worktree prune`
+  // ran without a follow-up rm -rf, or after manual `git worktree remove
+  // --force` leaves an empty dir. Phase 3 below only walks dirs already in
+  // git's list, so these are invisible to it. 2h mtime gate matches the
+  // existing age sweep further down.
+  cleaned.orphanWorktreeDirs = 0;
+  const _twoHoursAgo = Date.now() - 7200000;
+  const _scannedWtRoots = new Set();
+  for (const project of projects) {
+    const root = project.localPath ? path.resolve(project.localPath) : null;
+    if (!root || !fs.existsSync(root)) continue;
+    const wtRoots = new Set();
+    const configuredRoot = path.resolve(root, config.engine?.worktreeRoot || '../worktrees');
+    if (fs.existsSync(configuredRoot)) wtRoots.add(configuredRoot);
+    for (const d of ['worktrees', '.claude/worktrees'].map(d => path.join(root, d))) {
+      if (fs.existsSync(d)) wtRoots.add(d);
+    }
+    let registered = null;
+    for (const wtRoot of wtRoots) {
+      if (_scannedWtRoots.has(wtRoot)) continue;
+      _scannedWtRoots.add(wtRoot);
+      // Resolve `git worktree list` once per project; reused across its roots.
+      if (registered === null) {
+        try {
+          const raw = String(shared.execSilent('git worktree list --porcelain', { cwd: root, timeout: 10000, windowsHide: true }) || '');
+          registered = new Set(shared.parseWorktreePorcelain(raw).map(wt => path.resolve(wt.path)));
+        } catch (e) {
+          log('warn', `orphan worktree dir scan for ${project.name || root}: ${e.message}`);
+          registered = new Set();
+        }
+      }
+      let entries;
+      try { entries = fs.readdirSync(wtRoot, { withFileTypes: true }); } catch { continue; }
+      for (const entry of entries) {
+        if (!entry.isDirectory()) continue;
+        const full = path.resolve(wtRoot, entry.name);
+        if (registered.has(full)) continue;
+        let stat; try { stat = fs.statSync(full); } catch { continue; }
+        if (stat.mtimeMs >= _twoHoursAgo) continue;
+        try {
+          fs.rmSync(full, { recursive: true, force: true });
+          cleaned.orphanWorktreeDirs++;
+          log('info', `Cleanup: removed orphan worktree dir ${full} (not registered in git)`);
+        } catch (err) {
+          log('warn', `orphan worktree dir ${full}: ${err.message}`);
+        }
+      }
+    }
+  }
   // 3. Clean git worktrees for merged/abandoned PRs
   const _attemptedWorktreePaths = new Set(); // dedup across projects sharing a worktreeRoot
   for (const project of projects) {
@@ -909,8 +989,10 @@ async function runCleanup(config, verbose = false) {
     }
   } catch (e) { log('warn', 'prune orphaned dispatches: ' + e.message); }
-  if (cleaned.tempFiles + cleaned.liveOutputs + cleaned.worktrees + cleaned.zombies + (cleaned.files || 0) + cleaned.orphanedDispatches + (cleaned.nestedWorktrees || 0) > 0) {
-    log('info', `Cleanup: ${cleaned.tempFiles} temp, ${cleaned.liveOutputs} live outputs, ${cleaned.worktrees} worktrees, ${cleaned.zombies} zombies, ${cleaned.files || 0} archives, ${cleaned.orphanedDispatches} orphaned dispatches, ${cleaned.nestedWorktrees || 0} nested worktrees flagged`);
+  const _orphanTemp = cleaned.orphanTempAgentDirs || 0;
+  const _orphanWt = cleaned.orphanWorktreeDirs || 0;
+  if (cleaned.tempFiles + cleaned.liveOutputs + cleaned.worktrees + cleaned.zombies + (cleaned.files || 0) + cleaned.orphanedDispatches + (cleaned.nestedWorktrees || 0) + _orphanTemp + _orphanWt > 0) {
+    log('info', `Cleanup: ${cleaned.tempFiles} temp, ${cleaned.liveOutputs} live outputs, ${cleaned.worktrees} worktrees, ${cleaned.zombies} zombies, ${cleaned.files || 0} archives, ${cleaned.orphanedDispatches} orphaned dispatches, ${cleaned.nestedWorktrees || 0} nested worktrees flagged, ${_orphanTemp} orphan temp dirs, ${_orphanWt} orphan worktree dirs`);
   }
   // 8. Clean swept KB files older than 7 days

package/engine/dispatch.js CHANGED Viewed

@@ -696,6 +696,12 @@ function completeDispatch(id, result = DISPATCH_RESULT.SUCCESS, reason = '', res
                   // (overloaded_error / 503). Empty string clears any stale
                   // value from an earlier failure cycle.
                   wi._lastFailureClass = failureClass || '';
+                  // W-mpmwxn1j — Bump per-agent retry count so the next dispatch
+                  // can reassign to a different eligible agent once the same
+                  // agent hits the threshold. Skip when no agent is resolvable
+                  // (anonymous failures shouldn't corrupt the map shape).
+                  const failedAgent = item.agent || wi.dispatched_to;
+                  if (failedAgent) shared.bumpAgentRetryCount(wi, failedAgent);
                   delete wi.failReason;
                   delete wi.failedAt;
                   delete wi.dispatched_at;

package/engine/kb-sweep.js CHANGED Viewed

@@ -23,6 +23,8 @@ const KB_SWEEP_STATE_PATH = path.join(ENGINE_DIR, 'kb-sweep-state.json');
 const KB_SWEEP_LOG_PATH = path.join(ENGINE_DIR, 'kb-sweep.log');
 const KB_SWEEP_RUNNER_PATH = path.join(__dirname, 'kb-sweep-runner.js');
 const SWEPT_RETENTION_MS = 30 * 24 * 60 * 60 * 1000;
+const AUTO_SWEEP_INTERVAL_MS = 4 * 60 * 60 * 1000;
+const KB_SWEPT_PATH = path.join(ENGINE_DIR, 'kb-swept.json');
 const COMPRESS_THRESHOLD_BYTES = 5000;
 const LLM_BATCH_SIZE = 30;
 const NORMALIZE_CONCURRENCY = 5;
@@ -555,6 +557,127 @@ async function _runKbSweepImpl(opts = {}) {
   return summary;
 }
+/**
+ * Spawn the KB sweep runner (`engine/kb-sweep-runner.js`) as a detached child.
+ * Shared between dashboard's POST /api/knowledge/sweep handler and the engine
+ * tick's auto-sweep phase. Performs the same synchronous "starting" → "in-flight"
+ * CAS dance the dashboard handler used to do inline.
+ *
+ * Callers are responsible for the in-flight / stale-guard check BEFORE calling
+ * (so they can return distinct HTTP responses or log levels).
+ *
+ * @param {object} opts
+ * @param {string[]} [opts.pinnedKeys] - extra pinned KB keys to skip in the sweep
+ * @param {boolean}  [opts.dryRun]      - dry-run mode for the runner
+ * @param {string}   [opts.cwd=MINIONS_DIR] - working directory for the spawned runner
+ * @param {(level:string,msg:string)=>void} [opts.log] - logger (defaults to console)
+ * @returns {{ sweepToken:string, pid:number|null, bodyFile:string|null,
+ *             ok:boolean, error?:string }}
+ *           ok=false + error on synchronous spawn failure; the "starting" claim is
+ *           released so the caller can retry immediately.
+ */
+function spawnSweepRunnerDetached(opts = {}) {
+  const fsLocal = require('fs');
+  const { spawn: cpSpawn } = require('child_process');
+  const logFn = typeof opts.log === 'function'
+    ? opts.log
+    : (level, msg) => { (level === 'error' ? console.error : console.log)(`[kb-sweep] ${msg}`); };
+  const cwd = opts.cwd || require('./queries').MINIONS_DIR;
+  const startedAt = Date.now();
+  const sweepToken = `${startedAt}-${Math.random().toString(36).slice(2, 8)}`;
+  try {
+    safeWrite(KB_SWEEP_STATE_PATH, JSON.stringify({
+      status: 'starting', startedAt, startedAtIso: new Date().toISOString(),
+      sweepToken, pid: null,
+    }));
+  } catch (e) {
+    logFn('error', `failed to write starting state: ${e.message}`);
+  }
+  let bodyFile = null;
+  const hasBody = (Array.isArray(opts.pinnedKeys) && opts.pinnedKeys.length > 0)
+    || opts.dryRun != null;
+  if (hasBody) {
+    bodyFile = path.join(ENGINE_DIR, `tmp-kb-sweep-body-${sweepToken}.json`);
+    try {
+      safeWrite(bodyFile, JSON.stringify({
+        pinnedKeys: Array.isArray(opts.pinnedKeys) ? opts.pinnedKeys : undefined,
+        dryRun: opts.dryRun != null ? !!opts.dryRun : undefined,
+      }));
+    } catch (e) {
+      logFn('error', `failed to write body-file ${bodyFile}: ${e.message}`);
+      bodyFile = null;
+    }
+  }
+  let logFdNum = null;
+  let stdio = ['ignore', 'ignore', 'ignore'];
+  try {
+    logFdNum = fsLocal.openSync(KB_SWEEP_LOG_PATH, 'a');
+    stdio = ['ignore', logFdNum, logFdNum];
+  } catch (e) {
+    logFn('error', `failed to open log ${KB_SWEEP_LOG_PATH}: ${e.message}`);
+  }
+  const spawnArgs = ['--sweep-token', sweepToken];
+  if (bodyFile) spawnArgs.push('--body-file', bodyFile);
+  let proc;
+  try {
+    proc = cpSpawn(process.execPath, [KB_SWEEP_RUNNER_PATH, ...spawnArgs], {
+      cwd, stdio, detached: true, windowsHide: true,
+      env: { ...process.env },
+    });
+  } catch (e) {
+    if (logFdNum != null) try { fsLocal.closeSync(logFdNum); } catch { /* ignore */ }
+    if (bodyFile) try { fsLocal.unlinkSync(bodyFile); } catch { /* ignore */ }
+    try { shared.safeUnlink(KB_SWEEP_STATE_PATH); } catch { /* ignore */ }
+    return { ok: false, error: `spawn failed: ${e.message}`, sweepToken, pid: null, bodyFile: null };
+  }
+  if (logFdNum != null) try { fsLocal.closeSync(logFdNum); } catch { /* ignore */ }
+  try {
+    const current = safeJson(KB_SWEEP_STATE_PATH);
+    if (current && current.status === 'starting' && current.sweepToken === sweepToken) {
+      safeWrite(KB_SWEEP_STATE_PATH, JSON.stringify({
+        status: 'in-flight', startedAt, startedAtIso: new Date().toISOString(),
+        sweepToken, pid: proc.pid,
+      }));
+    }
+  } catch { /* best-effort */ }
+  proc.unref();
+  return { ok: true, sweepToken, pid: proc.pid, bodyFile };
+}
+/**
+ * Decide whether the engine tick should auto-spawn a sweep right now.
+ * Pure function (reads disk, no side effects). Used by the tick's
+ * auto-sweep phase.
+ *
+ * @param {object} [opts]
+ * @param {number} [opts.now=Date.now()]                injectable clock (tests)
+ * @param {number} [opts.intervalMs=AUTO_SWEEP_INTERVAL_MS]
+ * @param {object} [opts.liveness]                      pre-computed liveness (optional)
+ * @returns {{ shouldSpawn:boolean, reason:string, lastCompletedAt:number|null }}
+ */
+function shouldAutoSweep(opts = {}) {
+  const now = Number(opts.now) || Date.now();
+  const intervalMs = Number(opts.intervalMs) || AUTO_SWEEP_INTERVAL_MS;
+  const liveness = opts.liveness || readSweepLiveness({ entryCount: opts.entryCount || 0, now });
+  if (liveness.inFlight && liveness.alive && !liveness.stale) {
+    return { shouldSpawn: false, reason: 'sweep-in-flight', lastCompletedAt: null };
+  }
+  const swept = safeJson(KB_SWEPT_PATH);
+  const sweptTs = swept && swept.timestamp ? Date.parse(swept.timestamp) : NaN;
+  const lastCompletedAt = Number.isFinite(sweptTs) ? sweptTs : null;
+  if (lastCompletedAt != null && (now - lastCompletedAt) < intervalMs) {
+    return { shouldSpawn: false, reason: 'within-interval', lastCompletedAt };
+  }
+  return { shouldSpawn: true, reason: lastCompletedAt == null ? 'no-prior-sweep' : 'interval-elapsed', lastCompletedAt };
+}
 /** Compute a dynamic stale-guard timeout based on KB size. */
 function staleGuardMs(entryCount) {
   // 30 minutes minimum, plus 1 second per entry (for the rewrite pass)
@@ -566,6 +689,10 @@ module.exports = {
   staleGuardMs,
   readSweepLiveness,
   reconcileSweepStateOnBoot,
+  spawnSweepRunnerDetached,
+  shouldAutoSweep,
+  AUTO_SWEEP_INTERVAL_MS,
+  KB_SWEPT_PATH,
   KB_SWEEP_STATE_PATH,
   KB_SWEEP_LOG_PATH,
   KB_SWEEP_RUNNER_PATH,

package/engine/lifecycle.js CHANGED Viewed

@@ -595,6 +595,7 @@ function updateWorkItemStatus(meta, status, reason) {
         delete target.failReason;
         delete target.failedAt;
         delete target._retryCount;
+        delete target._retriesByAgent;
         target.completedAgents = Object.entries(target.agentResults)
           .filter(([, r]) => r.status === WI_STATUS.DONE)
           .map(([a]) => a);
@@ -611,6 +612,7 @@ function updateWorkItemStatus(meta, status, reason) {
         delete target.failReason;
         delete target.failedAt;
         delete target._retryCount;
+        delete target._retriesByAgent;
         // P-e0b4f7a5 — successful completion (including a phantom-retry
         // succeeding) clears the phantom markers so cleanup can reap the
         // worktree on the next sweep.
@@ -3218,6 +3220,14 @@ function _deferRetryWithCounter(meta, detection, counterField, maxCount, pending
         w._lastRetryAt = ts();
         w._lastRetryReason = reason;
         w._pendingReason = pendingReason;
+        // W-mpmwxn1j — only the standard PR-attachment / nonterminal counter
+        // (_retryCount) participates in per-agent reassignment. Phantom
+        // retries (runtime crashes before any work product) are not
+        // agent-specific failures, so we don't bump _retriesByAgent for them.
+        if (counterField === '_retryCount') {
+          const failedAgent = meta?._agentId || w.dispatched_to;
+          if (failedAgent) shared.bumpAgentRetryCount(w, failedAgent);
+        }
         // P-e0b4f7a5 — phantom-retry path stamps _phantomCompletion +
         // _phantomBranch so cleanup.js can preserve the worktree across the
         // re-dispatch window. Only set for the phantom counter; nonterminal
@@ -4018,6 +4028,10 @@ async function runPostCompletionHooks(dispatchItem, agentId, code, stdout, confi
               w._retryCount = retries + 1;
               w._lastRetryAt = ts();
               w._lastRetryReason = 'no review verdict';
+              // W-mpmwxn1j — bump per-agent counter so a reviewer who never
+              // emits a verdict gets reassigned after maxRetriesPerAgent hits.
+              const failedAgent = meta?._agentId || w.dispatched_to;
+              if (failedAgent) shared.bumpAgentRetryCount(w, failedAgent);
               delete w.dispatched_at;
               delete w.completedAt;
               delete w._pendingReason;
@@ -4125,6 +4139,10 @@ async function runPostCompletionHooks(dispatchItem, agentId, code, stdout, confi
             if (retries < ENGINE_DEFAULTS.maxRetries) {
               w.status = WI_STATUS.PENDING;
               w._retryCount = retries + 1;
+              // W-mpmwxn1j — bump per-agent counter so a planner that never
+              // writes the PRD gets reassigned after maxRetriesPerAgent hits.
+              const failedAgent = meta?._agentId || w.dispatched_to;
+              if (failedAgent) shared.bumpAgentRetryCount(w, failedAgent);
               delete w.dispatched_at;
               delete w.completedAt;
               log('warn', `plan-to-prd ${meta.item.id} completed without PRD file — auto-retry ${retries + 1}/${ENGINE_DEFAULTS.maxRetries}`);