@yemi33/minions 0.1.1985 → 0.1.1987

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/README.md CHANGED
@@ -15,6 +15,7 @@ Architecture, design proposals, and lifecycle references for people working on t
15
15
 
16
16
  - [command-center.md](command-center.md) — Command Center (CC) chat panel: persistent Sonnet sessions, `--resume` semantics, system-prompt invalidation, and per-tab session storage.
17
17
  - [completion-reports.md](completion-reports.md) — Canonical schema for the per-spawn completion JSON: trust nonce, `failure_class` enum, `noop` semantics, `retryable` / `needs_rerun` shape, and the artifacts array.
18
+ - [constellation-bridge.md](constellation-bridge.md) — Read-only cross-repo bridge: `engine.constellationBridge.enabled` flag, marker-file contract, and the `minions bridge` subcommand for local debugging.
18
19
  - [copilot-cli-schema.md](copilot-cli-schema.md) — Behavior and schema reference for the GitHub Copilot CLI adapter (capability flags, stdin vs `-p`, model discovery, effort levels).
19
20
  - [design-state-storage.md](design-state-storage.md) — Design proposal evaluating five database options for replacing Minions' file-based JSON state; recommends `node:sqlite` as the medium-term target.
20
21
  - [kb-sweep.md](kb-sweep.md) — Knowledge-base consolidation sweep (hash dedup → LLM batch dedup/reclassify → per-entry compress) and the detached runner that keeps it alive across `minions restart`.
@@ -39,6 +40,7 @@ Operational runbooks for engine operators and fleet maintainers.
39
40
  - [human-vs-automated.md](human-vs-automated.md) — Quick reference table of which features humans start, run, decide, and recover, and the two human approval gates.
40
41
  - [kb-sweep.md](kb-sweep.md) — Knowledge-base sweep runbook: how `engine/kb-sweep.js` consolidates `notes/inbox/` into `knowledge/` and survives `minions restart`.
41
42
  - [onboarding.md](onboarding.md) — First-30-minutes walkthrough for a new operator: install, init, dispatch a first work item, watch it land.
43
+ - [security.md](security.md) — Threat model: single-user/loopback deployment assumptions, dashboard Origin gate, data-flow trust boundaries, secret handling, and known residual risks (CSRF sweep, prompt injection, log-redactor audit).
42
44
 
43
45
  ---
44
46
 
@@ -0,0 +1,94 @@
1
+ # Constellation bridge
2
+
3
+ Minions ships a small read-only surface that the [Constellation](https://office.visualstudio.com/DefaultCollection/ISS/_git/constellation) dashboard polls to project Minions engine state (agents, dispatch queue, PR pipeline) into its HUD. This page documents the Minions-side of that contract.
4
+
5
+ > **The bridge polling logic itself lives in the Constellation repo** (`packages/agent/src/bridges/`). Minions only owns the on/off flag, the marker-file contract, and the `minions bridge` subcommand for local debugging.
6
+
7
+ ## Quick start
8
+
9
+ ```bash
10
+ minions bridge status # Show enabled flag + last-seen Constellation agent
11
+ minions bridge health # Probe http://127.0.0.1:7331/api/status and print the projection
12
+ minions bridge enable # Set engine.constellationBridge.enabled = true
13
+ minions bridge disable # Set engine.constellationBridge.enabled = false
14
+ ```
15
+
16
+ ## Config flag
17
+
18
+ ```json
19
+ {
20
+ "engine": {
21
+ "constellationBridge": {
22
+ "enabled": false
23
+ }
24
+ }
25
+ }
26
+ ```
27
+
28
+ Default: `false`. The flag is backfilled into existing configs on `minions init` (including the implicit init that runs on every `minions update`).
29
+
30
+ **Strict semantics.** Only the literal boolean `true` enables the bridge. Any missing, malformed, or non-boolean value (e.g. the string `"true"`, the number `1`, `null`) is treated as **disabled**. The Constellation-side reader MUST mirror this check — no truthy coercion — so a typo'd `"enabled": "false"` does not silently turn the bridge on.
31
+
32
+ Toggle without editing JSON by hand:
33
+
34
+ ```bash
35
+ minions bridge enable
36
+ minions bridge disable
37
+ ```
38
+
39
+ Both subcommands write `~/.minions/config.json` atomically via `mutateJsonFileLocked`, so concurrent edits from the engine/dashboard cannot tear the file.
40
+
41
+ ## Marker-file contract
42
+
43
+ The Constellation agent's bridge writes a marker file on every successful poll. Minions reads it with `minions bridge status` so a local operator can verify the bridge is alive and Constellation is talking to it.
44
+
45
+ - **Path:** `~/.minions/engine/constellation-bridge.json` (exposed as `CONSTELLATION_BRIDGE_MARKER_PATH` in `engine/shared.js`).
46
+ - **Owner:** the Constellation agent. The Minions engine **never** writes this file — it is a one-way breadcrumb from Constellation → Minions.
47
+
48
+ ### Schema (`schemaVersion: 1`)
49
+
50
+ ```json
51
+ {
52
+ "schemaVersion": 1,
53
+ "lastSeenAt": "2026-05-19T04:41:23.123Z",
54
+ "agentVersion": "0.2.3",
55
+ "source": "constellation-agent"
56
+ }
57
+ ```
58
+
59
+ | Field | Required | Notes |
60
+ | --------------- | -------- | ------------------------------------------------------------------------ |
61
+ | `schemaVersion` | yes | Must equal `1`. Any other value causes Minions to ignore the marker entirely (treated as no-marker, same as a missing file). New fields are added behind a deliberate version bump. |
62
+ | `lastSeenAt` | yes | ISO-8601 UTC timestamp of the last successful poll. |
63
+ | `agentVersion` | no | Constellation agent semver string, surfaced in `bridge status`. |
64
+ | `source` | no | Free-form identifier, expected `"constellation-agent"` today. |
65
+
66
+ Writers MUST use an atomic-replace pattern (`write to tmp + rename`) so a partial write never leaves Minions reading a half-baked JSON blob.
67
+
68
+ ## `minions bridge health`
69
+
70
+ `bridge health` performs a synchronous probe of the Minions dashboard's `GET /api/status` endpoint and prints a **curated subset** — the same fields the Constellation bridge would project into its own data model. This is intentionally small: full `/api/status` is large, unstable, and may expose unrelated local state.
71
+
72
+ Sample output:
73
+
74
+ ```text
75
+ bridge: dashboard reachable on http://127.0.0.1:7331
76
+ projection (same fields the Constellation bridge would read):
77
+ engineState: running
78
+ enginePid: 1234
79
+ minionsVersion: 0.1.1984
80
+ agentCount: 5
81
+ activeAgentCount: 2
82
+ dispatchPending: 1
83
+ dispatchActive: 2
84
+ dispatchCompleted: 14
85
+ projectCount: 3
86
+ ```
87
+
88
+ If the dashboard is not listening, `bridge health` prints `dashboard not running on :7331 — start it with \`minions dash\`` and exits 1. Use this exit code to gate scripted health checks.
89
+
90
+ ## Cross-repo coordination
91
+
92
+ The Constellation-side PR ([P-wi1-bridge-readonly](https://office.visualstudio.com/DefaultCollection/ISS/_git/constellation)) lands independently of this Minions PR. The Minions side merges first with the default `enabled: false`, then the Constellation side lights up bridge polling. Operators flip the flag to `true` only after both sides are deployed.
93
+
94
+ The Constellation agent's bridge reads `~/.minions/config.json` directly (no Minions HTTP API call) so config edits propagate without waiting for the engine to restart.
@@ -0,0 +1,177 @@
1
+ # Minions Security Model
2
+
3
+ This document records the threat model for Minions today. It is intentionally
4
+ narrow: it describes what the engine and dashboard are designed to protect
5
+ against, what they are **not** designed to protect against, and the residual
6
+ risks an operator should know about. It is the source of truth for "is X a
7
+ vulnerability or a documented assumption?" questions.
8
+
9
+ If you are implementing a change that touches authentication, the dashboard
10
+ HTTP surface, secret handling, or the agent prompt boundary, read this first
11
+ and update it in the same PR if the model changes.
12
+
13
+ ## 1. Deployment model
14
+
15
+ Minions is designed as a **single-user, localhost-only, single-tenant**
16
+ developer tool:
17
+
18
+ - One human operator runs `minions start` on a workstation (or a remote DevBox
19
+ they treat as their own workstation). Agents dispatched by the engine run as
20
+ the same OS user as the engine and dashboard.
21
+ - The dashboard binds `127.0.0.1` only (see
22
+ [`dashboard.js`](../dashboard.js) — `server.listen(PORT, '127.0.0.1', ...)`)
23
+ and is **not** intended to be reachable from any other host.
24
+ - Configuration, runtime state, secrets, project worktrees, and agent output
25
+ all live on the same machine under `MINIONS_DIR` and the operator's git
26
+ worktrees.
27
+
28
+ **Multi-tenant deployment is explicitly out of scope.** Minions is not a
29
+ hosted service. The engine, dashboard, agents, MCP helpers, and any tools the
30
+ operator invokes from the same shell session form one trust domain. Anything
31
+ that could allow a second human to share that trust domain — exposing the
32
+ dashboard port, mirroring `MINIONS_DIR` to another user, running the engine as
33
+ a service account read by multiple operators — is unsupported and not
34
+ defended against here.
35
+
36
+ ## 2. Dashboard threat model
37
+
38
+ The dashboard (`dashboard.js`, port 7331) is the only HTTP surface in the
39
+ system. Its threat model:
40
+
41
+ ### In scope (intentional, not vulnerabilities)
42
+
43
+ - **Loopback bind.** The dashboard binds `127.0.0.1` only; no LAN, container,
44
+ or VPN client can reach it. Operators who tunnel the port elsewhere (SSH
45
+ port forward, `ngrok`, etc.) opt out of this assumption and inherit
46
+ responsibility for whatever auth gate they place in front.
47
+ - **Same-user process access.** Any process running as the same OS user as
48
+ the engine (other agent runtimes, MCP helpers, `curl` from a terminal,
49
+ `minions` CLI subcommands) can call `/api/*`. This is intentional — it is
50
+ how `minions dispatch`, the Copilot/Claude runtimes, and operator scripts
51
+ drive the engine. We do not attempt to authenticate same-user callers.
52
+ - **No authentication gate.** There is no login, no session cookie, no
53
+ per-user ACL. The single-user assumption above is the entire authn story.
54
+ Adding authn would not increase security in the single-user model; it would
55
+ only break local CLI/MCP tooling.
56
+
57
+ ### Residual risks defended today
58
+
59
+ - **Cross-origin browser requests / CSRF / DNS rebinding.** A browser tab the
60
+ operator visits could in principle issue requests to `http://127.0.0.1:7331`.
61
+ The dashboard defends against this with:
62
+ - An **Origin gate** on mutating methods (`POST`/`PUT`/`PATCH`/`DELETE`)
63
+ and CORS preflights — see `dashboard.js` ~3677–3730 and
64
+ `shared.isAllowedOrigin` / `shared.buildSecurityHeaders` in
65
+ [`engine/shared.js`](../engine/shared.js). Requests whose `Origin` (or
66
+ `Referer`, if `Origin` is absent) is not in the local allowlist are
67
+ rejected with HTTP 403. Callers without an `Origin` header at all
68
+ (Node `http.request`, `curl` without `-H Origin`, CLI tooling) are
69
+ allowed through to preserve local automation.
70
+ - Baseline **security headers** (CSP, `X-Content-Type-Options`,
71
+ `Referrer-Policy`, clickjacking protections) applied to every response
72
+ via `shared.buildSecurityHeaders()`.
73
+
74
+ ### Residual risks tracked elsewhere
75
+
76
+ - **CSRF hardening sweep.** A broader hardening pass — deny-by-default CORS,
77
+ `Sec-Fetch-Site: same-origin` enforcement on mutating endpoints, and an
78
+ optional bearer-token gate as a secondary defense — is **deferred to a
79
+ separate plan** (`D-f8-csrf` in
80
+ `prd/security-fix-plan-from-weekly-review-2026-05-18.json`, open question
81
+ `Q-csrf-followup`). This document does not gate that work; if and when the
82
+ CSRF follow-up plan ships, update §2 to reflect the new posture.
83
+
84
+ ### Recommended hardening (if F8 ever moves from docs to code)
85
+
86
+ If we revisit this assumption — e.g. the dashboard ever serves more than one
87
+ operator, or we want defense-in-depth beyond Origin checks — the recommended
88
+ shape is:
89
+
90
+ 1. Reject mutating requests whose `Origin` / `Sec-Fetch-Site` is not
91
+ `same-origin`, instead of the current allowlist + missing-header pass-through.
92
+ 2. Switch CORS to deny-by-default and explicitly opt specific endpoints in.
93
+ 3. Add an optional bearer token (operator-supplied via env) as a secondary
94
+ gate; require it on mutating endpoints when set.
95
+ 4. Document the resulting break in CLI/MCP tooling and provide a token
96
+ injection path for it.
97
+
98
+ ## 3. Data flow trust boundaries
99
+
100
+ Minions reads from several sources with very different trust levels. The
101
+ engine and dashboard treat them differently on purpose:
102
+
103
+ | Source | Trust | Examples | Handling |
104
+ |---|---|---|---|
105
+ | **Operator config** | Trusted | `config.json`, `projects/`, `notes.md`, `notes/inbox/*` authored by the human, `pinned.md` | Read as-is. The operator is assumed to control these files. |
106
+ | **Agent output** | Semi-trusted | Completion reports, fenced `completion` blocks, learnings notes, PR comments authored by the shared `gh` identity | Schema-validated; completion JSON is gated by the per-spawn nonce (`MINIONS_COMPLETION_NONCE`, see [`completion-reports.md`](completion-reports.md) → "Trust boundary"). Reports without a valid nonce are rejected with `failure_class: 'completion-nonce-mismatch'`. |
107
+ | **External APIs** | Untrusted | GitHub REST/GraphQL responses, Azure DevOps REST responses, GitHub/ADO PR comment bodies, CI/run logs | Validated and shape-checked before persistence. Strings sourced from these responses are never passed as raw arguments to shells or `git`; see F2 (gh shell-injection fix) and F7 (`git log` execFile conversion) in the same security plan. |
108
+ | **Agent-controlled paths** | Untrusted | Paths supplied to dashboard endpoints by agents (e.g. `/api/agent-output`) | Normalized through `shared.sanitizePath` to constrain to the expected root; see F4 in the same plan. |
109
+
110
+ The agent-output trust boundary deserves emphasis: completion reports are the
111
+ single most powerful signal an agent can emit (they advance work-item status,
112
+ mark PRs reviewed, trigger merges). The nonce gate exists specifically so a
113
+ report written by an unrelated process — or by a stale dispatch from a
114
+ previous tick — cannot be silently consumed. Anything in the report body
115
+ itself remains agent-controlled and is treated as such (no `eval`, no shell
116
+ interpolation, schema-validated fields only).
117
+
118
+ ## 4. Secret management
119
+
120
+ - **PATs and API tokens live in environment variables only.** GitHub tokens
121
+ (`GH_TOKEN`, `COPILOT_GITHUB_TOKEN`), Azure DevOps PATs, and any
122
+ runtime-specific credentials are read from the engine's process
123
+ environment. They are not persisted to `config.json`, work-item state,
124
+ PR metadata, or any other on-disk JSON Minions owns.
125
+ - **Tokens are never intentionally logged.** Engine code that shells out to
126
+ `gh` or the ADO CLI threads the token via per-call `GH_TOKEN=...`
127
+ environment injection (see `engine/gh-token.js`), so the value never
128
+ appears on a command line or in `live-output.log`.
129
+ - **The log redactor is best-effort, not authoritative.** A best-effort
130
+ redactor scrubs token-shaped strings from logs and agent output, but its
131
+ coverage has **not** been formally audited (deferred as `D-f9`, open
132
+ question `Q-f9-log-redactor`). Treat redaction as a defense-in-depth nicety,
133
+ not a guarantee — do not rely on it to keep a leaked token out of an
134
+ uploaded log bundle. If a token may have appeared in output, rotate it.
135
+
136
+ ## 5. Known limitations
137
+
138
+ These are accepted limitations of the current model. They are documented
139
+ rather than fixed because (a) they are out of scope for the single-user
140
+ threat model, (b) they are tracked under other items, or (c) a fix would
141
+ break operator workflows we want to preserve.
142
+
143
+ - **No authentication gate on the dashboard.** Intentional — see §2. The
144
+ single-user UX (and `minions` CLI, MCP integrations, and operator scripts
145
+ that POST to `/api/*` without juggling a token) depends on this. Revisit
146
+ only if the deployment model in §1 changes.
147
+ - **Prompt-injection surface from PR comments and inbox notes.** Agent
148
+ prompts splice in human-authored content (pinned notes, `notes/inbox/*`,
149
+ PR comment bodies, `pendingHumanFeedback`) without a fenced delimiter
150
+ separating "instructions" from "data." A malicious PR comment author
151
+ could attempt to steer an agent that reads the comment thread. Mitigation
152
+ (F5 — delimited untrusted content blocks) is **blocked on an open
153
+ question** (`Q-f5-delimiter`) about which delimiter token to standardize
154
+ on. Until F5 lands, operators should treat external PR comment threads
155
+ as a low-but-nonzero injection surface.
156
+ - **Temp-file predictability.** Per-dispatch temp paths can be predictable
157
+ in some shells, opening a narrow TOCTOU window for a same-user process to
158
+ race the engine. Tracked as **F6** in this same security plan
159
+ (`P-f6-tmp-toctou`); the fix moves dispatch temp dirs to per-spawn unique
160
+ paths with restrictive permissions.
161
+ - **Log redactor coverage is unaudited.** See §4 and `D-f9` /
162
+ `Q-f9-log-redactor`. Until the audit lands, treat any log bundle that
163
+ might contain agent output, CI logs, or `live-output.log` excerpts as
164
+ potentially containing tokens, and rotate accordingly.
165
+ - **CSRF hardening sweep is deferred.** See §2. Origin gate + security
166
+ headers are in place today; the broader sweep (deny-by-default CORS,
167
+ `Sec-Fetch-Site` enforcement, optional bearer token) is `D-f8-csrf` /
168
+ `Q-csrf-followup`.
169
+
170
+ ---
171
+
172
+ **Updating this doc:** If you change the dashboard's bind address, add or
173
+ remove an authn/authz mechanism, change how completion reports are trusted,
174
+ change how secrets are read, or land any of F5 / F6 / F9 / the CSRF
175
+ follow-up, update the relevant section here in the same PR. Keep the
176
+ "in-scope vs residual vs deferred" split — it is the part reviewers come
177
+ back to.
@@ -0,0 +1,124 @@
1
+ /**
2
+ * engine/bridge.js — Constellation-bridge config + marker accessors.
3
+ *
4
+ * Pure helpers for the `minions bridge ...` subcommand and any future
5
+ * code that wants to inspect or mutate the read-only Constellation bridge
6
+ * surface. The bridge polling logic itself lives in the Constellation
7
+ * repo (P-wi1-bridge-readonly) — this file owns ONLY the on/off flag, the
8
+ * marker-file contract, and atomic config writes for the toggle.
9
+ *
10
+ * Strict semantics: only the literal boolean `true` enables the bridge.
11
+ * Mirror this check on the Constellation reader to avoid truthy coercion
12
+ * silently flipping bridge state on a typo'd `"enabled": "false"`.
13
+ */
14
+
15
+ const path = require('path');
16
+ const shared = require('./shared');
17
+
18
+ const {
19
+ MINIONS_DIR,
20
+ CONSTELLATION_BRIDGE_MARKER_PATH,
21
+ CONSTELLATION_BRIDGE_MARKER_SCHEMA_VERSION,
22
+ safeJson,
23
+ mutateJsonFileLocked,
24
+ } = shared;
25
+
26
+ const CONFIG_PATH = path.join(MINIONS_DIR, 'config.json');
27
+
28
+ /**
29
+ * Strict check: `true` ⇔ bridge enabled. Any other shape (missing field,
30
+ * non-object, string "true", etc.) returns false.
31
+ */
32
+ function isBridgeEnabled(config) {
33
+ return config?.engine?.constellationBridge?.enabled === true;
34
+ }
35
+
36
+ /**
37
+ * Read the cross-repo marker written by the Constellation agent. Returns
38
+ * `null` when the file is missing, unreadable, or schema-mismatched.
39
+ *
40
+ * Marker shape (see ENGINE_DEFAULTS.constellationBridge docstring):
41
+ * { schemaVersion: 1, lastSeenAt: ISO8601,
42
+ * agentVersion?: string, source?: 'constellation-agent' }
43
+ */
44
+ function readBridgeMarker(markerPath = CONSTELLATION_BRIDGE_MARKER_PATH) {
45
+ const raw = safeJson(markerPath);
46
+ if (!raw || typeof raw !== 'object' || Array.isArray(raw)) return null;
47
+ if (raw.schemaVersion !== CONSTELLATION_BRIDGE_MARKER_SCHEMA_VERSION) return null;
48
+ if (typeof raw.lastSeenAt !== 'string') return null;
49
+ return {
50
+ schemaVersion: raw.schemaVersion,
51
+ lastSeenAt: raw.lastSeenAt,
52
+ agentVersion: typeof raw.agentVersion === 'string' ? raw.agentVersion : null,
53
+ source: typeof raw.source === 'string' ? raw.source : null,
54
+ };
55
+ }
56
+
57
+ /**
58
+ * Flip `config.engine.constellationBridge.enabled` atomically via
59
+ * mutateJsonFileLocked. Returns `{ previous: bool, current: bool }`.
60
+ * `configPath` override exists for unit tests.
61
+ */
62
+ function setBridgeEnabled(enabled, configPath = CONFIG_PATH) {
63
+ const next = enabled === true;
64
+ let previous = false;
65
+ mutateJsonFileLocked(configPath, (cfg) => {
66
+ if (!cfg || typeof cfg !== 'object' || Array.isArray(cfg)) return cfg;
67
+ cfg.engine = cfg.engine || {};
68
+ cfg.engine.constellationBridge = cfg.engine.constellationBridge || {};
69
+ previous = cfg.engine.constellationBridge.enabled === true;
70
+ cfg.engine.constellationBridge.enabled = next;
71
+ return cfg;
72
+ });
73
+ return { previous, current: next };
74
+ }
75
+
76
+ /**
77
+ * Human-readable relative age string (e.g. "12s ago", "3m ago", "2h ago").
78
+ * Caps at "1d+ ago" — anything older than the bridge polling cadence is
79
+ * already actionable as "stale".
80
+ */
81
+ function formatRelativeAge(isoTimestamp, nowMs = Date.now()) {
82
+ const t = Date.parse(isoTimestamp);
83
+ if (!Number.isFinite(t)) return '(unknown)';
84
+ const deltaSec = Math.max(0, Math.round((nowMs - t) / 1000));
85
+ if (deltaSec < 60) return `${deltaSec}s ago`;
86
+ if (deltaSec < 3600) return `${Math.round(deltaSec / 60)}m ago`;
87
+ if (deltaSec < 86400) return `${Math.round(deltaSec / 3600)}h ago`;
88
+ return '1d+ ago';
89
+ }
90
+
91
+ /**
92
+ * Compose the small stable projection the Constellation bridge consumes
93
+ * from `/api/status`. Kept narrow on purpose: full /api/status is large,
94
+ * unstable, and may surface unrelated local state. New fields go behind
95
+ * a deliberate schema version bump.
96
+ */
97
+ function projectStatusForBridge(statusJson) {
98
+ if (!statusJson || typeof statusJson !== 'object') return null;
99
+ const dispatch = statusJson.dispatch || {};
100
+ const queueCount = (arr) => (Array.isArray(arr) ? arr.length : 0);
101
+ return {
102
+ engineState: statusJson.control?.state ?? null,
103
+ enginePid: statusJson.control?.pid ?? null,
104
+ minionsVersion: statusJson.version ?? null,
105
+ agentCount: Array.isArray(statusJson.agents) ? statusJson.agents.length : null,
106
+ activeAgentCount: Array.isArray(statusJson.agents)
107
+ ? statusJson.agents.filter(a => a && a.status && a.status !== 'idle').length
108
+ : null,
109
+ dispatchPending: queueCount(dispatch.pending),
110
+ dispatchActive: queueCount(dispatch.active),
111
+ dispatchCompleted: queueCount(dispatch.completed),
112
+ projectCount: Array.isArray(statusJson.projects) ? statusJson.projects.length : null,
113
+ };
114
+ }
115
+
116
+ module.exports = {
117
+ isBridgeEnabled,
118
+ readBridgeMarker,
119
+ setBridgeEnabled,
120
+ formatRelativeAge,
121
+ projectStatusForBridge,
122
+ CONSTELLATION_BRIDGE_MARKER_PATH,
123
+ CONSTELLATION_BRIDGE_MARKER_SCHEMA_VERSION,
124
+ };
@@ -136,6 +136,12 @@ class Worker {
136
136
  this.killed = false;
137
137
  this.spawnError = null;
138
138
  this.firstSystemPromptSent = false;
139
+ // In-flight spawn+initialize+session/new promise. Set by getSession()
140
+ // before the worker is registered in _tabs, cleared after the handshake
141
+ // settles. Racing getSession() callers await this to avoid the
142
+ // "warm-reuse path returns sessionId=null while init is still pending"
143
+ // hang on first message of a freshly-warmed tab (W-mpd45blx00072f04).
144
+ this.initPromise = null;
139
145
  }
140
146
 
141
147
  // ── Spawn + initialize handshake ────────────────────────────────────────
@@ -499,6 +505,31 @@ async function getSession({ tabId, model, effort, mcpServers, systemPromptHash,
499
505
  // 'cold-spawn' — fresh proc + initialize + session/new
500
506
  let lifecycle = 'warm-reuse';
501
507
 
508
+ if (worker) {
509
+ // W-mpd45blx00072f04: if the existing worker is still mid-init (warm
510
+ // fired but session/new hasn't resolved yet), await the in-flight init
511
+ // BEFORE evaluating warm-reuse / newSession / cold-spawn — otherwise we
512
+ // return a SessionHandle with sessionId=null and the caller's first
513
+ // session/prompt fires with a null sessionId, causing every subsequent
514
+ // session/update notification to be dropped by _handleMessage's
515
+ // sessionId-match guard. User-visible symptom: first message on a
516
+ // freshly-warmed CC tab hangs (no chunks streamed, eventual onDone
517
+ // with empty text).
518
+ if (worker.initPromise) {
519
+ try {
520
+ await worker.initPromise;
521
+ } catch (err) {
522
+ // Warm init failed (e.g., auth). The originating call has already
523
+ // (or is about to) delete _tabs[tabId] and close the worker in its
524
+ // own catch handler. Surface the same error to this caller so the
525
+ // dashboard's spawn-failed path runs instead of hanging.
526
+ throw err;
527
+ }
528
+ // Re-read in case the failing initPromise's cleanup already ran.
529
+ worker = _tabs.get(tabId) || null;
530
+ }
531
+ }
532
+
502
533
  if (worker) {
503
534
  if (worker.killed) {
504
535
  _tabs.delete(tabId);
@@ -533,8 +564,24 @@ async function getSession({ tabId, model, effort, mcpServers, systemPromptHash,
533
564
  tabId, model, effort, mcpServers, mcpServersHash, systemPromptHash, cwd,
534
565
  });
535
566
  _tabs.set(tabId, worker);
567
+ // Set initPromise BEFORE awaiting so concurrent getSession() callers
568
+ // landing during the spawn+initialize+session/new round-trip can detect
569
+ // and await it (W-mpd45blx00072f04). Clear on settle so callers that
570
+ // arrive AFTER init succeeds skip the no-op await. Attach the clear
571
+ // handler as both success+failure listeners (not .finally()) so the
572
+ // chained promise has a rejection handler and doesn't surface as an
573
+ // unhandled rejection when init throws.
574
+ const initPromise = worker._spawnAndInit();
575
+ worker.initPromise = initPromise;
576
+ const clearInit = () => {
577
+ // Only clear if we're still the active promise — defensive against
578
+ // a future refactor that calls _spawnAndInit twice for the same
579
+ // Worker (current code path never does).
580
+ if (worker.initPromise === initPromise) worker.initPromise = null;
581
+ };
582
+ initPromise.then(clearInit, clearInit);
536
583
  try {
537
- await worker._spawnAndInit();
584
+ await initPromise;
538
585
  } catch (err) {
539
586
  _tabs.delete(tabId);
540
587
  try { worker.close(); } catch { /* already torn down */ }
package/engine/cleanup.js CHANGED
@@ -273,35 +273,42 @@ function _killProcessInWorktree(dir, activeProcesses, activeIds) {
273
273
  log('info', `Killed orphaned process for dispatch ${id} before worktree removal`);
274
274
  }
275
275
 
276
- // Check PID files in engine/tmp/ — only kill if no active dispatch matches
276
+ // Check PID files in engine/tmp/ — both legacy flat layout and per-dispatch
277
+ // dirs (P-f6-tmp-toctou). Only kill if no active dispatch matches.
277
278
  try {
278
- const tmpDir = path.join(ENGINE_DIR, 'tmp');
279
- for (const f of fs.readdirSync(tmpDir)) {
280
- if (!f.startsWith('pid-') || !f.endsWith('.pid')) continue;
281
- const pidFileName = f.replace(/^pid-/, '').replace(/\.pid$/, '');
282
- if (!dirLower.includes(pidFileName.slice(-8))) continue;
279
+ shared.forEachPidFile((pidFilePath, fileName, layout) => {
280
+ const pidFileName = fileName.replace(/^pid-/, '').replace(/\.pid$/, '');
281
+ if (!dirLower.includes(pidFileName.slice(-8))) return;
283
282
  // Verify this PID file's dispatch is not active
284
283
  let isActive = false;
285
284
  for (const id of activeIds) { if (pidFileName.includes(id.slice(-8))) { isActive = true; break; } }
286
- if (isActive) continue; // still active — do not kill
287
- const pid = parseInt(fs.readFileSync(path.join(tmpDir, f), 'utf8').trim(), 10);
285
+ if (isActive) return; // still active — do not kill
286
+ let pid;
287
+ try { pid = parseInt(fs.readFileSync(pidFilePath, 'utf8').trim(), 10); }
288
+ catch { return; }
288
289
  if (pid > 0) {
289
290
  // Verify the PID still belongs to a Minions runtime process before killing.
290
291
  // The shared helper inspects the PID's full command line for `claude` /
291
292
  // `copilot` so a recycled PID running an unrelated process is skipped.
292
293
  try {
293
294
  if (process.platform === 'win32') {
294
- if (!shared.isProcessCommandLineMatchingAgent(pid)) continue;
295
+ if (!shared.isProcessCommandLineMatchingAgent(pid)) return;
295
296
  exec(`taskkill /F /T /PID ${pid}`, { stdio: 'pipe', timeout: 5000, windowsHide: true });
296
297
  } else {
297
- if (!shared.isProcessCommandLineMatchingAgent(pid)) continue;
298
+ if (!shared.isProcessCommandLineMatchingAgent(pid)) return;
298
299
  try { process.kill(-pid, 'SIGKILL'); } catch { process.kill(pid, 'SIGKILL'); }
299
300
  }
300
- log('info', `Killed orphaned PID ${pid} (${f}) before worktree removal`);
301
+ log('info', `Killed orphaned PID ${pid} (${fileName}, ${layout}) before worktree removal`);
301
302
  } catch {} // process may already be dead
302
303
  }
303
- try { fs.unlinkSync(path.join(tmpDir, f)); } catch {}
304
- }
304
+ if (layout === 'dispatch-dir') {
305
+ // Remove the entire per-dispatch dir — its remaining sidecars are
306
+ // orphans of the same dead process.
307
+ try { shared.removeDispatchTmpDir(path.dirname(pidFilePath)); } catch {}
308
+ } else {
309
+ try { fs.unlinkSync(pidFilePath); } catch {}
310
+ }
311
+ });
305
312
  } catch {} // tmp dir may not exist
306
313
  }
307
314
 
@@ -313,9 +320,35 @@ async function runCleanup(config, verbose = false) {
313
320
  let cleaned = { tempFiles: 0, liveOutputs: 0, worktrees: 0, zombies: 0 };
314
321
 
315
322
  // 1. Clean stale temp prompt/sysprompt files and orphaned safeWrite .tmp.* files (older than 1 hour)
323
+ // P-f6-tmp-toctou: also sweep abandoned per-dispatch dirs (engine/tmp/dispatch-*),
324
+ // and recurse into them so leftover prompt/sysprompt sidecars from crashed
325
+ // dispatches don't accumulate.
316
326
  const oneHourAgo = Date.now() - 3600000;
317
327
  const tmpDir = path.join(ENGINE_DIR, 'tmp');
318
328
  const scanDirs = [ENGINE_DIR, ...(fs.existsSync(tmpDir) ? [tmpDir] : [])];
329
+ // Discover dispatch-* dirs under engine/tmp/ and scan their contents too.
330
+ if (fs.existsSync(tmpDir)) {
331
+ try {
332
+ for (const entry of fs.readdirSync(tmpDir, { withFileTypes: true })) {
333
+ if (!entry.isDirectory()) continue;
334
+ if (!entry.name.startsWith('dispatch-')) continue;
335
+ const full = path.join(tmpDir, entry.name);
336
+ if (!shared.validateDispatchTmpDir(full)) continue;
337
+ scanDirs.push(full);
338
+ }
339
+ } catch { /* tmp dir may be empty/missing */ }
340
+ }
341
+ // Track which dispatch dirs we touch so we can rm empty ones whose owning
342
+ // dispatch is no longer in the active set.
343
+ const activeDispatchTmpDirs = new Set();
344
+ try {
345
+ const dispatch = getDispatch();
346
+ for (const queue of ['pending', 'active']) {
347
+ for (const e of dispatch[queue] || []) {
348
+ if (e?.tmpDir) activeDispatchTmpDirs.add(path.resolve(e.tmpDir));
349
+ }
350
+ }
351
+ } catch { /* dispatch.json may be empty */ }
319
352
  for (const dir of scanDirs) {
320
353
  // Each directory gets its own try-catch so one failure doesn't abort other directories (Bug #27)
321
354
  let dirEntries;
@@ -341,6 +374,22 @@ async function runCleanup(config, verbose = false) {
341
374
  }
342
375
  }
343
376
  }
377
+ // Reap empty/stale per-dispatch tmp dirs not referenced by an active entry.
378
+ cleaned.dispatchDirs = 0;
379
+ if (fs.existsSync(tmpDir)) {
380
+ try {
381
+ for (const entry of fs.readdirSync(tmpDir, { withFileTypes: true })) {
382
+ if (!entry.isDirectory() || !entry.name.startsWith('dispatch-')) continue;
383
+ const full = path.join(tmpDir, entry.name);
384
+ if (!shared.validateDispatchTmpDir(full)) continue;
385
+ if (activeDispatchTmpDirs.has(path.resolve(full))) continue;
386
+ let stat;
387
+ try { stat = fs.statSync(full); } catch { continue; }
388
+ if (stat.mtimeMs >= oneHourAgo) continue;
389
+ if (shared.removeDispatchTmpDir(full)) cleaned.dispatchDirs++;
390
+ }
391
+ } catch { /* sweep is best-effort */ }
392
+ }
344
393
 
345
394
  // 2. Clean live-output.log and live-output-prev.log for idle agents (not currently working)
346
395
  for (const [agentId] of Object.entries(config.agents || {})) {
@@ -1111,31 +1160,31 @@ async function runCleanup(config, verbose = false) {
1111
1160
  } catch (e) { log('warn', 'cap cooldowns: ' + e.message); }
1112
1161
 
1113
1162
  // 12. Clean stale PID files — remove PID files whose process is no longer running
1163
+ // P-f6-tmp-toctou: walks BOTH legacy flat layout and per-dispatch-dir layout
1164
+ // via shared.forEachPidFile.
1114
1165
  cleaned.pidFiles = 0;
1115
1166
  try {
1116
1167
  const tmpDir = path.join(ENGINE_DIR, 'tmp');
1117
1168
  if (fs.existsSync(tmpDir)) {
1118
- let pidDirEntries;
1119
- try { pidDirEntries = fs.readdirSync(tmpDir); } catch { pidDirEntries = []; }
1120
1169
  const activePids = new Set();
1121
1170
  for (const [, info] of activeProcesses) {
1122
1171
  if (info.proc?.pid) activePids.add(String(info.proc.pid));
1123
1172
  }
1124
- for (const f of pidDirEntries) {
1125
- if (!f.startsWith('pid-') || !f.endsWith('.pid')) continue;
1126
- const fp = path.join(tmpDir, f);
1173
+ shared.forEachPidFile((pidFilePath, fileName, layout) => {
1127
1174
  try {
1128
- const pidStr = fs.readFileSync(fp, 'utf8').trim();
1175
+ const pidStr = fs.readFileSync(pidFilePath, 'utf8').trim();
1129
1176
  // Skip if actively tracked
1130
- if (activePids.has(pidStr)) continue;
1177
+ if (activePids.has(pidStr)) return;
1131
1178
  // Check if file is stale (>1 hour old)
1132
- const stat = fs.statSync(fp);
1179
+ const stat = fs.statSync(pidFilePath);
1133
1180
  if (stat.mtimeMs < oneHourAgo) {
1134
- fs.unlinkSync(fp);
1181
+ fs.unlinkSync(pidFilePath);
1135
1182
  cleaned.pidFiles++;
1183
+ // For dispatch-dir layout, the empty/stale dispatch dir gets reaped
1184
+ // by the stale-dispatch-dir sweep in step 1.
1136
1185
  }
1137
1186
  } catch { /* cleanup */ }
1138
- }
1187
+ });
1139
1188
  }
1140
1189
  } catch (e) { log('warn', 'clean stale PID files: ' + e.message); }
1141
1190