npm - @tekyzinc/gsd-t - Versions diffs - 3.18.17 → 3.19.0 - Mend

@tekyzinc/gsd-t 3.18.17 → 3.19.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

package/CHANGELOG.md +69 -0
package/bin/gsd-t-parallel.cjs +184 -4
package/bin/gsd-t-unattended.cjs +501 -297
package/bin/gsd-t-worker-dispatch.cjs +211 -0
package/bin/headless-auto-spawn.cjs +28 -1
package/bin/m46-iter-proof.cjs +149 -0
package/bin/m46-worker-proof.cjs +201 -0
package/bin/spawn-plan-writer.cjs +1 -1
package/commands/gsd-t-resume.md +32 -0
package/commands/gsd-t-status.md +10 -0
package/docs/architecture.md +16 -0
package/docs/requirements.md +20 -0
package/package.json +1 -1
package/scripts/gsd-t-update-check.js +13 -4

package/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,75 @@
 All notable changes to GSD-T are documented here. Updated with each release.
+## [3.19.00] - 2026-04-23
+### Added — M46 Milestone: Unattended Iter-Parallel + Worker Fan-Out Completion
+Closes the two biggest gaps from the 2026-04-23 five-surface parallelism audit: (2A) unattended multi-iteration parallelism and (2B) worker-side sub-fan-out.
+**D1 — Iteration-parallel supervisor scaffold (`bin/gsd-t-unattended.cjs`):**
+- `_runOneIter(state, opts) → IterResult` — extracted from the while-loop body at line 969 (68-line delta, zero behavior change when called sequentially)
+- `_computeIterBatchSize(state, opts) → number` — mode-safety gates: `verify-needed → 1`, `milestone-boundary → 1`, `complete-milestone → 1`. Production default returns 1 (serial) unless caller passes `opts.maxIterParallel` — opt-in gate on the iter-parallel engagement pending full concurrent-state-safety rewrite (backlog #24).
+- `_runIterParallel(state, opts, iterFn, batchSize) → Promise<IterResult[]>` — uses `Promise.allSettled` for per-iter error isolation; one rejection does not cancel siblings.
+- `_reconcile(state, results)` — deduped union on `completedTasks`, OR on `verifyNeeded`, append-only `artifacts`, last-writer-wins `status`, writes `lastBatch` metadata. **Does NOT advance `state.iter`** — that invariant is owned by the main while loop via `_runOneIter`.
+- `module.exports.__test__` exposes all four helpers to unit tests.
+**D2 — Worker sub-dispatch production path (`bin/gsd-t-worker-dispatch.cjs`):**
+- `dispatchWorkerTasks({projectDir, parentSessionId, tasks, maxParallel=4}) → Promise<{parallel, taskResults, wallClockMs, reason}>` — deterministic probe + delegation to `bin/gsd-t-parallel.cjs::runDispatch` when the task graph is file-disjoint and `tasks.length > 1`.
+- Short-circuits with reason strings: `no-tasks`, `single-task`, `file-overlap`, `dispatch-error:*`, `dispatched`.
+- CLI entry: `node bin/gsd-t-worker-dispatch.cjs --parent-session <id> --tasks <path> --max-parallel <n>` — emits JSON result to stdout.
+- `commands/gsd-t-resume.md` Step 0 `GSD_T_UNATTENDED_WORKER=1` branch gains an additive sub-dispatch block (no deletion from existing prose).
+- `bin/spawn-plan-writer.cjs` kind enum extended with `unattended-worker-sub`.
+### Contracts
+- **`.gsd-t/contracts/iter-parallel-contract.md`** — NEW v1.0.0. Batch semantics, `IterResult` shape, reconciliation rule, stop-check batch-boundary invariant (stop-check is never polled mid-batch).
+- **`.gsd-t/contracts/headless-default-contract.md`** — v2.0.0 → v2.1.0 (additive §Worker Sub-Dispatch documenting the `unattended-worker-sub` kind and the resume-path hand-off).
+### Measurements — both proofs passed thresholds
+- **D1 iter-proof (`bin/m46-iter-proof.cjs`)**: 10-iter synthetic workload, 200ms work per iter, batch=4 vs serial. Result: `T_serial=2022.6ms`, `T_par=602.9ms`, **speedup=3.35×**, `T_par/T_serial=0.298` — passes thresholds `speedup ≥ 3.0` and `T_par/T_serial ≤ 0.35`.
+- **D2 worker-proof (`bin/m46-worker-proof.cjs`)**: 6-task file-disjoint synthetic workload, serial vs fan-out via `runDispatch`. Result: `T_serial=12134ms`, `T_par=2034ms`, **speedup=5.96×** — passes threshold `speedup ≥ 2.5`.
+- Reports: `.gsd-t/metrics/m46-iter-proof.json` + `.gsd-t/metrics/m46-worker-proof.json`.
+### Tests
+- **`test/m46-d1-iter-parallel.test.js`** — 12 unit tests covering serial fallback, 3-way parallel batch concurrency (<200ms for three 100ms iters), mode-safety gates (verify-needed / complete-milestone / milestoneBoundary), error isolation (one rejection, two siblings succeed), stop-check batch-boundary invariant, and `_reconcile` semantics.
+- **`test/m46-d2-worker-subdispatch.test.js`** — 6 unit tests covering disjoint fan-out, single-task short-circuit, file-overlap detection, dispatcher error surfacing, and CLI JSON-stdout contract.
+- Full suite: **1946/1946 pass**, zero regressions (M43 heartbeat-watchdog + M44 planner-wire + M45 suites all green).
+### Regression caught and fixed mid-milestone
+- **Double-increment of `state.iter`** between `_reconcile` and `_runOneIter` tripped 4 tests (m43-heartbeat-watchdog `staleHeartbeat res → exitCode 125`, m44-wire-unattended-to-planner iter-count / fallback / sequential). Root cause: `_reconcile` was advancing `state.iter` by `results.length` while `_runOneIter` already advanced it by 1. Fix: `_reconcile` leaves `state.iter` untouched (main loop owns the invariant); two m46-d1 tests updated to match new contract.
+### Follow-up backlog
+- **#24 — Dynamic work-stealing rewrite** (full concurrent-safe state isolation for `_runIterParallel` >1 batch). Covers the `state.iter` / heartbeat / writeState shared-mutation issue that keeps the iter-parallel engagement opt-in rather than default-on.
+- **D2-T11 integration smoke** deferred — unit tests + proof harness cover the surface.
+## [3.18.18] - 2026-04-23
+### Added — Model-aware worker spawn in `runDispatch`
+- **`bin/gsd-t-parallel.cjs::runDispatch`**: fan-out workers now default to `claude-sonnet-4-6` via new constant `DEFAULT_WORKER_MODEL` (was: inherit the orchestrator's `ANTHROPIC_MODEL`, which is `claude-opus-4-7` in this user's global settings). Caller overrides via `opts.workerModel`: alias strings (`"opus"` / `"sonnet"` / `"haiku"`) resolve to full model IDs; explicit full IDs pass through; `workerModel: false` opts out of the override entirely and inherits parent. **Why**: the 2026-04-23 M46 Wave 1 dispatch rate-limited all 8 Opus workers (Max 20x subscription concurrent-session throttle). Sonnet lives in a separate rate bucket, so orchestrator Opus + worker Sonnet lifts the concurrency ceiling. Per-task Opus opt-in via `[opus]` marker on tasks.md lines is future work (surfaces in planner metadata).
+- **`bin/headless-auto-spawn.cjs::autoSpawnHeadless`**: accepts `workerModel?: string` and sets `ANTHROPIC_MODEL` in the child env after the caller's `envOverride` merge (so caller always wins if they explicitly set `ANTHROPIC_MODEL` in `env`).
+### Added — Spawn stagger to avoid burst spikes
+- **`bin/gsd-t-parallel.cjs::runDispatch`**: new `opts.spawnStaggerMs` (default **3000** ms) delays each spawn after the first. Implemented via `Atomics.wait` on a `SharedArrayBuffer` so the blocking wait releases the CPU (no spin loop). 2026-04-23 observation: 8 concurrent `claude -p` spawned within 700 ms → all 429 rate-limited; 3 s stagger avoids the burst. Set `spawnStaggerMs: 0` for pre-v3.18.18 behavior.
+### Added — Cache-warming probe (opt-in)
+- **`bin/gsd-t-parallel.cjs::_runCacheWarmProbe`** + opt-in flag `opts.cacheWarm` / env `GSD_T_CACHE_WARM=1`. Before fan-out, fires a short `claude -p` that reads CLAUDE.md + progress.md + top-level contracts using the **same model** the workers will run, so Anthropic's 5-minute prompt cache is populated and the workers' identical initial reads return cache-read tokens (free for ITPM budget, lower rate-limit pressure). Probe is synchronous (workers land inside the warm window, not racing it), 60 s timeout, failure does not block fan-out. Dependency-injection hook `opts.cacheWarmProbeImpl` for tests. Gated behind opt-in until backlog #23 (mitmproxy header instrumentation) measures the actual delta; flips to default-on if measurement confirms the ITPM savings are real.
+### Tests
+- **`test/m44-run-dispatch.test.js`**: 4 new tests for model selection (default Sonnet / alias resolution / explicit opt-out / stagger timing) + 3 new tests for cache-warming (opt-in gating / probe model matches workers / probe failure does not block fan-out). Full suite **2023/2023** pass (baseline 2016 + 7 new).
+### Incident — 2026-04-23 M46 Wave 1 rate-limit
+- Root cause: all 8 headless workers inherited `ANTHROPIC_MODEL=claude-opus-4-7` from `~/.claude/settings.json` (this user runs Max 20x on Opus globally) and spawned in a 700 ms burst. Max subscription's concurrent-session throttle fired ~1 s into each worker's first tool call. Anthropic Console dashboards showed flat-zero API usage — confirmed the throttle is subscription-side, not API-key-side. Mitigations shipped in this release (model-mix + stagger + opt-in cache-warm) + scoped backlog items #22 (coord-gate runtime coordination) and #23 (mitmproxy header instrumentation for calibration).
 ## [3.18.17] - 2026-04-23
 ### Fixed — `npm test` picks up `worker-sim.js` fixture

package/bin/gsd-t-parallel.cjs CHANGED Viewed

@@ -79,7 +79,7 @@ function detectMode(opts, env) {
 // ─── CLI arg parsing ──────────────────────────────────────────────────────
 function parseArgv(argv) {
-  const out = { help: false, dryRun: false, mode: null, milestone: null, domain: null, command: null };
+  const out = { help: false, dryRun: false, mode: null, milestone: null, domain: null, command: null, maxWorkers: null, stagger: null };
   for (let i = 0; i < argv.length; i++) {
     const a = argv[i];
     if (a === "--help" || a === "-h") out.help = true;
@@ -92,6 +92,10 @@ function parseArgv(argv) {
     else if (a.startsWith("--domain=")) out.domain = a.slice("--domain=".length);
     else if (a === "--command") out.command = argv[++i] || null;
     else if (a.startsWith("--command=")) out.command = a.slice("--command=".length);
+    else if (a === "--max-workers") out.maxWorkers = parseInt(argv[++i], 10);
+    else if (a.startsWith("--max-workers=")) out.maxWorkers = parseInt(a.slice("--max-workers=".length), 10);
+    else if (a === "--stagger") out.stagger = parseInt(argv[++i], 10);
+    else if (a.startsWith("--stagger=")) out.stagger = parseInt(a.slice("--stagger=".length), 10);
   }
   return out;
 }
@@ -325,6 +329,69 @@ function _partitionTaskIds(taskIds, workerCount) {
   return buckets.filter((b) => b.length > 0);
 }
+/**
+ * _runCacheWarmProbe — fire a single short `claude -p` before fan-out so the
+ * Anthropic prompt cache (5-min TTL) is pre-populated with the files every
+ * worker will read. When workers spawn within the warm window, their initial
+ * Read(CLAUDE.md), Read(progress.md), Read(contracts/*.md) return cache-read
+ * tokens (free for ITPM budget, lower rate-limit pressure).
+ *
+ * Returns `{ok, filesRead, error}`. Best-effort; failures do not block fan-out.
+ *
+ * The probe reads the existing files (skips missing ones silently) and asks
+ * the child to print the literal string "warm" — cheap, deterministic, fast.
+ * Cache key matches exactly when `model` equals the workers' model and the
+ * workers use the same tool-call shape (Read on the same paths) within TTL.
+ */
+function _runCacheWarmProbe(opts) {
+  const projectDir = (opts && opts.projectDir) || process.cwd();
+  const model = opts && opts.model;
+  const timeoutMs = Number.isFinite(opts && opts.timeoutMs) ? opts.timeoutMs : 60000;
+  const candidates = [
+    "CLAUDE.md",
+    ".gsd-t/progress.md",
+    ".gsd-t/contracts/headless-default-contract.md",
+    ".gsd-t/contracts/wave-join-contract.md",
+  ];
+  const filesRead = candidates.filter((rel) => {
+    try {
+      return fs.statSync(path.join(projectDir, rel)).isFile();
+    } catch {
+      return false;
+    }
+  });
+  if (filesRead.length === 0) return { ok: false, filesRead: [], error: "no_warm_files" };
+  const { spawnSync } = require("node:child_process");
+  const prompt =
+    "Read the following files so they enter the prompt cache for subsequent workers, " +
+    "then reply with the single word `warm` and nothing else:\n" +
+    filesRead.map((f) => `- ${f}`).join("\n");
+  const env = Object.assign({}, process.env);
+  if (model) env.ANTHROPIC_MODEL = model;
+  try {
+    const r = spawnSync(
+      "claude",
+      ["-p", prompt, "--dangerously-skip-permissions"],
+      {
+        cwd: projectDir,
+        env,
+        encoding: "utf8",
+        timeout: timeoutMs,
+        stdio: ["ignore", "pipe", "pipe"],
+      },
+    );
+    if (r.error) return { ok: false, filesRead, error: r.error.message };
+    if (r.status !== 0) return { ok: false, filesRead, error: `exit_${r.status}` };
+    return { ok: true, filesRead };
+  } catch (e) {
+    return { ok: false, filesRead, error: (e && e.message) || "spawn_error" };
+  }
+}
 /**
  * runDispatch — the single instrument every command delegates to.
  *
@@ -389,8 +456,30 @@ function runDispatch(opts) {
     };
   }
-  const workerCount = Number(result.workerCount) || 0;
+  const plannerWorkerCount = Number(result.workerCount) || 0;
   const parallelTasks = Array.isArray(result.parallelTasks) ? result.parallelTasks : [];
+  // Concurrency cap (v3.18.19) — caller may clamp the planner-selected worker
+  // count via `opts.maxWorkers`. Motivated by 2026-04-23 incident: the Max
+  // subscription concurrent-session throttle rate-limits `claude -p` bursts
+  // regardless of model choice (since all spawns inherit the parent's Max
+  // OAuth, not an API key — see feedback_anthropic_key_measurement_only). The
+  // planner has no knowledge of this throttle; callers who know they're near
+  // the ceiling need a direct cap.
+  const cap = Number.isFinite(opts && opts.maxWorkers) && opts.maxWorkers > 0
+    ? Math.floor(opts.maxWorkers)
+    : 2;
+  const workerCount = Math.min(plannerWorkerCount, cap);
+  if (workerCount < plannerWorkerCount) {
+    appendEvent(projectDir, {
+      type: "parallelism_reduced",
+      source: "dispatch_max_workers_cap",
+      original_count: plannerWorkerCount,
+      reduced_count: workerCount,
+      reason: `max_workers_cap:${cap}`,
+      ts: new Date().toISOString(),
+    });
+  }
   const subsets = workerCount >= 2 ? _partitionTaskIds(parallelTasks, workerCount) : [];
   if (subsets.length < 2) {
@@ -433,8 +522,90 @@ function runDispatch(opts) {
     ts: new Date().toISOString(),
   });
+  // Worker model selection (v3.18.18) — mechanical fan-out defaults to Sonnet
+  // so the orchestrator's Opus bucket isn't the bottleneck. Caller may
+  // override via `opts.workerModel` ("opus" | "sonnet" | "haiku" | full ID).
+  // A task can opt back to Opus by declaring "[opus]" in its tasks.md line;
+  // the planner surfaces this via per-task metadata (future; today the per-
+  // subset opt-in is an all-or-nothing knob passed by the caller).
+  const DEFAULT_WORKER_MODEL = "claude-sonnet-4-6";
+  const modelAlias = {
+    opus: "claude-opus-4-7",
+    sonnet: "claude-sonnet-4-6",
+    haiku: "claude-haiku-4-5-20251001",
+  };
+  const callerModel = opts && opts.workerModel;
+  const workerModel = callerModel === false
+    ? null // explicit opt-out: inherit parent's ANTHROPIC_MODEL
+    : (modelAlias[callerModel] || callerModel || DEFAULT_WORKER_MODEL);
+  // Stagger between spawns — 10s default empirically-validated against the
+  // Max-subscription concurrent-session throttle (2026-04-23 M46 probe: two
+  // 10s-staggered 2-parallel rounds of real work, both exit 0, no 429; prior
+  // 3s default burst at >2 workers hit rate limits). Caller may override via
+  // `opts.spawnStaggerMs` (0 = no delay, previous burst behavior).
+  const staggerMs = Number.isFinite(opts && opts.spawnStaggerMs)
+    ? Math.max(0, opts.spawnStaggerMs)
+    : 10000;
+  const busyWait = (ms) => {
+    if (!ms) return;
+    // Synchronous sleep that releases the CPU (Atomics.wait on a dummy
+    // SharedArrayBuffer — pattern used in Node REPL/sync-sleep helpers).
+    // Keeps runDispatch's sync return contract without pegging a core.
+    // Total wall-clock added to startup: (subsets-1) * staggerMs.
+    try {
+      const sab = new SharedArrayBuffer(4);
+      const view = new Int32Array(sab);
+      Atomics.wait(view, 0, 0, ms);
+    } catch (_) {
+      // Atomics unavailable — fall back to a coarse spin.
+      const until = Date.now() + ms;
+      while (Date.now() < until) { /* spin */ }
+    }
+  };
+  // Cache-warming probe (v3.18.19) — opt-in via GSD_T_CACHE_WARM=1 or
+  // opts.cacheWarm. Anthropic's prompt cache has a 5-minute TTL keyed on the
+  // exact system-prompt + tool-call prefix. One leader probe that reads the
+  // same foundational files every worker will read (CLAUDE.md, progress.md,
+  // top-level contracts) populates the cache so the first N seconds of every
+  // subsequent worker hit cache-read tokens (free for ITPM budget, lower
+  // rate-limit pressure). Probe runs synchronously so workers land inside
+  // the warm window rather than racing it. Gated behind opt-in until
+  // backlog #23 (mitmproxy instrumentation) measures the actual delta.
+  const warmEnv = (opts && opts.env) || process.env;
+  const cacheWarmEnabled =
+    (opts && opts.cacheWarm === true) ||
+    (!(opts && opts.cacheWarm === false) && warmEnv.GSD_T_CACHE_WARM === "1");
+  if (cacheWarmEnabled) {
+    const warmStart = Date.now();
+    let warmResult = { ok: false, error: "not_run" };
+    try {
+      const probeImpl = (opts && opts.cacheWarmProbeImpl) || _runCacheWarmProbe;
+      warmResult = probeImpl({
+        projectDir,
+        model: workerModel, // same model as workers so cache key matches
+        timeoutMs: (opts && Number.isFinite(opts.cacheWarmTimeoutMs))
+          ? opts.cacheWarmTimeoutMs
+          : 60000,
+      });
+    } catch (e) {
+      warmResult = { ok: false, error: (e && e.message) || "unknown" };
+    }
+    appendEvent(projectDir, {
+      type: "cache_warm_probe",
+      source: "dispatch",
+      ok: !!warmResult.ok,
+      duration_ms: Date.now() - warmStart,
+      error: warmResult.error,
+      files_read: warmResult.filesRead,
+      ts: new Date().toISOString(),
+    });
+  }
   const workerResults = [];
   for (let i = 0; i < subsets.length; i++) {
+    if (i > 0) busyWait(staggerMs);
     const subset = subsets[i];
     const workerEnv = {
       GSD_T_WORKER_TASK_IDS: subset.join(","),
@@ -450,6 +621,7 @@ function runDispatch(opts) {
         projectDir,
         env: workerEnv,
         spawnType: "primary",
+        workerModel,
       });
     } catch (e) {
       spawnError = (e && e.message) || "unknown";
@@ -549,14 +721,21 @@ function runCli(argv, env) {
   // --command: live dispatch. The single instrument that command files
   // delegate to instead of re-implementing probe-and-branch logic.
   if (args.command) {
-    const dispatch = runDispatch({
+    const dispatchOpts = {
       projectDir: process.cwd(),
       mode,
       milestone: args.milestone,
       domain: args.domain,
       command: args.command,
       env,
-    });
+    };
+    if (Number.isFinite(args.maxWorkers) && args.maxWorkers > 0) {
+      dispatchOpts.maxWorkers = args.maxWorkers;
+    }
+    if (Number.isFinite(args.stagger) && args.stagger >= 0) {
+      dispatchOpts.spawnStaggerMs = args.stagger * 1000;
+    }
+    const dispatch = runDispatch(dispatchOpts);
     if (dispatch.decision === "fan_out") {
       process.stdout.write(
         `gsd-t parallel — fan_out command=${args.command} mode=${dispatch.mode} workers=${dispatch.fanOutCount}\n`,
@@ -606,6 +785,7 @@ module.exports = {
   _detectMode: detectMode,
   _appendEvent: appendEvent,
   _partitionTaskIds,
+  _runCacheWarmProbe,
   _HELP_TEXT: HELP_TEXT,
 };