@tekyzinc/gsd-t 3.18.17 → 3.19.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -2,6 +2,75 @@
2
2
 
3
3
  All notable changes to GSD-T are documented here. Updated with each release.
4
4
 
5
+ ## [3.19.00] - 2026-04-23
6
+
7
+ ### Added — M46 Milestone: Unattended Iter-Parallel + Worker Fan-Out Completion
8
+
9
+ Closes the two biggest gaps from the 2026-04-23 five-surface parallelism audit: (2A) unattended multi-iteration parallelism and (2B) worker-side sub-fan-out.
10
+
11
+ **D1 — Iteration-parallel supervisor scaffold (`bin/gsd-t-unattended.cjs`):**
12
+ - `_runOneIter(state, opts) → IterResult` — extracted from the while-loop body at line 969 (68-line delta, zero behavior change when called sequentially)
13
+ - `_computeIterBatchSize(state, opts) → number` — mode-safety gates: `verify-needed → 1`, `milestone-boundary → 1`, `complete-milestone → 1`. Production default returns 1 (serial) unless caller passes `opts.maxIterParallel` — opt-in gate on the iter-parallel engagement pending full concurrent-state-safety rewrite (backlog #24).
14
+ - `_runIterParallel(state, opts, iterFn, batchSize) → Promise<IterResult[]>` — uses `Promise.allSettled` for per-iter error isolation; one rejection does not cancel siblings.
15
+ - `_reconcile(state, results)` — deduped union on `completedTasks`, OR on `verifyNeeded`, append-only `artifacts`, last-writer-wins `status`, writes `lastBatch` metadata. **Does NOT advance `state.iter`** — that invariant is owned by the main while loop via `_runOneIter`.
16
+ - `module.exports.__test__` exposes all four helpers to unit tests.
17
+
18
+ **D2 — Worker sub-dispatch production path (`bin/gsd-t-worker-dispatch.cjs`):**
19
+ - `dispatchWorkerTasks({projectDir, parentSessionId, tasks, maxParallel=4}) → Promise<{parallel, taskResults, wallClockMs, reason}>` — deterministic probe + delegation to `bin/gsd-t-parallel.cjs::runDispatch` when the task graph is file-disjoint and `tasks.length > 1`.
20
+ - Short-circuits with reason strings: `no-tasks`, `single-task`, `file-overlap`, `dispatch-error:*`, `dispatched`.
21
+ - CLI entry: `node bin/gsd-t-worker-dispatch.cjs --parent-session <id> --tasks <path> --max-parallel <n>` — emits JSON result to stdout.
22
+ - `commands/gsd-t-resume.md` Step 0 `GSD_T_UNATTENDED_WORKER=1` branch gains an additive sub-dispatch block (no deletion from existing prose).
23
+ - `bin/spawn-plan-writer.cjs` kind enum extended with `unattended-worker-sub`.
24
+
25
+ ### Contracts
26
+
27
+ - **`.gsd-t/contracts/iter-parallel-contract.md`** — NEW v1.0.0. Batch semantics, `IterResult` shape, reconciliation rule, stop-check batch-boundary invariant (stop-check is never polled mid-batch).
28
+ - **`.gsd-t/contracts/headless-default-contract.md`** — v2.0.0 → v2.1.0 (additive §Worker Sub-Dispatch documenting the `unattended-worker-sub` kind and the resume-path hand-off).
29
+
30
+ ### Measurements — both proofs passed thresholds
31
+
32
+ - **D1 iter-proof (`bin/m46-iter-proof.cjs`)**: 10-iter synthetic workload, 200ms work per iter, batch=4 vs serial. Result: `T_serial=2022.6ms`, `T_par=602.9ms`, **speedup=3.35×**, `T_par/T_serial=0.298` — passes thresholds `speedup ≥ 3.0` and `T_par/T_serial ≤ 0.35`.
33
+ - **D2 worker-proof (`bin/m46-worker-proof.cjs`)**: 6-task file-disjoint synthetic workload, serial vs fan-out via `runDispatch`. Result: `T_serial=12134ms`, `T_par=2034ms`, **speedup=5.96×** — passes threshold `speedup ≥ 2.5`.
34
+ - Reports: `.gsd-t/metrics/m46-iter-proof.json` + `.gsd-t/metrics/m46-worker-proof.json`.
35
+
36
+ ### Tests
37
+
38
+ - **`test/m46-d1-iter-parallel.test.js`** — 12 unit tests covering serial fallback, 3-way parallel batch concurrency (<200ms for three 100ms iters), mode-safety gates (verify-needed / complete-milestone / milestoneBoundary), error isolation (one rejection, two siblings succeed), stop-check batch-boundary invariant, and `_reconcile` semantics.
39
+ - **`test/m46-d2-worker-subdispatch.test.js`** — 6 unit tests covering disjoint fan-out, single-task short-circuit, file-overlap detection, dispatcher error surfacing, and CLI JSON-stdout contract.
40
+ - Full suite: **1946/1946 pass**, zero regressions (M43 heartbeat-watchdog + M44 planner-wire + M45 suites all green).
41
+
42
+ ### Regression caught and fixed mid-milestone
43
+
44
+ - **Double-increment of `state.iter`** between `_reconcile` and `_runOneIter` tripped 4 tests (m43-heartbeat-watchdog `staleHeartbeat res → exitCode 125`, m44-wire-unattended-to-planner iter-count / fallback / sequential). Root cause: `_reconcile` was advancing `state.iter` by `results.length` while `_runOneIter` already advanced it by 1. Fix: `_reconcile` leaves `state.iter` untouched (main loop owns the invariant); two m46-d1 tests updated to match new contract.
45
+
46
+ ### Follow-up backlog
47
+
48
+ - **#24 — Dynamic work-stealing rewrite** (full concurrent-safe state isolation for `_runIterParallel` >1 batch). Covers the `state.iter` / heartbeat / writeState shared-mutation issue that keeps the iter-parallel engagement opt-in rather than default-on.
49
+ - **D2-T11 integration smoke** deferred — unit tests + proof harness cover the surface.
50
+
51
+ ## [3.18.18] - 2026-04-23
52
+
53
+ ### Added — Model-aware worker spawn in `runDispatch`
54
+
55
+ - **`bin/gsd-t-parallel.cjs::runDispatch`**: fan-out workers now default to `claude-sonnet-4-6` via new constant `DEFAULT_WORKER_MODEL` (was: inherit the orchestrator's `ANTHROPIC_MODEL`, which is `claude-opus-4-7` in this user's global settings). Caller overrides via `opts.workerModel`: alias strings (`"opus"` / `"sonnet"` / `"haiku"`) resolve to full model IDs; explicit full IDs pass through; `workerModel: false` opts out of the override entirely and inherits parent. **Why**: the 2026-04-23 M46 Wave 1 dispatch rate-limited all 8 Opus workers (Max 20x subscription concurrent-session throttle). Sonnet lives in a separate rate bucket, so orchestrator Opus + worker Sonnet lifts the concurrency ceiling. Per-task Opus opt-in via `[opus]` marker on tasks.md lines is future work (surfaces in planner metadata).
56
+ - **`bin/headless-auto-spawn.cjs::autoSpawnHeadless`**: accepts `workerModel?: string` and sets `ANTHROPIC_MODEL` in the child env after the caller's `envOverride` merge (so caller always wins if they explicitly set `ANTHROPIC_MODEL` in `env`).
57
+
58
+ ### Added — Spawn stagger to avoid burst spikes
59
+
60
+ - **`bin/gsd-t-parallel.cjs::runDispatch`**: new `opts.spawnStaggerMs` (default **3000** ms) delays each spawn after the first. Implemented via `Atomics.wait` on a `SharedArrayBuffer` so the blocking wait releases the CPU (no spin loop). 2026-04-23 observation: 8 concurrent `claude -p` spawned within 700 ms → all 429 rate-limited; 3 s stagger avoids the burst. Set `spawnStaggerMs: 0` for pre-v3.18.18 behavior.
61
+
62
+ ### Added — Cache-warming probe (opt-in)
63
+
64
+ - **`bin/gsd-t-parallel.cjs::_runCacheWarmProbe`** + opt-in flag `opts.cacheWarm` / env `GSD_T_CACHE_WARM=1`. Before fan-out, fires a short `claude -p` that reads CLAUDE.md + progress.md + top-level contracts using the **same model** the workers will run, so Anthropic's 5-minute prompt cache is populated and the workers' identical initial reads return cache-read tokens (free for ITPM budget, lower rate-limit pressure). Probe is synchronous (workers land inside the warm window, not racing it), 60 s timeout, failure does not block fan-out. Dependency-injection hook `opts.cacheWarmProbeImpl` for tests. Gated behind opt-in until backlog #23 (mitmproxy header instrumentation) measures the actual delta; flips to default-on if measurement confirms the ITPM savings are real.
65
+
66
+ ### Tests
67
+
68
+ - **`test/m44-run-dispatch.test.js`**: 4 new tests for model selection (default Sonnet / alias resolution / explicit opt-out / stagger timing) + 3 new tests for cache-warming (opt-in gating / probe model matches workers / probe failure does not block fan-out). Full suite **2023/2023** pass (baseline 2016 + 7 new).
69
+
70
+ ### Incident — 2026-04-23 M46 Wave 1 rate-limit
71
+
72
+ - Root cause: all 8 headless workers inherited `ANTHROPIC_MODEL=claude-opus-4-7` from `~/.claude/settings.json` (this user runs Max 20x on Opus globally) and spawned in a 700 ms burst. Max subscription's concurrent-session throttle fired ~1 s into each worker's first tool call. Anthropic Console dashboards showed flat-zero API usage — confirmed the throttle is subscription-side, not API-key-side. Mitigations shipped in this release (model-mix + stagger + opt-in cache-warm) + scoped backlog items #22 (coord-gate runtime coordination) and #23 (mitmproxy header instrumentation for calibration).
73
+
5
74
  ## [3.18.17] - 2026-04-23
6
75
 
7
76
  ### Fixed — `npm test` picks up `worker-sim.js` fixture
@@ -79,7 +79,7 @@ function detectMode(opts, env) {
79
79
  // ─── CLI arg parsing ──────────────────────────────────────────────────────
80
80
 
81
81
  function parseArgv(argv) {
82
- const out = { help: false, dryRun: false, mode: null, milestone: null, domain: null, command: null };
82
+ const out = { help: false, dryRun: false, mode: null, milestone: null, domain: null, command: null, maxWorkers: null, stagger: null };
83
83
  for (let i = 0; i < argv.length; i++) {
84
84
  const a = argv[i];
85
85
  if (a === "--help" || a === "-h") out.help = true;
@@ -92,6 +92,10 @@ function parseArgv(argv) {
92
92
  else if (a.startsWith("--domain=")) out.domain = a.slice("--domain=".length);
93
93
  else if (a === "--command") out.command = argv[++i] || null;
94
94
  else if (a.startsWith("--command=")) out.command = a.slice("--command=".length);
95
+ else if (a === "--max-workers") out.maxWorkers = parseInt(argv[++i], 10);
96
+ else if (a.startsWith("--max-workers=")) out.maxWorkers = parseInt(a.slice("--max-workers=".length), 10);
97
+ else if (a === "--stagger") out.stagger = parseInt(argv[++i], 10);
98
+ else if (a.startsWith("--stagger=")) out.stagger = parseInt(a.slice("--stagger=".length), 10);
95
99
  }
96
100
  return out;
97
101
  }
@@ -325,6 +329,69 @@ function _partitionTaskIds(taskIds, workerCount) {
325
329
  return buckets.filter((b) => b.length > 0);
326
330
  }
327
331
 
332
+ /**
333
+ * _runCacheWarmProbe — fire a single short `claude -p` before fan-out so the
334
+ * Anthropic prompt cache (5-min TTL) is pre-populated with the files every
335
+ * worker will read. When workers spawn within the warm window, their initial
336
+ * Read(CLAUDE.md), Read(progress.md), Read(contracts/*.md) return cache-read
337
+ * tokens (free for ITPM budget, lower rate-limit pressure).
338
+ *
339
+ * Returns `{ok, filesRead, error}`. Best-effort; failures do not block fan-out.
340
+ *
341
+ * The probe reads the existing files (skips missing ones silently) and asks
342
+ * the child to print the literal string "warm" — cheap, deterministic, fast.
343
+ * Cache key matches exactly when `model` equals the workers' model and the
344
+ * workers use the same tool-call shape (Read on the same paths) within TTL.
345
+ */
346
+ function _runCacheWarmProbe(opts) {
347
+ const projectDir = (opts && opts.projectDir) || process.cwd();
348
+ const model = opts && opts.model;
349
+ const timeoutMs = Number.isFinite(opts && opts.timeoutMs) ? opts.timeoutMs : 60000;
350
+
351
+ const candidates = [
352
+ "CLAUDE.md",
353
+ ".gsd-t/progress.md",
354
+ ".gsd-t/contracts/headless-default-contract.md",
355
+ ".gsd-t/contracts/wave-join-contract.md",
356
+ ];
357
+ const filesRead = candidates.filter((rel) => {
358
+ try {
359
+ return fs.statSync(path.join(projectDir, rel)).isFile();
360
+ } catch {
361
+ return false;
362
+ }
363
+ });
364
+ if (filesRead.length === 0) return { ok: false, filesRead: [], error: "no_warm_files" };
365
+
366
+ const { spawnSync } = require("node:child_process");
367
+ const prompt =
368
+ "Read the following files so they enter the prompt cache for subsequent workers, " +
369
+ "then reply with the single word `warm` and nothing else:\n" +
370
+ filesRead.map((f) => `- ${f}`).join("\n");
371
+
372
+ const env = Object.assign({}, process.env);
373
+ if (model) env.ANTHROPIC_MODEL = model;
374
+
375
+ try {
376
+ const r = spawnSync(
377
+ "claude",
378
+ ["-p", prompt, "--dangerously-skip-permissions"],
379
+ {
380
+ cwd: projectDir,
381
+ env,
382
+ encoding: "utf8",
383
+ timeout: timeoutMs,
384
+ stdio: ["ignore", "pipe", "pipe"],
385
+ },
386
+ );
387
+ if (r.error) return { ok: false, filesRead, error: r.error.message };
388
+ if (r.status !== 0) return { ok: false, filesRead, error: `exit_${r.status}` };
389
+ return { ok: true, filesRead };
390
+ } catch (e) {
391
+ return { ok: false, filesRead, error: (e && e.message) || "spawn_error" };
392
+ }
393
+ }
394
+
328
395
  /**
329
396
  * runDispatch — the single instrument every command delegates to.
330
397
  *
@@ -389,8 +456,30 @@ function runDispatch(opts) {
389
456
  };
390
457
  }
391
458
 
392
- const workerCount = Number(result.workerCount) || 0;
459
+ const plannerWorkerCount = Number(result.workerCount) || 0;
393
460
  const parallelTasks = Array.isArray(result.parallelTasks) ? result.parallelTasks : [];
461
+
462
+ // Concurrency cap (v3.18.19) — caller may clamp the planner-selected worker
463
+ // count via `opts.maxWorkers`. Motivated by 2026-04-23 incident: the Max
464
+ // subscription concurrent-session throttle rate-limits `claude -p` bursts
465
+ // regardless of model choice (since all spawns inherit the parent's Max
466
+ // OAuth, not an API key — see feedback_anthropic_key_measurement_only). The
467
+ // planner has no knowledge of this throttle; callers who know they're near
468
+ // the ceiling need a direct cap.
469
+ const cap = Number.isFinite(opts && opts.maxWorkers) && opts.maxWorkers > 0
470
+ ? Math.floor(opts.maxWorkers)
471
+ : 2;
472
+ const workerCount = Math.min(plannerWorkerCount, cap);
473
+ if (workerCount < plannerWorkerCount) {
474
+ appendEvent(projectDir, {
475
+ type: "parallelism_reduced",
476
+ source: "dispatch_max_workers_cap",
477
+ original_count: plannerWorkerCount,
478
+ reduced_count: workerCount,
479
+ reason: `max_workers_cap:${cap}`,
480
+ ts: new Date().toISOString(),
481
+ });
482
+ }
394
483
  const subsets = workerCount >= 2 ? _partitionTaskIds(parallelTasks, workerCount) : [];
395
484
 
396
485
  if (subsets.length < 2) {
@@ -433,8 +522,90 @@ function runDispatch(opts) {
433
522
  ts: new Date().toISOString(),
434
523
  });
435
524
 
525
+ // Worker model selection (v3.18.18) — mechanical fan-out defaults to Sonnet
526
+ // so the orchestrator's Opus bucket isn't the bottleneck. Caller may
527
+ // override via `opts.workerModel` ("opus" | "sonnet" | "haiku" | full ID).
528
+ // A task can opt back to Opus by declaring "[opus]" in its tasks.md line;
529
+ // the planner surfaces this via per-task metadata (future; today the per-
530
+ // subset opt-in is an all-or-nothing knob passed by the caller).
531
+ const DEFAULT_WORKER_MODEL = "claude-sonnet-4-6";
532
+ const modelAlias = {
533
+ opus: "claude-opus-4-7",
534
+ sonnet: "claude-sonnet-4-6",
535
+ haiku: "claude-haiku-4-5-20251001",
536
+ };
537
+ const callerModel = opts && opts.workerModel;
538
+ const workerModel = callerModel === false
539
+ ? null // explicit opt-out: inherit parent's ANTHROPIC_MODEL
540
+ : (modelAlias[callerModel] || callerModel || DEFAULT_WORKER_MODEL);
541
+
542
+ // Stagger between spawns — 10s default empirically-validated against the
543
+ // Max-subscription concurrent-session throttle (2026-04-23 M46 probe: two
544
+ // 10s-staggered 2-parallel rounds of real work, both exit 0, no 429; prior
545
+ // 3s default burst at >2 workers hit rate limits). Caller may override via
546
+ // `opts.spawnStaggerMs` (0 = no delay, previous burst behavior).
547
+ const staggerMs = Number.isFinite(opts && opts.spawnStaggerMs)
548
+ ? Math.max(0, opts.spawnStaggerMs)
549
+ : 10000;
550
+ const busyWait = (ms) => {
551
+ if (!ms) return;
552
+ // Synchronous sleep that releases the CPU (Atomics.wait on a dummy
553
+ // SharedArrayBuffer — pattern used in Node REPL/sync-sleep helpers).
554
+ // Keeps runDispatch's sync return contract without pegging a core.
555
+ // Total wall-clock added to startup: (subsets-1) * staggerMs.
556
+ try {
557
+ const sab = new SharedArrayBuffer(4);
558
+ const view = new Int32Array(sab);
559
+ Atomics.wait(view, 0, 0, ms);
560
+ } catch (_) {
561
+ // Atomics unavailable — fall back to a coarse spin.
562
+ const until = Date.now() + ms;
563
+ while (Date.now() < until) { /* spin */ }
564
+ }
565
+ };
566
+
567
+ // Cache-warming probe (v3.18.19) — opt-in via GSD_T_CACHE_WARM=1 or
568
+ // opts.cacheWarm. Anthropic's prompt cache has a 5-minute TTL keyed on the
569
+ // exact system-prompt + tool-call prefix. One leader probe that reads the
570
+ // same foundational files every worker will read (CLAUDE.md, progress.md,
571
+ // top-level contracts) populates the cache so the first N seconds of every
572
+ // subsequent worker hit cache-read tokens (free for ITPM budget, lower
573
+ // rate-limit pressure). Probe runs synchronously so workers land inside
574
+ // the warm window rather than racing it. Gated behind opt-in until
575
+ // backlog #23 (mitmproxy instrumentation) measures the actual delta.
576
+ const warmEnv = (opts && opts.env) || process.env;
577
+ const cacheWarmEnabled =
578
+ (opts && opts.cacheWarm === true) ||
579
+ (!(opts && opts.cacheWarm === false) && warmEnv.GSD_T_CACHE_WARM === "1");
580
+ if (cacheWarmEnabled) {
581
+ const warmStart = Date.now();
582
+ let warmResult = { ok: false, error: "not_run" };
583
+ try {
584
+ const probeImpl = (opts && opts.cacheWarmProbeImpl) || _runCacheWarmProbe;
585
+ warmResult = probeImpl({
586
+ projectDir,
587
+ model: workerModel, // same model as workers so cache key matches
588
+ timeoutMs: (opts && Number.isFinite(opts.cacheWarmTimeoutMs))
589
+ ? opts.cacheWarmTimeoutMs
590
+ : 60000,
591
+ });
592
+ } catch (e) {
593
+ warmResult = { ok: false, error: (e && e.message) || "unknown" };
594
+ }
595
+ appendEvent(projectDir, {
596
+ type: "cache_warm_probe",
597
+ source: "dispatch",
598
+ ok: !!warmResult.ok,
599
+ duration_ms: Date.now() - warmStart,
600
+ error: warmResult.error,
601
+ files_read: warmResult.filesRead,
602
+ ts: new Date().toISOString(),
603
+ });
604
+ }
605
+
436
606
  const workerResults = [];
437
607
  for (let i = 0; i < subsets.length; i++) {
608
+ if (i > 0) busyWait(staggerMs);
438
609
  const subset = subsets[i];
439
610
  const workerEnv = {
440
611
  GSD_T_WORKER_TASK_IDS: subset.join(","),
@@ -450,6 +621,7 @@ function runDispatch(opts) {
450
621
  projectDir,
451
622
  env: workerEnv,
452
623
  spawnType: "primary",
624
+ workerModel,
453
625
  });
454
626
  } catch (e) {
455
627
  spawnError = (e && e.message) || "unknown";
@@ -549,14 +721,21 @@ function runCli(argv, env) {
549
721
  // --command: live dispatch. The single instrument that command files
550
722
  // delegate to instead of re-implementing probe-and-branch logic.
551
723
  if (args.command) {
552
- const dispatch = runDispatch({
724
+ const dispatchOpts = {
553
725
  projectDir: process.cwd(),
554
726
  mode,
555
727
  milestone: args.milestone,
556
728
  domain: args.domain,
557
729
  command: args.command,
558
730
  env,
559
- });
731
+ };
732
+ if (Number.isFinite(args.maxWorkers) && args.maxWorkers > 0) {
733
+ dispatchOpts.maxWorkers = args.maxWorkers;
734
+ }
735
+ if (Number.isFinite(args.stagger) && args.stagger >= 0) {
736
+ dispatchOpts.spawnStaggerMs = args.stagger * 1000;
737
+ }
738
+ const dispatch = runDispatch(dispatchOpts);
560
739
  if (dispatch.decision === "fan_out") {
561
740
  process.stdout.write(
562
741
  `gsd-t parallel — fan_out command=${args.command} mode=${dispatch.mode} workers=${dispatch.fanOutCount}\n`,
@@ -606,6 +785,7 @@ module.exports = {
606
785
  _detectMode: detectMode,
607
786
  _appendEvent: appendEvent,
608
787
  _partitionTaskIds,
788
+ _runCacheWarmProbe,
609
789
  _HELP_TEXT: HELP_TEXT,
610
790
  };
611
791