npm - @tekyzinc/gsd-t - Versions diffs - 2.74.12 → 2.76.10 - Mend

@tekyzinc/gsd-t 2.74.12 → 2.76.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (61) hide show

package/CHANGELOG.md +130 -0
package/README.md +71 -1
package/bin/advisor-integration.js +93 -0
package/bin/check-headless-sessions.js +140 -0
package/bin/context-meter-config.cjs +101 -0
package/bin/context-meter-config.test.cjs +101 -0
package/bin/gsd-t.js +710 -16
package/bin/headless-auto-spawn.js +290 -0
package/bin/model-selector.js +224 -0
package/bin/runway-estimator.js +242 -0
package/bin/token-budget.js +96 -89
package/bin/token-optimizer.js +471 -0
package/bin/token-telemetry.js +246 -0
package/commands/gsd-t-audit.md +3 -3
package/commands/gsd-t-backlog-list.md +38 -0
package/commands/gsd-t-brainstorm.md +3 -3
package/commands/gsd-t-complete-milestone.md +24 -0
package/commands/gsd-t-debug.md +124 -7
package/commands/gsd-t-discuss.md +10 -3
package/commands/gsd-t-doc-ripple.md +32 -4
package/commands/gsd-t-execute.md +107 -52
package/commands/gsd-t-help.md +19 -0
package/commands/gsd-t-integrate.md +67 -4
package/commands/gsd-t-optimization-apply.md +91 -0
package/commands/gsd-t-optimization-reject.md +94 -0
package/commands/gsd-t-partition.md +7 -0
package/commands/gsd-t-pause.md +3 -0
package/commands/gsd-t-plan.md +10 -3
package/commands/gsd-t-prd.md +3 -3
package/commands/gsd-t-quick.md +71 -9
package/commands/gsd-t-reflect.md +3 -7
package/commands/gsd-t-resume.md +36 -0
package/commands/gsd-t-status.md +31 -0
package/commands/gsd-t-test-sync.md +7 -0
package/commands/gsd-t-verify.md +12 -5
package/commands/gsd-t-visualize.md +3 -7
package/commands/gsd-t-wave.md +82 -18
package/docs/GSD-T-README.md +52 -0
package/docs/architecture.md +95 -0
package/docs/infrastructure.md +117 -0
package/docs/methodology.md +36 -0
package/docs/prd-harness-evolution.md +51 -37
package/docs/requirements.md +66 -0
package/package.json +1 -1
package/scripts/context-meter/count-tokens-client.js +221 -0
package/scripts/context-meter/count-tokens-client.test.js +308 -0
package/scripts/context-meter/test-injector.js +55 -0
package/scripts/context-meter/threshold.js +88 -0
package/scripts/context-meter/threshold.test.js +255 -0
package/scripts/context-meter/transcript-parser.js +252 -0
package/scripts/context-meter/transcript-parser.test.js +320 -0
package/scripts/gsd-t-context-meter.e2e.test.js +415 -0
package/scripts/gsd-t-context-meter.js +350 -0
package/scripts/gsd-t-context-meter.test.js +417 -0
package/scripts/gsd-t-heartbeat.js +2 -2
package/scripts/gsd-t-statusline.js +23 -8
package/templates/CLAUDE-global.md +5 -1
package/templates/CLAUDE-project.md +26 -6
package/templates/context-meter-config.json +10 -0
package/templates/prompts/README.md +1 -1
package/bin/task-counter.cjs +0 -161

package/docs/architecture.md CHANGED Viewed

@@ -368,6 +368,101 @@ During partition, UI/frontend projects automatically receive `.gsd-t/contracts/d
 When Playwright MCP is registered in Claude Code settings, QA agents get 3 minutes of interactive exploration and Red Team gets 5 minutes after all scripted tests pass. Findings are tagged `[EXPLORATORY]` in qa-issues.md and red-team-report.md, and tracked separately in QA calibration (category key: `exploratory` — does NOT count against scripted pass/fail ratio). Silent skip when Playwright MCP absent. Wired into: execute, quick, integrate, debug.
+## Context Meter Architecture (M34, v2.75.10+)
+The Context Meter is the authoritative source for session context-burn measurement in GSD-T. It replaces the v2.74.12 `bin/task-counter.cjs` proxy (and the pre-v2.74.12 `CLAUDE_CONTEXT_TOKENS_USED` env-var approach, which never worked because Claude Code does not export those vars).
+**Data flow:**
+```
+Claude Code tool call finishes
+  │
+  ▼
+PostToolUse hook (~/.claude/settings.json registered)
+  │
+  ▼
+scripts/gsd-t-context-meter.js (runMeter)
+  │
+  ├── 1. loadConfig(.gsd-t/context-meter-config.json)
+  ├── 2. check-frequency gate — short-circuits if tool-call % freq != 0
+  ├── 3. parseTranscript(hook.transcript_path)
+  │         → { system, messages } shaped for count_tokens
+  ├── 4. countTokens({apiKey, model, system, messages, timeoutMs:200})
+  │         → POST https://api.anthropic.com/v1/messages/count_tokens
+  │         → 200 { input_tokens }  |  failure → null
+  ├── 5. computePct(inputTokens, modelWindowSize)
+  ├── 6. bandFor(pct) → "normal" | "warn" | "stop"   (v3.0.0 three-band model)
+  └── 7. atomic write .gsd-t/.context-meter-state.json
+           { version, timestamp, inputTokens, modelWindowSize, pct, threshold, checkCount, lastError? }
+  │
+  ▼
+bin/token-budget.js getSessionStatus(projectDir)      ── v3.0.0: normal/warn/stop only
+  │
+  ├── readContextMeterState(dir)
+  │      if fresh (timestamp within 5 min):
+  │        return { consumed, estimated_remaining, pct, threshold }
+  │      else: null
+  │
+  └── fallback: readSessionConsumed(dir) from .gsd-t/token-log.md (heuristic)
+  │
+  ▼
+bin/runway-estimator.js estimateRunway({command, domain_type, remaining_tasks})
+  │        reads current_pct from .context-meter-state.json
+  │        queries .gsd-t/token-metrics.jsonl for historical pct-delta per spawn
+  │        projects current_pct + pct_per_task × remaining_tasks × skew
+  │        confidence: high ≥50 records, medium ≥10, low <10 (+1.25× skew)
+  │        returns {can_start, projected_end_pct, confidence, recommendation}
+  ▼
+Command file Step 0 — runway gate (execute/wave/quick/integrate/debug):
+  if (!decision.can_start) {
+    print ⛔ banner
+    autoSpawnHeadless({command, continue_from: '.'})    ── bin/headless-auto-spawn.js
+    process.exit(0)                                      ── never prompts user
+  } else {
+    proceed to Step 0.1 (Verify Context Gate Readiness) and Step 1
+  }
+  │
+  ▼
+bin/headless-auto-spawn.js (when refused)
+  │        detached child: node bin/gsd-t.js headless {command} --log
+  │        child.unref(); interactive session returns immediately
+  │        writes .gsd-t/headless-sessions/{id}.json (status: "running")
+  │        2s poll watcher: process.kill(pid, 0) → mac osascript notification on exit
+  │
+  ▼
+Orchestrator Context Gate — v3.0.0 semantics:
+  normal → proceed
+  warn   → log to .gsd-t/token-log.md, proceed at full quality (informational only)
+  stop   → halt cleanly, runway estimator hands off to headless-auto-spawn
+```
+**Key constraints:**
+- **Fail-open**: every stage catches errors and writes a partial state file. Never crashes Claude Code.
+- **No message content in state or log files** — only token counts, band names, error codes.
+- **Never logs or writes the API key** anywhere.
+- **State staleness window**: 5 minutes — after that, heuristic fallback takes over.
+- **Hook latency budget**: 200ms (timeoutMs on the HTTP call), enforced by `req.setTimeout` + `req.destroy()`.
+**Contracts:**
+- `.gsd-t/contracts/context-meter-contract.md` — schema, state file format, hook I/O
+- `.gsd-t/contracts/context-observability-contract.md` v2.0.0 — Ctx% as the real session-wide signal (replaces Tasks-Since-Reset)
+- `.gsd-t/contracts/token-budget-contract.md` v3.0.0 — three-band stop-at-85 (M35 clean break from v2.0.0)
+- `.gsd-t/contracts/token-telemetry-contract.md` v1.0.0 — per-spawn 18-field JSONL at `.gsd-t/token-metrics.jsonl`
+- `.gsd-t/contracts/runway-estimator-contract.md` v1.0.0 — pre-flight projection, confidence grading, refusal/headless handoff
+- `.gsd-t/contracts/headless-auto-spawn-contract.md` v1.0.0 — detached continuation, session schema, macOS notification channel
+- `.gsd-t/contracts/model-selection-contract.md` v1.0.0 — per-phase tier mapping + complexity-signal escalation, consumed by `bin/model-selector.js`
+**M35 supporting components** (outside the context-meter dataflow):
+- `bin/model-selector.js` — declarative rules table mapping phases to haiku/sonnet/opus; consulted at plan time, never at runtime under pressure
+- `bin/token-optimizer.js` — at `complete-milestone`, scans the last 3 milestones of `.gsd-t/token-metrics.jsonl` and appends recalibration recommendations to `.gsd-t/optimization-backlog.md` (never auto-applied; user promotes via `/user:gsd-t-optimization-apply` or rejects via `/user:gsd-t-optimization-reject` with 5-milestone cooldown)
+- `bin/check-headless-sessions.js` — renders the read-back banner on `/user:gsd-t-resume` and `/user:gsd-t-status` for completed-but-not-yet-surfaced headless sessions
+**Installer integration** (`bin/gsd-t.js`):
+- `install` / `init` — copy hook runtime, merge PostToolUse entry into `~/.claude/settings.json`, copy config template, prompt for API key (skippable, TTY-only)
+- `doctor` — RED on missing API key, missing hook, missing script, invalid config, failed count_tokens dry-run
+- `status` — displays `Context: {pct}% of {window} tokens ({band}) — last check {rel}` line
+- `update-all` — one-shot task-counter retirement migration (deletes legacy files, writes `.gsd-t/.task-counter-retired-v1` marker)
 ## Planned Architecture Changes (M23-M24)
 **M23: Headless Mode**

package/docs/infrastructure.md CHANGED Viewed

@@ -195,6 +195,123 @@ gsd-t-verify:
     ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
 ```
+## Context Meter Setup (M34, v2.75.10+)
+The Context Meter is a PostToolUse hook that measures real context consumption via the Anthropic `count_tokens` API. It is **required** for the `gsd-t-execute`, `gsd-t-wave`, `gsd-t-quick`, `gsd-t-integrate`, and `gsd-t-debug` session-stop gates to work.
+**API key — required**
+```bash
+# Shell profile (~/.zshrc, ~/.bashrc, ~/.config/fish/config.fish)
+export ANTHROPIC_API_KEY="sk-ant-..."
+# Verify
+echo $ANTHROPIC_API_KEY | head -c 10
+```
+Get a key at [https://console.anthropic.com](https://console.anthropic.com). Free tier is sufficient — `count_tokens` is billed per call at negligible cost.
+**CI/CD**: set `ANTHROPIC_API_KEY` as a secret (`secrets.ANTHROPIC_API_KEY` in GitHub Actions, masked variables in GitLab CI). The existing verify workflow example above already threads the secret through.
+**Per-project `.env.local` (optional)**: drop `ANTHROPIC_API_KEY=sk-ant-...` into the project's ignored env file and source it in your shell profile if you need per-project keys.
+**Verify with doctor**:
+```bash
+npx @tekyzinc/gsd-t doctor
+```
+Expected GREEN output:
+```
+Context Meter
+  ✅ API key set (ANTHROPIC_API_KEY)
+  ✅ PostToolUse hook registered in ~/.claude/settings.json
+  ✅ scripts/gsd-t-context-meter.js exists
+  ✅ .gsd-t/context-meter-config.json loads cleanly
+  ✅ count_tokens dry-run: 7 tokens
+```
+If any check is RED, doctor exits with code 1.
+**Config file** — `.gsd-t/context-meter-config.json`:
+```json
+{
+  "enabled": true,
+  "apiKeyEnvVar": "ANTHROPIC_API_KEY",
+  "modelWindowSize": 200000,
+  "thresholdPct": 85,
+  "checkFrequency": 1
+}
+```
+**Threshold bands** (M35 v3.0.0 — three bands, lower-bound inclusive):
+| Band   | Range    | Orchestrator action                                                              |
+|--------|----------|----------------------------------------------------------------------------------|
+| normal | 0–69%    | Proceed                                                                          |
+| warn   | 70–84%   | Log warning; cue for explicit pause/resume at the next clean boundary            |
+| stop   | ≥85%     | Halt cleanly with resume instruction; command refuses to start if runway crosses |
+**Zero silent quality degradation.** There is no `downgrade` band and no `conserve` band. Models are **never** swapped at runtime under context pressure — model choice is a plan-time decision made by `bin/model-selector.js`, and quality-critical phases (Red Team, doc-ripple, Design Verify) always run at their designated tier. See `.gsd-t/contracts/token-budget-contract.md` v3.0.0.
+**Structural guarantee**: because the runway estimator refuses runs that project past 85% and the stop band fires at 85%, the runtime's 95% native compact is structurally unreachable under healthy operation. `halt_type: native-compact` in `.gsd-t/token-metrics.jsonl` is a defect signal.
+**Upgrading from pre-M34**: `gsd-t update-all` runs a one-time task-counter retirement migration in every registered project (deletes `bin/task-counter.cjs`, `.gsd-t/task-counter-config.json`, `.gsd-t/.task-counter-state.json`, and the `.gsd-t/.task-counter` file; writes `.gsd-t/.task-counter-retired-v1` marker). After upgrade you **must** set `ANTHROPIC_API_KEY` — doctor will fail otherwise.
+## Runway-Protected Execution (M35)
+M35 adds four components on top of the Context Meter. Together they replace graduated degradation with a pre-flight gate + pause/resume model.
+### Per-phase model selection — `bin/model-selector.js`
+Declarative rules table mapping each phase (`plan`, `execute`, `red-team`, `doc-ripple`, `design-verify`, `qa`, `integrate`, ...) to a default tier (`haiku`|`sonnet`|`opus`). Complexity signals promoted from the task plan (`cross_module_refactor`, `security_boundary`, `data_loss_risk`, `contract_design`) escalate sonnet→opus. Each command file documents its assignments in a `## Model Assignment` block.
+Contract: `.gsd-t/contracts/model-selection-contract.md` v1.0.0
+### Pre-flight runway estimator — `bin/runway-estimator.js`
+Reads historical per-spawn consumption from `.gsd-t/token-metrics.jsonl` via a three-tier query fallback (exact match on `{command, phase, domain}` → command+phase → command) and produces a confidence-weighted projection of end-of-run `pct`. If the projection would cross `STOP_THRESHOLD_PCT = 85`, the command refuses to start — the interactive session exits cleanly and an autonomous headless continuation is auto-spawned. The user never types `/clear`.
+### Headless auto-spawn — `bin/headless-auto-spawn.js`
+Detached child-process spawn (`child_process.spawn` with `detached:true`, `stdio:['ignore', fd, fd]`, `child.unref()`). Writes `.gsd-t/headless-sessions/{session-id}.json` with session metadata, polls with `process.kill(pid, 0)` liveness probe (timer `.unref()`-ed), marks `status: completed`, and posts a macOS `osascript` notification when done (graceful no-op on non-darwin). `bin/check-headless-sessions.js` renders the read-back banner on the next `gsd-t-resume` / `gsd-t-status`.
+Directory: `.gsd-t/headless-sessions/` — one JSON per session, plus optional `{id}-context.json` and log files.
+### Per-spawn token telemetry — `.gsd-t/token-metrics.jsonl`
+Frozen 18-field JSONL schema, one record per subagent spawn. Written by the orchestrator, consumed by the runway estimator and the token optimizer.
+Contract: `.gsd-t/contracts/token-telemetry-contract.md` v1.0.0
+Key fields: `timestamp`, `session_id`, `command`, `phase`, `domain`, `task_id`, `model`, `complexity_signals[]`, `input_tokens`, `output_tokens`, `duration_seconds`, `start_pct`, `end_pct`, `halt_type`, `halt_reason`, `exit_code`, `run_type` (`interactive`|`headless`), `projection_variance`.
+`halt_type` values: `clean`, `stop-band`, `runway-refuse`, `native-compact` (defect), `crash`.
+### Optimization backlog — `.gsd-t/optimization-backlog.md`
+Append-only markdown file of recalibration recommendations. `bin/token-optimizer.js` scans the last 3 milestones of telemetry at `complete-milestone` time and appends detected recommendations. Recommendations are **never** auto-applied — the user promotes via `/user:gsd-t-optimization-apply {ID}` or rejects via `/user:gsd-t-optimization-reject {ID} [--reason "..."]`. Rejected items are fingerprinted and cooled down for 5 milestones before re-surfacing.
+Detection rules: `demote` (opus phase with ≥90% success, ≥3 volume), `escalate` (sonnet phase with ≥30% failure, ≥5 volume), `runway-tune` (projection vs. actual divergence >15%), `investigate` (per-phase p95 > 2× median, ≥10 volume).
+## Metrics CLI — `gsd-t metrics`
+Read-only surface onto the token telemetry stream. Backward-compatible with pre-M35 `metrics` output; new flags surface M35 data.
+| Flag                  | Output                                                                                 |
+|-----------------------|----------------------------------------------------------------------------------------|
+| (no flag)             | Task telemetry, process ELO, domain health (pre-M35 behavior)                          |
+| `--tokens`            | Per-command / per-phase token usage summary from `.gsd-t/token-metrics.jsonl`          |
+| `--halts`             | Count + breakdown of `halt_type` values — flags any `native-compact` as a defect       |
+| `--context-window`    | Trailing 20-run window of `end_pct` with runway headroom                               |
+| `--cross-project`     | Cross-project ranking (pre-M35, unchanged)                                             |
+## `/advisor` escalation convention
+When a sonnet-default phase hits a complexity signal that warrants opus (e.g., cross-module refactor detected mid-execution), the command may emit an `/advisor` hook line in its output — a structured suggestion for the orchestrator to escalate the **next** spawn of that phase to opus. This is a plan-time signal, not a runtime swap: the current spawn completes at its assigned tier, and the escalation applies to subsequent work. See `.gsd-t/contracts/model-selection-contract.md` for the hook schema.
 ## Security Notes
 - Zero npm dependencies — no supply chain risk

package/docs/methodology.md CHANGED Viewed

@@ -83,3 +83,39 @@ Teams burn tokens fast. Use them strategically:
 - Planning (always — need full cross-domain context)
 - Integration (always — need to see all seams)
 - The task is straightforward
+## Context Awareness: From Proxy to Real Measurement (M34)
+GSD-T has always needed a reliable signal for "how much of the context window is consumed right now" so the orchestrator can decide whether to continue, pause, or hand off to a headless continuation. The journey to real measurement is instructive:
+1. **v1.0 era — env var check.** Early GSD-T read `CLAUDE_CONTEXT_TOKENS_USED` / `CLAUDE_CONTEXT_TOKENS_MAX` environment variables, assuming Claude Code exported them. It does not. The check was always inert — `pct` was effectively zero forever, and the stop gate never fired. The first symptom was not a crash, it was silent context exhaustion leading to mid-session compaction.
+2. **v2.74.12 — task-counter proxy.** To patch the regression, `bin/task-counter.cjs` tracked the number of tasks completed since the last `/clear` and assumed a linear correspondence between task count and context percentage (e.g., 5 tasks ≈ 80%). This was better than nothing but fundamentally a proxy — it could not distinguish a task that read three files from a task that ran a full-project grep and a Playwright suite.
+3. **v2.75.10 (M34) — real measurement.** The Context Meter PostToolUse hook streams the current transcript to the Anthropic `count_tokens` API after every tool call and writes the exact `input_tokens` count to `.gsd-t/.context-meter-state.json`. `bin/token-budget.js` `getSessionStatus()` reads that state file as the authoritative signal. Proxies are retired.
+**Why this matters**: Opus-primary sessions compound context risk (larger system prompts, deeper reasoning, longer tool outputs). A proxy with ±20% error is fine for an undercommitted Sonnet session but causes silent compaction on a busy Opus session. Real measurement is the only durable fix.
+**Fail-open principle**: the meter hook never blocks tool calls or crashes Claude Code. Every failure mode (missing API key, network error, malformed transcript, rate limit) catches and writes a partial state file with `lastError.code` set. The orchestrator treats a missing or stale state file as "fall back to heuristic" rather than "stop immediately" — the user never loses work to a meter hiccup.
+## From Silent Degradation to Aggressive Pause-Resume (M35)
+Between v2.74 and v2.75, GSD-T attempted to cope with context pressure through **graduated degradation** — downgrading subagent models (opus→sonnet, sonnet→haiku), checkpointing early, skipping "non-essential" phases (Red Team, doc-ripple, Design Verify). The idea was well-intentioned: burn less context, finish more work before hitting the runtime's native compact at 95%.
+**It was the wrong framing.** Degradation is invisible to the user: a task that silently dropped from opus to haiku still reports "completed," and a skipped Red Team pass still looks like a green wave. Several regressions showed up where bugs made it through QA because Red Team had been silently skipped under context pressure, and where cross-module refactors made on haiku introduced subtle type errors that opus would have caught. The quality floor had become conditional on context pressure — a load-bearing invariant that the user could not see or control.
+**M35 (v2.76.10) replaces graduated degradation with aggressive pause-resume.** The core principles:
+1. **Quality is non-negotiable.** No phase is "non-essential." No model is downgraded under pressure. Red Team, doc-ripple, and Design Verify always run at their designated tier. If a task can't fit, the task pauses — it does not degrade.
+2. **Explicit per-phase model selection.** `bin/model-selector.js` carries a declarative rules table with ≥13 phase mappings. Complexity signals (`cross_module_refactor`, `security_boundary`, `data_loss_risk`, `contract_design`) escalate sonnet→opus at plan time. Each command file carries a `## Model Assignment` block documenting its assignments. Model choice is now a plan-time decision, not a runtime pressure response.
+3. **User never types `/clear`.** When the runway estimator (`bin/runway-estimator.js`) projects a run would cross 85% context, the command refuses to start — then auto-spawns a detached headless continuation via `bin/headless-auto-spawn.js`. The interactive session sees a single ⛔ banner and exits cleanly. The user gets a macOS notification when the headless run finishes and a read-back banner on the next `gsd-t-resume` or `gsd-t-status` call.
+4. **Data before optimization.** Per-spawn token telemetry (`.gsd-t/token-metrics.jsonl`, 18-field frozen schema) is the raw material. The runway estimator reads historical consumption to project future runs. The token optimizer (`bin/token-optimizer.js`) runs at `complete-milestone` and appends retrospective recalibration recommendations to `.gsd-t/optimization-backlog.md`. Recommendations are **never auto-applied** — the user promotes or rejects deliberately. Tier calibration is a data-driven human decision, not a runtime heuristic.
+5. **Clean break, no compat shim.** The v2.0.0 `token-budget-contract.md` defined `downgrade` and `conserve` bands with `modelOverrides` and `skipPhases` fields. v3.0.0 drops all of it. The contract is a clean three-band model (`normal` < 70%, `warn` 70–85%, `stop` ≥ 85%) and the response object is just `{band, pct, message}`. No backwards-compat translation layer — the old API is gone.
+**Structural guarantee.** Because `STOP_THRESHOLD_PCT = 85` and the runway estimator refuses runs that would project past 85%, the runtime's 95% native compact is now structurally unreachable under healthy operation. `halt_type: native-compact` in `.gsd-t/token-metrics.jsonl` is a defect signal — if it appears, the estimator needs re-tuning.
+**Message content is never logged**: the meter writes only token counts, band names, and error category codes. Never transcript text, never API response bodies, never the API key itself. See `docs/architecture.md` for the full data-flow diagram and `.gsd-t/contracts/context-meter-contract.md` for the schema.

package/docs/prd-harness-evolution.md CHANGED Viewed

@@ -299,49 +299,63 @@ Evolve GSD-T from a static methodology framework into a self-calibrating quality
 ---
-### 3.7 Token-Aware Orchestration (MEDIUM-HIGH PRIORITY — Tier 1)
+### 3.7 Context Gate + Surgical Model Escalation (REWRITTEN IN M35 — v2.76.10)
-**Problem**: GSD-T runs on Claude's $200 Max plan, where tokens are a hard daily/weekly ceiling — not a variable API expense. A typical milestone spawns 30-50+ subagents across all phases. With tiered models, this consumes roughly 50-80% of a daily budget. Without budget awareness, the orchestrator can exhaust tokens mid-milestone, leaving uncommitted work scattered across subagents and forcing a wait until limits reset.
+**Status**: The original v2 design for this section described graduated degradation with `downgrade` and `conserve` bands that silently demoted models and skipped Red Team / doc-ripple phases under context pressure. **M35 removed that behavior entirely** (see `token-budget-contract.md` v3.0.0 and `.gsd-t/M35-definition.md`). The historical text is preserved in git history; this section documents the replacement.
-The article's harness doesn't address this because it operates on API billing where cost is variable. On a Max plan, token exhaustion is a binary failure mode — you either have capacity or you don't.
+**Problem**: Silent quality degradation is the wrong answer to context pressure. If the orchestrator is allowed to swap opus for sonnet, sonnet for haiku, or skip Red Team / doc-ripple / Design Verification when context is tight, the user silently receives lower-quality work than they asked for. GSD-T's core principle is excellent, deeply-tested results — not "best effort under pressure."
-**Solution**: Make the wave and execute orchestrators aware of aggregate session-level token consumption, with graceful degradation as limits approach.
+**Solution**: Three strict components that replace the old graduated-degradation model.
 **Mechanism**:
-1. **Session budget tracking** — The orchestrator tracks cumulative tokens consumed across all subagent spawns within a session. Uses the existing observability logging data (token-log.md) plus `CLAUDE_CONTEXT_TOKENS_USED` environment variable.
-2. **Budget estimation before spawn** — Before spawning a subagent, estimate the token cost based on: model tier (Opus ~5x Sonnet, Sonnet ~5x Haiku), task complexity (from plan-time scoring if available), and historical average from token-log.md for similar tasks.
-3. **Graduated degradation thresholds**:
-| Session Budget Consumed | Action |
-|------------------------|--------|
-| < 60% | Normal operation — all models at assigned tiers |
-| 60-70% | **WARN**: Display budget alert to user. Reduce iteration budgets to minimum (2). |
-| 70-85% | **DOWNGRADE**: Non-critical Sonnet tasks demoted to Haiku. Skip exploratory testing (3.5). Disable shadow-mode audit (3.1). |
-| 85-95% | **CONSERVE**: Pause non-essential phases (doc-ripple, design brief generation). Checkpoint all progress to disk. |
-| > 95% | **STOP**: Hard stop. Save all progress. Display: "Token budget nearly exhausted. Progress saved. Resume with `/gsd-t-resume` after limit resets." |
-4. **Model-tier-aware budgeting** — The budget tracker understands that one Opus call ≈ 5 Sonnet calls ≈ 25 Haiku calls in token terms. Degradation actions (downgrading Sonnet → Haiku) are chosen to maximize remaining capacity for high-value tasks.
-5. **Milestone pre-flight check** — Before starting a wave or execute run, estimate total token cost for the remaining work. If estimated cost exceeds available budget, warn the user: "This milestone has ~{N} tasks remaining, estimated at ~{X}% of daily budget. Proceed or split across sessions?"
-6. **Integration with iteration budget (3.6)** — When budget is constrained (>60%), iteration budgets are automatically reduced. At >70%, the system prefers model escalation (Haiku → Sonnet) over additional iterations at the same tier, since one Sonnet attempt is more likely to converge than three Haiku attempts.
-**Files affected**:
-- MODIFY: `commands/gsd-t-execute.md` — pre-spawn budget check, degradation logic
-- MODIFY: `commands/gsd-t-wave.md` — milestone pre-flight estimate, per-phase budget check
-- MODIFY: `commands/gsd-t-quick.md` — budget-aware model selection
-- MODIFY: `templates/CLAUDE-global.md` — document token-aware orchestration
-- MODIFY: `templates/CLAUDE-project.md` — optional `Daily Token Budget` field
-- NEW: `bin/token-budget.js` — budget estimation, tracking, and threshold logic (Node.js built-ins only)
+1. **Three-band context gate** (`bin/token-budget.js` v3.0.0, `token-budget-contract.md` v3.0.0):
+| Session Context Consumed | Band | Action |
+|--------------------------|------|--------|
+| < 70% | `normal` | Proceed at full quality. |
+| 70–85% | `warn` | Log to `.gsd-t/token-log.md` and proceed at **full quality**. **Never** downgrade models, skip Red Team, skip doc-ripple, or skip Design Verification. |
+| ≥ 85% | `stop` | Halt cleanly. Checkpoint progress. Hand off to runway estimator / headless auto-spawn (see below). |
+   The bands `downgrade` and `conserve` that existed in v2.x are deleted. `applyModelOverride`, `skipPhases`, and all related machinery are deleted.
+2. **Surgical per-phase model selection** (`bin/model-selector.js`, `model-selection-contract.md` v1.0.0 — m35-model-selector-advisor):
+   - Declarative phase→tier mapping: `haiku` for strictly mechanical work (test runners, file-existence checks, JSON validation, branch guards), `sonnet` for routine code work (execute step 2, test-sync, doc-ripple wiring), `opus` for high-stakes reasoning (partition, discuss, Red Team, verify judgment, debug root-cause, architecture/contract design).
+   - Sonnet is the **routine default** — opus is applied surgically at declared escalation points, not as a fallback for "important-looking" work.
+   - Dual-layer: `ANTHROPIC_MODEL=opus` for the interactive session, `model:` directive overrides per-spawn.
+   - Optional `/advisor` escalation hook at declared points; convention-based fallback if the native tool is not programmable.
+3. **Runway estimator + headless auto-spawn** (`bin/runway-estimator.js`, `bin/headless-auto-spawn.js` — m35-runway-estimator, m35-headless-auto-spawn):
+   - Pre-flight: before spawning a phase or task subagent, estimate the token cost of the remaining work from `.gsd-t/token-metrics.jsonl` history. If the projected runway crosses the `stop` threshold, **refuse the run and auto-spawn a headless session** to continue the work.
+   - The user never types `/clear` under normal operation. The interactive session stays idle; the headless run consumes the fresh context window and reports results back via `.gsd-t/M35-headless-results-*.md`.
+   - The only time a user sees a `/clear` prompt is when headless auto-spawn itself fails — an explicit degradation path, not a silent one.
+4. **Per-spawn telemetry** (`bin/token-telemetry.js`, `token-telemetry-contract.md` v1.0.0 — m35-token-telemetry):
+   - Every Task subagent spawn is wrapped in a token bracket that records `{timestamp, milestone, command, phase, step, domain, task, model, duration_s, input_tokens_before, input_tokens_after, tokens_consumed, context_window_pct_before, context_window_pct_after, outcome, halt_type, escalated_via_advisor}` to `.gsd-t/token-metrics.jsonl`.
+   - `gsd-t metrics --tokens [--by model,command,phase,milestone]`, `gsd-t metrics --halts`, `gsd-t metrics --context-window` surface the history.
+   - Telemetry feeds the runway estimator (historical cost-per-phase) and the optimization backlog (`bin/token-optimizer.js` → `.gsd-t/optimization-backlog.md`, detect-only — user selectively promotes via `/user:gsd-t-optimization-apply|reject`).
+**Files affected (M35 active set)**:
+- MODIFY: `bin/token-budget.js` — three-band `getDegradationActions`, `WARN_THRESHOLD_PCT = 70`, `STOP_THRESHOLD_PCT = 85`
+- MODIFY: `.gsd-t/contracts/token-budget-contract.md` — v3.0.0 rewrite (Option X clean break, no compat shim)
+- NEW: `bin/model-selector.js`, `bin/advisor-integration.js`, `bin/runway-estimator.js`, `bin/headless-auto-spawn.js`, `bin/token-telemetry.js`, `bin/token-optimizer.js`
+- NEW: `.gsd-t/contracts/model-selection-contract.md`, `runway-estimator-contract.md`, `token-telemetry-contract.md`, `headless-auto-spawn-contract.md` (all v1.0.0)
+- MODIFY: `commands/gsd-t-execute.md`, `gsd-t-wave.md`, `gsd-t-quick.md`, `gsd-t-integrate.md`, `gsd-t-debug.md`, `gsd-t-doc-ripple.md` — three-band handlers, Model Assignment blocks, per-spawn token brackets, runway estimator wires
+- MODIFY: `templates/CLAUDE-global.md`, `templates/CLAUDE-project.md` — rewrite Token-Aware Orchestration section
 **Success criteria**:
-- [ ] Orchestrator estimates token cost before each subagent spawn
-- [ ] Cumulative session usage is tracked and displayed at each phase boundary
-- [ ] Degradation actions trigger at 60%, 70%, 85%, and 95% thresholds
-- [ ] Non-critical Sonnet tasks are demoted to Haiku when budget is constrained
-- [ ] Milestone pre-flight check warns when estimated cost exceeds available budget
-- [ ] Progress is always saved before a hard stop — no lost work
-- [ ] Default behavior (no budget concern) is unchanged — thresholds only fire when budget tracking detects pressure
+- [ ] Zero `downgrade`, `conserve`, `modelOverride`, `skipPhases` references in live code under `bin/`, `scripts/`, `commands/`, `templates/` (prose discussing the removal is acceptable)
+- [ ] `bin/model-selector.js` contains ≥ 8 declarative phase mappings
+- [ ] Runway estimator wired into ≥ 5 command files
+- [ ] Per-spawn token telemetry captures the full 18-field schema
+- [ ] Headless auto-spawn tested end-to-end
+- [ ] ~1030/1030 tests green (954 baseline at M35 Wave 1 exit)
+- [ ] M35 dogfooded itself from Wave 3 onward
+- [ ] `v2.76.10` tagged
+**Non-goals**: cost modeling (M35 tracks tokens, not dollars), auto-optimization (backlog is detect-only), model marketplace (Claude models only), runtime Claude Code patches (convention-based `/advisor` fallback if no programmable API).
-**Acceptance test**: Start a wave with a 4-domain milestone. After 3 domains complete (simulating ~70% budget consumed), verify the orchestrator displays a budget warning, reduces iteration budgets, and demotes non-critical tasks to Haiku. Verify all progress is saved and the user sees a clear "resume" instruction.
+**Acceptance test**: Start a multi-wave milestone. Drive session context above 70% mid-wave. Verify: (a) the orchestrator logs a `warn` entry and proceeds at full quality with no model swaps and no phase skips; (b) when the runway estimator projects a `stop` crossing before the next phase completes, it halts cleanly and auto-spawns a headless session to continue; (c) the user never sees a `/clear` prompt unless the headless handoff itself fails; (d) `.gsd-t/token-metrics.jsonl` contains one record per spawn with the full schema.
 ---
@@ -498,7 +512,7 @@ All new code (`bin/qa-calibrator.js`, component registry logic) uses Node.js bui
 | Playwright MCP not widely available            | Medium     | Low    | Graceful degradation — feature skips if MCP absent.              |
 | Higher iteration budgets waste context tokens  | Low        | Medium | Budget governance (M26) caps aggregate rework. Telemetry tracks. |
 | Command file sizes grow beyond readability     | Medium     | High   | Each injection is max 5-10 lines. Total overhead auditable via 3.1. |
-| **Token exhaustion mid-milestone**             | **High**   | **High** | **Token-aware orchestration (3.7) with graduated degradation. Progress always checkpointed before hard stop.** |
+| **Token exhaustion mid-milestone**             | **High**   | **High** | **Runway-protected execution (3.7) with pre-flight estimator + headless auto-spawn. No graduated degradation; quality never lowered under pressure. Clean stop at 85% with automatic headless continuation.** |
 | **QA promotion to Sonnet exceeds token budget** | Low       | Medium | QA calls are small relative to execute. Net impact ~10-15% more tokens. Token orchestrator manages ceiling. |
 | **Budget estimation inaccuracy**               | Medium     | Medium | Estimates improve over time using historical data from token-log.md. Conservative defaults (overestimate). |
@@ -566,10 +580,10 @@ There are two distinct token budget concerns:
 - QA model promotion (Haiku → Sonnet): +10-15% tokens per milestone
 - Red Team model promotion (Sonnet → Opus): +3-5% tokens per milestone (only 1 call)
 - Exploratory testing: +5-10% tokens per milestone (when MCP available)
-- Higher iteration budgets: variable, capped by token-aware orchestrator (3.7)
+- Higher iteration budgets: variable, bounded by the M35 runway estimator (3.7) which refuses runs projected to cross 85%
 - Harness audit: opt-in only, not counted in normal milestone budgets
-**Maximum additional session cost**: ~25-30% more tokens per milestone vs. current. The token-aware orchestrator (3.7) ensures this stays within daily limits through graduated degradation.
+**Maximum additional session cost**: ~25-30% more tokens per milestone vs. current. The M35 runway estimator + headless auto-spawn (3.7) ensures this stays within daily limits through pre-flight refusal + fresh-context headless continuation — **not** through graduated degradation. When a run would exceed the budget, the interactive session hands off cleanly to a headless continuation with a fresh context window; quality is never reduced.
 ### Command file discipline

package/docs/requirements.md CHANGED Viewed

@@ -210,6 +210,72 @@
 **Orphaned requirements**: None — all M32 REQs mapped to tasks.
 **Unanchored tasks**: None — all 3 domain tasks map directly to functional requirements.
+## Requirements Traceability (updated by plan phase — M34)
+| REQ-ID  | Requirement Summary                                                                     | Domain                      | Task(s)  | Status   |
+|---------|------------------------------------------------------------------------------------------|-----------------------------|----------|----------|
+| REQ-063 | Context Meter PostToolUse hook — count_tokens API call, state file, fail-open            | context-meter-hook          | Tasks 1–5 | complete |
+| REQ-064 | Context Meter config schema — apiKeyEnvVar, modelWindowSize, thresholdPct, checkFrequency | context-meter-config        | Tasks 1–4 | complete |
+| REQ-065 | Installer integration — install/init hook, doctor gate, status line, update-all migration | installer-integration       | Tasks 1–6 | complete |
+| REQ-066 | bin/token-budget.js v2.0.0 — real-source `getSessionStatus()` reading the meter state file | token-budget-replacement    | Tasks 1–10 | complete |
+| REQ-067 | Command file migration — execute/wave/quick/integrate/debug use CTX_PCT, no task-counter  | token-budget-replacement    | Tasks 6–10 | complete |
+| REQ-068 | Docs + tests — README/GSD-T-README/templates/docs/CHANGELOG updated, integration tests added | m34-docs-and-tests          | Tasks 1–9 | in_progress |
+**M34 Functional Requirements:**
+- **REQ-063**: The PostToolUse hook must measure the real transcript token count after every tool call (subject to `checkFrequency`), write the result atomically to `.gsd-t/.context-meter-state.json`, and never crash Claude Code on error.
+- **REQ-064**: The config loader must validate `apiKeyEnvVar` is a string, `modelWindowSize` > 0, `thresholdPct` in (0, 100), `checkFrequency` ≥ 1. Missing config = use defaults.
+- **REQ-065**: `gsd-t doctor` must hard-gate on: API key set, hook registered, script present, config valid, live `count_tokens` dry-run succeeds. Exit code 1 if any RED.
+- **REQ-066**: `bin/token-budget.js` `getSessionStatus()` must read `.gsd-t/.context-meter-state.json` when fresh (within 5 minutes of timestamp) and fall back to a historical heuristic otherwise. Public API shape unchanged from v1.x — callers see no breakage.
+- **REQ-067**: No command file may reference `task-counter.cjs` or `CLAUDE_CONTEXT_TOKENS_*` env vars. All session-stop gates must call `token-budget.getSessionStatus()`.
+- **REQ-068**: All downstream docs (README.md, docs/GSD-T-README.md, templates/CLAUDE-*, docs/*.md, CHANGELOG.md, package.json) must describe M34 by the time the milestone is marked complete.
+**M34 Non-Functional Requirements:**
+- Hook latency ≤ 200ms P99 (enforced by `req.setTimeout` + `req.destroy()` in the HTTPS client)
+- Zero external npm dependencies (same as the rest of GSD-T)
+- Zero message content in state files, log files, or diagnostics — only token counts, band names, error category codes
+- Zero API-key material written to disk — env var read only, never persisted
+**Orphaned requirements**: None — all M34 REQs mapped to tasks.
+**Unanchored tasks**: None — all 34 M34 tasks map directly to functional or non-functional requirements.
+---
+## Requirements Traceability (updated by plan phase — M35)
+| REQ-ID  | Requirement Summary                                                                     | Domain                      | Task(s)   | Status   |
+|---------|-----------------------------------------------------------------------------------------|-----------------------------|-----------|----------|
+| REQ-069 | Silent degradation bands removed — `getDegradationActions()` returns only `{band: 'normal'\|'warn'\|'stop'}` | degradation-rip-out | T1 | complete (Wave 1) |
+| REQ-070 | Three-band model only — `WARN_THRESHOLD_PCT=70`, `STOP_THRESHOLD_PCT=85`, no model overrides or phase skips | degradation-rip-out | T1, T2 | complete (Waves 1–2) |
+| REQ-071 | Surgical per-phase model selection via `bin/model-selector.js` — ≥8 phase mappings, declarative rules table | model-selector-advisor | T2 | complete (Wave 2) |
+| REQ-072 | `/advisor` escalation with graceful fallback — convention-based if API not programmable | model-selector-advisor | T1, T3 | complete (Wave 2) |
+| REQ-073 | Pre-flight runway estimator refuses runs projected to cross 85% stop threshold | runway-estimator | T1–T5 | complete (Wave 3) |
+| REQ-074 | Per-spawn token telemetry to `.gsd-t/token-metrics.jsonl` with frozen 18-field schema | token-telemetry | T1–T3 | complete (Waves 1–2) |
+| REQ-075 | `gsd-t metrics` CLI: `--tokens [--by ...]`, `--halts`, `--tokens --context-window` | token-telemetry | T4–T6 | complete (Wave 2) |
+| REQ-076 | Optimization backlog — detect only, never auto-apply, user promotes or rejects | optimization-backlog | T1–T4 | complete (Wave 4) |
+| REQ-077 | Headless auto-spawn on runway refusal — user never sees a `/clear` prompt | headless-auto-spawn | T1–T5 | complete (Waves 3–4) |
+| REQ-078 | Structural elimination of native compact messages — `halt_type: native-compact` count is 0 during M35 execution | runway-estimator + headless-auto-spawn | T1–T5 (RE), T1–T5 (HAS) | complete (Wave 5) |
+**M35 Functional Requirements:**
+- **REQ-069**: `bin/token-budget.js` `getDegradationActions()` must return `{band: 'normal'|'warn'|'stop', pct: number, message: string}` only. No `modelOverride`, no `skipPhases`, no `checkpoint` side-channel.
+- **REQ-070**: `WARN_THRESHOLD_PCT = 70`, `STOP_THRESHOLD_PCT = 85`. `grep -r "downgrade\|conserve\|modelOverride\|skipPhases" bin/ commands/ docs/ templates/` returns zero hits in live code.
+- **REQ-071**: `bin/model-selector.js` exists with a declarative rules table, at least 8 phase mappings across all three tiers (haiku/sonnet/opus), and unit tests for each mapping.
+- **REQ-072**: `bin/advisor-integration.js` exists. If `/advisor` is programmable: calls it. If not: convention-based fallback block injection. Graceful degradation: missed escalations logged, caller never blocked.
+- **REQ-073**: Every long-running command (execute, wave, integrate, debug, quick) invokes `estimateRunway()` at Step 0. On refusal: prints ⛔ block, calls `autoSpawnHeadless()`, exits cleanly. User never sees "run /clear."
+- **REQ-074**: Every Task subagent spawn in execute/wave/quick/integrate/debug/doc-ripple is wrapped in a token bracket. Records appended to `.gsd-t/token-metrics.jsonl` with all 18 schema fields.
+- **REQ-075**: `gsd-t metrics --tokens [--by <field>...]` prints an aggregated table. `gsd-t metrics --halts` shows halt_type breakdown with native-compact warning. `gsd-t metrics --tokens --context-window` buckets by pct_before.
+- **REQ-076**: `bin/token-optimizer.js` runs at `complete-milestone`, appends to `.gsd-t/optimization-backlog.md`. Recommendations never auto-applied. User promotes via `gsd-t-optimization-apply {ID}` or rejects via `gsd-t-optimization-reject {ID}`.
+- **REQ-077**: `bin/headless-auto-spawn.js` `autoSpawnHeadless()` spawns a detached child process, writes `.gsd-t/headless-sessions/{id}.json`, fires macOS notification on completion, and surfaces results banner on next interactive command.
+- **REQ-078**: The combination of `STOP_THRESHOLD_PCT=85` and runway estimator refusing runs that project past 85% means the runtime's 95% native compact is structurally unreachable. `halt_type: native-compact` in `token-metrics.jsonl` is a defect signal.
+**M35 Non-Functional Requirements:**
+- Zero external npm dependencies (GSD-T mandate — token-telemetry.js, runway-estimator.js, headless-auto-spawn.js must use Node.js built-ins only)
+- `autoSpawnHeadless()` must return control to the interactive session in < 500ms (detached spawn is immediate)
+- `estimateRunway()` must complete in < 100ms (reads two local files, no network)
+- Full test suite: target ~1030 tests total after M35; quality over count
+**Orphaned requirements**: None — all M35 REQs mapped to tasks.
+**Unanchored tasks**: None — all 38 M35 tasks trace to REQ-069–REQ-078 or REQ-063–068 (existing requirements that M35 code continues to satisfy).
 ---
 ## M17: Scan Visual Output — Feature Specification

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@tekyzinc/gsd-t",
-  "version": "2.74.12",
+  "version": "2.76.10",
   "description": "GSD-T: Contract-Driven Development for Claude Code — 56 slash commands with headless CI/CD mode, graph-powered code analysis, real-time agent dashboard, execution intelligence, task telemetry, doc-ripple enforcement, backlog management, impact analysis, test sync, milestone archival, and PRD generation",
   "author": "Tekyz, Inc.",
   "license": "MIT",