@tekyzinc/gsd-t 2.50.12 → 2.53.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (99) hide show
  1. package/CHANGELOG.md +24 -0
  2. package/README.md +379 -372
  3. package/bin/component-registry.js +250 -0
  4. package/bin/graph-cgc.js +510 -510
  5. package/bin/graph-indexer.js +147 -147
  6. package/bin/graph-overlay.js +195 -195
  7. package/bin/graph-parsers.js +327 -327
  8. package/bin/graph-query.js +453 -452
  9. package/bin/graph-store.js +154 -154
  10. package/bin/qa-calibrator.js +194 -0
  11. package/bin/scan-data-collector.js +153 -153
  12. package/bin/scan-diagrams-generators.js +187 -187
  13. package/bin/scan-diagrams.js +79 -79
  14. package/bin/scan-renderer.js +92 -92
  15. package/bin/scan-report-sections.js +121 -121
  16. package/bin/scan-report.js +184 -184
  17. package/bin/scan-schema-parsers.js +199 -199
  18. package/bin/scan-schema.js +103 -103
  19. package/bin/token-budget.js +246 -0
  20. package/commands/Claude-md.md +10 -10
  21. package/commands/branch.md +15 -15
  22. package/commands/checkin.md +45 -45
  23. package/commands/global-change.md +209 -209
  24. package/commands/gsd-t-audit.md +199 -0
  25. package/commands/gsd-t-backlog-add.md +94 -94
  26. package/commands/gsd-t-backlog-edit.md +111 -111
  27. package/commands/gsd-t-backlog-list.md +63 -63
  28. package/commands/gsd-t-backlog-move.md +94 -94
  29. package/commands/gsd-t-backlog-promote.md +123 -123
  30. package/commands/gsd-t-backlog-remove.md +86 -86
  31. package/commands/gsd-t-backlog-settings.md +158 -158
  32. package/commands/gsd-t-complete-milestone.md +528 -515
  33. package/commands/gsd-t-debug.md +506 -399
  34. package/commands/gsd-t-discuss.md +174 -174
  35. package/commands/gsd-t-execute.md +758 -634
  36. package/commands/gsd-t-feature.md +276 -276
  37. package/commands/gsd-t-health.md +142 -142
  38. package/commands/gsd-t-help.md +465 -457
  39. package/commands/gsd-t-impact.md +302 -302
  40. package/commands/gsd-t-init.md +320 -280
  41. package/commands/gsd-t-integrate.md +365 -249
  42. package/commands/gsd-t-milestone.md +87 -87
  43. package/commands/gsd-t-partition.md +442 -361
  44. package/commands/gsd-t-pause.md +82 -82
  45. package/commands/gsd-t-plan.md +345 -344
  46. package/commands/gsd-t-populate.md +111 -111
  47. package/commands/gsd-t-prd.md +326 -326
  48. package/commands/gsd-t-project.md +211 -211
  49. package/commands/gsd-t-promote-debt.md +123 -123
  50. package/commands/gsd-t-prompt.md +137 -137
  51. package/commands/gsd-t-qa.md +266 -266
  52. package/commands/gsd-t-quick.md +357 -234
  53. package/commands/gsd-t-reflect.md +134 -134
  54. package/commands/gsd-t-resume.md +72 -72
  55. package/commands/gsd-t-scan.md +615 -615
  56. package/commands/gsd-t-setup.md +76 -0
  57. package/commands/gsd-t-status.md +192 -166
  58. package/commands/gsd-t-test-sync.md +381 -381
  59. package/commands/gsd-t-triage-and-merge.md +171 -171
  60. package/commands/gsd-t-verify.md +382 -382
  61. package/commands/gsd-t-visualize.md +118 -118
  62. package/commands/gsd-t-wave.md +401 -378
  63. package/docs/GSD-T-README.md +425 -422
  64. package/docs/architecture.md +385 -369
  65. package/docs/harness-design-analysis.md +371 -0
  66. package/docs/infrastructure.md +205 -205
  67. package/docs/prd-graph-engine.md +398 -398
  68. package/docs/prd-gsd2-hybrid.md +559 -559
  69. package/docs/prd-harness-evolution.md +583 -0
  70. package/docs/requirements.md +14 -0
  71. package/docs/workflows.md +226 -226
  72. package/examples/.gsd-t/domains/example-domain/scope.md +13 -13
  73. package/package.json +40 -40
  74. package/scripts/gsd-t-auto-route.js +39 -39
  75. package/scripts/gsd-t-dashboard-mockup.html +1143 -1143
  76. package/scripts/gsd-t-dashboard-server.js +171 -171
  77. package/scripts/gsd-t-dashboard.html +262 -262
  78. package/scripts/gsd-t-event-writer.js +128 -128
  79. package/scripts/gsd-t-statusline.js +94 -94
  80. package/scripts/gsd-t-tools.js +175 -175
  81. package/templates/CLAUDE-global.md +639 -614
  82. package/templates/CLAUDE-project.md +24 -0
  83. package/templates/backlog-settings.md +18 -18
  84. package/templates/backlog.md +1 -1
  85. package/templates/progress.md +40 -40
  86. package/templates/shared-services-contract.md +60 -60
  87. package/templates/stacks/desktop.ini +2 -2
  88. package/bin/desktop.ini +0 -2
  89. package/commands/desktop.ini +0 -2
  90. package/docs/ci-examples/desktop.ini +0 -2
  91. package/docs/desktop.ini +0 -2
  92. package/examples/.gsd-t/contracts/desktop.ini +0 -2
  93. package/examples/.gsd-t/desktop.ini +0 -2
  94. package/examples/.gsd-t/domains/desktop.ini +0 -2
  95. package/examples/.gsd-t/domains/example-domain/desktop.ini +0 -2
  96. package/examples/desktop.ini +0 -2
  97. package/examples/rules/desktop.ini +0 -2
  98. package/scripts/desktop.ini +0 -2
  99. package/templates/desktop.ini +0 -2
@@ -0,0 +1,583 @@
1
+ # PRD: Harness Evolution — Self-Calibrating Quality Infrastructure
2
+
3
+ ## Document Info
4
+ | Field | Value |
5
+ |-------------------|-----------------------------------------------------------------------|
6
+ | **PRD ID** | PRD-HARNESS-001 |
7
+ | **Date** | 2026-04-01 |
8
+ | **Author** | GSD-T Team |
9
+ | **Status** | DRAFT |
10
+ | **Milestones** | M31 (Tier 1), M32 (Tier 2), M33 (Tier 3) |
11
+ | **Version Target**| 2.52.10 (M31), 2.53.10 (M32), 2.54.10 (M33) |
12
+ | **Priority** | P0 — framework self-improvement and quality convergence |
13
+ | **Predecessor** | M30 (Stack Rules Engine), M26 (Rule Engine + Patch Lifecycle), M29 (Debug Loop) |
14
+ | **Successor** | Production quality parity with human-curated codebases |
15
+ | **Related** | PRD-GSD2-001 (M22-M24), PRD-GRAPH-001 (M20-M21) |
16
+
17
+ ---
18
+
19
+ ## Revision History
20
+
21
+ | Date | Version | Changes |
22
+ |------------|---------|------------------------|
23
+ | 2026-04-01 | v1 | Initial DRAFT — 6 enhancements across 3 tiers |
24
+ | 2026-04-01 | v2 | Added token-constraint analysis: enhancement 3.7 (token-aware orchestration), 3.8 (model tier refinement), updated risk/success metrics |
25
+
26
+ ---
27
+
28
+ ## 1. Problem Statement
29
+
30
+ GSD-T's quality infrastructure has grown significantly through M25-M30: telemetry collection, declarative rule engine, patch lifecycle, Red Team adversarial QA, stack rules, and compaction-proof debug loops. These components are individually effective, but five structural gaps prevent them from reaching their full potential:
31
+
32
+ 1. **Framework bloat without pruning** — GSD-T only grows, never sheds. Every milestone adds new checks, rules, and enforcement mechanisms to subagent prompts. There is no mechanism to determine whether a component (Red Team, stack rules, observability logging) is still earning its keep. A component that added value at M26 may be pure overhead at M35, consuming precious context tokens without preventing real failures.
33
+
34
+ 2. **Static QA prompts** — The QA subagent prompt is frozen at the text written during its creation. Red Team regularly finds bugs that QA missed, but this signal is never fed back to improve QA's detection capability. The rule engine (M26) and ELO system (M25) exist but are not connected to QA calibration. QA's miss rate is measurable but not actionable.
35
+
36
+ 3. **Procedural prompts without quality vision** — Every subagent prompt is purely procedural: "do X, check Y, report Z." Research on long-running AI harnesses shows that injecting an aspirational quality statement (a "quality persona") shifts output quality more effectively than adding more procedural checks. A phrase like "museum-quality code" or "production-ready from line one" changes the agent's default quality threshold. GSD-T has no mechanism for this.
37
+
38
+ 4. **Aesthetic drift in UI-heavy projects** — When GSD-T executes UI tasks, each subagent makes independent aesthetic decisions. Task 1 might pick rounded corners and soft shadows, Task 3 might use sharp borders and flat design. Contracts define functional interfaces but not visual language. There is no design brief artifact that flows through execution like contracts do.
39
+
40
+ 5. **Scripted-only test evaluation** — QA and Red Team agents can only evaluate through scripted Playwright assertions. They cannot interactively explore the application the way a human tester would — clicking around, trying unexpected flows, observing visual glitches. The article's evaluator harness found issues through dynamic interaction that scripted tests missed entirely.
41
+
42
+ 6. **Fixed iteration budget** — GSD-T hardcodes "2 fix attempts" before escalating to the headless debug loop. Research shows that 5-15 iterations drive convergence to significantly better quality for complex tasks. The current budget is one-size-fits-all: simple tasks get 2 attempts (enough), complex tasks get 2 attempts (not enough), and the headless debug loop is a heavyweight escalation that disrupts the normal execution flow.
43
+
44
+ **Root cause**: GSD-T was designed as a *methodology* that prescribes process. It now needs to become a *self-calibrating system* that measures its own component effectiveness, tunes its quality signals based on outcomes, and adapts its iteration depth to task complexity.
45
+
46
+ ---
47
+
48
+ ## 2. Objective
49
+
50
+ Evolve GSD-T from a static methodology framework into a self-calibrating quality system across 6 enhancements in 3 tiers.
51
+
52
+ **Primary goals** (in priority order):
53
+ 1. **Self-awareness** — GSD-T can measure whether its own components add value, and disable ones that don't
54
+ 2. **Closed-loop QA** — QA miss rates feed back into QA prompt tuning automatically
55
+ 3. **Quality culture** — Subagent prompts carry a project-level quality aspiration, not just procedural rules
56
+ 4. **Aesthetic coherence** — UI projects get a design brief that flows through execution like contracts
57
+ 5. **Exploratory evaluation** — QA/Red Team can interact with running applications, not just run scripts
58
+ 6. **Adaptive iteration** — Iteration budgets scale with task complexity, not a fixed constant
59
+ 7. **Token-budget awareness** — On a token-limited plan ($200 Max), the orchestrator must manage session-level token consumption to prevent mid-milestone exhaustion
60
+
61
+ **Core principle**: Every quality mechanism must prove its value through measurable outcomes. Components that cannot demonstrate impact are candidates for removal, not preservation.
62
+
63
+ **Operational constraint**: GSD-T runs on Claude's $200 Max plan, where tokens are a hard daily/weekly ceiling — not a variable cost. Running out mid-milestone is a workflow-breaking event. All enhancements must be designed with token conservation as a first-class concern.
64
+
65
+ ---
66
+
67
+ ## 3. Enhancements
68
+
69
+ ### 3.1 Harness Audit Capability (HIGH PRIORITY — Tier 1)
70
+
71
+ **Problem**: The framework accumulates enforcement mechanisms (Red Team, stack rules, observability logging, E2E enforcement, doc-ripple checks) but has no way to determine if they are still earning their context-token cost. Over time, the aggregate prompt overhead may exceed the value delivered.
72
+
73
+ **Solution**: A `gsd-t-audit` command and supporting infrastructure that stress-tests GSD-T's own components by selectively disabling them and comparing outcomes.
74
+
75
+ **Mechanism**:
76
+ 1. **Component registry** — A structured file (`.gsd-t/component-registry.jsonl`) listing every enforcement mechanism: name, injection point (which command files), approximate token cost (lines of prompt text), date added, last measured impact.
77
+ 2. **Audit mode** — `gsd-t-audit` runs a milestone's worth of tasks twice: once with all components active (control), once with a target component disabled (experiment). Compares: bugs caught, test pass rates, rework cycles, and time-to-completion.
78
+ 3. **Shadow mode** — For components that can't be cleanly disabled (like the pre-commit gate), audit mode runs them in "shadow" — the check executes but its result is logged, not enforced. This measures whether it would have caught something without blocking execution.
79
+ 4. **Cost/benefit ledger** — `.gsd-t/metrics/component-impact.jsonl` tracks per-component: token cost per invocation, bugs prevented (from QA/Red Team logs), false positives generated, context % consumed. Components with cost > benefit for 3+ milestones are flagged for deprecation review.
80
+ 5. **Integration with rule engine** — Components flagged for deprecation become candidates in the patch lifecycle (M26). A "disable component X" patch follows the same candidate -> measured -> promoted -> graduated flow.
81
+
82
+ **Files affected**:
83
+ - NEW: `commands/gsd-t-audit.md` — audit command (new command, count goes to 52)
84
+ - NEW: `.gsd-t/component-registry.jsonl` template
85
+ - MODIFY: `bin/gsd-t.js` — command count update
86
+ - MODIFY: `commands/gsd-t-complete-milestone.md` — component impact evaluation in distillation step
87
+ - MODIFY: `templates/CLAUDE-global.md`, `commands/gsd-t-help.md`, `README.md`, `GSD-T-README.md` — command reference updates
88
+
89
+ **Success criteria**:
90
+ - [ ] Component registry lists all enforcement mechanisms with token cost estimates
91
+ - [ ] `gsd-t-audit` can disable a named component and run comparison tasks
92
+ - [ ] Shadow mode logs enforcement results without blocking execution
93
+ - [ ] Cost/benefit ledger accumulates per-milestone impact data
94
+ - [ ] Components with 3+ milestones of negative ROI are flagged in `gsd-t-status`
95
+ - [ ] Flagged components enter the patch lifecycle as deprecation candidates
96
+
97
+ **Acceptance test**: Run `gsd-t-audit --component=red-team` on a small milestone. Verify the audit produces a comparison report showing bugs caught vs. context tokens consumed, and the result is persisted in `component-impact.jsonl`.
98
+
99
+ ---
100
+
101
+ ### 3.2 QA Calibration Feedback Loop (HIGH PRIORITY — Tier 1)
102
+
103
+ **Problem**: QA subagent prompts are static text. Red Team finds bugs that QA missed, but this signal is discarded. The rule engine (M26) and ELO system (M25) exist but are not connected to QA prompt tuning.
104
+
105
+ **Solution**: Wire Red Team miss-rate data back into QA prompt generation, creating a closed-loop calibration system.
106
+
107
+ **Mechanism**:
108
+ 1. **Miss-rate tracking** — After Red Team completes, compare its findings against QA's report. Bugs found by Red Team but missed by QA are logged to `.gsd-t/metrics/qa-miss-log.jsonl` with category tags (contract violation, boundary input, state transition, error path, missing flow, regression, E2E gap).
109
+ 2. **Category aggregation** — `bin/qa-calibrator.js` reads `qa-miss-log.jsonl` and computes miss rates per category across milestones. Categories with miss rates > 30% are "weak spots."
110
+ 3. **Dynamic QA prompt injection** — During `gsd-t-execute` Step 2 (QA subagent spawn), the orchestrator calls `qa-calibrator.js` to get current weak spots. These are injected into the QA prompt as priority focus areas: "PRIORITY: Your historical miss rate for {category} is {N}%. Pay extra attention to: {specific patterns from miss log}."
111
+ 4. **Calibration rules** — Weak spots that persist for 3+ milestones generate a rule engine candidate patch that adds a permanent check to the QA prompt template. Weak spots that drop below 10% miss rate for 2+ milestones have their priority injection removed.
112
+ 5. **ELO integration** — QA miss rates factor into the process ELO calculation. A milestone with high QA miss rates gets a lower ELO delta, incentivizing the system toward better first-pass QA detection.
113
+
114
+ **Files affected**:
115
+ - NEW: `bin/qa-calibrator.js` — miss-rate aggregation and weak-spot detection
116
+ - MODIFY: `commands/gsd-t-execute.md` — inject weak spots into QA subagent prompt
117
+ - MODIFY: `commands/gsd-t-quick.md` — same injection for inline QA
118
+ - MODIFY: `commands/gsd-t-integrate.md` — same injection for integration QA
119
+ - MODIFY: `templates/CLAUDE-global.md` — document QA calibration in QA Agent section
120
+ - MODIFY: `bin/metrics-rollup.js` — incorporate QA miss rates into ELO calculation
121
+
122
+ **Success criteria**:
123
+ - [ ] Red Team findings not in QA report are logged to `qa-miss-log.jsonl` with category tags
124
+ - [ ] `qa-calibrator.js` computes per-category miss rates and identifies weak spots (>30%)
125
+ - [ ] QA subagent prompts include dynamic weak-spot injections during execute
126
+ - [ ] Weak spots persisting 3+ milestones generate rule engine candidate patches
127
+ - [ ] Weak spots dropping below 10% for 2+ milestones have injections removed
128
+ - [ ] QA miss rate is reflected in process ELO calculation
129
+
130
+ **Acceptance test**: After a Red Team run that finds 2 boundary-input bugs QA missed, verify `qa-miss-log.jsonl` contains the entries, `qa-calibrator.js` reports boundary-input as a weak spot, and the next QA subagent spawn includes the priority injection.
131
+
132
+ ---
133
+
134
+ ### 3.3 Quality North Star Injection (MEDIUM PRIORITY — Tier 2)
135
+
136
+ **Problem**: Every subagent prompt is procedural. Research shows that an aspirational quality statement ("quality persona") shifts output quality more effectively than adding procedural checks. GSD-T has no mechanism for project-level quality aspiration.
137
+
138
+ **Solution**: A configurable quality persona that gets prepended to every subagent prompt, set once during `gsd-t-init` or `gsd-t-setup` and stored in project CLAUDE.md.
139
+
140
+ **Mechanism**:
141
+ 1. **Quality persona field** — A new section in the project CLAUDE.md template: `## Quality North Star`. Contains a 1-3 sentence aspirational quality statement. Examples: "Museum-quality code — every function reads like it was written for a textbook", "Production-ready from line one — no TODOs, no shortcuts, no 'fix later'", "Enterprise-grade reliability — every error path is handled, every edge case is tested."
142
+ 2. **Default personas** — `gsd-t-init` offers 3 preset personas based on project type detection:
143
+ - **Library/package**: "This code will be read by thousands of developers. Every public API must be self-documenting, every edge case must be handled, every error message must be actionable."
144
+ - **Web application**: "Every user interaction must feel instant, every error must be recoverable, every page must be accessible. Ship quality a designer would screenshot."
145
+ - **CLI tool**: "Every command must complete in under 2 seconds, every error must suggest a fix, every flag must have a help string. The --help output is the product."
146
+ - **Custom**: User writes their own statement.
147
+ 3. **Injection point** — The quality persona is prepended to every subagent prompt (execute, quick, debug, integrate, wave) immediately before the task-specific instructions. It sits above procedural checks, framing the agent's quality default before any rules are read.
148
+ 4. **Stack rule integration** — Quality persona is injected before stack rules, so the aspirational statement colors how the agent interprets and applies the rules.
149
+
150
+ **Files affected**:
151
+ - MODIFY: `templates/CLAUDE-project.md` — add Quality North Star section with placeholder
152
+ - MODIFY: `commands/gsd-t-init.md` — persona selection during init
153
+ - MODIFY: `commands/gsd-t-setup.md` — persona configuration
154
+ - MODIFY: `commands/gsd-t-execute.md` — inject persona into subagent prompt
155
+ - MODIFY: `commands/gsd-t-quick.md` — same injection
156
+ - MODIFY: `commands/gsd-t-debug.md` — same injection
157
+ - MODIFY: `commands/gsd-t-integrate.md` — same injection
158
+ - MODIFY: `commands/gsd-t-wave.md` — same injection
159
+ - MODIFY: `templates/CLAUDE-global.md` — document the feature
160
+
161
+ **Success criteria**:
162
+ - [ ] CLAUDE-project.md template includes Quality North Star section
163
+ - [ ] `gsd-t-init` prompts for or auto-selects a quality persona
164
+ - [ ] Quality persona is prepended to all subagent prompts (execute, quick, debug, integrate, wave)
165
+ - [ ] Persona injection occurs before stack rules and procedural checks
166
+ - [ ] Projects without a persona skip injection silently (backward compatible)
167
+
168
+ **Acceptance test**: Set persona to "Museum-quality code" in CLAUDE.md. Run `gsd-t-execute` on a task. Verify the subagent prompt starts with the persona statement before any procedural instructions.
169
+
170
+ ---
171
+
172
+ ### 3.4 Design Brief Artifact (MEDIUM PRIORITY — Tier 2)
173
+
174
+ **Problem**: UI-heavy projects suffer aesthetic drift when different execution subagents make independent visual decisions per task. Contracts define functional interfaces (component props, API shapes) but not visual language (colors, spacing, typography, interaction patterns).
175
+
176
+ **Solution**: A design brief artifact generated during `gsd-t-partition` or `gsd-t-plan` that flows into execute subagents alongside contracts.
177
+
178
+ **Mechanism**:
179
+ 1. **Design brief detection** — During `gsd-t-partition`, if any domain contains UI/frontend tasks (detected by: React/Vue/Svelte/Flutter in stack, component files in scope, CSS/styling files), prompt generation of a design brief.
180
+ 2. **Design brief structure** — `.gsd-t/contracts/design-brief.md`:
181
+ - **Color palette**: Primary, secondary, accent, background, text colors with hex values
182
+ - **Typography**: Font families, size scale, weight usage
183
+ - **Spacing system**: Base unit, scale (4px, 8px, 12px, 16px, 24px, 32px, 48px)
184
+ - **Component patterns**: Border radius, shadow levels, hover/active states, transition durations
185
+ - **Layout principles**: Grid system, breakpoints, container widths
186
+ - **Interaction patterns**: Loading states, error states, empty states, success feedback
187
+ - **Tone**: Formal/casual, dense/spacious, minimal/rich
188
+ 3. **Injection** — Design brief is injected into subagent prompts for UI-related tasks (same injection point as contracts). Non-UI tasks skip it.
189
+ 4. **Sources** — If the project already has a design system (Tailwind config, theme file, Storybook), the brief is derived from existing sources. If none exist, the brief is generated from the quality persona + project type.
190
+ 5. **Existing project support** — `gsd-t-setup` can generate a design brief for an existing project by analyzing current CSS/theme files.
191
+
192
+ **Files affected**:
193
+ - MODIFY: `commands/gsd-t-partition.md` — design brief detection and generation step
194
+ - MODIFY: `commands/gsd-t-plan.md` — reference design brief in UI task descriptions
195
+ - MODIFY: `commands/gsd-t-execute.md` — inject design brief for UI tasks
196
+ - MODIFY: `commands/gsd-t-quick.md` — same injection for quick UI tasks
197
+ - MODIFY: `commands/gsd-t-setup.md` — design brief generation for existing projects
198
+ - NEW: Template section in `templates/CLAUDE-project.md` referencing design brief convention
199
+
200
+ **Success criteria**:
201
+ - [ ] `gsd-t-partition` detects UI-heavy domains and triggers design brief generation
202
+ - [ ] Design brief is stored in `.gsd-t/contracts/design-brief.md`
203
+ - [ ] Design brief is injected into subagent prompts for UI-related tasks only
204
+ - [ ] Non-UI tasks do not receive design brief injection
205
+ - [ ] Existing Tailwind/theme configs are parsed to pre-populate the brief
206
+ - [ ] Projects without UI domains skip design brief entirely (no overhead)
207
+
208
+ **Acceptance test**: Partition a milestone with a React frontend domain. Verify `design-brief.md` is generated with color palette and component patterns. Execute a UI task and verify the subagent prompt includes the design brief. Execute a backend task and verify it does not.
209
+
210
+ ---
211
+
212
+ ### 3.5 Evaluator Interactivity (MEDIUM PRIORITY — Tier 2)
213
+
214
+ **Problem**: QA and Red Team agents evaluate through scripted Playwright assertions only. They cannot explore the running application dynamically — clicking unexpected paths, trying edge-case inputs, observing visual rendering. Scripted tests verify known requirements; exploratory testing finds unknown bugs.
215
+
216
+ **Solution**: Give QA and Red Team agents access to Playwright MCP for interactive browser testing beyond their scripted assertions.
217
+
218
+ **Mechanism**:
219
+ 1. **MCP detection** — Before spawning QA/Red Team subagents, check if Playwright MCP server is registered in Claude Code settings (`.claude/settings.local.json` or global settings). If available, include MCP access instructions in the subagent prompt.
220
+ 2. **Exploratory testing prompt** — After scripted tests pass, the QA/Red Team subagent receives an additional instruction block:
221
+ ```
222
+ "You have access to Playwright MCP for interactive browser testing.
223
+ After scripted tests pass, spend up to 3 minutes exploring:
224
+ - Navigate to every route in the application
225
+ - Try unexpected inputs in every form field
226
+ - Click UI elements in unexpected order
227
+ - Resize the browser to test responsive behavior
228
+ - Check console for errors after each action
229
+ Report any issues found as EXPLORATORY findings (separate from scripted test results)."
230
+ ```
231
+ 3. **Time budget** — Exploratory testing has a configurable time budget (default: 3 minutes for QA, 5 minutes for Red Team). This prevents runaway exploration while allowing meaningful coverage.
232
+ 4. **Finding classification** — Exploratory findings are tagged `[EXPLORATORY]` in qa-issues.md and red-team-report.md to distinguish them from scripted test failures. They feed into the QA calibration loop (3.2) as a separate category.
233
+ 5. **Graceful degradation** — If Playwright MCP is not available, exploratory testing is skipped silently. The feature is purely additive.
234
+
235
+ **Files affected**:
236
+ - MODIFY: `commands/gsd-t-execute.md` — add exploratory testing block to QA/Red Team prompts
237
+ - MODIFY: `commands/gsd-t-quick.md` — same for inline QA
238
+ - MODIFY: `commands/gsd-t-integrate.md` — same for integration QA/Red Team
239
+ - MODIFY: `commands/gsd-t-debug.md` — same for debug verification
240
+ - MODIFY: `templates/CLAUDE-global.md` — document evaluator interactivity in QA Agent section
241
+ - MODIFY: `templates/CLAUDE-project.md` — optional `Evaluator Time Budget` field
242
+
243
+ **Success criteria**:
244
+ - [ ] QA/Red Team subagent prompts include exploratory testing instructions when Playwright MCP is detected
245
+ - [ ] Exploratory testing has configurable time budgets (default 3min QA, 5min Red Team)
246
+ - [ ] Exploratory findings are tagged `[EXPLORATORY]` in reports
247
+ - [ ] Exploratory findings feed into QA calibration feedback loop (3.2)
248
+ - [ ] Missing Playwright MCP causes graceful skip, not failure
249
+ - [ ] Scripted tests still run first and must pass before exploratory testing begins
250
+
251
+ **Acceptance test**: With Playwright MCP registered, run `gsd-t-execute` on a web application task. Verify the QA subagent runs scripted tests, then performs exploratory testing via MCP, and any findings are tagged `[EXPLORATORY]` in qa-issues.md.
252
+
253
+ ---
254
+
255
+ ### 3.6 Configurable Iteration Budget (LOW-MEDIUM PRIORITY — Tier 3)
256
+
257
+ **Problem**: GSD-T hardcodes "2 fix attempts" before escalating to the headless debug loop (M29). Research shows 5-15 iterations drive convergence to significantly better quality for complex tasks. Simple tasks need 1-2 attempts; complex tasks need 5-10. The current one-size-fits-all budget under-serves complex tasks and the headless debug loop is a heavyweight escalation.
258
+
259
+ **Solution**: Allow domains and individual tasks to specify iteration budgets, with intelligent defaults based on task complexity signals.
260
+
261
+ **Mechanism**:
262
+ 1. **Budget specification** — Three levels of override:
263
+ - **Project-level default**: `Iteration Budget: N` in CLAUDE.md (default: 2 if unset, preserving current behavior)
264
+ - **Domain-level override**: `iteration_budget: N` in domain `constraints.md` (overrides project default)
265
+ - **Task-level override**: `[budget:N]` tag in task description in `tasks.md` (overrides domain default)
266
+ 2. **Complexity-based defaults** — During `gsd-t-plan`, each task gets a complexity score based on:
267
+ - File count in scope (>5 files = +1)
268
+ - Cross-domain dependencies (any = +1)
269
+ - New vs. modify (new file = +0, modify existing = +1)
270
+ - Test requirements (E2E = +1, unit-only = +0)
271
+ - Historical failure rate for similar domain types (from rule engine)
272
+ - Complexity score 0-1 = budget 2, score 2-3 = budget 4, score 4+ = budget 6
273
+ 3. **In-context vs. headless threshold** — The iteration budget applies to in-context fix attempts. The headless debug loop (M29) is still the escalation path, but it activates after the full budget is exhausted (not after a fixed 2). This makes the headless loop a true last resort.
274
+ 4. **Budget telemetry** — Each task's actual iteration count is logged in task-metrics.jsonl. Over time, this data refines the complexity-based defaults through the rule engine.
275
+ 5. **Budget governance** — The quality budget system (M26) still applies. If a milestone's aggregate rework rate exceeds the ceiling, the system tightens constraints rather than increasing iteration budgets.
276
+
277
+ **Files affected**:
278
+ - MODIFY: `commands/gsd-t-plan.md` — complexity scoring and budget assignment per task
279
+ - MODIFY: `commands/gsd-t-execute.md` — read task budget, use as fix-attempt limit instead of hardcoded 2
280
+ - MODIFY: `commands/gsd-t-quick.md` — same budget-aware fix attempts
281
+ - MODIFY: `commands/gsd-t-debug.md` — same
282
+ - MODIFY: `commands/gsd-t-wave.md` — same
283
+ - MODIFY: `commands/gsd-t-test-sync.md` — same
284
+ - MODIFY: `commands/gsd-t-verify.md` — same
285
+ - MODIFY: `templates/CLAUDE-project.md` — add Iteration Budget field
286
+ - MODIFY: `templates/CLAUDE-global.md` — document iteration budget system
287
+
288
+ **Success criteria**:
289
+ - [ ] Project CLAUDE.md supports `Iteration Budget: N` setting
290
+ - [ ] Domain `constraints.md` supports `iteration_budget: N` override
291
+ - [ ] Task descriptions support `[budget:N]` tag
292
+ - [ ] `gsd-t-plan` assigns complexity-based default budgets to tasks
293
+ - [ ] Execute commands respect task budget instead of hardcoded 2
294
+ - [ ] Headless debug loop activates only after full budget exhaustion
295
+ - [ ] Actual iteration counts are logged in task-metrics.jsonl
296
+ - [ ] Default behavior (no budget set) preserves current 2-attempt limit
297
+
298
+ **Acceptance test**: Set project budget to 5. Create a task with `[budget:8]` tag. Verify execute allows up to 8 fix attempts before escalating to headless debug loop. Create a task with no tag — verify it uses project default of 5.
299
+
300
+ ---
301
+
302
+ ### 3.7 Token-Aware Orchestration (MEDIUM-HIGH PRIORITY — Tier 1)
303
+
304
+ **Problem**: GSD-T runs on Claude's $200 Max plan, where tokens are a hard daily/weekly ceiling — not a variable API expense. A typical milestone spawns 30-50+ subagents across all phases. With tiered models, this consumes roughly 50-80% of a daily budget. Without budget awareness, the orchestrator can exhaust tokens mid-milestone, leaving uncommitted work scattered across subagents and forcing a wait until limits reset.
305
+
306
+ The article's harness doesn't address this because it operates on API billing where cost is variable. On a Max plan, token exhaustion is a binary failure mode — you either have capacity or you don't.
307
+
308
+ **Solution**: Make the wave and execute orchestrators aware of aggregate session-level token consumption, with graceful degradation as limits approach.
309
+
310
+ **Mechanism**:
311
+ 1. **Session budget tracking** — The orchestrator tracks cumulative tokens consumed across all subagent spawns within a session. Uses the existing observability logging data (token-log.md) plus `CLAUDE_CONTEXT_TOKENS_USED` environment variable.
312
+ 2. **Budget estimation before spawn** — Before spawning a subagent, estimate the token cost based on: model tier (Opus ~5x Sonnet, Sonnet ~5x Haiku), task complexity (from plan-time scoring if available), and historical average from token-log.md for similar tasks.
313
+ 3. **Graduated degradation thresholds**:
314
+
315
+ | Session Budget Consumed | Action |
316
+ |------------------------|--------|
317
+ | < 60% | Normal operation — all models at assigned tiers |
318
+ | 60-70% | **WARN**: Display budget alert to user. Reduce iteration budgets to minimum (2). |
319
+ | 70-85% | **DOWNGRADE**: Non-critical Sonnet tasks demoted to Haiku. Skip exploratory testing (3.5). Disable shadow-mode audit (3.1). |
320
+ | 85-95% | **CONSERVE**: Pause non-essential phases (doc-ripple, design brief generation). Checkpoint all progress to disk. |
321
+ | > 95% | **STOP**: Hard stop. Save all progress. Display: "Token budget nearly exhausted. Progress saved. Resume with `/gsd-t-resume` after limit resets." |
322
+
323
+ 4. **Model-tier-aware budgeting** — The budget tracker understands that one Opus call ≈ 5 Sonnet calls ≈ 25 Haiku calls in token terms. Degradation actions (downgrading Sonnet → Haiku) are chosen to maximize remaining capacity for high-value tasks.
324
+ 5. **Milestone pre-flight check** — Before starting a wave or execute run, estimate total token cost for the remaining work. If estimated cost exceeds available budget, warn the user: "This milestone has ~{N} tasks remaining, estimated at ~{X}% of daily budget. Proceed or split across sessions?"
325
+ 6. **Integration with iteration budget (3.6)** — When budget is constrained (>60%), iteration budgets are automatically reduced. At >70%, the system prefers model escalation (Haiku → Sonnet) over additional iterations at the same tier, since one Sonnet attempt is more likely to converge than three Haiku attempts.
326
+
327
+ **Files affected**:
328
+ - MODIFY: `commands/gsd-t-execute.md` — pre-spawn budget check, degradation logic
329
+ - MODIFY: `commands/gsd-t-wave.md` — milestone pre-flight estimate, per-phase budget check
330
+ - MODIFY: `commands/gsd-t-quick.md` — budget-aware model selection
331
+ - MODIFY: `templates/CLAUDE-global.md` — document token-aware orchestration
332
+ - MODIFY: `templates/CLAUDE-project.md` — optional `Daily Token Budget` field
333
+ - NEW: `bin/token-budget.js` — budget estimation, tracking, and threshold logic (Node.js built-ins only)
334
+
335
+ **Success criteria**:
336
+ - [ ] Orchestrator estimates token cost before each subagent spawn
337
+ - [ ] Cumulative session usage is tracked and displayed at each phase boundary
338
+ - [ ] Degradation actions trigger at 60%, 70%, 85%, and 95% thresholds
339
+ - [ ] Non-critical Sonnet tasks are demoted to Haiku when budget is constrained
340
+ - [ ] Milestone pre-flight check warns when estimated cost exceeds available budget
341
+ - [ ] Progress is always saved before a hard stop — no lost work
342
+ - [ ] Default behavior (no budget concern) is unchanged — thresholds only fire when budget tracking detects pressure
343
+
344
+ **Acceptance test**: Start a wave with a 4-domain milestone. After 3 domains complete (simulating ~70% budget consumed), verify the orchestrator displays a budget warning, reduces iteration budgets, and demotes non-critical tasks to Haiku. Verify all progress is saved and the user sees a clear "resume" instruction.
345
+
346
+ ---
347
+
348
+ ### 3.8 Refined Model Tier Assignments (IMMEDIATE — Pre-Milestone)
349
+
350
+ **Problem**: The current model assignments have QA running on Haiku. The analysis (see `docs/harness-design-analysis.md`) identifies this as the single largest source of quality gaps — QA on Haiku produces superficial evaluations that Red Team consistently catches.
351
+
352
+ **Solution**: Promote QA from Haiku to Sonnet across all command files. Narrow Haiku's scope to strictly mechanical (zero-judgment) tasks.
353
+
354
+ **Mechanism**:
355
+ This is a search-and-replace operation in existing command files, not a new system:
356
+
357
+ | Role | Current Model | New Model | Rationale |
358
+ |------|-------------|-----------|-----------|
359
+ | Task execution | Sonnet | Sonnet | No change |
360
+ | QA evaluation | Haiku | **Sonnet** | Biggest quality-per-token improvement |
361
+ | Red Team | Sonnet | **Opus** | Adversarial reasoning benefits most from top-tier |
362
+ | Test running (count pass/fail) | Haiku | Haiku | Mechanical — no judgment needed |
363
+ | File existence checks | Haiku | Haiku | Mechanical |
364
+ | Branch guards | Haiku | Haiku | Mechanical |
365
+ | Orchestration | Opus | Opus | No change |
366
+
367
+ **Token cost impact**: QA calls increase ~3-5x per call (Haiku → Sonnet), but QA calls are small relative to execute calls. Red Team increases ~3-5x per call (Sonnet → Opus), but there's only 1 Red Team call per milestone. Net impact: ~10-15% more tokens per milestone — well within the daily budget with the token-aware orchestration (3.7) managing the ceiling.
368
+
369
+ **Files affected**:
370
+ - MODIFY: `commands/gsd-t-execute.md` — QA model: haiku → sonnet, Red Team model annotation
371
+ - MODIFY: `commands/gsd-t-quick.md` — QA model annotation
372
+ - MODIFY: `commands/gsd-t-integrate.md` — QA model annotation
373
+ - MODIFY: `templates/CLAUDE-global.md` — model assignment table
374
+
375
+ **This can be done immediately as a standalone change**, before M31-M33. No new infrastructure required.
376
+
377
+ ---
378
+
379
+ ## 4. Milestone Plan
380
+
381
+ ### Pre-Milestone: Refined Model Tiers (v2.51.11)
382
+
383
+ **Scope**: Enhancement 3.8 (Refined Model Tier Assignments)
384
+
385
+ **Rationale**: This is a search-and-replace operation that addresses the #1 quality gap (QA on Haiku) with zero new infrastructure. Should be done immediately before M31, as the QA calibration system (3.2) will produce better baseline data when QA is already running on Sonnet.
386
+
387
+ **Estimated effort**: 1-2 hours (direct edits to 4 command files + 1 template)
388
+ **Predecessor**: None — standalone change
389
+
390
+ ---
391
+
392
+ ### M31: Self-Calibrating QA (Tier 1) — v2.52.10
393
+
394
+ **Scope**: Enhancements 3.1 (Harness Audit) + 3.2 (QA Calibration Feedback Loop) + 3.7 (Token-Aware Orchestration)
395
+
396
+ **Rationale**: These three enhancements are complementary — all three measure and manage GSD-T's resource effectiveness. The harness audit measures component-level ROI; the QA calibration measures QA-specific detection quality; the token-aware orchestrator ensures the framework can complete milestones within daily token limits. Together they establish the "self-awareness" foundation that all other enhancements benefit from.
397
+
398
+ **Estimated domains**: 4-5
399
+ - `harness-audit` — component registry, audit command, shadow mode, cost/benefit ledger
400
+ - `qa-calibrator` — miss-rate tracking, category aggregation, weak-spot detection, dynamic injection
401
+ - `token-orchestrator` — budget estimation, tracking, graduated degradation, pre-flight checks
402
+ - `command-integration` — wire audit into complete-milestone, wire calibrator into execute/quick/integrate, wire budget checks into wave/execute
403
+ - `telemetry-extension` — extend metrics-rollup with QA miss rates and component impact
404
+
405
+ **Estimated tasks**: 14-18
406
+ **Predecessor**: M30 (Stack Rules Engine — for component registry baseline), Pre-Milestone model tier refinement
407
+
408
+ ### M32: Quality Culture & Design (Tier 2) — v2.53.10
409
+
410
+ **Scope**: Enhancements 3.3 (Quality North Star) + 3.4 (Design Brief) + 3.5 (Evaluator Interactivity)
411
+
412
+ **Rationale**: These three enhancements share a theme: raising the quality ceiling through non-procedural means. Quality persona raises the baseline aspiration. Design brief ensures aesthetic coherence. Evaluator interactivity finds bugs that procedural checks miss. Grouping them ensures they are designed to work together — the quality persona influences how the design brief is generated, and evaluator interactivity tests against both.
413
+
414
+ **Estimated domains**: 3-4
415
+ - `quality-persona` — CLAUDE.md section, init/setup integration, prompt injection
416
+ - `design-brief` — detection, generation, contract storage, injection for UI tasks
417
+ - `evaluator-interactivity` — MCP detection, exploratory testing prompts, finding classification
418
+ - `command-integration` — wire all three into execute/quick/debug/integrate/wave
419
+
420
+ **Estimated tasks**: 10-12
421
+ **Predecessor**: M31 (QA calibration must exist for exploratory findings to feed into)
422
+
423
+ ### M33: Adaptive Iteration (Tier 3) — v2.54.10
424
+
425
+ **Scope**: Enhancement 3.6 (Configurable Iteration Budget)
426
+
427
+ **Rationale**: This enhancement depends on telemetry data from M31 (QA miss rates, component impact) and M32 (exploratory findings) to make intelligent budget decisions. It also modifies the most command files (7), so it should be last to minimize merge conflicts with M31/M32 changes.
428
+
429
+ **Estimated domains**: 2-3
430
+ - `complexity-scoring` — plan-time complexity analysis, budget assignment, defaults
431
+ - `budget-execution` — budget-aware fix attempts in all 7 execution commands
432
+ - `telemetry-extension` — iteration count tracking in task-metrics.jsonl
433
+
434
+ **Estimated tasks**: 8-10
435
+ **Predecessor**: M32
436
+
437
+ ---
438
+
439
+ ## 5. Impact Analysis
440
+
441
+ ### Existing commands modified
442
+
443
+ | Command | M31 | M32 | M33 | Changes |
444
+ |----------------------|-----|-----|-----|----------------------------------------------------------------|
445
+ | `gsd-t-execute` | X | X | X | QA calibration injection, persona injection, design brief, budget |
446
+ | `gsd-t-quick` | X | X | X | Same as execute (inline variants) |
447
+ | `gsd-t-integrate` | X | X | | QA calibration injection, persona, design brief |
448
+ | `gsd-t-debug` | | X | X | Persona injection, budget |
449
+ | `gsd-t-wave` | | X | X | Persona injection, budget |
450
+ | `gsd-t-plan` | | | X | Complexity scoring, budget assignment |
451
+ | `gsd-t-partition` | | X | | Design brief detection and generation |
452
+ | `gsd-t-init` | | X | | Quality persona selection |
453
+ | `gsd-t-setup` | | X | | Quality persona + design brief configuration |
454
+ | `gsd-t-test-sync` | | | X | Budget-aware fix attempts |
455
+ | `gsd-t-verify` | | | X | Budget-aware fix attempts |
456
+ | `gsd-t-complete-milestone` | X | | | Component impact evaluation in distillation |
457
+ | `gsd-t-status` | X | | | Show flagged components + QA miss rate summary |
458
+ | `gsd-t-help` | X | | | New audit command entry |
459
+
460
+ ### New artifacts
461
+
462
+ | Artifact | Milestone | Purpose |
463
+ |----------------------------------------|-----------|--------------------------------------------|
464
+ | `commands/gsd-t-audit.md` | M31 | Harness audit command |
465
+ | `bin/qa-calibrator.js` | M31 | QA miss-rate aggregation + weak-spot detection |
466
+ | `bin/token-budget.js` | M31 | Token budget estimation, tracking, thresholds |
467
+ | `.gsd-t/component-registry.jsonl` | M31 | Component inventory with cost tracking |
468
+ | `.gsd-t/metrics/qa-miss-log.jsonl` | M31 | Red Team findings QA missed |
469
+ | `.gsd-t/metrics/component-impact.jsonl`| M31 | Per-component cost/benefit ledger |
470
+ | `.gsd-t/contracts/design-brief.md` | M32 | Design language for UI-heavy projects |
471
+
472
+ ### Backward compatibility
473
+
474
+ All enhancements are **purely additive**:
475
+ - Projects without a quality persona skip injection silently
476
+ - Projects without UI domains skip design brief entirely
477
+ - Projects without Playwright MCP skip exploratory testing
478
+ - Projects without iteration budget settings use current 2-attempt default
479
+ - The audit command is opt-in — it never runs automatically
480
+ - QA calibration activates only when Red Team data exists (Red Team was added in v2.51.10)
481
+
482
+ No existing behavior changes unless the user explicitly enables the new features.
483
+
484
+ ### Zero-dependency constraint
485
+
486
+ All new code (`bin/qa-calibrator.js`, component registry logic) uses Node.js built-ins only. No external npm dependencies. This is non-negotiable per TECH-001.
487
+
488
+ ---
489
+
490
+ ## 6. Risk Assessment
491
+
492
+ | Risk | Likelihood | Impact | Mitigation |
493
+ |-----------------------------------------------|------------|--------|------------------------------------------------------------------|
494
+ | Harness audit doubles execution time | Medium | High | Audit is opt-in, never automatic. Budget per audit session. |
495
+ | QA calibration creates feedback oscillation | Low | Medium | Damping: changes only after 3+ milestones of consistent signal. |
496
+ | Quality persona is ignored by subagent | Medium | Low | Minimal cost (2-3 lines of prompt). Measure via A/B in audit. |
497
+ | Design brief is too prescriptive | Low | Medium | Brief sets direction, not pixel specs. Execution agents adapt. |
498
+ | Playwright MCP not widely available | Medium | Low | Graceful degradation — feature skips if MCP absent. |
499
+ | Higher iteration budgets waste context tokens | Low | Medium | Budget governance (M26) caps aggregate rework. Telemetry tracks. |
500
+ | Command file sizes grow beyond readability | Medium | High | Each injection is max 5-10 lines. Total overhead auditable via 3.1. |
501
+ | **Token exhaustion mid-milestone** | **High** | **High** | **Token-aware orchestration (3.7) with graduated degradation. Progress always checkpointed before hard stop.** |
502
+ | **QA promotion to Sonnet exceeds token budget** | Low | Medium | QA calls are small relative to execute. Net impact ~10-15% more tokens. Token orchestrator manages ceiling. |
503
+ | **Budget estimation inaccuracy** | Medium | Medium | Estimates improve over time using historical data from token-log.md. Conservative defaults (overestimate). |
504
+
505
+ ---
506
+
507
+ ## 7. Success Metrics
508
+
509
+ | Metric | Baseline (current) | Target (post-M33) |
510
+ |--------------------------------------|-----------------------|----------------------------|
511
+ | QA-to-Red-Team miss rate | Not tracked | < 15% per category |
512
+ | Component ROI visibility | None | 100% of components tracked |
513
+ | Subagent prompt quality signal | Procedural only | Persona + procedural |
514
+ | UI aesthetic consistency (manual) | Per-task independent | Brief-governed coherence |
515
+ | Exploratory bugs found | 0 (no capability) | > 0 per UI milestone |
516
+ | Iteration convergence (complex tasks)| Fixed 2 attempts | Adaptive 2-8 attempts |
517
+ | Framework bloat detection | None | Components flagged at 3+ milestone negative ROI |
518
+ | **Token exhaustion incidents** | **Not tracked** | **Zero — milestones always complete or checkpoint gracefully** |
519
+ | **Milestones per daily budget** | **~1-2 (unmanaged)** | **2-3 with budget-aware orchestration** |
520
+ | **QA model effectiveness** | **Haiku (baseline)** | **Sonnet (post-3.8 promotion)** |
521
+
522
+ ---
523
+
524
+ ## 8. Out of Scope
525
+
526
+ - **Automated component removal** — Flagging is automated; actual removal requires user approval (Destructive Action Guard)
527
+ - **Visual regression testing** — Design brief ensures consistency via prompt engineering, not pixel-diff tooling
528
+ - **Custom LLM model selection** — Quality persona works within the refined haiku/sonnet/opus model assignments. Fully dynamic model selection per-task is a future enhancement beyond 3.8's static tier refinement
529
+ - **Cross-project QA calibration** — QA calibration is per-project in this PRD. Cross-project propagation follows the M27 pattern if warranted later
530
+ - **Real-time quality dashboard** — The existing dashboard (M15) displays events; adding QA calibration visualizations is a future enhancement
531
+ - **AI-generated design systems** — The design brief is a coordination artifact, not a Figma export or component library generator
532
+
533
+ ---
534
+
535
+ ## 9. Dependencies
536
+
537
+ | Dependency | Type | Status | Required By |
538
+ |--------------------------------|----------|-------------|-------------|
539
+ | Rule Engine (M26) | Internal | COMPLETE | M31 (audit integration with patch lifecycle) |
540
+ | Patch Lifecycle (M26) | Internal | COMPLETE | M31 (component deprecation as patches) |
541
+ | Red Team (v2.51.10) | Internal | COMPLETE | M31 (miss-rate source data) |
542
+ | Metrics Rollup (M25) | Internal | COMPLETE | M31 (ELO integration for QA miss rates) |
543
+ | Headless Debug Loop (M29) | Internal | COMPLETE | M33 (budget exhaustion triggers headless) |
544
+ | Stack Rules Engine (M30) | Internal | COMPLETE | M32 (persona injected before stack rules) |
545
+ | Playwright MCP | External | Optional | M32 (evaluator interactivity — graceful skip if absent) |
546
+ | Claude Code Agent tool | External | Available | All (subagent spawning) |
547
+
548
+ ---
549
+
550
+ ## 10. Implementation Notes
551
+
552
+ ### Token budget awareness
553
+
554
+ There are two distinct token budget concerns:
555
+
556
+ **A. Per-subagent prompt overhead** — how many tokens each enhancement adds to individual subagent prompts:
557
+ - Quality persona: 2-3 lines (~50 tokens)
558
+ - QA weak-spot injection: 3-5 lines (~100 tokens) — only when weak spots exist
559
+ - Design brief: 15-30 lines (~300 tokens) — only for UI tasks
560
+ - Exploratory testing instructions: 8-10 lines (~150 tokens) — only when MCP available
561
+ - Iteration budget: 1 line (~20 tokens) — always injected
562
+
563
+ **Maximum additional overhead per subagent**: ~620 tokens (UI task with all features active). This is well within the per-agent context budget and measurable via the harness audit (3.1).
564
+
565
+ **B. Session-level token consumption** — how all enhancements affect daily/weekly token limits on the $200 Max plan:
566
+ - QA model promotion (Haiku → Sonnet): +10-15% tokens per milestone
567
+ - Red Team model promotion (Sonnet → Opus): +3-5% tokens per milestone (only 1 call)
568
+ - Exploratory testing: +5-10% tokens per milestone (when MCP available)
569
+ - Higher iteration budgets: variable, capped by token-aware orchestrator (3.7)
570
+ - Harness audit: opt-in only, not counted in normal milestone budgets
571
+
572
+ **Maximum additional session cost**: ~25-30% more tokens per milestone vs. current. The token-aware orchestrator (3.7) ensures this stays within daily limits through graduated degradation.
573
+
574
+ ### Command file discipline
575
+
576
+ Each command file modification is a targeted injection (5-15 lines), not a restructure. The existing step numbering and flow are preserved. New injection points follow the established pattern: read a state file, conditionally inject content into the subagent prompt.
577
+
578
+ ### Testing strategy
579
+
580
+ - `bin/qa-calibrator.js` — unit tests in `test/qa-calibrator.test.js` (JSONL parsing, miss-rate math, weak-spot detection)
581
+ - Component registry — unit tests in `test/component-registry.test.js` (CRUD, cost calculation, flagging logic)
582
+ - Integration — manual CLI testing of `gsd-t-audit` command, verified via existing `gsd-t-verify` gates
583
+ - No new external dependencies for testing (stays with Node.js built-in `node --test`)
@@ -68,6 +68,9 @@
68
68
  | REQ-057 | Stack Rule Templates — best practice rule files in `templates/stacks/` for React, TypeScript, and Node.js API. Each file follows a standard structure (mandatory framing, numbered sections, GOOD/BAD examples, verification checklist) and stays under 200 lines. Universal templates (`_` prefix) always injected; stack-specific templates injected when detected. | P1 | complete | M30: templates/stacks/ (4 files: _security.md, react.md, typescript.md, node-api.md) |
69
69
  | REQ-058 | Stack Detection Engine — auto-detect project tech stack from manifest files (package.json, requirements.txt, go.mod, Cargo.toml) at subagent spawn time. Match detected stacks against available templates. Inject matched rules into subagent prompts with mandatory enforcement framing. Resilient: skip silently if no templates exist or no matches found. | P1 | complete | M30: 5 command files (execute, quick, integrate, wave, debug) |
70
70
  | REQ-059 | Stack Rule QA Enforcement — QA subagent prompts include stack rule compliance validation. Stack rule violations have the same severity as contract violations — they fail the task, not warn. Report format includes "Stack rules: compliant/N violations". | P1 | complete | M30: execute QA prompt + all 5 commands |
71
+ | REQ-060 | Quality North Star Persona — project CLAUDE.md can define a `## Quality North Star` section (1-3 sentences) with a project quality identity. gsd-t-init auto-detects preset (library/web-app/cli) or prompts user. gsd-t-setup offers persona config for existing projects. Persona is injected at subagent spawn time; skips silently if section absent (backward compatible). | P2 | complete | M32: templates/CLAUDE-project.md, gsd-t-init.md, gsd-t-setup.md |
72
+ | REQ-061 | Design Brief Generation — during partition, if UI/frontend signals detected (React/Vue/Svelte/Flutter, CSS/SCSS, component files, or Tailwind config), generate `.gsd-t/contracts/design-brief.md` with color palette, typography, spacing, component patterns, layout principles, interaction patterns, and tone/voice. Skip for non-UI projects. Do not overwrite existing briefs. Referenced in plan for UI task descriptions. | P2 | complete | M32: gsd-t-partition.md, gsd-t-plan.md, gsd-t-setup.md |
73
+ | REQ-062 | Exploratory Testing Blocks — after scripted tests pass, if Playwright MCP is registered, QA agents get 3 minutes and Red Team gets 5 minutes of interactive exploration using Playwright MCP. All findings tagged [EXPLORATORY] in qa-issues.md and red-team-report.md. Feeds into M31 QA calibration as separate category. Silent skip when Playwright MCP absent. Injected into execute, quick, integrate, debug. | P2 | complete | M32: gsd-t-execute.md, gsd-t-quick.md, gsd-t-integrate.md, gsd-t-debug.md |
71
74
 
72
75
  ## Technical Requirements
73
76
 
@@ -196,6 +199,17 @@
196
199
  **Orphaned requirements**: None — all M30 REQs mapped to tasks.
197
200
  **Unanchored tasks**: command-integration Task 3 (tests) is QA infrastructure supporting REQ-057 through REQ-059. command-integration Task 4 (reference docs) supports Pre-Commit Gate compliance.
198
201
 
202
+ ## Requirements Traceability (updated by plan phase — M32)
203
+
204
+ | REQ-ID | Requirement Summary | Domain | Task(s) | Status |
205
+ |---------|--------------------------------------------------------------|-------------------------|---------|---------|
206
+ | REQ-060 | Quality North Star Persona — CLAUDE-project template + init/setup detection and config | quality-persona | Task 1 | complete |
207
+ | REQ-061 | Design Brief Generation — partition detection + plan note + setup option | design-brief | Task 1 | complete |
208
+ | REQ-062 | Exploratory Testing Blocks — post-scripted Playwright MCP exploration in 4 commands | evaluator-interactivity | Task 1 | complete |
209
+
210
+ **Orphaned requirements**: None — all M32 REQs mapped to tasks.
211
+ **Unanchored tasks**: None — all 3 domain tasks map directly to functional requirements.
212
+
199
213
  ---
200
214
 
201
215
  ## M17: Scan Visual Output — Feature Specification