@tekyzinc/gsd-t 2.50.12 → 2.53.10
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +24 -0
- package/README.md +379 -372
- package/bin/component-registry.js +250 -0
- package/bin/graph-cgc.js +510 -510
- package/bin/graph-indexer.js +147 -147
- package/bin/graph-overlay.js +195 -195
- package/bin/graph-parsers.js +327 -327
- package/bin/graph-query.js +453 -452
- package/bin/graph-store.js +154 -154
- package/bin/qa-calibrator.js +194 -0
- package/bin/scan-data-collector.js +153 -153
- package/bin/scan-diagrams-generators.js +187 -187
- package/bin/scan-diagrams.js +79 -79
- package/bin/scan-renderer.js +92 -92
- package/bin/scan-report-sections.js +121 -121
- package/bin/scan-report.js +184 -184
- package/bin/scan-schema-parsers.js +199 -199
- package/bin/scan-schema.js +103 -103
- package/bin/token-budget.js +246 -0
- package/commands/Claude-md.md +10 -10
- package/commands/branch.md +15 -15
- package/commands/checkin.md +45 -45
- package/commands/global-change.md +209 -209
- package/commands/gsd-t-audit.md +199 -0
- package/commands/gsd-t-backlog-add.md +94 -94
- package/commands/gsd-t-backlog-edit.md +111 -111
- package/commands/gsd-t-backlog-list.md +63 -63
- package/commands/gsd-t-backlog-move.md +94 -94
- package/commands/gsd-t-backlog-promote.md +123 -123
- package/commands/gsd-t-backlog-remove.md +86 -86
- package/commands/gsd-t-backlog-settings.md +158 -158
- package/commands/gsd-t-complete-milestone.md +528 -515
- package/commands/gsd-t-debug.md +506 -399
- package/commands/gsd-t-discuss.md +174 -174
- package/commands/gsd-t-execute.md +758 -634
- package/commands/gsd-t-feature.md +276 -276
- package/commands/gsd-t-health.md +142 -142
- package/commands/gsd-t-help.md +465 -457
- package/commands/gsd-t-impact.md +302 -302
- package/commands/gsd-t-init.md +320 -280
- package/commands/gsd-t-integrate.md +365 -249
- package/commands/gsd-t-milestone.md +87 -87
- package/commands/gsd-t-partition.md +442 -361
- package/commands/gsd-t-pause.md +82 -82
- package/commands/gsd-t-plan.md +345 -344
- package/commands/gsd-t-populate.md +111 -111
- package/commands/gsd-t-prd.md +326 -326
- package/commands/gsd-t-project.md +211 -211
- package/commands/gsd-t-promote-debt.md +123 -123
- package/commands/gsd-t-prompt.md +137 -137
- package/commands/gsd-t-qa.md +266 -266
- package/commands/gsd-t-quick.md +357 -234
- package/commands/gsd-t-reflect.md +134 -134
- package/commands/gsd-t-resume.md +72 -72
- package/commands/gsd-t-scan.md +615 -615
- package/commands/gsd-t-setup.md +76 -0
- package/commands/gsd-t-status.md +192 -166
- package/commands/gsd-t-test-sync.md +381 -381
- package/commands/gsd-t-triage-and-merge.md +171 -171
- package/commands/gsd-t-verify.md +382 -382
- package/commands/gsd-t-visualize.md +118 -118
- package/commands/gsd-t-wave.md +401 -378
- package/docs/GSD-T-README.md +425 -422
- package/docs/architecture.md +385 -369
- package/docs/harness-design-analysis.md +371 -0
- package/docs/infrastructure.md +205 -205
- package/docs/prd-graph-engine.md +398 -398
- package/docs/prd-gsd2-hybrid.md +559 -559
- package/docs/prd-harness-evolution.md +583 -0
- package/docs/requirements.md +14 -0
- package/docs/workflows.md +226 -226
- package/examples/.gsd-t/domains/example-domain/scope.md +13 -13
- package/package.json +40 -40
- package/scripts/gsd-t-auto-route.js +39 -39
- package/scripts/gsd-t-dashboard-mockup.html +1143 -1143
- package/scripts/gsd-t-dashboard-server.js +171 -171
- package/scripts/gsd-t-dashboard.html +262 -262
- package/scripts/gsd-t-event-writer.js +128 -128
- package/scripts/gsd-t-statusline.js +94 -94
- package/scripts/gsd-t-tools.js +175 -175
- package/templates/CLAUDE-global.md +639 -614
- package/templates/CLAUDE-project.md +24 -0
- package/templates/backlog-settings.md +18 -18
- package/templates/backlog.md +1 -1
- package/templates/progress.md +40 -40
- package/templates/shared-services-contract.md +60 -60
- package/templates/stacks/desktop.ini +2 -2
- package/bin/desktop.ini +0 -2
- package/commands/desktop.ini +0 -2
- package/docs/ci-examples/desktop.ini +0 -2
- package/docs/desktop.ini +0 -2
- package/examples/.gsd-t/contracts/desktop.ini +0 -2
- package/examples/.gsd-t/desktop.ini +0 -2
- package/examples/.gsd-t/domains/desktop.ini +0 -2
- package/examples/.gsd-t/domains/example-domain/desktop.ini +0 -2
- package/examples/desktop.ini +0 -2
- package/examples/rules/desktop.ini +0 -2
- package/scripts/desktop.ini +0 -2
- package/templates/desktop.ini +0 -2
|
@@ -0,0 +1,583 @@
|
|
|
1
|
+
# PRD: Harness Evolution — Self-Calibrating Quality Infrastructure
|
|
2
|
+
|
|
3
|
+
## Document Info
|
|
4
|
+
| Field | Value |
|
|
5
|
+
|-------------------|-----------------------------------------------------------------------|
|
|
6
|
+
| **PRD ID** | PRD-HARNESS-001 |
|
|
7
|
+
| **Date** | 2026-04-01 |
|
|
8
|
+
| **Author** | GSD-T Team |
|
|
9
|
+
| **Status** | DRAFT |
|
|
10
|
+
| **Milestones** | M31 (Tier 1), M32 (Tier 2), M33 (Tier 3) |
|
|
11
|
+
| **Version Target**| 2.52.10 (M31), 2.53.10 (M32), 2.54.10 (M33) |
|
|
12
|
+
| **Priority** | P0 — framework self-improvement and quality convergence |
|
|
13
|
+
| **Predecessor** | M30 (Stack Rules Engine), M26 (Rule Engine + Patch Lifecycle), M29 (Debug Loop) |
|
|
14
|
+
| **Successor** | Production quality parity with human-curated codebases |
|
|
15
|
+
| **Related** | PRD-GSD2-001 (M22-M24), PRD-GRAPH-001 (M20-M21) |
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## Revision History
|
|
20
|
+
|
|
21
|
+
| Date | Version | Changes |
|
|
22
|
+
|------------|---------|------------------------|
|
|
23
|
+
| 2026-04-01 | v1 | Initial DRAFT — 6 enhancements across 3 tiers |
|
|
24
|
+
| 2026-04-01 | v2 | Added token-constraint analysis: enhancement 3.7 (token-aware orchestration), 3.8 (model tier refinement), updated risk/success metrics |
|
|
25
|
+
|
|
26
|
+
---
|
|
27
|
+
|
|
28
|
+
## 1. Problem Statement
|
|
29
|
+
|
|
30
|
+
GSD-T's quality infrastructure has grown significantly through M25-M30: telemetry collection, declarative rule engine, patch lifecycle, Red Team adversarial QA, stack rules, and compaction-proof debug loops. These components are individually effective, but five structural gaps prevent them from reaching their full potential:
|
|
31
|
+
|
|
32
|
+
1. **Framework bloat without pruning** — GSD-T only grows, never sheds. Every milestone adds new checks, rules, and enforcement mechanisms to subagent prompts. There is no mechanism to determine whether a component (Red Team, stack rules, observability logging) is still earning its keep. A component that added value at M26 may be pure overhead at M35, consuming precious context tokens without preventing real failures.
|
|
33
|
+
|
|
34
|
+
2. **Static QA prompts** — The QA subagent prompt is frozen at the text written during its creation. Red Team regularly finds bugs that QA missed, but this signal is never fed back to improve QA's detection capability. The rule engine (M26) and ELO system (M25) exist but are not connected to QA calibration. QA's miss rate is measurable but not actionable.
|
|
35
|
+
|
|
36
|
+
3. **Procedural prompts without quality vision** — Every subagent prompt is purely procedural: "do X, check Y, report Z." Research on long-running AI harnesses shows that injecting an aspirational quality statement (a "quality persona") shifts output quality more effectively than adding more procedural checks. A phrase like "museum-quality code" or "production-ready from line one" changes the agent's default quality threshold. GSD-T has no mechanism for this.
|
|
37
|
+
|
|
38
|
+
4. **Aesthetic drift in UI-heavy projects** — When GSD-T executes UI tasks, each subagent makes independent aesthetic decisions. Task 1 might pick rounded corners and soft shadows, Task 3 might use sharp borders and flat design. Contracts define functional interfaces but not visual language. There is no design brief artifact that flows through execution like contracts do.
|
|
39
|
+
|
|
40
|
+
5. **Scripted-only test evaluation** — QA and Red Team agents can only evaluate through scripted Playwright assertions. They cannot interactively explore the application the way a human tester would — clicking around, trying unexpected flows, observing visual glitches. The article's evaluator harness found issues through dynamic interaction that scripted tests missed entirely.
|
|
41
|
+
|
|
42
|
+
6. **Fixed iteration budget** — GSD-T hardcodes "2 fix attempts" before escalating to the headless debug loop. Research shows that 5-15 iterations drive convergence to significantly better quality for complex tasks. The current budget is one-size-fits-all: simple tasks get 2 attempts (enough), complex tasks get 2 attempts (not enough), and the headless debug loop is a heavyweight escalation that disrupts the normal execution flow.
|
|
43
|
+
|
|
44
|
+
**Root cause**: GSD-T was designed as a *methodology* that prescribes process. It now needs to become a *self-calibrating system* that measures its own component effectiveness, tunes its quality signals based on outcomes, and adapts its iteration depth to task complexity.
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## 2. Objective
|
|
49
|
+
|
|
50
|
+
Evolve GSD-T from a static methodology framework into a self-calibrating quality system across 6 enhancements in 3 tiers.
|
|
51
|
+
|
|
52
|
+
**Primary goals** (in priority order):
|
|
53
|
+
1. **Self-awareness** — GSD-T can measure whether its own components add value, and disable ones that don't
|
|
54
|
+
2. **Closed-loop QA** — QA miss rates feed back into QA prompt tuning automatically
|
|
55
|
+
3. **Quality culture** — Subagent prompts carry a project-level quality aspiration, not just procedural rules
|
|
56
|
+
4. **Aesthetic coherence** — UI projects get a design brief that flows through execution like contracts
|
|
57
|
+
5. **Exploratory evaluation** — QA/Red Team can interact with running applications, not just run scripts
|
|
58
|
+
6. **Adaptive iteration** — Iteration budgets scale with task complexity, not a fixed constant
|
|
59
|
+
7. **Token-budget awareness** — On a token-limited plan ($200 Max), the orchestrator must manage session-level token consumption to prevent mid-milestone exhaustion
|
|
60
|
+
|
|
61
|
+
**Core principle**: Every quality mechanism must prove its value through measurable outcomes. Components that cannot demonstrate impact are candidates for removal, not preservation.
|
|
62
|
+
|
|
63
|
+
**Operational constraint**: GSD-T runs on Claude's $200 Max plan, where tokens are a hard daily/weekly ceiling — not a variable cost. Running out mid-milestone is a workflow-breaking event. All enhancements must be designed with token conservation as a first-class concern.
|
|
64
|
+
|
|
65
|
+
---
|
|
66
|
+
|
|
67
|
+
## 3. Enhancements
|
|
68
|
+
|
|
69
|
+
### 3.1 Harness Audit Capability (HIGH PRIORITY — Tier 1)
|
|
70
|
+
|
|
71
|
+
**Problem**: The framework accumulates enforcement mechanisms (Red Team, stack rules, observability logging, E2E enforcement, doc-ripple checks) but has no way to determine if they are still earning their context-token cost. Over time, the aggregate prompt overhead may exceed the value delivered.
|
|
72
|
+
|
|
73
|
+
**Solution**: A `gsd-t-audit` command and supporting infrastructure that stress-tests GSD-T's own components by selectively disabling them and comparing outcomes.
|
|
74
|
+
|
|
75
|
+
**Mechanism**:
|
|
76
|
+
1. **Component registry** — A structured file (`.gsd-t/component-registry.jsonl`) listing every enforcement mechanism: name, injection point (which command files), approximate token cost (lines of prompt text), date added, last measured impact.
|
|
77
|
+
2. **Audit mode** — `gsd-t-audit` runs a milestone's worth of tasks twice: once with all components active (control), once with a target component disabled (experiment). Compares: bugs caught, test pass rates, rework cycles, and time-to-completion.
|
|
78
|
+
3. **Shadow mode** — For components that can't be cleanly disabled (like the pre-commit gate), audit mode runs them in "shadow" — the check executes but its result is logged, not enforced. This measures whether it would have caught something without blocking execution.
|
|
79
|
+
4. **Cost/benefit ledger** — `.gsd-t/metrics/component-impact.jsonl` tracks per-component: token cost per invocation, bugs prevented (from QA/Red Team logs), false positives generated, context % consumed. Components with cost > benefit for 3+ milestones are flagged for deprecation review.
|
|
80
|
+
5. **Integration with rule engine** — Components flagged for deprecation become candidates in the patch lifecycle (M26). A "disable component X" patch follows the same candidate -> measured -> promoted -> graduated flow.
|
|
81
|
+
|
|
82
|
+
**Files affected**:
|
|
83
|
+
- NEW: `commands/gsd-t-audit.md` — audit command (new command, count goes to 52)
|
|
84
|
+
- NEW: `.gsd-t/component-registry.jsonl` template
|
|
85
|
+
- MODIFY: `bin/gsd-t.js` — command count update
|
|
86
|
+
- MODIFY: `commands/gsd-t-complete-milestone.md` — component impact evaluation in distillation step
|
|
87
|
+
- MODIFY: `templates/CLAUDE-global.md`, `commands/gsd-t-help.md`, `README.md`, `GSD-T-README.md` — command reference updates
|
|
88
|
+
|
|
89
|
+
**Success criteria**:
|
|
90
|
+
- [ ] Component registry lists all enforcement mechanisms with token cost estimates
|
|
91
|
+
- [ ] `gsd-t-audit` can disable a named component and run comparison tasks
|
|
92
|
+
- [ ] Shadow mode logs enforcement results without blocking execution
|
|
93
|
+
- [ ] Cost/benefit ledger accumulates per-milestone impact data
|
|
94
|
+
- [ ] Components with 3+ milestones of negative ROI are flagged in `gsd-t-status`
|
|
95
|
+
- [ ] Flagged components enter the patch lifecycle as deprecation candidates
|
|
96
|
+
|
|
97
|
+
**Acceptance test**: Run `gsd-t-audit --component=red-team` on a small milestone. Verify the audit produces a comparison report showing bugs caught vs. context tokens consumed, and the result is persisted in `component-impact.jsonl`.
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
### 3.2 QA Calibration Feedback Loop (HIGH PRIORITY — Tier 1)
|
|
102
|
+
|
|
103
|
+
**Problem**: QA subagent prompts are static text. Red Team finds bugs that QA missed, but this signal is discarded. The rule engine (M26) and ELO system (M25) exist but are not connected to QA prompt tuning.
|
|
104
|
+
|
|
105
|
+
**Solution**: Wire Red Team miss-rate data back into QA prompt generation, creating a closed-loop calibration system.
|
|
106
|
+
|
|
107
|
+
**Mechanism**:
|
|
108
|
+
1. **Miss-rate tracking** — After Red Team completes, compare its findings against QA's report. Bugs found by Red Team but missed by QA are logged to `.gsd-t/metrics/qa-miss-log.jsonl` with category tags (contract violation, boundary input, state transition, error path, missing flow, regression, E2E gap).
|
|
109
|
+
2. **Category aggregation** — `bin/qa-calibrator.js` reads `qa-miss-log.jsonl` and computes miss rates per category across milestones. Categories with miss rates > 30% are "weak spots."
|
|
110
|
+
3. **Dynamic QA prompt injection** — During `gsd-t-execute` Step 2 (QA subagent spawn), the orchestrator calls `qa-calibrator.js` to get current weak spots. These are injected into the QA prompt as priority focus areas: "PRIORITY: Your historical miss rate for {category} is {N}%. Pay extra attention to: {specific patterns from miss log}."
|
|
111
|
+
4. **Calibration rules** — Weak spots that persist for 3+ milestones generate a rule engine candidate patch that adds a permanent check to the QA prompt template. Weak spots that drop below 10% miss rate for 2+ milestones have their priority injection removed.
|
|
112
|
+
5. **ELO integration** — QA miss rates factor into the process ELO calculation. A milestone with high QA miss rates gets a lower ELO delta, incentivizing the system toward better first-pass QA detection.
|
|
113
|
+
|
|
114
|
+
**Files affected**:
|
|
115
|
+
- NEW: `bin/qa-calibrator.js` — miss-rate aggregation and weak-spot detection
|
|
116
|
+
- MODIFY: `commands/gsd-t-execute.md` — inject weak spots into QA subagent prompt
|
|
117
|
+
- MODIFY: `commands/gsd-t-quick.md` — same injection for inline QA
|
|
118
|
+
- MODIFY: `commands/gsd-t-integrate.md` — same injection for integration QA
|
|
119
|
+
- MODIFY: `templates/CLAUDE-global.md` — document QA calibration in QA Agent section
|
|
120
|
+
- MODIFY: `bin/metrics-rollup.js` — incorporate QA miss rates into ELO calculation
|
|
121
|
+
|
|
122
|
+
**Success criteria**:
|
|
123
|
+
- [ ] Red Team findings not in QA report are logged to `qa-miss-log.jsonl` with category tags
|
|
124
|
+
- [ ] `qa-calibrator.js` computes per-category miss rates and identifies weak spots (>30%)
|
|
125
|
+
- [ ] QA subagent prompts include dynamic weak-spot injections during execute
|
|
126
|
+
- [ ] Weak spots persisting 3+ milestones generate rule engine candidate patches
|
|
127
|
+
- [ ] Weak spots dropping below 10% for 2+ milestones have injections removed
|
|
128
|
+
- [ ] QA miss rate is reflected in process ELO calculation
|
|
129
|
+
|
|
130
|
+
**Acceptance test**: After a Red Team run that finds 2 boundary-input bugs QA missed, verify `qa-miss-log.jsonl` contains the entries, `qa-calibrator.js` reports boundary-input as a weak spot, and the next QA subagent spawn includes the priority injection.
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
### 3.3 Quality North Star Injection (MEDIUM PRIORITY — Tier 2)
|
|
135
|
+
|
|
136
|
+
**Problem**: Every subagent prompt is procedural. Research shows that an aspirational quality statement ("quality persona") shifts output quality more effectively than adding procedural checks. GSD-T has no mechanism for project-level quality aspiration.
|
|
137
|
+
|
|
138
|
+
**Solution**: A configurable quality persona that gets prepended to every subagent prompt, set once during `gsd-t-init` or `gsd-t-setup` and stored in project CLAUDE.md.
|
|
139
|
+
|
|
140
|
+
**Mechanism**:
|
|
141
|
+
1. **Quality persona field** — A new section in the project CLAUDE.md template: `## Quality North Star`. Contains a 1-3 sentence aspirational quality statement. Examples: "Museum-quality code — every function reads like it was written for a textbook", "Production-ready from line one — no TODOs, no shortcuts, no 'fix later'", "Enterprise-grade reliability — every error path is handled, every edge case is tested."
|
|
142
|
+
2. **Default personas** — `gsd-t-init` offers 3 preset personas based on project type detection:
|
|
143
|
+
- **Library/package**: "This code will be read by thousands of developers. Every public API must be self-documenting, every edge case must be handled, every error message must be actionable."
|
|
144
|
+
- **Web application**: "Every user interaction must feel instant, every error must be recoverable, every page must be accessible. Ship quality a designer would screenshot."
|
|
145
|
+
- **CLI tool**: "Every command must complete in under 2 seconds, every error must suggest a fix, every flag must have a help string. The --help output is the product."
|
|
146
|
+
- **Custom**: User writes their own statement.
|
|
147
|
+
3. **Injection point** — The quality persona is prepended to every subagent prompt (execute, quick, debug, integrate, wave) immediately before the task-specific instructions. It sits above procedural checks, framing the agent's quality default before any rules are read.
|
|
148
|
+
4. **Stack rule integration** — Quality persona is injected before stack rules, so the aspirational statement colors how the agent interprets and applies the rules.
|
|
149
|
+
|
|
150
|
+
**Files affected**:
|
|
151
|
+
- MODIFY: `templates/CLAUDE-project.md` — add Quality North Star section with placeholder
|
|
152
|
+
- MODIFY: `commands/gsd-t-init.md` — persona selection during init
|
|
153
|
+
- MODIFY: `commands/gsd-t-setup.md` — persona configuration
|
|
154
|
+
- MODIFY: `commands/gsd-t-execute.md` — inject persona into subagent prompt
|
|
155
|
+
- MODIFY: `commands/gsd-t-quick.md` — same injection
|
|
156
|
+
- MODIFY: `commands/gsd-t-debug.md` — same injection
|
|
157
|
+
- MODIFY: `commands/gsd-t-integrate.md` — same injection
|
|
158
|
+
- MODIFY: `commands/gsd-t-wave.md` — same injection
|
|
159
|
+
- MODIFY: `templates/CLAUDE-global.md` — document the feature
|
|
160
|
+
|
|
161
|
+
**Success criteria**:
|
|
162
|
+
- [ ] CLAUDE-project.md template includes Quality North Star section
|
|
163
|
+
- [ ] `gsd-t-init` prompts for or auto-selects a quality persona
|
|
164
|
+
- [ ] Quality persona is prepended to all subagent prompts (execute, quick, debug, integrate, wave)
|
|
165
|
+
- [ ] Persona injection occurs before stack rules and procedural checks
|
|
166
|
+
- [ ] Projects without a persona skip injection silently (backward compatible)
|
|
167
|
+
|
|
168
|
+
**Acceptance test**: Set persona to "Museum-quality code" in CLAUDE.md. Run `gsd-t-execute` on a task. Verify the subagent prompt starts with the persona statement before any procedural instructions.
|
|
169
|
+
|
|
170
|
+
---
|
|
171
|
+
|
|
172
|
+
### 3.4 Design Brief Artifact (MEDIUM PRIORITY — Tier 2)
|
|
173
|
+
|
|
174
|
+
**Problem**: UI-heavy projects suffer aesthetic drift when different execution subagents make independent visual decisions per task. Contracts define functional interfaces (component props, API shapes) but not visual language (colors, spacing, typography, interaction patterns).
|
|
175
|
+
|
|
176
|
+
**Solution**: A design brief artifact generated during `gsd-t-partition` or `gsd-t-plan` that flows into execute subagents alongside contracts.
|
|
177
|
+
|
|
178
|
+
**Mechanism**:
|
|
179
|
+
1. **Design brief detection** — During `gsd-t-partition`, if any domain contains UI/frontend tasks (detected by: React/Vue/Svelte/Flutter in stack, component files in scope, CSS/styling files), prompt generation of a design brief.
|
|
180
|
+
2. **Design brief structure** — `.gsd-t/contracts/design-brief.md`:
|
|
181
|
+
- **Color palette**: Primary, secondary, accent, background, text colors with hex values
|
|
182
|
+
- **Typography**: Font families, size scale, weight usage
|
|
183
|
+
- **Spacing system**: Base unit, scale (4px, 8px, 12px, 16px, 24px, 32px, 48px)
|
|
184
|
+
- **Component patterns**: Border radius, shadow levels, hover/active states, transition durations
|
|
185
|
+
- **Layout principles**: Grid system, breakpoints, container widths
|
|
186
|
+
- **Interaction patterns**: Loading states, error states, empty states, success feedback
|
|
187
|
+
- **Tone**: Formal/casual, dense/spacious, minimal/rich
|
|
188
|
+
3. **Injection** — Design brief is injected into subagent prompts for UI-related tasks (same injection point as contracts). Non-UI tasks skip it.
|
|
189
|
+
4. **Sources** — If the project already has a design system (Tailwind config, theme file, Storybook), the brief is derived from existing sources. If none exist, the brief is generated from the quality persona + project type.
|
|
190
|
+
5. **Existing project support** — `gsd-t-setup` can generate a design brief for an existing project by analyzing current CSS/theme files.
|
|
191
|
+
|
|
192
|
+
**Files affected**:
|
|
193
|
+
- MODIFY: `commands/gsd-t-partition.md` — design brief detection and generation step
|
|
194
|
+
- MODIFY: `commands/gsd-t-plan.md` — reference design brief in UI task descriptions
|
|
195
|
+
- MODIFY: `commands/gsd-t-execute.md` — inject design brief for UI tasks
|
|
196
|
+
- MODIFY: `commands/gsd-t-quick.md` — same injection for quick UI tasks
|
|
197
|
+
- MODIFY: `commands/gsd-t-setup.md` — design brief generation for existing projects
|
|
198
|
+
- NEW: Template section in `templates/CLAUDE-project.md` referencing design brief convention
|
|
199
|
+
|
|
200
|
+
**Success criteria**:
|
|
201
|
+
- [ ] `gsd-t-partition` detects UI-heavy domains and triggers design brief generation
|
|
202
|
+
- [ ] Design brief is stored in `.gsd-t/contracts/design-brief.md`
|
|
203
|
+
- [ ] Design brief is injected into subagent prompts for UI-related tasks only
|
|
204
|
+
- [ ] Non-UI tasks do not receive design brief injection
|
|
205
|
+
- [ ] Existing Tailwind/theme configs are parsed to pre-populate the brief
|
|
206
|
+
- [ ] Projects without UI domains skip design brief entirely (no overhead)
|
|
207
|
+
|
|
208
|
+
**Acceptance test**: Partition a milestone with a React frontend domain. Verify `design-brief.md` is generated with color palette and component patterns. Execute a UI task and verify the subagent prompt includes the design brief. Execute a backend task and verify it does not.
|
|
209
|
+
|
|
210
|
+
---
|
|
211
|
+
|
|
212
|
+
### 3.5 Evaluator Interactivity (MEDIUM PRIORITY — Tier 2)
|
|
213
|
+
|
|
214
|
+
**Problem**: QA and Red Team agents evaluate through scripted Playwright assertions only. They cannot explore the running application dynamically — clicking unexpected paths, trying edge-case inputs, observing visual rendering. Scripted tests verify known requirements; exploratory testing finds unknown bugs.
|
|
215
|
+
|
|
216
|
+
**Solution**: Give QA and Red Team agents access to Playwright MCP for interactive browser testing beyond their scripted assertions.
|
|
217
|
+
|
|
218
|
+
**Mechanism**:
|
|
219
|
+
1. **MCP detection** — Before spawning QA/Red Team subagents, check if Playwright MCP server is registered in Claude Code settings (`.claude/settings.local.json` or global settings). If available, include MCP access instructions in the subagent prompt.
|
|
220
|
+
2. **Exploratory testing prompt** — After scripted tests pass, the QA/Red Team subagent receives an additional instruction block:
|
|
221
|
+
```
|
|
222
|
+
"You have access to Playwright MCP for interactive browser testing.
|
|
223
|
+
After scripted tests pass, spend up to 3 minutes exploring:
|
|
224
|
+
- Navigate to every route in the application
|
|
225
|
+
- Try unexpected inputs in every form field
|
|
226
|
+
- Click UI elements in unexpected order
|
|
227
|
+
- Resize the browser to test responsive behavior
|
|
228
|
+
- Check console for errors after each action
|
|
229
|
+
Report any issues found as EXPLORATORY findings (separate from scripted test results)."
|
|
230
|
+
```
|
|
231
|
+
3. **Time budget** — Exploratory testing has a configurable time budget (default: 3 minutes for QA, 5 minutes for Red Team). This prevents runaway exploration while allowing meaningful coverage.
|
|
232
|
+
4. **Finding classification** — Exploratory findings are tagged `[EXPLORATORY]` in qa-issues.md and red-team-report.md to distinguish them from scripted test failures. They feed into the QA calibration loop (3.2) as a separate category.
|
|
233
|
+
5. **Graceful degradation** — If Playwright MCP is not available, exploratory testing is skipped silently. The feature is purely additive.
|
|
234
|
+
|
|
235
|
+
**Files affected**:
|
|
236
|
+
- MODIFY: `commands/gsd-t-execute.md` — add exploratory testing block to QA/Red Team prompts
|
|
237
|
+
- MODIFY: `commands/gsd-t-quick.md` — same for inline QA
|
|
238
|
+
- MODIFY: `commands/gsd-t-integrate.md` — same for integration QA/Red Team
|
|
239
|
+
- MODIFY: `commands/gsd-t-debug.md` — same for debug verification
|
|
240
|
+
- MODIFY: `templates/CLAUDE-global.md` — document evaluator interactivity in QA Agent section
|
|
241
|
+
- MODIFY: `templates/CLAUDE-project.md` — optional `Evaluator Time Budget` field
|
|
242
|
+
|
|
243
|
+
**Success criteria**:
|
|
244
|
+
- [ ] QA/Red Team subagent prompts include exploratory testing instructions when Playwright MCP is detected
|
|
245
|
+
- [ ] Exploratory testing has configurable time budgets (default 3min QA, 5min Red Team)
|
|
246
|
+
- [ ] Exploratory findings are tagged `[EXPLORATORY]` in reports
|
|
247
|
+
- [ ] Exploratory findings feed into QA calibration feedback loop (3.2)
|
|
248
|
+
- [ ] Missing Playwright MCP causes graceful skip, not failure
|
|
249
|
+
- [ ] Scripted tests still run first and must pass before exploratory testing begins
|
|
250
|
+
|
|
251
|
+
**Acceptance test**: With Playwright MCP registered, run `gsd-t-execute` on a web application task. Verify the QA subagent runs scripted tests, then performs exploratory testing via MCP, and any findings are tagged `[EXPLORATORY]` in qa-issues.md.
|
|
252
|
+
|
|
253
|
+
---
|
|
254
|
+
|
|
255
|
+
### 3.6 Configurable Iteration Budget (LOW-MEDIUM PRIORITY — Tier 3)
|
|
256
|
+
|
|
257
|
+
**Problem**: GSD-T hardcodes "2 fix attempts" before escalating to the headless debug loop (M29). Research shows 5-15 iterations drive convergence to significantly better quality for complex tasks. Simple tasks need 1-2 attempts; complex tasks need 5-10. The current one-size-fits-all budget under-serves complex tasks and the headless debug loop is a heavyweight escalation.
|
|
258
|
+
|
|
259
|
+
**Solution**: Allow domains and individual tasks to specify iteration budgets, with intelligent defaults based on task complexity signals.
|
|
260
|
+
|
|
261
|
+
**Mechanism**:
|
|
262
|
+
1. **Budget specification** — Three levels of override:
|
|
263
|
+
- **Project-level default**: `Iteration Budget: N` in CLAUDE.md (default: 2 if unset, preserving current behavior)
|
|
264
|
+
- **Domain-level override**: `iteration_budget: N` in domain `constraints.md` (overrides project default)
|
|
265
|
+
- **Task-level override**: `[budget:N]` tag in task description in `tasks.md` (overrides domain default)
|
|
266
|
+
2. **Complexity-based defaults** — During `gsd-t-plan`, each task gets a complexity score based on:
|
|
267
|
+
- File count in scope (>5 files = +1)
|
|
268
|
+
- Cross-domain dependencies (any = +1)
|
|
269
|
+
- New vs. modify (new file = +0, modify existing = +1)
|
|
270
|
+
- Test requirements (E2E = +1, unit-only = +0)
|
|
271
|
+
- Historical failure rate for similar domain types (from rule engine)
|
|
272
|
+
- Complexity score 0-1 = budget 2, score 2-3 = budget 4, score 4+ = budget 6
|
|
273
|
+
3. **In-context vs. headless threshold** — The iteration budget applies to in-context fix attempts. The headless debug loop (M29) is still the escalation path, but it activates after the full budget is exhausted (not after a fixed 2). This makes the headless loop a true last resort.
|
|
274
|
+
4. **Budget telemetry** — Each task's actual iteration count is logged in task-metrics.jsonl. Over time, this data refines the complexity-based defaults through the rule engine.
|
|
275
|
+
5. **Budget governance** — The quality budget system (M26) still applies. If a milestone's aggregate rework rate exceeds the ceiling, the system tightens constraints rather than increasing iteration budgets.
|
|
276
|
+
|
|
277
|
+
**Files affected**:
|
|
278
|
+
- MODIFY: `commands/gsd-t-plan.md` — complexity scoring and budget assignment per task
|
|
279
|
+
- MODIFY: `commands/gsd-t-execute.md` — read task budget, use as fix-attempt limit instead of hardcoded 2
|
|
280
|
+
- MODIFY: `commands/gsd-t-quick.md` — same budget-aware fix attempts
|
|
281
|
+
- MODIFY: `commands/gsd-t-debug.md` — same
|
|
282
|
+
- MODIFY: `commands/gsd-t-wave.md` — same
|
|
283
|
+
- MODIFY: `commands/gsd-t-test-sync.md` — same
|
|
284
|
+
- MODIFY: `commands/gsd-t-verify.md` — same
|
|
285
|
+
- MODIFY: `templates/CLAUDE-project.md` — add Iteration Budget field
|
|
286
|
+
- MODIFY: `templates/CLAUDE-global.md` — document iteration budget system
|
|
287
|
+
|
|
288
|
+
**Success criteria**:
|
|
289
|
+
- [ ] Project CLAUDE.md supports `Iteration Budget: N` setting
|
|
290
|
+
- [ ] Domain `constraints.md` supports `iteration_budget: N` override
|
|
291
|
+
- [ ] Task descriptions support `[budget:N]` tag
|
|
292
|
+
- [ ] `gsd-t-plan` assigns complexity-based default budgets to tasks
|
|
293
|
+
- [ ] Execute commands respect task budget instead of hardcoded 2
|
|
294
|
+
- [ ] Headless debug loop activates only after full budget exhaustion
|
|
295
|
+
- [ ] Actual iteration counts are logged in task-metrics.jsonl
|
|
296
|
+
- [ ] Default behavior (no budget set) preserves current 2-attempt limit
|
|
297
|
+
|
|
298
|
+
**Acceptance test**: Set project budget to 5. Create a task with `[budget:8]` tag. Verify execute allows up to 8 fix attempts before escalating to headless debug loop. Create a task with no tag — verify it uses project default of 5.
|
|
299
|
+
|
|
300
|
+
---
|
|
301
|
+
|
|
302
|
+
### 3.7 Token-Aware Orchestration (MEDIUM-HIGH PRIORITY — Tier 1)
|
|
303
|
+
|
|
304
|
+
**Problem**: GSD-T runs on Claude's $200 Max plan, where tokens are a hard daily/weekly ceiling — not a variable API expense. A typical milestone spawns 30-50+ subagents across all phases. With tiered models, this consumes roughly 50-80% of a daily budget. Without budget awareness, the orchestrator can exhaust tokens mid-milestone, leaving uncommitted work scattered across subagents and forcing a wait until limits reset.
|
|
305
|
+
|
|
306
|
+
The article's harness doesn't address this because it operates on API billing where cost is variable. On a Max plan, token exhaustion is a binary failure mode — you either have capacity or you don't.
|
|
307
|
+
|
|
308
|
+
**Solution**: Make the wave and execute orchestrators aware of aggregate session-level token consumption, with graceful degradation as limits approach.
|
|
309
|
+
|
|
310
|
+
**Mechanism**:
|
|
311
|
+
1. **Session budget tracking** — The orchestrator tracks cumulative tokens consumed across all subagent spawns within a session. Uses the existing observability logging data (token-log.md) plus `CLAUDE_CONTEXT_TOKENS_USED` environment variable.
|
|
312
|
+
2. **Budget estimation before spawn** — Before spawning a subagent, estimate the token cost based on: model tier (Opus ~5x Sonnet, Sonnet ~5x Haiku), task complexity (from plan-time scoring if available), and historical average from token-log.md for similar tasks.
|
|
313
|
+
3. **Graduated degradation thresholds**:
|
|
314
|
+
|
|
315
|
+
| Session Budget Consumed | Action |
|
|
316
|
+
|------------------------|--------|
|
|
317
|
+
| < 60% | Normal operation — all models at assigned tiers |
|
|
318
|
+
| 60-70% | **WARN**: Display budget alert to user. Reduce iteration budgets to minimum (2). |
|
|
319
|
+
| 70-85% | **DOWNGRADE**: Non-critical Sonnet tasks demoted to Haiku. Skip exploratory testing (3.5). Disable shadow-mode audit (3.1). |
|
|
320
|
+
| 85-95% | **CONSERVE**: Pause non-essential phases (doc-ripple, design brief generation). Checkpoint all progress to disk. |
|
|
321
|
+
| > 95% | **STOP**: Hard stop. Save all progress. Display: "Token budget nearly exhausted. Progress saved. Resume with `/gsd-t-resume` after limit resets." |
|
|
322
|
+
|
|
323
|
+
4. **Model-tier-aware budgeting** — The budget tracker understands that one Opus call ≈ 5 Sonnet calls ≈ 25 Haiku calls in token terms. Degradation actions (downgrading Sonnet → Haiku) are chosen to maximize remaining capacity for high-value tasks.
|
|
324
|
+
5. **Milestone pre-flight check** — Before starting a wave or execute run, estimate total token cost for the remaining work. If estimated cost exceeds available budget, warn the user: "This milestone has ~{N} tasks remaining, estimated at ~{X}% of daily budget. Proceed or split across sessions?"
|
|
325
|
+
6. **Integration with iteration budget (3.6)** — When budget is constrained (>60%), iteration budgets are automatically reduced. At >70%, the system prefers model escalation (Haiku → Sonnet) over additional iterations at the same tier, since one Sonnet attempt is more likely to converge than three Haiku attempts.
|
|
326
|
+
|
|
327
|
+
**Files affected**:
|
|
328
|
+
- MODIFY: `commands/gsd-t-execute.md` — pre-spawn budget check, degradation logic
|
|
329
|
+
- MODIFY: `commands/gsd-t-wave.md` — milestone pre-flight estimate, per-phase budget check
|
|
330
|
+
- MODIFY: `commands/gsd-t-quick.md` — budget-aware model selection
|
|
331
|
+
- MODIFY: `templates/CLAUDE-global.md` — document token-aware orchestration
|
|
332
|
+
- MODIFY: `templates/CLAUDE-project.md` — optional `Daily Token Budget` field
|
|
333
|
+
- NEW: `bin/token-budget.js` — budget estimation, tracking, and threshold logic (Node.js built-ins only)
|
|
334
|
+
|
|
335
|
+
**Success criteria**:
|
|
336
|
+
- [ ] Orchestrator estimates token cost before each subagent spawn
|
|
337
|
+
- [ ] Cumulative session usage is tracked and displayed at each phase boundary
|
|
338
|
+
- [ ] Degradation actions trigger at 60%, 70%, 85%, and 95% thresholds
|
|
339
|
+
- [ ] Non-critical Sonnet tasks are demoted to Haiku when budget is constrained
|
|
340
|
+
- [ ] Milestone pre-flight check warns when estimated cost exceeds available budget
|
|
341
|
+
- [ ] Progress is always saved before a hard stop — no lost work
|
|
342
|
+
- [ ] Default behavior (no budget concern) is unchanged — thresholds only fire when budget tracking detects pressure
|
|
343
|
+
|
|
344
|
+
**Acceptance test**: Start a wave with a 4-domain milestone. After 3 domains complete (simulating ~70% budget consumed), verify the orchestrator displays a budget warning, reduces iteration budgets, and demotes non-critical tasks to Haiku. Verify all progress is saved and the user sees a clear "resume" instruction.
|
|
345
|
+
|
|
346
|
+
---
|
|
347
|
+
|
|
348
|
+
### 3.8 Refined Model Tier Assignments (IMMEDIATE — Pre-Milestone)
|
|
349
|
+
|
|
350
|
+
**Problem**: The current model assignments have QA running on Haiku. The analysis (see `docs/harness-design-analysis.md`) identifies this as the single largest source of quality gaps — QA on Haiku produces superficial evaluations that Red Team consistently catches.
|
|
351
|
+
|
|
352
|
+
**Solution**: Promote QA from Haiku to Sonnet across all command files. Narrow Haiku's scope to strictly mechanical (zero-judgment) tasks.
|
|
353
|
+
|
|
354
|
+
**Mechanism**:
|
|
355
|
+
This is a search-and-replace operation in existing command files, not a new system:
|
|
356
|
+
|
|
357
|
+
| Role | Current Model | New Model | Rationale |
|
|
358
|
+
|------|-------------|-----------|-----------|
|
|
359
|
+
| Task execution | Sonnet | Sonnet | No change |
|
|
360
|
+
| QA evaluation | Haiku | **Sonnet** | Biggest quality-per-token improvement |
|
|
361
|
+
| Red Team | Sonnet | **Opus** | Adversarial reasoning benefits most from top-tier |
|
|
362
|
+
| Test running (count pass/fail) | Haiku | Haiku | Mechanical — no judgment needed |
|
|
363
|
+
| File existence checks | Haiku | Haiku | Mechanical |
|
|
364
|
+
| Branch guards | Haiku | Haiku | Mechanical |
|
|
365
|
+
| Orchestration | Opus | Opus | No change |
|
|
366
|
+
|
|
367
|
+
**Token cost impact**: QA calls increase ~3-5x per call (Haiku → Sonnet), but QA calls are small relative to execute calls. Red Team increases ~3-5x per call (Sonnet → Opus), but there's only 1 Red Team call per milestone. Net impact: ~10-15% more tokens per milestone — well within the daily budget with the token-aware orchestration (3.7) managing the ceiling.
|
|
368
|
+
|
|
369
|
+
**Files affected**:
|
|
370
|
+
- MODIFY: `commands/gsd-t-execute.md` — QA model: haiku → sonnet, Red Team model annotation
|
|
371
|
+
- MODIFY: `commands/gsd-t-quick.md` — QA model annotation
|
|
372
|
+
- MODIFY: `commands/gsd-t-integrate.md` — QA model annotation
|
|
373
|
+
- MODIFY: `templates/CLAUDE-global.md` — model assignment table
|
|
374
|
+
|
|
375
|
+
**This can be done immediately as a standalone change**, before M31-M33. No new infrastructure required.
|
|
376
|
+
|
|
377
|
+
---
|
|
378
|
+
|
|
379
|
+
## 4. Milestone Plan
|
|
380
|
+
|
|
381
|
+
### Pre-Milestone: Refined Model Tiers (v2.51.11)
|
|
382
|
+
|
|
383
|
+
**Scope**: Enhancement 3.8 (Refined Model Tier Assignments)
|
|
384
|
+
|
|
385
|
+
**Rationale**: This is a search-and-replace operation that addresses the #1 quality gap (QA on Haiku) with zero new infrastructure. Should be done immediately before M31, as the QA calibration system (3.2) will produce better baseline data when QA is already running on Sonnet.
|
|
386
|
+
|
|
387
|
+
**Estimated effort**: 1-2 hours (direct edits to 4 command files + 1 template)
|
|
388
|
+
**Predecessor**: None — standalone change
|
|
389
|
+
|
|
390
|
+
---
|
|
391
|
+
|
|
392
|
+
### M31: Self-Calibrating QA (Tier 1) — v2.52.10
|
|
393
|
+
|
|
394
|
+
**Scope**: Enhancements 3.1 (Harness Audit) + 3.2 (QA Calibration Feedback Loop) + 3.7 (Token-Aware Orchestration)
|
|
395
|
+
|
|
396
|
+
**Rationale**: These three enhancements are complementary — all three measure and manage GSD-T's resource effectiveness. The harness audit measures component-level ROI; the QA calibration measures QA-specific detection quality; the token-aware orchestrator ensures the framework can complete milestones within daily token limits. Together they establish the "self-awareness" foundation that all other enhancements benefit from.
|
|
397
|
+
|
|
398
|
+
**Estimated domains**: 4-5
|
|
399
|
+
- `harness-audit` — component registry, audit command, shadow mode, cost/benefit ledger
|
|
400
|
+
- `qa-calibrator` — miss-rate tracking, category aggregation, weak-spot detection, dynamic injection
|
|
401
|
+
- `token-orchestrator` — budget estimation, tracking, graduated degradation, pre-flight checks
|
|
402
|
+
- `command-integration` — wire audit into complete-milestone, wire calibrator into execute/quick/integrate, wire budget checks into wave/execute
|
|
403
|
+
- `telemetry-extension` — extend metrics-rollup with QA miss rates and component impact
|
|
404
|
+
|
|
405
|
+
**Estimated tasks**: 14-18
|
|
406
|
+
**Predecessor**: M30 (Stack Rules Engine — for component registry baseline), Pre-Milestone model tier refinement
|
|
407
|
+
|
|
408
|
+
### M32: Quality Culture & Design (Tier 2) — v2.53.10
|
|
409
|
+
|
|
410
|
+
**Scope**: Enhancements 3.3 (Quality North Star) + 3.4 (Design Brief) + 3.5 (Evaluator Interactivity)
|
|
411
|
+
|
|
412
|
+
**Rationale**: These three enhancements share a theme: raising the quality ceiling through non-procedural means. Quality persona raises the baseline aspiration. Design brief ensures aesthetic coherence. Evaluator interactivity finds bugs that procedural checks miss. Grouping them ensures they are designed to work together — the quality persona influences how the design brief is generated, and evaluator interactivity tests against both.
|
|
413
|
+
|
|
414
|
+
**Estimated domains**: 3-4
|
|
415
|
+
- `quality-persona` — CLAUDE.md section, init/setup integration, prompt injection
|
|
416
|
+
- `design-brief` — detection, generation, contract storage, injection for UI tasks
|
|
417
|
+
- `evaluator-interactivity` — MCP detection, exploratory testing prompts, finding classification
|
|
418
|
+
- `command-integration` — wire all three into execute/quick/debug/integrate/wave
|
|
419
|
+
|
|
420
|
+
**Estimated tasks**: 10-12
|
|
421
|
+
**Predecessor**: M31 (QA calibration must exist for exploratory findings to feed into)
|
|
422
|
+
|
|
423
|
+
### M33: Adaptive Iteration (Tier 3) — v2.54.10
|
|
424
|
+
|
|
425
|
+
**Scope**: Enhancement 3.6 (Configurable Iteration Budget)
|
|
426
|
+
|
|
427
|
+
**Rationale**: This enhancement depends on telemetry data from M31 (QA miss rates, component impact) and M32 (exploratory findings) to make intelligent budget decisions. It also modifies the most command files (7), so it should be last to minimize merge conflicts with M31/M32 changes.
|
|
428
|
+
|
|
429
|
+
**Estimated domains**: 2-3
|
|
430
|
+
- `complexity-scoring` — plan-time complexity analysis, budget assignment, defaults
|
|
431
|
+
- `budget-execution` — budget-aware fix attempts in all 7 execution commands
|
|
432
|
+
- `telemetry-extension` — iteration count tracking in task-metrics.jsonl
|
|
433
|
+
|
|
434
|
+
**Estimated tasks**: 8-10
|
|
435
|
+
**Predecessor**: M32
|
|
436
|
+
|
|
437
|
+
---
|
|
438
|
+
|
|
439
|
+
## 5. Impact Analysis
|
|
440
|
+
|
|
441
|
+
### Existing commands modified
|
|
442
|
+
|
|
443
|
+
| Command | M31 | M32 | M33 | Changes |
|
|
444
|
+
|----------------------|-----|-----|-----|----------------------------------------------------------------|
|
|
445
|
+
| `gsd-t-execute` | X | X | X | QA calibration injection, persona injection, design brief, budget |
|
|
446
|
+
| `gsd-t-quick` | X | X | X | Same as execute (inline variants) |
|
|
447
|
+
| `gsd-t-integrate` | X | X | | QA calibration injection, persona, design brief |
|
|
448
|
+
| `gsd-t-debug` | | X | X | Persona injection, budget |
|
|
449
|
+
| `gsd-t-wave` | | X | X | Persona injection, budget |
|
|
450
|
+
| `gsd-t-plan` | | | X | Complexity scoring, budget assignment |
|
|
451
|
+
| `gsd-t-partition` | | X | | Design brief detection and generation |
|
|
452
|
+
| `gsd-t-init` | | X | | Quality persona selection |
|
|
453
|
+
| `gsd-t-setup` | | X | | Quality persona + design brief configuration |
|
|
454
|
+
| `gsd-t-test-sync` | | | X | Budget-aware fix attempts |
|
|
455
|
+
| `gsd-t-verify` | | | X | Budget-aware fix attempts |
|
|
456
|
+
| `gsd-t-complete-milestone` | X | | | Component impact evaluation in distillation |
|
|
457
|
+
| `gsd-t-status` | X | | | Show flagged components + QA miss rate summary |
|
|
458
|
+
| `gsd-t-help` | X | | | New audit command entry |
|
|
459
|
+
|
|
460
|
+
### New artifacts
|
|
461
|
+
|
|
462
|
+
| Artifact | Milestone | Purpose |
|
|
463
|
+
|----------------------------------------|-----------|--------------------------------------------|
|
|
464
|
+
| `commands/gsd-t-audit.md` | M31 | Harness audit command |
|
|
465
|
+
| `bin/qa-calibrator.js` | M31 | QA miss-rate aggregation + weak-spot detection |
|
|
466
|
+
| `bin/token-budget.js` | M31 | Token budget estimation, tracking, thresholds |
|
|
467
|
+
| `.gsd-t/component-registry.jsonl` | M31 | Component inventory with cost tracking |
|
|
468
|
+
| `.gsd-t/metrics/qa-miss-log.jsonl` | M31 | Red Team findings QA missed |
|
|
469
|
+
| `.gsd-t/metrics/component-impact.jsonl`| M31 | Per-component cost/benefit ledger |
|
|
470
|
+
| `.gsd-t/contracts/design-brief.md` | M32 | Design language for UI-heavy projects |
|
|
471
|
+
|
|
472
|
+
### Backward compatibility
|
|
473
|
+
|
|
474
|
+
All enhancements are **purely additive**:
|
|
475
|
+
- Projects without a quality persona skip injection silently
|
|
476
|
+
- Projects without UI domains skip design brief entirely
|
|
477
|
+
- Projects without Playwright MCP skip exploratory testing
|
|
478
|
+
- Projects without iteration budget settings use current 2-attempt default
|
|
479
|
+
- The audit command is opt-in — it never runs automatically
|
|
480
|
+
- QA calibration activates only when Red Team data exists (Red Team was added in v2.51.10)
|
|
481
|
+
|
|
482
|
+
No existing behavior changes unless the user explicitly enables the new features.
|
|
483
|
+
|
|
484
|
+
### Zero-dependency constraint
|
|
485
|
+
|
|
486
|
+
All new code (`bin/qa-calibrator.js`, component registry logic) uses Node.js built-ins only. No external npm dependencies. This is non-negotiable per TECH-001.
|
|
487
|
+
|
|
488
|
+
---
|
|
489
|
+
|
|
490
|
+
## 6. Risk Assessment
|
|
491
|
+
|
|
492
|
+
| Risk | Likelihood | Impact | Mitigation |
|
|
493
|
+
|-----------------------------------------------|------------|--------|------------------------------------------------------------------|
|
|
494
|
+
| Harness audit doubles execution time | Medium | High | Audit is opt-in, never automatic. Budget per audit session. |
|
|
495
|
+
| QA calibration creates feedback oscillation | Low | Medium | Damping: changes only after 3+ milestones of consistent signal. |
|
|
496
|
+
| Quality persona is ignored by subagent | Medium | Low | Minimal cost (2-3 lines of prompt). Measure via A/B in audit. |
|
|
497
|
+
| Design brief is too prescriptive | Low | Medium | Brief sets direction, not pixel specs. Execution agents adapt. |
|
|
498
|
+
| Playwright MCP not widely available | Medium | Low | Graceful degradation — feature skips if MCP absent. |
|
|
499
|
+
| Higher iteration budgets waste context tokens | Low | Medium | Budget governance (M26) caps aggregate rework. Telemetry tracks. |
|
|
500
|
+
| Command file sizes grow beyond readability | Medium | High | Each injection is max 5-10 lines. Total overhead auditable via 3.1. |
|
|
501
|
+
| **Token exhaustion mid-milestone** | **High** | **High** | **Token-aware orchestration (3.7) with graduated degradation. Progress always checkpointed before hard stop.** |
|
|
502
|
+
| **QA promotion to Sonnet exceeds token budget** | Low | Medium | QA calls are small relative to execute. Net impact ~10-15% more tokens. Token orchestrator manages ceiling. |
|
|
503
|
+
| **Budget estimation inaccuracy** | Medium | Medium | Estimates improve over time using historical data from token-log.md. Conservative defaults (overestimate). |
|
|
504
|
+
|
|
505
|
+
---
|
|
506
|
+
|
|
507
|
+
## 7. Success Metrics
|
|
508
|
+
|
|
509
|
+
| Metric | Baseline (current) | Target (post-M33) |
|
|
510
|
+
|--------------------------------------|-----------------------|----------------------------|
|
|
511
|
+
| QA-to-Red-Team miss rate | Not tracked | < 15% per category |
|
|
512
|
+
| Component ROI visibility | None | 100% of components tracked |
|
|
513
|
+
| Subagent prompt quality signal | Procedural only | Persona + procedural |
|
|
514
|
+
| UI aesthetic consistency (manual) | Per-task independent | Brief-governed coherence |
|
|
515
|
+
| Exploratory bugs found | 0 (no capability) | > 0 per UI milestone |
|
|
516
|
+
| Iteration convergence (complex tasks)| Fixed 2 attempts | Adaptive 2-8 attempts |
|
|
517
|
+
| Framework bloat detection | None | Components flagged at 3+ milestone negative ROI |
|
|
518
|
+
| **Token exhaustion incidents** | **Not tracked** | **Zero — milestones always complete or checkpoint gracefully** |
|
|
519
|
+
| **Milestones per daily budget** | **~1-2 (unmanaged)** | **2-3 with budget-aware orchestration** |
|
|
520
|
+
| **QA model effectiveness** | **Haiku (baseline)** | **Sonnet (post-3.8 promotion)** |
|
|
521
|
+
|
|
522
|
+
---
|
|
523
|
+
|
|
524
|
+
## 8. Out of Scope
|
|
525
|
+
|
|
526
|
+
- **Automated component removal** — Flagging is automated; actual removal requires user approval (Destructive Action Guard)
|
|
527
|
+
- **Visual regression testing** — Design brief ensures consistency via prompt engineering, not pixel-diff tooling
|
|
528
|
+
- **Custom LLM model selection** — Quality persona works within the refined haiku/sonnet/opus model assignments. Fully dynamic model selection per-task is a future enhancement beyond 3.8's static tier refinement
|
|
529
|
+
- **Cross-project QA calibration** — QA calibration is per-project in this PRD. Cross-project propagation follows the M27 pattern if warranted later
|
|
530
|
+
- **Real-time quality dashboard** — The existing dashboard (M15) displays events; adding QA calibration visualizations is a future enhancement
|
|
531
|
+
- **AI-generated design systems** — The design brief is a coordination artifact, not a Figma export or component library generator
|
|
532
|
+
|
|
533
|
+
---
|
|
534
|
+
|
|
535
|
+
## 9. Dependencies
|
|
536
|
+
|
|
537
|
+
| Dependency | Type | Status | Required By |
|
|
538
|
+
|--------------------------------|----------|-------------|-------------|
|
|
539
|
+
| Rule Engine (M26) | Internal | COMPLETE | M31 (audit integration with patch lifecycle) |
|
|
540
|
+
| Patch Lifecycle (M26) | Internal | COMPLETE | M31 (component deprecation as patches) |
|
|
541
|
+
| Red Team (v2.51.10) | Internal | COMPLETE | M31 (miss-rate source data) |
|
|
542
|
+
| Metrics Rollup (M25) | Internal | COMPLETE | M31 (ELO integration for QA miss rates) |
|
|
543
|
+
| Headless Debug Loop (M29) | Internal | COMPLETE | M33 (budget exhaustion triggers headless) |
|
|
544
|
+
| Stack Rules Engine (M30) | Internal | COMPLETE | M32 (persona injected before stack rules) |
|
|
545
|
+
| Playwright MCP | External | Optional | M32 (evaluator interactivity — graceful skip if absent) |
|
|
546
|
+
| Claude Code Agent tool | External | Available | All (subagent spawning) |
|
|
547
|
+
|
|
548
|
+
---
|
|
549
|
+
|
|
550
|
+
## 10. Implementation Notes
|
|
551
|
+
|
|
552
|
+
### Token budget awareness
|
|
553
|
+
|
|
554
|
+
There are two distinct token budget concerns:
|
|
555
|
+
|
|
556
|
+
**A. Per-subagent prompt overhead** — how many tokens each enhancement adds to individual subagent prompts:
|
|
557
|
+
- Quality persona: 2-3 lines (~50 tokens)
|
|
558
|
+
- QA weak-spot injection: 3-5 lines (~100 tokens) — only when weak spots exist
|
|
559
|
+
- Design brief: 15-30 lines (~300 tokens) — only for UI tasks
|
|
560
|
+
- Exploratory testing instructions: 8-10 lines (~150 tokens) — only when MCP available
|
|
561
|
+
- Iteration budget: 1 line (~20 tokens) — always injected
|
|
562
|
+
|
|
563
|
+
**Maximum additional overhead per subagent**: ~620 tokens (UI task with all features active). This is well within the per-agent context budget and measurable via the harness audit (3.1).
|
|
564
|
+
|
|
565
|
+
**B. Session-level token consumption** — how all enhancements affect daily/weekly token limits on the $200 Max plan:
|
|
566
|
+
- QA model promotion (Haiku → Sonnet): +10-15% tokens per milestone
|
|
567
|
+
- Red Team model promotion (Sonnet → Opus): +3-5% tokens per milestone (only 1 call)
|
|
568
|
+
- Exploratory testing: +5-10% tokens per milestone (when MCP available)
|
|
569
|
+
- Higher iteration budgets: variable, capped by token-aware orchestrator (3.7)
|
|
570
|
+
- Harness audit: opt-in only, not counted in normal milestone budgets
|
|
571
|
+
|
|
572
|
+
**Maximum additional session cost**: ~25-30% more tokens per milestone vs. current. The token-aware orchestrator (3.7) ensures this stays within daily limits through graduated degradation.
|
|
573
|
+
|
|
574
|
+
### Command file discipline
|
|
575
|
+
|
|
576
|
+
Each command file modification is a targeted injection (5-15 lines), not a restructure. The existing step numbering and flow are preserved. New injection points follow the established pattern: read a state file, conditionally inject content into the subagent prompt.
|
|
577
|
+
|
|
578
|
+
### Testing strategy
|
|
579
|
+
|
|
580
|
+
- `bin/qa-calibrator.js` — unit tests in `test/qa-calibrator.test.js` (JSONL parsing, miss-rate math, weak-spot detection)
|
|
581
|
+
- Component registry — unit tests in `test/component-registry.test.js` (CRUD, cost calculation, flagging logic)
|
|
582
|
+
- Integration — manual CLI testing of `gsd-t-audit` command, verified via existing `gsd-t-verify` gates
|
|
583
|
+
- No new external dependencies for testing (stays with Node.js built-in `node --test`)
|
package/docs/requirements.md
CHANGED
|
@@ -68,6 +68,9 @@
|
|
|
68
68
|
| REQ-057 | Stack Rule Templates — best practice rule files in `templates/stacks/` for React, TypeScript, and Node.js API. Each file follows a standard structure (mandatory framing, numbered sections, GOOD/BAD examples, verification checklist) and stays under 200 lines. Universal templates (`_` prefix) always injected; stack-specific templates injected when detected. | P1 | complete | M30: templates/stacks/ (4 files: _security.md, react.md, typescript.md, node-api.md) |
|
|
69
69
|
| REQ-058 | Stack Detection Engine — auto-detect project tech stack from manifest files (package.json, requirements.txt, go.mod, Cargo.toml) at subagent spawn time. Match detected stacks against available templates. Inject matched rules into subagent prompts with mandatory enforcement framing. Resilient: skip silently if no templates exist or no matches found. | P1 | complete | M30: 5 command files (execute, quick, integrate, wave, debug) |
|
|
70
70
|
| REQ-059 | Stack Rule QA Enforcement — QA subagent prompts include stack rule compliance validation. Stack rule violations have the same severity as contract violations — they fail the task, not warn. Report format includes "Stack rules: compliant/N violations". | P1 | complete | M30: execute QA prompt + all 5 commands |
|
|
71
|
+
| REQ-060 | Quality North Star Persona — project CLAUDE.md can define a `## Quality North Star` section (1-3 sentences) with a project quality identity. gsd-t-init auto-detects preset (library/web-app/cli) or prompts user. gsd-t-setup offers persona config for existing projects. Persona is injected at subagent spawn time; skips silently if section absent (backward compatible). | P2 | complete | M32: templates/CLAUDE-project.md, gsd-t-init.md, gsd-t-setup.md |
|
|
72
|
+
| REQ-061 | Design Brief Generation — during partition, if UI/frontend signals detected (React/Vue/Svelte/Flutter, CSS/SCSS, component files, or Tailwind config), generate `.gsd-t/contracts/design-brief.md` with color palette, typography, spacing, component patterns, layout principles, interaction patterns, and tone/voice. Skip for non-UI projects. Do not overwrite existing briefs. Referenced in plan for UI task descriptions. | P2 | complete | M32: gsd-t-partition.md, gsd-t-plan.md, gsd-t-setup.md |
|
|
73
|
+
| REQ-062 | Exploratory Testing Blocks — after scripted tests pass, if Playwright MCP is registered, QA agents get 3 minutes and Red Team gets 5 minutes of interactive exploration using Playwright MCP. All findings tagged [EXPLORATORY] in qa-issues.md and red-team-report.md. Feeds into M31 QA calibration as separate category. Silent skip when Playwright MCP absent. Injected into execute, quick, integrate, debug. | P2 | complete | M32: gsd-t-execute.md, gsd-t-quick.md, gsd-t-integrate.md, gsd-t-debug.md |
|
|
71
74
|
|
|
72
75
|
## Technical Requirements
|
|
73
76
|
|
|
@@ -196,6 +199,17 @@
|
|
|
196
199
|
**Orphaned requirements**: None — all M30 REQs mapped to tasks.
|
|
197
200
|
**Unanchored tasks**: command-integration Task 3 (tests) is QA infrastructure supporting REQ-057 through REQ-059. command-integration Task 4 (reference docs) supports Pre-Commit Gate compliance.
|
|
198
201
|
|
|
202
|
+
## Requirements Traceability (updated by plan phase — M32)
|
|
203
|
+
|
|
204
|
+
| REQ-ID | Requirement Summary | Domain | Task(s) | Status |
|
|
205
|
+
|---------|--------------------------------------------------------------|-------------------------|---------|---------|
|
|
206
|
+
| REQ-060 | Quality North Star Persona — CLAUDE-project template + init/setup detection and config | quality-persona | Task 1 | complete |
|
|
207
|
+
| REQ-061 | Design Brief Generation — partition detection + plan note + setup option | design-brief | Task 1 | complete |
|
|
208
|
+
| REQ-062 | Exploratory Testing Blocks — post-scripted Playwright MCP exploration in 4 commands | evaluator-interactivity | Task 1 | complete |
|
|
209
|
+
|
|
210
|
+
**Orphaned requirements**: None — all M32 REQs mapped to tasks.
|
|
211
|
+
**Unanchored tasks**: None — all 3 domain tasks map directly to functional requirements.
|
|
212
|
+
|
|
199
213
|
---
|
|
200
214
|
|
|
201
215
|
## M17: Scan Visual Output — Feature Specification
|