@tekyzinc/gsd-t 2.74.10 → 2.74.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -2,6 +2,18 @@
2
2
 
3
3
  You are the lead agent coordinating task execution across domains. Choose solo or team mode based on the plan.
4
4
 
5
+ ## Step 0: Reset Task-Count Gate (MANDATORY — first thing in a fresh session)
6
+
7
+ Run via Bash:
8
+
9
+ ```bash
10
+ node bin/task-counter.cjs reset
11
+ ```
12
+
13
+ This clears `.gsd-t/.task-counter` so the new session starts at 0. The reset is the SIGNAL that this is a clean post-`/clear` orchestrator. Do this exactly ONCE per `/user:gsd-t-execute` invocation, immediately on entry. The gate logic is in Step 3.5; do NOT skip it. If `bin/task-counter.cjs` is missing in this project, `npm install` it via `gsd-t install` then retry — the gate is required.
14
+
15
+ Why: every `/user:gsd-t-execute` invocation is a fresh orchestrator session. Without the reset, the counter from the previous session would still be at the limit and the gate would refuse to spawn anything. Reset is the only acceptable way to advance the counter back to 0.
16
+
5
17
  ## Step 1: Load State
6
18
 
7
19
  Read:
@@ -95,24 +107,15 @@ Each domain's work runs via a lightweight domain task-dispatcher. The dispatcher
95
107
  **OBSERVABILITY LOGGING (MANDATORY) — repeat for every task subagent spawn:**
96
108
 
97
109
  Before spawning — run via Bash:
98
- `T_START=$(date +%s) && DT_START=$(date +"%Y-%m-%d %H:%M") && TOK_START=${CLAUDE_CONTEXT_TOKENS_USED:-0} && TOK_MAX=${CLAUDE_CONTEXT_TOKENS_MAX:-200000}`
110
+ `T_START=$(date +%s) && DT_START=$(date +"%Y-%m-%d %H:%M")`
99
111
 
100
112
  After subagent returns — run via Bash:
101
- `T_END=$(date +%s) && DT_END=$(date +"%Y-%m-%d %H:%M") && TOK_END=${CLAUDE_CONTEXT_TOKENS_USED:-0} && DURATION=$((T_END-T_START))`
113
+ `T_END=$(date +%s) && DT_END=$(date +"%Y-%m-%d %H:%M") && DURATION=$((T_END-T_START))`
102
114
 
103
- Compute tokens and compaction:
104
- - No compaction (TOK_END >= TOK_START): `TOKENS=$((TOK_END-TOK_START))`, COMPACTED=null
105
- - Compaction detected (TOK_END < TOK_START): `TOKENS=$(((TOK_MAX-TOK_START)+TOK_END))`, COMPACTED=$DT_END
115
+ Append to `.gsd-t/token-log.md` (create with header `| Datetime-start | Datetime-end | Command | Step | Model | Duration(s) | Notes | Domain | Task | Tasks-Since-Reset |` if missing):
116
+ `| {DT_START} | {DT_END} | gsd-t-execute | task:{task-id} | sonnet | {DURATION}s | {pass/fail} | {domain-name} | task-{task-id} | {COUNTER} |`
106
117
 
107
- Compute context utilizationrun via Bash:
108
- `if [ "${CLAUDE_CONTEXT_TOKENS_MAX:-0}" -gt 0 ]; then CTX_PCT=$(echo "scale=1; ${CLAUDE_CONTEXT_TOKENS_USED:-0} * 100 / ${CLAUDE_CONTEXT_TOKENS_MAX}" | bc); else CTX_PCT="N/A"; fi`
109
-
110
- Alert on context thresholds (display to user inline):
111
- - If CTX_PCT is a number and >= 85: `echo "🔴 CRITICAL: Context at ${CTX_PCT}% — compaction likely. Task MUST be split."`
112
- - If CTX_PCT is a number and >= 70: `echo "⚠️ WARNING: Context at ${CTX_PCT}% — approaching compaction threshold. Consider splitting in plan."`
113
-
114
- Append to `.gsd-t/token-log.md` (create with header `| Datetime-start | Datetime-end | Command | Step | Model | Duration(s) | Notes | Tokens | Compacted | Domain | Task | Ctx% |` if missing):
115
- `| {DT_START} | {DT_END} | gsd-t-execute | task:{task-id} | sonnet | {DURATION}s | {pass/fail} | {TOKENS} | {COMPACTED} | {domain-name} | task-{task-id} | {CTX_PCT} |`
118
+ Where `{COUNTER}` is the value returned by `node bin/task-counter.cjs status` (see Step 3.5). Note: the legacy `Tokens`, `Compacted`, and `Ctx%` columns were removed in v2.74.12 Claude Code does not export `CLAUDE_CONTEXT_TOKENS_USED`/`_MAX`, so those columns always wrote zeros and the orchestrator self-check based on them was inert. The real burn signal is now `Tasks-Since-Reset`, which the task-counter gate in Step 3.5 enforces.
116
119
 
117
120
  **For each domain (in wave order), run the domain task-dispatcher:**
118
121
 
@@ -373,32 +376,10 @@ Execute the task above:
373
376
  - Completed after a fix → prefix `[learning]`
374
377
  - Deferred to .gsd-t/deferred-items.md → prefix `[deferred]`
375
378
  - Failed after 3 attempts → prefix `[failure]`
376
- 13. Spawn QA subagent (model: sonnet) after completing the task:
377
- 'Run ALL configured test suites detect and run every one:
378
- a. Unit tests (vitest/jest/mocha): run the full suite, report pass/fail counts
379
- b. E2E tests: check for playwright.config.* or cypress.config.* if found, run the FULL E2E suite
380
- c. NEVER skip E2E when a config file exists. Running only unit tests is a QA FAILURE.
381
- d. Read .gsd-t/contracts/ for contract definitions. Check contract compliance.
382
- e. AUDIT E2E test quality: Review each Playwright spec — if any test only checks
383
- element existence (isVisible, toBeAttached, toBeEnabled) without verifying functional
384
- behavior (state changes, data loaded, content updated after actions), flag it as
385
- "SHALLOW TEST — needs functional assertions" in the gap report. A test suite where
386
- every spec passes but no feature actually works is a QA FAILURE.
387
- Report format: "Unit: X/Y pass | E2E: X/Y pass (or N/A if no config) | Contract: compliant/violations | Shallow tests: N (list) | Stack rules: compliant/N violations"
388
- f. Validate compliance with Stack Rules (if injected in the work subagent's prompt).
389
- Stack rule violations have the same severity as contract violations — report as failures, not warnings.
390
-
391
- ## Exploratory Testing (if Playwright MCP available)
392
-
393
- After all scripted tests pass:
394
- 1. Check if Playwright MCP is registered in Claude Code settings (look for "playwright" in mcpServers)
395
- 2. If available: spend 3 minutes on interactive exploration using Playwright MCP
396
- - Try variations of happy paths with unexpected inputs
397
- - Probe for race conditions, double-submits, empty states
398
- - Test accessibility (keyboard navigation, screen reader flow)
399
- 3. Tag all findings [EXPLORATORY] in your report and append to .gsd-t/qa-issues.md with [EXPLORATORY] prefix
400
- 4. If Playwright MCP is not available: skip this section silently
401
- Note: Exploratory findings do NOT count against the scripted test pass/fail ratio.'
379
+ 13. Spawn QA subagent (model: sonnet) after completing the task. Resolve the templated prompt path first so the orchestrator never holds the full prompt body in its own context:
380
+ Run via Bash: `QA_PROMPT="$(npm root -g 2>/dev/null)/@tekyzinc/gsd-t/templates/prompts/qa-subagent.md"; [ -f "$QA_PROMPT" ] || QA_PROMPT="templates/prompts/qa-subagent.md"`
381
+ Then spawn the subagent with this short prompt:
382
+ 'You are the QA agent. Read `'"$QA_PROMPT"'` and follow it exactly. Do not deviate from the protocol in that file. Context for this run: domain={domain-name}, task=task-{task-id}, files-modified={list-from-task-summary}.'
402
383
  If QA fails OR shallow tests are found, fix before proceeding. Append issues to .gsd-t/qa-issues.md.
403
384
  14. Write task summary to .gsd-t/domains/{domain-name}/task-{task-id}-summary.md:
404
385
  ## Task {task-id} Summary — {domain-name}
@@ -476,6 +457,12 @@ Report back:
476
457
 
477
458
  h. The revised `tasks.md` files are now on disk — the next domain's dispatcher will read the updated version automatically (disk-based handoff, no in-memory state sharing needed).
478
459
 
460
+ 5. **Per-domain Design Verification** — if `.gsd-t/contracts/design-contract.md` exists AND this domain modified UI files, invoke Step 5.25 (Design Verification Agent) NOW for this domain. Otherwise skip.
461
+
462
+ 6. **Per-domain Red Team** — invoke Step 5.5 (Red Team) NOW for this domain. This is the first place Red Team runs in v2.74.12 — there is no global post-execute Red Team anymore. If Red Team returns FAIL, fix bugs and re-run before proceeding to the next domain (max 2 fix-and-verify cycles); if bugs persist, log to `.gsd-t/deferred-items.md` and present to user.
463
+
464
+ 7. **Task-count gate re-check** — run `node bin/task-counter.cjs should-stop`. If exit code is `10`, follow the Step 3.5 STOP procedure now (do NOT spawn the next domain).
465
+
479
466
  ### Team Mode (when agent teams are enabled)
480
467
  Spawn teammates for domains within the same wave. Only domains in the same wave can run in parallel — do not spawn teammates for domains in different waves simultaneously. Each teammate uses the **domain task-dispatcher pattern** — one subagent per task within their domain (same as solo mode).
481
468
 
@@ -618,28 +605,62 @@ After all merges complete (whether all passed, some rolled back, or errors occur
618
605
  Cleanup is not optional — orphaned worktrees waste disk space and can confuse subsequent executions. Always run cleanup, even if earlier steps failed.
619
606
  ```
620
607
 
621
- ## Step 3.5: Orchestrator Context Self-Check (MANDATORY)
608
+ ## Step 3.5: Orchestrator Task-Count Gate (MANDATORY)
622
609
 
623
- After EVERY domain completes (and after every checkpoint), the orchestrator MUST check its own context utilization:
610
+ The orchestrator MUST check `bin/task-counter.cjs` BEFORE every task subagent spawn AND immediately AFTER every domain completes. This is the real context-burn guardrail. The previous version of this step relied on `CLAUDE_CONTEXT_TOKENS_USED`/`_MAX` env vars which Claude Code does not export — that check was inert and silently let the orchestrator drain context until forced compaction. The replacement below uses a deterministic on-disk task counter.
624
611
 
625
- Run via Bash:
626
- `if [ "${CLAUDE_CONTEXT_TOKENS_MAX:-0}" -gt 0 ]; then CTX_PCT=$(echo "scale=1; ${CLAUDE_CONTEXT_TOKENS_USED:-0} * 100 / ${CLAUDE_CONTEXT_TOKENS_MAX}" | bc); else CTX_PCT="N/A"; fi && echo "Orchestrator context: ${CTX_PCT}%"`
612
+ **Before each task spawn — gate check:**
613
+
614
+ ```bash
615
+ node bin/task-counter.cjs should-stop
616
+ ```
617
+
618
+ If the exit code is `10` (counter is at or past its limit), STOP immediately. Do NOT spawn the next task. Jump straight to the checkpoint/STOP procedure below.
619
+
620
+ If the exit code is `0`, proceed to spawn the task.
621
+
622
+ **After each task subagent returns — increment:**
623
+
624
+ ```bash
625
+ node bin/task-counter.cjs increment task
626
+ ```
627
+
628
+ This prints a JSON status line like `{"count":3,"limit":5,"remaining":2,"should_stop":false,...}`. Use this status when writing the token-log row (the `Tasks-Since-Reset` column).
629
+
630
+ If `should_stop` is `true` after the increment, STOP after this task completes — even if more tasks remain in the current domain.
631
+
632
+ **STOP procedure (when `should_stop` is true):**
627
633
 
628
- **If CTX_PCT >= 70:**
629
634
  1. **Save checkpoint to disk** — update `.gsd-t/progress.md` with:
630
635
  - Which domains are complete, which remain
631
636
  - Current wave, next domain to execute
632
- - Any checkpoint results
637
+ - Last completed task id and the next pending task id
633
638
  2. **Instruct user**: Output exactly:
634
639
  ```
635
- ⚠️ Orchestrator context at {CTX_PCT}% approaching limit.
636
- Progress saved. Run `/clear` then `/user:gsd-t-execute` to continue from the next domain.
640
+ ⏸️ Orchestrator task-count gate reached ({count}/{limit} tasks in this session).
641
+ Progress saved. Run `/clear` then `/user:gsd-t-execute` to continue from the next task.
637
642
  ```
638
- 3. **STOP execution.** Do NOT spawn another domain subagent. The next session will resume from saved state.
643
+ 3. **STOP execution.** Do NOT spawn another task or domain subagent. The next session resumes from saved state. The first thing the resumed orchestrator does in Step 0 is run `node bin/task-counter.cjs reset` (see below).
644
+
645
+ **Configuring the limit:**
646
+
647
+ The default limit is 5 tasks per session — conservative, designed for the model+harness combination as of 2026-04-13. Override per-project via `.gsd-t/task-counter-config.json`:
648
+
649
+ ```json
650
+ { "limit": 8 }
651
+ ```
639
652
 
640
- **If CTX_PCT < 70:** Continue normally to the next domain/wave.
653
+ Or per-session via env var: `GSD_T_TASK_LIMIT=8 /user:gsd-t-execute`.
641
654
 
642
- This prevents the orchestrator from running out of context mid-milestone, which causes session breaks and summary-based recovery.
655
+ **On resume (Step 0 first thing the orchestrator does in a fresh session):**
656
+
657
+ ```bash
658
+ node bin/task-counter.cjs reset
659
+ ```
660
+
661
+ This clears the counter so the new session starts fresh. The reset is the SIGNAL that this is a clean post-`/clear` session — never reset mid-session.
662
+
663
+ This deterministic gate replaces the vaporware env-var check. It is fail-safe: if `bin/task-counter.cjs` is missing for any reason, the `should-stop` command exits non-zero (treated as STOP) rather than silently allowing unlimited spawns.
643
664
 
644
665
  ## Step 4: Checkpoint Handling
645
666
 
@@ -680,285 +701,44 @@ A teammate finishes independent tasks and is waiting on a checkpoint:
680
701
  2. If not, have the teammate work on documentation, tests, or code cleanup within their domain
681
702
  3. Or shut down the teammate and respawn when unblocked
682
703
 
683
- ## Step 5.25: Design Verification Agent (MANDATORY when design contract exists)
704
+ ## Step 5.25: Design Verification Agent (per-domain, MANDATORY when design contract exists)
705
+
706
+ **IMPORTANT — frequency change in v2.74.12**: Design Verification was previously run once at the end of every execute run, regardless of how many domains existed. It is now run ONCE PER COMPLETED DOMAIN — call this step from the "After all tasks in a domain complete" block (Step 3.5 area), not from a global post-execute hook. This keeps verification co-located with the changes that introduced visual deviation, but stops the agent from re-materializing on every task spawn (which is what commit `b68353e` accidentally caused).
684
707
 
685
- After all domain tasks complete and QA passes, check if `.gsd-t/contracts/design-contract.md` exists. If it does NOT exist, skip this step entirely.
708
+ After all tasks in the CURRENT DOMAIN complete and per-task QA has passed, check if `.gsd-t/contracts/design-contract.md` exists. If it does NOT exist, skip this step entirely.
686
709
 
687
- If it DOES exist — spawn a **dedicated Design Verification Agent**. This agent's ONLY job is to open a browser, compare the built frontend against the original design, and produce a structured comparison table. It writes NO feature code. Separation of concerns: the coding agent codes, the verification agent verifies.
710
+ If it DOES exist AND this domain touched UI files — spawn the **Design Verification Agent**. This agent's ONLY job is to open a browser, compare the built frontend against the original design, and produce a structured comparison table. It writes NO feature code.
688
711
 
689
- ⚙ [{model}] Design Verification → visual comparison of built frontend vs design
712
+ ⚙ [opus] Design Verification → visual comparison for domain {domain-name}
690
713
 
691
714
  **OBSERVABILITY LOGGING (MANDATORY):**
692
715
  Before spawning — run via Bash:
693
- `T_START=$(date +%s) && DT_START=$(date +"%Y-%m-%d %H:%M") && TOK_START=${CLAUDE_CONTEXT_TOKENS_USED:-0} && TOK_MAX=${CLAUDE_CONTEXT_TOKENS_MAX:-200000}`
694
-
695
- ```
696
- Task subagent (general-purpose, model: opus):
697
- "You are the Design Verification Agent. Your ONLY job is to visually compare
698
- the built frontend against the original design and produce a structured
699
- comparison table. You write ZERO feature code. Your sole deliverable is
700
- the comparison table and verification results.
701
-
702
- FAIL-BY-DEFAULT: Every visual element starts as UNVERIFIED. You must prove
703
- each one matches — not assume it does. 'Looks close' is not a verdict.
704
- 'Appears to match' is not a verdict. The only valid verdicts are MATCH
705
- (with proof) or DEVIATION (with specifics).
706
-
707
- ## Step 0: Element Count Reconciliation (MANDATORY — run BEFORE anything else)
708
-
709
- Before any visual or property comparison, verify the built page has the
710
- correct NUMBER of elements. A missing widget is the easiest deviation to
711
- miss in a 30+ row comparison table — and the most catastrophic.
712
-
713
- 1. Read INDEX.md (hierarchical) or design-contract.md (flat) to get the
714
- Figma element counts:
715
- - Per-page: how many widgets on this page? How many total elements
716
- (including widget-internal charts, legends, cards, controls)?
717
- 2. Count the built page's distinct visual elements via Playwright:
718
- - Widgets/cards (top-level visual groups)
719
- - Charts, tables, stat cards, legends, controls within each widget
720
- 3. Compare:
721
- - Figma widget count vs built widget count → mismatch = ❌ CRITICAL
722
- - Figma element count vs built element count → mismatch = ❌ CRITICAL
723
- 4. If counts match → proceed to Step 0.5
724
- If counts DON'T match → identify WHICH elements are missing or extra:
725
- 'Figma has {N} widgets, built page has {M}. MISSING: {list}. EXTRA: {list}'
726
- Log as CRITICAL deviation — do NOT skip, continue with remaining steps.
727
-
728
- ## Step 0.5: Data-Labels Cross-Check (MANDATORY)
729
-
730
- Before any visual comparison, verify the built UI is rendering the CORRECT
731
- DATA from the design. This is the most common failure mode: agents use
732
- placeholder data (Calculator/Planner/Tracker) while the design shows real
733
- labels (Steps to Stay Covered/Broker Contact). The verifier then compares
734
- bar-shapes only and declares MATCH — while the content is catastrophically
735
- wrong.
736
-
737
- 1. For EACH element contract under .gsd-t/contracts/design/elements/
738
- (or each section of flat .gsd-t/contracts/design-contract.md):
739
- a. Read the 'Test Fixture' section — extract every label, value, percentage
740
- b. Open the built UI in the browser
741
- c. Inspect the rendered element (via DOM or screenshot OCR)
742
- d. For EACH label/value/percentage in the Test Fixture:
743
- - Does it appear verbatim in the rendered UI?
744
- - If NO → immediate ❌ DEVIATION (severity CRITICAL)
745
- Log: 'Test Fixture label {X} not found in rendered UI. Found instead: {Y}.'
746
- - If YES → ✅ MATCH for that specific label/value
747
-
748
- 2. Count: '{N}/{total} labels+values from Test Fixture appear correctly in UI'
749
-
750
- 3. If ANY Test Fixture label or value is missing from the rendered UI:
751
- The component is rendering WRONG DATA. This is a CRITICAL deviation.
752
- No amount of visual polish can redeem wrong data. Mark the element
753
- DEVIATION and continue (do not skip the rest — but flag the severity).
754
-
755
- ## Step 1: Get the Design Reference
756
-
757
- Read .gsd-t/contracts/design-contract.md for the source reference.
758
- - If Figma MCP available → call `get_metadata` to enumerate widget/component nodes,
759
- then call `get_design_context` per widget node to extract structured data
760
- (code, component properties, design tokens, text content, layout values).
761
- ⚠ Do NOT use `get_screenshot` for Figma data extraction — it returns pixels
762
- you cannot extract exact values from. `get_design_context` returns structured
763
- code and tokens. Use `get_design_context` for extraction, `get_screenshot`
764
- ONLY if you need a visual reference image for side-by-side comparison.
765
- - If design image files → locate them from the contract's Source Reference field
766
- - If no MCP and no images → log CRITICAL blocker to .gsd-t/qa-issues.md and STOP
767
- You MUST have structured design data (or reference images) before proceeding.
768
-
769
- ## Step 2: Build the Element Inventory
770
-
771
- Before ANY comparison, enumerate every distinct visual element in the design.
772
- Walk the design top-to-bottom, left-to-right. For each section:
773
- - Section title text and icon
774
- - Every chart/visualization (type, orientation, labels, legend, series count)
775
- - Every data table (columns, row structure, sort indicators)
776
- - Every KPI/stat card (value, label, icon, trend indicator)
777
- - Every button, toggle, tab, dropdown
778
- - Every text element (headings, body, captions, labels)
779
- - Every spacing boundary (section gaps, card padding, element margins)
780
- - Every color usage (backgrounds, borders, text, chart fills)
781
- Write each element as a row for the comparison table.
782
- If the inventory has fewer than 20 elements for a full page, you missed items.
783
-
784
- Data visualizations MUST expand into multiple rows:
785
- Chart type, chart orientation, axis labels, axis grid lines, legend position,
786
- data labels placement, chart colors per series, bar width/spacing,
787
- center text (donut/pie), tooltip style — each a SEPARATE element.
788
-
789
- ## Step 3: Open Side-by-Side Browser Sessions
790
-
791
- Start the dev server (npm run dev, or project equivalent).
792
- Open TWO browser views simultaneously for direct visual comparison:
793
-
794
- VIEW 1 — BUILT FRONTEND:
795
- Open the implemented page using Claude Preview, Chrome MCP, or Playwright.
796
- Navigate to the exact route/component being verified.
797
- You MUST see real rendered output — not just read the code.
798
-
799
- VIEW 2 — ORIGINAL DESIGN REFERENCE (structured data, not just images):
800
- If Figma MCP available → you already have `get_design_context` data from Step 1.
801
- Use the STRUCTURED DATA (component properties, text content, layout values,
802
- colors, spacing) as the authoritative design reference — not screenshots.
803
- Optionally open the Figma URL in a browser for visual context, but extract
804
- values from `get_design_context` responses, not from visual inspection.
805
- If design image file → open the image in a browser tab/window.
806
- Use: file://{absolute-path-to-image} or render in an HTML page.
807
- If no Figma MCP → use reference images from the design contract.
808
-
809
- COMPARISON APPROACH:
810
- For each widget/component, compare the BUILT DOM/styles against the
811
- STRUCTURED values from `get_design_context`:
812
- - Chart type: does the built component match the Figma node's structure?
813
- - Text content: do titles, labels, legends match `get_design_context` text?
814
- - Layout: do spacing, alignment, sizing match the structured properties?
815
- - Colors: do fills, strokes, text colors match the exact hex values?
816
- Capture implementation screenshots at each target breakpoint:
817
- Mobile (375px), Tablet (768px), Desktop (1280px) minimum.
818
- Compare screenshots against Figma for overall visual impression,
819
- but use `get_design_context` data for the authoritative value comparison.
820
-
821
- If Claude Preview, Chrome MCP, and Playwright are ALL unavailable:
822
- This is a CRITICAL blocker. Log to .gsd-t/qa-issues.md:
823
- 'CRITICAL: No browser tools available for visual verification.'
824
- STOP — the verification CANNOT proceed without a browser.
825
-
826
- ## Step 4: Structured Element-by-Element Comparison (MANDATORY FORMAT)
827
-
828
- Produce a comparison table with this exact structure. Every element from
829
- the inventory gets its own row. No summarizing, no grouping, no prose.
830
-
831
- | # | Section | Element | Design (specific) | Implementation (specific) | Verdict |
832
- |---|---------|---------|-------------------|--------------------------|---------|
833
- | 1 | Summary | Chart type | Horizontal stacked bar | Vertical grouped bar | ❌ DEVIATION |
834
- | 2 | Summary | Chart colors | #4285F4, #34A853, #FBBC04 | #4285F4, #34A853, #FBBC04 | ✅ MATCH |
835
-
836
- Rules:
837
- - 'Design' column: SPECIFIC values from `get_design_context` structured data
838
- (chart type name, hex color, px size, font weight, text content)
839
- - 'Implementation' column: SPECIFIC observed values from the built page DOM/styles
840
- - Verdict: only ✅ MATCH or ❌ DEVIATION — never 'appears to match' or 'need to verify'
841
- - NEVER write 'Appears to match' or 'Looks correct' — measure and verify
842
- - If the table has fewer than 30 rows for a full-page comparison, you skipped elements
843
-
844
- ## Step 5: SVG Structural Overlay Comparison (MANDATORY)
845
-
846
- After the property-level comparison, run a mechanical SVG-based diff to catch
847
- aggregate visual drift that individual property checks miss.
848
-
849
- 1. Export the Figma frame as SVG:
850
- - Use the Figma REST API or MCP to export the page/frame as SVG
851
- - If export is unavailable, ask the user to export and provide the SVG path
852
- - Store the SVG at .gsd-t/design-verify/{page-name}-figma.svg
853
- 2. Parse the SVG DOM: extract every <rect>, <text>, <circle>, <path>, <g>
854
- with their positions (x, y), dimensions (width, height), fills, strokes,
855
- and text content
856
- 3. Screenshot the built page at the same viewport width via Playwright
857
- 4. Inspect the built page DOM: extract element bounding boxes, computed
858
- styles (colors, dimensions), and text content
859
- 5. Map SVG elements → built DOM elements by:
860
- - Text content matching (highest confidence)
861
- - Position proximity (x,y within 10px tolerance)
862
- - Dimensional similarity (width/height within 10% tolerance)
863
- 6. For each mapped pair, compare:
864
- - Position: SVG (x,y) vs DOM bounding box (x,y). Within 2px = MATCH
865
- - Dimensions: SVG (w,h) vs DOM (w,h). Within 2px = MATCH
866
- - Colors: SVG fill/stroke vs computed CSS color. Exact hex = MATCH
867
- - Text: SVG <text> content vs DOM textContent. Exact = MATCH
868
- 7. Produce an SVG structural diff table:
869
- | # | SVG Element | SVG Position | Built Position | Δ px | Verdict |
870
- Threshold: ≤2px = ✅ MATCH, 3-5px = ⚠ REVIEW, >5px = ❌ DEVIATION
871
- 8. Unmapped SVG elements (no DOM match) → flag as MISSING IN BUILD
872
- Unmapped DOM elements (no SVG match) → flag as EXTRA IN BUILD
873
- 9. Generate a visual overlay image (optional but recommended):
874
- - Render SVG in browser at target viewport size
875
- - Overlay on built page screenshot with 50% opacity or difference blend
876
- - Save to .gsd-t/design-verify/{page-name}-overlay.png
877
-
878
- This step catches spacing rhythm, alignment drift, and proportion issues
879
- that pass the property-level check but are visually wrong in aggregate.
880
-
881
- ## Step 5.5: DOM Box Model Inspection (MANDATORY for fixed-height containers)
882
-
883
- The property table catches wrong values. The SVG overlay catches wrong positions.
884
- This step catches wrong SPACE DISTRIBUTION — elements whose box model is inflated
885
- by flex growth, pushing siblings out of position even when the visual appears close.
886
-
887
- For each card/widget with a fixed height (container_height is not 'auto'):
888
-
889
- 1. Use Playwright to evaluate in the browser:
890
- ```javascript
891
- // For each child element of the card body:
892
- const children = await page.$$eval('.card-body > *', els =>
893
- els.map(el => ({
894
- selector: el.className,
895
- offsetHeight: el.offsetHeight,
896
- scrollHeight: el.scrollHeight,
897
- computedFlex: getComputedStyle(el).flex,
898
- computedFlexGrow: getComputedStyle(el).flexGrow,
899
- }))
900
- );
901
- ```
902
-
903
- 2. Flag any element where `offsetHeight > scrollHeight * 1.5`:
904
- This means the element's layout box is ≥50% larger than its content.
905
- Symptom: element is using `flex: 1` or `flex-grow: 1` and inflating.
906
- ❌ DEVIATION (severity HIGH): '{selector} offsetHeight={X}px but
907
- content only needs {scrollHeight}px — inflated by flex growth.
908
- Fix: remove flex:1 from this element, apply justify-content:center
909
- on its parent container instead.'
910
-
911
- 3. Verify layout arithmetic:
912
- - Read the widget contract's Internal Layout Arithmetic section
913
- - Sum all child offsetHeights + computed gaps
914
- - Compare against the card body's offsetHeight
915
- - If sum > body height → ❌ DEVIATION: content overflows
916
- - If sum < body height by >20px with no centering strategy → ❌ DEVIATION
917
-
918
- 4. Produce box model table:
919
- | # | Element | offsetHeight | scrollHeight | flex-grow | Verdict |
920
- |---|---------|-------------|-------------|-----------|---------|
921
- | 1 | .kpi | 144px | 40px | 1 | ❌ INFLATED |
922
- | 2 | .chart | 74px | 74px | 0 | ✅ MATCH |
923
-
924
- ## Step 6: Report Deviations
925
-
926
- For each ❌ DEVIATION, write a specific finding:
927
- 'Design: {exact value}. Implementation: {exact value}. File: {path}:{line}'
716
+ `T_START=$(date +%s) && DT_START=$(date +"%Y-%m-%d %H:%M")`
928
717
 
929
- Write the FULL comparison table (property-level from Step 4 + SVG structural
930
- from Step 5) to .gsd-t/contracts/design-contract.md under a
931
- '## Verification Status' section.
718
+ Resolve the templated prompt path first so the orchestrator never holds the full ~3500-token verification protocol in its own context:
932
719
 
933
- Any ❌ DEVIATION → also append to .gsd-t/qa-issues.md with severity HIGH
934
- and tag [VISUAL]:
935
- | {date} | gsd-t-execute | Step 5.25 | opus | {duration} | HIGH | [VISUAL] {description} |
936
-
937
- ## Step 7: Verdict
938
-
939
- Count results: '{MATCH_COUNT}/{TOTAL} elements match at {breakpoints} breakpoints'
940
-
941
- VERDICT:
942
- - ALL rows ✅ MATCH → DESIGN VERIFIED
943
- - ANY rows ❌ DEVIATION → DESIGN DEVIATIONS FOUND ({count} deviations)
720
+ ```bash
721
+ DV_PROMPT="$(npm root -g 2>/dev/null)/@tekyzinc/gsd-t/templates/prompts/design-verify-subagent.md"
722
+ [ -f "$DV_PROMPT" ] || DV_PROMPT="templates/prompts/design-verify-subagent.md"
723
+ ```
944
724
 
945
- Write verdict to .gsd-t/contracts/design-contract.md Verification Status section.
725
+ Then spawn the subagent with this short prompt:
946
726
 
947
- Report back:
948
- - Verdict: DESIGN VERIFIED | DESIGN DEVIATIONS FOUND
949
- - Match count: {N}/{total}
950
- - Breakpoints verified: {list}
951
- - Deviations: {count with summary of each}
952
- - Comparison table: {the full table}"
727
+ ```
728
+ Task subagent (general-purpose, model: opus):
729
+ "You are the Design Verification Agent. Read $DV_PROMPT and follow it exactly.
730
+ Do not deviate from that protocol. Context for this run:
731
+ - domain: {domain-name}
732
+ - design contract: .gsd-t/contracts/design-contract.md
733
+ - files modified by this domain: {list}
734
+ Report back the verdict, match count, breakpoints verified, deviation count
735
+ and summary, and the full comparison table per the protocol's Step 7."
953
736
  ```
954
737
 
955
738
  After subagent returns — run via Bash:
956
- `T_END=$(date +%s) && DT_END=$(date +"%Y-%m-%d %H:%M") && TOK_END=${CLAUDE_CONTEXT_TOKENS_USED:-0} && DURATION=$((T_END-T_START))`
957
- Compute tokens and compaction:
958
- - No compaction (TOK_END >= TOK_START): `TOKENS=$((TOK_END-TOK_START))`, COMPACTED=null
959
- - Compaction detected (TOK_END < TOK_START): `TOKENS=$(((TOK_MAX-TOK_START)+TOK_END))`, COMPACTED=$DT_END
739
+ `T_END=$(date +%s) && DT_END=$(date +"%Y-%m-%d %H:%M") && DURATION=$((T_END-T_START)) && COUNTER_JSON=$(node bin/task-counter.cjs status 2>/dev/null || echo '{}') && COUNTER=$(echo "$COUNTER_JSON" | node -e "let s=''; process.stdin.on('data',d=>s+=d).on('end',()=>{try{process.stdout.write(String(JSON.parse(s).count||''))}catch(_){process.stdout.write('')}})")`
960
740
  Append to `.gsd-t/token-log.md`:
961
- `| {DT_START} | {DT_END} | gsd-t-execute | Design Verify | opus | {DURATION}s | {VERDICT} — {MATCH}/{TOTAL} elements | {TOKENS} | {COMPACTED} | | | {CTX_PCT} |`
741
+ `| {DT_START} | {DT_END} | gsd-t-execute | Design Verify | opus | {DURATION}s | {VERDICT} — {MATCH}/{TOTAL} elements for {domain-name} | | | {COUNTER} |`
962
742
 
963
743
  **Artifact Gate (MANDATORY):**
964
744
  After the Design Verification Agent returns, check `.gsd-t/contracts/design-contract.md`:
@@ -975,113 +755,43 @@ After the Design Verification Agent returns, check `.gsd-t/contracts/design-cont
975
755
 
976
756
  **If VERDICT is DESIGN VERIFIED:** Proceed to Red Team.
977
757
 
978
- ## Step 5.5: Red Team — Adversarial QA (MANDATORY)
758
+ ## Step 5.5: Red Team — Adversarial QA (per-domain, MANDATORY)
759
+
760
+ **IMPORTANT — frequency change in v2.74.12**: Red Team was promoted to per-task by commit `da6d3ae` on the assumption that the orchestrator would catch context drain via the `CLAUDE_CONTEXT_TOKENS_USED` self-check. That env var is never set by Claude Code, so the check was inert and the per-task spawning of ~10k-token Red Team subagents was the largest single contributor to the v2.74.x context-burn regression. Red Team is now run ONCE PER COMPLETED DOMAIN — call this step from the "After all tasks in a domain complete" block, not from a per-task hook.
979
761
 
980
- After all domain tasks pass their tests, spawn an adversarial Red Team agent. This agent's sole purpose is to BREAK the code that was just built. It operates with inverted incentives — its success is measured by bugs found, not tests passed.
762
+ After all tasks in the CURRENT DOMAIN pass their tests, spawn an adversarial Red Team agent. Its sole purpose is to BREAK the domain that was just built.
981
763
 
982
- ⚙ [{model}] Red Team → adversarial validation of executed domains
764
+ ⚙ [opus] Red Team → adversarial validation for domain {domain-name}
983
765
 
984
766
  **OBSERVABILITY LOGGING (MANDATORY):**
985
767
  Before spawning — run via Bash:
986
- `T_START=$(date +%s) && DT_START=$(date +"%Y-%m-%d %H:%M") && TOK_START=${CLAUDE_CONTEXT_TOKENS_USED:-0} && TOK_MAX=${CLAUDE_CONTEXT_TOKENS_MAX:-200000}`
768
+ `T_START=$(date +%s) && DT_START=$(date +"%Y-%m-%d %H:%M")`
769
+
770
+ Resolve the templated prompt path so the orchestrator never holds the full ~3500-token Red Team protocol in its own context:
771
+
772
+ ```bash
773
+ RT_PROMPT="$(npm root -g 2>/dev/null)/@tekyzinc/gsd-t/templates/prompts/red-team-subagent.md"
774
+ [ -f "$RT_PROMPT" ] || RT_PROMPT="templates/prompts/red-team-subagent.md"
775
+ ```
776
+
777
+ Then spawn the subagent with this short prompt:
987
778
 
988
779
  ```
989
780
  Task subagent (general-purpose, model: opus):
990
- "You are a Red Team QA adversary. Your job is to BREAK the code that was just written.
991
-
992
- Your value is measured by REAL bugs found. More bugs = more value.
993
- If you find zero bugs, you must prove you were thorough — list every
994
- attack vector you tried and why it didn't break. A short list means
995
- you didn't try hard enough.
996
-
997
- Rules:
998
- - False positives DESTROY your credibility. If you report something
999
- as a bug and it's actually correct behavior, that's worse than
1000
- missing a real bug. Never report something you haven't reproduced.
1001
- - Style opinions are not bugs. Theoretical concerns are not bugs.
1002
- A bug is: 'I did X, expected Y, got Z.' With proof.
1003
- - You are done ONLY when you have exhausted every category below
1004
- and either found a bug or documented exactly what you tried.
1005
-
1006
- ## Attack Categories (exhaust ALL of these)
1007
-
1008
- 1. **Contract Violations**: Read .gsd-t/contracts/. Does the code EXACTLY
1009
- match every contract? Test each endpoint/interface/schema shape.
1010
- 2. **Boundary Inputs**: Empty strings, null, undefined, huge payloads,
1011
- special characters, SQL injection attempts, XSS payloads, path traversal.
1012
- 3. **State Transitions**: What happens when actions are performed out of
1013
- order? Double-submit? Concurrent access? Refresh mid-flow?
1014
- 4. **Error Paths**: Remove env vars. Kill the database. Send malformed
1015
- requests. Does the code handle failures gracefully or crash?
1016
- 5. **Missing Flows**: Read docs/requirements.md. Are there user flows that
1017
- exist in requirements but have NO test coverage? Write tests for them.
1018
- 6. **Regression**: Run the FULL test suite. Did any existing tests break?
1019
- 7. **E2E Functional Gaps**: Review ALL Playwright specs. Do they test actual
1020
- behavior (state changes, data loaded, navigation works) or just check
1021
- that elements exist? Flag and rewrite any shallow/layout tests.
1022
- 8. **Design Fidelity** (if .gsd-t/contracts/design-contract.md exists):
1023
- FAIL-BY-DEFAULT: assume NOTHING matches. Prove each element individually.
1024
- a. Open every implemented screen in a real browser. Screenshot at mobile
1025
- (375px), tablet (768px), desktop (1280px). Get Figma reference via
1026
- `get_design_context` per widget node (structured data — NOT `get_screenshot`).
1027
- b. Build an element inventory: enumerate every distinct visual element
1028
- in the design top-to-bottom. Every chart, label, icon, heading, card,
1029
- spacing boundary, and color. Data visualizations expand: chart type,
1030
- orientation, axis labels, legend position, bar colors, data labels,
1031
- grid lines, center text — each a separate item.
1032
- c. Produce a structured comparison table (MANDATORY):
1033
- | # | Section | Element | Design (specific) | Implementation (specific) | Verdict |
1034
- Every element gets specific values in both columns (hex colors, chart
1035
- type names, px sizes, font weights — never vague descriptions).
1036
- Only valid verdicts: ✅ MATCH or ❌ DEVIATION.
1037
- NEVER write "appears to match" or "looks correct."
1038
- d. Any ❌ DEVIATION is a CRITICAL bug with full reproduction:
1039
- 'Design: horizontal stacked bar with % labels inside bars.
1040
- Build: vertical grouped bar with labels above bars.' — this is a bug.
1041
- 'Design: 32px Inter SemiBold. Build: 24px Inter Regular.' — this is a bug.
1042
- e. If the comparison table has fewer than 30 rows for a full page, the
1043
- audit is incomplete — go back and find the missing elements.
1044
-
1045
- ## Exploratory Testing (if Playwright MCP available)
1046
-
1047
- After all scripted tests pass:
1048
- 1. Check if Playwright MCP is registered in Claude Code settings (look for "playwright" in mcpServers)
1049
- 2. If available: spend 5 minutes on adversarial interactive exploration using Playwright MCP
1050
- - Attempt race conditions, double-submits, concurrent access patterns
1051
- - Try unexpected input sequences, boundary values, rapid state transitions
1052
- - Probe error recovery: does the app recover after failures or get stuck?
1053
- 3. Tag all findings [EXPLORATORY] in your report
1054
- 4. If Playwright MCP is not available: skip this section silently
1055
- Note: Exploratory findings are additive — they do not replace scripted test results.
1056
-
1057
- ## Report Format
1058
-
1059
- For each bug found:
1060
- - **BUG-{N}**: {severity: CRITICAL/HIGH/MEDIUM/LOW}
1061
- - **Reproduction**: {exact steps to reproduce}
1062
- - **Expected**: {what should happen}
1063
- - **Actual**: {what actually happens}
1064
- - **Proof**: {test file or command that demonstrates the bug}
1065
-
1066
- Summary:
1067
- - BUGS FOUND: {count} (with severity breakdown)
1068
- - COVERAGE GAPS: {untested flows from requirements}
1069
- - SHALLOW TESTS REWRITTEN: {count}
1070
- - CONTRACTS VERIFIED: {N}/{total}
1071
- - ATTACK VECTORS TRIED: {list every category attempted and results}
1072
- - VERDICT: FAIL ({N} bugs found) | GRUDGING PASS (exhaustive search, nothing found)
1073
-
1074
- Write all findings to .gsd-t/red-team-report.md.
1075
- If bugs found, also append to .gsd-t/qa-issues.md."
781
+ "You are a Red Team QA adversary. Read $RT_PROMPT and follow it exactly.
782
+ Do not deviate from that protocol. Context for this run:
783
+ - domain: {domain-name}
784
+ - files modified by this domain: {list}
785
+ - tasks just completed: {task-id list}
786
+ Report back the verdict (FAIL or GRUDGING PASS), bugs found by severity,
787
+ attack categories exhausted, and the path to the written
788
+ .gsd-t/red-team-report.md."
1076
789
  ```
1077
790
 
1078
791
  After subagent returns — run via Bash:
1079
- `T_END=$(date +%s) && DT_END=$(date +"%Y-%m-%d %H:%M") && TOK_END=${CLAUDE_CONTEXT_TOKENS_USED:-0} && DURATION=$((T_END-T_START))`
1080
- Compute tokens and compaction:
1081
- - No compaction (TOK_END >= TOK_START): `TOKENS=$((TOK_END-TOK_START))`, COMPACTED=null
1082
- - Compaction detected (TOK_END < TOK_START): `TOKENS=$(((TOK_MAX-TOK_START)+TOK_END))`, COMPACTED=$DT_END
792
+ `T_END=$(date +%s) && DT_END=$(date +"%Y-%m-%d %H:%M") && DURATION=$((T_END-T_START)) && COUNTER=$(node bin/task-counter.cjs status 2>/dev/null | node -e "let s='';process.stdin.on('data',d=>s+=d).on('end',()=>{try{process.stdout.write(String(JSON.parse(s).count||''))}catch(_){process.stdout.write('')}})")`
1083
793
  Append to `.gsd-t/token-log.md`:
1084
- `| {DT_START} | {DT_END} | gsd-t-execute | Red Team | sonnet | {DURATION}s | {VERDICT} — {N} bugs found | {TOKENS} | {COMPACTED} | | | {CTX_PCT} |`
794
+ `| {DT_START} | {DT_END} | gsd-t-execute | Red Team | opus | {DURATION}s | {VERDICT} — {N} bugs found in {domain-name} | | | {COUNTER} |`
1085
795
 
1086
796
  **If Red Team VERDICT is FAIL:**
1087
797
  1. Fix all CRITICAL and HIGH bugs immediately (up to 2 fix attempts per bug)