npm - @sun-asterisk/sungen - Versions diffs - 3.2.2-beta.1 → 3.2.2-beta.10 - Mend

@sun-asterisk/sungen 3.2.2-beta.1 → 3.2.2-beta.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

package/src/orchestrator/templates/ai-instructions/claude-cmd-run-test.md CHANGED Viewed

@@ -9,6 +9,8 @@ allowed-tools: Read, Grep, Bash, Glob, Edit, Write, AskUserQuestion, mcp__playwr
 You are a **Senior Developer**. Use `sungen-selector-fix`, `sungen-selector-keys`, and `sungen-error-mapping` skills.
+> ⛔ **Source of truth — the live page is NOT the oracle; `.feature`/`test-data`/`spec.md` are.** Auto-fix is for **selector-resolution** failures (wrong locator → fix `selectors.yaml`). An **assertion-value** failure where the app contradicts the spec is a **CANDIDATE BUG → report it, let it FAIL** — never loosen the rule, weaken the assertion, edit the expected value/`.feature`, or hand-edit the generated `.spec.ts` to make it pass. See `sungen-error-mapping` § "Source of truth". (A `password > 8` test that fails on 6 chars is a bug to report, not a `>= 6` edit.)
 ## Parameters
 Parse from `$ARGUMENTS`:
@@ -102,6 +104,7 @@ If the unit is **api-first**, skip every selector/capture phase (an API test has
 9. **Integrity check & trace (always run after the final run).**
    - `sungen script-check --screen <name>` — verify the generated spec is a **1:1** of the Gherkin (every non-@manual scenario ↔ one `test()`, no drift). If it reports **DRIFT** (spec hand-edited or stale), re-run `sungen generate --screen <name>` so the spec matches the feature, then re-run — **never hand-edit the generated spec** (auto-fix must edit `selectors.yaml`, not the `.spec.ts`).
    - `sungen ledger record --screen <name> --step run --ms <elapsed>` (record this run), then `sungen trace --screen <name>` — show the process map + bottlenecks + **HUMAN-LOOP FOCUS** (the @manual scenarios the QA must verify) to the user.
+   - **Phase gate (boundary — do NOT skip).** `sungen gate --screen <name> --phase run` (exit 2 = HALT): the run-boundary obligations (incl. automation) must be **satisfied or explicitly waived**. On **HALT**, classify per `sungen-error-mapping` § Source of truth (#387): a **selector-resolution** failure → fix `selectors.yaml` + re-run; an **assertion-vs-spec** failure → **report it as a candidate bug / leave it FAIL** (never weaken the assertion or edit the expected to pass); a genuinely-accepted gap → `sungen journey --screen <name> --waive <OB> --reason "..."`. Do **not** declare the run "done" past a HALT without a fix, a reported bug, or a reasoned waiver.
 10. **Capability-pending offer (consent-gated).** If `sungen audit --screen <name>` reports `AUTOMATION-READY-PENDING` (or the run shows `@requires:<cap>` tests skipped "requires …"), these are **automation-ready** scenarios waiting on an opt-in driver. Use `AskUserQuestion` to offer: *"N scenario(s) are automation-ready — enable `<cap>` to run them? (`sungen capability add <cap>`)"*. **Only on the user's yes** run `sungen capability add <cap>` then re-run those specs; on no, leave them skipped (they are NOT failures and NOT manual). **Never auto-install.**
 ## Playwright command guidelines

package/src/orchestrator/templates/ai-instructions/claude-skill-error-mapping.md CHANGED Viewed

@@ -21,6 +21,23 @@ Then choose the fix from the patterns below.
 ---
+## ⛔ Source of truth — classify EVERY failure before you "fix" it
+`.feature` + `test-data.yaml` + `spec.md` are the **oracle**. The **live page is NOT** — it may be the thing that's broken. A failing test is not automatically a test to "make pass". Classify first:
+- **Selector-resolution failure** (element not found / wrong locator / strict-mode / wrong element type) → the test looked in the wrong place. **Fix the locator in `selectors.yaml`** (re-snapshot, copy the exact accessible name). Legit auto-fix.
+- **Assertion-value failure** (element FOUND, but observed value ≠ expected) → STOP and ask: *is the TEST wrong, or is the APP wrong?*
+  - Expected value/rule is wrong **relative to `spec.md`** (typo, stale test-data) → fix `test-data.yaml`/`.feature` so it matches the **spec** — never the live page.
+  - App behaviour contradicts `spec.md` (spec says X, app shows Y) → **CANDIDATE BUG**. **Report it** (let the test FAIL / surface to the QA in the run summary). **NEVER** change the expected value, loosen the rule, weaken the assertion (`toHaveText`→`toContainText` to dodge a mismatch), edit `.feature`, or edit the generated `.spec.ts` to make it pass.
+> **Cardinal sin (do NOT do this):** a `password > 8 chars` rule fails on a 6-char input → "fix" it to `>= 6` so the test passes. The logic is now meaningless. A failing assertion is a **finding**, not a chore.
+**Auto-fix loop scope:** the run-test auto-fix loop engages ONLY on **selector-resolution** failures. On an assertion-value failure where the app contradicts the spec → **HALT and report**, do not loop it into passing.
+**Never hand-edit the generated `.spec.ts`** (e.g. inserting `page.evaluate`/`fetch` to bypass a broken control). `sungen script-check` regenerates the spec from `.feature` and flags any edit as DRIFT — regenerate, don't patch.
+---
 ## Fix Priority (try in order)
 1. **Auth issue** — page redirected to login? Fix auth first, everything else is noise
@@ -43,11 +60,13 @@ Then choose the fix from the patterns below.
 | not a select | Custom dropdown, not native `<select>` | Set `variant: 'custom'` |
 | Frame not found | iframe selector wrong or doesn't exist | Fix `frame` value, verify iframe in snapshot |
-### Assertion errors → fix in `test-data.yaml` or `.feature`
+### Assertion errors → apply the Source-of-truth gate above FIRST
-| Error | Diagnosis | Fix |
+> The "Fix" column below applies **only when the expected value was wrong relative to `spec.md`** (a test defect). If the app's value contradicts the spec, the row is a **candidate bug → report it, do not edit the expected to match live**. Never weaken `toHaveText`→`toContainText` just to pass.
+| Error | Diagnosis | Fix (only if the TEST was wrong per spec) |
 |---|---|---|
-| toHaveText mismatch | Expected text differs from actual | Fix value in test-data. If element is input type → change Gherkin type to `field`/`textarea` (triggers `toHaveValue` instead) |
+| toHaveText mismatch | Expected text differs from actual | If the test's expected was wrong per spec → fix value in test-data. If element is input type → change Gherkin type to `field`/`textarea` (triggers `toHaveValue`). If the app value contradicts spec → **report as bug**. |
 | toHaveValue mismatch | Expected value differs from actual | Fix value in test-data |
 | toContainText mismatch | Partial text not found | Fix expected partial text in test-data |
 | toBeVisible timeout | Element exists but hidden, or name wrong | Check: is element conditionally visible? Wrong name? Inside dialog? |

package/src/orchestrator/templates/ai-instructions/claude-skill-tc-generation.md CHANGED Viewed

@@ -275,6 +275,7 @@ Security:         [S1 – admin only]
 **Depth is a GATE dimension (harness-roadmap P1) — self-raise, never silently go shallow:**
 - For every data-correctness theme the catalog marks `depth.requires: data-assertion`, emit its `depth.template` shape by **default** — don't wait for the repair loop. `sungen audit` measures `businessDepth` (ratio of these scenarios that assert data) against an intent threshold (functional ≥ 0.70); below it the **gate FAILs**.
+- **Verify depth deterministically before the gate:** run `sungen depth-lint --screen <name>`. It classifies every shallow business-critical scenario into **deepen-in-place** (add the theme's value assertion — the printed `template` is a hint, fit it to the actual claim) vs **cross-screen** (route to a flow / `@manual:Mx`). Clear the `deepen` list first — this is the mechanical way to hit `businessDepth` on the first pass instead of churning repair rounds. Never fake a value assertion onto a visibility/behavior scenario the lint over-counts; leave it and note the over-count.
 - `depth.cross_screen: true` (cart / detail / filter / brand correctness) → write the deep capture/compare shape as an **automated flow scenario** (in the flow — do NOT leave a full-step `@manual` duplicate on the screen). `@manual` is **only** for genuine judgment (M6 visual/UX · M8 not-worth · M9 human) or a missing capability (M1–M5/M7), and it **must** carry a reason code (`@manual:Mx`, or a reason comment the planner can infer). A `@manual` scenario that still has full automatable steps (a data assertion, no visual/mock/a11y judgment) is now flagged by `sungen audit` as `MANUAL-AUTOMATABLE`, and business-critical scenarios you defer to `@manual` are reported as `DEPTH-DEFERRED` (they do NOT silently inflate `businessDepth`). Deferring automatable work to `@manual` lowers quality — automate it in the flow instead.
 - **Pick the right `@manual:Mx` code — it decides which driver can later automate the case** (`sungen audit` flags a code↔reason mismatch). Tag the code that matches the **oracle the reason describes**:

package/src/orchestrator/templates/ai-instructions/copilot-cmd-create-test.md CHANGED Viewed

@@ -66,7 +66,9 @@ If the unit is **api-first** (`qa/api/<name>/` or `qa/api/flows/<name>/`), the d
 4. Follow the `sungen-tc-generation` skill for section identification, viewpoint generation, and output format. **For flows**, use the "Flow Test Generation" section in the skill. When requirements exist, use the "Requirements-Driven Generation" strategy. **For Tier 1**, apply the **Lightweight Guard** — verify required fields, validation rules, business rules, security checks, and key state transitions all have TCs after generation. **For Tier 2+**, **MUST** apply the full **Mapping Contract** — walk every `spec.md` section top-to-bottom and produce the indicated TCs per Table 1; handle `test-viewpoint.md` per Table 2. Do not silently skip sections. Present sections as a numbered list and let user pick.
 5. Generate or update `.feature` + `test-data.yaml` following `sungen-gherkin-syntax` and `sungen-tc-generation` skills. Generate **group-by-group** (one viewpoint group at a time, tier-by-tier `Write`/`Edit` batches) to stay under the output-token cap. **For flows**: use `[Screen:Element]` namespace format, namespace test-data by phase, add `@flow` tag.
    > **No parallel fan-out here.** Copilot has no sub-agents, so generation is sequential (the Claude Code variant fans out one `sungen-generator` per viewpoint group and merges). Same output, no speedup.
+5.4. **Depth self-check (deterministic — BEFORE the audit).** Run `sungen depth-lint --screen ${input:name}`. It splits every shallow business-critical scenario into **DEEPEN IN PLACE** (add a real value assertion — the printed `template` is a theme-keyed hint, apply judgment to the actual claim; never fake one onto a visibility/behavior scenario) and **CROSS-SCREEN** (route to a flow / tag `@manual:Mx` + reason — removes it from the depth denominator honestly). Act on both, re-run until `deepen` is empty (or only honest over-counts remain), THEN gate. Lifts first-pass `businessDepth` mechanically instead of via 2–3 repair rounds.
 5.5. **Quality gate & repair (harness — always run).** Per `sungen-harness-audit`: run `sungen audit --screen ${input:name}` (structural), THEN do an **independent semantic review inline** using the `sungen-reviewer` criteria (does each scenario's steps PROVE its title/viewpoint? observable Thens? business-critical assertion depth?). Merge both sets of issues; if gate FAILs / findings exist, repair (budget 3) and re-audit — GATE missing theme → generate it (cross-screen → **automate it in the flow** via `/sungen:add-flow`, NOT a full `@manual` screen duplicate — `sungen audit` flags an automatable `@manual` as `MANUAL-AUTOMATABLE`; reserve `@manual:Mx` for true judgment/missing-capability); DEPTH → add data assertions; BALANCE → add business-core first; TRACE → align VP ids. Never fake a pass.
+5.5b. **Phase gate (boundary — do NOT skip).** Run `sungen gate --screen ${input:name} --phase create` (exit 2 = HALT): every required obligation (spec · coverage · depth · trace) must be **satisfied or explicitly waived**. On **HALT**, keep repairing within budget; a genuinely-accepted gap → `sungen journey --screen ${input:name} --waive <OB> --reason "..."` (reason mandatory). Do **not** converge (step 6) past a HALT without a fix or a reasoned waiver.
 5.6. **Record.** `sungen manifest --screen ${input:name}`. Ledger **each phase** (not just repair) — pick one `runId` at the start and pass it so `trace`/`ledger report` show THIS run, not a mix: `sungen ledger record --screen ${input:name} --run <runId> --step <discovery|viewpoint|gherkin|audit|repair:N> --ms <elapsed>`. On re-run, start with `sungen manifest --screen ${input:name} --diff` and only regenerate changed sections.
 6. **Converge — show the trace.** Run `sungen trace --screen ${input:name}` and present: process map (phases + repair rounds), bottlenecks, **HUMAN-LOOP FOCUS** (@manual to verify), audit score + gate + residual gaps. Then offer next steps based on which tier was just generated:

package/src/orchestrator/templates/ai-instructions/copilot-cmd-run-test.md CHANGED Viewed

@@ -9,6 +9,8 @@ tools: [read, execute, edit, vscode/askQuestions, playwright/*]
 You are a **Senior Developer**. Use `sungen-selector-fix`, `sungen-selector-keys`, and `sungen-error-mapping` skills.
+> ⛔ **Source of truth — the live page is NOT the oracle; `.feature`/`test-data`/`spec.md` are.** Auto-fix is for **selector-resolution** failures (wrong locator → fix `selectors.yaml`). An **assertion-value** failure where the app contradicts the spec is a **CANDIDATE BUG → report it, let it FAIL** — never loosen the rule, weaken the assertion, edit the expected value/`.feature`, or hand-edit the generated `.spec.ts` to make it pass. See `sungen-error-mapping` § "Source of truth". (A `password > 8` test that fails on 6 chars is a bug to report, not a `>= 6` edit.)
 ## Parameters
 Parse from `$ARGUMENTS`:
@@ -93,6 +95,7 @@ If the unit is **api-first**, skip every selector/capture phase (an API test has
 7. **Phase 3 — Full Run**: Run all tests. Fix only **new** failures (elements unique to `@normal`/`@low`). Max 1 attempt. Don't loop on low-priority failures.
 8. **Phase 4 — Regression**: One final full run. Report results. No more fix loops.
 9. **Integrity & trace (always run after the final run).** `sungen script-check --screen <name>` — verify the spec is a **1:1** of the Gherkin; if **DRIFT**, re-run `sungen generate --screen <name>` (never hand-edit the `.spec.ts` — auto-fix edits `selectors.yaml`). Then `sungen ledger record --screen <name> --step run --ms <elapsed>` and `sungen trace --screen <name>` to show the process map + bottlenecks + **HUMAN-LOOP FOCUS**.
+9b. **Phase gate (boundary — do NOT skip).** `sungen gate --screen <name> --phase run` (exit 2 = HALT): run-boundary obligations (incl. automation) must be **satisfied or explicitly waived**. On HALT, classify per `sungen-error-mapping` § Source of truth (#387): selector-resolution failure → fix `selectors.yaml` + re-run; assertion-vs-spec failure → **report as a candidate bug / leave it FAIL** (never weaken to pass); accepted gap → `sungen journey --screen <name> --waive <OB> --reason "..."`. Don't declare "done" past a HALT without a fix, a reported bug, or a reasoned waiver.
 10. **Capability-pending offer (consent-gated).** If `sungen audit` reports `AUTOMATION-READY-PENDING` (or `@requires:<cap>` tests are skipped "requires …"), offer: *"N scenario(s) are automation-ready — enable `<cap>` to run them? (`sungen capability add <cap>`)"*. Only on the user's yes, run `sungen capability add <cap>` + re-run; on no, leave skipped (not failures, not manual). **Never auto-install.**
 ## Playwright command guidelines

package/src/orchestrator/templates/ai-instructions/github-skill-sungen-error-mapping.md CHANGED Viewed

@@ -21,6 +21,23 @@ Then choose the fix from the patterns below.
 ---
+## ⛔ Source of truth — classify EVERY failure before you "fix" it
+`.feature` + `test-data.yaml` + `spec.md` are the **oracle**. The **live page is NOT** — it may be the thing that's broken. A failing test is not automatically a test to "make pass". Classify first:
+- **Selector-resolution failure** (element not found / wrong locator / strict-mode / wrong element type) → the test looked in the wrong place. **Fix the locator in `selectors.yaml`** (re-snapshot, copy the exact accessible name). Legit auto-fix.
+- **Assertion-value failure** (element FOUND, but observed value ≠ expected) → STOP and ask: *is the TEST wrong, or is the APP wrong?*
+  - Expected value/rule is wrong **relative to `spec.md`** (typo, stale test-data) → fix `test-data.yaml`/`.feature` so it matches the **spec** — never the live page.
+  - App behaviour contradicts `spec.md` (spec says X, app shows Y) → **CANDIDATE BUG**. **Report it** (let the test FAIL / surface to the QA in the run summary). **NEVER** change the expected value, loosen the rule, weaken the assertion (`toHaveText`→`toContainText` to dodge a mismatch), edit `.feature`, or edit the generated `.spec.ts` to make it pass.
+> **Cardinal sin (do NOT do this):** a `password > 8 chars` rule fails on a 6-char input → "fix" it to `>= 6` so the test passes. The logic is now meaningless. A failing assertion is a **finding**, not a chore.
+**Auto-fix loop scope:** the run-test auto-fix loop engages ONLY on **selector-resolution** failures. On an assertion-value failure where the app contradicts the spec → **HALT and report**, do not loop it into passing.
+**Never hand-edit the generated `.spec.ts`** (e.g. inserting `page.evaluate`/`fetch` to bypass a broken control). `sungen script-check` regenerates the spec from `.feature` and flags any edit as DRIFT — regenerate, don't patch.
+---
 ## Fix Priority (try in order)
 1. **Auth issue** — page redirected to login? Fix auth first, everything else is noise
@@ -43,11 +60,13 @@ Then choose the fix from the patterns below.
 | not a select | Custom dropdown, not native `<select>` | Set `variant: 'custom'` |
 | Frame not found | iframe selector wrong or doesn't exist | Fix `frame` value, verify iframe in snapshot |
-### Assertion errors → fix in `test-data.yaml` or `.feature`
+### Assertion errors → apply the Source-of-truth gate above FIRST
-| Error | Diagnosis | Fix |
+> The "Fix" column below applies **only when the expected value was wrong relative to `spec.md`** (a test defect). If the app's value contradicts the spec, the row is a **candidate bug → report it, do not edit the expected to match live**. Never weaken `toHaveText`→`toContainText` just to pass.
+| Error | Diagnosis | Fix (only if the TEST was wrong per spec) |
 |---|---|---|
-| toHaveText mismatch | Expected text differs from actual | Fix value in test-data. If element is input type → change Gherkin type to `field`/`textarea` (triggers `toHaveValue` instead) |
+| toHaveText mismatch | Expected text differs from actual | If the test's expected was wrong per spec → fix value in test-data. If element is input type → change Gherkin type to `field`/`textarea` (triggers `toHaveValue`). If the app value contradicts spec → **report as bug**. |
 | toHaveValue mismatch | Expected value differs from actual | Fix value in test-data |
 | toContainText mismatch | Partial text not found | Fix expected partial text in test-data |
 | toBeVisible timeout | Element exists but hidden, or name wrong | Check: is element conditionally visible? Wrong name? Inside dialog? |

package/src/orchestrator/templates/ai-instructions/github-skill-sungen-tc-generation.md CHANGED Viewed

@@ -275,6 +275,7 @@ Security:         [S1 – admin only]
 **Depth is a GATE dimension (harness-roadmap P1) — self-raise, never silently go shallow:**
 - For every data-correctness theme the catalog marks `depth.requires: data-assertion`, emit its `depth.template` shape by **default** — don't wait for the repair loop. `sungen audit` measures `businessDepth` (ratio of these scenarios that assert data) against an intent threshold (functional ≥ 0.70); below it the **gate FAILs**.
+- **Verify depth deterministically before the gate:** run `sungen depth-lint --screen <name>`. It classifies every shallow business-critical scenario into **deepen-in-place** (add the theme's value assertion — the printed `template` is a hint, fit it to the actual claim) vs **cross-screen** (route to a flow / `@manual:Mx`). Clear the `deepen` list first — this is the mechanical way to hit `businessDepth` on the first pass instead of churning repair rounds. Never fake a value assertion onto a visibility/behavior scenario the lint over-counts; leave it and note the over-count.
 - `depth.cross_screen: true` (cart / detail / filter / brand correctness) → write the deep capture/compare shape as an **automated flow scenario** (in the flow — do NOT leave a full-step `@manual` duplicate on the screen). `@manual` is **only** for genuine judgment (M6 visual/UX · M8 not-worth · M9 human) or a missing capability (M1–M5/M7), and it **must** carry a reason code (`@manual:Mx`, or a reason comment the planner can infer). A `@manual` scenario that still has full automatable steps (a data assertion, no visual/mock/a11y judgment) is now flagged by `sungen audit` as `MANUAL-AUTOMATABLE`, and business-critical scenarios you defer to `@manual` are reported as `DEPTH-DEFERRED` (they do NOT silently inflate `businessDepth`). Deferring automatable work to `@manual` lowers quality — automate it in the flow instead.
 - **Pick the right `@manual:Mx` code — it decides which driver can later automate the case** (`sungen audit` flags a code↔reason mismatch). Tag the code that matches the **oracle the reason describes**: