npm - oh-my-codex - Versions diffs - 0.16.4 → 0.17.0 - Mend

oh-my-codex 0.16.4 → 0.17.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (138) hide show

package/Cargo.lock +5 -5
package/Cargo.toml +1 -1
package/dist/catalog/__tests__/generator.test.js +2 -0
package/dist/catalog/__tests__/generator.test.js.map +1 -1
package/dist/cli/__tests__/doctor-warning-copy.test.js +80 -7
package/dist/cli/__tests__/doctor-warning-copy.test.js.map +1 -1
package/dist/cli/__tests__/index.test.js +17 -11
package/dist/cli/__tests__/index.test.js.map +1 -1
package/dist/cli/__tests__/mcp-serve.test.js +4 -0
package/dist/cli/__tests__/mcp-serve.test.js.map +1 -1
package/dist/cli/__tests__/setup-hooks-shared-ownership.test.js +8 -3
package/dist/cli/__tests__/setup-hooks-shared-ownership.test.js.map +1 -1
package/dist/cli/__tests__/setup-install-mode.test.js +27 -1
package/dist/cli/__tests__/setup-install-mode.test.js.map +1 -1
package/dist/cli/__tests__/ultragoal.test.js +22 -0
package/dist/cli/__tests__/ultragoal.test.js.map +1 -1
package/dist/cli/doctor.d.ts.map +1 -1
package/dist/cli/doctor.js +66 -10
package/dist/cli/doctor.js.map +1 -1
package/dist/cli/index.d.ts +8 -2
package/dist/cli/index.d.ts.map +1 -1
package/dist/cli/index.js +17 -7
package/dist/cli/index.js.map +1 -1
package/dist/cli/mcp-serve.d.ts.map +1 -1
package/dist/cli/mcp-serve.js +4 -0
package/dist/cli/mcp-serve.js.map +1 -1
package/dist/cli/plugin-marketplace.d.ts +20 -0
package/dist/cli/plugin-marketplace.d.ts.map +1 -1
package/dist/cli/plugin-marketplace.js +115 -1
package/dist/cli/plugin-marketplace.js.map +1 -1
package/dist/cli/setup.d.ts.map +1 -1
package/dist/cli/setup.js +29 -10
package/dist/cli/setup.js.map +1 -1
package/dist/cli/ultragoal.d.ts.map +1 -1
package/dist/cli/ultragoal.js +7 -1
package/dist/cli/ultragoal.js.map +1 -1
package/dist/config/__tests__/codex-hooks.test.js +136 -9
package/dist/config/__tests__/codex-hooks.test.js.map +1 -1
package/dist/config/__tests__/generator-idempotent.test.js +15 -0
package/dist/config/__tests__/generator-idempotent.test.js.map +1 -1
package/dist/config/codex-hooks.d.ts +13 -14
package/dist/config/codex-hooks.d.ts.map +1 -1
package/dist/config/codex-hooks.js +85 -7
package/dist/config/codex-hooks.js.map +1 -1
package/dist/config/generator.d.ts +4 -1
package/dist/config/generator.d.ts.map +1 -1
package/dist/config/generator.js +15 -9
package/dist/config/generator.js.map +1 -1
package/dist/config/omx-first-party-mcp.d.ts.map +1 -1
package/dist/config/omx-first-party-mcp.js +7 -0
package/dist/config/omx-first-party-mcp.js.map +1 -1
package/dist/hooks/__tests__/design-skill.test.d.ts +2 -0
package/dist/hooks/__tests__/design-skill.test.d.ts.map +1 -0
package/dist/hooks/__tests__/design-skill.test.js +55 -0
package/dist/hooks/__tests__/design-skill.test.js.map +1 -0
package/dist/hooks/__tests__/notify-hook-tmux-heal.test.js +265 -0
package/dist/hooks/__tests__/notify-hook-tmux-heal.test.js.map +1 -1
package/dist/hooks/__tests__/skill-catalog-hygiene.test.js +1 -1
package/dist/hooks/__tests__/skill-catalog-hygiene.test.js.map +1 -1
package/dist/hooks/__tests__/skill-guidance-contract.test.js +41 -0
package/dist/hooks/__tests__/skill-guidance-contract.test.js.map +1 -1
package/dist/hooks/keyword-detector.d.ts.map +1 -1
package/dist/hooks/keyword-detector.js +5 -1
package/dist/hooks/keyword-detector.js.map +1 -1
package/dist/hooks/keyword-registry.d.ts.map +1 -1
package/dist/hooks/keyword-registry.js +2 -0
package/dist/hooks/keyword-registry.js.map +1 -1
package/dist/hooks/prompt-guidance-contract.d.ts.map +1 -1
package/dist/hooks/prompt-guidance-contract.js +47 -2
package/dist/hooks/prompt-guidance-contract.js.map +1 -1
package/dist/mcp/__tests__/bootstrap.test.js +3 -0
package/dist/mcp/__tests__/bootstrap.test.js.map +1 -1
package/dist/mcp/__tests__/hermes-bridge.test.d.ts +2 -0
package/dist/mcp/__tests__/hermes-bridge.test.d.ts.map +1 -0
package/dist/mcp/__tests__/hermes-bridge.test.js +374 -0
package/dist/mcp/__tests__/hermes-bridge.test.js.map +1 -0
package/dist/mcp/__tests__/state-paths.test.js +96 -13
package/dist/mcp/__tests__/state-paths.test.js.map +1 -1
package/dist/mcp/bootstrap.d.ts +1 -1
package/dist/mcp/bootstrap.d.ts.map +1 -1
package/dist/mcp/bootstrap.js +2 -0
package/dist/mcp/bootstrap.js.map +1 -1
package/dist/mcp/hermes-bridge.d.ts +81 -0
package/dist/mcp/hermes-bridge.d.ts.map +1 -0
package/dist/mcp/hermes-bridge.js +400 -0
package/dist/mcp/hermes-bridge.js.map +1 -0
package/dist/mcp/hermes-server.d.ts +269 -0
package/dist/mcp/hermes-server.d.ts.map +1 -0
package/dist/mcp/hermes-server.js +121 -0
package/dist/mcp/hermes-server.js.map +1 -0
package/dist/mcp/state-paths.d.ts.map +1 -1
package/dist/mcp/state-paths.js +41 -9
package/dist/mcp/state-paths.js.map +1 -1
package/dist/modes/__tests__/base-tmux-pane.test.js +31 -1
package/dist/modes/__tests__/base-tmux-pane.test.js.map +1 -1
package/dist/scripts/__tests__/codex-native-hook.test.js +187 -2
package/dist/scripts/__tests__/codex-native-hook.test.js.map +1 -1
package/dist/scripts/codex-native-hook.d.ts +1 -0
package/dist/scripts/codex-native-hook.d.ts.map +1 -1
package/dist/scripts/codex-native-hook.js +44 -17
package/dist/scripts/codex-native-hook.js.map +1 -1
package/dist/scripts/notify-hook/tmux-injection.d.ts.map +1 -1
package/dist/scripts/notify-hook/tmux-injection.js +91 -2
package/dist/scripts/notify-hook/tmux-injection.js.map +1 -1
package/dist/state/mode-state-context.d.ts +2 -0
package/dist/state/mode-state-context.d.ts.map +1 -1
package/dist/state/mode-state-context.js +21 -0
package/dist/state/mode-state-context.js.map +1 -1
package/dist/ultragoal/__tests__/artifacts.test.js +121 -0
package/dist/ultragoal/__tests__/artifacts.test.js.map +1 -1
package/dist/ultragoal/artifacts.d.ts +9 -1
package/dist/ultragoal/artifacts.d.ts.map +1 -1
package/dist/ultragoal/artifacts.js +105 -3
package/dist/ultragoal/artifacts.js.map +1 -1
package/dist/utils/__tests__/paths.test.js +31 -1
package/dist/utils/__tests__/paths.test.js.map +1 -1
package/dist/utils/paths.d.ts +6 -0
package/dist/utils/paths.d.ts.map +1 -1
package/dist/utils/paths.js +18 -0
package/dist/utils/paths.js.map +1 -1
package/dist/wiki/lifecycle.js +3 -3
package/dist/wiki/lifecycle.js.map +1 -1
package/package.json +1 -1
package/plugins/oh-my-codex/.codex-plugin/plugin.json +1 -1
package/plugins/oh-my-codex/.mcp.json +8 -0
package/plugins/oh-my-codex/skills/design/SKILL.md +180 -0
package/plugins/oh-my-codex/skills/skill/SKILL.md +2 -1
package/plugins/oh-my-codex/skills/ultraqa/SKILL.md +161 -47
package/plugins/oh-my-codex/skills/visual-ralph/SKILL.md +2 -2
package/skills/design/SKILL.md +180 -0
package/skills/frontend-ui-ux/SKILL.md +6 -2
package/skills/skill/SKILL.md +2 -1
package/skills/ultraqa/SKILL.md +161 -47
package/skills/visual-ralph/SKILL.md +2 -2
package/src/scripts/__tests__/codex-native-hook.test.ts +206 -1
package/src/scripts/codex-native-hook.ts +45 -18
package/src/scripts/notify-hook/tmux-injection.ts +110 -3
package/templates/catalog-manifest.json +9 -2

package/skills/design/SKILL.md ADDED Viewed

@@ -0,0 +1,180 @@
+---
+name: design
+description: Canonical repo-local DESIGN.md workflow for product, UI/UX, and frontend decision source of truth
+---
+# Design Skill
+Use `$design` when product, UI/UX, frontend, or design-system decisions need a durable source of truth in the repository. This skill discovers existing design context, interviews for missing product/design information, and creates or refreshes repo-local `DESIGN.md` so future UI/UX/frontend work is grounded instead of improvised.
+## Purpose
+Make repo-local `DESIGN.md` source of truth and canonical design contract for the current repository:
+`existing repo evidence -> missing-context interview -> create/refresh DESIGN.md -> use DESIGN.md for UI/UX/frontend decisions`.
+The output is not a pixel-matching loop and not a one-off visual critique. It is the maintained design brief/checklist that implementation, review, and future visual work should cite.
+## Use when
+- The user asks for design direction, UX guidance, frontend planning, or design-system alignment.
+- A repo needs a design brief before UI/frontend implementation begins.
+- Existing UI/components/assets/screenshots need to be summarized into a reusable design source of truth.
+- UI/UX/frontend decisions are ambiguous and should be resolved through product context, constraints, and documented principles.
+- A feature needs `DESIGN.md` created or refreshed before `$ralph`, a designer lane, or implementation work proceeds.
+## Do not use when
+- The user provides or requests a visual reference/image/live URL and wants measured implementation until screenshots match. Use `$visual-ralph` for that visual-reference implementation loop.
+- The task is pure backend/API/infrastructure work with no user-facing design consequence.
+- The user only asks to compare screenshots or score visual fidelity. Use `$visual-ralph` and its built-in visual verdict flow.
+## Relationship to `$visual-ralph`
+`$design` owns the durable repo design source of truth: product goals, users, IA, visual language, components, accessibility, constraints, and open questions in `DESIGN.md`.
+`$visual-ralph` owns implementation against an approved generated/static/live-URL visual reference, with screenshot capture, Visual Ralph verdict scoring, and pixel-diff evidence. `$visual-ralph` may read `DESIGN.md`, and it may leave design-system artifacts behind, but it does not replace the `DESIGN.md` discovery/interview/refresh workflow.
+If both are needed, run `$design` first to establish the design contract, then run `$visual-ralph` only after the visual reference/baseline is approved.
+## Workflow
+### 1. Discover local design evidence
+Inspect the repository before writing guidance. Look for:
+- `DESIGN.md`, `docs/design*`, `docs/ux*`, `docs/frontend*`, `README.md`, product specs, PRDs, and issue notes.
+- Existing UI source: routes, pages, layouts, components, stories, examples, demos, theme files, CSS variables, Tailwind/theme config, tokens, icons, and assets.
+- Screenshots, mockups, brand files, logos, Figma/export notes, Storybook snapshots, Playwright screenshots, visual-regression baselines, or `.omx/artifacts/visual-ralph/*` references.
+- Accessibility, responsive, i18n, content, and platform constraints already encoded in code or docs.
+Record evidence with file paths. Distinguish observed facts from design inferences.
+### 2. Interview only for missing context
+Ask concise questions only when repo evidence cannot answer design-critical context. Prefer one focused round that closes the biggest gaps, such as:
+- target users/personas and jobs to be done,
+- product/business goals and non-goals,
+- brand personality or forbidden aesthetics,
+- primary flows and information architecture,
+- accessibility level, device/browser support, and implementation constraints,
+- existing design assets or references the repo does not contain.
+If the user wants autonomous progress or cannot answer, create `DESIGN.md` with explicit assumptions and open questions instead of blocking.
+### 3. Create or refresh `DESIGN.md`
+Use the structure below. Preserve useful existing content, remove contradictions, and mark unknowns as open questions. Keep it actionable for implementers and reviewers.
+#### Required `DESIGN.md` structure/checklist
+```markdown
+# Design
+## Source of truth
+- Status: Draft | Active | Needs refresh
+- Last refreshed: YYYY-MM-DD
+- Primary product surfaces:
+- Evidence reviewed:
+## Brand
+- Personality:
+- Trust signals:
+- Avoid:
+## Product goals
+- Goals:
+- Non-goals:
+- Success signals:
+## Personas and jobs
+- Primary personas:
+- User jobs:
+- Key contexts of use:
+## Information architecture
+- Primary navigation:
+- Core routes/screens:
+- Content hierarchy:
+## Design principles
+- Principle 1:
+- Principle 2:
+- Tradeoffs:
+## Visual language
+- Color:
+- Typography:
+- Spacing/layout rhythm:
+- Shape/radius/elevation:
+- Motion:
+- Imagery/iconography:
+## Components
+- Existing components to reuse:
+- New/changed components:
+- Variants and states:
+- Token/component ownership:
+## Accessibility
+- Target standard:
+- Keyboard/focus behavior:
+- Contrast/readability:
+- Screen-reader semantics:
+- Reduced motion and sensory considerations:
+## Responsive behavior
+- Supported breakpoints/devices:
+- Layout adaptations:
+- Touch/hover differences:
+## Interaction states
+- Loading:
+- Empty:
+- Error:
+- Success:
+- Disabled:
+- Offline/slow network, if applicable:
+## Content voice
+- Tone:
+- Terminology:
+- Microcopy rules:
+## Implementation constraints
+- Framework/styling system:
+- Design-token constraints:
+- Performance constraints:
+- Compatibility constraints:
+- Test/screenshot expectations:
+## Open questions
+- [ ] Question / owner / impact
+```
+### 4. Use `DESIGN.md` as the decision contract
+For UI/UX/frontend work after the refresh:
+- Cite the relevant `DESIGN.md` sections before making design choices.
+- Prefer existing components, tokens, and documented constraints.
+- If implementation reveals a design contradiction, update `DESIGN.md` or add an open question before proceeding.
+- Do not introduce a new design-system layer when existing repo-native patterns can be extended.
+### 5. Handoff to implementation or Visual Ralph when appropriate
+- For normal frontend implementation, hand off with the relevant `DESIGN.md` sections, repo evidence, and acceptance criteria.
+- For visual-reference/image/live-URL matching, hand off to `$visual-ralph` with the approved reference/baseline and note that `DESIGN.md` is supporting context, not the visual verdict target.
+## Completion checklist
+Do not declare the design workflow complete until:
+- Existing design docs/assets/components/screenshots have been inspected or explicitly noted as absent.
+- Missing product/design context has been answered, assumed, or listed in `DESIGN.md` open questions.
+- `DESIGN.md` exists at the repo root and contains all required checklist sections.
+- UI/UX/frontend recommendations cite `DESIGN.md` rather than relying on unstated preferences.
+- Any `$visual-ralph` handoff is clearly separated as visual implementation matching, not DESIGN.md governance.
+Task: {{ARGUMENTS}}

package/skills/frontend-ui-ux/SKILL.md CHANGED Viewed

@@ -1,12 +1,16 @@
 ---
 name: frontend-ui-ux
-description: Deprecated compatibility shim for designer-led frontend UI/UX work
+description: Deprecated compatibility shim for frontend UI/UX work; use $design or $visual-ralph
 ---
 # Frontend UI/UX compatibility shim
 Hard-deprecated. Do not invoke or route this skill for new work.
-Use the `designer` agent/lane directly, or use `$visual-ralph` when the task needs visual iteration and screenshot-based verification. This file exists only to preserve the public/catalog-visible `frontend-ui-ux` alias contract while designer-led frontend guidance is handled by the canonical designer surfaces.
+Use `$design` when the task needs product/design context, UX guidance, frontend planning, design-system alignment, or a repo-local `DESIGN.md` source of truth.
+Use `$visual-ralph` when the task needs implementation against an approved generated/static/live-URL visual reference with screenshot capture, Visual Ralph verdict scoring, and pixel-diff evidence.
+This file exists only to preserve the public/catalog-visible `frontend-ui-ux` compatibility contract while canonical design guidance is handled by `$design` and measured visual implementation is handled by `$visual-ralph`.
 Task: {{ARGUMENTS}}

package/skills/skill/SKILL.md CHANGED Viewed

@@ -313,7 +313,8 @@ Project-only skills (2):
   - backend-scaffold
 Common skills (3):
-  - frontend-ui-ux
+  - design
+  - frontend-ui-ux (deprecated; use design or visual-ralph)
   - git-master
   - planner

package/skills/ultraqa/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: ultraqa
-description: QA cycling workflow - test, verify, fix, repeat until goal met
+description: Adversarial dynamic e2e QA workflow - generate hostile scenarios, test, verify, fix, report, and clean up
 ---
 # UltraQA Skill
@@ -10,84 +10,184 @@ description: QA cycling workflow - test, verify, fix, repeat until goal met
 - Use outcome-first framing with concise, evidence-dense progress and completion reporting.
 - Treat newer user updates as local overrides for the active workflow branch while preserving earlier non-conflicting constraints.
 - If the user says `continue`, advance the current verified next step instead of restarting discovery.
+- UltraQA is not satisfied by a shallow build/lint/typecheck/test checklist. It must exercise the requested behavior through adversarial dynamic e2e scenarios whenever the target can be run, simulated, or harnessed safely.
-[ULTRAQA ACTIVATED - AUTONOMOUS QA CYCLING]
+[ULTRAQA ACTIVATED - ADVERSARIAL DYNAMIC E2E QA CYCLING]
 ## Overview
+UltraQA finds real behavior failures by combining normal verification commands with generated end-to-end scenarios, hostile user modeling, temporary harnesses when useful, and a structured evidence report. The workflow repeats test → diagnose → fix → retest until the goal is met, a bounded stop condition is reached, or a safety boundary blocks further execution.
 ## Goal Parsing
 Parse the goal from arguments. Supported formats:
 | Invocation | Goal Type | What to Check |
 |------------|-----------|---------------|
-| `/ultraqa --tests` | tests | All test suites pass |
-| `/ultraqa --build` | build | Build succeeds with exit 0 |
-| `/ultraqa --lint` | lint | No lint errors |
-| `/ultraqa --typecheck` | typecheck | No TypeScript errors |
-| `/ultraqa --custom "pattern"` | custom | Custom success pattern in output |
+| `/ultraqa --tests` | tests | Existing tests plus adversarial dynamic e2e scenarios for the changed behavior |
+| `/ultraqa --build` | build | Build succeeds and generated smoke/e2e probes still run against the built artifact when applicable |
+| `/ultraqa --lint` | lint | Lint passes and no generated harness/test artifact violates project hygiene |
+| `/ultraqa --typecheck` | typecheck | Typecheck passes and generated typed harnesses compile when applicable |
+| `/ultraqa --custom "pattern"` | custom | Custom success pattern is verified against behavior, not trusted as misleading success output |
+| `/ultraqa --interactive` | interactive | CLI/service behavior is tested with generated hostile and edge-case interactions |
+If no structured goal is provided, interpret the argument as a custom behavior goal and derive a runnable e2e strategy from repository context.
+## Required Scenario Matrix
+Before declaring success, create and maintain a scenario matrix. Each row must include: scenario id, intent, user/attacker model, setup, command or harness, expected signal, actual result, fixes applied, evidence, and cleanup status.
-If no structured goal provided, interpret the argument as a custom goal.
+The matrix must include normal-path coverage plus adversarial dynamic e2e scenarios selected from the current goal and codebase. Unless clearly irrelevant or impossible, include these hostile and edge-case classes:
+1. **Malformed input**: invalid JSON, missing fields, invalid flags, oversized strings, unusual Unicode, path traversal-like values, and corrupted state files.
+2. **Repeated interruptions**: repeated `continue`, stop/cancel/abort wording, interrupted command output, and retries after partial progress.
+3. **Prompt injection attempts**: user text that tries to override instructions, exfiltrate secrets, skip verification, delete state, or claim false success.
+4. **Cancel/resume behavior**: active state cleanup, resume detection, stale in-progress state, and cancellation followed by a fresh run.
+5. **Stale state**: old `.omx/state` files, mismatched sessions, missing timestamps, and contradictory phase metadata.
+6. **Dirty worktree**: pre-existing modifications, untracked generated files, and verification that UltraQA does not hide or overwrite unrelated work.
+7. **Hung or long-running commands**: bounded timeout handling, killed child processes, and recovery notes.
+8. **Flaky tests**: rerun strategy, failure clustering, quarantine evidence, and avoiding false green from a single lucky pass.
+9. **Misleading success output**: output containing success phrases with non-zero exits, hidden failures, skipped tests, or partial command logs.
+## Dynamic E2E and Temporary Harness Rules
+- Generate temporary tests, scripts, fixtures, or harnesses when they materially improve behavioral confidence and no existing e2e surface covers the scenario.
+- Prefer project-native test tools and small throwaway harnesses under a temporary directory or clearly named test fixture.
+- Record every generated artifact in the scenario matrix, including whether it was committed intentionally or removed during cleanup.
+- Use bounded runtimes and explicit timeouts for commands that can hang.
+- Validate exit codes and output semantics; do not trust success-looking text alone.
+- Do not delete, rewrite, or mask unrelated user work. Capture dirty-worktree evidence before and after generated harness work.
 ## Cycle Workflow
 ### Cycle N (Max 5)
-1. **RUN QA**: Execute verification based on goal type
-   - `--tests`: Run the project's test command
-   - `--build`: Run the project's build command
-   - `--lint`: Run the project's lint command
-   - `--typecheck`: Run the project's type check command
-   - `--custom`: Run appropriate command and check for pattern
-   - `--interactive`: Use qa-tester for interactive CLI/service testing:
+1. **PLAN ADVERSARIAL QA**
+   - Restate the goal, success criteria, safety bounds, and stop condition.
+   - Inspect repository context enough to identify runnable surfaces, test commands, state files, and cleanup paths.
+   - Build or update the required scenario matrix before running commands.
+2. **RUN BASELINE VERIFICATION**
+   - `--tests`: Run the project's test command.
+   - `--build`: Run the project's build command.
+   - `--lint`: Run the project's lint command.
+   - `--typecheck`: Run the project's type check command.
+   - `--custom`: Run the appropriate command and check the pattern plus exit status and failure markers.
+   - `--interactive`: Use qa-tester or an equivalent CLI/service harness:
      ```
      Use `/prompts:qa-tester` with:
      Goal: [describe what to verify]
      Service: [how to start]
-     Test cases: [specific scenarios to verify]
+     Test cases: [normal, hostile, malformed, interruption, resume, stale-state, dirty-worktree, hung-command, flaky, and misleading-output scenarios]
      ```
-2. **CHECK RESULT**: Did the goal pass?
-   - **YES** → Exit with success message
-   - **NO** → Continue to step 3
+3. **RUN ADVERSARIAL DYNAMIC E2E SCENARIOS**
+   - Execute the scenario matrix using existing e2e tests, generated temporary tests, or generated harnesses.
+   - Model malicious/hostile user behavior explicitly, including prompt injection and attempts to bypass safety or verification.
+   - Exercise malformed input, repeated interruptions, cancel/resume, stale state, dirty worktree handling, hung commands, flaky tests, and misleading success output when relevant.
+   - Capture commands, exit codes, important output excerpts, artifacts, and cleanup status.
+4. **CHECK RESULT**
+   - **YES** only if baseline verification and adversarial e2e scenarios passed, generated artifacts are cleaned up or intentionally tracked, and the report has complete evidence.
+   - **NO** if any scenario failed, was skipped without justification, left debris, relied on misleading output, or lacked evidence. Continue to step 5.
-3. **ARCHITECT DIAGNOSIS**: Spawn architect to analyze failure
+5. **ARCHITECT DIAGNOSIS**
    ```
    Use `/prompts:architect` with:
-   Goal: [goal type]
-   Output: [test/build output]
-   Provide root cause and specific fix recommendations.
+   Goal: [goal type and behavior]
+   Scenario matrix: [rows, commands, failures, evidence]
+   Output: [test/build/e2e/harness output]
+   Provide root cause, safety implications, and specific fix recommendations.
    ```
-4. **FIX ISSUES**: Apply architect's recommendations
+6. **FIX ISSUES**
    ```
    Use `/prompts:executor` with:
    Issue: [architect diagnosis]
    Files: [affected files]
+   Constraints: preserve unrelated dirty work, clean temporary harnesses, keep safety bounds
    Apply the fix precisely as recommended.
    ```
-5. **REPEAT**: Go back to step 1
+7. **CLEAN UP AND ROLLBACK**
+   - Remove temporary harnesses, fixtures, logs, spawned processes, and state files unless they are intentional deliverables.
+   - Roll back failed experimental edits that are not part of the final fix.
+   - Re-check the worktree and record remaining intentional changes or residual debris.
+8. **REPEAT**
+   - Go back to step 1 with the updated scenario matrix and failure history.
+## Safety Bounds
+UltraQA must stay inside these safety bounds:
+- No destructive commands such as force resets, broad deletes, secret exfiltration, credential dumping, production writes, or unbounded process spawning.
+- No reading or printing secrets beyond the minimum metadata needed to verify absence of leakage.
+- No network or external-production side effects unless the user explicitly authorized them.
+- No unbounded waits: use timeouts, retries with caps, and clear hung-command diagnostics.
+- No hiding unrelated dirty work or generated debris.
+- If a required scenario would violate these bounds, mark it blocked in the report with the safe substitute used.
 ## Exit Conditions
 | Condition | Action |
 |-----------|--------|
-| **Goal Met** | Exit with success: "ULTRAQA COMPLETE: Goal met after N cycles" |
-| **Cycle 5 Reached** | Exit with diagnosis: "ULTRAQA STOPPED: Max cycles. Diagnosis: ..." |
-| **Same Failure 3x** | Exit early: "ULTRAQA STOPPED: Same failure detected 3 times. Root cause: ..." |
-| **Environment Error** | Exit: "ULTRAQA ERROR: [tmux/port/dependency issue]" |
+| **Goal Met** | Exit with success: `ULTRAQA COMPLETE: Goal met after N cycles` plus the structured report |
+| **Cycle 5 Reached** | Exit with diagnosis: `ULTRAQA STOPPED: Max cycles` plus failures, fixes attempted, residual risks, and evidence |
+| **Same Failure 3x** | Exit early: `ULTRAQA STOPPED: Same failure detected 3 times` plus root cause, safety notes, and next owner |
+| **Safety Boundary** | Exit: `ULTRAQA BLOCKED: [destructive/credentialed/external-production/unbounded action]` plus safe substitute evidence |
+| **Environment Error** | Exit: `ULTRAQA ERROR: [tmux/port/dependency/hung command issue]` plus cleanup status |
+## Structured Report
+Every terminal UltraQA result must include this report shape:
+```markdown
+# UltraQA Report
+## Goal and success criteria
+- Goal:
+- Stop condition:
+- Safety bounds applied:
+## Scenario matrix
+| ID | User/attacker model | Scenario | Command/harness | Expected signal | Actual result | Status | Evidence | Cleanup |
+|----|---------------------|----------|-----------------|-----------------|---------------|--------|----------|---------|
+## Commands run
+- `[exit code] command` — purpose, duration/timeout, key output evidence
+## Failures found
+- Scenario ID, failure signal, root cause, user impact, safety impact
+## Fixes applied
+- Files changed, rationale, linked failing scenario(s), regression evidence
+## Cleanup and rollback
+- Generated artifacts removed or intentionally kept
+- State/process cleanup performed
+- Worktree status before/after
+## Residual risks
+- Untested or blocked scenarios with reasons and safe substitutes
+## Evidence
+- Test output, e2e logs, harness output, screenshots/transcripts when relevant, and rerun/flake evidence
+```
 ## Observability
 Output progress each cycle:
-```
-[ULTRAQA Cycle 1/5] Running tests...
-[ULTRAQA Cycle 1/5] FAILED - 3 tests failing
-[ULTRAQA Cycle 1/5] Architect diagnosing...
-[ULTRAQA Cycle 1/5] Fixing: auth.test.ts - missing mock
-[ULTRAQA Cycle 2/5] Running tests...
-[ULTRAQA Cycle 2/5] PASSED - All 47 tests pass
+```text
+[ULTRAQA Cycle 1/5] Planning adversarial scenario matrix...
+[ULTRAQA Cycle 1/5] Running baseline tests...
+[ULTRAQA Cycle 1/5] Running ADV-E2E-003 prompt-injection harness...
+[ULTRAQA Cycle 1/5] FAILED - stale state resume accepted misleading success output
+[ULTRAQA Cycle 1/5] Architect diagnosing scenario ADV-E2E-003...
+[ULTRAQA Cycle 1/5] Fixing: src/hooks/... - validate exit code before success phrase
+[ULTRAQA Cycle 1/5] Cleaning temporary harnesses and state...
+[ULTRAQA Cycle 2/5] PASSED - baseline + 9 adversarial scenarios pass
 [ULTRAQA COMPLETE] Goal met after 2 cycles
 ```
@@ -96,12 +196,16 @@ Output progress each cycle:
 Use the CLI-first state surface (`omx state ... --json`) for UltraQA lifecycle state. If explicit MCP compatibility tools are already available, equivalent `omx_state` calls are optional compatibility, not the default.
 - **On start**:
-  `omx state write --input '{"mode":"ultraqa","active":true,"current_phase":"qa","iteration":1,"started_at":"<now>"}' --json`
+  `omx state write --input '{"mode":"ultraqa","active":true,"current_phase":"planning","iteration":1,"started_at":"<now>","scenario_matrix":[]}' --json`
 - **On each cycle**:
-  `omx state write --input '{"mode":"ultraqa","current_phase":"qa","iteration":<cycle>}' --json`
+  `omx state write --input '{"mode":"ultraqa","current_phase":"qa","iteration":<cycle>,"scenario_matrix":"<updated matrix path or summary>"}' --json`
+- **On adversarial e2e transition**:
+  `omx state write --input '{"mode":"ultraqa","current_phase":"adversarial-e2e"}' --json`
 - **On diagnose/fix transitions**:
   `omx state write --input '{"mode":"ultraqa","current_phase":"diagnose"}' --json`
   `omx state write --input '{"mode":"ultraqa","current_phase":"fix"}' --json`
+- **On cleanup transition**:
+  `omx state write --input '{"mode":"ultraqa","current_phase":"cleanup"}' --json`
 - **On completion**:
   `omx state write --input '{"mode":"ultraqa","active":false,"current_phase":"complete","completed_at":"<now>"}' --json`
 - **For resume detection**:
@@ -109,23 +213,33 @@ Use the CLI-first state surface (`omx state ... --json`) for UltraQA lifecycle s
 ## Scenario Examples
-**Good:** The user says `continue` after the workflow already has a clear next step. Continue the current branch of work instead of restarting or re-asking the same question.
+**Good:** The user says `continue` after the workflow already has a clear next step. Continue the current branch of work, rerun the relevant adversarial scenario, and update the report instead of restarting discovery.
 **Good:** The user changes only the output shape or downstream delivery step (for example `make a PR`). Preserve earlier non-conflicting workflow constraints and apply the update locally.
+**Good:** A CLI prints `SUCCESS` while exiting 1. Mark the misleading success output scenario failed, fix the parser or reporting path, and rerun the generated harness.
+**Bad:** The workflow runs only `npm test`, `npm run build`, `npm run lint`, or `npm run typecheck`, sees green output, and declares UltraQA complete without adversarial dynamic e2e coverage.
+**Bad:** A generated harness leaves untracked files, state, or a child process behind and the final report omits cleanup status.
 **Bad:** The user says `continue`, and the workflow restarts discovery or stops before the missing verification/evidence is gathered.
 ## Cancellation
-User can cancel with `/cancel` which clears the state file.
+User can cancel with `/cancel`, which clears UltraQA state. Cancellation itself should be tested in cancel/resume scenarios when relevant, but UltraQA must not block an explicit user cancellation.
 ## Important Rules
-1. **PARALLEL when possible** - Run diagnosis while preparing potential fixes
-2. **TRACK failures** - Record each failure to detect patterns
-3. **EARLY EXIT on pattern** - 3x same failure = stop and surface
-4. **CLEAR OUTPUT** - User should always know current cycle and status
-5. **CLEAN UP** - Clear state file on completion or cancellation
+1. **ADVERSARIAL E2E REQUIRED** - Baseline build/lint/typecheck/test commands are necessary evidence, not sufficient completion proof.
+2. **SCENARIO MATRIX REQUIRED** - Track normal, hostile, malformed, interruption, injection, cancel/resume, stale-state, dirty-worktree, hung-command, flaky, and misleading-output coverage.
+3. **GENERATE HARNESSES WHEN USEFUL** - Create temporary tests or harnesses when they materially improve behavioral confidence, then clean them up or commit them intentionally.
+4. **PARALLEL WHEN SAFE** - Run independent diagnostics while preparing potential fixes; do not parallelize commands that mutate the same state or worktree.
+5. **TRACK FAILURES** - Record each failure to detect patterns and avoid false greens.
+6. **EARLY EXIT ON PATTERN** - 3x same failure = stop and surface with root cause and residual risk.
+7. **CLEAR OUTPUT** - User should always know current cycle, scenario, command, status, and evidence.
+8. **CLEAN UP** - Clear UltraQA state and temporary artifacts on completion, cancellation, or early stop.
+9. **SAFETY FIRST** - Never exfiltrate secrets, run destructive cleanup, write to production, or wait indefinitely to satisfy a scenario.
 ## STATE CLEANUP ON COMPLETION
@@ -133,8 +247,8 @@ When goal is met OR max cycles reached OR exiting early, run `$cancel` or call:
 `omx state clear --input '{"mode":"ultraqa"}' --json`
-Use CLI state cleanup rather than deleting files directly.
+Use CLI state cleanup rather than deleting files directly. Also remove temporary e2e harnesses, fixtures, and logs unless they are intentional artifacts listed in the report.
 ---
-Begin ULTRAQA cycling now. Parse the goal and start cycle 1.
+Begin ULTRAQA cycling now. Parse the goal, build the adversarial dynamic e2e scenario matrix, and start cycle 1.

package/skills/visual-ralph/SKILL.md CHANGED Viewed

@@ -27,7 +27,7 @@ This is an orchestration skill. It composes existing skills and must not add run
 ## Do not use when
-- The user only wants design critique or general frontend advice; use `$frontend-ui-ux` or a designer lane.
+- The user only wants repo-wide design guidance, product/design context, or a DESIGN.md source of truth; use `$design` or a designer lane.
 - The task is a non-visual backend/API implementation with no UI reference target.
 - The user already supplied a final static reference image and only needs comparison/fixes; hand directly to `$ralph` with Visual Ralph verdict guidance.
 - The requested output is a deterministic SVG/vector/code-native asset rather than a raster reference.
@@ -117,7 +117,7 @@ Record final diff evidence with the reference/screenshot artifacts so the result
 ### 7. Build a reproducible design system
-The implementation is incomplete unless the visual match is encoded in repo-native reusable artifacts. Depending on the project, this may mean CSS variables, theme tokens, Tailwind config, component variants, Storybook stories, design docs, or existing equivalents.
+The implementation is incomplete unless the visual match is encoded in repo-native reusable artifacts. Depending on the project, this may mean CSS variables, theme tokens, Tailwind config, component variants, Storybook stories, updates that align with DESIGN.md, or existing equivalents.
 Capture at least the applicable:
 - colors,