npm - openhermes - Versions diffs - 4.3.0 → 4.11.2 - Mend

openhermes 4.3.0 → 4.11.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (143) hide show

package/CONTEXT.md +10 -1
package/README.md +54 -42
package/bootstrap.ts +396 -142
package/harness/agents/oh-browser.md +97 -0
package/harness/agents/oh-builder.md +78 -0
package/harness/agents/oh-facade.md +75 -0
package/harness/agents/oh-fusion.md +45 -0
package/harness/agents/oh-gauntlet.md +71 -0
package/harness/agents/oh-grill.md +71 -0
package/harness/agents/oh-investigate.md +60 -0
package/harness/agents/oh-manifest.md +95 -0
package/harness/agents/oh-plan-review.md +40 -0
package/harness/agents/oh-planner.md +50 -0
package/harness/agents/oh-refactor.md +37 -0
package/harness/agents/oh-retro.md +46 -0
package/harness/agents/oh-review.md +85 -0
package/harness/agents/oh-security.md +83 -0
package/harness/agents/oh-ship.md +76 -0
package/harness/agents/oh-skill-craft.md +38 -0
package/harness/agents/openhermes.md +28 -73
package/harness/codex/AUTOPILOT.md +235 -87
package/harness/codex/CHARTER.md +80 -0
package/harness/instructions/SHELL.md +76 -0
package/harness/lib/background/background.test.ts +197 -0
package/harness/lib/background/index.ts +7 -0
package/harness/lib/background/interfaces.ts +31 -0
package/harness/lib/background/manager.ts +320 -0
package/harness/lib/composer/compose.test.ts +168 -0
package/harness/lib/composer/compose.ts +65 -0
package/harness/lib/composer/fragments/01-identity.md +1 -0
package/harness/lib/composer/fragments/02-delegation.md +6 -0
package/harness/lib/composer/fragments/03-permissions.md +13 -0
package/harness/lib/composer/fragments/04-task-flow.md +15 -0
package/harness/lib/composer/fragments/05-confidence.md +5 -0
package/harness/lib/composer/fragments/06-parallelization.md +17 -0
package/harness/lib/composer/fragments/07-shell.md +41 -0
package/harness/lib/composer/fragments/08-routing.md +8 -0
package/harness/lib/composer/fragments/09-guardrails.md +12 -0
package/harness/lib/composer/index.ts +1 -0
package/harness/lib/hooks/builtins/confidence-gate-hook.ts +70 -0
package/harness/lib/hooks/builtins/delegation-depth-hook.ts +59 -0
package/harness/lib/hooks/builtins/error-recovery-hook.ts +107 -0
package/harness/lib/hooks/builtins/memory-sync-hook.ts +73 -0
package/harness/lib/hooks/builtins/plan-check-hook.ts +43 -0
package/harness/lib/hooks/builtins/route-tracking-hook.ts +147 -0
package/harness/lib/hooks/builtins/sanity-check-hook.ts +52 -0
package/harness/lib/hooks/builtins/shell-detect-hook.ts +96 -0
package/harness/lib/hooks/hooks.test.ts +1016 -0
package/harness/lib/hooks/index.ts +30 -0
package/harness/lib/hooks/registry.ts +416 -0
package/harness/lib/hooks/types.ts +71 -0
package/harness/lib/memory/index.ts +18 -0
package/harness/lib/memory/interfaces.ts +53 -0
package/harness/lib/memory/memory-manager.ts +205 -0
package/harness/lib/memory/memory.test.ts +491 -0
package/harness/lib/memory/plan-store.ts +366 -0
package/harness/lib/recovery/handler.ts +243 -0
package/harness/lib/recovery/index.ts +14 -0
package/harness/lib/recovery/interfaces.ts +48 -0
package/harness/lib/recovery/patterns.ts +149 -0
package/harness/lib/recovery/recovery.test.ts +312 -0
package/harness/lib/sanity/anomaly-tracker.ts +127 -0
package/harness/lib/sanity/checker.ts +178 -0
package/harness/lib/sanity/index.ts +13 -0
package/harness/lib/sanity/interfaces.ts +24 -0
package/harness/lib/sanity/sanity.test.ts +472 -0
package/harness/lib/sync/file-watcher.ts +174 -0
package/harness/lib/sync/index.ts +11 -0
package/harness/lib/sync/interfaces.ts +27 -0
package/harness/lib/sync/plan-sync.ts +536 -0
package/harness/lib/sync/sync.test.ts +832 -0
package/harness/skills/oh-ascii/DEEP.md +292 -0
package/harness/skills/oh-ascii/SKILL.md +31 -0
package/harness/skills/oh-ascii/scripts/check_ascii_alignment.py +596 -0
package/harness/skills/oh-browser/DEEP.md +54 -0
package/harness/skills/oh-browser/SKILL.md +30 -0
package/harness/skills/oh-builder/DEEP.md +63 -0
package/harness/skills/oh-builder/SKILL.md +12 -90
package/harness/skills/oh-expert/DEEP.md +85 -0
package/harness/skills/oh-expert/SKILL.md +13 -106
package/harness/skills/oh-facade/DEEP.md +182 -0
package/harness/skills/oh-facade/SKILL.md +15 -279
package/harness/skills/oh-freeze/DEEP.md +18 -0
package/harness/skills/oh-freeze/SKILL.md +10 -19
package/harness/skills/oh-full-output/DEEP.md +25 -0
package/harness/skills/oh-full-output/SKILL.md +12 -65
package/harness/skills/oh-fusion/DEEP.md +120 -0
package/harness/skills/oh-fusion/SKILL.md +17 -295
package/harness/skills/oh-gauntlet/DEEP.md +77 -0
package/harness/skills/oh-gauntlet/SKILL.md +13 -105
package/harness/skills/oh-grill/DEEP.md +51 -0
package/harness/skills/oh-grill/SKILL.md +12 -63
package/harness/skills/oh-guard/DEEP.md +19 -0
package/harness/skills/oh-guard/SKILL.md +10 -24
package/harness/skills/oh-handoff/DEEP.md +48 -0
package/harness/skills/oh-handoff/SKILL.md +13 -23
package/harness/skills/oh-health/DEEP.md +74 -0
package/harness/skills/oh-health/SKILL.md +13 -76
package/harness/skills/oh-init/DEEP.md +85 -0
package/harness/skills/oh-init/SKILL.md +13 -127
package/harness/skills/oh-investigate/DEEP.md +171 -0
package/harness/skills/oh-investigate/SKILL.md +13 -66
package/harness/skills/oh-issue/DEEP.md +21 -0
package/harness/skills/oh-issue/SKILL.md +11 -27
package/harness/skills/oh-manifest/DEEP.md +92 -0
package/harness/skills/oh-manifest/SKILL.md +12 -109
package/harness/skills/oh-plan-review/DEEP.md +90 -0
package/harness/skills/oh-plan-review/SKILL.md +13 -115
package/harness/skills/oh-planner/DEEP.md +172 -0
package/harness/skills/oh-planner/SKILL.md +12 -149
package/harness/skills/oh-prd/DEEP.md +45 -0
package/harness/skills/oh-prd/SKILL.md +10 -26
package/harness/skills/oh-refactor/DEEP.md +122 -0
package/harness/skills/oh-refactor/SKILL.md +17 -410
package/harness/skills/oh-retro/DEEP.md +26 -0
package/harness/skills/oh-retro/SKILL.md +12 -24
package/harness/skills/oh-review/DEEP.md +87 -0
package/harness/skills/oh-review/SKILL.md +11 -97
package/harness/skills/oh-security/DEEP.md +83 -0
package/harness/skills/oh-security/SKILL.md +14 -96
package/harness/skills/oh-ship/DEEP.md +141 -0
package/harness/skills/oh-ship/SKILL.md +14 -32
package/harness/skills/oh-skill-craft/DEEP.md +369 -0
package/harness/skills/oh-skill-craft/SKILL.md +13 -177
package/harness/skills/oh-skills-link/DEEP.md +16 -0
package/harness/skills/oh-skills-link/SKILL.md +10 -20
package/harness/skills/oh-skills-list/DEEP.md +20 -0
package/harness/skills/oh-skills-list/SKILL.md +9 -22
package/harness/skills/oh-triage/DEEP.md +23 -0
package/harness/skills/oh-triage/SKILL.md +8 -24
package/harness/skills/oh-worktree/DEEP.md +169 -0
package/harness/skills/oh-worktree/SKILL.md +32 -0
package/lib/harness-resolver.ts +8 -10
package/package.json +7 -5
package/tsconfig.json +1 -1
package/harness/codex/CONSTITUTION.md +0 -73
package/harness/codex/ROUTING.md +0 -92
package/harness/commands/oh-doctor.md +0 -26
package/harness/commands/oh-log.md +0 -18
package/harness/instructions/RUNTIME.md +0 -30
package/harness/skills/oh-caveman/SKILL.md +0 -42
package/harness/skills/oh-learn/SKILL.md +0 -101
package/lib/logger.ts +0 -75

package/harness/skills/oh-fusion/SKILL.md CHANGED Viewed

@@ -1,24 +1,7 @@
 ---
 name: oh-fusion
-description: "Skill ingestion pipeline: discover, analyze, filter, adapt, fuse, and integrate external skills into the OH harness. Use when the user has an existing skill, finds a skill in their .agents/skills, or wants to bring an external capability into OH."
+description: "Use when the user has an existing skill, finds a skill in their .agents/skills, or wants to bring an external capability into OH as a skill."
 tier: 3
-benefits-from: [oh-skill-craft, oh-skills-link, oh-expert]
-triggers:
-  - "import skill"
-  - "ingest skill"
-  - "fuse skill"
-  - "merge skills"
-  - "port skill"
-  - "add skill from"
-  - "make this OH-native"
-  - "skill fusion"
-  - "oh-fusion"
-  - "integrate skill"
-  - "convert skill"
-  - "bring in a skill"
-  - "transfer skill"
-  - "copy skill"
-  - "adopt skill"
 route:
   pass:
     - oh-skills-link
@@ -29,286 +12,25 @@ route:
 # oh-fusion
-The skill ingestion pipeline: discover external skills, evaluate signal quality, filter out noise, adapt to OH conventions, fuse multiple into one, and integrate into the harness.
+Skill ingestion pipeline: Discover → Analyze → Decide → Adapt → Fuse → Integrate.
-Every skill you run through `oh-fusion` becomes part of the closed loop — wired into AUTOPILOT, ROUTING.md, AGENTS.md, and the self-driving engine.
+## Steps
-## When to Use
-- The user points at a skill in `.agents/skills` and says "make this OH-native"
-- The user has a skill from `npx skills` ecosystem they want integrated
-- The user provides raw skill content and asks "is this worth keeping?"
-- Multiple skills need fusing into one (like the `oh-facade` fusion in this session)
-- Any external capability needs to become an `oh-*` skill with full wiring
-## Pipeline
-6-phase closed loop:
-```
-Discovery → Analysis → Decision → Adaptation → Fusion (opt) → Integration
-                                                                      ↓
-                                                            oh-skills-link (verify)
-```
----
-## Phase 1: Discovery
-Input: user's skill source
-Output: raw skill content loaded for analysis
-### Sources
-| Source | How to access |
-|---|---|
-| `.agents/skills/<name>/SKILL.md` | Read the file directly |
-| `npx skills` package | Run `npx skills find <query>` or check `skills.sh` |
-| URL to a skill | Fetch the content via web fetch |
-| User-provided path | Resolve and read |
-| User-provided content inline | Capture the raw text |
-| Multiple skills (for fusion) | Load all, enter Phase 2 on each |
-### Discovery Checklist
-Before proceeding, confirm:
-- [ ] Skill content is loaded and readable
-- [ ] Frontmatter is present (name, description)
-- [ ] There are no access restrictions or permissions needed
-- [ ] For multiple skills: all are loaded and ready for comparison
----
-## Phase 2: Analysis
-Input: raw skill content
-Output: structured analysis report with signal score
-### 2a. Depth Scoring
-Measure the skill's substantive content:
-| Metric | How to assess |
-|---|---|
-| Total lines | SKILL.md length |
-| Concrete rules count | Number of "must", "never", "always", "banned" directives |
-| Example count | Number of code blocks showing before/after or usage |
-| Anti-patterns listed | Explicit "don't do this" sections |
-| Workflow steps | Number of sequential, actionable steps |
-| Routing table | Does it define pass/fail/blocker routing? |
-**Scoring:**
-- **High signal** (70-100): Multiple concrete rules, examples, anti-patterns, workflow steps, routing
-- **Medium signal** (30-69): Some structure but thin on specifics, few examples
-- **Low signal** (0-29): Vague descriptions, no concrete rules, no anti-patterns, "be creative" level
-### 2b. Overlap Detection
-Compare against all existing OH skills (`harness/skills/oh-*/SKILL.md`):
-- Does any existing OH skill cover the same domain?
-- Is the overlap partial (complementary) or complete (redundant)?
-- Does the external skill have unique content OH lacks?
-### 2c. Convention Check
-Does the skill follow good practices?
-- [ ] Has clear description for triggering
-- [ ] Has concrete, actionable instructions (not just philosophy)
-- [ ] Has anti-patterns or failure modes documented
-- [ ] Has examples or code blocks
-- [ ] Has measurable outcomes (not subjective "make it good")
-- [ ] Avoids time-sensitive references (dates, version numbers)
-- [ ] Avoids platform-specific assumptions that don't apply
-### 2d. Report
-Output a structured report:
-```markdown
-## Analysis: <skill-name>
-**Source:** <path or origin>
-**Depth score:** <0-100> — <High/Medium/Low>
-**Total lines:** <N>  |  Concrete rules: <N>  |  Examples: <N>  |  Anti-patterns: <N>
-**Overlap:** <existing OH skill> — <none/partial/complete>
-**Verdict:** <keep / fuse / discard / ask>
-**Strengths:**
-- <what this skill does well>
-**Weaknesses:**
-- <what is missing or weak>
-**Recommended action:** <port directly / fuse with X > / discard>
-```
----
-## Phase 3: Decision
-Based on the analysis, decide what to do:
-| Verdict | Action |
-|---|---|
-| **Keep** | High signal, no overlap, OH conventions missing. Port directly to `oh-<name>`. |
-| **Fuse** | Medium-high signal, partial overlap with existing OH skill(s). Merge complementary DNA. |
-| **Discard** | Low signal, complete overlap, too niche, or no actionable content. Surface reasoning. |
-| **Ask** | Ambiguous quality, unclear domain fit, or user needs to choose between approaches. Surface findings. |
-**Decision principles:**
-- When in doubt between keep and fuse, prefer fuse — conserves routing slots and reduces surface area
-- When in doubt between keep and discard, prefer keep if there is ANY unique signal — the autopilot won't load it unless triggered
-- Never fuse incompatible domains (e.g., UI design into a security skill) — the result is confusing
----
-## Phase 4: Adaptation
-Input: raw skill content to keep/fuse
-Output: OH-native SKILL.md
-### 4a. Rewrite Frontmatter
-```markdown
----
-name: oh-<new-name>
-description: "Adapted from <source>. <Core function>. Use when <triggers>."
-tier: <2|3|4>
-benefits-from: [<relevant oh- skills this depends on>]
-triggers:
-  - "<trigger phrase from original, adapted>"
-  - "<new trigger phrases for OH context>"
----
-```
-### 4b. Structure the Body
-OH skill structure:
-1. **Summary** — one paragraph of what the skill does
-2. **When to Use** — clear triggering context
-3. **Workflow** — numbered steps (the core of the skill)
-4. **Anti-patterns** — what NOT to do
-5. **Routing** — pass/fail/blocker table
-Adaptation rules:
-- Remove all emojis from content
-- Replace ecosystem-specific terminology with OH equivalents
-- Convert relative paths to OH harness conventions
-- Add routing table based on skill's purpose
-- Keep all concrete rules, examples, and anti-patterns from the original
-- Discard fluff, philosophy, and motivational language
-- Preserve the original's unique signal — that's why you're importing it
-### 4c. Naming
-- Name must match `^[a-z0-9]+(-[a-z0-9]+)*$`
-- Prefix with `oh-`
-- Use the original name if it maps well, adapt if not
-- For fusions: invent a new name that captures the combined purpose
----
-## Phase 5: Fusion (optional — skip for single-skill imports)
-Input: 2+ analyzed skill contents with "fuse" verdict
-Output: one unified skill that merges complementary DNA
-### 5a. Identify Complementary DNA
-For each skill being fused, identify:
-- **Unique rules/concepts** — content that only this skill has
-- **Overlapping content** — same idea expressed differently (keep the better version)
-- **Conflicting directives** — skills that say opposite things (surface to user)
-### 5b. Merge Architecture
-Structure the fused skill so each source contributes its strength:
-```markdown
-## <Combined Workflow>
-### Phase A: <from skill 1>
-<what skill 1 contributes>
-### Phase B: <from skill 2>
-<what skill 2 contributes>
-### Phase C: <from skill 3>
-<what skill 3 contributes>
-```
-Do NOT just concatenate. The fused skill must read as a single coherent workflow, not three documents glued together.
-### 5c. Name the Fusion
-The name should signal the combined purpose, not the individual sources.
-- `oh-facade` (from redesign + design-taste + high-end-visual) — not `oh-redesign-plus-taste`
-- Apply the same principle here
----
-## Phase 6: Integration
-Input: OH-native SKILL.md
-Output: skill fully wired into the harness
-### 6a. Create the Skill File
-Write to `~/.config/opencode/skills/oh-<name>/SKILL.md` (user dir, survives npm updates).
-If the user has an alternative preference (`~/.agents/skills/`), use that instead.
-The file structure follows the standard OH skill template.
-### 6b. Wire into AUTOPILOT
-Add an entry to the auto-classify matrix in `harness/codex/AUTOPILOT.md`:
-- Signal keywords that should trigger this skill
-- Classification label
-- Action: "Load **oh-<name>**. Do not ask."
-### 6c. Wire routing into frontmatter
-Add `route:` frontmatter to the skill — no ROUTING.md edit needed. The dynamic routing system reads `route.pass`, `route.fail`, and `route.blocker` directly from the skill's own `SKILL.md`. The skill becomes routable automatically:
-```yaml
-route:
-  pass: <next skill or done>
-  fail: <fallback skill or surface>
-  blocker: surface
-```
-### 6d. Wire into AGENTS.md
-Add to the skills table in `AGENTS.md`:
-- Skill, tier, purpose
-- Increment the total count
-### 6e. Wire into openhermes.md
-Add to the orchestrator's skill list in `harness/agents/openhermes.md`.
-### 6f. Verify Discovery
-Route to `oh-skills-link` to confirm the skill is discoverable by OpenCode.
----
+1. Load skill content — read from `.agents/skills/`, `npx skills`, URL, user path, or inline text.
+2. Analyze depth — score by lines, concrete rules, examples, anti-patterns, workflow steps, and routing.
+3. Detect overlap — compare against existing `oh-*` skills. Report none/partial/complete.
+4. Decide verdict — Keep (high signal, no overlap), Fuse (partial overlap, merge DNA), Discard (low/no signal), or Ask (ambiguous).
+5. Adapt to OH-native format — remove emojis, convert paths, add routing, preserve unique signal.
+6. Fuse if merging — identify unique concepts from each source, resolve conflicts, write one coherent workflow.
+7. Integrate — create skill file, wire AUTOPILOT, routing, AGENTS.md, openhermes.md.
+8. Verify — route to oh-skills-link to confirm discovery.
 ## Routing
 | Outcome | Route |
-|---|---|
-| integration complete | -> oh-skills-link (verify discovery) |
-| fusion with iteration needed | -> oh-skill-craft (optimize via eval loop) |
-| analysis: discard | -> surface findings to user |
-| analysis: ask | -> surface findings + recommendations to user |
-| blocker | -> surface to user |
-## Anti-patterns
-- Importing a skill without analyzing it first — always run Phase 2
-- Keeping everything from the source — 50% of most external skills is fluff. Be ruthless.
-- Fusing incompatible domains — the result confuses both the model and the user
-- Naming after the source ("oh-tailwind-v2") instead of the capability ("oh-styles")
-- Skipping route frontmatter — a skill without `route.pass`/`route.fail`/`route.blocker` won't auto-route
-- Overwriting existing routing entries without checking for collisions
+|---------|-------|
+| Integration complete | → oh-skills-link (verify discovery) |
+| Fusion needs iteration | → oh-skill-craft |
+| Analysis: discard | → surface |
+| Analysis: ask | → surface with recs |
+| Blocker | → surface |

package/harness/skills/oh-gauntlet/DEEP.md ADDED Viewed

@@ -0,0 +1,77 @@
+# oh-gauntlet — Deep Reference
+## Stage 1: Test Suite
+Run all tests. Check they test behavior (not implementation). Flag gaps in edge case coverage. Do NOT add tests — surface as findings.
+**TDD Iron Law:** `NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST`. If code was written before its test — flag as severe quality gap.
+**RED-GREEN-REFACTOR verification:** For new code in diff, verify each function has a test, the test was seen to fail before implementation (commit history), minimal code was written to pass each test, and tests use real code (not mocks unless unavoidable).
+**Rationalization Table:**
+| Excuse | Reality |
+|--------|---------|
+| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
+| "I'll test after" | Tests passing immediately prove nothing. |
+| "Already manually tested" | Ad-hoc ≠ systematic. Can't re-run. |
+| "Deleting X hours is wasteful" | Sunk cost fallacy. Keeping unverified code is technical debt. |
+| "TDD will slow me down" | TDD faster than debugging. |
+| "Existing code has no tests" | You're improving it. Add tests for existing code. |
+**Red Flags** (any = quality gap): Code before test · Test passes immediately · Can't explain why test failed · Rationalizing "just this once" · "Keep as reference" or "adapt existing code" · "Already spent X hours, deleting is wasteful" · "TDD is dogmatic, I'm being pragmatic" · "Tests after achieve the same purpose"
+**TDD Verification Checklist:**
+- [ ] Every new function has a test
+- [ ] Watched each test fail before implementing (evidence)
+- [ ] Wrote minimal code to pass each test
+- [ ] All tests pass
+- [ ] Output pristine (no errors/warnings)
+- [ ] Tests use real code (mocks only if unavoidable)
+- [ ] Edge cases and errors covered
+## Stage 2: Dual-Axis Review (parallel sub-agents)
+- **Standards** — read documented standards (CONTEXT.md, AGENTS.md, eslint, ADRs). Report every violation. Cite source. Distinguish hard violations from judgment calls.
+- **Spec** — read spec source (plan/issue/PRD). Report missing/partial requirements, scope creep, wrong implementations. Quote the spec.
+Report independently. Do not merge or rank.
+## Stage 3: Edge Case Sweep
+- Error states — invalid inputs, missing files, network failure
+- Concurrency — races, deadlocks, stale state
+- Security — injection, auth bypass, data leakage
+- Performance — N+1, unbounded loops, leaks
+- State transitions — invalid transitions, partial updates
+Per finding: severity (critical/major/minor), location, reproduction.
+## Stage 4: QA Sweep (tiered)
+Quick (critical only) / Standard (+ medium) / Exhaustive (+ cosmetic). Execute flows, log findings, fix highest severity first, re-verify after each fix.
+## Stage 5: Canary (post-deploy)
+Capture pre-deploy baselines. Deploy. Navigate key flows. Diff against baselines. Surface anomalies. Suggest rollback if critical.
+## Stage 6: Manual Verification
+- Happy path, error path, no regression, logging covers failures, docs match behavior.
+## Loop Protocol
+1. Run all stages (skip 5 if not deploying)
+2. 0 critical + 0 major → DONE
+3. Criticals/majors exist → fix highest severity, re-run affected stages
+4. Fix impossible → BLOCKER: `<what> | Options: A, B, C`
+## Anti-patterns
+- Sequential when parallel possible
+- Mixing Standards and Spec findings (keep axes separate)
+- Skipping edge case sweep because tests pass
+- Ignoring minors (accumulated design debt)
+- Pushing critical failures without surfacing
+- Skipping TDD — writing code without a failing test first
+- Tests that pass immediately without having failed first (might test wrong thing)

package/harness/skills/oh-gauntlet/SKILL.md CHANGED Viewed

@@ -1,17 +1,7 @@
 ---
 name: oh-gauntlet
-description: "Rigorous multi-axis testing gauntlet: unit, integration, edge cases, dual-axis review. Loops until done or blocker."
+description: "Use when code is ready for thorough testing — unit tests, integration, edge cases, dual-axis review, and QA. Loops until done or blocker."
 tier: 4
-benefits-from: [oh-expert, oh-builder]
-triggers:
-  - "run the gauntlet on"
-  - "test everything"
-  - "rigorous testing"
-  - "review all angles"
-  - "qa the feature"
-  - "full review of the code"
-  - "validate this feature"
-  - "thorough testing"
 route:
   pass: oh-ship
   fail: oh-builder
@@ -20,104 +10,22 @@ route:
 # oh-gauntlet
-Runs the current build through a multi-axis gauntlet: tests, edge cases, standards review, spec review. Spawns parallel sub-agents for independent axes. Loops until everything passes or a blocker is surfaced.
+Multi-axis testing: test suite, dual-axis review, edge case sweep, QA, canary.
-## Gauntlet Stages
+## Steps
-Each stage runs independently (parallel where possible). A stage that fails loops: fix → re-run → verify → pass or blocker.
-### Stage 1: Test Suite
-Run all existing tests. Check both that they pass and that they actually test the right things:
-- **Unit tests** — do they pass? Are they testing behavior or implementation?
-- **Integration tests** — do the real code paths work end-to-end?
-- **Edge case coverage** — empty states, error states, boundary conditions, concurrency
-If tests are missing or weak, flag what should be added. Do not add them here — surface as finding.
-### Stage 2: Dual-Axis Review (parallel sub-agents)
-Spawn two sub-agents simultaneously:
-**Standards sub-agent:** Read the repo's documented standards (CONTEXT.md, AGENTS.md, eslint config, ADRs). Then read the diff. Report every place the diff violates a documented standard. Cite the standard source. Distinguish hard violations from judgement calls.
-**Spec sub-agent:** Read the spec source (plan.md, issue, PRD, or user's description). Then read the diff. Report: (a) requirements that are missing or partial, (b) scope creep (behavior not asked for), (c) requirements that look implemented but wrong. Quote the spec.
-Report both axes independently — do not merge or rank. A change can pass one and fail the other.
-### Stage 3: Edge Case Sweep
-Systematic edge case analysis for the changed code:
-- Error states — what happens when inputs are invalid, files are missing, network fails?
-- Concurrency — race conditions, deadlocks, stale state
-- Security — injection, auth bypass, data leakage, permission escalation
-- Performance — N+1 queries, unbounded loops, memory leaks, unnecessary allocations
-- State transitions — invalid state transitions, partial updates, rollback gaps
-For each finding: severity (critical/major/minor), location, reproduction path.
-### Stage 4: QA Sweep (tiered)
-Systematic testing with iterative fix-verify cycles. Choose tier based on risk:
-- **Quick** — critical/high severity flows only
-- **Standard** — critical + medium severity, full edge case sweep
-- **Exhaustive** — all of the above + cosmetic, edge cases, cross-browser
-1. Execute tests against each user flow, edge case, error state
-2. Log findings with severity, reproduction steps, evidence
-3. Fix highest-severity first, commit each fix atomically
-4. Re-verify after each fix — confirm fix, check for regressions
-5. Produce health scores (before/after)
-### Stage 5: Canary (post-deploy)
-If deploying to production:
-1. **Set baseline** — capture pre-deploy screenshots and metrics
-2. **Deploy check** — verify deploy completed successfully
-3. **Canary run** — navigate key user flows, capture screenshots, log console errors
-4. **Compare** — diff against pre-deploy baselines
-5. **Alert** — surface anomalies, performance regressions, new errors
-6. **Recovery** — if critical issues found, suggest rollback
-Output: health status, screenshots (before/after), error log, performance diff, ship/no-go verdict.
-### Stage 6: Manual Verification Checklist
-Based on the plan's verification criteria or spec:
-- [ ] Happy path works end-to-end
-- [ ] Error path degrades gracefully
-- [ ] No regression in adjacent areas
-- [ ] Logging/monitoring covers failure modes
-- [ ] Documentation matches behavior (if applicable)
-## Loop Protocol
-1. Run all 6 stages (skip Stage 5 if not deploying)
-2. Collect findings by severity
-3. If 0 criticals and 0 majors → DONE
-4. If criticals or majors exist → fix highest severity first
-5. After fix → re-run affected stages only
-6. If fix is impossible within scope → surface BLOCKER
-## Blocker Protocol
-```
-BLOCKER: <what failed>
-Context: <what was attempted, why it cannot proceed>
-Options:
-  A: <scope reduction>
-  B: <alternative approach>
-  C: <dependency change>
-```
-## Anti-patterns
-- Running stages sequentially when they can be parallel (Standards and Spec reviews are independent)
-- Mixing Standards and Spec findings (keep axes separate — one can pass while the other fails)
-- Skipping edge case sweep because tests pass (tests confirm behavior, not absence of edge cases)
-- Ignoring minors because no criticals exist (accumulated minors signal design debt)
-- Pushing through critical failures without surfacing blocker
+1. Run all tests — verify TDD Iron Law (no production code without failing test first). Flag gaps in edge case coverage.
+2. Run dual-axis review — spawn parallel Standards and Spec sub-agents. Report independently. Do not merge or rank.
+3. Sweep edge cases — error states, concurrency, security, performance, state transitions. Assign severity (critical/major/minor).
+4. Run QA sweep — tiered (quick/standard/exhaustive). Fix highest severity first. Re-verify after each fix.
+5. Deploy canary if applicable — capture pre-deploy baselines, navigate flows, diff anomalies, suggest rollback if critical.
+6. Run manual verification — happy path, error path, no regression, logging covers failures, docs match behavior.
+7. Apply loop protocol — 0 critical + 0 major → done. Fix highest severity, re-run affected stages. Surface blocker with options.
 ## Routing
 | Outcome | Route |
 |---------|-------|
-| pass | → oh-ship (all checks pass) |
-| fail | → oh-builder (fix issues found) |
-| blocker | → surface to user |
+| pass | → oh-ship |
+| fail | → oh-builder (fix issues) |
+| blocker | → surface |

package/harness/skills/oh-grill/DEEP.md ADDED Viewed

@@ -0,0 +1,51 @@
+# oh-grill — Deep Reference
+## When to Use
+Before committing to a plan. "Writing exactly what I asked for and it's still wrong" = design concept not shared. Cheaper in conversation than in code.
+**Example:** User shares a plan. You respond with: "Have you considered the failure mode where X happens?" — then walk through the grill modes.
+## When NOT to Use
+- Clear vetted plan needing execution
+- User needs builder, not critic
+- Trivial decisions
+## Modes / Workflow
+### Mode A: Grill (quick)
+1. Read plan/design doc
+2. Interview one decision at a time — each answer reveals new branches
+3. Resolve each branch before moving on
+4. Surface: contradictions, blind spots, unstated assumptions, ambiguous terms
+5. Propose recommended answer per decision
+6. Output: verified plan with flagged ambiguities
+### Mode B: Grill with Docs (thorough)
+Same + persists to CONTEXT.md, ADRs, and DDD ubiquitous-language glossary.
+1. Load CONTEXT.md + ADRs
+2. Grill decision tree — each resolution may: update CONTEXT.md terms, create ADR, flag glossary ambiguity
+3. **Ubiquitous Language extraction** — scan for domain nouns/verbs/concepts. Identify: same word different concepts, different words same concept, vague terms. Propose canonical glossary with grouped tables. Write example dialogue (3-5 exchanges). Flag ambiguities.
+4. Persist CONTEXT.md changes immediately as language firms
+5. Output: updated CONTEXT.md + ADRs + UBIQUITOUS_LANGUAGE.md (if significant) + verified plan
+## Technique
+- One question at a time
+- Propose recommended answer per decision
+- Walk full decision tree before accepting
+- Reference CONTEXT.md glossary for ambiguous terms
+- Cross-reference ADRs for architecture decisions
+## Anti-patterns
+- Grilling for sake of grilling
+- Questions you could answer by reading plan/codebase
+- ADRs for trivial decisions
+- Polishing CONTEXT.md before concepts settled
+- Updating terms mid-discussion (let conversation resolve)
+- Not distinguishing "must resolve now" vs "figure out later"