npm - openhermes - Versions diffs - 4.3.0 → 4.9.2 - Mend

openhermes 4.3.0 → 4.9.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (96) hide show

package/CONTEXT.md +9 -0
package/README.md +26 -15
package/bootstrap.ts +161 -124
package/harness/agents/oh-browser.md +97 -0
package/harness/agents/oh-builder.md +78 -0
package/harness/agents/oh-facade.md +75 -0
package/harness/agents/oh-fusion.md +45 -0
package/harness/agents/oh-gauntlet.md +71 -0
package/harness/agents/oh-grill.md +71 -0
package/harness/agents/oh-investigate.md +60 -0
package/harness/agents/oh-manifest.md +95 -0
package/harness/agents/oh-plan-review.md +40 -0
package/harness/agents/oh-planner.md +50 -0
package/harness/agents/oh-refactor.md +37 -0
package/harness/agents/oh-retro.md +46 -0
package/harness/agents/oh-review.md +85 -0
package/harness/agents/oh-security.md +83 -0
package/harness/agents/oh-ship.md +76 -0
package/harness/agents/oh-skill-craft.md +38 -0
package/harness/agents/openhermes.md +107 -53
package/harness/codex/AUTOPILOT.md +143 -91
package/harness/codex/CHARTER.md +81 -0
package/harness/commands/oh-doctor.md +193 -14
package/harness/instructions/SHELL.md +76 -0
package/harness/skills/oh-ascii/DEEP.md +292 -0
package/harness/skills/oh-ascii/SKILL.md +31 -0
package/harness/skills/oh-ascii/scripts/check_ascii_alignment.py +596 -0
package/harness/skills/oh-browser/DEEP.md +54 -0
package/harness/skills/oh-browser/SKILL.md +30 -0
package/harness/skills/oh-builder/DEEP.md +63 -0
package/harness/skills/oh-builder/SKILL.md +12 -90
package/harness/skills/oh-expert/DEEP.md +85 -0
package/harness/skills/oh-expert/SKILL.md +13 -106
package/harness/skills/oh-facade/DEEP.md +182 -0
package/harness/skills/oh-facade/SKILL.md +15 -279
package/harness/skills/oh-freeze/DEEP.md +18 -0
package/harness/skills/oh-freeze/SKILL.md +10 -19
package/harness/skills/oh-full-output/DEEP.md +25 -0
package/harness/skills/oh-full-output/SKILL.md +12 -65
package/harness/skills/oh-fusion/DEEP.md +120 -0
package/harness/skills/oh-fusion/SKILL.md +17 -295
package/harness/skills/oh-gauntlet/DEEP.md +77 -0
package/harness/skills/oh-gauntlet/SKILL.md +13 -105
package/harness/skills/oh-grill/DEEP.md +51 -0
package/harness/skills/oh-grill/SKILL.md +12 -63
package/harness/skills/oh-guard/DEEP.md +19 -0
package/harness/skills/oh-guard/SKILL.md +10 -24
package/harness/skills/oh-handoff/DEEP.md +48 -0
package/harness/skills/oh-handoff/SKILL.md +13 -23
package/harness/skills/oh-health/DEEP.md +74 -0
package/harness/skills/oh-health/SKILL.md +13 -76
package/harness/skills/oh-init/DEEP.md +85 -0
package/harness/skills/oh-init/SKILL.md +13 -127
package/harness/skills/oh-investigate/DEEP.md +171 -0
package/harness/skills/oh-investigate/SKILL.md +13 -66
package/harness/skills/oh-issue/DEEP.md +21 -0
package/harness/skills/oh-issue/SKILL.md +11 -27
package/harness/skills/oh-learn/DEEP.md +44 -0
package/harness/skills/oh-learn/SKILL.md +12 -83
package/harness/skills/oh-manifest/DEEP.md +92 -0
package/harness/skills/oh-manifest/SKILL.md +11 -108
package/harness/skills/oh-plan-review/DEEP.md +90 -0
package/harness/skills/oh-plan-review/SKILL.md +13 -115
package/harness/skills/oh-planner/DEEP.md +172 -0
package/harness/skills/oh-planner/SKILL.md +12 -149
package/harness/skills/oh-prd/DEEP.md +45 -0
package/harness/skills/oh-prd/SKILL.md +10 -26
package/harness/skills/oh-refactor/DEEP.md +122 -0
package/harness/skills/oh-refactor/SKILL.md +17 -410
package/harness/skills/oh-retro/DEEP.md +26 -0
package/harness/skills/oh-retro/SKILL.md +12 -24
package/harness/skills/oh-review/DEEP.md +87 -0
package/harness/skills/oh-review/SKILL.md +11 -97
package/harness/skills/oh-security/DEEP.md +83 -0
package/harness/skills/oh-security/SKILL.md +14 -96
package/harness/skills/oh-ship/DEEP.md +141 -0
package/harness/skills/oh-ship/SKILL.md +13 -31
package/harness/skills/oh-skill-craft/DEEP.md +369 -0
package/harness/skills/oh-skill-craft/SKILL.md +17 -178
package/harness/skills/oh-skills-link/DEEP.md +16 -0
package/harness/skills/oh-skills-link/SKILL.md +10 -20
package/harness/skills/oh-skills-list/DEEP.md +20 -0
package/harness/skills/oh-skills-list/SKILL.md +9 -22
package/harness/skills/oh-triage/DEEP.md +23 -0
package/harness/skills/oh-triage/SKILL.md +8 -24
package/harness/skills/oh-worktree/DEEP.md +169 -0
package/harness/skills/oh-worktree/SKILL.md +32 -0
package/lib/harness-resolver.ts +8 -10
package/package.json +5 -3
package/scripts/count-tokens.mjs +158 -0
package/scripts/oh-doctor.ps1 +342 -0
package/harness/codex/CONSTITUTION.md +0 -73
package/harness/codex/ROUTING.md +0 -92
package/harness/instructions/RUNTIME.md +0 -30
package/harness/skills/oh-caveman/SKILL.md +0 -42
package/lib/logger.ts +0 -75

package/harness/skills/oh-skill-craft/DEEP.md ADDED Viewed

@@ -0,0 +1,369 @@
+# oh-skill-craft — Deep Reference
+## Skill Structure and Template
+### Directory Layout
+Every skill lives in its own directory under the harness:
+```
+harness/skills/<oh-name>/
+├── SKILL.md        # Main instructions (required)
+├── REFERENCE.md    # Extended docs (if SKILL.md > 100 lines)
+└── scripts/        # For deterministic operations (validation, formatting)
+```
+### Template
+Every SKILL.md follows this structure:
+```markdown
+---
+name: oh-<name>
+description: "Brief. Use when [triggers]."
+tier: <2|3|4>
+route:
+  pass: <next>
+  fail: <fallback>
+  blocker: surface
+---
+# oh-<name>
+<one-paragraph summary>
+## When to Use
+## Steps
+1. Step
+## Anti-patterns
+- List
+```
+### Field Guide
+| Field | Purpose |
+|-------|---------|
+| `name` | Must match `^[a-z0-9]+(-[a-z0-9]+)*$` and directory name |
+| `description` | Max 200 chars. First sentence = function, second = trigger context. |
+| `tier` | 2=tool (deterministic), 3=strategic (analysis/decisions), 4=autonomous (multi-step process) |
+| `route.pass` | Next skill after successful completion |
+| `route.fail` | Fallback skill on failure or edge case |
+| `route.blocker` | Where to surface blockers (usually "surface") |
+## Output Location and Review Checklist
+### Output Location
+Skills are stored in two locations with a precedence rule:
+| Location | Path | Behavior |
+|----------|------|----------|
+| **User-written** | `~/.config/opencode/skills/` | Survives npm update. User edits persist across reinstalls. |
+| **Built-in** | `harness/skills/` in the package | Gets replaced on package update. |
+**Name conflict rule**: On name conflict, user version wins. If a user has `~/.config/opencode/skills/oh-expert/SKILL.md`, that takes precedence over the built-in version.
+### Review Checklist
+Before marking a skill complete, verify every item:
+- [ ] **Description includes triggers** — "Use when..." phrasing in description field
+- [ ] **SKILL.md under 100 lines** — If longer, chunk into sections or create REFERENCE.md
+- [ ] **No time-sensitive info** — No dates, version numbers, or ephemeral references
+- [ ] **Consistent oh- prefix and terminology** — Follow naming conventions from existing skills
+- [ ] **Concrete examples included** — Show real usage, not abstract descriptions
+- [ ] **Anti-patterns documented** — What NOT to do, common mistakes
+- [ ] **Tests still pass** — Run `npm test` (or project-equivalent) to verify no regressions
+## Eval-Driven Iteration
+After drafting a skill, iterate with evidence — not guessing. Test prompts should be substantive multi-step tasks that mirror real usage. The model handles simple tasks without a skill — evals reveal whether the skill pulls its weight on hard cases.
+### 6-Step Loop
+#### 1. Create Test Cases
+Write 2-3 realistic multi-step prompts that mirror real usage. Save to `evals/evals.json`:
+```json
+{
+  "skill_name": "oh-<name>",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "Realistic multi-step task the skill should handle",
+      "expected_output": "Concrete expected result description",
+      "files": []
+    }
+  ]
+}
+```
+#### 2. Spawn Runs
+Launch parallel subagents: **with-skill** (load the skill, execute prompt) vs **baseline** (no skill loaded, or previous version). Use the same prompt for both.
+Save outputs to:
+- `iteration-N/eval-ID/with_skill/`
+- `iteration-N/eval-ID/baseline/`
+#### 3. Draft Assertions
+While runs execute, draft objectively verifiable assertions for each test case. Good assertions:
+- Have descriptive names
+- Can be checked programmatically (output contains X, file Y was created, step Z was followed)
+- Update `evals/evals.json` with these assertions
+#### 4. Grade
+Aggregate results per assertion:
+- **Pass rates** — Which assertions pass/fail with vs without the skill
+- **Timing** — How long each run takes
+- **Token usage** — Cost comparison
+Look for:
+- **Non-discriminating** assertions — always pass regardless of skill → remove them (they add noise)
+- **High-variance** results — possibly flaky tests → investigate
+- **Time/token tradeoffs** — does the skill justify its cost in latency and tokens?
+#### 5. Improve
+Revise the skill based on failures. Important rules:
+- **Generalize** from specific failure patterns — don't overfit to 2-3 test cases
+- The goal is a skill that works across a million prompts, not just your test set
+- Keep instructions lean — every word should earn its place
+#### 6. Loop
+Rerun all tests into a new iteration directory. Repeat until one of:
+- User is satisfied
+- All feedback is positive
+- No meaningful progress between iterations
+## Description Optimization
+After the skill body is solid, optimize its `description` field for triggering accuracy. The description is what the routing system uses to match queries to skills.
+### Process
+#### 1. Create 20 Eval Queries
+Construct a balanced eval set:
+| Type | Count | Purpose |
+|------|-------|---------|
+| **Should-trigger** | 10 | Different phrasings and contexts where this skill is the correct answer |
+| **Should-not-trigger** | 10 | Near-misses that share keywords but need a different skill |
+#### 2. Quality Rules
+- Queries must be **realistic** — phrases users actually type, not academic exercises
+- Include **concrete details** — "create a skill for validating YAML configs" not "make a skill"
+- Should-not-trigger queries should be **genuinely confusable** — if they're obviously unrelated, the test is useless
+#### 3. Iterate Description
+Write candidate descriptions. For each candidate:
+- Score it against the eval set
+- How many should-trigger queries does it catch?
+- How many should-not-trigger does it correctly reject?
+- Tune phrasing, keywords, and structure
+#### 4. Select Winner
+The description with the best precision/recall balance wins. Record it in the skill frontmatter.
+## Effectiveness and Testing
+### 1. TDD for Skills Methodology
+**Writing skills IS Test-Driven Development applied to process documentation.**
+You write test cases (pressure scenarios), watch them fail (baseline agent behavior), write the skill (the documentation), watch tests pass (agents comply), and refactor (close loopholes).
+**Core principle:** If you didn't watch an agent fail without the skill, you don't know if the skill teaches the right thing.
+#### TDD Mapping
+| TDD Concept | Skill Creation |
+|-------------|----------------|
+| **Test case** | Pressure scenario with subagent |
+| **Production code** | Skill document (`SKILL.md`) |
+| **Test fails (RED)** | Agent violates rule without skill (baseline) |
+| **Test passes (GREEN)** | Agent complies with skill present |
+| **Refactor** | Close loopholes while maintaining compliance |
+| **Write test first** | Run baseline scenario BEFORE writing skill |
+| **Watch it fail** | Document exact rationalizations agent uses |
+| **Minimal code** | Write skill addressing those specific violations |
+| **Watch it pass** | Verify agent now complies |
+| **Refactor cycle** | Find new rationalizations → plug → re-verify |
+#### The Iron Law
+```
+NO SKILL WITHOUT A FAILING TEST FIRST
+```
+This applies to NEW skills AND EDITS to existing skills. No exceptions.
+#### RED Phase — Write Failing Test
+Run a pressure scenario with a subagent WITHOUT the skill. Document exact behavior:
+- What choices did they make?
+- What rationalizations did they use (verbatim)?
+- Which pressures triggered violations?
+#### GREEN Phase — Write Minimal Skill
+Write a skill that addresses those specific rationalizations. Do not add extra content for hypothetical cases. Run the same scenario WITH the skill. The agent should now comply.
+#### REFACTOR Phase — Close Loopholes
+Agent found a new rationalization? Add an explicit counter. Re-test until bulletproof.
+### 2. Claude Search Optimization (CSO)
+**Critical for discovery — and for correct behavior.** The description field determines both *whether* a skill is loaded and *how* the agent uses it.
+#### Description = When to Use, NOT What the Skill Does
+When a description summarizes the skill's workflow, the agent may follow the description instead of reading the full skill content. The skill body becomes documentation the agent skips if the description gives away the process.
+#### Bad vs. Good Descriptions
+```yaml
+# ❌ BAD: Summarizes workflow
+description: Use when executing plans - dispatches subagent per task with code review between tasks
+# ❌ BAD: Too much process detail
+description: Use for TDD - write test first, watch it fail, write minimal code, refactor
+# ❌ BAD: Too abstract, vague
+description: For async testing
+# ❌ BAD: First person
+description: I can help you with async tests when they're flaky
+# ✅ GOOD: Just triggering conditions, no workflow summary
+description: Use when executing implementation plans with independent tasks in the current session
+# ✅ GOOD: Describes the problem
+description: Use when tests have race conditions, timing dependencies, or pass/fail inconsistently
+```
+**Rules:**
+- Start with "Use when..." to focus on triggering conditions
+- Describe the *problem*, not *language-specific symptoms*
+- Keep triggers technology-agnostic unless the skill itself is technology-specific
+- Write in third person
+- **NEVER summarize the skill's process or workflow**
+#### Keyword Coverage
+Use words the agent would search for:
+- **Error messages:** "Hook timed out", "ENOTEMPTY", "race condition"
+- **Symptoms:** "flaky", "hanging", "zombie", "pollution"
+- **Synonyms:** "timeout/hang/freeze", "cleanup/teardown/afterEach"
+- **Tools:** Actual commands, library names, file types
+### 3. Bulletproofing Techniques
+Skills that enforce discipline need to resist rationalization. Agents are smart and will find loopholes when under pressure.
+#### Close Every Loophole Explicitly
+Don't just state the rule — forbid specific workarounds:
+```markdown
+<!-- ✅ Good -->
+Write code before test? Delete it. Start over.
+No exceptions:
+- Don't keep it as "reference"
+- Don't "adapt" it while writing tests
+- Don't look at it
+- Delete means delete
+```
+#### Address "Spirit vs Letter" Arguments
+Add a foundational principle early:
+```markdown
+**Violating the letter of the rules is violating the spirit of the rules.**
+```
+#### Build Rationalization Table
+| Excuse | Reality |
+|--------|---------|
+| "Skill is obviously clear" | Clear to you ≠ clear to other agents. Test it. |
+| "It's just a reference" | References can have gaps, unclear sections. Test retrieval. |
+| "Testing is overkill" | Untested skills have issues. Always. 15 min testing saves hours. |
+| "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE deploying. |
+| "Too tedious to test" | Testing is less tedious than debugging bad skill in production. |
+| "I'm confident it's good" | Overconfidence guarantees issues. Test anyway. |
+| "Academic review is enough" | Reading ≠ using. Test application scenarios. |
+| "No time to test" | Deploying untested skill wastes more time fixing it later. |
+**All of these mean: Test before deploying. No exceptions.**
+#### Create Red Flags List
+```markdown
+## Red Flags — STOP and Start Over
+- Code before test
+- "I already manually tested it"
+- "Tests after achieve the same purpose"
+- "It's about spirit not ritual"
+- "This is different because..."
+```
+### 4. Pressure Testing Methodology
+Different skill types need different test approaches.
+| Skill Type | Test Approach | Success Criteria |
+|------------|---------------|-----------------|
+| Discipline-Enforcing | Academic questions, pressure scenarios, multiple pressures combined | Agent follows rule under maximum pressure |
+| Technique | Application scenarios, variation, missing information tests | Agent applies technique to new scenario |
+| Pattern | Recognition scenarios, application, counter-examples | Agent correctly identifies when/how to apply |
+| Reference | Retrieval scenarios, application, gap testing | Agent finds and applies reference information |
+#### Combine 3+ Pressures
+For discipline-enforcing skills, combine multiple pressures to find breaking points:
+| Pressure Type | Description |
+|---------------|-------------|
+| **Time** | "This is urgent, just this once skip the rule" |
+| **Sunk cost** | "I already wrote the code, starting over wastes work" |
+| **Authority** | "The user asked me to do it this way" |
+| **Exhaustion** | "After 10 tests, one shortcut won't matter" |
+| **Social** | "Other agents skip this step, it's fine" |
+| **Economic** | "Testing takes too many tokens" |
+#### Meta-Testing
+After the agent chooses wrong, ask: "How could the skill be written differently to prevent this?" Use the answer to improve the skill.
+### 5. Token Efficiency Targets
+Every word in a skill costs context.
+| Skill Type | Target |
+|------------|--------|
+| Getting-started workflows | **<150 words** each |
+| Frequently-loaded skills | **<200 words** total |
+| Other skills | **<500 words** |
+#### Techniques
+- **Move details to tool help** — Reference `--help` instead of documenting all flags
+- **Use cross-references** — Reference other skills instead of repeating workflow
+- **Compress examples** — Keep examples minimal
+- **Eliminate redundancy** — Don't repeat what's in cross-referenced skills
+### 6. Persuasion Principles
+Discipline-enforcing skills benefit from systematic persuasion mapping (based on Cialdini's principles):
+| Principle | Application in Skills |
+|-----------|----------------------|
+| **Authority** | State "Required", "Mandatory", "The Iron Law" |
+| **Commitment** | "Once you start, follow through. No shortcuts." |
+| **Social Proof** | "Every agent follows this rule. No exceptions." |
+| **Scarcity** | "You only get one chance to do this right." |
+| **Liking** | "Your human partner trusts you to follow this." |
+| **Unity** | "We follow quality processes here." |
+### 7. Flowchart Usage Guidance
+Flowcharts are a precision tool. Use them only where they add clarity.
+#### Use Flowcharts ONLY For
+- Non-obvious decision points
+- Process loops (where an agent might stop too early)
+- "When to use A vs B" decisions
+#### Never Use Flowcharts For
+- Reference material → Use tables or lists
+- Code examples → Use markdown code blocks
+- Linear instructions → Use numbered lists
+- Labels without semantic meaning → Every node label must explain the decision or action

package/harness/skills/oh-skill-craft/SKILL.md CHANGED Viewed

@@ -1,195 +1,34 @@
 ---
 name: oh-skill-craft
-description: "Create new agent skills with proper structure, frontmatter, progressive disclosure, and bundled resources. Meta-skill for growing the harness."
+description: "Use when a new OH skill needs to be created, existing skill needs review against standards, or an external capability should be integrated as a skill. Meta-skill for growing the harness."
 tier: 2
-benefits-from: [oh-expert]
-triggers:
-  - "create a skill"
-  - "write a skill"
-  - "new skill"
-  - "skill-craft"
-  - "meta-skill"
-  - "add a capability"
 route:
-  pass: oh-skills-link
+  pass:
+    - oh-skills-link
+    - oh-learn
   fail: oh-expert
   blocker: surface
 ---
 # oh-skill-craft
-Create new agent skills for the OpenHermes harness. Skills are the unit of progressive disclosure — loaded on demand, not preloaded.
+Create new agent skills for the OpenHermes harness.
-## Skill Structure
+## Steps
-```
-harness/skills/<oh-name>/
-├── SKILL.md           # Main instructions (required)
-├── REFERENCE.md       # Detailed docs (if SKILL.md exceeds 100 lines)
-└── scripts/           # Utility scripts (if deterministic operations needed)
-```
-## SKILL.md Template
-```markdown
----
-name: oh-<name>
-description: "Brief description. Use when [specific triggers]."
-tier: <2|3|4>
-benefits-from: [<skill-dependencies>]
-triggers:
-  - "<trigger phrase>"
-  - "<another trigger>"
----
-# oh-<name>
-<one-paragraph summary>
-## When to Use
-<when to invoke this skill>
-## Workflow
-1. <step>
-2. <step>
-3. <step>
-## Anti-patterns
-- <anti-pattern 1>
-- <anti-pattern 2>
-```
-## Description Requirements
-The description is the only thing the agent sees when deciding which skill to load. Make it actionable:
-**Good:** "Create new agent skills with proper structure, frontmrmatter, and bundled resources. Use when user wants to create, write, or build a new skill."
-**Bad:** "Helps with skills."
-## Field Guide
-| Frontmatter Field | Required | Purpose |
-|---|---|---|
-| `name` | yes | Must match `^[a-z0-9]+(-[a-z0-9]+)*$` and directory name |
-| `description` | yes | Max 200 chars. First sentence = what it does. Second = when to use. |
-| `tier` | no | 2=tool, 3=strategic, 4=autonomous. Controls preamble verbosity. |
-| `benefits-from` | no | Skill dependencies. Listed skills should be loaded first. |
-| `triggers` | no | Natural language patterns that should route to this skill. |
-## When to Add Scripts
-- Operation is deterministic (validation, formatting)
-- Same code would be generated repeatedly
-- Errors need explicit handling
-Scripts save tokens and improve reliability vs generated code.
-## Output Location
-Skills created with oh-skill-craft should be written to `~/.config/opencode/skills/` (or `~/.agents/skills/` if the user prefers). Built-in skills live in the package `harness/skills/` and get replaced on npm update. User-written skills in `~/.config/opencode/skills/` survive updates and are auto-discovered on every session. On name conflict with a built-in skill, the user version wins.
-## When to Split Files
-- SKILL.md exceeds 100 lines
-- Content has distinct domains
-- Advanced features are rarely used (put in REFERENCE.md)
-## Review Checklist
-- [ ] Description includes triggers ("Use when...")
-- [ ] SKILL.md under 100 lines
-- [ ] No time-sensitive info (dates, versions, deprecation warnings)
-- [ ] Consistent oh- prefix and terminology
-- [ ] Concrete examples included
-- [ ] Anti-patterns documented
-- [ ] Tests still pass after adding (`npm test`)
-## Eval-Driven Iteration
-After writing the initial skill draft, iterate using test cases and evidence rather than guessing.
-### 1. Create Test Cases
-Come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Save to `evals/evals.json`:
-```json
-{
-  "skill_name": "oh-<name>",
-  "evals": [
-    {
-      "id": 1,
-      "prompt": "User's realistic task prompt",
-      "expected_output": "Description of expected result",
-      "files": []
-    }
-  ]
-}
-```
-Good test prompts are substantive multi-step tasks — not simple queries like "read this file." The model can handle simple tasks without a skill. Complex, multi-step, or specialized queries reveal whether the skill is pulling its weight.
-### 2. Spawn Runs
-For each test case, spawn two subagents in parallel:
-- **With-skill run** — load the skill, execute the task
-- **Baseline run** — same prompt without the skill (for new skills) or with the previous version (for improvements)
-Save outputs to `iteration-<N>/eval-<ID>/with_skill/outputs/` and `iteration-<N>/eval-<ID>/without_skill/outputs/`.
-### 3. Draft Assertions
-While runs execute, draft objectively verifiable assertions for each test case. Good assertions have descriptive names and can be checked programmatically where possible. Update `evals/evals.json` with the assertions.
-### 4. Grade and Compare
-Grade runs against assertions. Aggregate results into pass rates, timing, and token usage. Look for:
-- Assertions that always pass regardless of skill (non-discriminating — remove them)
-- High-variance evals (possibly flaky tests)
-- Time/token tradeoffs between skill and baseline
-### 5. Improve
-Based on results, revise the skill. Generalize from specific failures rather than overfitting to the test cases. The goal is a skill that works across a million different prompts, not just 2-3 examples. Keep instructions lean — remove anything not pulling its weight.
-### 6. Loop
-Rerun all test cases into a new iteration directory. Repeat until:
-- User says they're happy
-- All feedback is positive
-- No meaningful progress between iterations
-## Description Optimization
-The description field in frontmatter is the primary mechanism for skill triggering. After the skill is solid, optimize the description for accuracy.
-### Trigger Eval Queries
-Create 20 eval queries — a mix of should-trigger and should-not-trigger cases:
-```json
-[
-  {"query": "realistic user prompt that should trigger", "should_trigger": true},
-  {"query": "near-miss prompt that should NOT trigger", "should_trigger": false}
-]
-```
-Key principles:
-- **Should-trigger** (8-10): different phrasings of the same intent — formal, casual. Include edge cases and contexts where this skill competes with another but should win.
-- **Should-not-trigger** (8-10): near-misses that share keywords but need a different skill. Avoid obviously irrelevant queries — the hard cases are the adjacent ones.
-Queries must be realistic — what a user would actually type, with concrete details, not abstract descriptions.
-### Run Optimization
-Iterate the description: test current, propose improvements based on failures, re-test. Select the description that scores best on held-out test data. Apply the winner to the skill's frontmatter.
+1. Create skill directory — `harness/skills/<oh-name>/` with `SKILL.md`. Follow template structure (frontmatter, summary, Workflow, Anti-patterns).
+2. Write frontmatter — name (regex `^[a-z0-9]+(-[a-z0-9]+)*$`), description (max 200 chars, "Use when..."), tier (2/3/4), triggers, route.
+3. Draft skill body — When to Use, Workflow (numbered steps), Anti-patterns with concrete examples, Routing table.
+4. Review against checklist — description includes triggers, SKILL.md under 100 lines, no time-sensitive info, tests pass, consistent oh- prefix.
+5. Run eval-driven iteration — create test cases, spawn with-skill vs baseline sub-agents, grade pass rates/timing/tokens, improve and generalize.
+6. Optimize description — create 20 eval queries (10 should-trigger, 10 should-not), iterate description against eval set, select best precision/recall.
+7. Close loopholes — build rationalization table, create red flags, apply bulletproofing techniques (forbid specific workarounds).
 ## Routing
 | Outcome | Route |
 |---------|-------|
-| pass | → oh-skills-link (verify skill discovery) |
-| iteration data available | → oh-learn (extract patterns from eval results) |
-| fail | → oh-expert (diagnose skill creation issues) |
-| blocker | → surface to user |
+| pass | → oh-skills-link (verify discovery) |
+| iteration data | → oh-learn (extract patterns) |
+| fail | → oh-expert (diagnose) |
+| blocker | → surface |

package/harness/skills/oh-skills-link/DEEP.md ADDED Viewed

@@ -0,0 +1,16 @@
+# oh-skills-link — Deep Reference
+## When to Use
+After installing or updating skills. Verify OpenCode discovers the package-local directory.
+## Anti-patterns
+- Linking without verifying files exist
+- Copying to global config during normal operation
+- Overwriting user-modified skills without intent
+- Linking broken/incomplete skills
+## Reference
+**Example:** After running npm update, run oh-skills-link. It reads harness/skills/, confirms config paths, reports any missing or new skills.

package/harness/skills/oh-skills-link/SKILL.md CHANGED Viewed

@@ -1,11 +1,7 @@
 ---
 name: oh-skills-link
-description: "Verify that OpenCode can discover the package-local skills directory"
+description: "Use after installing or updating skills to verify OpenCode discovers the package-local skills directory."
 tier: 2
-triggers:
-  - "verify skills"
-  - "check skill discovery"
-  - "link skills"
 route:
   pass: surface
   fail: oh-skill-craft
@@ -14,25 +10,19 @@ route:
 # oh-skills-link
-## When to Use
-After installing new skills or updating existing ones. Verifies that OpenCode can discover the package-local skills directory.
+Verify OpenCode discovers the package-local skills directory after install/update.
-## Workflow
-1. Read skills from `harness/skills/`
-2. Confirm `config.skills.paths` points at the package-local harness path
-3. Skip unchanged skills when checking manifests
-4. Log missing, invalid, or newly added skills
+## Steps
-## Anti-patterns
-- Linking skills without verifying they exist in harness
-- Copying skills into global config during normal operation
-- Overwriting user-modified skills without explicit intent
-- Linking broken or incomplete skills
+1. Read `harness/skills/` directory listing
+2. Confirm `config.skills.paths` points at harness path
+3. Skip skills that are unchanged
+4. Log missing, invalid, or newly added skills
 ## Routing
 | Outcome | Route |
 |---------|-------|
-| pass | → [report link status to user] |
-| fail | → oh-skill-craft (fix or rebuild broken skill) |
-| blocker | → surface to user |
+| pass | → surface |
+| fail | → oh-skill-craft |
+| blocker | → surface |

package/harness/skills/oh-skills-list/DEEP.md ADDED Viewed

@@ -0,0 +1,20 @@
+# oh-skills-list — Deep Reference
+## When to Use
+User wants to see available skills. Lists all oh-* skills with tier and description.
+## Anti-patterns
+- Filtering skills (show everything — let user decide)
+- Including non-OH skills in the output
+## Reference
+**Example:** User asks "what skills do you have?" → outputs a table of all oh-* skills with tier and purpose.
+### Output Format
+| Skill | Tier | Purpose |
+|-------|------|---------|
+| oh-<name> | 2/3/4 | <description> |