npm - opencodekit - Versions diffs - 0.18.4 → 0.18.6 - Mend

opencodekit 0.18.4 → 0.18.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (43) hide show

package/dist/template/.opencode/AGENTS.md CHANGED Viewed

@@ -220,7 +220,7 @@ For major tracked work:
 ## Edit Protocol
-`str_replace` failures are the #1 source of LLM coding failures. Use structured edits:
+`str_replace` failures are the #1 source of LLM coding failures. When tilth MCP is available with `--edit`, prefer hash-anchored edits (see below). Otherwise, use structured edits:
 1. **LOCATE** — Use LSP tools (goToDefinition, findReferences) to find exact positions
 2. **READ** — Get fresh file content around target (offset: line-10, limit: 30)
@@ -241,6 +241,18 @@ Files over ~500 lines become hard to maintain and review. Extract helpers, split
 **Use the `structured-edit` skill for complex edits.**
+### Hash-Anchored Edits (MCP)
+When tilth MCP is available with `--edit` mode, use hash-anchored edits for higher reliability:
+1. **READ** via `tilth_read` — output includes `line:hash|content` format per line
+2. **EDIT** via `tilth_edit` — reference lines by their `line:hash` anchor
+3. **REJECT** — if file changed since last read, hashes won't match; re-read and retry
+**Benefits**: Eliminates `str_replace` failures entirely. If the file changed between read and edit, the operation fails safely (no silent corruption).
+**Fallback**: Without tilth, use the standard LOCATE→READ→VERIFY→EDIT→CONFIRM flow above.
 ---
 ## Output Style

package/dist/template/.opencode/agent/build.md CHANGED Viewed

@@ -79,7 +79,9 @@ Implement requested work, verify with fresh evidence, and coordinate subagents o
 - No success claims without fresh verification output
 - Verification failures are **signals, not condemnations** — adjust and proceed
-- Re-run typecheck/lint/tests after meaningful edits
+- Re-run typecheck/lint/tests after meaningful edits (use incremental mode — changed files only)
+- Run typecheck + lint in parallel, then tests sequentially
+- Check `.beads/verify.log` cache before re-running — skip if no changes since last PASS
 - If verification fails twice on the same approach, **escalate with learnings**, not frustration
 ## Ritual Structure
@@ -170,6 +172,7 @@ Load contextually when needed:
 | UI work                | `frontend-design`, `react-best-practices`                  |
 | Parallel orchestration | `swarm-coordination`, `beads-bridge`                       |
 | Before completion      | `requesting-code-review`, `finishing-a-development-branch` |
+| Codebase exploration   | `code-navigation`                                          |
 ## Execution Mode

package/dist/template/.opencode/agent/explore.md CHANGED Viewed

@@ -11,6 +11,9 @@ tools:
   memory-update: false
   observation: false
   question: false
+  websearch: false
+  webfetch: false
+  codesearch: false
 ---
 You are OpenCode, the best coding agent on the planet.
@@ -19,8 +22,6 @@ You are OpenCode, the best coding agent on the planet.
 **Purpose**: Read-only codebase cartographer — you map terrain, you don't build on it.
-> _"Agency is knowing where the levers are before you pull them."_
 ## Identity
 You are a read-only codebase explorer. You output concise, evidence-backed findings with absolute paths only.
@@ -29,75 +30,41 @@ You are a read-only codebase explorer. You output concise, evidence-backed findi
 Find relevant files, symbols, and usage paths quickly for the caller.
-## Rules
+## Tools — Use These for Local Code Search
-- Never modify files — read-only is a hard constraint
-- Return absolute paths in final output
-- Cite `file:line` evidence whenever possible
-- Prefer semantic lookup (LSP) before broad text search when it improves precision
-- Stop when you can answer with concrete evidence or when additional search only repeats confirmed paths
-## Workflow
+| Tool | Use For | Example |
+|------|---------|--------|
+| `grep` | Find text/regex patterns in files | `grep(pattern: "PatchEntry", include: "*.ts")` |
+| `glob` | Find files by name/pattern | `glob(pattern: "src/**/*.ts")` |
+| `lsp` (goToDefinition) | Jump to symbol definition | `lsp(operation: "goToDefinition", filePath: "...", line: N, character: N)` |
+| `lsp` (findReferences) | Find all usages of a symbol | `lsp(operation: "findReferences", ...)` |
+| `lsp` (hover) | Get type info and docs | `lsp(operation: "hover", ...)` |
+| `read` | Read file content | `read(filePath: "src/utils/patch.ts")` |
-1. Discover candidate files with `glob` or `workspaceSymbol`
-2. Validate symbol flow with LSP (`goToDefinition`, `findReferences`)
-3. Use `grep` for targeted pattern checks
-4. Read only relevant sections
-5. Return findings + next steps
-## Thoroughness Levels
-| Level           | Scope                         | Use When                                   |
-| --------------- | ----------------------------- | ------------------------------------------ |
-| `quick`         | 1-3 files, direct answer      | Simple lookups, known symbol names         |
-| `medium`        | 3-6 files, include call paths | Understanding feature flow                 |
-| `very thorough` | Dependency map + edge cases   | Complex refactor prep, architecture review |
-## Output
-- **Files**: absolute paths with line refs
-- **Findings**: concise, evidence-backed
-- **Next Steps** (optional): recommended actions for the caller
-## Identity
-You are a read-only codebase explorer. You output concise, evidence-backed findings with absolute paths only.
-## Task
-Find relevant files, symbols, and usage paths quickly for the caller.
+**NEVER** use `websearch`, `webfetch`, or `codesearch` — those search the internet, not your project.
 ## Rules
 - Never modify files — read-only is a hard constraint
 - Return absolute paths in final output
 - Cite `file:line` evidence whenever possible
-- Prefer semantic lookup (LSP) before broad text search when it improves precision
-- Stop when you can answer with concrete evidence or when additional search only repeats confirmed paths
+- **Always start with `grep` or `glob`** to locate files and symbols — do NOT read directories to browse
+- Use LSP for precise navigation after finding candidate locations
+- Stop when you can answer with concrete evidence
-## Before You Explore
+## Navigation Patterns
-- **Be certain**: Only explore what's needed for the task at hand
-- **Don't over-explore**: Stop when you have enough evidence to answer
-- **Use LSP first**: Start with goToDefinition/findReferences before grep
-- **Stay scoped**: Don't explore files outside the task scope
-- **Cite evidence**: Every finding needs file:line reference
+1. **Search first, read second**: `grep` to find where a symbol lives, then `read` only that section
+2. **Don't re-read**: If you already read a file, reference what you learned — don't read it again
+3. **Follow the chain**: definition → usages → callers via LSP findReferences
+4. **Target ≤3 tool calls per symbol**: grep → read section → done
 ## Workflow
-1. Discover candidate files with `glob` or `workspaceSymbol`
-2. Validate symbol flow with LSP (`goToDefinition`, `findReferences`)
-3. Use `grep` for targeted pattern checks
-4. Read only relevant sections
-5. Return findings + next steps
-## Thoroughness Levels
-| Level           | Scope                         | Use When                                   |
-| --------------- | ----------------------------- | ------------------------------------------ |
-| `quick`         | 1-3 files, direct answer      | Simple lookups, known symbol names         |
-| `medium`        | 3-6 files, include call paths | Understanding feature flow                 |
-| `very thorough` | Dependency map + edge cases   | Complex refactor prep, architecture review |
+1. `grep` or `glob` to discover candidate files
+2. `lsp` goToDefinition/findReferences for precise symbol navigation
+3. `read` only the relevant sections (use offset/limit)
+4. Return findings with file:line evidence
 ## Output

package/dist/template/.opencode/command/ship.md CHANGED Viewed

@@ -75,13 +75,15 @@ ls .beads/artifacts/$ARGUMENTS/
 If `plan.md` exists with dependency graph:
-1. **Parse waves** from plan header (Wave 1, Wave 2, etc.)
-2. **Execute Wave 1** (independent tasks) in parallel using `task()` subagents
-3. **Wait for Wave 1 completion** — all tasks pass or report failures
-4. **Execute Wave 2** (depends on Wave 1) in parallel
+1. **Load skill:** `skill({ name: "executing-plans" })`
+2. **Parse waves** from dependency graph section
+3. **Execute wave-by-wave:**
+   - Single-task wave → execute directly (no subagent overhead)
+   - Multi-task wave → dispatch parallel `task({ subagent_type: "general" })` subagents, one per task
+4. **Review after each wave** — run verification gates, report, wait for feedback
 5. **Continue** until all waves complete
-**Parallel safety:** Only tasks within same wave run in parallel. Tasks in Wave N+1 wait for Wave N.
+**Parallel safety:** Only tasks within same wave run in parallel. Tasks must NOT share files. Tasks in Wave N+1 wait for Wave N.
 ### Phase 3A: PRD Task Loop (Sequential Fallback)

package/dist/template/.opencode/command/verify.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 description: Verify implementation completeness, correctness, and coherence
-argument-hint: "<bead-id> [--quick] [--full] [--fix]"
+argument-hint: "<bead-id> [--quick] [--full] [--fix] [--no-cache]"
 agent: review
 ---
@@ -17,12 +17,13 @@ skill({ name: "verification-before-completion" });
 ## Parse Arguments
-| Argument    | Default  | Description                                    |
-| ----------- | -------- | ---------------------------------------------- |
-| `<bead-id>` | required | The bead to verify                             |
-| `--quick`   | false    | Gates only, skip coherence check               |
-| `--full`    | false    | Force full verification mode (non-incremental) |
-| `--fix`     | false    | Auto-fix lint/format issues                    |
+| Argument     | Default  | Description                                    |
+| ------------ | -------- | ---------------------------------------------- |
+| `<bead-id>`  | required | The bead to verify                             |
+| `--quick`    | false    | Gates only, skip coherence check               |
+| `--full`     | false    | Force full verification mode (non-incremental) |
+| `--fix`      | false    | Auto-fix lint/format issues                    |
+| `--no-cache` | false    | Bypass verification cache, force fresh run     |
 ## Determine Input Type
@@ -39,6 +40,32 @@ skill({ name: "verification-before-completion" });
 - **Run the gates**: Build, test, lint, typecheck are non-negotiable
 - **Use project conventions**: Check `package.json` scripts first
+## Phase 0: Check Verification Cache
+Before running any gates, check if a recent verification is still valid:
+```bash
+# Compute current state fingerprint (commit hash + full diff + untracked files)
+CURRENT_STAMP=$(printf '%s\n%s\n%s' \
+  "$(git rev-parse HEAD)" \
+  "$(git diff HEAD -- '*.ts' '*.tsx' '*.js' '*.jsx')" \
+  "$(git ls-files --others --exclude-standard -- '*.ts' '*.tsx' '*.js' '*.jsx' | xargs cat 2>/dev/null)" \
+  | shasum -a 256 | cut -d' ' -f1)
+LAST_STAMP=$(tail -1 .beads/verify.log 2>/dev/null | awk '{print $1}')
+```
+| Condition                                 | Action                                                 |
+| ----------------------------------------- | ------------------------------------------------------ |
+| `--no-cache` or `--full`                  | Skip cache check, run fresh                            |
+| `CURRENT_STAMP == LAST_STAMP`             | Report **cached PASS**, skip to Phase 2 (completeness) |
+| `CURRENT_STAMP != LAST_STAMP` or no cache | Run gates normally                                     |
+When cache hits, report:
+```text
+Verification: cached PASS (no changes since <timestamp from verify.log>)
+```
 ## Phase 1: Gather Context
 ```bash
@@ -66,10 +93,34 @@ Extract all requirements/tasks from the PRD and verify each is implemented:
 Follow the [Verification Protocol](../skill/verification-before-completion/references/VERIFICATION_PROTOCOL.md):
-- Use **incremental mode** for `verify` (pre-commit checks)
-- Use **full mode** if `--full` flag is passed
-- Run parallel group first, then sequential group
-- Report results in gate results table format
+**Default: incremental mode** (changed files only, parallel gates).
+| Mode        | When                                      | Behavior                         |
+| ----------- | ----------------------------------------- | -------------------------------- |
+| Incremental | Default, <20 changed files                | Lint changed files, test changed |
+| Full        | `--full` flag, >20 changed files, or ship | Lint all, test all               |
+**Execution order:**
+1. **Parallel**: typecheck + lint (simultaneously)
+2. **Sequential** (after parallel passes): test, then build (ship only)
+Report results with mode column:
+```text
+| Gate      | Status | Mode        | Time   |
+|-----------|--------|-------------|--------|
+| Typecheck | PASS   | full        | 2.1s   |
+| Lint      | PASS   | incremental | 0.3s   |
+| Test      | PASS   | incremental | 1.2s   |
+| Build     | SKIP   | —           | —      |
+```
+**After all gates pass**, record to verification cache:
+```bash
+echo "$CURRENT_STAMP $(date -u +%Y-%m-%dT%H:%M:%SZ) PASS" >> .beads/verify.log
+```
 If `--fix` flag provided, run the project's auto-fix command (e.g., `npm run lint:fix`, `ruff check --fix`, `cargo clippy --fix`).
@@ -93,7 +144,7 @@ Output:
 1. **Result**: READY TO SHIP / NEEDS WORK / BLOCKED
 2. **Completeness**: score and status
-3. **Correctness**: gate results
+3. **Correctness**: gate results (with mode column)
 4. **Coherence**: contradictions found (if not --quick)
 5. **Blocking issues** to fix before shipping
 6. **Next step**: `/ship $ARGUMENTS` if ready, or list fixes needed

package/dist/template/.opencode/memory/research/benchmark-framework.md ADDED Viewed

@@ -0,0 +1,162 @@
+---
+purpose: Scoring rubric for evaluating template agent effectiveness
+updated: 2026-03-08
+based-on: tilth research (measurable pattern adoption improvements)
+---
+# Agent Effectiveness Benchmark Framework
+## Purpose
+Evaluate whether skills, tools, and commands in the OpenCodeKit template actually help AI agents perform better. Based on tilth's methodology: they measured accuracy, cost/correct answer, and tool adoption rates to prove what works.
+## Scoring Dimensions
+7 dimensions, each scored 0–2. Max score: **14**.
+### 1. Trigger Clarity (WHEN/SKIP)
+Does the description clearly specify when to load AND when NOT to?
+| Score | Criteria                                    |
+| ----- | ------------------------------------------- |
+| 0     | Vague or missing trigger conditions         |
+| 1     | Has WHEN but not WHEN NOT (or vice versa)   |
+| 2     | Clear WHEN and WHEN NOT (SKIP) binary gates |
+**Why it matters:** tilth found explicit WHEN/SKIP gates are the single most effective pattern for correct tool routing. Without them, agents either over-load (waste tokens) or under-load (miss relevant skills).
+### 2. "Replaces X" Framing
+Does it explicitly state what behavior, tool, or workflow it replaces?
+| Score | Criteria                                       |
+| ----- | ---------------------------------------------- |
+| 0     | No replacement framing                         |
+| 1     | Implied replacement or "better than X"         |
+| 2     | Explicit "Replaces X" statement in description |
+**Why it matters:** tilth measured +36 percentage points adoption on Haiku when tool descriptions included "Replaces X" framing. Models route better when they know what's superseded.
+### 3. Concrete Examples
+Does it provide working code with actual tool calls, not just prose?
+| Score | Criteria                                                                |
+| ----- | ----------------------------------------------------------------------- |
+| 0     | No examples                                                             |
+| 1     | Prose descriptions or generic prompt templates                          |
+| 2     | Working code examples with actual tool calls / before-after comparisons |
+**Why it matters:** Models follow examples more reliably than instructions. Prompt templates ("Analyze this image: [attach]") score 1, not 2 — they lack tool integration.
+### 4. Anti-Patterns
+Does it show what NOT to do?
+| Score | Criteria                                                       |
+| ----- | -------------------------------------------------------------- |
+| 0     | No anti-patterns section                                       |
+| 1     | Brief "don't do X" mentions                                    |
+| 2     | Wrong/right comparison table or detailed anti-patterns section |
+**Why it matters:** Failure prevention is as valuable as success instruction. tilth's evidence-based feature removal (disabling `--map` because 62% of losing tasks used it) proves tracking what fails matters.
+### 5. Verification Integration
+Does it reference or require verification steps?
+| Score | Criteria                                                       |
+| ----- | -------------------------------------------------------------- |
+| 0     | No mention of verification                                     |
+| 1     | Mentions verification in passing                               |
+| 2     | Integrates verification steps into workflow (run X, confirm Y) |
+**Why it matters:** Skills that don't include verification produce unverified outputs. The build loop is perceive → create → **verify** → ship.
+### 6. Token Efficiency
+Is the token cost proportional to value delivered?
+| Score | Criteria                                                                       |
+| ----- | ------------------------------------------------------------------------------ |
+| 0     | >2500 tokens with low value density (filler, repetition, obvious instructions) |
+| 1     | Reasonable size OR moderate value density                                      |
+| 2     | <1500 tokens with high value density, OR larger with proportional density      |
+**Why it matters:** Every loaded skill consumes context budget. A 4000-token skill that could be 1500 tokens is actively harmful — it displaces working memory.
+### 7. Cross-References
+Does it link to related skills for next steps?
+| Score | Criteria                                                                     |
+| ----- | ---------------------------------------------------------------------------- |
+| 0     | No references to other skills                                                |
+| 1     | Mentions related skills in text                                              |
+| 2     | Clear "Related Skills" table or "Next Phase" with skill loading instructions |
+**Why it matters:** Skills that exist in isolation force agents to discover connections. Explicit connections reduce routing failures.
+## Score Interpretation
+| Range | Tier       | Meaning                                                     |
+| ----- | ---------- | ----------------------------------------------------------- |
+| 12–14 | Exemplary  | Ready to ship — high adoption, measurable value             |
+| 8–11  | Adequate   | Functional but missing patterns that would improve adoption |
+| 4–7   | Needs Work | Significant gaps — may load but produce suboptimal results  |
+| 0–3   | Poor       | Should be rewritten or merged into another skill            |
+## Category Assessment
+Beyond individual scoring, evaluate each skill's **category fit**:
+| Category             | Expected Traits                                                                  |
+| -------------------- | -------------------------------------------------------------------------------- |
+| Core Workflow        | Loaded frequently, high token ROI, tight integration with other core skills      |
+| Planning & Lifecycle | Clear phase transitions, handoff points between skills                           |
+| Debugging & Quality  | Real examples from actual debugging sessions, measurable impact                  |
+| Code Review          | Severity levels, actionable findings format                                      |
+| Design & UI          | Visual reference integration, component breakdown                                |
+| Agent Orchestration  | Parallelism rules, coordination protocols                                        |
+| External Integration | API examples, auth handling, error patterns                                      |
+| Platform Specific    | Version-pinned APIs, migration guidance                                          |
+| Meta Skills          | Self-referential consistency (does the skill-about-skills follow its own rules?) |
+## Audit Process
+1. **Inventory** — List all skills with token size
+2. **Sample** — Read representative skills from each category
+3. **Score** — Apply 7 dimensions to each sampled skill
+4. **Classify** — Assign tier and category
+5. **Identify** — Flag overlaps, dead weight, and upgrade candidates
+6. **Prioritize** — Rank improvements by impact (core skills first)
+## Effectiveness Signals (Observable)
+Beyond the rubric, track these runtime signals when possible:
+| Signal                                     | Indicates                                                    |
+| ------------------------------------------ | ------------------------------------------------------------ |
+| Skill loaded but instructions not followed | Trigger too broad OR instructions too vague                  |
+| Skill never loaded despite relevant tasks  | Trigger too narrow OR description doesn't match task framing |
+| Agent re-reads files after skill search    | Skill examples insufficient — agent needs more context       |
+| Verification skipped after skill workflow  | Skill doesn't integrate verification                         |
+| Agent loads 5+ skills simultaneously       | Skills too granular — should be merged                       |
+## Template-Level Metrics
+For the overall template (all skills + tools + commands):
+| Metric                        | Target | Current |
+| ----------------------------- | ------ | ------- |
+| Core skills at Exemplary tier | 100%   | (audit) |
+| No skills at Poor tier        | 0      | (audit) |
+| Average token cost per skill  | <1500  | (audit) |
+| Skills with WHEN/SKIP gates   | 100%   | (audit) |
+| Skills with anti-patterns     | >75%   | (audit) |
+| Overlap/redundancy pairs      | 0      | (audit) |
+---
+_Apply this framework during effectiveness audits. Update scoring criteria as new evidence emerges._