npm - opencode-goal-mode - Versions diffs - 0.2.4 → 0.3.0 - Mend

opencode-goal-mode 0.2.4 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

package/ARCHITECTURE.md +31 -0
package/CHANGELOG.md +18 -0
package/README.md +64 -24
package/benchmarks/build-external-corpus.mjs +177 -0
package/benchmarks/external-corpus.json +3540 -0
package/benchmarks/external.mjs +110 -0
package/benchmarks/run.mjs +78 -24
package/commands/goal.md +16 -1
package/docs/benchmarks/detection-by-family.svg +2 -2
package/docs/benchmarks/external-scorecard.svg +32 -0
package/docs/benchmarks/latency.svg +3 -3
package/docs/benchmarks/overall-scorecard.svg +2 -2
package/docs/benchmarks/results.json +112 -71
package/docs/benchmarks/truthfulness-score.svg +2 -2
package/package.json +3 -1
package/plugins/goal-guard/config.js +9 -0
package/plugins/goal-guard/shell.js +4 -3
package/plugins/goal-guard/sidebar-data.js +71 -0
package/plugins/goal-guard/summary.js +34 -0
package/plugins/goal-guard/tools.js +8 -2
package/plugins/goal-guard.js +13 -0
package/plugins/goal-sidebar.js +141 -0
package/research/benchmarks.md +75 -69

package/ARCHITECTURE.md CHANGED Viewed

@@ -15,6 +15,10 @@ configuration directory:
    — a runtime guard that enforces review discipline, blocks destructive shell
    commands, preserves state across compaction and restarts, and exposes
    first-class `goal_*` tools.
+4. **An experimental TUI companion** (`plugins/goal-sidebar.js`) — a separate
+   `{ tui }` plugin module that renders the active goal as a yellow sidebar
+   banner. It is *paired* with the server plugin purely through the on-disk state
+   snapshot (no extra IPC) and no-ops on any runtime without the slot API.
 This document focuses on the plugin, where the engineering lives.
@@ -48,7 +52,9 @@ as plugins. Each module is independently unit-tested.
 | `goal-guard/events.js` | Shared edit/verification/evidence mutators. |
 | `goal-guard/summary.js` | State summaries, status reports, and evidence-map projections. |
 | `goal-guard/system.js` | Live state block injected into the system prompt. |
+| `goal-guard/summary.js` | Status/evidence projections, the short goal label, and the sidebar view. |
 | `goal-guard/tools.js` | The `goal_status` / `goal_evidence_map` / `goal_reviewer_memory` / `goal_contract` / `goal_evidence` / `goal_reset` tools. |
+| `goal-guard/sidebar-data.js` | Pure reader that projects the persisted snapshot into the sidebar banner model. |
 | `goal-guard/logger.js` | Best-effort logging/toasts over the OpenCode client. |
 ## Hooks used
@@ -157,6 +163,25 @@ The `@opencode-ai/plugin` import they need is isolated to `tools.js` and loaded
 via a guarded dynamic import, so if the host cannot resolve it the core guard
 hooks still load.
+## TUI companion (experimental)
+`plugins/goal-sidebar.js` is a TUI plugin module — `export const tui = async (api)
+=> …` — distinct from the server plugin (`@opencode-ai/plugin` types it as a
+`{ tui }` module, mutually exclusive with `{ server }`). It registers a
+`sidebar_content` slot via `api.slots.register({ slots: { sidebar_content } })`
+and renders, in the configured colour (`#FFD700` by default), the short goal
+label plus a `passing/total gates · dirty/ready` line.
+It is *paired* with the server plugin only through the persisted state file:
+`sidebar-data.js` recomputes the same `stateBaseDir`/`projectKey` path the guard
+writes to and projects the active session via `summary.sidebarView`. That keeps
+the pure projection logic Node-testable (`tests/sidebar.test.mjs`) even though the
+JSX renderer itself can only run inside OpenCode's (Bun) TUI runtime. Everything
+in the `tui` entry is wrapped so a missing slot API, missing JSX runtime, or read
+error degrades to rendering nothing — it can never break the TUI. The server plugin
+also emits review-verdict and completion-unlock toasts (`toastOnReview`) so review
+progress is visible even without the banner.
 ## Configuration
 `config.js` merges, in increasing precedence: built-in defaults, environment
@@ -182,8 +207,14 @@ manifest of the file hashes it wrote. On upgrade it distinguishes files it owns
 - `tests/plugin.test.mjs` — hook behavior, gating, verdicts, completion, tools, isolation.
 - `tests/truthfulness-benchmark.test.mjs` — false-completion corpus and truthfulness scoring.
 - `tests/state.test.mjs` — store, seq ordering, eviction, persistence round-trips.
+- `tests/sidebar.test.mjs` — short goal label, sidebar projection, snapshot reader, new destructive bins.
+- `tests/toast.test.mjs` — review-verdict and completion-unlock toasts.
 - `tests/agents.test.mjs` / `tests/commands.test.mjs` — frontmatter and contracts.
 - `tests/install.test.mjs` — recursive copy, manifest upgrades, uninstall.
+The shell guard's headline accuracy is measured on an external, third-party
+corpus (`benchmarks/external.mjs` over `external-corpus.json`), not on the curated
+fixtures — see [research/benchmarks.md](research/benchmarks.md).
 `npm run validate` runs the tests, the structural config validator, the publish
 readiness check, and an `npm pack --dry-run`.

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,23 @@
 # Changelog
+## v0.3.0
+- Honest benchmarks: add an EXTERNAL corpus of 704 real third-party commands from
+  tldr-pages (`benchmarks/external.mjs`, `npm run bench:external`) as the headline
+  detection/false-positive measure (93.3% vs 53.8% legacy; ~0% real false
+  positives). Reframe the curated 71-command set and 9 completion cases as
+  regression *fixtures*, not measured accuracy, and reword the README/charts to
+  stop overclaiming.
+- Stronger guard: block `mkfs.<fstype>` variants, `srm`, and `mkswap`
+  (genuine destructive commands the external corpus exposed as misses).
+- Deeper TUI embedding: toast on each review verdict (PASS/FAIL) and once when the
+  last required gate clears (`toastOnReview`); `goal_status` now surfaces the goal.
+- Experimental TUI sidebar banner (`plugins/goal-sidebar.js`): the active goal in
+  shining yellow with a live gate-status line, paired with the guard via persisted
+  state. No-ops on any runtime without the TUI slot API. New options
+  `sidebarBanner` / `sidebarColor` (`GOAL_GUARD_SIDEBAR_*`).
+- Tighter `/goal` flow that seeds the Goal Contract via the `goal_contract` tool.
 ## v0.2.4
 - Add Reviewer Memory for unresolved/resolved reviewer findings across cycles.

package/README.md CHANGED Viewed

@@ -38,37 +38,50 @@ honest caveats, in [research/goal-mode-comparison.md](research/goal-mode-compari
 - **Destructive commands are blocked by a real shell tokenizer**, not a regex.
   Claude Code's own docs call Bash argument-matching *"fragile"*.
-### Benchmarks: shell guard + truthfulness
+### Benchmarks (honest edition)
-The guard replaced a boundary-anchored regex classifier. On a labeled corpus of
-71 real commands (`npm run bench` from a repository checkout, reproducible — see
-[research/benchmarks.md](research/benchmarks.md)):
+The headline number is measured on commands **the analyzer was never fitted to**:
+704 real example commands from [tldr-pages](https://github.com/tldr-pages/tldr)
+(common/linux/osx), authored by hundreds of contributors who have never seen
+this guard. Ground-truth labels come from a deliberately simple, analyzer-*independent*
+rule (see [build-external-corpus.mjs](benchmarks/build-external-corpus.mjs)).
+Reproduce with `npm run bench` or `node benchmarks/external.mjs`.
-![Destructive-command detection rate by family](docs/benchmarks/detection-by-family.svg)
+![Guard accuracy on real third-party commands](docs/benchmarks/external-scorecard.svg)
-![Overall guard accuracy: detection rate vs false-positive rate](docs/benchmarks/overall-scorecard.svg)
-| | Legacy regex guard | Goal Mode analyzer |
+| On 704 real third-party commands | Legacy regex guard | Goal Mode analyzer |
 | --- | --- | --- |
-| Destructive-command detection | **20.8%** | **100%** |
-| False positives on safe commands | **21.7%** | **0%** |
-| Obfuscated bypasses caught (`$(…)`, `bash -c`, `sudo -u`, interpreters) | 0% | 100% |
-| Remote exec (`curl \| sh`) caught | 0% | 100% |
-The deeper analysis costs a few microseconds per command on this machine
-(hundreds of thousands of classifications per second) — negligible for a
-per-tool-call guard:
+| Destructive-command detection | 53.8% | **93.3%** |
+| False positives on safe commands | 0.2% | **0.2%** |
+Honest caveats, because the point of this rewrite was to stop overclaiming:
+- The ~7 remaining "misses" are almost all un-flagged single-target `rm <file>`,
+  which the guard **intentionally permits** (plain `rm` is common and the guard
+  blocks `rm -r`/`rm -f`, `$(rm …)`, `bash -c`, interpreters, etc.). Under a
+  strict every-`rm`-is-destructive labeling those count against it.
+- The single counted false positive (`git filter-repo …`) actually *is* a
+  history-rewriting command, so the real-world false-positive rate is effectively
+  zero. `node benchmarks/external.mjs --json` lists every miss and false positive
+  so you can audit the disagreements yourself.
+Two **curated fixture sets** also ship — and they are explicitly *fixtures*, not
+an unbiased benchmark. They define the patterns the analyzer must catch and guard
+against regressions, so they pass by construction; do not read the 100%/0% there
+as measured accuracy:
+- `benchmarks/corpus.mjs` — 71 destructive patterns (incl. `$(…)`, `bash -c`,
+  `sudo -u`, `/bin/rm`, `git -C … reset --hard`, `curl | sh`, interpreter
+  deletes) and their safe look-alikes (`git checkout -b`, `echo "rm -rf /"`).
+- `benchmarks/completion-corpus.mjs` — 9 completion-claim policy cases (missing
+  review-cycle line, stale review after edit, missing contextual gate, inactive
+  session, custom marker). `npm run bench:truthfulness` prints them.
+The analysis costs ~1µs per command (hundreds of thousands of classifications per
+second) — negligible for a per-tool-call guard:
 ![Per-command analysis latency](docs/benchmarks/latency.svg)
-Goal Mode also ships a **False Completion Dataset** for completion-claim
-truthfulness: `npm run bench` regenerates the scorecard and
-`npm run bench:truthfulness` prints the labeled-case JSON for premature and valid
-completion claims, including missing review-cycle lines, stale reviews after
-edits, missing contextual gates, inactive sessions, and custom completion markers.
-![Benchmark Truthfulness Score](docs/benchmarks/truthfulness-score.svg)
 ## Requirements
 - Node.js 20.11 or newer.
@@ -98,9 +111,33 @@ edits, missing contextual gates, inactive sessions, and custom completion marker
     `goal_reviewer_memory`, `goal_status`, `goal_reset`.
   - **Live state injection** into the system prompt so the model always knows
     what the guard requires.
+  - **TUI toasts**: a toast on each review verdict (PASS/FAIL) and a single
+    "completion unlocked" toast the moment the last required gate clears.
+- An **experimental** companion TUI plugin (`plugins/goal-sidebar.js`) that shows
+  the active goal as a shining-yellow banner in the sidebar with a compact gate
+  status line. See [TUI integration](#tui-integration).
 - A test suite validating the analyzer, plugin hooks, state store, install
   safety, and config compatibility.
+## TUI integration
+Goal Mode is a **plugin pair**: the server-side `goal-guard` plugin owns
+enforcement and writes its state to disk, and an experimental TUI plugin
+(`plugins/goal-sidebar.js`) reads that same state to render a live banner.
+- **Sidebar goal banner (experimental).** The current goal renders in shining
+  yellow in the sidebar (`sidebar_content` slot), with a `passing/total gates ·
+  dirty/ready` status line, and updates as reviews land. It requires a
+  TUI-plugin-capable OpenCode (one exposing `api.slots.register`); on any older
+  runtime it silently no-ops, so it can never break your TUI. Set
+  `sidebarBanner: false` (or `GOAL_GUARD_SIDEBAR_BANNER=0`) to disable, or
+  `sidebarColor` to recolour it. Because no local environment can run OpenCode's
+  TUI runtime, this banner is shipped best-effort and should be verified in your
+  own TUI.
+- **Toasts.** Review verdicts and completion-unlock events surface as toasts
+  (`toastOnReview`), and blocked destructive commands / premature completions
+  toast as before (`toastOnBlock`).
 ## Install globally
 ```bash
@@ -162,6 +199,9 @@ Or via environment variables (`GOAL_GUARD_*`):
 | `maxSessions` / `GOAL_GUARD_MAX_SESSIONS` | `200` | Session cache size. |
 | `sessionTtlMs` / `GOAL_GUARD_SESSION_TTL_MS` | `86400000` | Idle session TTL. |
 | `toastOnBlock` / `GOAL_GUARD_TOAST_ON_BLOCK` | `true` | Toast when something is blocked. |
+| `toastOnReview` / `GOAL_GUARD_TOAST_ON_REVIEW` | `true` | Toast on each review verdict and when completion unlocks. |
+| `sidebarBanner` / `GOAL_GUARD_SIDEBAR_BANNER` | `true` | Show the experimental yellow goal banner in the TUI sidebar. |
+| `sidebarColor` / `GOAL_GUARD_SIDEBAR_COLOR` | `#FFD700` | Foreground colour of the sidebar goal banner. |
 ## Custom tools

package/benchmarks/build-external-corpus.mjs ADDED Viewed

@@ -0,0 +1,177 @@
+#!/usr/bin/env node
+/**
+ * Build an EXTERNAL, third-party-authored shell-command corpus for the guard
+ * benchmark, so the reported detection / false-positive numbers measure
+ * real-world behavior instead of a self-authored set the analyzer was tuned on.
+ *
+ * Source: the tldr-pages project (https://github.com/tldr-pages/tldr, CC-BY).
+ * Every example command in the English `common`, `linux`, and `osx` pages is a
+ * real invocation documented by hundreds of contributors who have never seen
+ * this analyzer — so the analyzer cannot have been fitted to them.
+ *
+ * Ground-truth labels come from `labelDestructive()` below: a deliberately
+ * SIMPLE, transparent rule based on the primary utility and a fixed list of
+ * irreversible operations. It is intentionally independent of the analyzer's
+ * own classification logic. It is not perfect (no automatic labeler is) — the
+ * benchmark reports raw agreement and discloses the labeler so disagreements
+ * are auditable rather than hidden.
+ *
+ * Usage:
+ *   node benchmarks/build-external-corpus.mjs --tldr /path/to/tldr [--limit 600]
+ *   TLDR_DIR=/path/to/tldr node benchmarks/build-external-corpus.mjs
+ *
+ * Writes benchmarks/external-corpus.json (committed, so `npm run bench` is
+ * reproducible without a tldr checkout). Re-run this to regenerate it.
+ */
+import { readFileSync, readdirSync, writeFileSync, existsSync } from "node:fs";
+import { join, dirname } from "node:path";
+import { fileURLToPath } from "node:url";
+import { parseArgs } from "node:util";
+const { values } = parseArgs({
+  options: {
+    tldr: { type: "string" },
+    limit: { type: "string", default: "600" },
+  },
+});
+const here = dirname(fileURLToPath(import.meta.url));
+const tldrDir = values.tldr || process.env.TLDR_DIR;
+const safeLimit = Math.max(50, Number.parseInt(values.limit, 10) || 600);
+if (!tldrDir || !existsSync(tldrDir)) {
+  console.error(
+    "Need a tldr-pages checkout. Pass --tldr <dir> or set TLDR_DIR.\n" +
+      "  git clone --depth 1 https://github.com/tldr-pages/tldr.git",
+  );
+  process.exit(1);
+}
+/** Pinned provenance for reproducibility — resolves a symbolic HEAD to its SHA. */
+function tldrCommit() {
+  try {
+    const head = readFileSync(join(tldrDir, ".git", "HEAD"), "utf8").trim();
+    const ref = head.match(/^ref:\s*(.+)$/);
+    if (!ref) return head;
+    return readFileSync(join(tldrDir, ".git", ref[1]), "utf8").trim();
+  } catch {
+    return "unknown";
+  }
+}
+/**
+ * Turn a tldr example line into a real, literal shell command:
+ *  - `{{placeholder}}` → its inner text (a realistic argument).
+ *  - `[-f|--force]` / `[-r|--recursive]` alternative-flag notation → the first
+ *    form (`-f`, `-r`), so the result is a command a shell would actually accept
+ *    rather than tldr documentation syntax.
+ */
+function fillPlaceholders(cmd) {
+  return cmd
+    .replace(/\{\{(.*?)\}\}/g, (_, inner) => String(inner).trim() || "arg")
+    .replace(/\[([^\]|]+)\|[^\]]+\]/g, (_, first) => String(first).trim());
+}
+/** Independent, transparent destructive-intent labeler (NOT the analyzer). */
+function labelDestructive(cmd) {
+  const c = cmd.trim();
+  // Remote code execution: fetch piped into a shell.
+  if (/\b(curl|wget|fetch)\b[^|]*\|\s*(sudo\s+)?(sh|bash|zsh|dash|ksh)\b/.test(c)) return true;
+  // Strip a leading wrapper so `sudo rm` / `time rm` resolve to their target.
+  const stripped = c.replace(/^(sudo|time|nice|ionice|nohup|env)\s+(-\S+\s+)*/, "");
+  const m = stripped.match(/^(\/[^\s]*\/)?([a-zA-Z0-9_.-]+)\b(.*)$/);
+  if (!m) return false;
+  const bin = m[2];
+  const rest = m[3] || "";
+  const DESTRUCTIVE_BINS = new Set([
+    "rm", "rmdir", "shred", "srm", "dd", "mkfs", "fdisk", "parted",
+    "wipefs", "mkswap", "blkdiscard", "sgdisk", "unlink",
+  ]);
+  if (/^mkfs\./.test(bin)) return true;
+  if (DESTRUCTIVE_BINS.has(bin)) {
+    if (bin === "dd") return /\bof=\/dev\//.test(rest);
+    if (bin === "rmdir") return false; // only removes empty dirs
+    return true;
+  }
+  if (bin === "git") {
+    if (/\breset\s+--hard\b/.test(rest)) return true;
+    if (/\bclean\b.*\s-\S*f/.test(rest)) return true;
+    if (/\bpush\b.*(--force\b|\s-f\b)/.test(rest)) return true;
+    if (/\bbranch\b.*\s-D\b/.test(rest)) return true;
+    if (/\breflog\s+expire\b/.test(rest)) return true;
+    if (/\bgc\b.*--prune/.test(rest)) return true;
+    if (/\bfilter-branch\b/.test(rest)) return true;
+  }
+  return false;
+}
+const dirs = ["common", "linux", "osx"]
+  .map((d) => join(tldrDir, "pages", d))
+  .filter((d) => existsSync(d));
+const seen = new Set();
+const destructive = [];
+const safe = [];
+for (const dir of dirs) {
+  const family = dir.split("/").slice(-1)[0];
+  for (const file of readdirSync(dir)) {
+    if (!file.endsWith(".md")) continue;
+    const page = file.replace(/\.md$/, "");
+    const text = readFileSync(join(dir, file), "utf8");
+    for (const line of text.split("\n")) {
+      const trimmed = line.trim();
+      // tldr example commands are fenced in single backticks on their own line.
+      if (!trimmed.startsWith("`") || !trimmed.endsWith("`") || trimmed.length < 4) continue;
+      const raw = fillPlaceholders(trimmed.slice(1, -1)).trim();
+      if (!raw || raw.length > 240) continue;
+      if (!/^[a-zA-Z/.~$]/.test(raw)) continue; // must start like a command
+      if (seen.has(raw)) continue;
+      seen.add(raw);
+      const entry = { cmd: raw, page, family };
+      if (labelDestructive(raw)) destructive.push(entry);
+      else safe.push(entry);
+    }
+  }
+}
+/** Deterministic evenly-spaced stride sample (no RNG, so the build is stable). */
+function stride(list, target) {
+  if (list.length <= target) return list.slice();
+  const step = list.length / target;
+  const out = [];
+  for (let i = 0; i < target; i += 1) out.push(list[Math.floor(i * step)]);
+  return out;
+}
+// Enrich ALL destructive examples (they are rare in real docs) and stride-sample
+// safe ones up to the limit. This is disclosed in the report so the imbalance is
+// not mistaken for the natural base rate.
+destructive.sort((a, b) => a.cmd.localeCompare(b.cmd));
+safe.sort((a, b) => a.cmd.localeCompare(b.cmd));
+const sampledSafe = stride(safe, safeLimit);
+const corpus = {
+  source: "tldr-pages",
+  url: "https://github.com/tldr-pages/tldr",
+  license: "CC-BY-4.0",
+  commit: tldrCommit(),
+  pages: dirs.map((d) => d.split("/").slice(-2).join("/")),
+  labeler: "benchmarks/build-external-corpus.mjs labelDestructive() — independent of the analyzer",
+  totals: {
+    uniqueCommandsScanned: seen.size,
+    destructiveFound: destructive.length,
+    safeFound: safe.length,
+    safeSampled: sampledSafe.length,
+  },
+  entries: [...destructive, ...sampledSafe],
+};
+const outPath = join(here, "external-corpus.json");
+writeFileSync(outPath, JSON.stringify(corpus, null, 2));
+console.log(
+  `Wrote ${corpus.entries.length} external commands ` +
+    `(${destructive.length} destructive + ${sampledSafe.length}/${safe.length} safe sampled) ` +
+    `from ${seen.size} unique tldr examples @ ${corpus.commit.slice(0, 12)} → ${outPath}`,
+);