npm - runcap - Versions diffs - 0.3.1 → 0.5.0 - Mend

runcap 0.3.1 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

package/README.md +211 -9
package/bin/runcap.mjs +153 -0
package/examples/outcome-demo/agent-fixes.mjs +24 -0
package/examples/outcome-demo/agent-spins.mjs +20 -0
package/examples/outcome-demo/broken.mjs +5 -0
package/examples/outcome-demo/verify.mjs +7 -0
package/package.json +11 -2
package/scripts/guard-test.mjs +76 -0
package/scripts/make-demo-svg.mjs +20 -20
package/scripts/mission-test.mjs +148 -0
package/scripts/outcome-test.mjs +48 -0
package/scripts/policy-test.mjs +121 -0
package/scripts/render-media-screenshots.mjs +37 -0
package/src/mission-control.mjs +441 -1
package/src/policy.mjs +208 -0

package/README.md CHANGED Viewed

@@ -2,30 +2,52 @@
 [![CI](https://github.com/kirder24-code/ai-agent-manager/actions/workflows/ci.yml/badge.svg)](https://github.com/kirder24-code/ai-agent-manager/actions/workflows/ci.yml)
-![Runcap terminal demo: estimate, cap, compress, stop](docs/assets/demo.svg)
+![Runcap terminal demo: estimate, cap, verify integrity, mission PASS - then a tampered run graded BLOCKED on the PR](docs/assets/demo.svg)
-**Your AI coding agent re-reads the same files over and over and quietly burns your money. Runcap estimates the bill before you build, hard-caps the spend so it physically stops at your ceiling, and losslessly compresses every call. Free, MIT, 100% local. Your code and tokens never touch a server.**
+**Runcap stops AI-agent spend before it runs - and shows whether that spend produced a verified result. Free, MIT, 100% local. Your code and tokens never touch a server.**
-On a real OpenAI call, one edited-file re-read dropped from **1,186 to 737 prompt tokens (37.9% saved)** with the model still answering correctly about the changed line. No other proxy does this:
+> **An agent passing CI is not enough.**
+> Runcap verifies whether the evidence of success was altered during the mission.
+| Status | Meaning |
+|---|---|
+| `VERIFIED_STRONG` | Result passed an unchanged verifier and a clean-worktree replay. |
+| `VERIFIED_WEAK` | Result passed, but some integrity evidence is missing. |
+| `UNVERIFIED` | Verification did not pass. |
+| `VERIFIER_COMPROMISED` | The agent changed protected verification evidence. |
+```text
+Estimate the run  →  Cap the spend  →  Verify the outcome
+```
+Every other tool measures **tokens**. You don't buy tokens - you buy a result that passes a check. So Runcap measures the only number that matters:
+> **Verified Outcome Cost = total run cost / tasks that passed verification.** An agent that talks but never fixes the bug can cost *more* than one that does - and a token dashboard calls it "cheaper."
 | | Without Runcap | With Runcap |
 |---|---|---|
-| Re-read of an edited file | 1,186 prompt tokens | **737 prompt tokens** |
 | You find out the cost | when the invoice arrives | **before you press go, capped at your ceiling** |
 | When the agent gets stuck | it keeps spending | **run stops, you get the exact rescue prompt** |
+| What you paid for | tokens (delivered or not) | **dollars per *verified* result** |
+In a 6-run test on the same task, the run that **delivered nothing** cost *more* than the one that delivered, and the cheapest verified result was ~43x cheaper than the most expensive - same passing test. ([full table below](#real-results-6-runs-same-task-reproducible-offline))
-> Every other tool here is a rear-view mirror - it shows you the bill *after* you paid it. Runcap estimates the bill *before* you start and caps it. It is a circuit breaker, not a dashboard.
+> Every other tool here is a rear-view mirror - it shows you the bill *after* you paid it. Runcap estimates the bill *before* you start, caps it, and tells you if the spend actually delivered. It is a circuit breaker with a receipt, not a dashboard.
+> If Runcap caps a run for you or compresses a call, please **star the repo** - it is the one signal that tells me to keep building it in the open.
 ## Why
 **Agents loop on the same error, rewrite plans, and re-read files they just edited - every loop is tokens you pay for.** Multi-agent coding runs burn roughly **15x more tokens** than a single chat ([Anthropic engineering](https://www.anthropic.com/engineering/built-multi-agent-research-system)). They hand you a confident summary while the task is not actually done, and you find out what it cost when the invoice - or the subscription limit - arrives.
-Observability tools (Langfuse, Helicone, LangSmith, AgentOps) measure the past. Gateways (LiteLLM, Portkey, OpenRouter) route the present. None of them stop the spend *before* it happens. Runcap does the one thing the rear-view mirror can't:
+Observability tools (Langfuse, Helicone, LangSmith, AgentOps) measure the past. Gateways (LiteLLM, Portkey, OpenRouter) route the present. None of them stop the spend *before* it happens, and none tell you whether the spend produced a verified result. Runcap does the things the rear-view mirror can't:
 ```text
-estimate before build  →  cap during run  →  compress every call  →  rescue when stuck
+estimate before build  →  cap during run  →  compress every call  →  rescue when stuck  →  verify the outcome
 ```
+It also quietly trims waste on the way through: on a real OpenAI call, one edited-file re-read dropped from **1,186 to 737 prompt tokens (37.9% saved)** with the model still answering correctly about the changed line - lossless, and no other proxy does it ([details below](#token-compression-built-in-no-extra-deps)).
 ## The honest claim
 Runcap does **not** promise an exact cost oracle. Agent trajectories are stochastic - nobody, including the model labs, can predict the exact token count of a run. So Runcap gives you a **range plus a hard cap**:
@@ -48,6 +70,8 @@ If you have those, Runcap caps your spend in one command. If you are looking for
 No API key required.
+![Runcap terminal proof: estimate, cap spend, compress tokens, verify integrity, and block a compromised mission](docs/assets/media/demo.png)
 ```bash
 git clone https://github.com/kirder24-code/ai-agent-manager.git
 cd ai-agent-manager
@@ -106,6 +130,7 @@ Or run from source with `node ./bin/runcap.mjs <command>`.
 runcap plan --fuel 24 -- "build a small auth feature and verify it"   # range + recommended cap, before you spend
 runcap preflight -- claude "build a full SaaS app"                     # is this prompt too broad?
 runcap run --label fix -- claude "fix one failing check. stop if blocked."  # wrap any agent/command
+runcap outcome run --task "..." --verify "pnpm test" -- claude "fix it"     # cost of a VERIFIED result, not tokens
 runcap report                                                          # human-readable rescue report
 runcap export                                                          # evidence JSON with truth labels
 runcap dashboard                                                       # local cockpit at :8791
@@ -125,6 +150,10 @@ OPENAI_API_KEY=sk-... AIM_DAILY_BUDGET_USD=5 runcap gateway
 # Anthropic-native (Claude Code, /v1/messages)
 ANTHROPIC_API_KEY=sk-ant-... AIM_DAILY_BUDGET_USD=5 runcap gateway
 #   then: ANTHROPIC_BASE_URL=http://127.0.0.1:8792/v1
+# DeepSeek (OpenAI-compatible, much cheaper - same one command)
+OPENAI_API_KEY=sk-... AIM_UPSTREAM_BASE_URL=https://api.deepseek.com AIM_DAILY_BUDGET_USD=5 runcap gateway
+#   then point your agent at: OPENAI_BASE_URL=http://127.0.0.1:8792/v1  (model: deepseek-chat)
 ```
 When spend crosses the ceiling, the next call returns `429 budget_guard` instead of money leaving your account. Try it with no key: `runcap gateway --mock`.
@@ -137,7 +166,7 @@ Every request that passes through the gateway is compressed before it's forwarde
 2. **Identical-block dedup** - when the exact same file dump or tool_result ships again in the same request, the repeat is replaced with a deterministic stub.
 3. **Delta-encoding of near-duplicates** - the layer no other proxy has. When the agent reads a file, edits one line, and re-reads it, the block is *similar but not identical*, so plain dedup saves nothing. Runcap sends a readable line-diff against the version the model already saw, and the model reconstructs the current file from it. On a real OpenAI call, an edited-file re-read dropped from **1186 to 737 prompt tokens - 37.9% saved, with the model still answering correctly about the changed line.** Proof and reproduction steps: [docs/delta-encoding-evidence.md](https://github.com/kirder24-code/ai-agent-manager/blob/main/docs/delta-encoding-evidence.md).
-It's pure Node with **zero ML or native dependencies**, so it installs everywhere without the build pain heavier compressors have.
+It's pure Node with **zero native or ML dependencies** (the only runtime dependency is `js-yaml`, pure JS), so it installs everywhere without the build pain heavier compressors have.
 The dashboard shows the result as one number: **"You saved $X · N tokens compressed · would have spent $Y."** Disable it with `AIM_COMPRESS=off` if you ever want raw passthrough.
@@ -147,9 +176,178 @@ The hard case in stuck-detection is the agent that keeps producing output but is
 This is a **calculated** signal, not a proven dollar-saving: it tells you *"the agent has sent 3 near-identical prompts in a row with no progress"* so you can step in before the loop burns more budget. Tune or disable it with `AIM_LOOP_DETECT=off`. (Today's [`detectStuck`](src/mission-control.mjs) post-run score is outcome-based: exit code, parsed errors, and zero-diff. The loop signal adds the missing in-flight behavioral signal on top of it.)
+## Verified Outcome Cost (`runcap outcome`)
+Tokens are the wrong unit. You don't buy tokens, you buy a **result that passes a check**. So Runcap measures the only number that tracks what you actually paid for:
+> **Verified Outcome Cost = total run cost / tasks that passed verification.**
+Wrap your agent and hand it a verification command. Its exit code is the oracle:
+```bash
+runcap outcome run \
+  --task "Fix the failing test" \
+  --verify "pnpm test && pnpm build" \
+  -- claude "fix one failing check, then stop"
+runcap outcome show          # re-print the latest receipt
+```
+The run produces an **outcome receipt** where every field carries a truth label:
+```
+Outcome:               VERIFIED
+Actual cost:           $0.000677  (2 priced LLM calls, calculated_from_provider_usage_and_sourced_price_table)
+Verified Outcome Cost: $0.000677  (money that bought a verified result)
+```
+If verification fails, the receipt is honest about it instead of pretending the spend delivered something:
+```
+Outcome:               UNVERIFIED
+Verified Outcome Cost: N/A  (verification did not pass)
+Money spent without verified delivery: $0.001012
+```
+That second case is the whole point: an agent that talks but never fixes the bug can cost **more** than one that does, while a token dashboard calls it "cheaper." Try both, offline, no API key:
+```bash
+runcap outcome run --task "Fix sum() so it adds" --verify "node examples/outcome-demo/verify.mjs" --mock -- node examples/outcome-demo/agent-fixes.mjs   # VERIFIED
+runcap outcome run --task "Fix sum() so it adds" --verify "node examples/outcome-demo/verify.mjs" --mock -- node examples/outcome-demo/agent-spins.mjs   # UNVERIFIED
+```
+Receipts are written to `.runcap/outcomes/<id>/receipt.json`. Run the same task across several agents and you get the **Agent Economics Index** - the same board, priced by verified delivery instead of tokens.
+### Real results (6 runs, same task, reproducible offline)
+Same task - *fix a broken `sum()` so the test passes* - across different models and two agent behaviors. Every number below is measured from the gateway ledger, not estimated:
+| # | Model | Agent | LLM calls | Actual cost | Outcome | Verified Outcome Cost | Money, nothing delivered |
+|---|---|---|---|---|---|---|---|
+| 1 | gpt-4o | writes fix | 2 | $0.000677 | VERIFIED | **$0.000677** | $0 |
+| 2 | gpt-4o | spins, no fix | 3 | $0.001012 | UNVERIFIED | **N/A** | **$0.001012** |
+| 3 | gpt-5.4 | writes fix | 2 | $0.000957 | VERIFIED | **$0.000957** | $0 |
+| 4 | deepseek-chat | writes fix | 2 | $0.000022 | VERIFIED | **$0.000022** | $0 |
+| 5 | deepseek-chat | spins, no fix | 3 | $0.000033 | UNVERIFIED | **N/A** | **$0.000033** |
+| 6 | claude-sonnet-4 | writes fix | 2 | $0.000981 | VERIFIED | **$0.000981** | $0 |
+Two facts a token dashboard can't show you: on gpt-4o the run that **delivered nothing** (row 2) cost *more* than the run that delivered (row 1); and the cheapest verified result (deepseek-chat, $0.000022) bought the same passing test as gpt-5.4 at ~43x the price. Reproduce any row with `OUTCOME_DEMO_MODEL=<model>` in front of the command above. (One run per row in v0.1 - the point is the unit, not a vendor ranking; a ranking needs N>=5 runs/agent and a pass-rate column.)
+## Verification Integrity (`runcap outcome --guard`)
+A green test only means something if the agent passed it *fairly*. An agent under pressure can turn a check green without doing the work: delete the failing test, rewrite the assertion, repoint the `npm test` script at `true`, disable TypeScript strict mode, mock the real API, or hardcode the expected answer. Exit code 0 - and a plain Verified Outcome Cost calls it delivered. So does every token dashboard.
+`--guard` verifies the verifier. Before the agent runs, Runcap freezes a **Task Contract**: the baseline git commit, a SHA-256 of every file the verify command names, a snapshot of `package.json` scripts, and a check that the task *actually fails today* (a pass on an already-green tree proves nothing). After the run it re-checks all of it, and re-runs the verify command from the baseline commit in a throwaway git worktree with only the agent's changed files copied in - so a green that depended on uncommitted local junk dies in the clean room.
+The result is a trust grade, not a binary:
+| Status | Meaning |
+|---|---|
+| `VERIFIED_STRONG` | Passed, verifier untouched, the task really failed before, and the pass survives a clean checkout. |
+| `VERIFIED_WEAK` | Passed and verifier untouched, but a strong condition was missed (e.g. baseline failure not reproduced). |
+| `UNVERIFIED` | Verification did not pass. |
+| `VERIFIER_COMPROMISED` | Passed, **but the verifier itself was modified during the run.** The green light cannot be trusted. |
+```bash
+runcap outcome run --guard \
+  --task "Fix the failing test" \
+  --verify "pnpm test" \
+  --allow src/ \
+  -- claude "fix one failing check, then stop"
+```
+`--protect <path>` marks extra paths the agent must not touch (tests, config, and `package.json` are protected by default); `--allow <path>` declares the only paths a legitimate fix should change, so out-of-scope edits drop the grade. The receipt gains a `verificationIntegrity` block listing every check, every truth label, and exactly which file was tampered with if the grade is `VERIFIER_COMPROMISED`.
+> **One honesty note that rides on every receipt:** Verified Outcome Cost is the **LLM spend that bought the result** - it does *not* include subscriptions, CI minutes, sandbox compute, or human review time. For real agent economics you want **Expected Verified Outcome Cost = total spend across N attempts / strongly-verified outcomes**, which needs N>=5 runs. That's the v0.2 unit; v0.1 measures the one cost the gateway can observe honestly.
+## Mission Policy & CI enforcement (`runcap mission` / `policy` / `ci`)
+Everything above lives in one developer's terminal. A platform, VP-Eng, or FinOps owner can't act on it: there's no way to declare the rules of a mission *once*, in the repo, and no way to fail a pull request when an agent breaches budget or tampers with the evidence of its own success.
+A **mission policy** closes that gap. You declare the rules once in `.runcap/mission.yaml`, enforce them during the run, and grade the result into a **PASS / BLOCKED** verdict a GitHub Action turns into a red/green check on the PR.
+```yaml
+# .runcap/mission.yaml
+version: v1
+identity:
+  project: checkout
+  team: payments
+mission:
+  name: Fix the failing checkout test
+  task_class: bugfix
+budget:
+  mission_hard_limit_usd: 10      # required - per-mission hard cap (the gateway enforces it live)
+  max_llm_calls: 12               # optional - BLOCK if exceeded
+  max_runtime_minutes: 30         # optional - BLOCK if exceeded
+verification:
+  command: "pnpm test && pnpm build"   # required - the oracle (exit 0 = delivered)
+  guard: strict                        # strict (default) freezes + re-checks the verifier
+  protect: ["tests/**", "package.json"]  # paths the agent must not touch
+  allow:   ["src/checkout/**"]           # the only paths a legit fix should change
+```
+Validate it, then run the agent under it:
+```bash
+runcap policy validate                    # is .runcap/mission.yaml well-formed?
+runcap mission run -- claude "fix the failing checkout test, then stop"
+```
+`mission run` enforces the per-mission hard cap through the gateway, always runs the verification guard, and writes an outcome receipt that now carries a **policy block** - the org attribution, the limits, the **SHA-256 of the exact policy text that graded the run**, and the verdict. It exits `1` on `BLOCKED`, so it fails CI on its own. The mission is **BLOCKED** when any of these is true:
+- the verifier was tampered with (`VERIFIER_COMPROMISED`);
+- verification did not pass (`UNVERIFIED`);
+- a change landed outside the declared `allow` scope;
+- spend exceeded `mission_hard_limit_usd`, or the gateway's budget guard tripped mid-run;
+- `max_llm_calls` or `max_runtime_minutes` was exceeded.
+### The GitHub Action (the red/green PR check)
+```yaml
+# .github/workflows/runcap.yml
+jobs:
+  runcap-mission:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-node@v4
+        with: { node-version: 20 }
+      # 1. Run the agent under the policy (produces a graded receipt + exits 1 on BLOCKED)
+      - run: npx runcap mission run -- <your agent command>
+      # 2. Grade the latest receipt against the committed policy and annotate the PR
+      - uses: kirder24-code/ai-agent-manager@v1
+        with:
+          policy: .runcap/mission.yaml
+```
+`runcap ci` (what the Action runs) re-grades a receipt **against the committed policy text**, not whatever was stamped at run time, and appends the verdict to `$GITHUB_STEP_SUMMARY` as the PR annotation. Already have a receipt from an earlier job? Grade it directly:
+```bash
+runcap ci --policy .runcap/mission.yaml --receipt .runcap/outcomes/<id>/receipt.json
+```
+A reviewer sees one of two things:
+```
+Mission verdict: PASS
+  project checkout / team payments
+  Mission cost $0.0007 / $10.00
+  Policy hash: c857d10c…
+```
+```
+Mission verdict: BLOCKED
+  Blocked because:
+    - VERIFIER_COMPROMISED: the agent changed protected verification evidence (verifier_file_unchanged:app/verify.mjs).
+```
+Because the verdict is recomputed from the committed policy and the receipt records the policy hash, a reviewer can confirm *which rules graded the run* - the verdict is only as trustworthy as the policy hash it carries.
 ## Pricing table
-Costs are calculated from a sourced multi-provider table - Anthropic (Opus / Sonnet / Haiku) and OpenAI (GPT-5 family + legacy GPT-4), with cache-read and batch discounts handled - labeled with source and verification date. When a model is unknown, Runcap says `unknown_price` rather than guessing.
+Costs are calculated from a sourced multi-provider table - Anthropic (Opus / Sonnet / Haiku), OpenAI (GPT-5 family + legacy GPT-4), and DeepSeek (V4 Flash / V4 Pro) - with cache-read and batch discounts handled, labeled with source and verification date. When a model is unknown, Runcap says `unknown_price` rather than guessing.
+DeepSeek matters because its API is OpenAI-compatible: point the gateway at `https://api.deepseek.com` with your DeepSeek key and Runcap prices, caps, and compresses it with zero extra setup - the same one command as OpenAI. At roughly $0.14 / $0.28 per million input/output tokens it is far cheaper than the US frontier models, so the people running the biggest agent loops on it are exactly the ones a hard cap protects.
 ## Trust model
@@ -187,6 +385,10 @@ A working local tool, not a hosted SaaS. Ready for: wrapping real Codex / Claude
 - [Integrations](docs/integrations.md)
 - [Trust model](docs/trust-model.md)
+## Built by
+Runcap is built and maintained by Kirill D., a solo AI and automation consultant based in Calgary, Canada. He helps solo SaaS founders and service businesses ship AI features that hold up in production - cost control, vibe-code audits, and reliable automation. More at [launchsoloai.com](https://launchsoloai.com).
 ---
 The thesis: **AI agents need managers.**

package/bin/runcap.mjs CHANGED Viewed

@@ -11,7 +11,10 @@ import {
   preflightMission,
   recordFuel,
   renderReport,
+  renderOutcome,
+  latestOutcomeId,
   runMission,
+  runOutcome,
   setupProject,
   startDashboard,
   startGateway,
@@ -31,6 +34,14 @@ import {
   planToRun
 } from "../src/cloud.mjs";
 import { alertsCommand } from "../src/alerts.mjs";
+import {
+  loadPolicy,
+  validatePolicy,
+  evaluatePolicyVerdict,
+  policyMeta,
+  formatPolicyBlock
+} from "../src/policy.mjs";
+import { readFileSync, appendFileSync } from "node:fs";
 const args = process.argv.slice(2);
 const command = args[0] ?? "welcome";
@@ -42,6 +53,14 @@ Usage:
   runcap run [--label name] [--cap|--no-cap] [--mock] -- <command...>
                                  (auto-enforces your cap; no manual gateway/base-URL setup)
   runcap plan [--fuel 24] [--quality high|balanced|cheap] [--apply-cap] -- <goal...>
+  runcap outcome run --task "..." --verify "<cmd>" [--label name] [--mock] -- <agent cmd...>
+                                 (runs the agent, then verifies; reports Verified Outcome Cost)
+  runcap outcome [show]          (print the latest outcome receipt)
+  runcap policy validate [path]  (check .runcap/mission.yaml is well-formed)
+  runcap mission run [--policy path] [--task override] [--mock] -- <agent cmd...>
+                                 (enforce the repo policy; exit 1 if the mission is BLOCKED)
+  runcap ci [--policy path] [--receipt path]
+                                 (grade a receipt against the policy; writes PR summary, exit 1 on BLOCKED)
   runcap plans
   runcap cap <usd>               (set the hard cap the gateway enforces)
   runcap cap show                (show the current cap)
@@ -90,6 +109,17 @@ function takeFlag(input, name) {
   return true;
 }
+// Collect every occurrence of a repeatable option, e.g. --allow src --allow lib.
+function takeAll(input, name) {
+  const values = [];
+  let index;
+  while ((index = input.indexOf(name)) !== -1) {
+    values.push(input[index + 1]);
+    input.splice(index, 2);
+  }
+  return values.filter(Boolean);
+}
 // A real call can cost a fraction of a cent. toFixed(2)/(4) would print $0.00 or
 // $0.0000 and read as "nothing was recorded", so show a meaningful figure for
 // sub-cent spend instead of rounding a real charge down to zero.
@@ -100,6 +130,19 @@ function fmtUsd(n) {
   return `$${n.toPrecision(2)}`;
 }
+// In GitHub Actions, $GITHUB_STEP_SUMMARY is a file the runner renders as the
+// job's PR annotation. Append the verdict there so a reviewer sees red/green
+// without opening logs. A no-op off CI.
+function writeCiSummary(markdown) {
+  const target = process.env.GITHUB_STEP_SUMMARY;
+  if (!target) return;
+  try {
+    appendFileSync(target, markdown + "\n");
+  } catch {
+    // best-effort annotation only - never fail the verdict on a write error.
+  }
+}
 try {
   if (command === "welcome") {
     console.log(await welcome());
@@ -184,6 +227,116 @@ try {
     const sync = await syncRun(planToRun(plan));
     if (sync === "synced") console.log("Cloud: synced to your Runcap Pro dashboard.");
     else if (sync && sync.startsWith("sync_failed")) console.log(`Cloud: ${sync}`);
+  } else if (command === "outcome") {
+    const sub = args[1] ?? "show";
+    if (sub === "run") {
+      const oArgs = args.slice(2);
+      const task = takeOption(oArgs, "--task");
+      const verify = takeOption(oArgs, "--verify");
+      const label = takeOption(oArgs, "--label");
+      const mock = takeFlag(oArgs, "--mock");
+      const guard = takeFlag(oArgs, "--guard");
+      const protect = takeAll(oArgs, "--protect");
+      const allow = takeAll(oArgs, "--allow");
+      const separator = oArgs.indexOf("--");
+      const childArgs = separator === -1 ? [] : oArgs.slice(separator + 1);
+      if (!task) throw new Error("Missing --task \"description\".");
+      if (!verify) throw new Error("Missing --verify \"<command>\" (e.g. --verify \"npm test && npm run build\").");
+      if (childArgs.length === 0) throw new Error("Missing agent command after `--`.");
+      const result = await runOutcome({ task, verify, command: childArgs, label, mock, guard, protect, allow });
+      console.log("\n" + result.summary);
+      console.log(`Receipt: .runcap/outcomes/${result.id}/receipt.json`);
+    } else {
+      const id = sub === "show" ? await latestOutcomeId() : sub;
+      if (!id) throw new Error("No outcome receipt found. Run `runcap outcome run ...` first.");
+      console.log(await renderOutcome(id));
+    }
+  } else if (command === "policy") {
+    const sub = args[1] ?? "validate";
+    if (sub === "validate") {
+      const explicit = args[2] && !args[2].startsWith("--") ? args[2] : undefined;
+      const loaded = loadPolicy(process.cwd(), explicit);
+      if (!loaded) throw new Error("No policy found. Create .runcap/mission.yaml (or pass a path).");
+      const { ok, errors, warnings } = validatePolicy(loaded.policy);
+      console.log(`Policy: ${loaded.source}`);
+      console.log(`Hash:   ${loaded.hash}`);
+      for (const w of warnings) console.log(`  warning: ${w}`);
+      if (ok) {
+        console.log("Policy is valid.");
+      } else {
+        for (const e of errors) console.log(`  error: ${e}`);
+        console.log("Policy is INVALID.");
+        process.exitCode = 1;
+      }
+    } else {
+      throw new Error(`Unknown policy subcommand: ${sub}. Try \`runcap policy validate [path]\`.`);
+    }
+  } else if (command === "mission") {
+    const sub = args[1] ?? "run";
+    if (sub !== "run") throw new Error(`Unknown mission subcommand: ${sub}. Try \`runcap mission run -- <agent cmd>\`.`);
+    const mArgs = args.slice(2);
+    const policyPath = takeOption(mArgs, "--policy");
+    const taskOverride = takeOption(mArgs, "--task");
+    const mock = takeFlag(mArgs, "--mock");
+    const separator = mArgs.indexOf("--");
+    const childArgs = separator === -1 ? [] : mArgs.slice(separator + 1);
+    if (childArgs.length === 0) throw new Error("Missing agent command after `--`.");
+    const loaded = loadPolicy(process.cwd(), policyPath);
+    if (!loaded) throw new Error("No policy found. Create .runcap/mission.yaml (or pass --policy <path>).");
+    const { ok, errors } = validatePolicy(loaded.policy);
+    if (!ok) {
+      for (const e of errors) console.error(`  policy error: ${e}`);
+      throw new Error("Mission policy is invalid - fix it before running the mission.");
+    }
+    const p = loaded.policy;
+    const mission = p.mission ?? {};
+    const verification = p.verification ?? {};
+    const budget = p.budget ?? {};
+    const task = taskOverride || [mission.name, mission.task_class].filter(Boolean).join(" - ") || mission.name;
+    const result = await runOutcome({
+      task,
+      verify: verification.command,
+      command: childArgs,
+      label: mission.name,
+      mock,
+      guard: true,
+      protect: Array.isArray(verification.protect) ? verification.protect : [],
+      allow: Array.isArray(verification.allow) ? verification.allow : [],
+      capUsd: budget.mission_hard_limit_usd ?? null,
+      policy: loaded
+    });
+    console.log("\n" + result.summary);
+    console.log(`Receipt: .runcap/outcomes/${result.id}/receipt.json`);
+    if (result.receipt.policy?.verdict === "BLOCKED") process.exitCode = 1;
+  } else if (command === "ci") {
+    const ciArgs = args.slice(1);
+    const policyPath = takeOption(ciArgs, "--policy");
+    const receiptPath = takeOption(ciArgs, "--receipt");
+    const loaded = loadPolicy(process.cwd(), policyPath);
+    if (!loaded) throw new Error("No policy found. Create .runcap/mission.yaml (or pass --policy <path>).");
+    const { ok, errors } = validatePolicy(loaded.policy);
+    if (!ok) {
+      for (const e of errors) console.error(`  policy error: ${e}`);
+      writeCiSummary(["## Runcap mission: policy INVALID", "", ...errors.map((e) => `- ${e}`)].join("\n"));
+      throw new Error("Mission policy is invalid.");
+    }
+    let receipt;
+    if (receiptPath) {
+      receipt = JSON.parse(readFileSync(receiptPath, "utf8"));
+    } else {
+      const id = await latestOutcomeId();
+      if (!id) throw new Error("No outcome receipt found. Run `runcap mission run ...` first, or pass --receipt <path>.");
+      receipt = JSON.parse(readFileSync(`.runcap/outcomes/${id}/receipt.json`, "utf8"));
+    }
+    const verdict = evaluatePolicyVerdict(receipt, loaded.policy);
+    const block = formatPolicyBlock({ ...policyMeta(loaded), ...verdict });
+    console.log(block.join("\n"));
+    writeCiSummary(["## Runcap mission verdict: " + verdict.verdict, "", "```", ...block, "```"].join("\n"));
+    if (verdict.verdict === "BLOCKED") process.exitCode = 1;
   } else if (command === "login") {
     console.log(await loginCommand(args[1]));
   } else if (command === "logout") {

package/examples/outcome-demo/agent-fixes.mjs ADDED Viewed

@@ -0,0 +1,24 @@
+// A stand-in coding agent that (1) spends money via the gateway and (2) actually
+// fixes the bug. It points at whatever base URL Runcap injected (the cap
+// gateway), so the spend is recorded and priced exactly like a real agent's.
+import { writeFile } from "node:fs/promises";
+import path from "node:path";
+const base = process.env.OPENAI_BASE_URL || "https://api.openai.com/v1";
+const model = process.env.OUTCOME_DEMO_MODEL || "gpt-4o";
+async function call(prompt) {
+  const res = await fetch(`${base}/chat/completions`, {
+    method: "POST",
+    headers: { "content-type": "application/json", authorization: "Bearer demo" },
+    body: JSON.stringify({ model, messages: [{ role: "user", content: prompt }] })
+  });
+  return res.text();
+}
+await call("Read broken.mjs. The sum() function subtracts instead of adds. Plan the one-line fix.");
+await call("Apply the fix: sum should return a + b.");
+const file = path.join(process.cwd(), "examples/outcome-demo/broken.mjs");
+await writeFile(file, "export function sum(a, b) {\n  return a + b;\n}\n");
+console.log("agent-fixes: rewrote sum() to add");

package/examples/outcome-demo/agent-spins.mjs ADDED Viewed

@@ -0,0 +1,20 @@
+// A stand-in coding agent that spends money via the gateway but never fixes the
+// bug - it circles, re-reading and re-planning, the way a stuck agent burns
+// budget while reporting confident progress. broken.mjs is left untouched.
+const base = process.env.OPENAI_BASE_URL || "https://api.openai.com/v1";
+const model = process.env.OUTCOME_DEMO_MODEL || "gpt-4o";
+async function call(prompt) {
+  const res = await fetch(`${base}/chat/completions`, {
+    method: "POST",
+    headers: { "content-type": "application/json", authorization: "Bearer demo" },
+    body: JSON.stringify({ model, messages: [{ role: "user", content: prompt }] })
+  });
+  return res.text();
+}
+await call("Read broken.mjs and explain the bug in sum().");
+await call("Re-read broken.mjs once more to be sure, then restate the plan.");
+await call("Describe at length how you would fix it, but do not write the file yet.");
+console.log("agent-spins: lots of talk, no fix written");
+process.exit(0);

package/examples/outcome-demo/broken.mjs ADDED Viewed

@@ -0,0 +1,5 @@
+// Intentionally wrong: sum() subtracts. The task is to make verify.mjs pass.
+// The agent under test rewrites this file (or fails to).
+export function sum(a, b) {
+  return a - b;
+}

package/examples/outcome-demo/verify.mjs ADDED Viewed

@@ -0,0 +1,7 @@
+// The verification oracle. Exit 0 means the task is actually delivered.
+import { sum } from "./broken.mjs";
+import assert from "node:assert";
+assert.strictEqual(sum(2, 3), 5, `sum(2,3) should be 5, got ${sum(2, 3)}`);
+assert.strictEqual(sum(10, 5), 15, `sum(10,5) should be 15, got ${sum(10, 5)}`);
+console.log("verify: PASSED (sum is correct)");

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "runcap",
-  "version": "0.3.1",
+  "version": "0.5.0",
   "description": "Cap every agent run before it starts: estimate cost, set a hard ceiling that stops the run, rescue stuck agents. Local, MIT, nothing uploaded.",
   "license": "MIT",
   "type": "module",
@@ -45,7 +45,12 @@
     "acceptance": "node ./scripts/acceptance.mjs",
     "smoke": "node ./bin/runcap.mjs run --label smoke -- npm --prefix examples/broken-ts-app run build",
     "demo:broken": "node ./bin/runcap.mjs run --label broken-ts-demo -- npm --prefix examples/broken-ts-app run build",
-    "test": "node ./scripts/delta-test.mjs && node ./scripts/loop-test.mjs && node ./scripts/loop-e2e.mjs && node ./scripts/validate-demo.mjs",
+    "test": "node ./scripts/delta-test.mjs && node ./scripts/loop-test.mjs && node ./scripts/loop-e2e.mjs && node ./scripts/validate-demo.mjs && node ./scripts/outcome-test.mjs && node ./scripts/guard-test.mjs && node ./scripts/policy-test.mjs && node ./scripts/mission-test.mjs",
+    "test:outcome": "node ./scripts/outcome-test.mjs",
+    "test:guard": "node ./scripts/guard-test.mjs",
+    "test:policy": "node ./scripts/policy-test.mjs",
+    "test:mission": "node ./scripts/mission-test.mjs",
+    "outcome": "node ./bin/runcap.mjs outcome",
     "test:delta": "node ./scripts/delta-test.mjs",
     "test:loop": "node ./scripts/loop-test.mjs",
     "status": "node ./bin/runcap.mjs status",
@@ -53,11 +58,15 @@
     "export": "node ./bin/runcap.mjs export",
     "templates": "node ./bin/runcap.mjs templates",
     "dashboard": "node ./bin/runcap.mjs dashboard",
+    "screenshots": "node ./scripts/render-media-screenshots.mjs",
     "gateway": "node ./bin/runcap.mjs gateway",
     "fuel": "node ./bin/runcap.mjs fuel",
     "check": "node --check ./bin/runcap.mjs && node --check ./src/mission-control.mjs"
   },
   "engines": {
     "node": ">=20"
+  },
+  "dependencies": {
+    "js-yaml": "^4.1.0"
   }
 }