npm - runcap - Versions diffs - 0.3.1 → 0.6.0 - Mend

runcap 0.3.1 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

package/README.md +235 -20
package/bin/runcap.mjs +171 -0
package/examples/outcome-demo/agent-fixes.mjs +24 -0
package/examples/outcome-demo/agent-spins.mjs +20 -0
package/examples/outcome-demo/broken.mjs +5 -0
package/examples/outcome-demo/verify.mjs +7 -0
package/examples/runcap-adjudicate.yml +57 -0
package/package.json +24 -12
package/scripts/adjudicate-test.mjs +334 -0
package/scripts/guard-test.mjs +76 -0
package/scripts/make-demo-svg.mjs +20 -20
package/scripts/mission-test.mjs +148 -0
package/scripts/outcome-test.mjs +48 -0
package/scripts/policy-test.mjs +121 -0
package/scripts/render-media-screenshots.mjs +37 -0
package/src/adjudicate.mjs +508 -0
package/src/mission-control.mjs +441 -1
package/src/policy.mjs +208 -0

package/README.md CHANGED Viewed

@@ -2,30 +2,75 @@
 [![CI](https://github.com/kirder24-code/ai-agent-manager/actions/workflows/ci.yml/badge.svg)](https://github.com/kirder24-code/ai-agent-manager/actions/workflows/ci.yml)
-![Runcap terminal demo: estimate, cap, compress, stop](docs/assets/demo.svg)
+![Runcap terminal demo: estimate, cap, verify integrity, mission PASS - then a tampered run graded BLOCKED on the PR](docs/assets/demo.svg)
-**Your AI coding agent re-reads the same files over and over and quietly burns your money. Runcap estimates the bill before you build, hard-caps the spend so it physically stops at your ceiling, and losslessly compresses every call. Free, MIT, 100% local. Your code and tokens never touch a server.**
+**An AI coding agent can pass CI by editing the test that proves its own success. Runcap caps the spend before the run and issues evidence about whether that success check can be trusted. Free, MIT, local-first. Local runs keep Runcap's control plane on your machine; optional CI adjudication runs in your GitHub Actions environment.**
-On a real OpenAI call, one edited-file re-read dropped from **1,186 to 737 prompt tokens (37.9% saved)** with the model still answering correctly about the changed line. No other proxy does this:
+> **An agent passing CI is not enough.**
+> Runcap verifies whether the evidence of success was altered during the mission.
+| Status | Meaning |
+|---|---|
+| `VERIFIED_STRONG` | Result passed an unchanged verifier and a clean-worktree replay. |
+| `VERIFIED_WEAK` | Result passed, but some integrity evidence is missing. |
+| `UNVERIFIED` | Verification did not pass. |
+| `VERIFIER_COMPROMISED` | The agent changed protected verification evidence. |
+```text
+Estimate the run  →  Cap the spend  →  Verify the outcome
+```
+Most cost and observability tools measure **tokens** used. But you don't buy tokens - you buy a result that passes a check. So Runcap measures the number that actually decides whether the spend was worth it:
+> **Verified Outcome Cost = total run cost / tasks that passed verification.** An agent that talks but never fixes the bug can cost *more* than one that does - and a token dashboard calls it "cheaper."
 | | Without Runcap | With Runcap |
 |---|---|---|
-| Re-read of an edited file | 1,186 prompt tokens | **737 prompt tokens** |
 | You find out the cost | when the invoice arrives | **before you press go, capped at your ceiling** |
 | When the agent gets stuck | it keeps spending | **run stops, you get the exact rescue prompt** |
+| What you paid for | tokens (delivered or not) | **dollars per *verified* result** |
+In a 6-run test on the same task, the run that **delivered nothing** cost *more* than the one that delivered, and the cheapest verified result was ~43x cheaper than the most expensive - same passing test. ([full table below](#real-results-6-runs-same-task-reproducible-offline))
+> Most tools here are a rear-view mirror - they show you the bill *after* you paid it. Runcap estimates the bill *before* you start, caps it during the run, and issues evidence about whether the declared verification can be trusted. It is a circuit breaker with a receipt, not a dashboard.
+> If Runcap caps a run for you or compresses a call, please **star the repo** - it is the one signal that tells me to keep building it in the open.
+## Make a change earn merge eligibility
-> Every other tool here is a rear-view mirror - it shows you the bill *after* you paid it. Runcap estimates the bill *before* you start and caps it. It is a circuit breaker, not a dashboard.
+> AI can propose a change.
+> Runcap makes it earn merge eligibility.
+`runcap ci --mode adjudicate` is a required PR check that does not trust the agent or its receipt. It recomputes the merge decision in a clean CI job from the pull request's **base commit**:
+```text
+AI-generated PR
+  → Runcap action pinned to an immutable release commit
+  → policy / verifier / dependencies read from the PR base commit
+  → clean CI replay
+  → PASS / BLOCKED / HUMAN_APPROVAL_REQUIRED
+```
+- **`PASS`** - the base verifier failed, the replay passed, and the change was allowed text-only edits inside scope.
+- **`BLOCKED`** - a scope violation, an unsafe diff type (delete / rename / binary / symlink / submodule / mode change), an unresolved base/head identity, or a failed replay.
+- **`HUMAN_APPROVAL_REQUIRED`** - the change touches the policy, a workflow, a verifier file, a dependency manifest/lockfile, or a protected path. Runcap does not auto-approve changes to its own rules or evidence; a human CODEOWNER must approve.
+The verdict is a **CI-attested replay under a documented hardened GitHub profile**. It is *not* "unspoofable," *not* "fully independent," and it is *not* independent budget enforcement - its integrity rests on the [required GitHub setup](docs/trust-model.md#required-github-setup) being in place. The agent's receipt never decides the verdict: the required gate does not read it.
+See the [trust model](docs/trust-model.md#ci-adjudication-v06) for exactly what v0.6 proves and what it does not, and [Install in a consumer repo](#install-in-a-consumer-repo) to wire it up.
 ## Why
 **Agents loop on the same error, rewrite plans, and re-read files they just edited - every loop is tokens you pay for.** Multi-agent coding runs burn roughly **15x more tokens** than a single chat ([Anthropic engineering](https://www.anthropic.com/engineering/built-multi-agent-research-system)). They hand you a confident summary while the task is not actually done, and you find out what it cost when the invoice - or the subscription limit - arrives.
-Observability tools (Langfuse, Helicone, LangSmith, AgentOps) measure the past. Gateways (LiteLLM, Portkey, OpenRouter) route the present. None of them stop the spend *before* it happens. Runcap does the one thing the rear-view mirror can't:
+Observability tools (Langfuse, Helicone, LangSmith, AgentOps) measure the past, and some run evals on outputs. Gateways (LiteLLM, Portkey, OpenRouter) route the present. What they don't do is enforce a mission policy *during* the run - a hard spend cap, allowed scope, protected verification evidence - then issue evidence about whether the agent's own success check can be trusted. Runcap does the things the rear-view mirror can't:
 ```text
-estimate before build  →  cap during run  →  compress every call  →  rescue when stuck
+estimate before build  →  cap during run  →  compress every call  →  rescue when stuck  →  verify the outcome
 ```
+It also quietly trims waste on the way through: on a real OpenAI call, one edited-file re-read dropped from **1,186 to 737 prompt tokens (37.9% saved)** with the model still answering correctly about the changed line - lossless, and no other proxy does it ([details below](#token-compression-built-in-no-extra-deps)).
 ## The honest claim
 Runcap does **not** promise an exact cost oracle. Agent trajectories are stochastic - nobody, including the model labs, can predict the exact token count of a run. So Runcap gives you a **range plus a hard cap**:
@@ -48,6 +93,8 @@ If you have those, Runcap caps your spend in one command. If you are looking for
 No API key required.
+![Runcap terminal proof: estimate, cap spend, compress tokens, verify integrity, and block a compromised mission](docs/assets/media/demo.png)
 ```bash
 git clone https://github.com/kirder24-code/ai-agent-manager.git
 cd ai-agent-manager
@@ -106,6 +153,7 @@ Or run from source with `node ./bin/runcap.mjs <command>`.
 runcap plan --fuel 24 -- "build a small auth feature and verify it"   # range + recommended cap, before you spend
 runcap preflight -- claude "build a full SaaS app"                     # is this prompt too broad?
 runcap run --label fix -- claude "fix one failing check. stop if blocked."  # wrap any agent/command
+runcap outcome run --task "..." --verify "pnpm test" -- claude "fix it"     # cost of a VERIFIED result, not tokens
 runcap report                                                          # human-readable rescue report
 runcap export                                                          # evidence JSON with truth labels
 runcap dashboard                                                       # local cockpit at :8791
@@ -125,6 +173,10 @@ OPENAI_API_KEY=sk-... AIM_DAILY_BUDGET_USD=5 runcap gateway
 # Anthropic-native (Claude Code, /v1/messages)
 ANTHROPIC_API_KEY=sk-ant-... AIM_DAILY_BUDGET_USD=5 runcap gateway
 #   then: ANTHROPIC_BASE_URL=http://127.0.0.1:8792/v1
+# DeepSeek (OpenAI-compatible, much cheaper - same one command)
+OPENAI_API_KEY=sk-... AIM_UPSTREAM_BASE_URL=https://api.deepseek.com AIM_DAILY_BUDGET_USD=5 runcap gateway
+#   then point your agent at: OPENAI_BASE_URL=http://127.0.0.1:8792/v1  (model: deepseek-chat)
 ```
 When spend crosses the ceiling, the next call returns `429 budget_guard` instead of money leaving your account. Try it with no key: `runcap gateway --mock`.
@@ -135,9 +187,9 @@ Every request that passes through the gateway is compressed before it's forwarde
 1. **Per-field trim** - embedded JSON re-serialized compactly, long log/stack-trace dumps collapsed to head + tail, trailing whitespace squeezed.
 2. **Identical-block dedup** - when the exact same file dump or tool_result ships again in the same request, the repeat is replaced with a deterministic stub.
-3. **Delta-encoding of near-duplicates** - the layer no other proxy has. When the agent reads a file, edits one line, and re-reads it, the block is *similar but not identical*, so plain dedup saves nothing. Runcap sends a readable line-diff against the version the model already saw, and the model reconstructs the current file from it. On a real OpenAI call, an edited-file re-read dropped from **1186 to 737 prompt tokens - 37.9% saved, with the model still answering correctly about the changed line.** Proof and reproduction steps: [docs/delta-encoding-evidence.md](https://github.com/kirder24-code/ai-agent-manager/blob/main/docs/delta-encoding-evidence.md).
+3. **Delta-encoding of near-duplicates.** When the agent reads a file, edits one line, and re-reads it, the block is *similar but not identical*, so plain dedup saves nothing. Runcap sends a readable line-diff against the version the model already saw, and the model reconstructs the current file from it. On a real OpenAI call, an edited-file re-read dropped from **1186 to 737 prompt tokens - 37.9% saved, with the model still answering correctly about the changed line.** Proof and reproduction steps: [docs/delta-encoding-evidence.md](https://github.com/kirder24-code/ai-agent-manager/blob/main/docs/delta-encoding-evidence.md).
-It's pure Node with **zero ML or native dependencies**, so it installs everywhere without the build pain heavier compressors have.
+It's pure Node with **zero native or ML dependencies** (the only runtime dependency is `js-yaml`, pure JS), so it installs everywhere without the build pain heavier compressors have.
 The dashboard shows the result as one number: **"You saved $X · N tokens compressed · would have spent $Y."** Disable it with `AIM_COMPRESS=off` if you ever want raw passthrough.
@@ -147,9 +199,172 @@ The hard case in stuck-detection is the agent that keeps producing output but is
 This is a **calculated** signal, not a proven dollar-saving: it tells you *"the agent has sent 3 near-identical prompts in a row with no progress"* so you can step in before the loop burns more budget. Tune or disable it with `AIM_LOOP_DETECT=off`. (Today's [`detectStuck`](src/mission-control.mjs) post-run score is outcome-based: exit code, parsed errors, and zero-diff. The loop signal adds the missing in-flight behavioral signal on top of it.)
+## Verified Outcome Cost (`runcap outcome`)
+Tokens are the wrong unit. You don't buy tokens, you buy a **result that passes a check**. So Runcap measures the only number that tracks what you actually paid for:
+> **Verified Outcome Cost = total run cost / tasks that passed verification.**
+Wrap your agent and hand it a verification command. Its exit code is the oracle:
+```bash
+runcap outcome run \
+  --task "Fix the failing test" \
+  --verify "pnpm test && pnpm build" \
+  -- claude "fix one failing check, then stop"
+runcap outcome show          # re-print the latest receipt
+```
+The run produces an **outcome receipt** where every field carries a truth label:
+```
+Outcome:               VERIFIED
+Actual cost:           $0.000677  (2 priced LLM calls, calculated_from_provider_usage_and_sourced_price_table)
+Verified Outcome Cost: $0.000677  (money that bought a verified result)
+```
+If verification fails, the receipt is honest about it instead of pretending the spend delivered something:
+```
+Outcome:               UNVERIFIED
+Verified Outcome Cost: N/A  (verification did not pass)
+Money spent without verified delivery: $0.001012
+```
+That second case is the whole point: an agent that talks but never fixes the bug can cost **more** than one that does, while a token dashboard calls it "cheaper." Try both, offline, no API key:
+```bash
+runcap outcome run --task "Fix sum() so it adds" --verify "node examples/outcome-demo/verify.mjs" --mock -- node examples/outcome-demo/agent-fixes.mjs   # VERIFIED
+runcap outcome run --task "Fix sum() so it adds" --verify "node examples/outcome-demo/verify.mjs" --mock -- node examples/outcome-demo/agent-spins.mjs   # UNVERIFIED
+```
+Receipts are written to `.runcap/outcomes/<id>/receipt.json`. Run the same task across several agents and you get the **Agent Economics Index** - the same board, priced by verified delivery instead of tokens.
+### Real results (6 runs, same task, reproducible offline)
+Same task - *fix a broken `sum()` so the test passes* - across different models and two agent behaviors. Every number below is measured from the gateway ledger, not estimated:
+| # | Model | Agent | LLM calls | Actual cost | Outcome | Verified Outcome Cost | Money, nothing delivered |
+|---|---|---|---|---|---|---|---|
+| 1 | gpt-4o | writes fix | 2 | $0.000677 | VERIFIED | **$0.000677** | $0 |
+| 2 | gpt-4o | spins, no fix | 3 | $0.001012 | UNVERIFIED | **N/A** | **$0.001012** |
+| 3 | gpt-5.4 | writes fix | 2 | $0.000957 | VERIFIED | **$0.000957** | $0 |
+| 4 | deepseek-chat | writes fix | 2 | $0.000022 | VERIFIED | **$0.000022** | $0 |
+| 5 | deepseek-chat | spins, no fix | 3 | $0.000033 | UNVERIFIED | **N/A** | **$0.000033** |
+| 6 | claude-sonnet-4 | writes fix | 2 | $0.000981 | VERIFIED | **$0.000981** | $0 |
+Two facts a token dashboard can't show you: on gpt-4o the run that **delivered nothing** (row 2) cost *more* than the run that delivered (row 1); and the cheapest verified result (deepseek-chat, $0.000022) bought the same passing test as gpt-5.4 at ~43x the price. Reproduce any row with `OUTCOME_DEMO_MODEL=<model>` in front of the command above. (One run per row in v0.1 - the point is the unit, not a vendor ranking; a ranking needs N>=5 runs/agent and a pass-rate column.)
+## Verification Integrity (`runcap outcome --guard`)
+A green test only means something if the agent passed it *fairly*. An agent under pressure can turn a check green without doing the work: delete the failing test, rewrite the assertion, repoint the `npm test` script at `true`, disable TypeScript strict mode, mock the real API, or hardcode the expected answer. Exit code 0 - and a plain Verified Outcome Cost calls it delivered. So does every token dashboard.
+`--guard` verifies the verifier. Before the agent runs, Runcap freezes a **Task Contract**: the baseline git commit, a SHA-256 of every file the verify command names, a snapshot of `package.json` scripts, and a check that the task *actually fails today* (a pass on an already-green tree proves nothing). After the run it re-checks all of it, and re-runs the verify command from the baseline commit in a throwaway git worktree with only the agent's changed files copied in - so a green that depended on uncommitted local junk dies in the clean room.
+The result is a trust grade, not a binary:
+| Status | Meaning |
+|---|---|
+| `VERIFIED_STRONG` | Passed, verifier untouched, the task really failed before, and the pass survives a clean checkout. |
+| `VERIFIED_WEAK` | Passed and verifier untouched, but a strong condition was missed (e.g. baseline failure not reproduced). |
+| `UNVERIFIED` | Verification did not pass. |
+| `VERIFIER_COMPROMISED` | Passed, **but the verifier itself was modified during the run.** The green light cannot be trusted. |
+```bash
+runcap outcome run --guard \
+  --task "Fix the failing test" \
+  --verify "pnpm test" \
+  --allow src/ \
+  -- claude "fix one failing check, then stop"
+```
+`--protect <path>` marks extra paths the agent must not touch (tests, config, and `package.json` are protected by default); `--allow <path>` declares the only paths a legitimate fix should change, so out-of-scope edits drop the grade. The receipt gains a `verificationIntegrity` block listing every check, every truth label, and exactly which file was tampered with if the grade is `VERIFIER_COMPROMISED`.
+> **One honesty note that rides on every receipt:** Verified Outcome Cost is the **LLM spend that bought the result** - it does *not* include subscriptions, CI minutes, sandbox compute, or human review time. For real agent economics you want **Expected Verified Outcome Cost = total spend across N attempts / strongly-verified outcomes**, which needs N>=5 runs. That's the v0.2 unit; v0.1 measures the one cost the gateway can observe honestly.
+## Mission Policy & CI enforcement (`runcap mission` / `policy` / `ci`)
+Everything above lives in one developer's terminal. A platform, VP-Eng, or FinOps owner can't act on it: there's no way to declare the rules of a mission *once*, in the repo, and no way to fail a pull request when an agent breaches budget or tampers with the evidence of its own success.
+A **mission policy** closes that gap. You declare the rules once in `.runcap/mission.yaml`, enforce them during the run, and grade the result into a **PASS / BLOCKED** verdict a GitHub Action turns into a red/green check on the PR.
+```yaml
+# .runcap/mission.yaml
+version: v1
+identity:
+  project: checkout
+  team: payments
+mission:
+  name: Fix the failing checkout test
+  task_class: bugfix
+budget:
+  mission_hard_limit_usd: 10      # required - per-mission hard cap (the gateway enforces it live)
+  max_llm_calls: 12               # optional - BLOCK if exceeded
+  max_runtime_minutes: 30         # optional - BLOCK if exceeded
+verification:
+  command: "pnpm test && pnpm build"   # required - the oracle (exit 0 = delivered)
+  guard: strict                        # strict (default) freezes + re-checks the verifier
+  protect: ["tests/**", "package.json"]  # paths the agent must not touch
+  allow:   ["src/checkout/**"]           # the only paths a legit fix should change
+```
+Validate it, then run the agent under it:
+```bash
+runcap policy validate                    # is .runcap/mission.yaml well-formed?
+runcap mission run -- claude "fix the failing checkout test, then stop"
+```
+`mission run` enforces the per-mission hard cap through the gateway, always runs the verification guard, and writes an outcome receipt that now carries a **policy block** - the org attribution, the limits, the **SHA-256 of the exact policy text that graded the run**, and the verdict. It exits `1` on `BLOCKED`, so it fails CI on its own. The mission is **BLOCKED** when any of these is true:
+- the verifier was tampered with (`VERIFIER_COMPROMISED`);
+- verification did not pass (`UNVERIFIED`);
+- a change landed outside the declared `allow` scope;
+- spend exceeded `mission_hard_limit_usd`, or the gateway's budget guard tripped mid-run;
+- `max_llm_calls` or `max_runtime_minutes` was exceeded.
+### The local grade vs. the CI adjudication
+There are two ways the policy verdict reaches a PR, and they trust different things:
+- **`runcap mission run`** (local / same-job) grades the run it just executed and re-checks it against the committed policy text. Useful, but the receipt it produces is *agent-side* evidence.
+- **`runcap ci --mode adjudicate`** (the required PR check) trusts none of that. It recomputes the verdict in a clean CI job from the PR's base commit and **never reads the agent receipt**. This is the gate that decides merge eligibility.
+### Install in a consumer repo
+Make the adjudication a required red/green PR check in your own repo:
+1. Add `.runcap/mission.yaml` (the policy - see the example above).
+2. Copy `examples/runcap-adjudicate.yml` into `.github/workflows/`.
+3. Replace the all-zero `RUNCAP_ACTION_SHA` placeholder with the full immutable commit SHA of the released version (resolve it with `gh api repos/kirder24-code/ai-agent-manager/git/refs/tags/vX.Y.Z --jq '.object.sha'`).
+4. Configure the hardened GitHub branch profile (protected branch, required check, up-to-date-before-merge, dismiss stale approvals, CODEOWNERS for workflow/policy/verifier/dependency/protected paths, no bypass for ordinary authors) - the full list is in the [trust model](docs/trust-model.md#required-github-setup).
+5. Make `Runcap adjudicate` a required status check.
+> The template ships with an all-zero placeholder SHA and is **intentionally not runnable until you insert the release SHA**. This is deliberate: the judge must be an immutable release commit that lives outside the candidate PR, so a malicious PR cannot rewrite its own judge.
+A reviewer sees one of two things:
+```
+Mission verdict: PASS
+  project checkout / team payments
+  Mission cost $0.0007 / $10.00
+  Policy hash: c857d10c…
+```
+```
+Mission verdict: BLOCKED
+  Blocked because:
+    - VERIFIER_COMPROMISED: the agent changed protected verification evidence (verifier_file_unchanged:app/verify.mjs).
+```
+Because the verdict is recomputed from the committed policy and the receipt records the policy hash, a reviewer can confirm *which rules graded the run* - the verdict is only as trustworthy as the policy hash it carries.
 ## Pricing table
-Costs are calculated from a sourced multi-provider table - Anthropic (Opus / Sonnet / Haiku) and OpenAI (GPT-5 family + legacy GPT-4), with cache-read and batch discounts handled - labeled with source and verification date. When a model is unknown, Runcap says `unknown_price` rather than guessing.
+Costs are calculated from a sourced multi-provider table - Anthropic (Opus / Sonnet / Haiku), OpenAI (GPT-5 family + legacy GPT-4), and DeepSeek (V4 Flash / V4 Pro) - with cache-read and batch discounts handled, labeled with source and verification date. When a model is unknown, Runcap says `unknown_price` rather than guessing.
+DeepSeek matters because its API is OpenAI-compatible: point the gateway at `https://api.deepseek.com` with your DeepSeek key and Runcap prices, caps, and compresses it with zero extra setup - the same one command as OpenAI. At roughly $0.14 / $0.28 per million input/output tokens it is far cheaper than the US frontier models, so the people running the biggest agent loops on it are exactly the ones a hard cap protects.
 ## Trust model
@@ -163,20 +378,16 @@ Runcap is built not to fake certainty. Every important output carries a truth la
 If it cannot prove something, it says so.
-## Pricing (the product, not the tokens)
+## Availability
-| Tier | Price | What you get |
-|---|---|---|
-| **OSS** (MIT, local) | $0 forever | All local runs, cost estimation, hard cap, run wrapping, stuck detection, rescue prompts, local dashboard. Never crippleware. |
-| **Founding Pro** (limited) | **$49 once** | Lifetime Pro at the founder price - pay once, keep Pro forever, before it moves to $19/mo. |
-| **Pro** | $19/mo | Cloud sync across machines, hosted dashboard, estimate-vs-actual trends, shareable reports, alerts on cap breach |
-| **Team** | $49/seat/mo | Shared budget pools, org-wide ceilings, per-project rollups, role-based caps |
+Runcap v0.6 is open-source and free under MIT.
-The local core is free forever. Only persistence, collaboration, and aggregation are paid - the things that only matter once data leaves your laptop.
+The local CLI and CI adjudication mode are available now.
+Hosted sync, team budget pools, organization reporting, and paid plans are future ideas only. They are not available for purchase today.
 ## Current stage
-A working local tool, not a hosted SaaS. Ready for: wrapping real Codex / Claude / Cursor sessions, catching stuck agents, and proving rescue prompts save time. Not yet: a hosted cloud platform or a universal observability standard. It is not trying to replace Langfuse or LiteLLM - it does the thing they don't.
+A working local tool plus an optional CI adjudication mode, not a hosted SaaS. Ready for: wrapping real Codex / Claude / Cursor sessions, catching stuck agents, proving rescue prompts save time, and gating AI-generated pull requests in GitHub Actions. Not yet: a hosted cloud platform or a universal observability standard. It is not trying to replace Langfuse or LiteLLM; it focuses on a different layer - pre-run cost caps and merge-eligibility evidence.
 ## Documentation
@@ -187,6 +398,10 @@ A working local tool, not a hosted SaaS. Ready for: wrapping real Codex / Claude
 - [Integrations](docs/integrations.md)
 - [Trust model](docs/trust-model.md)
+## Built by
+Runcap is built and maintained by Kirill D., a solo AI and automation consultant based in Calgary, Canada. He helps solo SaaS founders and service businesses ship AI features that hold up in production - cost control, vibe-code audits, and reliable automation. More at [launchsoloai.com](https://launchsoloai.com).
 ---
-The thesis: **AI agents need managers.**
+The thesis: **AI can propose a change. It should not certify its own success.**

package/bin/runcap.mjs CHANGED Viewed

@@ -11,7 +11,10 @@ import {
   preflightMission,
   recordFuel,
   renderReport,
+  renderOutcome,
+  latestOutcomeId,
   runMission,
+  runOutcome,
   setupProject,
   startDashboard,
   startGateway,
@@ -31,6 +34,15 @@ import {
   planToRun
 } from "../src/cloud.mjs";
 import { alertsCommand } from "../src/alerts.mjs";
+import {
+  loadPolicy,
+  validatePolicy,
+  evaluatePolicyVerdict,
+  policyMeta,
+  formatPolicyBlock
+} from "../src/policy.mjs";
+import { adjudicate, formatAdjudication, exitCodeFor } from "../src/adjudicate.mjs";
+import { readFileSync, appendFileSync } from "node:fs";
 const args = process.argv.slice(2);
 const command = args[0] ?? "welcome";
@@ -42,6 +54,17 @@ Usage:
   runcap run [--label name] [--cap|--no-cap] [--mock] -- <command...>
                                  (auto-enforces your cap; no manual gateway/base-URL setup)
   runcap plan [--fuel 24] [--quality high|balanced|cheap] [--apply-cap] -- <goal...>
+  runcap outcome run --task "..." --verify "<cmd>" [--label name] [--mock] -- <agent cmd...>
+                                 (runs the agent, then verifies; reports Verified Outcome Cost)
+  runcap outcome [show]          (print the latest outcome receipt)
+  runcap policy validate [path]  (check .runcap/mission.yaml is well-formed)
+  runcap mission run [--policy path] [--task override] [--mock] -- <agent cmd...>
+                                 (enforce the repo policy; exit 1 if the mission is BLOCKED)
+  runcap ci [--policy path] [--receipt path]
+                                 (grade a receipt against the policy; writes PR summary, exit 1 on BLOCKED)
+  runcap ci --mode adjudicate [--policy path] [--base sha --head sha]
+                                 (Tier 3: recompute the verdict in CI from the PR's base commit -
+                                  never trusts the agent's receipt; exit 1 on BLOCKED)
   runcap plans
   runcap cap <usd>               (set the hard cap the gateway enforces)
   runcap cap show                (show the current cap)
@@ -90,6 +113,17 @@ function takeFlag(input, name) {
   return true;
 }
+// Collect every occurrence of a repeatable option, e.g. --allow src --allow lib.
+function takeAll(input, name) {
+  const values = [];
+  let index;
+  while ((index = input.indexOf(name)) !== -1) {
+    values.push(input[index + 1]);
+    input.splice(index, 2);
+  }
+  return values.filter(Boolean);
+}
 // A real call can cost a fraction of a cent. toFixed(2)/(4) would print $0.00 or
 // $0.0000 and read as "nothing was recorded", so show a meaningful figure for
 // sub-cent spend instead of rounding a real charge down to zero.
@@ -100,6 +134,19 @@ function fmtUsd(n) {
   return `$${n.toPrecision(2)}`;
 }
+// In GitHub Actions, $GITHUB_STEP_SUMMARY is a file the runner renders as the
+// job's PR annotation. Append the verdict there so a reviewer sees red/green
+// without opening logs. A no-op off CI.
+function writeCiSummary(markdown) {
+  const target = process.env.GITHUB_STEP_SUMMARY;
+  if (!target) return;
+  try {
+    appendFileSync(target, markdown + "\n");
+  } catch {
+    // best-effort annotation only - never fail the verdict on a write error.
+  }
+}
 try {
   if (command === "welcome") {
     console.log(await welcome());
@@ -184,6 +231,130 @@ try {
     const sync = await syncRun(planToRun(plan));
     if (sync === "synced") console.log("Cloud: synced to your Runcap Pro dashboard.");
     else if (sync && sync.startsWith("sync_failed")) console.log(`Cloud: ${sync}`);
+  } else if (command === "outcome") {
+    const sub = args[1] ?? "show";
+    if (sub === "run") {
+      const oArgs = args.slice(2);
+      const task = takeOption(oArgs, "--task");
+      const verify = takeOption(oArgs, "--verify");
+      const label = takeOption(oArgs, "--label");
+      const mock = takeFlag(oArgs, "--mock");
+      const guard = takeFlag(oArgs, "--guard");
+      const protect = takeAll(oArgs, "--protect");
+      const allow = takeAll(oArgs, "--allow");
+      const separator = oArgs.indexOf("--");
+      const childArgs = separator === -1 ? [] : oArgs.slice(separator + 1);
+      if (!task) throw new Error("Missing --task \"description\".");
+      if (!verify) throw new Error("Missing --verify \"<command>\" (e.g. --verify \"npm test && npm run build\").");
+      if (childArgs.length === 0) throw new Error("Missing agent command after `--`.");
+      const result = await runOutcome({ task, verify, command: childArgs, label, mock, guard, protect, allow });
+      console.log("\n" + result.summary);
+      console.log(`Receipt: .runcap/outcomes/${result.id}/receipt.json`);
+    } else {
+      const id = sub === "show" ? await latestOutcomeId() : sub;
+      if (!id) throw new Error("No outcome receipt found. Run `runcap outcome run ...` first.");
+      console.log(await renderOutcome(id));
+    }
+  } else if (command === "policy") {
+    const sub = args[1] ?? "validate";
+    if (sub === "validate") {
+      const explicit = args[2] && !args[2].startsWith("--") ? args[2] : undefined;
+      const loaded = loadPolicy(process.cwd(), explicit);
+      if (!loaded) throw new Error("No policy found. Create .runcap/mission.yaml (or pass a path).");
+      const { ok, errors, warnings } = validatePolicy(loaded.policy);
+      console.log(`Policy: ${loaded.source}`);
+      console.log(`Hash:   ${loaded.hash}`);
+      for (const w of warnings) console.log(`  warning: ${w}`);
+      if (ok) {
+        console.log("Policy is valid.");
+      } else {
+        for (const e of errors) console.log(`  error: ${e}`);
+        console.log("Policy is INVALID.");
+        process.exitCode = 1;
+      }
+    } else {
+      throw new Error(`Unknown policy subcommand: ${sub}. Try \`runcap policy validate [path]\`.`);
+    }
+  } else if (command === "mission") {
+    const sub = args[1] ?? "run";
+    if (sub !== "run") throw new Error(`Unknown mission subcommand: ${sub}. Try \`runcap mission run -- <agent cmd>\`.`);
+    const mArgs = args.slice(2);
+    const policyPath = takeOption(mArgs, "--policy");
+    const taskOverride = takeOption(mArgs, "--task");
+    const mock = takeFlag(mArgs, "--mock");
+    const separator = mArgs.indexOf("--");
+    const childArgs = separator === -1 ? [] : mArgs.slice(separator + 1);
+    if (childArgs.length === 0) throw new Error("Missing agent command after `--`.");
+    const loaded = loadPolicy(process.cwd(), policyPath);
+    if (!loaded) throw new Error("No policy found. Create .runcap/mission.yaml (or pass --policy <path>).");
+    const { ok, errors } = validatePolicy(loaded.policy);
+    if (!ok) {
+      for (const e of errors) console.error(`  policy error: ${e}`);
+      throw new Error("Mission policy is invalid - fix it before running the mission.");
+    }
+    const p = loaded.policy;
+    const mission = p.mission ?? {};
+    const verification = p.verification ?? {};
+    const budget = p.budget ?? {};
+    const task = taskOverride || [mission.name, mission.task_class].filter(Boolean).join(" - ") || mission.name;
+    const result = await runOutcome({
+      task,
+      verify: verification.command,
+      command: childArgs,
+      label: mission.name,
+      mock,
+      guard: true,
+      protect: Array.isArray(verification.protect) ? verification.protect : [],
+      allow: Array.isArray(verification.allow) ? verification.allow : [],
+      capUsd: budget.mission_hard_limit_usd ?? null,
+      policy: loaded
+    });
+    console.log("\n" + result.summary);
+    console.log(`Receipt: .runcap/outcomes/${result.id}/receipt.json`);
+    if (result.receipt.policy?.verdict === "BLOCKED") process.exitCode = 1;
+  } else if (command === "ci") {
+    const ciArgs = args.slice(1);
+    const mode = takeOption(ciArgs, "--mode");
+    const policyPath = takeOption(ciArgs, "--policy");
+    const receiptPath = takeOption(ciArgs, "--receipt");
+    if (mode === "adjudicate") {
+      // Tier 3: recompute the verdict from the BASE commit of the PR in a clean
+      // checkout. Trusts only the base/head SHAs from the pull_request event (or
+      // explicit --base/--head for local runs); never the agent's receipt.
+      const baseFlag = takeOption(ciArgs, "--base");
+      const headFlag = takeOption(ciArgs, "--head");
+      const verdict = await adjudicate({ cwd: process.cwd(), baseFlag, headFlag, policyPath });
+      const lines = formatAdjudication(verdict);
+      console.log(lines.join("\n"));
+      writeCiSummary(["## Runcap CI adjudication: " + verdict.verdict, "", "```", ...lines, "```"].join("\n"));
+      process.exitCode = exitCodeFor(verdict.verdict);
+    } else {
+      const loaded = loadPolicy(process.cwd(), policyPath);
+      if (!loaded) throw new Error("No policy found. Create .runcap/mission.yaml (or pass --policy <path>).");
+      const { ok, errors } = validatePolicy(loaded.policy);
+      if (!ok) {
+        for (const e of errors) console.error(`  policy error: ${e}`);
+        writeCiSummary(["## Runcap mission: policy INVALID", "", ...errors.map((e) => `- ${e}`)].join("\n"));
+        throw new Error("Mission policy is invalid.");
+      }
+      let receipt;
+      if (receiptPath) {
+        receipt = JSON.parse(readFileSync(receiptPath, "utf8"));
+      } else {
+        const id = await latestOutcomeId();
+        if (!id) throw new Error("No outcome receipt found. Run `runcap mission run ...` first, or pass --receipt <path>.");
+        receipt = JSON.parse(readFileSync(`.runcap/outcomes/${id}/receipt.json`, "utf8"));
+      }
+      const verdict = evaluatePolicyVerdict(receipt, loaded.policy);
+      const block = formatPolicyBlock({ ...policyMeta(loaded), ...verdict });
+      console.log(block.join("\n"));
+      writeCiSummary(["## Runcap mission verdict: " + verdict.verdict, "", "```", ...block, "```"].join("\n"));
+      if (verdict.verdict === "BLOCKED") process.exitCode = 1;
+    }
   } else if (command === "login") {
     console.log(await loginCommand(args[1]));
   } else if (command === "logout") {

package/examples/outcome-demo/agent-fixes.mjs ADDED Viewed

@@ -0,0 +1,24 @@
+// A stand-in coding agent that (1) spends money via the gateway and (2) actually
+// fixes the bug. It points at whatever base URL Runcap injected (the cap
+// gateway), so the spend is recorded and priced exactly like a real agent's.
+import { writeFile } from "node:fs/promises";
+import path from "node:path";
+const base = process.env.OPENAI_BASE_URL || "https://api.openai.com/v1";
+const model = process.env.OUTCOME_DEMO_MODEL || "gpt-4o";
+async function call(prompt) {
+  const res = await fetch(`${base}/chat/completions`, {
+    method: "POST",
+    headers: { "content-type": "application/json", authorization: "Bearer demo" },
+    body: JSON.stringify({ model, messages: [{ role: "user", content: prompt }] })
+  });
+  return res.text();
+}
+await call("Read broken.mjs. The sum() function subtracts instead of adds. Plan the one-line fix.");
+await call("Apply the fix: sum should return a + b.");
+const file = path.join(process.cwd(), "examples/outcome-demo/broken.mjs");
+await writeFile(file, "export function sum(a, b) {\n  return a + b;\n}\n");
+console.log("agent-fixes: rewrote sum() to add");

package/examples/outcome-demo/agent-spins.mjs ADDED Viewed

@@ -0,0 +1,20 @@
+// A stand-in coding agent that spends money via the gateway but never fixes the
+// bug - it circles, re-reading and re-planning, the way a stuck agent burns
+// budget while reporting confident progress. broken.mjs is left untouched.
+const base = process.env.OPENAI_BASE_URL || "https://api.openai.com/v1";
+const model = process.env.OUTCOME_DEMO_MODEL || "gpt-4o";
+async function call(prompt) {
+  const res = await fetch(`${base}/chat/completions`, {
+    method: "POST",
+    headers: { "content-type": "application/json", authorization: "Bearer demo" },
+    body: JSON.stringify({ model, messages: [{ role: "user", content: prompt }] })
+  });
+  return res.text();
+}
+await call("Read broken.mjs and explain the bug in sum().");
+await call("Re-read broken.mjs once more to be sure, then restate the plan.");
+await call("Describe at length how you would fix it, but do not write the file yet.");
+console.log("agent-spins: lots of talk, no fix written");
+process.exit(0);

package/examples/outcome-demo/broken.mjs ADDED Viewed

@@ -0,0 +1,5 @@
+// Intentionally wrong: sum() subtracts. The task is to make verify.mjs pass.
+// The agent under test rewrites this file (or fails to).
+export function sum(a, b) {
+  return a - b;
+}

package/examples/outcome-demo/verify.mjs ADDED Viewed

@@ -0,0 +1,7 @@
+// The verification oracle. Exit 0 means the task is actually delivered.
+import { sum } from "./broken.mjs";
+import assert from "node:assert";
+assert.strictEqual(sum(2, 3), 5, `sum(2,3) should be 5, got ${sum(2, 3)}`);
+assert.strictEqual(sum(10, 5), 15, `sum(10,5) should be 15, got ${sum(10, 5)}`);
+console.log("verify: PASSED (sum is correct)");