runcap 0.3.1 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -2,30 +2,75 @@
2
2
 
3
3
  [![CI](https://github.com/kirder24-code/ai-agent-manager/actions/workflows/ci.yml/badge.svg)](https://github.com/kirder24-code/ai-agent-manager/actions/workflows/ci.yml)
4
4
 
5
- ![Runcap terminal demo: estimate, cap, compress, stop](docs/assets/demo.svg)
5
+ ![Runcap terminal demo: estimate, cap, verify integrity, mission PASS - then a tampered run graded BLOCKED on the PR](docs/assets/demo.svg)
6
6
 
7
- **Your AI coding agent re-reads the same files over and over and quietly burns your money. Runcap estimates the bill before you build, hard-caps the spend so it physically stops at your ceiling, and losslessly compresses every call. Free, MIT, 100% local. Your code and tokens never touch a server.**
7
+ **An AI coding agent can pass CI by editing the test that proves its own success. Runcap caps the spend before the run and issues evidence about whether that success check can be trusted. Free, MIT, local-first. Local runs keep Runcap's control plane on your machine; optional CI adjudication runs in your GitHub Actions environment.**
8
8
 
9
- On a real OpenAI call, one edited-file re-read dropped from **1,186 to 737 prompt tokens (37.9% saved)** with the model still answering correctly about the changed line. No other proxy does this:
9
+ > **An agent passing CI is not enough.**
10
+ > Runcap verifies whether the evidence of success was altered during the mission.
11
+
12
+ | Status | Meaning |
13
+ |---|---|
14
+ | `VERIFIED_STRONG` | Result passed an unchanged verifier and a clean-worktree replay. |
15
+ | `VERIFIED_WEAK` | Result passed, but some integrity evidence is missing. |
16
+ | `UNVERIFIED` | Verification did not pass. |
17
+ | `VERIFIER_COMPROMISED` | The agent changed protected verification evidence. |
18
+
19
+ ```text
20
+ Estimate the run → Cap the spend → Verify the outcome
21
+ ```
22
+
23
+ Most cost and observability tools measure **tokens** used. But you don't buy tokens - you buy a result that passes a check. So Runcap measures the number that actually decides whether the spend was worth it:
24
+
25
+ > **Verified Outcome Cost = total run cost / tasks that passed verification.** An agent that talks but never fixes the bug can cost *more* than one that does - and a token dashboard calls it "cheaper."
10
26
 
11
27
  | | Without Runcap | With Runcap |
12
28
  |---|---|---|
13
- | Re-read of an edited file | 1,186 prompt tokens | **737 prompt tokens** |
14
29
  | You find out the cost | when the invoice arrives | **before you press go, capped at your ceiling** |
15
30
  | When the agent gets stuck | it keeps spending | **run stops, you get the exact rescue prompt** |
31
+ | What you paid for | tokens (delivered or not) | **dollars per *verified* result** |
32
+
33
+ In a 6-run test on the same task, the run that **delivered nothing** cost *more* than the one that delivered, and the cheapest verified result was ~43x cheaper than the most expensive - same passing test. ([full table below](#real-results-6-runs-same-task-reproducible-offline))
34
+
35
+ > Most tools here are a rear-view mirror - they show you the bill *after* you paid it. Runcap estimates the bill *before* you start, caps it during the run, and issues evidence about whether the declared verification can be trusted. It is a circuit breaker with a receipt, not a dashboard.
36
+
37
+ > If Runcap caps a run for you or compresses a call, please **star the repo** - it is the one signal that tells me to keep building it in the open.
38
+
39
+ ## Make a change earn merge eligibility
16
40
 
17
- > Every other tool here is a rear-view mirror - it shows you the bill *after* you paid it. Runcap estimates the bill *before* you start and caps it. It is a circuit breaker, not a dashboard.
41
+ > AI can propose a change.
42
+ > Runcap makes it earn merge eligibility.
43
+
44
+ `runcap ci --mode adjudicate` is a required PR check that does not trust the agent or its receipt. It recomputes the merge decision in a clean CI job from the pull request's **base commit**:
45
+
46
+ ```text
47
+ AI-generated PR
48
+ → Runcap action pinned to an immutable release commit
49
+ → policy / verifier / dependencies read from the PR base commit
50
+ → clean CI replay
51
+ → PASS / BLOCKED / HUMAN_APPROVAL_REQUIRED
52
+ ```
53
+
54
+ - **`PASS`** - the base verifier failed, the replay passed, and the change was allowed text-only edits inside scope.
55
+ - **`BLOCKED`** - a scope violation, an unsafe diff type (delete / rename / binary / symlink / submodule / mode change), an unresolved base/head identity, or a failed replay.
56
+ - **`HUMAN_APPROVAL_REQUIRED`** - the change touches the policy, a workflow, a verifier file, a dependency manifest/lockfile, or a protected path. Runcap does not auto-approve changes to its own rules or evidence; a human CODEOWNER must approve.
57
+
58
+ The verdict is a **CI-attested replay under a documented hardened GitHub profile**. It is *not* "unspoofable," *not* "fully independent," and it is *not* independent budget enforcement - its integrity rests on the [required GitHub setup](docs/trust-model.md#required-github-setup) being in place. The agent's receipt never decides the verdict: the required gate does not read it.
59
+
60
+ See the [trust model](docs/trust-model.md#ci-adjudication-v06) for exactly what v0.6 proves and what it does not, and [Install in a consumer repo](#install-in-a-consumer-repo) to wire it up.
18
61
 
19
62
  ## Why
20
63
 
21
64
  **Agents loop on the same error, rewrite plans, and re-read files they just edited - every loop is tokens you pay for.** Multi-agent coding runs burn roughly **15x more tokens** than a single chat ([Anthropic engineering](https://www.anthropic.com/engineering/built-multi-agent-research-system)). They hand you a confident summary while the task is not actually done, and you find out what it cost when the invoice - or the subscription limit - arrives.
22
65
 
23
- Observability tools (Langfuse, Helicone, LangSmith, AgentOps) measure the past. Gateways (LiteLLM, Portkey, OpenRouter) route the present. None of them stop the spend *before* it happens. Runcap does the one thing the rear-view mirror can't:
66
+ Observability tools (Langfuse, Helicone, LangSmith, AgentOps) measure the past, and some run evals on outputs. Gateways (LiteLLM, Portkey, OpenRouter) route the present. What they don't do is enforce a mission policy *during* the run - a hard spend cap, allowed scope, protected verification evidence - then issue evidence about whether the agent's own success check can be trusted. Runcap does the things the rear-view mirror can't:
24
67
 
25
68
  ```text
26
- estimate before build → cap during run → compress every call → rescue when stuck
69
+ estimate before build → cap during run → compress every call → rescue when stuck → verify the outcome
27
70
  ```
28
71
 
72
+ It also quietly trims waste on the way through: on a real OpenAI call, one edited-file re-read dropped from **1,186 to 737 prompt tokens (37.9% saved)** with the model still answering correctly about the changed line - lossless, and no other proxy does it ([details below](#token-compression-built-in-no-extra-deps)).
73
+
29
74
  ## The honest claim
30
75
 
31
76
  Runcap does **not** promise an exact cost oracle. Agent trajectories are stochastic - nobody, including the model labs, can predict the exact token count of a run. So Runcap gives you a **range plus a hard cap**:
@@ -48,6 +93,8 @@ If you have those, Runcap caps your spend in one command. If you are looking for
48
93
 
49
94
  No API key required.
50
95
 
96
+ ![Runcap terminal proof: estimate, cap spend, compress tokens, verify integrity, and block a compromised mission](docs/assets/media/demo.png)
97
+
51
98
  ```bash
52
99
  git clone https://github.com/kirder24-code/ai-agent-manager.git
53
100
  cd ai-agent-manager
@@ -106,6 +153,7 @@ Or run from source with `node ./bin/runcap.mjs <command>`.
106
153
  runcap plan --fuel 24 -- "build a small auth feature and verify it" # range + recommended cap, before you spend
107
154
  runcap preflight -- claude "build a full SaaS app" # is this prompt too broad?
108
155
  runcap run --label fix -- claude "fix one failing check. stop if blocked." # wrap any agent/command
156
+ runcap outcome run --task "..." --verify "pnpm test" -- claude "fix it" # cost of a VERIFIED result, not tokens
109
157
  runcap report # human-readable rescue report
110
158
  runcap export # evidence JSON with truth labels
111
159
  runcap dashboard # local cockpit at :8791
@@ -125,6 +173,10 @@ OPENAI_API_KEY=sk-... AIM_DAILY_BUDGET_USD=5 runcap gateway
125
173
  # Anthropic-native (Claude Code, /v1/messages)
126
174
  ANTHROPIC_API_KEY=sk-ant-... AIM_DAILY_BUDGET_USD=5 runcap gateway
127
175
  # then: ANTHROPIC_BASE_URL=http://127.0.0.1:8792/v1
176
+
177
+ # DeepSeek (OpenAI-compatible, much cheaper - same one command)
178
+ OPENAI_API_KEY=sk-... AIM_UPSTREAM_BASE_URL=https://api.deepseek.com AIM_DAILY_BUDGET_USD=5 runcap gateway
179
+ # then point your agent at: OPENAI_BASE_URL=http://127.0.0.1:8792/v1 (model: deepseek-chat)
128
180
  ```
129
181
 
130
182
  When spend crosses the ceiling, the next call returns `429 budget_guard` instead of money leaving your account. Try it with no key: `runcap gateway --mock`.
@@ -135,9 +187,9 @@ Every request that passes through the gateway is compressed before it's forwarde
135
187
 
136
188
  1. **Per-field trim** - embedded JSON re-serialized compactly, long log/stack-trace dumps collapsed to head + tail, trailing whitespace squeezed.
137
189
  2. **Identical-block dedup** - when the exact same file dump or tool_result ships again in the same request, the repeat is replaced with a deterministic stub.
138
- 3. **Delta-encoding of near-duplicates** - the layer no other proxy has. When the agent reads a file, edits one line, and re-reads it, the block is *similar but not identical*, so plain dedup saves nothing. Runcap sends a readable line-diff against the version the model already saw, and the model reconstructs the current file from it. On a real OpenAI call, an edited-file re-read dropped from **1186 to 737 prompt tokens - 37.9% saved, with the model still answering correctly about the changed line.** Proof and reproduction steps: [docs/delta-encoding-evidence.md](https://github.com/kirder24-code/ai-agent-manager/blob/main/docs/delta-encoding-evidence.md).
190
+ 3. **Delta-encoding of near-duplicates.** When the agent reads a file, edits one line, and re-reads it, the block is *similar but not identical*, so plain dedup saves nothing. Runcap sends a readable line-diff against the version the model already saw, and the model reconstructs the current file from it. On a real OpenAI call, an edited-file re-read dropped from **1186 to 737 prompt tokens - 37.9% saved, with the model still answering correctly about the changed line.** Proof and reproduction steps: [docs/delta-encoding-evidence.md](https://github.com/kirder24-code/ai-agent-manager/blob/main/docs/delta-encoding-evidence.md).
139
191
 
140
- It's pure Node with **zero ML or native dependencies**, so it installs everywhere without the build pain heavier compressors have.
192
+ It's pure Node with **zero native or ML dependencies** (the only runtime dependency is `js-yaml`, pure JS), so it installs everywhere without the build pain heavier compressors have.
141
193
 
142
194
  The dashboard shows the result as one number: **"You saved $X · N tokens compressed · would have spent $Y."** Disable it with `AIM_COMPRESS=off` if you ever want raw passthrough.
143
195
 
@@ -147,9 +199,172 @@ The hard case in stuck-detection is the agent that keeps producing output but is
147
199
 
148
200
  This is a **calculated** signal, not a proven dollar-saving: it tells you *"the agent has sent 3 near-identical prompts in a row with no progress"* so you can step in before the loop burns more budget. Tune or disable it with `AIM_LOOP_DETECT=off`. (Today's [`detectStuck`](src/mission-control.mjs) post-run score is outcome-based: exit code, parsed errors, and zero-diff. The loop signal adds the missing in-flight behavioral signal on top of it.)
149
201
 
202
+ ## Verified Outcome Cost (`runcap outcome`)
203
+
204
+ Tokens are the wrong unit. You don't buy tokens, you buy a **result that passes a check**. So Runcap measures the only number that tracks what you actually paid for:
205
+
206
+ > **Verified Outcome Cost = total run cost / tasks that passed verification.**
207
+
208
+ Wrap your agent and hand it a verification command. Its exit code is the oracle:
209
+
210
+ ```bash
211
+ runcap outcome run \
212
+ --task "Fix the failing test" \
213
+ --verify "pnpm test && pnpm build" \
214
+ -- claude "fix one failing check, then stop"
215
+
216
+ runcap outcome show # re-print the latest receipt
217
+ ```
218
+
219
+ The run produces an **outcome receipt** where every field carries a truth label:
220
+
221
+ ```
222
+ Outcome: VERIFIED
223
+ Actual cost: $0.000677 (2 priced LLM calls, calculated_from_provider_usage_and_sourced_price_table)
224
+ Verified Outcome Cost: $0.000677 (money that bought a verified result)
225
+ ```
226
+
227
+ If verification fails, the receipt is honest about it instead of pretending the spend delivered something:
228
+
229
+ ```
230
+ Outcome: UNVERIFIED
231
+ Verified Outcome Cost: N/A (verification did not pass)
232
+ Money spent without verified delivery: $0.001012
233
+ ```
234
+
235
+ That second case is the whole point: an agent that talks but never fixes the bug can cost **more** than one that does, while a token dashboard calls it "cheaper." Try both, offline, no API key:
236
+
237
+ ```bash
238
+ runcap outcome run --task "Fix sum() so it adds" --verify "node examples/outcome-demo/verify.mjs" --mock -- node examples/outcome-demo/agent-fixes.mjs # VERIFIED
239
+ runcap outcome run --task "Fix sum() so it adds" --verify "node examples/outcome-demo/verify.mjs" --mock -- node examples/outcome-demo/agent-spins.mjs # UNVERIFIED
240
+ ```
241
+
242
+ Receipts are written to `.runcap/outcomes/<id>/receipt.json`. Run the same task across several agents and you get the **Agent Economics Index** - the same board, priced by verified delivery instead of tokens.
243
+
244
+ ### Real results (6 runs, same task, reproducible offline)
245
+
246
+ Same task - *fix a broken `sum()` so the test passes* - across different models and two agent behaviors. Every number below is measured from the gateway ledger, not estimated:
247
+
248
+ | # | Model | Agent | LLM calls | Actual cost | Outcome | Verified Outcome Cost | Money, nothing delivered |
249
+ |---|---|---|---|---|---|---|---|
250
+ | 1 | gpt-4o | writes fix | 2 | $0.000677 | VERIFIED | **$0.000677** | $0 |
251
+ | 2 | gpt-4o | spins, no fix | 3 | $0.001012 | UNVERIFIED | **N/A** | **$0.001012** |
252
+ | 3 | gpt-5.4 | writes fix | 2 | $0.000957 | VERIFIED | **$0.000957** | $0 |
253
+ | 4 | deepseek-chat | writes fix | 2 | $0.000022 | VERIFIED | **$0.000022** | $0 |
254
+ | 5 | deepseek-chat | spins, no fix | 3 | $0.000033 | UNVERIFIED | **N/A** | **$0.000033** |
255
+ | 6 | claude-sonnet-4 | writes fix | 2 | $0.000981 | VERIFIED | **$0.000981** | $0 |
256
+
257
+ Two facts a token dashboard can't show you: on gpt-4o the run that **delivered nothing** (row 2) cost *more* than the run that delivered (row 1); and the cheapest verified result (deepseek-chat, $0.000022) bought the same passing test as gpt-5.4 at ~43x the price. Reproduce any row with `OUTCOME_DEMO_MODEL=<model>` in front of the command above. (One run per row in v0.1 - the point is the unit, not a vendor ranking; a ranking needs N>=5 runs/agent and a pass-rate column.)
258
+
259
+ ## Verification Integrity (`runcap outcome --guard`)
260
+
261
+ A green test only means something if the agent passed it *fairly*. An agent under pressure can turn a check green without doing the work: delete the failing test, rewrite the assertion, repoint the `npm test` script at `true`, disable TypeScript strict mode, mock the real API, or hardcode the expected answer. Exit code 0 - and a plain Verified Outcome Cost calls it delivered. So does every token dashboard.
262
+
263
+ `--guard` verifies the verifier. Before the agent runs, Runcap freezes a **Task Contract**: the baseline git commit, a SHA-256 of every file the verify command names, a snapshot of `package.json` scripts, and a check that the task *actually fails today* (a pass on an already-green tree proves nothing). After the run it re-checks all of it, and re-runs the verify command from the baseline commit in a throwaway git worktree with only the agent's changed files copied in - so a green that depended on uncommitted local junk dies in the clean room.
264
+
265
+ The result is a trust grade, not a binary:
266
+
267
+ | Status | Meaning |
268
+ |---|---|
269
+ | `VERIFIED_STRONG` | Passed, verifier untouched, the task really failed before, and the pass survives a clean checkout. |
270
+ | `VERIFIED_WEAK` | Passed and verifier untouched, but a strong condition was missed (e.g. baseline failure not reproduced). |
271
+ | `UNVERIFIED` | Verification did not pass. |
272
+ | `VERIFIER_COMPROMISED` | Passed, **but the verifier itself was modified during the run.** The green light cannot be trusted. |
273
+
274
+ ```bash
275
+ runcap outcome run --guard \
276
+ --task "Fix the failing test" \
277
+ --verify "pnpm test" \
278
+ --allow src/ \
279
+ -- claude "fix one failing check, then stop"
280
+ ```
281
+
282
+ `--protect <path>` marks extra paths the agent must not touch (tests, config, and `package.json` are protected by default); `--allow <path>` declares the only paths a legitimate fix should change, so out-of-scope edits drop the grade. The receipt gains a `verificationIntegrity` block listing every check, every truth label, and exactly which file was tampered with if the grade is `VERIFIER_COMPROMISED`.
283
+
284
+ > **One honesty note that rides on every receipt:** Verified Outcome Cost is the **LLM spend that bought the result** - it does *not* include subscriptions, CI minutes, sandbox compute, or human review time. For real agent economics you want **Expected Verified Outcome Cost = total spend across N attempts / strongly-verified outcomes**, which needs N>=5 runs. That's the v0.2 unit; v0.1 measures the one cost the gateway can observe honestly.
285
+
286
+ ## Mission Policy & CI enforcement (`runcap mission` / `policy` / `ci`)
287
+
288
+ Everything above lives in one developer's terminal. A platform, VP-Eng, or FinOps owner can't act on it: there's no way to declare the rules of a mission *once*, in the repo, and no way to fail a pull request when an agent breaches budget or tampers with the evidence of its own success.
289
+
290
+ A **mission policy** closes that gap. You declare the rules once in `.runcap/mission.yaml`, enforce them during the run, and grade the result into a **PASS / BLOCKED** verdict a GitHub Action turns into a red/green check on the PR.
291
+
292
+ ```yaml
293
+ # .runcap/mission.yaml
294
+ version: v1
295
+ identity:
296
+ project: checkout
297
+ team: payments
298
+ mission:
299
+ name: Fix the failing checkout test
300
+ task_class: bugfix
301
+ budget:
302
+ mission_hard_limit_usd: 10 # required - per-mission hard cap (the gateway enforces it live)
303
+ max_llm_calls: 12 # optional - BLOCK if exceeded
304
+ max_runtime_minutes: 30 # optional - BLOCK if exceeded
305
+ verification:
306
+ command: "pnpm test && pnpm build" # required - the oracle (exit 0 = delivered)
307
+ guard: strict # strict (default) freezes + re-checks the verifier
308
+ protect: ["tests/**", "package.json"] # paths the agent must not touch
309
+ allow: ["src/checkout/**"] # the only paths a legit fix should change
310
+ ```
311
+
312
+ Validate it, then run the agent under it:
313
+
314
+ ```bash
315
+ runcap policy validate # is .runcap/mission.yaml well-formed?
316
+ runcap mission run -- claude "fix the failing checkout test, then stop"
317
+ ```
318
+
319
+ `mission run` enforces the per-mission hard cap through the gateway, always runs the verification guard, and writes an outcome receipt that now carries a **policy block** - the org attribution, the limits, the **SHA-256 of the exact policy text that graded the run**, and the verdict. It exits `1` on `BLOCKED`, so it fails CI on its own. The mission is **BLOCKED** when any of these is true:
320
+
321
+ - the verifier was tampered with (`VERIFIER_COMPROMISED`);
322
+ - verification did not pass (`UNVERIFIED`);
323
+ - a change landed outside the declared `allow` scope;
324
+ - spend exceeded `mission_hard_limit_usd`, or the gateway's budget guard tripped mid-run;
325
+ - `max_llm_calls` or `max_runtime_minutes` was exceeded.
326
+
327
+ ### The local grade vs. the CI adjudication
328
+
329
+ There are two ways the policy verdict reaches a PR, and they trust different things:
330
+
331
+ - **`runcap mission run`** (local / same-job) grades the run it just executed and re-checks it against the committed policy text. Useful, but the receipt it produces is *agent-side* evidence.
332
+ - **`runcap ci --mode adjudicate`** (the required PR check) trusts none of that. It recomputes the verdict in a clean CI job from the PR's base commit and **never reads the agent receipt**. This is the gate that decides merge eligibility.
333
+
334
+ ### Install in a consumer repo
335
+
336
+ Make the adjudication a required red/green PR check in your own repo:
337
+
338
+ 1. Add `.runcap/mission.yaml` (the policy - see the example above).
339
+ 2. Copy `examples/runcap-adjudicate.yml` into `.github/workflows/`.
340
+ 3. Replace the all-zero `RUNCAP_ACTION_SHA` placeholder with the full immutable commit SHA of the released version (resolve it with `gh api repos/kirder24-code/ai-agent-manager/git/refs/tags/vX.Y.Z --jq '.object.sha'`).
341
+ 4. Configure the hardened GitHub branch profile (protected branch, required check, up-to-date-before-merge, dismiss stale approvals, CODEOWNERS for workflow/policy/verifier/dependency/protected paths, no bypass for ordinary authors) - the full list is in the [trust model](docs/trust-model.md#required-github-setup).
342
+ 5. Make `Runcap adjudicate` a required status check.
343
+
344
+ > The template ships with an all-zero placeholder SHA and is **intentionally not runnable until you insert the release SHA**. This is deliberate: the judge must be an immutable release commit that lives outside the candidate PR, so a malicious PR cannot rewrite its own judge.
345
+
346
+ A reviewer sees one of two things:
347
+
348
+ ```
349
+ Mission verdict: PASS
350
+ project checkout / team payments
351
+ Mission cost $0.0007 / $10.00
352
+ Policy hash: c857d10c…
353
+ ```
354
+
355
+ ```
356
+ Mission verdict: BLOCKED
357
+ Blocked because:
358
+ - VERIFIER_COMPROMISED: the agent changed protected verification evidence (verifier_file_unchanged:app/verify.mjs).
359
+ ```
360
+
361
+ Because the verdict is recomputed from the committed policy and the receipt records the policy hash, a reviewer can confirm *which rules graded the run* - the verdict is only as trustworthy as the policy hash it carries.
362
+
150
363
  ## Pricing table
151
364
 
152
- Costs are calculated from a sourced multi-provider table - Anthropic (Opus / Sonnet / Haiku) and OpenAI (GPT-5 family + legacy GPT-4), with cache-read and batch discounts handled - labeled with source and verification date. When a model is unknown, Runcap says `unknown_price` rather than guessing.
365
+ Costs are calculated from a sourced multi-provider table - Anthropic (Opus / Sonnet / Haiku), OpenAI (GPT-5 family + legacy GPT-4), and DeepSeek (V4 Flash / V4 Pro) - with cache-read and batch discounts handled, labeled with source and verification date. When a model is unknown, Runcap says `unknown_price` rather than guessing.
366
+
367
+ DeepSeek matters because its API is OpenAI-compatible: point the gateway at `https://api.deepseek.com` with your DeepSeek key and Runcap prices, caps, and compresses it with zero extra setup - the same one command as OpenAI. At roughly $0.14 / $0.28 per million input/output tokens it is far cheaper than the US frontier models, so the people running the biggest agent loops on it are exactly the ones a hard cap protects.
153
368
 
154
369
  ## Trust model
155
370
 
@@ -163,20 +378,16 @@ Runcap is built not to fake certainty. Every important output carries a truth la
163
378
 
164
379
  If it cannot prove something, it says so.
165
380
 
166
- ## Pricing (the product, not the tokens)
381
+ ## Availability
167
382
 
168
- | Tier | Price | What you get |
169
- |---|---|---|
170
- | **OSS** (MIT, local) | $0 forever | All local runs, cost estimation, hard cap, run wrapping, stuck detection, rescue prompts, local dashboard. Never crippleware. |
171
- | **Founding Pro** (limited) | **$49 once** | Lifetime Pro at the founder price - pay once, keep Pro forever, before it moves to $19/mo. |
172
- | **Pro** | $19/mo | Cloud sync across machines, hosted dashboard, estimate-vs-actual trends, shareable reports, alerts on cap breach |
173
- | **Team** | $49/seat/mo | Shared budget pools, org-wide ceilings, per-project rollups, role-based caps |
383
+ Runcap v0.6 is open-source and free under MIT.
174
384
 
175
- The local core is free forever. Only persistence, collaboration, and aggregation are paid - the things that only matter once data leaves your laptop.
385
+ The local CLI and CI adjudication mode are available now.
386
+ Hosted sync, team budget pools, organization reporting, and paid plans are future ideas only. They are not available for purchase today.
176
387
 
177
388
  ## Current stage
178
389
 
179
- A working local tool, not a hosted SaaS. Ready for: wrapping real Codex / Claude / Cursor sessions, catching stuck agents, and proving rescue prompts save time. Not yet: a hosted cloud platform or a universal observability standard. It is not trying to replace Langfuse or LiteLLM - it does the thing they don't.
390
+ A working local tool plus an optional CI adjudication mode, not a hosted SaaS. Ready for: wrapping real Codex / Claude / Cursor sessions, catching stuck agents, proving rescue prompts save time, and gating AI-generated pull requests in GitHub Actions. Not yet: a hosted cloud platform or a universal observability standard. It is not trying to replace Langfuse or LiteLLM; it focuses on a different layer - pre-run cost caps and merge-eligibility evidence.
180
391
 
181
392
  ## Documentation
182
393
 
@@ -187,6 +398,10 @@ A working local tool, not a hosted SaaS. Ready for: wrapping real Codex / Claude
187
398
  - [Integrations](docs/integrations.md)
188
399
  - [Trust model](docs/trust-model.md)
189
400
 
401
+ ## Built by
402
+
403
+ Runcap is built and maintained by Kirill D., a solo AI and automation consultant based in Calgary, Canada. He helps solo SaaS founders and service businesses ship AI features that hold up in production - cost control, vibe-code audits, and reliable automation. More at [launchsoloai.com](https://launchsoloai.com).
404
+
190
405
  ---
191
406
 
192
- The thesis: **AI agents need managers.**
407
+ The thesis: **AI can propose a change. It should not certify its own success.**
package/bin/runcap.mjs CHANGED
@@ -11,7 +11,10 @@ import {
11
11
  preflightMission,
12
12
  recordFuel,
13
13
  renderReport,
14
+ renderOutcome,
15
+ latestOutcomeId,
14
16
  runMission,
17
+ runOutcome,
15
18
  setupProject,
16
19
  startDashboard,
17
20
  startGateway,
@@ -31,6 +34,15 @@ import {
31
34
  planToRun
32
35
  } from "../src/cloud.mjs";
33
36
  import { alertsCommand } from "../src/alerts.mjs";
37
+ import {
38
+ loadPolicy,
39
+ validatePolicy,
40
+ evaluatePolicyVerdict,
41
+ policyMeta,
42
+ formatPolicyBlock
43
+ } from "../src/policy.mjs";
44
+ import { adjudicate, formatAdjudication, exitCodeFor } from "../src/adjudicate.mjs";
45
+ import { readFileSync, appendFileSync } from "node:fs";
34
46
 
35
47
  const args = process.argv.slice(2);
36
48
  const command = args[0] ?? "welcome";
@@ -42,6 +54,17 @@ Usage:
42
54
  runcap run [--label name] [--cap|--no-cap] [--mock] -- <command...>
43
55
  (auto-enforces your cap; no manual gateway/base-URL setup)
44
56
  runcap plan [--fuel 24] [--quality high|balanced|cheap] [--apply-cap] -- <goal...>
57
+ runcap outcome run --task "..." --verify "<cmd>" [--label name] [--mock] -- <agent cmd...>
58
+ (runs the agent, then verifies; reports Verified Outcome Cost)
59
+ runcap outcome [show] (print the latest outcome receipt)
60
+ runcap policy validate [path] (check .runcap/mission.yaml is well-formed)
61
+ runcap mission run [--policy path] [--task override] [--mock] -- <agent cmd...>
62
+ (enforce the repo policy; exit 1 if the mission is BLOCKED)
63
+ runcap ci [--policy path] [--receipt path]
64
+ (grade a receipt against the policy; writes PR summary, exit 1 on BLOCKED)
65
+ runcap ci --mode adjudicate [--policy path] [--base sha --head sha]
66
+ (Tier 3: recompute the verdict in CI from the PR's base commit -
67
+ never trusts the agent's receipt; exit 1 on BLOCKED)
45
68
  runcap plans
46
69
  runcap cap <usd> (set the hard cap the gateway enforces)
47
70
  runcap cap show (show the current cap)
@@ -90,6 +113,17 @@ function takeFlag(input, name) {
90
113
  return true;
91
114
  }
92
115
 
116
+ // Collect every occurrence of a repeatable option, e.g. --allow src --allow lib.
117
+ function takeAll(input, name) {
118
+ const values = [];
119
+ let index;
120
+ while ((index = input.indexOf(name)) !== -1) {
121
+ values.push(input[index + 1]);
122
+ input.splice(index, 2);
123
+ }
124
+ return values.filter(Boolean);
125
+ }
126
+
93
127
  // A real call can cost a fraction of a cent. toFixed(2)/(4) would print $0.00 or
94
128
  // $0.0000 and read as "nothing was recorded", so show a meaningful figure for
95
129
  // sub-cent spend instead of rounding a real charge down to zero.
@@ -100,6 +134,19 @@ function fmtUsd(n) {
100
134
  return `$${n.toPrecision(2)}`;
101
135
  }
102
136
 
137
+ // In GitHub Actions, $GITHUB_STEP_SUMMARY is a file the runner renders as the
138
+ // job's PR annotation. Append the verdict there so a reviewer sees red/green
139
+ // without opening logs. A no-op off CI.
140
+ function writeCiSummary(markdown) {
141
+ const target = process.env.GITHUB_STEP_SUMMARY;
142
+ if (!target) return;
143
+ try {
144
+ appendFileSync(target, markdown + "\n");
145
+ } catch {
146
+ // best-effort annotation only - never fail the verdict on a write error.
147
+ }
148
+ }
149
+
103
150
  try {
104
151
  if (command === "welcome") {
105
152
  console.log(await welcome());
@@ -184,6 +231,130 @@ try {
184
231
  const sync = await syncRun(planToRun(plan));
185
232
  if (sync === "synced") console.log("Cloud: synced to your Runcap Pro dashboard.");
186
233
  else if (sync && sync.startsWith("sync_failed")) console.log(`Cloud: ${sync}`);
234
+ } else if (command === "outcome") {
235
+ const sub = args[1] ?? "show";
236
+ if (sub === "run") {
237
+ const oArgs = args.slice(2);
238
+ const task = takeOption(oArgs, "--task");
239
+ const verify = takeOption(oArgs, "--verify");
240
+ const label = takeOption(oArgs, "--label");
241
+ const mock = takeFlag(oArgs, "--mock");
242
+ const guard = takeFlag(oArgs, "--guard");
243
+ const protect = takeAll(oArgs, "--protect");
244
+ const allow = takeAll(oArgs, "--allow");
245
+ const separator = oArgs.indexOf("--");
246
+ const childArgs = separator === -1 ? [] : oArgs.slice(separator + 1);
247
+ if (!task) throw new Error("Missing --task \"description\".");
248
+ if (!verify) throw new Error("Missing --verify \"<command>\" (e.g. --verify \"npm test && npm run build\").");
249
+ if (childArgs.length === 0) throw new Error("Missing agent command after `--`.");
250
+ const result = await runOutcome({ task, verify, command: childArgs, label, mock, guard, protect, allow });
251
+ console.log("\n" + result.summary);
252
+ console.log(`Receipt: .runcap/outcomes/${result.id}/receipt.json`);
253
+ } else {
254
+ const id = sub === "show" ? await latestOutcomeId() : sub;
255
+ if (!id) throw new Error("No outcome receipt found. Run `runcap outcome run ...` first.");
256
+ console.log(await renderOutcome(id));
257
+ }
258
+ } else if (command === "policy") {
259
+ const sub = args[1] ?? "validate";
260
+ if (sub === "validate") {
261
+ const explicit = args[2] && !args[2].startsWith("--") ? args[2] : undefined;
262
+ const loaded = loadPolicy(process.cwd(), explicit);
263
+ if (!loaded) throw new Error("No policy found. Create .runcap/mission.yaml (or pass a path).");
264
+ const { ok, errors, warnings } = validatePolicy(loaded.policy);
265
+ console.log(`Policy: ${loaded.source}`);
266
+ console.log(`Hash: ${loaded.hash}`);
267
+ for (const w of warnings) console.log(` warning: ${w}`);
268
+ if (ok) {
269
+ console.log("Policy is valid.");
270
+ } else {
271
+ for (const e of errors) console.log(` error: ${e}`);
272
+ console.log("Policy is INVALID.");
273
+ process.exitCode = 1;
274
+ }
275
+ } else {
276
+ throw new Error(`Unknown policy subcommand: ${sub}. Try \`runcap policy validate [path]\`.`);
277
+ }
278
+ } else if (command === "mission") {
279
+ const sub = args[1] ?? "run";
280
+ if (sub !== "run") throw new Error(`Unknown mission subcommand: ${sub}. Try \`runcap mission run -- <agent cmd>\`.`);
281
+ const mArgs = args.slice(2);
282
+ const policyPath = takeOption(mArgs, "--policy");
283
+ const taskOverride = takeOption(mArgs, "--task");
284
+ const mock = takeFlag(mArgs, "--mock");
285
+ const separator = mArgs.indexOf("--");
286
+ const childArgs = separator === -1 ? [] : mArgs.slice(separator + 1);
287
+ if (childArgs.length === 0) throw new Error("Missing agent command after `--`.");
288
+
289
+ const loaded = loadPolicy(process.cwd(), policyPath);
290
+ if (!loaded) throw new Error("No policy found. Create .runcap/mission.yaml (or pass --policy <path>).");
291
+ const { ok, errors } = validatePolicy(loaded.policy);
292
+ if (!ok) {
293
+ for (const e of errors) console.error(` policy error: ${e}`);
294
+ throw new Error("Mission policy is invalid - fix it before running the mission.");
295
+ }
296
+ const p = loaded.policy;
297
+ const mission = p.mission ?? {};
298
+ const verification = p.verification ?? {};
299
+ const budget = p.budget ?? {};
300
+ const task = taskOverride || [mission.name, mission.task_class].filter(Boolean).join(" - ") || mission.name;
301
+ const result = await runOutcome({
302
+ task,
303
+ verify: verification.command,
304
+ command: childArgs,
305
+ label: mission.name,
306
+ mock,
307
+ guard: true,
308
+ protect: Array.isArray(verification.protect) ? verification.protect : [],
309
+ allow: Array.isArray(verification.allow) ? verification.allow : [],
310
+ capUsd: budget.mission_hard_limit_usd ?? null,
311
+ policy: loaded
312
+ });
313
+ console.log("\n" + result.summary);
314
+ console.log(`Receipt: .runcap/outcomes/${result.id}/receipt.json`);
315
+ if (result.receipt.policy?.verdict === "BLOCKED") process.exitCode = 1;
316
+ } else if (command === "ci") {
317
+ const ciArgs = args.slice(1);
318
+ const mode = takeOption(ciArgs, "--mode");
319
+ const policyPath = takeOption(ciArgs, "--policy");
320
+ const receiptPath = takeOption(ciArgs, "--receipt");
321
+
322
+ if (mode === "adjudicate") {
323
+ // Tier 3: recompute the verdict from the BASE commit of the PR in a clean
324
+ // checkout. Trusts only the base/head SHAs from the pull_request event (or
325
+ // explicit --base/--head for local runs); never the agent's receipt.
326
+ const baseFlag = takeOption(ciArgs, "--base");
327
+ const headFlag = takeOption(ciArgs, "--head");
328
+ const verdict = await adjudicate({ cwd: process.cwd(), baseFlag, headFlag, policyPath });
329
+ const lines = formatAdjudication(verdict);
330
+ console.log(lines.join("\n"));
331
+ writeCiSummary(["## Runcap CI adjudication: " + verdict.verdict, "", "```", ...lines, "```"].join("\n"));
332
+ process.exitCode = exitCodeFor(verdict.verdict);
333
+ } else {
334
+ const loaded = loadPolicy(process.cwd(), policyPath);
335
+ if (!loaded) throw new Error("No policy found. Create .runcap/mission.yaml (or pass --policy <path>).");
336
+ const { ok, errors } = validatePolicy(loaded.policy);
337
+ if (!ok) {
338
+ for (const e of errors) console.error(` policy error: ${e}`);
339
+ writeCiSummary(["## Runcap mission: policy INVALID", "", ...errors.map((e) => `- ${e}`)].join("\n"));
340
+ throw new Error("Mission policy is invalid.");
341
+ }
342
+
343
+ let receipt;
344
+ if (receiptPath) {
345
+ receipt = JSON.parse(readFileSync(receiptPath, "utf8"));
346
+ } else {
347
+ const id = await latestOutcomeId();
348
+ if (!id) throw new Error("No outcome receipt found. Run `runcap mission run ...` first, or pass --receipt <path>.");
349
+ receipt = JSON.parse(readFileSync(`.runcap/outcomes/${id}/receipt.json`, "utf8"));
350
+ }
351
+
352
+ const verdict = evaluatePolicyVerdict(receipt, loaded.policy);
353
+ const block = formatPolicyBlock({ ...policyMeta(loaded), ...verdict });
354
+ console.log(block.join("\n"));
355
+ writeCiSummary(["## Runcap mission verdict: " + verdict.verdict, "", "```", ...block, "```"].join("\n"));
356
+ if (verdict.verdict === "BLOCKED") process.exitCode = 1;
357
+ }
187
358
  } else if (command === "login") {
188
359
  console.log(await loginCommand(args[1]));
189
360
  } else if (command === "logout") {
@@ -0,0 +1,24 @@
1
+ // A stand-in coding agent that (1) spends money via the gateway and (2) actually
2
+ // fixes the bug. It points at whatever base URL Runcap injected (the cap
3
+ // gateway), so the spend is recorded and priced exactly like a real agent's.
4
+ import { writeFile } from "node:fs/promises";
5
+ import path from "node:path";
6
+
7
+ const base = process.env.OPENAI_BASE_URL || "https://api.openai.com/v1";
8
+ const model = process.env.OUTCOME_DEMO_MODEL || "gpt-4o";
9
+
10
+ async function call(prompt) {
11
+ const res = await fetch(`${base}/chat/completions`, {
12
+ method: "POST",
13
+ headers: { "content-type": "application/json", authorization: "Bearer demo" },
14
+ body: JSON.stringify({ model, messages: [{ role: "user", content: prompt }] })
15
+ });
16
+ return res.text();
17
+ }
18
+
19
+ await call("Read broken.mjs. The sum() function subtracts instead of adds. Plan the one-line fix.");
20
+ await call("Apply the fix: sum should return a + b.");
21
+
22
+ const file = path.join(process.cwd(), "examples/outcome-demo/broken.mjs");
23
+ await writeFile(file, "export function sum(a, b) {\n return a + b;\n}\n");
24
+ console.log("agent-fixes: rewrote sum() to add");
@@ -0,0 +1,20 @@
1
+ // A stand-in coding agent that spends money via the gateway but never fixes the
2
+ // bug - it circles, re-reading and re-planning, the way a stuck agent burns
3
+ // budget while reporting confident progress. broken.mjs is left untouched.
4
+ const base = process.env.OPENAI_BASE_URL || "https://api.openai.com/v1";
5
+ const model = process.env.OUTCOME_DEMO_MODEL || "gpt-4o";
6
+
7
+ async function call(prompt) {
8
+ const res = await fetch(`${base}/chat/completions`, {
9
+ method: "POST",
10
+ headers: { "content-type": "application/json", authorization: "Bearer demo" },
11
+ body: JSON.stringify({ model, messages: [{ role: "user", content: prompt }] })
12
+ });
13
+ return res.text();
14
+ }
15
+
16
+ await call("Read broken.mjs and explain the bug in sum().");
17
+ await call("Re-read broken.mjs once more to be sure, then restate the plan.");
18
+ await call("Describe at length how you would fix it, but do not write the file yet.");
19
+ console.log("agent-spins: lots of talk, no fix written");
20
+ process.exit(0);
@@ -0,0 +1,5 @@
1
+ // Intentionally wrong: sum() subtracts. The task is to make verify.mjs pass.
2
+ // The agent under test rewrites this file (or fails to).
3
+ export function sum(a, b) {
4
+ return a - b;
5
+ }
@@ -0,0 +1,7 @@
1
+ // The verification oracle. Exit 0 means the task is actually delivered.
2
+ import { sum } from "./broken.mjs";
3
+ import assert from "node:assert";
4
+
5
+ assert.strictEqual(sum(2, 3), 5, `sum(2,3) should be 5, got ${sum(2, 3)}`);
6
+ assert.strictEqual(sum(10, 5), 15, `sum(10,5) should be 15, got ${sum(10, 5)}`);
7
+ console.log("verify: PASSED (sum is correct)");