runcap 0.3.1 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -2,30 +2,52 @@
2
2
 
3
3
  [![CI](https://github.com/kirder24-code/ai-agent-manager/actions/workflows/ci.yml/badge.svg)](https://github.com/kirder24-code/ai-agent-manager/actions/workflows/ci.yml)
4
4
 
5
- ![Runcap terminal demo: estimate, cap, compress, stop](docs/assets/demo.svg)
5
+ ![Runcap terminal demo: estimate, cap, verify integrity, mission PASS - then a tampered run graded BLOCKED on the PR](docs/assets/demo.svg)
6
6
 
7
- **Your AI coding agent re-reads the same files over and over and quietly burns your money. Runcap estimates the bill before you build, hard-caps the spend so it physically stops at your ceiling, and losslessly compresses every call. Free, MIT, 100% local. Your code and tokens never touch a server.**
7
+ **Runcap stops AI-agent spend before it runs - and shows whether that spend produced a verified result. Free, MIT, 100% local. Your code and tokens never touch a server.**
8
8
 
9
- On a real OpenAI call, one edited-file re-read dropped from **1,186 to 737 prompt tokens (37.9% saved)** with the model still answering correctly about the changed line. No other proxy does this:
9
+ > **An agent passing CI is not enough.**
10
+ > Runcap verifies whether the evidence of success was altered during the mission.
11
+
12
+ | Status | Meaning |
13
+ |---|---|
14
+ | `VERIFIED_STRONG` | Result passed an unchanged verifier and a clean-worktree replay. |
15
+ | `VERIFIED_WEAK` | Result passed, but some integrity evidence is missing. |
16
+ | `UNVERIFIED` | Verification did not pass. |
17
+ | `VERIFIER_COMPROMISED` | The agent changed protected verification evidence. |
18
+
19
+ ```text
20
+ Estimate the run → Cap the spend → Verify the outcome
21
+ ```
22
+
23
+ Every other tool measures **tokens**. You don't buy tokens - you buy a result that passes a check. So Runcap measures the only number that matters:
24
+
25
+ > **Verified Outcome Cost = total run cost / tasks that passed verification.** An agent that talks but never fixes the bug can cost *more* than one that does - and a token dashboard calls it "cheaper."
10
26
 
11
27
  | | Without Runcap | With Runcap |
12
28
  |---|---|---|
13
- | Re-read of an edited file | 1,186 prompt tokens | **737 prompt tokens** |
14
29
  | You find out the cost | when the invoice arrives | **before you press go, capped at your ceiling** |
15
30
  | When the agent gets stuck | it keeps spending | **run stops, you get the exact rescue prompt** |
31
+ | What you paid for | tokens (delivered or not) | **dollars per *verified* result** |
32
+
33
+ In a 6-run test on the same task, the run that **delivered nothing** cost *more* than the one that delivered, and the cheapest verified result was ~43x cheaper than the most expensive - same passing test. ([full table below](#real-results-6-runs-same-task-reproducible-offline))
16
34
 
17
- > Every other tool here is a rear-view mirror - it shows you the bill *after* you paid it. Runcap estimates the bill *before* you start and caps it. It is a circuit breaker, not a dashboard.
35
+ > Every other tool here is a rear-view mirror - it shows you the bill *after* you paid it. Runcap estimates the bill *before* you start, caps it, and tells you if the spend actually delivered. It is a circuit breaker with a receipt, not a dashboard.
36
+
37
+ > If Runcap caps a run for you or compresses a call, please **star the repo** - it is the one signal that tells me to keep building it in the open.
18
38
 
19
39
  ## Why
20
40
 
21
41
  **Agents loop on the same error, rewrite plans, and re-read files they just edited - every loop is tokens you pay for.** Multi-agent coding runs burn roughly **15x more tokens** than a single chat ([Anthropic engineering](https://www.anthropic.com/engineering/built-multi-agent-research-system)). They hand you a confident summary while the task is not actually done, and you find out what it cost when the invoice - or the subscription limit - arrives.
22
42
 
23
- Observability tools (Langfuse, Helicone, LangSmith, AgentOps) measure the past. Gateways (LiteLLM, Portkey, OpenRouter) route the present. None of them stop the spend *before* it happens. Runcap does the one thing the rear-view mirror can't:
43
+ Observability tools (Langfuse, Helicone, LangSmith, AgentOps) measure the past. Gateways (LiteLLM, Portkey, OpenRouter) route the present. None of them stop the spend *before* it happens, and none tell you whether the spend produced a verified result. Runcap does the things the rear-view mirror can't:
24
44
 
25
45
  ```text
26
- estimate before build → cap during run → compress every call → rescue when stuck
46
+ estimate before build → cap during run → compress every call → rescue when stuck → verify the outcome
27
47
  ```
28
48
 
49
+ It also quietly trims waste on the way through: on a real OpenAI call, one edited-file re-read dropped from **1,186 to 737 prompt tokens (37.9% saved)** with the model still answering correctly about the changed line - lossless, and no other proxy does it ([details below](#token-compression-built-in-no-extra-deps)).
50
+
29
51
  ## The honest claim
30
52
 
31
53
  Runcap does **not** promise an exact cost oracle. Agent trajectories are stochastic - nobody, including the model labs, can predict the exact token count of a run. So Runcap gives you a **range plus a hard cap**:
@@ -48,6 +70,8 @@ If you have those, Runcap caps your spend in one command. If you are looking for
48
70
 
49
71
  No API key required.
50
72
 
73
+ ![Runcap terminal proof: estimate, cap spend, compress tokens, verify integrity, and block a compromised mission](docs/assets/media/demo.png)
74
+
51
75
  ```bash
52
76
  git clone https://github.com/kirder24-code/ai-agent-manager.git
53
77
  cd ai-agent-manager
@@ -106,6 +130,7 @@ Or run from source with `node ./bin/runcap.mjs <command>`.
106
130
  runcap plan --fuel 24 -- "build a small auth feature and verify it" # range + recommended cap, before you spend
107
131
  runcap preflight -- claude "build a full SaaS app" # is this prompt too broad?
108
132
  runcap run --label fix -- claude "fix one failing check. stop if blocked." # wrap any agent/command
133
+ runcap outcome run --task "..." --verify "pnpm test" -- claude "fix it" # cost of a VERIFIED result, not tokens
109
134
  runcap report # human-readable rescue report
110
135
  runcap export # evidence JSON with truth labels
111
136
  runcap dashboard # local cockpit at :8791
@@ -125,6 +150,10 @@ OPENAI_API_KEY=sk-... AIM_DAILY_BUDGET_USD=5 runcap gateway
125
150
  # Anthropic-native (Claude Code, /v1/messages)
126
151
  ANTHROPIC_API_KEY=sk-ant-... AIM_DAILY_BUDGET_USD=5 runcap gateway
127
152
  # then: ANTHROPIC_BASE_URL=http://127.0.0.1:8792/v1
153
+
154
+ # DeepSeek (OpenAI-compatible, much cheaper - same one command)
155
+ OPENAI_API_KEY=sk-... AIM_UPSTREAM_BASE_URL=https://api.deepseek.com AIM_DAILY_BUDGET_USD=5 runcap gateway
156
+ # then point your agent at: OPENAI_BASE_URL=http://127.0.0.1:8792/v1 (model: deepseek-chat)
128
157
  ```
129
158
 
130
159
  When spend crosses the ceiling, the next call returns `429 budget_guard` instead of money leaving your account. Try it with no key: `runcap gateway --mock`.
@@ -137,7 +166,7 @@ Every request that passes through the gateway is compressed before it's forwarde
137
166
  2. **Identical-block dedup** - when the exact same file dump or tool_result ships again in the same request, the repeat is replaced with a deterministic stub.
138
167
  3. **Delta-encoding of near-duplicates** - the layer no other proxy has. When the agent reads a file, edits one line, and re-reads it, the block is *similar but not identical*, so plain dedup saves nothing. Runcap sends a readable line-diff against the version the model already saw, and the model reconstructs the current file from it. On a real OpenAI call, an edited-file re-read dropped from **1186 to 737 prompt tokens - 37.9% saved, with the model still answering correctly about the changed line.** Proof and reproduction steps: [docs/delta-encoding-evidence.md](https://github.com/kirder24-code/ai-agent-manager/blob/main/docs/delta-encoding-evidence.md).
139
168
 
140
- It's pure Node with **zero ML or native dependencies**, so it installs everywhere without the build pain heavier compressors have.
169
+ It's pure Node with **zero native or ML dependencies** (the only runtime dependency is `js-yaml`, pure JS), so it installs everywhere without the build pain heavier compressors have.
141
170
 
142
171
  The dashboard shows the result as one number: **"You saved $X · N tokens compressed · would have spent $Y."** Disable it with `AIM_COMPRESS=off` if you ever want raw passthrough.
143
172
 
@@ -147,9 +176,178 @@ The hard case in stuck-detection is the agent that keeps producing output but is
147
176
 
148
177
  This is a **calculated** signal, not a proven dollar-saving: it tells you *"the agent has sent 3 near-identical prompts in a row with no progress"* so you can step in before the loop burns more budget. Tune or disable it with `AIM_LOOP_DETECT=off`. (Today's [`detectStuck`](src/mission-control.mjs) post-run score is outcome-based: exit code, parsed errors, and zero-diff. The loop signal adds the missing in-flight behavioral signal on top of it.)
149
178
 
179
+ ## Verified Outcome Cost (`runcap outcome`)
180
+
181
+ Tokens are the wrong unit. You don't buy tokens, you buy a **result that passes a check**. So Runcap measures the only number that tracks what you actually paid for:
182
+
183
+ > **Verified Outcome Cost = total run cost / tasks that passed verification.**
184
+
185
+ Wrap your agent and hand it a verification command. Its exit code is the oracle:
186
+
187
+ ```bash
188
+ runcap outcome run \
189
+ --task "Fix the failing test" \
190
+ --verify "pnpm test && pnpm build" \
191
+ -- claude "fix one failing check, then stop"
192
+
193
+ runcap outcome show # re-print the latest receipt
194
+ ```
195
+
196
+ The run produces an **outcome receipt** where every field carries a truth label:
197
+
198
+ ```
199
+ Outcome: VERIFIED
200
+ Actual cost: $0.000677 (2 priced LLM calls, calculated_from_provider_usage_and_sourced_price_table)
201
+ Verified Outcome Cost: $0.000677 (money that bought a verified result)
202
+ ```
203
+
204
+ If verification fails, the receipt is honest about it instead of pretending the spend delivered something:
205
+
206
+ ```
207
+ Outcome: UNVERIFIED
208
+ Verified Outcome Cost: N/A (verification did not pass)
209
+ Money spent without verified delivery: $0.001012
210
+ ```
211
+
212
+ That second case is the whole point: an agent that talks but never fixes the bug can cost **more** than one that does, while a token dashboard calls it "cheaper." Try both, offline, no API key:
213
+
214
+ ```bash
215
+ runcap outcome run --task "Fix sum() so it adds" --verify "node examples/outcome-demo/verify.mjs" --mock -- node examples/outcome-demo/agent-fixes.mjs # VERIFIED
216
+ runcap outcome run --task "Fix sum() so it adds" --verify "node examples/outcome-demo/verify.mjs" --mock -- node examples/outcome-demo/agent-spins.mjs # UNVERIFIED
217
+ ```
218
+
219
+ Receipts are written to `.runcap/outcomes/<id>/receipt.json`. Run the same task across several agents and you get the **Agent Economics Index** - the same board, priced by verified delivery instead of tokens.
220
+
221
+ ### Real results (6 runs, same task, reproducible offline)
222
+
223
+ Same task - *fix a broken `sum()` so the test passes* - across different models and two agent behaviors. Every number below is measured from the gateway ledger, not estimated:
224
+
225
+ | # | Model | Agent | LLM calls | Actual cost | Outcome | Verified Outcome Cost | Money, nothing delivered |
226
+ |---|---|---|---|---|---|---|---|
227
+ | 1 | gpt-4o | writes fix | 2 | $0.000677 | VERIFIED | **$0.000677** | $0 |
228
+ | 2 | gpt-4o | spins, no fix | 3 | $0.001012 | UNVERIFIED | **N/A** | **$0.001012** |
229
+ | 3 | gpt-5.4 | writes fix | 2 | $0.000957 | VERIFIED | **$0.000957** | $0 |
230
+ | 4 | deepseek-chat | writes fix | 2 | $0.000022 | VERIFIED | **$0.000022** | $0 |
231
+ | 5 | deepseek-chat | spins, no fix | 3 | $0.000033 | UNVERIFIED | **N/A** | **$0.000033** |
232
+ | 6 | claude-sonnet-4 | writes fix | 2 | $0.000981 | VERIFIED | **$0.000981** | $0 |
233
+
234
+ Two facts a token dashboard can't show you: on gpt-4o the run that **delivered nothing** (row 2) cost *more* than the run that delivered (row 1); and the cheapest verified result (deepseek-chat, $0.000022) bought the same passing test as gpt-5.4 at ~43x the price. Reproduce any row with `OUTCOME_DEMO_MODEL=<model>` in front of the command above. (One run per row in v0.1 - the point is the unit, not a vendor ranking; a ranking needs N>=5 runs/agent and a pass-rate column.)
235
+
236
+ ## Verification Integrity (`runcap outcome --guard`)
237
+
238
+ A green test only means something if the agent passed it *fairly*. An agent under pressure can turn a check green without doing the work: delete the failing test, rewrite the assertion, repoint the `npm test` script at `true`, disable TypeScript strict mode, mock the real API, or hardcode the expected answer. Exit code 0 - and a plain Verified Outcome Cost calls it delivered. So does every token dashboard.
239
+
240
+ `--guard` verifies the verifier. Before the agent runs, Runcap freezes a **Task Contract**: the baseline git commit, a SHA-256 of every file the verify command names, a snapshot of `package.json` scripts, and a check that the task *actually fails today* (a pass on an already-green tree proves nothing). After the run it re-checks all of it, and re-runs the verify command from the baseline commit in a throwaway git worktree with only the agent's changed files copied in - so a green that depended on uncommitted local junk dies in the clean room.
241
+
242
+ The result is a trust grade, not a binary:
243
+
244
+ | Status | Meaning |
245
+ |---|---|
246
+ | `VERIFIED_STRONG` | Passed, verifier untouched, the task really failed before, and the pass survives a clean checkout. |
247
+ | `VERIFIED_WEAK` | Passed and verifier untouched, but a strong condition was missed (e.g. baseline failure not reproduced). |
248
+ | `UNVERIFIED` | Verification did not pass. |
249
+ | `VERIFIER_COMPROMISED` | Passed, **but the verifier itself was modified during the run.** The green light cannot be trusted. |
250
+
251
+ ```bash
252
+ runcap outcome run --guard \
253
+ --task "Fix the failing test" \
254
+ --verify "pnpm test" \
255
+ --allow src/ \
256
+ -- claude "fix one failing check, then stop"
257
+ ```
258
+
259
+ `--protect <path>` marks extra paths the agent must not touch (tests, config, and `package.json` are protected by default); `--allow <path>` declares the only paths a legitimate fix should change, so out-of-scope edits drop the grade. The receipt gains a `verificationIntegrity` block listing every check, every truth label, and exactly which file was tampered with if the grade is `VERIFIER_COMPROMISED`.
260
+
261
+ > **One honesty note that rides on every receipt:** Verified Outcome Cost is the **LLM spend that bought the result** - it does *not* include subscriptions, CI minutes, sandbox compute, or human review time. For real agent economics you want **Expected Verified Outcome Cost = total spend across N attempts / strongly-verified outcomes**, which needs N>=5 runs. That's the v0.2 unit; v0.1 measures the one cost the gateway can observe honestly.
262
+
263
+ ## Mission Policy & CI enforcement (`runcap mission` / `policy` / `ci`)
264
+
265
+ Everything above lives in one developer's terminal. A platform, VP-Eng, or FinOps owner can't act on it: there's no way to declare the rules of a mission *once*, in the repo, and no way to fail a pull request when an agent breaches budget or tampers with the evidence of its own success.
266
+
267
+ A **mission policy** closes that gap. You declare the rules once in `.runcap/mission.yaml`, enforce them during the run, and grade the result into a **PASS / BLOCKED** verdict a GitHub Action turns into a red/green check on the PR.
268
+
269
+ ```yaml
270
+ # .runcap/mission.yaml
271
+ version: v1
272
+ identity:
273
+ project: checkout
274
+ team: payments
275
+ mission:
276
+ name: Fix the failing checkout test
277
+ task_class: bugfix
278
+ budget:
279
+ mission_hard_limit_usd: 10 # required - per-mission hard cap (the gateway enforces it live)
280
+ max_llm_calls: 12 # optional - BLOCK if exceeded
281
+ max_runtime_minutes: 30 # optional - BLOCK if exceeded
282
+ verification:
283
+ command: "pnpm test && pnpm build" # required - the oracle (exit 0 = delivered)
284
+ guard: strict # strict (default) freezes + re-checks the verifier
285
+ protect: ["tests/**", "package.json"] # paths the agent must not touch
286
+ allow: ["src/checkout/**"] # the only paths a legit fix should change
287
+ ```
288
+
289
+ Validate it, then run the agent under it:
290
+
291
+ ```bash
292
+ runcap policy validate # is .runcap/mission.yaml well-formed?
293
+ runcap mission run -- claude "fix the failing checkout test, then stop"
294
+ ```
295
+
296
+ `mission run` enforces the per-mission hard cap through the gateway, always runs the verification guard, and writes an outcome receipt that now carries a **policy block** - the org attribution, the limits, the **SHA-256 of the exact policy text that graded the run**, and the verdict. It exits `1` on `BLOCKED`, so it fails CI on its own. The mission is **BLOCKED** when any of these is true:
297
+
298
+ - the verifier was tampered with (`VERIFIER_COMPROMISED`);
299
+ - verification did not pass (`UNVERIFIED`);
300
+ - a change landed outside the declared `allow` scope;
301
+ - spend exceeded `mission_hard_limit_usd`, or the gateway's budget guard tripped mid-run;
302
+ - `max_llm_calls` or `max_runtime_minutes` was exceeded.
303
+
304
+ ### The GitHub Action (the red/green PR check)
305
+
306
+ ```yaml
307
+ # .github/workflows/runcap.yml
308
+ jobs:
309
+ runcap-mission:
310
+ runs-on: ubuntu-latest
311
+ steps:
312
+ - uses: actions/checkout@v4
313
+ - uses: actions/setup-node@v4
314
+ with: { node-version: 20 }
315
+ # 1. Run the agent under the policy (produces a graded receipt + exits 1 on BLOCKED)
316
+ - run: npx runcap mission run -- <your agent command>
317
+ # 2. Grade the latest receipt against the committed policy and annotate the PR
318
+ - uses: kirder24-code/ai-agent-manager@v1
319
+ with:
320
+ policy: .runcap/mission.yaml
321
+ ```
322
+
323
+ `runcap ci` (what the Action runs) re-grades a receipt **against the committed policy text**, not whatever was stamped at run time, and appends the verdict to `$GITHUB_STEP_SUMMARY` as the PR annotation. Already have a receipt from an earlier job? Grade it directly:
324
+
325
+ ```bash
326
+ runcap ci --policy .runcap/mission.yaml --receipt .runcap/outcomes/<id>/receipt.json
327
+ ```
328
+
329
+ A reviewer sees one of two things:
330
+
331
+ ```
332
+ Mission verdict: PASS
333
+ project checkout / team payments
334
+ Mission cost $0.0007 / $10.00
335
+ Policy hash: c857d10c…
336
+ ```
337
+
338
+ ```
339
+ Mission verdict: BLOCKED
340
+ Blocked because:
341
+ - VERIFIER_COMPROMISED: the agent changed protected verification evidence (verifier_file_unchanged:app/verify.mjs).
342
+ ```
343
+
344
+ Because the verdict is recomputed from the committed policy and the receipt records the policy hash, a reviewer can confirm *which rules graded the run* - the verdict is only as trustworthy as the policy hash it carries.
345
+
150
346
  ## Pricing table
151
347
 
152
- Costs are calculated from a sourced multi-provider table - Anthropic (Opus / Sonnet / Haiku) and OpenAI (GPT-5 family + legacy GPT-4), with cache-read and batch discounts handled - labeled with source and verification date. When a model is unknown, Runcap says `unknown_price` rather than guessing.
348
+ Costs are calculated from a sourced multi-provider table - Anthropic (Opus / Sonnet / Haiku), OpenAI (GPT-5 family + legacy GPT-4), and DeepSeek (V4 Flash / V4 Pro) - with cache-read and batch discounts handled, labeled with source and verification date. When a model is unknown, Runcap says `unknown_price` rather than guessing.
349
+
350
+ DeepSeek matters because its API is OpenAI-compatible: point the gateway at `https://api.deepseek.com` with your DeepSeek key and Runcap prices, caps, and compresses it with zero extra setup - the same one command as OpenAI. At roughly $0.14 / $0.28 per million input/output tokens it is far cheaper than the US frontier models, so the people running the biggest agent loops on it are exactly the ones a hard cap protects.
153
351
 
154
352
  ## Trust model
155
353
 
@@ -187,6 +385,10 @@ A working local tool, not a hosted SaaS. Ready for: wrapping real Codex / Claude
187
385
  - [Integrations](docs/integrations.md)
188
386
  - [Trust model](docs/trust-model.md)
189
387
 
388
+ ## Built by
389
+
390
+ Runcap is built and maintained by Kirill D., a solo AI and automation consultant based in Calgary, Canada. He helps solo SaaS founders and service businesses ship AI features that hold up in production - cost control, vibe-code audits, and reliable automation. More at [launchsoloai.com](https://launchsoloai.com).
391
+
190
392
  ---
191
393
 
192
394
  The thesis: **AI agents need managers.**
package/bin/runcap.mjs CHANGED
@@ -11,7 +11,10 @@ import {
11
11
  preflightMission,
12
12
  recordFuel,
13
13
  renderReport,
14
+ renderOutcome,
15
+ latestOutcomeId,
14
16
  runMission,
17
+ runOutcome,
15
18
  setupProject,
16
19
  startDashboard,
17
20
  startGateway,
@@ -31,6 +34,14 @@ import {
31
34
  planToRun
32
35
  } from "../src/cloud.mjs";
33
36
  import { alertsCommand } from "../src/alerts.mjs";
37
+ import {
38
+ loadPolicy,
39
+ validatePolicy,
40
+ evaluatePolicyVerdict,
41
+ policyMeta,
42
+ formatPolicyBlock
43
+ } from "../src/policy.mjs";
44
+ import { readFileSync, appendFileSync } from "node:fs";
34
45
 
35
46
  const args = process.argv.slice(2);
36
47
  const command = args[0] ?? "welcome";
@@ -42,6 +53,14 @@ Usage:
42
53
  runcap run [--label name] [--cap|--no-cap] [--mock] -- <command...>
43
54
  (auto-enforces your cap; no manual gateway/base-URL setup)
44
55
  runcap plan [--fuel 24] [--quality high|balanced|cheap] [--apply-cap] -- <goal...>
56
+ runcap outcome run --task "..." --verify "<cmd>" [--label name] [--mock] -- <agent cmd...>
57
+ (runs the agent, then verifies; reports Verified Outcome Cost)
58
+ runcap outcome [show] (print the latest outcome receipt)
59
+ runcap policy validate [path] (check .runcap/mission.yaml is well-formed)
60
+ runcap mission run [--policy path] [--task override] [--mock] -- <agent cmd...>
61
+ (enforce the repo policy; exit 1 if the mission is BLOCKED)
62
+ runcap ci [--policy path] [--receipt path]
63
+ (grade a receipt against the policy; writes PR summary, exit 1 on BLOCKED)
45
64
  runcap plans
46
65
  runcap cap <usd> (set the hard cap the gateway enforces)
47
66
  runcap cap show (show the current cap)
@@ -90,6 +109,17 @@ function takeFlag(input, name) {
90
109
  return true;
91
110
  }
92
111
 
112
+ // Collect every occurrence of a repeatable option, e.g. --allow src --allow lib.
113
+ function takeAll(input, name) {
114
+ const values = [];
115
+ let index;
116
+ while ((index = input.indexOf(name)) !== -1) {
117
+ values.push(input[index + 1]);
118
+ input.splice(index, 2);
119
+ }
120
+ return values.filter(Boolean);
121
+ }
122
+
93
123
  // A real call can cost a fraction of a cent. toFixed(2)/(4) would print $0.00 or
94
124
  // $0.0000 and read as "nothing was recorded", so show a meaningful figure for
95
125
  // sub-cent spend instead of rounding a real charge down to zero.
@@ -100,6 +130,19 @@ function fmtUsd(n) {
100
130
  return `$${n.toPrecision(2)}`;
101
131
  }
102
132
 
133
+ // In GitHub Actions, $GITHUB_STEP_SUMMARY is a file the runner renders as the
134
+ // job's PR annotation. Append the verdict there so a reviewer sees red/green
135
+ // without opening logs. A no-op off CI.
136
+ function writeCiSummary(markdown) {
137
+ const target = process.env.GITHUB_STEP_SUMMARY;
138
+ if (!target) return;
139
+ try {
140
+ appendFileSync(target, markdown + "\n");
141
+ } catch {
142
+ // best-effort annotation only - never fail the verdict on a write error.
143
+ }
144
+ }
145
+
103
146
  try {
104
147
  if (command === "welcome") {
105
148
  console.log(await welcome());
@@ -184,6 +227,116 @@ try {
184
227
  const sync = await syncRun(planToRun(plan));
185
228
  if (sync === "synced") console.log("Cloud: synced to your Runcap Pro dashboard.");
186
229
  else if (sync && sync.startsWith("sync_failed")) console.log(`Cloud: ${sync}`);
230
+ } else if (command === "outcome") {
231
+ const sub = args[1] ?? "show";
232
+ if (sub === "run") {
233
+ const oArgs = args.slice(2);
234
+ const task = takeOption(oArgs, "--task");
235
+ const verify = takeOption(oArgs, "--verify");
236
+ const label = takeOption(oArgs, "--label");
237
+ const mock = takeFlag(oArgs, "--mock");
238
+ const guard = takeFlag(oArgs, "--guard");
239
+ const protect = takeAll(oArgs, "--protect");
240
+ const allow = takeAll(oArgs, "--allow");
241
+ const separator = oArgs.indexOf("--");
242
+ const childArgs = separator === -1 ? [] : oArgs.slice(separator + 1);
243
+ if (!task) throw new Error("Missing --task \"description\".");
244
+ if (!verify) throw new Error("Missing --verify \"<command>\" (e.g. --verify \"npm test && npm run build\").");
245
+ if (childArgs.length === 0) throw new Error("Missing agent command after `--`.");
246
+ const result = await runOutcome({ task, verify, command: childArgs, label, mock, guard, protect, allow });
247
+ console.log("\n" + result.summary);
248
+ console.log(`Receipt: .runcap/outcomes/${result.id}/receipt.json`);
249
+ } else {
250
+ const id = sub === "show" ? await latestOutcomeId() : sub;
251
+ if (!id) throw new Error("No outcome receipt found. Run `runcap outcome run ...` first.");
252
+ console.log(await renderOutcome(id));
253
+ }
254
+ } else if (command === "policy") {
255
+ const sub = args[1] ?? "validate";
256
+ if (sub === "validate") {
257
+ const explicit = args[2] && !args[2].startsWith("--") ? args[2] : undefined;
258
+ const loaded = loadPolicy(process.cwd(), explicit);
259
+ if (!loaded) throw new Error("No policy found. Create .runcap/mission.yaml (or pass a path).");
260
+ const { ok, errors, warnings } = validatePolicy(loaded.policy);
261
+ console.log(`Policy: ${loaded.source}`);
262
+ console.log(`Hash: ${loaded.hash}`);
263
+ for (const w of warnings) console.log(` warning: ${w}`);
264
+ if (ok) {
265
+ console.log("Policy is valid.");
266
+ } else {
267
+ for (const e of errors) console.log(` error: ${e}`);
268
+ console.log("Policy is INVALID.");
269
+ process.exitCode = 1;
270
+ }
271
+ } else {
272
+ throw new Error(`Unknown policy subcommand: ${sub}. Try \`runcap policy validate [path]\`.`);
273
+ }
274
+ } else if (command === "mission") {
275
+ const sub = args[1] ?? "run";
276
+ if (sub !== "run") throw new Error(`Unknown mission subcommand: ${sub}. Try \`runcap mission run -- <agent cmd>\`.`);
277
+ const mArgs = args.slice(2);
278
+ const policyPath = takeOption(mArgs, "--policy");
279
+ const taskOverride = takeOption(mArgs, "--task");
280
+ const mock = takeFlag(mArgs, "--mock");
281
+ const separator = mArgs.indexOf("--");
282
+ const childArgs = separator === -1 ? [] : mArgs.slice(separator + 1);
283
+ if (childArgs.length === 0) throw new Error("Missing agent command after `--`.");
284
+
285
+ const loaded = loadPolicy(process.cwd(), policyPath);
286
+ if (!loaded) throw new Error("No policy found. Create .runcap/mission.yaml (or pass --policy <path>).");
287
+ const { ok, errors } = validatePolicy(loaded.policy);
288
+ if (!ok) {
289
+ for (const e of errors) console.error(` policy error: ${e}`);
290
+ throw new Error("Mission policy is invalid - fix it before running the mission.");
291
+ }
292
+ const p = loaded.policy;
293
+ const mission = p.mission ?? {};
294
+ const verification = p.verification ?? {};
295
+ const budget = p.budget ?? {};
296
+ const task = taskOverride || [mission.name, mission.task_class].filter(Boolean).join(" - ") || mission.name;
297
+ const result = await runOutcome({
298
+ task,
299
+ verify: verification.command,
300
+ command: childArgs,
301
+ label: mission.name,
302
+ mock,
303
+ guard: true,
304
+ protect: Array.isArray(verification.protect) ? verification.protect : [],
305
+ allow: Array.isArray(verification.allow) ? verification.allow : [],
306
+ capUsd: budget.mission_hard_limit_usd ?? null,
307
+ policy: loaded
308
+ });
309
+ console.log("\n" + result.summary);
310
+ console.log(`Receipt: .runcap/outcomes/${result.id}/receipt.json`);
311
+ if (result.receipt.policy?.verdict === "BLOCKED") process.exitCode = 1;
312
+ } else if (command === "ci") {
313
+ const ciArgs = args.slice(1);
314
+ const policyPath = takeOption(ciArgs, "--policy");
315
+ const receiptPath = takeOption(ciArgs, "--receipt");
316
+
317
+ const loaded = loadPolicy(process.cwd(), policyPath);
318
+ if (!loaded) throw new Error("No policy found. Create .runcap/mission.yaml (or pass --policy <path>).");
319
+ const { ok, errors } = validatePolicy(loaded.policy);
320
+ if (!ok) {
321
+ for (const e of errors) console.error(` policy error: ${e}`);
322
+ writeCiSummary(["## Runcap mission: policy INVALID", "", ...errors.map((e) => `- ${e}`)].join("\n"));
323
+ throw new Error("Mission policy is invalid.");
324
+ }
325
+
326
+ let receipt;
327
+ if (receiptPath) {
328
+ receipt = JSON.parse(readFileSync(receiptPath, "utf8"));
329
+ } else {
330
+ const id = await latestOutcomeId();
331
+ if (!id) throw new Error("No outcome receipt found. Run `runcap mission run ...` first, or pass --receipt <path>.");
332
+ receipt = JSON.parse(readFileSync(`.runcap/outcomes/${id}/receipt.json`, "utf8"));
333
+ }
334
+
335
+ const verdict = evaluatePolicyVerdict(receipt, loaded.policy);
336
+ const block = formatPolicyBlock({ ...policyMeta(loaded), ...verdict });
337
+ console.log(block.join("\n"));
338
+ writeCiSummary(["## Runcap mission verdict: " + verdict.verdict, "", "```", ...block, "```"].join("\n"));
339
+ if (verdict.verdict === "BLOCKED") process.exitCode = 1;
187
340
  } else if (command === "login") {
188
341
  console.log(await loginCommand(args[1]));
189
342
  } else if (command === "logout") {
@@ -0,0 +1,24 @@
1
+ // A stand-in coding agent that (1) spends money via the gateway and (2) actually
2
+ // fixes the bug. It points at whatever base URL Runcap injected (the cap
3
+ // gateway), so the spend is recorded and priced exactly like a real agent's.
4
+ import { writeFile } from "node:fs/promises";
5
+ import path from "node:path";
6
+
7
+ const base = process.env.OPENAI_BASE_URL || "https://api.openai.com/v1";
8
+ const model = process.env.OUTCOME_DEMO_MODEL || "gpt-4o";
9
+
10
+ async function call(prompt) {
11
+ const res = await fetch(`${base}/chat/completions`, {
12
+ method: "POST",
13
+ headers: { "content-type": "application/json", authorization: "Bearer demo" },
14
+ body: JSON.stringify({ model, messages: [{ role: "user", content: prompt }] })
15
+ });
16
+ return res.text();
17
+ }
18
+
19
+ await call("Read broken.mjs. The sum() function subtracts instead of adds. Plan the one-line fix.");
20
+ await call("Apply the fix: sum should return a + b.");
21
+
22
+ const file = path.join(process.cwd(), "examples/outcome-demo/broken.mjs");
23
+ await writeFile(file, "export function sum(a, b) {\n return a + b;\n}\n");
24
+ console.log("agent-fixes: rewrote sum() to add");
@@ -0,0 +1,20 @@
1
+ // A stand-in coding agent that spends money via the gateway but never fixes the
2
+ // bug - it circles, re-reading and re-planning, the way a stuck agent burns
3
+ // budget while reporting confident progress. broken.mjs is left untouched.
4
+ const base = process.env.OPENAI_BASE_URL || "https://api.openai.com/v1";
5
+ const model = process.env.OUTCOME_DEMO_MODEL || "gpt-4o";
6
+
7
+ async function call(prompt) {
8
+ const res = await fetch(`${base}/chat/completions`, {
9
+ method: "POST",
10
+ headers: { "content-type": "application/json", authorization: "Bearer demo" },
11
+ body: JSON.stringify({ model, messages: [{ role: "user", content: prompt }] })
12
+ });
13
+ return res.text();
14
+ }
15
+
16
+ await call("Read broken.mjs and explain the bug in sum().");
17
+ await call("Re-read broken.mjs once more to be sure, then restate the plan.");
18
+ await call("Describe at length how you would fix it, but do not write the file yet.");
19
+ console.log("agent-spins: lots of talk, no fix written");
20
+ process.exit(0);
@@ -0,0 +1,5 @@
1
+ // Intentionally wrong: sum() subtracts. The task is to make verify.mjs pass.
2
+ // The agent under test rewrites this file (or fails to).
3
+ export function sum(a, b) {
4
+ return a - b;
5
+ }
@@ -0,0 +1,7 @@
1
+ // The verification oracle. Exit 0 means the task is actually delivered.
2
+ import { sum } from "./broken.mjs";
3
+ import assert from "node:assert";
4
+
5
+ assert.strictEqual(sum(2, 3), 5, `sum(2,3) should be 5, got ${sum(2, 3)}`);
6
+ assert.strictEqual(sum(10, 5), 15, `sum(10,5) should be 15, got ${sum(10, 5)}`);
7
+ console.log("verify: PASSED (sum is correct)");
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "runcap",
3
- "version": "0.3.1",
3
+ "version": "0.5.0",
4
4
  "description": "Cap every agent run before it starts: estimate cost, set a hard ceiling that stops the run, rescue stuck agents. Local, MIT, nothing uploaded.",
5
5
  "license": "MIT",
6
6
  "type": "module",
@@ -45,7 +45,12 @@
45
45
  "acceptance": "node ./scripts/acceptance.mjs",
46
46
  "smoke": "node ./bin/runcap.mjs run --label smoke -- npm --prefix examples/broken-ts-app run build",
47
47
  "demo:broken": "node ./bin/runcap.mjs run --label broken-ts-demo -- npm --prefix examples/broken-ts-app run build",
48
- "test": "node ./scripts/delta-test.mjs && node ./scripts/loop-test.mjs && node ./scripts/loop-e2e.mjs && node ./scripts/validate-demo.mjs",
48
+ "test": "node ./scripts/delta-test.mjs && node ./scripts/loop-test.mjs && node ./scripts/loop-e2e.mjs && node ./scripts/validate-demo.mjs && node ./scripts/outcome-test.mjs && node ./scripts/guard-test.mjs && node ./scripts/policy-test.mjs && node ./scripts/mission-test.mjs",
49
+ "test:outcome": "node ./scripts/outcome-test.mjs",
50
+ "test:guard": "node ./scripts/guard-test.mjs",
51
+ "test:policy": "node ./scripts/policy-test.mjs",
52
+ "test:mission": "node ./scripts/mission-test.mjs",
53
+ "outcome": "node ./bin/runcap.mjs outcome",
49
54
  "test:delta": "node ./scripts/delta-test.mjs",
50
55
  "test:loop": "node ./scripts/loop-test.mjs",
51
56
  "status": "node ./bin/runcap.mjs status",
@@ -53,11 +58,15 @@
53
58
  "export": "node ./bin/runcap.mjs export",
54
59
  "templates": "node ./bin/runcap.mjs templates",
55
60
  "dashboard": "node ./bin/runcap.mjs dashboard",
61
+ "screenshots": "node ./scripts/render-media-screenshots.mjs",
56
62
  "gateway": "node ./bin/runcap.mjs gateway",
57
63
  "fuel": "node ./bin/runcap.mjs fuel",
58
64
  "check": "node --check ./bin/runcap.mjs && node --check ./src/mission-control.mjs"
59
65
  },
60
66
  "engines": {
61
67
  "node": ">=20"
68
+ },
69
+ "dependencies": {
70
+ "js-yaml": "^4.1.0"
62
71
  }
63
72
  }