aiforcecli 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,302 +1,305 @@
1
- # aiforcecli
2
-
3
- **One interface over every coding agent — that you can trust and afford.**
4
-
5
- `aiforcecli` wraps multiple coding agents — **Claude Code**, **OpenAI Codex**, and more —
6
- behind a single normalized interface, and adds the things a single-agent CLI structurally
7
- can't:
8
-
9
- - 🧭 **Advise** — instantly recommend the best agent+model for a task (public benchmarks blended with *your* results), before you spend a cent.
10
- - 🏁 **Race** — run several agents in parallel in isolated git worktrees, **verify each against your tests**, and apply the winner.
11
- - 🔁 **Self-heal** — run verify on failure feed the error back and escalate to a stronger agent until green.
12
- - 📊 **Bench / Eval** — record every outcome and learn which agent wins which task, per dollar, on *your* codebase.
13
- - 💸 **First-class FinOps** — every run's token usage and cost recorded locally; hard budgets enforced before and during a run.
14
-
15
- `aiforcecli` does **not** reimplement any agent. It shells out to the agent CLIs you already
16
- have installed and normalizes their I/O behind one adapter interface so it improves as the
17
- agents do, and you keep your existing auth and subscriptions.
18
-
19
- ## 60-second quickstart
20
-
21
- ```bash
22
- # 1. Install
23
- npm install -g aiforcecli # or: npx aiforcecli <command>
24
-
25
- # 2. Make sure at least one underlying agent CLI is installed
26
- npm install -g @anthropic-ai/claude-code # Claude Code
27
- npm install -g @openai/codex # Codex
28
- aiforcecli agents # shows install status
29
-
30
- # 3. Scaffold config in your project
31
- cd my-project
32
- aiforcecli init
33
-
34
- # 4. Run a task
35
- aiforcecli run "add a health check endpoint and a test for it"
36
-
37
- # 5. See what it cost
38
- aiforcecli cost
39
- ```
40
-
41
- That's it — you're productive on day one.
42
-
43
- ## Commands
44
-
45
- | Command | What it does |
46
- | --- | --- |
47
- | `aiforcecli run "<task>"` | Run a task with an agent. With no `--agent`, `aiforcecli` **auto-routes** to the best agent+model for the task within budget. Add `--heal` to **verify and self-heal** (see below). Flags: `--agent claude-code\|codex`, `--model <m>`, `--cwd <path>`, `--budget <usd>`, `--explain`, `--resume <sessionId>`, `--json`, `--heal`, `--max-attempts <n>`, `--verify <cmd>`, `--skip-verify`. |
48
- | `aiforcecli advise "<task>"` | **Recommend** which agent + model to use instantly, without running anything. Blends a public benchmark scorecard with your private outcomes. `--cwd <path>`, `--json`. |
49
- | `aiforcecli race "<task>"` | Run the task on **several agents in parallel** (isolated git worktrees), verify each, and apply the winner. Flags: `--agents claude-code,codex`, `--budget <usd>` (total, split across agents), `--select cheapest\|fastest\|first-pass`, `--verify <cmd>`, `--keep`, `--json`. |
50
- | `aiforcecli eval` | Run the **private eval suite** to calibrate `advise` on your codebase. `--dir <path>`, `--agents <ids>`, `--json`. |
51
- | `aiforcecli bench` | **Leaderboard** from your local history: which agent/model wins which task class, per dollar. `--since 7d\|24h\|<ISO>`, `--by-task`, `--json`. |
52
- | `aiforcecli agents` | List agents and whether their CLI is installed (`--json`). |
53
- | `aiforcecli cost` | Report spend. `--since 7d\|24h\|<ISO>`, `--by day\|agent\|project`, `--json`. |
54
- | `aiforcecli init` | Scaffold `aiforcecli.config.json` and install bundled assets. `--force` to overwrite. |
55
- | `aiforcecli mcp` | (Phase 3) Start the MCP server. |
56
-
57
- ## How it works
58
-
59
- ```
60
- CLI ── orchestrator ── AgentAdapter ── subprocess ── claude / codex
61
-
62
- ├── BudgetEnforcer (pre-run daily cap + mid-run per-run cap)
63
- └── UsageStore (SQLite: every run tagged agent/project/time/promptHash)
64
- ```
65
-
66
- Each adapter parses its agent's **native** output (Claude Code `--output-format
67
- stream-json`, Codex `exec --json` JSONL) and maps it onto one normalized event shape:
68
-
69
- ```ts
70
- type NormalizedEvent =
71
- | { type: 'token'; text: string }
72
- | { type: 'message'; role: 'assistant' | 'user'; text: string }
73
- | { type: 'tool_call'; name: string; input?: unknown; id?: string }
74
- | { type: 'usage'; usage: Usage; cumulative: boolean }
75
- | { type: 'error'; message: string; fatal: boolean };
76
- ```
77
-
78
- A run resolves to `{ message, usage: { inputTokens, outputTokens, costUsd, costSource },
79
- exitCode, sessionId, aborted }`.
80
-
81
- ## Routing
82
-
83
- Instead of hardcoding which agent and model to call, omit `--agent` and `aiforcecli`
84
- picks for you. It classifies the task (a transparent keyword + length
85
- heuristic), then selects the **most relevant** model whose estimated cost fits
86
- the effective budget the tightest of `--budget`, `maxCostPerRunUsd`, and the
87
- remaining daily headroom (`dailyCapUsd spent today`):
88
-
89
- ```
90
- CLI ── router ── classifyTask (light | standard | heavy)
91
-
92
- ├── model catalog (agent, model, tier, price)
93
- └── effectiveBudget (min of --budget / per-run cap / daily headroom)
94
- ```
95
-
96
- - A model that **meets** the task tier and fits the budget is preferred.
97
- - If budget is too tight, it **downgrades** to the most capable model that fits.
98
- - If nothing fits, it picks the cheapest and warns the budget enforcer still
99
- aborts the run if it overruns.
100
-
101
- ```bash
102
- aiforcecli run "fix a typo in the README" # → cheap, light model
103
- aiforcecli run "refactor the auth architecture" --budget 5 # → top model within $5
104
- aiforcecli run "<task>" --explain # show the decision, don't run
105
- ```
106
-
107
- An explicit `--agent` (and/or `--model`) always wins and skips routing. Tune via
108
- the `routing` block in config: `enabled`, `prefer` (tie-break agent order),
109
- `only` (restrict to catalog keys), and `models` (add custom catalog entries).
110
-
111
- ## Advise — pick the right agent before you start
112
-
113
- Before a complex project, `advise` gives you an instant, no-cost recommendation of which agent and
114
- model to use the cheap predictor that complements `race` (proof) and `run --heal` (verify):
115
-
116
- ```bash
117
- aiforcecli advise "migrate the auth module from sessions to JWT across the API"
118
- ```
119
-
120
- ```
121
- Task brief
122
- type security (auth, security)
123
- complexity heavy
124
- codebase TypeScript · 412 files · tests: yes (npm test)
125
- budget $4.21 headroom
126
-
127
- Recommendation
128
- → claude-code · opus confidence 82%
129
- expected ~$0.90 · ~88% one-shot pass
130
- Why
131
- • public eval: claude-code·opus ~88 on security
132
- your history: 91% pass over 9 security run(s)
133
- Runner-up codex·gpt-5-codex ~$0.40 · ~80% cheaper
134
- Run it: aiforcecli run "<task>" --agent claude-code --model opus --heal
135
- ```
136
-
137
- **How the recommendation is scored.** A transparent blend:
138
-
139
- ```
140
- capability = posterior(prior = public scorecard, evidence = your private pass rate)
141
- score = wCapability·capability + wCost·costFit + wSpeed·speedFit
142
- ```
143
-
144
- - **Public scorecard** (`src/advise/scorecard.ts`) — curated, dated priors per agent·model × task
145
- type, distilled from public coding benchmarks. A *prior*, not gospel; it goes stale, so your own
146
- data overrides it.
147
- - **Private outcomes** — your recorded `--heal`/`race`/`eval` results. Capability is a Bayesian
148
- blend: the public prior dominates until you have evidence, then your numbers take over
149
- (`advise.privatePseudocount` sets how fast). Confidence rises with your sample size.
150
-
151
- ### The learning policy (contextual bandit)
152
-
153
- Under the hood the recommendation is a **contextual bandit**. Each arm `(agent, model)` has a Beta
154
- posterior over its pass probability seeded by the public prior, updated by your verified outcomes
155
- for that task type. `advise` shows the **policy mix**: how often the policy would pick each arm.
156
-
157
- - **Exploit (default):** recommend the top expected arm.
158
- - **Explore (`advise --explore`, or `advise.explore: true` for routed `run`):** pick via **Thompson
159
- Sampling** sample each arm's posterior and take the winner so uncertain-but-promising arms get
160
- tried occasionally. That's how the policy *discovers* which agent quietly wins on your code instead
161
- of only exploiting the current best.
162
-
163
- **Off-policy logging (so it keeps learning):** every verified run records its decision **context**,
164
- the chosen arm's **propensity** (selection probability), and a scalar **reward** (pass, minus
165
- cost/latency/attempt penalties, plus acceptance). That turns your run history into a labeled dataset
166
- for unbiased off-policy improvement the foundation for a learned escalation/route policy next.
167
- The reward signal is automatic and objective because it comes from **your tests**, which is what
168
- makes the learning real rather than guesswork.
169
-
170
- **Calibrate it to your codebase** with the private eval suite. Drop task cases as JSON files in
171
- `.aiforcecli/evals/`:
172
-
173
- ```json
174
- { "prompt": "fix the off-by-one in parseRange", "verify": "npm test", "taskType": "bugfix" }
175
- ```
176
-
177
- Then run them against every candidate agent·model (each in an isolated worktree, verified):
178
-
179
- ```bash
180
- aiforcecli eval # all installed agents × all cases
181
- aiforcecli eval --agents claude-code # restrict the field
182
- ```
183
-
184
- Results are recorded as private evidence, so `advise` (and `bench`) get smarter about *your* repo
185
- with every run. `eval` runs real agents it costs money and time, so run it occasionally.
186
-
187
- ## Verify, self-heal & race
188
-
189
- aiforcecli sits *above* the agents, so it can do something they can't: run the work, **verify it
190
- against your tests**, and act on the result. Verification is auto-detected from the project
191
- (`npm test`, `pytest`, `cargo test`, `go test ./...`, …) and overridable via `verify.command` or
192
- `--verify "<cmd>"`.
193
-
194
- **Self-healing runs** — `aiforcecli run "<task>" --heal`
195
-
196
- ```
197
- run → verify → ✗ fail → retry same agent with the failure → ✗ → escalate to a stronger agent → ✓
198
- ```
199
-
200
- After the run, aiforcecli runs your verify command. On failure it feeds the output back to the agent
201
- (resuming the session), and if it still fails it **escalates to a stronger agent/model** looping
202
- until green, `heal.maxAttempts` is reached, or the budget runs out. A budget breach never escalates.
203
-
204
- **Race** — `aiforcecli race "<task>" --agents claude-code,codex`
205
-
206
- Runs the task on every agent **in parallel**, each in its own throwaway **git worktree** (your
207
- uncommitted changes are carried in, so they build on your current work). Each result is verified, and
208
- the **cheapest passing** diff (or `--select fastest|first-pass`) is applied to your tree the tests
209
- decide, not vibes. With no verify command, you get each candidate's diff and pick one. `--budget` is
210
- the *total* and is split across racers; `--keep` leaves the worktrees for inspection.
211
-
212
- A fresh worktree only has git-tracked files, so the repo's `node_modules` is **symlinked** into each
213
- one automatically otherwise `npm test` couldn't find its runner. For other ecosystems whose deps
214
- are gitignored, list them under `race.linkPaths` (e.g. `[".venv", "target", "vendor"]`).
215
-
216
- ```bash
217
- aiforcecli race "fix the failing auth test" --agents claude-code,codex --budget 2
218
- # claude-code · sonnet ✓ pass 42s $0.31
219
- # codex · gpt-5-codex ✗ fail 51s $0.27
220
- # applied claude-code's diff (cheapest passing) total race cost: $0.58
221
- ```
222
-
223
- **Leaderboard** — `aiforcecli bench`
224
-
225
- Every `--heal` and `race` run records an *outcome* (passed? won? cost? duration?). `bench` reports,
226
- from your own history, which agent/model actually wins which task class per dollar — turning routing
227
- from anecdote into evidence. Opt in to `telemetry` to contribute anonymized outcomes to a shared
228
- leaderboard (off by default; never sends prompts, paths, or output).
229
-
230
- ## FinOps & budgets
231
-
232
- - **Cost source of truth:** `aiforcecli` uses the cost the agent CLI reports when available
233
- (Claude Code reports `total_cost_usd`). When a CLI reports tokens only (Codex), cost is
234
- **computed** from a local pricing table `costSource` records which, so reports are
235
- never silently wrong.
236
- - **Daily cap (`dailyCapUsd`):** checked *before* a run starts. If today's spend already
237
- meets the cap, the run is refused (exit code `2`).
238
- - **Per-run cap (`maxCostPerRunUsd`, or `--budget`):** enforced *during* the run. As usage
239
- events arrive, if cumulative cost crosses the cap the subprocess is aborted
240
- (SIGTERM SIGKILL).
241
- - **Honest limitation:** agents that only report usage at the *end* of a run can only
242
- be caught after the fact. Mid-run enforcement is best-effort for those.
243
- - **Never hangs:** every run has a wall-clock `timeoutMs` and an `inactivityTimeoutMs`
244
- watchdog; a stuck subprocess is always killed.
245
-
246
- ## Configuration
247
-
248
- `aiforcecli.config.json` (or `.ts`) is discovered at two levels, deep-merged (**project wins**), then
249
- CLI flags win over both:
250
-
251
- 1. **User level** your defaults, applied in *every* folder. Put the file here once; you don't need
252
- one per project. The directory is platform-specific:
253
- - **Windows:** `%APPDATA%\aiforcecli\Config\aiforcecli.config.json`
254
- - **macOS:** `~/Library/Preferences/aiforcecli/aiforcecli.config.json`
255
- - **Linux:** `~/.config/aiforcecli/aiforcecli.config.json` (or `$XDG_CONFIG_HOME`)
256
- 2. **Project level** — `aiforcecli.config.json` in the working directory, for per-project overrides only.
257
-
258
- The CLI command itself (installed globally via `npm i -g aiforcecli`, or `npm link` for a dev build)
259
- is separate from config — installing it once is enough; settings come from the files above. See
260
- [`aiforcecli.config.example.json`](./aiforcecli.config.example.json) and the `examples/` directory.
261
-
262
- | Block | Key | Default | Purpose |
263
- | --- | --- | --- | --- |
264
- | `agents.<id>` | `model`, `bin`, `defaultFlags`, `allowedTools` | | Per-agent overrides (binary path, forced model, default flags, tool allow-list). |
265
- | `defaultAgent` | | `claude-code` | Agent used by `run` when routing is off and no `--agent`. |
266
- | `routing` | `enabled`, `prefer`, `only`, `models` | `enabled:true` | Budget-aware auto-routing; `only` restricts the catalog, `models` adds custom entries. |
267
- | `verify` | `enabled`, `command`, `timeoutMs` | `enabled:true`, 300 s | How results are verified; `command` overrides auto-detection. |
268
- | `heal` | `enabled`, `maxAttempts`, `escalate` | `false`, `3`, `true` | Self-healing loop behavior for `run --heal`. |
269
- | `race` | `agents`, `select`, `keepWorktrees`, `linkPaths` | `select:cheapest` | Defaults for `race`; `linkPaths` symlinks extra dep dirs into worktrees. |
270
- | `advise` | `weights`, `privatePseudocount`, `explore` | `0.7/0.2/0.1`, `5`, `false` | Recommendation scoring weights, prior strength, and Thompson-Sampling exploration. |
271
- | `eval` | `dir`, `models` | `.aiforcecli/evals` | Private eval suite location and which catalog models to evaluate. |
272
- | `telemetry` | `enabled`, `endpoint` | `false` | Opt-in anonymized outcome upload for the shared leaderboard. |
273
- | `budgets` | `maxCostPerRunUsd`, `dailyCapUsd` | | Per-run and daily spend caps. |
274
- | top-level | `timeoutMs`, `inactivityTimeoutMs` | 600 s, 120 s | Wall-clock and inactivity watchdogs for every run. |
275
-
276
- ## Development
277
-
278
- ```bash
279
- npm install
280
- npm run build # tsup -> dist/ (ESM)
281
- npm test # vitest — adapter parsers run against recorded fixtures
282
- npm run typecheck
283
- ```
284
-
285
- Adapter parser tests use **recorded real outputs** from each CLI under
286
- `test/fixtures/`. To refresh them, re-capture with:
287
-
288
- ```bash
289
- claude -p "say hi" --output-format stream-json --verbose > test/fixtures/claude-code/hello.stream.jsonl
290
- codex exec --json --sandbox workspace-write "say hi" > test/fixtures/codex/hello.jsonl
291
- ```
292
-
293
- ## Documentation
294
-
295
- - [Product Requirements (PRD)](./docs/PRD.md) — problem, users, features, scope, success metrics.
296
- - [Business Requirements (BRD)](./docs/BRD.md) — market, value proposition, model, KPIs.
297
-
298
- ## Roadmap
299
-
300
- - **Shipped:** adapters (claude-code, codex), config, FinOps + budgets, budget-aware routing, `run` (+ `--heal`), `race`, `advise` (**contextual-bandit policy + off-policy logging**), `eval`, `bench`, `cost`, `agents`, `init`.
301
- - **Next:** learned escalation/route policy trained from the logged (context, arm, reward, propensity) data; surface agent fatal errors in `race`/`eval`; richer task classification (`--deep`); bundled `assets/` via `init`.
302
- - **Later:** `aiforcecli mcp` MCP server with auth + hard budget cap; hosted shared leaderboard + federated global/local policy; team/org budgets and reporting.
1
+ # aiforcecli
2
+
3
+ **One interface over every coding agent — that you can trust and afford.**
4
+
5
+ `aiforcecli` wraps multiple coding agents — **Claude Code**, **OpenAI Codex**, **Aider**
6
+ (which itself routes to DeepSeek, Mistral/Codestral, Qwen, Ollama, OpenRouter, and more), and
7
+ **Antigravity** (Google's `agy`, for Gemini models) — behind a single normalized interface, and
8
+ adds the things a single-agent CLI structurally can't:
9
+
10
+ - 🧭 **Advise** — instantly recommend the best agent+model for a task (public benchmarks blended with *your* results), before you spend a cent.
11
+ - 🏁 **Race** — run several agents in parallel in isolated git worktrees, **verify each against your tests**, and apply the winner.
12
+ - 🔁 **Self-heal** — run verify on failure feed the error back and escalate to a stronger agent until green.
13
+ - 📊 **Bench / Eval** — record every outcome and learn which agent wins which task, per dollar, on *your* codebase.
14
+ - 💸 **First-class FinOps** — every run's token usage and cost recorded locally; hard budgets enforced before and during a run.
15
+
16
+ `aiforcecli` does **not** reimplement any agent. It shells out to the agent CLIs you already
17
+ have installed and normalizes their I/O behind one adapter interface — so it improves as the
18
+ agents do, and you keep your existing auth and subscriptions.
19
+
20
+ ## 60-second quickstart
21
+
22
+ ```bash
23
+ # 1. Install
24
+ npm install -g aiforcecli # or: npx aiforcecli <command>
25
+
26
+ # 2. Make sure at least one underlying agent CLI is installed
27
+ npm install -g @anthropic-ai/claude-code # Claude Code
28
+ npm install -g @openai/codex # Codex
29
+ aiforcecli agents # shows install status
30
+
31
+ # 3. Scaffold config in your project
32
+ cd my-project
33
+ aiforcecli init
34
+
35
+ # 4. Run a task
36
+ aiforcecli run "add a health check endpoint and a test for it"
37
+
38
+ # 5. See what it cost
39
+ aiforcecli cost
40
+ ```
41
+
42
+ That's it — you're productive on day one.
43
+
44
+ ## Commands
45
+
46
+ | Command | What it does |
47
+ | --- | --- |
48
+ | `aiforcecli run "<task>"` | Run a task with an agent. With no `--agent`, `aiforcecli` **auto-routes** to the best agent+model for the task within budget. Add `--heal` to **verify and self-heal** (see below). Flags: `--agent claude-code\|codex\|aider\|antigravity`, `--model <m>`, `--cwd <path>`, `--budget <usd>`, `--route deterministic\|bayesian`, `--explore`/`--no-explore` (bayesian), `--explain`, `--resume <sessionId>`, `--json`, `--heal`, `--max-attempts <n>`, `--verify <cmd>`, `--skip-verify`. |
49
+ | `aiforcecli advise "<task>"` | **Recommend** which agent + model to use instantly, without running anything. Blends a public benchmark scorecard with your private outcomes. `--cwd <path>`, `--budget <usd>` (preview under a cap), `--explore`, `--json`. |
50
+ | `aiforcecli race "<task>"` | Run the task on **several agents in parallel** (isolated git worktrees), verify each, and apply the winner. Flags: `--agents claude-code,codex`, `--budget <usd>` (total, split across agents), `--select cheapest\|fastest\|first-pass`, `--verify <cmd>`, `--keep`, `--json`. |
51
+ | `aiforcecli eval` | Run the **private eval suite** to calibrate `advise` on your codebase. `--dir <path>`, `--agents <ids>`, `--json`. |
52
+ | `aiforcecli bench` | **Leaderboard** from your local history: which agent/model wins which task class, per dollar. `--since 7d\|24h\|<ISO>`, `--by-task`, `--json`. |
53
+ | `aiforcecli agents` | List agents and whether their CLI is installed (`--json`). |
54
+ | `aiforcecli cost` | Report spend. `--since 7d\|24h\|<ISO>`, `--by day\|agent\|project`, `--json`. |
55
+ | `aiforcecli init` | Scaffold `aiforcecli.config.json` and install bundled assets. `--force` to overwrite. |
56
+ | `aiforcecli mcp` | (Phase 3) Start the MCP server. |
57
+
58
+ ## How it works
59
+
60
+ ```
61
+ CLI ── orchestrator ── AgentAdapter ── subprocess ── claude / codex
62
+
63
+ ├── BudgetEnforcer (pre-run daily cap + mid-run per-run cap)
64
+ └── UsageStore (SQLite: every run tagged agent/project/time/promptHash)
65
+ ```
66
+
67
+ Each adapter parses its agent's **native** output (Claude Code `--output-format
68
+ stream-json`, Codex `exec --json` JSONL) and maps it onto one normalized event shape:
69
+
70
+ ```ts
71
+ type NormalizedEvent =
72
+ | { type: 'token'; text: string }
73
+ | { type: 'message'; role: 'assistant' | 'user'; text: string }
74
+ | { type: 'tool_call'; name: string; input?: unknown; id?: string }
75
+ | { type: 'usage'; usage: Usage; cumulative: boolean }
76
+ | { type: 'error'; message: string; fatal: boolean };
77
+ ```
78
+
79
+ A run resolves to `{ message, usage: { inputTokens, outputTokens, costUsd, costSource },
80
+ exitCode, sessionId, aborted }`.
81
+
82
+ ## Routing
83
+
84
+ Instead of hardcoding which agent and model to call, omit `--agent` and `aiforcecli`
85
+ picks for you. It classifies the task (a transparent keyword + length
86
+ heuristic), then selects the **most relevant** model whose estimated cost fits
87
+ the effective budget the tightest of `--budget`, `maxCostPerRunUsd`, and the
88
+ remaining daily headroom (`dailyCapUsd − spent today`):
89
+
90
+ ```
91
+ CLI ── router ── classifyTask (light | standard | heavy)
92
+
93
+ ├── model catalog (agent, model, tier, price)
94
+ └── effectiveBudget (min of --budget / per-run cap / daily headroom)
95
+ ```
96
+
97
+ - A model that **meets** the task tier and fits the budget is preferred.
98
+ - If budget is too tight, it **downgrades** to the most capable model that fits.
99
+ - If nothing fits, it picks the cheapest and warns — the budget enforcer still
100
+ aborts the run if it overruns.
101
+
102
+ ```bash
103
+ aiforcecli run "fix a typo in the README" # → cheap, light model
104
+ aiforcecli run "refactor the auth architecture" --budget 5 # top model within $5
105
+ aiforcecli run "<task>" --explain # show the decision, don't run
106
+ ```
107
+
108
+ An explicit `--agent` (and/or `--model`) always wins and skips routing. Tune via
109
+ the `routing` block in config: `enabled`, `prefer` (tie-break agent order),
110
+ `only` (restrict to catalog keys), and `models` (add custom catalog entries).
111
+
112
+ ## Advise — pick the right agent before you start
113
+
114
+ Before a complex project, `advise` gives you an instant, no-cost recommendation of which agent and
115
+ model to use — the cheap predictor that complements `race` (proof) and `run --heal` (verify):
116
+
117
+ ```bash
118
+ aiforcecli advise "migrate the auth module from sessions to JWT across the API"
119
+ ```
120
+
121
+ ```
122
+ Task brief
123
+ type security (auth, security)
124
+ complexity heavy
125
+ codebase TypeScript · 412 files · tests: yes (npm test)
126
+ budget $4.21 headroom
127
+
128
+ Recommendation
129
+ claude-code · opus confidence 82%
130
+ expected ~$0.90 · ~88% one-shot pass
131
+ Why
132
+ public eval: claude-code·opus ~88 on security
133
+ your history: 91% pass over 9 security run(s)
134
+ Runner-up codex·gpt-5-codex ~$0.40 · ~80% cheaper
135
+ Run it: aiforcecli run "<task>" --agent claude-code --model opus --heal
136
+ ```
137
+
138
+ **How the recommendation is scored.** A transparent blend:
139
+
140
+ ```
141
+ capability = posterior(prior = public scorecard, evidence = your private pass rate)
142
+ score = wCapability·capability + wCost·costFit + wSpeed·speedFit
143
+ ```
144
+
145
+ - **Public scorecard** (`src/advise/scorecard.ts`) curated, dated priors per agent·model × task
146
+ type, distilled from public coding benchmarks. A *prior*, not gospel; it goes stale, so your own
147
+ data overrides it.
148
+ - **Private outcomes** your recorded `--heal`/`race`/`eval` results. Capability is a Bayesian
149
+ blend: the public prior dominates until you have evidence, then your numbers take over
150
+ (`advise.privatePseudocount` sets how fast). Confidence rises with your sample size.
151
+
152
+ ### The learning policy (contextual bandit)
153
+
154
+ Under the hood the recommendation is a **contextual bandit**. Each arm `(agent, model)` has a Beta
155
+ posterior over its pass probability seeded by the public prior, updated by your verified outcomes
156
+ for that task type. `advise` shows the **policy mix**: how often the policy would pick each arm.
157
+
158
+ - **Exploit (default):** recommend the top expected arm.
159
+ - **Explore (`advise --explore`, or `advise.explore: true` for routed `run`):** pick via **Thompson
160
+ Sampling** sample each arm's posterior and take the winner so uncertain-but-promising arms get
161
+ tried occasionally. That's how the policy *discovers* which agent quietly wins on your code instead
162
+ of only exploiting the current best.
163
+
164
+ **Off-policy logging (so it keeps learning):** every verified run records its decision **context**,
165
+ the chosen arm's **propensity** (selection probability), and a scalar **reward** (pass, minus
166
+ cost/latency/attempt penalties, plus acceptance). That turns your run history into a labeled dataset
167
+ for unbiased off-policy improvement the foundation for a learned escalation/route policy next.
168
+ The reward signal is automatic and objective because it comes from **your tests**, which is what
169
+ makes the learning real rather than guesswork.
170
+
171
+ **Calibrate it to your codebase** with the private eval suite. Drop task cases as JSON files in
172
+ `.aiforcecli/evals/`:
173
+
174
+ ```json
175
+ { "prompt": "fix the off-by-one in parseRange", "verify": "npm test", "taskType": "bugfix" }
176
+ ```
177
+
178
+ Then run them against every candidate agent·model (each in an isolated worktree, verified):
179
+
180
+ ```bash
181
+ aiforcecli eval # all installed agents × all cases
182
+ aiforcecli eval --agents claude-code # restrict the field
183
+ ```
184
+
185
+ Results are recorded as private evidence, so `advise` (and `bench`) get smarter about *your* repo
186
+ with every run. `eval` runs real agents — it costs money and time, so run it occasionally.
187
+
188
+ ## Verify, self-heal & race
189
+
190
+ aiforcecli sits *above* the agents, so it can do something they can't: run the work, **verify it
191
+ against your tests**, and act on the result. Verification is auto-detected from the project
192
+ (`npm test`, `pytest`, `cargo test`, `go test ./...`, …) and overridable via `verify.command` or
193
+ `--verify "<cmd>"`.
194
+
195
+ **Self-healing runs** — `aiforcecli run "<task>" --heal`
196
+
197
+ ```
198
+ run → verify → ✗ fail → retry same agent with the failure → ✗ → escalate to a stronger agent → ✓
199
+ ```
200
+
201
+ After the run, aiforcecli runs your verify command. On failure it feeds the output back to the agent
202
+ (resuming the session), and if it still fails it **escalates to a stronger agent/model** looping
203
+ until green, `heal.maxAttempts` is reached, or the budget runs out. A budget breach never escalates.
204
+
205
+ **Race** — `aiforcecli race "<task>" --agents claude-code,codex`
206
+
207
+ Runs the task on every agent **in parallel**, each in its own throwaway **git worktree** (your
208
+ uncommitted changes are carried in, so they build on your current work). Each result is verified, and
209
+ the **cheapest passing** diff (or `--select fastest|first-pass`) is applied to your tree the tests
210
+ decide, not vibes. With no verify command, you get each candidate's diff and pick one. `--budget` is
211
+ the *total* and is split across racers; `--keep` leaves the worktrees for inspection.
212
+
213
+ A fresh worktree only has git-tracked files, so the repo's `node_modules` is **symlinked** into each
214
+ one automatically otherwise `npm test` couldn't find its runner. For other ecosystems whose deps
215
+ are gitignored, list them under `race.linkPaths` (e.g. `[".venv", "target", "vendor"]`).
216
+
217
+ ```bash
218
+ aiforcecli race "fix the failing auth test" --agents claude-code,codex --budget 2
219
+ # claude-code · sonnet ✓ pass 42s $0.31
220
+ # codex · gpt-5-codex ✗ fail 51s $0.27
221
+ # → applied claude-code's diff (cheapest passing) total race cost: $0.58
222
+ ```
223
+
224
+ **Leaderboard** — `aiforcecli bench`
225
+
226
+ Every `--heal` and `race` run records an *outcome* (passed? won? cost? duration?). `bench` reports,
227
+ from your own history, which agent/model actually wins which task class per dollar turning routing
228
+ from anecdote into evidence. Opt in to `telemetry` to contribute anonymized outcomes to a shared
229
+ leaderboard (off by default; never sends prompts, paths, or output).
230
+
231
+ ## FinOps & budgets
232
+
233
+ - **Cost source of truth:** `aiforcecli` uses the cost the agent CLI reports when available
234
+ (Claude Code reports `total_cost_usd`). When a CLI reports tokens only (Codex), cost is
235
+ **computed** from a local pricing table — `costSource` records which, so reports are
236
+ never silently wrong.
237
+ - **Window caps (`dailyCapUsd`, `weeklyCapUsd`, `monthlyCapUsd`):** checked *before* a run
238
+ starts. If spend in any window already meets its cap (today / this calendar week from Monday /
239
+ this calendar month), the run is refused (exit code `2`). They also tighten the routing budget.
240
+ - **Per-run cap (`maxCostPerRunUsd`, or `--budget`):** enforced *during* the run. As usage
241
+ events arrive, if cumulative cost crosses the cap the subprocess is aborted
242
+ (SIGTERM SIGKILL).
243
+ - **Honest limitation:** agents that only report usage at the *end* of a run can only
244
+ be caught after the fact. Mid-run enforcement is best-effort for those.
245
+ - **Never hangs:** every run has a wall-clock `timeoutMs` and an `inactivityTimeoutMs`
246
+ watchdog; a stuck subprocess is always killed.
247
+
248
+ ## Configuration
249
+
250
+ `aiforcecli.config.json` (or `.ts`) is discovered at two levels, deep-merged (**project wins**), then
251
+ CLI flags win over both:
252
+
253
+ 1. **User level** — your defaults, applied in *every* folder. Put the file here once; you don't need
254
+ one per project. The directory is platform-specific:
255
+ - **Windows:** `%APPDATA%\aiforcecli\Config\aiforcecli.config.json`
256
+ - **macOS:** `~/Library/Preferences/aiforcecli/aiforcecli.config.json`
257
+ - **Linux:** `~/.config/aiforcecli/aiforcecli.config.json` (or `$XDG_CONFIG_HOME`)
258
+ 2. **Project level** `aiforcecli.config.json` in the working directory, for per-project overrides only.
259
+
260
+ The CLI command itself (installed globally via `npm i -g aiforcecli`, or `npm link` for a dev build)
261
+ is separate from config — installing it once is enough; settings come from the files above. See
262
+ [`aiforcecli.config.example.json`](./aiforcecli.config.example.json) and the `examples/` directory.
263
+
264
+ | Block | Key | Default | Purpose |
265
+ | --- | --- | --- | --- |
266
+ | `agents.<id>` | `enabled`, `model`, `bin`, `defaultFlags`, `allowedTools` | `enabled:true` | Per-agent settings. `enabled:false` removes the agent everywhere. Plus binary path, forced model, default flags, tool allow-list. |
267
+ | `defaultAgent` | | `claude-code` | Agent used by `run` when routing is off and no `--agent`. |
268
+ | `routing` | `enabled`, `strategy`, `prefer`, `only`, `models` | `enabled:true`, `strategy:deterministic` | Auto-routing. `strategy`: `deterministic` (budget-aware tier router) or `bayesian` (learned engine; honors `advise.explore`). `only` restricts the catalog, `models` adds entries. Per-run override: `run --route`. |
269
+ | `verify` | `enabled`, `command`, `timeoutMs` | `enabled:true`, 300 s | How results are verified; `command` overrides auto-detection. |
270
+ | `heal` | `enabled`, `maxAttempts`, `escalate` | `false`, `3`, `true` | Self-healing loop behavior for `run --heal`. |
271
+ | `race` | `agents`, `select`, `keepWorktrees`, `linkPaths` | `select:cheapest` | Defaults for `race`; `linkPaths` symlinks extra dep dirs into worktrees. |
272
+ | `advise` | `weights`, `privatePseudocount`, `explore` | `0.7/0.2/0.1`, `5`, `false` | Recommendation scoring weights, prior strength, and Thompson-Sampling exploration. |
273
+ | `eval` | `dir`, `models` | `.aiforcecli/evals` | Private eval suite location and which catalog models to evaluate. |
274
+ | `telemetry` | `enabled`, `endpoint` | `false` | Opt-in anonymized outcome upload for the shared leaderboard. |
275
+ | `budgets` | `maxCostPerRunUsd`, `dailyCapUsd`, `weeklyCapUsd`, `monthlyCapUsd` | — | Per-run cap + daily/weekly/monthly spend caps (refuse to start once a window cap is hit). |
276
+ | top-level | `timeoutMs`, `inactivityTimeoutMs` | 600 s, 120 s | Wall-clock and inactivity watchdogs for every run. |
277
+
278
+ ## Development
279
+
280
+ ```bash
281
+ npm install
282
+ npm run build # tsup -> dist/ (ESM)
283
+ npm test # vitest — adapter parsers run against recorded fixtures
284
+ npm run typecheck
285
+ ```
286
+
287
+ Adapter parser tests use **recorded real outputs** from each CLI under
288
+ `test/fixtures/`. To refresh them, re-capture with:
289
+
290
+ ```bash
291
+ claude -p "say hi" --output-format stream-json --verbose > test/fixtures/claude-code/hello.stream.jsonl
292
+ codex exec --json --sandbox workspace-write "say hi" > test/fixtures/codex/hello.jsonl
293
+ ```
294
+
295
+ ## Documentation
296
+
297
+ - [Feature Reference (FEATURES)](./docs/FEATURES.md) — every feature in depth: public eval, the RL/bandit policy, race, heal, FinOps, config, data model.
298
+ - [Product Requirements (PRD)](./docs/PRD.md) — problem, users, features, scope, success metrics.
299
+ - [Business Requirements (BRD)](./docs/BRD.md) — market, value proposition, model, KPIs.
300
+
301
+ ## Roadmap
302
+
303
+ - **Shipped:** adapters (claude-code, codex), config, FinOps + budgets, budget-aware routing, `run` (+ `--heal`), `race`, `advise` (**contextual-bandit policy + off-policy logging**), `eval`, `bench`, `cost`, `agents`, `init`.
304
+ - **Next:** learned escalation/route policy trained from the logged (context, arm, reward, propensity) data; surface agent fatal errors in `race`/`eval`; richer task classification (`--deep`); bundled `assets/` via `init`.
305
+ - **Later:** `aiforcecli mcp` MCP server with auth + hard budget cap; hosted shared leaderboard + federated global/local policy; team/org budgets and reporting.