aiforcecli 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +303 -302
- package/aiforcecli.config.example.json +50 -50
- package/assets/README.md +14 -14
- package/dist/cli.js +1 -1
- package/dist/index.js +1 -1
- package/package.json +45 -45
package/README.md
CHANGED
|
@@ -1,302 +1,303 @@
|
|
|
1
|
-
# aiforcecli
|
|
2
|
-
|
|
3
|
-
**One interface over every coding agent — that you can trust and afford.**
|
|
4
|
-
|
|
5
|
-
`aiforcecli` wraps multiple coding agents — **Claude Code**, **OpenAI Codex**, and more —
|
|
6
|
-
behind a single normalized interface, and adds the things a single-agent CLI structurally
|
|
7
|
-
can't:
|
|
8
|
-
|
|
9
|
-
- 🧭 **Advise** — instantly recommend the best agent+model for a task (public benchmarks blended with *your* results), before you spend a cent.
|
|
10
|
-
- 🏁 **Race** — run several agents in parallel in isolated git worktrees, **verify each against your tests**, and apply the winner.
|
|
11
|
-
- 🔁 **Self-heal** — run → verify → on failure feed the error back and escalate to a stronger agent until green.
|
|
12
|
-
- 📊 **Bench / Eval** — record every outcome and learn which agent wins which task, per dollar, on *your* codebase.
|
|
13
|
-
- 💸 **First-class FinOps** — every run's token usage and cost recorded locally; hard budgets enforced before and during a run.
|
|
14
|
-
|
|
15
|
-
`aiforcecli` does **not** reimplement any agent. It shells out to the agent CLIs you already
|
|
16
|
-
have installed and normalizes their I/O behind one adapter interface — so it improves as the
|
|
17
|
-
agents do, and you keep your existing auth and subscriptions.
|
|
18
|
-
|
|
19
|
-
## 60-second quickstart
|
|
20
|
-
|
|
21
|
-
```bash
|
|
22
|
-
# 1. Install
|
|
23
|
-
npm install -g aiforcecli # or: npx aiforcecli <command>
|
|
24
|
-
|
|
25
|
-
# 2. Make sure at least one underlying agent CLI is installed
|
|
26
|
-
npm install -g @anthropic-ai/claude-code # Claude Code
|
|
27
|
-
npm install -g @openai/codex # Codex
|
|
28
|
-
aiforcecli agents # shows install status
|
|
29
|
-
|
|
30
|
-
# 3. Scaffold config in your project
|
|
31
|
-
cd my-project
|
|
32
|
-
aiforcecli init
|
|
33
|
-
|
|
34
|
-
# 4. Run a task
|
|
35
|
-
aiforcecli run "add a health check endpoint and a test for it"
|
|
36
|
-
|
|
37
|
-
# 5. See what it cost
|
|
38
|
-
aiforcecli cost
|
|
39
|
-
```
|
|
40
|
-
|
|
41
|
-
That's it — you're productive on day one.
|
|
42
|
-
|
|
43
|
-
## Commands
|
|
44
|
-
|
|
45
|
-
| Command | What it does |
|
|
46
|
-
| --- | --- |
|
|
47
|
-
| `aiforcecli run "<task>"` | Run a task with an agent. With no `--agent`, `aiforcecli` **auto-routes** to the best agent+model for the task within budget. Add `--heal` to **verify and self-heal** (see below). Flags: `--agent claude-code\|codex`, `--model <m>`, `--cwd <path>`, `--budget <usd>`, `--explain`, `--resume <sessionId>`, `--json`, `--heal`, `--max-attempts <n>`, `--verify <cmd>`, `--skip-verify`. |
|
|
48
|
-
| `aiforcecli advise "<task>"` | **Recommend** which agent + model to use — instantly, without running anything. Blends a public benchmark scorecard with your private outcomes. `--cwd <path>`, `--json`. |
|
|
49
|
-
| `aiforcecli race "<task>"` | Run the task on **several agents in parallel** (isolated git worktrees), verify each, and apply the winner. Flags: `--agents claude-code,codex`, `--budget <usd>` (total, split across agents), `--select cheapest\|fastest\|first-pass`, `--verify <cmd>`, `--keep`, `--json`. |
|
|
50
|
-
| `aiforcecli eval` | Run the **private eval suite** to calibrate `advise` on your codebase. `--dir <path>`, `--agents <ids>`, `--json`. |
|
|
51
|
-
| `aiforcecli bench` | **Leaderboard** from your local history: which agent/model wins which task class, per dollar. `--since 7d\|24h\|<ISO>`, `--by-task`, `--json`. |
|
|
52
|
-
| `aiforcecli agents` | List agents and whether their CLI is installed (`--json`). |
|
|
53
|
-
| `aiforcecli cost` | Report spend. `--since 7d\|24h\|<ISO>`, `--by day\|agent\|project`, `--json`. |
|
|
54
|
-
| `aiforcecli init` | Scaffold `aiforcecli.config.json` and install bundled assets. `--force` to overwrite. |
|
|
55
|
-
| `aiforcecli mcp` | (Phase 3) Start the MCP server. |
|
|
56
|
-
|
|
57
|
-
## How it works
|
|
58
|
-
|
|
59
|
-
```
|
|
60
|
-
CLI ── orchestrator ── AgentAdapter ── subprocess ── claude / codex
|
|
61
|
-
│
|
|
62
|
-
├── BudgetEnforcer (pre-run daily cap + mid-run per-run cap)
|
|
63
|
-
└── UsageStore (SQLite: every run tagged agent/project/time/promptHash)
|
|
64
|
-
```
|
|
65
|
-
|
|
66
|
-
Each adapter parses its agent's **native** output (Claude Code `--output-format
|
|
67
|
-
stream-json`, Codex `exec --json` JSONL) and maps it onto one normalized event shape:
|
|
68
|
-
|
|
69
|
-
```ts
|
|
70
|
-
type NormalizedEvent =
|
|
71
|
-
| { type: 'token'; text: string }
|
|
72
|
-
| { type: 'message'; role: 'assistant' | 'user'; text: string }
|
|
73
|
-
| { type: 'tool_call'; name: string; input?: unknown; id?: string }
|
|
74
|
-
| { type: 'usage'; usage: Usage; cumulative: boolean }
|
|
75
|
-
| { type: 'error'; message: string; fatal: boolean };
|
|
76
|
-
```
|
|
77
|
-
|
|
78
|
-
A run resolves to `{ message, usage: { inputTokens, outputTokens, costUsd, costSource },
|
|
79
|
-
exitCode, sessionId, aborted }`.
|
|
80
|
-
|
|
81
|
-
## Routing
|
|
82
|
-
|
|
83
|
-
Instead of hardcoding which agent and model to call, omit `--agent` and `aiforcecli`
|
|
84
|
-
picks for you. It classifies the task (a transparent keyword + length
|
|
85
|
-
heuristic), then selects the **most relevant** model whose estimated cost fits
|
|
86
|
-
the effective budget — the tightest of `--budget`, `maxCostPerRunUsd`, and the
|
|
87
|
-
remaining daily headroom (`dailyCapUsd − spent today`):
|
|
88
|
-
|
|
89
|
-
```
|
|
90
|
-
CLI ── router ── classifyTask (light | standard | heavy)
|
|
91
|
-
│
|
|
92
|
-
├── model catalog (agent, model, tier, price)
|
|
93
|
-
└── effectiveBudget (min of --budget / per-run cap / daily headroom)
|
|
94
|
-
```
|
|
95
|
-
|
|
96
|
-
- A model that **meets** the task tier and fits the budget is preferred.
|
|
97
|
-
- If budget is too tight, it **downgrades** to the most capable model that fits.
|
|
98
|
-
- If nothing fits, it picks the cheapest and warns — the budget enforcer still
|
|
99
|
-
aborts the run if it overruns.
|
|
100
|
-
|
|
101
|
-
```bash
|
|
102
|
-
aiforcecli run "fix a typo in the README" # → cheap, light model
|
|
103
|
-
aiforcecli run "refactor the auth architecture" --budget 5 # → top model within $5
|
|
104
|
-
aiforcecli run "<task>" --explain # show the decision, don't run
|
|
105
|
-
```
|
|
106
|
-
|
|
107
|
-
An explicit `--agent` (and/or `--model`) always wins and skips routing. Tune via
|
|
108
|
-
the `routing` block in config: `enabled`, `prefer` (tie-break agent order),
|
|
109
|
-
`only` (restrict to catalog keys), and `models` (add custom catalog entries).
|
|
110
|
-
|
|
111
|
-
## Advise — pick the right agent before you start
|
|
112
|
-
|
|
113
|
-
Before a complex project, `advise` gives you an instant, no-cost recommendation of which agent and
|
|
114
|
-
model to use — the cheap predictor that complements `race` (proof) and `run --heal` (verify):
|
|
115
|
-
|
|
116
|
-
```bash
|
|
117
|
-
aiforcecli advise "migrate the auth module from sessions to JWT across the API"
|
|
118
|
-
```
|
|
119
|
-
|
|
120
|
-
```
|
|
121
|
-
Task brief
|
|
122
|
-
type security (auth, security)
|
|
123
|
-
complexity heavy
|
|
124
|
-
codebase TypeScript · 412 files · tests: yes (npm test)
|
|
125
|
-
budget $4.21 headroom
|
|
126
|
-
|
|
127
|
-
Recommendation
|
|
128
|
-
→ claude-code · opus confidence 82%
|
|
129
|
-
expected ~$0.90 · ~88% one-shot pass
|
|
130
|
-
Why
|
|
131
|
-
• public eval: claude-code·opus ~88 on security
|
|
132
|
-
• your history: 91% pass over 9 security run(s)
|
|
133
|
-
Runner-up codex·gpt-5-codex ~$0.40 · ~80% — cheaper
|
|
134
|
-
Run it: aiforcecli run "<task>" --agent claude-code --model opus --heal
|
|
135
|
-
```
|
|
136
|
-
|
|
137
|
-
**How the recommendation is scored.** A transparent blend:
|
|
138
|
-
|
|
139
|
-
```
|
|
140
|
-
capability = posterior(prior = public scorecard, evidence = your private pass rate)
|
|
141
|
-
score = wCapability·capability + wCost·costFit + wSpeed·speedFit
|
|
142
|
-
```
|
|
143
|
-
|
|
144
|
-
- **Public scorecard** (`src/advise/scorecard.ts`) — curated, dated priors per agent·model × task
|
|
145
|
-
type, distilled from public coding benchmarks. A *prior*, not gospel; it goes stale, so your own
|
|
146
|
-
data overrides it.
|
|
147
|
-
- **Private outcomes** — your recorded `--heal`/`race`/`eval` results. Capability is a Bayesian
|
|
148
|
-
blend: the public prior dominates until you have evidence, then your numbers take over
|
|
149
|
-
(`advise.privatePseudocount` sets how fast). Confidence rises with your sample size.
|
|
150
|
-
|
|
151
|
-
### The learning policy (contextual bandit)
|
|
152
|
-
|
|
153
|
-
Under the hood the recommendation is a **contextual bandit**. Each arm `(agent, model)` has a Beta
|
|
154
|
-
posterior over its pass probability — seeded by the public prior, updated by your verified outcomes
|
|
155
|
-
for that task type. `advise` shows the **policy mix**: how often the policy would pick each arm.
|
|
156
|
-
|
|
157
|
-
- **Exploit (default):** recommend the top expected arm.
|
|
158
|
-
- **Explore (`advise --explore`, or `advise.explore: true` for routed `run`):** pick via **Thompson
|
|
159
|
-
Sampling** — sample each arm's posterior and take the winner — so uncertain-but-promising arms get
|
|
160
|
-
tried occasionally. That's how the policy *discovers* which agent quietly wins on your code instead
|
|
161
|
-
of only exploiting the current best.
|
|
162
|
-
|
|
163
|
-
**Off-policy logging (so it keeps learning):** every verified run records its decision **context**,
|
|
164
|
-
the chosen arm's **propensity** (selection probability), and a scalar **reward** (pass, minus
|
|
165
|
-
cost/latency/attempt penalties, plus acceptance). That turns your run history into a labeled dataset
|
|
166
|
-
for unbiased off-policy improvement — the foundation for a learned escalation/route policy next.
|
|
167
|
-
The reward signal is automatic and objective because it comes from **your tests**, which is what
|
|
168
|
-
makes the learning real rather than guesswork.
|
|
169
|
-
|
|
170
|
-
**Calibrate it to your codebase** with the private eval suite. Drop task cases as JSON files in
|
|
171
|
-
`.aiforcecli/evals/`:
|
|
172
|
-
|
|
173
|
-
```json
|
|
174
|
-
{ "prompt": "fix the off-by-one in parseRange", "verify": "npm test", "taskType": "bugfix" }
|
|
175
|
-
```
|
|
176
|
-
|
|
177
|
-
Then run them against every candidate agent·model (each in an isolated worktree, verified):
|
|
178
|
-
|
|
179
|
-
```bash
|
|
180
|
-
aiforcecli eval # all installed agents × all cases
|
|
181
|
-
aiforcecli eval --agents claude-code # restrict the field
|
|
182
|
-
```
|
|
183
|
-
|
|
184
|
-
Results are recorded as private evidence, so `advise` (and `bench`) get smarter about *your* repo
|
|
185
|
-
with every run. `eval` runs real agents — it costs money and time, so run it occasionally.
|
|
186
|
-
|
|
187
|
-
## Verify, self-heal & race
|
|
188
|
-
|
|
189
|
-
aiforcecli sits *above* the agents, so it can do something they can't: run the work, **verify it
|
|
190
|
-
against your tests**, and act on the result. Verification is auto-detected from the project
|
|
191
|
-
(`npm test`, `pytest`, `cargo test`, `go test ./...`, …) and overridable via `verify.command` or
|
|
192
|
-
`--verify "<cmd>"`.
|
|
193
|
-
|
|
194
|
-
**Self-healing runs** — `aiforcecli run "<task>" --heal`
|
|
195
|
-
|
|
196
|
-
```
|
|
197
|
-
run → verify → ✗ fail → retry same agent with the failure → ✗ → escalate to a stronger agent → ✓
|
|
198
|
-
```
|
|
199
|
-
|
|
200
|
-
After the run, aiforcecli runs your verify command. On failure it feeds the output back to the agent
|
|
201
|
-
(resuming the session), and if it still fails it **escalates to a stronger agent/model** — looping
|
|
202
|
-
until green, `heal.maxAttempts` is reached, or the budget runs out. A budget breach never escalates.
|
|
203
|
-
|
|
204
|
-
**Race** — `aiforcecli race "<task>" --agents claude-code,codex`
|
|
205
|
-
|
|
206
|
-
Runs the task on every agent **in parallel**, each in its own throwaway **git worktree** (your
|
|
207
|
-
uncommitted changes are carried in, so they build on your current work). Each result is verified, and
|
|
208
|
-
the **cheapest passing** diff (or `--select fastest|first-pass`) is applied to your tree — the tests
|
|
209
|
-
decide, not vibes. With no verify command, you get each candidate's diff and pick one. `--budget` is
|
|
210
|
-
the *total* and is split across racers; `--keep` leaves the worktrees for inspection.
|
|
211
|
-
|
|
212
|
-
A fresh worktree only has git-tracked files, so the repo's `node_modules` is **symlinked** into each
|
|
213
|
-
one automatically — otherwise `npm test` couldn't find its runner. For other ecosystems whose deps
|
|
214
|
-
are gitignored, list them under `race.linkPaths` (e.g. `[".venv", "target", "vendor"]`).
|
|
215
|
-
|
|
216
|
-
```bash
|
|
217
|
-
aiforcecli race "fix the failing auth test" --agents claude-code,codex --budget 2
|
|
218
|
-
# claude-code · sonnet ✓ pass 42s $0.31
|
|
219
|
-
# codex · gpt-5-codex ✗ fail 51s $0.27
|
|
220
|
-
# → applied claude-code's diff (cheapest passing) total race cost: $0.58
|
|
221
|
-
```
|
|
222
|
-
|
|
223
|
-
**Leaderboard** — `aiforcecli bench`
|
|
224
|
-
|
|
225
|
-
Every `--heal` and `race` run records an *outcome* (passed? won? cost? duration?). `bench` reports,
|
|
226
|
-
from your own history, which agent/model actually wins which task class per dollar — turning routing
|
|
227
|
-
from anecdote into evidence. Opt in to `telemetry` to contribute anonymized outcomes to a shared
|
|
228
|
-
leaderboard (off by default; never sends prompts, paths, or output).
|
|
229
|
-
|
|
230
|
-
## FinOps & budgets
|
|
231
|
-
|
|
232
|
-
- **Cost source of truth:** `aiforcecli` uses the cost the agent CLI reports when available
|
|
233
|
-
(Claude Code reports `total_cost_usd`). When a CLI reports tokens only (Codex), cost is
|
|
234
|
-
**computed** from a local pricing table — `costSource` records which, so reports are
|
|
235
|
-
never silently wrong.
|
|
236
|
-
- **Daily cap (`dailyCapUsd`):** checked *before* a run starts. If today's spend already
|
|
237
|
-
meets the cap, the run is refused (exit code `2`).
|
|
238
|
-
- **Per-run cap (`maxCostPerRunUsd`, or `--budget`):** enforced *during* the run. As usage
|
|
239
|
-
events arrive, if cumulative cost crosses the cap the subprocess is aborted
|
|
240
|
-
(SIGTERM → SIGKILL).
|
|
241
|
-
- **Honest limitation:** agents that only report usage at the *end* of a run can only
|
|
242
|
-
be caught after the fact. Mid-run enforcement is best-effort for those.
|
|
243
|
-
- **Never hangs:** every run has a wall-clock `timeoutMs` and an `inactivityTimeoutMs`
|
|
244
|
-
watchdog; a stuck subprocess is always killed.
|
|
245
|
-
|
|
246
|
-
## Configuration
|
|
247
|
-
|
|
248
|
-
`aiforcecli.config.json` (or `.ts`) is discovered at two levels, deep-merged (**project wins**), then
|
|
249
|
-
CLI flags win over both:
|
|
250
|
-
|
|
251
|
-
1. **User level** — your defaults, applied in *every* folder. Put the file here once; you don't need
|
|
252
|
-
one per project. The directory is platform-specific:
|
|
253
|
-
- **Windows:** `%APPDATA%\aiforcecli\Config\aiforcecli.config.json`
|
|
254
|
-
- **macOS:** `~/Library/Preferences/aiforcecli/aiforcecli.config.json`
|
|
255
|
-
- **Linux:** `~/.config/aiforcecli/aiforcecli.config.json` (or `$XDG_CONFIG_HOME`)
|
|
256
|
-
2. **Project level** — `aiforcecli.config.json` in the working directory, for per-project overrides only.
|
|
257
|
-
|
|
258
|
-
The CLI command itself (installed globally via `npm i -g aiforcecli`, or `npm link` for a dev build)
|
|
259
|
-
is separate from config — installing it once is enough; settings come from the files above. See
|
|
260
|
-
[`aiforcecli.config.example.json`](./aiforcecli.config.example.json) and the `examples/` directory.
|
|
261
|
-
|
|
262
|
-
| Block | Key | Default | Purpose |
|
|
263
|
-
| --- | --- | --- | --- |
|
|
264
|
-
| `agents.<id>` | `model`, `bin`, `defaultFlags`, `allowedTools` | — | Per-agent overrides (binary path, forced model, default flags, tool allow-list). |
|
|
265
|
-
| `defaultAgent` | | `claude-code` | Agent used by `run` when routing is off and no `--agent`. |
|
|
266
|
-
| `routing` | `enabled`, `prefer`, `only`, `models` | `enabled:true` | Budget-aware auto-routing; `only` restricts the catalog, `models` adds custom entries. |
|
|
267
|
-
| `verify` | `enabled`, `command`, `timeoutMs` | `enabled:true`, 300 s | How results are verified; `command` overrides auto-detection. |
|
|
268
|
-
| `heal` | `enabled`, `maxAttempts`, `escalate` | `false`, `3`, `true` | Self-healing loop behavior for `run --heal`. |
|
|
269
|
-
| `race` | `agents`, `select`, `keepWorktrees`, `linkPaths` | `select:cheapest` | Defaults for `race`; `linkPaths` symlinks extra dep dirs into worktrees. |
|
|
270
|
-
| `advise` | `weights`, `privatePseudocount`, `explore` | `0.7/0.2/0.1`, `5`, `false` | Recommendation scoring weights, prior strength, and Thompson-Sampling exploration. |
|
|
271
|
-
| `eval` | `dir`, `models` | `.aiforcecli/evals` | Private eval suite location and which catalog models to evaluate. |
|
|
272
|
-
| `telemetry` | `enabled`, `endpoint` | `false` | Opt-in anonymized outcome upload for the shared leaderboard. |
|
|
273
|
-
| `budgets` | `maxCostPerRunUsd`, `dailyCapUsd` | — | Per-run and daily spend caps. |
|
|
274
|
-
| top-level | `timeoutMs`, `inactivityTimeoutMs` | 600 s, 120 s | Wall-clock and inactivity watchdogs for every run. |
|
|
275
|
-
|
|
276
|
-
## Development
|
|
277
|
-
|
|
278
|
-
```bash
|
|
279
|
-
npm install
|
|
280
|
-
npm run build # tsup -> dist/ (ESM)
|
|
281
|
-
npm test # vitest — adapter parsers run against recorded fixtures
|
|
282
|
-
npm run typecheck
|
|
283
|
-
```
|
|
284
|
-
|
|
285
|
-
Adapter parser tests use **recorded real outputs** from each CLI under
|
|
286
|
-
`test/fixtures/`. To refresh them, re-capture with:
|
|
287
|
-
|
|
288
|
-
```bash
|
|
289
|
-
claude -p "say hi" --output-format stream-json --verbose > test/fixtures/claude-code/hello.stream.jsonl
|
|
290
|
-
codex exec --json --sandbox workspace-write "say hi" > test/fixtures/codex/hello.jsonl
|
|
291
|
-
```
|
|
292
|
-
|
|
293
|
-
## Documentation
|
|
294
|
-
|
|
295
|
-
- [
|
|
296
|
-
- [
|
|
297
|
-
|
|
298
|
-
|
|
299
|
-
|
|
300
|
-
|
|
301
|
-
- **
|
|
302
|
-
- **
|
|
1
|
+
# aiforcecli
|
|
2
|
+
|
|
3
|
+
**One interface over every coding agent — that you can trust and afford.**
|
|
4
|
+
|
|
5
|
+
`aiforcecli` wraps multiple coding agents — **Claude Code**, **OpenAI Codex**, and more —
|
|
6
|
+
behind a single normalized interface, and adds the things a single-agent CLI structurally
|
|
7
|
+
can't:
|
|
8
|
+
|
|
9
|
+
- 🧭 **Advise** — instantly recommend the best agent+model for a task (public benchmarks blended with *your* results), before you spend a cent.
|
|
10
|
+
- 🏁 **Race** — run several agents in parallel in isolated git worktrees, **verify each against your tests**, and apply the winner.
|
|
11
|
+
- 🔁 **Self-heal** — run → verify → on failure feed the error back and escalate to a stronger agent until green.
|
|
12
|
+
- 📊 **Bench / Eval** — record every outcome and learn which agent wins which task, per dollar, on *your* codebase.
|
|
13
|
+
- 💸 **First-class FinOps** — every run's token usage and cost recorded locally; hard budgets enforced before and during a run.
|
|
14
|
+
|
|
15
|
+
`aiforcecli` does **not** reimplement any agent. It shells out to the agent CLIs you already
|
|
16
|
+
have installed and normalizes their I/O behind one adapter interface — so it improves as the
|
|
17
|
+
agents do, and you keep your existing auth and subscriptions.
|
|
18
|
+
|
|
19
|
+
## 60-second quickstart
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
# 1. Install
|
|
23
|
+
npm install -g aiforcecli # or: npx aiforcecli <command>
|
|
24
|
+
|
|
25
|
+
# 2. Make sure at least one underlying agent CLI is installed
|
|
26
|
+
npm install -g @anthropic-ai/claude-code # Claude Code
|
|
27
|
+
npm install -g @openai/codex # Codex
|
|
28
|
+
aiforcecli agents # shows install status
|
|
29
|
+
|
|
30
|
+
# 3. Scaffold config in your project
|
|
31
|
+
cd my-project
|
|
32
|
+
aiforcecli init
|
|
33
|
+
|
|
34
|
+
# 4. Run a task
|
|
35
|
+
aiforcecli run "add a health check endpoint and a test for it"
|
|
36
|
+
|
|
37
|
+
# 5. See what it cost
|
|
38
|
+
aiforcecli cost
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
That's it — you're productive on day one.
|
|
42
|
+
|
|
43
|
+
## Commands
|
|
44
|
+
|
|
45
|
+
| Command | What it does |
|
|
46
|
+
| --- | --- |
|
|
47
|
+
| `aiforcecli run "<task>"` | Run a task with an agent. With no `--agent`, `aiforcecli` **auto-routes** to the best agent+model for the task within budget. Add `--heal` to **verify and self-heal** (see below). Flags: `--agent claude-code\|codex`, `--model <m>`, `--cwd <path>`, `--budget <usd>`, `--explain`, `--resume <sessionId>`, `--json`, `--heal`, `--max-attempts <n>`, `--verify <cmd>`, `--skip-verify`. |
|
|
48
|
+
| `aiforcecli advise "<task>"` | **Recommend** which agent + model to use — instantly, without running anything. Blends a public benchmark scorecard with your private outcomes. `--cwd <path>`, `--json`. |
|
|
49
|
+
| `aiforcecli race "<task>"` | Run the task on **several agents in parallel** (isolated git worktrees), verify each, and apply the winner. Flags: `--agents claude-code,codex`, `--budget <usd>` (total, split across agents), `--select cheapest\|fastest\|first-pass`, `--verify <cmd>`, `--keep`, `--json`. |
|
|
50
|
+
| `aiforcecli eval` | Run the **private eval suite** to calibrate `advise` on your codebase. `--dir <path>`, `--agents <ids>`, `--json`. |
|
|
51
|
+
| `aiforcecli bench` | **Leaderboard** from your local history: which agent/model wins which task class, per dollar. `--since 7d\|24h\|<ISO>`, `--by-task`, `--json`. |
|
|
52
|
+
| `aiforcecli agents` | List agents and whether their CLI is installed (`--json`). |
|
|
53
|
+
| `aiforcecli cost` | Report spend. `--since 7d\|24h\|<ISO>`, `--by day\|agent\|project`, `--json`. |
|
|
54
|
+
| `aiforcecli init` | Scaffold `aiforcecli.config.json` and install bundled assets. `--force` to overwrite. |
|
|
55
|
+
| `aiforcecli mcp` | (Phase 3) Start the MCP server. |
|
|
56
|
+
|
|
57
|
+
## How it works
|
|
58
|
+
|
|
59
|
+
```
|
|
60
|
+
CLI ── orchestrator ── AgentAdapter ── subprocess ── claude / codex
|
|
61
|
+
│
|
|
62
|
+
├── BudgetEnforcer (pre-run daily cap + mid-run per-run cap)
|
|
63
|
+
└── UsageStore (SQLite: every run tagged agent/project/time/promptHash)
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
Each adapter parses its agent's **native** output (Claude Code `--output-format
|
|
67
|
+
stream-json`, Codex `exec --json` JSONL) and maps it onto one normalized event shape:
|
|
68
|
+
|
|
69
|
+
```ts
|
|
70
|
+
type NormalizedEvent =
|
|
71
|
+
| { type: 'token'; text: string }
|
|
72
|
+
| { type: 'message'; role: 'assistant' | 'user'; text: string }
|
|
73
|
+
| { type: 'tool_call'; name: string; input?: unknown; id?: string }
|
|
74
|
+
| { type: 'usage'; usage: Usage; cumulative: boolean }
|
|
75
|
+
| { type: 'error'; message: string; fatal: boolean };
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
A run resolves to `{ message, usage: { inputTokens, outputTokens, costUsd, costSource },
|
|
79
|
+
exitCode, sessionId, aborted }`.
|
|
80
|
+
|
|
81
|
+
## Routing
|
|
82
|
+
|
|
83
|
+
Instead of hardcoding which agent and model to call, omit `--agent` and `aiforcecli`
|
|
84
|
+
picks for you. It classifies the task (a transparent keyword + length
|
|
85
|
+
heuristic), then selects the **most relevant** model whose estimated cost fits
|
|
86
|
+
the effective budget — the tightest of `--budget`, `maxCostPerRunUsd`, and the
|
|
87
|
+
remaining daily headroom (`dailyCapUsd − spent today`):
|
|
88
|
+
|
|
89
|
+
```
|
|
90
|
+
CLI ── router ── classifyTask (light | standard | heavy)
|
|
91
|
+
│
|
|
92
|
+
├── model catalog (agent, model, tier, price)
|
|
93
|
+
└── effectiveBudget (min of --budget / per-run cap / daily headroom)
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
- A model that **meets** the task tier and fits the budget is preferred.
|
|
97
|
+
- If budget is too tight, it **downgrades** to the most capable model that fits.
|
|
98
|
+
- If nothing fits, it picks the cheapest and warns — the budget enforcer still
|
|
99
|
+
aborts the run if it overruns.
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
aiforcecli run "fix a typo in the README" # → cheap, light model
|
|
103
|
+
aiforcecli run "refactor the auth architecture" --budget 5 # → top model within $5
|
|
104
|
+
aiforcecli run "<task>" --explain # show the decision, don't run
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
An explicit `--agent` (and/or `--model`) always wins and skips routing. Tune via
|
|
108
|
+
the `routing` block in config: `enabled`, `prefer` (tie-break agent order),
|
|
109
|
+
`only` (restrict to catalog keys), and `models` (add custom catalog entries).
|
|
110
|
+
|
|
111
|
+
## Advise — pick the right agent before you start
|
|
112
|
+
|
|
113
|
+
Before a complex project, `advise` gives you an instant, no-cost recommendation of which agent and
|
|
114
|
+
model to use — the cheap predictor that complements `race` (proof) and `run --heal` (verify):
|
|
115
|
+
|
|
116
|
+
```bash
|
|
117
|
+
aiforcecli advise "migrate the auth module from sessions to JWT across the API"
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
```
|
|
121
|
+
Task brief
|
|
122
|
+
type security (auth, security)
|
|
123
|
+
complexity heavy
|
|
124
|
+
codebase TypeScript · 412 files · tests: yes (npm test)
|
|
125
|
+
budget $4.21 headroom
|
|
126
|
+
|
|
127
|
+
Recommendation
|
|
128
|
+
→ claude-code · opus confidence 82%
|
|
129
|
+
expected ~$0.90 · ~88% one-shot pass
|
|
130
|
+
Why
|
|
131
|
+
• public eval: claude-code·opus ~88 on security
|
|
132
|
+
• your history: 91% pass over 9 security run(s)
|
|
133
|
+
Runner-up codex·gpt-5-codex ~$0.40 · ~80% — cheaper
|
|
134
|
+
Run it: aiforcecli run "<task>" --agent claude-code --model opus --heal
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
**How the recommendation is scored.** A transparent blend:
|
|
138
|
+
|
|
139
|
+
```
|
|
140
|
+
capability = posterior(prior = public scorecard, evidence = your private pass rate)
|
|
141
|
+
score = wCapability·capability + wCost·costFit + wSpeed·speedFit
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
- **Public scorecard** (`src/advise/scorecard.ts`) — curated, dated priors per agent·model × task
|
|
145
|
+
type, distilled from public coding benchmarks. A *prior*, not gospel; it goes stale, so your own
|
|
146
|
+
data overrides it.
|
|
147
|
+
- **Private outcomes** — your recorded `--heal`/`race`/`eval` results. Capability is a Bayesian
|
|
148
|
+
blend: the public prior dominates until you have evidence, then your numbers take over
|
|
149
|
+
(`advise.privatePseudocount` sets how fast). Confidence rises with your sample size.
|
|
150
|
+
|
|
151
|
+
### The learning policy (contextual bandit)
|
|
152
|
+
|
|
153
|
+
Under the hood the recommendation is a **contextual bandit**. Each arm `(agent, model)` has a Beta
|
|
154
|
+
posterior over its pass probability — seeded by the public prior, updated by your verified outcomes
|
|
155
|
+
for that task type. `advise` shows the **policy mix**: how often the policy would pick each arm.
|
|
156
|
+
|
|
157
|
+
- **Exploit (default):** recommend the top expected arm.
|
|
158
|
+
- **Explore (`advise --explore`, or `advise.explore: true` for routed `run`):** pick via **Thompson
|
|
159
|
+
Sampling** — sample each arm's posterior and take the winner — so uncertain-but-promising arms get
|
|
160
|
+
tried occasionally. That's how the policy *discovers* which agent quietly wins on your code instead
|
|
161
|
+
of only exploiting the current best.
|
|
162
|
+
|
|
163
|
+
**Off-policy logging (so it keeps learning):** every verified run records its decision **context**,
|
|
164
|
+
the chosen arm's **propensity** (selection probability), and a scalar **reward** (pass, minus
|
|
165
|
+
cost/latency/attempt penalties, plus acceptance). That turns your run history into a labeled dataset
|
|
166
|
+
for unbiased off-policy improvement — the foundation for a learned escalation/route policy next.
|
|
167
|
+
The reward signal is automatic and objective because it comes from **your tests**, which is what
|
|
168
|
+
makes the learning real rather than guesswork.
|
|
169
|
+
|
|
170
|
+
**Calibrate it to your codebase** with the private eval suite. Drop task cases as JSON files in
|
|
171
|
+
`.aiforcecli/evals/`:
|
|
172
|
+
|
|
173
|
+
```json
|
|
174
|
+
{ "prompt": "fix the off-by-one in parseRange", "verify": "npm test", "taskType": "bugfix" }
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
Then run them against every candidate agent·model (each in an isolated worktree, verified):
|
|
178
|
+
|
|
179
|
+
```bash
|
|
180
|
+
aiforcecli eval # all installed agents × all cases
|
|
181
|
+
aiforcecli eval --agents claude-code # restrict the field
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
Results are recorded as private evidence, so `advise` (and `bench`) get smarter about *your* repo
|
|
185
|
+
with every run. `eval` runs real agents — it costs money and time, so run it occasionally.
|
|
186
|
+
|
|
187
|
+
## Verify, self-heal & race
|
|
188
|
+
|
|
189
|
+
aiforcecli sits *above* the agents, so it can do something they can't: run the work, **verify it
|
|
190
|
+
against your tests**, and act on the result. Verification is auto-detected from the project
|
|
191
|
+
(`npm test`, `pytest`, `cargo test`, `go test ./...`, …) and overridable via `verify.command` or
|
|
192
|
+
`--verify "<cmd>"`.
|
|
193
|
+
|
|
194
|
+
**Self-healing runs** — `aiforcecli run "<task>" --heal`
|
|
195
|
+
|
|
196
|
+
```
|
|
197
|
+
run → verify → ✗ fail → retry same agent with the failure → ✗ → escalate to a stronger agent → ✓
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
After the run, aiforcecli runs your verify command. On failure it feeds the output back to the agent
|
|
201
|
+
(resuming the session), and if it still fails it **escalates to a stronger agent/model** — looping
|
|
202
|
+
until green, `heal.maxAttempts` is reached, or the budget runs out. A budget breach never escalates.
|
|
203
|
+
|
|
204
|
+
**Race** — `aiforcecli race "<task>" --agents claude-code,codex`
|
|
205
|
+
|
|
206
|
+
Runs the task on every agent **in parallel**, each in its own throwaway **git worktree** (your
|
|
207
|
+
uncommitted changes are carried in, so they build on your current work). Each result is verified, and
|
|
208
|
+
the **cheapest passing** diff (or `--select fastest|first-pass`) is applied to your tree — the tests
|
|
209
|
+
decide, not vibes. With no verify command, you get each candidate's diff and pick one. `--budget` is
|
|
210
|
+
the *total* and is split across racers; `--keep` leaves the worktrees for inspection.
|
|
211
|
+
|
|
212
|
+
A fresh worktree only has git-tracked files, so the repo's `node_modules` is **symlinked** into each
|
|
213
|
+
one automatically — otherwise `npm test` couldn't find its runner. For other ecosystems whose deps
|
|
214
|
+
are gitignored, list them under `race.linkPaths` (e.g. `[".venv", "target", "vendor"]`).
|
|
215
|
+
|
|
216
|
+
```bash
|
|
217
|
+
aiforcecli race "fix the failing auth test" --agents claude-code,codex --budget 2
|
|
218
|
+
# claude-code · sonnet ✓ pass 42s $0.31
|
|
219
|
+
# codex · gpt-5-codex ✗ fail 51s $0.27
|
|
220
|
+
# → applied claude-code's diff (cheapest passing) total race cost: $0.58
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
**Leaderboard** — `aiforcecli bench`
|
|
224
|
+
|
|
225
|
+
Every `--heal` and `race` run records an *outcome* (passed? won? cost? duration?). `bench` reports,
|
|
226
|
+
from your own history, which agent/model actually wins which task class per dollar — turning routing
|
|
227
|
+
from anecdote into evidence. Opt in to `telemetry` to contribute anonymized outcomes to a shared
|
|
228
|
+
leaderboard (off by default; never sends prompts, paths, or output).
|
|
229
|
+
|
|
230
|
+
## FinOps & budgets
|
|
231
|
+
|
|
232
|
+
- **Cost source of truth:** `aiforcecli` uses the cost the agent CLI reports when available
|
|
233
|
+
(Claude Code reports `total_cost_usd`). When a CLI reports tokens only (Codex), cost is
|
|
234
|
+
**computed** from a local pricing table — `costSource` records which, so reports are
|
|
235
|
+
never silently wrong.
|
|
236
|
+
- **Daily cap (`dailyCapUsd`):** checked *before* a run starts. If today's spend already
|
|
237
|
+
meets the cap, the run is refused (exit code `2`).
|
|
238
|
+
- **Per-run cap (`maxCostPerRunUsd`, or `--budget`):** enforced *during* the run. As usage
|
|
239
|
+
events arrive, if cumulative cost crosses the cap the subprocess is aborted
|
|
240
|
+
(SIGTERM → SIGKILL).
|
|
241
|
+
- **Honest limitation:** agents that only report usage at the *end* of a run can only
|
|
242
|
+
be caught after the fact. Mid-run enforcement is best-effort for those.
|
|
243
|
+
- **Never hangs:** every run has a wall-clock `timeoutMs` and an `inactivityTimeoutMs`
|
|
244
|
+
watchdog; a stuck subprocess is always killed.
|
|
245
|
+
|
|
246
|
+
## Configuration
|
|
247
|
+
|
|
248
|
+
`aiforcecli.config.json` (or `.ts`) is discovered at two levels, deep-merged (**project wins**), then
|
|
249
|
+
CLI flags win over both:
|
|
250
|
+
|
|
251
|
+
1. **User level** — your defaults, applied in *every* folder. Put the file here once; you don't need
|
|
252
|
+
one per project. The directory is platform-specific:
|
|
253
|
+
- **Windows:** `%APPDATA%\aiforcecli\Config\aiforcecli.config.json`
|
|
254
|
+
- **macOS:** `~/Library/Preferences/aiforcecli/aiforcecli.config.json`
|
|
255
|
+
- **Linux:** `~/.config/aiforcecli/aiforcecli.config.json` (or `$XDG_CONFIG_HOME`)
|
|
256
|
+
2. **Project level** — `aiforcecli.config.json` in the working directory, for per-project overrides only.
|
|
257
|
+
|
|
258
|
+
The CLI command itself (installed globally via `npm i -g aiforcecli`, or `npm link` for a dev build)
|
|
259
|
+
is separate from config — installing it once is enough; settings come from the files above. See
|
|
260
|
+
[`aiforcecli.config.example.json`](./aiforcecli.config.example.json) and the `examples/` directory.
|
|
261
|
+
|
|
262
|
+
| Block | Key | Default | Purpose |
|
|
263
|
+
| --- | --- | --- | --- |
|
|
264
|
+
| `agents.<id>` | `model`, `bin`, `defaultFlags`, `allowedTools` | — | Per-agent overrides (binary path, forced model, default flags, tool allow-list). |
|
|
265
|
+
| `defaultAgent` | | `claude-code` | Agent used by `run` when routing is off and no `--agent`. |
|
|
266
|
+
| `routing` | `enabled`, `prefer`, `only`, `models` | `enabled:true` | Budget-aware auto-routing; `only` restricts the catalog, `models` adds custom entries. |
|
|
267
|
+
| `verify` | `enabled`, `command`, `timeoutMs` | `enabled:true`, 300 s | How results are verified; `command` overrides auto-detection. |
|
|
268
|
+
| `heal` | `enabled`, `maxAttempts`, `escalate` | `false`, `3`, `true` | Self-healing loop behavior for `run --heal`. |
|
|
269
|
+
| `race` | `agents`, `select`, `keepWorktrees`, `linkPaths` | `select:cheapest` | Defaults for `race`; `linkPaths` symlinks extra dep dirs into worktrees. |
|
|
270
|
+
| `advise` | `weights`, `privatePseudocount`, `explore` | `0.7/0.2/0.1`, `5`, `false` | Recommendation scoring weights, prior strength, and Thompson-Sampling exploration. |
|
|
271
|
+
| `eval` | `dir`, `models` | `.aiforcecli/evals` | Private eval suite location and which catalog models to evaluate. |
|
|
272
|
+
| `telemetry` | `enabled`, `endpoint` | `false` | Opt-in anonymized outcome upload for the shared leaderboard. |
|
|
273
|
+
| `budgets` | `maxCostPerRunUsd`, `dailyCapUsd` | — | Per-run and daily spend caps. |
|
|
274
|
+
| top-level | `timeoutMs`, `inactivityTimeoutMs` | 600 s, 120 s | Wall-clock and inactivity watchdogs for every run. |
|
|
275
|
+
|
|
276
|
+
## Development
|
|
277
|
+
|
|
278
|
+
```bash
|
|
279
|
+
npm install
|
|
280
|
+
npm run build # tsup -> dist/ (ESM)
|
|
281
|
+
npm test # vitest — adapter parsers run against recorded fixtures
|
|
282
|
+
npm run typecheck
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
Adapter parser tests use **recorded real outputs** from each CLI under
|
|
286
|
+
`test/fixtures/`. To refresh them, re-capture with:
|
|
287
|
+
|
|
288
|
+
```bash
|
|
289
|
+
claude -p "say hi" --output-format stream-json --verbose > test/fixtures/claude-code/hello.stream.jsonl
|
|
290
|
+
codex exec --json --sandbox workspace-write "say hi" > test/fixtures/codex/hello.jsonl
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
## Documentation
|
|
294
|
+
|
|
295
|
+
- [Feature Reference (FEATURES)](./docs/FEATURES.md) — every feature in depth: public eval, the RL/bandit policy, race, heal, FinOps, config, data model.
|
|
296
|
+
- [Product Requirements (PRD)](./docs/PRD.md) — problem, users, features, scope, success metrics.
|
|
297
|
+
- [Business Requirements (BRD)](./docs/BRD.md) — market, value proposition, model, KPIs.
|
|
298
|
+
|
|
299
|
+
## Roadmap
|
|
300
|
+
|
|
301
|
+
- **Shipped:** adapters (claude-code, codex), config, FinOps + budgets, budget-aware routing, `run` (+ `--heal`), `race`, `advise` (**contextual-bandit policy + off-policy logging**), `eval`, `bench`, `cost`, `agents`, `init`.
|
|
302
|
+
- **Next:** learned escalation/route policy trained from the logged (context, arm, reward, propensity) data; surface agent fatal errors in `race`/`eval`; richer task classification (`--deep`); bundled `assets/` via `init`.
|
|
303
|
+
- **Later:** `aiforcecli mcp` MCP server with auth + hard budget cap; hosted shared leaderboard + federated global/local policy; team/org budgets and reporting.
|