aiforcecli 0.4.0 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md DELETED
@@ -1,305 +0,0 @@
1
- # aiforcecli
2
-
3
- **One interface over every coding agent — that you can trust and afford.**
4
-
5
- `aiforcecli` wraps multiple coding agents — **Claude Code**, **OpenAI Codex**, **Aider**
6
- (which itself routes to DeepSeek, Mistral/Codestral, Qwen, Ollama, OpenRouter, and more), and
7
- **Antigravity** (Google's `agy`, for Gemini models) — behind a single normalized interface, and
8
- adds the things a single-agent CLI structurally can't:
9
-
10
- - 🧭 **Advise** — instantly recommend the best agent+model for a task (public benchmarks blended with *your* results), before you spend a cent.
11
- - 🏁 **Race** — run several agents in parallel in isolated git worktrees, **verify each against your tests**, and apply the winner.
12
- - 🔁 **Self-heal** — run → verify → on failure feed the error back and escalate to a stronger agent until green.
13
- - 📊 **Bench / Eval** — record every outcome and learn which agent wins which task, per dollar, on *your* codebase.
14
- - 💸 **First-class FinOps** — every run's token usage and cost recorded locally; hard budgets enforced before and during a run.
15
-
16
- `aiforcecli` does **not** reimplement any agent. It shells out to the agent CLIs you already
17
- have installed and normalizes their I/O behind one adapter interface — so it improves as the
18
- agents do, and you keep your existing auth and subscriptions.
19
-
20
- ## 60-second quickstart
21
-
22
- ```bash
23
- # 1. Install
24
- npm install -g aiforcecli # or: npx aiforcecli <command>
25
-
26
- # 2. Make sure at least one underlying agent CLI is installed
27
- npm install -g @anthropic-ai/claude-code # Claude Code
28
- npm install -g @openai/codex # Codex
29
- aiforcecli agents # shows install status
30
-
31
- # 3. Scaffold config in your project
32
- cd my-project
33
- aiforcecli init
34
-
35
- # 4. Run a task
36
- aiforcecli run "add a health check endpoint and a test for it"
37
-
38
- # 5. See what it cost
39
- aiforcecli cost
40
- ```
41
-
42
- That's it — you're productive on day one.
43
-
44
- ## Commands
45
-
46
- | Command | What it does |
47
- | --- | --- |
48
- | `aiforcecli run "<task>"` | Run a task with an agent. With no `--agent`, `aiforcecli` **auto-routes** to the best agent+model for the task within budget. Add `--heal` to **verify and self-heal** (see below). Flags: `--agent claude-code\|codex\|aider\|antigravity`, `--model <m>`, `--cwd <path>`, `--budget <usd>`, `--route deterministic\|bayesian`, `--explore`/`--no-explore` (bayesian), `--explain`, `--resume <sessionId>`, `--json`, `--heal`, `--max-attempts <n>`, `--verify <cmd>`, `--skip-verify`. |
49
- | `aiforcecli advise "<task>"` | **Recommend** which agent + model to use — instantly, without running anything. Blends a public benchmark scorecard with your private outcomes. `--cwd <path>`, `--budget <usd>` (preview under a cap), `--explore`, `--json`. |
50
- | `aiforcecli race "<task>"` | Run the task on **several agents in parallel** (isolated git worktrees), verify each, and apply the winner. Flags: `--agents claude-code,codex`, `--budget <usd>` (total, split across agents), `--select cheapest\|fastest\|first-pass`, `--verify <cmd>`, `--keep`, `--json`. |
51
- | `aiforcecli eval` | Run the **private eval suite** to calibrate `advise` on your codebase. `--dir <path>`, `--agents <ids>`, `--json`. |
52
- | `aiforcecli bench` | **Leaderboard** from your local history: which agent/model wins which task class, per dollar. `--since 7d\|24h\|<ISO>`, `--by-task`, `--json`. |
53
- | `aiforcecli agents` | List agents and whether their CLI is installed (`--json`). |
54
- | `aiforcecli cost` | Report spend. `--since 7d\|24h\|<ISO>`, `--by day\|agent\|project`, `--json`. |
55
- | `aiforcecli init` | Scaffold `aiforcecli.config.json` and install bundled assets. `--force` to overwrite. |
56
- | `aiforcecli mcp` | (Phase 3) Start the MCP server. |
57
-
58
- ## How it works
59
-
60
- ```
61
- CLI ── orchestrator ── AgentAdapter ── subprocess ── claude / codex
62
-
63
- ├── BudgetEnforcer (pre-run daily cap + mid-run per-run cap)
64
- └── UsageStore (SQLite: every run tagged agent/project/time/promptHash)
65
- ```
66
-
67
- Each adapter parses its agent's **native** output (Claude Code `--output-format
68
- stream-json`, Codex `exec --json` JSONL) and maps it onto one normalized event shape:
69
-
70
- ```ts
71
- type NormalizedEvent =
72
- | { type: 'token'; text: string }
73
- | { type: 'message'; role: 'assistant' | 'user'; text: string }
74
- | { type: 'tool_call'; name: string; input?: unknown; id?: string }
75
- | { type: 'usage'; usage: Usage; cumulative: boolean }
76
- | { type: 'error'; message: string; fatal: boolean };
77
- ```
78
-
79
- A run resolves to `{ message, usage: { inputTokens, outputTokens, costUsd, costSource },
80
- exitCode, sessionId, aborted }`.
81
-
82
- ## Routing
83
-
84
- Instead of hardcoding which agent and model to call, omit `--agent` and `aiforcecli`
85
- picks for you. It classifies the task (a transparent keyword + length
86
- heuristic), then selects the **most relevant** model whose estimated cost fits
87
- the effective budget — the tightest of `--budget`, `maxCostPerRunUsd`, and the
88
- remaining daily headroom (`dailyCapUsd − spent today`):
89
-
90
- ```
91
- CLI ── router ── classifyTask (light | standard | heavy)
92
-
93
- ├── model catalog (agent, model, tier, price)
94
- └── effectiveBudget (min of --budget / per-run cap / daily headroom)
95
- ```
96
-
97
- - A model that **meets** the task tier and fits the budget is preferred.
98
- - If budget is too tight, it **downgrades** to the most capable model that fits.
99
- - If nothing fits, it picks the cheapest and warns — the budget enforcer still
100
- aborts the run if it overruns.
101
-
102
- ```bash
103
- aiforcecli run "fix a typo in the README" # → cheap, light model
104
- aiforcecli run "refactor the auth architecture" --budget 5 # → top model within $5
105
- aiforcecli run "<task>" --explain # show the decision, don't run
106
- ```
107
-
108
- An explicit `--agent` (and/or `--model`) always wins and skips routing. Tune via
109
- the `routing` block in config: `enabled`, `prefer` (tie-break agent order),
110
- `only` (restrict to catalog keys), and `models` (add custom catalog entries).
111
-
112
- ## Advise — pick the right agent before you start
113
-
114
- Before a complex project, `advise` gives you an instant, no-cost recommendation of which agent and
115
- model to use — the cheap predictor that complements `race` (proof) and `run --heal` (verify):
116
-
117
- ```bash
118
- aiforcecli advise "migrate the auth module from sessions to JWT across the API"
119
- ```
120
-
121
- ```
122
- Task brief
123
- type security (auth, security)
124
- complexity heavy
125
- codebase TypeScript · 412 files · tests: yes (npm test)
126
- budget $4.21 headroom
127
-
128
- Recommendation
129
- → claude-code · opus confidence 82%
130
- expected ~$0.90 · ~88% one-shot pass
131
- Why
132
- • public eval: claude-code·opus ~88 on security
133
- • your history: 91% pass over 9 security run(s)
134
- Runner-up codex·gpt-5-codex ~$0.40 · ~80% — cheaper
135
- Run it: aiforcecli run "<task>" --agent claude-code --model opus --heal
136
- ```
137
-
138
- **How the recommendation is scored.** A transparent blend:
139
-
140
- ```
141
- capability = posterior(prior = public scorecard, evidence = your private pass rate)
142
- score = wCapability·capability + wCost·costFit + wSpeed·speedFit
143
- ```
144
-
145
- - **Public scorecard** (`src/advise/scorecard.ts`) — curated, dated priors per agent·model × task
146
- type, distilled from public coding benchmarks. A *prior*, not gospel; it goes stale, so your own
147
- data overrides it.
148
- - **Private outcomes** — your recorded `--heal`/`race`/`eval` results. Capability is a Bayesian
149
- blend: the public prior dominates until you have evidence, then your numbers take over
150
- (`advise.privatePseudocount` sets how fast). Confidence rises with your sample size.
151
-
152
- ### The learning policy (contextual bandit)
153
-
154
- Under the hood the recommendation is a **contextual bandit**. Each arm `(agent, model)` has a Beta
155
- posterior over its pass probability — seeded by the public prior, updated by your verified outcomes
156
- for that task type. `advise` shows the **policy mix**: how often the policy would pick each arm.
157
-
158
- - **Exploit (default):** recommend the top expected arm.
159
- - **Explore (`advise --explore`, or `advise.explore: true` for routed `run`):** pick via **Thompson
160
- Sampling** — sample each arm's posterior and take the winner — so uncertain-but-promising arms get
161
- tried occasionally. That's how the policy *discovers* which agent quietly wins on your code instead
162
- of only exploiting the current best.
163
-
164
- **Off-policy logging (so it keeps learning):** every verified run records its decision **context**,
165
- the chosen arm's **propensity** (selection probability), and a scalar **reward** (pass, minus
166
- cost/latency/attempt penalties, plus acceptance). That turns your run history into a labeled dataset
167
- for unbiased off-policy improvement — the foundation for a learned escalation/route policy next.
168
- The reward signal is automatic and objective because it comes from **your tests**, which is what
169
- makes the learning real rather than guesswork.
170
-
171
- **Calibrate it to your codebase** with the private eval suite. Drop task cases as JSON files in
172
- `.aiforcecli/evals/`:
173
-
174
- ```json
175
- { "prompt": "fix the off-by-one in parseRange", "verify": "npm test", "taskType": "bugfix" }
176
- ```
177
-
178
- Then run them against every candidate agent·model (each in an isolated worktree, verified):
179
-
180
- ```bash
181
- aiforcecli eval # all installed agents × all cases
182
- aiforcecli eval --agents claude-code # restrict the field
183
- ```
184
-
185
- Results are recorded as private evidence, so `advise` (and `bench`) get smarter about *your* repo
186
- with every run. `eval` runs real agents — it costs money and time, so run it occasionally.
187
-
188
- ## Verify, self-heal & race
189
-
190
- aiforcecli sits *above* the agents, so it can do something they can't: run the work, **verify it
191
- against your tests**, and act on the result. Verification is auto-detected from the project
192
- (`npm test`, `pytest`, `cargo test`, `go test ./...`, …) and overridable via `verify.command` or
193
- `--verify "<cmd>"`.
194
-
195
- **Self-healing runs** — `aiforcecli run "<task>" --heal`
196
-
197
- ```
198
- run → verify → ✗ fail → retry same agent with the failure → ✗ → escalate to a stronger agent → ✓
199
- ```
200
-
201
- After the run, aiforcecli runs your verify command. On failure it feeds the output back to the agent
202
- (resuming the session), and if it still fails it **escalates to a stronger agent/model** — looping
203
- until green, `heal.maxAttempts` is reached, or the budget runs out. A budget breach never escalates.
204
-
205
- **Race** — `aiforcecli race "<task>" --agents claude-code,codex`
206
-
207
- Runs the task on every agent **in parallel**, each in its own throwaway **git worktree** (your
208
- uncommitted changes are carried in, so they build on your current work). Each result is verified, and
209
- the **cheapest passing** diff (or `--select fastest|first-pass`) is applied to your tree — the tests
210
- decide, not vibes. With no verify command, you get each candidate's diff and pick one. `--budget` is
211
- the *total* and is split across racers; `--keep` leaves the worktrees for inspection.
212
-
213
- A fresh worktree only has git-tracked files, so the repo's `node_modules` is **symlinked** into each
214
- one automatically — otherwise `npm test` couldn't find its runner. For other ecosystems whose deps
215
- are gitignored, list them under `race.linkPaths` (e.g. `[".venv", "target", "vendor"]`).
216
-
217
- ```bash
218
- aiforcecli race "fix the failing auth test" --agents claude-code,codex --budget 2
219
- # claude-code · sonnet ✓ pass 42s $0.31
220
- # codex · gpt-5-codex ✗ fail 51s $0.27
221
- # → applied claude-code's diff (cheapest passing) total race cost: $0.58
222
- ```
223
-
224
- **Leaderboard** — `aiforcecli bench`
225
-
226
- Every `--heal` and `race` run records an *outcome* (passed? won? cost? duration?). `bench` reports,
227
- from your own history, which agent/model actually wins which task class per dollar — turning routing
228
- from anecdote into evidence. Opt in to `telemetry` to contribute anonymized outcomes to a shared
229
- leaderboard (off by default; never sends prompts, paths, or output).
230
-
231
- ## FinOps & budgets
232
-
233
- - **Cost source of truth:** `aiforcecli` uses the cost the agent CLI reports when available
234
- (Claude Code reports `total_cost_usd`). When a CLI reports tokens only (Codex), cost is
235
- **computed** from a local pricing table — `costSource` records which, so reports are
236
- never silently wrong.
237
- - **Window caps (`dailyCapUsd`, `weeklyCapUsd`, `monthlyCapUsd`):** checked *before* a run
238
- starts. If spend in any window already meets its cap (today / this calendar week from Monday /
239
- this calendar month), the run is refused (exit code `2`). They also tighten the routing budget.
240
- - **Per-run cap (`maxCostPerRunUsd`, or `--budget`):** enforced *during* the run. As usage
241
- events arrive, if cumulative cost crosses the cap the subprocess is aborted
242
- (SIGTERM → SIGKILL).
243
- - **Honest limitation:** agents that only report usage at the *end* of a run can only
244
- be caught after the fact. Mid-run enforcement is best-effort for those.
245
- - **Never hangs:** every run has a wall-clock `timeoutMs` and an `inactivityTimeoutMs`
246
- watchdog; a stuck subprocess is always killed.
247
-
248
- ## Configuration
249
-
250
- `aiforcecli.config.json` (or `.ts`) is discovered at two levels, deep-merged (**project wins**), then
251
- CLI flags win over both:
252
-
253
- 1. **User level** — your defaults, applied in *every* folder. Put the file here once; you don't need
254
- one per project. The directory is platform-specific:
255
- - **Windows:** `%APPDATA%\aiforcecli\Config\aiforcecli.config.json`
256
- - **macOS:** `~/Library/Preferences/aiforcecli/aiforcecli.config.json`
257
- - **Linux:** `~/.config/aiforcecli/aiforcecli.config.json` (or `$XDG_CONFIG_HOME`)
258
- 2. **Project level** — `aiforcecli.config.json` in the working directory, for per-project overrides only.
259
-
260
- The CLI command itself (installed globally via `npm i -g aiforcecli`, or `npm link` for a dev build)
261
- is separate from config — installing it once is enough; settings come from the files above. See
262
- [`aiforcecli.config.example.json`](./aiforcecli.config.example.json) and the `examples/` directory.
263
-
264
- | Block | Key | Default | Purpose |
265
- | --- | --- | --- | --- |
266
- | `agents.<id>` | `enabled`, `model`, `bin`, `defaultFlags`, `allowedTools` | `enabled:true` | Per-agent settings. `enabled:false` removes the agent everywhere. Plus binary path, forced model, default flags, tool allow-list. |
267
- | `defaultAgent` | | `claude-code` | Agent used by `run` when routing is off and no `--agent`. |
268
- | `routing` | `enabled`, `strategy`, `prefer`, `only`, `models` | `enabled:true`, `strategy:deterministic` | Auto-routing. `strategy`: `deterministic` (budget-aware tier router) or `bayesian` (learned engine; honors `advise.explore`). `only` restricts the catalog, `models` adds entries. Per-run override: `run --route`. |
269
- | `verify` | `enabled`, `command`, `timeoutMs` | `enabled:true`, 300 s | How results are verified; `command` overrides auto-detection. |
270
- | `heal` | `enabled`, `maxAttempts`, `escalate` | `false`, `3`, `true` | Self-healing loop behavior for `run --heal`. |
271
- | `race` | `agents`, `select`, `keepWorktrees`, `linkPaths` | `select:cheapest` | Defaults for `race`; `linkPaths` symlinks extra dep dirs into worktrees. |
272
- | `advise` | `weights`, `privatePseudocount`, `explore` | `0.7/0.2/0.1`, `5`, `false` | Recommendation scoring weights, prior strength, and Thompson-Sampling exploration. |
273
- | `eval` | `dir`, `models` | `.aiforcecli/evals` | Private eval suite location and which catalog models to evaluate. |
274
- | `telemetry` | `enabled`, `endpoint` | `false` | Opt-in anonymized outcome upload for the shared leaderboard. |
275
- | `budgets` | `maxCostPerRunUsd`, `dailyCapUsd`, `weeklyCapUsd`, `monthlyCapUsd` | — | Per-run cap + daily/weekly/monthly spend caps (refuse to start once a window cap is hit). |
276
- | top-level | `timeoutMs`, `inactivityTimeoutMs` | 600 s, 120 s | Wall-clock and inactivity watchdogs for every run. |
277
-
278
- ## Development
279
-
280
- ```bash
281
- npm install
282
- npm run build # tsup -> dist/ (ESM)
283
- npm test # vitest — adapter parsers run against recorded fixtures
284
- npm run typecheck
285
- ```
286
-
287
- Adapter parser tests use **recorded real outputs** from each CLI under
288
- `test/fixtures/`. To refresh them, re-capture with:
289
-
290
- ```bash
291
- claude -p "say hi" --output-format stream-json --verbose > test/fixtures/claude-code/hello.stream.jsonl
292
- codex exec --json --sandbox workspace-write "say hi" > test/fixtures/codex/hello.jsonl
293
- ```
294
-
295
- ## Documentation
296
-
297
- - [Feature Reference (FEATURES)](./docs/FEATURES.md) — every feature in depth: public eval, the RL/bandit policy, race, heal, FinOps, config, data model.
298
- - [Product Requirements (PRD)](./docs/PRD.md) — problem, users, features, scope, success metrics.
299
- - [Business Requirements (BRD)](./docs/BRD.md) — market, value proposition, model, KPIs.
300
-
301
- ## Roadmap
302
-
303
- - **Shipped:** adapters (claude-code, codex), config, FinOps + budgets, budget-aware routing, `run` (+ `--heal`), `race`, `advise` (**contextual-bandit policy + off-policy logging**), `eval`, `bench`, `cost`, `agents`, `init`.
304
- - **Next:** learned escalation/route policy trained from the logged (context, arm, reward, propensity) data; surface agent fatal errors in `race`/`eval`; richer task classification (`--deep`); bundled `assets/` via `init`.
305
- - **Later:** `aiforcecli mcp` MCP server with auth + hard budget cap; hosted shared leaderboard + federated global/local policy; team/org budgets and reporting.