aiforcecli-chat 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. package/License.MD +49 -0
  2. package/README.md +642 -0
  3. package/aiforcecli.config.example.json +66 -0
  4. package/assets/README.md +14 -0
  5. package/dist/cli.js +2 -0
  6. package/dist/index.js +2 -0
  7. package/package.json +62 -0
  8. package/tools/scorecard/README.md +92 -0
  9. package/tools/scorecard/config.json +134 -0
  10. package/tools/scorecard/fetch.mjs +335 -0
  11. package/tools/scorecard/generate.mjs +289 -0
  12. package/tools/scorecard/generated/example/invalid-rows.json +1 -0
  13. package/tools/scorecard/generated/example/scorecard-report.md +147 -0
  14. package/tools/scorecard/generated/example/scorecard.compact.json +61 -0
  15. package/tools/scorecard/generated/example/scorecard.json +1492 -0
  16. package/tools/scorecard/generated/example/unmapped-models.json +1492 -0
  17. package/tools/scorecard/generated/raw/aider_polyglot.html +21071 -0
  18. package/tools/scorecard/generated/raw/terminal_bench_2_1.html +2 -0
  19. package/tools/scorecard/generated/scorecard/invalid-rows.json +1 -0
  20. package/tools/scorecard/generated/scorecard/scorecard-report.md +133 -0
  21. package/tools/scorecard/generated/scorecard/scorecard.compact.json +51 -0
  22. package/tools/scorecard/generated/scorecard/scorecard.json +1181 -0
  23. package/tools/scorecard/generated/scorecard/unmapped-models.json +1492 -0
  24. package/tools/scorecard/generated/scorecard-example/invalid-rows.json +1 -0
  25. package/tools/scorecard/generated/scorecard-example/scorecard-report.md +40 -0
  26. package/tools/scorecard/generated/scorecard-example/scorecard.compact.json +22 -0
  27. package/tools/scorecard/generated/scorecard-example/scorecard.json +389 -0
  28. package/tools/scorecard/generated/scorecard-example/unmapped-models.json +1 -0
  29. package/tools/scorecard/generated/scorecard-fetch/raw/aider_polyglot.html +21071 -0
  30. package/tools/scorecard/generated/scorecard-fetch/raw/terminal_bench_2_1.html +2 -0
  31. package/tools/scorecard/snapshots/example.normalized.example.json +38 -0
  32. package/tools/scorecard/snapshots/live.aider_polyglot.json +1318 -0
  33. package/tools/scorecard/snapshots/live.terminal_bench_2_1.json +294 -0
package/License.MD ADDED
@@ -0,0 +1,49 @@
1
+ # Proprietary License
2
+
3
+ Copyright (c) [2026] [Apoorv Iyer]. All rights reserved.
4
+
5
+ This software package, including all source code, compiled code, documentation, examples, assets, and related files, is proprietary and confidential.
6
+
7
+ ## License Grant
8
+
9
+ You may use this software only if you have received explicit written permission from [YOUR NAME OR COMPANY NAME] or have an active paid license, subscription, or agreement that allows you to use it.
10
+
11
+ Subject to that permission, you are granted a limited, non-exclusive, non-transferable, revocable license to install and use this package solely for the purposes permitted by your agreement with [YOUR NAME OR COMPANY NAME].
12
+
13
+ ## Restrictions
14
+
15
+ You may not, without prior written permission:
16
+
17
+ * copy, modify, adapt, translate, or create derivative works based on this software;
18
+ * distribute, publish, sublicense, rent, lease, sell, resell, or otherwise transfer this software;
19
+ * make this software available to any third party, including through a public repository, package registry, SaaS product, or hosted service;
20
+ * reverse engineer, decompile, disassemble, or attempt to derive the source code or underlying structure of this software;
21
+ * remove, alter, or obscure any copyright, trademark, proprietary, or license notices;
22
+ * use this software to build a competing product or service;
23
+ * use this software in violation of any applicable law or regulation.
24
+
25
+ ## No Open Source License
26
+
27
+ This software is not open source. No rights are granted under any open source license. Any access to this package through npm or another package registry does not grant permission to use, copy, modify, or distribute the software except as expressly allowed by this license or a separate written agreement.
28
+
29
+ ## Ownership
30
+
31
+ [YOUR NAME OR COMPANY NAME] retains all ownership, intellectual property rights, and proprietary rights in and to the software. No ownership rights are transferred to you.
32
+
33
+ ## Termination
34
+
35
+ This license automatically terminates if you violate any of its terms. Upon termination, you must immediately stop using the software and delete all copies in your possession or control.
36
+
37
+ ## Disclaimer of Warranty
38
+
39
+ This software is provided “as is” without warranties of any kind, whether express, implied, statutory, or otherwise, including but not limited to warranties of merchantability, fitness for a particular purpose, and non-infringement.
40
+
41
+ ## Limitation of Liability
42
+
43
+ To the maximum extent permitted by law, [YOUR NAME OR COMPANY NAME] shall not be liable for any indirect, incidental, special, consequential, exemplary, or punitive damages, or for any loss of profits, revenue, data, goodwill, or business opportunities arising out of or related to the use of this software.
44
+
45
+ ## Contact
46
+
47
+ For licensing inquiries, contact:
48
+
49
+ [apoorviy@hcltech.com]
package/README.md ADDED
@@ -0,0 +1,642 @@
1
+ # aiforcecli-chat
2
+
3
+ Version: 0.1.0
4
+
5
+ One CLI over multiple coding agents, with routing, verification, racing, local learning, and cost controls.
6
+
7
+ `aiforcecli-chat` does not reimplement an agent or host models. It shells out to the agent CLIs you already have installed and authenticated, normalizes their output, records cost/outcomes locally, and uses your repo's tests as the objective signal for choosing what to trust.
8
+
9
+ ## What It Wraps
10
+
11
+ Built-in adapters:
12
+
13
+ | Agent id | CLI | What it supports |
14
+ | --- | --- | --- |
15
+ | `claude-code` | `claude` | Claude Code via `-p --output-format stream-json`; parses usage/cost from the CLI; supports session resume. |
16
+ | `codex` | `codex` | OpenAI Codex via `codex exec --json`; parses JSONL events; computes cost from token usage; supports thread resume. |
17
+ | `aider` | `aider` | Aider one-shot mode; model-agnostic through Aider's `--model`; parses Aider's token/cost summary when present. |
18
+ | `antigravity` | `agy` | Google's Antigravity/Gemini CLI through `agy -p`; runs under a pseudo-terminal because output is TUI-oriented; no token/cost reporting from the CLI. |
19
+
20
+ Each adapter implements the same contract: detect whether the CLI exists, run a prompt in a working directory, stream normalized events, and optionally resume an existing session.
21
+
22
+ ## Why Use It
23
+
24
+ The product is built around a simple workflow:
25
+
26
+ 1. Use `advise` before spending money to choose an agent/model.
27
+ 2. Use `run --heal` when you want one agent to fix, verify, retry, and escalate.
28
+ 3. Use `race` when quality matters: run multiple agents in isolated git worktrees, verify each result, and apply the passing winner.
29
+ 4. Use `pr` when you want a complete branch -> verify -> commit -> push -> GitHub pull request workflow.
30
+ 5. Use `bench` and `eval` to learn which agent/model actually works on your repo.
31
+ 6. Use `cost` and budgets to avoid bill shock.
32
+
33
+ The strongest differentiator is that `aiforcecli-chat` is horizontal: it compares and governs several coding agents instead of locking you into one.
34
+
35
+ ## Quickstart
36
+
37
+ Install the wrapper:
38
+
39
+ ```bash
40
+ npm install -g aiforcecli-chat
41
+ ```
42
+
43
+ Install at least one underlying agent CLI:
44
+
45
+ ```bash
46
+ npm install -g @anthropic-ai/claude-code
47
+ npm install -g @openai/codex
48
+ python -m pip install aider-chat
49
+ ```
50
+
51
+ Antigravity uses Google's `agy` binary; install and authenticate it through Google's current Antigravity distribution.
52
+
53
+ Check what is available:
54
+
55
+ ```bash
56
+ aiforcecli-chat agents
57
+ aiforcecli-chat models
58
+ ```
59
+
60
+ Scaffold project config:
61
+
62
+ ```bash
63
+ cd my-project
64
+ aiforcecli-chat init
65
+ ```
66
+
67
+ Run a task:
68
+
69
+ ```bash
70
+ aiforcecli-chat run "add a health check endpoint and a test for it"
71
+ ```
72
+
73
+ Run a verified healing loop:
74
+
75
+ ```bash
76
+ aiforcecli-chat run "fix the failing auth test" --heal --verify "npm test"
77
+ ```
78
+
79
+ Race multiple agents:
80
+
81
+ ```bash
82
+ aiforcecli-chat race "fix the failing tests" --agents claude-code,codex --verify "npm test" --select cheapest
83
+ ```
84
+
85
+ Create a tested GitHub pull request:
86
+
87
+ ```bash
88
+ aiforcecli-chat pr "add input validation to the signup form" --agent codex --model gpt-5.4-mini --heal
89
+ ```
90
+
91
+ See spend:
92
+
93
+ ```bash
94
+ aiforcecli-chat cost
95
+ ```
96
+
97
+ ## Commands
98
+
99
+ | Command | Summary |
100
+ | --- | --- |
101
+ | `aiforcecli-chat run "<task>"` | Run a coding task. Without `--agent`, auto-routes to an agent/model. Supports `--agent`, `--model`, `--cwd`, `--budget`, `--resume`, `--explain`, `--json`, `--heal`, `--max-attempts`, `--verify`, `--skip-verify`, `--route deterministic|bayesian`, `--explore`, and `--no-explore`. |
102
+ | `aiforcecli-chat advise "<task>"` | Recommend an agent/model without running an agent. Supports `--cwd`, `--budget`, `--explore`, and `--json`. |
103
+ | `aiforcecli-chat scorecard update` | Manually fetch public benchmark leaderboards and generate local public-prior scorecard artifacts. |
104
+ | `aiforcecli-chat pr "<task>"` | Run a task on a new branch, verify it, commit, push, and open a GitHub pull request. Supports `--agent`, `--model`, `--cwd`, `--budget`, `--branch`, `--base`, `--title`, `--body`, `--commit-message`, `--remote`, `--draft`, `--no-push`, `--no-pr`, `--allow-dirty`, `--heal`, `--max-attempts`, `--verify`, `--skip-verify`, `--route`, `--explore`, and `--no-explore`. |
105
+ | `aiforcecli-chat race "<task>"` | Run several agents in parallel in isolated git worktrees, verify each, and apply the winner. Supports `--agents`, `--cwd`, `--budget`, `--select cheapest|fastest|first-pass`, `--verify`, `--keep`, and `--json`. |
106
+ | `aiforcecli-chat eval` | Run private eval cases against installed agent/model targets to calibrate `advise`. Supports `--cwd`, `--dir`, `--agents`, and `--json`. |
107
+ | `aiforcecli-chat bench` | Local leaderboard from recorded outcomes. Supports `--since`, `--by-task`, `--clean`, and `--json`. |
108
+ | `aiforcecli-chat models` | Show the built-in plus config-added model catalog, tiers, prices, install status, and `routing.only` allow-list status. Supports `--json`. |
109
+ | `aiforcecli-chat agents` | Show agent install status. Supports `--enable <id>`, `--disable <id>`, and `--json`. |
110
+ | `aiforcecli-chat cost` | Report spend by `day`, `agent`, or `project`. Supports `--since`, `--by day|agent|project`, and `--json`. |
111
+ | `aiforcecli-chat init` | Write `aiforcecli.config.json` and install bundled assets. Supports `--cwd` and `--force`. |
112
+ | `aiforcecli-chat mcp` | Command is reserved for Phase 3. In 0.1.0 it is a stub and exits with a not-implemented message. |
113
+
114
+ ## Public Prior Scorecard
115
+
116
+ Bayesian `advise`, `run --route bayesian`, chat auto-routing, and `pr --route bayesian` use the generated public-prior scorecard by default. Refresh it manually:
117
+
118
+ ```bash
119
+ aiforcecli-chat scorecard update
120
+ ```
121
+
122
+ The generated scorecard is stored under the installed package at `tools/scorecard/generated/scorecard/scorecard.json`. If it is unavailable, routing falls back to the bundled static scorecard.
123
+
124
+ ## Architecture
125
+
126
+ ```text
127
+ CLI
128
+ commands: run, advise, pr, race, eval, bench, models, agents, cost, init, mcp
129
+ adapters: claude-code, codex, aider, antigravity
130
+ routing: catalog, classifier, deterministic router
131
+ advise: task analysis, public scorecard, private stats, scorer, bandit policy
132
+ core: orchestrator, subprocess runner, worktree isolation, race, heal
133
+ verify: test command detection and execution
134
+ finops: SQLite usage store, budgets, pricing, telemetry
135
+ ```
136
+
137
+ Adapters normalize native agent output into this event shape:
138
+
139
+ ```ts
140
+ type NormalizedEvent =
141
+ | { type: 'token'; text: string }
142
+ | { type: 'message'; role: 'assistant' | 'user'; text: string }
143
+ | { type: 'tool_call'; name: string; input?: unknown; id?: string }
144
+ | { type: 'usage'; usage: Usage; cumulative: boolean }
145
+ | { type: 'error'; message: string; fatal: boolean };
146
+ ```
147
+
148
+ A finished run resolves to message, usage, exit code, optional session/thread id, and abort metadata.
149
+
150
+ ## Model Catalog
151
+
152
+ `aiforcecli-chat models` displays the catalog used for routing estimates and recommendations.
153
+
154
+ Built-in catalog in 0.1.0:
155
+
156
+ | Key | Agent | Model | Tier |
157
+ | --- | --- | --- | --- |
158
+ | `claude-haiku` | `claude-code` | `haiku` | light |
159
+ | `claude-sonnet` | `claude-code` | `sonnet` | standard |
160
+ | `claude-opus` | `claude-code` | `opus` | heavy |
161
+ | `claude-fable` | `claude-code` | `fable` | heavy |
162
+ | `codex-gpt-5.4-mini` | `codex` | `gpt-5.4-mini` | light |
163
+ | `codex-gpt-5.4` | `codex` | `gpt-5.4` | standard |
164
+ | `codex-gpt-5.5` | `codex` | `gpt-5.5` | heavy |
165
+ | `aider-deepseek` | `aider` | `deepseek` | standard |
166
+ | `aider-codestral` | `aider` | `codestral/codestral-latest` | standard |
167
+ | `gemini-3-flash` | `antigravity` | `gemini-3-flash` | light |
168
+ | `gemini-3.5-flash` | `antigravity` | `gemini-3.5-flash` | standard |
169
+ | `gemini-3.1-pro` | `antigravity` | `gemini-3.1-pro` | heavy |
170
+
171
+ You can restrict routing with `routing.only` and add or override catalog entries with `routing.models`.
172
+
173
+ Example:
174
+
175
+ ```json
176
+ {
177
+ "routing": {
178
+ "only": ["claude-sonnet", "codex-gpt-5.4-mini"],
179
+ "models": [
180
+ {
181
+ "key": "codex-mini",
182
+ "agent": "codex",
183
+ "model": "gpt-5.4-mini",
184
+ "tier": 1,
185
+ "price": { "input": 0.1875, "output": 1.13 }
186
+ }
187
+ ]
188
+ }
189
+ }
190
+ ```
191
+
192
+ Prices are USD per 1M tokens and are used for estimates. Runtime cost uses the agent CLI's reported cost when available; otherwise it is computed from the local pricing table.
193
+
194
+ Claude Fable is exposed through Claude Code as `--model fable` and is treated as a heavy-tier option. Its estimate is `$10` input / `$50` output per 1M tokens, so it is best reserved for harder refactors, architecture work, security-sensitive changes, and complex PRs rather than small edits.
195
+
196
+ ## Routing
197
+
198
+ When `run` is called without `--agent`, `aiforcecli-chat` routes the task.
199
+
200
+ Routing strategies:
201
+
202
+ | Strategy | Behavior |
203
+ | --- | --- |
204
+ | `deterministic` | Default. Classifies the prompt as light, standard, or heavy with keyword/length heuristics, estimates candidate costs, and picks the cheapest model that meets the desired tier within the effective budget. |
205
+ | `bayesian` | Uses the same learned recommendation pipeline as `advise`: public priors plus private verified outcomes, with optional Thompson-sampling exploration. |
206
+
207
+ Effective budget is the tightest of:
208
+
209
+ - `--budget`
210
+ - `budgets.maxCostPerRunUsd`
211
+ - remaining `dailyCapUsd`
212
+ - remaining `weeklyCapUsd`
213
+ - remaining `monthlyCapUsd`
214
+
215
+ Explicit `--agent` always wins. Explicit `--model` overrides the routed or configured model.
216
+
217
+ Examples:
218
+
219
+ ```bash
220
+ aiforcecli-chat run "fix a typo in the README"
221
+ aiforcecli-chat run "refactor the auth architecture" --budget 5 --explain
222
+ aiforcecli-chat run "fix the cache bug" --route bayesian --explore
223
+ aiforcecli-chat run "add validation" --agent codex --model gpt-5.4-mini
224
+ aiforcecli-chat pr "redesign the calculator architecture" --agent claude-code --model fable --branch fable-refactor --base dev-ai-base
225
+ ```
226
+
227
+ ## Advise
228
+
229
+ `advise` is a no-run recommendation engine:
230
+
231
+ ```bash
232
+ aiforcecli-chat advise "migrate auth from sessions to JWT" --budget 2
233
+ ```
234
+
235
+ It produces:
236
+
237
+ - task type: `bugfix`, `feature`, `refactor`, `test`, `docs`, `security`, `perf`, or `general`
238
+ - complexity tier: light, standard, or heavy
239
+ - codebase scan: top languages and file count
240
+ - test detection
241
+ - budget headroom
242
+ - ranked agent/model recommendations
243
+ - confidence, expected cost, estimated capability, and reasons
244
+ - policy mix: how often each arm would be selected under Thompson sampling
245
+
246
+ Scoring:
247
+
248
+ ```text
249
+ capability = posterior(public prior, private pass/fail history)
250
+ score = wCapability * capability + wCost * costFit + wSpeed * speedFit
251
+ ```
252
+
253
+ Defaults:
254
+
255
+ ```json
256
+ {
257
+ "advise": {
258
+ "weights": { "capability": 0.7, "cost": 0.2, "speed": 0.1 },
259
+ "privatePseudocount": 5,
260
+ "explore": false
261
+ }
262
+ }
263
+ ```
264
+
265
+ The public scorecard is a dated, curated prior. Your verified `heal`, `race`, and `eval` outcomes override it as data accumulates.
266
+
267
+ ## Race
268
+
269
+ `race` is the high-trust workflow:
270
+
271
+ ```bash
272
+ aiforcecli-chat race "fix the failing checkout test" --agents claude-code,codex --verify "npm test" --select cheapest
273
+ ```
274
+
275
+ What happens:
276
+
277
+ 1. Finds the local git repo root from `--cwd` or the current directory.
278
+ 2. Captures tracked and untracked dirty changes.
279
+ 3. Creates one detached temp git worktree per agent.
280
+ 4. Carries your current changes into each worktree and commits them as that worktree's base.
281
+ 5. Runs each agent independently.
282
+ 6. Runs the verify command in each worktree.
283
+ 7. Selects a passing non-empty diff by `cheapest`, `fastest`, or `first-pass`.
284
+ 8. Applies the winner's diff back to your real working tree.
285
+ 9. Records outcome, cost, duration, task class/type, reward, and winner flag.
286
+
287
+ Requirements:
288
+
289
+ - The target directory must be inside a git repo.
290
+ - The repo must have at least one commit, because worktrees are created from `HEAD`.
291
+ - A verify command is strongly recommended. Without one, `race` cannot objectively pick a winner and only offers manual selection in an interactive terminal.
292
+
293
+ Budget behavior:
294
+
295
+ - `--budget` is the total race budget.
296
+ - It is split evenly across racers.
297
+
298
+ Model behavior:
299
+
300
+ - `race --agents` accepts agent ids, not per-agent model syntax.
301
+ - To race specific models, set `agents.<id>.model` in config.
302
+
303
+ Example:
304
+
305
+ ```json
306
+ {
307
+ "agents": {
308
+ "claude-code": { "model": "sonnet" },
309
+ "codex": { "model": "gpt-5.4-mini" }
310
+ }
311
+ }
312
+ ```
313
+
314
+ Then:
315
+
316
+ ```bash
317
+ aiforcecli-chat race "fix the discount bug" --agents claude-code,codex --verify "npm test"
318
+ ```
319
+
320
+ Dependency links:
321
+
322
+ - `node_modules` is linked into worktrees automatically when present.
323
+ - Add other gitignored dependency/build directories with `race.linkPaths`, for example `.venv`, `target`, or `vendor`.
324
+ - Cleanup severs symlinks/junctions before removing worktrees to avoid deleting through dependency links.
325
+
326
+ ## PR Mode
327
+
328
+ `pr` is the one-command workflow for turning an AI task into a normal GitHub pull request:
329
+
330
+ ```bash
331
+ aiforcecli-chat pr "fix the checkout timeout bug" --agent codex --model gpt-5.4 --heal
332
+ ```
333
+
334
+ What happens:
335
+
336
+ 1. Requires a clean git working tree by default so AI edits do not mix with existing local work.
337
+ 2. Creates a new branch, or uses `--branch <name>`.
338
+ 3. Runs the selected or routed agent on the task.
339
+ 4. Runs final verification if a verify command is available.
340
+ 5. Commits the change when verification passes.
341
+ 6. Pushes the branch to the configured remote.
342
+ 7. Opens a GitHub pull request with the GitHub CLI (`gh`).
343
+
344
+ Useful examples:
345
+
346
+ ```bash
347
+ aiforcecli-chat pr "add a contact form" --branch ai-contact-form
348
+ aiforcecli-chat pr "fix the parser bug" --heal --verify "npm test"
349
+ aiforcecli-chat pr "update the About page copy" --no-push
350
+ aiforcecli-chat pr "prepare a docs-only change" --skip-verify --draft
351
+ aiforcecli-chat pr "push a branch but do not open a PR" --no-pr
352
+ ```
353
+
354
+ Requirements:
355
+
356
+ - The target directory must be inside a git repo.
357
+ - Start from a clean working tree unless you intentionally pass `--allow-dirty`.
358
+ - `gh` must be installed and authenticated to open the PR automatically.
359
+ - If `gh` is unavailable, the branch is still pushed and the command prints the `gh pr create` command to run later.
360
+
361
+ ## Self-Healing Runs
362
+
363
+ `run --heal` turns a single agent run into a verify/fix/escalate loop:
364
+
365
+ ```bash
366
+ aiforcecli-chat run "fix the failing parser test" --heal --verify "npm test"
367
+ ```
368
+
369
+ Loop:
370
+
371
+ 1. Run selected agent/model.
372
+ 2. Run verify command.
373
+ 3. If passed, stop.
374
+ 4. If failed and the adapter supports resume, retry the same agent with the verify failure output.
375
+ 5. If still failing and escalation is enabled, select a stronger untried catalog entry.
376
+ 6. Stop on pass, budget breach, or `heal.maxAttempts`.
377
+
378
+ Config:
379
+
380
+ ```json
381
+ {
382
+ "heal": {
383
+ "enabled": false,
384
+ "maxAttempts": 3,
385
+ "escalate": true
386
+ }
387
+ }
388
+ ```
389
+
390
+ `--heal` edits the real working tree by design. Use `race` when you want isolated candidates.
391
+
392
+ ## Verification
393
+
394
+ Verification resolves in this order:
395
+
396
+ 1. `--verify "<cmd>"`
397
+ 2. `verify.command` in config
398
+ 3. auto-detection, if `verify.enabled` is true
399
+
400
+ Auto-detected commands include:
401
+
402
+ - `npm test`, `pnpm test`, or `yarn test` when `package.json` has a `test` script
403
+ - `pytest`
404
+ - `cargo test`
405
+ - `go test ./...`
406
+ - `bundle exec rake test`
407
+ - `mvn test`
408
+ - `gradle test`
409
+
410
+ The verify command runs through a shell, has a timeout, returns pass/fail, and captures the tail of output for healing.
411
+
412
+ ## Eval And Bench
413
+
414
+ `eval` runs a private suite of representative cases against candidate agent/model pairs.
415
+
416
+ Case example:
417
+
418
+ ```json
419
+ {
420
+ "prompt": "fix the off-by-one in parseRange",
421
+ "verify": "npm test",
422
+ "taskType": "bugfix"
423
+ }
424
+ ```
425
+
426
+ Run:
427
+
428
+ ```bash
429
+ aiforcecli-chat eval
430
+ aiforcecli-chat eval --agents claude-code,codex
431
+ ```
432
+
433
+ Targets come from `eval.models` or all installed agents' catalog entries. Each trial runs in an isolated worktree and records outcomes for `advise` and `bench`.
434
+
435
+ `bench` shows local history:
436
+
437
+ ```bash
438
+ aiforcecli-chat bench
439
+ aiforcecli-chat bench --by-task
440
+ aiforcecli-chat bench --since 7d
441
+ aiforcecli-chat bench --clean
442
+ ```
443
+
444
+ `--clean` hides zero-signal rows such as typoed/retired model ids or unverified runs with no cost/pass/win signal.
445
+
446
+ Note: the current CLI command does not expose a project filter for `bench`; reports are from the local usage database.
447
+
448
+ ## FinOps And Budgets
449
+
450
+ Every recorded run stores:
451
+
452
+ - timestamp
453
+ - agent and model
454
+ - project path
455
+ - prompt hash
456
+ - input/output/cache tokens
457
+ - cost and cost source
458
+ - exit code and abort reason
459
+ - mode: `run`, `heal`, `race`, or `eval`
460
+ - outcome: `pass`, `fail`, or `unknown`
461
+ - duration
462
+ - task class/type
463
+ - race winner flag
464
+ - learning context, propensity, and reward when available
465
+
466
+ Cost source:
467
+
468
+ - Claude Code reports cost directly; that is trusted.
469
+ - Codex reports tokens; cost is computed from local prices.
470
+ - Aider cost is parsed from Aider's summary line when present.
471
+ - Antigravity currently reports no usage/cost in this integration, so recorded cost may be zero even though the underlying service may charge separately.
472
+
473
+ Budget controls:
474
+
475
+ ```json
476
+ {
477
+ "budgets": {
478
+ "maxCostPerRunUsd": 1.0,
479
+ "dailyCapUsd": 10.0,
480
+ "weeklyCapUsd": 50.0,
481
+ "monthlyCapUsd": 150.0
482
+ }
483
+ }
484
+ ```
485
+
486
+ Window caps are checked before starting a run. Per-run caps are enforced during a run when usage events arrive. For agents that only report usage at the end, mid-run enforcement is best effort.
487
+
488
+ Cost reports:
489
+
490
+ ```bash
491
+ aiforcecli-chat cost
492
+ aiforcecli-chat cost --since 24h
493
+ aiforcecli-chat cost --by agent
494
+ aiforcecli-chat cost --by project --json
495
+ ```
496
+
497
+ ## Telemetry
498
+
499
+ Telemetry is opt-in and off by default.
500
+
501
+ When enabled with `telemetry.enabled` and `telemetry.endpoint`, it sends only coarse anonymized outcome fields:
502
+
503
+ - task class
504
+ - agent
505
+ - model
506
+ - outcome
507
+ - winner flag
508
+ - mode
509
+ - cost
510
+ - duration
511
+
512
+ It does not send prompts, prompt hashes, project paths, or output. Upload is fire-and-forget with a short timeout and never blocks a run.
513
+
514
+ In 0.1.0 telemetry is a primitive upload hook, not a hosted shared leaderboard.
515
+
516
+ ## Configuration
517
+
518
+ Config files are deep-merged:
519
+
520
+ 1. user-level config
521
+ 2. project-level `aiforcecli.config.json` or `aiforcecli.config.ts`
522
+ 3. CLI flags
523
+
524
+ User-level paths:
525
+
526
+ | Platform | Path |
527
+ | --- | --- |
528
+ | Windows | `%APPDATA%\aiforcecli\Config\aiforcecli.config.json` |
529
+ | macOS | `~/Library/Preferences/aiforcecli/aiforcecli.config.json` |
530
+ | Linux | `~/.config/aiforcecli/aiforcecli.config.json` or `$XDG_CONFIG_HOME` |
531
+
532
+ Schema summary:
533
+
534
+ | Block | Keys | Purpose |
535
+ | --- | --- | --- |
536
+ | `agents.<id>` | `enabled`, `model`, `bin`, `defaultFlags`, `allowedTools` | Per-agent settings. `enabled:false` removes the agent from registry/routing/race/eval/listing. |
537
+ | `defaultAgent` | string | Agent used when routing is disabled and no explicit agent is passed. |
538
+ | `routing` | `enabled`, `strategy`, `prefer`, `only`, `models` | Auto-routing behavior and model catalog customization. |
539
+ | `verify` | `enabled`, `command`, `timeoutMs` | Verification command resolution and timeout. |
540
+ | `heal` | `enabled`, `maxAttempts`, `escalate` | Self-healing behavior. |
541
+ | `race` | `agents`, `select`, `keepWorktrees`, `linkPaths` | Race defaults, winner selection, worktree retention, dependency links. |
542
+ | `advise` | `weights`, `privatePseudocount`, `explore` | Recommendation scoring and exploration. |
543
+ | `eval` | `dir`, `models` | Private eval suite location and catalog target keys. |
544
+ | `telemetry` | `enabled`, `endpoint` | Opt-in anonymized upload. |
545
+ | `budgets` | `maxCostPerRunUsd`, `dailyCapUsd`, `weeklyCapUsd`, `monthlyCapUsd` | Cost caps. |
546
+ | top-level | `timeoutMs`, `inactivityTimeoutMs` | Agent subprocess watchdogs. |
547
+
548
+ Enable or disable built-in agents globally:
549
+
550
+ ```bash
551
+ aiforcecli-chat agents --disable antigravity
552
+ aiforcecli-chat agents --enable aider
553
+ ```
554
+
555
+ These toggles write to the user-level config.
556
+
557
+ ## `init`
558
+
559
+ `aiforcecli-chat init` writes a project config and installs bundled assets when present:
560
+
561
+ - `assets/skills` -> `.claude/skills`
562
+ - `assets/subagents` -> `.claude/agents`
563
+ - `assets/prompts` -> `.aiforcecli/prompts`
564
+
565
+ Use `--force` to overwrite an existing project config:
566
+
567
+ ```bash
568
+ aiforcecli-chat init --force
569
+ ```
570
+
571
+ ## Known Limitations
572
+
573
+ - `aiforcecli-chat mcp` is not implemented in 0.1.0.
574
+ - `race` and `eval` require a git repo with a commit because they use git worktrees.
575
+ - Verification quality depends on the repo's tests/build command.
576
+ - Task classification is heuristic and keyword-based.
577
+ - The public scorecard is curated and dated, not live.
578
+ - New or custom models without scorecard rows start from a neutral prior until private evidence accumulates.
579
+ - `pr` needs the GitHub CLI (`gh`) installed and authenticated to open pull requests automatically.
580
+ - Some agent fatal errors can still look like no-change/no-usage outcomes in `race` or `eval`; inspect output and use `--keep` when debugging.
581
+ - Codex model availability can differ by account type; configure `routing.only` or `agents.codex.model` to match what your local Codex CLI accepts.
582
+ - Antigravity usage/cost is not reported by the CLI in this integration.
583
+ - Mid-run budget enforcement is best effort for agents that only emit usage at the end.
584
+ - `bench` currently reports from the local global usage DB; there is no CLI `--project` filter yet.
585
+
586
+ ## Development
587
+
588
+ ```bash
589
+ npm install
590
+ npm run build
591
+ npm run typecheck
592
+ npm test
593
+ ```
594
+
595
+ Build:
596
+
597
+ - TypeScript is bundled with `tsup`.
598
+ - `npm run build` runs `tsup` and then `scripts/obfuscate.mjs`.
599
+ - `npm run build:raw` runs `tsup` only.
600
+
601
+ Tests cover:
602
+
603
+ - adapter parsers
604
+ - routing
605
+ - recommendation scoring
606
+ - bandit policy
607
+ - reward
608
+ - budget enforcement
609
+ - SQLite store
610
+ - verify detection
611
+ - worktree isolation
612
+ - self-healing policy
613
+ - race selection
614
+ - PR command helpers
615
+ - command helpers
616
+
617
+ Recorded fixtures live under `test/fixtures`.
618
+
619
+ ## Roadmap
620
+
621
+ Shipped in 0.1.0:
622
+
623
+ - adapters for Claude Code, Codex, Aider, and Antigravity
624
+ - config loading/schema
625
+ - model catalog and `models` command
626
+ - deterministic and bayesian routing
627
+ - `run`, `run --heal`, `pr`, `race`, `advise`, `eval`, `bench`, `cost`, `agents`, `init`
628
+ - local SQLite usage/outcome store
629
+ - budgets and cost reporting
630
+ - opt-in telemetry hook
631
+ - worktree isolation for race/eval
632
+ - contextual-bandit recommendation policy and off-policy logging fields
633
+
634
+ Next:
635
+
636
+ - clearer fatal error surfacing in `race` and `eval`
637
+ - richer task classification
638
+ - Codex model compatibility auto-detection
639
+ - learned route/escalation policy from logged context/action/reward/propensity
640
+ - project-scoped `bench` filtering
641
+ - real MCP server with auth and budget-capped tools
642
+ - hosted or federated aggregate leaderboard