llm-cost-attribution 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,102 +1,48 @@
1
1
  # llm-cost-attribution
2
2
 
3
- Per-issue token, turn, and quota analytics for [Claude Code](https://docs.anthropic.com/en/docs/claude-code) and [Codex CLI](https://github.com/openai/codex) sessions. Reads the CLIs' own session JSONLs — **no telemetry pipeline, no database, no API keys**.
3
+ Per-issue cost analytics for [Claude Code](https://docs.anthropic.com/en/docs/claude-code) and [Codex CLI](https://github.com/openai/codex) sessions — how many **tokens** an issue burned, how many **turns** it took (one agent request → response is a turn), and how much of your Codex/Claude plan's rate-limit **quota** it ate. It reads the CLIs' own session logs (JSONL = one JSON record per line) — **no telemetry pipeline, no database, no API keys**.
4
4
 
5
5
  ```bash
6
6
  npx llm-cost-attribution EPAC-1940
7
7
  ```
8
8
 
9
9
  ```
10
- ════════════════════════════════════════════════════════════════════════
11
- LLM COST — EPAC-1940
12
- ════════════════════════════════════════════════════════════════════════
13
- Sessions found: 5
14
- Total turns: 414
15
- Total tokens: 61,357,012
16
-
17
- ────────────────────────────────────────────────────────────────────────
18
- CODEX (4 sessions)
19
- ────────────────────────────────────────────────────────────────────────
20
- Models: gpt-5-codex
21
- Turns: 340
22
- Tokens:
23
- input uncached 1,517,206
24
- cache read 51,024,768
25
- output (visible) 44,683
26
- output (reasoning) 18,649
27
- grand total 52,605,306
28
- Quota (plan_type=pro, 345 samples):
29
- 5h window 58% → 64% used (peak 64%)
30
- 7d window 56% → 57% used (peak 57%)
10
+ LLM COST — EPAC-1940
11
+ Sessions: 5 Turns: 414 Tokens: 61,357,012
12
+
13
+ CODEX (4 sessions) Models: gpt-5-codex Turns: 340
14
+ input uncached 1,517,206
15
+ cache read 51,024,768
16
+ output (visible) 44,683
17
+ output (reasoning) 18,649
18
+ grand total 52,605,306
19
+ Quota (pro, 345 samples): 5h 58%→64% (peak 64%) 7d 56%→57% (peak 57%)
31
20
  ```
32
21
 
33
- ## Designed for Symphony workflows
22
+ Reading that block: **cache read** is tokens the provider served from its prompt cache (cheap, and usually most of the total); **output (reasoning)** is the model's hidden thinking tokens, billed separately from the **visible** answer; **Quota** is how much of your Codex plan's two rolling rate-limit windows — a 5-hour and a 7-day one — these sessions used.
34
23
 
35
- [OpenAI Symphony's specification](https://github.com/openai/symphony/blob/main/SPEC.md) requires that each issue gets its own filesystem workspace, and that the coding agent's `cwd` equals that workspace path:
36
-
37
- - **§4.1.4 Workspace** — "Filesystem workspace assigned to one issue identifier."
38
- - **Workspace path formula** — `<workspace.root>/<sanitized_issue_identifier>`.
39
- - **Invariant 1** — "Run the coding agent only in the per-issue workspace path... validate: `cwd == workspace_path`."
40
-
41
- Because of those requirements, the working directory of every Claude Code or Codex CLI session that Symphony (or any Symphony-spec-conformant orchestrator) launches always carries the issue identifier as its last path component. The CLI agents in turn record that `cwd` in every session JSONL they create. So the issue identifier is already in the transcript — no custom telemetry pipeline needed to join.
42
-
43
- This package's default `--cwd-pattern` matches the two most common `workspace.root` configurations:
44
-
45
- 1. The Symphony spec default: `<system-temp>/symphony_workspaces/<ISSUE-ID>` (e.g. `/tmp/symphony_workspaces/EPAC-1940`).
46
- 2. A common in-repo override: `<repo>/.symphony/workspaces/<ISSUE-ID>` (used by Autopilot and the Riddim factory's Symphony config).
47
-
48
- For any other `workspace.root` setting, pass `--cwd-pattern '<regex>'` with one capture group for the issue identifier — see "[The convention](#the-convention)" below.
24
+ Requires Node 20+. Zero runtime dependencies.
49
25
 
50
26
  ## How it works
51
27
 
52
- Both CLIs persist every session they run as JSONL:
53
-
54
- - **Claude Code** writes `~/.claude/projects/<encoded-cwd>/<sessionId>.jsonl` for every interactive and non-interactive run (encoded-cwd is the absolute working directory with `/` and `.` replaced by `-`).
55
- - **Codex CLI** writes `~/.codex/sessions/YYYY/MM/DD/rollout-<timestamp>-<id>.jsonl` for every run, with the working directory recorded in the first `session_meta` event.
56
-
57
- Each file carries provider-reported token usage per turn — the same numbers your Anthropic / OpenAI account is billed against:
58
-
59
- | Provider | Tokens captured |
60
- |---|---|
61
- | Claude | `input_tokens`, `cache_read_input_tokens`, `cache_creation.{ephemeral_5m,1h}_input_tokens`, `output_tokens` |
62
- | Codex | `input_tokens`, `cached_input_tokens`, `output_tokens`, `reasoning_output_tokens` (deltaed from cumulative) |
63
- | Codex (additionally) | `rate_limits.{primary,secondary}.used_percent` per turn |
28
+ Both CLIs persist every run as JSONL — Claude Code in `~/.claude/projects/<encoded-cwd>/<sessionId>.jsonl` (`<encoded-cwd>` is just the run's working directory with `/` and `.` rewritten to `-`), Codex in `~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl` — and each file records, per turn, the provider-reported token counts (the same numbers your account is billed against) plus, for Codex, its rate-limit usage. This package walks both directories, keeps the sessions whose **working directory** matches the issue ID you ask for, and adds them up.
64
29
 
65
- This package walks both directories, filters sessions whose working directory matches an issue identifier you ask for, and aggregates.
66
-
67
- ## The convention
68
-
69
- You map sessions to issues via the **working directory at session start**. By default this package matches the Symphony-spec convention:
70
-
71
- ```
72
- <repo>/.symphony/workspaces/<ISSUE-ID>
73
- ```
74
-
75
- A regex extracts `<ISSUE-ID>`. If your workflow uses a different layout, pass `--cwd-pattern '<regex>'` with one capture group:
30
+ How does a session get matched to an issue? By its **working directory** (`cwd`). Under [Symphony](https://github.com/openai/symphony/blob/main/SPEC.md)'s spec — Symphony being an orchestrator that runs coding agents one issue at a time — each agent runs in a directory dedicated to its issue (`<workspace.root>/<ISSUE-ID>`), so the issue ID is already baked into every transcript's path; no custom pipeline needed. The default `--cwd-pattern` (the regex that pulls the issue ID out of that path) matches both the spec default (`<tmp>/symphony_workspaces/<ID>`) and the common in-repo layout (`<repo>/.symphony/workspaces/<ID>`). For any other layout, pass your own regex with one capture group around the ID:
76
31
 
77
32
  ```bash
78
- # Your workflow uses ../repo-worktrees/<ID>
79
- llm-cost FOO-12 --cwd-pattern '-([A-Z]+-\d+)$'
80
-
81
- # Your workflow uses ~/issues/<id>/
82
- llm-cost 1234 --cwd-pattern '/issues/(\d+)$'
33
+ llm-cost FOO-12 --cwd-pattern '-([A-Z]+-\d+)$' # ../repo-worktrees/<ID>
34
+ llm-cost 1234 --cwd-pattern '/issues/(\d+)$' # ~/issues/<id>/
83
35
  ```
84
36
 
85
- If your workflow doesn't give each issue its own working directory (e.g. you switch branches in a single checkout), this package can't disambiguate sessions for you — see "[What it doesn't (and can't) do](#what-it-doesnt-and-cant-do)" below.
37
+ If your workflow doesn't give each issue its own directory, this package can't disambiguate sessions — see "What it doesn't do."
86
38
 
87
39
  ## Install
88
40
 
89
41
  ```bash
90
- # One-shot via npx
91
- npx llm-cost-attribution EPAC-1940
92
-
93
- # Install globally
94
- npm install -g llm-cost-attribution
95
- llm-cost EPAC-1940
42
+ npx llm-cost-attribution EPAC-1940 # one-shot
43
+ npm install -g llm-cost-attribution # then: llm-cost EPAC-1940
96
44
  ```
97
45
 
98
- Requires Node 20+. Zero runtime dependencies.
99
-
100
46
  ## CLI
101
47
 
102
48
  ```
@@ -104,47 +50,49 @@ llm-cost <ISSUE-ID> [options]
104
50
  llm-cost <ISSUE-ID> --from-usage <usage.jsonl-or-dir>
105
51
  llm-cost list
106
52
  llm-cost backfill --out <usage.jsonl-path>
53
+ llm-cost calibrate <usage.jsonl-or-dir> [--seed N] [--holdout F]
107
54
  llm-cost --help
108
55
 
109
56
  Options:
110
- --cwd-pattern <regex> JS regex matching the cwd; one capture group is the issue ID.
111
- Default matches both `<system-temp>/symphony_workspaces/<ID>`
112
- and `<repo>/.symphony/workspaces/<ID>` (raw or Claude-encoded).
113
- --claude-dir <path> Override ~/.claude/projects.
114
- --codex-dir <path> Override ~/.codex/sessions.
115
- --from-usage <path> Read from a usage.jsonl file or directory of `usage*.jsonl`
116
- files instead of the CLI transcripts. See "Delete transcripts,
117
- keep cost history" below.
118
- --out <path> (backfill only) Destination usage.jsonl path. Appended.
119
- --json Emit JSON instead of a table.
120
- -h, --help Print help.
57
+ --cwd-pattern <regex> JS regex matching the cwd; one capture group = issue ID.
58
+ --claude-dir <path> Override ~/.claude/projects.
59
+ --codex-dir <path> Override ~/.codex/sessions.
60
+ --from-usage <path> Read a baked usage.jsonl file/dir instead of transcripts.
61
+ --out <path> (backfill) Destination usage.jsonl. Appended.
62
+ --seed <int> (calibrate) Held-out split seed. Default 1.
63
+ --holdout <0..1> (calibrate) Fraction held out per cell. Default 0.2.
64
+ --quantile <0..1> (calibrate) Band to test. Default 0.8.
65
+ --threshold <0..1> (calibrate) Flag coverage drift beyond this. Default 0.1.
66
+ --json Emit JSON instead of a table.
67
+ --no-pricing Suppress the dollar block.
121
68
  ```
122
69
 
123
- ## Delete transcripts, keep cost history (optional)
70
+ ## Delete transcripts, keep cost history
124
71
 
125
- Transcripts are large a few MB per session, growing to gigabytes across an active factory and most of the bytes are conversation content the cost tool doesn't need. So `llm-cost` can **bake** every transcript into a small append-only JSONL file (~1 KB per turn, no prompt or response content), then read cost queries from that file instead. After the bake, transcripts are safe to delete.
72
+ Transcripts are large (MBs per session, GBs across a factory) and mostly conversation content the cost tool doesn't need. `backfill` bakes every transcript into a small append-only JSONL (~1 KB/turn, no prompt/response content); queries then read that file, and the transcripts are safe to delete:
126
73
 
127
74
  ```bash
128
- # Bake every transcript on this machine into one file.
129
75
  llm-cost backfill --out ~/llm-cost-history.jsonl
130
-
131
- # Cost queries now run against the much smaller file:
132
76
  llm-cost EPAC-1940 --from-usage ~/llm-cost-history.jsonl
133
-
134
- # Once you've verified the numbers match, transcripts are safe to delete:
135
- rm -rf ~/.claude/projects ~/.codex/sessions
77
+ rm -rf ~/.claude/projects ~/.codex/sessions # once numbers verified
136
78
  ```
137
79
 
138
- Real-world numbers from a working factory:
139
-
140
- | | Before backfill | After backfill |
80
+ | | Before | After |
141
81
  |---|---:|---:|
142
- | Disk footprint | 5.0 GB | 125 MB (40× smaller) |
143
- | `llm-cost EPAC-1940` query time | ~3 min (full Codex scan) | ~0.3 s |
82
+ | Disk | 5.0 GB | 125 MB (40× smaller) |
83
+ | Query time | ~3 min | ~0.3 s |
144
84
 
145
- The backfill is lossless for everything the cost analysis cares about — including the Codex per-window quota readout, the Claude cache-tier split (5m vs 1h), and the Codex reasoning-vs-visible output split. Token grand totals, turn counts, models, timestamps, and workspace-path provenance are preserved exactly. The bake file can also be checked into a private repo, shipped to a billing host, or queried from CI without access to the machine that produced the agent sessions.
85
+ The bake is lossless for everything the analysis uses (quota windows, Claude cache tiers, Codex reasoning/visible split, totals, models, timestamps, workspace provenance). The format follows the [Symphony Cost Telemetry Extension spec](https://github.com/RiddimSoftware/groove/blob/main/specs/symphony-cost-telemetry-extension/SPEC.md), so a conformant orchestrator can emit `usage.jsonl` directly and skip the bake optional interop, not required.
146
86
 
147
- This whole flow is a built-in feature of the package you don't need to know anything about the file format to use it. As a side benefit: the format follows the [Symphony Coding-Agent Cost Telemetry Extension spec](https://github.com/RiddimSoftware/groove/blob/main/specs/symphony-cost-telemetry-extension/SPEC.md), so any other tool that conforms can read or write the same file (e.g. a Symphony-spec-conformant orchestrator can emit `usage.jsonl` directly during runs, skipping the bake step entirely). That interop is purely optional; the package works exactly the same whether you care about the spec or not.
87
+ ## Is the forecast trustworthy? (`calibrate`)
88
+
89
+ A **P80** is the 80th-percentile cost — the number 80% of comparable issues come in at or below. Claiming "P80 = 12K tokens" is only honest if, on issues the forecaster never saw, the real cost actually lands under 12K about 80% of the time; otherwise it's a horoscope. `calibrate` checks exactly that against a local `usage.jsonl` whose records are **estimate-tagged** (each one carries the issue's size estimate). It sorts the records into **cells** — groups of past issues sharing the same `{ size, model }` — holds out a reproducible slice of each cell (`--seed` makes the split repeatable), forecasts from what's left, and measures how often the held-out actuals really fell at or below the predicted P80. Any cell whose hit-rate drifts from 80% by more than `--threshold` is flagged ⚠. On a small dataset the coverage figures are themselves noisy — a cell with only a few held-out issues can read 0% or 100% by luck — so treat per-cell flags as directional until cells are well-populated.
90
+
91
+ ```bash
92
+ llm-cost calibrate ~/backfill.out --seed 1 --holdout 0.2
93
+ ```
94
+
95
+ Read-only and local — the input is never written back or committed (point it at a gitignored file). Committed tests use only synthetic fixtures (`test/forecast-recovers-known-dist.test.mjs`).
148
96
 
149
97
  ## Library
150
98
 
@@ -156,48 +104,23 @@ import {
156
104
  listKnownIssues,
157
105
  } from 'llm-cost-attribution';
158
106
 
159
- // Read from transcripts directly:
160
- const rollup = await computeIssueCost('EPAC-1940');
161
- console.log(rollup.combinedTokens);
162
- console.log(rollup.providerTotals.codex.quotaSamples);
163
-
164
- // Or read from a backfilled usage.jsonl:
107
+ const rollup = await computeIssueCost('EPAC-1940');
165
108
  const rollup2 = await computeIssueCostFromUsage('EPAC-1940', '~/llm-cost-history.jsonl');
166
-
167
- // Backfill programmatically:
168
- const result = await backfillUsageFromTranscripts({
169
- outFile: '/tmp/usage.jsonl',
170
- onProgress: ({ phase, processed, total }) => console.log(`${phase}: ${processed}/${total}`),
171
- });
172
- console.log(`Wrote ${result.recordsWritten} records`);
109
+ const result = await backfillUsageFromTranscripts({ outFile: '/tmp/usage.jsonl' });
173
110
  ```
174
111
 
175
- Pass `{ cwdPattern, claudeProjectsDir, codexSessionsDir }` to override defaults on any of the above.
112
+ Pass `{ cwdPattern, claudeProjectsDir, codexSessionsDir }` to override defaults.
176
113
 
177
114
  ## What it doesn't (and can't) do
178
115
 
179
- - **Story-point estimate axis.** Estimates live in your issue tracker (Linear / Jira / GitHub Projects), not in the CLI transcripts. To get cost-vs-estimate rollups you'd need to join issue-tracker data — out of scope for this package.
180
- - **Attempt counts.** The CLI doesn't record "this was attempt #N of M"; if you ran `claude` 5 times on the same issue, this package sees 5 sessions but can't tell you which one shipped.
181
- - **PR-merge state, CI status, reviewer verdicts.** These come from GitHub, not from the CLIs and the Symphony spec explicitly out-of-scopes them (§2.2 Non-Goals, §11.5): ticket mutations and PR outcomes are delegated to the coding agent's tooling, not recorded by the orchestrator. This package stops at the same boundary: "what's in the CLI transcript."
182
- - **Anything in the Claude Desktop app, claude.ai, ChatGPT, or direct API SDK calls.** Only Claude Code CLI and Codex CLI sessions are stored in the directories this package reads.
116
+ - **Story-point estimates** live in your tracker, not the transcripts (see the sibling `llm-cost-estimation`).
117
+ - **Attempt counts** the CLI doesn't record "attempt #N"; 5 runs look like 5 sessions with no winner marked.
118
+ - **PR / CI / reviewer state** comes from GitHub, not the CLIs; out of scope (matches Symphony §2.211.5).
119
+ - **Claude Desktop, claude.ai, ChatGPT, raw API SDK** only Claude Code CLI and Codex CLI sessions are read.
183
120
 
184
121
  ## Pricing
185
122
 
186
- `llm-cost` shows API-equivalent dollar cost per bucket alongside the raw token counts, using a built-in rate table sourced from [anthropic.com/pricing](https://www.anthropic.com/pricing) and [platform.openai.com/docs/pricing](https://platform.openai.com/docs/pricing):
187
-
188
- ```
189
- API-equivalent pricing (gpt-5.5 @ rates verified 2026-05-22):
190
- input uncached $7.59 (1.5M × $5.00/1M)
191
- cache read $25.51 (51.0M × $0.500/1M)
192
- output (visible) $1.34 (44.7K × $30.00/1M)
193
- output (reasoning) $0.56 (18.6K × $30.00/1M)
194
- ───────────────────────────────────────────
195
- total API cost $35.00 [hypothetical — your Codex Pro plan covers this]
196
- ```
197
-
198
- **This is a counterfactual, not your actual spend.** If you're on a subscription plan (Claude Max, Codex Pro, etc.), the dollar number represents what the same token volume would have cost on pay-as-you-go API — useful for comparison, but the marginal cost of running it on your actual plan is captured by the Codex quota readout above (`5h primary 58% → 64% used`), not by the dollar total.
199
-
200
- The CLI warns when the bundled rate table is more than 90 days old. Pass `--no-pricing` to suppress the block entirely.
123
+ `llm-cost` shows API-equivalent dollar cost per bucket from a built-in rate table ([Anthropic](https://www.anthropic.com/pricing), [OpenAI](https://platform.openai.com/docs/pricing)). **This is a counterfactual, not your actual spend:** on a subscription plan (Claude Max, Codex Pro) it's what the same tokens would cost pay-as-you-go — your real marginal cost is the quota readout, not the dollar total. The CLI warns when the table is >90 days old; `--no-pricing` suppresses the block.
201
124
 
202
125
  ## License
203
126
 
package/bin/llm-cost.mjs CHANGED
@@ -21,10 +21,13 @@ import { resolve } from 'node:path';
21
21
  import { parseArgs } from 'node:util';
22
22
  import {
23
23
  backfillUsageFromTranscripts,
24
+ calibrateCoverage,
24
25
  computeIssueCost,
25
26
  computeIssueCostFromUsage,
26
27
  computeWorktreeCost,
27
28
  listKnownIssues,
29
+ readUsageRecords,
30
+ validateUsageRecord,
28
31
  } from '../src/index.mjs';
29
32
  import { DEFAULT_CWD_PATTERN } from '../src/issue-pattern.mjs';
30
33
  import { computeMultiIssueRollup, expandAllIssueArgs } from '../src/multi-issue.mjs';
@@ -47,6 +50,10 @@ async function main() {
47
50
  'no-pricing': { type: 'boolean' },
48
51
  worktree: { type: 'string' },
49
52
  out: { type: 'string' },
53
+ seed: { type: 'string' },
54
+ holdout: { type: 'string' },
55
+ quantile: { type: 'string' },
56
+ threshold: { type: 'string' },
50
57
  json: { type: 'boolean' },
51
58
  help: { type: 'boolean', short: 'h' },
52
59
  },
@@ -63,6 +70,7 @@ async function main() {
63
70
  const options = { cwdPattern };
64
71
  if (values['claude-dir'] !== undefined) options.claudeProjectsDir = values['claude-dir'];
65
72
  if (values['codex-dir'] !== undefined) options.codexSessionsDir = values['codex-dir'];
73
+ if (process.stderr.isTTY) options.onProgress = makeProgressReporter();
66
74
 
67
75
  const withPricing = values['no-pricing'] !== true;
68
76
 
@@ -104,6 +112,44 @@ async function main() {
104
112
  return;
105
113
  }
106
114
 
115
+ // `llm-cost calibrate <path>` backtests the forecaster's P80 band against a
116
+ // local estimate-tagged usage.jsonl and prints an empirical coverage report.
117
+ // The input is read locally only — never written back, never committed.
118
+ if (command === 'calibrate') {
119
+ const inputPath = positionals[1];
120
+ if (inputPath === undefined || inputPath === '') {
121
+ console.error('error: calibrate requires a path to a usage.jsonl file or directory');
122
+ process.exit(1);
123
+ }
124
+ const calOptions = {};
125
+ if (values.seed !== undefined) calOptions.seed = parseIntOption(values.seed, 'seed');
126
+ if (values.holdout !== undefined) calOptions.holdoutFraction = parseFloatOption(values.holdout, 'holdout');
127
+ if (values.quantile !== undefined) calOptions.quantile = parseFloatOption(values.quantile, 'quantile');
128
+ if (values.threshold !== undefined) calOptions.deviationThreshold = parseFloatOption(values.threshold, 'threshold');
129
+
130
+ const records = [];
131
+ let invalidLines = 0;
132
+ for await (const rec of readUsageRecords(inputPath)) {
133
+ if (validateUsageRecord(rec) === null) records.push(rec);
134
+ else invalidLines += 1;
135
+ }
136
+
137
+ let report;
138
+ try {
139
+ report = await calibrateCoverage(records, calOptions);
140
+ } catch (err) {
141
+ console.error(`error: ${err.message}`);
142
+ process.exit(1);
143
+ }
144
+
145
+ if (values.json === true) {
146
+ console.log(JSON.stringify(report, null, 2));
147
+ return;
148
+ }
149
+ printCalibrationReport(report, inputPath, invalidLines);
150
+ return;
151
+ }
152
+
107
153
  if (command === 'list') {
108
154
  const ids = await listKnownIssues(options);
109
155
  if (values.json === true) {
@@ -160,6 +206,117 @@ async function main() {
160
206
  printMultiIssueRollup(multi, fromUsage !== undefined, withPricing);
161
207
  }
162
208
 
209
+ /** Parse a CLI integer option, exiting with a clear error on bad input. */
210
+ function parseIntOption(raw, name) {
211
+ const n = Number(raw);
212
+ if (!Number.isInteger(n)) {
213
+ console.error(`error: --${name} must be an integer (got "${raw}")`);
214
+ process.exit(1);
215
+ }
216
+ return n;
217
+ }
218
+
219
+ /** Parse a CLI float option, exiting with a clear error on bad input. */
220
+ function parseFloatOption(raw, name) {
221
+ const n = Number(raw);
222
+ if (!Number.isFinite(n)) {
223
+ console.error(`error: --${name} must be a number (got "${raw}")`);
224
+ process.exit(1);
225
+ }
226
+ return n;
227
+ }
228
+
229
+ /**
230
+ * Print the calibration coverage report: per-cell and overall empirical
231
+ * coverage of the predicted P80 band, with flags for cells that drift from the
232
+ * target by more than the threshold. Low-confidence cells (too few train/held-out
233
+ * issues) are shown but never flagged.
234
+ */
235
+ function printCalibrationReport(report, inputPath, invalidLines = 0) {
236
+ const pct = (q) => (q == null ? ' —' : `${(q * 100).toFixed(0)}%`);
237
+ const targetPct = (report.quantile * 100).toFixed(0);
238
+ const thresholdPp = (report.deviationThreshold * 100).toFixed(0);
239
+
240
+ console.log(HEAD);
241
+ console.log(`CALIBRATION COVERAGE — ${inputPath}`);
242
+ console.log(HEAD);
243
+ console.log(
244
+ `Target band: P${targetPct} Held-out: ${(report.holdoutFraction * 100).toFixed(0)}% ` +
245
+ `Seed: ${report.seed} Flag threshold: ±${thresholdPp}pp`,
246
+ );
247
+ console.log(
248
+ `Records: ${formatNumber(report.overall.recordsTotal)} read, ` +
249
+ `${formatNumber(report.overall.recordsSkipped)} skipped (no cell / unavailable)` +
250
+ (invalidLines > 0 ? `, ${formatNumber(invalidLines)} invalid` : ''),
251
+ );
252
+ console.log(`Issues: ${formatNumber(report.overall.issuesTotal)} across ${report.overall.cellsTotal} cell${report.overall.cellsTotal === 1 ? '' : 's'}`);
253
+ console.log();
254
+
255
+ if (report.cells.length === 0) {
256
+ console.log('No forecastable cells found — need records tagged with size (or estimate) and model.');
257
+ return;
258
+ }
259
+
260
+ const cellLabel = (c) => `${c.cell.size} / ${c.cell.model}${c.lowConfidence ? ' (low conf)' : ''}`;
261
+ const labelWidth = Math.max(20, ...report.cells.map((c) => cellLabel(c).length));
262
+
263
+ console.log(
264
+ padRight('Cell', labelWidth) +
265
+ ' ' + padLeft('Train', 6) +
266
+ ' ' + padLeft('Holdout', 7) +
267
+ ' ' + padLeft(`Pred P${targetPct}`, 9) +
268
+ ' ' + padLeft('Coverage', 8) +
269
+ ' Flag',
270
+ );
271
+ console.log(SEP);
272
+ for (const c of report.cells) {
273
+ console.log(
274
+ padRight(cellLabel(c), labelWidth) +
275
+ ' ' + padLeft(formatNumber(c.trainN), 6) +
276
+ ' ' + padLeft(formatNumber(c.holdoutN), 7) +
277
+ ' ' + padLeft(c.predictedP80 == null ? '—' : formatTokensCompact(c.predictedP80), 9) +
278
+ ' ' + padLeft(pct(c.coverage), 8) +
279
+ ' ' + (c.flagged ? '⚠ FLAG' : ''),
280
+ );
281
+ }
282
+ console.log(SEP);
283
+ console.log(
284
+ padRight('OVERALL', labelWidth) +
285
+ ' ' + padLeft('', 6) +
286
+ ' ' + padLeft(formatNumber(report.overall.holdoutN), 7) +
287
+ ' ' + padLeft('', 9) +
288
+ ' ' + padLeft(pct(report.overall.coverage), 8) +
289
+ ' ' + (report.overall.flagged ? '⚠ FLAG' : ''),
290
+ );
291
+
292
+ const flagged = report.cells.filter((c) => c.flagged);
293
+ console.log();
294
+ if (flagged.length === 0) {
295
+ console.log(`✓ No cells deviate from P${targetPct} coverage by more than ${thresholdPp} points.`);
296
+ } else {
297
+ console.log(`⚠ ${flagged.length} cell${flagged.length === 1 ? '' : 's'} off target by >${thresholdPp}pp: ${flagged.map((c) => `${c.cell.size} / ${c.cell.model}`).join(', ')}`);
298
+ }
299
+ console.log();
300
+ console.log('Note: the input is read locally only — never written back or committed. Keep it gitignored.');
301
+ }
302
+
303
+ /**
304
+ * Returns an onProgress callback that writes a live scan counter to stderr,
305
+ * overwriting the same line each tick. Clears the line when the Codex phase
306
+ * completes so the output table starts on a clean line.
307
+ * Only wired up when stderr is a TTY (not when piping --json output).
308
+ */
309
+ function makeProgressReporter() {
310
+ return ({ phase, processed, total }) => {
311
+ const pct = total === 0 ? 100 : Math.round((processed / total) * 100);
312
+ process.stderr.write(
313
+ ` scanning ${phase} sessions: ${processed.toLocaleString()} / ${total.toLocaleString()} (${pct}%)\r`,
314
+ );
315
+ // Clear the line once each phase finishes so the results table is uncluttered.
316
+ if (processed === total) process.stderr.write(' '.repeat(60) + '\r');
317
+ };
318
+ }
319
+
163
320
  function attachPricingToRollup(rollup) {
164
321
  for (const provider of ['claude', 'codex']) {
165
322
  const totals = rollup.providerTotals[provider];
@@ -184,6 +341,7 @@ function printUsage() {
184
341
  llm-cost <ISSUE-ID> --from-usage <usage.jsonl-or-dir>
185
342
  llm-cost list
186
343
  llm-cost backfill --out <usage.jsonl-path>
344
+ llm-cost calibrate <usage.jsonl-or-dir> [--seed N] [--holdout F]
187
345
  llm-cost --help
188
346
 
189
347
  Per-issue token, turn, and quota analytics for Claude Code and Codex CLI sessions.
@@ -209,6 +367,13 @@ Options:
209
367
  \`usage*.jsonl\` files (per the cost-telemetry spec)
210
368
  instead of from the CLI transcripts.
211
369
  --out <path> (backfill only) Destination usage.jsonl path. Appended.
370
+ --seed <int> (calibrate only) Seed for the deterministic held-out
371
+ split. Default 1.
372
+ --holdout <0..1> (calibrate only) Fraction of each cell's issues to hold
373
+ out for backtesting. Default 0.2.
374
+ --quantile <0..1> (calibrate only) Quantile band to test. Default 0.8 (P80).
375
+ --threshold <0..1> (calibrate only) Flag a cell when coverage drifts from
376
+ the target by more than this. Default 0.1 (10 points).
212
377
  --json Emit machine-readable JSON instead of the table.
213
378
  -h, --help Print this message.
214
379
 
@@ -226,6 +391,10 @@ Examples:
226
391
  # to rm -rf ~/.claude/projects and ~/.codex/sessions.
227
392
  llm-cost backfill --out ~/llm-cost-history.jsonl
228
393
  llm-cost EPAC-1940 --from-usage ~/llm-cost-history.jsonl
394
+
395
+ # Check whether the forecaster's P80 band is actually calibrated against a
396
+ # local, estimate-tagged dataset. The input stays local — never committed.
397
+ llm-cost calibrate ~/backfill.out --seed 1 --holdout 0.2
229
398
  `);
230
399
  }
231
400
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "llm-cost-attribution",
3
- "version": "0.1.0",
3
+ "version": "0.2.0",
4
4
  "description": "Per-issue token, turn, and quota analytics for Claude Code and Codex CLI sessions. Reads the CLIs' own session JSONLs — no telemetry pipeline required.",
5
5
  "type": "module",
6
6
  "bin": {
@@ -14,7 +14,8 @@
14
14
  "LICENSE"
15
15
  ],
16
16
  "scripts": {
17
- "test": "node --test"
17
+ "test": "node --test && npm run test:boundary",
18
+ "test:boundary": "node scripts/check-boundary.mjs"
18
19
  },
19
20
  "keywords": [
20
21
  "claude",