npm - llm-cost-attribution - Versions diffs - 0.1.1 → 0.2.0 - Mend

llm-cost-attribution 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/README.md +56 -133
package/bin/llm-cost.mjs +151 -0
package/package.json +3 -2
package/src/calibrate.mjs +310 -0
package/src/forecast.mjs +425 -0
package/src/index.mjs +83 -0
package/src/project-forecast.mjs +329 -0
package/src/quantiles.mjs +39 -0
package/src/synthetic.mjs +153 -0

package/README.md CHANGED Viewed

@@ -1,102 +1,48 @@
 # llm-cost-attribution
-Per-issue token, turn, and quota analytics for [Claude Code](https://docs.anthropic.com/en/docs/claude-code) and [Codex CLI](https://github.com/openai/codex) sessions. Reads the CLIs' own session JSONLs — **no telemetry pipeline, no database, no API keys**.
+Per-issue cost analytics for [Claude Code](https://docs.anthropic.com/en/docs/claude-code) and [Codex CLI](https://github.com/openai/codex) sessions — how many **tokens** an issue burned, how many **turns** it took (one agent request → response is a turn), and how much of your Codex/Claude plan's rate-limit **quota** it ate. It reads the CLIs' own session logs (JSONL = one JSON record per line) — **no telemetry pipeline, no database, no API keys**.
 ```bash
 npx llm-cost-attribution EPAC-1940
 ```
 ```
-════════════════════════════════════════════════════════════════════════
-LLM COST  —  EPAC-1940
-════════════════════════════════════════════════════════════════════════
-Sessions found:       5
-Total turns:          414
-Total tokens:         61,357,012
-────────────────────────────────────────────────────────────────────────
-CODEX  (4 sessions)
-────────────────────────────────────────────────────────────────────────
-  Models:             gpt-5-codex
-  Turns:              340
-  Tokens:
-    input uncached         1,517,206
-    cache read            51,024,768
-    output (visible)          44,683
-    output (reasoning)        18,649
-    grand total           52,605,306
-  Quota  (plan_type=pro, 345 samples):
-    5h window  58% → 64% used  (peak 64%)
-    7d window  56% → 57% used  (peak 57%)
+LLM COST — EPAC-1940
+Sessions: 5   Turns: 414   Tokens: 61,357,012
+CODEX  (4 sessions)   Models: gpt-5-codex   Turns: 340
+  input uncached      1,517,206
+  cache read         51,024,768
+  output (visible)       44,683
+  output (reasoning)     18,649
+  grand total        52,605,306
+  Quota (pro, 345 samples):  5h 58%→64% (peak 64%)   7d 56%→57% (peak 57%)
 ```
-## Designed for Symphony workflows
+Reading that block: **cache read** is tokens the provider served from its prompt cache (cheap, and usually most of the total); **output (reasoning)** is the model's hidden thinking tokens, billed separately from the **visible** answer; **Quota** is how much of your Codex plan's two rolling rate-limit windows — a 5-hour and a 7-day one — these sessions used.
-[OpenAI Symphony's specification](https://github.com/openai/symphony/blob/main/SPEC.md) requires that each issue gets its own filesystem workspace, and that the coding agent's `cwd` equals that workspace path:
-- **§4.1.4 Workspace** — "Filesystem workspace assigned to one issue identifier."
-- **Workspace path formula** — `<workspace.root>/<sanitized_issue_identifier>`.
-- **Invariant 1** — "Run the coding agent only in the per-issue workspace path... validate: `cwd == workspace_path`."
-Because of those requirements, the working directory of every Claude Code or Codex CLI session that Symphony (or any Symphony-spec-conformant orchestrator) launches always carries the issue identifier as its last path component. The CLI agents in turn record that `cwd` in every session JSONL they create. So the issue identifier is already in the transcript — no custom telemetry pipeline needed to join.
-This package's default `--cwd-pattern` matches the two most common `workspace.root` configurations:
-1. The Symphony spec default: `<system-temp>/symphony_workspaces/<ISSUE-ID>` (e.g. `/tmp/symphony_workspaces/EPAC-1940`).
-2. A common in-repo override: `<repo>/.symphony/workspaces/<ISSUE-ID>` (used by Autopilot and the Riddim factory's Symphony config).
-For any other `workspace.root` setting, pass `--cwd-pattern '<regex>'` with one capture group for the issue identifier — see "[The convention](#the-convention)" below.
+Requires Node 20+. Zero runtime dependencies.
 ## How it works
-Both CLIs persist every session they run as JSONL:
-- **Claude Code** writes `~/.claude/projects/<encoded-cwd>/<sessionId>.jsonl` for every interactive and non-interactive run (encoded-cwd is the absolute working directory with `/` and `.` replaced by `-`).
-- **Codex CLI** writes `~/.codex/sessions/YYYY/MM/DD/rollout-<timestamp>-<id>.jsonl` for every run, with the working directory recorded in the first `session_meta` event.
-Each file carries provider-reported token usage per turn — the same numbers your Anthropic / OpenAI account is billed against:
-| Provider | Tokens captured |
-|---|---|
-| Claude | `input_tokens`, `cache_read_input_tokens`, `cache_creation.{ephemeral_5m,1h}_input_tokens`, `output_tokens` |
-| Codex | `input_tokens`, `cached_input_tokens`, `output_tokens`, `reasoning_output_tokens` (deltaed from cumulative) |
-| Codex (additionally) | `rate_limits.{primary,secondary}.used_percent` per turn |
+Both CLIs persist every run as JSONL — Claude Code in `~/.claude/projects/<encoded-cwd>/<sessionId>.jsonl` (`<encoded-cwd>` is just the run's working directory with `/` and `.` rewritten to `-`), Codex in `~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl` — and each file records, per turn, the provider-reported token counts (the same numbers your account is billed against) plus, for Codex, its rate-limit usage. This package walks both directories, keeps the sessions whose **working directory** matches the issue ID you ask for, and adds them up.
-This package walks both directories, filters sessions whose working directory matches an issue identifier you ask for, and aggregates.
-## The convention
-You map sessions to issues via the **working directory at session start**. By default this package matches the Symphony-spec convention:
-```
-<repo>/.symphony/workspaces/<ISSUE-ID>
-```
-A regex extracts `<ISSUE-ID>`. If your workflow uses a different layout, pass `--cwd-pattern '<regex>'` with one capture group:
+How does a session get matched to an issue? By its **working directory** (`cwd`). Under [Symphony](https://github.com/openai/symphony/blob/main/SPEC.md)'s spec — Symphony being an orchestrator that runs coding agents one issue at a time — each agent runs in a directory dedicated to its issue (`<workspace.root>/<ISSUE-ID>`), so the issue ID is already baked into every transcript's path; no custom pipeline needed. The default `--cwd-pattern` (the regex that pulls the issue ID out of that path) matches both the spec default (`<tmp>/symphony_workspaces/<ID>`) and the common in-repo layout (`<repo>/.symphony/workspaces/<ID>`). For any other layout, pass your own regex with one capture group around the ID:
 ```bash
-# Your workflow uses ../repo-worktrees/<ID>
-llm-cost FOO-12 --cwd-pattern '-([A-Z]+-\d+)$'
-# Your workflow uses ~/issues/<id>/
-llm-cost 1234 --cwd-pattern '/issues/(\d+)$'
+llm-cost FOO-12 --cwd-pattern '-([A-Z]+-\d+)$'   # ../repo-worktrees/<ID>
+llm-cost 1234   --cwd-pattern '/issues/(\d+)$'    # ~/issues/<id>/
 ```
-If your workflow doesn't give each issue its own working directory (e.g. you switch branches in a single checkout), this package can't disambiguate sessions for you — see "[What it doesn't (and can't) do](#what-it-doesnt-and-cant-do)" below.
+If your workflow doesn't give each issue its own directory, this package can't disambiguate sessions — see "What it doesn't do."
 ## Install
 ```bash
-# One-shot via npx
-npx llm-cost-attribution EPAC-1940
-# Install globally
-npm install -g llm-cost-attribution
-llm-cost EPAC-1940
+npx llm-cost-attribution EPAC-1940     # one-shot
+npm install -g llm-cost-attribution    # then: llm-cost EPAC-1940
 ```
-Requires Node 20+. Zero runtime dependencies.
 ## CLI
 ```
@@ -104,47 +50,49 @@ llm-cost <ISSUE-ID> [options]
 llm-cost <ISSUE-ID> --from-usage <usage.jsonl-or-dir>
 llm-cost list
 llm-cost backfill --out <usage.jsonl-path>
+llm-cost calibrate <usage.jsonl-or-dir> [--seed N] [--holdout F]
 llm-cost --help
 Options:
-  --cwd-pattern <regex>   JS regex matching the cwd; one capture group is the issue ID.
-                          Default matches both `<system-temp>/symphony_workspaces/<ID>`
-                          and `<repo>/.symphony/workspaces/<ID>` (raw or Claude-encoded).
-  --claude-dir <path>     Override ~/.claude/projects.
-  --codex-dir <path>      Override ~/.codex/sessions.
-  --from-usage <path>     Read from a usage.jsonl file or directory of `usage*.jsonl`
-                          files instead of the CLI transcripts. See "Delete transcripts,
-                          keep cost history" below.
-  --out <path>            (backfill only) Destination usage.jsonl path. Appended.
-  --json                  Emit JSON instead of a table.
-  -h, --help              Print help.
+  --cwd-pattern <regex>  JS regex matching the cwd; one capture group = issue ID.
+  --claude-dir <path>    Override ~/.claude/projects.
+  --codex-dir <path>     Override ~/.codex/sessions.
+  --from-usage <path>    Read a baked usage.jsonl file/dir instead of transcripts.
+  --out <path>           (backfill) Destination usage.jsonl. Appended.
+  --seed <int>           (calibrate) Held-out split seed. Default 1.
+  --holdout <0..1>       (calibrate) Fraction held out per cell. Default 0.2.
+  --quantile <0..1>      (calibrate) Band to test. Default 0.8.
+  --threshold <0..1>     (calibrate) Flag coverage drift beyond this. Default 0.1.
+  --json                 Emit JSON instead of a table.
+  --no-pricing           Suppress the dollar block.
 ```
-## Delete transcripts, keep cost history (optional)
+## Delete transcripts, keep cost history
-Transcripts are large — a few MB per session, growing to gigabytes across an active factory — and most of the bytes are conversation content the cost tool doesn't need. So `llm-cost` can **bake** every transcript into a small append-only JSONL file (~1 KB per turn, no prompt or response content), then read cost queries from that file instead. After the bake, transcripts are safe to delete.
+Transcripts are large (MBs per session, GBs across a factory) and mostly conversation content the cost tool doesn't need. `backfill` bakes every transcript into a small append-only JSONL (~1 KB/turn, no prompt/response content); queries then read that file, and the transcripts are safe to delete:
 ```bash
-# Bake every transcript on this machine into one file.
 llm-cost backfill --out ~/llm-cost-history.jsonl
-# Cost queries now run against the much smaller file:
 llm-cost EPAC-1940 --from-usage ~/llm-cost-history.jsonl
-# Once you've verified the numbers match, transcripts are safe to delete:
-rm -rf ~/.claude/projects ~/.codex/sessions
+rm -rf ~/.claude/projects ~/.codex/sessions   # once numbers verified
 ```
-Real-world numbers from a working factory:
-| | Before backfill | After backfill |
+| | Before | After |
 |---|---:|---:|
-| Disk footprint | 5.0 GB | 125 MB (40× smaller) |
-| `llm-cost EPAC-1940` query time | ~3 min (full Codex scan) | ~0.3 s |
+| Disk | 5.0 GB | 125 MB (40× smaller) |
+| Query time | ~3 min | ~0.3 s |
-The backfill is lossless for everything the cost analysis cares about — including the Codex per-window quota readout, the Claude cache-tier split (5m vs 1h), and the Codex reasoning-vs-visible output split. Token grand totals, turn counts, models, timestamps, and workspace-path provenance are preserved exactly. The bake file can also be checked into a private repo, shipped to a billing host, or queried from CI without access to the machine that produced the agent sessions.
+The bake is lossless for everything the analysis uses (quota windows, Claude cache tiers, Codex reasoning/visible split, totals, models, timestamps, workspace provenance). The format follows the [Symphony Cost Telemetry Extension spec](https://github.com/RiddimSoftware/groove/blob/main/specs/symphony-cost-telemetry-extension/SPEC.md), so a conformant orchestrator can emit `usage.jsonl` directly and skip the bake — optional interop, not required.
-This whole flow is a built-in feature of the package — you don't need to know anything about the file format to use it. As a side benefit: the format follows the [Symphony Coding-Agent Cost Telemetry Extension spec](https://github.com/RiddimSoftware/groove/blob/main/specs/symphony-cost-telemetry-extension/SPEC.md), so any other tool that conforms can read or write the same file (e.g. a Symphony-spec-conformant orchestrator can emit `usage.jsonl` directly during runs, skipping the bake step entirely). That interop is purely optional; the package works exactly the same whether you care about the spec or not.
+## Is the forecast trustworthy? (`calibrate`)
+A **P80** is the 80th-percentile cost — the number 80% of comparable issues come in at or below. Claiming "P80 = 12K tokens" is only honest if, on issues the forecaster never saw, the real cost actually lands under 12K about 80% of the time; otherwise it's a horoscope. `calibrate` checks exactly that against a local `usage.jsonl` whose records are **estimate-tagged** (each one carries the issue's size estimate). It sorts the records into **cells** — groups of past issues sharing the same `{ size, model }` — holds out a reproducible slice of each cell (`--seed` makes the split repeatable), forecasts from what's left, and measures how often the held-out actuals really fell at or below the predicted P80. Any cell whose hit-rate drifts from 80% by more than `--threshold` is flagged ⚠. On a small dataset the coverage figures are themselves noisy — a cell with only a few held-out issues can read 0% or 100% by luck — so treat per-cell flags as directional until cells are well-populated.
+```bash
+llm-cost calibrate ~/backfill.out --seed 1 --holdout 0.2
+```
+Read-only and local — the input is never written back or committed (point it at a gitignored file). Committed tests use only synthetic fixtures (`test/forecast-recovers-known-dist.test.mjs`).
 ## Library
@@ -156,48 +104,23 @@ import {
   listKnownIssues,
 } from 'llm-cost-attribution';
-// Read from transcripts directly:
-const rollup = await computeIssueCost('EPAC-1940');
-console.log(rollup.combinedTokens);
-console.log(rollup.providerTotals.codex.quotaSamples);
-// Or read from a backfilled usage.jsonl:
+const rollup  = await computeIssueCost('EPAC-1940');
 const rollup2 = await computeIssueCostFromUsage('EPAC-1940', '~/llm-cost-history.jsonl');
-// Backfill programmatically:
-const result = await backfillUsageFromTranscripts({
-  outFile: '/tmp/usage.jsonl',
-  onProgress: ({ phase, processed, total }) => console.log(`${phase}: ${processed}/${total}`),
-});
-console.log(`Wrote ${result.recordsWritten} records`);
+const result  = await backfillUsageFromTranscripts({ outFile: '/tmp/usage.jsonl' });
 ```
-Pass `{ cwdPattern, claudeProjectsDir, codexSessionsDir }` to override defaults on any of the above.
+Pass `{ cwdPattern, claudeProjectsDir, codexSessionsDir }` to override defaults.
 ## What it doesn't (and can't) do
-- **Story-point estimate axis.** Estimates live in your issue tracker (Linear / Jira / GitHub Projects), not in the CLI transcripts. To get cost-vs-estimate rollups you'd need to join issue-tracker data — out of scope for this package.
-- **Attempt counts.** The CLI doesn't record "this was attempt #N of M"; if you ran `claude` 5 times on the same issue, this package sees 5 sessions but can't tell you which one shipped.
-- **PR-merge state, CI status, reviewer verdicts.** These come from GitHub, not from the CLIs — and the Symphony spec explicitly out-of-scopes them (§2.2 Non-Goals, §11.5): ticket mutations and PR outcomes are delegated to the coding agent's tooling, not recorded by the orchestrator. This package stops at the same boundary: "what's in the CLI transcript."
-- **Anything in the Claude Desktop app, claude.ai, ChatGPT, or direct API SDK calls.** Only Claude Code CLI and Codex CLI sessions are stored in the directories this package reads.
+- **Story-point estimates** — live in your tracker, not the transcripts (see the sibling `llm-cost-estimation`).
+- **Attempt counts** — the CLI doesn't record "attempt #N"; 5 runs look like 5 sessions with no winner marked.
+- **PR / CI / reviewer state** — comes from GitHub, not the CLIs; out of scope (matches Symphony §2.2/§11.5).
+- **Claude Desktop, claude.ai, ChatGPT, raw API SDK** — only Claude Code CLI and Codex CLI sessions are read.
 ## Pricing
-`llm-cost` shows API-equivalent dollar cost per bucket alongside the raw token counts, using a built-in rate table sourced from [anthropic.com/pricing](https://www.anthropic.com/pricing) and [platform.openai.com/docs/pricing](https://platform.openai.com/docs/pricing):
-```
-API-equivalent pricing (gpt-5.5 @ rates verified 2026-05-22):
-    input uncached        $7.59    (1.5M × $5.00/1M)
-    cache read           $25.51    (51.0M × $0.500/1M)
-    output (visible)      $1.34    (44.7K × $30.00/1M)
-    output (reasoning)    $0.56    (18.6K × $30.00/1M)
-    ───────────────────────────────────────────
-    total API cost       $35.00    [hypothetical — your Codex Pro plan covers this]
-```
-**This is a counterfactual, not your actual spend.** If you're on a subscription plan (Claude Max, Codex Pro, etc.), the dollar number represents what the same token volume would have cost on pay-as-you-go API — useful for comparison, but the marginal cost of running it on your actual plan is captured by the Codex quota readout above (`5h primary 58% → 64% used`), not by the dollar total.
-The CLI warns when the bundled rate table is more than 90 days old. Pass `--no-pricing` to suppress the block entirely.
+`llm-cost` shows API-equivalent dollar cost per bucket from a built-in rate table ([Anthropic](https://www.anthropic.com/pricing), [OpenAI](https://platform.openai.com/docs/pricing)). **This is a counterfactual, not your actual spend:** on a subscription plan (Claude Max, Codex Pro) it's what the same tokens would cost pay-as-you-go — your real marginal cost is the quota readout, not the dollar total. The CLI warns when the table is >90 days old; `--no-pricing` suppresses the block.
 ## License

package/bin/llm-cost.mjs CHANGED Viewed

@@ -21,10 +21,13 @@ import { resolve } from 'node:path';
 import { parseArgs } from 'node:util';
 import {
   backfillUsageFromTranscripts,
+  calibrateCoverage,
   computeIssueCost,
   computeIssueCostFromUsage,
   computeWorktreeCost,
   listKnownIssues,
+  readUsageRecords,
+  validateUsageRecord,
 } from '../src/index.mjs';
 import { DEFAULT_CWD_PATTERN } from '../src/issue-pattern.mjs';
 import { computeMultiIssueRollup, expandAllIssueArgs } from '../src/multi-issue.mjs';
@@ -47,6 +50,10 @@ async function main() {
       'no-pricing': { type: 'boolean' },
       worktree: { type: 'string' },
       out: { type: 'string' },
+      seed: { type: 'string' },
+      holdout: { type: 'string' },
+      quantile: { type: 'string' },
+      threshold: { type: 'string' },
       json: { type: 'boolean' },
       help: { type: 'boolean', short: 'h' },
     },
@@ -105,6 +112,44 @@ async function main() {
     return;
   }
+  // `llm-cost calibrate <path>` backtests the forecaster's P80 band against a
+  // local estimate-tagged usage.jsonl and prints an empirical coverage report.
+  // The input is read locally only — never written back, never committed.
+  if (command === 'calibrate') {
+    const inputPath = positionals[1];
+    if (inputPath === undefined || inputPath === '') {
+      console.error('error: calibrate requires a path to a usage.jsonl file or directory');
+      process.exit(1);
+    }
+    const calOptions = {};
+    if (values.seed !== undefined) calOptions.seed = parseIntOption(values.seed, 'seed');
+    if (values.holdout !== undefined) calOptions.holdoutFraction = parseFloatOption(values.holdout, 'holdout');
+    if (values.quantile !== undefined) calOptions.quantile = parseFloatOption(values.quantile, 'quantile');
+    if (values.threshold !== undefined) calOptions.deviationThreshold = parseFloatOption(values.threshold, 'threshold');
+    const records = [];
+    let invalidLines = 0;
+    for await (const rec of readUsageRecords(inputPath)) {
+      if (validateUsageRecord(rec) === null) records.push(rec);
+      else invalidLines += 1;
+    }
+    let report;
+    try {
+      report = await calibrateCoverage(records, calOptions);
+    } catch (err) {
+      console.error(`error: ${err.message}`);
+      process.exit(1);
+    }
+    if (values.json === true) {
+      console.log(JSON.stringify(report, null, 2));
+      return;
+    }
+    printCalibrationReport(report, inputPath, invalidLines);
+    return;
+  }
   if (command === 'list') {
     const ids = await listKnownIssues(options);
     if (values.json === true) {
@@ -161,6 +206,100 @@ async function main() {
   printMultiIssueRollup(multi, fromUsage !== undefined, withPricing);
 }
+/** Parse a CLI integer option, exiting with a clear error on bad input. */
+function parseIntOption(raw, name) {
+  const n = Number(raw);
+  if (!Number.isInteger(n)) {
+    console.error(`error: --${name} must be an integer (got "${raw}")`);
+    process.exit(1);
+  }
+  return n;
+}
+/** Parse a CLI float option, exiting with a clear error on bad input. */
+function parseFloatOption(raw, name) {
+  const n = Number(raw);
+  if (!Number.isFinite(n)) {
+    console.error(`error: --${name} must be a number (got "${raw}")`);
+    process.exit(1);
+  }
+  return n;
+}
+/**
+ * Print the calibration coverage report: per-cell and overall empirical
+ * coverage of the predicted P80 band, with flags for cells that drift from the
+ * target by more than the threshold. Low-confidence cells (too few train/held-out
+ * issues) are shown but never flagged.
+ */
+function printCalibrationReport(report, inputPath, invalidLines = 0) {
+  const pct = (q) => (q == null ? '   —' : `${(q * 100).toFixed(0)}%`);
+  const targetPct = (report.quantile * 100).toFixed(0);
+  const thresholdPp = (report.deviationThreshold * 100).toFixed(0);
+  console.log(HEAD);
+  console.log(`CALIBRATION COVERAGE  —  ${inputPath}`);
+  console.log(HEAD);
+  console.log(
+    `Target band: P${targetPct}   Held-out: ${(report.holdoutFraction * 100).toFixed(0)}%   ` +
+    `Seed: ${report.seed}   Flag threshold: ±${thresholdPp}pp`,
+  );
+  console.log(
+    `Records: ${formatNumber(report.overall.recordsTotal)} read, ` +
+    `${formatNumber(report.overall.recordsSkipped)} skipped (no cell / unavailable)` +
+    (invalidLines > 0 ? `, ${formatNumber(invalidLines)} invalid` : ''),
+  );
+  console.log(`Issues: ${formatNumber(report.overall.issuesTotal)} across ${report.overall.cellsTotal} cell${report.overall.cellsTotal === 1 ? '' : 's'}`);
+  console.log();
+  if (report.cells.length === 0) {
+    console.log('No forecastable cells found — need records tagged with size (or estimate) and model.');
+    return;
+  }
+  const cellLabel = (c) => `${c.cell.size} / ${c.cell.model}${c.lowConfidence ? '  (low conf)' : ''}`;
+  const labelWidth = Math.max(20, ...report.cells.map((c) => cellLabel(c).length));
+  console.log(
+    padRight('Cell', labelWidth) +
+    '  ' + padLeft('Train', 6) +
+    '  ' + padLeft('Holdout', 7) +
+    '  ' + padLeft(`Pred P${targetPct}`, 9) +
+    '  ' + padLeft('Coverage', 8) +
+    '  Flag',
+  );
+  console.log(SEP);
+  for (const c of report.cells) {
+    console.log(
+      padRight(cellLabel(c), labelWidth) +
+      '  ' + padLeft(formatNumber(c.trainN), 6) +
+      '  ' + padLeft(formatNumber(c.holdoutN), 7) +
+      '  ' + padLeft(c.predictedP80 == null ? '—' : formatTokensCompact(c.predictedP80), 9) +
+      '  ' + padLeft(pct(c.coverage), 8) +
+      '  ' + (c.flagged ? '⚠ FLAG' : ''),
+    );
+  }
+  console.log(SEP);
+  console.log(
+    padRight('OVERALL', labelWidth) +
+    '  ' + padLeft('', 6) +
+    '  ' + padLeft(formatNumber(report.overall.holdoutN), 7) +
+    '  ' + padLeft('', 9) +
+    '  ' + padLeft(pct(report.overall.coverage), 8) +
+    '  ' + (report.overall.flagged ? '⚠ FLAG' : ''),
+  );
+  const flagged = report.cells.filter((c) => c.flagged);
+  console.log();
+  if (flagged.length === 0) {
+    console.log(`✓ No cells deviate from P${targetPct} coverage by more than ${thresholdPp} points.`);
+  } else {
+    console.log(`⚠  ${flagged.length} cell${flagged.length === 1 ? '' : 's'} off target by >${thresholdPp}pp: ${flagged.map((c) => `${c.cell.size} / ${c.cell.model}`).join(', ')}`);
+  }
+  console.log();
+  console.log('Note: the input is read locally only — never written back or committed. Keep it gitignored.');
+}
 /**
  * Returns an onProgress callback that writes a live scan counter to stderr,
  * overwriting the same line each tick. Clears the line when the Codex phase
@@ -202,6 +341,7 @@ function printUsage() {
        llm-cost <ISSUE-ID> --from-usage <usage.jsonl-or-dir>
        llm-cost list
        llm-cost backfill --out <usage.jsonl-path>
+       llm-cost calibrate <usage.jsonl-or-dir> [--seed N] [--holdout F]
        llm-cost --help
 Per-issue token, turn, and quota analytics for Claude Code and Codex CLI sessions.
@@ -227,6 +367,13 @@ Options:
                           \`usage*.jsonl\` files (per the cost-telemetry spec)
                           instead of from the CLI transcripts.
   --out <path>            (backfill only) Destination usage.jsonl path. Appended.
+  --seed <int>            (calibrate only) Seed for the deterministic held-out
+                          split. Default 1.
+  --holdout <0..1>        (calibrate only) Fraction of each cell's issues to hold
+                          out for backtesting. Default 0.2.
+  --quantile <0..1>       (calibrate only) Quantile band to test. Default 0.8 (P80).
+  --threshold <0..1>      (calibrate only) Flag a cell when coverage drifts from
+                          the target by more than this. Default 0.1 (10 points).
   --json                  Emit machine-readable JSON instead of the table.
   -h, --help              Print this message.
@@ -244,6 +391,10 @@ Examples:
   # to rm -rf ~/.claude/projects and ~/.codex/sessions.
   llm-cost backfill --out ~/llm-cost-history.jsonl
   llm-cost EPAC-1940 --from-usage ~/llm-cost-history.jsonl
+  # Check whether the forecaster's P80 band is actually calibrated against a
+  # local, estimate-tagged dataset. The input stays local — never committed.
+  llm-cost calibrate ~/backfill.out --seed 1 --holdout 0.2
 `);
 }

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "llm-cost-attribution",
-  "version": "0.1.1",
+  "version": "0.2.0",
   "description": "Per-issue token, turn, and quota analytics for Claude Code and Codex CLI sessions. Reads the CLIs' own session JSONLs — no telemetry pipeline required.",
   "type": "module",
   "bin": {
@@ -14,7 +14,8 @@
     "LICENSE"
   ],
   "scripts": {
-    "test": "node --test"
+    "test": "node --test && npm run test:boundary",
+    "test:boundary": "node scripts/check-boundary.mjs"
   },
   "keywords": [
     "claude",