@verica-app/cli 0.1.3 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -8,6 +8,24 @@ version and additive features/fixes bump the **patch** (see [Stability](./README
8
8
 
9
9
  ## [Unreleased]
10
10
 
11
+ ## [0.1.4] - 2026-06-19
12
+
13
+ ### Added
14
+
15
+ - `--reuse-if-unchanged` (with `--reuse-max-age <hrs>` and `--reuse-same-ref`) to
16
+ reuse a recent **completed** run instead of executing again when the config
17
+ (prompt version + model + sampling + dataset snapshot + graders) is unchanged.
18
+ Opt-in and freshness-bounded (default 24h, max 720) — re-running stays the
19
+ default, since reuse can't see provider drift. On a cache hit the `--json`
20
+ element gains `"reused": true` and a `reusedFrom` block (additive — a normal
21
+ run's payload is unchanged); the API answers `200` instead of `202`. Cannot be
22
+ combined with `--threshold` / `--baseline-*` (the reused verdict was frozen under
23
+ the prior gate).
24
+
25
+ - `--git-repo-url <url>` (auto-detected from `GITHUB_SERVER_URL`/`GITHUB_REPOSITORY`,
26
+ else GitLab's `CI_PROJECT_URL`) sent as `git.repoUrl`, so the run UI can link the
27
+ commit SHA to `<repoUrl>/commit/<sha>` and the branch to `<repoUrl>/tree/<ref>`.
28
+
11
29
  ## [0.1.3] - 2026-06-18
12
30
 
13
31
  ### Added
package/README.md CHANGED
@@ -47,12 +47,12 @@ evals:
47
47
  - id: eval_8x2k9d
48
48
  prompt: prompts/support-agent.txt
49
49
  systemPrompt: prompts/support-agent.system.txt
50
- tools: prompts/support-agent.tools.json # a path to a JSON file…
50
+ tools: prompts/support-agent.tools.json # a path to a JSON file…
51
51
  sampling: { temperature: 0.2, maxTokens: 512 }
52
52
  model: gpt-4.1-mini
53
53
  - id: eval_3p1m7q
54
54
  prompt: prompts/triage.txt
55
- tools: # …or an inline array
55
+ tools: # …or an inline array
56
56
  - name: get_order
57
57
  description: Look up an order by id
58
58
  parameters: { type: object, properties: { id: { type: string } }, required: [id] }
@@ -77,7 +77,8 @@ default — you don't configure a URL.
77
77
  - `--junit <file>` · `--junit-mode rows|gate` — JUnit report (default `rows`).
78
78
  - `--json` — machine-readable results on stdout.
79
79
  - `--threshold <0..1>` · `--baseline-ref <ref>` · `--baseline-run <id>` — override the gate per branch.
80
- - `--git-sha` / `--git-ref` — provenance (auto-detected from CI env otherwise).
80
+ - `--reuse-if-unchanged` · `--reuse-max-age <hrs>` · `--reuse-same-ref` — reuse a recent completed run instead of re-executing an unchanged config. See [Reuse](#reuse-skip-re-running-an-unchanged-config).
81
+ - `--git-sha` / `--git-ref` / `--git-repo-url` — provenance + the repo web base that links the SHA in the run UI (all auto-detected from CI env otherwise).
81
82
 
82
83
  > Local dev / self-hosting only: point the CLI at another instance with `--base-url`
83
84
  > (or the `VERICA_BASE_URL` env var). Clients never need this.
@@ -108,15 +109,58 @@ wrapper (auto-unwrapped), so you can paste your real schemas as-is:
108
109
 
109
110
  ```json
110
111
  [
111
- { "name": "get_order", "description": "Look up an order by id",
112
- "parameters": { "type": "object", "properties": { "id": { "type": "string" } }, "required": ["id"] } },
113
- { "type": "function", "function": { "name": "cancel_order", "description": "…", "parameters": { "type": "object" } } }
112
+ {
113
+ "name": "get_order",
114
+ "description": "Look up an order by id",
115
+ "parameters": {
116
+ "type": "object",
117
+ "properties": { "id": { "type": "string" } },
118
+ "required": ["id"]
119
+ }
120
+ },
121
+ {
122
+ "type": "function",
123
+ "function": { "name": "cancel_order", "description": "…", "parameters": { "type": "object" } }
124
+ }
114
125
  ]
115
126
  ```
116
127
 
117
- Tools are never executed — the model's *decision* to call one (and with which
128
+ Tools are never executed — the model's _decision_ to call one (and with which
118
129
  arguments) is what the eval grades.
119
130
 
131
+ ## Reuse (skip re-running an unchanged config)
132
+
133
+ By default every `verica run` executes — re-running an unchanged eval is often the
134
+ **point** in CI (it catches silent model drift and run-to-run variance, since an
135
+ eval's output isn't a pure function of its inputs). When you'd rather save the
136
+ tokens, opt in with `--reuse-if-unchanged`:
137
+
138
+ ```bash
139
+ verica run --eval eval_8x2k9d --model gpt-4.1-mini --reuse-if-unchanged --wait
140
+ # if the same config ran & completed in the last 24h, returns that verdict — no new run
141
+ ```
142
+
143
+ "Unchanged" means the **prompt version + model + sampling + dataset snapshot +
144
+ graders** all match a prior run (the gate is _not_ part of it — it decides the
145
+ verdict, not the output). On a hit the CLI exits on the prior run's frozen verdict
146
+ and the `--json` element carries `"reused": true` plus a `reusedFrom` block (the API
147
+ also answers `200` instead of `202`).
148
+
149
+ - `--reuse-max-age <hrs>` — how stale a reusable run may be (default **24**, max
150
+ **720**). There is no "forever": reuse can't see provider-side drift behind a
151
+ stable model id, so it's always bounded — that bound is your staleness budget.
152
+ - `--reuse-same-ref` — only reuse a run on the **same git ref**. Off by default: an
153
+ identical config produces the same output distribution regardless of branch.
154
+ - Only **completed** runs are reused (never a partial/failed one).
155
+ - Incompatible with `--threshold` / `--baseline-ref` / `--baseline-run`. Reuse hands
156
+ back a _prior_ run's verdict, frozen under the gate that applied when it ran, so a
157
+ new `--threshold` can't be recomputed against it. `--baseline-ref` is worse than
158
+ stale: no-regression compares against the _last run on the ref_ — a moving target —
159
+ so a cached verdict can never be a fresh no-regression check. Gate on either → run
160
+ fresh (omit reuse).
161
+
162
+ Omit `--reuse-if-unchanged` (the default) any time you want a guaranteed fresh run.
163
+
120
164
  ## Exit codes
121
165
 
122
166
  `0` passed · `1` gate failed · `2` validation/transport error.
@@ -136,8 +180,8 @@ During `0.x` the **minor** version is the breaking lever, so pin accordingly:
136
180
 
137
181
  We bump the **minor** for any breaking change (flags, output shapes, push behavior) and
138
182
  the **patch** for additive features and fixes. **1.0** will freeze the commands, flags,
139
- exit codes, and output shapes under standard semver. See
140
- [CHANGELOG.md](./CHANGELOG.md) for what changed in each release.
183
+ exit codes, and output shapes under standard semver. See the bundled `CHANGELOG.md`
184
+ for what changed in each release.
141
185
 
142
186
  MIT licensed. There's no IP in the client — the engine, graders, gate, and crypto all
143
187
  run server-side behind the token API.
package/dist/cli.js CHANGED
@@ -4086,7 +4086,13 @@ var runRequestSchema = external_exports.object({
4086
4086
  /** Commit provenance, stamped on the run. */
4087
4087
  git: external_exports.object({
4088
4088
  sha: external_exports.string().optional(),
4089
- ref: external_exports.string().optional()
4089
+ ref: external_exports.string().optional(),
4090
+ /**
4091
+ * The repository's web base URL (e.g. `https://github.com/acme/widgets`) so
4092
+ * the run UI can link the SHA → `<repoUrl>/commit/<sha>` and the branch →
4093
+ * `<repoUrl>/tree/<ref>`. The CLI auto-detects it from CI env.
4094
+ */
4095
+ repoUrl: external_exports.string().optional()
4090
4096
  }).optional(),
4091
4097
  /** CLI gate overrides (precedence over the eval's pass_condition). */
4092
4098
  gate: external_exports.object({
@@ -4096,6 +4102,25 @@ var runRequestSchema = external_exports.object({
4096
4102
  baselineRef: external_exports.string().optional(),
4097
4103
  /** Pin a specific baseline run (wins over baselineRef). */
4098
4104
  baselineRunId: external_exports.string().optional()
4105
+ }).optional(),
4106
+ /**
4107
+ * Opt-in cost control: when the merged config (prompt version + model +
4108
+ * sampling + dataset snapshot + graders) matches a recent COMPLETED run, the
4109
+ * server returns that run's frozen verdict instead of executing again. NOT a
4110
+ * default — an eval's output isn't a pure function of its config (generation +
4111
+ * judge are non-deterministic, the model endpoint drifts), so reuse is always
4112
+ * the caller's explicit choice and is bounded by `maxAgeHours`. Incompatible
4113
+ * with `gate`: a reused verdict is frozen under its own gate (a new threshold
4114
+ * can't be recomputed), and no-regression compares against a moving baseline
4115
+ * (the last run on the ref), so a cache can never be a fresh gated check.
4116
+ */
4117
+ reuse: external_exports.object({
4118
+ /** Turn reuse on. The trigger — everything else is just tuning. */
4119
+ ifUnchanged: external_exports.boolean().optional(),
4120
+ /** Max age (hours) of a reusable run; server default 24, cap 720 (30d). No "infinite reuse". */
4121
+ maxAgeHours: external_exports.number().positive().max(720).optional(),
4122
+ /** Also require the prior run's git ref to match (per-branch isolation); default false. */
4123
+ sameRef: external_exports.boolean().optional()
4099
4124
  }).optional()
4100
4125
  });
4101
4126
  var runAcceptedSchema = external_exports.object({
@@ -4104,7 +4129,23 @@ var runAcceptedSchema = external_exports.object({
4104
4129
  promptVersion: external_exports.number().int(),
4105
4130
  /** Whether a NEW prompt version was created (vs. the current one reused). */
4106
4131
  created: external_exports.boolean(),
4107
- resultUrl: external_exports.string()
4132
+ resultUrl: external_exports.string(),
4133
+ /**
4134
+ * Whether this response reuses a prior run instead of executing a new one (a
4135
+ * cache hit on `reuse.ifUnchanged`). The HTTP status reflects it too: 200 when
4136
+ * reused, 202 when a fresh run was enqueued. Optional so an older API that
4137
+ * predates reuse (omitting it) reads as `false`.
4138
+ */
4139
+ reused: external_exports.boolean().optional(),
4140
+ /** Provenance of the reused run — present iff `reused` is true. */
4141
+ reusedFrom: external_exports.object({
4142
+ runId: external_exports.string(),
4143
+ /** ISO timestamp the reused run finished — shows how stale the verdict is. */
4144
+ finishedAt: external_exports.string(),
4145
+ status: external_exports.literal("completed"),
4146
+ gitSha: external_exports.string().nullable(),
4147
+ gitRef: external_exports.string().nullable()
4148
+ }).optional()
4108
4149
  });
4109
4150
  var runStatusSchema = external_exports.enum([
4110
4151
  "queued",
@@ -4409,16 +4450,21 @@ async function runCommand(opts) {
4409
4450
  const entries = await resolveEntries(opts);
4410
4451
  const git = resolveGit(opts);
4411
4452
  const gate = resolveGate(opts);
4453
+ const reuse = resolveReuse(opts);
4412
4454
  const suites = [];
4413
4455
  const summaries = [];
4414
4456
  let worst = EXIT.pass;
4415
4457
  for (const entry of entries) {
4416
4458
  try {
4417
- const body = await buildRequest(entry, { samplingFile: opts.samplingFile, git, gate });
4459
+ const body = await buildRequest(entry, { samplingFile: opts.samplingFile, git, gate, reuse });
4418
4460
  const accepted = await client.triggerRun(entry.id, body);
4419
- err(
4420
- `\u25B6 ${entry.id}: run ${accepted.runId} queued (prompt v${accepted.promptVersion}${accepted.created ? ", new version" : ", reused"})`
4421
- );
4461
+ const promptNote = `prompt v${accepted.promptVersion}${accepted.created ? ", new version" : ""}`;
4462
+ if (accepted.reused) {
4463
+ err(`\u267B ${entry.id}: run ${accepted.runId} reused (${promptNote})`);
4464
+ if (accepted.reusedFrom) err(` \u21B3 a completed run from ${accepted.reusedFrom.finishedAt}`);
4465
+ } else {
4466
+ err(`\u25B6 ${entry.id}: run ${accepted.runId} queued (${promptNote})`);
4467
+ }
4422
4468
  err(` ${accepted.resultUrl}`);
4423
4469
  if (!opts.wait) {
4424
4470
  summaries.push(buildSummary(entry.id, { status: "queued", accepted }));
@@ -4435,7 +4481,7 @@ async function runCommand(opts) {
4435
4481
  opts.junitMode === "rows" ? rowsSuite(entry.id, await client.getResults(accepted.runId)) : gateSuite(entry.id, run)
4436
4482
  );
4437
4483
  }
4438
- summaries.push(buildSummary(entry.id, { status: "waited", runId: accepted.runId, run }));
4484
+ summaries.push(buildSummary(entry.id, { status: "waited", accepted, run }));
4439
4485
  } catch (e) {
4440
4486
  worst = EXIT.error;
4441
4487
  const message = e instanceof Error ? e.message : String(e);
@@ -4487,19 +4533,34 @@ async function buildRequest(entry, ctx) {
4487
4533
  ...prompt ? { prompt } : {},
4488
4534
  ...sampling ? { samplingParams: sampling } : {},
4489
4535
  ...ctx.git ? { git: ctx.git } : {},
4490
- ...ctx.gate ? { gate: ctx.gate } : {}
4536
+ ...ctx.gate ? { gate: ctx.gate } : {},
4537
+ ...ctx.reuse ? { reuse: ctx.reuse } : {}
4491
4538
  };
4492
4539
  }
4493
4540
  function buildSummary(evalId, outcome) {
4494
4541
  switch (outcome.status) {
4495
4542
  case "queued":
4496
- return { evalId, runId: outcome.accepted.runId, resultUrl: outcome.accepted.resultUrl };
4543
+ return {
4544
+ evalId,
4545
+ runId: outcome.accepted.runId,
4546
+ resultUrl: outcome.accepted.resultUrl,
4547
+ ...reuseFields(outcome.accepted)
4548
+ };
4497
4549
  case "waited":
4498
- return { evalId, runId: outcome.runId, ...outcome.run };
4550
+ return {
4551
+ evalId,
4552
+ runId: outcome.accepted.runId,
4553
+ ...outcome.run,
4554
+ ...reuseFields(outcome.accepted)
4555
+ };
4499
4556
  case "error":
4500
4557
  return { evalId, error: outcome.message };
4501
4558
  }
4502
4559
  }
4560
+ function reuseFields(accepted) {
4561
+ if (!accepted.reused) return {};
4562
+ return { reused: true, ...accepted.reusedFrom ? { reusedFrom: accepted.reusedFrom } : {} };
4563
+ }
4503
4564
  async function resolveTools(tools) {
4504
4565
  if (tools === void 0) return void 0;
4505
4566
  const raw = typeof tools === "string" ? JSON.parse(await readFile2(tools, "utf8")) : tools;
@@ -4508,8 +4569,21 @@ async function resolveTools(tools) {
4508
4569
  function resolveGit(opts) {
4509
4570
  const sha = opts.gitSha ?? process.env.GITHUB_SHA ?? process.env.CI_COMMIT_SHA;
4510
4571
  const ref = opts.gitRef ?? process.env.GITHUB_REF ?? process.env.CI_COMMIT_REF_NAME;
4511
- if (!sha && !ref) return void 0;
4512
- return { ...sha ? { sha } : {}, ...ref ? { ref } : {} };
4572
+ const repoUrl = resolveRepoUrl(opts);
4573
+ if (!sha && !ref && !repoUrl) return void 0;
4574
+ return {
4575
+ ...sha ? { sha } : {},
4576
+ ...ref ? { ref } : {},
4577
+ ...repoUrl ? { repoUrl } : {}
4578
+ };
4579
+ }
4580
+ function resolveRepoUrl(opts) {
4581
+ if (opts.gitRepoUrl) return opts.gitRepoUrl.replace(/\/+$/, "");
4582
+ const server = process.env.GITHUB_SERVER_URL;
4583
+ const repo = process.env.GITHUB_REPOSITORY;
4584
+ if (server && repo) return `${server.replace(/\/+$/, "")}/${repo}`;
4585
+ if (process.env.CI_PROJECT_URL) return process.env.CI_PROJECT_URL.replace(/\/+$/, "");
4586
+ return void 0;
4513
4587
  }
4514
4588
  function resolveGate(opts) {
4515
4589
  const gate = {};
@@ -4518,6 +4592,14 @@ function resolveGate(opts) {
4518
4592
  if (opts.baselineRun !== void 0) gate.baselineRunId = opts.baselineRun;
4519
4593
  return Object.keys(gate).length > 0 ? gate : void 0;
4520
4594
  }
4595
+ function resolveReuse(opts) {
4596
+ if (!opts.reuseIfUnchanged) return void 0;
4597
+ return {
4598
+ ifUnchanged: true,
4599
+ ...opts.reuseMaxAgeHours !== void 0 ? { maxAgeHours: opts.reuseMaxAgeHours } : {},
4600
+ ...opts.reuseSameRef ? { sameRef: true } : {}
4601
+ };
4602
+ }
4521
4603
  function pct(n) {
4522
4604
  return n == null ? "?" : `${(n * 100).toFixed(1)}%`;
4523
4605
  }
@@ -4562,8 +4644,16 @@ Options:
4562
4644
  --threshold <0..1> Override the gate's minimum pass rate.
4563
4645
  --baseline-ref <ref> No-regression baseline = last run on this git ref.
4564
4646
  --baseline-run <id> No-regression baseline = this specific run.
4647
+ --reuse-if-unchanged Reuse a recent completed run instead of executing again
4648
+ when the config (prompt + model + sampling + dataset +
4649
+ graders) is unchanged. Off by default. Incompatible with
4650
+ --threshold / --baseline-*.
4651
+ --reuse-max-age <hrs> Max age of a reusable run (default 24, max 720).
4652
+ --reuse-same-ref Only reuse a run on the same git ref (default: any ref).
4565
4653
  --git-sha <sha> Commit SHA (else auto-detected from CI env).
4566
4654
  --git-ref <ref> Git ref (else auto-detected from CI env).
4655
+ --git-repo-url <url> Repo web base for the SHA link in the run UI (e.g.
4656
+ https://github.com/acme/widgets). Auto-detected from CI env.
4567
4657
  --base-url <url> Override the API base URL (dev/self-host only).
4568
4658
  --poll-interval <sec> Initial poll interval (default 3).
4569
4659
  --timeout <sec> Max wait (default 1800).
@@ -4598,8 +4688,12 @@ async function main() {
4598
4688
  threshold: { type: "string" },
4599
4689
  "baseline-ref": { type: "string" },
4600
4690
  "baseline-run": { type: "string" },
4691
+ "reuse-if-unchanged": { type: "boolean", default: false },
4692
+ "reuse-max-age": { type: "string" },
4693
+ "reuse-same-ref": { type: "boolean", default: false },
4601
4694
  "git-sha": { type: "string" },
4602
4695
  "git-ref": { type: "string" },
4696
+ "git-repo-url": { type: "string" },
4603
4697
  "base-url": { type: "string" },
4604
4698
  "poll-interval": { type: "string" },
4605
4699
  timeout: { type: "string" },
@@ -4618,6 +4712,17 @@ async function main() {
4618
4712
  if (values.threshold !== void 0 && threshold === void 0) {
4619
4713
  throw new Error(`--threshold must be a number between 0 and 1 (got "${values.threshold}").`);
4620
4714
  }
4715
+ const reuseMaxAge = finiteNumber(values["reuse-max-age"]);
4716
+ if (values["reuse-max-age"] !== void 0 && (reuseMaxAge === void 0 || reuseMaxAge <= 0 || reuseMaxAge > 720)) {
4717
+ throw new Error(
4718
+ `--reuse-max-age must be a number of hours in (0, 720] (got "${values["reuse-max-age"]}").`
4719
+ );
4720
+ }
4721
+ if (values["reuse-if-unchanged"] && (threshold !== void 0 || values["baseline-ref"] !== void 0 || values["baseline-run"] !== void 0)) {
4722
+ throw new Error(
4723
+ "--reuse-if-unchanged cannot be combined with --threshold / --baseline-ref / --baseline-run: a reused verdict is frozen under its own gate, and no-regression compares against a moving baseline \u2014 neither can be recomputed. Gate on these? Run fresh (drop --reuse-if-unchanged)."
4724
+ );
4725
+ }
4621
4726
  const opts = {
4622
4727
  baseUrl,
4623
4728
  token,
@@ -4635,8 +4740,12 @@ async function main() {
4635
4740
  threshold,
4636
4741
  baselineRef: values["baseline-ref"],
4637
4742
  baselineRun: values["baseline-run"],
4743
+ reuseIfUnchanged: values["reuse-if-unchanged"] ?? false,
4744
+ reuseMaxAgeHours: reuseMaxAge,
4745
+ reuseSameRef: values["reuse-same-ref"] ?? false,
4638
4746
  gitSha: values["git-sha"],
4639
4747
  gitRef: values["git-ref"],
4748
+ gitRepoUrl: values["git-repo-url"],
4640
4749
  pollIntervalMs: (finiteNumber(values["poll-interval"]) ?? 3) * 1e3,
4641
4750
  timeoutMs: (finiteNumber(values.timeout) ?? 1800) * 1e3
4642
4751
  };
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@verica-app/cli",
3
- "version": "0.1.3",
3
+ "version": "0.1.5",
4
4
  "private": false,
5
5
  "description": "Run a Verica eval from CI and block the merge on the result.",
6
6
  "license": "MIT",