@verica-app/cli 0.1.3 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +18 -0
- package/README.md +47 -7
- package/dist/cli.js +119 -12
- package/package.json +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -8,6 +8,24 @@ version and additive features/fixes bump the **patch** (see [Stability](./README
|
|
|
8
8
|
|
|
9
9
|
## [Unreleased]
|
|
10
10
|
|
|
11
|
+
## [0.1.4] - 2026-06-19
|
|
12
|
+
|
|
13
|
+
### Added
|
|
14
|
+
|
|
15
|
+
- `--reuse-if-unchanged` (with `--reuse-max-age <hrs>` and `--reuse-same-ref`) to
|
|
16
|
+
reuse a recent **completed** run instead of executing again when the config
|
|
17
|
+
(prompt version + model + sampling + dataset snapshot + graders) is unchanged.
|
|
18
|
+
Opt-in and freshness-bounded (default 24h, max 720) — re-running stays the
|
|
19
|
+
default, since reuse can't see provider drift. On a cache hit the `--json`
|
|
20
|
+
element gains `"reused": true` and a `reusedFrom` block (additive — a normal
|
|
21
|
+
run's payload is unchanged); the API answers `200` instead of `202`. Cannot be
|
|
22
|
+
combined with `--threshold` / `--baseline-*` (the reused verdict was frozen under
|
|
23
|
+
the prior gate).
|
|
24
|
+
|
|
25
|
+
- `--git-repo-url <url>` (auto-detected from `GITHUB_SERVER_URL`/`GITHUB_REPOSITORY`,
|
|
26
|
+
else GitLab's `CI_PROJECT_URL`) sent as `git.repoUrl`, so the run UI can link the
|
|
27
|
+
commit SHA to `<repoUrl>/commit/<sha>` and the branch to `<repoUrl>/tree/<ref>`.
|
|
28
|
+
|
|
11
29
|
## [0.1.3] - 2026-06-18
|
|
12
30
|
|
|
13
31
|
### Added
|
package/README.md
CHANGED
|
@@ -47,12 +47,12 @@ evals:
|
|
|
47
47
|
- id: eval_8x2k9d
|
|
48
48
|
prompt: prompts/support-agent.txt
|
|
49
49
|
systemPrompt: prompts/support-agent.system.txt
|
|
50
|
-
tools: prompts/support-agent.tools.json
|
|
50
|
+
tools: prompts/support-agent.tools.json # a path to a JSON file…
|
|
51
51
|
sampling: { temperature: 0.2, maxTokens: 512 }
|
|
52
52
|
model: gpt-4.1-mini
|
|
53
53
|
- id: eval_3p1m7q
|
|
54
54
|
prompt: prompts/triage.txt
|
|
55
|
-
tools:
|
|
55
|
+
tools: # …or an inline array
|
|
56
56
|
- name: get_order
|
|
57
57
|
description: Look up an order by id
|
|
58
58
|
parameters: { type: object, properties: { id: { type: string } }, required: [id] }
|
|
@@ -77,7 +77,8 @@ default — you don't configure a URL.
|
|
|
77
77
|
- `--junit <file>` · `--junit-mode rows|gate` — JUnit report (default `rows`).
|
|
78
78
|
- `--json` — machine-readable results on stdout.
|
|
79
79
|
- `--threshold <0..1>` · `--baseline-ref <ref>` · `--baseline-run <id>` — override the gate per branch.
|
|
80
|
-
- `--
|
|
80
|
+
- `--reuse-if-unchanged` · `--reuse-max-age <hrs>` · `--reuse-same-ref` — reuse a recent completed run instead of re-executing an unchanged config. See [Reuse](#reuse-skip-re-running-an-unchanged-config).
|
|
81
|
+
- `--git-sha` / `--git-ref` / `--git-repo-url` — provenance + the repo web base that links the SHA in the run UI (all auto-detected from CI env otherwise).
|
|
81
82
|
|
|
82
83
|
> Local dev / self-hosting only: point the CLI at another instance with `--base-url`
|
|
83
84
|
> (or the `VERICA_BASE_URL` env var). Clients never need this.
|
|
@@ -108,15 +109,54 @@ wrapper (auto-unwrapped), so you can paste your real schemas as-is:
|
|
|
108
109
|
|
|
109
110
|
```json
|
|
110
111
|
[
|
|
111
|
-
{
|
|
112
|
-
"
|
|
113
|
-
|
|
112
|
+
{
|
|
113
|
+
"name": "get_order",
|
|
114
|
+
"description": "Look up an order by id",
|
|
115
|
+
"parameters": {
|
|
116
|
+
"type": "object",
|
|
117
|
+
"properties": { "id": { "type": "string" } },
|
|
118
|
+
"required": ["id"]
|
|
119
|
+
}
|
|
120
|
+
},
|
|
121
|
+
{
|
|
122
|
+
"type": "function",
|
|
123
|
+
"function": { "name": "cancel_order", "description": "…", "parameters": { "type": "object" } }
|
|
124
|
+
}
|
|
114
125
|
]
|
|
115
126
|
```
|
|
116
127
|
|
|
117
|
-
Tools are never executed — the model's
|
|
128
|
+
Tools are never executed — the model's _decision_ to call one (and with which
|
|
118
129
|
arguments) is what the eval grades.
|
|
119
130
|
|
|
131
|
+
## Reuse (skip re-running an unchanged config)
|
|
132
|
+
|
|
133
|
+
By default every `verica run` executes — re-running an unchanged eval is often the
|
|
134
|
+
**point** in CI (it catches silent model drift and run-to-run variance, since an
|
|
135
|
+
eval's output isn't a pure function of its inputs). When you'd rather save the
|
|
136
|
+
tokens, opt in with `--reuse-if-unchanged`:
|
|
137
|
+
|
|
138
|
+
```bash
|
|
139
|
+
verica run --eval eval_8x2k9d --model gpt-4.1-mini --reuse-if-unchanged --wait
|
|
140
|
+
# if the same config ran & completed in the last 24h, returns that verdict — no new run
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
"Unchanged" means the **prompt version + model + sampling + dataset snapshot +
|
|
144
|
+
graders** all match a prior run (the gate is _not_ part of it — it decides the
|
|
145
|
+
verdict, not the output). On a hit the CLI exits on the prior run's frozen verdict
|
|
146
|
+
and the `--json` element carries `"reused": true` plus a `reusedFrom` block (the API
|
|
147
|
+
also answers `200` instead of `202`).
|
|
148
|
+
|
|
149
|
+
- `--reuse-max-age <hrs>` — how stale a reusable run may be (default **24**, max
|
|
150
|
+
**720**). There is no "forever": reuse can't see provider-side drift behind a
|
|
151
|
+
stable model id, so it's always bounded — that bound is your staleness budget.
|
|
152
|
+
- `--reuse-same-ref` — only reuse a run on the **same git ref**. Off by default: an
|
|
153
|
+
identical config produces the same output distribution regardless of branch.
|
|
154
|
+
- Only **completed** runs are reused (never a partial/failed one).
|
|
155
|
+
- Incompatible with `--threshold` / `--baseline-ref` / `--baseline-run` — a reused
|
|
156
|
+
verdict was frozen under its own gate, so it can't honor a new one.
|
|
157
|
+
|
|
158
|
+
Omit `--reuse-if-unchanged` (the default) any time you want a guaranteed fresh run.
|
|
159
|
+
|
|
120
160
|
## Exit codes
|
|
121
161
|
|
|
122
162
|
`0` passed · `1` gate failed · `2` validation/transport error.
|
package/dist/cli.js
CHANGED
|
@@ -4086,7 +4086,13 @@ var runRequestSchema = external_exports.object({
|
|
|
4086
4086
|
/** Commit provenance, stamped on the run. */
|
|
4087
4087
|
git: external_exports.object({
|
|
4088
4088
|
sha: external_exports.string().optional(),
|
|
4089
|
-
ref: external_exports.string().optional()
|
|
4089
|
+
ref: external_exports.string().optional(),
|
|
4090
|
+
/**
|
|
4091
|
+
* The repository's web base URL (e.g. `https://github.com/acme/widgets`) so
|
|
4092
|
+
* the run UI can link the SHA → `<repoUrl>/commit/<sha>` and the branch →
|
|
4093
|
+
* `<repoUrl>/tree/<ref>`. The CLI auto-detects it from CI env.
|
|
4094
|
+
*/
|
|
4095
|
+
repoUrl: external_exports.string().optional()
|
|
4090
4096
|
}).optional(),
|
|
4091
4097
|
/** CLI gate overrides (precedence over the eval's pass_condition). */
|
|
4092
4098
|
gate: external_exports.object({
|
|
@@ -4096,6 +4102,23 @@ var runRequestSchema = external_exports.object({
|
|
|
4096
4102
|
baselineRef: external_exports.string().optional(),
|
|
4097
4103
|
/** Pin a specific baseline run (wins over baselineRef). */
|
|
4098
4104
|
baselineRunId: external_exports.string().optional()
|
|
4105
|
+
}).optional(),
|
|
4106
|
+
/**
|
|
4107
|
+
* Opt-in cost control: when the merged config (prompt version + model +
|
|
4108
|
+
* sampling + dataset snapshot + graders) matches a recent COMPLETED run, the
|
|
4109
|
+
* server returns that run's frozen verdict instead of executing again. NOT a
|
|
4110
|
+
* default — an eval's output isn't a pure function of its config (generation +
|
|
4111
|
+
* judge are non-deterministic, the model endpoint drifts), so reuse is always
|
|
4112
|
+
* the caller's explicit choice and is bounded by `maxAgeHours`. Incompatible
|
|
4113
|
+
* with `gate` (the cached verdict was frozen under the old gate).
|
|
4114
|
+
*/
|
|
4115
|
+
reuse: external_exports.object({
|
|
4116
|
+
/** Turn reuse on. The trigger — everything else is just tuning. */
|
|
4117
|
+
ifUnchanged: external_exports.boolean().optional(),
|
|
4118
|
+
/** Max age (hours) of a reusable run; server default 24, cap 720 (30d). No "infinite reuse". */
|
|
4119
|
+
maxAgeHours: external_exports.number().positive().max(720).optional(),
|
|
4120
|
+
/** Also require the prior run's git ref to match (per-branch isolation); default false. */
|
|
4121
|
+
sameRef: external_exports.boolean().optional()
|
|
4099
4122
|
}).optional()
|
|
4100
4123
|
});
|
|
4101
4124
|
var runAcceptedSchema = external_exports.object({
|
|
@@ -4104,7 +4127,23 @@ var runAcceptedSchema = external_exports.object({
|
|
|
4104
4127
|
promptVersion: external_exports.number().int(),
|
|
4105
4128
|
/** Whether a NEW prompt version was created (vs. the current one reused). */
|
|
4106
4129
|
created: external_exports.boolean(),
|
|
4107
|
-
resultUrl: external_exports.string()
|
|
4130
|
+
resultUrl: external_exports.string(),
|
|
4131
|
+
/**
|
|
4132
|
+
* Whether this response reuses a prior run instead of executing a new one (a
|
|
4133
|
+
* cache hit on `reuse.ifUnchanged`). The HTTP status reflects it too: 200 when
|
|
4134
|
+
* reused, 202 when a fresh run was enqueued. Optional so an older API that
|
|
4135
|
+
* predates reuse (omitting it) reads as `false`.
|
|
4136
|
+
*/
|
|
4137
|
+
reused: external_exports.boolean().optional(),
|
|
4138
|
+
/** Provenance of the reused run — present iff `reused` is true. */
|
|
4139
|
+
reusedFrom: external_exports.object({
|
|
4140
|
+
runId: external_exports.string(),
|
|
4141
|
+
/** ISO timestamp the reused run finished — shows how stale the verdict is. */
|
|
4142
|
+
finishedAt: external_exports.string(),
|
|
4143
|
+
status: external_exports.literal("completed"),
|
|
4144
|
+
gitSha: external_exports.string().nullable(),
|
|
4145
|
+
gitRef: external_exports.string().nullable()
|
|
4146
|
+
}).optional()
|
|
4108
4147
|
});
|
|
4109
4148
|
var runStatusSchema = external_exports.enum([
|
|
4110
4149
|
"queued",
|
|
@@ -4409,16 +4448,21 @@ async function runCommand(opts) {
|
|
|
4409
4448
|
const entries = await resolveEntries(opts);
|
|
4410
4449
|
const git = resolveGit(opts);
|
|
4411
4450
|
const gate = resolveGate(opts);
|
|
4451
|
+
const reuse = resolveReuse(opts);
|
|
4412
4452
|
const suites = [];
|
|
4413
4453
|
const summaries = [];
|
|
4414
4454
|
let worst = EXIT.pass;
|
|
4415
4455
|
for (const entry of entries) {
|
|
4416
4456
|
try {
|
|
4417
|
-
const body = await buildRequest(entry, { samplingFile: opts.samplingFile, git, gate });
|
|
4457
|
+
const body = await buildRequest(entry, { samplingFile: opts.samplingFile, git, gate, reuse });
|
|
4418
4458
|
const accepted = await client.triggerRun(entry.id, body);
|
|
4419
|
-
|
|
4420
|
-
|
|
4421
|
-
|
|
4459
|
+
const promptNote = `prompt v${accepted.promptVersion}${accepted.created ? ", new version" : ""}`;
|
|
4460
|
+
if (accepted.reused) {
|
|
4461
|
+
err(`\u267B ${entry.id}: run ${accepted.runId} reused (${promptNote})`);
|
|
4462
|
+
if (accepted.reusedFrom) err(` \u21B3 a completed run from ${accepted.reusedFrom.finishedAt}`);
|
|
4463
|
+
} else {
|
|
4464
|
+
err(`\u25B6 ${entry.id}: run ${accepted.runId} queued (${promptNote})`);
|
|
4465
|
+
}
|
|
4422
4466
|
err(` ${accepted.resultUrl}`);
|
|
4423
4467
|
if (!opts.wait) {
|
|
4424
4468
|
summaries.push(buildSummary(entry.id, { status: "queued", accepted }));
|
|
@@ -4435,7 +4479,7 @@ async function runCommand(opts) {
|
|
|
4435
4479
|
opts.junitMode === "rows" ? rowsSuite(entry.id, await client.getResults(accepted.runId)) : gateSuite(entry.id, run)
|
|
4436
4480
|
);
|
|
4437
4481
|
}
|
|
4438
|
-
summaries.push(buildSummary(entry.id, { status: "waited",
|
|
4482
|
+
summaries.push(buildSummary(entry.id, { status: "waited", accepted, run }));
|
|
4439
4483
|
} catch (e) {
|
|
4440
4484
|
worst = EXIT.error;
|
|
4441
4485
|
const message = e instanceof Error ? e.message : String(e);
|
|
@@ -4487,19 +4531,34 @@ async function buildRequest(entry, ctx) {
|
|
|
4487
4531
|
...prompt ? { prompt } : {},
|
|
4488
4532
|
...sampling ? { samplingParams: sampling } : {},
|
|
4489
4533
|
...ctx.git ? { git: ctx.git } : {},
|
|
4490
|
-
...ctx.gate ? { gate: ctx.gate } : {}
|
|
4534
|
+
...ctx.gate ? { gate: ctx.gate } : {},
|
|
4535
|
+
...ctx.reuse ? { reuse: ctx.reuse } : {}
|
|
4491
4536
|
};
|
|
4492
4537
|
}
|
|
4493
4538
|
function buildSummary(evalId, outcome) {
|
|
4494
4539
|
switch (outcome.status) {
|
|
4495
4540
|
case "queued":
|
|
4496
|
-
return {
|
|
4541
|
+
return {
|
|
4542
|
+
evalId,
|
|
4543
|
+
runId: outcome.accepted.runId,
|
|
4544
|
+
resultUrl: outcome.accepted.resultUrl,
|
|
4545
|
+
...reuseFields(outcome.accepted)
|
|
4546
|
+
};
|
|
4497
4547
|
case "waited":
|
|
4498
|
-
return {
|
|
4548
|
+
return {
|
|
4549
|
+
evalId,
|
|
4550
|
+
runId: outcome.accepted.runId,
|
|
4551
|
+
...outcome.run,
|
|
4552
|
+
...reuseFields(outcome.accepted)
|
|
4553
|
+
};
|
|
4499
4554
|
case "error":
|
|
4500
4555
|
return { evalId, error: outcome.message };
|
|
4501
4556
|
}
|
|
4502
4557
|
}
|
|
4558
|
+
function reuseFields(accepted) {
|
|
4559
|
+
if (!accepted.reused) return {};
|
|
4560
|
+
return { reused: true, ...accepted.reusedFrom ? { reusedFrom: accepted.reusedFrom } : {} };
|
|
4561
|
+
}
|
|
4503
4562
|
async function resolveTools(tools) {
|
|
4504
4563
|
if (tools === void 0) return void 0;
|
|
4505
4564
|
const raw = typeof tools === "string" ? JSON.parse(await readFile2(tools, "utf8")) : tools;
|
|
@@ -4508,8 +4567,21 @@ async function resolveTools(tools) {
|
|
|
4508
4567
|
function resolveGit(opts) {
|
|
4509
4568
|
const sha = opts.gitSha ?? process.env.GITHUB_SHA ?? process.env.CI_COMMIT_SHA;
|
|
4510
4569
|
const ref = opts.gitRef ?? process.env.GITHUB_REF ?? process.env.CI_COMMIT_REF_NAME;
|
|
4511
|
-
|
|
4512
|
-
|
|
4570
|
+
const repoUrl = resolveRepoUrl(opts);
|
|
4571
|
+
if (!sha && !ref && !repoUrl) return void 0;
|
|
4572
|
+
return {
|
|
4573
|
+
...sha ? { sha } : {},
|
|
4574
|
+
...ref ? { ref } : {},
|
|
4575
|
+
...repoUrl ? { repoUrl } : {}
|
|
4576
|
+
};
|
|
4577
|
+
}
|
|
4578
|
+
function resolveRepoUrl(opts) {
|
|
4579
|
+
if (opts.gitRepoUrl) return opts.gitRepoUrl.replace(/\/+$/, "");
|
|
4580
|
+
const server = process.env.GITHUB_SERVER_URL;
|
|
4581
|
+
const repo = process.env.GITHUB_REPOSITORY;
|
|
4582
|
+
if (server && repo) return `${server.replace(/\/+$/, "")}/${repo}`;
|
|
4583
|
+
if (process.env.CI_PROJECT_URL) return process.env.CI_PROJECT_URL.replace(/\/+$/, "");
|
|
4584
|
+
return void 0;
|
|
4513
4585
|
}
|
|
4514
4586
|
function resolveGate(opts) {
|
|
4515
4587
|
const gate = {};
|
|
@@ -4518,6 +4590,14 @@ function resolveGate(opts) {
|
|
|
4518
4590
|
if (opts.baselineRun !== void 0) gate.baselineRunId = opts.baselineRun;
|
|
4519
4591
|
return Object.keys(gate).length > 0 ? gate : void 0;
|
|
4520
4592
|
}
|
|
4593
|
+
function resolveReuse(opts) {
|
|
4594
|
+
if (!opts.reuseIfUnchanged) return void 0;
|
|
4595
|
+
return {
|
|
4596
|
+
ifUnchanged: true,
|
|
4597
|
+
...opts.reuseMaxAgeHours !== void 0 ? { maxAgeHours: opts.reuseMaxAgeHours } : {},
|
|
4598
|
+
...opts.reuseSameRef ? { sameRef: true } : {}
|
|
4599
|
+
};
|
|
4600
|
+
}
|
|
4521
4601
|
function pct(n) {
|
|
4522
4602
|
return n == null ? "?" : `${(n * 100).toFixed(1)}%`;
|
|
4523
4603
|
}
|
|
@@ -4562,8 +4642,16 @@ Options:
|
|
|
4562
4642
|
--threshold <0..1> Override the gate's minimum pass rate.
|
|
4563
4643
|
--baseline-ref <ref> No-regression baseline = last run on this git ref.
|
|
4564
4644
|
--baseline-run <id> No-regression baseline = this specific run.
|
|
4645
|
+
--reuse-if-unchanged Reuse a recent completed run instead of executing again
|
|
4646
|
+
when the config (prompt + model + sampling + dataset +
|
|
4647
|
+
graders) is unchanged. Off by default. Incompatible with
|
|
4648
|
+
--threshold / --baseline-*.
|
|
4649
|
+
--reuse-max-age <hrs> Max age of a reusable run (default 24, max 720).
|
|
4650
|
+
--reuse-same-ref Only reuse a run on the same git ref (default: any ref).
|
|
4565
4651
|
--git-sha <sha> Commit SHA (else auto-detected from CI env).
|
|
4566
4652
|
--git-ref <ref> Git ref (else auto-detected from CI env).
|
|
4653
|
+
--git-repo-url <url> Repo web base for the SHA link in the run UI (e.g.
|
|
4654
|
+
https://github.com/acme/widgets). Auto-detected from CI env.
|
|
4567
4655
|
--base-url <url> Override the API base URL (dev/self-host only).
|
|
4568
4656
|
--poll-interval <sec> Initial poll interval (default 3).
|
|
4569
4657
|
--timeout <sec> Max wait (default 1800).
|
|
@@ -4598,8 +4686,12 @@ async function main() {
|
|
|
4598
4686
|
threshold: { type: "string" },
|
|
4599
4687
|
"baseline-ref": { type: "string" },
|
|
4600
4688
|
"baseline-run": { type: "string" },
|
|
4689
|
+
"reuse-if-unchanged": { type: "boolean", default: false },
|
|
4690
|
+
"reuse-max-age": { type: "string" },
|
|
4691
|
+
"reuse-same-ref": { type: "boolean", default: false },
|
|
4601
4692
|
"git-sha": { type: "string" },
|
|
4602
4693
|
"git-ref": { type: "string" },
|
|
4694
|
+
"git-repo-url": { type: "string" },
|
|
4603
4695
|
"base-url": { type: "string" },
|
|
4604
4696
|
"poll-interval": { type: "string" },
|
|
4605
4697
|
timeout: { type: "string" },
|
|
@@ -4618,6 +4710,17 @@ async function main() {
|
|
|
4618
4710
|
if (values.threshold !== void 0 && threshold === void 0) {
|
|
4619
4711
|
throw new Error(`--threshold must be a number between 0 and 1 (got "${values.threshold}").`);
|
|
4620
4712
|
}
|
|
4713
|
+
const reuseMaxAge = finiteNumber(values["reuse-max-age"]);
|
|
4714
|
+
if (values["reuse-max-age"] !== void 0 && (reuseMaxAge === void 0 || reuseMaxAge <= 0 || reuseMaxAge > 720)) {
|
|
4715
|
+
throw new Error(
|
|
4716
|
+
`--reuse-max-age must be a number of hours in (0, 720] (got "${values["reuse-max-age"]}").`
|
|
4717
|
+
);
|
|
4718
|
+
}
|
|
4719
|
+
if (values["reuse-if-unchanged"] && (threshold !== void 0 || values["baseline-ref"] !== void 0 || values["baseline-run"] !== void 0)) {
|
|
4720
|
+
throw new Error(
|
|
4721
|
+
"--reuse-if-unchanged cannot be combined with --threshold / --baseline-ref / --baseline-run (a reused verdict was frozen under the prior gate)."
|
|
4722
|
+
);
|
|
4723
|
+
}
|
|
4621
4724
|
const opts = {
|
|
4622
4725
|
baseUrl,
|
|
4623
4726
|
token,
|
|
@@ -4635,8 +4738,12 @@ async function main() {
|
|
|
4635
4738
|
threshold,
|
|
4636
4739
|
baselineRef: values["baseline-ref"],
|
|
4637
4740
|
baselineRun: values["baseline-run"],
|
|
4741
|
+
reuseIfUnchanged: values["reuse-if-unchanged"] ?? false,
|
|
4742
|
+
reuseMaxAgeHours: reuseMaxAge,
|
|
4743
|
+
reuseSameRef: values["reuse-same-ref"] ?? false,
|
|
4638
4744
|
gitSha: values["git-sha"],
|
|
4639
4745
|
gitRef: values["git-ref"],
|
|
4746
|
+
gitRepoUrl: values["git-repo-url"],
|
|
4640
4747
|
pollIntervalMs: (finiteNumber(values["poll-interval"]) ?? 3) * 1e3,
|
|
4641
4748
|
timeoutMs: (finiteNumber(values.timeout) ?? 1800) * 1e3
|
|
4642
4749
|
};
|