@alis-build/harness-eval 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +201 -0
- package/README.md +700 -0
- package/dist/adapters/claude-code/index.d.ts +3 -0
- package/dist/adapters/claude-code/index.js +2 -0
- package/dist/build-DsVJ_UeU.js +1396 -0
- package/dist/build-DsVJ_UeU.js.map +1 -0
- package/dist/cardinality-DlE44e-4.js +31 -0
- package/dist/cardinality-DlE44e-4.js.map +1 -0
- package/dist/claude-code-ycT0JQZF.js +563 -0
- package/dist/claude-code-ycT0JQZF.js.map +1 -0
- package/dist/cli/bin.d.ts +1 -0
- package/dist/cli/bin.js +623 -0
- package/dist/cli/bin.js.map +1 -0
- package/dist/config/loader.d.ts +2 -0
- package/dist/config/loader.js +2 -0
- package/dist/index-6Z17eKZx.d.ts +72 -0
- package/dist/index.d.ts +725 -0
- package/dist/index.js +5 -0
- package/dist/loader-BCnFJ8rm.js +717 -0
- package/dist/loader-BCnFJ8rm.js.map +1 -0
- package/dist/loader-DTvoVfN0.d.ts +33 -0
- package/dist/rolldown-runtime-D7D4PA-g.js +13 -0
- package/dist/runner/suite.d.ts +2 -0
- package/dist/runner/suite.js +2 -0
- package/dist/suite-BoOvK_lq.d.ts +7 -0
- package/dist/suite-chj0j22j.js +684 -0
- package/dist/suite-chj0j22j.js.map +1 -0
- package/dist/types-B9H4IZtA.d.ts +305 -0
- package/dist/types-BQol062t.d.ts +292 -0
- package/package.json +74 -0
- package/schemas/eval-interchange-agent-trace.schema.json +322 -0
- package/schemas/eval-interchange-proto-instance.schema.json +106 -0
- package/schemas/eval-interchange.schema.json +140 -0
- package/schemas/eval-run-envelope.schema.json +2195 -0
- package/schemas/trajectory-view.schema.json +441 -0
package/README.md
ADDED
|
@@ -0,0 +1,700 @@
|
|
|
1
|
+
# @alis-build/harness-eval
|
|
2
|
+
|
|
3
|
+
Statistical eval framework for **AI coding agent harnesses** (Claude Code today; Cursor and Gemini planned). Run real headless harness sessions, capture tool trajectories, and score behavior and outcomes across many repetitions and configurations.
|
|
4
|
+
|
|
5
|
+
**Use it to answer:** “When users ask X, does this harness actually call our MCP tools — reliably, in this plugin/model setup?”
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Requirements
|
|
10
|
+
|
|
11
|
+
- Node.js ≥ 22.12 required; Node 24 LTS recommended for development and CI
|
|
12
|
+
- `claude` on `PATH` (for the Claude Code adapter)
|
|
13
|
+
- Authentication for Claude Code:
|
|
14
|
+
- **Option A:** `claude login` and set `isolateConfig: false` in your suite (uses your normal plugins/MCP setup)
|
|
15
|
+
- **Option B:** `ANTHROPIC_API_KEY` with isolated config per run (default adapter behavior)
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## Install
|
|
20
|
+
|
|
21
|
+
**Consumers** — run via npx (no global install required):
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
npx @alis-build/harness-eval --help
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
Or install as a project dependency:
|
|
28
|
+
|
|
29
|
+
```bash
|
|
30
|
+
npm install @alis-build/harness-eval
|
|
31
|
+
npx @alis-build/harness-eval run examples/basic.yaml --output report.json
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
The npm package name is `@alis-build/harness-eval`; the CLI binary is `harness-eval`. With a single bin entry, `npx @alis-build/harness-eval <command>` invokes it directly.
|
|
35
|
+
|
|
36
|
+
### Development (clone & build)
|
|
37
|
+
|
|
38
|
+
Contributors working from a git checkout:
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
pnpm install
|
|
42
|
+
pnpm run build
|
|
43
|
+
node dist/cli/bin.js --help
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## Quick start
|
|
49
|
+
|
|
50
|
+
### 1. Write a suite
|
|
51
|
+
|
|
52
|
+
Suites are YAML files. Committed examples:
|
|
53
|
+
|
|
54
|
+
- [`examples/basic.yaml`](examples/basic.yaml) — smoke test using the built-in `Read` tool on this repo's README
|
|
55
|
+
- [`examples/matrix.yaml`](examples/matrix.yaml) — same idea with a model matrix (sonnet vs opus)
|
|
56
|
+
- [`examples/multi-file/`](examples/multi-file/) — directory layout with `suite.yaml` plus cases under `cases/`
|
|
57
|
+
- [`examples/grading.yaml`](examples/grading.yaml) — standalone judge config for `harness-eval grade`
|
|
58
|
+
|
|
59
|
+
```yaml
|
|
60
|
+
adapter: claude-code
|
|
61
|
+
|
|
62
|
+
defaultConfig:
|
|
63
|
+
model: claude-sonnet-4-6
|
|
64
|
+
timeoutMs: 120000
|
|
65
|
+
cwd: ..
|
|
66
|
+
claudeCode:
|
|
67
|
+
isolateConfig: false # use your logged-in Claude Code config
|
|
68
|
+
permissionMode: bypassPermissions
|
|
69
|
+
allowedTools:
|
|
70
|
+
- Read
|
|
71
|
+
|
|
72
|
+
matrix:
|
|
73
|
+
- label: sonnet
|
|
74
|
+
config: {}
|
|
75
|
+
|
|
76
|
+
cases:
|
|
77
|
+
- id: summarize-readme
|
|
78
|
+
prompt: "Read README.md and summarize what harness-eval does in one or two sentences."
|
|
79
|
+
repetitions: 3
|
|
80
|
+
|
|
81
|
+
# Behavioral checks (deterministic, on tool trajectory)
|
|
82
|
+
assertions:
|
|
83
|
+
- called: Read
|
|
84
|
+
threshold: 0.8
|
|
85
|
+
- not:
|
|
86
|
+
responded_without_tool_calls: true
|
|
87
|
+
|
|
88
|
+
# Outcome checks (LLM judge via `harness-eval grade`)
|
|
89
|
+
expectations:
|
|
90
|
+
- "The response describes an eval framework for AI coding agent harnesses"
|
|
91
|
+
- "The summary is grounded in README content, not a generic refusal"
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
Generic fields (`model`, `cwd`, `timeoutMs`, `env`) sit at the top level. Claude-specific options go under `claudeCode`.
|
|
95
|
+
|
|
96
|
+
### 2. Run behavioral eval
|
|
97
|
+
|
|
98
|
+
```bash
|
|
99
|
+
npx @alis-build/harness-eval run examples/basic.yaml --output report.json --max-concurrent 1 --format console
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
This spawns Claude Code headless for each (case × matrix cell × repetition), evaluates **assertions** on the captured trajectory, and prints pass rates.
|
|
103
|
+
|
|
104
|
+
**Progress (stderr):** one line per repetition with ETA by default; use `--quiet` for dots or `--verbose` for tool/assertion detail.
|
|
105
|
+
|
|
106
|
+
Exit code `0` = all cells passed all assertion thresholds.
|
|
107
|
+
|
|
108
|
+
### 3. Grade outcomes (optional)
|
|
109
|
+
|
|
110
|
+
Judge model, timeout, env, and `claudeCode` flags live in a separate **`grading.yaml`** (not in the suite file). See [`examples/grading.yaml`](examples/grading.yaml).
|
|
111
|
+
|
|
112
|
+
```bash
|
|
113
|
+
npx @alis-build/harness-eval grade report.json --config examples/grading.yaml --output grading.json --max-concurrent 1 --format console
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Runs a separate Claude subprocess as **judge** against the `expectations` in your suite (copied into `report.json`). Produces per-expectation PASS/FAIL with cited evidence.
|
|
117
|
+
|
|
118
|
+
Exit codes: `0` = all graded expectations passed; `1` = at least one failed; `2` = no expectations or no gradable repetitions.
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## Data contracts & schemas
|
|
123
|
+
|
|
124
|
+
harness-eval separates **vendor output** from **eval interchange**. Use the types below when wiring CI, a database, or an external judge — not Claude `stream-json` or OTLP as your primary record.
|
|
125
|
+
|
|
126
|
+
### Layering
|
|
127
|
+
|
|
128
|
+
| Layer | Type | Where | Use for |
|
|
129
|
+
| --------------- | --------------------- | ------------------------- | -------------------------------------------------- |
|
|
130
|
+
| Vendor stream | `StreamEvent` | `src/types/stream.ts` | Claude `stream-json` debug only |
|
|
131
|
+
| Harness session | **`TrajectoryView`** | `src/types/trajectory.ts` | Assertions, trajectory queries, judge input |
|
|
132
|
+
| Run report | **`SuiteReport`** | `report.json` from `run` | Runner output; full trajectories + assertion stats |
|
|
133
|
+
| Eval record | **`EvalRunEnvelope`** | `buildEvalRunEnvelope()` | CI gates, APIs, DB storage |
|
|
134
|
+
| Observability | OTLP | `--otel-output` | Tempo / Jaeger side export |
|
|
135
|
+
|
|
136
|
+
```
|
|
137
|
+
Suite YAML → run → TrajectoryView → SuiteReport (report.json)
|
|
138
|
+
↓ optional grade / external judge
|
|
139
|
+
EvalRunEnvelope → DB / API / CI gate
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
### `TrajectoryView`
|
|
143
|
+
|
|
144
|
+
Cross-harness normalized session. Every adapter maps vendor output into this shape.
|
|
145
|
+
|
|
146
|
+
| Field | Meaning |
|
|
147
|
+
| --------------- | -------------------------------------------------------------------------------------- |
|
|
148
|
+
| `meta` | Session id, model, cwd, available tools, MCP server status |
|
|
149
|
+
| `toolCalls` | Every tool call in emission order (`name`, `args`, `result`, `turnIndex`, `callIndex`) |
|
|
150
|
+
| `turns` | Per-turn assistant text and tool calls |
|
|
151
|
+
| `finalResponse` | Concatenated assistant text (for `response_contains` and judges) |
|
|
152
|
+
| `usage` | Tokens, cost, duration, turn count |
|
|
153
|
+
| `success` | Whether the harness reported success |
|
|
154
|
+
|
|
155
|
+
Tool names follow the harness format (e.g. `mcp__plugin_alis-build_api__SearchSkills`). Assertions use `turnIndex` / `callIndex` for ordering — not wall-clock time.
|
|
156
|
+
|
|
157
|
+
### `SuiteReport` (`report.json`)
|
|
158
|
+
|
|
159
|
+
Produced by `harness-eval run`. Contains everything from the run:
|
|
160
|
+
|
|
161
|
+
- `cells[]` — one row per (test case × matrix cell)
|
|
162
|
+
- `cells[].repetitions[]` — each harness invocation
|
|
163
|
+
- `cells[].repetitions[].adapterResult.view` — **`TrajectoryView`** when the harness succeeded
|
|
164
|
+
- `cells[].repetitions[].assertionResults` — per-rep behavioral assertion tree
|
|
165
|
+
- `cells[].assertionStats` — pass rates across repetitions
|
|
166
|
+
- `cells[].expectations` — natural-language outcome checks (copied from suite for judges)
|
|
167
|
+
|
|
168
|
+
Gate behavioral eval on `cells[].passed` or on assertion stats. This file is enough to hand off to a custom judge without re-running the harness.
|
|
169
|
+
|
|
170
|
+
### `EvalRunEnvelope`
|
|
171
|
+
|
|
172
|
+
Versioned document for **storage and interchange** (`schemaVersion` `1.0`). Build it from a report (and optional grading):
|
|
173
|
+
|
|
174
|
+
```typescript
|
|
175
|
+
import {
|
|
176
|
+
buildEvalRunEnvelope,
|
|
177
|
+
buildEvalRunEnvelopeFromFiles,
|
|
178
|
+
} from "@alis-build/harness-eval";
|
|
179
|
+
|
|
180
|
+
const envelope = buildEvalRunEnvelope(report, {
|
|
181
|
+
grading, // optional: from gradeReport()
|
|
182
|
+
suite: { uri: "./examples/basic.yaml" },
|
|
183
|
+
provenance: { git: { commit: process.env.GITHUB_SHA } },
|
|
184
|
+
});
|
|
185
|
+
|
|
186
|
+
// Or from disk after CLI run:
|
|
187
|
+
const envelope = await buildEvalRunEnvelopeFromFiles("report.json", {
|
|
188
|
+
gradingPath: "grading.json",
|
|
189
|
+
suitePath: "examples/basic.yaml",
|
|
190
|
+
});
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
| Field | Meaning |
|
|
194
|
+
| -------------------------------------------- | --------------------------------------------------------------------------------- |
|
|
195
|
+
| `summary.behavioralPass` | All cells passed assertion thresholds |
|
|
196
|
+
| `summary.outcomePass` | All graded expectations passed (when outcome layer present) |
|
|
197
|
+
| `cells[].repetitions[]` | Unit of work for judges — trajectory, assertion results, optional `outcomeGrades` |
|
|
198
|
+
| `cells[].repetitions[].artifacts.transcript` | Text for LLM judges (`trajectoryToTranscript`) |
|
|
199
|
+
| `cells[].repetitions[].externalScores` | Attach scores from LangSmith, Braintrust, etc. |
|
|
200
|
+
|
|
201
|
+
**Full reference:** [docs/eval-record.md](docs/eval-record.md)
|
|
202
|
+
|
|
203
|
+
### TypeScript types & Zod schemas
|
|
204
|
+
|
|
205
|
+
| Artifact | Location |
|
|
206
|
+
| -------------------------------- | -------------------------------------------------------------------------------------- |
|
|
207
|
+
| TypeScript interfaces | `@alis-build/harness-eval` — `TrajectoryView`, `EvalRunEnvelope`, `AssertionResult`, … |
|
|
208
|
+
| Zod schemas (runtime validation) | `src/schemas/` in repo only — not published on npm |
|
|
209
|
+
| JSON Schema (DB / OpenAPI) | `schemas/*.schema.json` — shipped in the npm package |
|
|
210
|
+
|
|
211
|
+
Zod is the **source of truth** for JSON Schema. Each field has `title` and `description` for downstream tooling.
|
|
212
|
+
|
|
213
|
+
```bash
|
|
214
|
+
pnpm run generate-schemas # Zod → schemas/*.schema.json
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
Published JSON Schema files (Draft 2020-12):
|
|
218
|
+
|
|
219
|
+
- `schemas/trajectory-view.schema.json` — `TrajectoryView` + `schemaVersion`
|
|
220
|
+
- `schemas/eval-run-envelope.schema.json` — full run envelope
|
|
221
|
+
|
|
222
|
+
Canonical `$id` URLs (for validators and `$ref`):
|
|
223
|
+
|
|
224
|
+
- `https://raw.githubusercontent.com/alis-build/harness-eval-ts/main/schemas/trajectory-view.schema.json`
|
|
225
|
+
- `https://raw.githubusercontent.com/alis-build/harness-eval-ts/main/schemas/eval-run-envelope.schema.json`
|
|
226
|
+
|
|
227
|
+
Source: [github.com/alis-build/harness-eval-ts](https://github.com/alis-build/harness-eval-ts)
|
|
228
|
+
|
|
229
|
+
Runtime validation (repo development or clone):
|
|
230
|
+
|
|
231
|
+
```typescript
|
|
232
|
+
import { evalRunEnvelopeSchema } from "./src/schemas/eval-run-envelope";
|
|
233
|
+
evalRunEnvelopeSchema.parse(envelope);
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
npm consumers validate with the published JSON Schema files or by cloning the repo for Zod imports.
|
|
237
|
+
|
|
238
|
+
Uses [Zod 4 `z.toJSONSchema()`](https://zod.dev/json-schema).
|
|
239
|
+
|
|
240
|
+
---
|
|
241
|
+
|
|
242
|
+
## External eval frameworks & custom judges
|
|
243
|
+
|
|
244
|
+
harness-eval is intentionally split: **run the harness and score behavior deterministically**; **outcome quality can live anywhere**.
|
|
245
|
+
|
|
246
|
+
You do not need `harness-eval grade` if you already have LangSmith, Braintrust, OpenAI Evals, a Python judge, or an internal rubric service.
|
|
247
|
+
|
|
248
|
+
### What harness-eval provides vs what you can replace
|
|
249
|
+
|
|
250
|
+
| Concern | harness-eval | External framework / custom judge |
|
|
251
|
+
| ------------------------ | ------------------------------ | ------------------------------------------ |
|
|
252
|
+
| Headless harness runs | `run` / `runSuite` | — |
|
|
253
|
+
| Tool-call behavior | Assertions on `TrajectoryView` | Optional: re-implement on `toolCalls` |
|
|
254
|
+
| Outcome / rubric scoring | `grade` (Claude judge) | Your judge, eval platform, or human review |
|
|
255
|
+
| Storage contract | `EvalRunEnvelope` | Same envelope; attach `externalScores` |
|
|
256
|
+
|
|
257
|
+
### Pattern 1 — Behavioral only (no LLM judge)
|
|
258
|
+
|
|
259
|
+
Run the suite, gate CI on behavioral pass rates, skip outcome grading entirely.
|
|
260
|
+
|
|
261
|
+
```bash
|
|
262
|
+
npx @alis-build/harness-eval run examples/basic.yaml --output report.json
|
|
263
|
+
# exit 0 ⇒ all assertion thresholds met
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
Omit `expectations` from the suite, or ignore them. Your pipeline only checks `report.json` assertion stats.
|
|
267
|
+
|
|
268
|
+
### Pattern 2 — Custom judge in TypeScript (`gradeFn`)
|
|
269
|
+
|
|
270
|
+
Keep the harness-eval grading **workflow** (concurrency, report shape) but swap the judge implementation:
|
|
271
|
+
|
|
272
|
+
```typescript
|
|
273
|
+
import {
|
|
274
|
+
gradeReport,
|
|
275
|
+
trajectoryToTranscript,
|
|
276
|
+
type GraderFn,
|
|
277
|
+
} from "@alis-build/harness-eval";
|
|
278
|
+
|
|
279
|
+
const myJudge: GraderFn = async ({ prompt, transcript, expectations }) => {
|
|
280
|
+
// Call your API, rubric service, or local model
|
|
281
|
+
const results = await myRubricService.score(transcript, expectations);
|
|
282
|
+
return {
|
|
283
|
+
expectations: results,
|
|
284
|
+
summary: { passed: 2, failed: 0, total: 2, passRate: 1 },
|
|
285
|
+
};
|
|
286
|
+
};
|
|
287
|
+
|
|
288
|
+
const grading = await gradeReport(report, { gradeFn: myJudge });
|
|
289
|
+
```
|
|
290
|
+
|
|
291
|
+
Output is the same `SuiteGradingReport` shape as the built-in Claude grader — merge into `EvalRunEnvelope` via `buildEvalRunEnvelope(report, { grading })`.
|
|
292
|
+
|
|
293
|
+
### Pattern 3 — Separate judge pipeline (any language)
|
|
294
|
+
|
|
295
|
+
1. `npx @alis-build/harness-eval run … --output report.json`
|
|
296
|
+
2. Your service reads each repetition:
|
|
297
|
+
|
|
298
|
+
```typescript
|
|
299
|
+
// Minimal handoff fields from report.json
|
|
300
|
+
const cell = report.cells[0];
|
|
301
|
+
const rep = cell.repetitions[0];
|
|
302
|
+
const view = rep.adapterResult?.view;
|
|
303
|
+
const prompt = cell.prompt;
|
|
304
|
+
const expectations = cell.expectations ?? [];
|
|
305
|
+
|
|
306
|
+
// Prefer transcript for LLM judges
|
|
307
|
+
import { trajectoryToTranscript } from "@alis-build/harness-eval";
|
|
308
|
+
const transcript = view ? trajectoryToTranscript(view, prompt ?? "") : null;
|
|
309
|
+
|
|
310
|
+
// Or use structured toolCalls for deterministic checks
|
|
311
|
+
const toolNames = view?.toolCalls.map((t) => t.name) ?? [];
|
|
312
|
+
```
|
|
313
|
+
|
|
314
|
+
3. Write scores to your DB or a sidecar JSON.
|
|
315
|
+
4. Optionally merge into an envelope for a unified eval store:
|
|
316
|
+
|
|
317
|
+
```typescript
|
|
318
|
+
const envelope = buildEvalRunEnvelope(report, { grading });
|
|
319
|
+
// Attach platform scores per repetition (not a buildEvalRunEnvelope option today):
|
|
320
|
+
envelope.cells[0].repetitions[0].externalScores = [
|
|
321
|
+
{ source: "langsmith", metric: "correctness", value: 0.92 },
|
|
322
|
+
];
|
|
323
|
+
```
|
|
324
|
+
|
|
325
|
+
**Judges should use `trajectoryToTranscript(view, prompt)` or structured `toolCalls`** — not raw Claude `stream-json` (Claude-only and verbose).
|
|
326
|
+
|
|
327
|
+
### Pattern 4 — LangSmith, Braintrust, OpenAI Evals, etc.
|
|
328
|
+
|
|
329
|
+
Typical flow:
|
|
330
|
+
|
|
331
|
+
1. **Generate trajectories** with harness-eval (real harness, real MCP tools, statistical repetitions).
|
|
332
|
+
2. **Upload or reference** each repetition in your platform:
|
|
333
|
+
- **Input:** `prompt`, `artifacts.transcript` (from envelope), or `TrajectoryView`
|
|
334
|
+
- **Metadata:** `caseId`, `cellLabel`, `axes`, `runId`, git/CI provenance from `EvalRunEnvelope`
|
|
335
|
+
3. **Run the platform's evaluators** (LLM judges, human review, custom scorers).
|
|
336
|
+
4. **Attach scores** back via `externalScores` on `EvalRepetition` when building the envelope, or store platform run IDs in `provenance`.
|
|
337
|
+
|
|
338
|
+
harness-eval does not need to own scoring — it owns **faithful harness reproduction** and a **stable trajectory contract**.
|
|
339
|
+
|
|
340
|
+
### Pattern 5 — Behavioral here, outcome elsewhere (recommended split)
|
|
341
|
+
|
|
342
|
+
```bash
|
|
343
|
+
# CI job 1: behavioral gate (fast, deterministic)
|
|
344
|
+
npx @alis-build/harness-eval run suite.yaml --output report.json
|
|
345
|
+
|
|
346
|
+
# CI job 2: your outcome eval (async, platform-specific)
|
|
347
|
+
node scripts/push-to-langsmith.mjs report.json
|
|
348
|
+
# or: python scripts/run_braintrust_eval.py report.json
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
- Job 1 fails on tool-selection regressions immediately.
|
|
352
|
+
- Job 2 scores answer quality without blocking on harness spawn time.
|
|
353
|
+
|
|
354
|
+
Both can converge on one `EvalRunEnvelope` in your database for dashboards.
|
|
355
|
+
|
|
356
|
+
### Injecting a custom `GraderInput`
|
|
357
|
+
|
|
358
|
+
Built-in grader input shape:
|
|
359
|
+
|
|
360
|
+
```typescript
|
|
361
|
+
interface GraderInput {
|
|
362
|
+
prompt: string;
|
|
363
|
+
transcript: string; // from trajectoryToTranscript()
|
|
364
|
+
expectations: string[]; // from suite / report
|
|
365
|
+
}
|
|
366
|
+
```
|
|
367
|
+
|
|
368
|
+
Built-in output shape (`outcomeGrades` in the envelope):
|
|
369
|
+
|
|
370
|
+
```typescript
|
|
371
|
+
interface GradedExpectation {
|
|
372
|
+
text: string;
|
|
373
|
+
passed: boolean;
|
|
374
|
+
evidence: string;
|
|
375
|
+
}
|
|
376
|
+
```
|
|
377
|
+
|
|
378
|
+
Map your framework's output into these shapes (or use `externalScores`) so CI and DB layers stay consistent.
|
|
379
|
+
|
|
380
|
+
---
|
|
381
|
+
|
|
382
|
+
## Two layers of evaluation
|
|
383
|
+
|
|
384
|
+
| Layer | Command | What it checks | Mechanism |
|
|
385
|
+
| ------------ | ------- | --------------------------------------- | -------------------------------------------- |
|
|
386
|
+
| **Behavior** | `run` | Tool calls, order, args, efficiency | Deterministic assertions on `TrajectoryView` |
|
|
387
|
+
| **Outcome** | `grade` | Answer quality, grounding, completeness | LLM judge on transcript + `finalResponse` |
|
|
388
|
+
|
|
389
|
+
Both layers use statistical thresholds: a case runs `repetitions` times per matrix cell, and each assertion/expectation has a pass-rate threshold (default `1.0`).
|
|
390
|
+
|
|
391
|
+
---
|
|
392
|
+
|
|
393
|
+
## CLI reference
|
|
394
|
+
|
|
395
|
+
```bash
|
|
396
|
+
npx @alis-build/harness-eval run <suite.yaml> [options]
|
|
397
|
+
npx @alis-build/harness-eval grade <report.json> [options]
|
|
398
|
+
npx @alis-build/harness-eval envelope <report.json> [options]
|
|
399
|
+
npx @alis-build/harness-eval format <report.json> [options]
|
|
400
|
+
npx @alis-build/harness-eval --help
|
|
401
|
+
```
|
|
402
|
+
|
|
403
|
+
### `run`
|
|
404
|
+
|
|
405
|
+
| Option | Description |
|
|
406
|
+
| ---------------------------------- | ---------------------------------------------------------------------------------------- |
|
|
407
|
+
| `--output <path>` | Write full `SuiteReport` JSON |
|
|
408
|
+
| `--otel-output <dir>` | Write OTLP trace JSON per repetition (optional) |
|
|
409
|
+
| `--format console\|markdown\|json` | Report format (default: `console`) |
|
|
410
|
+
| `--baseline <path>` | Compare against a previous report |
|
|
411
|
+
| `--max-concurrent <n>` | Parallel harness processes (default: 4) |
|
|
412
|
+
| `--adapter <id>` | Harness adapter (default: `claude-code`) |
|
|
413
|
+
| `--quiet` | Progress: dots only (`.` ok, `x` fail) |
|
|
414
|
+
| `--verbose` | Progress: per-rep tool counts and assertion summary |
|
|
415
|
+
| `--progress <mode>` | `default` \| `quiet` \| `verbose` \| `json` (ndjson on stderr; disables color) |
|
|
416
|
+
| `--color` / `--no-color` | Force or disable ANSI colors (auto when stderr is a TTY; `NO_COLOR` / `FORCE_COLOR` env) |
|
|
417
|
+
|
|
418
|
+
### `grade`
|
|
419
|
+
|
|
420
|
+
Uses a standalone **`grading.yaml`** for judge model, timeout, env, and `claudeCode` flags (Option B — separate from the suite file).
|
|
421
|
+
|
|
422
|
+
```yaml
|
|
423
|
+
# examples/grading.yaml
|
|
424
|
+
judge:
|
|
425
|
+
adapter: claude-code
|
|
426
|
+
model: claude-sonnet-4-6
|
|
427
|
+
timeoutMs: 300000
|
|
428
|
+
maxConcurrent: 1
|
|
429
|
+
claudeCode:
|
|
430
|
+
permissionMode: bypassPermissions
|
|
431
|
+
```
|
|
432
|
+
|
|
433
|
+
```bash
|
|
434
|
+
npx @alis-build/harness-eval grade report.json --config examples/grading.yaml --output grading.json
|
|
435
|
+
```
|
|
436
|
+
|
|
437
|
+
| Option | Description |
|
|
438
|
+
| -------------------------------------- | ----------------------------------------------------------------- |
|
|
439
|
+
| `--config <path>` | Grading YAML (`judge` block) — model, env, timeout, `claudeCode` |
|
|
440
|
+
| `--output <path>` | Write grading JSON |
|
|
441
|
+
| `--expectations <path>` | Sidecar YAML/JSON if report lacks expectations |
|
|
442
|
+
| `--format console\|json` | Output format |
|
|
443
|
+
| `--model <id>` | Overrides `judge.model` in config |
|
|
444
|
+
| `--binary <path>` | Overrides `judge.claudeCode.binary` |
|
|
445
|
+
| `--timeout-ms <n>` | Overrides `judge.timeoutMs` |
|
|
446
|
+
| `--max-concurrent <n>` | Overrides `judge.maxConcurrent` (default: 2 if unset) |
|
|
447
|
+
| `--quiet` / `--verbose` / `--progress` | Same progress modes as `run` (including `--color` / `--no-color`) |
|
|
448
|
+
|
|
449
|
+
CLI flags override the YAML file. Expectations still come from `report.json` (copied from the suite at `run` time) unless `--expectations` is set. The grading report may include `gradingConfigPath` when `--config` was used.
|
|
450
|
+
|
|
451
|
+
The built-in judge spawns Claude with **`--output-format json`** (single-shot response, not `stream-json`). It applies **safe defaults** so Claude Code does not reload plugins/MCP during grading: `maxTurns: 1`, `bare: true`, `disableSlashCommands: true`, `noSessionPersistence: true`, plus `permissionMode: bypassPermissions` on the judge subprocess. Override in `judge.claudeCode` only if you need a different judge setup.
|
|
452
|
+
|
|
453
|
+
Exit codes: `0` = all expectations passed; `1` = failures; `2` = no expectations or no gradable repetitions (harness failures without trajectories are skipped).
|
|
454
|
+
|
|
455
|
+
Optional — use [External eval frameworks & custom judges](#external-eval-frameworks--custom-judges) instead of this command.
|
|
456
|
+
|
|
457
|
+
### `envelope`
|
|
458
|
+
|
|
459
|
+
Build the versioned **`EvalRunEnvelope`** (primary eval interchange document) from a harness `report.json`. Optionally merge outcome grades and emit platform-compatible projections.
|
|
460
|
+
|
|
461
|
+
```bash
|
|
462
|
+
npx @alis-build/harness-eval envelope report.json --suite examples/basic.yaml --grading grading.json --output envelope.json
|
|
463
|
+
|
|
464
|
+
# Interchange projections
|
|
465
|
+
npx @alis-build/harness-eval envelope report.json --projection trajectory --output trajectory.jsonl
|
|
466
|
+
npx @alis-build/harness-eval envelope report.json --projection instances --output instances.json
|
|
467
|
+
npx @alis-build/harness-eval envelope report.json --projection agent-trace --output agent-traces.json
|
|
468
|
+
```
|
|
469
|
+
|
|
470
|
+
| Option | Description |
|
|
471
|
+
| ----------------------------------------------------------- | --------------------------------------------------------- |
|
|
472
|
+
| `--output <path>` | Write output (stdout if omitted) |
|
|
473
|
+
| `--grading <path>` | Merge `grading.json` outcome scores into the envelope |
|
|
474
|
+
| `--suite <path>` | Suite YAML for provenance (`uri`, `contentHash`) |
|
|
475
|
+
| `--projection envelope\|trajectory\|instances\|agent-trace` | Output shape (default: `envelope`) |
|
|
476
|
+
| `--include-raw-stream-events` | Include adapter raw stream events in repetition artifacts |
|
|
477
|
+
| `--no-transcript` | Omit judge transcript artifacts |
|
|
478
|
+
|
|
479
|
+
Exit codes: `0` = envelope built and behavioral pass; `1` = built but behavioral failures; `2` = usage or file errors.
|
|
480
|
+
|
|
481
|
+
### `format`
|
|
482
|
+
|
|
483
|
+
Re-render an existing `report.json` without re-running the harness.
|
|
484
|
+
|
|
485
|
+
---
|
|
486
|
+
|
|
487
|
+
## Output artifacts
|
|
488
|
+
|
|
489
|
+
After a typical run:
|
|
490
|
+
|
|
491
|
+
| File | Produced by | Purpose |
|
|
492
|
+
| ----------------------------- | ---------------------------------- | -------------------------------------------------------------- |
|
|
493
|
+
| **`suite.yaml`** | You | Test spec: prompts, matrix, assertions, expectations |
|
|
494
|
+
| **`report.json`** | `run --output` | `SuiteReport` — trajectories, assertion stats, per-rep details |
|
|
495
|
+
| **`grading.json`** | `grade --output` | Outcome scores with evidence (optional; or use external judge) |
|
|
496
|
+
| **`envelope.json`** | `envelope --output` | Versioned `EvalRunEnvelope` for DB / API / eval platforms |
|
|
497
|
+
| **`trajectory.jsonl`** | `envelope --projection trajectory` | Tabular interchange rows (JSONL) |
|
|
498
|
+
| **`schemas/*.schema.json`** | `pnpm run generate-schemas` | JSON Schema for validators and OpenAPI |
|
|
499
|
+
| **`otel-traces/*.otlp.json`** | `run --otel-output` | OTLP for trace UIs (optional; not the eval contract) |
|
|
500
|
+
|
|
501
|
+
Write artifact paths with `--output` (and `--otel-output` for traces) wherever your pipeline or CI expects them.
|
|
502
|
+
|
|
503
|
+
See [Data contracts & schemas](#data-contracts--schemas) for type details.
|
|
504
|
+
|
|
505
|
+
---
|
|
506
|
+
|
|
507
|
+
## Suite concepts
|
|
508
|
+
|
|
509
|
+
### Test case
|
|
510
|
+
|
|
511
|
+
One prompt + assertions + optional expectations, run N times per matrix cell.
|
|
512
|
+
|
|
513
|
+
### Matrix cell
|
|
514
|
+
|
|
515
|
+
One configuration point (plugin version, model, tool allowlist, etc.). Each (case × cell) is one row in the report.
|
|
516
|
+
|
|
517
|
+
### Config merge order
|
|
518
|
+
|
|
519
|
+
Later wins: `defaultConfig` → `case.config` → `cell.config`.
|
|
520
|
+
|
|
521
|
+
List fields like `allowedTools` and `pluginDirs` are **replaced**, not merged.
|
|
522
|
+
|
|
523
|
+
### Thresholds
|
|
524
|
+
|
|
525
|
+
```yaml
|
|
526
|
+
assertions:
|
|
527
|
+
- called: mcp__api__search_skills
|
|
528
|
+
threshold: 0.8 # pass if ≥80% of reps call the tool
|
|
529
|
+
```
|
|
530
|
+
|
|
531
|
+
Default threshold is `1.0` (every evaluated rep must pass). Reps where the harness crashes are excluded from the denominator and counted as `adapterErrors`.
|
|
532
|
+
|
|
533
|
+
**Full reference:** [docs/assertions.md](docs/assertions.md) — all assertion kinds, predicates, statistical model, and how to add new assertion types or harness adapters.
|
|
534
|
+
|
|
535
|
+
---
|
|
536
|
+
|
|
537
|
+
## Adding harness adapters
|
|
538
|
+
|
|
539
|
+
Built-in adapters register at module load. Today only `claude-code` ships; additional harnesses (Codex, Gemini CLI, Antigravity CLI) plug in via the same pattern:
|
|
540
|
+
|
|
541
|
+
1. Implement `HarnessAdapter` under `src/adapters/<id>/` with a `run(config)` that returns a `TrajectoryView`.
|
|
542
|
+
2. Add a nested config key on `SuiteConfig` (e.g. `codex: { ... }`) for harness-specific options.
|
|
543
|
+
3. Call `registerAdapter("<id>", adapter)` at startup (built-in registration in `src/adapters/registry.ts`, or from plugin bootstrap code).
|
|
544
|
+
4. Set `adapter: <id>` in suite YAML; the runner resolves via `getAdapter(id)`.
|
|
545
|
+
|
|
546
|
+
```typescript
|
|
547
|
+
import {
|
|
548
|
+
registerAdapter,
|
|
549
|
+
listAdapters,
|
|
550
|
+
getAdapter,
|
|
551
|
+
} from "@alis-build/harness-eval";
|
|
552
|
+
|
|
553
|
+
registerAdapter("my-harness", myAdapter);
|
|
554
|
+
console.log(listAdapters()); // ["claude-code", "my-harness"]
|
|
555
|
+
```
|
|
556
|
+
|
|
557
|
+
Duplicate registration throws so accidental overrides fail fast during startup or tests.
|
|
558
|
+
|
|
559
|
+
---
|
|
560
|
+
|
|
561
|
+
## Claude Code adapter
|
|
562
|
+
|
|
563
|
+
Nested under `claudeCode` in YAML (or flat in programmatic config). Maps to [Claude Code CLI flags](https://code.claude.com/docs/en/cli-reference#cli-flags).
|
|
564
|
+
|
|
565
|
+
The adapter always passes `-p`, `--output-format stream-json`, and `--verbose`.
|
|
566
|
+
|
|
567
|
+
| Field | CLI flag | Notes |
|
|
568
|
+
| --------------------------------- | -------------------------------------- | ------------------------------------------------------------------------ |
|
|
569
|
+
| `binary` | — | Default `claude` |
|
|
570
|
+
| `pluginDirs` | `--plugin-dir` | Repeatable |
|
|
571
|
+
| `pluginUrls` | `--plugin-url` | Repeatable |
|
|
572
|
+
| `addDirs` | `--add-dir` | Extra readable dirs (repeatable) |
|
|
573
|
+
| `mcpConfig` | `--mcp-config` | MCP config file path |
|
|
574
|
+
| `strictMcpConfig` | `--strict-mcp-config` | Only MCP servers from `mcpConfig` |
|
|
575
|
+
| `model` | `--model` | Also settable at top level |
|
|
576
|
+
| `permissionMode` | `--permission-mode` | `default`, `acceptEdits`, `plan`, `auto`, `dontAsk`, `bypassPermissions` |
|
|
577
|
+
| `effort` | `--effort` | `low` … `max` |
|
|
578
|
+
| `agent` | `--agent` | Subagent for session |
|
|
579
|
+
| `fallbackModel` | `--fallback-model` | Comma-separated fallback chain |
|
|
580
|
+
| `tools` | `--tools` | Restrict built-in tools (`Bash,Edit,Read` or `default`) |
|
|
581
|
+
| `allowedTools` | `--allowedTools` | Auto-approve tool patterns |
|
|
582
|
+
| `disallowedTools` | `--disallowedTools` | Deny tool patterns |
|
|
583
|
+
| `maxTurns` | `--max-turns` | Print-mode turn cap |
|
|
584
|
+
| `maxBudgetUsd` | `--max-budget-usd` | Print-mode spend cap |
|
|
585
|
+
| `settings` | `--settings` | Settings JSON file path or inline JSON string |
|
|
586
|
+
| `settingSources` | `--setting-sources` | e.g. `user,project` |
|
|
587
|
+
| `systemPrompt` | `--system-prompt` | Replace default system prompt |
|
|
588
|
+
| `systemPromptFile` | `--system-prompt-file` | Replace from file |
|
|
589
|
+
| `appendSystemPrompt` | `--append-system-prompt` | Append to default prompt |
|
|
590
|
+
| `appendSystemPromptFile` | `--append-system-prompt-file` | Append from file |
|
|
591
|
+
| `debug` | `--debug` | `true` or category filter string |
|
|
592
|
+
| `debugFile` | `--debug-file` | Debug log path |
|
|
593
|
+
| `includeHookEvents` | `--include-hook-events` | Hook events in stream-json |
|
|
594
|
+
| `noSessionPersistence` | `--no-session-persistence` | Don't save session to disk |
|
|
595
|
+
| `disableSlashCommands` | `--disable-slash-commands` | Disable skills/commands for session |
|
|
596
|
+
| `bare` | `--bare` | Skip auto-discovery (hooks, skills, plugins, MCP) |
|
|
597
|
+
| `safeMode` | `--safe-mode` | Disable customizations for troubleshooting |
|
|
598
|
+
| `dangerouslySkipPermissions` | `--dangerously-skip-permissions` | Same as `bypassPermissions` mode |
|
|
599
|
+
| `allowDangerouslySkipPermissions` | `--allow-dangerously-skip-permissions` | Add bypass to mode cycle |
|
|
600
|
+
| `isolateConfig` | — | `false` = use your login/plugins; `true` (default) = fresh temp config |
|
|
601
|
+
|
|
602
|
+
Generic `cwd` sets the child process working directory (not a Claude flag). Relative paths in `mcpConfig`, `pluginDirs`, `addDirs`, and settings/prompt files resolve against the suite YAML directory.
|
|
603
|
+
|
|
604
|
+
Not wired (eval usually starts fresh sessions): `--resume`, `--continue`, `--session-id`, `--worktree`, interactive-only flags.
|
|
605
|
+
|
|
606
|
+
The adapter captures Claude’s stream-json output and builds a `TrajectoryView`. Unknown stream events are ignored so schema evolution does not break CI.
|
|
607
|
+
|
|
608
|
+
---
|
|
609
|
+
|
|
610
|
+
## Library API
|
|
611
|
+
|
|
612
|
+
```typescript
|
|
613
|
+
import {
|
|
614
|
+
loadSuite,
|
|
615
|
+
runSuite,
|
|
616
|
+
gradeReport,
|
|
617
|
+
buildEvalRunEnvelope,
|
|
618
|
+
trajectoryToTranscript,
|
|
619
|
+
trajectoryToOtlp,
|
|
620
|
+
resolveGradeOptions,
|
|
621
|
+
gradingReportPassed,
|
|
622
|
+
} from "@alis-build/harness-eval";
|
|
623
|
+
import { loadGradingConfig } from "@alis-build/harness-eval/config";
|
|
624
|
+
|
|
625
|
+
const suite = await loadSuite("./examples/basic.yaml");
|
|
626
|
+
const report = await runSuite(suite, { maxConcurrent: 2 });
|
|
627
|
+
|
|
628
|
+
const gradingConfig = await loadGradingConfig("./examples/grading.yaml");
|
|
629
|
+
const gradeOpts = resolveGradeOptions(gradingConfig, { maxConcurrent: 2 });
|
|
630
|
+
const grading = await gradeReport(report, gradeOpts);
|
|
631
|
+
|
|
632
|
+
// Export trajectory for custom tooling
|
|
633
|
+
const view = report.cells[0].repetitions[0].adapterResult?.view;
|
|
634
|
+
if (view) {
|
|
635
|
+
const transcript = trajectoryToTranscript(
|
|
636
|
+
view,
|
|
637
|
+
"Read README.md and summarize harness-eval.",
|
|
638
|
+
);
|
|
639
|
+
const otlp = trajectoryToOtlp(view, { prompt: "..." });
|
|
640
|
+
}
|
|
641
|
+
|
|
642
|
+
// Build versioned envelope for DB / CI (see docs/eval-record.md)
|
|
643
|
+
const envelope = buildEvalRunEnvelope(report, {
|
|
644
|
+
grading,
|
|
645
|
+
suite: { uri: "./examples/basic.yaml" },
|
|
646
|
+
});
|
|
647
|
+
```
|
|
648
|
+
|
|
649
|
+
Subpath exports: `@alis-build/harness-eval/runner`, `@alis-build/harness-eval/config`, `@alis-build/harness-eval/adapters/claude-code`.
|
|
650
|
+
|
|
651
|
+
---
|
|
652
|
+
|
|
653
|
+
## Architecture (brief)
|
|
654
|
+
|
|
655
|
+
```
|
|
656
|
+
Suite YAML → runSuite → Harness adapter → TrajectoryView
|
|
657
|
+
↓
|
|
658
|
+
assertions (run, in harness-eval)
|
|
659
|
+
↓
|
|
660
|
+
SuiteReport (report.json)
|
|
661
|
+
↓
|
|
662
|
+
┌─────────────────────┴─────────────────────┐
|
|
663
|
+
↓ ↓
|
|
664
|
+
harness-eval grade External judge / eval platform
|
|
665
|
+
(optional built-in) (LangSmith, Braintrust, custom)
|
|
666
|
+
↓ ↓
|
|
667
|
+
└─────────────────────┬─────────────────────┘
|
|
668
|
+
↓
|
|
669
|
+
EvalRunEnvelope → DB / CI / API
|
|
670
|
+
```
|
|
671
|
+
|
|
672
|
+
- **Pluggable harness adapters** — runner and assertions depend only on `TrajectoryView`.
|
|
673
|
+
- **Pluggable outcome layer** — built-in `grade`, custom `gradeFn`, or any external workflow.
|
|
674
|
+
- **OTLP** — observability side export; not required for scoring.
|
|
675
|
+
|
|
676
|
+
Details: [Data contracts & schemas](#data-contracts--schemas) · [External eval frameworks](#external-eval-frameworks--custom-judges) · [docs/eval-record.md](docs/eval-record.md)
|
|
677
|
+
|
|
678
|
+
---
|
|
679
|
+
|
|
680
|
+
## Development
|
|
681
|
+
|
|
682
|
+
```bash
|
|
683
|
+
pnpm install
|
|
684
|
+
pnpm run build
|
|
685
|
+
pnpm test # vitest
|
|
686
|
+
pnpm run typecheck
|
|
687
|
+
pnpm run generate-schemas # Zod → schemas/*.schema.json only
|
|
688
|
+
```
|
|
689
|
+
|
|
690
|
+
**Docs:** [Assertion DSL & adapter extension](docs/assertions.md) · [Eval record contract (DB / CI)](docs/eval-record.md)
|
|
691
|
+
|
|
692
|
+
---
|
|
693
|
+
|
|
694
|
+
## Related work
|
|
695
|
+
|
|
696
|
+
- [lastmile-ai/mcp-eval](https://github.com/lastmile-ai/mcp-eval) — model + MCP eval (not harness-specific)
|
|
697
|
+
- [alpic-ai/mcp-eval](https://github.com/alpic-ai/mcp-eval) — YAML-driven MCP eval
|
|
698
|
+
- [OpenTelemetry GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/registry/attributes/gen-ai/) — OTLP export shape
|
|
699
|
+
|
|
700
|
+
---
|