@tangle-network/agent-eval 0.17.3 → 0.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,155 @@
1
+ # Concepts
2
+
3
+ Read this once and the rest of agent-eval makes sense.
4
+
5
+ ## What is agent-eval?
6
+
7
+ A library for **deciding whether a code generator or content generator did its job.** You give it a thing the generator produced (a scaffold, a patch, a tweet, a JSON config), and you get back a structured verdict: pass/fail, dimension scores, a reason in plain English.
8
+
9
+ It exists because LLMs lie about whether they succeeded. A model will say "Done!" and ship code that doesn't compile. agent-eval is the layer between the model's output and your decision to ship.
10
+
11
+ ## The three things you'll touch most
12
+
13
+ | Thing | What it is | One-line example |
14
+ |---|---|---|
15
+ | **Judge** | A function that scores one piece of output. | "Did this scaffold implement async fetching?" |
16
+ | **Rubric** | The recipe a judge uses — what to score on, with what weights. | "Score on buyer_quality (0.5), voice (0.3), signal (0.2)." |
17
+ | **Verifier** | A pipeline of judges run in order, with dependencies. | "install → typecheck → build → semantic" |
18
+ | **Feedback trajectory** | A multi-shot record of attempts, approvals, rejections, edits, metrics, and policy outcomes. | "draft → user rejects → revised draft → approved → measured" |
19
+
20
+ That's the whole framework. Everything else (sessions, traces, layers) is plumbing around those three.
21
+
22
+ When the thing being evaluated is an agent that should keep working, use
23
+ [`runAgentControlLoop`](./control-runtime.md). It turns validators into a
24
+ runtime loop: observe typed state, validate it, decide the next action, act,
25
+ and repeat until the task passes, blocks, times out, spends too much, or stops
26
+ making progress.
27
+
28
+ When normal agent usage should become reusable training/eval signal, use
29
+ [`FeedbackTrajectory`](./feedback-trajectories.md). It captures approvals,
30
+ rejections, edits, option choices, metrics, and policy blocks as portable data
31
+ that can seed memory, replay scenarios, and optimization.
32
+
33
+ ## Vocabulary, plain English
34
+
35
+ | Term | Plain English |
36
+ |---|---|
37
+ | **Artifact** | The thing being judged. Often a workdir of files, sometimes a string of text. |
38
+ | **Snapshot** | A frozen view of an artifact (every file path → content). What the judge actually reads. |
39
+ | **Harness** | A description of *how to run* the artifact: setup command, test command, working dir, timeout. |
40
+ | **Sandbox driver** | The thing that actually executes commands inside the harness. Local subprocess, or remote container. |
41
+ | **Layer** | One stage of a verifier pipeline (install, typecheck, build, semantic, …). |
42
+ | **Finding** | A specific issue a judge found — file, line, severity, message. |
43
+ | **Trace store** | The append-only log of every span/event during a run. Replay = read this back. |
44
+ | **Composite score** | A 0..1 number combining all dimensions. The single number you gate on. |
45
+ | **Rubric version** | A stable hash of the rubric. Scores from different rubric versions are not comparable. |
46
+ | **Muffled gate** | A check that should fail loud but silently passes (e.g. `command || true`). The most expensive bug class in this codebase — see SKILL.md. |
47
+
48
+ ## The feedback trajectory loop
49
+
50
+ For agentic systems, the highest-quality labels often come from normal review
51
+ workflow, not a separate labeling UI:
52
+
53
+ ```text
54
+ agent proposes -> user approves/rejects/edits/selects -> agent revises -> outcome is measured
55
+ ```
56
+
57
+ `FeedbackTrajectory` is the portable record of that loop. Browser agents can
58
+ store task outcomes, coding agents can store patch review plus test results,
59
+ and research agents can store reviewer corrections. The domain changes; the
60
+ shape stays the same.
61
+
62
+ Those trajectories can be converted into preference memory, `DatasetScenario`
63
+ rows, optimizer rows, and held-out examples for overfit checks.
64
+
65
+ ## The three-layer eval (for code generators)
66
+
67
+ When the artifact is generated code, agent-eval scores it at three independent layers. Each layer fails differently, and you want to know which one broke:
68
+
69
+ ```
70
+ L0 builder Did the agent's session itself work?
71
+ (Did it produce an artifact at all?)
72
+
73
+
74
+ L1 app-build Does the artifact build / typecheck / test?
75
+ (Static signal, ground-truth gate.)
76
+
77
+
78
+ L2 app-runtime Does the artifact actually run end-to-end?
79
+ (Dynamic signal — only worth checking if L1 passed.)
80
+ ```
81
+
82
+ `BuilderSession` orchestrates this. It opens at `startChat`, runs the build at `ship`, runs the runtime check at `runAppScenario`. Each layer emits a trace span. Composite score aggregates them with `scoreProject`.
83
+
84
+ Why three? Because each catches a different failure mode:
85
+ - L0 misses — agent crashed mid-generation, you have a half-written file.
86
+ - L1 misses — files exist but typecheck fails. LLM judges can't reliably catch this.
87
+ - L2 misses — code compiles but does the wrong thing at runtime.
88
+
89
+ If you only check one layer, you ship the bugs that the other two layers would have caught.
90
+
91
+ ## How rubrics work
92
+
93
+ A rubric describes:
94
+ 1. **Dimensions** — the axes you score on (e.g. `buyer_quality`, `voice`, `signal`).
95
+ 2. **Weights** — how to combine dimensions into a composite (`0.5 * buyer_quality + 0.3 * voice + 0.2 * signal`).
96
+ 3. **Failure modes** — named patterns the judge looks for ("ai-cadence", "vague-claim").
97
+ 4. **Wins** — named positive patterns ("specific-component", "earned-detail").
98
+ 5. **System prompt** — what to tell the judging LLM about the persona and the task.
99
+
100
+ Built-in rubrics ship in `src/wire/rubrics.ts` (e.g. `anti-slop` for technical-buyer voice). You can also pass a rubric inline — the same shape, just defined at the call site.
101
+
102
+ A rubric is plain data. The hash of that data is the `rubricVersion`. Two scores are only comparable if they used the same `rubricVersion` — change the rubric and you start a new comparison series.
103
+
104
+ ## How verifiers work
105
+
106
+ When you have a multi-step pipeline (install → typecheck → build → lint → semantic), use `MultiLayerVerifier`:
107
+
108
+ ```ts
109
+ const verifier = new MultiLayerVerifier([
110
+ installLayer, // runs `pnpm install`
111
+ typecheckLayer, // runs `tsc --noEmit`, depends on install
112
+ buildLayer, // runs `pnpm build`, depends on typecheck
113
+ semanticLayer, // LLM judge, weight 3, depends on build
114
+ ])
115
+
116
+ const report = await verifier.run({ env: { runner, workdir, ... } })
117
+ report.allPass // boolean — every layer passed
118
+ report.blendedScore // 0..1 — weighted aggregate
119
+ report.layers // per-layer status, findings, duration
120
+ ```
121
+
122
+ Two rules that will save you bugs (paid for in real incidents — see SKILL.md):
123
+
124
+ 1. **Run both gates.** Build gates catch code that doesn't compile; structural assertions catch missing files. Run both unconditionally — they catch orthogonal failures.
125
+
126
+ 2. **Pair LLM judges with build outcomes.** An LLM judge will rate non-compiling code as "looks right" (0.8). Always short-circuit on `buildOutcome.passed === false` before any LLM judging.
127
+
128
+ ## The trace model (skip on first read)
129
+
130
+ Every operation emits structured spans into a `TraceStore`. A run is a tree:
131
+
132
+ ```
133
+ builder-session [span]
134
+ ├── chat-turn [span]
135
+ ├── ship [span]
136
+ │ ├── harness.install [span]
137
+ │ ├── harness.typecheck [span]
138
+ │ └── harness.build [span]
139
+ └── app-runtime [span]
140
+ └── scenario.run [span]
141
+ ```
142
+
143
+ Spans are append-only and have stable ids — replay is reading the same store back. OTLP export ships them out for distributed tracing.
144
+
145
+ You don't need to build the trace tree by hand. `BuilderSession` does it for you. Look at the trace store when you're debugging a flaky run; ignore it otherwise.
146
+
147
+ ## Where to go next
148
+
149
+ - **Need the layman feature map?** → [feature-guide.md](./feature-guide.md) — what each primitive does, when to use it, integration patterns, and guardrails.
150
+ - **Just want to score a string against a rubric?** → [wire-protocol.md](./wire-protocol.md) — HTTP/RPC interface, pluggable from any language.
151
+ - **Need a reusable driver/worker/evaluator loop?** → [control-runtime.md](./control-runtime.md) — generic runtime plus coding, browser, computer-use, and research integration patterns.
152
+ - **Want review feedback to become eval/optimization data?** → [feedback-trajectories.md](./feedback-trajectories.md) — turn feedback into datasets, optimizer rows, and preference memory.
153
+ - **Building a code-generator eval?** → SKILL.md §Minimal working path — the `BuilderSession` recipe.
154
+ - **Multi-layer verifier?** → SKILL.md §Verification pipeline.
155
+ - **Adding a new judge or rubric?** → `src/wire/rubrics.ts` for the cross-language path; `src/anti-slop.ts` and `src/judges.ts` for the in-process path.
@@ -0,0 +1,351 @@
1
+ # Agent Control Runtime
2
+
3
+ `runAgentControlLoop` is the smallest reusable runtime for agentic tasks:
4
+
5
+ ```text
6
+ observe state -> validate state -> decide next action -> act -> repeat
7
+ ```
8
+
9
+ It is intentionally not a topology framework. Direct execution, driver
10
+ intervention, critique/revision, specialist fan-out, and user escalation are
11
+ all just actions selected by policy.
12
+
13
+ Use it when an agent should keep working until objective state says the task is
14
+ done, blocked, too expensive, or no longer improving.
15
+
16
+ ## Core API
17
+
18
+ ```ts
19
+ import {
20
+ objectiveEval,
21
+ runAgentControlLoop,
22
+ subjectiveEval,
23
+ } from '@tangle-network/agent-eval'
24
+
25
+ const result = await runAgentControlLoop({
26
+ intent: 'Create a final answer with citations and no math errors.',
27
+ budget: { maxSteps: 6, maxWallMs: 180_000, maxCostUsd: 1.50 },
28
+
29
+ async observe() {
30
+ return await readCurrentTaskState()
31
+ },
32
+
33
+ async validate({ state }) {
34
+ return [
35
+ objectiveEval({
36
+ id: 'citations-present',
37
+ passed: state.citations.length >= 2,
38
+ severity: 'critical',
39
+ }),
40
+ objectiveEval({
41
+ id: 'math-reconciles',
42
+ passed: state.mathErrors.length === 0,
43
+ severity: 'critical',
44
+ }),
45
+ subjectiveEval({
46
+ id: 'answer-usefulness',
47
+ passed: state.judgeScore >= 0.8,
48
+ score: state.judgeScore,
49
+ severity: 'warning',
50
+ }),
51
+ ]
52
+ },
53
+
54
+ async decide({ evals, history }) {
55
+ const failed = evals.filter((e) => !e.passed)
56
+ if (!failed.length) return { type: 'stop', pass: true, reason: 'done' }
57
+ return {
58
+ type: 'continue',
59
+ action: { type: 'revise', failures: failed.map((e) => e.id) },
60
+ reason: `fix ${failed.map((e) => e.id).join(', ')}`,
61
+ }
62
+ },
63
+
64
+ async act(action) {
65
+ return await worker.act(action)
66
+ },
67
+
68
+ getActionCostUsd: ({ result }) => result.costUsd,
69
+ stopPolicies: {
70
+ maxNoProgressSteps: 2,
71
+ maxRepeatedActions: 3,
72
+ },
73
+ })
74
+ ```
75
+
76
+ ## Design Rules
77
+
78
+ - Keep domain adapters in downstream repos until they are reused by multiple
79
+ integrations.
80
+ - Use the same adapter in product, benchmark replay, and optimization. Swapping
81
+ the state reader or worker implementation is fine; changing validators,
82
+ action semantics, or stop policy means you are no longer measuring what users
83
+ experience.
84
+ - Prefer objective validators over LLM judges. Use LLM judges for judgment,
85
+ usefulness, clarity, and domain expert review.
86
+ - Treat irreversible external actions as domain policy, not runtime policy.
87
+ The runtime can stop loops; the downstream adapter must decide which actions
88
+ require approval before `act()`.
89
+ - Use typed state. Do not make the policy reason only over transcript text.
90
+ - Make `act()` return cost when possible so `maxCostUsd` can enforce recorded
91
+ spend.
92
+
93
+ ## Product / Eval Contract
94
+
95
+ The runtime is most useful when a downstream product exposes a small adapter:
96
+
97
+ ```ts
98
+ interface ProductControlAdapter<State, Action, ActionResult> {
99
+ observe(): Promise<State>
100
+ validate(state: State): Promise<ControlEvalResult[]>
101
+ decide(ctx: ControlContext<State, Action, ActionResult>): Promise<ControlDecision<Action>>
102
+ act(action: Action): Promise<ActionResult>
103
+ }
104
+ ```
105
+
106
+ Production passes the adapter real sessions, credentials, and storage. Evals
107
+ pass the same adapter replay fixtures, sandboxes, or recorded traces. The
108
+ adapter boundary is the transfer point between training and real usage.
109
+
110
+ Avoid this split:
111
+
112
+ ```text
113
+ benchmark harness has one loop
114
+ product runtime has another loop
115
+ optimizer tunes only the benchmark loop
116
+ ```
117
+
118
+ That creates benchmark wins that do not transfer. Keep one loop and vary only
119
+ the dependencies behind `observe` and `act`.
120
+
121
+ ## What the Runtime Guarantees
122
+
123
+ - `maxSteps`, `maxWallMs`, and `maxCostUsd` guard runaway loops.
124
+ - repeated-action and no-progress stop policies catch stuck behavior.
125
+ - `actionFailure: 'continue'` records worker failures and lets policy recover.
126
+ - `actionFailure: 'stop'` fails fast for workflows where a failed action should
127
+ abort.
128
+ - observation, validation, decision, stop-policy, and action failures are
129
+ returned as structured `runtimeErrors` instead of disappearing.
130
+ - trace sink and `onStep` callback failures are recorded in `runtimeErrors`
131
+ but do not abort the control loop. Agent progress should not depend on
132
+ telemetry availability.
133
+ - action-policy preflight belongs before `act()`. Use `evaluateActionPolicy`
134
+ to block or label side effects, budget breaches, and missing expected
135
+ outcomes before any irreversible action runs.
136
+ - when a `TraceStore` is supplied, the runtime emits:
137
+ - one run
138
+ - one tool span per control step
139
+ - one judge span per eval result
140
+ - budget ledger entries for recorded spend
141
+
142
+ ## Propose / Review Preset
143
+
144
+ `runProposeReviewAsControlLoop` adapts the common artifact-refinement loop onto
145
+ the generic runtime:
146
+
147
+ ```text
148
+ propose -> verify -> review -> propose again
149
+ ```
150
+
151
+ Use it when the task is naturally "produce or improve state until verification
152
+ passes."
153
+
154
+ ```ts
155
+ import { runProposeReviewAsControlLoop } from '@tangle-network/agent-eval'
156
+
157
+ const report = await runProposeReviewAsControlLoop({
158
+ goal: 'Make the implementation pass tests and satisfy the reviewer.',
159
+ initialState: { workdir },
160
+ maxShots: 5,
161
+
162
+ async propose({ state, priorReview }) {
163
+ return await codingAgent.patch({
164
+ workdir: state.workdir,
165
+ instruction: priorReview?.nextShotInstruction,
166
+ })
167
+ },
168
+
169
+ async verify(state) {
170
+ const tests = await runTests(state.workdir)
171
+ return {
172
+ pass: tests.ok,
173
+ score: tests.ok ? 1 : 0,
174
+ failingLayers: tests.ok ? [] : ['tests'],
175
+ details: tests,
176
+ }
177
+ },
178
+
179
+ async review({ verification }) {
180
+ return await reviewer.explainNextShot(verification)
181
+ },
182
+
183
+ failureClassFromVerification(verification) {
184
+ if (verification.failingLayers?.includes('tests')) return 'sandbox_failure'
185
+ return 'unknown'
186
+ },
187
+ })
188
+ ```
189
+
190
+ Long term, `runProposeReview` should remain the stable convenience API, while
191
+ its internals can route through this control-loop preset.
192
+
193
+ ## Domain Patterns
194
+
195
+ These examples show what belongs in product repos. They should not become core
196
+ `agent-eval` adapters until the same adapter shape is reused by multiple
197
+ products.
198
+
199
+ ## Shared Sandbox Execution
200
+
201
+ Yes, harnesses and judges can run against the same sandbox. The common pattern
202
+ is to pass one sandbox driver and one workdir through every layer:
203
+
204
+ ```ts
205
+ const driver = new SubprocessSandboxDriver({ cwd: workdir, env })
206
+ const harness = new SandboxHarness(driver)
207
+ ```
208
+
209
+ Use the same sandbox when checks need shared state:
210
+
211
+ - install dependencies once, then typecheck/build/test in the same workdir
212
+ - run a browser/computer-use scenario against the app the harness just started
213
+ - let a judge inspect files, logs, screenshots, or traces produced by earlier
214
+ layers
215
+
216
+ Use separate sandboxes when checks need isolation:
217
+
218
+ - variants are running in parallel
219
+ - a test mutates global state
220
+ - credentials or network access differ by phase
221
+ - one action can corrupt the workdir for later checks
222
+
223
+ The important rule is explicit ownership: one driver/workdir means shared state;
224
+ multiple drivers/workdirs means isolated state. Do not rely on hidden global
225
+ state.
226
+
227
+ ### Coding Agent
228
+
229
+ ```ts
230
+ interface CodingState {
231
+ workdir: string
232
+ diffSummary: string
233
+ tests: { typecheck: boolean; unit: boolean; e2e?: boolean }
234
+ generatedFiles: string[]
235
+ runtimeTrace?: string
236
+ reviewerFindings: string[]
237
+ }
238
+ ```
239
+
240
+ ```ts
241
+ type CodingAction =
242
+ | { type: 'patch'; instruction: string }
243
+ | { type: 'run_tests'; command: string }
244
+ | { type: 'review_diff' }
245
+ | { type: 'ask_user'; question: string }
246
+ ```
247
+
248
+ Validators:
249
+
250
+ - expected files exist
251
+ - typecheck/build/tests pass
252
+ - generated app or agent completes a representative runtime scenario
253
+ - no hardcoded fake success or placeholder integrations
254
+ - reviewer findings resolved or explicitly accepted
255
+
256
+ Stop policy:
257
+
258
+ - stop when build and runtime validators pass
259
+ - stop on no progress after repeated patch/test cycles
260
+ - ask user when task intent or credentials are missing
261
+
262
+ ### Browser / Computer-Use Agent
263
+
264
+ Use this shape when the agent controls a browser, desktop session, or remote
265
+ computer and needs to complete a task end-to-end.
266
+
267
+ ```ts
268
+ interface ComputerUseState {
269
+ url?: string
270
+ goal: string
271
+ screenshot?: string
272
+ accessibilityTree?: unknown
273
+ completedSteps: string[]
274
+ openIssues: string[]
275
+ assertions: Array<{ id: string; passed: boolean; detail?: string }>
276
+ }
277
+ ```
278
+
279
+ ```ts
280
+ type ComputerUseAction =
281
+ | { type: 'navigate'; url: string }
282
+ | { type: 'click'; selectorOrDescription: string }
283
+ | { type: 'type'; selectorOrDescription: string; text: string }
284
+ | { type: 'inspect' }
285
+ | { type: 'ask_user'; question: string }
286
+ ```
287
+
288
+ Validators:
289
+
290
+ - required page or app state is reached
291
+ - no blocking errors are visible
292
+ - expected text, data, or UI state is present
293
+ - screenshots or accessibility tree support the claimed success
294
+ - repeated clicks or navigation loops are detected
295
+
296
+ Stop policy:
297
+
298
+ - stop when objective UI assertions pass
299
+ - stop on repeated action or no-progress policies
300
+ - ask user when credentials, permissions, or ambiguous choices block progress
301
+
302
+ ### Research / Documentation Agent
303
+
304
+ Use this shape when the agent produces a brief, explanation, migration guide, or
305
+ technical research note.
306
+
307
+ ```ts
308
+ interface ResearchState {
309
+ question: string
310
+ draft: string
311
+ sources: Array<{ url: string; title?: string; relevant: boolean }>
312
+ unsupportedClaims: string[]
313
+ reviewerFindings: string[]
314
+ }
315
+ ```
316
+
317
+ ```ts
318
+ type ResearchAction =
319
+ | { type: 'search'; query: string }
320
+ | { type: 'read_source'; url: string }
321
+ | { type: 'revise_draft'; failures: string[] }
322
+ | { type: 'ask_user'; question: string }
323
+ ```
324
+
325
+ Validators:
326
+
327
+ - every important claim has a source
328
+ - sources are relevant and current enough for the task
329
+ - unsupported claims are removed or marked as uncertain
330
+ - reviewer findings are resolved
331
+ - final output answers the original question
332
+
333
+ Stop policy:
334
+
335
+ - stop when source coverage and reviewer checks pass
336
+ - ask user when the question scope is ambiguous
337
+ - stop on repeated research queries with no new evidence
338
+
339
+ ## Integration Checklist
340
+
341
+ For a new downstream integration:
342
+
343
+ 1. Define typed state.
344
+ 2. Define domain actions.
345
+ 3. Write objective validators first.
346
+ 4. Add subjective judges only for judgment-heavy dimensions.
347
+ 5. Decide which actions require approval before execution.
348
+ 6. Add cost extraction for expensive actions.
349
+ 7. Add no-progress and repeated-action policies.
350
+ 8. Emit to a `TraceStore` in CI and production-like evals.
351
+ 9. Keep the adapter downstream until it proves reusable.