@tangle-network/agent-eval 0.20.2 → 0.20.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,343 +1,162 @@
1
1
  # @tangle-network/agent-eval
2
2
 
3
- **A library for deciding whether an LLM-driven generator did its job.**
4
-
5
- You hand it the thing the generator produced — a code scaffold, a patch, a tweet, a JSON config — and you get back a structured verdict: pass/fail, dimension scores, plain-English rationale. Built to catch the LLM failure modes that LLM-as-judge alone misses.
6
-
7
- ```ts
8
- import { BuilderSession, SubprocessSandboxDriver, InMemoryTraceStore } from '@tangle-network/agent-eval'
9
-
10
- const session = new BuilderSession(new InMemoryTraceStore(), { projectId: 'my-app' }, new SubprocessSandboxDriver())
11
- await session.startChat()
12
- const ship = await session.ship({
13
- harness: { setupCommand: 'pnpm install', testCommand: 'pnpm exec tsc --noEmit', cwd: scaffoldDir, timeoutMs: 180_000 },
14
- })
15
- console.log(ship.result.passed, ship.result.score)
3
+ Trace-first evaluation infrastructure for agent systems.
4
+
5
+ `agent-eval` provides the contracts and runtime primitives for measuring agent
6
+ behavior: traces, harnesses, verifier pipelines, judges, datasets, holdout
7
+ gates, failure classification, optimization loops, and release reports.
8
+
9
+ It does not own your product state, credentials, UI, or model routing. Product
10
+ teams keep those boundaries; this package standardizes how runs are recorded,
11
+ checked, compared, and promoted.
12
+
13
+ ## Contents
14
+
15
+ - [When To Use It](#when-to-use-it)
16
+ - [Architecture](#architecture)
17
+ - [Install](#install)
18
+ - [Core Primitives](#core-primitives)
19
+ - [Examples](#examples)
20
+ - [Documentation](#documentation)
21
+ - [Development](#development)
22
+ - [Related Packages](#related-packages)
23
+
24
+ ## When To Use It
25
+
26
+ Use `agent-eval` when you need one or more of these:
27
+
28
+ - A reproducible eval harness for coding agents, builder agents, or multi-tool
29
+ workflows.
30
+ - Structured traces for agent runs: spans, artifacts, events, budgets, tool
31
+ calls, retrieval, judge output, and sandbox execution.
32
+ - Deterministic gates around build/test/deploy checks.
33
+ - LLM-as-judge or deterministic judge fleets with calibration and canaries.
34
+ - Dataset splits, holdouts, paired statistics, and release confidence gates.
35
+ - Failure taxonomy that distinguishes prompt, tool, sandbox, retrieval,
36
+ evaluator, and knowledge-readiness failures.
37
+ - Optimization loops over prompts, steering, code mutations, or full multi-shot
38
+ trajectories.
39
+ - Report data for internal launch reviews, CI gates, and research analysis.
40
+
41
+ ## Architecture
42
+
43
+ ```txt
44
+ agent/product run
45
+ -> TraceEmitter / TraceStore
46
+ -> SandboxHarness / MultiLayerVerifier / JudgeRunner
47
+ -> failure taxonomy + metrics
48
+ -> paired stats + held-out gates
49
+ -> optimization + release confidence + reports
16
50
  ```
17
51
 
18
- ## Who this is for
19
-
20
- - You ship a code generator (scaffolder, patcher, refactor agent) and need to gate on whether its output actually works.
21
- - You ship a content generator and need quality signal beyond "the LLM said it's good".
22
- - You want a release gate that fails on regressions you can name, not vibes.
52
+ Package responsibilities:
23
53
 
24
- If that's you, start with [`docs/concepts.md`](./docs/concepts.md) — 5-minute mental model then use [`docs/feature-guide.md`](./docs/feature-guide.md) to choose the right primitive.
54
+ - `agent-eval`: run evidence, eval contracts, verification, statistics,
55
+ optimization, reporting.
56
+ - Product app: domain state, tools, credentials, UI, storage, deployment, model
57
+ gateway.
58
+ - `agent-runtime`: production agent-loop/session runtime.
59
+ - `agent-knowledge`: evidence stores, claim/page synthesis, retrieval, knowledge
60
+ readiness implementation.
25
61
 
26
- ## Quickstart
62
+ ## Install
27
63
 
28
- ### From any language: HTTP or RPC
64
+ ```sh
65
+ pnpm add @tangle-network/agent-eval
66
+ ```
29
67
 
30
- The fastest path. agent-eval ships a CLI that runs as either an HTTP server or a stdio RPC binary. Drive it from Python, Rust, Go, anything.
68
+ Wire protocol / CLI:
31
69
 
32
70
  ```sh
33
71
  npm i -g @tangle-network/agent-eval
34
-
35
- # HTTP — long-running
36
72
  agent-eval serve --port 5005
37
-
38
- # stdio RPC — one-shot, batch
39
- echo '{"rubricName":"anti-slop","content":"…"}' | agent-eval rpc judge
40
73
  ```
41
74
 
42
- Python:
75
+ Python client:
76
+
43
77
  ```sh
44
78
  pip install tangle-agent-eval
45
79
  ```
46
- ```python
47
- from tangle_agent_eval import Client
48
- c = Client()
49
- r = c.judge(content="our scaffold ships zero-copy IO", rubric_name="anti-slop")
50
- print(r.composite, r.failure_modes)
51
- ```
52
-
53
- See [`docs/wire-protocol.md`](./docs/wire-protocol.md) for the full surface.
54
80
 
55
- ### From TypeScript: import directly
56
-
57
- In-process; no wire round-trip. Use this when your eval lives in the same Node process as your generator.
81
+ ## Core Primitives
82
+
83
+ | Primitive | Purpose |
84
+ |---|---|
85
+ | `TraceEmitter`, `TraceStore` | Append-only run/span/event/artifact/budget records. |
86
+ | `SandboxHarness` | Build/test/runtime checks with captured stdout, stderr, exit codes, wall time, and parsed test counts. |
87
+ | `MultiLayerVerifier` | Ordered verification stages with dependencies, skip-on-fail, findings, scores, and time caps. |
88
+ | `JudgeRunner` | Parallel deterministic or LLM-backed judges over the same artifact/run. |
89
+ | `runAgentControlLoop` | Observe/validate/decide/act loop with budgets, stop policies, and structured eval results. |
90
+ | `Dataset`, `RunRecord`, `HeldOutGate` | Versioned corpora, reproducible run metadata, and held-out promotion decisions. |
91
+ | `pairedBootstrap`, `pairedWilcoxon`, `bhAdjust` | Paired experiment statistics and multiple-comparison correction. |
92
+ | `classifyFailure` | Rule-based failure classification for agent, tool, sandbox, retrieval, and knowledge failures. |
93
+ | `runMultiShotOptimization` | Optimization over full agent trajectories with actionable side information. |
94
+ | `runPromptEvolution` | Prompt/steering/code evolution over scenario scores. |
95
+ | `evaluateReleaseConfidence` | Release scorecard across evidence volume, pass rate, score, overfit, cost, latency, and gates. |
96
+ | `summaryTable`, `paretoChart`, `gainHistogram` | Report-ready structured outputs. |
97
+ | `KnowledgeRequirement`, `KnowledgeBundle` | Shared contracts for knowledge readiness. |
98
+
99
+ ## Examples
100
+
101
+ Runnable examples live in [`examples/`](./examples):
102
+
103
+ - [`examples/same-sandbox-harness`](./examples/same-sandbox-harness) - run
104
+ multiple eval passes against the same workspace.
105
+ - [`examples/multi-shot-optimization`](./examples/multi-shot-optimization) -
106
+ optimize full agent trajectories with held-out promotion.
107
+ - [`examples/benchmarks`](./examples/benchmarks) - benchmark adapter shape and
108
+ reference benchmark wrappers.
109
+
110
+ The examples are intentionally kept outside the README so they can be expanded,
111
+ tested, and copied without turning this page into a tutorial.
112
+
113
+ ## Documentation
114
+
115
+ - [Concepts](./docs/concepts.md)
116
+ - [Feature Guide](./docs/feature-guide.md)
117
+ - [Control Runtime](./docs/control-runtime.md)
118
+ - [Knowledge Readiness](./docs/knowledge-readiness.md)
119
+ - [Multi-Shot Optimization](./docs/multi-shot-optimization.md)
120
+ - [Feedback Trajectories](./docs/feedback-trajectories.md)
121
+ - [Wire Protocol](./docs/wire-protocol.md)
122
+
123
+ ## Development
58
124
 
59
125
  ```sh
60
- pnpm add @tangle-network/agent-eval
61
- ```
62
-
63
- The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](./.claude/skills/agent-eval/SKILL.md#minimal-working-path-builder-of-builders).
64
-
65
- ## Two ways to read this repo
66
-
67
- - **You're a human onboarding** — read [`docs/concepts.md`](./docs/concepts.md) for the mental model, then [`docs/wire-protocol.md`](./docs/wire-protocol.md) if you'll call from another language, or `SKILL.md` if you'll embed in TS.
68
- - **You're deciding what to integrate** — read [`docs/feature-guide.md`](./docs/feature-guide.md) for the layman explanation, use cases, feature map, and guardrails.
69
- - **You're an LLM agent writing integration code** — read `SKILL.md`. Every directive there encodes a shipped bug; skipping one reintroduces the bug class.
70
-
71
- ## What's in the box
72
-
73
- | Module | What it does | Doc |
74
- |---|---|---|
75
- | `BuilderSession` | Three-layer eval orchestrator (builder → app-build → app-runtime) for code generators. | concepts.md §three-layer eval |
76
- | `MultiLayerVerifier` | Pipeline of layers (install → typecheck → build → semantic). Skip-on-fail, weighted aggregate. | concepts.md §verifiers |
77
- | `judges`, `createCustomJudge`, `createAntiSlopJudge` | LLM and deterministic judges. | SKILL.md |
78
- | Wire protocol (`agent-eval serve` / `rpc`) | HTTP and stdio RPC interface for cross-language clients. | wire-protocol.md |
79
- | `clients/python/` | First-party Python client (`tangle-agent-eval` on PyPI). Version-locked to npm. | clients/python/README.md |
80
- | `BenchmarkRunner`, `executeScenario`, `ConvergenceTracker` | Multi-turn scenario execution + cross-run tracking. | SKILL.md |
81
- | `runAgentControlLoop` | Policy-based runtime for agentic tasks: observe typed state, validate, decide, act, repeat with budgets, tracing, and stuck-loop guards. | [control-runtime.md](./docs/control-runtime.md) |
82
- | `FeedbackTrajectory`, `InMemoryFeedbackTrajectoryStore`, `FileSystemFeedbackTrajectoryStore` | Human/environment feedback loops: capture approvals, rejections, choices, revisions, metrics, and policy blocks as train/dev/test/holdout examples. | [feedback-trajectories.md](./docs/feedback-trajectories.md) |
83
- | `evaluateActionPolicy` | Generic action preflight for approval, budget, expected-outcome, and kill-criteria checks. | [feature-guide.md](./docs/feature-guide.md) |
84
- | `ExperimentTracker`, steering optimizers, `bisector` | A/B prompts, optimize steering, bisect regressions. | SKILL.md |
85
- | `runMultiShotOptimization`, `trialTraceFromMultiShotTrial` | GEPA-style optimization for variable-length agent trajectories with ASI, paired seeds, and optional held-out promotion gating. | [multi-shot-optimization.md](./docs/multi-shot-optimization.md) |
86
- | `evaluateReleaseConfidence`, `assertReleaseConfidence` | Release scorecard that composes corpus coverage, search/holdout run evidence, ASI diagnostics, overfit checks, and cost/latency budgets. | §Release confidence |
87
- | `runPromptEvolution`, `createCompositeMutator`, `createSandboxPool`, `createSandboxCodeMutator`, `MutationTelemetry`, `LineageRecorder`, `CostLedger`, `JsonlTrialCache` | Prompt + code evolution loops with bounded sandbox pools, durable JSONL telemetry, plateau-detecting composite mutators, crash-resumable trial cache. | §Evolution loop |
88
- | `reflective-mutation` (`buildReflectionPrompt`, `parseReflectionResponse`, `DEFAULT_MUTATION_PRIMITIVES`) | Trace-conditioned LLM mutator that reasons over top/bottom trials instead of blind rewrites. | inline JSDoc |
89
- | `correlationStudy`, `OutcomeStore`, `ProductRegistry` | Meta-eval: do our scores predict deployment outcomes (revenue, retention)? | inline JSDoc |
90
- | Telemetry (`telemetry/`, `telemetry/file`) | OTLP export, trace replay, file sinks. | inline JSDoc |
91
-
92
- ## Release confidence
93
-
94
- Use `evaluateReleaseConfidence` at the release boundary for every consuming
95
- agent surface. It fails closed unless the release has a versioned corpus,
96
- search and holdout run evidence, score/pass-rate evidence, ASI for failures,
97
- and budget/overfit checks. Single-shot and multi-shot apps use the same path:
98
- single-shot traces are just trace evidence with `turnCount: 1`.
99
-
100
- ```ts
101
- import {
102
- evaluateReleaseConfidence,
103
- releaseTraceEvidenceFromMultiShotTrials,
104
- } from '@tangle-network/agent-eval'
105
-
106
- const scorecard = evaluateReleaseConfidence({
107
- target: 'blueprint-agent/autoresearch',
108
- candidateId: 'candidate-v3',
109
- baselineId: 'baseline',
110
- dataset: await dataset.manifest(),
111
- runs: [...candidateRuns, ...baselineRuns],
112
- traces: releaseTraceEvidenceFromMultiShotTrials(result.evolution.generations.flatMap((g) => g.trials)),
113
- gateDecision: result.gate?.decision,
114
- thresholds: {
115
- minScenarioCount: 50,
116
- minSearchRuns: 50,
117
- minHoldoutRuns: 20,
118
- minPassRate: 0.9,
119
- minMeanScore: 0.8,
120
- maxOverfitGap: 0.1,
121
- maxMeanCostUsd: 0.05,
122
- maxP95WallMs: 120_000,
123
- },
124
- })
125
-
126
- if (!scorecard.promote) throw new Error(scorecard.summary)
126
+ pnpm install
127
+ pnpm typecheck
128
+ pnpm test
129
+ pnpm build
130
+ pnpm openapi
127
131
  ```
128
132
 
129
- ## Evolution loop
130
-
131
- For agent tasks that run across many chat turns or tool calls, start with
132
- [`runMultiShotOptimization`](./docs/multi-shot-optimization.md). It runs the
133
- same prompt-evolution core over full trajectories, carries actionable side
134
- information into reflection, and separates the search winner from the variant
135
- that actually passes held-out promotion.
136
-
137
- Closing the loop on a prompt or codebase is **two adapters + a config**. Compose `runPromptEvolution` with `createCompositeMutator` (plateau policy) and you get prompt-only optimization until improvement stalls, then automatic switch to code-channel mutations from a coding agent inside a `SandboxPool`.
138
-
139
- ```ts
140
- import {
141
- createSandboxPool,
142
- createSandboxCodeMutator,
143
- createCompositeMutator,
144
- buildReflectionPrompt,
145
- parseReflectionResponse,
146
- runPromptEvolution,
147
- MutationTelemetry,
148
- LineageRecorder,
149
- CostLedger,
150
- JsonlTrialCache,
151
- } from '@tangle-network/agent-eval'
152
-
153
- // 1. Prompt mutator — reflective-mutation reasons over top/bottom trials
154
- const promptMutator = {
155
- async mutate({ parent, topTrials, bottomTrials, childCount }) {
156
- const ctx = { target: 'forge-prompt', parentPayload: parent.payload, topTrials, bottomTrials, childCount }
157
- const reflection = buildReflectionPrompt(ctx)
158
- const raw = await yourLlm(reflection)
159
- return parseReflectionResponse(raw, childCount).map((p, i) => ({
160
- id: `${parent.id}.g${parent.generation + 1}.prompt.${i}`,
161
- payload: p.payload,
162
- generation: parent.generation + 1,
163
- parentId: parent.id,
164
- label: p.label,
165
- rationale: p.rationale,
166
- }))
167
- },
168
- }
169
-
170
- // 2. Code mutator — runs a coding agent in a sandbox slot, captures the diff
171
- const pool = createSandboxPool({
172
- size: 4,
173
- factory: {
174
- async create(id) { return await yourSandboxClient.create({ name: id }) },
175
- async reset(slot) { await slot.resource.exec('git reset --hard origin/main && git clean -fd') },
176
- async destroy(slot) { await slot.resource.delete() },
177
- },
178
- })
179
- const codeMutator = createSandboxCodeMutator({
180
- pool,
181
- runner: async ({ slot, parent, topTrials, bottomTrials }) => {
182
- const result = await slot.resource.task(`Improve the prompt at /repo/forge-prompt.ts...`)
183
- return [{ ok: true, latencyMs: result.durationMs, costUsd: result.costUsd, artifact: { diff: result.diff } }]
184
- },
185
- toVariantPayload: (outcome, parent) => ({ ...parent.payload, codeMutation: outcome.artifact }),
186
- })
187
-
188
- // 3. Compose — plateau policy auto-switches when prompt evolution stalls
189
- const composite = createCompositeMutator({
190
- primary: promptMutator,
191
- secondary: codeMutator,
192
- policy: 'plateau',
193
- plateauThreshold: 0.02,
194
- plateauPatience: 2,
195
- })
196
-
197
- // 4. Run — durable telemetry to disk, crash-resumable
198
- const result = await runPromptEvolution({
199
- runId: `forge_${Date.now()}`,
200
- target: 'forge-prompt',
201
- seedVariants: [{ id: 'v0', payload: { text: currentPrompt }, generation: 0, label: 'baseline' }],
202
- scenarioIds: referenceCorpus.map(s => s.id),
203
- reps: 3,
204
- generations: 5,
205
- populationSize: 4,
206
- scoreAdapter: { /* runs your eval against (variant, scenario, rep) */ },
207
- mutateAdapter: composite,
208
- cache: new JsonlTrialCache('.evolve/cache.jsonl'),
209
- objectives: [
210
- { name: 'score', direction: 'maximize', value: a => a.meanScore },
211
- { name: 'cost', direction: 'minimize', value: a => a.meanCost },
212
- ],
213
- })
214
- ```
133
+ Run the local server:
215
134
 
216
- The `MutationTelemetry`, `LineageRecorder`, and `CostLedger` pass into the `code-mutator` (and any consumer that wants them) — they emit append-only JSONL of every attempt (success + failure with reason) and a snapshot lineage tree, so a finished run leaves a forensically complete trail under one directory.
217
-
218
- For the full primitive surface and rationale, read each module's JSDoc — `prompt-evolution.ts`, `composite-mutator.ts`, `sandbox-pool.ts`, `code-mutator.ts`, `reflective-mutation.ts`, `evolution-telemetry.ts`.
219
-
220
- ## Feedback trajectory loop
221
-
222
- When normal agent usage should generate training/eval signal, use feedback
223
- trajectories. They turn approvals, rejections, option choices, edits, metrics,
224
- and policy blocks into reusable examples.
225
-
226
- ```ts
227
- import {
228
- createFeedbackTrajectory,
229
- summarizePreferenceMemory,
230
- feedbackTrajectoriesToDatasetScenarios,
231
- feedbackTrajectoriesToOptimizerRows,
232
- } from '@tangle-network/agent-eval'
233
-
234
- const trajectory = createFeedbackTrajectory({
235
- projectId: 'research-agent',
236
- scenarioId: 'brief-review',
237
- task: { intent: 'Revise a research brief until it is specific and sourced.' },
238
- attempts: [{
239
- id: 'draft-1',
240
- stepIndex: 0,
241
- artifactType: 'research',
242
- artifact: { summary: 'Initial brief with weak sourcing.' },
243
- createdAt: new Date().toISOString(),
244
- }],
245
- labels: [{
246
- source: 'user',
247
- kind: 'revision_request',
248
- value: 'needs stronger evidence',
249
- reason: 'add primary sources and remove unsupported claims',
250
- severity: 'error',
251
- createdAt: new Date().toISOString(),
252
- }],
253
- })
254
-
255
- const memory = summarizePreferenceMemory([trajectory])
256
- const scenarios = feedbackTrajectoriesToDatasetScenarios([trajectory])
257
- const optimizerRows = feedbackTrajectoriesToOptimizerRows([trajectory])
135
+ ```sh
136
+ pnpm build
137
+ node dist/cli.js serve --port 5005
258
138
  ```
259
139
 
260
- This is the bridge between feedback and optimization: review signals become
261
- immediate memory, replayable eval scenarios, and prompt/signature/code optimizer
262
- input. See [`docs/feedback-trajectories.md`](./docs/feedback-trajectories.md).
263
-
264
- ## v0.16 highlights — production-rigor primitives
265
-
266
- These are the primitives any team running prompt-optimization in production needs, regardless of whether they're writing a paper. v0.15 shipped them under "paper-grade" naming; v0.16 corrects that — they're production-first, paper-grade as a side effect.
267
-
268
- - `HeldOutGate` — held-out paired-delta gate with `few_runs` /
269
- `negative_delta` / `overfit_gap` rejection codes and a full evidence
270
- block on every decision. Sits alongside the existing bootstrap-CI
271
- `promotion-gate.ts`: that one asks "is this real or noise?", this one
272
- asks "is this a real win on held-out and not overfit?". Use both.
273
- - `RunRecord` — typed run schema with mandatory snapshot-pinned `model`,
274
- `promptHash`, `configHash`, `commitSha`, `costUsd`, `splitTag`.
275
- Runtime validator throws on missing fields. Reproducibility falls
276
- out for free.
277
- - `pairedBootstrap`, `pairedWilcoxon`, `bhAdjust` — statistical
278
- primitives every rigorous A/B test needs. Already-existing primitives
279
- are re-exported for paper-style aliases.
280
- - `runCanaries` — silent judge-fallback, calibration drift (KS test),
281
- distribution shift (chi-square). Catches the failure mode where your
282
- judge silently degrades to a constant-0.30 confidence and you ship
283
- configs graded by a stub.
284
- - `summaryTable`, `paretoChart`, `gainHistogram` — A/B reporting
285
- helpers. `summaryTable` emits markdown with means + 95% bootstrap
286
- CIs + paired Wilcoxon p (BH-adjusted) + Cohen's d. Useful for both
287
- internal status reports and paper Table 1s.
288
- - `Researcher` — stable interface for an external agent that drives the
289
- meta-loop (`inspectFailures` → `proposeChange` → `applyChange` →
290
- `evaluateChange`). Ship a `NoopResearcher` as a placeholder; real
291
- implementations live downstream.
292
- - `benchmarks/routing` — synthetic 16-task router benchmark we own.
293
- Ships in the package. Reference wrappers for GSM8K and SWE-Bench
294
- Lite live under `examples/benchmarks/` — read, copy, adapt. All
295
- three implement one `BenchmarkAdapter` shape with deterministic
296
- splits and fail-loud env-var configuration.
297
-
298
- ### v0.16 changes from v0.15
299
-
300
- - Renamed `paperTable` → `summaryTable`, `paretoFigure` → `paretoChart`,
301
- `gainDistributionFigure` → `gainHistogram`. Underlying semantics
302
- unchanged. Type names follow (`SummaryTable`, `SummaryTableOptions`,
303
- `SummaryTableRow`).
304
- - File: `src/paper-report.ts` → `src/summary-report.ts`.
305
- - Drop the "paper-grade" framing — the primitives are production-first.
306
-
307
- See `CHANGELOG.md` for the full list. `.claude/skills/agent-eval/SKILL.md`
308
- covers usage directives and pitfalls.
309
-
310
- ## Tech stack
311
-
312
- - TypeScript strict, no semicolons, single quotes, 2-space indent
313
- - `tsup` for bundling, `vitest` for tests
314
- - `@tangle-network/tcloud` for LLM calls (judges, driver)
315
- - `hono` + `@asteasolutions/zod-to-openapi` for the wire protocol
316
-
317
- ## Develop
140
+ Python client tests:
318
141
 
319
142
  ```sh
320
- pnpm install
321
- pnpm typecheck
322
- pnpm test
323
143
  pnpm build
324
- pnpm openapi # write dist/openapi.json from the wire schemas
325
-
326
- # Run the server locally
327
- node dist/cli.js serve --port 5005
328
-
329
- # Python client tests (require pnpm build first)
330
- cd clients/python && pip install -e ".[dev]" && pytest
144
+ cd clients/python
145
+ pip install -e ".[dev]"
146
+ pytest
331
147
  ```
332
148
 
333
149
  ## Release
334
150
 
335
- `@tangle-network/agent-eval` (npm) and `tangle-agent-eval` (PyPI) ship from the same git tag in the same CI workflow. If either fails to publish, neither does. Versions are locked.
151
+ `@tangle-network/agent-eval` publishes to npm. The Python client lives under
152
+ `clients/python` and is versioned from this repository.
336
153
 
337
- ## Related
154
+ ## Related Packages
338
155
 
156
+ - [`@tangle-network/agent-runtime`](https://github.com/tangle-network/agent-runtime)
157
+ - [`@tangle-network/agent-knowledge`](https://github.com/tangle-network/agent-knowledge)
158
+ - [`@tangle-network/agent-integrations`](https://github.com/tangle-network/agent-integrations)
339
159
  - [`@tangle-network/agent-gateway`](https://github.com/tangle-network/agent-gateway)
340
- - [`@tangle-network/agent-client`](https://github.com/tangle-network/agent-client)
341
160
  - [`@tangle-network/tcloud`](https://github.com/tangle-network/tcloud)
342
161
 
343
162
  ## License
@@ -1,6 +1,6 @@
1
1
  import {
2
2
  callLlmJson
3
- } from "./chunk-ITN4YOZY.js";
3
+ } from "./chunk-JAOLXRIA.js";
4
4
 
5
5
  // src/wire/schemas.ts
6
6
  import { extendZodWithOpenApi } from "@asteasolutions/zod-to-openapi";
@@ -591,4 +591,4 @@ export {
591
591
  runRpcOnce,
592
592
  runRpcBatch
593
593
  };
594
- //# sourceMappingURL=chunk-OZPRSK4A.js.map
594
+ //# sourceMappingURL=chunk-CJJSB6ZQ.js.map
@@ -76,6 +76,56 @@ function stripFencedJson(raw) {
76
76
  const m = trimmed.match(/^```(?:json)?\s*\n?([\s\S]*?)\n?```\s*$/);
77
77
  return m ? m[1].trim() : trimmed;
78
78
  }
79
+ function extractJsonPayload(raw) {
80
+ const stripped = stripFencedJson(raw);
81
+ try {
82
+ JSON.parse(stripped);
83
+ return stripped;
84
+ } catch {
85
+ }
86
+ const starts = [...stripped.matchAll(/[\[{]/g)].map((match) => match.index).filter((index) => index != null);
87
+ for (const start of starts) {
88
+ const candidate = extractBalancedJson(stripped, start);
89
+ if (!candidate) continue;
90
+ try {
91
+ JSON.parse(candidate);
92
+ return candidate;
93
+ } catch {
94
+ }
95
+ }
96
+ return stripped;
97
+ }
98
+ function extractBalancedJson(input, start) {
99
+ const opener = input[start];
100
+ const closer = opener === "{" ? "}" : opener === "[" ? "]" : null;
101
+ if (!closer) return null;
102
+ const stack = [closer];
103
+ let isInString = false;
104
+ let isEscaped = false;
105
+ for (let i = start + 1; i < input.length; i++) {
106
+ const char = input[i];
107
+ if (isEscaped) {
108
+ isEscaped = false;
109
+ continue;
110
+ }
111
+ if (char === "\\") {
112
+ isEscaped = isInString;
113
+ continue;
114
+ }
115
+ if (char === '"') {
116
+ isInString = !isInString;
117
+ continue;
118
+ }
119
+ if (isInString) continue;
120
+ if (char === "{") stack.push("}");
121
+ else if (char === "[") stack.push("]");
122
+ else if (char === stack[stack.length - 1]) {
123
+ stack.pop();
124
+ if (stack.length === 0) return input.slice(start, i + 1);
125
+ }
126
+ }
127
+ return null;
128
+ }
79
129
  async function callLlm(req, opts = {}) {
80
130
  const baseUrl = (opts.baseUrl ?? DEFAULT_BASE_URL).replace(/\/+$/, "");
81
131
  const url = `${baseUrl}/chat/completions`;
@@ -159,7 +209,7 @@ async function callLlmJson(req, opts = {}) {
159
209
  }
160
210
  }
161
211
  function parseJsonSafely(content, model) {
162
- const stripped = stripFencedJson(content);
212
+ const stripped = extractJsonPayload(content);
163
213
  try {
164
214
  return JSON.parse(stripped);
165
215
  } catch (err) {
@@ -212,4 +262,4 @@ export {
212
262
  probeLlm,
213
263
  LlmClient
214
264
  };
215
- //# sourceMappingURL=chunk-ITN4YOZY.js.map
265
+ //# sourceMappingURL=chunk-JAOLXRIA.js.map
@@ -0,0 +1 @@
1
+ {"version":3,"sources":["../src/llm-client.ts"],"sourcesContent":["/**\n * LLM client with graceful degrade.\n *\n * OpenAI-compatible `/v1/chat/completions` client with:\n * - Exponential-backoff retry on 429 + 5xx gateway errors (502/503/504).\n * - Retry on transient network errors (fetch failed, AbortError, ECONNRESET).\n * - Graceful json_schema → json_object degrade on 400 with schema-reject body.\n * - Fenced-JSON stripping (```json ... ```) for models that wrap structured output.\n * - Configurable base URL + api key / bearer, works with LiteLLM proxies, OpenAI\n * directly, cli-bridge subscriptions, and any router that speaks the spec.\n *\n * Usage:\n * const { value, result } = await callLlmJson<MyType>(\n * { model: 'gpt-4o', messages: [...], jsonSchema: { name: 'x', schema: {...} } },\n * { baseUrl: 'https://router.tangle.tools/v1', apiKey: process.env.KEY },\n * )\n *\n * This is THE llm-calling seam for agent-eval primitives that need structured\n * output (semantic concept judge, reviewer directives, critic scores). Primitives\n * that need free-form text use `callLlm` and parse output themselves.\n */\n\n// ─── Types ──────────────────────────────────────────────────────────────\n\nexport interface LlmMessage {\n role: 'system' | 'user' | 'assistant'\n /**\n * Either a plain text content string OR a multimodal content array\n * (text + image_url parts) for vision-capable models.\n */\n content:\n | string\n | Array<\n | { type: 'text'; text: string }\n | { type: 'image_url'; image_url: { url: string; detail?: 'auto' | 'low' | 'high' } }\n >\n}\n\nexport interface LlmCallRequest {\n model: string\n messages: LlmMessage[]\n /** Optional JSON-mode response format (response_format: json_object). */\n jsonMode?: boolean\n /** Optional structured output via JSON Schema. Falls back to json_object on 400. */\n jsonSchema?: { name: string; schema: Record<string, unknown> }\n temperature?: number\n maxTokens?: number\n /** Per-call timeout, default 60s. */\n timeoutMs?: number\n}\n\nexport interface LlmUsage {\n promptTokens: number\n completionTokens: number\n totalTokens: number\n /** Proxies populate this when prompt caching is on. */\n cachedPromptTokens?: number\n}\n\nexport interface LlmCallResult {\n /** The text content of the first choice. Empty string if none. */\n content: string\n usage: LlmUsage\n /**\n * Cost in USD. Pulled from proxy's `_response_cost` field when present;\n * `null` when neither the proxy nor the caller can derive it.\n */\n costUsd: number | null\n /** Model name actually used (echoed from response). */\n model: string\n /** Wall-clock duration of the HTTP call (last attempt, if retried). */\n durationMs: number\n /** Raw response body. */\n raw: Record<string, unknown>\n}\n\nexport class LlmCallError extends Error {\n constructor(\n message: string,\n public readonly status: number,\n public readonly body: string,\n public readonly model: string,\n ) {\n super(message)\n this.name = 'LlmCallError'\n }\n}\n\nexport interface LlmClientOptions {\n /** Base URL (without trailing slash). Must end at the `/v1` prefix. */\n baseUrl?: string\n /** Bearer token — either `apiKey` or `bearer` populates `Authorization: Bearer ...`. */\n apiKey?: string\n bearer?: string\n /** Override for the `Authorization` header (e.g. `X-Auth: ...`). Takes precedence over apiKey/bearer. */\n authHeader?: { name: string; value: string }\n /** Default timeout in ms. Per-call can override. */\n defaultTimeoutMs?: number\n /** Max retry attempts on retriable errors. Default 3 (1 initial + 2 retries). */\n maxRetries?: number\n /** Fetch implementation — defaults to global `fetch`. Override for custom transport (e.g. tests). */\n fetch?: typeof fetch\n}\n\n// ─── Internals ──────────────────────────────────────────────────────────\n\nconst DEFAULT_BASE_URL = 'https://router.tangle.tools/v1'\nconst DEFAULT_TIMEOUT_MS = 60_000\nconst DEFAULT_MAX_RETRIES = 3\n\nconst RETRYABLE_STATUS = new Set([429, 502, 503, 504])\n\nfunction isRetryableError(err: unknown): boolean {\n if (err instanceof LlmCallError) return RETRYABLE_STATUS.has(err.status)\n if (err instanceof Error) {\n return (\n err.name === 'AbortError' ||\n err.name === 'TimeoutError' ||\n /fetch failed|ECONNRESET|ETIMEDOUT|EAI_AGAIN/i.test(err.message)\n )\n }\n return false\n}\n\nfunction parseRetryAfter(headers: Headers): number | null {\n const h = headers.get('retry-after')\n if (!h) return null\n const asNumber = Number(h)\n if (Number.isFinite(asNumber) && asNumber > 0) return asNumber * 1000\n const asDate = Date.parse(h)\n if (Number.isFinite(asDate)) return Math.max(0, asDate - Date.now())\n return null\n}\n\nfunction backoffMs(attempt: number): number {\n // 500ms, 1s, 2s, 4s, ...\n return Math.min(500 * Math.pow(2, attempt), 16_000)\n}\n\nfunction buildHeaders(opts: LlmClientOptions): Record<string, string> {\n const headers: Record<string, string> = {\n 'Content-Type': 'application/json',\n Accept: 'application/json',\n }\n if (opts.authHeader) {\n headers[opts.authHeader.name] = opts.authHeader.value\n } else if (opts.bearer || opts.apiKey) {\n headers.Authorization = `Bearer ${opts.bearer ?? opts.apiKey}`\n }\n return headers\n}\n\nfunction isSchemaRejection(status: number, body: string): boolean {\n if (status !== 400) return false\n const lower = body.toLowerCase()\n return (\n lower.includes('response_format') ||\n lower.includes('json_schema') ||\n lower.includes('is unavailable') ||\n lower.includes('not supported')\n )\n}\n\nfunction buildBody(req: LlmCallRequest, forceJsonObject: boolean): Record<string, unknown> {\n const body: Record<string, unknown> = {\n model: req.model,\n messages: req.messages,\n temperature: req.temperature ?? 0,\n }\n if (req.maxTokens != null) body.max_tokens = req.maxTokens\n\n if (req.jsonSchema && !forceJsonObject) {\n body.response_format = {\n type: 'json_schema',\n json_schema: { name: req.jsonSchema.name, schema: req.jsonSchema.schema, strict: true },\n }\n } else if (req.jsonMode || req.jsonSchema) {\n body.response_format = { type: 'json_object' }\n }\n\n return body\n}\n\nasync function sleep(ms: number): Promise<void> {\n return new Promise((resolve) => setTimeout(resolve, ms))\n}\n\n// ─── Public API ─────────────────────────────────────────────────────────\n\n/**\n * Strip a ```json / ``` code fence if the model emitted one.\n * Idempotent for naked JSON. Some models (claude-code via router, certain\n * deepseek models) wrap output even under json_object.\n */\nexport function stripFencedJson(raw: string): string {\n const trimmed = raw.trim()\n const m = trimmed.match(/^```(?:json)?\\s*\\n?([\\s\\S]*?)\\n?```\\s*$/)\n return m ? m[1]!.trim() : trimmed\n}\n\nexport function extractJsonPayload(raw: string): string {\n const stripped = stripFencedJson(raw)\n try {\n JSON.parse(stripped)\n return stripped\n } catch {\n // Continue with balanced extraction below.\n }\n\n const starts = [...stripped.matchAll(/[\\[{]/g)].map((match) => match.index).filter((index) => index != null)\n for (const start of starts) {\n const candidate = extractBalancedJson(stripped, start)\n if (!candidate) continue\n try {\n JSON.parse(candidate)\n return candidate\n } catch {\n // Keep scanning; earlier braces may belong to prose.\n }\n }\n\n return stripped\n}\n\nfunction extractBalancedJson(input: string, start: number): string | null {\n const opener = input[start]\n const closer = opener === '{' ? '}' : opener === '[' ? ']' : null\n if (!closer) return null\n\n const stack: string[] = [closer]\n let isInString = false\n let isEscaped = false\n\n for (let i = start + 1; i < input.length; i++) {\n const char = input[i]!\n if (isEscaped) {\n isEscaped = false\n continue\n }\n if (char === '\\\\') {\n isEscaped = isInString\n continue\n }\n if (char === '\"') {\n isInString = !isInString\n continue\n }\n if (isInString) continue\n\n if (char === '{') stack.push('}')\n else if (char === '[') stack.push(']')\n else if (char === stack[stack.length - 1]) {\n stack.pop()\n if (stack.length === 0) return input.slice(start, i + 1)\n }\n }\n\n return null\n}\n\n/**\n * Low-level call. Returns raw content + usage + cost. Retries on transient\n * failures; does NOT degrade schema here — callers that want graceful\n * degrade use `callLlmJson`.\n */\nexport async function callLlm(\n req: LlmCallRequest,\n opts: LlmClientOptions = {},\n): Promise<LlmCallResult> {\n const baseUrl = (opts.baseUrl ?? DEFAULT_BASE_URL).replace(/\\/+$/, '')\n const url = `${baseUrl}/chat/completions`\n const timeoutMs = req.timeoutMs ?? opts.defaultTimeoutMs ?? DEFAULT_TIMEOUT_MS\n const maxRetries = opts.maxRetries ?? DEFAULT_MAX_RETRIES\n const fetchFn = opts.fetch ?? globalThis.fetch\n const headers = buildHeaders(opts)\n\n let lastErr: unknown\n for (let attempt = 0; attempt < maxRetries; attempt++) {\n const controller = new AbortController()\n const timeoutHandle = setTimeout(() => controller.abort(), timeoutMs)\n const started = Date.now()\n\n try {\n const res = await fetchFn(url, {\n method: 'POST',\n headers,\n body: JSON.stringify(buildBody(req, false)),\n signal: controller.signal,\n })\n clearTimeout(timeoutHandle)\n\n if (!res.ok) {\n const body = await res.text()\n const err = new LlmCallError(\n `LLM call ${res.status}: ${body.slice(0, 300)}`,\n res.status,\n body,\n req.model,\n )\n if (RETRYABLE_STATUS.has(res.status) && attempt < maxRetries - 1) {\n lastErr = err\n const retryAfter = parseRetryAfter(res.headers)\n await sleep(retryAfter ?? backoffMs(attempt))\n continue\n }\n throw err\n }\n\n const json = (await res.json()) as Record<string, unknown>\n const choice = (json.choices as Array<{ message?: { content?: string } }> | undefined)?.[0]\n const usageRaw = (json.usage as Record<string, unknown> | undefined) ?? {}\n const costFromProxy = (json._response_cost ?? json.cost_usd) as number | undefined\n\n return {\n content: choice?.message?.content ?? '',\n usage: {\n promptTokens: Number(usageRaw.prompt_tokens ?? 0),\n completionTokens: Number(usageRaw.completion_tokens ?? 0),\n totalTokens: Number(usageRaw.total_tokens ?? 0),\n cachedPromptTokens:\n usageRaw.prompt_tokens_details &&\n typeof usageRaw.prompt_tokens_details === 'object'\n ? Number(\n (usageRaw.prompt_tokens_details as Record<string, unknown>).cached_tokens ?? 0,\n )\n : undefined,\n },\n costUsd: typeof costFromProxy === 'number' ? costFromProxy : null,\n model: (json.model as string) ?? req.model,\n durationMs: Date.now() - started,\n raw: json,\n }\n } catch (err) {\n clearTimeout(timeoutHandle)\n lastErr = err\n if (attempt < maxRetries - 1 && isRetryableError(err)) {\n await sleep(backoffMs(attempt))\n continue\n }\n throw err\n }\n }\n throw lastErr instanceof Error ? lastErr : new Error(String(lastErr))\n}\n\n/**\n * Structured-output call. Returns parsed JSON plus the raw result envelope.\n * Degrades `jsonSchema` → `jsonMode` on a 400 that names the schema param —\n * critical for deepseek-v3/v4, kimi-k2.6, and other models that don't accept\n * the `response_format.json_schema` shape but DO accept `json_object`.\n */\nexport async function callLlmJson<T = unknown>(\n req: LlmCallRequest,\n opts: LlmClientOptions = {},\n): Promise<{ value: T; result: LlmCallResult }> {\n try {\n const result = await callLlm({ ...req, jsonMode: req.jsonMode ?? !req.jsonSchema }, opts)\n const value = parseJsonSafely<T>(result.content, result.model)\n return { value, result }\n } catch (err) {\n if (err instanceof LlmCallError && isSchemaRejection(err.status, err.body) && req.jsonSchema) {\n // Degrade to json_object + retry.\n const degradedReq: LlmCallRequest = { ...req, jsonMode: true, jsonSchema: undefined }\n const result = await callLlm(degradedReq, opts)\n const value = parseJsonSafely<T>(result.content, result.model)\n return { value, result }\n }\n throw err\n }\n}\n\nfunction parseJsonSafely<T>(content: string, model: string): T {\n const stripped = extractJsonPayload(content)\n try {\n return JSON.parse(stripped) as T\n } catch (err) {\n throw new Error(\n `LLM returned non-JSON content (model=${model}): ${\n err instanceof Error ? err.message : String(err)\n }\\n--- raw content ---\\n${content.slice(0, 800)}`,\n )\n }\n}\n\n/**\n * Probe whether a model is reachable. Returns latency + null error on\n * success; `ok=false` + error message on any failure (HTTP, timeout,\n * network, parse). Designed for sweep preflights — fail loud at the\n * boundary before burning a 30-leaf run on a misconfigured router.\n *\n * Sends a tiny `ping` message with `maxTokens=64`. Reasoning models\n * (glm-5.1, deepseek-v4) can burn the entire budget on internal reasoning\n * for short prompts, so don't tighten this further. We don't validate\n * content; HTTP 200 means reachable.\n */\nexport async function probeLlm(\n model: string,\n opts: LlmClientOptions & { timeoutMs?: number } = {},\n): Promise<{ ok: boolean; latencyMs: number; error: string | null }> {\n const start = Date.now()\n try {\n await callLlm(\n {\n model,\n messages: [{ role: 'user', content: 'ping' }],\n maxTokens: 64,\n timeoutMs: opts.timeoutMs ?? 30_000,\n },\n opts,\n )\n return { ok: true, latencyMs: Date.now() - start, error: null }\n } catch (err) {\n return {\n ok: false,\n latencyMs: Date.now() - start,\n error: err instanceof Error ? err.message : String(err),\n }\n }\n}\n\n/**\n * Stateful client — construct once with defaults, call many times.\n * Thin wrapper around the free functions; exists for callers that want\n * to inject a single configured instance into multiple primitives.\n */\nexport class LlmClient {\n constructor(private readonly opts: LlmClientOptions = {}) {}\n\n call(req: LlmCallRequest, per?: LlmClientOptions): Promise<LlmCallResult> {\n return callLlm(req, { ...this.opts, ...per })\n }\n\n callJson<T = unknown>(\n req: LlmCallRequest,\n per?: LlmClientOptions,\n ): Promise<{ value: T; result: LlmCallResult }> {\n return callLlmJson<T>(req, { ...this.opts, ...per })\n }\n}\n"],"mappings":";AA4EO,IAAM,eAAN,cAA2B,MAAM;AAAA,EACtC,YACE,SACgB,QACA,MACA,OAChB;AACA,UAAM,OAAO;AAJG;AACA;AACA;AAGhB,SAAK,OAAO;AAAA,EACd;AAAA,EANkB;AAAA,EACA;AAAA,EACA;AAKpB;AAoBA,IAAM,mBAAmB;AACzB,IAAM,qBAAqB;AAC3B,IAAM,sBAAsB;AAE5B,IAAM,mBAAmB,oBAAI,IAAI,CAAC,KAAK,KAAK,KAAK,GAAG,CAAC;AAErD,SAAS,iBAAiB,KAAuB;AAC/C,MAAI,eAAe,aAAc,QAAO,iBAAiB,IAAI,IAAI,MAAM;AACvE,MAAI,eAAe,OAAO;AACxB,WACE,IAAI,SAAS,gBACb,IAAI,SAAS,kBACb,+CAA+C,KAAK,IAAI,OAAO;AAAA,EAEnE;AACA,SAAO;AACT;AAEA,SAAS,gBAAgB,SAAiC;AACxD,QAAM,IAAI,QAAQ,IAAI,aAAa;AACnC,MAAI,CAAC,EAAG,QAAO;AACf,QAAM,WAAW,OAAO,CAAC;AACzB,MAAI,OAAO,SAAS,QAAQ,KAAK,WAAW,EAAG,QAAO,WAAW;AACjE,QAAM,SAAS,KAAK,MAAM,CAAC;AAC3B,MAAI,OAAO,SAAS,MAAM,EAAG,QAAO,KAAK,IAAI,GAAG,SAAS,KAAK,IAAI,CAAC;AACnE,SAAO;AACT;AAEA,SAAS,UAAU,SAAyB;AAE1C,SAAO,KAAK,IAAI,MAAM,KAAK,IAAI,GAAG,OAAO,GAAG,IAAM;AACpD;AAEA,SAAS,aAAa,MAAgD;AACpE,QAAM,UAAkC;AAAA,IACtC,gBAAgB;AAAA,IAChB,QAAQ;AAAA,EACV;AACA,MAAI,KAAK,YAAY;AACnB,YAAQ,KAAK,WAAW,IAAI,IAAI,KAAK,WAAW;AAAA,EAClD,WAAW,KAAK,UAAU,KAAK,QAAQ;AACrC,YAAQ,gBAAgB,UAAU,KAAK,UAAU,KAAK,MAAM;AAAA,EAC9D;AACA,SAAO;AACT;AAEA,SAAS,kBAAkB,QAAgB,MAAuB;AAChE,MAAI,WAAW,IAAK,QAAO;AAC3B,QAAM,QAAQ,KAAK,YAAY;AAC/B,SACE,MAAM,SAAS,iBAAiB,KAChC,MAAM,SAAS,aAAa,KAC5B,MAAM,SAAS,gBAAgB,KAC/B,MAAM,SAAS,eAAe;AAElC;AAEA,SAAS,UAAU,KAAqB,iBAAmD;AACzF,QAAM,OAAgC;AAAA,IACpC,OAAO,IAAI;AAAA,IACX,UAAU,IAAI;AAAA,IACd,aAAa,IAAI,eAAe;AAAA,EAClC;AACA,MAAI,IAAI,aAAa,KAAM,MAAK,aAAa,IAAI;AAEjD,MAAI,IAAI,cAAc,CAAC,iBAAiB;AACtC,SAAK,kBAAkB;AAAA,MACrB,MAAM;AAAA,MACN,aAAa,EAAE,MAAM,IAAI,WAAW,MAAM,QAAQ,IAAI,WAAW,QAAQ,QAAQ,KAAK;AAAA,IACxF;AAAA,EACF,WAAW,IAAI,YAAY,IAAI,YAAY;AACzC,SAAK,kBAAkB,EAAE,MAAM,cAAc;AAAA,EAC/C;AAEA,SAAO;AACT;AAEA,eAAe,MAAM,IAA2B;AAC9C,SAAO,IAAI,QAAQ,CAAC,YAAY,WAAW,SAAS,EAAE,CAAC;AACzD;AASO,SAAS,gBAAgB,KAAqB;AACnD,QAAM,UAAU,IAAI,KAAK;AACzB,QAAM,IAAI,QAAQ,MAAM,yCAAyC;AACjE,SAAO,IAAI,EAAE,CAAC,EAAG,KAAK,IAAI;AAC5B;AAEO,SAAS,mBAAmB,KAAqB;AACtD,QAAM,WAAW,gBAAgB,GAAG;AACpC,MAAI;AACF,SAAK,MAAM,QAAQ;AACnB,WAAO;AAAA,EACT,QAAQ;AAAA,EAER;AAEA,QAAM,SAAS,CAAC,GAAG,SAAS,SAAS,QAAQ,CAAC,EAAE,IAAI,CAAC,UAAU,MAAM,KAAK,EAAE,OAAO,CAAC,UAAU,SAAS,IAAI;AAC3G,aAAW,SAAS,QAAQ;AAC1B,UAAM,YAAY,oBAAoB,UAAU,KAAK;AACrD,QAAI,CAAC,UAAW;AAChB,QAAI;AACF,WAAK,MAAM,SAAS;AACpB,aAAO;AAAA,IACT,QAAQ;AAAA,IAER;AAAA,EACF;AAEA,SAAO;AACT;AAEA,SAAS,oBAAoB,OAAe,OAA8B;AACxE,QAAM,SAAS,MAAM,KAAK;AAC1B,QAAM,SAAS,WAAW,MAAM,MAAM,WAAW,MAAM,MAAM;AAC7D,MAAI,CAAC,OAAQ,QAAO;AAEpB,QAAM,QAAkB,CAAC,MAAM;AAC/B,MAAI,aAAa;AACjB,MAAI,YAAY;AAEhB,WAAS,IAAI,QAAQ,GAAG,IAAI,MAAM,QAAQ,KAAK;AAC7C,UAAM,OAAO,MAAM,CAAC;AACpB,QAAI,WAAW;AACb,kBAAY;AACZ;AAAA,IACF;AACA,QAAI,SAAS,MAAM;AACjB,kBAAY;AACZ;AAAA,IACF;AACA,QAAI,SAAS,KAAK;AAChB,mBAAa,CAAC;AACd;AAAA,IACF;AACA,QAAI,WAAY;AAEhB,QAAI,SAAS,IAAK,OAAM,KAAK,GAAG;AAAA,aACvB,SAAS,IAAK,OAAM,KAAK,GAAG;AAAA,aAC5B,SAAS,MAAM,MAAM,SAAS,CAAC,GAAG;AACzC,YAAM,IAAI;AACV,UAAI,MAAM,WAAW,EAAG,QAAO,MAAM,MAAM,OAAO,IAAI,CAAC;AAAA,IACzD;AAAA,EACF;AAEA,SAAO;AACT;AAOA,eAAsB,QACpB,KACA,OAAyB,CAAC,GACF;AACxB,QAAM,WAAW,KAAK,WAAW,kBAAkB,QAAQ,QAAQ,EAAE;AACrE,QAAM,MAAM,GAAG,OAAO;AACtB,QAAM,YAAY,IAAI,aAAa,KAAK,oBAAoB;AAC5D,QAAM,aAAa,KAAK,cAAc;AACtC,QAAM,UAAU,KAAK,SAAS,WAAW;AACzC,QAAM,UAAU,aAAa,IAAI;AAEjC,MAAI;AACJ,WAAS,UAAU,GAAG,UAAU,YAAY,WAAW;AACrD,UAAM,aAAa,IAAI,gBAAgB;AACvC,UAAM,gBAAgB,WAAW,MAAM,WAAW,MAAM,GAAG,SAAS;AACpE,UAAM,UAAU,KAAK,IAAI;AAEzB,QAAI;AACF,YAAM,MAAM,MAAM,QAAQ,KAAK;AAAA,QAC7B,QAAQ;AAAA,QACR;AAAA,QACA,MAAM,KAAK,UAAU,UAAU,KAAK,KAAK,CAAC;AAAA,QAC1C,QAAQ,WAAW;AAAA,MACrB,CAAC;AACD,mBAAa,aAAa;AAE1B,UAAI,CAAC,IAAI,IAAI;AACX,cAAM,OAAO,MAAM,IAAI,KAAK;AAC5B,cAAM,MAAM,IAAI;AAAA,UACd,YAAY,IAAI,MAAM,KAAK,KAAK,MAAM,GAAG,GAAG,CAAC;AAAA,UAC7C,IAAI;AAAA,UACJ;AAAA,UACA,IAAI;AAAA,QACN;AACA,YAAI,iBAAiB,IAAI,IAAI,MAAM,KAAK,UAAU,aAAa,GAAG;AAChE,oBAAU;AACV,gBAAM,aAAa,gBAAgB,IAAI,OAAO;AAC9C,gBAAM,MAAM,cAAc,UAAU,OAAO,CAAC;AAC5C;AAAA,QACF;AACA,cAAM;AAAA,MACR;AAEA,YAAM,OAAQ,MAAM,IAAI,KAAK;AAC7B,YAAM,SAAU,KAAK,UAAoE,CAAC;AAC1F,YAAM,WAAY,KAAK,SAAiD,CAAC;AACzE,YAAM,gBAAiB,KAAK,kBAAkB,KAAK;AAEnD,aAAO;AAAA,QACL,SAAS,QAAQ,SAAS,WAAW;AAAA,QACrC,OAAO;AAAA,UACL,cAAc,OAAO,SAAS,iBAAiB,CAAC;AAAA,UAChD,kBAAkB,OAAO,SAAS,qBAAqB,CAAC;AAAA,UACxD,aAAa,OAAO,SAAS,gBAAgB,CAAC;AAAA,UAC9C,oBACE,SAAS,yBACT,OAAO,SAAS,0BAA0B,WACtC;AAAA,YACG,SAAS,sBAAkD,iBAAiB;AAAA,UAC/E,IACA;AAAA,QACR;AAAA,QACA,SAAS,OAAO,kBAAkB,WAAW,gBAAgB;AAAA,QAC7D,OAAQ,KAAK,SAAoB,IAAI;AAAA,QACrC,YAAY,KAAK,IAAI,IAAI;AAAA,QACzB,KAAK;AAAA,MACP;AAAA,IACF,SAAS,KAAK;AACZ,mBAAa,aAAa;AAC1B,gBAAU;AACV,UAAI,UAAU,aAAa,KAAK,iBAAiB,GAAG,GAAG;AACrD,cAAM,MAAM,UAAU,OAAO,CAAC;AAC9B;AAAA,MACF;AACA,YAAM;AAAA,IACR;AAAA,EACF;AACA,QAAM,mBAAmB,QAAQ,UAAU,IAAI,MAAM,OAAO,OAAO,CAAC;AACtE;AAQA,eAAsB,YACpB,KACA,OAAyB,CAAC,GACoB;AAC9C,MAAI;AACF,UAAM,SAAS,MAAM,QAAQ,EAAE,GAAG,KAAK,UAAU,IAAI,YAAY,CAAC,IAAI,WAAW,GAAG,IAAI;AACxF,UAAM,QAAQ,gBAAmB,OAAO,SAAS,OAAO,KAAK;AAC7D,WAAO,EAAE,OAAO,OAAO;AAAA,EACzB,SAAS,KAAK;AACZ,QAAI,eAAe,gBAAgB,kBAAkB,IAAI,QAAQ,IAAI,IAAI,KAAK,IAAI,YAAY;AAE5F,YAAM,cAA8B,EAAE,GAAG,KAAK,UAAU,MAAM,YAAY,OAAU;AACpF,YAAM,SAAS,MAAM,QAAQ,aAAa,IAAI;AAC9C,YAAM,QAAQ,gBAAmB,OAAO,SAAS,OAAO,KAAK;AAC7D,aAAO,EAAE,OAAO,OAAO;AAAA,IACzB;AACA,UAAM;AAAA,EACR;AACF;AAEA,SAAS,gBAAmB,SAAiB,OAAkB;AAC7D,QAAM,WAAW,mBAAmB,OAAO;AAC3C,MAAI;AACF,WAAO,KAAK,MAAM,QAAQ;AAAA,EAC5B,SAAS,KAAK;AACZ,UAAM,IAAI;AAAA,MACR,wCAAwC,KAAK,MAC3C,eAAe,QAAQ,IAAI,UAAU,OAAO,GAAG,CACjD;AAAA;AAAA,EAA0B,QAAQ,MAAM,GAAG,GAAG,CAAC;AAAA,IACjD;AAAA,EACF;AACF;AAaA,eAAsB,SACpB,OACA,OAAkD,CAAC,GACgB;AACnE,QAAM,QAAQ,KAAK,IAAI;AACvB,MAAI;AACF,UAAM;AAAA,MACJ;AAAA,QACE;AAAA,QACA,UAAU,CAAC,EAAE,MAAM,QAAQ,SAAS,OAAO,CAAC;AAAA,QAC5C,WAAW;AAAA,QACX,WAAW,KAAK,aAAa;AAAA,MAC/B;AAAA,MACA;AAAA,IACF;AACA,WAAO,EAAE,IAAI,MAAM,WAAW,KAAK,IAAI,IAAI,OAAO,OAAO,KAAK;AAAA,EAChE,SAAS,KAAK;AACZ,WAAO;AAAA,MACL,IAAI;AAAA,MACJ,WAAW,KAAK,IAAI,IAAI;AAAA,MACxB,OAAO,eAAe,QAAQ,IAAI,UAAU,OAAO,GAAG;AAAA,IACxD;AAAA,EACF;AACF;AAOO,IAAM,YAAN,MAAgB;AAAA,EACrB,YAA6B,OAAyB,CAAC,GAAG;AAA7B;AAAA,EAA8B;AAAA,EAA9B;AAAA,EAE7B,KAAK,KAAqB,KAAgD;AACxE,WAAO,QAAQ,KAAK,EAAE,GAAG,KAAK,MAAM,GAAG,IAAI,CAAC;AAAA,EAC9C;AAAA,EAEA,SACE,KACA,KAC8C;AAC9C,WAAO,YAAe,KAAK,EAAE,GAAG,KAAK,MAAM,GAAG,IAAI,CAAC;AAAA,EACrD;AACF;","names":[]}
package/dist/cli.js CHANGED
@@ -5,8 +5,8 @@ import {
5
5
  runRpcBatch,
6
6
  runRpcOnce,
7
7
  startServer
8
- } from "./chunk-OZPRSK4A.js";
9
- import "./chunk-ITN4YOZY.js";
8
+ } from "./chunk-CJJSB6ZQ.js";
9
+ import "./chunk-JAOLXRIA.js";
10
10
  import "./chunk-PZ5AY32C.js";
11
11
 
12
12
  // src/cli.ts
package/dist/index.js CHANGED
@@ -5,7 +5,7 @@ import {
5
5
  callLlmJson,
6
6
  probeLlm,
7
7
  stripFencedJson
8
- } from "./chunk-ITN4YOZY.js";
8
+ } from "./chunk-JAOLXRIA.js";
9
9
  import {
10
10
  __export
11
11
  } from "./chunk-PZ5AY32C.js";
@@ -24,8 +24,8 @@ import {
24
24
  runRpcBatch,
25
25
  runRpcOnce,
26
26
  startServer
27
- } from "../chunk-OZPRSK4A.js";
28
- import "../chunk-ITN4YOZY.js";
27
+ } from "../chunk-CJJSB6ZQ.js";
28
+ import "../chunk-JAOLXRIA.js";
29
29
  import "../chunk-PZ5AY32C.js";
30
30
  export {
31
31
  BUILTIN_RUBRICS,
@@ -0,0 +1,44 @@
1
+ # Example benchmark wrappers
2
+
3
+ Reference implementations of `BenchmarkAdapter` for two public benchmarks. They are NOT bundled — they're intentionally shipped as source you read, copy, and adapt.
4
+
5
+ | Wrapper | What it does | Why it's an example, not core |
6
+ |---|---|---|
7
+ | [`gsm8k/`](./gsm8k) | Exact-match grading on the final numeric answer of GSM8K (Cobbe et al.) | The dataset isn't ours and isn't bundled. The wrapper points to a local JSONL via `AGENT_EVAL_GSM8K_PATH`. |
8
+ | [`swebench-lite/`](./swebench-lite) | Pass/fail grading via an external SWE-Bench grader command | The grader is a separate binary; the wrapper stubs the integration via `AGENT_EVAL_SWEBENCH_GRADER_CMD`. |
9
+
10
+ The novel benchmark we ship and own — the synthetic routing task — lives in `src/benchmarks/routing/` and IS in the bundle.
11
+
12
+ ## Using these wrappers
13
+
14
+ Two paths.
15
+
16
+ **Option A — read and inline.** Copy the wrapper file into your project. Replace the import paths from `../../../src/benchmarks/types` and `../../../src/run-record` with `@tangle-network/agent-eval`. Done.
17
+
18
+ **Option B — import from agent-eval source.** If your project sits in this monorepo (or you've cloned the repo), import directly:
19
+
20
+ ```ts
21
+ import * as gsm8k from '@tangle-network/agent-eval/examples/benchmarks/gsm8k'
22
+ ```
23
+
24
+ This requires adding `examples/**/*.ts` to your TypeScript paths. Easier to just copy.
25
+
26
+ ## What every BenchmarkAdapter exports
27
+
28
+ ```ts
29
+ loadDataset(split: 'search' | 'dev' | 'holdout'): Promise<DatasetItem[]>
30
+ evaluate(item, response): Promise<{ score: number, raw: Record<string, unknown> }>
31
+ assignSplit(itemId: string): 'search' | 'dev' | 'holdout'
32
+ ```
33
+
34
+ `assignSplit` uses `deterministicSplit(itemId, BENCHMARK_SPLIT_SEED)` — same item gets the same split everywhere. Don't change the seed; it's load-bearing for reproducibility.
35
+
36
+ ## Adding a new benchmark
37
+
38
+ 1. Create `examples/benchmarks/<your-benchmark>/index.ts`.
39
+ 2. Export `loadDataset`, `evaluate`, `assignSplit`. Optionally a typed `Adapter` class.
40
+ 3. Use `deterministicSplit` from `@tangle-network/agent-eval` for split assignment.
41
+ 4. Fail loud on missing config (env vars, paths). Never default to silent-pass.
42
+ 5. Document config requirements in a per-benchmark README.
43
+
44
+ If your benchmark is novel and broadly useful, propose moving it into `src/benchmarks/` as core surface (PR welcome). The bar is: novel rubric, reusable across projects, low maintenance burden.
@@ -0,0 +1,126 @@
1
+ /**
2
+ * GSM8K wrapper — exact-match grading on the final numeric answer.
3
+ *
4
+ * The dataset itself is NOT bundled. `loadDataset` will:
5
+ * 1. read from `process.env.AGENT_EVAL_GSM8K_PATH` if set (a JSONL
6
+ * file with `{ id, question, answer }` records — the standard
7
+ * HF mirror layout converted to JSONL);
8
+ * 2. otherwise throw a clearly-marked error pointing to the loader.
9
+ *
10
+ * `evaluate` parses the final number out of the response (last
11
+ * occurrence of a signed-decimal-or-integer literal, optionally after
12
+ * `####`, the GSM8K answer convention) and compares to the ground-
13
+ * truth integer. Floating-point comparisons use a 1e-6 tolerance.
14
+ */
15
+
16
+ import { existsSync, readFileSync } from 'node:fs'
17
+
18
+ import type {
19
+ BenchmarkAdapter,
20
+ BenchmarkDatasetItem,
21
+ BenchmarkEvaluation,
22
+ } from '../../../src/benchmarks/types'
23
+ import { deterministicSplit } from '../../../src/benchmarks/types'
24
+ import type { RunSplitTag } from '../../../src/run-record'
25
+
26
+ export interface Gsm8kPayload {
27
+ question: string
28
+ /** Reference answer, post-#### normalization. May be a number or
29
+ * a numeric string ("72", "1.5"). */
30
+ answer: string
31
+ }
32
+
33
+ export type Gsm8kItem = BenchmarkDatasetItem<Gsm8kPayload>
34
+
35
+ class Gsm8kAdapter implements BenchmarkAdapter<Gsm8kItem, Gsm8kPayload> {
36
+ async loadDataset(split: RunSplitTag): Promise<Gsm8kItem[]> {
37
+ const path = process.env.AGENT_EVAL_GSM8K_PATH
38
+ if (!path) {
39
+ throw new Error(
40
+ 'GSM8K dataset not provided. Set AGENT_EVAL_GSM8K_PATH to a JSONL file ' +
41
+ 'with {id, question, answer} records (the HF GSM8K mirror converted to JSONL).',
42
+ )
43
+ }
44
+ if (!existsSync(path)) {
45
+ throw new Error(`AGENT_EVAL_GSM8K_PATH=${path} does not exist`)
46
+ }
47
+ const items = parseJsonl(path).filter((it) => assignSplitImpl(it.id) === split)
48
+ return items
49
+ }
50
+
51
+ async evaluate(item: Gsm8kItem, response: string): Promise<BenchmarkEvaluation> {
52
+ const expected = parseGsm8kAnswer(item.payload.answer)
53
+ const observed = parseGsm8kAnswer(response)
54
+ if (expected === null) {
55
+ // Defensive: the dataset should never ship a non-numeric ref.
56
+ return { score: 0, raw: { reason: 'reference_not_numeric', expected: item.payload.answer } }
57
+ }
58
+ if (observed === null) {
59
+ return { score: 0, raw: { reason: 'no_numeric_in_response', expected, observed: null } }
60
+ }
61
+ const ok = Math.abs(expected - observed) < 1e-6
62
+ return { score: ok ? 1 : 0, raw: { expected, observed, exactMatch: ok } }
63
+ }
64
+
65
+ assignSplit(itemId: string): RunSplitTag {
66
+ return assignSplitImpl(itemId)
67
+ }
68
+ }
69
+
70
+ function assignSplitImpl(itemId: string): RunSplitTag {
71
+ return deterministicSplit(`gsm8k::${itemId}`)
72
+ }
73
+
74
+ function parseJsonl(path: string): Gsm8kItem[] {
75
+ const raw = readFileSync(path, 'utf8')
76
+ const out: Gsm8kItem[] = []
77
+ let lineNo = 0
78
+ for (const line of raw.split('\n')) {
79
+ lineNo++
80
+ const trimmed = line.trim()
81
+ if (!trimmed) continue
82
+ let row: Record<string, unknown>
83
+ try {
84
+ row = JSON.parse(trimmed) as Record<string, unknown>
85
+ } catch (e) {
86
+ throw new Error(`GSM8K JSONL parse error at line ${lineNo}: ${(e as Error).message}`)
87
+ }
88
+ const id = String(row.id ?? `gsm8k_${lineNo}`)
89
+ const question = String(row.question ?? '')
90
+ const answer = String(row.answer ?? '')
91
+ if (!question || !answer) {
92
+ throw new Error(`GSM8K JSONL line ${lineNo} missing question/answer`)
93
+ }
94
+ out.push({ id, payload: { question, answer } })
95
+ }
96
+ return out
97
+ }
98
+
99
+ /**
100
+ * Parse a GSM8K-style answer. Honors the dataset's `#### N`
101
+ * convention (the canonical answer comes after `####`); otherwise
102
+ * returns the LAST signed numeric literal in the string.
103
+ */
104
+ export function parseGsm8kAnswer(text: string): number | null {
105
+ if (!text) return null
106
+ const afterMarker = text.match(/####\s*(-?\d[\d,]*\.?\d*)/)
107
+ if (afterMarker) {
108
+ const cleaned = afterMarker[1]!.replace(/,/g, '')
109
+ const v = Number(cleaned)
110
+ if (Number.isFinite(v)) return v
111
+ }
112
+ // Last numeric literal anywhere in the string.
113
+ const matches = text.match(/-?\d[\d,]*\.?\d*/g)
114
+ if (!matches || matches.length === 0) return null
115
+ const last = matches[matches.length - 1]!
116
+ const cleaned = last.replace(/,/g, '')
117
+ const v = Number(cleaned)
118
+ return Number.isFinite(v) ? v : null
119
+ }
120
+
121
+ const adapter = new Gsm8kAdapter()
122
+
123
+ export const loadDataset = adapter.loadDataset.bind(adapter)
124
+ export const evaluate = adapter.evaluate.bind(adapter)
125
+ export const assignSplit = adapter.assignSplit.bind(adapter)
126
+ export { Gsm8kAdapter }
@@ -0,0 +1,178 @@
1
+ /**
2
+ * SWE-Bench Lite wrapper — 30-instance subset.
3
+ *
4
+ * Status: STUB. The actual SWE-Bench harness needs a Docker host and
5
+ * is too heavy to ship inside this package. We expose the contract
6
+ * (loadDataset, evaluate, assignSplit) so consumers can plug in their
7
+ * own grader without touching call sites.
8
+ *
9
+ * Wire-up paths in priority order:
10
+ *
11
+ * 1. `process.env.AGENT_EVAL_SWEBENCH_PATH` → JSONL with the 30
12
+ * lite instances + per-instance metadata (instance_id,
13
+ * problem_statement, base_commit, repo, FAIL_TO_PASS,
14
+ * PASS_TO_PASS).
15
+ * 2. `process.env.AGENT_EVAL_SWEBENCH_GRADER_CMD` → executable
16
+ * that reads `{instance_id, patch}` JSON on stdin and writes
17
+ * `{passed, fail_to_pass_passed, pass_to_pass_passed, log}`
18
+ * JSON on stdout. Implementations can shell out to the
19
+ * official `swebench` runner here.
20
+ *
21
+ * If neither is set, every public method throws a clearly-marked
22
+ * "not implemented" error. The stub fails LOUD; it never silently
23
+ * scores zero.
24
+ */
25
+
26
+ import { existsSync, readFileSync } from 'node:fs'
27
+ import { spawn } from 'node:child_process'
28
+
29
+ import type {
30
+ BenchmarkAdapter,
31
+ BenchmarkDatasetItem,
32
+ BenchmarkEvaluation,
33
+ } from '../../../src/benchmarks/types'
34
+ import { deterministicSplit } from '../../../src/benchmarks/types'
35
+ import type { RunSplitTag } from '../../../src/run-record'
36
+
37
+ export interface SweBenchLitePayload {
38
+ instanceId: string
39
+ problemStatement: string
40
+ baseCommit: string
41
+ repo: string
42
+ failToPass: string[]
43
+ passToPass: string[]
44
+ }
45
+
46
+ export type SweBenchLiteItem = BenchmarkDatasetItem<SweBenchLitePayload>
47
+
48
+ class SweBenchLiteAdapter
49
+ implements BenchmarkAdapter<SweBenchLiteItem, SweBenchLitePayload>
50
+ {
51
+ async loadDataset(split: RunSplitTag): Promise<SweBenchLiteItem[]> {
52
+ const path = process.env.AGENT_EVAL_SWEBENCH_PATH
53
+ if (!path) {
54
+ throw new Error(
55
+ 'SWE-Bench Lite dataset not provided. Set AGENT_EVAL_SWEBENCH_PATH to a JSONL file ' +
56
+ 'with the 30 lite instances. STUB: this wrapper does not bundle the dataset; ' +
57
+ 'see https://www.swebench.com/lite.html for the canonical source.',
58
+ )
59
+ }
60
+ if (!existsSync(path)) {
61
+ throw new Error(`AGENT_EVAL_SWEBENCH_PATH=${path} does not exist`)
62
+ }
63
+ const all = parseJsonl(path)
64
+ return all.filter((it) => assignSplitImpl(it.id) === split)
65
+ }
66
+
67
+ async evaluate(item: SweBenchLiteItem, response: string): Promise<BenchmarkEvaluation> {
68
+ const cmd = process.env.AGENT_EVAL_SWEBENCH_GRADER_CMD
69
+ if (!cmd) {
70
+ throw new Error(
71
+ 'SWE-Bench Lite grader not configured. Set AGENT_EVAL_SWEBENCH_GRADER_CMD to an ' +
72
+ 'executable that reads {instance_id, patch} JSON on stdin and writes ' +
73
+ '{passed, fail_to_pass_passed, pass_to_pass_passed, log} JSON on stdout. ' +
74
+ 'TODO(swebench-lite): bundle a default Docker-based runner once the SDK ' +
75
+ 'stabilises (https://github.com/swe-bench/SWE-bench).',
76
+ )
77
+ }
78
+ const stdinPayload = JSON.stringify({ instance_id: item.payload.instanceId, patch: response })
79
+ const result = await runGrader(cmd, stdinPayload)
80
+ let parsed: Record<string, unknown>
81
+ try {
82
+ parsed = JSON.parse(result.stdout) as Record<string, unknown>
83
+ } catch (e) {
84
+ throw new Error(
85
+ `SWE-Bench grader emitted non-JSON stdout: ${(e as Error).message}\n` +
86
+ `stdout=${result.stdout.slice(0, 400)}\nstderr=${result.stderr.slice(0, 400)}`,
87
+ )
88
+ }
89
+ const passed = Boolean(parsed.passed)
90
+ return {
91
+ score: passed ? 1 : 0,
92
+ raw: {
93
+ passed,
94
+ failToPassPassed: Boolean(parsed.fail_to_pass_passed),
95
+ passToPassPassed: Boolean(parsed.pass_to_pass_passed),
96
+ graderLog: typeof parsed.log === 'string' ? parsed.log.slice(0, 4000) : '',
97
+ },
98
+ }
99
+ }
100
+
101
+ assignSplit(itemId: string): RunSplitTag {
102
+ return assignSplitImpl(itemId)
103
+ }
104
+ }
105
+
106
+ function assignSplitImpl(itemId: string): RunSplitTag {
107
+ return deterministicSplit(`swebench-lite::${itemId}`)
108
+ }
109
+
110
+ function parseJsonl(path: string): SweBenchLiteItem[] {
111
+ const raw = readFileSync(path, 'utf8')
112
+ const out: SweBenchLiteItem[] = []
113
+ let lineNo = 0
114
+ for (const line of raw.split('\n')) {
115
+ lineNo++
116
+ const trimmed = line.trim()
117
+ if (!trimmed) continue
118
+ const row = JSON.parse(trimmed) as Record<string, unknown>
119
+ const instanceId = String(row.instance_id ?? row.instanceId ?? '')
120
+ if (!instanceId) {
121
+ throw new Error(`swebench-lite line ${lineNo} missing instance_id`)
122
+ }
123
+ out.push({
124
+ id: instanceId,
125
+ payload: {
126
+ instanceId,
127
+ problemStatement: String(row.problem_statement ?? row.problemStatement ?? ''),
128
+ baseCommit: String(row.base_commit ?? row.baseCommit ?? ''),
129
+ repo: String(row.repo ?? ''),
130
+ failToPass: asStringArray(row.FAIL_TO_PASS ?? row.failToPass),
131
+ passToPass: asStringArray(row.PASS_TO_PASS ?? row.passToPass),
132
+ },
133
+ })
134
+ }
135
+ return out
136
+ }
137
+
138
+ function asStringArray(v: unknown): string[] {
139
+ if (Array.isArray(v)) return v.filter((x): x is string => typeof x === 'string')
140
+ if (typeof v === 'string') {
141
+ try {
142
+ const parsed = JSON.parse(v)
143
+ if (Array.isArray(parsed)) return parsed.filter((x): x is string => typeof x === 'string')
144
+ } catch {
145
+ // Plain string; treat as a single-element list.
146
+ return [v]
147
+ }
148
+ }
149
+ return []
150
+ }
151
+
152
+ function runGrader(cmd: string, stdin: string): Promise<{ stdout: string; stderr: string }> {
153
+ return new Promise((resolve, reject) => {
154
+ const parts = cmd.split(/\s+/)
155
+ const child = spawn(parts[0]!, parts.slice(1), { stdio: ['pipe', 'pipe', 'pipe'] })
156
+ let stdout = ''
157
+ let stderr = ''
158
+ child.stdout.on('data', (b: Buffer) => (stdout += b.toString('utf8')))
159
+ child.stderr.on('data', (b: Buffer) => (stderr += b.toString('utf8')))
160
+ child.on('error', reject)
161
+ child.on('close', (code) => {
162
+ if (code !== 0) {
163
+ reject(new Error(`grader exited with code ${code}: ${stderr.slice(0, 400)}`))
164
+ return
165
+ }
166
+ resolve({ stdout, stderr })
167
+ })
168
+ child.stdin.write(stdin)
169
+ child.stdin.end()
170
+ })
171
+ }
172
+
173
+ const adapter = new SweBenchLiteAdapter()
174
+
175
+ export const loadDataset = adapter.loadDataset.bind(adapter)
176
+ export const evaluate = adapter.evaluate.bind(adapter)
177
+ export const assignSplit = adapter.assignSplit.bind(adapter)
178
+ export { SweBenchLiteAdapter }
@@ -0,0 +1,114 @@
1
+ import {
2
+ runMultiShotOptimization,
3
+ trialTraceFromMultiShotTrial,
4
+ type MultiShotVariant,
5
+ type RunRecord,
6
+ } from '@tangle-network/agent-eval'
7
+
8
+ type Payload = {
9
+ instruction: string
10
+ quality: number
11
+ }
12
+
13
+ const baseline: MultiShotVariant<Payload> = {
14
+ id: 'baseline',
15
+ label: 'baseline',
16
+ generation: 0,
17
+ payload: {
18
+ instruction: 'Complete the user task.',
19
+ quality: 0.45,
20
+ },
21
+ }
22
+
23
+ const result = await runMultiShotOptimization<Payload>({
24
+ runId: 'demo-multi-shot',
25
+ target: 'demo-agent-system-prompt',
26
+ seedVariants: [baseline],
27
+ searchScenarioIds: ['search-brief', 'search-code-review', 'search-research'],
28
+ reps: 1,
29
+ generations: 2,
30
+ populationSize: 2,
31
+ scoreConcurrency: 2,
32
+ runner: {
33
+ async run({ variant, scenarioId }) {
34
+ return {
35
+ trace: {
36
+ scenarioId,
37
+ turns: [
38
+ { role: 'user', content: `Run ${scenarioId}` },
39
+ { role: 'assistant', content: `${variant.payload.instruction} quality=${variant.payload.quality}` },
40
+ ],
41
+ output: `quality=${variant.payload.quality}`,
42
+ },
43
+ costUsd: 0.01,
44
+ durationMs: 50,
45
+ }
46
+ },
47
+ },
48
+ scorer: {
49
+ async score({ variant }) {
50
+ return {
51
+ score: variant.payload.quality,
52
+ ok: true,
53
+ asi: variant.payload.quality >= 0.8
54
+ ? []
55
+ : [{
56
+ expectationId: 'complete-task',
57
+ message: 'The agent did not fully complete the task.',
58
+ severity: 'error',
59
+ responsibleSurface: 'system-prompt',
60
+ suggestion: 'Make completion criteria explicit before final response.',
61
+ }],
62
+ }
63
+ },
64
+ },
65
+ mutateAdapter: {
66
+ async mutate({ parent, bottomTrials, childCount, generation }) {
67
+ const traces = bottomTrials.map((trial) => trialTraceFromMultiShotTrial(trial))
68
+ const rationale = traces.flatMap((trace) => (trace.expectations ?? []).map((e) => e.phrase)).join('\n')
69
+ return Array.from({ length: childCount }, (_, i) => ({
70
+ id: `${parent.id}.g${generation}.${i}`,
71
+ label: 'completion-focused',
72
+ generation,
73
+ payload: {
74
+ instruction: `${parent.payload.instruction} Verify every requested step before final answer.`,
75
+ quality: 0.9,
76
+ },
77
+ rationale,
78
+ }))
79
+ },
80
+ },
81
+ gate: {
82
+ holdoutScenarioIds: ['holdout-brief', 'holdout-code-review', 'holdout-research'],
83
+ gate: {
84
+ baselineKey: 'baseline',
85
+ minProductiveRuns: 3,
86
+ pairedDeltaThreshold: 0,
87
+ seed: 7,
88
+ },
89
+ toRunRecord: ({ variant, scenarioId, rep, split, seed, trial }): RunRecord => ({
90
+ runId: `demo-${variant.id}-${scenarioId}-${rep}-${split}`,
91
+ experimentId: scenarioId,
92
+ candidateId: variant.id,
93
+ seed,
94
+ model: 'demo-model@2026-01-01',
95
+ promptHash: 'p'.repeat(64),
96
+ configHash: 'c'.repeat(64),
97
+ commitSha: 'deadbeef',
98
+ wallMs: trial.durationMs ?? 0,
99
+ costUsd: trial.cost ?? 0,
100
+ tokenUsage: { input: 1, output: 1 },
101
+ outcome: {
102
+ [split === 'holdout' ? 'holdoutScore' : 'searchScore']: trial.score,
103
+ raw: { score: trial.score },
104
+ },
105
+ splitTag: split,
106
+ }),
107
+ },
108
+ })
109
+
110
+ console.log({
111
+ searchBest: result.searchBestVariant.id,
112
+ promoted: result.promotedVariant.id,
113
+ gate: result.gate?.decision ?? null,
114
+ })
@@ -0,0 +1,63 @@
1
+ import {
2
+ InMemoryTraceStore,
3
+ SandboxHarness,
4
+ SubprocessSandboxDriver,
5
+ TraceEmitter,
6
+ } from '@tangle-network/agent-eval'
7
+
8
+ /**
9
+ * Same-sandbox pattern:
10
+ * - one driver owns one workdir
11
+ * - the harness runs setup/build/test there
12
+ * - later checks can inspect files/logs/screenshots produced by those phases
13
+ *
14
+ * Replace `workdir` with a generated app, browser automation checkout, or
15
+ * remote computer-use workspace.
16
+ */
17
+ export async function runSameSandboxExample(workdir: string) {
18
+ const store = new InMemoryTraceStore()
19
+ const driver = new SubprocessSandboxDriver({ cwd: workdir })
20
+ const harness = new SandboxHarness(driver)
21
+ const emitter = new TraceEmitter(store)
22
+ await emitter.startRun({
23
+ scenarioId: 'same-sandbox-example',
24
+ layer: 'app-build',
25
+ })
26
+
27
+ const result = await harness.run({
28
+ setupCommand: 'pnpm install --frozen-lockfile',
29
+ runCommand: 'pnpm build',
30
+ testCommand: 'pnpm test',
31
+ timeoutMs: 180_000,
32
+ }, emitter)
33
+
34
+ const summary = [
35
+ `passed=${result.passed}`,
36
+ `score=${result.score}`,
37
+ `build=${result.run?.exitCode ?? 'not-run'}`,
38
+ `test=${result.test?.exitCode ?? 'not-run'}`,
39
+ result.test?.stdout?.slice(-2000) ?? '',
40
+ ].join('\n')
41
+
42
+ const judged = {
43
+ score: result.passed && summary.includes('test=0') ? 1 : 0,
44
+ rationale: result.passed
45
+ ? 'Shared sandbox produced passing build/test evidence.'
46
+ : 'Shared sandbox did not produce passing build/test evidence.',
47
+ }
48
+ await emitter.recordJudge({
49
+ judgeId: 'same-sandbox-evidence',
50
+ name: 'same-sandbox-evidence',
51
+ dimension: 'evidence',
52
+ score: judged.score,
53
+ rationale: judged.rationale,
54
+ evidence: summary,
55
+ })
56
+ await emitter.endRun({
57
+ pass: result.passed,
58
+ score: result.score,
59
+ notes: judged.rationale,
60
+ })
61
+
62
+ return { result, judged, traces: await store.listRuns() }
63
+ }
package/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "@tangle-network/agent-eval",
3
- "version": "0.20.2",
4
- "description": "Trace-first evaluation framework for Tangle agents. Core (spans, pipelines, sandbox harness, OTLP export), trust (dataset, red-team, calibration, behavior DSL), builder-of-builders (three-layer eval, resumable sessions, meta-runtime correlation), and frontier (meta-eval correlation study, Process Reward Modeling, bisector).",
3
+ "version": "0.20.3",
4
+ "description": "Trace-first evaluation infrastructure for agent systems: traces, harnesses, verifier pipelines, judges, datasets, gates, optimization, and reporting.",
5
5
  "homepage": "https://github.com/tangle-network/agent-eval#readme",
6
6
  "repository": {
7
7
  "type": "git",
@@ -40,11 +40,21 @@
40
40
  },
41
41
  "files": [
42
42
  "dist",
43
- "docs"
43
+ "docs",
44
+ "examples"
44
45
  ],
45
46
  "publishConfig": {
46
47
  "access": "public"
47
48
  },
49
+ "scripts": {
50
+ "build": "tsup",
51
+ "dev": "tsup --watch",
52
+ "prepare": "tsup",
53
+ "test": "vitest run",
54
+ "test:watch": "vitest",
55
+ "typecheck": "tsc --noEmit",
56
+ "openapi": "node dist/cli.js openapi --out dist/openapi.json"
57
+ },
48
58
  "dependencies": {
49
59
  "@asteasolutions/zod-to-openapi": "^8.5.0",
50
60
  "@ax-llm/ax": "^19.0.25",
@@ -64,12 +74,5 @@
64
74
  "node": ">=20"
65
75
  },
66
76
  "license": "MIT",
67
- "scripts": {
68
- "build": "tsup",
69
- "dev": "tsup --watch",
70
- "test": "vitest run",
71
- "test:watch": "vitest",
72
- "typecheck": "tsc --noEmit",
73
- "openapi": "node dist/cli.js openapi --out dist/openapi.json"
74
- }
75
- }
77
+ "packageManager": "pnpm@10.22.0"
78
+ }
@@ -1 +0,0 @@
1
- {"version":3,"sources":["../src/llm-client.ts"],"sourcesContent":["/**\n * LLM client with graceful degrade.\n *\n * OpenAI-compatible `/v1/chat/completions` client with:\n * - Exponential-backoff retry on 429 + 5xx gateway errors (502/503/504).\n * - Retry on transient network errors (fetch failed, AbortError, ECONNRESET).\n * - Graceful json_schema → json_object degrade on 400 with schema-reject body.\n * - Fenced-JSON stripping (```json ... ```) for models that wrap structured output.\n * - Configurable base URL + api key / bearer, works with LiteLLM proxies, OpenAI\n * directly, cli-bridge subscriptions, and any router that speaks the spec.\n *\n * Usage:\n * const { value, result } = await callLlmJson<MyType>(\n * { model: 'gpt-4o', messages: [...], jsonSchema: { name: 'x', schema: {...} } },\n * { baseUrl: 'https://router.tangle.tools/v1', apiKey: process.env.KEY },\n * )\n *\n * This is THE llm-calling seam for agent-eval primitives that need structured\n * output (semantic concept judge, reviewer directives, critic scores). Primitives\n * that need free-form text use `callLlm` and parse output themselves.\n */\n\n// ─── Types ──────────────────────────────────────────────────────────────\n\nexport interface LlmMessage {\n role: 'system' | 'user' | 'assistant'\n /**\n * Either a plain text content string OR a multimodal content array\n * (text + image_url parts) for vision-capable models.\n */\n content:\n | string\n | Array<\n | { type: 'text'; text: string }\n | { type: 'image_url'; image_url: { url: string; detail?: 'auto' | 'low' | 'high' } }\n >\n}\n\nexport interface LlmCallRequest {\n model: string\n messages: LlmMessage[]\n /** Optional JSON-mode response format (response_format: json_object). */\n jsonMode?: boolean\n /** Optional structured output via JSON Schema. Falls back to json_object on 400. */\n jsonSchema?: { name: string; schema: Record<string, unknown> }\n temperature?: number\n maxTokens?: number\n /** Per-call timeout, default 60s. */\n timeoutMs?: number\n}\n\nexport interface LlmUsage {\n promptTokens: number\n completionTokens: number\n totalTokens: number\n /** Proxies populate this when prompt caching is on. */\n cachedPromptTokens?: number\n}\n\nexport interface LlmCallResult {\n /** The text content of the first choice. Empty string if none. */\n content: string\n usage: LlmUsage\n /**\n * Cost in USD. Pulled from proxy's `_response_cost` field when present;\n * `null` when neither the proxy nor the caller can derive it.\n */\n costUsd: number | null\n /** Model name actually used (echoed from response). */\n model: string\n /** Wall-clock duration of the HTTP call (last attempt, if retried). */\n durationMs: number\n /** Raw response body. */\n raw: Record<string, unknown>\n}\n\nexport class LlmCallError extends Error {\n constructor(\n message: string,\n public readonly status: number,\n public readonly body: string,\n public readonly model: string,\n ) {\n super(message)\n this.name = 'LlmCallError'\n }\n}\n\nexport interface LlmClientOptions {\n /** Base URL (without trailing slash). Must end at the `/v1` prefix. */\n baseUrl?: string\n /** Bearer token — either `apiKey` or `bearer` populates `Authorization: Bearer ...`. */\n apiKey?: string\n bearer?: string\n /** Override for the `Authorization` header (e.g. `X-Auth: ...`). Takes precedence over apiKey/bearer. */\n authHeader?: { name: string; value: string }\n /** Default timeout in ms. Per-call can override. */\n defaultTimeoutMs?: number\n /** Max retry attempts on retriable errors. Default 3 (1 initial + 2 retries). */\n maxRetries?: number\n /** Fetch implementation — defaults to global `fetch`. Override for custom transport (e.g. tests). */\n fetch?: typeof fetch\n}\n\n// ─── Internals ──────────────────────────────────────────────────────────\n\nconst DEFAULT_BASE_URL = 'https://router.tangle.tools/v1'\nconst DEFAULT_TIMEOUT_MS = 60_000\nconst DEFAULT_MAX_RETRIES = 3\n\nconst RETRYABLE_STATUS = new Set([429, 502, 503, 504])\n\nfunction isRetryableError(err: unknown): boolean {\n if (err instanceof LlmCallError) return RETRYABLE_STATUS.has(err.status)\n if (err instanceof Error) {\n return (\n err.name === 'AbortError' ||\n err.name === 'TimeoutError' ||\n /fetch failed|ECONNRESET|ETIMEDOUT|EAI_AGAIN/i.test(err.message)\n )\n }\n return false\n}\n\nfunction parseRetryAfter(headers: Headers): number | null {\n const h = headers.get('retry-after')\n if (!h) return null\n const asNumber = Number(h)\n if (Number.isFinite(asNumber) && asNumber > 0) return asNumber * 1000\n const asDate = Date.parse(h)\n if (Number.isFinite(asDate)) return Math.max(0, asDate - Date.now())\n return null\n}\n\nfunction backoffMs(attempt: number): number {\n // 500ms, 1s, 2s, 4s, ...\n return Math.min(500 * Math.pow(2, attempt), 16_000)\n}\n\nfunction buildHeaders(opts: LlmClientOptions): Record<string, string> {\n const headers: Record<string, string> = {\n 'Content-Type': 'application/json',\n Accept: 'application/json',\n }\n if (opts.authHeader) {\n headers[opts.authHeader.name] = opts.authHeader.value\n } else if (opts.bearer || opts.apiKey) {\n headers.Authorization = `Bearer ${opts.bearer ?? opts.apiKey}`\n }\n return headers\n}\n\nfunction isSchemaRejection(status: number, body: string): boolean {\n if (status !== 400) return false\n const lower = body.toLowerCase()\n return (\n lower.includes('response_format') ||\n lower.includes('json_schema') ||\n lower.includes('is unavailable') ||\n lower.includes('not supported')\n )\n}\n\nfunction buildBody(req: LlmCallRequest, forceJsonObject: boolean): Record<string, unknown> {\n const body: Record<string, unknown> = {\n model: req.model,\n messages: req.messages,\n temperature: req.temperature ?? 0,\n }\n if (req.maxTokens != null) body.max_tokens = req.maxTokens\n\n if (req.jsonSchema && !forceJsonObject) {\n body.response_format = {\n type: 'json_schema',\n json_schema: { name: req.jsonSchema.name, schema: req.jsonSchema.schema, strict: true },\n }\n } else if (req.jsonMode || req.jsonSchema) {\n body.response_format = { type: 'json_object' }\n }\n\n return body\n}\n\nasync function sleep(ms: number): Promise<void> {\n return new Promise((resolve) => setTimeout(resolve, ms))\n}\n\n// ─── Public API ─────────────────────────────────────────────────────────\n\n/**\n * Strip a ```json / ``` code fence if the model emitted one.\n * Idempotent for naked JSON. Some models (claude-code via router, certain\n * deepseek models) wrap output even under json_object.\n */\nexport function stripFencedJson(raw: string): string {\n const trimmed = raw.trim()\n const m = trimmed.match(/^```(?:json)?\\s*\\n?([\\s\\S]*?)\\n?```\\s*$/)\n return m ? m[1]!.trim() : trimmed\n}\n\n/**\n * Low-level call. Returns raw content + usage + cost. Retries on transient\n * failures; does NOT degrade schema here — callers that want graceful\n * degrade use `callLlmJson`.\n */\nexport async function callLlm(\n req: LlmCallRequest,\n opts: LlmClientOptions = {},\n): Promise<LlmCallResult> {\n const baseUrl = (opts.baseUrl ?? DEFAULT_BASE_URL).replace(/\\/+$/, '')\n const url = `${baseUrl}/chat/completions`\n const timeoutMs = req.timeoutMs ?? opts.defaultTimeoutMs ?? DEFAULT_TIMEOUT_MS\n const maxRetries = opts.maxRetries ?? DEFAULT_MAX_RETRIES\n const fetchFn = opts.fetch ?? globalThis.fetch\n const headers = buildHeaders(opts)\n\n let lastErr: unknown\n for (let attempt = 0; attempt < maxRetries; attempt++) {\n const controller = new AbortController()\n const timeoutHandle = setTimeout(() => controller.abort(), timeoutMs)\n const started = Date.now()\n\n try {\n const res = await fetchFn(url, {\n method: 'POST',\n headers,\n body: JSON.stringify(buildBody(req, false)),\n signal: controller.signal,\n })\n clearTimeout(timeoutHandle)\n\n if (!res.ok) {\n const body = await res.text()\n const err = new LlmCallError(\n `LLM call ${res.status}: ${body.slice(0, 300)}`,\n res.status,\n body,\n req.model,\n )\n if (RETRYABLE_STATUS.has(res.status) && attempt < maxRetries - 1) {\n lastErr = err\n const retryAfter = parseRetryAfter(res.headers)\n await sleep(retryAfter ?? backoffMs(attempt))\n continue\n }\n throw err\n }\n\n const json = (await res.json()) as Record<string, unknown>\n const choice = (json.choices as Array<{ message?: { content?: string } }> | undefined)?.[0]\n const usageRaw = (json.usage as Record<string, unknown> | undefined) ?? {}\n const costFromProxy = (json._response_cost ?? json.cost_usd) as number | undefined\n\n return {\n content: choice?.message?.content ?? '',\n usage: {\n promptTokens: Number(usageRaw.prompt_tokens ?? 0),\n completionTokens: Number(usageRaw.completion_tokens ?? 0),\n totalTokens: Number(usageRaw.total_tokens ?? 0),\n cachedPromptTokens:\n usageRaw.prompt_tokens_details &&\n typeof usageRaw.prompt_tokens_details === 'object'\n ? Number(\n (usageRaw.prompt_tokens_details as Record<string, unknown>).cached_tokens ?? 0,\n )\n : undefined,\n },\n costUsd: typeof costFromProxy === 'number' ? costFromProxy : null,\n model: (json.model as string) ?? req.model,\n durationMs: Date.now() - started,\n raw: json,\n }\n } catch (err) {\n clearTimeout(timeoutHandle)\n lastErr = err\n if (attempt < maxRetries - 1 && isRetryableError(err)) {\n await sleep(backoffMs(attempt))\n continue\n }\n throw err\n }\n }\n throw lastErr instanceof Error ? lastErr : new Error(String(lastErr))\n}\n\n/**\n * Structured-output call. Returns parsed JSON plus the raw result envelope.\n * Degrades `jsonSchema` → `jsonMode` on a 400 that names the schema param —\n * critical for deepseek-v3/v4, kimi-k2.6, and other models that don't accept\n * the `response_format.json_schema` shape but DO accept `json_object`.\n */\nexport async function callLlmJson<T = unknown>(\n req: LlmCallRequest,\n opts: LlmClientOptions = {},\n): Promise<{ value: T; result: LlmCallResult }> {\n try {\n const result = await callLlm({ ...req, jsonMode: req.jsonMode ?? !req.jsonSchema }, opts)\n const value = parseJsonSafely<T>(result.content, result.model)\n return { value, result }\n } catch (err) {\n if (err instanceof LlmCallError && isSchemaRejection(err.status, err.body) && req.jsonSchema) {\n // Degrade to json_object + retry.\n const degradedReq: LlmCallRequest = { ...req, jsonMode: true, jsonSchema: undefined }\n const result = await callLlm(degradedReq, opts)\n const value = parseJsonSafely<T>(result.content, result.model)\n return { value, result }\n }\n throw err\n }\n}\n\nfunction parseJsonSafely<T>(content: string, model: string): T {\n const stripped = stripFencedJson(content)\n try {\n return JSON.parse(stripped) as T\n } catch (err) {\n throw new Error(\n `LLM returned non-JSON content (model=${model}): ${\n err instanceof Error ? err.message : String(err)\n }\\n--- raw content ---\\n${content.slice(0, 800)}`,\n )\n }\n}\n\n/**\n * Probe whether a model is reachable. Returns latency + null error on\n * success; `ok=false` + error message on any failure (HTTP, timeout,\n * network, parse). Designed for sweep preflights — fail loud at the\n * boundary before burning a 30-leaf run on a misconfigured router.\n *\n * Sends a tiny `ping` message with `maxTokens=64`. Reasoning models\n * (glm-5.1, deepseek-v4) can burn the entire budget on internal reasoning\n * for short prompts, so don't tighten this further. We don't validate\n * content; HTTP 200 means reachable.\n */\nexport async function probeLlm(\n model: string,\n opts: LlmClientOptions & { timeoutMs?: number } = {},\n): Promise<{ ok: boolean; latencyMs: number; error: string | null }> {\n const start = Date.now()\n try {\n await callLlm(\n {\n model,\n messages: [{ role: 'user', content: 'ping' }],\n maxTokens: 64,\n timeoutMs: opts.timeoutMs ?? 30_000,\n },\n opts,\n )\n return { ok: true, latencyMs: Date.now() - start, error: null }\n } catch (err) {\n return {\n ok: false,\n latencyMs: Date.now() - start,\n error: err instanceof Error ? err.message : String(err),\n }\n }\n}\n\n/**\n * Stateful client — construct once with defaults, call many times.\n * Thin wrapper around the free functions; exists for callers that want\n * to inject a single configured instance into multiple primitives.\n */\nexport class LlmClient {\n constructor(private readonly opts: LlmClientOptions = {}) {}\n\n call(req: LlmCallRequest, per?: LlmClientOptions): Promise<LlmCallResult> {\n return callLlm(req, { ...this.opts, ...per })\n }\n\n callJson<T = unknown>(\n req: LlmCallRequest,\n per?: LlmClientOptions,\n ): Promise<{ value: T; result: LlmCallResult }> {\n return callLlmJson<T>(req, { ...this.opts, ...per })\n }\n}\n"],"mappings":";AA4EO,IAAM,eAAN,cAA2B,MAAM;AAAA,EACtC,YACE,SACgB,QACA,MACA,OAChB;AACA,UAAM,OAAO;AAJG;AACA;AACA;AAGhB,SAAK,OAAO;AAAA,EACd;AAAA,EANkB;AAAA,EACA;AAAA,EACA;AAKpB;AAoBA,IAAM,mBAAmB;AACzB,IAAM,qBAAqB;AAC3B,IAAM,sBAAsB;AAE5B,IAAM,mBAAmB,oBAAI,IAAI,CAAC,KAAK,KAAK,KAAK,GAAG,CAAC;AAErD,SAAS,iBAAiB,KAAuB;AAC/C,MAAI,eAAe,aAAc,QAAO,iBAAiB,IAAI,IAAI,MAAM;AACvE,MAAI,eAAe,OAAO;AACxB,WACE,IAAI,SAAS,gBACb,IAAI,SAAS,kBACb,+CAA+C,KAAK,IAAI,OAAO;AAAA,EAEnE;AACA,SAAO;AACT;AAEA,SAAS,gBAAgB,SAAiC;AACxD,QAAM,IAAI,QAAQ,IAAI,aAAa;AACnC,MAAI,CAAC,EAAG,QAAO;AACf,QAAM,WAAW,OAAO,CAAC;AACzB,MAAI,OAAO,SAAS,QAAQ,KAAK,WAAW,EAAG,QAAO,WAAW;AACjE,QAAM,SAAS,KAAK,MAAM,CAAC;AAC3B,MAAI,OAAO,SAAS,MAAM,EAAG,QAAO,KAAK,IAAI,GAAG,SAAS,KAAK,IAAI,CAAC;AACnE,SAAO;AACT;AAEA,SAAS,UAAU,SAAyB;AAE1C,SAAO,KAAK,IAAI,MAAM,KAAK,IAAI,GAAG,OAAO,GAAG,IAAM;AACpD;AAEA,SAAS,aAAa,MAAgD;AACpE,QAAM,UAAkC;AAAA,IACtC,gBAAgB;AAAA,IAChB,QAAQ;AAAA,EACV;AACA,MAAI,KAAK,YAAY;AACnB,YAAQ,KAAK,WAAW,IAAI,IAAI,KAAK,WAAW;AAAA,EAClD,WAAW,KAAK,UAAU,KAAK,QAAQ;AACrC,YAAQ,gBAAgB,UAAU,KAAK,UAAU,KAAK,MAAM;AAAA,EAC9D;AACA,SAAO;AACT;AAEA,SAAS,kBAAkB,QAAgB,MAAuB;AAChE,MAAI,WAAW,IAAK,QAAO;AAC3B,QAAM,QAAQ,KAAK,YAAY;AAC/B,SACE,MAAM,SAAS,iBAAiB,KAChC,MAAM,SAAS,aAAa,KAC5B,MAAM,SAAS,gBAAgB,KAC/B,MAAM,SAAS,eAAe;AAElC;AAEA,SAAS,UAAU,KAAqB,iBAAmD;AACzF,QAAM,OAAgC;AAAA,IACpC,OAAO,IAAI;AAAA,IACX,UAAU,IAAI;AAAA,IACd,aAAa,IAAI,eAAe;AAAA,EAClC;AACA,MAAI,IAAI,aAAa,KAAM,MAAK,aAAa,IAAI;AAEjD,MAAI,IAAI,cAAc,CAAC,iBAAiB;AACtC,SAAK,kBAAkB;AAAA,MACrB,MAAM;AAAA,MACN,aAAa,EAAE,MAAM,IAAI,WAAW,MAAM,QAAQ,IAAI,WAAW,QAAQ,QAAQ,KAAK;AAAA,IACxF;AAAA,EACF,WAAW,IAAI,YAAY,IAAI,YAAY;AACzC,SAAK,kBAAkB,EAAE,MAAM,cAAc;AAAA,EAC/C;AAEA,SAAO;AACT;AAEA,eAAe,MAAM,IAA2B;AAC9C,SAAO,IAAI,QAAQ,CAAC,YAAY,WAAW,SAAS,EAAE,CAAC;AACzD;AASO,SAAS,gBAAgB,KAAqB;AACnD,QAAM,UAAU,IAAI,KAAK;AACzB,QAAM,IAAI,QAAQ,MAAM,yCAAyC;AACjE,SAAO,IAAI,EAAE,CAAC,EAAG,KAAK,IAAI;AAC5B;AAOA,eAAsB,QACpB,KACA,OAAyB,CAAC,GACF;AACxB,QAAM,WAAW,KAAK,WAAW,kBAAkB,QAAQ,QAAQ,EAAE;AACrE,QAAM,MAAM,GAAG,OAAO;AACtB,QAAM,YAAY,IAAI,aAAa,KAAK,oBAAoB;AAC5D,QAAM,aAAa,KAAK,cAAc;AACtC,QAAM,UAAU,KAAK,SAAS,WAAW;AACzC,QAAM,UAAU,aAAa,IAAI;AAEjC,MAAI;AACJ,WAAS,UAAU,GAAG,UAAU,YAAY,WAAW;AACrD,UAAM,aAAa,IAAI,gBAAgB;AACvC,UAAM,gBAAgB,WAAW,MAAM,WAAW,MAAM,GAAG,SAAS;AACpE,UAAM,UAAU,KAAK,IAAI;AAEzB,QAAI;AACF,YAAM,MAAM,MAAM,QAAQ,KAAK;AAAA,QAC7B,QAAQ;AAAA,QACR;AAAA,QACA,MAAM,KAAK,UAAU,UAAU,KAAK,KAAK,CAAC;AAAA,QAC1C,QAAQ,WAAW;AAAA,MACrB,CAAC;AACD,mBAAa,aAAa;AAE1B,UAAI,CAAC,IAAI,IAAI;AACX,cAAM,OAAO,MAAM,IAAI,KAAK;AAC5B,cAAM,MAAM,IAAI;AAAA,UACd,YAAY,IAAI,MAAM,KAAK,KAAK,MAAM,GAAG,GAAG,CAAC;AAAA,UAC7C,IAAI;AAAA,UACJ;AAAA,UACA,IAAI;AAAA,QACN;AACA,YAAI,iBAAiB,IAAI,IAAI,MAAM,KAAK,UAAU,aAAa,GAAG;AAChE,oBAAU;AACV,gBAAM,aAAa,gBAAgB,IAAI,OAAO;AAC9C,gBAAM,MAAM,cAAc,UAAU,OAAO,CAAC;AAC5C;AAAA,QACF;AACA,cAAM;AAAA,MACR;AAEA,YAAM,OAAQ,MAAM,IAAI,KAAK;AAC7B,YAAM,SAAU,KAAK,UAAoE,CAAC;AAC1F,YAAM,WAAY,KAAK,SAAiD,CAAC;AACzE,YAAM,gBAAiB,KAAK,kBAAkB,KAAK;AAEnD,aAAO;AAAA,QACL,SAAS,QAAQ,SAAS,WAAW;AAAA,QACrC,OAAO;AAAA,UACL,cAAc,OAAO,SAAS,iBAAiB,CAAC;AAAA,UAChD,kBAAkB,OAAO,SAAS,qBAAqB,CAAC;AAAA,UACxD,aAAa,OAAO,SAAS,gBAAgB,CAAC;AAAA,UAC9C,oBACE,SAAS,yBACT,OAAO,SAAS,0BAA0B,WACtC;AAAA,YACG,SAAS,sBAAkD,iBAAiB;AAAA,UAC/E,IACA;AAAA,QACR;AAAA,QACA,SAAS,OAAO,kBAAkB,WAAW,gBAAgB;AAAA,QAC7D,OAAQ,KAAK,SAAoB,IAAI;AAAA,QACrC,YAAY,KAAK,IAAI,IAAI;AAAA,QACzB,KAAK;AAAA,MACP;AAAA,IACF,SAAS,KAAK;AACZ,mBAAa,aAAa;AAC1B,gBAAU;AACV,UAAI,UAAU,aAAa,KAAK,iBAAiB,GAAG,GAAG;AACrD,cAAM,MAAM,UAAU,OAAO,CAAC;AAC9B;AAAA,MACF;AACA,YAAM;AAAA,IACR;AAAA,EACF;AACA,QAAM,mBAAmB,QAAQ,UAAU,IAAI,MAAM,OAAO,OAAO,CAAC;AACtE;AAQA,eAAsB,YACpB,KACA,OAAyB,CAAC,GACoB;AAC9C,MAAI;AACF,UAAM,SAAS,MAAM,QAAQ,EAAE,GAAG,KAAK,UAAU,IAAI,YAAY,CAAC,IAAI,WAAW,GAAG,IAAI;AACxF,UAAM,QAAQ,gBAAmB,OAAO,SAAS,OAAO,KAAK;AAC7D,WAAO,EAAE,OAAO,OAAO;AAAA,EACzB,SAAS,KAAK;AACZ,QAAI,eAAe,gBAAgB,kBAAkB,IAAI,QAAQ,IAAI,IAAI,KAAK,IAAI,YAAY;AAE5F,YAAM,cAA8B,EAAE,GAAG,KAAK,UAAU,MAAM,YAAY,OAAU;AACpF,YAAM,SAAS,MAAM,QAAQ,aAAa,IAAI;AAC9C,YAAM,QAAQ,gBAAmB,OAAO,SAAS,OAAO,KAAK;AAC7D,aAAO,EAAE,OAAO,OAAO;AAAA,IACzB;AACA,UAAM;AAAA,EACR;AACF;AAEA,SAAS,gBAAmB,SAAiB,OAAkB;AAC7D,QAAM,WAAW,gBAAgB,OAAO;AACxC,MAAI;AACF,WAAO,KAAK,MAAM,QAAQ;AAAA,EAC5B,SAAS,KAAK;AACZ,UAAM,IAAI;AAAA,MACR,wCAAwC,KAAK,MAC3C,eAAe,QAAQ,IAAI,UAAU,OAAO,GAAG,CACjD;AAAA;AAAA,EAA0B,QAAQ,MAAM,GAAG,GAAG,CAAC;AAAA,IACjD;AAAA,EACF;AACF;AAaA,eAAsB,SACpB,OACA,OAAkD,CAAC,GACgB;AACnE,QAAM,QAAQ,KAAK,IAAI;AACvB,MAAI;AACF,UAAM;AAAA,MACJ;AAAA,QACE;AAAA,QACA,UAAU,CAAC,EAAE,MAAM,QAAQ,SAAS,OAAO,CAAC;AAAA,QAC5C,WAAW;AAAA,QACX,WAAW,KAAK,aAAa;AAAA,MAC/B;AAAA,MACA;AAAA,IACF;AACA,WAAO,EAAE,IAAI,MAAM,WAAW,KAAK,IAAI,IAAI,OAAO,OAAO,KAAK;AAAA,EAChE,SAAS,KAAK;AACZ,WAAO;AAAA,MACL,IAAI;AAAA,MACJ,WAAW,KAAK,IAAI,IAAI;AAAA,MACxB,OAAO,eAAe,QAAQ,IAAI,UAAU,OAAO,GAAG;AAAA,IACxD;AAAA,EACF;AACF;AAOO,IAAM,YAAN,MAAgB;AAAA,EACrB,YAA6B,OAAyB,CAAC,GAAG;AAA7B;AAAA,EAA8B;AAAA,EAA9B;AAAA,EAE7B,KAAK,KAAqB,KAAgD;AACxE,WAAO,QAAQ,KAAK,EAAE,GAAG,KAAK,MAAM,GAAG,IAAI,CAAC;AAAA,EAC9C;AAAA,EAEA,SACE,KACA,KAC8C;AAC9C,WAAO,YAAe,KAAK,EAAE,GAAG,KAAK,MAAM,GAAG,IAAI,CAAC;AAAA,EACrD;AACF;","names":[]}