npm - @tangle-network/agent-eval - Versions diffs - 0.20.11 → 0.21.0 - Mend

@tangle-network/agent-eval 0.20.11 → 0.21.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (59) hide show

package/CHANGELOG.md +76 -0
package/README.md +137 -170
package/dist/benchmarks/index.d.ts +2 -1
package/dist/{chunk-JAOLXRIA.js → chunk-3GN6U53I.js} +205 -4
package/dist/chunk-3GN6U53I.js.map +1 -0
package/dist/chunk-3IX6QTB7.js +1349 -0
package/dist/chunk-3IX6QTB7.js.map +1 -0
package/dist/chunk-5IIQKMD5.js +236 -0
package/dist/chunk-5IIQKMD5.js.map +1 -0
package/dist/chunk-ARZ6BEV6.js +1310 -0
package/dist/chunk-ARZ6BEV6.js.map +1 -0
package/dist/chunk-HRZELXCR.js +1354 -0
package/dist/chunk-HRZELXCR.js.map +1 -0
package/dist/chunk-KRR4VMH7.js +423 -0
package/dist/chunk-KRR4VMH7.js.map +1 -0
package/dist/chunk-SNUHRBDL.js +154 -0
package/dist/chunk-SNUHRBDL.js.map +1 -0
package/dist/chunk-WOK2RTWG.js +1920 -0
package/dist/chunk-WOK2RTWG.js.map +1 -0
package/dist/{chunk-LSR4IAYN.js → chunk-WOPGKVN4.js} +2 -2
package/dist/chunk-YUFXO3TU.js +148 -0
package/dist/chunk-YUFXO3TU.js.map +1 -0
package/dist/cli.js +3 -2
package/dist/cli.js.map +1 -1
package/dist/control-cxwMOAsy.d.ts +259 -0
package/dist/control.d.ts +6 -0
package/dist/control.js +30 -0
package/dist/control.js.map +1 -0
package/dist/dataset-B9qvlm_o.d.ts +112 -0
package/dist/emitter-B2XqDKFU.d.ts +121 -0
package/dist/feedback-trajectory-CB0A32o3.d.ts +346 -0
package/dist/{index-1PZOtZFr.d.ts → index-c5saLbKD.d.ts} +2 -133
package/dist/index.d.ts +178 -2945
package/dist/index.js +1066 -6185
package/dist/index.js.map +1 -1
package/dist/multi-shot-optimization-Bvtz294B.d.ts +598 -0
package/dist/openapi.json +1 -1
package/dist/optimization.d.ts +146 -0
package/dist/optimization.js +60 -0
package/dist/optimization.js.map +1 -0
package/dist/reporting-Da2ihlcM.d.ts +672 -0
package/dist/reporting.d.ts +5 -0
package/dist/reporting.js +36 -0
package/dist/reporting.js.map +1 -0
package/dist/run-record-CX_jcAyr.d.ts +134 -0
package/dist/store-u47QaJ9G.d.ts +297 -0
package/dist/traces.d.ts +914 -0
package/dist/traces.js +120 -0
package/dist/traces.js.map +1 -0
package/dist/wire/index.js +3 -2
package/docs/concepts.md +16 -11
package/docs/feature-guide.md +10 -17
package/docs/integration-launch-gates.md +77 -0
package/docs/product-eval-adoption.md +27 -0
package/docs/research-report-methodology.md +155 -0
package/docs/trace-analysis.md +75 -0
package/package.json +30 -12
package/dist/chunk-JAOLXRIA.js.map +0 -1
/package/dist/{chunk-LSR4IAYN.js.map → chunk-WOPGKVN4.js.map} +0 -0

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,81 @@
 # Changelog
+## 0.21.0 — capture integrity + launch-grade reporting
+This release closes the layer-1 gap a downstream consumer surfaced: better
+post-run statistics don't help if the underlying data wasn't captured. 0.21
+adds first-class raw provider-event capture, a fail-loud route guard, a
+run-completion integrity check, and run-complete hooks (with a trace-analyst
+auto-execution helper) so a direct matrix run produces complete forensics
+without out-of-band glue.
+### Added
+- **`RawProviderSink` (capture).** First-class persistence for HTTP-level
+  provider request / response / error payloads alongside the structured
+  `LlmSpan`. `InMemoryRawProviderSink`, `FileSystemRawProviderSink` (NDJSON,
+  rolls at 32 MiB), and `NoopRawProviderSink` ship in core. Default redactor
+  strips `Authorization` / `X-Api-Key` / `Cookie` headers and credential-shaped
+  body fields (`apiKey`, `bearer`, `password`, `secret`, `token`); redacted
+  paths are recorded on `event.redactedFields` so a reviewer can see what was
+  stripped without exposing values. Wired into `callLlm` via
+  `LlmClientOptions.rawSink` — every retry attempt produces a `request` and
+  either a `response` or `error` event with the attempt index attached.
+- **`assertLlmRoute` (route guard).** Pure function that throws
+  `LlmRouteAssertionError` when the configured client doesn't match the
+  caller's route requirements: `requireExplicitBaseUrl`, `allowedBaseUrls`,
+  `blockedBaseUrls`, `requireAuth`, `expectedProvider`. Designed for the
+  matrix-runner preflight — fail loud at the boundary instead of silently
+  falling back to the public/free-tier router.
+- **`assertRunCaptured` (integrity check).** Read-only check on
+  `(store, runId, expectations)` that returns a structured
+  `RunIntegrityReport` with issue codes (`missing_llm_spans`,
+  `missing_raw_events`, `orphan_llm_span`, `no_raw_sink`, `missing_outcome`,
+  …). Pair with the new `requireRawCoverageOfLlmSpans` to assert every
+  `LlmSpan` has a matching raw `request` event. Use directly or via
+  `throwIfRunIncomplete` for strict mode.
+- **`onRunComplete` hooks on `TraceEmitter`.** New
+  `TraceEmitterOptions.onRunComplete` array fires after `endRun` / `abortRun`
+  with full run context (run id, outcome, status, store, emitter). Errors are
+  swallowed and recorded as `log` events by default; opt into propagation via
+  `hookErrors: 'throw'`. `addRunCompleteHook` attaches hooks after construction.
+- **`traceAnalystOnRunComplete` factory.** Drop-in run-complete hook that
+  runs `analyzeTraces` after each run and persists the result. Resolves the
+  "trace analyst never ran on this matrix sweep" complaint by making
+  auto-execution declarative.
+- **`researchReport`** — executive research-report layer for coding-vertical
+  benchmark runs (originally landed in #34, elevated in #35). Composes
+  `summaryTable`, `paretoChart`, `gainHistogram`, held-out gate decisions,
+  and optional `failureClusterView` output into one structured artifact:
+  promote / hold / equivalent / reject / needs-more-data guidance with
+  rationale, risks, next actions, markdown, HTML, and JSON chart specs.
+  - Decisions are made on paired evidence — never on marginal means alone.
+  - ROPE (Region of Practical Equivalence) supported via the `rope` option.
+  - Bayesian-bootstrap-style `Pr(Δ>0)` and `Pr(Δ∈ROPE)` summaries (Rubin 1981).
+  - Per-candidate minimum detectable paired effect via `pairedMde`.
+  - SHA-256 `runFingerprint` and optional `preregistrationHash` linking a
+    signed `HypothesisManifest`.
+  - Embedded methodology + `docs/research-report-methodology.md` companion.
+- **`pairedMde`** in `power-analysis`: closed-form minimum detectable paired
+  effect (inverse to the paired-t / sign-rank power formula).
+### Changed
+- `researchReport` is async (uses Web Crypto via `hashJson` for the run
+  fingerprint).
+- Default `researchReport.minPairs` is 20 (soft floor); hard floor of 6 is
+  enforced regardless via `RESEARCH_REPORT_HARD_PAIR_FLOOR`.
+### Wire-protocol consumers
+No wire-protocol changes. The new capture / integrity / hook primitives are
+TypeScript-only; cross-language consumers continue to use the existing RPC
+surface.
+### Python client
+Locked at `tangle-agent-eval==0.21.0` to match the npm package.
 ## 0.20.10 — hardening audit follow-up
 ### Fixed

package/README.md CHANGED Viewed

@@ -1,65 +1,24 @@
 # @tangle-network/agent-eval
-Evaluation infrastructure for agent systems.
-`agent-eval` gives agent products a reusable way to record what happened,
-verify outcomes, classify failures, compare variants, optimize prompts or
-policies, and make release decisions from evidence instead of anecdotes.
-It does not own your product state, credentials, UI, or model routing. Product
-teams keep those boundaries; this package standardizes how runs are recorded,
-checked, compared, and promoted.
-## Contents
-- [When To Use It](#when-to-use-it)
-- [Architecture](#architecture)
-- [Install](#install)
-- [Quick Start](#quick-start)
-- [Core Primitives](#core-primitives)
-- [Adoption Path](#adoption-path)
-- [Examples](#examples)
-- [Documentation](#documentation)
-- [Development](#development)
-- [Related Packages](#related-packages)
-## When To Use It
-Use `agent-eval` when you need one or more of these:
-- A reproducible eval harness for coding agents, builder agents, or multi-tool
-  workflows.
-- Structured traces for agent runs: spans, artifacts, events, budgets, tool
-  calls, retrieval, judge output, and sandbox execution.
-- Deterministic gates around build/test/deploy checks.
-- LLM-as-judge or deterministic judge fleets with calibration and canaries.
-- Dataset splits, holdouts, paired statistics, and release confidence gates.
-- Failure taxonomy that distinguishes prompt, tool, sandbox, retrieval,
-  evaluator, and knowledge-readiness failures.
-- Optimization loops over prompts, steering, code mutations, or full multi-shot
-  trajectories.
-- Report data for internal launch reviews, CI gates, and research analysis.
-## Architecture
+Evaluation infrastructure for agent products.
+Use it to wrap the real workflow your users run, record what happened, verify
+the result, turn feedback into replay data, compare variants, and ship only
+when the evidence improves.
 ```txt
-agent/product run
-  -> TraceEmitter / TraceStore
-  -> SandboxHarness / MultiLayerVerifier / JudgeRunner
-  -> failure taxonomy + metrics
-  -> paired stats + held-out gates
-  -> optimization + release confidence + reports
+product task
+  -> observe state
+  -> validate with deterministic gates first
+  -> act through the real product adapter
+  -> trace + feedback trajectory
+  -> replay / optimize / release gate
 ```
-Package responsibilities:
-- `agent-eval`: run evidence, eval contracts, verification, statistics,
-  optimization, reporting.
-- Product app: domain state, tools, credentials, UI, storage, deployment, model
-  gateway.
-- `@tangle-network/agent-runtime`: production agent-loop/session runtime.
-- `@tangle-network/agent-knowledge`: evidence stores, claim/page synthesis,
-  retrieval, knowledge readiness implementation.
+`agent-eval` does not own product state, credentials, UI, storage, model
+routing, browser drivers, sandbox policy, or deployment. Products own those.
+This package owns eval contracts, loop mechanics, traces, statistics,
+optimization inputs, and release evidence.
 ## Install
@@ -67,41 +26,23 @@ Package responsibilities:
 pnpm add @tangle-network/agent-eval
 ```
-Wire protocol / CLI:
-```sh
-npm i -g @tangle-network/agent-eval
-agent-eval serve --port 5005
-```
-Python client source lives in `clients/python`. Until the PyPI package is
-published, install it from the repo:
-```sh
-cd clients/python
-pip install -e .
-```
 ## Quick Start
-Wrap the real product loop first. Do not build a toy eval path that users never
-exercise.
 ```ts
 import {
   objectiveEval,
   runAgentControlLoop,
-} from '@tangle-network/agent-eval'
+} from '@tangle-network/agent-eval/control'
 const result = await runAgentControlLoop({
   intent: task.prompt,
   budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },
-  async observe() {
-    return productAdapter.readState(task.id)
+  observe() {
+    return product.readState(task.id)
   },
-  async validate({ state }) {
+  validate({ state }) {
     return [
       objectiveEval({
         id: 'build-passes',
@@ -117,128 +58,154 @@ const result = await runAgentControlLoop({
     ]
   },
-  async decide({ evals }) {
-    return evals.every((evalResult) => evalResult.passed)
-      ? { type: 'stop', reason: 'all critical checks passed' }
-      : { type: 'continue', action: { type: 'repair' }, reason: 'checks failed' }
+  decide({ evals }) {
+    const failed = evals.filter((e) => !e.passed)
+    if (failed.length === 0) {
+      return { type: 'stop', pass: true, reason: 'all gates passed' }
+    }
+    return {
+      type: 'continue',
+      action: { type: 'repair', failed: failed.map((e) => e.id) },
+      reason: 'repair failed gates',
+    }
   },
-  async act(action) {
-    return productAdapter.runAgentStep(task.id, action)
+  act(action) {
+    return product.runAgentStep(task.id, action)
   },
 })
-await productAdapter.storeControlResult(task.id, result)
+await product.storeEvalResult(task.id, result)
 ```
-Once this loop represents production behavior, convert completed runs into
-feedback trajectories, split them into train/dev/test/holdout sets, and run
-multi-shot optimization against the same adapter.
-## Core Primitives
-| Primitive | Purpose |
-|---|---|
-| `TraceEmitter`, `TraceStore` | Append-only run/span/event/artifact/budget records. |
-| `SandboxHarness` | Build/test/runtime checks with captured stdout, stderr, exit codes, wall time, and parsed test counts. |
-| `MultiLayerVerifier` | Ordered verification stages with dependencies, skip-on-fail, findings, scores, and time caps. |
-| `JudgeRunner` | Parallel deterministic or LLM-backed judges over the same artifact/run. |
-| `runAgentControlLoop` | Observe/validate/decide/act loop with budgets, stop policies, and structured eval results. |
-| `Dataset`, `RunRecord`, `HeldOutGate` | Versioned corpora, reproducible run metadata, and held-out promotion decisions. |
-| `pairedBootstrap`, `pairedWilcoxon`, `bhAdjust` | Paired experiment statistics and multiple-comparison correction. |
-| `classifyFailure` | Rule-based failure classification for agent, tool, sandbox, retrieval, and knowledge failures. |
-| `runMultiShotOptimization` | Optimization over full agent trajectories with actionable side information. |
-| `runPromptEvolution` | Prompt/steering/code evolution over scenario scores. |
-| `evaluateReleaseConfidence` | Release scorecard across evidence volume, pass rate, score, overfit, cost, latency, and gates. |
-| `summaryTable`, `paretoChart`, `gainHistogram` | Report-ready structured outputs. |
-| `KnowledgeRequirement`, `KnowledgeBundle` | Shared contracts for knowledge readiness. |
-`NoopResearcher` is a fail-loud sentinel for wiring tests. Production systems
-should implement `Researcher` directly or use `CallbackResearcher`.
-## Adoption Path
-1. Choose one real workflow: code generation, browser task, research task,
-   workflow builder, voice interaction, or domain agent task.
-2. Write a product adapter that can observe state and execute one agent step.
-3. Add deterministic validators first: build, test, serve, schema, policy,
-   permission, retrieval, and deployment checks.
-4. Add LLM judges only for subjective quality that deterministic checks cannot
-   measure.
-5. Emit traces and convert successful and failed attempts into
-   `FeedbackTrajectory` records.
-6. Build train/dev/test/holdout scenarios from those trajectories.
-7. Run `runMultiShotOptimization()` or prompt/code evolution on train/dev.
-8. Promote only when test/holdout gates and real product telemetry improve.
-For a complete product integration guide, see
-[Product Eval Adoption](./docs/product-eval-adoption.md).
+That loop should be the same shape in production, replay, benchmark, and
+optimization. Swap dependencies behind `observe()` and `act()`, not the eval
+contract itself.
-## Examples
+## Import Paths
-Runnable examples live in the repository's
-[`examples/`](https://github.com/tangle-network/agent-eval/tree/main/examples)
-directory. They are not part of the published npm package.
+The root export remains available, but new code should prefer focused subpaths:
-- [`examples/same-sandbox-harness`](https://github.com/tangle-network/agent-eval/tree/main/examples/same-sandbox-harness) - run
-  multiple eval passes against the same workspace.
-- [`examples/multi-shot-optimization`](https://github.com/tangle-network/agent-eval/tree/main/examples/multi-shot-optimization) -
-  optimize full agent trajectories with held-out promotion.
-- [`examples/benchmarks`](https://github.com/tangle-network/agent-eval/tree/main/examples/benchmarks) - benchmark adapter shape and
-  reference benchmark wrappers.
+```ts
+import { runAgentControlLoop } from '@tangle-network/agent-eval/control'
+import { runMultiShotOptimization } from '@tangle-network/agent-eval/optimization'
+import { TraceEmitter } from '@tangle-network/agent-eval/traces'
+import { renderReleaseReport } from '@tangle-network/agent-eval/reporting'
+```
-The examples are intentionally kept outside the README so they can be expanded,
-tested, and copied without turning this page into a tutorial.
+| Subpath | Use for |
+| --- | --- |
+| `@tangle-network/agent-eval/control` | `observe -> validate -> decide -> act`, action policy, propose/review loops |
+| `@tangle-network/agent-eval/traces` | trace stores, emitters, TraceAnalyst |
+| `@tangle-network/agent-eval/optimization` | feedback trajectories, multi-shot optimization, prompt evolution |
+| `@tangle-network/agent-eval/reporting` | release confidence, paired stats, report/table/chart specs |
+| `@tangle-network/agent-eval/wire` | HTTP/RPC judge server and schemas |
+| `@tangle-network/agent-eval/benchmarks` | benchmark adapter contracts and reference wrappers |
+## Core Pieces
+| Need | Use |
+| --- | --- |
+| Keep an agent working until objective state passes | `runAgentControlLoop` |
+| Turn user/reviewer feedback into replay data | `FeedbackTrajectory` |
+| Compare prompt/tool/retrieval policies over full trajectories | `runMultiShotOptimization` |
+| Gate releases with paired evidence and holdouts | `evaluateReleaseConfidence`, `HeldOutGate` |
+| Explain regressions across trace corpora | `TraceAnalyst` / `analyzeTraces` |
+| Report a launch decision | `renderReleaseReport`, `researchReport`, `summaryTable`, `paretoChart`, `gainHistogram` |
+| Capture every provider HTTP request / response for forensics | `RawProviderSink`, `LlmClientOptions.rawSink` |
+| Fail loud if an eval would silently use the wrong route | `assertLlmRoute` |
+| Assert at run-end that the artifact is complete | `assertRunCaptured`, `throwIfRunIncomplete` |
+| Auto-execute the trace analyst on every run | `traceAnalystOnRunComplete` + `TraceEmitterOptions.onRunComplete` |
+| Model missing context separately from bad reasoning | `KnowledgeRequirement`, `KnowledgeBundle` |
+### Capture integrity (0.21+)
+Launch-grade benchmark runs need four things that are easy to forget in glue
+code: (1) raw HTTP capture alongside the structured spans so a reviewer can
+verify which route answered, (2) a preflight assertion that the configured
+client points at the intended provider, (3) a run-end assertion that the
+expected events were actually written, and (4) auto-execution of the trace
+analyst as part of the run lifecycle. The wiring fits in a few lines:
-## Documentation
+```ts
+import {
+  TraceEmitter, FileSystemRawProviderSink, callLlm, assertLlmRoute,
+  assertRunCaptured, throwIfRunIncomplete,
+} from '@tangle-network/agent-eval'
+import { traceAnalystOnRunComplete } from '@tangle-network/agent-eval/traces'
-- [Concepts](./docs/concepts.md)
-- [Feature Guide](./docs/feature-guide.md)
-- [Product Eval Adoption](./docs/product-eval-adoption.md)
-- [Control Runtime](./docs/control-runtime.md)
-- [Knowledge Readiness](./docs/knowledge-readiness.md)
-- [Multi-Shot Optimization](./docs/multi-shot-optimization.md)
-- [Feedback Trajectories](./docs/feedback-trajectories.md)
-- [Wire Protocol](./docs/wire-protocol.md)
+const sink = new FileSystemRawProviderSink({ dir: `${workDir}/raw-events` })
+assertLlmRoute(llmOpts, { requireExplicitBaseUrl: true, allowedBaseUrls, requireAuth: true })
-## Development
+const emitter = new TraceEmitter(store, {
+  onRunComplete: [traceAnalystOnRunComplete({ analyze: analystOpts, save })],
+})
+await emitter.startRun(/* ... */)
+// LLM calls flow through callLlm with `{ rawSink: sink, traceContext: { runId, spanId } }`.
+await emitter.endRun({ pass, score })
-```sh
-pnpm install
-pnpm typecheck
-pnpm test
-pnpm build
-pnpm openapi
+throwIfRunIncomplete(await assertRunCaptured(store, emitter.runId, {
+  llmSpansMin: 1, rawSink: sink, requireRawCoverageOfLlmSpans: true, requireOutcome: true,
+}))
 ```
-Run the local server:
+Directives, rationale, and shipped-bug context are in
+[`SKILL.md` § Capture integrity](./.claude/skills/agent-eval/SKILL.md#capture-integrity-required-for-launch-grade-adoption).
+## Examples
+Runnable examples live in
+[`examples/`](https://github.com/tangle-network/agent-eval/tree/main/examples).
+- [`examples/multi-shot-optimization`](https://github.com/tangle-network/agent-eval/tree/main/examples/multi-shot-optimization):
+  optimize full trajectories with held-out promotion.
+- [`examples/same-sandbox-harness`](https://github.com/tangle-network/agent-eval/tree/main/examples/same-sandbox-harness):
+  run setup/build/test and evidence checks in one workspace.
+- [`examples/benchmarks`](https://github.com/tangle-network/agent-eval/tree/main/examples/benchmarks):
+  benchmark adapter shape and reference wrappers.
+## Docs
+Read in this order:
+1. [Product Eval Adoption](./docs/product-eval-adoption.md)
+2. [Control Runtime](./docs/control-runtime.md)
+3. [Feedback Trajectories](./docs/feedback-trajectories.md)
+4. [Multi-Shot Optimization](./docs/multi-shot-optimization.md)
+5. [Trace Analysis](./docs/trace-analysis.md)
+6. [Knowledge Readiness](./docs/knowledge-readiness.md)
+7. [Integration Launch Gates](./docs/integration-launch-gates.md)
+8. [Wire Protocol](./docs/wire-protocol.md)
+## CLI / Wire Protocol
 ```sh
-pnpm build
-node dist/cli.js serve --port 5005
+npm i -g @tangle-network/agent-eval
+agent-eval serve --port 5005
 ```
-Python client tests:
+The Python client lives in `clients/python`:
 ```sh
-pnpm build
 cd clients/python
-pip install -e ".[dev]"
-pytest
+pip install -e .
 ```
-## Release
+## Development
-`@tangle-network/agent-eval` publishes to npm. The Python client lives under
-`clients/python` and is versioned from this repository.
+```sh
+pnpm install
+pnpm typecheck
+pnpm test
+pnpm build
+pnpm openapi
+```
 ## Related Packages
-- [`@tangle-network/agent-runtime`](https://github.com/tangle-network/agent-runtime)
-- [`@tangle-network/agent-knowledge`](https://github.com/tangle-network/agent-knowledge)
-- [`@tangle-network/agent-integrations`](https://github.com/tangle-network/agent-integrations)
-- [`@tangle-network/agent-gateway`](https://github.com/tangle-network/agent-gateway)
-- [`@tangle-network/tcloud`](https://github.com/tangle-network/tcloud)
+- `@tangle-network/agent-runtime`: production session/runtime layer.
+- `@tangle-network/agent-knowledge`: source-grounded knowledge bases and readiness.
+- `@tangle-network/agent-integrations`: connection, grant, capability, and integration invocation contracts.
 ## License

package/dist/benchmarks/index.d.ts CHANGED Viewed

@@ -1 +1,2 @@
-export { B as BENCHMARK_SPLIT_SEED, b as BenchmarkAdapter, c as BenchmarkDatasetItem, d as BenchmarkEvaluation, i as deterministicSplit, l as routing } from '../index-1PZOtZFr.js';
+export { B as BENCHMARK_SPLIT_SEED, a as BenchmarkAdapter, b as BenchmarkDatasetItem, c as BenchmarkEvaluation, d as deterministicSplit, e as routing } from '../index-c5saLbKD.js';
+import '../run-record-CX_jcAyr.js';