npm - @tangle-network/agent-eval - Versions diffs - 0.50.0 → 0.50.2 - Mend

@tangle-network/agent-eval 0.50.0 → 0.50.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/CHANGELOG.md +154 -0
package/README.md +235 -331
package/dist/adapters/otel.d.ts +1 -1
package/dist/contract/index.d.ts +2 -2
package/dist/contract/index.js +53 -6
package/dist/contract/index.js.map +1 -1
package/dist/hosted/index.d.ts +1 -1
package/dist/{index-BRxz6qov.d.ts → index-DQHtWQ57.d.ts} +16 -0
package/dist/openapi.json +1 -1
package/docs/concepts.md +20 -0
package/docs/customer-journeys.md +208 -0
package/docs/insight-report.md +337 -0
package/package.json +1 -1

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,159 @@
 # Changelog
+All notable changes to `@tangle-network/agent-eval` and its sibling `agent-eval-rpc` (Python). The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/); versions are locked across the npm + PyPI packages.
+---
+## [0.50.2] — 2026-05-27 — actionability fixes from real-data dogfood
+### Added
+- **`ScalarDistribution.tailRuns?: Array<{runId, score}>`** — populated for the composite distribution. The report now names the 5 worst runs a customer should inspect first, instead of telling them to "investigate the lower tail" anonymously.
+- **`InsightReport.costQuality.degraded?: {cost?, pareto?}`** — explicit per-axis degradation reasons when `costUsd` is all zero (cost axis carries no signal) or only a single candidate appears (Pareto collapses to a single point). Replaces the prior silent emission of meaningless single-point Pareto figures.
+- **Composite-distribution recommendations.** When `composite.mean < 0.3`, the report emits a `critical/investigate` recommendation with the worst-5 runIds enumerated in the detail. Between 0.3 and 0.5, a `high/investigate` recommendation with the worst-3. Closes the gap where `recommendations: []` was being emitted for completely broken corpora.
+- **Missing-judges flag.** When `judges` is empty across the corpus, the report emits a `medium/expand-corpus` recommendation pointing at `outcome.judgeScores.perJudge` enrichment. Before, the customer had no signal that per-dimension / calibration was unavailable because of input shape, not substrate failure.
+### Fixed
+- `analyzeRuns()` on the legal-agent canonical run (n=36, mean composite = 0.002) now emits actionable recommendations naming specific failing scenarios; previously it returned `recommendations: []` for a fully-broken agent.
+### Notes
+The four behavior changes are additive — fields are optional, no existing field shape changed. Dogfood-driven: surfaced by running `analyzeRuns()` against three real consumer datasets (legal-agent, agent-builder, gtm-agent golden run) and observing where the report was silent when it should have been loud.
+---
+## [0.50.1] — 2026-05-27 — docs + examples
+### Added
+- `README.md` rewritten as a top-tier OSS landing page: table of contents, decision-packet output sample (annotated JSON), comparison matrix vs LangSmith / Braintrust / Phoenix, three customer journey cards.
+- `examples/selfimprove-quickstart/` — minimal closed-loop example with annotated stdout.
+- `examples/customer-feedback-loop/` — Customer A journey: multi-rater approve/reject corpus → `fromFeedbackTable` → `analyzeRuns`.
+- `examples/customer-otel-traces/` — Customer B journey: OTel spans → `fromOtelSpans` → `analyzeRuns`.
+- `docs/insight-report.md` — annotated walkthrough of every section of the decision packet.
+- `docs/customer-journeys.md` — three end-to-end journeys with code + expected output.
+### Changed
+- `docs/concepts.md` — updated mental model for the three top-level entries (`selfImprove`, `analyzeRuns`, intake adapters) and the layering rule.
+### Notes
+Docs-only patch. No code changes, no behavior changes, no API surface changes vs 0.50.0.
+---
+## [0.50.0] — 2026-05-27 — the decision packet
+### Added
+- **`analyzeRuns({ runs, ... }): InsightReport`** in `/contract`. Composes the substrate's statistical / calibration / clustering / Pareto primitives into one rigor packet. Sections populate based on what the input supports: distributional summary always, lift when baseline+candidate are present, judges when run records carry `judgeScores`, inter-rater agreement when `raterScores` are supplied, failure clusters when an `AnalystRegistry` is wired, contamination when canaries are passed, outcome correlation when a downstream signal is supplied.
+- **`InsightReport`** canonical decision-packet shape; reused by `selfImprove()` and emitted on the hosted wire as `EvalRunEvent.insightReport?`.
+- **Intake adapters** in `/contract`:
+  - `fromFeedbackTable({ ratings })` — multi-rater corpus → `RunRecord[] + raterScores`.
+  - `fromOtelSpans({ spans })` — OpenTelemetry spans → `RunRecord[]`, grouped by `tangle.runId` or `traceId`.
+- **`SelfImproveResult.insight: InsightReport`** — `selfImprove()` now returns the full decision packet alongside the existing ship/hold verdict.
+### Changed
+- `selfImprove()` internally calls `analyzeRuns()` on baseline + winner cells; consumers reading `.lift` continue to work unchanged, while `.insight.lift` now carries CI95 + p-value + Cohen's d + MDE + required-n.
+### Test coverage
+1427 / 1427 passing; 11 new integration tests covering lift detection paths, outcome correlation + linear reward model, canary contamination, multi-rater journey end-to-end, OTel journey end-to-end, recommendations shape, JSON-serialisability.
+---
+## [0.49.0] — 2026-05-27 — audit-fix sweep
+### Added
+- `src/adapters/otel.ts` — generic OTel→hosted bridge (`createOtelBridge` / `OtelBridge` / `OtelBridgeOptions`). Stringifies array-valued attributes instead of dropping them.
+- `src/contract/diff.ts` — `keyForCell` uses `JSON.stringify([scenarioId, rep])` (no separator collisions); `Number.isFinite` coercion on dimension deltas (no NaN propagating to dashboards).
+- `examples/hosted-ingest-server/server.ts` — `REFERENCE_RECEIVER_START=1|0` env var as the primary start signal; idempotency cache prunes on read with the wire-spec 24h TTL.
+### Changed
+- Python `TraceSpanEventOuter` exposes `tangle.*` pivots via field aliases (`tangle_run_id`, etc.) and round-trips through `model_dump(by_alias=True)`.
+- Python `_WireModel` emits a `UserWarning` when an extra field is the snake_case shadow of a declared camelCase field (cross-language drift guard).
+### Removed
+- `src/adapters/traceai.ts` — replaced by `src/adapters/otel.ts`. No back-compat shim.
+---
+## [0.48.0] — 2026-05-27 — substrate↔runtime layering fix + diffRuns + Python hosted parity
+### Added
+- `src/verdict.ts` — `DefaultVerdict` substrate primitive (moved DOWN from agent-runtime).
+- `src/contract/diff.ts` — `diffRuns` / `diffGenerations` / `diffRunBaselineToWinner` for v3-vs-v4 dashboard rendering, CI reporting, and any consumer comparing improvement-loop output.
+- `src/adapters/traceai.ts` — OTel→hosted bridge (renamed to `otel.ts` in 0.49.0).
+- `tests/hosted-roundtrip.test.ts` — proves wire-format binary compat between client and reference receiver.
+- Python `HostedClient` (`clients/python/src/agent_eval_rpc/hosted.py`) — TS↔Python wire-format parity with bearer auth, idempotency, and exponential backoff on 5xx/408/429.
+- `CLAUDE.md` repo-layering rule: agent-eval is the substrate; agent-runtime + agent-knowledge depend on it; the reverse is forbidden.
+### Changed
+- `src/campaign/gates/default-production-gate.ts` — `RunRecord` import from local `../../run-record` (was reaching up into agent-runtime).
+- `src/matrix/types.ts` — `DefaultVerdict` import from `../verdict` (was reaching up into agent-runtime).
+### Removed
+- `@tangle-network/agent-runtime` from `peerDependencies`, `devDependencies`, and `pnpm.minimumReleaseAgeExclude` (no upward deps from substrate).
+---
+## [0.47.0] — 2026-05-26 — Phase D hosted-tier substrate
+### Added
+- `src/hosted/` — wire-format types frozen at `HOSTED_WIRE_VERSION = '2026-05-26.v1'`, `createHostedClient` with bearer auth + idempotency + bounded retries.
+- `examples/hosted-ingest-server/` — reference receiver implementing the spec.
+- `docs/hosted-ingest-spec.md` — semver-locked wire spec.
+- `selfImprove({ hostedTenant })` — opt-in hosted ingest; failures logged, never fail the loop.
+---
+## [0.46.0] — `selfImprove()` LAND-tier helper
+`selfImprove({ scenarios, dispatch, judges, baselineSurface })` shipped in `/contract` as the one-shot wrapper around `runImprovementLoop`.
+---
+## [0.45.0] — distributed campaigns
+`/adapters/http` with `httpDispatch` + `runDispatchServer`; `cellPlacement` on `RunCampaignOptions` for cross-region fan-out.
+---
+## [0.44.0] — `/adapters/langchain`
+LangChain runnable → `Dispatch` adapter.
+---
+## [0.43.0] — edge-friendly storage
+`inMemoryCampaignStorage()` for Cloudflare Workers / edge / test environments.
+---
+## [0.42.0] — GEPA driver + legacy deletion
+### Added
+- `gepaDriver` reflective LLM mutation driver.
+- `campaignToRunRecords` adapter.
+### Removed
+- `runMultiShotOptimization` (top-level trajectory-optimizer) — replaced by `runImprovementLoop` + `gepaDriver` composition. The `/multishot` subpath (N-shot persona matrix) is unrelated and remains.
+---
 ## 0.34.0 — 2026-05-23
 ### Eval evolution-tracking — first-class `AgentProfile` + per-cell scorecard