@tangle-network/agent-eval 0.50.0 → 0.50.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,159 @@
1
1
  # Changelog
2
2
 
3
+ All notable changes to `@tangle-network/agent-eval` and its sibling `agent-eval-rpc` (Python). The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/); versions are locked across the npm + PyPI packages.
4
+
5
+ ---
6
+
7
+ ## [0.50.2] — 2026-05-27 — actionability fixes from real-data dogfood
8
+
9
+ ### Added
10
+
11
+ - **`ScalarDistribution.tailRuns?: Array<{runId, score}>`** — populated for the composite distribution. The report now names the 5 worst runs a customer should inspect first, instead of telling them to "investigate the lower tail" anonymously.
12
+ - **`InsightReport.costQuality.degraded?: {cost?, pareto?}`** — explicit per-axis degradation reasons when `costUsd` is all zero (cost axis carries no signal) or only a single candidate appears (Pareto collapses to a single point). Replaces the prior silent emission of meaningless single-point Pareto figures.
13
+ - **Composite-distribution recommendations.** When `composite.mean < 0.3`, the report emits a `critical/investigate` recommendation with the worst-5 runIds enumerated in the detail. Between 0.3 and 0.5, a `high/investigate` recommendation with the worst-3. Closes the gap where `recommendations: []` was being emitted for completely broken corpora.
14
+ - **Missing-judges flag.** When `judges` is empty across the corpus, the report emits a `medium/expand-corpus` recommendation pointing at `outcome.judgeScores.perJudge` enrichment. Before, the customer had no signal that per-dimension / calibration was unavailable because of input shape, not substrate failure.
15
+
16
+ ### Fixed
17
+
18
+ - `analyzeRuns()` on the legal-agent canonical run (n=36, mean composite = 0.002) now emits actionable recommendations naming specific failing scenarios; previously it returned `recommendations: []` for a fully-broken agent.
19
+
20
+ ### Notes
21
+
22
+ The four behavior changes are additive — fields are optional, no existing field shape changed. Dogfood-driven: surfaced by running `analyzeRuns()` against three real consumer datasets (legal-agent, agent-builder, gtm-agent golden run) and observing where the report was silent when it should have been loud.
23
+
24
+ ---
25
+
26
+ ## [0.50.1] — 2026-05-27 — docs + examples
27
+
28
+ ### Added
29
+
30
+ - `README.md` rewritten as a top-tier OSS landing page: table of contents, decision-packet output sample (annotated JSON), comparison matrix vs LangSmith / Braintrust / Phoenix, three customer journey cards.
31
+ - `examples/selfimprove-quickstart/` — minimal closed-loop example with annotated stdout.
32
+ - `examples/customer-feedback-loop/` — Customer A journey: multi-rater approve/reject corpus → `fromFeedbackTable` → `analyzeRuns`.
33
+ - `examples/customer-otel-traces/` — Customer B journey: OTel spans → `fromOtelSpans` → `analyzeRuns`.
34
+ - `docs/insight-report.md` — annotated walkthrough of every section of the decision packet.
35
+ - `docs/customer-journeys.md` — three end-to-end journeys with code + expected output.
36
+
37
+ ### Changed
38
+
39
+ - `docs/concepts.md` — updated mental model for the three top-level entries (`selfImprove`, `analyzeRuns`, intake adapters) and the layering rule.
40
+
41
+ ### Notes
42
+
43
+ Docs-only patch. No code changes, no behavior changes, no API surface changes vs 0.50.0.
44
+
45
+ ---
46
+
47
+ ## [0.50.0] — 2026-05-27 — the decision packet
48
+
49
+ ### Added
50
+
51
+ - **`analyzeRuns({ runs, ... }): InsightReport`** in `/contract`. Composes the substrate's statistical / calibration / clustering / Pareto primitives into one rigor packet. Sections populate based on what the input supports: distributional summary always, lift when baseline+candidate are present, judges when run records carry `judgeScores`, inter-rater agreement when `raterScores` are supplied, failure clusters when an `AnalystRegistry` is wired, contamination when canaries are passed, outcome correlation when a downstream signal is supplied.
52
+ - **`InsightReport`** canonical decision-packet shape; reused by `selfImprove()` and emitted on the hosted wire as `EvalRunEvent.insightReport?`.
53
+ - **Intake adapters** in `/contract`:
54
+ - `fromFeedbackTable({ ratings })` — multi-rater corpus → `RunRecord[] + raterScores`.
55
+ - `fromOtelSpans({ spans })` — OpenTelemetry spans → `RunRecord[]`, grouped by `tangle.runId` or `traceId`.
56
+ - **`SelfImproveResult.insight: InsightReport`** — `selfImprove()` now returns the full decision packet alongside the existing ship/hold verdict.
57
+
58
+ ### Changed
59
+
60
+ - `selfImprove()` internally calls `analyzeRuns()` on baseline + winner cells; consumers reading `.lift` continue to work unchanged, while `.insight.lift` now carries CI95 + p-value + Cohen's d + MDE + required-n.
61
+
62
+ ### Test coverage
63
+
64
+ 1427 / 1427 passing; 11 new integration tests covering lift detection paths, outcome correlation + linear reward model, canary contamination, multi-rater journey end-to-end, OTel journey end-to-end, recommendations shape, JSON-serialisability.
65
+
66
+ ---
67
+
68
+ ## [0.49.0] — 2026-05-27 — audit-fix sweep
69
+
70
+ ### Added
71
+
72
+ - `src/adapters/otel.ts` — generic OTel→hosted bridge (`createOtelBridge` / `OtelBridge` / `OtelBridgeOptions`). Stringifies array-valued attributes instead of dropping them.
73
+ - `src/contract/diff.ts` — `keyForCell` uses `JSON.stringify([scenarioId, rep])` (no separator collisions); `Number.isFinite` coercion on dimension deltas (no NaN propagating to dashboards).
74
+ - `examples/hosted-ingest-server/server.ts` — `REFERENCE_RECEIVER_START=1|0` env var as the primary start signal; idempotency cache prunes on read with the wire-spec 24h TTL.
75
+
76
+ ### Changed
77
+
78
+ - Python `TraceSpanEventOuter` exposes `tangle.*` pivots via field aliases (`tangle_run_id`, etc.) and round-trips through `model_dump(by_alias=True)`.
79
+ - Python `_WireModel` emits a `UserWarning` when an extra field is the snake_case shadow of a declared camelCase field (cross-language drift guard).
80
+
81
+ ### Removed
82
+
83
+ - `src/adapters/traceai.ts` — replaced by `src/adapters/otel.ts`. No back-compat shim.
84
+
85
+ ---
86
+
87
+ ## [0.48.0] — 2026-05-27 — substrate↔runtime layering fix + diffRuns + Python hosted parity
88
+
89
+ ### Added
90
+
91
+ - `src/verdict.ts` — `DefaultVerdict` substrate primitive (moved DOWN from agent-runtime).
92
+ - `src/contract/diff.ts` — `diffRuns` / `diffGenerations` / `diffRunBaselineToWinner` for v3-vs-v4 dashboard rendering, CI reporting, and any consumer comparing improvement-loop output.
93
+ - `src/adapters/traceai.ts` — OTel→hosted bridge (renamed to `otel.ts` in 0.49.0).
94
+ - `tests/hosted-roundtrip.test.ts` — proves wire-format binary compat between client and reference receiver.
95
+ - Python `HostedClient` (`clients/python/src/agent_eval_rpc/hosted.py`) — TS↔Python wire-format parity with bearer auth, idempotency, and exponential backoff on 5xx/408/429.
96
+ - `CLAUDE.md` repo-layering rule: agent-eval is the substrate; agent-runtime + agent-knowledge depend on it; the reverse is forbidden.
97
+
98
+ ### Changed
99
+
100
+ - `src/campaign/gates/default-production-gate.ts` — `RunRecord` import from local `../../run-record` (was reaching up into agent-runtime).
101
+ - `src/matrix/types.ts` — `DefaultVerdict` import from `../verdict` (was reaching up into agent-runtime).
102
+
103
+ ### Removed
104
+
105
+ - `@tangle-network/agent-runtime` from `peerDependencies`, `devDependencies`, and `pnpm.minimumReleaseAgeExclude` (no upward deps from substrate).
106
+
107
+ ---
108
+
109
+ ## [0.47.0] — 2026-05-26 — Phase D hosted-tier substrate
110
+
111
+ ### Added
112
+
113
+ - `src/hosted/` — wire-format types frozen at `HOSTED_WIRE_VERSION = '2026-05-26.v1'`, `createHostedClient` with bearer auth + idempotency + bounded retries.
114
+ - `examples/hosted-ingest-server/` — reference receiver implementing the spec.
115
+ - `docs/hosted-ingest-spec.md` — semver-locked wire spec.
116
+ - `selfImprove({ hostedTenant })` — opt-in hosted ingest; failures logged, never fail the loop.
117
+
118
+ ---
119
+
120
+ ## [0.46.0] — `selfImprove()` LAND-tier helper
121
+
122
+ `selfImprove({ scenarios, dispatch, judges, baselineSurface })` shipped in `/contract` as the one-shot wrapper around `runImprovementLoop`.
123
+
124
+ ---
125
+
126
+ ## [0.45.0] — distributed campaigns
127
+
128
+ `/adapters/http` with `httpDispatch` + `runDispatchServer`; `cellPlacement` on `RunCampaignOptions` for cross-region fan-out.
129
+
130
+ ---
131
+
132
+ ## [0.44.0] — `/adapters/langchain`
133
+
134
+ LangChain runnable → `Dispatch` adapter.
135
+
136
+ ---
137
+
138
+ ## [0.43.0] — edge-friendly storage
139
+
140
+ `inMemoryCampaignStorage()` for Cloudflare Workers / edge / test environments.
141
+
142
+ ---
143
+
144
+ ## [0.42.0] — GEPA driver + legacy deletion
145
+
146
+ ### Added
147
+
148
+ - `gepaDriver` reflective LLM mutation driver.
149
+ - `campaignToRunRecords` adapter.
150
+
151
+ ### Removed
152
+
153
+ - `runMultiShotOptimization` (top-level trajectory-optimizer) — replaced by `runImprovementLoop` + `gepaDriver` composition. The `/multishot` subpath (N-shot persona matrix) is unrelated and remains.
154
+
155
+ ---
156
+
3
157
  ## 0.34.0 — 2026-05-23
4
158
 
5
159
  ### Eval evolution-tracking — first-class `AgentProfile` + per-cell scorecard