@tangle-network/agent-eval 0.50.0 → 0.50.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,140 @@
1
1
  # Changelog
2
2
 
3
+ All notable changes to `@tangle-network/agent-eval` and its sibling `agent-eval-rpc` (Python). The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/); versions are locked across the npm + PyPI packages.
4
+
5
+ ---
6
+
7
+ ## [0.50.1] — 2026-05-27 — docs + examples
8
+
9
+ ### Added
10
+
11
+ - `README.md` rewritten as a top-tier OSS landing page: table of contents, decision-packet output sample (annotated JSON), comparison matrix vs LangSmith / Braintrust / Phoenix, three customer journey cards.
12
+ - `examples/selfimprove-quickstart/` — minimal closed-loop example with annotated stdout.
13
+ - `examples/customer-feedback-loop/` — Customer A journey: multi-rater approve/reject corpus → `fromFeedbackTable` → `analyzeRuns`.
14
+ - `examples/customer-otel-traces/` — Customer B journey: OTel spans → `fromOtelSpans` → `analyzeRuns`.
15
+ - `docs/insight-report.md` — annotated walkthrough of every section of the decision packet.
16
+ - `docs/customer-journeys.md` — three end-to-end journeys with code + expected output.
17
+
18
+ ### Changed
19
+
20
+ - `docs/concepts.md` — updated mental model for the three top-level entries (`selfImprove`, `analyzeRuns`, intake adapters) and the layering rule.
21
+
22
+ ### Notes
23
+
24
+ Docs-only patch. No code changes, no behavior changes, no API surface changes vs 0.50.0.
25
+
26
+ ---
27
+
28
+ ## [0.50.0] — 2026-05-27 — the decision packet
29
+
30
+ ### Added
31
+
32
+ - **`analyzeRuns({ runs, ... }): InsightReport`** in `/contract`. Composes the substrate's statistical / calibration / clustering / Pareto primitives into one rigor packet. Sections populate based on what the input supports: distributional summary always, lift when baseline+candidate are present, judges when run records carry `judgeScores`, inter-rater agreement when `raterScores` are supplied, failure clusters when an `AnalystRegistry` is wired, contamination when canaries are passed, outcome correlation when a downstream signal is supplied.
33
+ - **`InsightReport`** canonical decision-packet shape; reused by `selfImprove()` and emitted on the hosted wire as `EvalRunEvent.insightReport?`.
34
+ - **Intake adapters** in `/contract`:
35
+ - `fromFeedbackTable({ ratings })` — multi-rater corpus → `RunRecord[] + raterScores`.
36
+ - `fromOtelSpans({ spans })` — OpenTelemetry spans → `RunRecord[]`, grouped by `tangle.runId` or `traceId`.
37
+ - **`SelfImproveResult.insight: InsightReport`** — `selfImprove()` now returns the full decision packet alongside the existing ship/hold verdict.
38
+
39
+ ### Changed
40
+
41
+ - `selfImprove()` internally calls `analyzeRuns()` on baseline + winner cells; consumers reading `.lift` continue to work unchanged, while `.insight.lift` now carries CI95 + p-value + Cohen's d + MDE + required-n.
42
+
43
+ ### Test coverage
44
+
45
+ 1427 / 1427 passing; 11 new integration tests covering lift detection paths, outcome correlation + linear reward model, canary contamination, multi-rater journey end-to-end, OTel journey end-to-end, recommendations shape, JSON-serialisability.
46
+
47
+ ---
48
+
49
+ ## [0.49.0] — 2026-05-27 — audit-fix sweep
50
+
51
+ ### Added
52
+
53
+ - `src/adapters/otel.ts` — generic OTel→hosted bridge (`createOtelBridge` / `OtelBridge` / `OtelBridgeOptions`). Stringifies array-valued attributes instead of dropping them.
54
+ - `src/contract/diff.ts` — `keyForCell` uses `JSON.stringify([scenarioId, rep])` (no separator collisions); `Number.isFinite` coercion on dimension deltas (no NaN propagating to dashboards).
55
+ - `examples/hosted-ingest-server/server.ts` — `REFERENCE_RECEIVER_START=1|0` env var as the primary start signal; idempotency cache prunes on read with the wire-spec 24h TTL.
56
+
57
+ ### Changed
58
+
59
+ - Python `TraceSpanEventOuter` exposes `tangle.*` pivots via field aliases (`tangle_run_id`, etc.) and round-trips through `model_dump(by_alias=True)`.
60
+ - Python `_WireModel` emits a `UserWarning` when an extra field is the snake_case shadow of a declared camelCase field (cross-language drift guard).
61
+
62
+ ### Removed
63
+
64
+ - `src/adapters/traceai.ts` — replaced by `src/adapters/otel.ts`. No back-compat shim.
65
+
66
+ ---
67
+
68
+ ## [0.48.0] — 2026-05-27 — substrate↔runtime layering fix + diffRuns + Python hosted parity
69
+
70
+ ### Added
71
+
72
+ - `src/verdict.ts` — `DefaultVerdict` substrate primitive (moved DOWN from agent-runtime).
73
+ - `src/contract/diff.ts` — `diffRuns` / `diffGenerations` / `diffRunBaselineToWinner` for v3-vs-v4 dashboard rendering, CI reporting, and any consumer comparing improvement-loop output.
74
+ - `src/adapters/traceai.ts` — OTel→hosted bridge (renamed to `otel.ts` in 0.49.0).
75
+ - `tests/hosted-roundtrip.test.ts` — proves wire-format binary compat between client and reference receiver.
76
+ - Python `HostedClient` (`clients/python/src/agent_eval_rpc/hosted.py`) — TS↔Python wire-format parity with bearer auth, idempotency, and exponential backoff on 5xx/408/429.
77
+ - `CLAUDE.md` repo-layering rule: agent-eval is the substrate; agent-runtime + agent-knowledge depend on it; the reverse is forbidden.
78
+
79
+ ### Changed
80
+
81
+ - `src/campaign/gates/default-production-gate.ts` — `RunRecord` import from local `../../run-record` (was reaching up into agent-runtime).
82
+ - `src/matrix/types.ts` — `DefaultVerdict` import from `../verdict` (was reaching up into agent-runtime).
83
+
84
+ ### Removed
85
+
86
+ - `@tangle-network/agent-runtime` from `peerDependencies`, `devDependencies`, and `pnpm.minimumReleaseAgeExclude` (no upward deps from substrate).
87
+
88
+ ---
89
+
90
+ ## [0.47.0] — 2026-05-26 — Phase D hosted-tier substrate
91
+
92
+ ### Added
93
+
94
+ - `src/hosted/` — wire-format types frozen at `HOSTED_WIRE_VERSION = '2026-05-26.v1'`, `createHostedClient` with bearer auth + idempotency + bounded retries.
95
+ - `examples/hosted-ingest-server/` — reference receiver implementing the spec.
96
+ - `docs/hosted-ingest-spec.md` — semver-locked wire spec.
97
+ - `selfImprove({ hostedTenant })` — opt-in hosted ingest; failures logged, never fail the loop.
98
+
99
+ ---
100
+
101
+ ## [0.46.0] — `selfImprove()` LAND-tier helper
102
+
103
+ `selfImprove({ scenarios, dispatch, judges, baselineSurface })` shipped in `/contract` as the one-shot wrapper around `runImprovementLoop`.
104
+
105
+ ---
106
+
107
+ ## [0.45.0] — distributed campaigns
108
+
109
+ `/adapters/http` with `httpDispatch` + `runDispatchServer`; `cellPlacement` on `RunCampaignOptions` for cross-region fan-out.
110
+
111
+ ---
112
+
113
+ ## [0.44.0] — `/adapters/langchain`
114
+
115
+ LangChain runnable → `Dispatch` adapter.
116
+
117
+ ---
118
+
119
+ ## [0.43.0] — edge-friendly storage
120
+
121
+ `inMemoryCampaignStorage()` for Cloudflare Workers / edge / test environments.
122
+
123
+ ---
124
+
125
+ ## [0.42.0] — GEPA driver + legacy deletion
126
+
127
+ ### Added
128
+
129
+ - `gepaDriver` reflective LLM mutation driver.
130
+ - `campaignToRunRecords` adapter.
131
+
132
+ ### Removed
133
+
134
+ - `runMultiShotOptimization` (top-level trajectory-optimizer) — replaced by `runImprovementLoop` + `gepaDriver` composition. The `/multishot` subpath (N-shot persona matrix) is unrelated and remains.
135
+
136
+ ---
137
+
3
138
  ## 0.34.0 — 2026-05-23
4
139
 
5
140
  ### Eval evolution-tracking — first-class `AgentProfile` + per-cell scorecard