@tangle-network/agent-eval 0.43.1 → 0.43.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/dist/openapi.json CHANGED
@@ -2,7 +2,7 @@
2
2
  "openapi": "3.1.0",
3
3
  "info": {
4
4
  "title": "@tangle-network/agent-eval — wire protocol",
5
- "version": "0.43.1",
5
+ "version": "0.43.2",
6
6
  "description": "HTTP and stdio RPC interface to agent-eval. The TypeScript runtime is the source of truth; this spec is the contract that cross-language clients (Python, Rust, Go) generate from.\n\nWire-protocol version: 1.0.0. Bumps on breaking changes to request/response schemas.",
7
7
  "contact": {
8
8
  "name": "Tangle Network",
@@ -0,0 +1,89 @@
1
+ # External-Agent Wedge — plug-in self-improvement for any agent
2
+
3
+ **Status:** proposed · **Owner:** Drew · **Tracking:** ops-board #507 (epic) · **Updated:** 2026-05-26
4
+
5
+ > One sentence: **expose the self-improvement engine our own fleet runs on as a drop-in library any agent builder can install — land free, bill the hosted intelligence.**
6
+
7
+ This doc locks direction so the team executes without thrash. It is deliberately short. If a question isn't answered here, the default is "do the cheapest thing that keeps the LAND tier free and the EXPAND tier billable."
8
+
9
+ ---
10
+
11
+ ## Why now
12
+
13
+ Repeated signal from agent founders (latest: a high-quality GTM/social-marketing agent, sharp CTO): **no evals, no observability, hand-built, no way for the agent to improve itself.** This is the median serious agent team. They don't need another trace viewer — they need their agent to *get better on its own* and proof that it did.
14
+
15
+ We already built that engine and hardened it the hard way: **6 of our own agents** (tax, legal, creative, gtm, agent-builder, physim) now run the *same* `runCampaign` / `runImprovementLoop` / `gepaDriver` / gate surface. It is `@tangle-network/agent-eval` — already on npm, runtime-agnostic, dogfooded in production. Exposing it externally is distribution of an asset we already own, not a new product to invent.
16
+
17
+ ## Positioning — self-improvement first, not observability
18
+
19
+ If we sell "observability," we are a worse Langfuse/Braintrust/Arize and it's a margin race. Our wedge is the **closed self-improvement loop** (eval campaign → reflective `gepaDriver` prompt optimization → gated promotion → Bradley-Terry tournaments → predictive-validity calibration). Observability is the *byproduct*, not the pitch.
20
+
21
+ **The pitch:** "Plug us in. Your agent runs a closed self-improvement loop against your own use case, gated so it never ships a regression — and you get the eval + trace observability for free as it does."
22
+
23
+ ### Backend-agnostic by design — the sandbox is swappable, not required
24
+
25
+ The whole stack already runs **without our sandbox**, at two granularities:
26
+ - **`agent-eval`** (the self-improvement engine) touches the sandbox only as a *type* — the LAND tier needs no sandbox.
27
+ - **`agent-runtime`** is runtime-decoupled too: every sandbox import is `import type`, and the sandbox backend takes an *injected* instance. Its `Backend` seam already ships `createOpenAICompatibleBackend` (pure LLM) and `createIterableBackend` (bring-your-own) next to `createSandboxPromptBackend`. A builder runs the full runtime + self-improvement loop on **their own execution backend**, no Tangle sandbox installed.
28
+
29
+ So adoption is *graduated*, and the builder picks the depth: (1) **trace-analysis only** over their existing runtime; (2) the **full self-improvement loop** on their own backend; (3) **our sandbox as a swap-in backend** for the batteries. Our sandbox becomes the best *option*, never the cost of entry. The only work to make this explicit: mark the `@tangle-network/sandbox` peer `optional` in agent-runtime + document the `Backend` interface as the swap seam (Phase A).
30
+
31
+ ## The surface — three tiers (land → expand → platform)
32
+
33
+ | Tier | What they do | What they get | Billing |
34
+ |---|---|---|---|
35
+ | **LAND** (exists today) | `npm i @tangle-network/agent-eval`, wrap their agent behind one `dispatch` seam, bring a judge | Full self-improvement loop + **local** trace/eval artifacts. Any infra, no sandbox. | Free (lib) |
36
+ | **EXPAND** (the build) | Route trace/eval/labeled-scenario data to our orchestrator | Hosted dashboards, cross-run intelligence, the capture flywheel as a service | **Metered** — composes with existing sandbox Stripe + cost-ledger |
37
+ | **PLATFORM** (the carrot) | Move execution into our sandbox (agent-dev-container) | Substrate + orchestrator data/intelligence pre-wired, batteries included | Sandbox usage |
38
+
39
+ The free lib casts the widest possible net at near-zero cost (it's already published). Value capture is EXPAND: hosting their data/intelligence = a billable surface on the dimensions we already meter (ingested/retained volume, eval-campaign compute, loop runs, seats). "We don't host observability unless they route to us" is the *business model*, not a gap.
40
+
41
+ ## Plan & gates — land-first, validate, then build
42
+
43
+ The non-negotiable discipline: **do not build the hosted/billing tier before the free LAND is validated on a foreign agent.** Reality on someone else's real agent is cheaper than our imagination.
44
+
45
+ - **Phase A — unblock (ungated, now):** upgrade agent-dev-container to the latest substrate; export `OutcomeStore`/`DeploymentOutcome` from `/rl`; mark agent-runtime's `@tangle-network/sandbox` peer `optional` + document the `Backend` swap seam (make sandbox-free adoption explicitly supported). Pure wins, correct regardless of the wedge.
46
+ - **Phase B — design-partner LAND validation (forcing function):** wrap the founder's agent behind `dispatch`, author a marketing-quality judge, run one real campaign + `gepaDriver` loop. Instrument integration friction, judge cold-start, and actual quality lift.
47
+ - **GATE — go/no-go + pricing:** decided from Phase B evidence, not theory.
48
+ - **Phase C — LAND ergonomics:** external 15-minute quickstart + a stable `dispatch`/judge/scenario contract + ≥1 reference framework adapter.
49
+ - **Phase D — EXPAND (gated):** hosted OTLP/eval-run HTTP sink (client in agent-eval) + multi-tenant orchestrator ingest + metered billing + minimal dashboard (server in `@tangle-network/monorepo`).
50
+
51
+ ## Success metrics
52
+
53
+ - **Phase B:** ≥1 measurable quality lift on the partner's own use case; integration ≤ a 1–2 day pairing.
54
+ - **LAND:** time-to-first-self-improvement-loop for a new external agent < 1 day from the quickstart.
55
+ - **EXPAND:** first external tenant routed + first metered dollar.
56
+
57
+ ## Risk register
58
+
59
+ **Knowns (high confidence)**
60
+ - The substrate is a portable lib; it wraps our 6 agents behind `dispatch`/judge/gate in production.
61
+ - It's runtime-agnostic (FS + in-memory storage, Node + edge) and emits OTLP traces.
62
+ - **agent-runtime is already sandbox-decoupled at runtime** — sandbox is type-only + dependency-injected; the `Backend` seam (`createOpenAICompatibleBackend` / `createIterableBackend` / `createSandboxPromptBackend`) makes the sandbox one swappable option. Only the peer-dep label needs to change.
63
+ - The orchestrator/observability/billing platform lives in `@tangle-network/monorepo`.
64
+
65
+ **Known-unknowns (Phase B answers these)**
66
+ - Does `dispatch` wrap a *foreign* agent as cleanly as ours, or are there hidden Tangle assumptions?
67
+ - Judge cold-start: can a team with no prior evals author a usable judge for a subjective domain (marketing quality)?
68
+ - Data cold-start: no evals = no scenarios = no labels; how do they bootstrap the flywheel day 1?
69
+ - Orchestrator multi-tenancy: it's only run internally — external auth/isolation/privacy is unbuilt.
70
+ - Pricing line + OSS boundary (the lib is already public).
71
+
72
+ **Unknown-unknowns (instrument, don't predict)**
73
+ - Integration-surface explosion → keep a tiny stable contract + reference adapters; refuse to special-case.
74
+ - Foreign-domain eval semantics we've never seen → the design partner is the discovery mechanism.
75
+ - Multi-tenant orchestrator failure modes at external/adversarial scale.
76
+ - Supply-chain/trust — we enter their dependency tree; our security posture becomes their procurement question.
77
+ - *Mitigation for all:* land-first-with-a-partner surfaces surprises cheaply, on a real agent, before any hosted spend. Timebox Phase D behind the gate.
78
+
79
+ ## Non-goals (anti-thrash)
80
+
81
+ - **Not** a standalone observability product. Observability ships only as a byproduct of self-improvement.
82
+ - **Not** building the hosted/billing tier before Phase B validates the lib.
83
+ - **Not** per-framework bespoke integrations — one stable contract + a couple of reference adapters.
84
+ - **Not** re-architecting the substrate for hypothetical external needs — extend at the `dispatch`/judge/store seams only.
85
+ - **Not** changing the sandbox/products roadmap — this *exposes the same engine*, it doesn't fork it.
86
+
87
+ ## Tracking
88
+
89
+ ops-board epic **#507** + children: agent-dev-container stack upgrade (eng), foreign-agent adoption surface (eng), design-partner LAND validation (gtm), hosted orchestrator routing + billing (eng, blocked on the gate), GTM-fit + pricing decision (gtm).
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@tangle-network/agent-eval",
3
- "version": "0.43.1",
3
+ "version": "0.43.2",
4
4
  "description": "Substrate for self-improving agents: traces, verifiable rewards, preferences, GEPA / reflective mutation, auto-research, replay, sequential anytime-valid stats, and release gates.",
5
5
  "homepage": "https://github.com/tangle-network/agent-eval#readme",
6
6
  "repository": {
@@ -140,7 +140,7 @@
140
140
  "zod": "^4.3.6"
141
141
  },
142
142
  "peerDependencies": {
143
- "@tangle-network/agent-runtime": "^0.21.0",
143
+ "@tangle-network/agent-runtime": ">=0.21.0 <0.26.0",
144
144
  "@tangle-network/sandbox": ">=0.2.1 <0.4.0"
145
145
  },
146
146
  "peerDependenciesMeta": {
@@ -153,7 +153,7 @@
153
153
  },
154
154
  "devDependencies": {
155
155
  "@biomejs/biome": "^2.4.15",
156
- "@tangle-network/agent-runtime": "^0.21.0",
156
+ "@tangle-network/agent-runtime": ">=0.21.0 <0.26.0",
157
157
  "@tangle-network/sandbox": "0.3.0",
158
158
  "@types/node": "^25.6.0",
159
159
  "husky": "^9.1.7",