@tangle-network/agent-eval 0.53.0 → 0.55.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (67) hide show
  1. package/dist/adapters/http.d.ts +1 -1
  2. package/dist/adapters/langchain.d.ts +1 -1
  3. package/dist/adapters/otel.d.ts +7 -6
  4. package/dist/{baseline-4R5deP0N.d.ts → baseline-DE36-Np7.d.ts} +1 -1
  5. package/dist/benchmarks/index.d.ts +3 -2
  6. package/dist/builder-eval/index.d.ts +4 -3
  7. package/dist/campaign/index.d.ts +9 -7
  8. package/dist/campaign/index.js +33 -4
  9. package/dist/campaign/index.js.map +1 -1
  10. package/dist/{chunk-L7XMNXLO.js → chunk-J4DIMSRK.js} +2 -2
  11. package/dist/{chunk-5KSDYBYH.js → chunk-LYL4SOKT.js} +3 -2
  12. package/dist/chunk-LYL4SOKT.js.map +1 -0
  13. package/dist/{chunk-BWZEGTES.js → chunk-NCK5QLGT.js} +1 -1
  14. package/dist/chunk-NCK5QLGT.js.map +1 -0
  15. package/dist/contract/index.d.ts +13 -12
  16. package/dist/contract/index.js +25 -0
  17. package/dist/contract/index.js.map +1 -1
  18. package/dist/{control-ojEWkMfJ.d.ts → control-DjEgwWNo.d.ts} +6 -5
  19. package/dist/{control-runtime-BZ_lVLYW.d.ts → control-runtime-DuFBYg7A.d.ts} +3 -2
  20. package/dist/control.d.ts +7 -6
  21. package/dist/control.js +2 -2
  22. package/dist/{emitter-DP_cSSiw.d.ts → emitter-DEZwY14K.d.ts} +2 -1
  23. package/dist/{failure-cluster-Cw65_5FY.d.ts → failure-cluster-CL7IVgkJ.d.ts} +2 -1
  24. package/dist/{feedback-trajectory-BSxqEpu7.d.ts → feedback-trajectory-DpUmE90J.d.ts} +1 -1
  25. package/dist/governance/index.d.ts +3 -2
  26. package/dist/hosted/index.d.ts +7 -6
  27. package/dist/{index-C7RhhEME.d.ts → index-D2nT6_KT.d.ts} +20 -2
  28. package/dist/{index-0pu_fBwZ.d.ts → index-wlaiph9Y.d.ts} +1 -1
  29. package/dist/index.d.ts +31 -29
  30. package/dist/index.js +3 -3
  31. package/dist/{integrity-CTDhR1Sg.d.ts → integrity-CfXjSqEv.d.ts} +1 -1
  32. package/dist/knowledge/index.d.ts +4 -3
  33. package/dist/meta-eval/index.d.ts +4 -3
  34. package/dist/openapi.json +1 -1
  35. package/dist/pipelines/index.d.ts +7 -6
  36. package/dist/prm/index.d.ts +5 -4
  37. package/dist/{query-DODUYdPg.d.ts → query-CqTxMwDw.d.ts} +2 -1
  38. package/dist/{red-team-30II1T4o.d.ts → red-team-CrC5MZYd.d.ts} +1 -1
  39. package/dist/{registry-8KAs18kY.d.ts → registry-BSWy0rvH.d.ts} +1 -1
  40. package/dist/{release-report-DSu0DWy8.d.ts → release-report-B6l5fi7T.d.ts} +2 -2
  41. package/dist/reporting.d.ts +7 -6
  42. package/dist/{researcher-LZD0qHEa.d.ts → researcher-JP8EvnLv.d.ts} +11 -6
  43. package/dist/rl.d.ts +11 -10
  44. package/dist/rl.js +2 -2
  45. package/dist/{rubric-D5tjHNJQ.d.ts → rubric-BOfxn4ja.d.ts} +3 -2
  46. package/dist/{rubric-predictive-validity-ByZEC3BX.d.ts → rubric-predictive-validity-B3qNa4aY.d.ts} +1 -1
  47. package/dist/{run-improvement-loop-Cc7oZlRP.d.ts → run-improvement-loop-BhfdjrMY.d.ts} +3 -3
  48. package/dist/{run-record-BGY6bHRh.d.ts → run-record-etiCMsUq.d.ts} +11 -3
  49. package/dist/{store-Db2Bv8Cf.d.ts → schema-m0gsnbt3.d.ts} +1 -99
  50. package/dist/store-CKUAgsJz.d.ts +101 -0
  51. package/dist/{summary-report-B7gNRX-r.d.ts → summary-report-DLxh4yWk.d.ts} +2 -2
  52. package/dist/{test-graded-scenario-B2kWEdh9.d.ts → test-graded-scenario-BdVaPyHT.d.ts} +3 -2
  53. package/dist/traces.d.ts +7 -6
  54. package/dist/{trajectory-CnoBo-JY.d.ts → trajectory-GEdXJCL5.d.ts} +2 -1
  55. package/dist/{types-Dbj5gu8n.d.ts → types-BgrxOJSf.d.ts} +31 -1
  56. package/dist/wire/index.d.ts +5 -4
  57. package/docs/pilot/README.md +62 -0
  58. package/docs/pilot/customer-checklist.md +90 -0
  59. package/docs/pilot/integration-foreign-stack.md +296 -0
  60. package/docs/pilot/integration-tangle-stack.md +248 -0
  61. package/docs/pilot/one-pager.md +161 -0
  62. package/docs/pilot/sample-insight-report.json +172 -0
  63. package/docs/research/research-roadmap.md +204 -0
  64. package/package.json +1 -1
  65. package/dist/chunk-5KSDYBYH.js.map +0 -1
  66. package/dist/chunk-BWZEGTES.js.map +0 -1
  67. /package/dist/{chunk-L7XMNXLO.js.map → chunk-J4DIMSRK.js.map} +0 -0
@@ -0,0 +1,296 @@
1
+ # Integration — Tangle Intelligence on any stack (OpenRouter, OpenAI, Anthropic, LangChain, LlamaIndex, custom)
2
+
3
+ Companion to `integration-tangle-stack.md`. This doc is the path for customers NOT on `@tangle-network/sandbox` + tcloud — you bring your own agent, your own LLM provider, your own trace format. We meet you where you are.
4
+
5
+ ## Zero-setup demo first
6
+
7
+ ```sh
8
+ npx @tangle-network/intelligence demo
9
+ ```
10
+
11
+ Synthetic agent + scenarios + selfImprove end-to-end with zero setup. Prints the `InsightReport` shape — same output you'll get against your real data. Hosted equivalent: **[staging-intelligence.tangle.tools](https://staging-intelligence.tangle.tools)**.
12
+
13
+ ## Decision tree — pick your starting point
14
+
15
+ ```
16
+ What's your trace source?
17
+ ├── OTel-compatible (Datadog APM, Honeycomb, NewRelic, raw OTLP) → fromOtelSpans
18
+ ├── LangChain (LangSmith traces, LCEL run traces) → fromLangChain (#104, queued)
19
+ ├── LlamaIndex (callback traces) → fromLlamaIndex (#104, queued)
20
+ ├── Anthropic SDK direct (Messages API call logs) → fromAnthropicSDK (#104, queued)
21
+ ├── OpenAI Assistants API (run + step events) → fromOpenAIAssistants (#104, queued)
22
+ ├── Multi-rater human approval/reject corpus → fromFeedbackTable
23
+ ├── Custom (your own logs / DB rows) → 20-line mapper to RunRecord
24
+ └── @tangle-network/sandbox → fromTangleSandbox (see Tangle-stack doc)
25
+
26
+ What's your LLM provider for the closed loop?
27
+ ├── OpenAI → gepaDriver({ llm: { apiKey, baseUrl: 'https://api.openai.com/v1' } })
28
+ ├── Anthropic → gepaDriver({ llm: { apiKey, baseUrl: 'https://api.anthropic.com/v1' } })
29
+ ├── OpenRouter → gepaDriver({ llm: { apiKey, baseUrl: 'https://openrouter.ai/api/v1' } })
30
+ ├── Tangle tcloud → gepaDriver({ llm: { apiKey, baseUrl: 'https://router.tangle.tools/v1' } })
31
+ ├── Azure OpenAI → gepaDriver({ llm: { apiKey, baseUrl: 'https://<resource>.openai.azure.com/...' } })
32
+ ├── Bedrock / Vertex → custom client wrapper (we help)
33
+ └── Self-hosted (vLLM, Ollama, etc.) → OpenAI-compat endpoint works directly
34
+ ```
35
+
36
+ ## OTel → InsightReport (5 minutes)
37
+
38
+ ```ts
39
+ import { fromOtelSpans } from '@tangle-network/agent-eval'
40
+ import { analyzeRuns } from '@tangle-network/agent-eval/contract'
41
+
42
+ // You probably already export OTel via OTLP. Pull a JSONL dump of spans
43
+ // for your last week of agent activity.
44
+ const spans = JSON.parse(fs.readFileSync('./agent-traces.jsonl', 'utf8').split('\n').filter(Boolean).map(JSON.parse))
45
+
46
+ const runs = fromOtelSpans({ spans }) // RunRecord[]
47
+ const report = await analyzeRuns({ runs })
48
+
49
+ console.log(report.composite.mean)
50
+ console.log(report.recommendations)
51
+ ```
52
+
53
+ What `fromOtelSpans` expects in spans (it's flexible — it tries multiple attribute keys):
54
+ - `trace_id` (groups spans into a run)
55
+ - `attributes['llm.tokens.in' | 'llm.input_tokens']` (optional)
56
+ - `attributes['llm.tokens.out' | 'llm.output_tokens']` (optional)
57
+ - `attributes['tool.name']` (optional, for tool-failure-rate analysis)
58
+ - `status.code: 'ERROR' | 'OK'` (for failure detection)
59
+ - `start_unix_nano` + `end_unix_nano` (for duration)
60
+
61
+ If your spans use different attribute keys, pass `{ attributeMap }` to override.
62
+
63
+ ## OpenRouter as your closed-loop LLM provider
64
+
65
+ OpenRouter speaks OpenAI-compat. Plug it in directly:
66
+
67
+ ```ts
68
+ import { selfImprove, gepaDriver } from '@tangle-network/agent-eval/contract'
69
+
70
+ const result = await selfImprove({
71
+ scenarios: yourScenarios,
72
+ agent: yourAgent, // your existing dispatch — any framework
73
+ judge: yourJudge,
74
+ baselineSurface: currentPrompt,
75
+ driver: gepaDriver({
76
+ llm: {
77
+ apiKey: process.env.OPENROUTER_KEY!,
78
+ baseUrl: 'https://openrouter.ai/api/v1',
79
+ },
80
+ model: 'anthropic/claude-sonnet-4.6', // any OpenRouter-supported model
81
+ target: 'agent system prompt',
82
+ }),
83
+ budget: { generations: 3, populationSize: 4, holdoutFraction: 0.3, maxUsd: 25 },
84
+ })
85
+ ```
86
+
87
+ Same shape works for: OpenAI direct, Anthropic direct, Azure OpenAI, Vertex (with their OpenAI-compat layer), Bedrock (via LiteLLM proxy), self-hosted vLLM / Ollama / LMStudio.
88
+
89
+ ## LangChain customer — three minute integration
90
+
91
+ While the dedicated `fromLangChain` adapter is queued (#104), the universal path:
92
+
93
+ ```ts
94
+ // Step 1: configure LangSmith to also export OTel
95
+ // (LangSmith → Project Settings → Trace exports → enable OTLP)
96
+
97
+ // Step 2: ingest as OTel
98
+ import { fromOtelSpans } from '@tangle-network/agent-eval'
99
+ const runs = fromOtelSpans({ spans: yourLangSmithOtelDump })
100
+
101
+ // Step 3: analyze
102
+ import { analyzeRuns } from '@tangle-network/agent-eval/contract'
103
+ const report = await analyzeRuns({ runs })
104
+ ```
105
+
106
+ When `fromLangChain` lands, it'll be a one-liner:
107
+
108
+ ```ts
109
+ // Coming in 0.55.0
110
+ import { fromLangChain } from '@tangle-network/agent-eval/adapters/langchain'
111
+ const runs = fromLangChain({ traces: yourLangSmithExport })
112
+ ```
113
+
114
+ Same for LlamaIndex / OpenAI Assistants / Anthropic SDK — direct adapters queued in #104.
115
+
116
+ ## LlamaIndex customer
117
+
118
+ LlamaIndex's callback manager emits OTel spans natively. Wire it once:
119
+
120
+ ```python
121
+ # Python side — your LlamaIndex setup
122
+ from llama_index.callbacks import OpenInferenceCallbackHandler
123
+
124
+ callback_handler = OpenInferenceCallbackHandler()
125
+ # This emits to your OTel exporter
126
+
127
+ # Then on the agent-eval side (TypeScript or Python):
128
+ from agent_eval_rpc import Client
129
+ client = Client(base_url='https://api.tangle.tools/v1', api_key=YOUR_KEY)
130
+ report = client.analyze_runs(spans=your_otel_spans)
131
+ ```
132
+
133
+ The Python client (`agent-eval-rpc@0.53.0`) speaks the same wire protocol — no functional difference between TS and Python customers.
134
+
135
+ ## Anthropic SDK direct (no framework)
136
+
137
+ If you're calling `@anthropic-ai/sdk` directly without an agent framework:
138
+
139
+ ```ts
140
+ import Anthropic from '@anthropic-ai/sdk'
141
+ import { fromOtelSpans } from '@tangle-network/agent-eval'
142
+
143
+ // Step 1: wrap your Anthropic calls to emit OTel
144
+ import { trace } from '@opentelemetry/api'
145
+ const tracer = trace.getTracer('your-agent')
146
+
147
+ async function callAnthropic(scenario: Scenario) {
148
+ return tracer.startActiveSpan('agent.turn', async (span) => {
149
+ const result = await anthropic.messages.create({...})
150
+ span.setAttribute('llm.input_tokens', result.usage.input_tokens)
151
+ span.setAttribute('llm.output_tokens', result.usage.output_tokens)
152
+ span.setAttribute('tangle.runId', scenario.id)
153
+ span.end()
154
+ return result
155
+ })
156
+ }
157
+
158
+ // Step 2: same pipeline
159
+ const runs = fromOtelSpans({ spans: yourOtelExport })
160
+ const report = await analyzeRuns({ runs })
161
+ ```
162
+
163
+ 20 lines of OTel wrapping; the rest is pure substrate.
164
+
165
+ ## OpenAI Assistants API
166
+
167
+ The Assistants API emits `runs.steps` events natively. Map them to RunRecord:
168
+
169
+ ```ts
170
+ // Custom mapper while fromOpenAIAssistants (queued #104) lands:
171
+ function mapAssistantRunToRunRecord(threadId: string, runId: string): RunRecord {
172
+ const run = await openai.beta.threads.runs.retrieve(threadId, runId)
173
+ const steps = await openai.beta.threads.runs.steps.list(threadId, runId)
174
+
175
+ return {
176
+ runId: run.id,
177
+ experimentId: 'default',
178
+ candidateId: run.assistant_id,
179
+ seed: 0,
180
+ model: run.model,
181
+ promptHash: hashOf(run.instructions),
182
+ configHash: hashOf({ tools: run.tools, model: run.model }),
183
+ commitSha: process.env.GIT_SHA ?? 'unknown',
184
+ wallMs: (run.completed_at - run.created_at) * 1000,
185
+ costUsd: estimateCostFromUsage(run.usage),
186
+ tokenUsage: {
187
+ input: run.usage?.prompt_tokens ?? 0,
188
+ output: run.usage?.completion_tokens ?? 0,
189
+ },
190
+ outcome: {
191
+ holdoutScore: yourScoring(run),
192
+ raw: { stepCount: steps.data.length, status: run.status },
193
+ },
194
+ splitTag: 'holdout',
195
+ }
196
+ }
197
+ ```
198
+
199
+ Once the dedicated adapter ships in 0.55.0 this becomes one line.
200
+
201
+ ## Custom trace format
202
+
203
+ Your logs / DB rows / proprietary schema → `RunRecord`:
204
+
205
+ ```ts
206
+ import type { RunRecord } from '@tangle-network/agent-eval'
207
+
208
+ function mapMyRowToRunRecord(row: MyAgentLog): RunRecord {
209
+ return {
210
+ runId: row.id,
211
+ experimentId: row.experiment_name ?? 'default',
212
+ candidateId: row.model_version,
213
+ seed: row.random_seed ?? 0,
214
+ model: row.model,
215
+ promptHash: row.prompt_hash,
216
+ configHash: row.config_hash,
217
+ commitSha: row.git_sha,
218
+ wallMs: row.duration_ms,
219
+ costUsd: row.cost,
220
+ tokenUsage: {
221
+ input: row.input_tokens,
222
+ output: row.output_tokens,
223
+ },
224
+ outcome: {
225
+ holdoutScore: row.score,
226
+ raw: row.raw_output, // free-form bag for fields the substrate doesn't standardize
227
+ },
228
+ splitTag: row.is_holdout ? 'holdout' : 'search',
229
+ }
230
+ }
231
+
232
+ const runs = myLogs.map(mapMyRowToRunRecord)
233
+ const report = await analyzeRuns({ runs })
234
+ ```
235
+
236
+ That's the worst case. ~20 lines of mapping, and you're in.
237
+
238
+ ## Multi-rater human feedback (no LLM-as-judge yet)
239
+
240
+ If you don't have an automated judge but you DO have human raters approving/rejecting agent outputs:
241
+
242
+ ```ts
243
+ import { fromFeedbackTable } from '@tangle-network/agent-eval'
244
+ import { analyzeRuns } from '@tangle-network/agent-eval/contract'
245
+
246
+ // Your data shape:
247
+ const ratings = [
248
+ { runId: 'r-001', rater: 'alice', score: 1 }, // approved
249
+ { runId: 'r-001', rater: 'bob', score: 0 }, // rejected — disagreement!
250
+ { runId: 'r-002', rater: 'alice', score: 1 },
251
+ { runId: 'r-002', rater: 'bob', score: 1 },
252
+ // ...
253
+ ]
254
+
255
+ const { runs, raterScores } = fromFeedbackTable({ ratings })
256
+ const report = await analyzeRuns({ runs, raterScores })
257
+
258
+ // report.interRater.kappa → how much your raters agree
259
+ // report.interRater.disagreementCases → which runs raters split on
260
+ // → use these to iterate the rubric until kappa > 0.7
261
+ // → then build an LLM-as-judge against that aligned rubric
262
+ ```
263
+
264
+ This is the warm-up path for customers who don't have a judge yet.
265
+
266
+ ## Hosted vs self-hosted — what's the difference?
267
+
268
+ | | Self-hosted | Hosted (Tangle Intelligence) |
269
+ |---|---|---|
270
+ | Cost | Your LLM bills + your compute | Same LLM bills + hosted-tier subscription |
271
+ | Dashboard | You build it (or use the OSS examples) | Renders InsightReport out of the box |
272
+ | Cron / scheduling | Your CI / cron / GitHub Action | Managed scheduler runs weekly |
273
+ | Slack / email digest | You wire it | Included |
274
+ | Multi-week trends | You persist | Persisted for you |
275
+ | Decision packet generation | Local (free) | API call (same code; we run it) |
276
+ | Closed-loop campaigns | Local (you pay LLM directly) | Pass-through pricing on LLM, plus per-campaign fee |
277
+ | Auto-PR | Your GitHub token | Your GitHub token via OAuth |
278
+
279
+ Both work end-to-end. Hosted tier is convenience; self-hosted is fine for engineering-heavy teams who want full control.
280
+
281
+ ## Common foreign-stack questions
282
+
283
+ **Q: We use vLLM / Ollama / a custom self-hosted LLM. Does the closed-loop driver work?**
284
+ A: Yes if your server speaks OpenAI-compat (most do). Pass `baseUrl: 'http://localhost:8000/v1'` (or wherever) and your dummy `apiKey`. We've shipped customers running selfImprove against local LMStudio + Ollama.
285
+
286
+ **Q: We're a Python shop, not TypeScript. Does anything change?**
287
+ A: `agent-eval-rpc@0.53.0` on PyPI speaks the same wire protocol. The Python client is a thin wrapper around the hosted endpoints — same `analyzeRuns()` / `selfImprove()` calls, same `InsightReport` shape, same `gateDecision` values.
288
+
289
+ **Q: We have an extremely custom agent (not LLM-call-shaped). Can we still use this?**
290
+ A: Yes. The substrate doesn't care what your agent IS — it only cares that you can express your runs as `RunRecord[]` and your judge as `(artifact) → JudgeScore`. RL-trained agents, multi-step plan-and-execute, browser-driving agents, code-generating agents — all map cleanly.
291
+
292
+ **Q: What's the minimum cost to try it?**
293
+ A: Free. `analyzeRuns()` is deterministic, runs locally, $0 LLM cost. You can ingest your last week of traces and get a real decision packet without spending a cent. The LLM-cost-incurring step is `selfImprove()` and you set the ceiling.
294
+
295
+ **Q: We don't want to send traces to your hosted tier. Self-hosted only — works?**
296
+ A: Yes. Every primitive in this doc runs locally. The package is MIT-licensed, no SaaS lock-in, no required network call.
@@ -0,0 +1,248 @@
1
+ # Integration — Tangle Intelligence on the Tangle stack (sandbox + tcloud)
2
+
3
+ Step-by-step. This is what we run with you on the onboarding call.
4
+
5
+ ## Zero-setup demo first (30 seconds, no install)
6
+
7
+ ```sh
8
+ npx @tangle-network/intelligence demo
9
+ ```
10
+
11
+ End-to-end loop against synthetic data — agent + judge + scenarios + selfImprove. Prints the `InsightReport` shape you'll get on your real data. Useful to confirm the output is what you want before any integration. Hosted equivalent: open **[staging-intelligence.tangle.tools](https://staging-intelligence.tangle.tools)**.
12
+
13
+ ## Prerequisites you already have
14
+
15
+ - `@tangle-network/sandbox` running your agent in a session
16
+ - `@tangle-network/tcloud` for LLM routing (or any OpenAI-compat router)
17
+ - Your scenarios (the inputs your agent handles) listed somewhere — even as YAML or a TS array
18
+ - A judge function for scoring outputs — LLM-as-judge is fine for v1
19
+
20
+ ## Install
21
+
22
+ The CLI scaffolds and runs everything; you only add the substrate package if your code calls primitives directly:
23
+
24
+ ```sh
25
+ # CLI (zero-install via npx, or add to your repo as a dev-dep)
26
+ npx @tangle-network/intelligence init
27
+
28
+ # Optional — only if your code imports analyzeRuns / selfImprove directly
29
+ pnpm add @tangle-network/agent-eval
30
+ # or for Python customers:
31
+ pip install agent-eval-rpc
32
+ ```
33
+
34
+ `@tangle-network/intelligence` is the customer-facing CLI + hosted product (binary `tangle-intel`). `@tangle-network/agent-eval` is the substrate it wraps — install only if you want to script directly against the primitives.
35
+
36
+ ## Step 1 — Ingest your trace stream
37
+
38
+ You already emit traces via sandbox sessions. Pull them into canonical `RunRecord[]`:
39
+
40
+ ```ts
41
+ import { fromTangleSandbox } from '@tangle-network/agent-eval/adapters/sandbox'
42
+
43
+ const runs = await fromTangleSandbox({
44
+ sessionIds: ['session_abc', 'session_def'], // your current week
45
+ fromMs: lastReportTime,
46
+ toMs: Date.now(),
47
+ })
48
+ // runs is RunRecord[] — canonical wire shape, ready for any downstream substrate primitive
49
+ ```
50
+
51
+ If your agent emits OTel directly instead of going through `@tangle-network/sandbox`:
52
+
53
+ ```ts
54
+ import { fromOtelSpans } from '@tangle-network/agent-eval'
55
+
56
+ const runs = fromOtelSpans({ spans: yourOtelSpans })
57
+ ```
58
+
59
+ ## Step 2 — Get the decision packet (no LLM cost)
60
+
61
+ ```ts
62
+ import { analyzeRuns } from '@tangle-network/agent-eval/contract'
63
+
64
+ const report = await analyzeRuns({
65
+ runs: thisWeek,
66
+ baselineRuns: lastWeek, // optional — gives you the "did my change help?" answer
67
+ baselineLabel: 'vs prior 7 days',
68
+ })
69
+
70
+ console.log(report.composite.mean) // overall score
71
+ console.log(report.composite.tailRuns) // worst 5 runs by name
72
+ console.log(report.priorPeriodComparison?.improvedMetrics) // ['composite'] if significantly better
73
+ console.log(report.priorPeriodComparison?.regressedMetrics) // ['cost'] if cost went up significantly
74
+ console.log(report.recommendations) // priority-ranked actions
75
+ ```
76
+
77
+ That's the **full deterministic flow** — no LLM, $0 cost, runs in ms.
78
+
79
+ Render in your dashboard or pipe to Slack:
80
+
81
+ ```ts
82
+ for (const rec of report.recommendations) {
83
+ if (rec.priority === 'critical') {
84
+ await slack.post(`🔴 ${rec.title}\n${rec.detail}`)
85
+ }
86
+ }
87
+ ```
88
+
89
+ ## Step 3 — Wire the closed loop (real LLM cost — opt-in)
90
+
91
+ Pick the surface you want to optimize. For most customers this is the agent's system-prompt addendum:
92
+
93
+ ```ts
94
+ import { selfImprove, gepaDriver } from '@tangle-network/agent-eval/contract'
95
+
96
+ const result = await selfImprove({
97
+ scenarios: yourScenarios, // 20-50 representative inputs
98
+ agent: async (surface, scenario) => {
99
+ // Your existing agent invocation, with the substrate-proposed surface
100
+ // injected as the system-prompt addendum.
101
+ return await runYourAgent({
102
+ ...scenario,
103
+ systemPromptAddendum: surface as string,
104
+ })
105
+ },
106
+ judge: yourJudge, // function (artifact) → { composite, dimensions }
107
+ baselineSurface: currentAddendum, // the production string today
108
+ driver: gepaDriver({
109
+ llm: { apiKey: tcloudKey, baseUrl: 'https://router.tangle.tools/v1' },
110
+ model: 'anthropic/claude-sonnet-4.6',
111
+ target: 'agent system-prompt addendum',
112
+ }),
113
+ budget: {
114
+ generations: 3,
115
+ populationSize: 4,
116
+ holdoutFraction: 0.3,
117
+ maxUsd: 25, // hard ceiling — refuses to overspend
118
+ },
119
+ })
120
+
121
+ console.log(`gate: ${result.gateDecision.kind}`)
122
+ console.log(`lift: ${result.lift.delta.toFixed(3)} CI=[${result.lift.ci95.join(', ')}]`)
123
+ console.log(`cost spent: $${result.totalCostUsd.toFixed(2)}`)
124
+ ```
125
+
126
+ `result.gateDecision` is one of:
127
+ - `ship-substrate` — winner statistically beats baseline; safe to deploy
128
+ - `inconclusive` — CI straddles zero; either run more rollouts or expand corpus
129
+ - `ship-harness` / `merge` — only when `driftPolicy: 'benchmark-branches'` is on (advanced)
130
+
131
+ ## Step 4 — Auto-PR the winner
132
+
133
+ ```ts
134
+ if (result.gateDecision.kind === 'ship-substrate') {
135
+ await openAutoPr({
136
+ title: `eval: auto-improve ${target} (composite +${result.lift.delta.toFixed(3)})`,
137
+ body: `${result.gateDecision.reason}\n\n${formatInsight(result.insight)}`,
138
+ filePath: 'src/lib/.server/production-loop/prompt-addendum.ts',
139
+ newContent: result.diff.kind === 'replace' ? result.diff.content : applyDiff(currentAddendum, result.diff),
140
+ })
141
+ }
142
+ ```
143
+
144
+ We ship `openAutoPr` from `@tangle-network/agent-eval/contract`. It wraps the GitHub PR flow with your existing token.
145
+
146
+ ## The full canonical flow (script you copy and run)
147
+
148
+ ```ts
149
+ // scripts/weekly-improvement.ts — run from a cron / GitHub Action
150
+
151
+ import { fromTangleSandbox } from '@tangle-network/agent-eval/adapters/sandbox'
152
+ import {
153
+ analyzeRuns,
154
+ gepaDriver,
155
+ openAutoPr,
156
+ selfImprove,
157
+ } from '@tangle-network/agent-eval/contract'
158
+ import { scenarios } from './eval/scenarios'
159
+ import { judge } from './eval/judges'
160
+ import { runYourAgent } from './src/agent'
161
+ import { PRODUCTION_ADDENDUM } from './src/lib/.server/production-loop/prompt-addendum'
162
+
163
+ const lastWeek = Date.now() - 7 * 24 * 60 * 60 * 1000
164
+ const twoWeeksAgo = lastWeek - 7 * 24 * 60 * 60 * 1000
165
+
166
+ const thisWeekRuns = await fromTangleSandbox({ fromMs: lastWeek, toMs: Date.now() })
167
+ const lastWeekRuns = await fromTangleSandbox({ fromMs: twoWeeksAgo, toMs: lastWeek })
168
+
169
+ // 1. Deterministic packet — always
170
+ const report = await analyzeRuns({
171
+ runs: thisWeekRuns,
172
+ baselineRuns: lastWeekRuns,
173
+ baselineLabel: 'vs prior 7 days',
174
+ })
175
+
176
+ // 2. Closed loop — only if composite regressed OR we haven't tried in a while
177
+ const shouldRun =
178
+ report.priorPeriodComparison?.regressedMetrics.includes('composite') ||
179
+ daysSinceLastImprovement() > 7
180
+
181
+ if (!shouldRun) {
182
+ console.log('No regression + recent run; skipping.')
183
+ process.exit(0)
184
+ }
185
+
186
+ const result = await selfImprove({
187
+ scenarios,
188
+ agent: (surface, scenario) =>
189
+ runYourAgent({ ...scenario, systemPromptAddendum: surface as string }),
190
+ judge,
191
+ baselineSurface: PRODUCTION_ADDENDUM,
192
+ driver: gepaDriver({
193
+ llm: { apiKey: process.env.TANGLE_KEY!, baseUrl: 'https://router.tangle.tools/v1' },
194
+ model: 'anthropic/claude-sonnet-4.6',
195
+ target: 'production agent system-prompt addendum',
196
+ }),
197
+ budget: { generations: 3, populationSize: 4, holdoutFraction: 0.3, maxUsd: 50 },
198
+ })
199
+
200
+ if (result.gateDecision.kind === 'ship-substrate') {
201
+ await openAutoPr({
202
+ title: `eval: auto-improve addendum (composite +${result.lift.delta.toFixed(3)})`,
203
+ body: renderInsightAsPrBody(result.insight),
204
+ filePath: 'src/lib/.server/production-loop/prompt-addendum.ts',
205
+ newContent: result.diff.kind === 'replace' ? result.diff.content : '...',
206
+ })
207
+ }
208
+ ```
209
+
210
+ ## What we'll do together on the onboarding call
211
+
212
+ 1. **Map your existing setup** — where do your traces emit? which sandbox sessions? which scenarios exist already?
213
+ 2. **Stub the judge** — even a single dimension is enough to start
214
+ 3. **Run a deterministic `analyzeRuns()` against your live data** — first decision packet rendered live
215
+ 4. **Wire one selfImprove cycle** — small budget, single generation, see the loop fire
216
+ 5. **Schedule the cron + auto-PR target** — the loop runs autonomously thereafter
217
+
218
+ Time budget: ~90 minutes. By the end you have a working pilot.
219
+
220
+ ## What our hosted tier adds on top
221
+
222
+ - Decision packet rendered weekly in the Intelligence dashboard — no code changes needed
223
+ - Slack / email digest on `regressedMetrics`
224
+ - Pareto chart, judge calibration, failure-cluster drilldown in the UI
225
+ - Multi-week trend lines
226
+ - Stripe-billed usage tracking
227
+
228
+ If you want self-hosted only, every primitive above works locally. The hosted tier is a convenience.
229
+
230
+ ## FAQ
231
+
232
+ **Q: What's the smallest scenario corpus that gives useful results?**
233
+ A: ~15 scenarios for the deterministic packet (you get distributional stats + recommendations). For `selfImprove`'s held-out gate you want ≥20 since `holdoutFraction: 0.3` reserves 6 for the gate. Below that, the gate often returns `inconclusive`.
234
+
235
+ **Q: What if my judge isn't reliable yet?**
236
+ A: That's normal. Use multi-rater intake (`fromFeedbackTable`) to get inter-rater agreement (κ) first, then iterate on the judge until raters agree. Substrate has an `interRater` block in InsightReport showing exactly which scenarios raters disagree on.
237
+
238
+ **Q: What if a `selfImprove` campaign returns `inconclusive`?**
239
+ A: It refused to claim improvement because the CI straddles zero. Either expand the corpus, raise `holdoutFraction`, or run more generations. Better than shipping noise.
240
+
241
+ **Q: Can I use a non-tcloud LLM provider?**
242
+ A: Yes — `gepaDriver` accepts any `LlmClientOptions` (any OpenAI-compatible endpoint). We default to tcloud because we already have your auth.
243
+
244
+ **Q: How do I see what changed when the gate ships?**
245
+ A: `result.diff` is a structured patch. We also ship `diffRuns()` separately if you want to compare two campaign outputs.
246
+
247
+ **Q: What if my agent self-modifies (Hermes / Claude Code skills)?**
248
+ A: This is the offline/online drift case. We have the architecture spec ready (`docs/specs/profile-versioning.md`) but the implementation is gated on a forcing-function experiment. For v0.5x pilots we assume the substrate is the only writer to your agent's optimizable surface.