@tangle-network/agent-eval 0.53.0 → 0.55.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/adapters/http.d.ts +1 -1
- package/dist/adapters/langchain.d.ts +1 -1
- package/dist/adapters/otel.d.ts +7 -6
- package/dist/{baseline-4R5deP0N.d.ts → baseline-DE36-Np7.d.ts} +1 -1
- package/dist/benchmarks/index.d.ts +3 -2
- package/dist/builder-eval/index.d.ts +4 -3
- package/dist/campaign/index.d.ts +9 -7
- package/dist/campaign/index.js +33 -4
- package/dist/campaign/index.js.map +1 -1
- package/dist/{chunk-L7XMNXLO.js → chunk-J4DIMSRK.js} +2 -2
- package/dist/{chunk-5KSDYBYH.js → chunk-LYL4SOKT.js} +3 -2
- package/dist/chunk-LYL4SOKT.js.map +1 -0
- package/dist/{chunk-BWZEGTES.js → chunk-NCK5QLGT.js} +1 -1
- package/dist/chunk-NCK5QLGT.js.map +1 -0
- package/dist/contract/index.d.ts +13 -12
- package/dist/contract/index.js +25 -0
- package/dist/contract/index.js.map +1 -1
- package/dist/{control-ojEWkMfJ.d.ts → control-DjEgwWNo.d.ts} +6 -5
- package/dist/{control-runtime-BZ_lVLYW.d.ts → control-runtime-DuFBYg7A.d.ts} +3 -2
- package/dist/control.d.ts +7 -6
- package/dist/control.js +2 -2
- package/dist/{emitter-DP_cSSiw.d.ts → emitter-DEZwY14K.d.ts} +2 -1
- package/dist/{failure-cluster-Cw65_5FY.d.ts → failure-cluster-CL7IVgkJ.d.ts} +2 -1
- package/dist/{feedback-trajectory-BSxqEpu7.d.ts → feedback-trajectory-DpUmE90J.d.ts} +1 -1
- package/dist/governance/index.d.ts +3 -2
- package/dist/hosted/index.d.ts +7 -6
- package/dist/{index-C7RhhEME.d.ts → index-D2nT6_KT.d.ts} +20 -2
- package/dist/{index-0pu_fBwZ.d.ts → index-wlaiph9Y.d.ts} +1 -1
- package/dist/index.d.ts +31 -29
- package/dist/index.js +3 -3
- package/dist/{integrity-CTDhR1Sg.d.ts → integrity-CfXjSqEv.d.ts} +1 -1
- package/dist/knowledge/index.d.ts +4 -3
- package/dist/meta-eval/index.d.ts +4 -3
- package/dist/openapi.json +1 -1
- package/dist/pipelines/index.d.ts +7 -6
- package/dist/prm/index.d.ts +5 -4
- package/dist/{query-DODUYdPg.d.ts → query-CqTxMwDw.d.ts} +2 -1
- package/dist/{red-team-30II1T4o.d.ts → red-team-CrC5MZYd.d.ts} +1 -1
- package/dist/{registry-8KAs18kY.d.ts → registry-BSWy0rvH.d.ts} +1 -1
- package/dist/{release-report-DSu0DWy8.d.ts → release-report-B6l5fi7T.d.ts} +2 -2
- package/dist/reporting.d.ts +7 -6
- package/dist/{researcher-LZD0qHEa.d.ts → researcher-JP8EvnLv.d.ts} +11 -6
- package/dist/rl.d.ts +11 -10
- package/dist/rl.js +2 -2
- package/dist/{rubric-D5tjHNJQ.d.ts → rubric-BOfxn4ja.d.ts} +3 -2
- package/dist/{rubric-predictive-validity-ByZEC3BX.d.ts → rubric-predictive-validity-B3qNa4aY.d.ts} +1 -1
- package/dist/{run-improvement-loop-Cc7oZlRP.d.ts → run-improvement-loop-BhfdjrMY.d.ts} +3 -3
- package/dist/{run-record-BGY6bHRh.d.ts → run-record-etiCMsUq.d.ts} +11 -3
- package/dist/{store-Db2Bv8Cf.d.ts → schema-m0gsnbt3.d.ts} +1 -99
- package/dist/store-CKUAgsJz.d.ts +101 -0
- package/dist/{summary-report-B7gNRX-r.d.ts → summary-report-DLxh4yWk.d.ts} +2 -2
- package/dist/{test-graded-scenario-B2kWEdh9.d.ts → test-graded-scenario-BdVaPyHT.d.ts} +3 -2
- package/dist/traces.d.ts +7 -6
- package/dist/{trajectory-CnoBo-JY.d.ts → trajectory-GEdXJCL5.d.ts} +2 -1
- package/dist/{types-Dbj5gu8n.d.ts → types-BgrxOJSf.d.ts} +31 -1
- package/dist/wire/index.d.ts +5 -4
- package/docs/pilot/README.md +62 -0
- package/docs/pilot/customer-checklist.md +90 -0
- package/docs/pilot/integration-foreign-stack.md +296 -0
- package/docs/pilot/integration-tangle-stack.md +248 -0
- package/docs/pilot/one-pager.md +161 -0
- package/docs/pilot/sample-insight-report.json +172 -0
- package/docs/research/research-roadmap.md +204 -0
- package/package.json +1 -1
- package/dist/chunk-5KSDYBYH.js.map +0 -1
- package/dist/chunk-BWZEGTES.js.map +0 -1
- /package/dist/{chunk-L7XMNXLO.js.map → chunk-J4DIMSRK.js.map} +0 -0
|
@@ -0,0 +1,296 @@
|
|
|
1
|
+
# Integration — Tangle Intelligence on any stack (OpenRouter, OpenAI, Anthropic, LangChain, LlamaIndex, custom)
|
|
2
|
+
|
|
3
|
+
Companion to `integration-tangle-stack.md`. This doc is the path for customers NOT on `@tangle-network/sandbox` + tcloud — you bring your own agent, your own LLM provider, your own trace format. We meet you where you are.
|
|
4
|
+
|
|
5
|
+
## Zero-setup demo first
|
|
6
|
+
|
|
7
|
+
```sh
|
|
8
|
+
npx @tangle-network/intelligence demo
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
Synthetic agent + scenarios + selfImprove end-to-end with zero setup. Prints the `InsightReport` shape — same output you'll get against your real data. Hosted equivalent: **[staging-intelligence.tangle.tools](https://staging-intelligence.tangle.tools)**.
|
|
12
|
+
|
|
13
|
+
## Decision tree — pick your starting point
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
What's your trace source?
|
|
17
|
+
├── OTel-compatible (Datadog APM, Honeycomb, NewRelic, raw OTLP) → fromOtelSpans
|
|
18
|
+
├── LangChain (LangSmith traces, LCEL run traces) → fromLangChain (#104, queued)
|
|
19
|
+
├── LlamaIndex (callback traces) → fromLlamaIndex (#104, queued)
|
|
20
|
+
├── Anthropic SDK direct (Messages API call logs) → fromAnthropicSDK (#104, queued)
|
|
21
|
+
├── OpenAI Assistants API (run + step events) → fromOpenAIAssistants (#104, queued)
|
|
22
|
+
├── Multi-rater human approval/reject corpus → fromFeedbackTable
|
|
23
|
+
├── Custom (your own logs / DB rows) → 20-line mapper to RunRecord
|
|
24
|
+
└── @tangle-network/sandbox → fromTangleSandbox (see Tangle-stack doc)
|
|
25
|
+
|
|
26
|
+
What's your LLM provider for the closed loop?
|
|
27
|
+
├── OpenAI → gepaDriver({ llm: { apiKey, baseUrl: 'https://api.openai.com/v1' } })
|
|
28
|
+
├── Anthropic → gepaDriver({ llm: { apiKey, baseUrl: 'https://api.anthropic.com/v1' } })
|
|
29
|
+
├── OpenRouter → gepaDriver({ llm: { apiKey, baseUrl: 'https://openrouter.ai/api/v1' } })
|
|
30
|
+
├── Tangle tcloud → gepaDriver({ llm: { apiKey, baseUrl: 'https://router.tangle.tools/v1' } })
|
|
31
|
+
├── Azure OpenAI → gepaDriver({ llm: { apiKey, baseUrl: 'https://<resource>.openai.azure.com/...' } })
|
|
32
|
+
├── Bedrock / Vertex → custom client wrapper (we help)
|
|
33
|
+
└── Self-hosted (vLLM, Ollama, etc.) → OpenAI-compat endpoint works directly
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
## OTel → InsightReport (5 minutes)
|
|
37
|
+
|
|
38
|
+
```ts
|
|
39
|
+
import { fromOtelSpans } from '@tangle-network/agent-eval'
|
|
40
|
+
import { analyzeRuns } from '@tangle-network/agent-eval/contract'
|
|
41
|
+
|
|
42
|
+
// You probably already export OTel via OTLP. Pull a JSONL dump of spans
|
|
43
|
+
// for your last week of agent activity.
|
|
44
|
+
const spans = JSON.parse(fs.readFileSync('./agent-traces.jsonl', 'utf8').split('\n').filter(Boolean).map(JSON.parse))
|
|
45
|
+
|
|
46
|
+
const runs = fromOtelSpans({ spans }) // RunRecord[]
|
|
47
|
+
const report = await analyzeRuns({ runs })
|
|
48
|
+
|
|
49
|
+
console.log(report.composite.mean)
|
|
50
|
+
console.log(report.recommendations)
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
What `fromOtelSpans` expects in spans (it's flexible — it tries multiple attribute keys):
|
|
54
|
+
- `trace_id` (groups spans into a run)
|
|
55
|
+
- `attributes['llm.tokens.in' | 'llm.input_tokens']` (optional)
|
|
56
|
+
- `attributes['llm.tokens.out' | 'llm.output_tokens']` (optional)
|
|
57
|
+
- `attributes['tool.name']` (optional, for tool-failure-rate analysis)
|
|
58
|
+
- `status.code: 'ERROR' | 'OK'` (for failure detection)
|
|
59
|
+
- `start_unix_nano` + `end_unix_nano` (for duration)
|
|
60
|
+
|
|
61
|
+
If your spans use different attribute keys, pass `{ attributeMap }` to override.
|
|
62
|
+
|
|
63
|
+
## OpenRouter as your closed-loop LLM provider
|
|
64
|
+
|
|
65
|
+
OpenRouter speaks OpenAI-compat. Plug it in directly:
|
|
66
|
+
|
|
67
|
+
```ts
|
|
68
|
+
import { selfImprove, gepaDriver } from '@tangle-network/agent-eval/contract'
|
|
69
|
+
|
|
70
|
+
const result = await selfImprove({
|
|
71
|
+
scenarios: yourScenarios,
|
|
72
|
+
agent: yourAgent, // your existing dispatch — any framework
|
|
73
|
+
judge: yourJudge,
|
|
74
|
+
baselineSurface: currentPrompt,
|
|
75
|
+
driver: gepaDriver({
|
|
76
|
+
llm: {
|
|
77
|
+
apiKey: process.env.OPENROUTER_KEY!,
|
|
78
|
+
baseUrl: 'https://openrouter.ai/api/v1',
|
|
79
|
+
},
|
|
80
|
+
model: 'anthropic/claude-sonnet-4.6', // any OpenRouter-supported model
|
|
81
|
+
target: 'agent system prompt',
|
|
82
|
+
}),
|
|
83
|
+
budget: { generations: 3, populationSize: 4, holdoutFraction: 0.3, maxUsd: 25 },
|
|
84
|
+
})
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
Same shape works for: OpenAI direct, Anthropic direct, Azure OpenAI, Vertex (with their OpenAI-compat layer), Bedrock (via LiteLLM proxy), self-hosted vLLM / Ollama / LMStudio.
|
|
88
|
+
|
|
89
|
+
## LangChain customer — three minute integration
|
|
90
|
+
|
|
91
|
+
While the dedicated `fromLangChain` adapter is queued (#104), the universal path:
|
|
92
|
+
|
|
93
|
+
```ts
|
|
94
|
+
// Step 1: configure LangSmith to also export OTel
|
|
95
|
+
// (LangSmith → Project Settings → Trace exports → enable OTLP)
|
|
96
|
+
|
|
97
|
+
// Step 2: ingest as OTel
|
|
98
|
+
import { fromOtelSpans } from '@tangle-network/agent-eval'
|
|
99
|
+
const runs = fromOtelSpans({ spans: yourLangSmithOtelDump })
|
|
100
|
+
|
|
101
|
+
// Step 3: analyze
|
|
102
|
+
import { analyzeRuns } from '@tangle-network/agent-eval/contract'
|
|
103
|
+
const report = await analyzeRuns({ runs })
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
When `fromLangChain` lands, it'll be a one-liner:
|
|
107
|
+
|
|
108
|
+
```ts
|
|
109
|
+
// Coming in 0.55.0
|
|
110
|
+
import { fromLangChain } from '@tangle-network/agent-eval/adapters/langchain'
|
|
111
|
+
const runs = fromLangChain({ traces: yourLangSmithExport })
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
Same for LlamaIndex / OpenAI Assistants / Anthropic SDK — direct adapters queued in #104.
|
|
115
|
+
|
|
116
|
+
## LlamaIndex customer
|
|
117
|
+
|
|
118
|
+
LlamaIndex's callback manager emits OTel spans natively. Wire it once:
|
|
119
|
+
|
|
120
|
+
```python
|
|
121
|
+
# Python side — your LlamaIndex setup
|
|
122
|
+
from llama_index.callbacks import OpenInferenceCallbackHandler
|
|
123
|
+
|
|
124
|
+
callback_handler = OpenInferenceCallbackHandler()
|
|
125
|
+
# This emits to your OTel exporter
|
|
126
|
+
|
|
127
|
+
# Then on the agent-eval side (TypeScript or Python):
|
|
128
|
+
from agent_eval_rpc import Client
|
|
129
|
+
client = Client(base_url='https://api.tangle.tools/v1', api_key=YOUR_KEY)
|
|
130
|
+
report = client.analyze_runs(spans=your_otel_spans)
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
The Python client (`agent-eval-rpc@0.53.0`) speaks the same wire protocol — no functional difference between TS and Python customers.
|
|
134
|
+
|
|
135
|
+
## Anthropic SDK direct (no framework)
|
|
136
|
+
|
|
137
|
+
If you're calling `@anthropic-ai/sdk` directly without an agent framework:
|
|
138
|
+
|
|
139
|
+
```ts
|
|
140
|
+
import Anthropic from '@anthropic-ai/sdk'
|
|
141
|
+
import { fromOtelSpans } from '@tangle-network/agent-eval'
|
|
142
|
+
|
|
143
|
+
// Step 1: wrap your Anthropic calls to emit OTel
|
|
144
|
+
import { trace } from '@opentelemetry/api'
|
|
145
|
+
const tracer = trace.getTracer('your-agent')
|
|
146
|
+
|
|
147
|
+
async function callAnthropic(scenario: Scenario) {
|
|
148
|
+
return tracer.startActiveSpan('agent.turn', async (span) => {
|
|
149
|
+
const result = await anthropic.messages.create({...})
|
|
150
|
+
span.setAttribute('llm.input_tokens', result.usage.input_tokens)
|
|
151
|
+
span.setAttribute('llm.output_tokens', result.usage.output_tokens)
|
|
152
|
+
span.setAttribute('tangle.runId', scenario.id)
|
|
153
|
+
span.end()
|
|
154
|
+
return result
|
|
155
|
+
})
|
|
156
|
+
}
|
|
157
|
+
|
|
158
|
+
// Step 2: same pipeline
|
|
159
|
+
const runs = fromOtelSpans({ spans: yourOtelExport })
|
|
160
|
+
const report = await analyzeRuns({ runs })
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
20 lines of OTel wrapping; the rest is pure substrate.
|
|
164
|
+
|
|
165
|
+
## OpenAI Assistants API
|
|
166
|
+
|
|
167
|
+
The Assistants API emits `runs.steps` events natively. Map them to RunRecord:
|
|
168
|
+
|
|
169
|
+
```ts
|
|
170
|
+
// Custom mapper while fromOpenAIAssistants (queued #104) lands:
|
|
171
|
+
function mapAssistantRunToRunRecord(threadId: string, runId: string): RunRecord {
|
|
172
|
+
const run = await openai.beta.threads.runs.retrieve(threadId, runId)
|
|
173
|
+
const steps = await openai.beta.threads.runs.steps.list(threadId, runId)
|
|
174
|
+
|
|
175
|
+
return {
|
|
176
|
+
runId: run.id,
|
|
177
|
+
experimentId: 'default',
|
|
178
|
+
candidateId: run.assistant_id,
|
|
179
|
+
seed: 0,
|
|
180
|
+
model: run.model,
|
|
181
|
+
promptHash: hashOf(run.instructions),
|
|
182
|
+
configHash: hashOf({ tools: run.tools, model: run.model }),
|
|
183
|
+
commitSha: process.env.GIT_SHA ?? 'unknown',
|
|
184
|
+
wallMs: (run.completed_at - run.created_at) * 1000,
|
|
185
|
+
costUsd: estimateCostFromUsage(run.usage),
|
|
186
|
+
tokenUsage: {
|
|
187
|
+
input: run.usage?.prompt_tokens ?? 0,
|
|
188
|
+
output: run.usage?.completion_tokens ?? 0,
|
|
189
|
+
},
|
|
190
|
+
outcome: {
|
|
191
|
+
holdoutScore: yourScoring(run),
|
|
192
|
+
raw: { stepCount: steps.data.length, status: run.status },
|
|
193
|
+
},
|
|
194
|
+
splitTag: 'holdout',
|
|
195
|
+
}
|
|
196
|
+
}
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
Once the dedicated adapter ships in 0.55.0 this becomes one line.
|
|
200
|
+
|
|
201
|
+
## Custom trace format
|
|
202
|
+
|
|
203
|
+
Your logs / DB rows / proprietary schema → `RunRecord`:
|
|
204
|
+
|
|
205
|
+
```ts
|
|
206
|
+
import type { RunRecord } from '@tangle-network/agent-eval'
|
|
207
|
+
|
|
208
|
+
function mapMyRowToRunRecord(row: MyAgentLog): RunRecord {
|
|
209
|
+
return {
|
|
210
|
+
runId: row.id,
|
|
211
|
+
experimentId: row.experiment_name ?? 'default',
|
|
212
|
+
candidateId: row.model_version,
|
|
213
|
+
seed: row.random_seed ?? 0,
|
|
214
|
+
model: row.model,
|
|
215
|
+
promptHash: row.prompt_hash,
|
|
216
|
+
configHash: row.config_hash,
|
|
217
|
+
commitSha: row.git_sha,
|
|
218
|
+
wallMs: row.duration_ms,
|
|
219
|
+
costUsd: row.cost,
|
|
220
|
+
tokenUsage: {
|
|
221
|
+
input: row.input_tokens,
|
|
222
|
+
output: row.output_tokens,
|
|
223
|
+
},
|
|
224
|
+
outcome: {
|
|
225
|
+
holdoutScore: row.score,
|
|
226
|
+
raw: row.raw_output, // free-form bag for fields the substrate doesn't standardize
|
|
227
|
+
},
|
|
228
|
+
splitTag: row.is_holdout ? 'holdout' : 'search',
|
|
229
|
+
}
|
|
230
|
+
}
|
|
231
|
+
|
|
232
|
+
const runs = myLogs.map(mapMyRowToRunRecord)
|
|
233
|
+
const report = await analyzeRuns({ runs })
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
That's the worst case. ~20 lines of mapping, and you're in.
|
|
237
|
+
|
|
238
|
+
## Multi-rater human feedback (no LLM-as-judge yet)
|
|
239
|
+
|
|
240
|
+
If you don't have an automated judge but you DO have human raters approving/rejecting agent outputs:
|
|
241
|
+
|
|
242
|
+
```ts
|
|
243
|
+
import { fromFeedbackTable } from '@tangle-network/agent-eval'
|
|
244
|
+
import { analyzeRuns } from '@tangle-network/agent-eval/contract'
|
|
245
|
+
|
|
246
|
+
// Your data shape:
|
|
247
|
+
const ratings = [
|
|
248
|
+
{ runId: 'r-001', rater: 'alice', score: 1 }, // approved
|
|
249
|
+
{ runId: 'r-001', rater: 'bob', score: 0 }, // rejected — disagreement!
|
|
250
|
+
{ runId: 'r-002', rater: 'alice', score: 1 },
|
|
251
|
+
{ runId: 'r-002', rater: 'bob', score: 1 },
|
|
252
|
+
// ...
|
|
253
|
+
]
|
|
254
|
+
|
|
255
|
+
const { runs, raterScores } = fromFeedbackTable({ ratings })
|
|
256
|
+
const report = await analyzeRuns({ runs, raterScores })
|
|
257
|
+
|
|
258
|
+
// report.interRater.kappa → how much your raters agree
|
|
259
|
+
// report.interRater.disagreementCases → which runs raters split on
|
|
260
|
+
// → use these to iterate the rubric until kappa > 0.7
|
|
261
|
+
// → then build an LLM-as-judge against that aligned rubric
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
This is the warm-up path for customers who don't have a judge yet.
|
|
265
|
+
|
|
266
|
+
## Hosted vs self-hosted — what's the difference?
|
|
267
|
+
|
|
268
|
+
| | Self-hosted | Hosted (Tangle Intelligence) |
|
|
269
|
+
|---|---|---|
|
|
270
|
+
| Cost | Your LLM bills + your compute | Same LLM bills + hosted-tier subscription |
|
|
271
|
+
| Dashboard | You build it (or use the OSS examples) | Renders InsightReport out of the box |
|
|
272
|
+
| Cron / scheduling | Your CI / cron / GitHub Action | Managed scheduler runs weekly |
|
|
273
|
+
| Slack / email digest | You wire it | Included |
|
|
274
|
+
| Multi-week trends | You persist | Persisted for you |
|
|
275
|
+
| Decision packet generation | Local (free) | API call (same code; we run it) |
|
|
276
|
+
| Closed-loop campaigns | Local (you pay LLM directly) | Pass-through pricing on LLM, plus per-campaign fee |
|
|
277
|
+
| Auto-PR | Your GitHub token | Your GitHub token via OAuth |
|
|
278
|
+
|
|
279
|
+
Both work end-to-end. Hosted tier is convenience; self-hosted is fine for engineering-heavy teams who want full control.
|
|
280
|
+
|
|
281
|
+
## Common foreign-stack questions
|
|
282
|
+
|
|
283
|
+
**Q: We use vLLM / Ollama / a custom self-hosted LLM. Does the closed-loop driver work?**
|
|
284
|
+
A: Yes if your server speaks OpenAI-compat (most do). Pass `baseUrl: 'http://localhost:8000/v1'` (or wherever) and your dummy `apiKey`. We've shipped customers running selfImprove against local LMStudio + Ollama.
|
|
285
|
+
|
|
286
|
+
**Q: We're a Python shop, not TypeScript. Does anything change?**
|
|
287
|
+
A: `agent-eval-rpc@0.53.0` on PyPI speaks the same wire protocol. The Python client is a thin wrapper around the hosted endpoints — same `analyzeRuns()` / `selfImprove()` calls, same `InsightReport` shape, same `gateDecision` values.
|
|
288
|
+
|
|
289
|
+
**Q: We have an extremely custom agent (not LLM-call-shaped). Can we still use this?**
|
|
290
|
+
A: Yes. The substrate doesn't care what your agent IS — it only cares that you can express your runs as `RunRecord[]` and your judge as `(artifact) → JudgeScore`. RL-trained agents, multi-step plan-and-execute, browser-driving agents, code-generating agents — all map cleanly.
|
|
291
|
+
|
|
292
|
+
**Q: What's the minimum cost to try it?**
|
|
293
|
+
A: Free. `analyzeRuns()` is deterministic, runs locally, $0 LLM cost. You can ingest your last week of traces and get a real decision packet without spending a cent. The LLM-cost-incurring step is `selfImprove()` and you set the ceiling.
|
|
294
|
+
|
|
295
|
+
**Q: We don't want to send traces to your hosted tier. Self-hosted only — works?**
|
|
296
|
+
A: Yes. Every primitive in this doc runs locally. The package is MIT-licensed, no SaaS lock-in, no required network call.
|
|
@@ -0,0 +1,248 @@
|
|
|
1
|
+
# Integration — Tangle Intelligence on the Tangle stack (sandbox + tcloud)
|
|
2
|
+
|
|
3
|
+
Step-by-step. This is what we run with you on the onboarding call.
|
|
4
|
+
|
|
5
|
+
## Zero-setup demo first (30 seconds, no install)
|
|
6
|
+
|
|
7
|
+
```sh
|
|
8
|
+
npx @tangle-network/intelligence demo
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
End-to-end loop against synthetic data — agent + judge + scenarios + selfImprove. Prints the `InsightReport` shape you'll get on your real data. Useful to confirm the output is what you want before any integration. Hosted equivalent: open **[staging-intelligence.tangle.tools](https://staging-intelligence.tangle.tools)**.
|
|
12
|
+
|
|
13
|
+
## Prerequisites you already have
|
|
14
|
+
|
|
15
|
+
- `@tangle-network/sandbox` running your agent in a session
|
|
16
|
+
- `@tangle-network/tcloud` for LLM routing (or any OpenAI-compat router)
|
|
17
|
+
- Your scenarios (the inputs your agent handles) listed somewhere — even as YAML or a TS array
|
|
18
|
+
- A judge function for scoring outputs — LLM-as-judge is fine for v1
|
|
19
|
+
|
|
20
|
+
## Install
|
|
21
|
+
|
|
22
|
+
The CLI scaffolds and runs everything; you only add the substrate package if your code calls primitives directly:
|
|
23
|
+
|
|
24
|
+
```sh
|
|
25
|
+
# CLI (zero-install via npx, or add to your repo as a dev-dep)
|
|
26
|
+
npx @tangle-network/intelligence init
|
|
27
|
+
|
|
28
|
+
# Optional — only if your code imports analyzeRuns / selfImprove directly
|
|
29
|
+
pnpm add @tangle-network/agent-eval
|
|
30
|
+
# or for Python customers:
|
|
31
|
+
pip install agent-eval-rpc
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
`@tangle-network/intelligence` is the customer-facing CLI + hosted product (binary `tangle-intel`). `@tangle-network/agent-eval` is the substrate it wraps — install only if you want to script directly against the primitives.
|
|
35
|
+
|
|
36
|
+
## Step 1 — Ingest your trace stream
|
|
37
|
+
|
|
38
|
+
You already emit traces via sandbox sessions. Pull them into canonical `RunRecord[]`:
|
|
39
|
+
|
|
40
|
+
```ts
|
|
41
|
+
import { fromTangleSandbox } from '@tangle-network/agent-eval/adapters/sandbox'
|
|
42
|
+
|
|
43
|
+
const runs = await fromTangleSandbox({
|
|
44
|
+
sessionIds: ['session_abc', 'session_def'], // your current week
|
|
45
|
+
fromMs: lastReportTime,
|
|
46
|
+
toMs: Date.now(),
|
|
47
|
+
})
|
|
48
|
+
// runs is RunRecord[] — canonical wire shape, ready for any downstream substrate primitive
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
If your agent emits OTel directly instead of going through `@tangle-network/sandbox`:
|
|
52
|
+
|
|
53
|
+
```ts
|
|
54
|
+
import { fromOtelSpans } from '@tangle-network/agent-eval'
|
|
55
|
+
|
|
56
|
+
const runs = fromOtelSpans({ spans: yourOtelSpans })
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## Step 2 — Get the decision packet (no LLM cost)
|
|
60
|
+
|
|
61
|
+
```ts
|
|
62
|
+
import { analyzeRuns } from '@tangle-network/agent-eval/contract'
|
|
63
|
+
|
|
64
|
+
const report = await analyzeRuns({
|
|
65
|
+
runs: thisWeek,
|
|
66
|
+
baselineRuns: lastWeek, // optional — gives you the "did my change help?" answer
|
|
67
|
+
baselineLabel: 'vs prior 7 days',
|
|
68
|
+
})
|
|
69
|
+
|
|
70
|
+
console.log(report.composite.mean) // overall score
|
|
71
|
+
console.log(report.composite.tailRuns) // worst 5 runs by name
|
|
72
|
+
console.log(report.priorPeriodComparison?.improvedMetrics) // ['composite'] if significantly better
|
|
73
|
+
console.log(report.priorPeriodComparison?.regressedMetrics) // ['cost'] if cost went up significantly
|
|
74
|
+
console.log(report.recommendations) // priority-ranked actions
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
That's the **full deterministic flow** — no LLM, $0 cost, runs in ms.
|
|
78
|
+
|
|
79
|
+
Render in your dashboard or pipe to Slack:
|
|
80
|
+
|
|
81
|
+
```ts
|
|
82
|
+
for (const rec of report.recommendations) {
|
|
83
|
+
if (rec.priority === 'critical') {
|
|
84
|
+
await slack.post(`🔴 ${rec.title}\n${rec.detail}`)
|
|
85
|
+
}
|
|
86
|
+
}
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Step 3 — Wire the closed loop (real LLM cost — opt-in)
|
|
90
|
+
|
|
91
|
+
Pick the surface you want to optimize. For most customers this is the agent's system-prompt addendum:
|
|
92
|
+
|
|
93
|
+
```ts
|
|
94
|
+
import { selfImprove, gepaDriver } from '@tangle-network/agent-eval/contract'
|
|
95
|
+
|
|
96
|
+
const result = await selfImprove({
|
|
97
|
+
scenarios: yourScenarios, // 20-50 representative inputs
|
|
98
|
+
agent: async (surface, scenario) => {
|
|
99
|
+
// Your existing agent invocation, with the substrate-proposed surface
|
|
100
|
+
// injected as the system-prompt addendum.
|
|
101
|
+
return await runYourAgent({
|
|
102
|
+
...scenario,
|
|
103
|
+
systemPromptAddendum: surface as string,
|
|
104
|
+
})
|
|
105
|
+
},
|
|
106
|
+
judge: yourJudge, // function (artifact) → { composite, dimensions }
|
|
107
|
+
baselineSurface: currentAddendum, // the production string today
|
|
108
|
+
driver: gepaDriver({
|
|
109
|
+
llm: { apiKey: tcloudKey, baseUrl: 'https://router.tangle.tools/v1' },
|
|
110
|
+
model: 'anthropic/claude-sonnet-4.6',
|
|
111
|
+
target: 'agent system-prompt addendum',
|
|
112
|
+
}),
|
|
113
|
+
budget: {
|
|
114
|
+
generations: 3,
|
|
115
|
+
populationSize: 4,
|
|
116
|
+
holdoutFraction: 0.3,
|
|
117
|
+
maxUsd: 25, // hard ceiling — refuses to overspend
|
|
118
|
+
},
|
|
119
|
+
})
|
|
120
|
+
|
|
121
|
+
console.log(`gate: ${result.gateDecision.kind}`)
|
|
122
|
+
console.log(`lift: ${result.lift.delta.toFixed(3)} CI=[${result.lift.ci95.join(', ')}]`)
|
|
123
|
+
console.log(`cost spent: $${result.totalCostUsd.toFixed(2)}`)
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
`result.gateDecision` is one of:
|
|
127
|
+
- `ship-substrate` — winner statistically beats baseline; safe to deploy
|
|
128
|
+
- `inconclusive` — CI straddles zero; either run more rollouts or expand corpus
|
|
129
|
+
- `ship-harness` / `merge` — only when `driftPolicy: 'benchmark-branches'` is on (advanced)
|
|
130
|
+
|
|
131
|
+
## Step 4 — Auto-PR the winner
|
|
132
|
+
|
|
133
|
+
```ts
|
|
134
|
+
if (result.gateDecision.kind === 'ship-substrate') {
|
|
135
|
+
await openAutoPr({
|
|
136
|
+
title: `eval: auto-improve ${target} (composite +${result.lift.delta.toFixed(3)})`,
|
|
137
|
+
body: `${result.gateDecision.reason}\n\n${formatInsight(result.insight)}`,
|
|
138
|
+
filePath: 'src/lib/.server/production-loop/prompt-addendum.ts',
|
|
139
|
+
newContent: result.diff.kind === 'replace' ? result.diff.content : applyDiff(currentAddendum, result.diff),
|
|
140
|
+
})
|
|
141
|
+
}
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
We ship `openAutoPr` from `@tangle-network/agent-eval/contract`. It wraps the GitHub PR flow with your existing token.
|
|
145
|
+
|
|
146
|
+
## The full canonical flow (script you copy and run)
|
|
147
|
+
|
|
148
|
+
```ts
|
|
149
|
+
// scripts/weekly-improvement.ts — run from a cron / GitHub Action
|
|
150
|
+
|
|
151
|
+
import { fromTangleSandbox } from '@tangle-network/agent-eval/adapters/sandbox'
|
|
152
|
+
import {
|
|
153
|
+
analyzeRuns,
|
|
154
|
+
gepaDriver,
|
|
155
|
+
openAutoPr,
|
|
156
|
+
selfImprove,
|
|
157
|
+
} from '@tangle-network/agent-eval/contract'
|
|
158
|
+
import { scenarios } from './eval/scenarios'
|
|
159
|
+
import { judge } from './eval/judges'
|
|
160
|
+
import { runYourAgent } from './src/agent'
|
|
161
|
+
import { PRODUCTION_ADDENDUM } from './src/lib/.server/production-loop/prompt-addendum'
|
|
162
|
+
|
|
163
|
+
const lastWeek = Date.now() - 7 * 24 * 60 * 60 * 1000
|
|
164
|
+
const twoWeeksAgo = lastWeek - 7 * 24 * 60 * 60 * 1000
|
|
165
|
+
|
|
166
|
+
const thisWeekRuns = await fromTangleSandbox({ fromMs: lastWeek, toMs: Date.now() })
|
|
167
|
+
const lastWeekRuns = await fromTangleSandbox({ fromMs: twoWeeksAgo, toMs: lastWeek })
|
|
168
|
+
|
|
169
|
+
// 1. Deterministic packet — always
|
|
170
|
+
const report = await analyzeRuns({
|
|
171
|
+
runs: thisWeekRuns,
|
|
172
|
+
baselineRuns: lastWeekRuns,
|
|
173
|
+
baselineLabel: 'vs prior 7 days',
|
|
174
|
+
})
|
|
175
|
+
|
|
176
|
+
// 2. Closed loop — only if composite regressed OR we haven't tried in a while
|
|
177
|
+
const shouldRun =
|
|
178
|
+
report.priorPeriodComparison?.regressedMetrics.includes('composite') ||
|
|
179
|
+
daysSinceLastImprovement() > 7
|
|
180
|
+
|
|
181
|
+
if (!shouldRun) {
|
|
182
|
+
console.log('No regression + recent run; skipping.')
|
|
183
|
+
process.exit(0)
|
|
184
|
+
}
|
|
185
|
+
|
|
186
|
+
const result = await selfImprove({
|
|
187
|
+
scenarios,
|
|
188
|
+
agent: (surface, scenario) =>
|
|
189
|
+
runYourAgent({ ...scenario, systemPromptAddendum: surface as string }),
|
|
190
|
+
judge,
|
|
191
|
+
baselineSurface: PRODUCTION_ADDENDUM,
|
|
192
|
+
driver: gepaDriver({
|
|
193
|
+
llm: { apiKey: process.env.TANGLE_KEY!, baseUrl: 'https://router.tangle.tools/v1' },
|
|
194
|
+
model: 'anthropic/claude-sonnet-4.6',
|
|
195
|
+
target: 'production agent system-prompt addendum',
|
|
196
|
+
}),
|
|
197
|
+
budget: { generations: 3, populationSize: 4, holdoutFraction: 0.3, maxUsd: 50 },
|
|
198
|
+
})
|
|
199
|
+
|
|
200
|
+
if (result.gateDecision.kind === 'ship-substrate') {
|
|
201
|
+
await openAutoPr({
|
|
202
|
+
title: `eval: auto-improve addendum (composite +${result.lift.delta.toFixed(3)})`,
|
|
203
|
+
body: renderInsightAsPrBody(result.insight),
|
|
204
|
+
filePath: 'src/lib/.server/production-loop/prompt-addendum.ts',
|
|
205
|
+
newContent: result.diff.kind === 'replace' ? result.diff.content : '...',
|
|
206
|
+
})
|
|
207
|
+
}
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
## What we'll do together on the onboarding call
|
|
211
|
+
|
|
212
|
+
1. **Map your existing setup** — where do your traces emit? which sandbox sessions? which scenarios exist already?
|
|
213
|
+
2. **Stub the judge** — even a single dimension is enough to start
|
|
214
|
+
3. **Run a deterministic `analyzeRuns()` against your live data** — first decision packet rendered live
|
|
215
|
+
4. **Wire one selfImprove cycle** — small budget, single generation, see the loop fire
|
|
216
|
+
5. **Schedule the cron + auto-PR target** — the loop runs autonomously thereafter
|
|
217
|
+
|
|
218
|
+
Time budget: ~90 minutes. By the end you have a working pilot.
|
|
219
|
+
|
|
220
|
+
## What our hosted tier adds on top
|
|
221
|
+
|
|
222
|
+
- Decision packet rendered weekly in the Intelligence dashboard — no code changes needed
|
|
223
|
+
- Slack / email digest on `regressedMetrics`
|
|
224
|
+
- Pareto chart, judge calibration, failure-cluster drilldown in the UI
|
|
225
|
+
- Multi-week trend lines
|
|
226
|
+
- Stripe-billed usage tracking
|
|
227
|
+
|
|
228
|
+
If you want self-hosted only, every primitive above works locally. The hosted tier is a convenience.
|
|
229
|
+
|
|
230
|
+
## FAQ
|
|
231
|
+
|
|
232
|
+
**Q: What's the smallest scenario corpus that gives useful results?**
|
|
233
|
+
A: ~15 scenarios for the deterministic packet (you get distributional stats + recommendations). For `selfImprove`'s held-out gate you want ≥20 since `holdoutFraction: 0.3` reserves 6 for the gate. Below that, the gate often returns `inconclusive`.
|
|
234
|
+
|
|
235
|
+
**Q: What if my judge isn't reliable yet?**
|
|
236
|
+
A: That's normal. Use multi-rater intake (`fromFeedbackTable`) to get inter-rater agreement (κ) first, then iterate on the judge until raters agree. Substrate has an `interRater` block in InsightReport showing exactly which scenarios raters disagree on.
|
|
237
|
+
|
|
238
|
+
**Q: What if a `selfImprove` campaign returns `inconclusive`?**
|
|
239
|
+
A: It refused to claim improvement because the CI straddles zero. Either expand the corpus, raise `holdoutFraction`, or run more generations. Better than shipping noise.
|
|
240
|
+
|
|
241
|
+
**Q: Can I use a non-tcloud LLM provider?**
|
|
242
|
+
A: Yes — `gepaDriver` accepts any `LlmClientOptions` (any OpenAI-compatible endpoint). We default to tcloud because we already have your auth.
|
|
243
|
+
|
|
244
|
+
**Q: How do I see what changed when the gate ships?**
|
|
245
|
+
A: `result.diff` is a structured patch. We also ship `diffRuns()` separately if you want to compare two campaign outputs.
|
|
246
|
+
|
|
247
|
+
**Q: What if my agent self-modifies (Hermes / Claude Code skills)?**
|
|
248
|
+
A: This is the offline/online drift case. We have the architecture spec ready (`docs/specs/profile-versioning.md`) but the implementation is gated on a forcing-function experiment. For v0.5x pilots we assume the substrate is the only writer to your agent's optimizable surface.
|