@tangle-network/agent-eval 0.52.0 → 0.54.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +23 -0
- package/dist/adapters/http.d.ts +1 -1
- package/dist/adapters/langchain.d.ts +1 -1
- package/dist/adapters/otel.d.ts +7 -6
- package/dist/{baseline-4R5deP0N.d.ts → baseline-DE36-Np7.d.ts} +1 -1
- package/dist/benchmarks/index.d.ts +3 -2
- package/dist/builder-eval/index.d.ts +4 -3
- package/dist/campaign/index.d.ts +9 -7
- package/dist/campaign/index.js +33 -4
- package/dist/campaign/index.js.map +1 -1
- package/dist/{chunk-L7XMNXLO.js → chunk-J4DIMSRK.js} +2 -2
- package/dist/{chunk-BWZEGTES.js → chunk-NCK5QLGT.js} +1 -1
- package/dist/chunk-NCK5QLGT.js.map +1 -0
- package/dist/{chunk-5KSDYBYH.js → chunk-YXTT6GSZ.js} +2 -2
- package/dist/contract/index.d.ts +25 -12
- package/dist/contract/index.js +171 -0
- package/dist/contract/index.js.map +1 -1
- package/dist/{control-ojEWkMfJ.d.ts → control-DjEgwWNo.d.ts} +6 -5
- package/dist/{control-runtime-BZ_lVLYW.d.ts → control-runtime-DuFBYg7A.d.ts} +3 -2
- package/dist/control.d.ts +7 -6
- package/dist/control.js +2 -2
- package/dist/{emitter-DP_cSSiw.d.ts → emitter-DEZwY14K.d.ts} +2 -1
- package/dist/{failure-cluster-Cw65_5FY.d.ts → failure-cluster-CL7IVgkJ.d.ts} +2 -1
- package/dist/{feedback-trajectory-BSxqEpu7.d.ts → feedback-trajectory-DpUmE90J.d.ts} +1 -1
- package/dist/governance/index.d.ts +3 -2
- package/dist/hosted/index.d.ts +7 -6
- package/dist/{index-DQHtWQ57.d.ts → index-D2nT6_KT.d.ts} +66 -2
- package/dist/{index-0pu_fBwZ.d.ts → index-wlaiph9Y.d.ts} +1 -1
- package/dist/index.d.ts +31 -29
- package/dist/index.js +3 -3
- package/dist/{integrity-CTDhR1Sg.d.ts → integrity-CfXjSqEv.d.ts} +1 -1
- package/dist/knowledge/index.d.ts +4 -3
- package/dist/meta-eval/index.d.ts +4 -3
- package/dist/openapi.json +1 -1
- package/dist/pipelines/index.d.ts +7 -6
- package/dist/prm/index.d.ts +5 -4
- package/dist/{query-DODUYdPg.d.ts → query-CqTxMwDw.d.ts} +2 -1
- package/dist/{red-team-30II1T4o.d.ts → red-team-CrC5MZYd.d.ts} +1 -1
- package/dist/{registry-8KAs18kY.d.ts → registry-BSWy0rvH.d.ts} +1 -1
- package/dist/{release-report-DSu0DWy8.d.ts → release-report-B6l5fi7T.d.ts} +2 -2
- package/dist/reporting.d.ts +7 -6
- package/dist/{researcher-LZD0qHEa.d.ts → researcher-D4AZjxNa.d.ts} +5 -5
- package/dist/rl.d.ts +11 -10
- package/dist/rl.js +2 -2
- package/dist/{rubric-D5tjHNJQ.d.ts → rubric-BOfxn4ja.d.ts} +3 -2
- package/dist/{rubric-predictive-validity-ByZEC3BX.d.ts → rubric-predictive-validity-B3qNa4aY.d.ts} +1 -1
- package/dist/{run-improvement-loop-Cc7oZlRP.d.ts → run-improvement-loop-BhfdjrMY.d.ts} +3 -3
- package/dist/{run-record-BGY6bHRh.d.ts → run-record-etiCMsUq.d.ts} +11 -3
- package/dist/{store-Db2Bv8Cf.d.ts → schema-m0gsnbt3.d.ts} +1 -99
- package/dist/store-CKUAgsJz.d.ts +101 -0
- package/dist/{summary-report-B7gNRX-r.d.ts → summary-report-DLxh4yWk.d.ts} +2 -2
- package/dist/{test-graded-scenario-B2kWEdh9.d.ts → test-graded-scenario-BdVaPyHT.d.ts} +3 -2
- package/dist/traces.d.ts +7 -6
- package/dist/{trajectory-CnoBo-JY.d.ts → trajectory-GEdXJCL5.d.ts} +2 -1
- package/dist/{types-Dbj5gu8n.d.ts → types-BgrxOJSf.d.ts} +31 -1
- package/dist/wire/index.d.ts +5 -4
- package/docs/design/self-improvement-protocol.md +223 -0
- package/docs/pilot/README.md +62 -0
- package/docs/pilot/customer-checklist.md +90 -0
- package/docs/pilot/integration-foreign-stack.md +296 -0
- package/docs/pilot/integration-tangle-stack.md +248 -0
- package/docs/pilot/one-pager.md +161 -0
- package/docs/pilot/sample-insight-report.json +172 -0
- package/docs/research/research-roadmap.md +204 -0
- package/package.json +1 -1
- package/dist/chunk-BWZEGTES.js.map +0 -1
- /package/dist/{chunk-L7XMNXLO.js.map → chunk-J4DIMSRK.js.map} +0 -0
- /package/dist/{chunk-5KSDYBYH.js.map → chunk-YXTT6GSZ.js.map} +0 -0
|
@@ -0,0 +1,90 @@
|
|
|
1
|
+
# Pre-onboarding checklist — what to have ready
|
|
2
|
+
|
|
3
|
+
Send this to the customer 48h before the onboarding call. If they show up to the call having done this, the 90-minute slot ends with a working pilot.
|
|
4
|
+
|
|
5
|
+
## What we need from you before the call
|
|
6
|
+
|
|
7
|
+
### Credentials
|
|
8
|
+
|
|
9
|
+
- [ ] **LLM provider API key** — tcloud key, OpenRouter key, OpenAI key, Anthropic key, or any OpenAI-compat router endpoint
|
|
10
|
+
- [ ] **GitHub token** with PR-write access to your agent repo (optional — required only if you want auto-PR promotion on green gate decisions)
|
|
11
|
+
- [ ] **Sandbox session access** (Tangle stack customers only) — read access to the session IDs we'll analyze
|
|
12
|
+
|
|
13
|
+
### Data
|
|
14
|
+
|
|
15
|
+
- [ ] **Trace data** — ONE of:
|
|
16
|
+
- Tangle sandbox session IDs (we use `fromTangleSandbox`)
|
|
17
|
+
- OTel spans dumped as JSONL (we use `fromOtelSpans`)
|
|
18
|
+
- Multi-rater feedback table (CSV with runId / rater / score, we use `fromFeedbackTable`)
|
|
19
|
+
- LangChain / LlamaIndex / OpenAI Assistants trace export (we use the corresponding adapter)
|
|
20
|
+
- Custom trace format (we map it together on the call — usually 20 lines of glue)
|
|
21
|
+
- [ ] **Scenarios** — 20-50 representative inputs your agent handles. Even YAML / JSON / TS array is fine; we'll convert to canonical `DatasetScenario[]` shape together
|
|
22
|
+
- [ ] **The system prompt addendum** your agent uses today (or whichever text surface you want to optimize) — the closed loop edits this
|
|
23
|
+
|
|
24
|
+
### Judge
|
|
25
|
+
|
|
26
|
+
- [ ] **A judge function or rubric** — either:
|
|
27
|
+
- An existing function `(artifact) → { composite, dimensions }`
|
|
28
|
+
- A rubric describing what "good output" means (1-2 paragraphs is enough — we'll build the judge on the call)
|
|
29
|
+
- A set of "good" / "bad" labeled examples (we use these as anchors)
|
|
30
|
+
|
|
31
|
+
### Constraints
|
|
32
|
+
|
|
33
|
+
- [ ] **LLM cost budget for the closed loop** — default $25 per campaign. Tell us if you want a different ceiling
|
|
34
|
+
- [ ] **Cadence** — how often should the loop run? Default: weekly. Some customers want daily; others want on-demand only
|
|
35
|
+
- [ ] **Deployment gate preference** — do you want:
|
|
36
|
+
- Auto-PR on `ship-substrate` (we open the PR, your team reviews)
|
|
37
|
+
- Manual review only (we report; you decide)
|
|
38
|
+
- Auto-deploy on `ship-substrate` (only with explicit ack; not default)
|
|
39
|
+
|
|
40
|
+
## Call agenda — 90 minutes
|
|
41
|
+
|
|
42
|
+
| Time | Topic |
|
|
43
|
+
|---|---|
|
|
44
|
+
| 0:00 — 0:10 | Walk through your existing setup — what runs where, what scenarios exist, what success looks like for you |
|
|
45
|
+
| 0:10 — 0:30 | Pick the right intake adapter; pull traces; run `analyzeRuns()` against last week's data — first decision packet rendered live |
|
|
46
|
+
| 0:30 — 0:50 | Build the judge — either wrap your existing one or scaffold a new one from your rubric |
|
|
47
|
+
| 0:50 — 1:10 | Fire one `selfImprove` cycle with a small budget ($5, single generation, 2 candidates) — watch the loop run end-to-end |
|
|
48
|
+
| 1:10 — 1:25 | Wire the cron + auto-PR target; schedule first weekly run |
|
|
49
|
+
| 1:25 — 1:30 | Confirm what we hand back to you between runs and what reaches you when |
|
|
50
|
+
|
|
51
|
+
If something on the checklist isn't ready, we adapt — just send what you have. Worst case, we spend the first 30 minutes getting unblocked.
|
|
52
|
+
|
|
53
|
+
## What you'll have at the end of the call
|
|
54
|
+
|
|
55
|
+
- A working `analyzeRuns()` call against YOUR live trace data, returning a real `InsightReport`
|
|
56
|
+
- A judge function (yours or scaffolded) wired to your agent's output shape
|
|
57
|
+
- One completed `selfImprove` cycle with a real `gateDecision` + lift CI
|
|
58
|
+
- A scheduled cron / GitHub Action that runs the loop weekly
|
|
59
|
+
- Optional: an auto-PR target if you want green-gate proposals to land as draft PRs
|
|
60
|
+
|
|
61
|
+
## After the call
|
|
62
|
+
|
|
63
|
+
- Day 1-7: first weekly run fires; we monitor + jump in if anything breaks
|
|
64
|
+
- Day 7: we send you a `selfImprove`-result summary + the corresponding `InsightReport`
|
|
65
|
+
- Day 14-28: 3 more cycles complete; you have enough data to evaluate the pilot
|
|
66
|
+
- Day 30: pilot review — what we found, what shipped, what's next
|
|
67
|
+
|
|
68
|
+
## What we send back to you between runs
|
|
69
|
+
|
|
70
|
+
- The full `InsightReport` JSON (you render it however you want, or use our hosted dashboard if it's available for your tier)
|
|
71
|
+
- Slack / email digest of `regressedMetrics` + critical recommendations (opt-in)
|
|
72
|
+
- Cost tally per campaign
|
|
73
|
+
- Auto-PR links if green gate verdicts opened any
|
|
74
|
+
|
|
75
|
+
## Common pre-call questions
|
|
76
|
+
|
|
77
|
+
**Q: How small a corpus can we start with?**
|
|
78
|
+
A: 15 scenarios works for the deterministic packet. 25+ is recommended for `selfImprove`'s held-out gate (the default `holdoutFraction: 0.3` reserves ~30% of scenarios for the gate).
|
|
79
|
+
|
|
80
|
+
**Q: What if our judge isn't reliable yet?**
|
|
81
|
+
A: Start with multi-rater intake — `fromFeedbackTable` produces inter-rater agreement (κ) so you can see exactly which scenarios humans disagree on. Iterate the judge until κ > 0.7, then go to closed loop.
|
|
82
|
+
|
|
83
|
+
**Q: We don't use Tangle's sandbox — can we still pilot?**
|
|
84
|
+
A: Yes. We have intake adapters for OTel, LangChain, LlamaIndex, Anthropic SDK, OpenAI Assistants, OpenRouter, multi-rater feedback tables, and custom trace formats. See `integration-foreign-stack.md`.
|
|
85
|
+
|
|
86
|
+
**Q: We use OpenRouter — does the closed-loop driver work with our routing setup?**
|
|
87
|
+
A: Yes. `gepaDriver` accepts any OpenAI-compatible endpoint via its `llm.baseUrl` option. Most customers run their selfImprove campaigns through OpenRouter or their existing provider — no migration required.
|
|
88
|
+
|
|
89
|
+
**Q: What if the pilot fails — what do we get?**
|
|
90
|
+
A: You get the deterministic `InsightReport` weekly regardless. Even if no `selfImprove` cycle ever ships a green gate verdict, you get the failure-cluster analysis, regressed-metric detection, and worst-runs surfacing. Those alone replace what most teams currently get from LangSmith / Braintrust / Phoenix scorecards.
|
|
@@ -0,0 +1,296 @@
|
|
|
1
|
+
# Integration — Tangle Intelligence on any stack (OpenRouter, OpenAI, Anthropic, LangChain, LlamaIndex, custom)
|
|
2
|
+
|
|
3
|
+
Companion to `integration-tangle-stack.md`. This doc is the path for customers NOT on `@tangle-network/sandbox` + tcloud — you bring your own agent, your own LLM provider, your own trace format. We meet you where you are.
|
|
4
|
+
|
|
5
|
+
## Zero-setup demo first
|
|
6
|
+
|
|
7
|
+
```sh
|
|
8
|
+
npx @tangle-network/intelligence demo
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
Synthetic agent + scenarios + selfImprove end-to-end with zero setup. Prints the `InsightReport` shape — same output you'll get against your real data. Hosted equivalent: **[staging-intelligence.tangle.tools](https://staging-intelligence.tangle.tools)**.
|
|
12
|
+
|
|
13
|
+
## Decision tree — pick your starting point
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
What's your trace source?
|
|
17
|
+
├── OTel-compatible (Datadog APM, Honeycomb, NewRelic, raw OTLP) → fromOtelSpans
|
|
18
|
+
├── LangChain (LangSmith traces, LCEL run traces) → fromLangChain (#104, queued)
|
|
19
|
+
├── LlamaIndex (callback traces) → fromLlamaIndex (#104, queued)
|
|
20
|
+
├── Anthropic SDK direct (Messages API call logs) → fromAnthropicSDK (#104, queued)
|
|
21
|
+
├── OpenAI Assistants API (run + step events) → fromOpenAIAssistants (#104, queued)
|
|
22
|
+
├── Multi-rater human approval/reject corpus → fromFeedbackTable
|
|
23
|
+
├── Custom (your own logs / DB rows) → 20-line mapper to RunRecord
|
|
24
|
+
└── @tangle-network/sandbox → fromTangleSandbox (see Tangle-stack doc)
|
|
25
|
+
|
|
26
|
+
What's your LLM provider for the closed loop?
|
|
27
|
+
├── OpenAI → gepaDriver({ llm: { apiKey, baseUrl: 'https://api.openai.com/v1' } })
|
|
28
|
+
├── Anthropic → gepaDriver({ llm: { apiKey, baseUrl: 'https://api.anthropic.com/v1' } })
|
|
29
|
+
├── OpenRouter → gepaDriver({ llm: { apiKey, baseUrl: 'https://openrouter.ai/api/v1' } })
|
|
30
|
+
├── Tangle tcloud → gepaDriver({ llm: { apiKey, baseUrl: 'https://router.tangle.tools/v1' } })
|
|
31
|
+
├── Azure OpenAI → gepaDriver({ llm: { apiKey, baseUrl: 'https://<resource>.openai.azure.com/...' } })
|
|
32
|
+
├── Bedrock / Vertex → custom client wrapper (we help)
|
|
33
|
+
└── Self-hosted (vLLM, Ollama, etc.) → OpenAI-compat endpoint works directly
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
## OTel → InsightReport (5 minutes)
|
|
37
|
+
|
|
38
|
+
```ts
|
|
39
|
+
import { fromOtelSpans } from '@tangle-network/agent-eval'
|
|
40
|
+
import { analyzeRuns } from '@tangle-network/agent-eval/contract'
|
|
41
|
+
|
|
42
|
+
// You probably already export OTel via OTLP. Pull a JSONL dump of spans
|
|
43
|
+
// for your last week of agent activity.
|
|
44
|
+
const spans = JSON.parse(fs.readFileSync('./agent-traces.jsonl', 'utf8').split('\n').filter(Boolean).map(JSON.parse))
|
|
45
|
+
|
|
46
|
+
const runs = fromOtelSpans({ spans }) // RunRecord[]
|
|
47
|
+
const report = await analyzeRuns({ runs })
|
|
48
|
+
|
|
49
|
+
console.log(report.composite.mean)
|
|
50
|
+
console.log(report.recommendations)
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
What `fromOtelSpans` expects in spans (it's flexible — it tries multiple attribute keys):
|
|
54
|
+
- `trace_id` (groups spans into a run)
|
|
55
|
+
- `attributes['llm.tokens.in' | 'llm.input_tokens']` (optional)
|
|
56
|
+
- `attributes['llm.tokens.out' | 'llm.output_tokens']` (optional)
|
|
57
|
+
- `attributes['tool.name']` (optional, for tool-failure-rate analysis)
|
|
58
|
+
- `status.code: 'ERROR' | 'OK'` (for failure detection)
|
|
59
|
+
- `start_unix_nano` + `end_unix_nano` (for duration)
|
|
60
|
+
|
|
61
|
+
If your spans use different attribute keys, pass `{ attributeMap }` to override.
|
|
62
|
+
|
|
63
|
+
## OpenRouter as your closed-loop LLM provider
|
|
64
|
+
|
|
65
|
+
OpenRouter speaks OpenAI-compat. Plug it in directly:
|
|
66
|
+
|
|
67
|
+
```ts
|
|
68
|
+
import { selfImprove, gepaDriver } from '@tangle-network/agent-eval/contract'
|
|
69
|
+
|
|
70
|
+
const result = await selfImprove({
|
|
71
|
+
scenarios: yourScenarios,
|
|
72
|
+
agent: yourAgent, // your existing dispatch — any framework
|
|
73
|
+
judge: yourJudge,
|
|
74
|
+
baselineSurface: currentPrompt,
|
|
75
|
+
driver: gepaDriver({
|
|
76
|
+
llm: {
|
|
77
|
+
apiKey: process.env.OPENROUTER_KEY!,
|
|
78
|
+
baseUrl: 'https://openrouter.ai/api/v1',
|
|
79
|
+
},
|
|
80
|
+
model: 'anthropic/claude-sonnet-4.6', // any OpenRouter-supported model
|
|
81
|
+
target: 'agent system prompt',
|
|
82
|
+
}),
|
|
83
|
+
budget: { generations: 3, populationSize: 4, holdoutFraction: 0.3, maxUsd: 25 },
|
|
84
|
+
})
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
Same shape works for: OpenAI direct, Anthropic direct, Azure OpenAI, Vertex (with their OpenAI-compat layer), Bedrock (via LiteLLM proxy), self-hosted vLLM / Ollama / LMStudio.
|
|
88
|
+
|
|
89
|
+
## LangChain customer — three minute integration
|
|
90
|
+
|
|
91
|
+
While the dedicated `fromLangChain` adapter is queued (#104), the universal path:
|
|
92
|
+
|
|
93
|
+
```ts
|
|
94
|
+
// Step 1: configure LangSmith to also export OTel
|
|
95
|
+
// (LangSmith → Project Settings → Trace exports → enable OTLP)
|
|
96
|
+
|
|
97
|
+
// Step 2: ingest as OTel
|
|
98
|
+
import { fromOtelSpans } from '@tangle-network/agent-eval'
|
|
99
|
+
const runs = fromOtelSpans({ spans: yourLangSmithOtelDump })
|
|
100
|
+
|
|
101
|
+
// Step 3: analyze
|
|
102
|
+
import { analyzeRuns } from '@tangle-network/agent-eval/contract'
|
|
103
|
+
const report = await analyzeRuns({ runs })
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
When `fromLangChain` lands, it'll be a one-liner:
|
|
107
|
+
|
|
108
|
+
```ts
|
|
109
|
+
// Coming in 0.55.0
|
|
110
|
+
import { fromLangChain } from '@tangle-network/agent-eval/adapters/langchain'
|
|
111
|
+
const runs = fromLangChain({ traces: yourLangSmithExport })
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
Same for LlamaIndex / OpenAI Assistants / Anthropic SDK — direct adapters queued in #104.
|
|
115
|
+
|
|
116
|
+
## LlamaIndex customer
|
|
117
|
+
|
|
118
|
+
LlamaIndex's callback manager emits OTel spans natively. Wire it once:
|
|
119
|
+
|
|
120
|
+
```python
|
|
121
|
+
# Python side — your LlamaIndex setup
|
|
122
|
+
from llama_index.callbacks import OpenInferenceCallbackHandler
|
|
123
|
+
|
|
124
|
+
callback_handler = OpenInferenceCallbackHandler()
|
|
125
|
+
# This emits to your OTel exporter
|
|
126
|
+
|
|
127
|
+
# Then on the agent-eval side (TypeScript or Python):
|
|
128
|
+
from agent_eval_rpc import Client
|
|
129
|
+
client = Client(base_url='https://api.tangle.tools/v1', api_key=YOUR_KEY)
|
|
130
|
+
report = client.analyze_runs(spans=your_otel_spans)
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
The Python client (`agent-eval-rpc@0.53.0`) speaks the same wire protocol — no functional difference between TS and Python customers.
|
|
134
|
+
|
|
135
|
+
## Anthropic SDK direct (no framework)
|
|
136
|
+
|
|
137
|
+
If you're calling `@anthropic-ai/sdk` directly without an agent framework:
|
|
138
|
+
|
|
139
|
+
```ts
|
|
140
|
+
import Anthropic from '@anthropic-ai/sdk'
|
|
141
|
+
import { fromOtelSpans } from '@tangle-network/agent-eval'
|
|
142
|
+
|
|
143
|
+
// Step 1: wrap your Anthropic calls to emit OTel
|
|
144
|
+
import { trace } from '@opentelemetry/api'
|
|
145
|
+
const tracer = trace.getTracer('your-agent')
|
|
146
|
+
|
|
147
|
+
async function callAnthropic(scenario: Scenario) {
|
|
148
|
+
return tracer.startActiveSpan('agent.turn', async (span) => {
|
|
149
|
+
const result = await anthropic.messages.create({...})
|
|
150
|
+
span.setAttribute('llm.input_tokens', result.usage.input_tokens)
|
|
151
|
+
span.setAttribute('llm.output_tokens', result.usage.output_tokens)
|
|
152
|
+
span.setAttribute('tangle.runId', scenario.id)
|
|
153
|
+
span.end()
|
|
154
|
+
return result
|
|
155
|
+
})
|
|
156
|
+
}
|
|
157
|
+
|
|
158
|
+
// Step 2: same pipeline
|
|
159
|
+
const runs = fromOtelSpans({ spans: yourOtelExport })
|
|
160
|
+
const report = await analyzeRuns({ runs })
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
20 lines of OTel wrapping; the rest is pure substrate.
|
|
164
|
+
|
|
165
|
+
## OpenAI Assistants API
|
|
166
|
+
|
|
167
|
+
The Assistants API emits `runs.steps` events natively. Map them to RunRecord:
|
|
168
|
+
|
|
169
|
+
```ts
|
|
170
|
+
// Custom mapper while fromOpenAIAssistants (queued #104) lands:
|
|
171
|
+
function mapAssistantRunToRunRecord(threadId: string, runId: string): RunRecord {
|
|
172
|
+
const run = await openai.beta.threads.runs.retrieve(threadId, runId)
|
|
173
|
+
const steps = await openai.beta.threads.runs.steps.list(threadId, runId)
|
|
174
|
+
|
|
175
|
+
return {
|
|
176
|
+
runId: run.id,
|
|
177
|
+
experimentId: 'default',
|
|
178
|
+
candidateId: run.assistant_id,
|
|
179
|
+
seed: 0,
|
|
180
|
+
model: run.model,
|
|
181
|
+
promptHash: hashOf(run.instructions),
|
|
182
|
+
configHash: hashOf({ tools: run.tools, model: run.model }),
|
|
183
|
+
commitSha: process.env.GIT_SHA ?? 'unknown',
|
|
184
|
+
wallMs: (run.completed_at - run.created_at) * 1000,
|
|
185
|
+
costUsd: estimateCostFromUsage(run.usage),
|
|
186
|
+
tokenUsage: {
|
|
187
|
+
input: run.usage?.prompt_tokens ?? 0,
|
|
188
|
+
output: run.usage?.completion_tokens ?? 0,
|
|
189
|
+
},
|
|
190
|
+
outcome: {
|
|
191
|
+
holdoutScore: yourScoring(run),
|
|
192
|
+
raw: { stepCount: steps.data.length, status: run.status },
|
|
193
|
+
},
|
|
194
|
+
splitTag: 'holdout',
|
|
195
|
+
}
|
|
196
|
+
}
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
Once the dedicated adapter ships in 0.55.0 this becomes one line.
|
|
200
|
+
|
|
201
|
+
## Custom trace format
|
|
202
|
+
|
|
203
|
+
Your logs / DB rows / proprietary schema → `RunRecord`:
|
|
204
|
+
|
|
205
|
+
```ts
|
|
206
|
+
import type { RunRecord } from '@tangle-network/agent-eval'
|
|
207
|
+
|
|
208
|
+
function mapMyRowToRunRecord(row: MyAgentLog): RunRecord {
|
|
209
|
+
return {
|
|
210
|
+
runId: row.id,
|
|
211
|
+
experimentId: row.experiment_name ?? 'default',
|
|
212
|
+
candidateId: row.model_version,
|
|
213
|
+
seed: row.random_seed ?? 0,
|
|
214
|
+
model: row.model,
|
|
215
|
+
promptHash: row.prompt_hash,
|
|
216
|
+
configHash: row.config_hash,
|
|
217
|
+
commitSha: row.git_sha,
|
|
218
|
+
wallMs: row.duration_ms,
|
|
219
|
+
costUsd: row.cost,
|
|
220
|
+
tokenUsage: {
|
|
221
|
+
input: row.input_tokens,
|
|
222
|
+
output: row.output_tokens,
|
|
223
|
+
},
|
|
224
|
+
outcome: {
|
|
225
|
+
holdoutScore: row.score,
|
|
226
|
+
raw: row.raw_output, // free-form bag for fields the substrate doesn't standardize
|
|
227
|
+
},
|
|
228
|
+
splitTag: row.is_holdout ? 'holdout' : 'search',
|
|
229
|
+
}
|
|
230
|
+
}
|
|
231
|
+
|
|
232
|
+
const runs = myLogs.map(mapMyRowToRunRecord)
|
|
233
|
+
const report = await analyzeRuns({ runs })
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
That's the worst case. ~20 lines of mapping, and you're in.
|
|
237
|
+
|
|
238
|
+
## Multi-rater human feedback (no LLM-as-judge yet)
|
|
239
|
+
|
|
240
|
+
If you don't have an automated judge but you DO have human raters approving/rejecting agent outputs:
|
|
241
|
+
|
|
242
|
+
```ts
|
|
243
|
+
import { fromFeedbackTable } from '@tangle-network/agent-eval'
|
|
244
|
+
import { analyzeRuns } from '@tangle-network/agent-eval/contract'
|
|
245
|
+
|
|
246
|
+
// Your data shape:
|
|
247
|
+
const ratings = [
|
|
248
|
+
{ runId: 'r-001', rater: 'alice', score: 1 }, // approved
|
|
249
|
+
{ runId: 'r-001', rater: 'bob', score: 0 }, // rejected — disagreement!
|
|
250
|
+
{ runId: 'r-002', rater: 'alice', score: 1 },
|
|
251
|
+
{ runId: 'r-002', rater: 'bob', score: 1 },
|
|
252
|
+
// ...
|
|
253
|
+
]
|
|
254
|
+
|
|
255
|
+
const { runs, raterScores } = fromFeedbackTable({ ratings })
|
|
256
|
+
const report = await analyzeRuns({ runs, raterScores })
|
|
257
|
+
|
|
258
|
+
// report.interRater.kappa → how much your raters agree
|
|
259
|
+
// report.interRater.disagreementCases → which runs raters split on
|
|
260
|
+
// → use these to iterate the rubric until kappa > 0.7
|
|
261
|
+
// → then build an LLM-as-judge against that aligned rubric
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
This is the warm-up path for customers who don't have a judge yet.
|
|
265
|
+
|
|
266
|
+
## Hosted vs self-hosted — what's the difference?
|
|
267
|
+
|
|
268
|
+
| | Self-hosted | Hosted (Tangle Intelligence) |
|
|
269
|
+
|---|---|---|
|
|
270
|
+
| Cost | Your LLM bills + your compute | Same LLM bills + hosted-tier subscription |
|
|
271
|
+
| Dashboard | You build it (or use the OSS examples) | Renders InsightReport out of the box |
|
|
272
|
+
| Cron / scheduling | Your CI / cron / GitHub Action | Managed scheduler runs weekly |
|
|
273
|
+
| Slack / email digest | You wire it | Included |
|
|
274
|
+
| Multi-week trends | You persist | Persisted for you |
|
|
275
|
+
| Decision packet generation | Local (free) | API call (same code; we run it) |
|
|
276
|
+
| Closed-loop campaigns | Local (you pay LLM directly) | Pass-through pricing on LLM, plus per-campaign fee |
|
|
277
|
+
| Auto-PR | Your GitHub token | Your GitHub token via OAuth |
|
|
278
|
+
|
|
279
|
+
Both work end-to-end. Hosted tier is convenience; self-hosted is fine for engineering-heavy teams who want full control.
|
|
280
|
+
|
|
281
|
+
## Common foreign-stack questions
|
|
282
|
+
|
|
283
|
+
**Q: We use vLLM / Ollama / a custom self-hosted LLM. Does the closed-loop driver work?**
|
|
284
|
+
A: Yes if your server speaks OpenAI-compat (most do). Pass `baseUrl: 'http://localhost:8000/v1'` (or wherever) and your dummy `apiKey`. We've shipped customers running selfImprove against local LMStudio + Ollama.
|
|
285
|
+
|
|
286
|
+
**Q: We're a Python shop, not TypeScript. Does anything change?**
|
|
287
|
+
A: `agent-eval-rpc@0.53.0` on PyPI speaks the same wire protocol. The Python client is a thin wrapper around the hosted endpoints — same `analyzeRuns()` / `selfImprove()` calls, same `InsightReport` shape, same `gateDecision` values.
|
|
288
|
+
|
|
289
|
+
**Q: We have an extremely custom agent (not LLM-call-shaped). Can we still use this?**
|
|
290
|
+
A: Yes. The substrate doesn't care what your agent IS — it only cares that you can express your runs as `RunRecord[]` and your judge as `(artifact) → JudgeScore`. RL-trained agents, multi-step plan-and-execute, browser-driving agents, code-generating agents — all map cleanly.
|
|
291
|
+
|
|
292
|
+
**Q: What's the minimum cost to try it?**
|
|
293
|
+
A: Free. `analyzeRuns()` is deterministic, runs locally, $0 LLM cost. You can ingest your last week of traces and get a real decision packet without spending a cent. The LLM-cost-incurring step is `selfImprove()` and you set the ceiling.
|
|
294
|
+
|
|
295
|
+
**Q: We don't want to send traces to your hosted tier. Self-hosted only — works?**
|
|
296
|
+
A: Yes. Every primitive in this doc runs locally. The package is MIT-licensed, no SaaS lock-in, no required network call.
|
|
@@ -0,0 +1,248 @@
|
|
|
1
|
+
# Integration — Tangle Intelligence on the Tangle stack (sandbox + tcloud)
|
|
2
|
+
|
|
3
|
+
Step-by-step. This is what we run with you on the onboarding call.
|
|
4
|
+
|
|
5
|
+
## Zero-setup demo first (30 seconds, no install)
|
|
6
|
+
|
|
7
|
+
```sh
|
|
8
|
+
npx @tangle-network/intelligence demo
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
End-to-end loop against synthetic data — agent + judge + scenarios + selfImprove. Prints the `InsightReport` shape you'll get on your real data. Useful to confirm the output is what you want before any integration. Hosted equivalent: open **[staging-intelligence.tangle.tools](https://staging-intelligence.tangle.tools)**.
|
|
12
|
+
|
|
13
|
+
## Prerequisites you already have
|
|
14
|
+
|
|
15
|
+
- `@tangle-network/sandbox` running your agent in a session
|
|
16
|
+
- `@tangle-network/tcloud` for LLM routing (or any OpenAI-compat router)
|
|
17
|
+
- Your scenarios (the inputs your agent handles) listed somewhere — even as YAML or a TS array
|
|
18
|
+
- A judge function for scoring outputs — LLM-as-judge is fine for v1
|
|
19
|
+
|
|
20
|
+
## Install
|
|
21
|
+
|
|
22
|
+
The CLI scaffolds and runs everything; you only add the substrate package if your code calls primitives directly:
|
|
23
|
+
|
|
24
|
+
```sh
|
|
25
|
+
# CLI (zero-install via npx, or add to your repo as a dev-dep)
|
|
26
|
+
npx @tangle-network/intelligence init
|
|
27
|
+
|
|
28
|
+
# Optional — only if your code imports analyzeRuns / selfImprove directly
|
|
29
|
+
pnpm add @tangle-network/agent-eval
|
|
30
|
+
# or for Python customers:
|
|
31
|
+
pip install agent-eval-rpc
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
`@tangle-network/intelligence` is the customer-facing CLI + hosted product (binary `tangle-intel`). `@tangle-network/agent-eval` is the substrate it wraps — install only if you want to script directly against the primitives.
|
|
35
|
+
|
|
36
|
+
## Step 1 — Ingest your trace stream
|
|
37
|
+
|
|
38
|
+
You already emit traces via sandbox sessions. Pull them into canonical `RunRecord[]`:
|
|
39
|
+
|
|
40
|
+
```ts
|
|
41
|
+
import { fromTangleSandbox } from '@tangle-network/agent-eval/adapters/sandbox'
|
|
42
|
+
|
|
43
|
+
const runs = await fromTangleSandbox({
|
|
44
|
+
sessionIds: ['session_abc', 'session_def'], // your current week
|
|
45
|
+
fromMs: lastReportTime,
|
|
46
|
+
toMs: Date.now(),
|
|
47
|
+
})
|
|
48
|
+
// runs is RunRecord[] — canonical wire shape, ready for any downstream substrate primitive
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
If your agent emits OTel directly instead of going through `@tangle-network/sandbox`:
|
|
52
|
+
|
|
53
|
+
```ts
|
|
54
|
+
import { fromOtelSpans } from '@tangle-network/agent-eval'
|
|
55
|
+
|
|
56
|
+
const runs = fromOtelSpans({ spans: yourOtelSpans })
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## Step 2 — Get the decision packet (no LLM cost)
|
|
60
|
+
|
|
61
|
+
```ts
|
|
62
|
+
import { analyzeRuns } from '@tangle-network/agent-eval/contract'
|
|
63
|
+
|
|
64
|
+
const report = await analyzeRuns({
|
|
65
|
+
runs: thisWeek,
|
|
66
|
+
baselineRuns: lastWeek, // optional — gives you the "did my change help?" answer
|
|
67
|
+
baselineLabel: 'vs prior 7 days',
|
|
68
|
+
})
|
|
69
|
+
|
|
70
|
+
console.log(report.composite.mean) // overall score
|
|
71
|
+
console.log(report.composite.tailRuns) // worst 5 runs by name
|
|
72
|
+
console.log(report.priorPeriodComparison?.improvedMetrics) // ['composite'] if significantly better
|
|
73
|
+
console.log(report.priorPeriodComparison?.regressedMetrics) // ['cost'] if cost went up significantly
|
|
74
|
+
console.log(report.recommendations) // priority-ranked actions
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
That's the **full deterministic flow** — no LLM, $0 cost, runs in ms.
|
|
78
|
+
|
|
79
|
+
Render in your dashboard or pipe to Slack:
|
|
80
|
+
|
|
81
|
+
```ts
|
|
82
|
+
for (const rec of report.recommendations) {
|
|
83
|
+
if (rec.priority === 'critical') {
|
|
84
|
+
await slack.post(`🔴 ${rec.title}\n${rec.detail}`)
|
|
85
|
+
}
|
|
86
|
+
}
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Step 3 — Wire the closed loop (real LLM cost — opt-in)
|
|
90
|
+
|
|
91
|
+
Pick the surface you want to optimize. For most customers this is the agent's system-prompt addendum:
|
|
92
|
+
|
|
93
|
+
```ts
|
|
94
|
+
import { selfImprove, gepaDriver } from '@tangle-network/agent-eval/contract'
|
|
95
|
+
|
|
96
|
+
const result = await selfImprove({
|
|
97
|
+
scenarios: yourScenarios, // 20-50 representative inputs
|
|
98
|
+
agent: async (surface, scenario) => {
|
|
99
|
+
// Your existing agent invocation, with the substrate-proposed surface
|
|
100
|
+
// injected as the system-prompt addendum.
|
|
101
|
+
return await runYourAgent({
|
|
102
|
+
...scenario,
|
|
103
|
+
systemPromptAddendum: surface as string,
|
|
104
|
+
})
|
|
105
|
+
},
|
|
106
|
+
judge: yourJudge, // function (artifact) → { composite, dimensions }
|
|
107
|
+
baselineSurface: currentAddendum, // the production string today
|
|
108
|
+
driver: gepaDriver({
|
|
109
|
+
llm: { apiKey: tcloudKey, baseUrl: 'https://router.tangle.tools/v1' },
|
|
110
|
+
model: 'anthropic/claude-sonnet-4.6',
|
|
111
|
+
target: 'agent system-prompt addendum',
|
|
112
|
+
}),
|
|
113
|
+
budget: {
|
|
114
|
+
generations: 3,
|
|
115
|
+
populationSize: 4,
|
|
116
|
+
holdoutFraction: 0.3,
|
|
117
|
+
maxUsd: 25, // hard ceiling — refuses to overspend
|
|
118
|
+
},
|
|
119
|
+
})
|
|
120
|
+
|
|
121
|
+
console.log(`gate: ${result.gateDecision.kind}`)
|
|
122
|
+
console.log(`lift: ${result.lift.delta.toFixed(3)} CI=[${result.lift.ci95.join(', ')}]`)
|
|
123
|
+
console.log(`cost spent: $${result.totalCostUsd.toFixed(2)}`)
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
`result.gateDecision` is one of:
|
|
127
|
+
- `ship-substrate` — winner statistically beats baseline; safe to deploy
|
|
128
|
+
- `inconclusive` — CI straddles zero; either run more rollouts or expand corpus
|
|
129
|
+
- `ship-harness` / `merge` — only when `driftPolicy: 'benchmark-branches'` is on (advanced)
|
|
130
|
+
|
|
131
|
+
## Step 4 — Auto-PR the winner
|
|
132
|
+
|
|
133
|
+
```ts
|
|
134
|
+
if (result.gateDecision.kind === 'ship-substrate') {
|
|
135
|
+
await openAutoPr({
|
|
136
|
+
title: `eval: auto-improve ${target} (composite +${result.lift.delta.toFixed(3)})`,
|
|
137
|
+
body: `${result.gateDecision.reason}\n\n${formatInsight(result.insight)}`,
|
|
138
|
+
filePath: 'src/lib/.server/production-loop/prompt-addendum.ts',
|
|
139
|
+
newContent: result.diff.kind === 'replace' ? result.diff.content : applyDiff(currentAddendum, result.diff),
|
|
140
|
+
})
|
|
141
|
+
}
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
We ship `openAutoPr` from `@tangle-network/agent-eval/contract`. It wraps the GitHub PR flow with your existing token.
|
|
145
|
+
|
|
146
|
+
## The full canonical flow (script you copy and run)
|
|
147
|
+
|
|
148
|
+
```ts
|
|
149
|
+
// scripts/weekly-improvement.ts — run from a cron / GitHub Action
|
|
150
|
+
|
|
151
|
+
import { fromTangleSandbox } from '@tangle-network/agent-eval/adapters/sandbox'
|
|
152
|
+
import {
|
|
153
|
+
analyzeRuns,
|
|
154
|
+
gepaDriver,
|
|
155
|
+
openAutoPr,
|
|
156
|
+
selfImprove,
|
|
157
|
+
} from '@tangle-network/agent-eval/contract'
|
|
158
|
+
import { scenarios } from './eval/scenarios'
|
|
159
|
+
import { judge } from './eval/judges'
|
|
160
|
+
import { runYourAgent } from './src/agent'
|
|
161
|
+
import { PRODUCTION_ADDENDUM } from './src/lib/.server/production-loop/prompt-addendum'
|
|
162
|
+
|
|
163
|
+
const lastWeek = Date.now() - 7 * 24 * 60 * 60 * 1000
|
|
164
|
+
const twoWeeksAgo = lastWeek - 7 * 24 * 60 * 60 * 1000
|
|
165
|
+
|
|
166
|
+
const thisWeekRuns = await fromTangleSandbox({ fromMs: lastWeek, toMs: Date.now() })
|
|
167
|
+
const lastWeekRuns = await fromTangleSandbox({ fromMs: twoWeeksAgo, toMs: lastWeek })
|
|
168
|
+
|
|
169
|
+
// 1. Deterministic packet — always
|
|
170
|
+
const report = await analyzeRuns({
|
|
171
|
+
runs: thisWeekRuns,
|
|
172
|
+
baselineRuns: lastWeekRuns,
|
|
173
|
+
baselineLabel: 'vs prior 7 days',
|
|
174
|
+
})
|
|
175
|
+
|
|
176
|
+
// 2. Closed loop — only if composite regressed OR we haven't tried in a while
|
|
177
|
+
const shouldRun =
|
|
178
|
+
report.priorPeriodComparison?.regressedMetrics.includes('composite') ||
|
|
179
|
+
daysSinceLastImprovement() > 7
|
|
180
|
+
|
|
181
|
+
if (!shouldRun) {
|
|
182
|
+
console.log('No regression + recent run; skipping.')
|
|
183
|
+
process.exit(0)
|
|
184
|
+
}
|
|
185
|
+
|
|
186
|
+
const result = await selfImprove({
|
|
187
|
+
scenarios,
|
|
188
|
+
agent: (surface, scenario) =>
|
|
189
|
+
runYourAgent({ ...scenario, systemPromptAddendum: surface as string }),
|
|
190
|
+
judge,
|
|
191
|
+
baselineSurface: PRODUCTION_ADDENDUM,
|
|
192
|
+
driver: gepaDriver({
|
|
193
|
+
llm: { apiKey: process.env.TANGLE_KEY!, baseUrl: 'https://router.tangle.tools/v1' },
|
|
194
|
+
model: 'anthropic/claude-sonnet-4.6',
|
|
195
|
+
target: 'production agent system-prompt addendum',
|
|
196
|
+
}),
|
|
197
|
+
budget: { generations: 3, populationSize: 4, holdoutFraction: 0.3, maxUsd: 50 },
|
|
198
|
+
})
|
|
199
|
+
|
|
200
|
+
if (result.gateDecision.kind === 'ship-substrate') {
|
|
201
|
+
await openAutoPr({
|
|
202
|
+
title: `eval: auto-improve addendum (composite +${result.lift.delta.toFixed(3)})`,
|
|
203
|
+
body: renderInsightAsPrBody(result.insight),
|
|
204
|
+
filePath: 'src/lib/.server/production-loop/prompt-addendum.ts',
|
|
205
|
+
newContent: result.diff.kind === 'replace' ? result.diff.content : '...',
|
|
206
|
+
})
|
|
207
|
+
}
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
## What we'll do together on the onboarding call
|
|
211
|
+
|
|
212
|
+
1. **Map your existing setup** — where do your traces emit? which sandbox sessions? which scenarios exist already?
|
|
213
|
+
2. **Stub the judge** — even a single dimension is enough to start
|
|
214
|
+
3. **Run a deterministic `analyzeRuns()` against your live data** — first decision packet rendered live
|
|
215
|
+
4. **Wire one selfImprove cycle** — small budget, single generation, see the loop fire
|
|
216
|
+
5. **Schedule the cron + auto-PR target** — the loop runs autonomously thereafter
|
|
217
|
+
|
|
218
|
+
Time budget: ~90 minutes. By the end you have a working pilot.
|
|
219
|
+
|
|
220
|
+
## What our hosted tier adds on top
|
|
221
|
+
|
|
222
|
+
- Decision packet rendered weekly in the Intelligence dashboard — no code changes needed
|
|
223
|
+
- Slack / email digest on `regressedMetrics`
|
|
224
|
+
- Pareto chart, judge calibration, failure-cluster drilldown in the UI
|
|
225
|
+
- Multi-week trend lines
|
|
226
|
+
- Stripe-billed usage tracking
|
|
227
|
+
|
|
228
|
+
If you want self-hosted only, every primitive above works locally. The hosted tier is a convenience.
|
|
229
|
+
|
|
230
|
+
## FAQ
|
|
231
|
+
|
|
232
|
+
**Q: What's the smallest scenario corpus that gives useful results?**
|
|
233
|
+
A: ~15 scenarios for the deterministic packet (you get distributional stats + recommendations). For `selfImprove`'s held-out gate you want ≥20 since `holdoutFraction: 0.3` reserves 6 for the gate. Below that, the gate often returns `inconclusive`.
|
|
234
|
+
|
|
235
|
+
**Q: What if my judge isn't reliable yet?**
|
|
236
|
+
A: That's normal. Use multi-rater intake (`fromFeedbackTable`) to get inter-rater agreement (κ) first, then iterate on the judge until raters agree. Substrate has an `interRater` block in InsightReport showing exactly which scenarios raters disagree on.
|
|
237
|
+
|
|
238
|
+
**Q: What if a `selfImprove` campaign returns `inconclusive`?**
|
|
239
|
+
A: It refused to claim improvement because the CI straddles zero. Either expand the corpus, raise `holdoutFraction`, or run more generations. Better than shipping noise.
|
|
240
|
+
|
|
241
|
+
**Q: Can I use a non-tcloud LLM provider?**
|
|
242
|
+
A: Yes — `gepaDriver` accepts any `LlmClientOptions` (any OpenAI-compatible endpoint). We default to tcloud because we already have your auth.
|
|
243
|
+
|
|
244
|
+
**Q: How do I see what changed when the gate ships?**
|
|
245
|
+
A: `result.diff` is a structured patch. We also ship `diffRuns()` separately if you want to compare two campaign outputs.
|
|
246
|
+
|
|
247
|
+
**Q: What if my agent self-modifies (Hermes / Claude Code skills)?**
|
|
248
|
+
A: This is the offline/online drift case. We have the architecture spec ready (`docs/specs/profile-versioning.md`) but the implementation is gated on a forcing-function experiment. For v0.5x pilots we assume the substrate is the only writer to your agent's optimizable surface.
|