@tangle-network/agent-eval 0.45.0 → 0.47.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/adapters/http.d.ts +1 -1
- package/dist/adapters/http.js +11 -4
- package/dist/adapters/http.js.map +1 -1
- package/dist/adapters/langchain.d.ts +1 -1
- package/dist/campaign/index.d.ts +3 -3
- package/dist/chunk-ZQABFCVJ.js +85 -0
- package/dist/chunk-ZQABFCVJ.js.map +1 -0
- package/dist/contract/index.d.ts +217 -2
- package/dist/contract/index.js +206 -1
- package/dist/contract/index.js.map +1 -1
- package/dist/hosted/index.d.ts +192 -0
- package/dist/hosted/index.js +10 -0
- package/dist/hosted/index.js.map +1 -0
- package/dist/openapi.json +1 -1
- package/dist/rl.d.ts +1 -1
- package/dist/{run-improvement-loop-pJ4yrx4X.d.ts → run-improvement-loop-Bfam3MT1.d.ts} +2 -2
- package/dist/{types-BURGZ8Ug.d.ts → types-8u72Gc76.d.ts} +1 -1
- package/docs/design/external-agent-wedge.md +2 -2
- package/docs/design/phase-d-rfc.md +125 -0
- package/docs/hosted-ingest-spec.md +204 -0
- package/docs/phase-b-pairing-kit.md +188 -0
- package/docs/phase-b-runbook.md +176 -0
- package/docs/quickstart-external.md +43 -4
- package/package.json +6 -1
|
@@ -0,0 +1,125 @@
|
|
|
1
|
+
# Phase D RFC — hosted-tier substrate
|
|
2
|
+
|
|
3
|
+
Pinned scope decisions for the EXPAND tier. What we built, what we
|
|
4
|
+
deliberately did NOT, and what's gated on Phase B evidence.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## What's in this version
|
|
9
|
+
|
|
10
|
+
**Wire-format substrate (shipped):**
|
|
11
|
+
|
|
12
|
+
1. `@tangle-network/agent-eval/hosted` — public client + types for shipping
|
|
13
|
+
eval-run events + trace spans to any orchestrator that speaks the wire
|
|
14
|
+
format.
|
|
15
|
+
2. `docs/hosted-ingest-spec.md` — semver-committed wire spec
|
|
16
|
+
(`HostedWireVersion = "2026-05-26.v1"`).
|
|
17
|
+
3. `examples/hosted-ingest-server/` — minimal hono-based reference
|
|
18
|
+
receiver (~200 LOC). Executable spec. Stays as the reference even
|
|
19
|
+
after the production orchestrator ships.
|
|
20
|
+
4. `selfImprove({ hostedTenant })` opt-in — when set, the substrate
|
|
21
|
+
POSTs the final eval-run event to the configured endpoint. Failures
|
|
22
|
+
are logged but never fail the loop (LAND tier never blocks on
|
|
23
|
+
EXPAND-tier infra).
|
|
24
|
+
|
|
25
|
+
**Production orchestrator (started):**
|
|
26
|
+
|
|
27
|
+
5. HTTP ingest service in `@tangle-network/monorepo` accepting the wire
|
|
28
|
+
format. Lives under the orchestrator app. Tenant auth + isolation
|
|
29
|
+
+ persistent storage + read endpoints. *Started this session — see
|
|
30
|
+
the @tangle-network/agent-dev-container PR. Not feature-complete:
|
|
31
|
+
tenant CRUD + adversarial isolation tests pending.*
|
|
32
|
+
|
|
33
|
+
## What's deliberately deferred
|
|
34
|
+
|
|
35
|
+
The wedge doc gates these on Phase B evidence — partner-validated
|
|
36
|
+
signal about what the hosted product actually needs to do. Shipping
|
|
37
|
+
them without that signal risks building the wrong thing.
|
|
38
|
+
|
|
39
|
+
| Deferred until Phase B passes | Why |
|
|
40
|
+
|---|---|
|
|
41
|
+
| **Metered billing wire-up (Stripe + cost-ledger)** | The billable units (per-eval-run, per-ingested-MB, per-seat) depend on actual partner consumption patterns. Picking dimensions in a vacuum locks us into wrong pricing. |
|
|
42
|
+
| **Multi-tenant dashboard UX** | Partners' first dashboard request defines the right default views. We have a stub list-runs page; the rest is post-signal. |
|
|
43
|
+
| **Webhook callbacks per tenant** | The events partners want pushed (gate-decided, cost-threshold, regression-alert) are partner-shaped. Add them when a partner asks. |
|
|
44
|
+
| **Cross-tenant aggregation / benchmarking** | This is the "Datadog for agents" tier — explicit roadmap, requires user volume we don't have. |
|
|
45
|
+
| **Sandbox-cost roll-up into hosted billing** | Cross-product billing integration requires PLATFORM-tier partners. Out of scope until at least one. |
|
|
46
|
+
| **Trace UI** | OTel-shape spans store fine. Visualization comes after partners ask. Phoenix / Jaeger / any OTLP-compatible viewer covers it in the interim. |
|
|
47
|
+
| **Soc2 / compliance audit work** | Required for enterprise; not required for design partners. |
|
|
48
|
+
|
|
49
|
+
## Architecture decisions locked
|
|
50
|
+
|
|
51
|
+
These are committed and won't change without a major-version wire bump
|
|
52
|
+
or a documented migration:
|
|
53
|
+
|
|
54
|
+
1. **Wire format is JSON over HTTP**, not gRPC. Reasons: works in
|
|
55
|
+
browsers + edge + node + curl; OTel-compatible at the trace stream
|
|
56
|
+
level; lowest possible barrier to a self-hosted orchestrator.
|
|
57
|
+
2. **Tenant auth is bearer-token + tenant-id header**, not OIDC /
|
|
58
|
+
service-account / mutual-TLS. Reasons: simplest thing that's
|
|
59
|
+
actually secure with proper key handling; defers complex IAM until
|
|
60
|
+
enterprise demand.
|
|
61
|
+
3. **Idempotency via header, not transactional API**. Servers MUST
|
|
62
|
+
dedupe by `(tenantId, Idempotency-Key)` for 24h. Simpler than
|
|
63
|
+
making clients commit transactions.
|
|
64
|
+
4. **Eval-runs and traces are SEPARATE streams** with pivot keys
|
|
65
|
+
(`tangle.runId` etc.) on spans. Reasons: traces can be best-effort
|
|
66
|
+
(lossy) without corrupting eval-run semantics; orchestrators can
|
|
67
|
+
prioritize eval-run durability without forcing trace durability.
|
|
68
|
+
5. **Wire version is a date.v-N string**, not semver. Reasons: dates
|
|
69
|
+
communicate "when was this contract frozen"; v-N captures
|
|
70
|
+
incremental breaking changes between dates.
|
|
71
|
+
|
|
72
|
+
## Open questions for Phase B to answer
|
|
73
|
+
|
|
74
|
+
When the design-partner pairing happens, capture answers to these
|
|
75
|
+
explicitly:
|
|
76
|
+
|
|
77
|
+
1. **Surface confidentiality**: do partners want the verbatim surface
|
|
78
|
+
(system prompt) shipped, or just the hash? Today the wire format
|
|
79
|
+
has `surface?` as optional; partner default is what we ship.
|
|
80
|
+
2. **Trace sampling**: at what cells-per-second do trace spans become
|
|
81
|
+
noise? What's the right default sampling rate?
|
|
82
|
+
3. **Cost attribution granularity**: per cell? per generation? per
|
|
83
|
+
run? Per judge dimension? Partner needs determine what we surface
|
|
84
|
+
in billing reports.
|
|
85
|
+
4. **Replay**: do partners want to re-run an old eval-run from the
|
|
86
|
+
stored data? That would require us to store more than the summary —
|
|
87
|
+
actual artifacts + prompts. Storage cost implication.
|
|
88
|
+
5. **PII / sensitive scenarios**: how do partners want to handle
|
|
89
|
+
scenarios containing user data? Encryption-at-rest is table stakes;
|
|
90
|
+
redaction-at-ingest may be required for some.
|
|
91
|
+
|
|
92
|
+
The partner pairing kit (`docs/phase-b-pairing-kit.md`) has discovery
|
|
93
|
+
questions that probe these.
|
|
94
|
+
|
|
95
|
+
## Non-goals (explicit)
|
|
96
|
+
|
|
97
|
+
This RFC does NOT plan for:
|
|
98
|
+
|
|
99
|
+
- Replacing Langfuse / Phoenix / Arize. We INGEST OTel; we don't
|
|
100
|
+
build a generic trace viewer. The dashboard is eval-run-shaped, not
|
|
101
|
+
trace-shaped.
|
|
102
|
+
- Becoming a model gateway. Tangle Router exists; the hosted
|
|
103
|
+
orchestrator routes to Tangle Router by default but doesn't
|
|
104
|
+
duplicate its function.
|
|
105
|
+
- Becoming an LLM-call CDN. Caching is the consumer's job (their
|
|
106
|
+
agent code, their HTTP client). We don't intercept LLM calls.
|
|
107
|
+
- Building an "agents IDE." Substrate, not surface.
|
|
108
|
+
|
|
109
|
+
## Migration path (post Phase B)
|
|
110
|
+
|
|
111
|
+
When Phase B passes the gate, the production orchestrator finishes:
|
|
112
|
+
|
|
113
|
+
1. Replace in-memory store with Postgres (tenant data) + S3 (large
|
|
114
|
+
artifacts) OR Cloudflare D1 + R2 (Workers-native).
|
|
115
|
+
2. Wire metered events to Stripe + the cost-ledger.
|
|
116
|
+
3. Tenant CRUD UI + onboarding flow.
|
|
117
|
+
4. Multi-tenant dashboard MVP (list runs, drill into one, diff
|
|
118
|
+
generations, view shipped prompt).
|
|
119
|
+
5. Adversarial tenant-isolation test battery in CI.
|
|
120
|
+
6. Webhooks + observability for the orchestrator itself.
|
|
121
|
+
|
|
122
|
+
Estimated effort post-Phase-B: ~1 week focused work for one engineer.
|
|
123
|
+
This is fast precisely BECAUSE the wire format is locked and the
|
|
124
|
+
reference receiver exists — the production server is a different
|
|
125
|
+
implementation of the same contract.
|
|
@@ -0,0 +1,204 @@
|
|
|
1
|
+
# Hosted-ingest wire spec — `2026-05-26.v1`
|
|
2
|
+
|
|
3
|
+
The schema **every** orchestrator (ours, partners' self-hosted ones,
|
|
4
|
+
any future open implementation) must accept. Frozen under semver:
|
|
5
|
+
**new minors only add optional fields. Breaking changes mean a major
|
|
6
|
+
bump and a new `HostedWireVersion` literal.**
|
|
7
|
+
|
|
8
|
+
This is the contract that decouples the LAND-tier substrate
|
|
9
|
+
(`@tangle-network/agent-eval`) from the EXPAND-tier hosted product. A
|
|
10
|
+
foreign builder can:
|
|
11
|
+
|
|
12
|
+
- Use our orchestrator at `https://orchestrator.tangle.tools/v1`.
|
|
13
|
+
- Self-host the reference receiver from
|
|
14
|
+
`examples/hosted-ingest-server/`.
|
|
15
|
+
- Implement their own orchestrator against this spec.
|
|
16
|
+
|
|
17
|
+
All three are wire-compatible by definition.
|
|
18
|
+
|
|
19
|
+
---
|
|
20
|
+
|
|
21
|
+
## Transport
|
|
22
|
+
|
|
23
|
+
Two endpoints, both `POST`, both JSON. Headers on every request:
|
|
24
|
+
|
|
25
|
+
| Header | Value |
|
|
26
|
+
|---|---|
|
|
27
|
+
| `Authorization` | `Bearer <tenant-key>` (the orchestrator issues this) |
|
|
28
|
+
| `Content-Type` | `application/json` |
|
|
29
|
+
| `X-Tangle-Tenant-Id` | The tenant's stable id (the orchestrator's primary key for the tenant) |
|
|
30
|
+
| `X-Tangle-Wire-Version` | `2026-05-26.v1` (this spec) |
|
|
31
|
+
| `Idempotency-Key` (optional) | UUID; servers MUST treat repeated keys as dedup |
|
|
32
|
+
|
|
33
|
+
Responses are JSON of shape `{ accepted: number, rejected: Array<{ index, reason }> }`. The
|
|
34
|
+
server SHOULD return 202 (accepted, async) or 200 (accepted, synchronous);
|
|
35
|
+
both are equivalent for the wire's purposes.
|
|
36
|
+
|
|
37
|
+
### `POST /v1/ingest/eval-runs`
|
|
38
|
+
|
|
39
|
+
Body: `IngestEvalRunsRequest = { wireVersion, events: EvalRunEvent[] }`.
|
|
40
|
+
|
|
41
|
+
One ingest call per logical eval-run; generations stream in
|
|
42
|
+
incrementally via repeated calls with the same `runId`. The
|
|
43
|
+
orchestrator deduplicates by `(tenantId, runId, generation.index)`.
|
|
44
|
+
|
|
45
|
+
### `POST /v1/ingest/traces`
|
|
46
|
+
|
|
47
|
+
Body: `IngestTracesRequest = { wireVersion, spans: TraceSpanEvent[] }`.
|
|
48
|
+
|
|
49
|
+
Standard OTLP-shaped spans with a few additional attributes
|
|
50
|
+
(`tangle.runId`, `tangle.generation`, `tangle.cellId`,
|
|
51
|
+
`tangle.scenarioId`) so the orchestrator can pivot between the
|
|
52
|
+
eval-run stream and the underlying execution trace.
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
## `EvalRunEvent`
|
|
57
|
+
|
|
58
|
+
```ts
|
|
59
|
+
interface EvalRunEvent {
|
|
60
|
+
runId: string // stable; same id across all generations of one run
|
|
61
|
+
runDir: string // logical run directory (mem://... or filesystem path)
|
|
62
|
+
timestamp: string // ISO-8601
|
|
63
|
+
status: // lifecycle stage this event represents
|
|
64
|
+
| 'started'
|
|
65
|
+
| 'baseline-complete'
|
|
66
|
+
| 'generation-complete'
|
|
67
|
+
| 'gate-decided'
|
|
68
|
+
| 'finished'
|
|
69
|
+
| 'errored'
|
|
70
|
+
labels: Record<string, string> // free-form (env, branch, model id, etc.)
|
|
71
|
+
baseline?: EvalRunGenerationSnapshot // present when status >= baseline-complete
|
|
72
|
+
generations: EvalRunGenerationSnapshot[]
|
|
73
|
+
gateDecision?: // present when status >= gate-decided
|
|
74
|
+
| 'ship' | 'hold' | 'need_more_work' | 'model_ceiling' | 'arch_ceiling'
|
|
75
|
+
holdoutLift?: number // winner-on-holdout - baseline-on-holdout
|
|
76
|
+
totalCostUsd: number
|
|
77
|
+
totalDurationMs: number
|
|
78
|
+
errorMessage?: string // present when status === 'errored'
|
|
79
|
+
}
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
## `EvalRunGenerationSnapshot`
|
|
83
|
+
|
|
84
|
+
```ts
|
|
85
|
+
interface EvalRunGenerationSnapshot {
|
|
86
|
+
index: number // 0 is baseline; 1..N are improvement generations
|
|
87
|
+
surfaceHash: string // stable hash of the candidate surface (pivot key)
|
|
88
|
+
surface?: MutableSurface // OMITTED to avoid PII when consumer prefers
|
|
89
|
+
cells: EvalRunCellScore[]
|
|
90
|
+
compositeMean: number
|
|
91
|
+
costUsd: number
|
|
92
|
+
durationMs: number
|
|
93
|
+
}
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
## `EvalRunCellScore`
|
|
97
|
+
|
|
98
|
+
```ts
|
|
99
|
+
interface EvalRunCellScore {
|
|
100
|
+
scenarioId: string
|
|
101
|
+
rep: number // 0 for the default; > 0 when reps > 1
|
|
102
|
+
compositeMean: number // composite across all judges + dimensions
|
|
103
|
+
dimensions: Record< // outer key = judge name; inner = dimension name → score
|
|
104
|
+
string,
|
|
105
|
+
Record<string, number>
|
|
106
|
+
>
|
|
107
|
+
errorMessage?: string // present when the dispatch threw
|
|
108
|
+
}
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## `TraceSpanEvent`
|
|
112
|
+
|
|
113
|
+
```ts
|
|
114
|
+
interface TraceSpanEvent {
|
|
115
|
+
// Standard OTel
|
|
116
|
+
traceId: string
|
|
117
|
+
spanId: string
|
|
118
|
+
parentSpanId?: string
|
|
119
|
+
name: string
|
|
120
|
+
startTimeUnixNano: number
|
|
121
|
+
endTimeUnixNano: number
|
|
122
|
+
attributes: Record<string, string | number | boolean>
|
|
123
|
+
events?: Array<{ timeUnixNano, name, attributes? }>
|
|
124
|
+
status?: { code: 'OK' | 'ERROR' | 'UNSET', message? }
|
|
125
|
+
|
|
126
|
+
// Tangle additions (all optional) for pivoting
|
|
127
|
+
'tangle.runId'?: string
|
|
128
|
+
'tangle.generation'?: number
|
|
129
|
+
'tangle.cellId'?: string
|
|
130
|
+
'tangle.scenarioId'?: string
|
|
131
|
+
}
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
---
|
|
135
|
+
|
|
136
|
+
## Server requirements
|
|
137
|
+
|
|
138
|
+
Any orchestrator implementing this spec MUST:
|
|
139
|
+
|
|
140
|
+
1. **Validate auth**: reject without `Authorization` header (401), with a
|
|
141
|
+
mismatched bearer token (401), or without a recognized `X-Tangle-Tenant-Id`
|
|
142
|
+
(404).
|
|
143
|
+
2. **Validate wire version**: reject incompatible wire versions (400 with
|
|
144
|
+
a clear error message). The major component is the breaking-change axis.
|
|
145
|
+
3. **Validate tenant isolation**: queries with `tenantId` X never return
|
|
146
|
+
data tagged with `tenantId` Y. Test this adversarially.
|
|
147
|
+
4. **Honor idempotency**: when an `Idempotency-Key` matches a prior
|
|
148
|
+
request from the same tenant in the last 24h, return the same response
|
|
149
|
+
without double-processing.
|
|
150
|
+
5. **Persist eval-runs durably**: at least the event + cell scores must
|
|
151
|
+
survive an orchestrator restart. Trace spans MAY be best-effort.
|
|
152
|
+
6. **Provide read access**: GET endpoints for the tenant to list + fetch
|
|
153
|
+
their own runs. Wire format for reads is NOT part of this spec — each
|
|
154
|
+
orchestrator can pick its own (REST + JSON, gRPC, GraphQL).
|
|
155
|
+
|
|
156
|
+
Servers SHOULD also:
|
|
157
|
+
|
|
158
|
+
- Provide a webhook callback per tenant for `gate-decided` events.
|
|
159
|
+
- Provide a billable-events emitter (Stripe meter / equivalent) per ingest
|
|
160
|
+
call so consumption can be metered.
|
|
161
|
+
- Provide a dashboard or API to view + diff per-scenario lifts over time.
|
|
162
|
+
|
|
163
|
+
---
|
|
164
|
+
|
|
165
|
+
## Reference implementation
|
|
166
|
+
|
|
167
|
+
`examples/hosted-ingest-server/` — a minimal hono-based receiver. ~200
|
|
168
|
+
LOC. Validates auth, accepts ingest, stores in memory, exposes a
|
|
169
|
+
read endpoint. Runs anywhere Node runs.
|
|
170
|
+
|
|
171
|
+
```sh
|
|
172
|
+
TENANT_KEY=dev-token TENANT_ID=acme pnpm tsx examples/hosted-ingest-server/server.ts
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
In another terminal:
|
|
176
|
+
|
|
177
|
+
```sh
|
|
178
|
+
HOSTED_ENDPOINT=http://localhost:8080 \
|
|
179
|
+
HOSTED_TENANT_KEY=dev-token \
|
|
180
|
+
HOSTED_TENANT_ID=acme \
|
|
181
|
+
pnpm tsx examples/foreign-agent-quickstart/index.ts
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
The quickstart's eval-run gets POSTed to the reference receiver; the
|
|
185
|
+
receiver's `GET /v1/runs` lists it back.
|
|
186
|
+
|
|
187
|
+
---
|
|
188
|
+
|
|
189
|
+
## Versioning
|
|
190
|
+
|
|
191
|
+
`HostedWireVersion` is `"2026-05-26.v1"`.
|
|
192
|
+
|
|
193
|
+
- Adding an optional field → no version change.
|
|
194
|
+
- Adding a new endpoint or new event type → minor wire bump
|
|
195
|
+
(`2026-05-26.v2`).
|
|
196
|
+
- Changing the shape of an existing field, removing a field, or
|
|
197
|
+
changing semantics of an existing field → major wire bump
|
|
198
|
+
(`2026-11-XX.v1`); a server may accept both versions during a
|
|
199
|
+
transition window.
|
|
200
|
+
|
|
201
|
+
Servers MUST reject requests with `X-Tangle-Wire-Version` they don't
|
|
202
|
+
support, with a 400 listing the versions they DO accept.
|
|
203
|
+
|
|
204
|
+
The version string IS the spec id — pin against it.
|
|
@@ -0,0 +1,188 @@
|
|
|
1
|
+
# Phase-B partner pairing kit
|
|
2
|
+
|
|
3
|
+
Everything we hand a design partner — the pitch, the discovery doc,
|
|
4
|
+
the judge worksheet, the 4-hour pairing agenda, the success criteria.
|
|
5
|
+
|
|
6
|
+
> This file is **partner-facing**. The internal driving runbook is in
|
|
7
|
+
> [`phase-b-runbook.md`](./phase-b-runbook.md).
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## The pitch (one-pager)
|
|
12
|
+
|
|
13
|
+
You have a working agent. You don't have evals. You don't have a
|
|
14
|
+
self-improvement loop. You don't know which prompt change actually
|
|
15
|
+
made the agent better last week.
|
|
16
|
+
|
|
17
|
+
We have all of that on a shelf — same engine our six internal product
|
|
18
|
+
agents use in production. It's open source, free at the LAND tier, and
|
|
19
|
+
sandbox-free if you don't want our sandbox.
|
|
20
|
+
|
|
21
|
+
**The Phase-B offer:** in one 4-hour pairing, we wrap your agent
|
|
22
|
+
behind our `Dispatch`, author your domain-specific judge with you,
|
|
23
|
+
and run one real campaign + improvement loop on **your actual use
|
|
24
|
+
case**. You walk away with:
|
|
25
|
+
|
|
26
|
+
- A reproducible eval harness against scenarios you control.
|
|
27
|
+
- A judge that scores your outputs on dimensions you defined.
|
|
28
|
+
- One measurable lift on your real product, with a held-out gate.
|
|
29
|
+
- Trace artifacts you own (locally on disk; nothing leaves your
|
|
30
|
+
network unless you point at our hosted tier).
|
|
31
|
+
|
|
32
|
+
What we get: design-partner evidence the substrate works on a foreign
|
|
33
|
+
agent we did not build. That validates the wedge for us. Nothing else
|
|
34
|
+
changes hands.
|
|
35
|
+
|
|
36
|
+
**Cost to you:** 4 hours of pairing + your LLM bill for the campaign
|
|
37
|
+
run (typically $5-$50 depending on model + scenario count). No
|
|
38
|
+
commitment, no contract, no exclusivity. We don't take your code, your
|
|
39
|
+
data, or your secrets.
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## Discovery questions (15 min, before the pairing)
|
|
44
|
+
|
|
45
|
+
Send these to the partner ahead of the pairing so they walk in with
|
|
46
|
+
their answers.
|
|
47
|
+
|
|
48
|
+
### About the agent
|
|
49
|
+
|
|
50
|
+
1. What does your agent **do** — one paragraph, end-user perspective?
|
|
51
|
+
2. What's the **input** it accepts and the **output** it produces?
|
|
52
|
+
(Schemas help; English is fine.)
|
|
53
|
+
3. What framework / stack? (LangChain / Mastra / OpenAI Agents SDK /
|
|
54
|
+
bespoke / something else.)
|
|
55
|
+
4. Where does it run? (Local node / serverless / your sandbox /
|
|
56
|
+
browser / mobile / other.)
|
|
57
|
+
5. What model(s) does it use today? Any model-routing layer
|
|
58
|
+
(OpenRouter, Portkey, your own)?
|
|
59
|
+
|
|
60
|
+
### About quality
|
|
61
|
+
|
|
62
|
+
6. How do you currently know your agent is good? (Eyeballing /
|
|
63
|
+
user feedback / metrics / nothing yet — all fine answers.)
|
|
64
|
+
7. What does a **bad** output look like for you? Give 2-3 concrete
|
|
65
|
+
examples. Be specific.
|
|
66
|
+
8. What does a **good** output look like? Same.
|
|
67
|
+
9. Are there outputs that are *technically correct but feel wrong*?
|
|
68
|
+
What's the signal?
|
|
69
|
+
10. How would a senior person on your team **score** an output, if
|
|
70
|
+
they had to give it a 1-10? Walk us through the rubric they'd
|
|
71
|
+
use, even informally.
|
|
72
|
+
|
|
73
|
+
### About the loop
|
|
74
|
+
|
|
75
|
+
11. If we could improve one thing about the agent in 4 hours, what
|
|
76
|
+
would move the needle the most for you?
|
|
77
|
+
12. Are there *prompt* changes you've wanted to try but haven't had
|
|
78
|
+
the loop to validate?
|
|
79
|
+
13. Anything you've explicitly tried that **didn't** work? (Saves us
|
|
80
|
+
suggesting it.)
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## Judge-design worksheet (45 min into the pairing)
|
|
85
|
+
|
|
86
|
+
The judge is the most under-discussed piece of an eval system. Most
|
|
87
|
+
projects fail at the judge, not the agent.
|
|
88
|
+
|
|
89
|
+
We start with a **strawman** — the 6 dimensions in our canonical
|
|
90
|
+
marketing-quality judge:
|
|
91
|
+
|
|
92
|
+
| Dim | What it measures |
|
|
93
|
+
|---|---|
|
|
94
|
+
| hook_strength | Opens with concrete user outcome, not category |
|
|
95
|
+
| voice_match | Reads human-written; no AI slop |
|
|
96
|
+
| cta_clarity | Next step unambiguous for the audience |
|
|
97
|
+
| factual_grounding | Only claims things the brief supports |
|
|
98
|
+
| surface_fit | Length + register correct for medium |
|
|
99
|
+
| audience_specificity | Vocabulary the audience actually responds to |
|
|
100
|
+
|
|
101
|
+
**Your job in this 45 min:** rip this apart. We expect:
|
|
102
|
+
|
|
103
|
+
- **2-3 of these are wrong for you.** Replace them.
|
|
104
|
+
- **2-3 dimensions are missing.** Add them. (E.g., "tone matches our
|
|
105
|
+
brand book" or "safety-critical claim has a citation" or "answer is
|
|
106
|
+
decisive — no hedging when the user wants a recommendation".)
|
|
107
|
+
- **Weights are wrong.** For your use case some dims matter 5x more.
|
|
108
|
+
|
|
109
|
+
The deliverable: a judge with 4-8 dimensions, each scored 0.0 - 1.0,
|
|
110
|
+
each unambiguous enough that two independent humans would score the
|
|
111
|
+
same artifact within 0.1.
|
|
112
|
+
|
|
113
|
+
If a dimension is squishy, throw it out. A noisy judge poisons the
|
|
114
|
+
loop.
|
|
115
|
+
|
|
116
|
+
---
|
|
117
|
+
|
|
118
|
+
## The 4-hour pairing agenda
|
|
119
|
+
|
|
120
|
+
### Hour 1 — Discovery + Dispatch wiring
|
|
121
|
+
|
|
122
|
+
| Time | What | Deliverable |
|
|
123
|
+
|---|---|---|
|
|
124
|
+
| 0:00 - 0:15 | Review discovery answers, align on scope | Shared doc with goals + constraints |
|
|
125
|
+
| 0:15 - 0:45 | Wire `Dispatch` around their agent — typically 1 function | Working `Dispatch<TScenario, TArtifact>` |
|
|
126
|
+
| 0:45 - 1:00 | Run 1-2 scenarios through `Dispatch` manually; see real artifacts | Confirmed wire shape |
|
|
127
|
+
|
|
128
|
+
### Hour 2 — Judge calibration
|
|
129
|
+
|
|
130
|
+
| Time | What | Deliverable |
|
|
131
|
+
|---|---|---|
|
|
132
|
+
| 1:00 - 1:45 | Walk through the strawman judge; redesign dimensions with the partner | Final `JudgeConfig` for their domain |
|
|
133
|
+
| 1:45 - 2:00 | Calibrate judge against the 2 manual outputs from Hour 1 | Confirmed judge gives same scores a human would |
|
|
134
|
+
|
|
135
|
+
### Hour 3 — First campaign + tuning
|
|
136
|
+
|
|
137
|
+
| Time | What | Deliverable |
|
|
138
|
+
|---|---|---|
|
|
139
|
+
| 2:00 - 2:30 | Define 8-15 scenarios with the partner (or use ours as a template) | Scenario set with train + holdout split |
|
|
140
|
+
| 2:30 - 3:00 | Run `runEval` for baseline; review per-scenario scores | Baseline score + identified failure modes |
|
|
141
|
+
|
|
142
|
+
### Hour 4 — Improvement loop + go/no-go
|
|
143
|
+
|
|
144
|
+
| Time | What | Deliverable |
|
|
145
|
+
|---|---|---|
|
|
146
|
+
| 3:00 - 3:30 | Configure `runImprovementLoop` with `gepaDriver` (3 generations, population 2) + `defaultProductionGate` | Improvement run completes |
|
|
147
|
+
| 3:30 - 3:50 | Walk the partner through the gate decision + lift per scenario | Report artifact |
|
|
148
|
+
| 3:50 - 4:00 | Capture: was the lift real? Would they ship the winner? Will they keep using the lib? | **Go/no-go signal for Phase D** |
|
|
149
|
+
|
|
150
|
+
If we're tracking ahead at any hour, use the slack to deepen — add a
|
|
151
|
+
red-team battery, swap the judge model, run more generations. If we're
|
|
152
|
+
behind, cut the scenario set to 6 and ship.
|
|
153
|
+
|
|
154
|
+
---
|
|
155
|
+
|
|
156
|
+
## Success criteria — what counts as Phase B passed
|
|
157
|
+
|
|
158
|
+
For us to greenlight Phase D (hosted orchestrator + metered billing),
|
|
159
|
+
we need ALL of:
|
|
160
|
+
|
|
161
|
+
1. **Real lift.** Held-out winner score > baseline by ≥ 0.05 composite
|
|
162
|
+
points (or the partner's chosen threshold). Not just train; held-out.
|
|
163
|
+
2. **Partner-validated lift.** The partner reads the winner output on
|
|
164
|
+
3+ held-out scenarios and confirms it's actually better.
|
|
165
|
+
3. **Integration time ≤ 1 day.** Discovery + wiring + judge took ≤ 4
|
|
166
|
+
hours for the pairing; partner could reach the same point solo in
|
|
167
|
+
≤ 1 day from the quickstart doc.
|
|
168
|
+
4. **Public commitment.** Partner agrees to a public reference (case
|
|
169
|
+
study / quote / logo) OR commits to running the LAND tier in their
|
|
170
|
+
own product within 2 weeks.
|
|
171
|
+
|
|
172
|
+
3-of-4 = soft pass (revisit Phase D scope but proceed). 4-of-4 = hard
|
|
173
|
+
pass (build Phase D). ≤ 2 = fail (back to substrate iteration).
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
## What we don't ask for
|
|
178
|
+
|
|
179
|
+
- Your code. Wire `Dispatch` around your existing API; we never see the
|
|
180
|
+
source.
|
|
181
|
+
- Your customer data. Use synthetic scenarios or anonymized real ones —
|
|
182
|
+
whichever you prefer.
|
|
183
|
+
- Your model keys. You bring your own; if you want, route through Tangle
|
|
184
|
+
Router and we never see the prompts either.
|
|
185
|
+
- Exclusivity, commitment, or contract. Walk away whenever.
|
|
186
|
+
|
|
187
|
+
The point is to learn if the substrate works for someone we didn't
|
|
188
|
+
build it for. That's it.
|
|
@@ -0,0 +1,176 @@
|
|
|
1
|
+
# Phase-B runbook (internal)
|
|
2
|
+
|
|
3
|
+
How we drive a design-partner pairing. Goes alongside
|
|
4
|
+
[`phase-b-pairing-kit.md`](./phase-b-pairing-kit.md) (the partner-facing
|
|
5
|
+
materials) — this file is for us.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Before the pairing
|
|
10
|
+
|
|
11
|
+
- **24-48h prior:** send discovery questions from
|
|
12
|
+
[`phase-b-pairing-kit.md`](./phase-b-pairing-kit.md). Don't run the
|
|
13
|
+
pairing without answers in hand. The pairing fails when we discover
|
|
14
|
+
the partner's quality bar live; we don't have time to interview AND
|
|
15
|
+
build in 4 hours.
|
|
16
|
+
- **48h prior:** run the canonical demo (`pnpm tsx
|
|
17
|
+
examples/marketing-agent-canonical/index.ts`) end-to-end against the
|
|
18
|
+
partner's preferred model. Confirms the substrate + their LLM tier
|
|
19
|
+
compose. If it errors, fix the substrate before the pairing.
|
|
20
|
+
- **24h prior:** mirror the partner's stack locally. If they're on
|
|
21
|
+
Cloudflare Workers, run a Worker. On LangChain, install `@langchain/*`.
|
|
22
|
+
Don't debug their tooling on the call.
|
|
23
|
+
- **1h prior:** open the pairing kit, the agent-eval repo, the partner's
|
|
24
|
+
agent code/endpoint, a shared doc, and a screenshare ready.
|
|
25
|
+
|
|
26
|
+
## During the pairing
|
|
27
|
+
|
|
28
|
+
### Driving principles
|
|
29
|
+
|
|
30
|
+
- **Talk less, ship more.** The partner is paying with their time and
|
|
31
|
+
attention; every minute we talk we aren't shipping their lift.
|
|
32
|
+
- **They write the judge.** We start with our strawman so they have
|
|
33
|
+
something to react to, but the judge that ends up running is theirs.
|
|
34
|
+
This is the most-discussed seam — they should own it.
|
|
35
|
+
- **No invented features.** Don't promise capabilities that don't exist
|
|
36
|
+
("we have a hosted ingest for this") unless they actually exist.
|
|
37
|
+
Phase B is honesty's purest test.
|
|
38
|
+
- **Capture verbatim.** Write down their exact words on what's broken /
|
|
39
|
+
what would change their mind. The wedge-gate evidence is qualitative
|
|
40
|
+
too.
|
|
41
|
+
|
|
42
|
+
### When to escalate to Drew
|
|
43
|
+
|
|
44
|
+
- Partner wants something Phase D would have (hosted dashboard, multi-
|
|
45
|
+
tenant, billing). **Escalate same day** — this is the GTM signal we're
|
|
46
|
+
hunting for; Drew should hear it directly.
|
|
47
|
+
- Partner is the wrong fit (technical or business) and the pairing
|
|
48
|
+
would burn both sides' time. **Pause the pairing**, debrief with Drew,
|
|
49
|
+
reschedule with a better-fit partner.
|
|
50
|
+
- Substrate breaks in a way that requires a published bump. **Pause
|
|
51
|
+
the pairing**, ship the fix in a focused PR, resume.
|
|
52
|
+
|
|
53
|
+
### What to capture for the wedge gate
|
|
54
|
+
|
|
55
|
+
Per [`docs/design/external-agent-wedge.md`](./design/external-agent-wedge.md),
|
|
56
|
+
the gate decision hinges on Phase B evidence. We capture:
|
|
57
|
+
|
|
58
|
+
1. **Quantitative lift** — held-out winner composite vs baseline, per
|
|
59
|
+
scenario + overall. Auto-generated in the report artifact by the
|
|
60
|
+
canonical demo (`.phase-b-runs/<ts>/phase-b-report.md`).
|
|
61
|
+
2. **Qualitative partner-validation** — partner read 3+ winner outputs
|
|
62
|
+
and confirmed they're better. Capture as a 1-paragraph quote.
|
|
63
|
+
3. **Integration friction** — minutes spent on each pairing phase. Were
|
|
64
|
+
any > 2x estimated? What broke?
|
|
65
|
+
4. **Judge-design surprise** — which dimensions the partner added or
|
|
66
|
+
killed vs our strawman. Strong signal about what the substrate's
|
|
67
|
+
default judge templates are missing for adjacent domains.
|
|
68
|
+
5. **Soft commitments** — would they reference us? Would they
|
|
69
|
+
self-serve from the quickstart doc? Would they pay for hosted?
|
|
70
|
+
|
|
71
|
+
Capture into a single `phase-b-debrief.md` per partner. We don't
|
|
72
|
+
publish these; they feed the next substrate iteration + the wedge
|
|
73
|
+
go/no-go.
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## Failure modes — what we do NOT do
|
|
78
|
+
|
|
79
|
+
### "We'll just optimize on the train set"
|
|
80
|
+
|
|
81
|
+
Hard no. The held-out gate is the entire point. A win that doesn't
|
|
82
|
+
generalize is worse than no win — it's evidence that the substrate
|
|
83
|
+
overfits, which is the failure mode the wedge tier rewards.
|
|
84
|
+
|
|
85
|
+
If the holdout lift is < threshold but train looks great:
|
|
86
|
+
|
|
87
|
+
1. Show the partner the gap. Explain what overfitting means here.
|
|
88
|
+
2. Try raising `maxGenerations` to 5 (gives gepa more search budget).
|
|
89
|
+
3. Try widening `populationSize` to 3 (more diverse mutations per gen).
|
|
90
|
+
4. If still no lift on holdout: **report the result honestly**. A
|
|
91
|
+
negative finding is real evidence for us too — tells us this surface
|
|
92
|
+
isn't amenable to prompt-only mutation, and the partner needs Phase
|
|
93
|
+
C (code-tier optimization) or a different approach.
|
|
94
|
+
|
|
95
|
+
### "The judge is too noisy"
|
|
96
|
+
|
|
97
|
+
A judge whose two-run variance > 0.1 on the same artifact is broken.
|
|
98
|
+
Fixes, in order:
|
|
99
|
+
|
|
100
|
+
1. Lower temperature to 0.0 (the canonical judge uses 0.2, which is
|
|
101
|
+
already low).
|
|
102
|
+
2. Use a stronger model than the agent (default: same model. Bump the
|
|
103
|
+
judge to GPT-5.5 / Claude Opus.)
|
|
104
|
+
3. Add anchors to each dimension ("0.0 = X, 0.5 = Y, 1.0 = Z").
|
|
105
|
+
4. If still noisy: collapse to fewer, simpler dimensions. 3 unambiguous
|
|
106
|
+
dimensions beat 6 squishy ones.
|
|
107
|
+
|
|
108
|
+
### "We can't decide what the partner's judge should be"
|
|
109
|
+
|
|
110
|
+
Then we don't have Phase B. The judge IS the partner's quality bar.
|
|
111
|
+
If they can't articulate it in 45 minutes of pairing, we're in the
|
|
112
|
+
wrong pairing — they need to do the interview-themselves work first.
|
|
113
|
+
|
|
114
|
+
**Pause the pairing, send the discovery doc again, regroup in a week.**
|
|
115
|
+
|
|
116
|
+
### "Their agent is slow / expensive"
|
|
117
|
+
|
|
118
|
+
`maxConcurrency: 1` and reduce scenarios to 6. Cost scales linearly;
|
|
119
|
+
time scales as `(scenarios × reps × generations × population) /
|
|
120
|
+
concurrency`. Tune until the loop completes in ≤ 30 min.
|
|
121
|
+
|
|
122
|
+
If the per-call cost is > $1, talk to Drew before the pairing — we
|
|
123
|
+
might want to subsidize the partner's first run.
|
|
124
|
+
|
|
125
|
+
### "They want to share their secrets through Tangle Router"
|
|
126
|
+
|
|
127
|
+
Fine — `OPENAI_BASE_URL=https://router.tangle.tools/v1` works. Make
|
|
128
|
+
sure they understand: every call routes through us; the prompts and
|
|
129
|
+
responses are visible to whatever observability we have on the router.
|
|
130
|
+
If they want zero data leaving their network, point at their own
|
|
131
|
+
endpoint, not Tangle Router.
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
## After the pairing
|
|
136
|
+
|
|
137
|
+
### Same day
|
|
138
|
+
|
|
139
|
+
- Save the `phase-b-report.md` artifact + the partner's debrief notes
|
|
140
|
+
to `~/company/design-partners/<partner>/<date>/`.
|
|
141
|
+
- Send the partner a thank-you with the winner artifact + the next-
|
|
142
|
+
steps doc. Whether or not we proceed to Phase D, leave them with
|
|
143
|
+
something concrete they can ship in their product.
|
|
144
|
+
- Slack Drew the verdict against the [success criteria](./phase-b-pairing-kit.md#success-criteria--what-counts-as-phase-b-passed).
|
|
145
|
+
|
|
146
|
+
### Within a week
|
|
147
|
+
|
|
148
|
+
- If Phase B passed: open the Phase D RFC. Reuse the partner-validated
|
|
149
|
+
judge dimensions + scenarios as the spec for what the hosted tier
|
|
150
|
+
needs to support out of the box.
|
|
151
|
+
- If Phase B failed: substrate iteration ticket(s). Specific gaps the
|
|
152
|
+
pairing surfaced (judge dim defaults, doc clarity, missing helper).
|
|
153
|
+
- Either way: update the wedge doc (`docs/design/external-agent-wedge.md`)
|
|
154
|
+
with the partner-name redacted + the qualitative signal.
|
|
155
|
+
|
|
156
|
+
### Within a month (regardless of go/no-go)
|
|
157
|
+
|
|
158
|
+
- Followup with the partner. If they're still using the lib, capture a
|
|
159
|
+
metric. If they stopped, find out why. Both data points feed product.
|
|
160
|
+
|
|
161
|
+
---
|
|
162
|
+
|
|
163
|
+
## The canonical demo as a forcing function
|
|
164
|
+
|
|
165
|
+
`examples/marketing-agent-canonical/` is the demo we open the pairing
|
|
166
|
+
with. It does three things at once:
|
|
167
|
+
|
|
168
|
+
1. **Proves the substrate works** — they see a real lift on a real-
|
|
169
|
+
feeling agent before we touch their code.
|
|
170
|
+
2. **Sets the bar for the judge conversation** — they react to concrete
|
|
171
|
+
dimensions, not abstract questions.
|
|
172
|
+
3. **Trains us** — running the canonical demo before the pairing
|
|
173
|
+
surfaces substrate bugs on the partner's preferred model BEFORE the
|
|
174
|
+
partner is watching. We hit those bugs first.
|
|
175
|
+
|
|
176
|
+
Run the canonical demo before every Phase-B pairing. It's not optional.
|