@tangle-network/agent-eval 0.45.0 → 0.47.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,125 @@
1
+ # Phase D RFC — hosted-tier substrate
2
+
3
+ Pinned scope decisions for the EXPAND tier. What we built, what we
4
+ deliberately did NOT, and what's gated on Phase B evidence.
5
+
6
+ ---
7
+
8
+ ## What's in this version
9
+
10
+ **Wire-format substrate (shipped):**
11
+
12
+ 1. `@tangle-network/agent-eval/hosted` — public client + types for shipping
13
+ eval-run events + trace spans to any orchestrator that speaks the wire
14
+ format.
15
+ 2. `docs/hosted-ingest-spec.md` — semver-committed wire spec
16
+ (`HostedWireVersion = "2026-05-26.v1"`).
17
+ 3. `examples/hosted-ingest-server/` — minimal hono-based reference
18
+ receiver (~200 LOC). Executable spec. Stays as the reference even
19
+ after the production orchestrator ships.
20
+ 4. `selfImprove({ hostedTenant })` opt-in — when set, the substrate
21
+ POSTs the final eval-run event to the configured endpoint. Failures
22
+ are logged but never fail the loop (LAND tier never blocks on
23
+ EXPAND-tier infra).
24
+
25
+ **Production orchestrator (started):**
26
+
27
+ 5. HTTP ingest service in `@tangle-network/monorepo` accepting the wire
28
+ format. Lives under the orchestrator app. Tenant auth + isolation
29
+ + persistent storage + read endpoints. *Started this session — see
30
+ the @tangle-network/agent-dev-container PR. Not feature-complete:
31
+ tenant CRUD + adversarial isolation tests pending.*
32
+
33
+ ## What's deliberately deferred
34
+
35
+ The wedge doc gates these on Phase B evidence — partner-validated
36
+ signal about what the hosted product actually needs to do. Shipping
37
+ them without that signal risks building the wrong thing.
38
+
39
+ | Deferred until Phase B passes | Why |
40
+ |---|---|
41
+ | **Metered billing wire-up (Stripe + cost-ledger)** | The billable units (per-eval-run, per-ingested-MB, per-seat) depend on actual partner consumption patterns. Picking dimensions in a vacuum locks us into wrong pricing. |
42
+ | **Multi-tenant dashboard UX** | Partners' first dashboard request defines the right default views. We have a stub list-runs page; the rest is post-signal. |
43
+ | **Webhook callbacks per tenant** | The events partners want pushed (gate-decided, cost-threshold, regression-alert) are partner-shaped. Add them when a partner asks. |
44
+ | **Cross-tenant aggregation / benchmarking** | This is the "Datadog for agents" tier — explicit roadmap, requires user volume we don't have. |
45
+ | **Sandbox-cost roll-up into hosted billing** | Cross-product billing integration requires PLATFORM-tier partners. Out of scope until at least one. |
46
+ | **Trace UI** | OTel-shape spans store fine. Visualization comes after partners ask. Phoenix / Jaeger / any OTLP-compatible viewer covers it in the interim. |
47
+ | **Soc2 / compliance audit work** | Required for enterprise; not required for design partners. |
48
+
49
+ ## Architecture decisions locked
50
+
51
+ These are committed and won't change without a major-version wire bump
52
+ or a documented migration:
53
+
54
+ 1. **Wire format is JSON over HTTP**, not gRPC. Reasons: works in
55
+ browsers + edge + node + curl; OTel-compatible at the trace stream
56
+ level; lowest possible barrier to a self-hosted orchestrator.
57
+ 2. **Tenant auth is bearer-token + tenant-id header**, not OIDC /
58
+ service-account / mutual-TLS. Reasons: simplest thing that's
59
+ actually secure with proper key handling; defers complex IAM until
60
+ enterprise demand.
61
+ 3. **Idempotency via header, not transactional API**. Servers MUST
62
+ dedupe by `(tenantId, Idempotency-Key)` for 24h. Simpler than
63
+ making clients commit transactions.
64
+ 4. **Eval-runs and traces are SEPARATE streams** with pivot keys
65
+ (`tangle.runId` etc.) on spans. Reasons: traces can be best-effort
66
+ (lossy) without corrupting eval-run semantics; orchestrators can
67
+ prioritize eval-run durability without forcing trace durability.
68
+ 5. **Wire version is a date.v-N string**, not semver. Reasons: dates
69
+ communicate "when was this contract frozen"; v-N captures
70
+ incremental breaking changes between dates.
71
+
72
+ ## Open questions for Phase B to answer
73
+
74
+ When the design-partner pairing happens, capture answers to these
75
+ explicitly:
76
+
77
+ 1. **Surface confidentiality**: do partners want the verbatim surface
78
+ (system prompt) shipped, or just the hash? Today the wire format
79
+ has `surface?` as optional; partner default is what we ship.
80
+ 2. **Trace sampling**: at what cells-per-second do trace spans become
81
+ noise? What's the right default sampling rate?
82
+ 3. **Cost attribution granularity**: per cell? per generation? per
83
+ run? Per judge dimension? Partner needs determine what we surface
84
+ in billing reports.
85
+ 4. **Replay**: do partners want to re-run an old eval-run from the
86
+ stored data? That would require us to store more than the summary —
87
+ actual artifacts + prompts. Storage cost implication.
88
+ 5. **PII / sensitive scenarios**: how do partners want to handle
89
+ scenarios containing user data? Encryption-at-rest is table stakes;
90
+ redaction-at-ingest may be required for some.
91
+
92
+ The partner pairing kit (`docs/phase-b-pairing-kit.md`) has discovery
93
+ questions that probe these.
94
+
95
+ ## Non-goals (explicit)
96
+
97
+ This RFC does NOT plan for:
98
+
99
+ - Replacing Langfuse / Phoenix / Arize. We INGEST OTel; we don't
100
+ build a generic trace viewer. The dashboard is eval-run-shaped, not
101
+ trace-shaped.
102
+ - Becoming a model gateway. Tangle Router exists; the hosted
103
+ orchestrator routes to Tangle Router by default but doesn't
104
+ duplicate its function.
105
+ - Becoming an LLM-call CDN. Caching is the consumer's job (their
106
+ agent code, their HTTP client). We don't intercept LLM calls.
107
+ - Building an "agents IDE." Substrate, not surface.
108
+
109
+ ## Migration path (post Phase B)
110
+
111
+ When Phase B passes the gate, the production orchestrator finishes:
112
+
113
+ 1. Replace in-memory store with Postgres (tenant data) + S3 (large
114
+ artifacts) OR Cloudflare D1 + R2 (Workers-native).
115
+ 2. Wire metered events to Stripe + the cost-ledger.
116
+ 3. Tenant CRUD UI + onboarding flow.
117
+ 4. Multi-tenant dashboard MVP (list runs, drill into one, diff
118
+ generations, view shipped prompt).
119
+ 5. Adversarial tenant-isolation test battery in CI.
120
+ 6. Webhooks + observability for the orchestrator itself.
121
+
122
+ Estimated effort post-Phase-B: ~1 week focused work for one engineer.
123
+ This is fast precisely BECAUSE the wire format is locked and the
124
+ reference receiver exists — the production server is a different
125
+ implementation of the same contract.
@@ -0,0 +1,204 @@
1
+ # Hosted-ingest wire spec — `2026-05-26.v1`
2
+
3
+ The schema **every** orchestrator (ours, partners' self-hosted ones,
4
+ any future open implementation) must accept. Frozen under semver:
5
+ **new minors only add optional fields. Breaking changes mean a major
6
+ bump and a new `HostedWireVersion` literal.**
7
+
8
+ This is the contract that decouples the LAND-tier substrate
9
+ (`@tangle-network/agent-eval`) from the EXPAND-tier hosted product. A
10
+ foreign builder can:
11
+
12
+ - Use our orchestrator at `https://orchestrator.tangle.tools/v1`.
13
+ - Self-host the reference receiver from
14
+ `examples/hosted-ingest-server/`.
15
+ - Implement their own orchestrator against this spec.
16
+
17
+ All three are wire-compatible by definition.
18
+
19
+ ---
20
+
21
+ ## Transport
22
+
23
+ Two endpoints, both `POST`, both JSON. Headers on every request:
24
+
25
+ | Header | Value |
26
+ |---|---|
27
+ | `Authorization` | `Bearer <tenant-key>` (the orchestrator issues this) |
28
+ | `Content-Type` | `application/json` |
29
+ | `X-Tangle-Tenant-Id` | The tenant's stable id (the orchestrator's primary key for the tenant) |
30
+ | `X-Tangle-Wire-Version` | `2026-05-26.v1` (this spec) |
31
+ | `Idempotency-Key` (optional) | UUID; servers MUST treat repeated keys as dedup |
32
+
33
+ Responses are JSON of shape `{ accepted: number, rejected: Array<{ index, reason }> }`. The
34
+ server SHOULD return 202 (accepted, async) or 200 (accepted, synchronous);
35
+ both are equivalent for the wire's purposes.
36
+
37
+ ### `POST /v1/ingest/eval-runs`
38
+
39
+ Body: `IngestEvalRunsRequest = { wireVersion, events: EvalRunEvent[] }`.
40
+
41
+ One ingest call per logical eval-run; generations stream in
42
+ incrementally via repeated calls with the same `runId`. The
43
+ orchestrator deduplicates by `(tenantId, runId, generation.index)`.
44
+
45
+ ### `POST /v1/ingest/traces`
46
+
47
+ Body: `IngestTracesRequest = { wireVersion, spans: TraceSpanEvent[] }`.
48
+
49
+ Standard OTLP-shaped spans with a few additional attributes
50
+ (`tangle.runId`, `tangle.generation`, `tangle.cellId`,
51
+ `tangle.scenarioId`) so the orchestrator can pivot between the
52
+ eval-run stream and the underlying execution trace.
53
+
54
+ ---
55
+
56
+ ## `EvalRunEvent`
57
+
58
+ ```ts
59
+ interface EvalRunEvent {
60
+ runId: string // stable; same id across all generations of one run
61
+ runDir: string // logical run directory (mem://... or filesystem path)
62
+ timestamp: string // ISO-8601
63
+ status: // lifecycle stage this event represents
64
+ | 'started'
65
+ | 'baseline-complete'
66
+ | 'generation-complete'
67
+ | 'gate-decided'
68
+ | 'finished'
69
+ | 'errored'
70
+ labels: Record<string, string> // free-form (env, branch, model id, etc.)
71
+ baseline?: EvalRunGenerationSnapshot // present when status >= baseline-complete
72
+ generations: EvalRunGenerationSnapshot[]
73
+ gateDecision?: // present when status >= gate-decided
74
+ | 'ship' | 'hold' | 'need_more_work' | 'model_ceiling' | 'arch_ceiling'
75
+ holdoutLift?: number // winner-on-holdout - baseline-on-holdout
76
+ totalCostUsd: number
77
+ totalDurationMs: number
78
+ errorMessage?: string // present when status === 'errored'
79
+ }
80
+ ```
81
+
82
+ ## `EvalRunGenerationSnapshot`
83
+
84
+ ```ts
85
+ interface EvalRunGenerationSnapshot {
86
+ index: number // 0 is baseline; 1..N are improvement generations
87
+ surfaceHash: string // stable hash of the candidate surface (pivot key)
88
+ surface?: MutableSurface // OMITTED to avoid PII when consumer prefers
89
+ cells: EvalRunCellScore[]
90
+ compositeMean: number
91
+ costUsd: number
92
+ durationMs: number
93
+ }
94
+ ```
95
+
96
+ ## `EvalRunCellScore`
97
+
98
+ ```ts
99
+ interface EvalRunCellScore {
100
+ scenarioId: string
101
+ rep: number // 0 for the default; > 0 when reps > 1
102
+ compositeMean: number // composite across all judges + dimensions
103
+ dimensions: Record< // outer key = judge name; inner = dimension name → score
104
+ string,
105
+ Record<string, number>
106
+ >
107
+ errorMessage?: string // present when the dispatch threw
108
+ }
109
+ ```
110
+
111
+ ## `TraceSpanEvent`
112
+
113
+ ```ts
114
+ interface TraceSpanEvent {
115
+ // Standard OTel
116
+ traceId: string
117
+ spanId: string
118
+ parentSpanId?: string
119
+ name: string
120
+ startTimeUnixNano: number
121
+ endTimeUnixNano: number
122
+ attributes: Record<string, string | number | boolean>
123
+ events?: Array<{ timeUnixNano, name, attributes? }>
124
+ status?: { code: 'OK' | 'ERROR' | 'UNSET', message? }
125
+
126
+ // Tangle additions (all optional) for pivoting
127
+ 'tangle.runId'?: string
128
+ 'tangle.generation'?: number
129
+ 'tangle.cellId'?: string
130
+ 'tangle.scenarioId'?: string
131
+ }
132
+ ```
133
+
134
+ ---
135
+
136
+ ## Server requirements
137
+
138
+ Any orchestrator implementing this spec MUST:
139
+
140
+ 1. **Validate auth**: reject without `Authorization` header (401), with a
141
+ mismatched bearer token (401), or without a recognized `X-Tangle-Tenant-Id`
142
+ (404).
143
+ 2. **Validate wire version**: reject incompatible wire versions (400 with
144
+ a clear error message). The major component is the breaking-change axis.
145
+ 3. **Validate tenant isolation**: queries with `tenantId` X never return
146
+ data tagged with `tenantId` Y. Test this adversarially.
147
+ 4. **Honor idempotency**: when an `Idempotency-Key` matches a prior
148
+ request from the same tenant in the last 24h, return the same response
149
+ without double-processing.
150
+ 5. **Persist eval-runs durably**: at least the event + cell scores must
151
+ survive an orchestrator restart. Trace spans MAY be best-effort.
152
+ 6. **Provide read access**: GET endpoints for the tenant to list + fetch
153
+ their own runs. Wire format for reads is NOT part of this spec — each
154
+ orchestrator can pick its own (REST + JSON, gRPC, GraphQL).
155
+
156
+ Servers SHOULD also:
157
+
158
+ - Provide a webhook callback per tenant for `gate-decided` events.
159
+ - Provide a billable-events emitter (Stripe meter / equivalent) per ingest
160
+ call so consumption can be metered.
161
+ - Provide a dashboard or API to view + diff per-scenario lifts over time.
162
+
163
+ ---
164
+
165
+ ## Reference implementation
166
+
167
+ `examples/hosted-ingest-server/` — a minimal hono-based receiver. ~200
168
+ LOC. Validates auth, accepts ingest, stores in memory, exposes a
169
+ read endpoint. Runs anywhere Node runs.
170
+
171
+ ```sh
172
+ TENANT_KEY=dev-token TENANT_ID=acme pnpm tsx examples/hosted-ingest-server/server.ts
173
+ ```
174
+
175
+ In another terminal:
176
+
177
+ ```sh
178
+ HOSTED_ENDPOINT=http://localhost:8080 \
179
+ HOSTED_TENANT_KEY=dev-token \
180
+ HOSTED_TENANT_ID=acme \
181
+ pnpm tsx examples/foreign-agent-quickstart/index.ts
182
+ ```
183
+
184
+ The quickstart's eval-run gets POSTed to the reference receiver; the
185
+ receiver's `GET /v1/runs` lists it back.
186
+
187
+ ---
188
+
189
+ ## Versioning
190
+
191
+ `HostedWireVersion` is `"2026-05-26.v1"`.
192
+
193
+ - Adding an optional field → no version change.
194
+ - Adding a new endpoint or new event type → minor wire bump
195
+ (`2026-05-26.v2`).
196
+ - Changing the shape of an existing field, removing a field, or
197
+ changing semantics of an existing field → major wire bump
198
+ (`2026-11-XX.v1`); a server may accept both versions during a
199
+ transition window.
200
+
201
+ Servers MUST reject requests with `X-Tangle-Wire-Version` they don't
202
+ support, with a 400 listing the versions they DO accept.
203
+
204
+ The version string IS the spec id — pin against it.
@@ -0,0 +1,188 @@
1
+ # Phase-B partner pairing kit
2
+
3
+ Everything we hand a design partner — the pitch, the discovery doc,
4
+ the judge worksheet, the 4-hour pairing agenda, the success criteria.
5
+
6
+ > This file is **partner-facing**. The internal driving runbook is in
7
+ > [`phase-b-runbook.md`](./phase-b-runbook.md).
8
+
9
+ ---
10
+
11
+ ## The pitch (one-pager)
12
+
13
+ You have a working agent. You don't have evals. You don't have a
14
+ self-improvement loop. You don't know which prompt change actually
15
+ made the agent better last week.
16
+
17
+ We have all of that on a shelf — same engine our six internal product
18
+ agents use in production. It's open source, free at the LAND tier, and
19
+ sandbox-free if you don't want our sandbox.
20
+
21
+ **The Phase-B offer:** in one 4-hour pairing, we wrap your agent
22
+ behind our `Dispatch`, author your domain-specific judge with you,
23
+ and run one real campaign + improvement loop on **your actual use
24
+ case**. You walk away with:
25
+
26
+ - A reproducible eval harness against scenarios you control.
27
+ - A judge that scores your outputs on dimensions you defined.
28
+ - One measurable lift on your real product, with a held-out gate.
29
+ - Trace artifacts you own (locally on disk; nothing leaves your
30
+ network unless you point at our hosted tier).
31
+
32
+ What we get: design-partner evidence the substrate works on a foreign
33
+ agent we did not build. That validates the wedge for us. Nothing else
34
+ changes hands.
35
+
36
+ **Cost to you:** 4 hours of pairing + your LLM bill for the campaign
37
+ run (typically $5-$50 depending on model + scenario count). No
38
+ commitment, no contract, no exclusivity. We don't take your code, your
39
+ data, or your secrets.
40
+
41
+ ---
42
+
43
+ ## Discovery questions (15 min, before the pairing)
44
+
45
+ Send these to the partner ahead of the pairing so they walk in with
46
+ their answers.
47
+
48
+ ### About the agent
49
+
50
+ 1. What does your agent **do** — one paragraph, end-user perspective?
51
+ 2. What's the **input** it accepts and the **output** it produces?
52
+ (Schemas help; English is fine.)
53
+ 3. What framework / stack? (LangChain / Mastra / OpenAI Agents SDK /
54
+ bespoke / something else.)
55
+ 4. Where does it run? (Local node / serverless / your sandbox /
56
+ browser / mobile / other.)
57
+ 5. What model(s) does it use today? Any model-routing layer
58
+ (OpenRouter, Portkey, your own)?
59
+
60
+ ### About quality
61
+
62
+ 6. How do you currently know your agent is good? (Eyeballing /
63
+ user feedback / metrics / nothing yet — all fine answers.)
64
+ 7. What does a **bad** output look like for you? Give 2-3 concrete
65
+ examples. Be specific.
66
+ 8. What does a **good** output look like? Same.
67
+ 9. Are there outputs that are *technically correct but feel wrong*?
68
+ What's the signal?
69
+ 10. How would a senior person on your team **score** an output, if
70
+ they had to give it a 1-10? Walk us through the rubric they'd
71
+ use, even informally.
72
+
73
+ ### About the loop
74
+
75
+ 11. If we could improve one thing about the agent in 4 hours, what
76
+ would move the needle the most for you?
77
+ 12. Are there *prompt* changes you've wanted to try but haven't had
78
+ the loop to validate?
79
+ 13. Anything you've explicitly tried that **didn't** work? (Saves us
80
+ suggesting it.)
81
+
82
+ ---
83
+
84
+ ## Judge-design worksheet (45 min into the pairing)
85
+
86
+ The judge is the most under-discussed piece of an eval system. Most
87
+ projects fail at the judge, not the agent.
88
+
89
+ We start with a **strawman** — the 6 dimensions in our canonical
90
+ marketing-quality judge:
91
+
92
+ | Dim | What it measures |
93
+ |---|---|
94
+ | hook_strength | Opens with concrete user outcome, not category |
95
+ | voice_match | Reads human-written; no AI slop |
96
+ | cta_clarity | Next step unambiguous for the audience |
97
+ | factual_grounding | Only claims things the brief supports |
98
+ | surface_fit | Length + register correct for medium |
99
+ | audience_specificity | Vocabulary the audience actually responds to |
100
+
101
+ **Your job in this 45 min:** rip this apart. We expect:
102
+
103
+ - **2-3 of these are wrong for you.** Replace them.
104
+ - **2-3 dimensions are missing.** Add them. (E.g., "tone matches our
105
+ brand book" or "safety-critical claim has a citation" or "answer is
106
+ decisive — no hedging when the user wants a recommendation".)
107
+ - **Weights are wrong.** For your use case some dims matter 5x more.
108
+
109
+ The deliverable: a judge with 4-8 dimensions, each scored 0.0 - 1.0,
110
+ each unambiguous enough that two independent humans would score the
111
+ same artifact within 0.1.
112
+
113
+ If a dimension is squishy, throw it out. A noisy judge poisons the
114
+ loop.
115
+
116
+ ---
117
+
118
+ ## The 4-hour pairing agenda
119
+
120
+ ### Hour 1 — Discovery + Dispatch wiring
121
+
122
+ | Time | What | Deliverable |
123
+ |---|---|---|
124
+ | 0:00 - 0:15 | Review discovery answers, align on scope | Shared doc with goals + constraints |
125
+ | 0:15 - 0:45 | Wire `Dispatch` around their agent — typically 1 function | Working `Dispatch<TScenario, TArtifact>` |
126
+ | 0:45 - 1:00 | Run 1-2 scenarios through `Dispatch` manually; see real artifacts | Confirmed wire shape |
127
+
128
+ ### Hour 2 — Judge calibration
129
+
130
+ | Time | What | Deliverable |
131
+ |---|---|---|
132
+ | 1:00 - 1:45 | Walk through the strawman judge; redesign dimensions with the partner | Final `JudgeConfig` for their domain |
133
+ | 1:45 - 2:00 | Calibrate judge against the 2 manual outputs from Hour 1 | Confirmed judge gives same scores a human would |
134
+
135
+ ### Hour 3 — First campaign + tuning
136
+
137
+ | Time | What | Deliverable |
138
+ |---|---|---|
139
+ | 2:00 - 2:30 | Define 8-15 scenarios with the partner (or use ours as a template) | Scenario set with train + holdout split |
140
+ | 2:30 - 3:00 | Run `runEval` for baseline; review per-scenario scores | Baseline score + identified failure modes |
141
+
142
+ ### Hour 4 — Improvement loop + go/no-go
143
+
144
+ | Time | What | Deliverable |
145
+ |---|---|---|
146
+ | 3:00 - 3:30 | Configure `runImprovementLoop` with `gepaDriver` (3 generations, population 2) + `defaultProductionGate` | Improvement run completes |
147
+ | 3:30 - 3:50 | Walk the partner through the gate decision + lift per scenario | Report artifact |
148
+ | 3:50 - 4:00 | Capture: was the lift real? Would they ship the winner? Will they keep using the lib? | **Go/no-go signal for Phase D** |
149
+
150
+ If we're tracking ahead at any hour, use the slack to deepen — add a
151
+ red-team battery, swap the judge model, run more generations. If we're
152
+ behind, cut the scenario set to 6 and ship.
153
+
154
+ ---
155
+
156
+ ## Success criteria — what counts as Phase B passed
157
+
158
+ For us to greenlight Phase D (hosted orchestrator + metered billing),
159
+ we need ALL of:
160
+
161
+ 1. **Real lift.** Held-out winner score > baseline by ≥ 0.05 composite
162
+ points (or the partner's chosen threshold). Not just train; held-out.
163
+ 2. **Partner-validated lift.** The partner reads the winner output on
164
+ 3+ held-out scenarios and confirms it's actually better.
165
+ 3. **Integration time ≤ 1 day.** Discovery + wiring + judge took ≤ 4
166
+ hours for the pairing; partner could reach the same point solo in
167
+ ≤ 1 day from the quickstart doc.
168
+ 4. **Public commitment.** Partner agrees to a public reference (case
169
+ study / quote / logo) OR commits to running the LAND tier in their
170
+ own product within 2 weeks.
171
+
172
+ 3-of-4 = soft pass (revisit Phase D scope but proceed). 4-of-4 = hard
173
+ pass (build Phase D). ≤ 2 = fail (back to substrate iteration).
174
+
175
+ ---
176
+
177
+ ## What we don't ask for
178
+
179
+ - Your code. Wire `Dispatch` around your existing API; we never see the
180
+ source.
181
+ - Your customer data. Use synthetic scenarios or anonymized real ones —
182
+ whichever you prefer.
183
+ - Your model keys. You bring your own; if you want, route through Tangle
184
+ Router and we never see the prompts either.
185
+ - Exclusivity, commitment, or contract. Walk away whenever.
186
+
187
+ The point is to learn if the substrate works for someone we didn't
188
+ build it for. That's it.
@@ -0,0 +1,176 @@
1
+ # Phase-B runbook (internal)
2
+
3
+ How we drive a design-partner pairing. Goes alongside
4
+ [`phase-b-pairing-kit.md`](./phase-b-pairing-kit.md) (the partner-facing
5
+ materials) — this file is for us.
6
+
7
+ ---
8
+
9
+ ## Before the pairing
10
+
11
+ - **24-48h prior:** send discovery questions from
12
+ [`phase-b-pairing-kit.md`](./phase-b-pairing-kit.md). Don't run the
13
+ pairing without answers in hand. The pairing fails when we discover
14
+ the partner's quality bar live; we don't have time to interview AND
15
+ build in 4 hours.
16
+ - **48h prior:** run the canonical demo (`pnpm tsx
17
+ examples/marketing-agent-canonical/index.ts`) end-to-end against the
18
+ partner's preferred model. Confirms the substrate + their LLM tier
19
+ compose. If it errors, fix the substrate before the pairing.
20
+ - **24h prior:** mirror the partner's stack locally. If they're on
21
+ Cloudflare Workers, run a Worker. On LangChain, install `@langchain/*`.
22
+ Don't debug their tooling on the call.
23
+ - **1h prior:** open the pairing kit, the agent-eval repo, the partner's
24
+ agent code/endpoint, a shared doc, and a screenshare ready.
25
+
26
+ ## During the pairing
27
+
28
+ ### Driving principles
29
+
30
+ - **Talk less, ship more.** The partner is paying with their time and
31
+ attention; every minute we talk we aren't shipping their lift.
32
+ - **They write the judge.** We start with our strawman so they have
33
+ something to react to, but the judge that ends up running is theirs.
34
+ This is the most-discussed seam — they should own it.
35
+ - **No invented features.** Don't promise capabilities that don't exist
36
+ ("we have a hosted ingest for this") unless they actually exist.
37
+ Phase B is honesty's purest test.
38
+ - **Capture verbatim.** Write down their exact words on what's broken /
39
+ what would change their mind. The wedge-gate evidence is qualitative
40
+ too.
41
+
42
+ ### When to escalate to Drew
43
+
44
+ - Partner wants something Phase D would have (hosted dashboard, multi-
45
+ tenant, billing). **Escalate same day** — this is the GTM signal we're
46
+ hunting for; Drew should hear it directly.
47
+ - Partner is the wrong fit (technical or business) and the pairing
48
+ would burn both sides' time. **Pause the pairing**, debrief with Drew,
49
+ reschedule with a better-fit partner.
50
+ - Substrate breaks in a way that requires a published bump. **Pause
51
+ the pairing**, ship the fix in a focused PR, resume.
52
+
53
+ ### What to capture for the wedge gate
54
+
55
+ Per [`docs/design/external-agent-wedge.md`](./design/external-agent-wedge.md),
56
+ the gate decision hinges on Phase B evidence. We capture:
57
+
58
+ 1. **Quantitative lift** — held-out winner composite vs baseline, per
59
+ scenario + overall. Auto-generated in the report artifact by the
60
+ canonical demo (`.phase-b-runs/<ts>/phase-b-report.md`).
61
+ 2. **Qualitative partner-validation** — partner read 3+ winner outputs
62
+ and confirmed they're better. Capture as a 1-paragraph quote.
63
+ 3. **Integration friction** — minutes spent on each pairing phase. Were
64
+ any > 2x estimated? What broke?
65
+ 4. **Judge-design surprise** — which dimensions the partner added or
66
+ killed vs our strawman. Strong signal about what the substrate's
67
+ default judge templates are missing for adjacent domains.
68
+ 5. **Soft commitments** — would they reference us? Would they
69
+ self-serve from the quickstart doc? Would they pay for hosted?
70
+
71
+ Capture into a single `phase-b-debrief.md` per partner. We don't
72
+ publish these; they feed the next substrate iteration + the wedge
73
+ go/no-go.
74
+
75
+ ---
76
+
77
+ ## Failure modes — what we do NOT do
78
+
79
+ ### "We'll just optimize on the train set"
80
+
81
+ Hard no. The held-out gate is the entire point. A win that doesn't
82
+ generalize is worse than no win — it's evidence that the substrate
83
+ overfits, which is the failure mode the wedge tier rewards.
84
+
85
+ If the holdout lift is < threshold but train looks great:
86
+
87
+ 1. Show the partner the gap. Explain what overfitting means here.
88
+ 2. Try raising `maxGenerations` to 5 (gives gepa more search budget).
89
+ 3. Try widening `populationSize` to 3 (more diverse mutations per gen).
90
+ 4. If still no lift on holdout: **report the result honestly**. A
91
+ negative finding is real evidence for us too — tells us this surface
92
+ isn't amenable to prompt-only mutation, and the partner needs Phase
93
+ C (code-tier optimization) or a different approach.
94
+
95
+ ### "The judge is too noisy"
96
+
97
+ A judge whose two-run variance > 0.1 on the same artifact is broken.
98
+ Fixes, in order:
99
+
100
+ 1. Lower temperature to 0.0 (the canonical judge uses 0.2, which is
101
+ already low).
102
+ 2. Use a stronger model than the agent (default: same model. Bump the
103
+ judge to GPT-5.5 / Claude Opus.)
104
+ 3. Add anchors to each dimension ("0.0 = X, 0.5 = Y, 1.0 = Z").
105
+ 4. If still noisy: collapse to fewer, simpler dimensions. 3 unambiguous
106
+ dimensions beat 6 squishy ones.
107
+
108
+ ### "We can't decide what the partner's judge should be"
109
+
110
+ Then we don't have Phase B. The judge IS the partner's quality bar.
111
+ If they can't articulate it in 45 minutes of pairing, we're in the
112
+ wrong pairing — they need to do the interview-themselves work first.
113
+
114
+ **Pause the pairing, send the discovery doc again, regroup in a week.**
115
+
116
+ ### "Their agent is slow / expensive"
117
+
118
+ `maxConcurrency: 1` and reduce scenarios to 6. Cost scales linearly;
119
+ time scales as `(scenarios × reps × generations × population) /
120
+ concurrency`. Tune until the loop completes in ≤ 30 min.
121
+
122
+ If the per-call cost is > $1, talk to Drew before the pairing — we
123
+ might want to subsidize the partner's first run.
124
+
125
+ ### "They want to share their secrets through Tangle Router"
126
+
127
+ Fine — `OPENAI_BASE_URL=https://router.tangle.tools/v1` works. Make
128
+ sure they understand: every call routes through us; the prompts and
129
+ responses are visible to whatever observability we have on the router.
130
+ If they want zero data leaving their network, point at their own
131
+ endpoint, not Tangle Router.
132
+
133
+ ---
134
+
135
+ ## After the pairing
136
+
137
+ ### Same day
138
+
139
+ - Save the `phase-b-report.md` artifact + the partner's debrief notes
140
+ to `~/company/design-partners/<partner>/<date>/`.
141
+ - Send the partner a thank-you with the winner artifact + the next-
142
+ steps doc. Whether or not we proceed to Phase D, leave them with
143
+ something concrete they can ship in their product.
144
+ - Slack Drew the verdict against the [success criteria](./phase-b-pairing-kit.md#success-criteria--what-counts-as-phase-b-passed).
145
+
146
+ ### Within a week
147
+
148
+ - If Phase B passed: open the Phase D RFC. Reuse the partner-validated
149
+ judge dimensions + scenarios as the spec for what the hosted tier
150
+ needs to support out of the box.
151
+ - If Phase B failed: substrate iteration ticket(s). Specific gaps the
152
+ pairing surfaced (judge dim defaults, doc clarity, missing helper).
153
+ - Either way: update the wedge doc (`docs/design/external-agent-wedge.md`)
154
+ with the partner-name redacted + the qualitative signal.
155
+
156
+ ### Within a month (regardless of go/no-go)
157
+
158
+ - Followup with the partner. If they're still using the lib, capture a
159
+ metric. If they stopped, find out why. Both data points feed product.
160
+
161
+ ---
162
+
163
+ ## The canonical demo as a forcing function
164
+
165
+ `examples/marketing-agent-canonical/` is the demo we open the pairing
166
+ with. It does three things at once:
167
+
168
+ 1. **Proves the substrate works** — they see a real lift on a real-
169
+ feeling agent before we touch their code.
170
+ 2. **Sets the bar for the judge conversation** — they react to concrete
171
+ dimensions, not abstract questions.
172
+ 3. **Trains us** — running the canonical demo before the pairing
173
+ surfaces substrate bugs on the partner's preferred model BEFORE the
174
+ partner is watching. We hit those bugs first.
175
+
176
+ Run the canonical demo before every Phase-B pairing. It's not optional.