npm - @tangle-network/agent-eval - Versions diffs - 0.40.3 → 0.40.5 - Mend

@tangle-network/agent-eval 0.40.3 → 0.40.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/dist/campaign/index.d.ts +18 -9
package/dist/campaign/index.js +9 -9
package/dist/campaign/index.js.map +1 -1
package/dist/{chunk-TMXPFWC7.js → chunk-YNMCYUWT.js} +10 -10
package/dist/chunk-YNMCYUWT.js.map +1 -0
package/dist/openapi.json +1 -1
package/dist/{run-campaign-JYJXYHHL.js → run-campaign-KEJK5KFT.js} +2 -2
package/docs/design/phase4-consumer-migration.md +70 -0
package/docs/design/primitives-integration-spec.md +393 -0
package/docs/design/product-self-improvement-loop.md +146 -0
package/package.json +1 -1
package/dist/chunk-TMXPFWC7.js.map +0 -1
/package/dist/{run-campaign-JYJXYHHL.js.map → run-campaign-KEJK5KFT.js.map} +0 -0

package/docs/design/phase4-consumer-migration.md ADDED Viewed

@@ -0,0 +1,70 @@
+# Phase 4 — consumer migration tracking
+Migrate the product repos off their duplicated eval / prompt-evolution
+orchestration onto the published substrate (`@tangle-network/agent-eval@^0.40.3`
++ `@tangle-network/agent-runtime@^0.25.0`). Integration contract:
+[`primitives-integration-spec.md`](./primitives-integration-spec.md).
+**Strategy:** prove **gtm end-to-end first** (the canonical consumer), then fan
+the proven migration pattern to the rest via parallel subagents, each briefed
+with the gtm reference diff + the spec's forbidden-anti-patterns list. Each
+migration is its own reviewable, rollback-able PR.
+## Status board
+| Repo | Deletable orchestration (LOC est.) | Dispatch seam | Status | PR |
+|---|---|---|---|---|
+| gtm-agent | ~2,420 | `runChatThroughRuntime` | **IN PROGRESS** | — |
+| legal-agent | tbd | tbd | queued | — |
+| tax-agent | tbd | tbd | queued | — |
+| creative-agent | tbd | tbd | queued | — |
+| agent-builder | tbd | tbd | queued | — |
+| blueprint-agent | tbd | tbd | queued (Drew dispatching via spec) | — |
+| physim | tbd (MultiLayerVerifier adapter) | tbd | queued | — |
+## Per-repo migration checklist
+For each repo, in order:
+- [ ] **Survey** — inventory eval + prompt-evolution wrappers (file:line + LOC).
+      Identify the dispatch seam, scenarios, judges, mutation strategy.
+- [ ] **Bump deps** — `@tangle-network/agent-eval` → `^0.40.0`,
+      `@tangle-network/agent-runtime` → `^0.25.0`; `pnpm update`; baseline
+      typecheck green.
+- [ ] **Rewire seams** — `dispatch`/`dispatchWithSurface`, `judges`,
+      `scenarios` extracted from the existing wrappers (KEEP domain logic).
+- [ ] **Replace orchestration** — swap the local generation/population/scorecard
+      loop for `runImprovementLoop` (or `runCampaign` for eval-only). DELETE the
+      wrapper body.
+- [ ] **Gate** — compose domain gates with `defaultProductionGate`.
+- [ ] **Dataset** — wire `FsLabeledScenarioStore` with correct `captureSource`.
+- [ ] **Tests** — port wrapper contract tests to assert the substrate wiring;
+      keep judge/scenario tests. Suite green.
+- [ ] **Prove** — one real eval/improve run end-to-end; confirm scorecard +
+      (if applicable) a PR opens on a shipping gate.
+- [ ] **Anti-pattern sweep** — no silent fallbacks, no reimplemented loop, no
+      train/holdout conflation, tracing on, dispatch named.
+- [ ] **PR** — open, independent-review, merge.
+## gtm-agent — migration map (from survey)
+- **Branch base:** off the repo's working branch (`feat/gtm-rich-chat-actions`)
+  or main — confirm before starting.
+- **Dispatch seam:** `runChatThroughRuntime(ctx)`
+  (`src/lib/.server/agent-runtime/chat.ts`) — prompt variant + scenario → real
+  agent run → artifact + events + token usage.
+- **Scenarios:** `src/lib/.server/production-loop/scenarios.ts` (3 holdout) +
+  `eval/business-owner/personas.json` (canonical personas).
+- **Judges:** `src/lib/.server/production-loop/judges.ts` (`runEnsembleJudge`,
+  3-model ensemble) + canonical 12-dimension judges in `eval/canonical.ts`.
+- **Delete (~2,420 LOC orchestration):** the generation/population/reps loop in
+  `src/lib/.server/production-loop/index.ts` (~450), the checkpoint loop in
+  `eval/canonical.ts` (~600), `eval/run-prompt-evolution.ts` wrapper (~800),
+  `eval/analyst-loop.ts` wrapper (~300), `eval/optimization-campaign.ts` (~170),
+  `scripts/evals/run-optimization-campaign.ts` (~100 scaffold).
+- **Rewire:** `buildHoldoutRunner` → `dispatchWithSurface`; `buildScorer` →
+  `judges`; `buildMutator` → `evolutionaryDriver({ mutator })`;
+  `runProductionLoop` → `runImprovementLoop`.
+- **Keep:** judges, scenarios, persona data + reactive driver, deterministic
+  anti-slop/brief checks, GitHub PR wiring, feedback/trace ingestion.
+- **Net:** ~1,400–1,600 LOC reduction.

package/docs/design/primitives-integration-spec.md ADDED Viewed

@@ -0,0 +1,393 @@
+# Self-improvement primitives — integration spec
+**Audience:** an engineer (or agent) wiring a product onto the Tangle
+self-improvement stack. This is the authoritative "how to use the primitives"
+reference. It is exact: every signature, every seam, every forbidden pattern.
+**Packages (published):**
+- `@tangle-network/agent-eval@^0.40.3` — measurement + improvement loop +
+  worktree adapter + gates + dataset store. The leaf; depends on nothing
+  upstream. Import the loop surface from `@tangle-network/agent-eval/campaign`.
+- `@tangle-network/agent-runtime@^0.25.0` — the runtime-side improvement
+  driver (`improvementDriver`) + generators (`reflectiveGenerator`,
+  `agenticGenerator`). Import from `@tangle-network/agent-runtime/improvement`.
+Read [`loop-taxonomy.md`](./loop-taxonomy.md) (vocabulary) and
+[`self-improvement-engine.md`](./self-improvement-engine.md) (phases) first.
+This doc is the contract-level detail under them.
+---
+## 0. The one-paragraph model
+A **measurement** (`runCampaign`) runs your agent (behind a `dispatch` seam)
+over `scenarios`, judges the outputs, and returns a scorecard with confidence
+intervals. An **improvement loop** (`runImprovementLoop`) drives an
+`ImprovementDriver` to propose candidate **surfaces** (a prompt string, or a
+`CodeSurface` = a git worktree of code edits), measures each on a **holdout**,
+runs a release **gate**, and opens a **PR** for the winner. Every run feeds a
+**dataset** (`LabeledScenarioStore`) — the same corpus the optimizer learns
+from. Three roles, fixed meaning: **driver** decides what's next; **worker** =
+the agent in a sandbox (invoked behind `dispatch`); **measurement** runs the
+worker and scores it.
+---
+## 1. The seams you implement (everything else is substrate)
+You implement exactly three things. The substrate owns the rest.
+| Seam | Type | What it is |
+|---|---|---|
+| `dispatch` | `(scenario, ctx) => Promise<TArtifact>` | invoke YOUR agent on one scenario → the artifact judges score. Topology-opaque: one LLM call, or a driver↔workers-in-a-sandbox loop — substrate doesn't care. |
+| `judges` | `JudgeConfig<TArtifact, TScenario>[]` | score an artifact on named dimensions → composite. Your rubrics. |
+| `scenarios` | `Scenario[]` | the inputs (`{ id, kind, ... }`). Your eval set. |
+If you are also improving a surface, you additionally provide:
+| Seam | Type | What it is |
+|---|---|---|
+| `dispatchWithSurface` | `(surface, scenario, ctx) => Promise<TArtifact>` | like `dispatch`, but takes the candidate surface (prompt string or `CodeSurface`) — swap it into your agent before running. |
+| a **driver** | `ImprovementDriver` | how candidates are proposed (see §4). Use a shipped one; don't hand-roll. |
+| a **gate** | `Gate` | ship/hold decision (use `defaultProductionGate`). |
+**You never implement:** generation loops, population/top-K selection, seed
+propagation, manifest hashing, cell caching, bootstrap CIs, worktree git
+plumbing, PR-opening, or trace capture. Reimplementing any of these is the
+anti-pattern this whole stack exists to delete.
+---
+## 2. `runCampaign` — the measurement primitive
+```ts
+import { runCampaign, type RunCampaignOptions } from '@tangle-network/agent-eval/campaign'
+const result = await runCampaign<MyScenario, MyArtifact>({
+  scenarios,                 // MyScenario[]
+  dispatch,                  // (scenario, ctx) => Promise<MyArtifact>
+  judges,                    // JudgeConfig<MyArtifact, MyScenario>[]  (optional)
+  runDir: '/abs/run/dir',    // REQUIRED — where artifacts + traces land
+  seed: 42,                  // default 42 — reproducibility
+  reps: 1,                   // per-scenario replicates; raise to 5+ for tight CIs
+  maxConcurrency: 2,         // parallel cells
+  costCeiling: 5.0,          // optional USD soft-abort
+  tracing: 'on',             // default on; 'off' refused by improvement loop w/ a driver
+  labeledStore: store,       // optional capture (see §8); 'off' to disable
+  captureSource: 'eval-run', // provenance for captured rows
+})
+```
+Returns `CampaignResult<TArtifact, TScenario>`:
+```ts
+{
+  manifestHash: string        // sha256(scenarios, judges, dispatch ref, seed, reps) — run identity
+  seed: number
+  startedAt, endedAt, durationMs
+  cells: CampaignCellResult[] // one per scenario×rep: { cellId, scenarioId, rep, artifact, judgeScores, costUsd, cached, error? }
+  aggregates: {
+    byJudge:    Record<string, JudgeAggregate>     // { mean, stdev, ci95:[lo,hi], n } — bootstrap CIs
+    byScenario: Record<string, ScenarioAggregate>
+    totalCostUsd, cellsExecuted, cellsSkipped, cellsCached, cellsFailed
+  }
+  runDir, artifactsByPath, scenarios
+}
+```
+**Rules:**
+- `dispatch` must be a *named* function (`dispatch.name` feeds the manifest hash
+  — anonymous arrows weaken reproducibility identity).
+- Inspect `cell.error` before trusting `cell.artifact`. Cells fail-soft
+  individually (one bad scenario doesn't kill the run) but the error is
+  recorded, never swallowed.
+- Re-running the same `runDir` with `resumable: true` (default) skips cached
+  cells by `(manifestHash, scenarioId, rep)`.
+`runEval(opts)` is a thin alias for the scorecard-only case (no improvement).
+---
+## 3. `JudgeConfig`, `Scenario` — the domain types you own
+```ts
+interface Scenario { id: string; kind: string; /* + your fields */ }
+interface JudgeConfig<TArtifact, TScenario = Scenario> {
+  name: string
+  dimensions: { key: string; weight?: number }[]
+  appliesTo?: (scenario: TScenario) => boolean   // scope a judge to some scenarios
+  score(args: { artifact: TArtifact; scenario: TScenario; signal: AbortSignal })
+    : Promise<JudgeScore> | JudgeScore
+}
+interface JudgeScore { composite: number; dimensions: Record<string, number>; notes: string }
+```
+Judges are where your rubric lives. They MUST fail loud: if the judge LLM call
+fails, throw — do not return a `composite: 0` (a fake zero is indistinguishable
+from a real zero and silently corrupts every aggregate downstream).
+---
+## 4. The improvement loop — `runImprovementLoop`
+```ts
+import {
+  runImprovementLoop, defaultProductionGate, evolutionaryDriver,
+} from '@tangle-network/agent-eval/campaign'
+const result = await runImprovementLoop({
+  // --- measurement config (same as runCampaign, minus dispatch) ---
+  scenarios: trainScenarios,
+  judges,
+  runDir,
+  // --- surface improvement ---
+  baselineSurface,                 // string | CodeSurface — current best
+  dispatchWithSurface,             // (surface, scenario, ctx) => artifact
+  driver,                          // ImprovementDriver — see §5/§6
+  populationSize: 4,               // BREADTH: candidates per generation
+  maxGenerations: 3,
+  promoteTopK: 2,
+  maxImprovementShots: 3,          // DEPTH: forwarded to the driver's propose()
+  // --- gated promotion ---
+  holdoutScenarios,                // NEVER in the training pool — gate scores on these
+  gate: defaultProductionGate({ holdoutScenarios, deltaThreshold: 0.02 }),
+  autoOnPromote: 'pr',             // 'pr' | 'none'  (NO 'config' in v0.40 — throws)
+  ghOwner: 'tangle-network',
+  ghRepo: 'gtm-agent',             // required when autoOnPromote: 'pr'
+})
+// → { winnerSurface, winnerSurfaceHash, generations, baselineOnHoldout,
+//     winnerOnHoldout, gateResult, prResult? }
+```
+`runOptimization(opts)` is the loop body without the gate/holdout/PR (use it
+when you want candidates + a winner but will gate yourself).
+**Hard refusals (by design — these throw):**
+- `autoOnPromote: 'config'` → deferred to a later pass (live self-mutation
+  needs the full safety stack). Use `'pr'` or `'none'`.
+- `tracing: 'off'` while a `driver` is wired → an improvement loop that doesn't
+  feed the dataset is unattributable.
+- `autoOnPromote: 'pr'` without `ghOwner`/`ghRepo`.
+---
+## 5. `ImprovementDriver` + `ProposeContext` — the contract
+```ts
+interface ImprovementDriver<TFindings = unknown> {
+  kind: string
+  propose(ctx: ProposeContext<TFindings>): Promise<MutableSurface[]>   // PLAN
+  decide?(args: { history: GenerationRecord[] }): { stop: boolean; reason?: string }
+}
+interface ProposeContext<TFindings = unknown> {
+  currentSurface: MutableSurface
+  history: GenerationRecord[]     // prior generations + scores
+  findings: TFindings[]
+  populationSize: number          // how many candidates to return
+  generation: number
+  signal: AbortSignal
+  report?: unknown                // Phase-2 research report (analyst findings + diff)
+  dataset?: LabeledScenarioStore  // handle to all captured data
+  maxImprovementShots?: number    // DEPTH knob
+}
+type MutableSurface = string | CodeSurface
+interface CodeSurface { kind: 'code'; worktreeRef: string; baseRef?: string; summary?: string }
+```
+`propose()` returns candidates; it does NOT measure (the loop measures). For a
+code-tier driver, `propose()` may itself be agentic (spawn a harness, write a
+worktree) — that's the recursion. Pick a shipped driver:
+---
+## 6. The shipped drivers (use these; don't hand-roll)
+### `evolutionaryDriver` (agent-eval) — prompt mutation, no sandbox
+```ts
+import { evolutionaryDriver } from '@tangle-network/agent-eval/campaign'
+const driver = evolutionaryDriver({
+  mutator: {                       // YOUR Mutator (the only domain bit)
+    kind: 'reflection',
+    async mutate({ currentSurface, populationSize, findings, signal }) {
+      // return N prompt-string variants of currentSurface
+      return [...]
+    },
+  },
+})
+```
+Use when the surface is a **prompt string** and you have a mutation strategy
+(reflection, GEPA, AxGEPA). Cheap, deterministic-friendly.
+### `improvementDriver` + generators (agent-runtime) — one driver, a cost dial
+```ts
+import {
+  improvementDriver, reflectiveGenerator, agenticGenerator,
+} from '@tangle-network/agent-runtime/improvement'
+import { gitWorktreeAdapter } from '@tangle-network/agent-eval/campaign'
+const worktree = gitWorktreeAdapter({ repoRoot: '/abs/repo' })
+// cheap, no sandbox: drafts patches from findings, applies them
+const cheap = improvementDriver({
+  worktree,
+  generator: reflectiveGenerator({ improvementAdapter }), // wraps proposeFromFindings
+  baseRef: 'main',
+})
+// full agentic: a real coding harness edits the worktree, retries up to maxShots
+const deep = improvementDriver({
+  worktree,
+  generator: agenticGenerator({ harness: 'claude' }),     // claude | codex | opencode
+  baseRef: 'main',
+})
+```
+One driver; the generator is the cost dial. Both emit `CodeSurface`s the loop
+measures + gates. `agenticGenerator.generate()` runs the harness with
+`cwd = worktree`, trusts the **git diff** (not harness stdout) to decide
+"applied", and retries up to `maxImprovementShots` on a clean tree.
+---
+## 7. Gates — `defaultProductionGate`, `composeGate`, `heldOutGate`
+```ts
+import { defaultProductionGate, composeGate, heldOutGate } from '@tangle-network/agent-eval/campaign'
+// opinionated default: heldout-delta + budget + red-team + reward-hacking + canary
+const gate = defaultProductionGate({
+  holdoutScenarios,
+  deltaThreshold: 0.02,      // winner must beat baseline by this on holdout
+  budgetUsd: 5,              // optional cost ceiling
+  redTeamBattery: [...],     // optional adversarial probes
+})
+// compose your own: ALL must ship, else the worst verdict wins
+const custom = composeGate(heldOutGate({ scenarios: holdoutScenarios, deltaThreshold: 0.02 }), myDomainGate)
+```
+`Gate.decide(ctx) → GateResult` with a 5-valued verdict:
+`GateDecision = 'ship' | 'hold' | 'need_more_work' | 'model_ceiling' | 'arch_ceiling'`.
+`composeGate` returns `ship` only if all sub-gates ship; otherwise the
+precedence is `arch_ceiling > model_ceiling > hold > need_more_work`. Use the
+non-ship verdicts to route: `need_more_work` → more data, `model_ceiling` →
+try a stronger model, `arch_ceiling` → the surface can't fix it.
+`openAutoPr({ result, gate, promotedDiff, ghOwner, ghRepo })` opens the PR —
+**refuses unless `gate.decision === 'ship'`**, dry-runs without a GH token.
+---
+## 8. The dataset flywheel — `FsLabeledScenarioStore`
+```ts
+import { FsLabeledScenarioStore } from '@tangle-network/agent-eval/campaign'
+const store = new FsLabeledScenarioStore({ root: '/abs/dataset', maxWritesPerMinutePerBucket: 60 })
+// pass to runCampaign({ labeledStore: store, captureSource: 'production-trace' })
+```
+Every campaign cell captures `(scenario, artifact, judgeScore, source)`. This
+corpus IS the optimizer's training set. Discipline enforced at the store:
+- **provenance required** on every write (source / sourceVersionHash /
+  capturedAt / redactionStatus).
+- **temporal split**: `sample()` requires explicit `split` + `capturedBefore`.
+- **`production-trace` is excluded from the train split by default** (no
+  contamination of the holdout it's judged against).
+---
+## 9. The migration recipe (what to DELETE / KEEP / REWIRE)
+For a product that already has eval + prompt-evolution wrappers:
+**DELETE (orchestration the substrate now owns):**
+- generation/population/top-K loops, trial-matrix construction, frontier
+  tracking, seed plumbing, manifest hashing, cell caching, scorecard
+  aggregation, CI math, PR-opening scaffolding, worktree git commands.
+- any local `runProductionLoop` / `runPromptEvolution` / `runAnalystLoop`
+  wrapper whose body is a loop over generations × candidates × reps.
+**KEEP (domain logic — it does not move):**
+- scenarios (your eval inputs) → become `scenarios`.
+- judges/rubrics/dimension weights → become `judges`.
+- the agent-invocation function → becomes `dispatch` / `dispatchWithSurface`.
+- the mutation strategy (reflection prompt) → becomes a `Mutator` or a
+  generator's `buildPrompt`.
+- domain gates (e.g. anti-fabrication) → compose with `defaultProductionGate`.
+**REWIRE:**
+- `buildHoldoutRunner()` → `dispatchWithSurface`.
+- `buildScorer()` → `judges`.
+- `buildMutator()` → `evolutionaryDriver({ mutator })`.
+- `runProductionLoop(...)` → `runImprovementLoop(...)`.
+- `runPromptEvolution(...)` → `runImprovementLoop` (surface = prompt string).
+- `runAnalystLoop(...)` improvement step → `improvementDriver` + a generator;
+  its findings-ledger + knowledge-graph writes stay.
+Net for a typical consumer: ~2,400 LOC of orchestration deleted, ~800 LOC
+rewired into the three seams.
+---
+## 10. Forbidden anti-patterns (a review will reject these)
+1. **No silent fallbacks.** No `catch { return null }`, no `?? 0` on a judge
+   composite, no returning `false`/empty on an error you can't interpret.
+   External-boundary calls return typed outcomes or throw. A git/LLM/subprocess
+   failure is a *throw*, never a fold-into-a-default.
+2. **Don't reimplement the loop.** If you write a `for (gen of generations)`
+   that mutates + scores + selects, you've rebuilt the substrate. Stop; call
+   `runImprovementLoop`.
+3. **Don't conflate train and holdout.** Holdout scenarios never enter the
+   training pool. The gate scores on holdout only.
+4. **Don't trust harness stdout.** For code edits, the git diff is the truth,
+   not what the agent says it did.
+5. **Account for every worktree.** A created worktree is finalized into a
+   surface or discarded — never leaked, even on throw (the shipped
+   `improvementDriver` already guarantees this; preserve it if you extend).
+6. **Don't auto-deploy.** Promotion opens a PR (`autoOnPromote: 'pr'`). Live
+   self-mutation (`'config'`) is deferred behind the full safety stack.
+7. **Tracing stays on when improving.** The loop refuses `tracing: 'off'` with
+   a driver wired — the dataset must be fed.
+8. **Name your `dispatch`.** Anonymous dispatch weakens the manifest-hash
+   reproducibility identity.
+---
+## 11. Minimal end-to-end skeleton
+```ts
+import {
+  runImprovementLoop, defaultProductionGate, evolutionaryDriver,
+  FsLabeledScenarioStore,
+} from '@tangle-network/agent-eval/campaign'
+const store = new FsLabeledScenarioStore({ root: '.dataset' })
+async function dispatchWithSurface(surface: string, scenario: MyScenario) {
+  return runMyAgent({ systemPrompt: surface, input: scenario }) // → MyArtifact
+}
+const judges = [{
+  name: 'quality',
+  dimensions: [{ key: 'grounding' }, { key: 'actionability' }],
+  async score({ artifact, scenario }) { /* → JudgeScore, throw on failure */ },
+}]
+const result = await runImprovementLoop<MyScenario, MyArtifact>({
+  scenarios: train, holdoutScenarios: holdout, judges,
+  baselineSurface: CURRENT_PROMPT,
+  dispatchWithSurface,
+  driver: evolutionaryDriver({ mutator: myReflectionMutator }),
+  populationSize: 4, maxGenerations: 3, promoteTopK: 2,
+  gate: defaultProductionGate({ holdoutScenarios: holdout, deltaThreshold: 0.02 }),
+  autoOnPromote: 'pr', ghOwner: 'tangle-network', ghRepo: 'my-agent',
+  runDir: '.runs/improve', labeledStore: store, captureSource: 'eval-run',
+})
+if (result.gateResult.decision === 'ship') console.log('PR:', result.prResult?.prUrl)
+```
+That is the whole integration. Everything not in this skeleton is substrate.

package/docs/design/product-self-improvement-loop.md ADDED Viewed

@@ -0,0 +1,146 @@
+# The product self-improvement loop — the finish-line target
+This is the **end state** every Tangle product agent (gtm, legal, tax, creative,
+agent-builder, blueprint, physim) converges to. It is the target the consumer
+migrations build toward — not a 1:1 port of whatever eval/improvement code a
+product has today.
+**Thesis.** A product agent is *one closed, automated self-improvement loop*
+that makes the agent measurably better over time while humans only approve
+PRs. A product should NOT have a "production loop" *and* a pile of `eval/*`
+CLIs *and* bespoke optimization orchestration. It has **one** loop, composed
+from the substrate. Everything else is deleted.
+Primitives reference: [`primitives-integration-spec.md`](./primitives-integration-spec.md).
+Engine internals: [`self-improvement-engine.md`](./self-improvement-engine.md).
+---
+## The loop (7 steps, exact substrate composition)
+```
+1. SAMPLE the eval matrix
+   scenarios = cartesian(
+     profileVariants,        // the surface(s) under test: baseline + candidates
+     productScenarios,       // the hard product tasks (gtm: attribution honesty, …)
+     personas,               // simulated users / drivers
+   )  ∪  productionFailures  // real failures pulled from the LabeledScenarioStore
+                             // (the flywheel: prod traces become eval scenarios)
+2. MEASURE — runCampaign
+   dispatch(scenario) = runMultishot({           // the multi-turn challenging flow
+     persona: scenario.persona,                  //   driver = simulated user
+     profile: productAgentProfile(surface),      //   worker = the agent under test
+     shape:   scenario.flow,                     //   the real task, many turns
+     tools:   productTools,                      //   real tools, real side-effects
+   }) → transcript artifact
+   judges = product ensemble (domain dimensions) → scorecard + bootstrap CIs
+   labeledStore: capture EVERY cell (scenario, artifact, score, source) → the dataset
+3. ANALYZE — trace analysts (runAnalystLoop / AnalystRegistry)
+   read the campaign traces → a research report (failure modes, why, where).
+   This REPLACES bespoke "failure clustering": the analyst is the richer,
+   LLM-driven version of "what should we improve and why".
+4. IMPROVE — runImprovementLoop( improvementDriver + agenticGenerator )
+   driver.propose({ report, dataset, … }) → candidate surfaces.
+   The agentic generator runs a coding harness in a worktree, reading the
+   report + the codebase, making REAL product changes — prompt, tools, AND
+   code — not just an addendum string. Each candidate is measured on a
+   HELD-OUT slice of the matrix.
+5. GATE — defaultProductionGate (+ domain gates, composed)
+   heldout-delta + budget + red-team + reward-hacking + canary, plus any
+   product-specific gate (e.g. anti-fabrication) and an overfit-gap check.
+   Verdict ∈ ship | hold | need_more_work | model_ceiling | arch_ceiling.
+6. PROMOTE — openAutoPr
+   the winning worktree → a PR against the product repo. Human approves → ships.
+   (autoOnPromote: 'pr'. Live self-mutation is deferred behind the full safety
+   stack.)
+7. LOOP
+   the shipped, improved agent runs in production → emits traces → the dataset
+   grows → back to (1). The loop is scheduled (cron) and/or triggered when the
+   analyst report crosses a severity threshold. Autonomous between PR approvals.
+```
+**One entry point, no new abstractions.** A product exposes a single
+`run<Product>ImprovementCycle()` that *composes* the substrate primitives
+above. It does NOT define `runFooPromptEvolution`, `FooOptimizer`,
+`FooProductionLoop`, etc. The substrate carries every name; the product only
+wires its domain pieces into the seams.
+---
+## What each product OWNS vs DELETES vs COMPOSES
+**OWNS (domain — stays, this is the product's value):**
+- `productScenarios` — the hard tasks the agent must handle.
+- `personas` — the simulated users that drive the multi-shot flows.
+- `judges` / rubrics / dimension weights — how "good" is defined.
+- `productTools` — the real tools the agent uses.
+- deterministic checks (anti-slop, format, forbidden-claim) — fast pre-judges.
+- domain gates (e.g. anti-fabrication) — composed into the gate.
+**DELETES (orchestration the substrate now owns):**
+- every `for (gen of generations)` mutate→score→select loop.
+- bespoke prompt-evolution / production-loop / analyst-loop wrappers.
+- trial-matrix construction, frontier tracking, seed plumbing, manifest
+  hashing, cell caching, scorecard aggregation, CI math.
+- PR-opening scaffolding, worktree git plumbing.
+- parallel `eval/*` CLIs that each re-implement a slice of the above.
+**COMPOSES (the substrate, in the one cycle):**
+- `runCampaign` (matrix measurement) · `runMultishot` (the dispatch flow) ·
+  `FsLabeledScenarioStore` (dataset) · analysts (report) ·
+  `runImprovementLoop` + `improvementDriver` + `agenticGenerator` (improve) ·
+  `defaultProductionGate` + `composeGate` (gate) · `openAutoPr` (promote).
+---
+## Definition of done (a product is "at the finish line" when)
+1. **One cycle, one entry.** A single `run<Product>ImprovementCycle()` composes
+   the substrate; the old eval/improvement systems are deleted, not coexisting.
+2. **Matrix eval is real.** `dispatch` runs genuine multi-shot persona↔agent
+   flows with real tools — not single-turn projections, not stubbed workers
+   (non-zero token usage is asserted).
+3. **The dataset is fed.** Every cell captures to `LabeledScenarioStore` with
+   correct provenance; production failures flow back in as scenarios.
+4. **Improvement is code-real.** The agentic generator produces worktree
+   changes (prompt/tools/code), measured on holdout — not just addendum-string
+   mutation.
+5. **The gate is honest.** Composed `defaultProductionGate` + domain gates +
+   overfit-gap; fails closed; holdout never overlaps train.
+6. **Promotion is a PR.** `openAutoPr` opens it; a human approves; nothing
+   auto-deploys.
+7. **It's scheduled + triggered.** Runs on cadence and/or when the analyst
+   report crosses severity; autonomous between approvals.
+8. **Tests + a real proof run.** Contract tests assert the wiring; one real
+   end-to-end cycle produces a scorecard and (on a shipping gate) a PR.
+Anything short of this is mid-migration, not done.
+---
+## gtm-agent — the worked instantiation (first reference build)
+| Loop step | gtm wiring |
+|---|---|
+| SAMPLE | profile variants of `OPERATOR_CEO_SYSTEM_PROMPT` + addendum; `GTM_LOOP_HOLDOUT_SCENARIOS` + `eval/business-owner/personas.json`; production failures from the trace store |
+| MEASURE | `dispatch` = `runMultishot(persona ↔ gtm-agent via runChatThroughRuntime, real tools)`; judges = the 3-model ensemble (`attribution_honesty`, `proposal_grounding`) + canonical 12-dim |
+| ANALYZE | trace analysts over the campaign traces → report (supersedes `FailureClusterConfig` clustering) |
+| IMPROVE | `improvementDriver` + `agenticGenerator` (claude harness) edits prompt/tools/code in a worktree, fed the report |
+| GATE | `composeGate(defaultProductionGate, antiFabricationGate, overfitGapGate)` |
+| PROMOTE | `openAutoPr` → PR against `tangle-network/gtm-agent` |
+**Deleted:** `eval/run-prompt-evolution.ts`, `eval/analyst-loop.ts`,
+`eval/optimization-campaign.ts`, `scripts/evals/*`, the orchestration body of
+`production-loop/index.ts` and `eval/canonical.ts`.
+**Kept:** scenarios, personas, judges, tools, deterministic checks, the
+`composeProductionLoopSystemPrompt` wiring.
+**Result:** one `runGtmImprovementCycle()`; ~3–4k LOC of scattered orchestration
+gone, replaced by a substrate composition.
+This gtm build is the reference the other six products copy.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@tangle-network/agent-eval",
-  "version": "0.40.3",
+  "version": "0.40.5",
   "description": "Substrate for self-improving agents: traces, verifiable rewards, preferences, GEPA / reflective mutation, auto-research, replay, sequential anytime-valid stats, and release gates.",
   "homepage": "https://github.com/tangle-network/agent-eval#readme",
   "repository": {