npm - @tangle-network/agent-app - Versions diffs - 0.1.14 → 0.3.0 - Mend

@tangle-network/agent-app 0.1.14 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/.claude/skills/eval-architect/SKILL.md +44 -0
package/.claude/skills/eval-bootstrap/SKILL.md +46 -0
package/.claude/skills/eval-campaign/SKILL.md +116 -0
package/.claude/skills/improve-conductor/SKILL.md +45 -0
package/.claude/skills/measurement-validation/SKILL.md +38 -0
package/.claude/skills/skill-evolution/SKILL.md +42 -0
package/.claude/skills/surface-evolution/SKILL.md +35 -0
package/dist/eval-campaign/index.d.ts +155 -0
package/dist/eval-campaign/index.js +140 -0
package/dist/eval-campaign/index.js.map +1 -0
package/dist/index.js +12 -12
package/package.json +18 -14

package/.claude/skills/eval-architect/SKILL.md ADDED Viewed

@@ -0,0 +1,44 @@
+---
+name: eval-architect
+description: Build a measurement that scores an agent's REAL deliverable — not a proxy — for a product you've never seen before. Use when scaffolding or repairing the eval an Improve loop optimizes against. Get this wrong and every downstream optimization perfects a fiction.
+---
+# Eval Architect — measure the real deliverable
+You are building the measurement an improvement loop will optimize against. **The loop optimizes whatever you measure.** If you measure the wrong thing, the loop perfects the wrong thing — confidently, expensively, and invisibly. The measurement is the product. Everything else in the Improve stack is downstream of getting this right.
+This skill is held by the agent that *builds* the eval (often a delegated coding agent). Pair it with `measurement-validation` (the gate that proves your eval is sound before anyone spends money on it).
+## The cardinal question
+**Where does this agent's deliverable actually land?** Prose in the reply? Validated tool calls? Persisted artifacts (vault docs, DB rows)? A PR? A rendered UI? Find out by inspecting *real runs* — never by assuming it's the chat text.
+> Worked failure (legal-agent, this is why the skill exists): the eval scored the assistant's chat prose. A tool-migration moved the deliverable into `submit_proposal` calls + vault docs, leaving the prose empty. Every scorer reading prose silently collapsed to ~0. The loop would have optimized an empty string. The deliverable had *moved* and the measurement didn't follow it.
+## Invariant (non-negotiable — violate these and the loop is a slot machine)
+1. **Score the produced artifact, not the conversation.** Locate the real output channel and score *that*.
+2. **For accumulating-artifact agents, score the CONVERGED multi-shot artifact, not turn 1.** Most real agents build their deliverable over several turns. Define a convergence criterion (e.g. the artifact stops growing for N shots) and score the converged state.
+3. **A held-out split exists and is never trained on.** No held-out → no honest gate → no trustworthy lift.
+4. **Every requirement has gold the scorer matches against, from real records — never fabricated.** A requirement with no gold means there is nothing to verify; fail loud, do not pass-by-default. A fluent hallucination that produced nothing must score 0, not 0.9.
+## Judgment (figure this out per product — the agentic core)
+- What *is* the deliverable here, and where does it persist? Read the runtime events / tool calls / storage, not the transcript.
+- What is the convergence criterion for this agent's artifact? When has it stopped accumulating?
+- What gold defines "correct" for each requirement, and where does it come from (real records, never invented figures)?
+- Which dimensions matter, and what are their weights? What is the one dimension that, if it regresses, kills the deal regardless of the composite (safety, hallucination, the regulated invariant)?
+## Self-test (prove the metric works before trusting it)
+- **Baseline sanity:** run it. Is the score non-zero and plausible for a competent agent? A near-zero baseline usually means you're scoring the wrong channel, not that the agent is terrible.
+- **The mutation test (the one that catches the empty-string bug):** hand-edit the produced artifact to be *obviously better* and *obviously worse*. Does the score move in the right direction and magnitude? A metric that doesn't move under obvious changes is measuring the wrong thing.
+- **Audit EVERY scoring surface together.** Completion, quality, and the optimizer's own scorer all read *something*. When the deliverable's channel moves, all of them that read the old channel silently zero. (Session: completion + quality were fixed; the optimizer's own scorer was missed and only found by tracing. Three surfaces — enumerate them, don't assume one.)
+## Evolves-by
+When a later optimization shows lift on *training* but none on *held-out*, your eval was overfittable or gameable — add the gap it missed as a new judgment rule. The architect's judgment surface is itself optimized by the meta-eval *"did evals built this way yield real held-out lift, no critical regression?"* See `skill-evolution`.
+## Fleet as dogfood
+legal / tax / gtm / creative / insurance each put their deliverable in a *different* channel — filings, forms, published copy, rendered artifacts, routed proposals. The skill is general precisely because it forces you to *locate* the channel for the product in front of you rather than hardcode "the reply text."

package/.claude/skills/eval-bootstrap/SKILL.md ADDED Viewed

@@ -0,0 +1,46 @@
+---
+name: eval-bootstrap
+description: When a product has NO improvement infrastructure yet, build it for real — elicit the RIGHT target, ground the measurement in external truth, and construct a validated harness (often via a delegated agent-runtime build loop) BEFORE any optimization spend. The anti-toy, anti-circular skill: it exists so the improver moves what the user actually wants, not a measurable proxy it invented.
+---
+# Eval Bootstrap — build the apparatus for real, at cold start
+Cold start is the most dangerous moment in the whole stack. There is no eval, so the agent is tempted to invent one — and **an invented eval optimizes an invented target.** Your job here is to build a measurement of the thing the *user* actually values, grounded in something *real*, and prove it works *before a dollar is spent optimizing against it*. You are a builder, not a tuner: if the apparatus does not exist, you construct it — delegating a coding / agent-runtime build loop that runs to completion when the work is substantial.
+This skill is what makes the Improve button honest at cold start. Held by the delegated builder; gated by `improve-conductor`; it hands off to `surface-evolution` only once the harness is validated.
+## The two loops — never collapse them
+- **BUILD loop** — construct + *validate* the harness: representative scenarios, **externally-grounded** gold, a scorer that passes the mutation test, a held-out split, and the wiring for the single surface to evolve. The done-criterion is **`measurement-validation` passes** — not "the files exist." When the build is substantial, mechanical, or long-running, **delegate it to an agent-runtime driven loop** that builds to completion in its own sandbox and returns a validated harness.
+- **IMPROVE loop** — optimize the surface against the now-trusted harness (`surface-evolution`). **Starts only after BUILD exits clean.**
+"I built an eval and it shows improvement," said in one breath, is the toy. The gate between the two loops is the product.
+## Invariant (non-negotiable)
+1. **No optimization spend until: (a) the target is user-confirmed and tied to a product-value claim, (b) the gold is grounded in EXTERNAL real truth, and (c) the harness passed `measurement-validation`.** The BUILD loop returns a *validated* harness or it isn't done.
+2. **Gold is grounded in external reality** — the user's accepted past outputs, reference documents, real records, or human labels. **NEVER gold the agent generates and then optimizes against.** That is grading its own homework: the number always rises and means nothing. If you cannot name the external source of the gold, it is circular — stop.
+3. **The target is the thing the user would REJECT a draft over** — not the easiest thing to measure. If you're scoring length / format / keyword presence while the user cares about correctness / usefulness / persuasiveness, you are building a toy. Re-elicit.
+4. **If grounding does not exist, ACQUIRE it** — ask the user for examples, pull references, label a seed set — do not fabricate it. (This is where `@tangle-network/agent-app/knowledge-loop`'s source-grounded, propose-don't-apply acquisition plugs in.)
+## Judgment (figure this out per product)
+- What does the user *actually* value about this artifact? Extract it, confirm it, phrase it as a product-value claim. The user is often unsure — anticipate the decision-relevant quality and propose it.
+- What external truth can ground it, and is there *enough* (the data threshold)? If not, what's the cheapest way to acquire real grounding?
+- What's the *minimal real* harness — fewest scenarios, simplest scorer — that still measures the real thing? **Small and real beats big and toy.**
+- Build inline, or delegate an agent-runtime loop to construct it? Delegate when it's substantial or long-running; the loop is accountable for returning a *validated* harness.
+## Self-test
+- **The "would the user agree?" test:** show the user 2–3 scored examples — one high, one low. Do they agree with the scores? If not, the measurement is wrong; fix it before optimizing. This single check kills most toys.
+- **The mutation test:** an obviously-better and an obviously-worse artifact move the score in the right direction and magnitude. (A metric that doesn't move is measuring the wrong channel — see `eval-architect`.)
+- **The non-circularity check:** name the external source of the gold. If you can't, it's circular — stop.
+- **It RUNS, not just compiles:** a baseline produces a real, non-zero, plausible score against the grounded gold.
+## Evolves-by
+When an improve loop later ships a "win" the user *rejects*, the bootstrap mis-framed the target or mis-grounded the gold — that rejection becomes a sharper elicitation / grounding rule. The bootstrap's judgment surface is optimized by the meta-eval *"did harnesses built this way produce lifts the user accepted as real?"* See `skill-evolution`.
+## Why this is the accountable skill
+The improver is tasked with *moving the thing the user wants moved* — end to end: elicit the target, ground it, build the apparatus (constructing it for real when it's missing), run the loop, report the honest lift, iterate to threshold or budget. It is accountable to the real improvement, not to "I ran a loop." But it will **not** spend the user's money optimizing a target it made up — it builds the right measurement first, or it tells the user what it needs to.

package/.claude/skills/eval-campaign/SKILL.md ADDED Viewed

@@ -0,0 +1,116 @@
+---
+name: eval-campaign
+description: Wire a product agent's self-improvement loop (measure → optimize → gate → ship) onto the shared @tangle-network/agent-app/eval-campaign scaffold. Use when adding or refactoring any product agent's eval/ loop.
+---
+# Wiring a product onto the eval-campaign scaffold
+You are integrating a product agent's self-improvement loop. The loop **engine already exists** in the substrate — do not rebuild it. Your job is to supply the three things only the product knows, and call one function.
+## Mental model (read first)
+`selfImprove` (from `@tangle-network/agent-eval/contract`, re-exported here) owns the entire cycle:
+- the **train/holdout split** from a flat `scenarios` array,
+- the **driver** (default `gepaDriver` from your `mutationPrimitives`),
+- the **held-out production gate** (default `defaultProductionGate`, `deltaThreshold` 0.05),
+- **durable provenance** + optional hosted ingest,
+- every budget/seed/storage default.
+A product brings exactly three things:
+1. **`scenarios`** — your corpus (personas / cases / tasks) in the substrate `Scenario` shape.
+2. **`agent`** — `(surface, scenario, ctx) => artifact`: run your agent under the current surface (a system-prompt addendum the loop optimizes) and return the artifact your judge scores. Report real cost via `ctx.cost.observe(...)` so the backend-integrity guard sees a real run.
+3. **`judge`** — score an artifact on your rubric. Use `buildEnsembleJudge` (below) for a multi-model ensemble, or hand-write a `JudgeConfig` for a bespoke composite.
+Everything else is a default you override only when you have a reason.
+## The one import
+```ts
+import {
+  selfImprove,
+  buildEnsembleJudge,
+  type SelfImproveOptions,
+  type JudgeVerdict,
+} from '@tangle-network/agent-app/eval-campaign'
+```
+> Requires `@tangle-network/agent-eval >= 0.81.0` (peer). The scaffold composes the substrate downward; never import a product package from agent-eval (layering rule).
+## Minimal wiring (copy, then fill the three blanks)
+```ts
+const RUBRIC = ['accuracy', 'grounding', 'tone'] as const
+type Dim = (typeof RUBRIC)[number]
+const judge = buildEnsembleJudge<MyArtifact, MyScenario, Dim>({
+  name: 'my-product',
+  rubric: RUBRIC,
+  judgeReps: 3,                         // 3 uncorrelated judges → inter-rater bands
+  async scoreOne({ artifact, scenario, rep }) {
+    const model = JUDGE_MODELS[rep % JUDGE_MODELS.length]   // vary the model per rep
+    try {
+      const v = await callMyJudge(model, artifact, scenario) // → { accuracy, grounding, tone }
+      return { model, perDimension: v, rationale: v.note, costUsd: v.cost }
+    } catch (err) {
+      return { model, perDimension: null, rationale: String(err) } // failure ≠ zero
+    }
+  },
+})
+const result = await selfImprove<MyScenario, MyArtifact>({
+  scenarios: loadMyScenarios(),         // YOU own
+  agent: dispatchUnderSurface,          // YOU own — (surface, scenario, ctx) => artifact
+  judge,                                // built above
+  baselineSurface: '',                  // the addendum the loop optimizes (start empty)
+  mutationPrimitives: MY_DIRECTIVES,    // the optimization levers (default driver mutates toward these)
+  runDir: process.env.MY_RUN_DIR,       // a real path → durable provenance; omit → in-memory
+  // budget / model / gate / hostedTenant all default — override only when needed
+})
+if (result.gate.decision === 'ship') await ship(result.winnerSurface)
+```
+## `buildEnsembleJudge` contract
+- `scoreOne` is called `judgeReps` times per artifact; **vary the model by `rep`** so the ensemble is uncorrelated (judges sharing a base model share its bias).
+- Return `{ model, perDimension: null }` to record a judge failure **without** killing the ensemble — the reducer means over survivors.
+- The reducer (`aggregateJudgeVerdicts`) **throws only if every rep failed** → the campaign records a failed cell, never a silent zero.
+- `weights` (partial) selects-and-weights named dimensions; default is uniform.
+## Config reference (all `SelfImproveOptions`, all optional unless noted)
+| Field | Default | When to set |
+|---|---|---|
+| `scenarios` | — (required) | your corpus |
+| `agent` | — (required) | your dispatch under a surface |
+| `judge` | — (required) | `buildEnsembleJudge` or a `JudgeConfig` |
+| `baselineSurface` | — (required) | the surface the loop optimizes; start `''` |
+| `mutationPrimitives` | gepaDriver's own | your optimization levers (additive directives) |
+| `driver` | `gepaDriver` | pass `evolutionaryDriver({ mutator })` for blind addendum rotation |
+| `gate` | `defaultProductionGate` (Δ 0.05) | `paretoSignificanceGate` for multi-objective; tune `deltaThreshold` for your rubric scale |
+| `budget` | 3 gens × pop 2, 0.25 holdout | `budget.reps` (replicates → tighter CIs), `budget.promoteTopK`, `budget.holdoutScenarios` (explicit split), `budget.dollars` (cost cap) |
+| `expectUsage` | **`'assert'`** | the fail-loud backend-integrity guard. Leave at `'assert'` for real runs (a stub cell throws); set `'off'` ONLY for a deterministic offline/replay run |
+| `labeledStore` | off | capture every artifact + judge score (the dataset you ship + few-shot corpus); set `captureSource` (default `'eval-run'`) |
+| `analyzeGeneration` | — | the per-generation findings producer (EYES→HANDS) — plug a trace-analyst / HALO to refresh `ctx.findings` each round |
+| `runDir` | `mem://…` (non-durable) | a real path to persist provenance + spans |
+| `hostedTenant` | off | ship eval-run events to a hosted orchestrator |
+| `collectWorkerRecords` | — | return the per-call `RunRecord`s your agent accumulated → real backend-integrity verdict |
+| `onProgress` | — | stream baseline/generation/gate events to a UI |
+## Fail-loud contract (do not break)
+- In `agent`, report real cost via `ctx.cost.observe(costUsd, label)` + `ctx.cost.observeTokens(...)`. A dispatch that reports `{0,0}` trips `expectUsage` — that is the honest "ran against a stub" signal; never paper over it.
+- A judge failure is `perDimension: null`, never a fabricated zero.
+- Train and holdout must both be non-empty (`selfImprove` derives the split; supply enough scenarios).
+## Anti-patterns (these are what this scaffold deletes)
+- ❌ Hand-rolling `runImprovementLoop({...})` + `emitLoopProvenance({...})` + a train/holdout split. That is ~100 lines of identical boilerplate per product. Call `selfImprove`.
+- ❌ A per-product copy of the judge-ensemble reducer (survivor-mean / disagreement / cost-sum). Use `buildEnsembleJudge` → `aggregateJudgeVerdicts`.
+- ❌ `import type` from a product package inside the scaffold or substrate (upward dependency — forbidden).
+## Where it lives in the product
+One file: `eval/self-improve.ts`. It exports `runMyEval` (measure: `selfImprove` with `budget.generations = 0`, or `runCampaign`) and `runMySelfImprovement` (optimize: the wiring above). The product's harness/CLI calls these; nothing else duplicates the loop.

package/.claude/skills/improve-conductor/SKILL.md ADDED Viewed

@@ -0,0 +1,45 @@
+---
+name: improve-conductor
+description: The user-facing controller for the Improve button. Decide whether a request is improvable, translate a dollar budget into a run, read the verdict honestly, and promote or refuse with a reason. Never promise a lift you cannot measure.
+---
+# Improve Conductor — own the user's trust
+You are the agent the user talks to when they click **Improve**. You do not build the eval or run the loop yourself — you decide *whether* to, *how much* to spend, and *what to tell the user about the result*. The product you are protecting is **trust**, not lift. You would rather say "I could not prove an improvement — here is what another $X buys" than ship noise and call it a win.
+You delegate the building to an agent holding `eval-architect` + `surface-evolution`; you both share `measurement-validation` as the honesty contract.
+## Invariant (non-negotiable)
+1. **Never promise or report a lift you cannot measure with valid paired evidence.** Surface the honest verdict: `ship` / `hold` / `need-more-data` / `invalid`. "Invalid" (incomplete or unpaired evidence) is a first-class outcome — say it plainly, never paper over it with a survivor-mean number.
+2. **Refuse below the data threshold, and say why** — "I have N real outcomes; I won't optimize below M. Here's how to get to M." A refusal with a reason builds more trust than a fabricated win.
+3. **Route correctly.** Improvable by surface-tuning → dispatch `surface-evolution`. Needs a new capability or architecture → escalate and say so; don't pretend tuning will fix a structural gap.
+4. **No optimization spend before the target is confirmed and the measurement is real.** If there is no improvement infrastructure yet, you do NOT improvise a metric and start spending — you dispatch `eval-bootstrap` to BUILD a validated, externally-grounded harness first. The gate between "build the apparatus" and "spend optimizing" is yours to hold.
+## Cold start — no infrastructure yet
+The most dangerous request is "improve this" for a product with no eval. The wrong move is to invent a metric and start a loop — you'll perfect a proxy and report a fake win. The right move is a strict two-step you orchestrate:
+1. **Frame + build (no spend):** confirm with the user *what "better" means* — the thing they'd reject a draft over, tied to a product-value claim — then dispatch `eval-bootstrap` (often a delegated agent-runtime build loop) to construct a harness grounded in **external truth**, exiting only when `measurement-validation` passes. The improver is a *builder* here, not a tuner.
+2. **Then optimize (spend):** only once the harness is validated, dispatch `surface-evolution` against it.
+Never let the user believe step 2 happened when only a toy of step 1 did. If you can't yet build a real measurement (no grounding, target unclear), say so and ask for what you need — that's the honest move, not a loop against an invented number.
+## Judgment (figure this out per request)
+- Is this a surface-tuning problem or an architectural one? (If the agent literally cannot do the task, no prompt edit fixes it.)
+- Translate the user's dollars into a run: more spend = wider candidate search + more reps = tighter CI + higher chance of clearing the gate. $0.20 ≈ one quick generation on a couple scenarios; $50 ≈ multi-generation search with a held-out gate that can actually reach significance.
+- When to stop: threshold met, plateaued, or budget exhausted — and report which.
+## Self-test
+- **Before spending,** you can state out loud: the metric, its variance, the threshold, the held-out set, and what this budget buys. If you can't, you're not ready to charge for the click.
+- **After,** you report the gated lift with its CI and the decision's *reason*. If the run came back `invalid` (a cell errored, evidence unpaired), you tell the user that and offer the re-run — you do not quote the broken number.
+## Evolves-by
+User accept/reject of promotions; spend→lift efficiency; the rate of `invalid` runs. A rising invalid rate is a signal the measurement or the infra needs hardening — route it back to `measurement-validation` / `eval-architect`, don't absorb it silently. See `skill-evolution`.
+## Why this is calibrated, not timid
+A naive Improve button maximizes the displayed number and tells the user "improved +47%". The disciplined one, faced with the same +47, checks the evidence, finds it unpaired, and says "I found a promising candidate but can't yet prove it beats baseline — $X more will confirm it." The second one is the one people pay for twice.

package/.claude/skills/measurement-validation/SKILL.md ADDED Viewed

@@ -0,0 +1,38 @@
+---
+name: measurement-validation
+description: Prove a measurement is sound BEFORE spending money optimizing against it. The gate that decides whether an Improve run is allowed to start, and whether its result is allowed to be believed. Refuse metrics whose noise exceeds the effect, that have no held-out split, or whose evidence is incomplete.
+---
+# Measurement Validation — earn the right to optimize
+Optimization is only as trustworthy as the measurement under it. This skill is the gate on both ends: **before** a paid run (is this metric allowed to be optimized?) and **after** (is this result allowed to be believed?). It is the difference between an Improve button that is a product and one that is a slot machine.
+Held by both the orchestrator (`improve-conductor`) and the builder (`eval-architect`). It is the shared honesty contract.
+## Invariant (non-negotiable)
+1. **Refuse to optimize if CV(metric) > the target delta.** If the run-to-run noise is bigger than the effect you're paying to move, the metric *cannot* validate the change — raise reps or fix the metric first. Do not tune against noise.
+2. **Refuse to report a lift over INCOMPLETE or UNPAIRED evidence.** Every held-out scenario must have a non-errored cell on *both* the baseline and the candidate side. Below the paired-n floor (≥3), the run is **invalid**, not a verdict. A lift computed over survivors is worse than no number.
+   > **Enforced by** `trustVerdicts` from `@tangle-network/agent-app/eval-campaign` — the after-gate: IRR floor + per-item rater spread (within-item, never pooled) + survivor floor, with each failed check named in `trustReasons`.
+3. **Every metric ties to a product-value claim** — "if this number moves, *this* user-visible outcome moves with it." No claim → it's a proxy → don't optimize it.
+4. **Below the data threshold of real outcomes, refuse to optimize** — state N and say why. You cannot improve what you have not yet observed enough of.
+> Worked failures (this is why the skill exists):
+> - **Noise read as signal:** ~6 optimization rounds were burned chasing ±0.15 run-to-run swings as if they were real. The metric's variance was 3× any prompt delta — every conclusion was unprovable. The bug was the *measurement*, not the model.
+> - **A lift that was a lie:** a GEPA run reported `heldOutLift = +47`. Reading the actual cells: 2 of 4 held-out cells had errored, so "baseline" was *delaware alone* (42) and "winner" was *saas alone* (89) — two different personas. The +47 was differencing unlike cells. The gate correctly held (0 valid pairs), but the headline number a naive promoter would have shipped was fiction.
+## Judgment (figure this out per metric)
+- How many reps establish variance for *this* metric? (Noisy targets need 5+, converged-artifact metrics fewer.)
+- Is an observed "noisy" result model variance, or a measurement smell? **Default: suspect the metric** until its CI is shown tighter than the effect.
+- Where might this metric diverge from real value (the Goodhart risk specific to this product)?
+## Self-test
+- Report **mean ± 95% CI over K converged rollouts.** Show CI < target delta *before* greenlighting spend. If you can't, you haven't earned the right to optimize yet.
+- Confirm the held-out split is disjoint from training and large enough that the paired-n floor survives an errored cell.
+- **Verify against ground truth, never the summary.** Read the actual cells / artifacts, not the provenance headline. (The +47 above was sitting right there in the summary; only the cells revealed it was unpaired. A green build-hook is not a successful build; a typechecking harness is not a running one; a reported lift is not a measured lift.)
+## Evolves-by
+Track promotions that passed validation but regressed in production → that's a missed variance source or an unguarded dimension; strengthen the preflight. The validation bar itself is a surface that tightens from its own misses. See `skill-evolution`.

package/.claude/skills/skill-evolution/SKILL.md ADDED Viewed

@@ -0,0 +1,42 @@
+---
+name: skill-evolution
+description: How every skill in the Improve family stays agentic and general instead of rotting into a brittle rulebook. A skill is a measured hypothesis — a few human-owned invariants plus a wide loop-owned judgment surface that improves from outcome data via its own meta-eval. This is the recursion that lets the agent builder learn to build improvable agents.
+---
+# Skill Evolution — the skill is a surface, too
+A skill written as a fixed checklist is brittle: it can't handle the product nobody anticipated, and it goes stale silently. A skill written as vibes is unaccountable. The resolution is to give every skill the *same* structure the Improve loop optimizes — a small frozen core and a wide evolvable surface — and then point the loop at the skill itself.
+This is the meta-skill. It governs `eval-architect`, `measurement-validation`, `surface-evolution`, and `improve-conductor`, and it is what makes them general rather than legal-specific (or any-product-specific).
+## The 4-part contract (every skill in this family follows it)
+- **Invariant** — the 1–2 laws that, if violated, turn the loop into a slot machine. **Human-owned. Frozen.** Few. ("Gate on held-out." "Score the real deliverable." "Fail loud on incomplete evidence." "Never promote a regression on a guarded dimension.")
+- **Judgment** — what the agent figures out for *this* product. **Loop-owned. Wide.** This is the agentic surface — the place the agent is *supposed* to think, not follow steps.
+- **Self-test** — a checkable signal the agent actually ran to verify it did the work right (the mutation test, CI < delta, diff-the-deployed-surface). Not "I followed the procedure" — a *result*.
+- **Evolves-by** — the outcome data that updates the *judgment* surface. Never the invariants.
+**The split is the answer to "how do you keep it agentic and not a dumb rulebook":** few invariants hold the line; judgment is broad and loop-owned; outcomes are measured; the judgment surface self-revises. The agent stays free to solve the novel case — it just cannot violate the handful of laws that make optimization mean anything.
+## The recursion
+Each skill's *judgment* surface is itself an evolvable surface, optimized by the **same loop the skill describes**, with a verifiable reward:
+> *Did following this skill produce an eval that yielded real held-out lift, with no critical-dimension regression?*
+Above a data threshold of real runs, the skill proposes revisions to its own judgment — gated identically (held-out, critical-dimension floor, paired-n ≥ floor). **Invariants are the frozen surface; judgment is the evolvable surface.** A skill improving itself is just `surface-evolution` pointed inward.
+## The north-star: the agent builder, closed-loop
+The agent builder builds agents *and* the evals that improve them. Its success is **not** "wrote an eval file." It is the verifiable reward above, applied to the agent it just built:
+> *The eval the builder produced yields real held-out lift on the agent the builder built — no Goodhart regression.*
+The fleet — **legal, tax, gtm, creative, insurance** — is the training distribution. Each is `{an agent + a known set of gaps}`. The builder's score is how much of each gap its produced eval+loop closes on held-out scenarios. The first dogfood data point already exists: legal-agent's loop, repaired in the session that produced these skills, found a transferable jurisdictional-divergence rule and the gate *correctly refused to ship it until the evidence was valid* — a clean demonstration that the builder's reward must be "real, evidence-backed lift," never "a number went up."
+## Anti-patterns (the rulebook smells)
+- A skill that lists steps but has **no self-test** — you can't tell if following it worked.
+- An "invariant" that's really a **judgment call in disguise** — over-constraining; it should live in Judgment so the agent can adapt it per product.
+- A judgment surface with **no evolves-by hook** — it will rot, and nothing will notice.
+- A reported lift with **no held-out or no paired-n** — the slot machine. This is the one that ends the product.

package/.claude/skills/surface-evolution/SKILL.md ADDED Viewed

@@ -0,0 +1,35 @@
+---
+name: surface-evolution
+description: Optimize ONE evolvable surface (a prompt section / tool config) against a validated measurement, gate winners on held-out evidence plus a critical-dimension floor, and promote without offline/online drift. Use to run an Improve loop once eval-architect + measurement-validation pass.
+---
+# Surface Evolution — run the gated loop, promote without drift
+You are a closed-loop controller for agent quality. **Sensor** = the eval (built by `eval-architect`, certified by `measurement-validation`). **Controller** = the driver that proposes surface rewrites. **Actuator** = promotion (writing the surface to the live agent). **Safety interlock** = the gate. The interlock is the entire point: it prefers *under-promotion* to Goodhart. A loop that ships every apparent gain is worthless; a loop that ships only evidence-backed gains is the product.
+The engine exists in the substrate (`@tangle-network/agent-eval/contract` `selfImprove` / `runImprovementLoop`, `gepaDriver`, `defaultProductionGate`) — re-exported via `@tangle-network/agent-app/eval-campaign`. **Do not rebuild it.** This skill is how you wire and run it safely.
+## Invariant (non-negotiable)
+1. **Optimize exactly ONE surface that production renders identically.** The artifact you mutate offline must be the artifact the live agent loads — one source, rendered both places (e.g. an evolvable prompt section materialized from a single file into the live system prompt). If offline and online diverge, the lift is fictional the moment it ships.
+2. **Gate promotion on a held-out split AND a critical-dimension floor.** Never promote a net composite gain that regresses a guarded dimension (safety, hallucination, the regulated invariant). A +10 composite that loses 30 on hallucination is a regression, not a win.
+3. **Budget is a hard ceiling and cost-aware** — skip cells beyond the ceiling, never abort. The user's spend maps to generations × candidates × reps: $0.20 buys one quick generation; $50 buys a wide search with tight CIs.
+4. **Never evolve a frozen surface.** The regulated invariants — human-in-the-loop, the compliance gate, auth/RBAC — are off-limits. Declare exactly what is evolvable; everything else the loop must not touch.
+> Worked result (this is the skill working): a GEPA run on the legal addendum proposed a "multi-jurisdictional divergence handling" section, diagnosed from the worst *training* personas. The gate guarded `hallucination_free` (no regression) and required held-out significance. When the held-out evidence came back incomplete, it **refused to ship** — exactly right. Trust > lift.
+## Judgment (figure this out per product)
+- Which surface is safe to evolve (a guidance section, a tool description, a config knob) vs frozen (the invariants above)?
+- How to scope budget to the gap — one generation to confirm a hunch, many to search a wide space?
+- When has it converged or plateaued? If surface-tuning is exhausted and the gap is architectural, escalate — don't keep spending on a surface that's maxed.
+## Self-test
+- **After a promotion, the live agent renders the exact winning surface** — diff the deployed artifact against the promoted one. They must be byte-identical.
+- **The held-out lift reproduces on a fresh run** — a one-shot gain is noise until it repeats.
+- **The gate's rejections are honest** — a held verdict carries a stated reason ("0 valid paired runs", "regressed guarded dimension"), never a silent pass and never a lift over partial data.
+## Evolves-by
+Production outcomes of shipped surfaces (did the held-out lift actually hold live?) feed back into the driver's mutation priors and the gate's thresholds. A surface that lifted offline but flatlined live tells you the held-out set wasn't representative — widen it. See `skill-evolution`.

package/dist/eval-campaign/index.d.ts ADDED Viewed

@@ -0,0 +1,155 @@
+import { JudgeVerdict } from '@tangle-network/agent-eval';
+export { EnsembleAggregate, JudgeVerdict, RunRecord, aggregateJudgeVerdicts } from '@tangle-network/agent-eval';
+import { Scenario, JudgeConfig } from '@tangle-network/agent-eval/campaign';
+export { CampaignResult, DispatchContext, Gate, ImprovementDriver, JudgeConfig, JudgeDimension, JudgeScore, LabeledScenarioStore, MutableSurface, Mutator, Scenario, defaultProductionGate, evolutionaryDriver, gepaDriver, paretoSignificanceGate, runCampaign } from '@tangle-network/agent-eval/campaign';
+export { SelfImproveBudget, SelfImproveOptions, SelfImproveResult, selfImprove } from '@tangle-network/agent-eval/contract';
+/**
+ * Trust gate — decides whether an ensemble's scores are allowed to be BELIEVED,
+ * one level up from {@link aggregateJudgeVerdicts} (which only reduces ONE
+ * artifact's raters to a composite). A composite is a number; this is the check
+ * that the number means anything. It is the code "Enforced by" for the
+ * measurement-validation skill's after-gate ("is this result allowed to be
+ * believed").
+ *
+ * Three checks, each fail-loud and named in `trustReasons`:
+ *   (1) inter-rater reliability over the corpus ≥ `irrFloor` — raters that
+ *       disagree no better than chance carry no signal to optimize against.
+ *   (2) per-item rater spread ≤ `spreadCeiling` — for EACH item, raters must
+ *       converge on THAT item.
+ *   (3) surviving raters per item ≥ `minSurvivors` — a mean over one or two
+ *       raters is an anecdote, not an ensemble.
+ *
+ * CRITICAL metric semantics — per-item spread is rater disagreement about the
+ * SAME item: `max(score) − min(score)` across the raters that scored THAT item
+ * (max over its dimensions), never pooled across different items or across the
+ * baseline/candidate sides. Pooling reads a genuine quality gap BETWEEN items as
+ * "the raters split" and so trips the gate exactly when the finding is largest —
+ * the failure mode the after-gate exists to prevent. The corpus IRR (check 1)
+ * leans on the substrate's `interRaterReliability`, whose expected-disagreement
+ * denominator already pools across items, so genuine item-to-item variation
+ * RAISES reliability rather than lowering it.
+ */
+/** One item's raters: the per-judge verdicts {@link aggregateJudgeVerdicts}
+ *  reduces, tagged with the item they scored so spread stays within-item. */
+interface TrustItem<D extends string = string> {
+    /** Stable item identifier — surfaces in `perItemSpread` and `trustReasons`. */
+    itemId: string;
+    /** The raters' verdicts for THIS item (one per judge call). A failed judge
+     *  (`perDimension: null`) is dropped before spread/IRR, never folded as 0. */
+    verdicts: readonly JudgeVerdict<D>[];
+}
+/** Thresholds for {@link trustVerdicts}. All overridable; defaults are the
+ *  conservative after-gate bar. */
+interface TrustThresholds {
+    /** Minimum corpus inter-rater reliability (Krippendorff-style α). Below this
+     *  the raters agree no better than chance. Default 0.2. */
+    irrFloor?: number;
+    /** Maximum per-item rater spread (`max − min` over a single item's surviving
+     *  raters, across its dimensions). Above this the raters split ON THAT ITEM.
+     *  Default 0.5. */
+    spreadCeiling?: number;
+    /** Minimum surviving (non-failed) raters required per item. Default 3. */
+    minSurvivors?: number;
+}
+/** Result of the trust gate. `trustworthy` iff every check passed; `trustReasons`
+ *  is empty iff `trustworthy`. */
+interface TrustVerdict {
+    /** True iff IRR ≥ floor AND every item's spread ≤ ceiling AND every item has
+     *  ≥ `minSurvivors` surviving raters. */
+    trustworthy: boolean;
+    /** One entry per FAILED check, each naming its number + the offending value.
+     *  Empty iff `trustworthy`. */
+    trustReasons: string[];
+    /** Corpus inter-rater reliability actually measured (the check-1 value). */
+    interRaterReliability: number;
+    /** Per-item spread (`max − min` over surviving raters, max over dimensions),
+     *  keyed by `itemId`. The check-2 input, surfaced for drill-down. */
+    perItemSpread: Record<string, number>;
+}
+/**
+ * Decide whether an ensemble's per-item verdicts are trustworthy enough to
+ * believe a lift computed from them. Pure: no LLM, no I/O, no clock, no random —
+ * the same `items` + `thresholds` always yield the same verdict.
+ *
+ * Sibling to {@link aggregateJudgeVerdicts}: that reduces ONE item's raters to a
+ * composite; this audits the raters ACROSS items and reports whether the
+ * composites are believable. Run it on the corpus of held-out items before
+ * reporting any lift over their scores.
+ *
+ * @throws if `items` is empty — an empty corpus has no measurable trust, and a
+ *   silent `trustworthy: true` over zero evidence is the exact lie the gate
+ *   exists to refuse.
+ */
+declare function trustVerdicts<D extends string>(items: readonly TrustItem<D>[], thresholds?: TrustThresholds): TrustVerdict;
+/**
+ * Eval-campaign — the app-shell's curated surface for a product's
+ * self-improvement loop, NOT a reimplementation.
+ *
+ * The loop ENGINE lives in `@tangle-network/agent-eval` (a peer dependency):
+ * `selfImprove` already owns the whole cycle — train/holdout split, the GEPA
+ * driver, the held-out production gate, durable provenance + hosted ingest, and
+ * every default. A product should NOT hand-roll `runImprovementLoop` +
+ * `emitLoopProvenance` around it (that is the boilerplate this surface exists to
+ * delete). It should call `selfImprove` with three things it actually owns:
+ * scenarios, an `agent` dispatch, and a `judge`.
+ *
+ * This module adds the one piece `selfImprove` does not own and which every
+ * multi-model product re-hand-rolls — the ensemble judge:
+ *
+ *   {@link buildEnsembleJudge} — turn a per-rubric `scoreOne` into a
+ *   `JudgeConfig` that fans out N uncorrelated judge calls and reduces them via
+ *   the substrate's `aggregateJudgeVerdicts` (survivor-mean, inter-rater spread,
+ *   fail-loud on all-failed). A product writes its rubric + one judge call; the
+ *   fan-out, partial-failure handling, and composite are the scaffold's.
+ *
+ * Everything else is a curated re-export so a product has ONE eval import:
+ * `selfImprove` + the gates + the drivers + the types. See
+ * `.claude/skills/eval-campaign/SKILL.md` for the wiring contract.
+ */
+/** Config for {@link buildEnsembleJudge}. `D` = the rubric's dimension union. */
+interface EnsembleJudgeConfig<TArtifact, TScenario extends Scenario, D extends string> {
+    /** Judge name — appears in traces and scorecards. */
+    name: string;
+    /** Stable-ordered rubric dimensions. Drives the `JudgeDimension` list AND the
+     *  reducer keys, so a judge that omits a dimension scores it 0 (never silently
+     *  dropped). */
+    rubric: readonly D[];
+    /**
+     * Score ONE artifact on the rubric → a raw per-dimension verdict. Called
+     * `judgeReps` times per artifact; vary the model by `rep` for an uncorrelated
+     * ensemble (judges that share a base model share its bias). Return
+     * `{ model, perDimension: null }` to record a judge failure WITHOUT killing
+     * the ensemble; throw only on an unrecoverable error (the whole rep is then
+     * treated as a failed judge).
+     */
+    scoreOne: (input: {
+        artifact: TArtifact;
+        scenario: TScenario;
+        signal: AbortSignal;
+        rep: number;
+    }) => Promise<JudgeVerdict<D>>;
+    /** Independent judge calls per artifact, reduced by `aggregateJudgeVerdicts`.
+     *  Default 1. Raise (with model variety in `scoreOne`) for inter-rater bands. */
+    judgeReps?: number;
+    /** Per-dimension composite weights. Default: uniform over `rubric`. A partial
+     *  map selects-and-weights exactly the named dimensions. */
+    weights?: Partial<Record<D, number>>;
+    /** Optional human-readable dimension descriptions. Default: the key itself. */
+    describe?: (dim: D) => string;
+}
+/**
+ * Build a `JudgeConfig` whose `score()` fans out `judgeReps` independent
+ * `scoreOne` calls and reduces them with the substrate's
+ * `aggregateJudgeVerdicts`. A single judge call failing does NOT fail the cell
+ * (it is recorded and dropped); only ALL judges failing throws — which the
+ * campaign records as a failed cell, never a silent zero.
+ *
+ * Pass the result straight to `selfImprove({ judge })` (or `runCampaign`).
+ */
+declare function buildEnsembleJudge<TArtifact, TScenario extends Scenario, D extends string>(cfg: EnsembleJudgeConfig<TArtifact, TScenario, D>): JudgeConfig<TArtifact, TScenario>;
+export { type EnsembleJudgeConfig, type TrustItem, type TrustThresholds, type TrustVerdict, buildEnsembleJudge, trustVerdicts };

package/dist/eval-campaign/index.js ADDED Viewed

@@ -0,0 +1,140 @@
+// src/eval-campaign/index.ts
+import {
+  aggregateJudgeVerdicts
+} from "@tangle-network/agent-eval";
+// src/eval-campaign/trust-gate.ts
+import {
+  interRaterReliability
+} from "@tangle-network/agent-eval";
+var DEFAULT_IRR_FLOOR = 0.2;
+var DEFAULT_SPREAD_CEILING = 0.5;
+var DEFAULT_MIN_SURVIVORS = 3;
+function survivors(item) {
+  return item.verdicts.filter((v) => v.perDimension !== null);
+}
+function itemSpread(survivorVerdicts) {
+  if (survivorVerdicts.length < 2) return 0;
+  const dims = /* @__PURE__ */ new Set();
+  for (const v of survivorVerdicts) {
+    for (const d of Object.keys(v.perDimension)) dims.add(d);
+  }
+  let worst = 0;
+  for (const d of dims) {
+    let min = Infinity;
+    let max = -Infinity;
+    for (const v of survivorVerdicts) {
+      const score = v.perDimension[d];
+      if (score === void 0) continue;
+      if (score < min) min = score;
+      if (score > max) max = score;
+    }
+    if (max > -Infinity && max - min > worst) worst = max - min;
+  }
+  return worst;
+}
+function trustVerdicts(items, thresholds = {}) {
+  if (items.length === 0) {
+    throw new Error("trustVerdicts: items is empty \u2014 no evidence to trust");
+  }
+  const irrFloor = thresholds.irrFloor ?? DEFAULT_IRR_FLOOR;
+  const spreadCeiling = thresholds.spreadCeiling ?? DEFAULT_SPREAD_CEILING;
+  const minSurvivors = thresholds.minSurvivors ?? DEFAULT_MIN_SURVIVORS;
+  const maxRaters = items.reduce((m, it) => Math.max(m, survivors(it).length), 0);
+  const raterSeries = Array.from({ length: maxRaters }, () => []);
+  const perItemSpread = {};
+  const splitItems = [];
+  const starvedItems = [];
+  for (const item of items) {
+    const surv = survivors(item);
+    if (surv.length < minSurvivors) starvedItems.push({ itemId: item.itemId, n: surv.length });
+    const spread = itemSpread(surv);
+    perItemSpread[item.itemId] = spread;
+    if (spread > spreadCeiling) splitItems.push({ itemId: item.itemId, spread });
+    if (surv.length >= 2) {
+      const dims = Array.from(
+        new Set(surv.flatMap((v) => Object.keys(v.perDimension)))
+      ).sort();
+      surv.forEach((v, raterIdx) => {
+        const column = raterSeries[raterIdx] ??= [];
+        const pd = v.perDimension;
+        for (const d of dims) {
+          const score = pd[d];
+          if (score === void 0) continue;
+          column.push({
+            judgeName: v.model,
+            dimension: `${item.itemId}::${d}`,
+            score,
+            reasoning: v.rationale ?? ""
+          });
+        }
+      });
+    }
+  }
+  const irr = interRaterReliability(raterSeries);
+  const trustReasons = [];
+  if (irr < irrFloor) {
+    trustReasons.push(`(1) IRR ${round(irr)} < ${irrFloor}`);
+  }
+  for (const { itemId, spread } of splitItems) {
+    trustReasons.push(`(2) item ${itemId} spread ${round(spread)} > ${spreadCeiling} \u2014 raters split`);
+  }
+  for (const { itemId, n } of starvedItems) {
+    trustReasons.push(`(3) item ${itemId}: ${n} surviving raters < ${minSurvivors}`);
+  }
+  return {
+    trustworthy: trustReasons.length === 0,
+    trustReasons,
+    interRaterReliability: irr,
+    perItemSpread
+  };
+}
+function round(n) {
+  return Math.round(n * 100) / 100;
+}
+// src/eval-campaign/index.ts
+import { aggregateJudgeVerdicts as aggregateJudgeVerdicts2 } from "@tangle-network/agent-eval";
+import {
+  defaultProductionGate,
+  evolutionaryDriver,
+  gepaDriver,
+  paretoSignificanceGate,
+  runCampaign
+} from "@tangle-network/agent-eval/campaign";
+import { selfImprove } from "@tangle-network/agent-eval/contract";
+function buildEnsembleJudge(cfg) {
+  const reps = cfg.judgeReps ?? 1;
+  if (reps < 1) {
+    throw new Error(`buildEnsembleJudge: judgeReps must be >= 1 (got ${reps})`);
+  }
+  if (cfg.rubric.length === 0) {
+    throw new Error("buildEnsembleJudge: rubric is empty");
+  }
+  return {
+    name: cfg.name,
+    dimensions: cfg.rubric.map((key) => ({ key, description: cfg.describe?.(key) ?? key })),
+    async score({ artifact, scenario, signal }) {
+      const settled = await Promise.allSettled(
+        Array.from({ length: reps }, (_, rep) => cfg.scoreOne({ artifact, scenario, signal, rep }))
+      );
+      const verdicts = settled.map(
+        (r, rep) => r.status === "fulfilled" ? r.value : { model: `${cfg.name}-rep${rep}`, perDimension: null, rationale: String(r.reason) }
+      );
+      const agg = aggregateJudgeVerdicts(verdicts, cfg.rubric, cfg.weights);
+      return { composite: agg.composite, dimensions: agg.perDimension, notes: agg.rationale };
+    }
+  };
+}
+export {
+  aggregateJudgeVerdicts2 as aggregateJudgeVerdicts,
+  buildEnsembleJudge,
+  defaultProductionGate,
+  evolutionaryDriver,
+  gepaDriver,
+  paretoSignificanceGate,
+  runCampaign,
+  selfImprove,
+  trustVerdicts
+};
+//# sourceMappingURL=index.js.map

package/dist/eval-campaign/index.js.map ADDED Viewed

@@ -0,0 +1 @@

+ {"version":3,"sources":["../../src/eval-campaign/index.ts","../../src/eval-campaign/trust-gate.ts"],"sourcesContent":["/**\n * Eval-campaign — the app-shell's curated surface for a product's\n * self-improvement loop, NOT a reimplementation.\n *\n * The loop ENGINE lives in `@tangle-network/agent-eval` (a peer dependency):\n * `selfImprove` already owns the whole cycle — train/holdout split, the GEPA\n * driver, the held-out production gate, durable provenance + hosted ingest, and\n * every default. A product should NOT hand-roll `runImprovementLoop` +\n * `emitLoopProvenance` around it (that is the boilerplate this surface exists to\n * delete). It should call `selfImprove` with three things it actually owns:\n * scenarios, an `agent` dispatch, and a `judge`.\n *\n * This module adds the one piece `selfImprove` does not own and which every\n * multi-model product re-hand-rolls — the ensemble judge:\n *\n * {@link buildEnsembleJudge} — turn a per-rubric `scoreOne` into a\n * `JudgeConfig` that fans out N uncorrelated judge calls and reduces them via\n * the substrate's `aggregateJudgeVerdicts` (survivor-mean, inter-rater spread,\n * fail-loud on all-failed). A product writes its rubric + one judge call; the\n * fan-out, partial-failure handling, and composite are the scaffold's.\n *\n * Everything else is a curated re-export so a product has ONE eval import:\n * `selfImprove` + the gates + the drivers + the types. See\n * `.claude/skills/eval-campaign/SKILL.md` for the wiring contract.\n */\n\nimport {\n aggregateJudgeVerdicts,\n type JudgeVerdict,\n} from '@tangle-network/agent-eval'\nimport type {\n JudgeConfig,\n JudgeScore,\n Scenario,\n} from '@tangle-network/agent-eval/campaign'\n\n/** Config for {@link buildEnsembleJudge}. `D` = the rubric's dimension union. */\nexport interface EnsembleJudgeConfig<TArtifact, TScenario extends Scenario, D extends string> {\n /** Judge name — appears in traces and scorecards. */\n name: string\n /** Stable-ordered rubric dimensions. Drives the `JudgeDimension` list AND the\n * reducer keys, so a judge that omits a dimension scores it 0 (never silently\n * dropped). */\n rubric: readonly D[]\n /**\n * Score ONE artifact on the rubric → a raw per-dimension verdict. Called\n * `judgeReps` times per artifact; vary the model by `rep` for an uncorrelated\n * ensemble (judges that share a base model share its bias). Return\n * `{ model, perDimension: null }` to record a judge failure WITHOUT killing\n * the ensemble; throw only on an unrecoverable error (the whole rep is then\n * treated as a failed judge).\n */\n scoreOne: (input: {\n artifact: TArtifact\n scenario: TScenario\n signal: AbortSignal\n rep: number\n }) => Promise<JudgeVerdict<D>>\n /** Independent judge calls per artifact, reduced by `aggregateJudgeVerdicts`.\n * Default 1. Raise (with model variety in `scoreOne`) for inter-rater bands. */\n judgeReps?: number\n /** Per-dimension composite weights. Default: uniform over `rubric`. A partial\n * map selects-and-weights exactly the named dimensions. */\n weights?: Partial<Record<D, number>>\n /** Optional human-readable dimension descriptions. Default: the key itself. */\n describe?: (dim: D) => string\n}\n\n/**\n * Build a `JudgeConfig` whose `score()` fans out `judgeReps` independent\n * `scoreOne` calls and reduces them with the substrate's\n * `aggregateJudgeVerdicts`. A single judge call failing does NOT fail the cell\n * (it is recorded and dropped); only ALL judges failing throws — which the\n * campaign records as a failed cell, never a silent zero.\n *\n * Pass the result straight to `selfImprove({ judge })` (or `runCampaign`).\n */\nexport function buildEnsembleJudge<TArtifact, TScenario extends Scenario, D extends string>(\n cfg: EnsembleJudgeConfig<TArtifact, TScenario, D>,\n): JudgeConfig<TArtifact, TScenario> {\n const reps = cfg.judgeReps ?? 1\n if (reps < 1) {\n throw new Error(`buildEnsembleJudge: judgeReps must be >= 1 (got ${reps})`)\n }\n if (cfg.rubric.length === 0) {\n throw new Error('buildEnsembleJudge: rubric is empty')\n }\n return {\n name: cfg.name,\n dimensions: cfg.rubric.map((key) => ({ key, description: cfg.describe?.(key) ?? key })),\n async score({ artifact, scenario, signal }): Promise<JudgeScore> {\n const settled = await Promise.allSettled(\n Array.from({ length: reps }, (_, rep) => cfg.scoreOne({ artifact, scenario, signal, rep })),\n )\n const verdicts: JudgeVerdict<D>[] = settled.map((r, rep) =>\n r.status === 'fulfilled'\n ? r.value\n : { model: `${cfg.name}-rep${rep}`, perDimension: null, rationale: String(r.reason) },\n )\n // Throws iff EVERY rep failed → the campaign records a failed cell.\n const agg = aggregateJudgeVerdicts(verdicts, cfg.rubric, cfg.weights)\n return { composite: agg.composite, dimensions: agg.perDimension, notes: agg.rationale }\n },\n }\n}\n\n// ── Trust gate — the after-gate (\"is this result allowed to be believed\") ────\n// One level up from `aggregateJudgeVerdicts`: it audits the raters ACROSS items\n// and reports whether the composites are believable before a lift is reported.\nexport {\n trustVerdicts,\n type TrustItem,\n type TrustThresholds,\n type TrustVerdict,\n} from './trust-gate'\n\n// ── Curated re-exports — the one eval import for a product loop ──────────────\n// The loop engine + gates + drivers + the ensemble reducer, so a product wires\n// its self-improvement loop from a single module instead of reaching across\n// three agent-eval subpaths. All DOWNWARD imports (agent-app consumes the\n// substrate); the layering rule is preserved.\n\nexport { aggregateJudgeVerdicts } from '@tangle-network/agent-eval'\nexport type {\n EnsembleAggregate,\n JudgeVerdict,\n RunRecord,\n} from '@tangle-network/agent-eval'\nexport {\n defaultProductionGate,\n evolutionaryDriver,\n gepaDriver,\n paretoSignificanceGate,\n runCampaign,\n} from '@tangle-network/agent-eval/campaign'\nexport type {\n CampaignResult,\n DispatchContext,\n Gate,\n ImprovementDriver,\n JudgeConfig,\n JudgeDimension,\n JudgeScore,\n LabeledScenarioStore,\n MutableSurface,\n Mutator,\n Scenario,\n} from '@tangle-network/agent-eval/campaign'\nexport { selfImprove } from '@tangle-network/agent-eval/contract'\nexport type {\n SelfImproveBudget,\n SelfImproveOptions,\n SelfImproveResult,\n} from '@tangle-network/agent-eval/contract'\n","/**\n * Trust gate — decides whether an ensemble's scores are allowed to be BELIEVED,\n * one level up from {@link aggregateJudgeVerdicts} (which only reduces ONE\n * artifact's raters to a composite). A composite is a number; this is the check\n * that the number means anything. It is the code \"Enforced by\" for the\n * measurement-validation skill's after-gate (\"is this result allowed to be\n * believed\").\n *\n * Three checks, each fail-loud and named in `trustReasons`:\n * (1) inter-rater reliability over the corpus ≥ `irrFloor` — raters that\n * disagree no better than chance carry no signal to optimize against.\n * (2) per-item rater spread ≤ `spreadCeiling` — for EACH item, raters must\n * converge on THAT item.\n * (3) surviving raters per item ≥ `minSurvivors` — a mean over one or two\n * raters is an anecdote, not an ensemble.\n *\n * CRITICAL metric semantics — per-item spread is rater disagreement about the\n * SAME item: `max(score) − min(score)` across the raters that scored THAT item\n * (max over its dimensions), never pooled across different items or across the\n * baseline/candidate sides. Pooling reads a genuine quality gap BETWEEN items as\n * \"the raters split\" and so trips the gate exactly when the finding is largest —\n * the failure mode the after-gate exists to prevent. The corpus IRR (check 1)\n * leans on the substrate's `interRaterReliability`, whose expected-disagreement\n * denominator already pools across items, so genuine item-to-item variation\n * RAISES reliability rather than lowering it.\n */\n\nimport {\n interRaterReliability,\n type JudgeScore,\n type JudgeVerdict,\n} from '@tangle-network/agent-eval'\n\n/** One item's raters: the per-judge verdicts {@link aggregateJudgeVerdicts}\n * reduces, tagged with the item they scored so spread stays within-item. */\nexport interface TrustItem<D extends string = string> {\n /** Stable item identifier — surfaces in `perItemSpread` and `trustReasons`. */\n itemId: string\n /** The raters' verdicts for THIS item (one per judge call). A failed judge\n * (`perDimension: null`) is dropped before spread/IRR, never folded as 0. */\n verdicts: readonly JudgeVerdict<D>[]\n}\n\n/** Thresholds for {@link trustVerdicts}. All overridable; defaults are the\n * conservative after-gate bar. */\nexport interface TrustThresholds {\n /** Minimum corpus inter-rater reliability (Krippendorff-style α). Below this\n * the raters agree no better than chance. Default 0.2. */\n irrFloor?: number\n /** Maximum per-item rater spread (`max − min` over a single item's surviving\n * raters, across its dimensions). Above this the raters split ON THAT ITEM.\n * Default 0.5. */\n spreadCeiling?: number\n /** Minimum surviving (non-failed) raters required per item. Default 3. */\n minSurvivors?: number\n}\n\n/** Result of the trust gate. `trustworthy` iff every check passed; `trustReasons`\n * is empty iff `trustworthy`. */\nexport interface TrustVerdict {\n /** True iff IRR ≥ floor AND every item's spread ≤ ceiling AND every item has\n * ≥ `minSurvivors` surviving raters. */\n trustworthy: boolean\n /** One entry per FAILED check, each naming its number + the offending value.\n * Empty iff `trustworthy`. */\n trustReasons: string[]\n /** Corpus inter-rater reliability actually measured (the check-1 value). */\n interRaterReliability: number\n /** Per-item spread (`max − min` over surviving raters, max over dimensions),\n * keyed by `itemId`. The check-2 input, surfaced for drill-down. */\n perItemSpread: Record<string, number>\n}\n\nconst DEFAULT_IRR_FLOOR = 0.2\nconst DEFAULT_SPREAD_CEILING = 0.5\nconst DEFAULT_MIN_SURVIVORS = 3\n\n/** Surviving (non-failed) verdicts for an item — those with a real\n * `perDimension` map. A failed judge carries no scores and is excluded from\n * every statistic (it is NOT a zero rater). */\nfunction survivors<D extends string>(item: TrustItem<D>): JudgeVerdict<D>[] {\n return item.verdicts.filter((v) => v.perDimension !== null)\n}\n\n/**\n * Within-item rater spread: for each dimension, `max − min` across the item's\n * surviving raters; the item's spread is the max over its dimensions (the worst\n * dimension the raters split on). Pooled ONLY within this one item — never\n * across items — so a quality gap between items cannot inflate it.\n */\nfunction itemSpread<D extends string>(survivorVerdicts: JudgeVerdict<D>[]): number {\n if (survivorVerdicts.length < 2) return 0\n const dims = new Set<string>()\n for (const v of survivorVerdicts) {\n for (const d of Object.keys(v.perDimension as Record<string, number>)) dims.add(d)\n }\n let worst = 0\n for (const d of dims) {\n let min = Infinity\n let max = -Infinity\n for (const v of survivorVerdicts) {\n const score = (v.perDimension as Record<string, number>)[d]\n if (score === undefined) continue\n if (score < min) min = score\n if (score > max) max = score\n }\n if (max > -Infinity && max - min > worst) worst = max - min\n }\n return worst\n}\n\n/**\n * Decide whether an ensemble's per-item verdicts are trustworthy enough to\n * believe a lift computed from them. Pure: no LLM, no I/O, no clock, no random —\n * the same `items` + `thresholds` always yield the same verdict.\n *\n * Sibling to {@link aggregateJudgeVerdicts}: that reduces ONE item's raters to a\n * composite; this audits the raters ACROSS items and reports whether the\n * composites are believable. Run it on the corpus of held-out items before\n * reporting any lift over their scores.\n *\n * @throws if `items` is empty — an empty corpus has no measurable trust, and a\n * silent `trustworthy: true` over zero evidence is the exact lie the gate\n * exists to refuse.\n */\nexport function trustVerdicts<D extends string>(\n items: readonly TrustItem<D>[],\n thresholds: TrustThresholds = {},\n): TrustVerdict {\n if (items.length === 0) {\n throw new Error('trustVerdicts: items is empty — no evidence to trust')\n }\n const irrFloor = thresholds.irrFloor ?? DEFAULT_IRR_FLOOR\n const spreadCeiling = thresholds.spreadCeiling ?? DEFAULT_SPREAD_CEILING\n const minSurvivors = thresholds.minSurvivors ?? DEFAULT_MIN_SURVIVORS\n\n // Rater-major JudgeScore series for the substrate's IRR. Each item's surviving\n // raters are assigned a stable column index so the same rater across items\n // lines up; per (item, dimension) one JudgeScore per rater, in item-then-\n // dimension order — the layout interRaterReliability chunks back into items.\n const maxRaters = items.reduce((m, it) => Math.max(m, survivors(it).length), 0)\n const raterSeries: JudgeScore[][] = Array.from({ length: maxRaters }, () => [])\n const perItemSpread: Record<string, number> = {}\n const splitItems: Array<{ itemId: string; spread: number }> = []\n const starvedItems: Array<{ itemId: string; n: number }> = []\n\n for (const item of items) {\n const surv = survivors(item)\n if (surv.length < minSurvivors) starvedItems.push({ itemId: item.itemId, n: surv.length })\n\n const spread = itemSpread(surv)\n perItemSpread[item.itemId] = spread\n if (spread > spreadCeiling) splitItems.push({ itemId: item.itemId, spread })\n\n if (surv.length >= 2) {\n const dims = Array.from(\n new Set(surv.flatMap((v) => Object.keys(v.perDimension as Record<string, number>))),\n ).sort()\n surv.forEach((v, raterIdx) => {\n // raterIdx < surv.length ≤ maxRaters = raterSeries.length, so the column\n // always exists; the ??= keeps the access provably defined for the type.\n const column = (raterSeries[raterIdx] ??= [])\n const pd = v.perDimension as Record<string, number>\n for (const d of dims) {\n const score = pd[d]\n if (score === undefined) continue\n column.push({\n judgeName: v.model,\n dimension: `${item.itemId}::${d}`,\n score,\n reasoning: v.rationale ?? '',\n })\n }\n })\n }\n }\n\n const irr = interRaterReliability(raterSeries)\n\n const trustReasons: string[] = []\n if (irr < irrFloor) {\n trustReasons.push(`(1) IRR ${round(irr)} < ${irrFloor}`)\n }\n for (const { itemId, spread } of splitItems) {\n trustReasons.push(`(2) item ${itemId} spread ${round(spread)} > ${spreadCeiling} — raters split`)\n }\n for (const { itemId, n } of starvedItems) {\n trustReasons.push(`(3) item ${itemId}: ${n} surviving raters < ${minSurvivors}`)\n }\n\n return {\n trustworthy: trustReasons.length === 0,\n trustReasons,\n interRaterReliability: irr,\n perItemSpread,\n }\n}\n\n/** Round to 2 decimals for stable, readable reason strings. */\nfunction round(n: number): number {\n return Math.round(n * 100) / 100\n}\n"],"mappings":";AA0BA;AAAA,EACE;AAAA,OAEK;;;ACFP;AAAA,EACE;AAAA,OAGK;AA0CP,IAAM,oBAAoB;AAC1B,IAAM,yBAAyB;AAC/B,IAAM,wBAAwB;AAK9B,SAAS,UAA4B,MAAuC;AAC1E,SAAO,KAAK,SAAS,OAAO,CAAC,MAAM,EAAE,iBAAiB,IAAI;AAC5D;AAQA,SAAS,WAA6B,kBAA6C;AACjF,MAAI,iBAAiB,SAAS,EAAG,QAAO;AACxC,QAAM,OAAO,oBAAI,IAAY;AAC7B,aAAW,KAAK,kBAAkB;AAChC,eAAW,KAAK,OAAO,KAAK,EAAE,YAAsC,EAAG,MAAK,IAAI,CAAC;AAAA,EACnF;AACA,MAAI,QAAQ;AACZ,aAAW,KAAK,MAAM;AACpB,QAAI,MAAM;AACV,QAAI,MAAM;AACV,eAAW,KAAK,kBAAkB;AAChC,YAAM,QAAS,EAAE,aAAwC,CAAC;AAC1D,UAAI,UAAU,OAAW;AACzB,UAAI,QAAQ,IAAK,OAAM;AACvB,UAAI,QAAQ,IAAK,OAAM;AAAA,IACzB;AACA,QAAI,MAAM,aAAa,MAAM,MAAM,MAAO,SAAQ,MAAM;AAAA,EAC1D;AACA,SAAO;AACT;AAgBO,SAAS,cACd,OACA,aAA8B,CAAC,GACjB;AACd,MAAI,MAAM,WAAW,GAAG;AACtB,UAAM,IAAI,MAAM,2DAAsD;AAAA,EACxE;AACA,QAAM,WAAW,WAAW,YAAY;AACxC,QAAM,gBAAgB,WAAW,iBAAiB;AAClD,QAAM,eAAe,WAAW,gBAAgB;AAMhD,QAAM,YAAY,MAAM,OAAO,CAAC,GAAG,OAAO,KAAK,IAAI,GAAG,UAAU,EAAE,EAAE,MAAM,GAAG,CAAC;AAC9E,QAAM,cAA8B,MAAM,KAAK,EAAE,QAAQ,UAAU,GAAG,MAAM,CAAC,CAAC;AAC9E,QAAM,gBAAwC,CAAC;AAC/C,QAAM,aAAwD,CAAC;AAC/D,QAAM,eAAqD,CAAC;AAE5D,aAAW,QAAQ,OAAO;AACxB,UAAM,OAAO,UAAU,IAAI;AAC3B,QAAI,KAAK,SAAS,aAAc,cAAa,KAAK,EAAE,QAAQ,KAAK,QAAQ,GAAG,KAAK,OAAO,CAAC;AAEzF,UAAM,SAAS,WAAW,IAAI;AAC9B,kBAAc,KAAK,MAAM,IAAI;AAC7B,QAAI,SAAS,cAAe,YAAW,KAAK,EAAE,QAAQ,KAAK,QAAQ,OAAO,CAAC;AAE3E,QAAI,KAAK,UAAU,GAAG;AACpB,YAAM,OAAO,MAAM;AAAA,QACjB,IAAI,IAAI,KAAK,QAAQ,CAAC,MAAM,OAAO,KAAK,EAAE,YAAsC,CAAC,CAAC;AAAA,MACpF,EAAE,KAAK;AACP,WAAK,QAAQ,CAAC,GAAG,aAAa;AAG5B,cAAM,SAAU,YAAY,QAAQ,MAAM,CAAC;AAC3C,cAAM,KAAK,EAAE;AACb,mBAAW,KAAK,MAAM;AACpB,gBAAM,QAAQ,GAAG,CAAC;AAClB,cAAI,UAAU,OAAW;AACzB,iBAAO,KAAK;AAAA,YACV,WAAW,EAAE;AAAA,YACb,WAAW,GAAG,KAAK,MAAM,KAAK,CAAC;AAAA,YAC/B;AAAA,YACA,WAAW,EAAE,aAAa;AAAA,UAC5B,CAAC;AAAA,QACH;AAAA,MACF,CAAC;AAAA,IACH;AAAA,EACF;AAEA,QAAM,MAAM,sBAAsB,WAAW;AAE7C,QAAM,eAAyB,CAAC;AAChC,MAAI,MAAM,UAAU;AAClB,iBAAa,KAAK,WAAW,MAAM,GAAG,CAAC,MAAM,QAAQ,EAAE;AAAA,EACzD;AACA,aAAW,EAAE,QAAQ,OAAO,KAAK,YAAY;AAC3C,iBAAa,KAAK,YAAY,MAAM,WAAW,MAAM,MAAM,CAAC,MAAM,aAAa,sBAAiB;AAAA,EAClG;AACA,aAAW,EAAE,QAAQ,EAAE,KAAK,cAAc;AACxC,iBAAa,KAAK,YAAY,MAAM,KAAK,CAAC,uBAAuB,YAAY,EAAE;AAAA,EACjF;AAEA,SAAO;AAAA,IACL,aAAa,aAAa,WAAW;AAAA,IACrC;AAAA,IACA,uBAAuB;AAAA,IACvB;AAAA,EACF;AACF;AAGA,SAAS,MAAM,GAAmB;AAChC,SAAO,KAAK,MAAM,IAAI,GAAG,IAAI;AAC/B;;;AD/EA,SAAS,0BAAAA,+BAA8B;AAMvC;AAAA,EACE;AAAA,EACA;AAAA,EACA;AAAA,EACA;AAAA,EACA;AAAA,OACK;AAcP,SAAS,mBAAmB;AAvErB,SAAS,mBACd,KACmC;AACnC,QAAM,OAAO,IAAI,aAAa;AAC9B,MAAI,OAAO,GAAG;AACZ,UAAM,IAAI,MAAM,mDAAmD,IAAI,GAAG;AAAA,EAC5E;AACA,MAAI,IAAI,OAAO,WAAW,GAAG;AAC3B,UAAM,IAAI,MAAM,qCAAqC;AAAA,EACvD;AACA,SAAO;AAAA,IACL,MAAM,IAAI;AAAA,IACV,YAAY,IAAI,OAAO,IAAI,CAAC,SAAS,EAAE,KAAK,aAAa,IAAI,WAAW,GAAG,KAAK,IAAI,EAAE;AAAA,IACtF,MAAM,MAAM,EAAE,UAAU,UAAU,OAAO,GAAwB;AAC/D,YAAM,UAAU,MAAM,QAAQ;AAAA,QAC5B,MAAM,KAAK,EAAE,QAAQ,KAAK,GAAG,CAAC,GAAG,QAAQ,IAAI,SAAS,EAAE,UAAU,UAAU,QAAQ,IAAI,CAAC,CAAC;AAAA,MAC5F;AACA,YAAM,WAA8B,QAAQ;AAAA,QAAI,CAAC,GAAG,QAClD,EAAE,WAAW,cACT,EAAE,QACF,EAAE,OAAO,GAAG,IAAI,IAAI,OAAO,GAAG,IAAI,cAAc,MAAM,WAAW,OAAO,EAAE,MAAM,EAAE;AAAA,MACxF;AAEA,YAAM,MAAM,uBAAuB,UAAU,IAAI,QAAQ,IAAI,OAAO;AACpE,aAAO,EAAE,WAAW,IAAI,WAAW,YAAY,IAAI,cAAc,OAAO,IAAI,UAAU;AAAA,IACxF;AAAA,EACF;AACF;","names":["aggregateJudgeVerdicts"]}

package/dist/index.js CHANGED Viewed

@@ -1,3 +1,10 @@
+import {
+  addSecurityHeaders,
+  checkRateLimit,
+  extractRequestContext,
+  parseJsonObjectBody,
+  requireString
+} from "./chunk-CN75FIPT.js";
 import {
   DEFAULT_REDACTION_PATTERNS,
   buildRedactedDocument,
@@ -6,6 +13,11 @@ import {
   redactForIngestion,
   revealSpan
 } from "./chunk-5RMIUJDI.js";
+import {
+  createKnowledgeLoop,
+  createReviewerDecider,
+  reviewCandidate
+} from "./chunk-EEPJGZJW.js";
 import {
   DEFAULT_HARNESS,
   KNOWN_HARNESSES,
@@ -65,13 +77,6 @@ import {
   invokeIntegrationHub,
   resolveIntegrationAction
 } from "./chunk-L2TG5DBW.js";
-import {
-  addSecurityHeaders,
-  checkRateLimit,
-  extractRequestContext,
-  parseJsonObjectBody,
-  requireString
-} from "./chunk-CN75FIPT.js";
 import {
   DEFAULT_APP_TOOL_PATHS,
   DEFAULT_HEADER_NAMES,
@@ -123,11 +128,6 @@ import {
   buildKnowledgeRequirements,
   deriveSignals
 } from "./chunk-ZXNXAQAH.js";
-import {
-  createKnowledgeLoop,
-  createReviewerDecider,
-  reviewCandidate
-} from "./chunk-EEPJGZJW.js";
 export {
   APP_TOOL_NAMES,
   DEFAULT_APP_TOOL_PATHS,

package/package.json CHANGED Viewed

@@ -1,7 +1,6 @@
 {
   "name": "@tangle-network/agent-app",
-  "version": "0.1.14",
-  "packageManager": "pnpm@10.33.4",
+  "version": "0.3.0",
   "description": "Application-shell framework for Tangle agent products: a bounded tool loop, the structured agent→app tool side channel, integration-hub client, per-workspace billing, and crypto — composed over the Tangle agent substrate through typed seams.",
   "keywords": [
     "tangle",
@@ -28,7 +27,8 @@
   "main": "./dist/index.js",
   "types": "./dist/index.d.ts",
   "files": [
-    "dist"
+    "dist",
+    ".claude/skills"
   ],
   "exports": {
     ".": {
@@ -61,6 +61,11 @@
       "import": "./dist/eval/index.js",
       "default": "./dist/eval/index.js"
     },
+    "./eval-campaign": {
+      "types": "./dist/eval-campaign/index.d.ts",
+      "import": "./dist/eval-campaign/index.js",
+      "default": "./dist/eval-campaign/index.js"
+    },
     "./knowledge": {
       "types": "./dist/knowledge/index.d.ts",
       "import": "./dist/knowledge/index.js",
@@ -122,16 +127,8 @@
       "default": "./dist/store/index.js"
     }
   },
-  "scripts": {
-    "build": "tsup",
-    "dev": "tsup --watch",
-    "prepare": "tsup",
-    "test": "vitest run",
-    "test:watch": "vitest",
-    "typecheck": "tsc --noEmit"
-  },
   "devDependencies": {
-    "@tangle-network/agent-eval": "^0.70.0",
+    "@tangle-network/agent-eval": "^0.82.0",
     "@tangle-network/agent-integrations": "^0.32.0",
     "@tangle-network/agent-knowledge": "^1.5.2",
     "@types/node": "^25.6.0",
@@ -140,7 +137,7 @@
     "vitest": "^3.0.0"
   },
   "peerDependencies": {
-    "@tangle-network/agent-eval": ">=0.50.0",
+    "@tangle-network/agent-eval": ">=0.82.0",
     "@tangle-network/agent-integrations": ">=0.32.0",
     "@tangle-network/agent-knowledge": ">=1.5.0",
     "@tangle-network/agent-runtime": ">=0.21.0"
@@ -152,5 +149,12 @@
     "@tangle-network/agent-runtime": {
       "optional": true
     }
+  },
+  "scripts": {
+    "build": "tsup",
+    "dev": "tsup --watch",
+    "test": "vitest run",
+    "test:watch": "vitest",
+    "typecheck": "tsc --noEmit"
   }
-}
+}