npm - @tangle-network/agent-eval - Versions diffs - 0.17.2 → 0.18.0 - Mend

@tangle-network/agent-eval 0.17.2 → 0.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/README.md +24 -16
package/dist/index.d.ts +271 -75
package/dist/index.js +393 -16
package/dist/index.js.map +1 -1
package/docs/concepts.md +155 -0
package/docs/control-runtime.md +351 -0
package/docs/feature-guide.md +213 -0
package/docs/feedback-trajectories.md +193 -0
package/docs/multi-shot-optimization.md +122 -0
package/docs/wire-protocol.md +199 -0
package/package.json +21 -14

package/README.md CHANGED Viewed

@@ -21,7 +21,7 @@ console.log(ship.result.passed, ship.result.score)
 - You ship a content generator and need quality signal beyond "the LLM said it's good".
 - You want a release gate that fails on regressions you can name, not vibes.
-If that's you, start with [`docs/concepts.md`](./docs/concepts.md) — 5-minute mental model — then come back here.
+If that's you, start with [`docs/concepts.md`](./docs/concepts.md) — 5-minute mental model — then use [`docs/feature-guide.md`](./docs/feature-guide.md) to choose the right primitive.
 ## Quickstart
@@ -65,6 +65,7 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
 ## Two ways to read this repo
 - **You're a human onboarding** — read [`docs/concepts.md`](./docs/concepts.md) for the mental model, then [`docs/wire-protocol.md`](./docs/wire-protocol.md) if you'll call from another language, or `SKILL.md` if you'll embed in TS.
+- **You're deciding what to integrate** — read [`docs/feature-guide.md`](./docs/feature-guide.md) for the layman explanation, use cases, feature map, and guardrails.
 - **You're an LLM agent writing integration code** — read `SKILL.md`. Every directive there encodes a shipped bug; skipping one reintroduces the bug class.
 ## What's in the box
@@ -78,8 +79,10 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
 | `clients/python/` | First-party Python client (`tangle-agent-eval` on PyPI). Version-locked to npm. | clients/python/README.md |
 | `BenchmarkRunner`, `executeScenario`, `ConvergenceTracker` | Multi-turn scenario execution + cross-run tracking. | SKILL.md |
 | `runAgentControlLoop` | Policy-based runtime for agentic tasks: observe typed state, validate, decide, act, repeat with budgets, tracing, and stuck-loop guards. | [control-runtime.md](./docs/control-runtime.md) |
-| `FeedbackTrajectory`, `InMemoryFeedbackTrajectoryStore`, `FileSystemFeedbackTrajectoryStore` | Product-native learning loops: capture approvals, rejections, choices, revisions, metrics, and policy blocks as train/dev/test/holdout examples. | [feedback-trajectories.md](./docs/feedback-trajectories.md) |
+| `FeedbackTrajectory`, `InMemoryFeedbackTrajectoryStore`, `FileSystemFeedbackTrajectoryStore` | Human/environment feedback loops: capture approvals, rejections, choices, revisions, metrics, and policy blocks as train/dev/test/holdout examples. | [feedback-trajectories.md](./docs/feedback-trajectories.md) |
+| `evaluateActionPolicy` | Generic action preflight for approval, budget, expected-outcome, and kill-criteria checks. | [feature-guide.md](./docs/feature-guide.md) |
 | `ExperimentTracker`, `PromptOptimizer`, `bisector` | A/B prompts, optimize steering, bisect regressions. | SKILL.md |
+| `runMultiShotOptimization`, `trialTraceFromMultiShotTrial` | GEPA-style optimization for variable-length agent trajectories with ASI, paired seeds, and optional held-out promotion gating. | [multi-shot-optimization.md](./docs/multi-shot-optimization.md) |
 | `runPromptEvolution`, `createCompositeMutator`, `createSandboxPool`, `createSandboxCodeMutator`, `MutationTelemetry`, `LineageRecorder`, `CostLedger`, `JsonlTrialCache` | Prompt + code evolution loops with bounded sandbox pools, durable JSONL telemetry, plateau-detecting composite mutators, crash-resumable trial cache. | §Evolution loop |
 | `reflective-mutation` (`buildReflectionPrompt`, `parseReflectionResponse`, `DEFAULT_MUTATION_PRIMITIVES`) | Trace-conditioned LLM mutator that reasons over top/bottom trials instead of blind rewrites. | inline JSDoc |
 | `correlationStudy`, `OutcomeStore`, `ProductRegistry` | Meta-eval: do our scores predict deployment outcomes (revenue, retention)? | inline JSDoc |
@@ -87,6 +90,12 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
 ## Evolution loop
+For agent tasks that run across many chat turns or tool calls, start with
+[`runMultiShotOptimization`](./docs/multi-shot-optimization.md). It runs the
+same prompt-evolution core over full trajectories, carries actionable side
+information into reflection, and separates the search winner from the variant
+that actually passes held-out promotion.
 Closing the loop on a prompt or codebase is **two adapters + a config**. Compose `runPromptEvolution` with `createCompositeMutator` (plateau policy) and you get prompt-only optimization until improvement stalls, then automatic switch to code-channel mutations from a coding agent inside a `SandboxPool`.
 ```ts
@@ -170,9 +179,9 @@ The `MutationTelemetry`, `LineageRecorder`, and `CostLedger` pass into the `code
 For the full primitive surface and rationale, read each module's JSDoc — `prompt-evolution.ts`, `composite-mutator.ts`, `sandbox-pool.ts`, `code-mutator.ts`, `reflective-mutation.ts`, `evolution-telemetry.ts`.
-## Product feedback loop
+## Feedback trajectory loop
-When normal product usage should generate training/eval signal, use feedback
+When normal agent usage should generate training/eval signal, use feedback
 trajectories. They turn approvals, rejections, option choices, edits, metrics,
 and policy blocks into reusable examples.
@@ -185,22 +194,21 @@ import {
 } from '@tangle-network/agent-eval'
 const trajectory = createFeedbackTrajectory({
-  projectId: 'gtm-agent',
-  scenarioId: 'ad-positioning-choice',
-  task: { intent: 'Choose a paid-social positioning angle.' },
+  projectId: 'research-agent',
+  scenarioId: 'brief-review',
+  task: { intent: 'Revise a research brief until it is specific and sourced.' },
   attempts: [{
     id: 'draft-1',
     stepIndex: 0,
-    artifactType: 'decision',
-    artifact: { option: 'enterprise procurement language' },
-    options: ['enterprise procurement', 'technical founder pain'],
+    artifactType: 'research',
+    artifact: { summary: 'Initial brief with weak sourcing.' },
     createdAt: new Date().toISOString(),
   }],
   labels: [{
     source: 'user',
-    kind: 'reject',
-    value: 'enterprise procurement',
-    reason: 'too enterprise; our buyer is a technical founder',
+    kind: 'revision_request',
+    value: 'needs stronger evidence',
+    reason: 'add primary sources and remove unsupported claims',
     severity: 'error',
     createdAt: new Date().toISOString(),
   }],
@@ -211,9 +219,9 @@ const scenarios = feedbackTrajectoriesToDatasetScenarios([trajectory])
 const optimizerRows = feedbackTrajectoriesToOptimizerRows([trajectory])
 ```
-This is the bridge between product UX and optimization: normal user feedback
-becomes immediate memory, replayable eval scenarios, and prompt/signature/code
-optimizer input. See [`docs/feedback-trajectories.md`](./docs/feedback-trajectories.md).
+This is the bridge between feedback and optimization: review signals become
+immediate memory, replayable eval scenarios, and prompt/signature/code optimizer
+input. See [`docs/feedback-trajectories.md`](./docs/feedback-trajectories.md).
 ## v0.16 highlights — production-rigor primitives

package/dist/index.d.ts CHANGED Viewed

@@ -1437,6 +1437,17 @@ interface FeedbackOptimizerRow extends OptimizationExample {
     labelKinds: FeedbackLabelKind[];
     score?: number;
 }
+interface FeedbackReplayResult {
+    trajectoryId: string;
+    pass: boolean;
+    score?: number;
+    labels: FeedbackLabel[];
+    outcome?: FeedbackOutcome;
+    metadata?: Record<string, unknown>;
+}
+interface FeedbackReplayAdapter {
+    replay(trajectory: FeedbackTrajectory): Promise<Omit<FeedbackReplayResult, 'trajectoryId'>> | Omit<FeedbackReplayResult, 'trajectoryId'>;
+}
 declare class InMemoryFeedbackTrajectoryStore implements FeedbackTrajectoryStore {
     private readonly trajectories;
     save(trajectory: FeedbackTrajectory): Promise<void>;
@@ -1479,6 +1490,8 @@ declare function feedbackTrajectoryToDatasetScenario(trajectory: FeedbackTraject
 declare function feedbackTrajectoriesToDatasetScenarios(trajectories: FeedbackTrajectory[]): DatasetScenario[];
 declare function feedbackTrajectoryToOptimizerRow(trajectory: FeedbackTrajectory): FeedbackOptimizerRow;
 declare function feedbackTrajectoriesToOptimizerRows(trajectories: FeedbackTrajectory[]): FeedbackOptimizerRow[];
+declare function replayFeedbackTrajectory(trajectory: FeedbackTrajectory, adapter: FeedbackReplayAdapter): Promise<FeedbackReplayResult>;
+declare function replayFeedbackTrajectories(trajectories: FeedbackTrajectory[], adapter: FeedbackReplayAdapter): Promise<FeedbackReplayResult[]>;
 declare function summarizePreferenceMemory(trajectories: FeedbackTrajectory[], options?: {
     maxEntries?: number;
 }): PreferenceMemoryEntry[];
@@ -1494,6 +1507,29 @@ declare function controlRunToFeedbackTrajectory<TState, TAction, TActionResult>(
     createdAt?: string;
 }): FeedbackTrajectory;
+interface ActionExecutionPolicy {
+    allowedTypes?: string[];
+    blockedTypes?: string[];
+    alwaysRequireApprovalTypes?: string[];
+    autoApproveTypes?: string[];
+    requireApprovalForExternalSideEffects?: boolean;
+    requireApprovalAboveCostUsd?: number;
+    maxActionCostUsd?: number;
+    remainingBudgetUsd?: number;
+    expectedOutcomeRequired?: boolean;
+    killCriteriaRequired?: boolean;
+}
+interface ActionPolicyDecision {
+    allowed: boolean;
+    blocked: boolean;
+    requiresApproval: boolean;
+    reasons: string[];
+    label?: FeedbackLabel;
+}
+declare function evaluateActionPolicy(action: ProposedSideEffect, policy?: ActionExecutionPolicy, options?: {
+    createdAt?: string;
+}): ActionPolicyDecision;
 /**
  * Normalize scores so all dimensions follow "higher = better".
  * Inverted dimensions (hallucination, false_confidence, worst_failure)
@@ -1595,7 +1631,7 @@ declare class ConvergenceTracker {
  * Uses the Web Crypto API (works in Workers, Node 22+, browsers).
  */
 interface PromptHandle {
-    /** Stable human-readable id, e.g. 'legal.system' */
+    /** Stable human-readable id, e.g. 'browser.system' */
     id: string;
     /** Caller-chosen version string, e.g. 'v3' or '2026-04-20' */
     version: string;
@@ -1687,7 +1723,7 @@ declare function analyzeAntiSlop(outputs: string[], config: Omit<Required<AntiSl
  * Artifact validators.
  *
  * Generic "score a produced artifact" primitive. Tax uses it for PDF form
- * correctness, legal for contract clauses, film for script breakdowns, GTM
+ * correctness, research for sourced briefs, browser for task assertions, coding
  * for social posts. One interface, many validators; all plug into
  * `BenchmarkRunner` the same way.
  *
@@ -1975,7 +2011,7 @@ declare class FileSystemExperimentStore implements ExperimentStore {
  * `Run.status` field one-to-one.
  *
  * Why this lives next to `InMemoryExperimentStore`:
- *   - bad-app, legal-agent, gtm-agent, film-agent all run as Workers
+ *   - browser, coding, and computer-use agents can all run as Workers
  *   - Workers cannot use `node:fs`, so `FileSystemExperimentStore` doesn't apply
  *   - Hand-rolling D1 SQL in every consumer is exactly the duplication this
  *     module exists to prevent
@@ -2008,7 +2044,7 @@ interface D1ExperimentStoreOptions {
     db: D1Like;
     /**
      * Optional table-name prefix so multiple ExperimentStores can share a DB
-     * without colliding (e.g. `tax_eval_experiments` vs `legal_eval_experiments`).
+     * without colliding (e.g. `browser_eval_experiments` vs `coding_eval_experiments`).
      * Default: `agent_eval_`.
      */
     tablePrefix?: string;
@@ -2592,7 +2628,7 @@ type HostedRunCriticConfig = Pick<RunCriticOptions, 'weights'> & {
 /**
  * Dual-agent convergence bench.
  *
- * Pattern lifted from tax-agent + legal-agent: two agents take turns until
+ * Pattern lifted from dual-worker review loops: two agents take turns until
  * they converge on a consensus artifact. One proposes, the other critiques;
  * the proposer revises; repeat until a score threshold is hit or max rounds.
  *
@@ -3408,7 +3444,7 @@ declare function evaluateOracles(obs: OracleObservation, oracles: Oracle[]): Ora
 /**
  * Cost tracker — token + USD accounting per scenario and per run.
  *
- * Lifted from tax/legal metrics.ts + tangle-router UsageEvent. Every
+ * Adapted from generic usage-event accounting. Every
  * optimizer needs to know "is the quality gain worth the cost delta?",
  * and every dashboard needs dollars-per-completed-task. MODEL_PRICING
  * from metrics.ts stays authoritative for estimate math; this module
@@ -3619,7 +3655,7 @@ declare function analyzeSeries(values: number[], options?: SeriesConvergenceOpti
  * State continuity scoring — measures how well a resumed/handed-off agent
  * preserves prior work.
  *
- * Lifted from tax-agent's run-resume-eval.ts. When session 2 continues
+ * When session 2 continues
  * session 1's work, the key question is: did it preserve key artifacts,
  * or start over and lose context? Each `ContinuityCheck` inspects one
  * aspect (file preserved, key count grew, status advanced) and yields
@@ -7633,6 +7669,233 @@ interface PromptEvolutionResult<P = unknown> {
 }
 declare function runPromptEvolution<P>(config: PromptEvolutionConfig<P>): Promise<PromptEvolutionResult<P>>;
+/**
+ * Reflective mutation — primitives for trace-conditioned prompt rewriting.
+ *
+ * Used by `prompt-evolution.ts` (and any consumer running iterative
+ * improvement). Given a parent prompt + concrete trace evidence (top trials,
+ * bottom trials, missed expectations), produce an LLM-ready prompt that
+ * proposes targeted mutations — not blind rephrasings.
+ *
+ * Why this lives outside `prompt-evolution.ts`: any consumer that wants to
+ * run reflective rewriting WITHOUT the population/Pareto machinery can
+ * import these primitives directly.
+ *
+ * Quality bar (vs. naive "mutate this prompt"):
+ *   - Show parent ↔ children diff, not just one variant
+ *   - Quote specific missed goldens with their match phrases
+ *   - Surface the model's actual emitted output side-by-side with what was expected
+ *   - Quote concrete mutation primitives so the model has a vocabulary
+ */
+interface TrialTrace {
+    /** Stable id for the trial — surfaces in the prompt for grounding. */
+    id: string;
+    /** Score the trial received on its primary metric. */
+    score: number;
+    /** Candidate inputs the agent was given (e.g., the fixture or scenario). */
+    inputName?: string;
+    /**
+     * Goldens / expectations this trial was tested against, with whether each
+     * was matched. The reflection prompt quotes the missed ones specifically.
+     */
+    expectations?: Array<{
+        id: string;
+        phrase: string;
+        matched: boolean;
+    }>;
+    /** Free-form text — what the agent actually emitted (e.g., findings, plan). */
+    emitted?: string;
+    /** Optional structured metrics (recall, precision, cost, latency). */
+    metrics?: Record<string, number>;
+}
+interface ReflectionContext {
+    /** What is being mutated — appears in the system prompt for orientation. */
+    target: string;
+    /** Current variant's payload — JSON-serialised for the prompt. */
+    parentPayload: unknown;
+    /** Best-performing trials this generation. */
+    topTrials: TrialTrace[];
+    /** Worst-performing trials this generation — the missed-golden source. */
+    bottomTrials: TrialTrace[];
+    /** How many children the mutator should propose. */
+    childCount: number;
+    /** Optional: domain-specific mutation primitives the model can pick from. */
+    mutationPrimitives?: string[];
+}
+declare const DEFAULT_MUTATION_PRIMITIVES: string[];
+/**
+ * Build the LLM-ready reflection prompt. Output is plain text — pass it as
+ * the user message. The system message should be small and stable (e.g.
+ * "Output ONLY a JSON object matching the schema below.").
+ */
+declare function buildReflectionPrompt(ctx: ReflectionContext): string;
+interface ReflectionProposal {
+    label: string;
+    rationale: string;
+    payload: unknown;
+}
+declare function parseReflectionResponse(raw: string, maxProposals?: number): ReflectionProposal[];
+/**
+ * Multi-shot optimization adapter.
+ *
+ * This is the canonical bridge between variable-length agent trajectories
+ * and `runPromptEvolution`. Apps provide four things:
+ *
+ *   - variants: prompt/config/tool-policy candidates
+ *   - runner: executes one full task trajectory for a variant
+ *   - scorer: turns that trajectory into score + actionable side information
+ *   - mutator: proposes new variants from top/bottom scored trials
+ *
+ * The adapter owns the boring but easy-to-get-wrong glue: stable seeds,
+ * score/cost objectives, error-to-trial conversion, ASI metric projection,
+ * and optional paired holdout gating via `HeldOutGate`.
+ */
+type MultiShotSplit = 'search' | 'dev' | 'holdout';
+type AsiSeverity = 'info' | 'warning' | 'error' | 'critical';
+type MultiShotVariant<P = unknown> = PromptVariant<P>;
+interface ActionableSideInfo {
+    /** Stable expectation/check id when available. */
+    expectationId?: string;
+    /** Human-readable diagnosis of what happened. */
+    message: string;
+    severity?: AsiSeverity;
+    /** Concrete trace excerpt, file path, tool call, screenshot id, etc. */
+    evidence?: string;
+    /** Prompt/tool/context surface likely responsible. */
+    responsibleSurface?: string;
+    /** Suggested fix in natural language. */
+    suggestion?: string;
+    /** Whether this expectation was satisfied. Defaults to false for ASI rows. */
+    matched?: boolean;
+    metadata?: Record<string, unknown>;
+}
+interface MultiShotTrace {
+    scenarioId: string;
+    /** Full turn/tool trace. Shape is intentionally app-owned. */
+    turns?: unknown[];
+    toolCalls?: unknown[];
+    artifacts?: unknown[];
+    /** Compact final output or summary used by reflection prompts. */
+    transcript?: string;
+    output?: unknown;
+    metadata?: Record<string, unknown>;
+}
+interface MultiShotRun {
+    trace: MultiShotTrace;
+    costUsd?: number;
+    durationMs?: number;
+    tokenUsage?: {
+        input?: number;
+        output?: number;
+        cached?: number;
+    };
+    metadata?: Record<string, unknown>;
+}
+interface MultiShotRunInput<P = unknown> {
+    variant: PromptVariant<P>;
+    scenarioId: string;
+    rep: number;
+    split: MultiShotSplit;
+    /** Stable paired seed for baseline/candidate comparisons. */
+    seed: number;
+}
+interface MultiShotRunner<P = unknown> {
+    run(input: MultiShotRunInput<P>): Promise<MultiShotRun> | MultiShotRun;
+}
+interface MultiShotScore {
+    /** Primary score in [0,1]. The adapter clamps for safety. */
+    score: number;
+    /** Pass/fail for top/bottom trial selection. Defaults to true. */
+    ok?: boolean;
+    costUsd?: number;
+    durationMs?: number;
+    metrics?: Record<string, number>;
+    asi?: ActionableSideInfo[];
+    /** Optional rich output shown to reflection mutators. */
+    emitted?: string;
+    metadata?: Record<string, unknown>;
+}
+interface MultiShotScorer<P = unknown> {
+    score(input: MultiShotRunInput<P> & {
+        run: MultiShotRun;
+    }): Promise<MultiShotScore> | MultiShotScore;
+}
+interface MultiShotTrialResult extends TrialResult {
+    split: MultiShotSplit;
+    seed: number;
+    trace?: MultiShotTrace;
+    asi?: ActionableSideInfo[];
+    emitted?: string;
+    metadata?: Record<string, unknown>;
+}
+interface MultiShotMutateAdapter<P = unknown> {
+    mutate(args: {
+        parent: PromptVariant<P>;
+        parentAggregate: VariantAggregate;
+        topTrials: MultiShotTrialResult[];
+        bottomTrials: MultiShotTrialResult[];
+        childCount: number;
+        generation: number;
+    }): Promise<PromptVariant<P>[]>;
+}
+interface MultiShotGateConfig<P = unknown> {
+    /** Search rows are optional, but enable HeldOutGate's overfit-gap check. */
+    searchScenarioIds?: string[];
+    holdoutScenarioIds: string[];
+    reps?: number;
+    gate: HeldOutGateConfig;
+    /** Convert scored trajectory runs into paper-grade RunRecords. */
+    toRunRecord(input: {
+        variant: PromptVariant<P>;
+        scenarioId: string;
+        rep: number;
+        split: RunSplitTag;
+        seed: number;
+        trial: MultiShotTrialResult;
+    }): RunRecord;
+}
+interface MultiShotOptimizationConfig<P = unknown> {
+    runId: string;
+    target: string;
+    seedVariants: PromptVariant<P>[];
+    searchScenarioIds: string[];
+    reps: number;
+    generations: number;
+    populationSize: number;
+    scoreConcurrency?: number;
+    runner: MultiShotRunner<P>;
+    scorer: MultiShotScorer<P>;
+    mutateAdapter: MultiShotMutateAdapter<P>;
+    objectives?: Objective<VariantAggregate>[];
+    scalarWeights?: Record<string, number>;
+    cache?: TrialCache;
+    earlyStopOnNoImprovement?: boolean;
+    seedBase?: number;
+    onProgress?: (event: PromptEvolutionEvent) => void;
+    gate?: MultiShotGateConfig<P>;
+}
+interface MultiShotGateResult {
+    decision: GateDecision;
+    candidateRuns: RunRecord[];
+    baselineRuns: RunRecord[];
+}
+interface MultiShotOptimizationResult<P = unknown> {
+    evolution: PromptEvolutionResult<P>;
+    /** Best candidate on the optimizer-visible search split. */
+    searchBestVariant: PromptVariant<P>;
+    searchBestAggregate: VariantAggregate;
+    /** Variant callers should actually ship after optional holdout gating. */
+    promotedVariant: PromptVariant<P>;
+    promotedAggregate: VariantAggregate;
+    /** Null when no gate was configured or the search-best candidate was the baseline. */
+    gate: MultiShotGateResult | null;
+}
+declare function runMultiShotOptimization<P>(config: MultiShotOptimizationConfig<P>): Promise<MultiShotOptimizationResult<P>>;
+declare function defaultMultiShotObjectives(): Objective<VariantAggregate>[];
+declare function trialTraceFromMultiShotTrial(trial: MultiShotTrialResult): TrialTrace;
 /**
  * concurrency — small primitives the evolution loop needs.
  *
@@ -8280,71 +8543,4 @@ declare function judgeReplayGate<TOutput>(args: JudgeReplayGateArgs<TOutput>): P
     candidateSamples: number;
 }>;
-/**
- * Reflective mutation — primitives for trace-conditioned prompt rewriting.
- *
- * Used by `prompt-evolution.ts` (and any consumer running iterative
- * improvement). Given a parent prompt + concrete trace evidence (top trials,
- * bottom trials, missed expectations), produce an LLM-ready prompt that
- * proposes targeted mutations — not blind rephrasings.
- *
- * Why this lives outside `prompt-evolution.ts`: any consumer that wants to
- * run reflective rewriting WITHOUT the population/Pareto machinery can
- * import these primitives directly.
- *
- * Quality bar (vs. naive "mutate this prompt"):
- *   - Show parent ↔ children diff, not just one variant
- *   - Quote specific missed goldens with their match phrases
- *   - Surface the model's actual emitted output side-by-side with what was expected
- *   - Quote concrete mutation primitives so the model has a vocabulary
- */
-interface TrialTrace {
-    /** Stable id for the trial — surfaces in the prompt for grounding. */
-    id: string;
-    /** Score the trial received on its primary metric. */
-    score: number;
-    /** Candidate inputs the agent was given (e.g., the fixture or scenario). */
-    inputName?: string;
-    /**
-     * Goldens / expectations this trial was tested against, with whether each
-     * was matched. The reflection prompt quotes the missed ones specifically.
-     */
-    expectations?: Array<{
-        id: string;
-        phrase: string;
-        matched: boolean;
-    }>;
-    /** Free-form text — what the agent actually emitted (e.g., findings, plan). */
-    emitted?: string;
-    /** Optional structured metrics (recall, precision, cost, latency). */
-    metrics?: Record<string, number>;
-}
-interface ReflectionContext {
-    /** What is being mutated — appears in the system prompt for orientation. */
-    target: string;
-    /** Current variant's payload — JSON-serialised for the prompt. */
-    parentPayload: unknown;
-    /** Best-performing trials this generation. */
-    topTrials: TrialTrace[];
-    /** Worst-performing trials this generation — the missed-golden source. */
-    bottomTrials: TrialTrace[];
-    /** How many children the mutator should propose. */
-    childCount: number;
-    /** Optional: domain-specific mutation primitives the model can pick from. */
-    mutationPrimitives?: string[];
-}
-declare const DEFAULT_MUTATION_PRIMITIVES: string[];
-/**
- * Build the LLM-ready reflection prompt. Output is plain text — pass it as
- * the user message. The system message should be small and stable (e.g.
- * "Output ONLY a JSON object matching the schema below.").
- */
-declare function buildReflectionPrompt(ctx: ReflectionContext): string;
-interface ReflectionProposal {
-    label: string;
-    rationale: string;
-    payload: unknown;
-}
-declare function parseReflectionResponse(raw: string, maxProposals?: number): ReflectionProposal[];
-export { type ActiveLearningOptions, type AdapterRun, AgentDriver, type AgentDriverConfig, type AlignmentOp, type AntiSlopConfig, type AntiSlopIssue, type AntiSlopReport, type Artifact$1 as Artifact, type ArtifactCheck, type Artifact as ArtifactCheckArtifact, type ArtifactResult, type ArtifactValidator, AxGepaSteeringOptimizer, type AxSteeringOptimizerConfig, BENCHMARK_SPLIT_SEED, type BaselineOptions, type BaselineReport, BehaviorAssertion, type BenchmarkAdapter, type BenchmarkDatasetItem, type BenchmarkEvaluation, type BenchmarkReport, BenchmarkRunner, type BenchmarkRunnerConfig, type BestOfNResult, type BisectOptions, type BisectResult, type BisectStep, type BootstrapOptions, type BootstrapResult, BudgetBreachError, type BudgetBreachFinding, type BudgetBreachReport, BudgetGuard, type BudgetLedgerEntry, type BudgetSpec, BuilderSession, type BuilderSessionInit, type CalibrationBin, type CalibrationOptions, type CalibrationReport, type CalibrationResult, CallExpectation, type CanaryAlert, type CanaryKind, type CanaryLeak, type CanaryOptions, type CanaryReport, type CanarySeverity, type CandidateScenario, type CandidateScore, type CausalAttributionReport, type ChatSummary, type CheckResult, type CodeMutationOutcome, type CodeMutationRunner, type CollectedArtifacts, type CommandRunner, type CompletionCriterion, type CompositePolicy, type ConceptComplexity, type ConceptFinding, type ConceptSpec, type ConceptWeightStrategy, type ContinuityCheck, type ContinuityCheckResult, type ContinuityReport, type ContinuitySnapshotPair, type ContractMetric, type ContractReport, type ControlActionFailureMode, type ControlActionOutcome, type ControlBudget, type ControlContext, type ControlDecision, type ControlEvalResult, type ControlRunResult, type ControlRuntimeConfig, type ControlRuntimeError, type ControlSeverity, type ControlStep, type ControlStopPolicies, ConvergenceTracker, type CorrelationReport, type CorrelationResult, type CorrelationStudyOptions, type CorrelationStudyResult, type CostEntry, CostLedger, type CostLedgerGeneration, type CostLedgerSnapshot, type CostSummary, CostTracker, type CounterfactualContext, type CounterfactualMutation, type CounterfactualResult, type CounterfactualRunner, type CreateCompositeMutatorOpts, type CreateDefaultReviewerOptions, type CreateSandboxCodeMutatorOpts, type CreateSandboxPoolOpts, type CrossTraceDiff, type CrossTraceDiffOptions, D1ExperimentStore, type D1ExperimentStoreOptions, type D1Like, type D1PreparedStatementLike, DEFAULT_AGENT_SLOS, DEFAULT_COMPLEXITY_WEIGHTS, DEFAULT_RULES as DEFAULT_FAILURE_RULES, DEFAULT_FINDERS, DEFAULT_HARNESS_OBJECTIVES, DEFAULT_MUTATION_PRIMITIVES, DEFAULT_MUTATORS, DEFAULT_REDACTION_RULES, DEFAULT_RED_TEAM_CORPUS, DEFAULT_RUN_SCORE_WEIGHTS, DEFAULT_SEVERITY_WEIGHTS, Dataset, type DatasetDifficulty, type DatasetManifest, type DatasetProvenance, type DatasetScenario, type DatasetSplit, type DeployFamily, type DeployGateLayerInput, type DeployRunResult, type DeployRunner, type DeploymentOutcome, type DirEntry, type Direction, type DivergenceOptions, type DivergenceReport, DockerSandboxDriver, type DriverResult, type DriverState, DualAgentBench, type DualAgentBenchConfig, type DualAgentReport, type DualAgentRound, type DualAgentScenario, type DualAgentScenarioResult, ERROR_COUNT_PATTERNS, type ErrorCountPattern, type EuRiskClass, type EvalMetricSpec, type EvalResult, type EventFilter, type EventKind, type EvolutionRound, type PromptVariant as EvolvableVariant, type ExecutorConfig, type Expectation, type Experiment, type ExperimentPlan, type ExperimentResult, type Run as ExperimentRun, type ExperimentStore, ExperimentTracker, type ExportedRewardModel, type ExtractOptions, type ExtractResult, FAILURE_CLASSES, type FactorContribution, type FactorialCell, type FailureClass, type FailureClassification, type FailureCluster, type FailureClusterReport, type FailureContext, type FailureMode, type FailureRule, type FeedbackArtifactType, type FeedbackAttempt, type FeedbackLabel, type FeedbackLabelKind, type FeedbackLabelSource, type FeedbackOptimizerRow, type FeedbackOutcome, type FeedbackPattern, type FeedbackSeverity, type FeedbackSplitPolicy, type FeedbackTask, type FeedbackTrajectory, type FeedbackTrajectoryFilter, type FeedbackTrajectoryStore, FileSystemExperimentStore, type FileSystemExperimentStoreOptions, FileSystemFeedbackTrajectoryStore, FileSystemOutcomeStore, type FileSystemOutcomeStoreOptions, FileSystemTraceStore, type FileSystemTraceStoreOptions, type Finding, type FlowAction, type FlowLayerEnv, type FlowLayerFactoryInput, type FlowRunner, type FlowRunnerStepResult, type FlowSpec, type FlowStep, type GainDistributionBin, type GainDistributionFigureSpec, type GainDistributionOptions, type GateDecision, type GateEvidence, type GenerationReport, type GenericSpan, type GoldenItem, type GoldenSeverity, type GoldenSpec, type GovernanceContext, type GovernanceFinding, type GovernanceReport, type GradedStep, type HarnessAdapter, type HarnessConfig, type HarnessExperimentConfig, type HarnessExperimentResult, type HarnessIntervention, type HarnessRunRequest, type HarnessRunResult, type HarnessScenario, type HarnessSelection, type HarnessVariant, type HarnessVariantReport, HeldOutGate, type HeldOutGateConfig, type HeldOutGateRejectionCode, HoldoutAuditor, HoldoutLockedError, type HostedJudgeConfig, type HostedJudgeDimension, type HostedJudgeRequest, type HostedJudgeResponse, type HostedRunCriticConfig, type HostedRunScoreRequest, type HostedRunScoreResponse, type HypothesisManifest, type HypothesisResult, INTENT_MATCH_JUDGE_VERSION, type ImageData, InMemoryExperimentStore, InMemoryFeedbackTrajectoryStore, InMemoryOutcomeStore, InMemoryTraceStore, InMemoryTrialCache, InMemoryWorkspaceInspector, type InferenceScorer, type InspectorContext, type IntentMatchInput, type IntentMatchOptions, type IntentMatchResult, type InteractionContribution, JsonlTrialCache, type JudgeAgreementReport, type JudgeConfig, type JudgeFleetOptions, type JudgeFn, type JudgeInput, type JudgePair, type JudgeReplayGateArgs, type JudgeReplayResult, type JudgeRubric, JudgeRunner, type JudgeScore, type JudgeSpan, type KeywordConceptSpec, type KeywordCoverageFinding, type KeywordCoverageOptions, type KeywordCoverageResult, type LangfuseEnvelope, type LangfuseGeneration, type LangfuseScore, type Layer, type LayerCorrelation, type LayerResult, type LayerStatus, type LineageKind, type LineageKindResolver, type LineageNode, LineageRecorder, LlmCallError, type LlmCallRequest, type LlmCallResult, LlmClient, type LlmClientOptions, type LlmJsonCall, type LlmMessage, type LlmReviewerConfig, type LlmSpan, type LlmUsage, LockedJsonlAppender, MODEL_PRICING, type MatchResult, type MatcherResult, type MeasurementPolicy, type MergeOptions, type Message, type MetricSamples, type MetricVerdict, MetricsCollector, type MuffledFinder, type MuffledFinding, MultiLayerVerifier, type MultiToolchainLayerConfig, type MutateAdapter, type MutationAttempt, type MutationChannel, MutationTelemetry, type Mutator, Mutex, NoopResearcher, OTEL_AGENT_EVAL_SCOPE, type Objective, type OptimizationConfig, type OptimizationExample, OptimizationLoop, type OptimizationLoopConfig, type OptimizationLoopResult, type OptimizationResult, type Oracle, type OracleObservation, type OracleReport, type OracleResult, type OrthogonalityInput, type OrthogonalityResult, type OtlpExport, type OtlpResourceSpans, type OtlpSpan, type OutcomeFilter, type OutcomePair, type OutcomeStore, type PairedBootstrapOptions, type PairedBootstrapResult, type PairwiseComparison, PairwiseSteeringOptimizer, type ParetoFigureSpec, type ParetoPoint, type ParetoResult, type PersonaConfig, type Playbook, type PlaybookEntry, type PoolSlot, type PositionalBiasResult, type PreferenceMemoryEntry, type PrmGradedTrace, PrmGrader, type PrmTrainingSample, ProductClient, type ProductClientConfig, type ProjectKind, ProjectRegistry, type ProjectSummary, type ProjectTimelineEntry, type PromptEvolutionConfig, type PromptEvolutionEvent, type PromptEvolutionResult, type PromptHandle, PromptOptimizer, PromptRegistry, type TrialResult as PromptTrialResult, type PromptVariant$1 as PromptVariant, type ProposeFn, type ProposeInput, type ProposeOutput, type ProposeReviewConfig, type ProposeReviewControlAction, type ProposeReviewControlConfig, type ProposeReviewControlResult, type ProposeReviewControlState, type ProposeReviewReport, type ProposeReviewShot, type ProposedSideEffect, REDACTION_VERSION, type RedTeamCase, type RedTeamCategory, type RedTeamFinding, type RedTeamPayload, type RedTeamReport, type RedactionReport, type RedactionRule, type ReferenceMatchResult, type ReferenceReplayAdapter, type ReferenceReplayAdapterFn, type ReferenceReplayAdapterLike, type ReferenceReplayAggregate, type ReferenceReplayCandidate, type ReferenceReplayCase, type ReferenceReplayCaseRun, type ReferenceReplayExecutionScenario, type ReferenceReplayItem, type ReferenceReplayMatch, type ReferenceReplayMatchStrategy, type ReferenceReplayMatcher, type ReferenceReplayPromotionDecision, type ReferenceReplayPromotionPolicy, type ReferenceReplayRun, type ReferenceReplayRunContext, type ReferenceReplayRunOptions, type ReferenceReplayRunStore, type ReferenceReplayScenario, type ReferenceReplayScenarioScore, type ReferenceReplayScore, type ReferenceReplayScoreOptions, type ReferenceReplaySplit, type ReferenceReplaySplitComparison, type ReferenceReplaySteeringRowsOptions, type ReflectionContext, type ReflectionProposal, type RegressionOptions, type RegressionSpec, type Researcher, type RetrievalSpan, type Review, type ReviewFn, type ReviewInput, type ReviewMemoryEntry, type ReviewMemoryStore, type ReviewerMemoryEntry, type ReviewerOutput, type ReviewerPromptInput, type ReviewerSoftFailDefaults, type ReviewerVerificationSummary, type RobustnessResult, type RouteMap, type RubricDimension, type Run$1 as Run, type RunAppScenarioOptions, type RunCommandInput, type RunCommandResult, type RunConfig, RunCritic, type RunCriticOptions, type RunDiff, type RunFilter, type RunJudgeMetadata, type RunLayer, type RunOutcome, type RunRecord, RunRecordValidationError, type RunScore, type RunScoreWeights, type RunSplitTag, type RunStatus, type RunTokenUsage, type RunTrace, SEMANTIC_CONCEPT_JUDGE_VERSION, type SandboxDriver, SandboxHarness, type SandboxHarnessResult, type SandboxJudgeKind, type SandboxJudgeResult, type SandboxJudgeSpec, type SandboxPool, type SandboxResult, type SandboxSpan, type ScanOptions, type Scenario, type ScenarioAggregate, type ScenarioCost, type ScenarioFile, ScenarioRegistry, type ScenarioResult, type ScoreAdapter, type ScoredTarget, type SelfPlayOptions, type SelfPlayProposer, type SelfPlayScorer, type SelfPreferenceResult, type SemanticConceptJudgeInput, type SemanticConceptJudgeOptions, type SemanticConceptJudgeResult, type SeriesConvergenceOptions, type SeriesConvergenceResult, type Severity, type ShipOptions, type SignedManifest, type SliceOptions, type Slo, type SloCheckResult, type SloComparator, type SloReport, type SloSeverity, type SlopCategory, type SlotFactory, type Span, type SpanBase, type SpanFilter, type SpanHandle, type SpanKind, type SpanStatus, type SteeringBundle, type SteeringChange, type SteeringDelta, type SteeringEvaluation, type SteeringOptimizationResult, type SteeringOptimizationRow, type SteeringOptimizationSelector, type SteeringOptimizerBackend, type SteeringOptimizerConfig, type SteeringRolePrompt, type SteeringVariantReport, type StepAttribution, type StepContext, type StepRubric, type StopDecision, type StuckLoopFinding, type StuckLoopOptions, type StuckLoopReport, SubprocessSandboxDriver, type SubprocessSandboxDriverOptions, type SummaryTable, type SummaryTableOptions, type SummaryTableRow, type SynthesisReason, type SynthesisTarget, TRACE_SCHEMA_VERSION, type TestGradedRunOptions, type TestGradedRunResult, type TestGradedScenario, type TestOutputParser, type TestResult, type ThreeLayerProjectReport, type ThresholdContract, TokenCounter, type TokenSpec, type ToolSpan, type ToolStats, type ToolUseMetrics, type ToolUseOptions, type ToolWasteFinding, type ToolWasteOptions, type ToolWasteReport, TraceEmitter, type TraceEmitterOptions, type TraceEvent, type TraceStore, type Trajectory, type TrajectoryStep, type TrialAttempt, type TrialCache, TrialTelemetry, type TrialTrace, type Turn, type TurnMetrics, type TurnResult, UNIVERSAL_FINDERS, type UseCaseSignals, type ValidationContext, type ValidationIssue, type ValidationResult, type VariantAggregate, type VariantScore, type VerbosityBiasResult, type Verdict, type Verification, type VerificationReport, type VerifyContext, type VerifyFn, type VerifyOptions, type VisualDiffOptions, type VisualDiffResult, type ViteDeployRunnerInput, type WorkflowTopology, type WorkspaceAssertion, type WorkspaceAssertionResult, type WorkspaceInspector, type WorkspaceSnapshot, type WranglerDeployRunnerInput, adversarialJudge, aggregateLlm, aggregateRunScore, allCriticalPassed, analyzeAntiSlop, analyzeSeries, argHash, assignFeedbackSplit, attributeCounterfactuals, deterministicSplit as benchmarkDeterministicSplit, index as benchmarks, benjaminiHochberg, bhAdjust, bisect, bonferroni, bootstrapCi, budgetBreachView, buildReflectionPrompt, buildReviewerPrompt, buildTrajectory, byteLengthRange, calibrateJudge, calibrationCurve, callLlm, callLlmJson, canaryLeakView, causalAttribution, checkCanaries, checkSlos, clamp01, classifyEuAiRisk, classifyFailure, codeExecutionJudge, cohensD, coherenceJudge, collectionPreserved, commitBisect, compareReferenceReplay, compareToBaseline, compilerJudge, composeParsers, composeValidators, computeToolUseMetrics, confidenceInterval, containsAll, controlFailureClassFromVerification, controlRunToFeedbackTrajectory, correlateLayers, correlationStudy, createAntiSlopJudge, createCompositeMutator, createCustomJudge, createDefaultReviewer, createDomainExpertJudge, createFeedbackTrajectory, createIntentMatchJudge, createLlmReviewer, createSandboxCodeMutator, createSandboxPool, createSemanticConceptJudge, crossTraceDiff, crowdingDistance, decideReferenceReplayPromotion, decideReferenceReplayRunPromotion, defaultJudges, defaultReferenceReplayMatcher, deployGateLayer, distillPlaybook, dominates, estimateCost, estimateTokens, euAiActReport, evaluateContract, evaluateHypothesis, evaluateOracles, executeScenario, expectAgent, exportRewardModel, exportRunAsOtlp, exportTrainingData, extractAssetUrls, extractErrorCount, failureClusterView, feedbackTrajectoriesToDatasetScenarios, feedbackTrajectoriesToOptimizerRows, feedbackTrajectoryToDatasetScenario, feedbackTrajectoryToOptimizerRow, fileContains, fileExists, findAutoMatchNoExpectation, findConstructorCwdDropped, findFallbackToPass, findLiteralTruePass, findSkipCountsAsPass, firstDivergenceView, flowLayer, formatBenchmarkReport, formatDriverReport, formatFindings, gainHistogram, precision as goldenPrecision, gradeSemanticStatus, groupBy, hashContent, hashScenarios, htmlContainsElement, inMemoryReferenceReplayStore, inMemoryReviewStore, interRaterReliability, iqr, isJudgeSpan, isLlmSpan, isPrmVerdict, isRetrievalSpan, isRunRecord, isSandboxSpan, isToolSpan, jestTestParser, jsonHasKeys, jsonShape, jsonlReferenceReplayStore, jsonlReviewStore, judgeAgreementView, judgeReplayGate, judgeSpans, keyPreserved, linterJudge, llmSpanFromProvider, llmSpans, loadScorerFromGrader, localCommandRunner, lowercaseMutator, mannWhitneyU, matchGoldens, mergeLayerResults, mergeSteeringBundle, multiToolchainLayer, nistAiRmfReport, nonRefusalRubric, normalizeScores, notBlocked, objectiveEval, outputLengthRubric, pairedBootstrap, pairedTTest, pairedWilcoxon, paraphraseRobustness, paretoChart, paretoFrontier, paretoFrontierWithCrowding, parseFeedbackTrajectoriesJsonl, parseReflectionResponse, parseRunRecordSafe, partialCredit, passOrthogonality, pixelDeltaRatio, politenessPrefixMutator, positionalBias, printDriverSummary, prmBestOfN, prmEnsembleBestOfN, probeLlm, promptBisect, proposeSynthesisTargets, pytestTestParser, redTeamDataset, redTeamReport, redactString, redactValue, referenceReplayRunsToSteeringRows, referenceReplayScenarioToRunScore, regexMatch, regexMatches, regressionView, renderMarkdown, renderMarkdownReport, renderPlaybookMarkdown, renderPreferenceMemoryMarkdown, renderSteeringText, replayScorerOverCorpus, replayTraceThroughJudge, requiredSampleSize, resetLockedAppendersForTesting, resumeBuilderSession, roundTripRunRecord, rowCount, rowWhere, runAgentControlLoop, runAssertions, runCanaries, runCounterfactual, runE2EWorkflow, runExpectations, runFailureClass, runHarnessExperiment, runIntentMatchJudge, runJudgeFleet, runKeywordCoverageJudge, runKeywordCoverageJudgeUrl, runPromptEvolution, runProposeReview, runProposeReviewAsControlLoop, runReferenceReplay, runSelfPlay, runSemanticConceptJudge, runTestGradedScenario, runsForScenario, scalarScore, scanForMuffledGates, scoreAllProjects, scoreContinuity, scoreProject, scoreRedTeamOutput, scoreReferenceReplay, securityJudge, selectHarnessVariant, selfPreference, sentenceReorderMutator, serializeFeedbackTrajectoriesJsonl, signManifest, soc2Report, statusAdvanced, stopOnNoProgress, stopOnRepeatedAction, stripFencedJson, stuckLoopView, subjectiveEval, summarize, summarizeHarnessResults, summarizePreferenceMemory, summaryTable, testJudge, textInSnapshot, toLangfuseEnvelope, toNdjson, toPrometheusText, toolIntentAlignmentRubric, toolNamesForRun, toolNonRedundantRubric, toolSpans, toolSuccessRubric, toolWasteView, typoMutator, urlContains, validateRunRecord, verbosityBias, verifyManifest, visualDiff, viteDeployRunner, vitestTestParser, weightedMean, weightedRecall, welchsTTest, whitespaceCollapseMutator, wilcoxonSignedRank, withAssignedFeedbackSplit, wranglerDeployRunner };
+export { type ActionExecutionPolicy, type ActionPolicyDecision, type ActionableSideInfo, type ActiveLearningOptions, type AdapterRun, AgentDriver, type AgentDriverConfig, type AlignmentOp, type AntiSlopConfig, type AntiSlopIssue, type AntiSlopReport, type Artifact$1 as Artifact, type ArtifactCheck, type Artifact as ArtifactCheckArtifact, type ArtifactResult, type ArtifactValidator, type AsiSeverity, AxGepaSteeringOptimizer, type AxSteeringOptimizerConfig, BENCHMARK_SPLIT_SEED, type BaselineOptions, type BaselineReport, BehaviorAssertion, type BenchmarkAdapter, type BenchmarkDatasetItem, type BenchmarkEvaluation, type BenchmarkReport, BenchmarkRunner, type BenchmarkRunnerConfig, type BestOfNResult, type BisectOptions, type BisectResult, type BisectStep, type BootstrapOptions, type BootstrapResult, BudgetBreachError, type BudgetBreachFinding, type BudgetBreachReport, BudgetGuard, type BudgetLedgerEntry, type BudgetSpec, BuilderSession, type BuilderSessionInit, type CalibrationBin, type CalibrationOptions, type CalibrationReport, type CalibrationResult, CallExpectation, type CanaryAlert, type CanaryKind, type CanaryLeak, type CanaryOptions, type CanaryReport, type CanarySeverity, type CandidateScenario, type CandidateScore, type CausalAttributionReport, type ChatSummary, type CheckResult, type CodeMutationOutcome, type CodeMutationRunner, type CollectedArtifacts, type CommandRunner, type CompletionCriterion, type CompositePolicy, type ConceptComplexity, type ConceptFinding, type ConceptSpec, type ConceptWeightStrategy, type ContinuityCheck, type ContinuityCheckResult, type ContinuityReport, type ContinuitySnapshotPair, type ContractMetric, type ContractReport, type ControlActionFailureMode, type ControlActionOutcome, type ControlBudget, type ControlContext, type ControlDecision, type ControlEvalResult, type ControlRunResult, type ControlRuntimeConfig, type ControlRuntimeError, type ControlSeverity, type ControlStep, type ControlStopPolicies, ConvergenceTracker, type CorrelationReport, type CorrelationResult, type CorrelationStudyOptions, type CorrelationStudyResult, type CostEntry, CostLedger, type CostLedgerGeneration, type CostLedgerSnapshot, type CostSummary, CostTracker, type CounterfactualContext, type CounterfactualMutation, type CounterfactualResult, type CounterfactualRunner, type CreateCompositeMutatorOpts, type CreateDefaultReviewerOptions, type CreateSandboxCodeMutatorOpts, type CreateSandboxPoolOpts, type CrossTraceDiff, type CrossTraceDiffOptions, D1ExperimentStore, type D1ExperimentStoreOptions, type D1Like, type D1PreparedStatementLike, DEFAULT_AGENT_SLOS, DEFAULT_COMPLEXITY_WEIGHTS, DEFAULT_RULES as DEFAULT_FAILURE_RULES, DEFAULT_FINDERS, DEFAULT_HARNESS_OBJECTIVES, DEFAULT_MUTATION_PRIMITIVES, DEFAULT_MUTATORS, DEFAULT_REDACTION_RULES, DEFAULT_RED_TEAM_CORPUS, DEFAULT_RUN_SCORE_WEIGHTS, DEFAULT_SEVERITY_WEIGHTS, Dataset, type DatasetDifficulty, type DatasetManifest, type DatasetProvenance, type DatasetScenario, type DatasetSplit, type DeployFamily, type DeployGateLayerInput, type DeployRunResult, type DeployRunner, type DeploymentOutcome, type DirEntry, type Direction, type DivergenceOptions, type DivergenceReport, DockerSandboxDriver, type DriverResult, type DriverState, DualAgentBench, type DualAgentBenchConfig, type DualAgentReport, type DualAgentRound, type DualAgentScenario, type DualAgentScenarioResult, ERROR_COUNT_PATTERNS, type ErrorCountPattern, type EuRiskClass, type EvalMetricSpec, type EvalResult, type EventFilter, type EventKind, type EvolutionRound, type PromptVariant as EvolvableVariant, type ExecutorConfig, type Expectation, type Experiment, type ExperimentPlan, type ExperimentResult, type Run as ExperimentRun, type ExperimentStore, ExperimentTracker, type ExportedRewardModel, type ExtractOptions, type ExtractResult, FAILURE_CLASSES, type FactorContribution, type FactorialCell, type FailureClass, type FailureClassification, type FailureCluster, type FailureClusterReport, type FailureContext, type FailureMode, type FailureRule, type FeedbackArtifactType, type FeedbackAttempt, type FeedbackLabel, type FeedbackLabelKind, type FeedbackLabelSource, type FeedbackOptimizerRow, type FeedbackOutcome, type FeedbackPattern, type FeedbackReplayAdapter, type FeedbackReplayResult, type FeedbackSeverity, type FeedbackSplitPolicy, type FeedbackTask, type FeedbackTrajectory, type FeedbackTrajectoryFilter, type FeedbackTrajectoryStore, FileSystemExperimentStore, type FileSystemExperimentStoreOptions, FileSystemFeedbackTrajectoryStore, FileSystemOutcomeStore, type FileSystemOutcomeStoreOptions, FileSystemTraceStore, type FileSystemTraceStoreOptions, type Finding, type FlowAction, type FlowLayerEnv, type FlowLayerFactoryInput, type FlowRunner, type FlowRunnerStepResult, type FlowSpec, type FlowStep, type GainDistributionBin, type GainDistributionFigureSpec, type GainDistributionOptions, type GateDecision, type GateEvidence, type GenerationReport, type GenericSpan, type GoldenItem, type GoldenSeverity, type GoldenSpec, type GovernanceContext, type GovernanceFinding, type GovernanceReport, type GradedStep, type HarnessAdapter, type HarnessConfig, type HarnessExperimentConfig, type HarnessExperimentResult, type HarnessIntervention, type HarnessRunRequest, type HarnessRunResult, type HarnessScenario, type HarnessSelection, type HarnessVariant, type HarnessVariantReport, HeldOutGate, type HeldOutGateConfig, type HeldOutGateRejectionCode, HoldoutAuditor, HoldoutLockedError, type HostedJudgeConfig, type HostedJudgeDimension, type HostedJudgeRequest, type HostedJudgeResponse, type HostedRunCriticConfig, type HostedRunScoreRequest, type HostedRunScoreResponse, type HypothesisManifest, type HypothesisResult, INTENT_MATCH_JUDGE_VERSION, type ImageData, InMemoryExperimentStore, InMemoryFeedbackTrajectoryStore, InMemoryOutcomeStore, InMemoryTraceStore, InMemoryTrialCache, InMemoryWorkspaceInspector, type InferenceScorer, type InspectorContext, type IntentMatchInput, type IntentMatchOptions, type IntentMatchResult, type InteractionContribution, JsonlTrialCache, type JudgeAgreementReport, type JudgeConfig, type JudgeFleetOptions, type JudgeFn, type JudgeInput, type JudgePair, type JudgeReplayGateArgs, type JudgeReplayResult, type JudgeRubric, JudgeRunner, type JudgeScore, type JudgeSpan, type KeywordConceptSpec, type KeywordCoverageFinding, type KeywordCoverageOptions, type KeywordCoverageResult, type LangfuseEnvelope, type LangfuseGeneration, type LangfuseScore, type Layer, type LayerCorrelation, type LayerResult, type LayerStatus, type LineageKind, type LineageKindResolver, type LineageNode, LineageRecorder, LlmCallError, type LlmCallRequest, type LlmCallResult, LlmClient, type LlmClientOptions, type LlmJsonCall, type LlmMessage, type LlmReviewerConfig, type LlmSpan, type LlmUsage, LockedJsonlAppender, MODEL_PRICING, type MatchResult, type MatcherResult, type MeasurementPolicy, type MergeOptions, type Message, type MetricSamples, type MetricVerdict, MetricsCollector, type MuffledFinder, type MuffledFinding, MultiLayerVerifier, type MultiShotGateConfig, type MultiShotGateResult, type MultiShotMutateAdapter, type MultiShotOptimizationConfig, type MultiShotOptimizationResult, type MultiShotRun, type MultiShotRunInput, type MultiShotRunner, type MultiShotScore, type MultiShotScorer, type MultiShotSplit, type MultiShotTrace, type MultiShotTrialResult, type MultiShotVariant, type MultiToolchainLayerConfig, type MutateAdapter, type MutationAttempt, type MutationChannel, MutationTelemetry, type Mutator, Mutex, NoopResearcher, OTEL_AGENT_EVAL_SCOPE, type Objective, type OptimizationConfig, type OptimizationExample, OptimizationLoop, type OptimizationLoopConfig, type OptimizationLoopResult, type OptimizationResult, type Oracle, type OracleObservation, type OracleReport, type OracleResult, type OrthogonalityInput, type OrthogonalityResult, type OtlpExport, type OtlpResourceSpans, type OtlpSpan, type OutcomeFilter, type OutcomePair, type OutcomeStore, type PairedBootstrapOptions, type PairedBootstrapResult, type PairwiseComparison, PairwiseSteeringOptimizer, type ParetoFigureSpec, type ParetoPoint, type ParetoResult, type PersonaConfig, type Playbook, type PlaybookEntry, type PoolSlot, type PositionalBiasResult, type PreferenceMemoryEntry, type PrmGradedTrace, PrmGrader, type PrmTrainingSample, ProductClient, type ProductClientConfig, type ProjectKind, ProjectRegistry, type ProjectSummary, type ProjectTimelineEntry, type PromptEvolutionConfig, type PromptEvolutionEvent, type PromptEvolutionResult, type PromptHandle, PromptOptimizer, PromptRegistry, type TrialResult as PromptTrialResult, type PromptVariant$1 as PromptVariant, type ProposeFn, type ProposeInput, type ProposeOutput, type ProposeReviewConfig, type ProposeReviewControlAction, type ProposeReviewControlConfig, type ProposeReviewControlResult, type ProposeReviewControlState, type ProposeReviewReport, type ProposeReviewShot, type ProposedSideEffect, REDACTION_VERSION, type RedTeamCase, type RedTeamCategory, type RedTeamFinding, type RedTeamPayload, type RedTeamReport, type RedactionReport, type RedactionRule, type ReferenceMatchResult, type ReferenceReplayAdapter, type ReferenceReplayAdapterFn, type ReferenceReplayAdapterLike, type ReferenceReplayAggregate, type ReferenceReplayCandidate, type ReferenceReplayCase, type ReferenceReplayCaseRun, type ReferenceReplayExecutionScenario, type ReferenceReplayItem, type ReferenceReplayMatch, type ReferenceReplayMatchStrategy, type ReferenceReplayMatcher, type ReferenceReplayPromotionDecision, type ReferenceReplayPromotionPolicy, type ReferenceReplayRun, type ReferenceReplayRunContext, type ReferenceReplayRunOptions, type ReferenceReplayRunStore, type ReferenceReplayScenario, type ReferenceReplayScenarioScore, type ReferenceReplayScore, type ReferenceReplayScoreOptions, type ReferenceReplaySplit, type ReferenceReplaySplitComparison, type ReferenceReplaySteeringRowsOptions, type ReflectionContext, type ReflectionProposal, type RegressionOptions, type RegressionSpec, type Researcher, type RetrievalSpan, type Review, type ReviewFn, type ReviewInput, type ReviewMemoryEntry, type ReviewMemoryStore, type ReviewerMemoryEntry, type ReviewerOutput, type ReviewerPromptInput, type ReviewerSoftFailDefaults, type ReviewerVerificationSummary, type RobustnessResult, type RouteMap, type RubricDimension, type Run$1 as Run, type RunAppScenarioOptions, type RunCommandInput, type RunCommandResult, type RunConfig, RunCritic, type RunCriticOptions, type RunDiff, type RunFilter, type RunJudgeMetadata, type RunLayer, type RunOutcome, type RunRecord, RunRecordValidationError, type RunScore, type RunScoreWeights, type RunSplitTag, type RunStatus, type RunTokenUsage, type RunTrace, SEMANTIC_CONCEPT_JUDGE_VERSION, type SandboxDriver, SandboxHarness, type SandboxHarnessResult, type SandboxJudgeKind, type SandboxJudgeResult, type SandboxJudgeSpec, type SandboxPool, type SandboxResult, type SandboxSpan, type ScanOptions, type Scenario, type ScenarioAggregate, type ScenarioCost, type ScenarioFile, ScenarioRegistry, type ScenarioResult, type ScoreAdapter, type ScoredTarget, type SelfPlayOptions, type SelfPlayProposer, type SelfPlayScorer, type SelfPreferenceResult, type SemanticConceptJudgeInput, type SemanticConceptJudgeOptions, type SemanticConceptJudgeResult, type SeriesConvergenceOptions, type SeriesConvergenceResult, type Severity, type ShipOptions, type SignedManifest, type SliceOptions, type Slo, type SloCheckResult, type SloComparator, type SloReport, type SloSeverity, type SlopCategory, type SlotFactory, type Span, type SpanBase, type SpanFilter, type SpanHandle, type SpanKind, type SpanStatus, type SteeringBundle, type SteeringChange, type SteeringDelta, type SteeringEvaluation, type SteeringOptimizationResult, type SteeringOptimizationRow, type SteeringOptimizationSelector, type SteeringOptimizerBackend, type SteeringOptimizerConfig, type SteeringRolePrompt, type SteeringVariantReport, type StepAttribution, type StepContext, type StepRubric, type StopDecision, type StuckLoopFinding, type StuckLoopOptions, type StuckLoopReport, SubprocessSandboxDriver, type SubprocessSandboxDriverOptions, type SummaryTable, type SummaryTableOptions, type SummaryTableRow, type SynthesisReason, type SynthesisTarget, TRACE_SCHEMA_VERSION, type TestGradedRunOptions, type TestGradedRunResult, type TestGradedScenario, type TestOutputParser, type TestResult, type ThreeLayerProjectReport, type ThresholdContract, TokenCounter, type TokenSpec, type ToolSpan, type ToolStats, type ToolUseMetrics, type ToolUseOptions, type ToolWasteFinding, type ToolWasteOptions, type ToolWasteReport, TraceEmitter, type TraceEmitterOptions, type TraceEvent, type TraceStore, type Trajectory, type TrajectoryStep, type TrialAttempt, type TrialCache, TrialTelemetry, type TrialTrace, type Turn, type TurnMetrics, type TurnResult, UNIVERSAL_FINDERS, type UseCaseSignals, type ValidationContext, type ValidationIssue, type ValidationResult, type VariantAggregate, type VariantScore, type VerbosityBiasResult, type Verdict, type Verification, type VerificationReport, type VerifyContext, type VerifyFn, type VerifyOptions, type VisualDiffOptions, type VisualDiffResult, type ViteDeployRunnerInput, type WorkflowTopology, type WorkspaceAssertion, type WorkspaceAssertionResult, type WorkspaceInspector, type WorkspaceSnapshot, type WranglerDeployRunnerInput, adversarialJudge, aggregateLlm, aggregateRunScore, allCriticalPassed, analyzeAntiSlop, analyzeSeries, argHash, assignFeedbackSplit, attributeCounterfactuals, deterministicSplit as benchmarkDeterministicSplit, index as benchmarks, benjaminiHochberg, bhAdjust, bisect, bonferroni, bootstrapCi, budgetBreachView, buildReflectionPrompt, buildReviewerPrompt, buildTrajectory, byteLengthRange, calibrateJudge, calibrationCurve, callLlm, callLlmJson, canaryLeakView, causalAttribution, checkCanaries, checkSlos, clamp01, classifyEuAiRisk, classifyFailure, codeExecutionJudge, cohensD, coherenceJudge, collectionPreserved, commitBisect, compareReferenceReplay, compareToBaseline, compilerJudge, composeParsers, composeValidators, computeToolUseMetrics, confidenceInterval, containsAll, controlFailureClassFromVerification, controlRunToFeedbackTrajectory, correlateLayers, correlationStudy, createAntiSlopJudge, createCompositeMutator, createCustomJudge, createDefaultReviewer, createDomainExpertJudge, createFeedbackTrajectory, createIntentMatchJudge, createLlmReviewer, createSandboxCodeMutator, createSandboxPool, createSemanticConceptJudge, crossTraceDiff, crowdingDistance, decideReferenceReplayPromotion, decideReferenceReplayRunPromotion, defaultJudges, defaultMultiShotObjectives, defaultReferenceReplayMatcher, deployGateLayer, distillPlaybook, dominates, estimateCost, estimateTokens, euAiActReport, evaluateActionPolicy, evaluateContract, evaluateHypothesis, evaluateOracles, executeScenario, expectAgent, exportRewardModel, exportRunAsOtlp, exportTrainingData, extractAssetUrls, extractErrorCount, failureClusterView, feedbackTrajectoriesToDatasetScenarios, feedbackTrajectoriesToOptimizerRows, feedbackTrajectoryToDatasetScenario, feedbackTrajectoryToOptimizerRow, fileContains, fileExists, findAutoMatchNoExpectation, findConstructorCwdDropped, findFallbackToPass, findLiteralTruePass, findSkipCountsAsPass, firstDivergenceView, flowLayer, formatBenchmarkReport, formatDriverReport, formatFindings, gainHistogram, precision as goldenPrecision, gradeSemanticStatus, groupBy, hashContent, hashScenarios, htmlContainsElement, inMemoryReferenceReplayStore, inMemoryReviewStore, interRaterReliability, iqr, isJudgeSpan, isLlmSpan, isPrmVerdict, isRetrievalSpan, isRunRecord, isSandboxSpan, isToolSpan, jestTestParser, jsonHasKeys, jsonShape, jsonlReferenceReplayStore, jsonlReviewStore, judgeAgreementView, judgeReplayGate, judgeSpans, keyPreserved, linterJudge, llmSpanFromProvider, llmSpans, loadScorerFromGrader, localCommandRunner, lowercaseMutator, mannWhitneyU, matchGoldens, mergeLayerResults, mergeSteeringBundle, multiToolchainLayer, nistAiRmfReport, nonRefusalRubric, normalizeScores, notBlocked, objectiveEval, outputLengthRubric, pairedBootstrap, pairedTTest, pairedWilcoxon, paraphraseRobustness, paretoChart, paretoFrontier, paretoFrontierWithCrowding, parseFeedbackTrajectoriesJsonl, parseReflectionResponse, parseRunRecordSafe, partialCredit, passOrthogonality, pixelDeltaRatio, politenessPrefixMutator, positionalBias, printDriverSummary, prmBestOfN, prmEnsembleBestOfN, probeLlm, promptBisect, proposeSynthesisTargets, pytestTestParser, redTeamDataset, redTeamReport, redactString, redactValue, referenceReplayRunsToSteeringRows, referenceReplayScenarioToRunScore, regexMatch, regexMatches, regressionView, renderMarkdown, renderMarkdownReport, renderPlaybookMarkdown, renderPreferenceMemoryMarkdown, renderSteeringText, replayFeedbackTrajectories, replayFeedbackTrajectory, replayScorerOverCorpus, replayTraceThroughJudge, requiredSampleSize, resetLockedAppendersForTesting, resumeBuilderSession, roundTripRunRecord, rowCount, rowWhere, runAgentControlLoop, runAssertions, runCanaries, runCounterfactual, runE2EWorkflow, runExpectations, runFailureClass, runHarnessExperiment, runIntentMatchJudge, runJudgeFleet, runKeywordCoverageJudge, runKeywordCoverageJudgeUrl, runMultiShotOptimization, runPromptEvolution, runProposeReview, runProposeReviewAsControlLoop, runReferenceReplay, runSelfPlay, runSemanticConceptJudge, runTestGradedScenario, runsForScenario, scalarScore, scanForMuffledGates, scoreAllProjects, scoreContinuity, scoreProject, scoreRedTeamOutput, scoreReferenceReplay, securityJudge, selectHarnessVariant, selfPreference, sentenceReorderMutator, serializeFeedbackTrajectoriesJsonl, signManifest, soc2Report, statusAdvanced, stopOnNoProgress, stopOnRepeatedAction, stripFencedJson, stuckLoopView, subjectiveEval, summarize, summarizeHarnessResults, summarizePreferenceMemory, summaryTable, testJudge, textInSnapshot, toLangfuseEnvelope, toNdjson, toPrometheusText, toolIntentAlignmentRubric, toolNamesForRun, toolNonRedundantRubric, toolSpans, toolSuccessRubric, toolWasteView, trialTraceFromMultiShotTrial, typoMutator, urlContains, validateRunRecord, verbosityBias, verifyManifest, visualDiff, viteDeployRunner, vitestTestParser, weightedMean, weightedRecall, welchsTTest, whitespaceCollapseMutator, wilcoxonSignedRank, withAssignedFeedbackSplit, wranglerDeployRunner };