npm - @tangle-network/agent-eval - Versions diffs - 0.33.1 → 0.34.0 - Mend

@tangle-network/agent-eval 0.33.1 → 0.34.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,38 @@
 # Changelog
+## 0.34.0 — 2026-05-23
+### Eval evolution-tracking — first-class `AgentProfile` + per-cell scorecard
+The headline shift: a feature PR's eval can now answer the question a single
+run cannot — *did this change regress persona P on profile F, even while the
+aggregate improved?*
+- **`AgentProfile` + `agentProfileHash`** — the harness's unit of variation.
+  Model lives inside the profile (skill/tool order doesn't matter; the `id`
+  label is excluded from identity), so "same model, different skills" is two
+  profiles. (#78)
+- **Append-only JSONL scorecard** keyed `(scenarioId, profileHash)` —
+  `recordRuns` / `recordRunsToScorecard` / `loadScorecard`. Idempotent
+  appends on `eventId` so concurrent campaign runs cannot clobber. (#78)
+- **`diffScorecard`** — per-cell verdict (`improved` / `regressed` / `flat` /
+  `new`) using Cohen's d + Welch's t-test; the keystone CI guard is
+  `diff.cells.filter(c => c.verdict === 'regressed')`. `formatScorecardDiff`
+  renders the PR-facing report. (#78)
+- **Agent profile cells** — `src/agent-profile-cell.ts` extends the profile
+  contract into `RunRecord` rows and `runEvalCampaign` so every campaign row
+  is keyed by `(profile, scenario, seed)` end-to-end. (#79)
+- **Stats consolidation** — `pairedBootstrap`, power analysis, and the
+  paired/Welch primitives now all live in `src/statistics.ts`. (#73)
+- **LLM retry classifier unified** across `llm-client` and `judge-retry`
+  via `isTransientLlmError`. (#74)
+- **`pr-review-benchmark` source committed** — the module was exported from
+  `index.ts` since the run-record refactor but the source files were never
+  committed; CI on `main` has been red on #78/#79/#81 as a result. (#83)
+- **Examples**: `scorecard/`, `held-out-gate/`, `user-simulation-driver/`. (#81)
+No breaking changes — additive across the board.
 ## 0.33.0 — 2026-05-21
 ### Release — `decideNextUserTurn` in the published tarball

package/dist/index.d.ts CHANGED Viewed

@@ -1800,6 +1800,94 @@ declare class MetricsCollector {
     getConvergenceCurve(): number[];
 }
+type PrReviewSource = 'drew' | 'donovan' | 'shady' | 'codex' | 'claude-code' | 'gpt-5.5-high' | 'claude-opus-4.7-high' | 'kimi' | 'opencode' | (string & {});
+type PrReviewSeverity = 'critical' | 'high' | 'medium' | 'low' | 'nit';
+type PrReviewOutcome = 'accepted' | 'fixed' | 'rejected' | 'duplicate' | 'noise' | 'unknown';
+interface PrReviewComment {
+    id: string;
+    source: PrReviewSource;
+    body: string;
+    model?: string;
+    author?: string;
+    path?: string;
+    line?: number;
+    severity?: PrReviewSeverity;
+    outcome?: PrReviewOutcome;
+    createdAt?: string;
+    metadata?: Record<string, unknown>;
+}
+interface PrReviewReferenceFinding {
+    id: string;
+    title: string;
+    severity: PrReviewSeverity;
+    path?: string;
+    line?: number;
+    /**
+     * Stable terms that should appear in a useful finding. Keep these
+     * factual: API names, invariant names, table names, error classes.
+     */
+    keywords?: string[];
+    fixedByCommit?: string;
+    sourceCommentIds?: string[];
+    metadata?: Record<string, unknown>;
+}
+interface PrReviewAuditCase {
+    id: string;
+    repo: string;
+    prNumber?: number;
+    baseSha?: string;
+    headSha?: string;
+    title?: string;
+    diff?: string;
+    split?: 'train' | 'validation' | 'test' | 'holdout' | (string & {});
+    comments: PrReviewComment[];
+    referenceFindings: PrReviewReferenceFinding[];
+    metadata?: Record<string, unknown>;
+}
+interface PrReviewScoreWeights {
+    recall: number;
+    precision: number;
+    actionability: number;
+    severityCalibration: number;
+    lowNoise: number;
+}
+interface PrReviewMatchedFinding {
+    referenceId: string;
+    commentId: string;
+    score: number;
+}
+interface PrReviewScore {
+    caseId: string;
+    source: PrReviewSource;
+    commentCount: number;
+    referenceCount: number;
+    matchedFindings: PrReviewMatchedFinding[];
+    recall: number;
+    precision: number;
+    actionability: number;
+    severityCalibration: number;
+    lowNoise: number;
+    aggregate: number;
+    notes: string[];
+}
+interface PrReviewBenchmarkSummary {
+    source: PrReviewSource;
+    caseCount: number;
+    commentCount: number;
+    aggregateMean: number;
+    recallMean: number;
+    precisionMean: number;
+    actionabilityMean: number;
+    severityCalibrationMean: number;
+    lowNoiseMean: number;
+}
+declare const DEFAULT_PR_REVIEW_SCORE_WEIGHTS: PrReviewScoreWeights;
+declare function commentsForSource(auditCase: PrReviewAuditCase, source: PrReviewSource): PrReviewComment[];
+declare function scorePrReviewSource(auditCase: PrReviewAuditCase, source: PrReviewSource, weights?: Partial<PrReviewScoreWeights>): PrReviewScore;
+declare function scorePrReviewComments(auditCase: PrReviewAuditCase, comments: PrReviewComment[], source: PrReviewSource, weights?: Partial<PrReviewScoreWeights>): PrReviewScore;
+declare function summarizePrReviewBenchmark(scores: PrReviewScore[]): PrReviewBenchmarkSummary[];
+declare function aggregatePrReviewScore(dimensions: Pick<PrReviewScore, 'recall' | 'precision' | 'actionability' | 'severityCalibration' | 'lowNoise'>, weights?: Partial<PrReviewScoreWeights>): number;
 /**
  * ProductionLoop — the substrate that closes eval → prod → eval.
  *
@@ -3029,6 +3117,46 @@ declare class BudgetGuard {
     get state(): Record<keyof BudgetSpec, number>;
 }
+/**
+ * @stable
+ *
+ * AgentProfile — the eval harness's unit of variation.
+ *
+ * A profile pins everything that changes agent behaviour for a benchmark
+ * cell: the model, the active skills, the prompt version, the available
+ * tools. Vary the profile — swap a model, add a skill — and re-run the suite
+ * to benchmark the change. The scorecard keys a cell on
+ * `(scenarioId, profileHash)`, so the model is not a separate axis: it lives
+ * inside the profile, and two profiles with the same model but different
+ * skills are different cells.
+ *
+ * `agentProfileHash` is the profile's behaviour identity. Two profiles that
+ * produce the same agent behaviour share a hash (and a scorecard cell);
+ * reordering `skills` or `tools` does not change it; the human-facing `id`
+ * label does not affect it.
+ */
+interface AgentProfile {
+    /** Human-facing label, e.g. `sonnet-legal-skills-v3`. Not part of the hash. */
+    id: string;
+    /** Model snapshot id this profile pins, e.g. `claude-sonnet-4-6@2025-04-15`. */
+    model: string;
+    /** Skill ids/versions active in this profile — the primary behaviour lever. */
+    skills?: string[];
+    /** Prompt version identifier. */
+    promptVersion?: string;
+    /** Tool ids available to the agent. */
+    tools?: string[];
+    /** Any other behaviour-bearing knobs that should fingerprint into the hash. */
+    metadata?: Record<string, string | number | boolean>;
+}
+/**
+ * Deterministic behaviour identity of a profile — a sha256 over the
+ * behaviour-bearing fields. `skills` and `tools` are order-insensitive; the
+ * `id` label is excluded. Throws on a profile with no `model` — an unkeyable
+ * profile must fail loud rather than collapse into a blank-model cell.
+ */
+declare function agentProfileHash(profile: AgentProfile): string;
 /**
  * Cost tracker — token + USD accounting per scenario and per run.
  *
@@ -3262,6 +3390,138 @@ interface OracleReport {
 /** Run all oracles against one observation and aggregate. */
 declare function evaluateOracles(obs: OracleObservation, oracles: Oracle[]): OracleReport;
+/**
+ * @stable
+ *
+ * Eval scorecard — the persistent (persona × profile) score timeline.
+ *
+ * Every benchmark run folds into per-cell entries; a cell is
+ * `(scenarioId, profileHash)` and its timeline carries one entry per commit.
+ * The scorecard answers the question a single run cannot: did THIS change
+ * regress persona P on profile F, even while the aggregate improved?
+ *
+ * Storage is an append-only JSONL log — one line per (cell, commit). Appends
+ * never read-modify-write, so concurrent campaign runs cannot clobber each
+ * other; `loadScorecard` folds the log into the queryable `Scorecard`, and a
+ * malformed line never breaks the read. `diffScorecard` compares the latest
+ * entry of each cell against its predecessor with Cohen's d + Welch's t-test.
+ */
+/** One commit's measurement of one (scenario, profile) cell. */
+interface ScorecardEntry {
+    commitSha: string;
+    /** ISO timestamp the entry was recorded. */
+    timestamp: string;
+    /** Per-seed (or per-rep) scores for this cell at this commit. */
+    scores: number[];
+    /** Median of `scores` — the cell's headline score for the commit. */
+    composite: number;
+    /** Per-dimension means, when the runs carried a judge breakdown. */
+    perDimension?: Record<string, number>;
+    /** RunRecord ids folded into this entry — provenance. */
+    runIds: string[];
+}
+/** A (scenario, profile) cell and its commit-ordered score timeline. */
+interface ScorecardCell {
+    scenarioId: string;
+    profileHash: string;
+    /** Model id — denormalised from the profile for readable filtering. */
+    model: string;
+    timeline: ScorecardEntry[];
+}
+/** The folded scorecard: every cell, plus the profile definitions by hash. */
+interface Scorecard {
+    cells: ScorecardCell[];
+    /** Profile definitions seen — keeps the scorecard self-describing. */
+    profiles: Record<string, AgentProfile>;
+}
+/** One append-only log line — a single cell's entry for a single commit. */
+interface ScorecardLogLine {
+    scenarioId: string;
+    profileHash: string;
+    model: string;
+    profile: AgentProfile;
+    entry: ScorecardEntry;
+}
+interface RecordRunsOptions {
+    /** The profile that produced these runs — keys the cell. */
+    profile: AgentProfile;
+    commitSha: string;
+    /** Defaults to `new Date().toISOString()`. */
+    timestamp?: string;
+}
+/**
+ * Fold a benchmark's `RunRecord[]` into per-cell scorecard log lines — one
+ * line per scenario the runs cover. All runs are attributed to the single
+ * `profile` in `opts` (the harness ran them under it); the cell key is
+ * `(scenarioId, agentProfileHash(profile))`.
+ */
+declare function recordRuns(runs: RunRecord[], opts: RecordRunsOptions): ScorecardLogLine[];
+/** Append cell entries to the JSONL scorecard log. Creates the file/dir. */
+declare function appendScorecard(logPath: string, lines: ScorecardLogLine[]): void;
+/** Record runs and append them to the log in one call. Returns the lines. */
+declare function recordRunsToScorecard(logPath: string, runs: RunRecord[], opts: RecordRunsOptions): ScorecardLogLine[];
+/**
+ * Fold the JSONL log into a queryable `Scorecard`. A missing file yields an
+ * empty scorecard; a malformed line is skipped — a corrupt append never
+ * breaks the read. Each cell's timeline is sorted chronologically.
+ */
+declare function loadScorecard(logPath: string): Scorecard;
+type CellVerdict = 'improved' | 'regressed' | 'flat' | 'new';
+interface ScorecardCellDiff {
+    scenarioId: string;
+    profileHash: string;
+    model: string;
+    verdict: CellVerdict;
+    /** Composite of the latest entry. */
+    current: number;
+    /** Composite of the comparison entry — `null` when `verdict === 'new'`. */
+    baseline: number | null;
+    /** `current − baseline` — `null` when new. */
+    delta: number | null;
+    /** Cohen's d of current vs baseline samples — `null` when new or n < 2. */
+    cohensD: number | null;
+    /** Welch's t-test p-value — `null` when new or n < 2. */
+    pValue: number | null;
+    currentCommit: string;
+    baselineCommit: string | null;
+}
+interface ScorecardDiff {
+    cells: ScorecardCellDiff[];
+    summary: {
+        improved: number;
+        regressed: number;
+        flat: number;
+        new: number;
+    };
+}
+interface DiffScorecardOptions {
+    /** Compare each cell against this commit instead of its immediate predecessor. */
+    baselineCommit?: string;
+    /** |Cohen's d| at/above which a move counts as real. Default 0.5. */
+    minEffect?: number;
+    /** p-value at/below which a move is significant. Default 0.05. */
+    maxP?: number;
+    /**
+     * |delta| at/above which a move counts when statistics are unavailable
+     * (a cell with fewer than 2 samples on either side). Default 0.05.
+     */
+    minDelta?: number;
+}
+/**
+ * Compare the latest entry of every cell against its predecessor (or against
+ * `baselineCommit`) and classify the move. A move is `improved`/`regressed`
+ * only when it clears both the effect-size and significance gates; otherwise
+ * `flat`. Cells with no prior entry are `new`.
+ */
+declare function diffScorecard(scorecard: Scorecard, opts?: DiffScorecardOptions): ScorecardDiff;
+/**
+ * Render a scorecard diff as a human-readable report — the block a feature
+ * PR prints. Regressions are listed first; flat cells are summarised, not
+ * enumerated.
+ */
+declare function formatScorecardDiff(diff: ScorecardDiff): string;
 /**
  * Series convergence — detects whether a sequence of scalar measurements
  * is stabilizing, drifting, or noisy.
@@ -6018,4 +6278,4 @@ declare function aggregateTrialsByMode(trials: TrialResult[], opts: {
     mode: AggregatorMode;
 }): TrialAggregate;
-export { ANALYST_SEVERITIES, ActionableSideInfo, type ActiveLearningOptions, type AdapterRun, AgentDriver, type AgentDriverConfig, AgentEvalError, type AggregatorMode, type AlignmentOp, type Analyst, type AnalystContext, type AnalystCost, type AnalystFinding, type AnalystHooks, type AnalystInputKind, AnalystRegistry, type AnalystRegistryOptions, type AnalystRequirements, type AnalystRunEvent, type AnalystRunInputs, type AnalystRunResult, type AnalystRunSummary, type AnalystSeverity, AnalyzeTracesOptions, type AntiSlopConfig, type AntiSlopIssue, type AntiSlopReport, Artifact$1 as Artifact, type Artifact as ArtifactCheckArtifact, type ArtifactEventLike, type ArtifactValidator, type AutoPrClient, AxGepaSteeringOptimizer, type AxSteeringOptimizerConfig, BackendIntegrityError, type BackendIntegrityReport, BaselineReport, BehaviorAssertion, BenchmarkReport, BenchmarkRunner, BenchmarkRunnerConfig, type BisectOptions, type BisectResult, type BisectStep, BudgetBreachError, BudgetGuard, BudgetLedgerEntry, type BudgetPolicy, BudgetSpec, CallExpectation, type CanaryAlert, type CanaryKind, type CanaryLeak, type CanaryOptions, type CanaryReport, type CanarySeverity, type CandidateScenario, type CausalAttributionReport, type ChatCallOpts, type ChatClient, type ChatRequest, type ChatResponse, type ChatTransport, CheckResult, type CliBridgeTransportOpts, type CodeMutationOutcome, type CodeMutationRunner, CollectedArtifacts, type CommandRunner, CompletionCriterion, type CompletionRequirement, type CompletionVerdict, type CompositePolicy, type ConceptComplexity, type ConceptFinding, type ConceptSpec, type ConceptWeightStrategy, type ContinuityCheck, type ContinuityCheckResult, type ContinuityReport, type ContinuitySnapshotPair, type ContractMetric, type ContractReport, ControlEvalResult, ConvergenceTracker, type CorrectnessChecker, type CostEntry, CostLedger, type CostLedgerGeneration, type CostLedgerSnapshot, type CostSummary, CostTracker, type CounterfactualContext, type CounterfactualMutation, type CounterfactualResult, type CounterfactualRunner, type CreateChatClientOpts, type CreateCompositeMutatorOpts, type CreateDefaultReviewerOptions, type CreateSandboxCodeMutatorOpts, type CreateSandboxPoolOpts, type CreateTraceAnalystKindOpts, type CrossTraceDiff, type CrossTraceDiffOptions, D1ExperimentStore, type D1ExperimentStoreOptions, type D1Like, type D1PreparedStatementLike, DEFAULT_AGENT_SLOS, DEFAULT_COMPLEXITY_WEIGHTS, DEFAULT_FINDERS, DEFAULT_HARNESS_OBJECTIVES, DEFAULT_MUTATORS, DEFAULT_RUN_SCORE_WEIGHTS, DEFAULT_SEVERITY_WEIGHTS, DEFAULT_TRACE_ANALYST_KINDS, Dataset, DatasetScenario, type DecideNextUserTurnOpts, type DeployFamily, type DeployGateLayerInput, type DeployRunResult, type DeployRunner, type DiffPolicy, type DirEntry, type DirectProviderTransportOpts, type DiscoverPersonasOptions, type DiscoveredPersona, DriverResult, DriverState, DualAgentBench, type DualAgentBenchConfig, type DualAgentReport, type DualAgentRound, type DualAgentScenario, type DualAgentScenarioResult, ERROR_COUNT_PATTERNS, type ErrorCountPattern, type EvidenceRef, type EvolutionRound, EvolvableVariant, type ExecutorConfig, type Expectation, type Experiment, type Run as ExperimentRun, type ExperimentStore, ExperimentTracker, type ExportedRewardModel, type ExtractOptions, type ExtractResult, FAILURE_MODE_KIND_SPEC, FINDING_SUBJECT_GRAMMAR_PROMPT, FINDING_SUBJECT_KINDS, type FactorContribution, type FactorialCell, type FailureClusterConfig, FeedbackLabel, FeedbackTrajectory, FeedbackTrajectoryStore, type FileChange, FileSystemExperimentStore, type FileSystemExperimentStoreOptions, type FindingSubject, type FindingSubjectKind, FindingSubjectStringSchema, type FindingsDiff, FindingsStore, type FlowAction, type FlowLayerEnv, type FlowLayerFactoryInput, type FlowRunner, type FlowRunnerStepResult, type FlowSpec, type FlowStep, GateDecision, type GhCliClientOptions, type GoldenSeverity, type GoldenSpec, type HarnessAdapter, HarnessConfig, type HarnessExperimentConfig, type HarnessExperimentResult, type HarnessIntervention, type HarnessRunRequest, type HarnessRunResult, type HarnessScenario, type HarnessSelection, type HarnessVariant, type HarnessVariantReport, HeldOutGateConfig, HoldoutAuditor, type HostedJudgeConfig, type HostedJudgeDimension, type HostedJudgeRequest, type HostedJudgeResponse, type HostedRunCriticConfig, type HostedRunScoreRequest, type HostedRunScoreResponse, type HttpGithubClientOptions, type HypothesisManifest, type HypothesisResult, IMPROVEMENT_KIND_SPEC, INTENT_MATCH_JUDGE_VERSION, type ImageData, InMemoryExperimentStore, InMemoryWorkspaceInspector, type InferenceScorer, type InspectorContext, type IntegrationGateSurface, type IntegrationInvokeFailureInput, type IntegrationManifestGateInput, type IntentMatchInput, type IntentMatchOptions, type IntentMatchResult, type InteractionContribution, JsonlTrialCache, type JudgeAdapterOpts, type JudgeFleetOptions, JudgeFn, JudgeInput, type JudgeReplayResult, type JudgeRetryOutcome, type JudgeRetryPolicy, JudgeRunner, KIND_EXPECTED_SUBJECTS, KNOWLEDGE_GAP_KIND_SPEC, KNOWLEDGE_POISONING_KIND_SPEC, type KeywordConceptSpec, type KeywordCoverageFinding, type KeywordCoverageOptions, type KeywordCoverageResult, type LangfuseEnvelope, type LangfuseGeneration, type LangfuseScore, Layer, LayerResult, type LineageKind, type LineageKindResolver, type LineageNode, LineageRecorder, type LiveProofArtifact, type LiveProofConfig, type LiveProofContext, type LiveProofResult, LlmCallRequest, LlmCallResult, LlmClientOptions, type LlmCorrectnessCheckerOpts, LlmSpan, LockedJsonlAppender, MODEL_PRICING, type MatchResult, type MatcherResult, type MeasurementPolicy, type MergeOptions, MetricsCollector, type MockTransportOpts, type MuffledFinder, type MuffledFinding, MultiLayerVerifier, MultiShotMutateAdapter, MultiShotOptimizationResult, MultiShotRunner, MultiShotScorer, MultiShotTrialResult, type MultiToolchainLayerConfig, MutateAdapter, type MutationAttempt, type MutationChannel, MutationTelemetry, type Mutator, Mutex, Objective, type Oracle, type OracleObservation, type OracleReport, type OracleResult, type OrthogonalityInput, type OrthogonalityResult, PairwiseSteeringOptimizer, type ParaphraseRobustnessScenarioInput, type ParaphraseRobustnessScenarioResult, ParetoResult, type PersistedFinding, PersonaConfig, type Playbook, type PlaybookEntry, type PoolSlot, type ProducedProposal, type ProducedState, ProductClient, ProductClientConfig, type ProductionEvolveConfig, type ProductionLoopCronConfig, type ProductionLoopDecision, type ProductionLoopRenderContext, type ProductionLoopResult, type ProductionShipConfig, type PromptHandle, PromptRegistry, TrialResult as PromptTrialResult, type ProposalEventLike, type ProposeAutomatedPullRequestInput, type ProposeAutomatedPullRequestResult, RAW_FINDING_SCHEMA_PROMPT, type RawAnalystFinding, RawAnalystFindingSchema, type ReferenceMatchResult, type ReferenceReplayAdapter, type ReferenceReplayAdapterFn, type ReferenceReplayAdapterLike, type ReferenceReplayAggregate, type ReferenceReplayCandidate, type ReferenceReplayCase, type ReferenceReplayCaseRun, type ReferenceReplayExecutionScenario, type ReferenceReplayItem, type ReferenceReplayMatch, type ReferenceReplayMatchStrategy, type ReferenceReplayMatcher, type ReferenceReplayPromotionDecision, type ReferenceReplayPromotionPolicy, type ReferenceReplayRun, type ReferenceReplayRunContext, type ReferenceReplayRunOptions, type ReferenceReplayRunStore, type ReferenceReplayScenario, type ReferenceReplayScenarioScore, type ReferenceReplayScore, type ReferenceReplayScoreOptions, type ReferenceReplaySplit, type ReferenceReplaySplitComparison, type ReferenceReplaySteeringRowsOptions, type RegistryRunOpts, ReleaseConfidenceScorecard, ReleaseConfidenceThresholds, type RepoRef, type RequirementCheck, type ReviewerMemoryEntry, type ReviewerOutput, type ReviewerPromptInput, type ReviewerSoftFailDefaults, type ReviewerVerificationSummary, type RobustnessResult, type RouterTransportOpts, Run$1 as Run, type RunCommandInput, type RunCommandResult, type RunConfig, RunCritic, type RunCriticAdapterOpts, type RunCriticOptions, type RunDiff, RunFilter, type RunProductionLoopOptions, RunRecord, type RunScore, type RunScoreWeights, RunSplitTag, type RunTrace, type RuntimeEventLike, SEMANTIC_CONCEPT_JUDGE_VERSION, SandboxDriver, SandboxHarnessResult, type SandboxJudgeKind, type SandboxJudgeResult, type SandboxJudgeSpec, type SandboxPool, type SandboxSdkTransportOpts, type SatisfiedBy, type ScanOptions, Scenario, type ScenarioCost, ScenarioFile, ScenarioRegistry, ScenarioResult, type ScoredTarget, type SelfPlayOptions, type SelfPlayProposer, type SelfPlayScorer, type SemanticConceptJudgeAdapterOpts, type SemanticConceptJudgeInput, type SemanticConceptJudgeOptions, type SemanticConceptJudgeResult, type SeriesConvergenceOptions, type SeriesConvergenceResult, Severity, type SignedManifest, type SignedManifestAlgo, type Slo, type SloCheckResult, type SloComparator, type SloReport, type SloSeverity, type SlopCategory, type SlotFactory, Span, type SteeringBundle, type SteeringDelta, type SteeringOptimizationResult, type SteeringOptimizationRow, type SteeringOptimizationSelector, type SteeringOptimizerBackend, type SteeringOptimizerConfig, type SteeringRolePrompt, type StepAttribution, type SynthesisReason, type SynthesisTarget, type TaskGold, TestResult, type ThresholdContract, TokenCounter, type TokenSpec, type ToolCallEventLike, TraceAnalysisStore, type TraceAnalystAdapterOpts, type TraceAnalystGolden, type TraceAnalystKindSpec, TraceEmitter, TraceEvent, TraceStore, type TraceToolGroupName, Trajectory, TrajectoryStep, type TrialAggregate, type TrialAttempt, TrialCache, TrialTelemetry, TurnMetrics, UNIVERSAL_FINDERS, type ValidationContext, type ValidationIssue, type ValidationResult, VariantAggregate, type VerifierAdapterOpts, VerifyContext, VerifyOptions, type VisualDiffOptions, type VisualDiffResult, type ViteDeployRunnerInput, type WorkflowTopology, type WorkspaceAssertion, type WorkspaceAssertionResult, type WorkspaceInspector, type WorkspaceSnapshot, type WranglerDeployRunnerInput, adversarialJudge, aggregateRunScore, aggregateTrialsByMode, analyzeAntiSlop, analyzeSeries, assertRealBackend, attributeCounterfactuals, bisect, buildDriverSystemPrompt, buildReviewerPrompt, buildTraceToolsForGroup, byteLengthRange, canaryLeakView, canonicalize, causalAttribution, checkBehavioralCanary, checkCanaries, checkSlos, clamp01, codeExecutionJudge, coherenceJudge, collectionPreserved, commitBisect, compareReferenceReplay, compilerJudge, composeValidators, computeFindingId, containsAll, createAntiSlopJudge, createChatClient, createCompositeMutator, createCustomJudge, createDefaultReviewer, createDomainExpertJudge, createIntentMatchJudge, createJudgeAdapter, createLlmCorrectnessChecker, createRunCriticAdapter, createSandboxCodeMutator, createSandboxPool, createSemanticConceptJudge, createSemanticConceptJudgeAdapter, createTraceAnalystAdapter, createTraceAnalystKind, createVerifierAdapter, crossTraceDiff, decideNextUserTurn, decideReferenceReplayPromotion, decideReferenceReplayRunPromotion, defaultIsMaterial, defaultJudges, defaultReferenceReplayMatcher, deployGateLayer, diffFindings, discoverPersonas, distillPlaybook, estimateCost, estimateTokens, evaluateContract, evaluateHypothesis, evaluateOracles, executeScenario, expectAgent, exportRewardModel, extractAssetUrls, extractErrorCount, extractProducedState, fileContains, fileExists, findAutoMatchNoExpectation, findConstructorCwdDropped, findFallbackToPass, findLiteralTruePass, findSkipCountsAsPass, flowLayer, formatBenchmarkReport, formatDriverReport, formatFindings, ghCliClient, precision as goldenPrecision, hashContent, hashJson, htmlContainsElement, httpGithubClient, inMemoryReferenceReplayStore, integrationAsi, integrationGateEvals, integrationInvokeFailedPayload, integrationManifestResolvedPayload, integrationManifestValidatedPayload, jsonHasKeys, jsonShape, jsonlReferenceReplayStore, keyPreserved, liftSeverity, linterJudge, loadScorerFromGrader, localCommandRunner, lowercaseMutator, makeFinding, matchGoldens, mergeLayerResults, mergeSteeringBundle, multiToolchainLayer, notBlocked, paraphraseRobustness, paraphraseRobustnessScenarios, parseCorrectnessResponse, parseFindingSubject, parseRawFinding, passOrthogonality, pixelDeltaRatio, politenessPrefixMutator, printDriverSummary, promptBisect, proposeAutomatedPullRequest, proposeSynthesisTargets, referenceReplayRunsToSteeringRows, referenceReplayScenarioToRunScore, regexMatch, regexMatches, renderFindingSubject, renderMarkdownReport, renderPlaybookMarkdown, renderPriorFindings, renderSteeringText, replayScorerOverCorpus, replayTraceThroughJudge, resetLockedAppendersForTesting, rowCount, rowWhere, runAssertions, runBehavioralCanaries, runCanaries, runCounterfactual, runE2EWorkflow, runExpectations, runHarnessExperiment, runIntentMatchJudge, runJudgeFleet, runKeywordCoverageJudge, runKeywordCoverageJudgeUrl, runLiveProof, runProductionLoop, runReferenceReplay, runSelfPlay, runSemanticConceptJudge, scanForMuffledGates, scoreContinuity, scoreReferenceReplay, securityJudge, selectHarnessVariant, sentenceReorderMutator, signManifest, statusAdvanced, summarizeBackendIntegrity, summarizeHarnessResults, testJudge, textInSnapshot, toLangfuseEnvelope, toPrometheusText, typoMutator, urlContains, verifyCompletion, verifyManifest, visualDiff, viteDeployRunner, weightedRecall, whitespaceCollapseMutator, withJudgeRetry, wranglerDeployRunner };
+export { ANALYST_SEVERITIES, ActionableSideInfo, type ActiveLearningOptions, type AdapterRun, AgentDriver, type AgentDriverConfig, AgentEvalError, type AgentProfile, type AggregatorMode, type AlignmentOp, type Analyst, type AnalystContext, type AnalystCost, type AnalystFinding, type AnalystHooks, type AnalystInputKind, AnalystRegistry, type AnalystRegistryOptions, type AnalystRequirements, type AnalystRunEvent, type AnalystRunInputs, type AnalystRunResult, type AnalystRunSummary, type AnalystSeverity, AnalyzeTracesOptions, type AntiSlopConfig, type AntiSlopIssue, type AntiSlopReport, Artifact$1 as Artifact, type Artifact as ArtifactCheckArtifact, type ArtifactEventLike, type ArtifactValidator, type AutoPrClient, AxGepaSteeringOptimizer, type AxSteeringOptimizerConfig, BackendIntegrityError, type BackendIntegrityReport, BaselineReport, BehaviorAssertion, BenchmarkReport, BenchmarkRunner, BenchmarkRunnerConfig, type BisectOptions, type BisectResult, type BisectStep, BudgetBreachError, BudgetGuard, BudgetLedgerEntry, type BudgetPolicy, BudgetSpec, CallExpectation, type CanaryAlert, type CanaryKind, type CanaryLeak, type CanaryOptions, type CanaryReport, type CanarySeverity, type CandidateScenario, type CausalAttributionReport, type CellVerdict, type ChatCallOpts, type ChatClient, type ChatRequest, type ChatResponse, type ChatTransport, CheckResult, type CliBridgeTransportOpts, type CodeMutationOutcome, type CodeMutationRunner, CollectedArtifacts, type CommandRunner, CompletionCriterion, type CompletionRequirement, type CompletionVerdict, type CompositePolicy, type ConceptComplexity, type ConceptFinding, type ConceptSpec, type ConceptWeightStrategy, type ContinuityCheck, type ContinuityCheckResult, type ContinuityReport, type ContinuitySnapshotPair, type ContractMetric, type ContractReport, ControlEvalResult, ConvergenceTracker, type CorrectnessChecker, type CostEntry, CostLedger, type CostLedgerGeneration, type CostLedgerSnapshot, type CostSummary, CostTracker, type CounterfactualContext, type CounterfactualMutation, type CounterfactualResult, type CounterfactualRunner, type CreateChatClientOpts, type CreateCompositeMutatorOpts, type CreateDefaultReviewerOptions, type CreateSandboxCodeMutatorOpts, type CreateSandboxPoolOpts, type CreateTraceAnalystKindOpts, type CrossTraceDiff, type CrossTraceDiffOptions, D1ExperimentStore, type D1ExperimentStoreOptions, type D1Like, type D1PreparedStatementLike, DEFAULT_AGENT_SLOS, DEFAULT_COMPLEXITY_WEIGHTS, DEFAULT_FINDERS, DEFAULT_HARNESS_OBJECTIVES, DEFAULT_MUTATORS, DEFAULT_PR_REVIEW_SCORE_WEIGHTS, DEFAULT_RUN_SCORE_WEIGHTS, DEFAULT_SEVERITY_WEIGHTS, DEFAULT_TRACE_ANALYST_KINDS, Dataset, DatasetScenario, type DecideNextUserTurnOpts, type DeployFamily, type DeployGateLayerInput, type DeployRunResult, type DeployRunner, type DiffPolicy, type DiffScorecardOptions, type DirEntry, type DirectProviderTransportOpts, type DiscoverPersonasOptions, type DiscoveredPersona, DriverResult, DriverState, DualAgentBench, type DualAgentBenchConfig, type DualAgentReport, type DualAgentRound, type DualAgentScenario, type DualAgentScenarioResult, ERROR_COUNT_PATTERNS, type ErrorCountPattern, type EvidenceRef, type EvolutionRound, EvolvableVariant, type ExecutorConfig, type Expectation, type Experiment, type Run as ExperimentRun, type ExperimentStore, ExperimentTracker, type ExportedRewardModel, type ExtractOptions, type ExtractResult, FAILURE_MODE_KIND_SPEC, FINDING_SUBJECT_GRAMMAR_PROMPT, FINDING_SUBJECT_KINDS, type FactorContribution, type FactorialCell, type FailureClusterConfig, FeedbackLabel, FeedbackTrajectory, FeedbackTrajectoryStore, type FileChange, FileSystemExperimentStore, type FileSystemExperimentStoreOptions, type FindingSubject, type FindingSubjectKind, FindingSubjectStringSchema, type FindingsDiff, FindingsStore, type FlowAction, type FlowLayerEnv, type FlowLayerFactoryInput, type FlowRunner, type FlowRunnerStepResult, type FlowSpec, type FlowStep, GateDecision, type GhCliClientOptions, type GoldenSeverity, type GoldenSpec, type HarnessAdapter, HarnessConfig, type HarnessExperimentConfig, type HarnessExperimentResult, type HarnessIntervention, type HarnessRunRequest, type HarnessRunResult, type HarnessScenario, type HarnessSelection, type HarnessVariant, type HarnessVariantReport, HeldOutGateConfig, HoldoutAuditor, type HostedJudgeConfig, type HostedJudgeDimension, type HostedJudgeRequest, type HostedJudgeResponse, type HostedRunCriticConfig, type HostedRunScoreRequest, type HostedRunScoreResponse, type HttpGithubClientOptions, type HypothesisManifest, type HypothesisResult, IMPROVEMENT_KIND_SPEC, INTENT_MATCH_JUDGE_VERSION, type ImageData, InMemoryExperimentStore, InMemoryWorkspaceInspector, type InferenceScorer, type InspectorContext, type IntegrationGateSurface, type IntegrationInvokeFailureInput, type IntegrationManifestGateInput, type IntentMatchInput, type IntentMatchOptions, type IntentMatchResult, type InteractionContribution, JsonlTrialCache, type JudgeAdapterOpts, type JudgeFleetOptions, JudgeFn, JudgeInput, type JudgeReplayResult, type JudgeRetryOutcome, type JudgeRetryPolicy, JudgeRunner, KIND_EXPECTED_SUBJECTS, KNOWLEDGE_GAP_KIND_SPEC, KNOWLEDGE_POISONING_KIND_SPEC, type KeywordConceptSpec, type KeywordCoverageFinding, type KeywordCoverageOptions, type KeywordCoverageResult, type LangfuseEnvelope, type LangfuseGeneration, type LangfuseScore, Layer, LayerResult, type LineageKind, type LineageKindResolver, type LineageNode, LineageRecorder, type LiveProofArtifact, type LiveProofConfig, type LiveProofContext, type LiveProofResult, LlmCallRequest, LlmCallResult, LlmClientOptions, type LlmCorrectnessCheckerOpts, LlmSpan, LockedJsonlAppender, MODEL_PRICING, type MatchResult, type MatcherResult, type MeasurementPolicy, type MergeOptions, MetricsCollector, type MockTransportOpts, type MuffledFinder, type MuffledFinding, MultiLayerVerifier, MultiShotMutateAdapter, MultiShotOptimizationResult, MultiShotRunner, MultiShotScorer, MultiShotTrialResult, type MultiToolchainLayerConfig, MutateAdapter, type MutationAttempt, type MutationChannel, MutationTelemetry, type Mutator, Mutex, Objective, type Oracle, type OracleObservation, type OracleReport, type OracleResult, type OrthogonalityInput, type OrthogonalityResult, PairwiseSteeringOptimizer, type ParaphraseRobustnessScenarioInput, type ParaphraseRobustnessScenarioResult, ParetoResult, type PersistedFinding, PersonaConfig, type Playbook, type PlaybookEntry, type PoolSlot, type PrReviewAuditCase, type PrReviewBenchmarkSummary, type PrReviewComment, type PrReviewMatchedFinding, type PrReviewOutcome, type PrReviewReferenceFinding, type PrReviewScore, type PrReviewScoreWeights, type PrReviewSeverity, type PrReviewSource, type ProducedProposal, type ProducedState, ProductClient, ProductClientConfig, type ProductionEvolveConfig, type ProductionLoopCronConfig, type ProductionLoopDecision, type ProductionLoopRenderContext, type ProductionLoopResult, type ProductionShipConfig, type PromptHandle, PromptRegistry, TrialResult as PromptTrialResult, type ProposalEventLike, type ProposeAutomatedPullRequestInput, type ProposeAutomatedPullRequestResult, RAW_FINDING_SCHEMA_PROMPT, type RawAnalystFinding, RawAnalystFindingSchema, type RecordRunsOptions, type ReferenceMatchResult, type ReferenceReplayAdapter, type ReferenceReplayAdapterFn, type ReferenceReplayAdapterLike, type ReferenceReplayAggregate, type ReferenceReplayCandidate, type ReferenceReplayCase, type ReferenceReplayCaseRun, type ReferenceReplayExecutionScenario, type ReferenceReplayItem, type ReferenceReplayMatch, type ReferenceReplayMatchStrategy, type ReferenceReplayMatcher, type ReferenceReplayPromotionDecision, type ReferenceReplayPromotionPolicy, type ReferenceReplayRun, type ReferenceReplayRunContext, type ReferenceReplayRunOptions, type ReferenceReplayRunStore, type ReferenceReplayScenario, type ReferenceReplayScenarioScore, type ReferenceReplayScore, type ReferenceReplayScoreOptions, type ReferenceReplaySplit, type ReferenceReplaySplitComparison, type ReferenceReplaySteeringRowsOptions, type RegistryRunOpts, ReleaseConfidenceScorecard, ReleaseConfidenceThresholds, type RepoRef, type RequirementCheck, type ReviewerMemoryEntry, type ReviewerOutput, type ReviewerPromptInput, type ReviewerSoftFailDefaults, type ReviewerVerificationSummary, type RobustnessResult, type RouterTransportOpts, Run$1 as Run, type RunCommandInput, type RunCommandResult, type RunConfig, RunCritic, type RunCriticAdapterOpts, type RunCriticOptions, type RunDiff, RunFilter, type RunProductionLoopOptions, RunRecord, type RunScore, type RunScoreWeights, RunSplitTag, type RunTrace, type RuntimeEventLike, SEMANTIC_CONCEPT_JUDGE_VERSION, SandboxDriver, SandboxHarnessResult, type SandboxJudgeKind, type SandboxJudgeResult, type SandboxJudgeSpec, type SandboxPool, type SandboxSdkTransportOpts, type SatisfiedBy, type ScanOptions, Scenario, type ScenarioCost, ScenarioFile, ScenarioRegistry, ScenarioResult, type Scorecard, type ScorecardCell, type ScorecardCellDiff, type ScorecardDiff, type ScorecardEntry, type ScorecardLogLine, type ScoredTarget, type SelfPlayOptions, type SelfPlayProposer, type SelfPlayScorer, type SemanticConceptJudgeAdapterOpts, type SemanticConceptJudgeInput, type SemanticConceptJudgeOptions, type SemanticConceptJudgeResult, type SeriesConvergenceOptions, type SeriesConvergenceResult, Severity, type SignedManifest, type SignedManifestAlgo, type Slo, type SloCheckResult, type SloComparator, type SloReport, type SloSeverity, type SlopCategory, type SlotFactory, Span, type SteeringBundle, type SteeringDelta, type SteeringOptimizationResult, type SteeringOptimizationRow, type SteeringOptimizationSelector, type SteeringOptimizerBackend, type SteeringOptimizerConfig, type SteeringRolePrompt, type StepAttribution, type SynthesisReason, type SynthesisTarget, type TaskGold, TestResult, type ThresholdContract, TokenCounter, type TokenSpec, type ToolCallEventLike, TraceAnalysisStore, type TraceAnalystAdapterOpts, type TraceAnalystGolden, type TraceAnalystKindSpec, TraceEmitter, TraceEvent, TraceStore, type TraceToolGroupName, Trajectory, TrajectoryStep, type TrialAggregate, type TrialAttempt, TrialCache, TrialTelemetry, TurnMetrics, UNIVERSAL_FINDERS, type ValidationContext, type ValidationIssue, type ValidationResult, VariantAggregate, type VerifierAdapterOpts, VerifyContext, VerifyOptions, type VisualDiffOptions, type VisualDiffResult, type ViteDeployRunnerInput, type WorkflowTopology, type WorkspaceAssertion, type WorkspaceAssertionResult, type WorkspaceInspector, type WorkspaceSnapshot, type WranglerDeployRunnerInput, adversarialJudge, agentProfileHash, aggregatePrReviewScore, aggregateRunScore, aggregateTrialsByMode, analyzeAntiSlop, analyzeSeries, appendScorecard, assertRealBackend, attributeCounterfactuals, bisect, buildDriverSystemPrompt, buildReviewerPrompt, buildTraceToolsForGroup, byteLengthRange, canaryLeakView, canonicalize, causalAttribution, checkBehavioralCanary, checkCanaries, checkSlos, clamp01, codeExecutionJudge, coherenceJudge, collectionPreserved, commentsForSource, commitBisect, compareReferenceReplay, compilerJudge, composeValidators, computeFindingId, containsAll, createAntiSlopJudge, createChatClient, createCompositeMutator, createCustomJudge, createDefaultReviewer, createDomainExpertJudge, createIntentMatchJudge, createJudgeAdapter, createLlmCorrectnessChecker, createRunCriticAdapter, createSandboxCodeMutator, createSandboxPool, createSemanticConceptJudge, createSemanticConceptJudgeAdapter, createTraceAnalystAdapter, createTraceAnalystKind, createVerifierAdapter, crossTraceDiff, decideNextUserTurn, decideReferenceReplayPromotion, decideReferenceReplayRunPromotion, defaultIsMaterial, defaultJudges, defaultReferenceReplayMatcher, deployGateLayer, diffFindings, diffScorecard, discoverPersonas, distillPlaybook, estimateCost, estimateTokens, evaluateContract, evaluateHypothesis, evaluateOracles, executeScenario, expectAgent, exportRewardModel, extractAssetUrls, extractErrorCount, extractProducedState, fileContains, fileExists, findAutoMatchNoExpectation, findConstructorCwdDropped, findFallbackToPass, findLiteralTruePass, findSkipCountsAsPass, flowLayer, formatBenchmarkReport, formatDriverReport, formatFindings, formatScorecardDiff, ghCliClient, precision as goldenPrecision, hashContent, hashJson, htmlContainsElement, httpGithubClient, inMemoryReferenceReplayStore, integrationAsi, integrationGateEvals, integrationInvokeFailedPayload, integrationManifestResolvedPayload, integrationManifestValidatedPayload, jsonHasKeys, jsonShape, jsonlReferenceReplayStore, keyPreserved, liftSeverity, linterJudge, loadScorecard, loadScorerFromGrader, localCommandRunner, lowercaseMutator, makeFinding, matchGoldens, mergeLayerResults, mergeSteeringBundle, multiToolchainLayer, notBlocked, paraphraseRobustness, paraphraseRobustnessScenarios, parseCorrectnessResponse, parseFindingSubject, parseRawFinding, passOrthogonality, pixelDeltaRatio, politenessPrefixMutator, printDriverSummary, promptBisect, proposeAutomatedPullRequest, proposeSynthesisTargets, recordRuns, recordRunsToScorecard, referenceReplayRunsToSteeringRows, referenceReplayScenarioToRunScore, regexMatch, regexMatches, renderFindingSubject, renderMarkdownReport, renderPlaybookMarkdown, renderPriorFindings, renderSteeringText, replayScorerOverCorpus, replayTraceThroughJudge, resetLockedAppendersForTesting, rowCount, rowWhere, runAssertions, runBehavioralCanaries, runCanaries, runCounterfactual, runE2EWorkflow, runExpectations, runHarnessExperiment, runIntentMatchJudge, runJudgeFleet, runKeywordCoverageJudge, runKeywordCoverageJudgeUrl, runLiveProof, runProductionLoop, runReferenceReplay, runSelfPlay, runSemanticConceptJudge, scanForMuffledGates, scoreContinuity, scorePrReviewComments, scorePrReviewSource, scoreReferenceReplay, securityJudge, selectHarnessVariant, sentenceReorderMutator, signManifest, statusAdvanced, summarizeBackendIntegrity, summarizeHarnessResults, summarizePrReviewBenchmark, testJudge, textInSnapshot, toLangfuseEnvelope, toPrometheusText, typoMutator, urlContains, verifyCompletion, verifyManifest, visualDiff, viteDeployRunner, weightedRecall, whitespaceCollapseMutator, withJudgeRetry, wranglerDeployRunner };