npm - @tangle-network/agent-eval - Versions diffs - 0.79.0 → 0.80.0 - Mend

@tangle-network/agent-eval 0.79.0 → 0.80.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (39) hide show

package/README.md +50 -19
package/dist/adapters/http.d.ts +1 -1
package/dist/adapters/langchain.d.ts +1 -1
package/dist/adapters/otel.d.ts +1 -1
package/dist/analyst/index.d.ts +3 -3
package/dist/belief-state/index.d.ts +188 -0
package/dist/belief-state/index.js +486 -0
package/dist/belief-state/index.js.map +1 -0
package/dist/calibration-Cpr3WaX3.d.ts +101 -0
package/dist/campaign/index.d.ts +5 -5
package/dist/chunk-4DIJWVUT.js +131 -0
package/dist/chunk-4DIJWVUT.js.map +1 -0
package/dist/chunk-NPCTHQIO.js +91 -0
package/dist/chunk-NPCTHQIO.js.map +1 -0
package/dist/contract/index.d.ts +123 -10
package/dist/contract/index.js +116 -0
package/dist/contract/index.js.map +1 -1
package/dist/governance/index.d.ts +1 -1
package/dist/hosted/index.d.ts +1 -1
package/dist/index.d.ts +5 -5
package/dist/meta-eval/index.d.ts +5 -98
package/dist/meta-eval/index.js +7 -76
package/dist/meta-eval/index.js.map +1 -1
package/dist/off-policy-DiwuKKg7.d.ts +132 -0
package/dist/openapi.json +1 -1
package/dist/{outcome-store-D6KWmYvj.d.ts → outcome-store-rnXLEqSn.d.ts} +1 -1
package/dist/{provenance-CEAJI9rm.d.ts → provenance-jG-Gngg8.d.ts} +2 -2
package/dist/{registry-BmEuU94S.d.ts → registry-BK0Zee01.d.ts} +1 -1
package/dist/reporting.d.ts +2 -2
package/dist/rl.d.ts +6 -136
package/dist/rl.js +6 -120
package/dist/rl.js.map +1 -1
package/dist/{rubric-predictive-validity-CWyWWLBg.d.ts → rubric-predictive-validity-CLPuwiUw.d.ts} +1 -1
package/dist/{run-improvement-loop-Bgu4C59E.d.ts → run-improvement-loop-BAl_aVOZ.d.ts} +1 -1
package/dist/{semantic-concept-judge-Du4ZVyef.d.ts → semantic-concept-judge-qXEUV2w7.d.ts} +1 -1
package/dist/{types-QHG0KnkF.d.ts → types-4mm2msnR.d.ts} +1 -1
package/docs/research/belief-state-agent-eval-roadmap.md +558 -0
package/docs/research/research-roadmap.md +1 -0
package/package.json +7 -2

package/README.md CHANGED Viewed

@@ -1,6 +1,8 @@
 # `@tangle-network/agent-eval`
-**Ship better agent prompts with statistical confidence.** One function call returns a decision packet: lift CI, judge calibration, contamination check, failure clusters, cost-quality Pareto, and a ranked action list. Same shape whether you've got a closed improvement loop or just production logs.
+**Decision-grade evals for agents.** One function call returns a decision packet — lift CI, judge calibration, contamination check, failure clusters, cost-quality Pareto, and a ranked action list — with the same shape whether you have a closed improvement loop or just production logs.
+It is the **substrate at the bottom of the stack**: [`@tangle-network/agent-runtime`](https://www.npmjs.com/package/@tangle-network/agent-runtime) runs agents and captures every run as a trace, then delegates scoring and the ship gate here. The dependency arrow only points up — agent-eval never imports the runtime.
 [![npm](https://img.shields.io/npm/v/@tangle-network/agent-eval.svg)](https://www.npmjs.com/package/@tangle-network/agent-eval)
 [![pypi](https://img.shields.io/pypi/v/agent-eval-rpc.svg)](https://pypi.org/project/agent-eval-rpc/)
@@ -207,28 +209,55 @@ Each example: `README.md` + a single `index.ts` runnable via `pnpm tsx`. Prints
 | Subpath | What it gives you |
 |---|---|
-| `@tangle-network/agent-eval/contract` | **The headline surface.** `selfImprove`, `analyzeRuns`, `runImprovementLoop`, `runCampaign`, `runEval`, `diffRuns`, intake adapters (`fromFeedbackTable`, `fromOtelSpans`), drivers (`gepaDriver`, `evolutionaryDriver`), gates (`defaultProductionGate`, `heldOutGate`, `composeGate`), storage. **New code starts here.** |
-| `@tangle-network/agent-eval/hosted` | Hosted-tier wire-format types + `createHostedClient` to ship eval-run events + trace spans to any orchestrator speaking the spec |
-| `@tangle-network/agent-eval/adapters/otel` | `createOtelBridge` — forwards OpenTelemetry-shape spans into the hosted-tier ingest |
-| `@tangle-network/agent-eval/adapters/langchain` | LangChain runnable → `Dispatch` adapter |
-| `@tangle-network/agent-eval/adapters/http` | `httpDispatch` + `runDispatchServer` for distributed campaigns across machines |
-| `@tangle-network/agent-eval/campaign` | Lower-level campaign primitives (storage, drivers, types) |
-| `@tangle-network/agent-eval/multishot` | N-shot persona × shot matrix runner |
-| `@tangle-network/agent-eval/control` | Agent control loop primitives (`runAgentControlLoop`, action policy, propose/review) |
-| `@tangle-network/agent-eval/traces` | Trace stores, emitters, OTLP-JSONL replay |
-| `@tangle-network/agent-eval/reporting` | Release confidence, paired stats, sequential e-values, launch reports |
-| `@tangle-network/agent-eval/rl` | RL bridge — verifiable rewards, preferences, OPE, PRM, tournaments, contamination, compute curves, auto-research |
-| `@tangle-network/agent-eval/matrix` | N-axis cartesian over substrate types |
-| `@tangle-network/agent-eval/wire` | HTTP/RPC server + Zod schemas (same protocol the Python client speaks) |
-| `@tangle-network/agent-eval/benchmarks` | Benchmark adapter contracts and reference wrappers |
-The root export remains available for backward compatibility; new code should prefer focused subpaths. Anything under `/rl`, `/pipelines`, `/meta-eval`, `/prm`, or `/builder-eval` is **only** reachable via its subpath.
+| `…/contract` | **The headline, frozen surface — new code starts here.** `selfImprove`, `analyzeRuns`, `runEval`, `runCampaign`, `runImprovementLoop`, `diffRuns`; intake adapters (`fromFeedbackTable`, `fromOtelSpans`); drivers (`gepaDriver`, `evolutionaryDriver`); gates (`defaultProductionGate`, `heldOutGate`, `paretoSignificanceGate`, `composeGate`); the deployment-outcome store; storage; and the five core types `Scenario` / `Dispatch` / `JudgeConfig` / `Mutator` / `Gate`. |
+| `…/hosted` | `createHostedClient` / `hostedClientFromEnv` + the wire types to ship eval-run events + trace spans to a hosted orchestrator (ours or your own implementation of the spec) |
+| `…/adapters/otel` | `createOtelBridge` — forwards OpenTelemetry-shape spans into the hosted-tier ingest, no `@opentelemetry/*` dependency |
+| `…/adapters/langchain` | Wrap any LangChain `Runnable` as a `Dispatch` (or `JudgeConfig`), no `@langchain/core` peer dep |
+| `…/adapters/http` | `httpDispatch` + `runDispatchServer` — run a campaign's worker on another machine (multi-region, driver-as-a-service) |
+| `…/campaign` | **The measurement + improvement engine** (`@experimental`): `runProfileMatrix`, `compareDrivers`, every driver (`gepaDriver`, `haloDriver`, `skillOptDriver`, `aceDriver`, `memoryCurationDriver`, …), the gates, storage backends, and loop provenance. `/contract` re-exports the stable subset. |
+| `…/rl` | RL bridge from eval artifacts to training signal: verifiable rewards, preferences, OPE, PRM, tournaments, contamination, compute curves, plus the durable corpus + `buildRlDataset` / datasheet bundle |
+| `…/reporting` | Release-decision statistics: `pairedBootstrap`, `benjaminiHochberg`, anytime-valid sequential e-values, `evaluateReleaseConfidence`, and the report renderers |
+| `…/analyst` | The trace-analyst surface: `AnalystRegistry` + `buildDefaultAnalystRegistry` (run the failure-clustering panel), `FindingsStore`, and the LLM chat transports |
+| `…/traces` | Trace stores + emitters, OTLP-JSONL deterministic replay, `analyzeTraces`, and the `traceAnalystOnRunComplete` hook |
+| `…/control` | Agent control loop: `runAgentControlLoop` (observe → validate → decide → act), action policy, propose/review |
+| `…/matrix` | `runAgentMatrix` — an N-axis cartesian over caller-supplied substrate values, per-axis pass/score/cost/duration |
+| `…/multishot` | N-shot persona × shot matrix runner (`runMultishot` / `runMultishotMatrix`) |
+| `…/wire` | The cross-language HTTP/RPC server + Zod schemas (the source-of-truth protocol the Python client speaks) + the built-in rubric registry |
+| `…/benchmarks` | `BenchmarkAdapter` contract + `deterministicSplit` + the bundled `routing` reference benchmark |
+**Specialized surfaces** (subpath-only): `…/prm` (process-reward grading + best-of-N), `…/meta-eval` (judge calibration + the deployment-outcome store), `…/pipelines` (trace-diagnostic views: budget breach, failure cluster, stuck loop, …), `…/governance` (EU AI Act / NIST AI RMF / SOC2 reports), `…/knowledge` (knowledge-readiness gating before a run), `…/builder-eval` (code-generator three-layer eval), `…/storyboard` (trace → watchable replay), `…/authenticity` (anti-Goodhart "real or convincing BS" scorer over produced files), `…/workflow` (workflow-trace eval + partner export), `…/telemetry` (Workers-safe telemetry client).
+The root export remains available for backward compatibility; new code should prefer the focused subpaths above — `/contract` first.
+---
+## Composition with the stack
+agent-eval is the bottom of the layering: consumers depend on it, it depends on none of them.
+```
+agent-runtime    Runs agents (chat turns, one-shot tasks, multi-attempt loops), captures every
+                 run as a trace, and calls optimizePrompt / runImprovementLoop. Produces the
+                 RunRecords + traces agent-eval scores. Depends on agent-eval.
+agent-eval       selfImprove, analyzeRuns, runCampaign + drivers (gepaDriver, …), the gates
+   (this repo)   (heldOutGate, defaultProductionGate, paretoSignificanceGate), the InsightReport
+                 decision packet, the RL bridge, the wire protocol. Depends on neither consumer.
+agent-knowledge  proposeKnowledgeWrites / applyKnowledgeWriteBlocks. agent-eval's analyst findings
+                 feed it; the knowledge gate consumes them. Depends on agent-eval.
+sandbox          AgentProfile, Sandbox.create, streamPrompt. The execution surface the runtime's
+                 loops run on; agent-eval scores what comes back.
+```
+The rule: **agent-eval has zero upward dependencies on a consumer.** A concept that makes sense *without* a running agent loop — a verdict, a run record, a scenario, a judge score — is substrate and lives here; a runtime-shaped one (a sandbox profile, a validation context with an abort signal) lives in agent-runtime. When in doubt, lean substrate.
 ---
 ## Concepts + design
-- [`docs/concepts.md`](./docs/concepts.md) — five types, three top-level functions, the layering rule, the wire protocol contract
+- [`docs/concepts.md`](./docs/concepts.md) — the three top-level functions, the layering rule, and the wire-protocol contract (the five core contract types are documented in the `/contract` barrel itself)
 - [`docs/insight-report.md`](./docs/insight-report.md) — annotated walkthrough of every section of the decision packet
 - [`docs/customer-journeys.md`](./docs/customer-journeys.md) — three end-to-end journeys with code + expected output
 - [`docs/adapters-observability.md`](./docs/adapters-observability.md) — composing agent-eval with LangSmith, Langfuse, Phoenix, OpenLLMetry, TraceAI
@@ -287,7 +316,9 @@ pnpm test
 ## Stability + versioning
-Public exports carry JSDoc stability markers visible in IDE hover + `.d.ts`:
+The `/contract` surface is the **stability contract**: its barrel freezes the API — a `0.x` minor only *adds*; nothing there changes shape or disappears. Depend on `/contract` (and the documented subpaths) rather than the root barrel.
+In the deeper subpaths, `@stable` / `@experimental` JSDoc markers (visible in IDE hover + `.d.ts`) call out what may still move — most granularly in `/rl` (tagged per export) and `/campaign` (whole barrel `@experimental`, since `/contract` re-exports only its settled subset).
 | Tag | Meaning |
 |---|---|

package/dist/adapters/http.d.ts CHANGED Viewed

@@ -1,4 +1,4 @@
-import { S as Scenario, D as DispatchFn, b as DispatchContext } from '../types-QHG0KnkF.js';
+import { S as Scenario, D as DispatchFn, b as DispatchContext } from '../types-4mm2msnR.js';
 import '../run-record-sItO5ftF.js';
 import '../errors-Dwqw-T_m.js';
 import '../schema-m0gsnbt3.js';

package/dist/adapters/langchain.d.ts CHANGED Viewed

@@ -1,4 +1,4 @@
-import { S as Scenario, J as JudgeScore, D as DispatchFn, a as JudgeConfig } from '../types-QHG0KnkF.js';
+import { S as Scenario, J as JudgeScore, D as DispatchFn, a as JudgeConfig } from '../types-4mm2msnR.js';
 import '../run-record-sItO5ftF.js';
 import '../errors-Dwqw-T_m.js';
 import '../schema-m0gsnbt3.js';

package/dist/adapters/otel.d.ts CHANGED Viewed

@@ -1,5 +1,5 @@
 import { TraceSpanEvent, HostedClient } from '../hosted/index.js';
-import '../types-QHG0KnkF.js';
+import '../types-4mm2msnR.js';
 import '../run-record-sItO5ftF.js';
 import '../errors-Dwqw-T_m.js';
 import '../schema-m0gsnbt3.js';

package/dist/analyst/index.d.ts CHANGED Viewed

@@ -1,8 +1,8 @@
 import { AxAIService, AxFunction } from '@ax-llm/ax';
 import { M as MultiLayerVerifier, V as VerifyOptions, S as Severity } from '../multi-layer-verifier-DlWCXuxL.js';
 import { c as RunCritic, a as RunTrace } from '../run-critic-BAIjX99r.js';
-import { S as SemanticConceptJudgeOptions, a as SemanticConceptJudgeInput, B as BehavioralMetrics } from '../semantic-concept-judge-Du4ZVyef.js';
-export { C as CreateAnalystAiConfig, D as DEFAULT_TRACE_ANALYST_KINDS, b as DefaultAnalystRegistryOptions, c as DiffPolicy, F as FAILURE_MODE_KIND_SPEC, d as FINDING_SUBJECT_GRAMMAR_PROMPT, e as FINDING_SUBJECT_KINDS, f as FindingSubject, g as FindingSubjectKind, h as FindingSubjectStringSchema, i as FindingsDiff, j as FindingsStore, I as IMPROVEMENT_KIND_SPEC, K as KIND_EXPECTED_SUBJECTS, k as KNOWLEDGE_GAP_KIND_SPEC, l as KNOWLEDGE_POISONING_KIND_SPEC, P as PersistedFinding, m as SKILL_USAGE_ANALYST, n as SkillUsageAnalyst, o as SkillUsageRecord, p as SkillUsageReport, q as SkillUsageScanConfig, r as buildDefaultAnalystRegistry, s as buildSkillUsageReport, t as createAnalystAi, u as defaultIsMaterial, v as diffFindings, w as emitSkillUsageFindings, x as parseFindingSubject, y as renderFindingSubject } from '../semantic-concept-judge-Du4ZVyef.js';
+import { S as SemanticConceptJudgeOptions, a as SemanticConceptJudgeInput, B as BehavioralMetrics } from '../semantic-concept-judge-qXEUV2w7.js';
+export { C as CreateAnalystAiConfig, D as DEFAULT_TRACE_ANALYST_KINDS, b as DefaultAnalystRegistryOptions, c as DiffPolicy, F as FAILURE_MODE_KIND_SPEC, d as FINDING_SUBJECT_GRAMMAR_PROMPT, e as FINDING_SUBJECT_KINDS, f as FindingSubject, g as FindingSubjectKind, h as FindingSubjectStringSchema, i as FindingsDiff, j as FindingsStore, I as IMPROVEMENT_KIND_SPEC, K as KIND_EXPECTED_SUBJECTS, k as KNOWLEDGE_GAP_KIND_SPEC, l as KNOWLEDGE_POISONING_KIND_SPEC, P as PersistedFinding, m as SKILL_USAGE_ANALYST, n as SkillUsageAnalyst, o as SkillUsageRecord, p as SkillUsageReport, q as SkillUsageScanConfig, r as buildDefaultAnalystRegistry, s as buildSkillUsageReport, t as createAnalystAi, u as defaultIsMaterial, v as diffFindings, w as emitSkillUsageFindings, x as parseFindingSubject, y as renderFindingSubject } from '../semantic-concept-judge-qXEUV2w7.js';
 import { A as AnalyzeTracesOptions } from '../analyst-t7zZS3TV.js';
 import { T as TraceAnalysisStore } from '../store-GmBE2pZZ.js';
 import { b as JudgeFn, a as JudgeInput } from '../types-Croy5h7V.js';
@@ -10,7 +10,7 @@ import { A as Analyst, h as AnalystSeverity, c as AnalystFinding } from '../type
 export { a as AnalystContext, g as AnalystCost, i as AnalystInputKind, j as AnalystRequirements, f as AnalystRunEvent, e as AnalystRunInputs, d as AnalystRunResult, b as AnalystRunSummary, k as ChatCallOpts, C as ChatClient, l as ChatRequest, m as ChatResponse, n as ChatTransport, o as CliBridgeTransportOpts, p as CreateChatClientOpts, D as DirectProviderTransportOpts, E as EvidenceRef, M as MockTransportOpts, R as RouterTransportOpts, S as SandboxSdkTransportOpts, q as computeFindingId, r as createChatClient, s as makeFinding } from '../types-DRvV0zRo.js';
 import { TCloud } from '@tangle-network/tcloud';
 export { A as ANALYST_SEVERITIES, C as CreateTraceAnalystKindOpts, R as RAW_FINDING_SCHEMA_PROMPT, a as RawAnalystFinding, b as RawAnalystFindingSchema, T as TraceAnalystGolden, c as TraceAnalystKindSpec, d as createTraceAnalystKind, p as parseRawFinding, r as renderPriorFindings } from '../kind-factory-DqV2t1Xk.js';
-export { a as AnalystHooks, A as AnalystRegistry, b as AnalystRegistryOptions, B as BudgetPolicy, R as RegistryRunOpts } from '../registry-BmEuU94S.js';
+export { A as AnalystHooks, a as AnalystRegistry, b as AnalystRegistryOptions, B as BudgetPolicy, R as RegistryRunOpts } from '../registry-BK0Zee01.js';
 import { L as LlmClientOptions } from '../llm-client-DbjLfz-K.js';
 import '../schema-m0gsnbt3.js';
 import '../store-CKUAgsJz.js';

package/dist/belief-state/index.d.ts ADDED Viewed

@@ -0,0 +1,188 @@
+import { c as CalibrationReport } from '../calibration-Cpr3WaX3.js';
+import { O as OffPolicyEstimate, a as OffPolicyOptions, b as OffPolicyTrajectory } from '../off-policy-DiwuKKg7.js';
+import { T as TraceStore } from '../store-CKUAgsJz.js';
+import '../schema-m0gsnbt3.js';
+import '../outcome-store-rnXLEqSn.js';
+type BeliefDecisionKind = 'continue' | 'verify' | 'ask' | 'retry' | 'stop' | 'memory-write' | 'memory-read' | 'tool-select' | 'skill-select' | 'workflow-select' | 'surface-promote';
+type BeliefEvidenceSource = 'run' | 'span' | 'event' | 'finding' | 'memory' | 'knowledge' | 'policy';
+interface BeliefEvidenceRef {
+    source: BeliefEvidenceSource;
+    id: string;
+    runId?: string;
+    spanId?: string;
+    eventId?: string;
+    detail?: string;
+    metadata?: Record<string, unknown>;
+}
+interface BeliefDecisionOutcome {
+    success?: boolean;
+    score?: number;
+    reward?: number;
+    costUsd?: number;
+    observedAt?: string;
+    metadata?: Record<string, unknown>;
+}
+interface BeliefDecisionPoint {
+    id: string;
+    runId: string;
+    scenarioId?: string;
+    stepIndex: number;
+    kind: BeliefDecisionKind;
+    chosenAction: string;
+    candidateActions?: string[];
+    confidence?: number;
+    behaviorProb?: number;
+    targetProb?: number;
+    qHat?: number | null;
+    costUsd?: number;
+    evidence: BeliefEvidenceRef[];
+    outcome?: BeliefDecisionOutcome;
+    metadata?: Record<string, unknown>;
+}
+interface BeliefDecisionExtractionDiagnostic {
+    runId: string;
+    eventId?: string;
+    severity: 'info' | 'warning' | 'error';
+    reason: string;
+}
+interface BeliefDecisionExtractionReport {
+    decisions: BeliefDecisionPoint[];
+    diagnostics: BeliefDecisionExtractionDiagnostic[];
+}
+type BeliefPolicyAction = 'accept' | 'defer' | 'verify' | 'ask' | 'retry' | 'stop';
+interface BeliefPolicyDecision {
+    action: BeliefPolicyAction;
+    confidence?: number;
+    targetProb?: number;
+    qHat?: number | null;
+    reason?: string;
+}
+interface BeliefSelectivePolicy {
+    id: string;
+    decide(point: BeliefDecisionPoint): BeliefPolicyDecision;
+}
+interface BeliefOpeTargetPolicy {
+    id: string;
+    targetProbOf(point: BeliefDecisionPoint): number | null | undefined;
+    qHatOf?(point: BeliefDecisionPoint): number | null | undefined;
+}
+interface BeliefUtilityOptions {
+    successUtility?: number;
+    failureUtility?: number;
+    deferUtility?: number;
+    verifyCost?: number;
+    askCost?: number;
+    retryCost?: number;
+    stopUtility?: number;
+    costWeight?: number;
+}
+interface BeliefSelectivePolicyMetrics {
+    policyId: string;
+    n: number;
+    accepted: number;
+    rejected: number;
+    coverage: number;
+    acceptedErrorRate: number;
+    baselineUtility: number;
+    policyUtility: number;
+    utilityDelta: number;
+    utilityCi95: {
+        mean: number;
+        lower: number;
+        upper: number;
+    };
+    rejectedMeanReward: number | null;
+    recommendation: 'ship' | 'hold' | 'need_more_data';
+    reasons: string[];
+}
+interface BeliefOpeSupportDiagnostics {
+    supported: boolean;
+    n: number;
+    dropped: number;
+    effectiveSampleSize: number;
+    effectiveSampleRatio: number;
+    maxImportanceWeight: number;
+    reasons: string[];
+}
+interface BeliefOpeReport {
+    targetPolicyId: string;
+    ips: OffPolicyEstimate;
+    snips: OffPolicyEstimate;
+    dr: OffPolicyEstimate;
+    support: BeliefOpeSupportDiagnostics;
+}
+type BeliefEvaluationStatus = 'ship' | 'hold' | 'need_more_data';
+type BeliefCalibrationStatus = 'supported' | 'unsupported';
+type BeliefOpeStatus = 'supported' | 'unsupported' | 'not_requested';
+interface BeliefPolicyEvaluationReport {
+    policyId: string;
+    n: number;
+    status: BeliefEvaluationStatus;
+    selectiveStatus: BeliefEvaluationStatus;
+    calibrationStatus: BeliefCalibrationStatus;
+    opeStatus: BeliefOpeStatus;
+    opeTargetPolicyId?: string;
+    selective: BeliefSelectivePolicyMetrics;
+    calibration?: CalibrationReport;
+    ope?: BeliefOpeReport;
+    diagnostics: string[];
+}
+type BeliefCalibrationRegion = 'all' | 'accepted' | 'rejected';
+interface BeliefCalibrationOptions {
+    bins?: number;
+    minPairs?: number;
+    policy?: BeliefSelectivePolicy;
+    region?: BeliefCalibrationRegion;
+}
+declare function calibrateBeliefDecisions(points: BeliefDecisionPoint[], options?: BeliefCalibrationOptions): CalibrationReport | null;
+interface ExtractBeliefDecisionPointsOptions {
+    runIds?: string[];
+}
+declare function extractBeliefDecisionPoints(store: TraceStore, options?: ExtractBeliefDecisionPointsOptions): Promise<BeliefDecisionExtractionReport>;
+interface BeliefOpeOptions extends OffPolicyOptions {
+    minEffectiveSampleSize?: number;
+    minEffectiveSampleRatio?: number;
+    maxDiagnostics?: number;
+}
+interface BeliefOffPolicyTrajectoryReport {
+    targetPolicyId: string;
+    trajectories: OffPolicyTrajectory[];
+    dropped: number;
+    diagnostics: string[];
+}
+declare function embeddedBeliefOpeTargetPolicy(id?: string): BeliefOpeTargetPolicy;
+declare function beliefDecisionsToOffPolicyTrajectories(points: BeliefDecisionPoint[], targetPolicy: BeliefOpeTargetPolicy, options?: Pick<BeliefOpeOptions, 'maxDiagnostics'>): BeliefOffPolicyTrajectoryReport;
+declare function evaluateBeliefOffPolicy(points: BeliefDecisionPoint[], targetPolicy: BeliefOpeTargetPolicy, options?: BeliefOpeOptions): BeliefOpeReport;
+interface EvaluateBeliefSelectivePolicyOptions {
+    utility?: BeliefUtilityOptions;
+    minN?: number;
+    minAccepted?: number;
+    minUtilityDelta?: number;
+    seed?: number;
+}
+declare function thresholdSelectivePolicy(options: {
+    id?: string;
+    confidenceThreshold: number;
+    belowThresholdAction?: Exclude<BeliefPolicyAction, 'accept'>;
+}): BeliefSelectivePolicy;
+declare function evaluateBeliefSelectivePolicy(points: BeliefDecisionPoint[], policy: BeliefSelectivePolicy, options?: EvaluateBeliefSelectivePolicyOptions): BeliefSelectivePolicyMetrics;
+interface AnalyzeBeliefPolicyOpeOptions extends BeliefOpeOptions {
+    targetPolicy?: BeliefOpeTargetPolicy;
+}
+interface AnalyzeBeliefPolicyOptions {
+    points: BeliefDecisionPoint[];
+    policy: BeliefSelectivePolicy;
+    selective?: EvaluateBeliefSelectivePolicyOptions;
+    calibration?: BeliefCalibrationOptions;
+    ope?: AnalyzeBeliefPolicyOpeOptions;
+    requireOpe?: boolean;
+}
+declare function analyzeBeliefPolicy(options: AnalyzeBeliefPolicyOptions): BeliefPolicyEvaluationReport;
+export { type AnalyzeBeliefPolicyOpeOptions, type AnalyzeBeliefPolicyOptions, type BeliefCalibrationOptions, type BeliefCalibrationRegion, type BeliefCalibrationStatus, type BeliefDecisionExtractionDiagnostic, type BeliefDecisionExtractionReport, type BeliefDecisionKind, type BeliefDecisionOutcome, type BeliefDecisionPoint, type BeliefEvaluationStatus, type BeliefEvidenceRef, type BeliefEvidenceSource, type BeliefOffPolicyTrajectoryReport, type BeliefOpeOptions, type BeliefOpeReport, type BeliefOpeStatus, type BeliefOpeSupportDiagnostics, type BeliefOpeTargetPolicy, type BeliefPolicyAction, type BeliefPolicyDecision, type BeliefPolicyEvaluationReport, type BeliefSelectivePolicy, type BeliefSelectivePolicyMetrics, type BeliefUtilityOptions, type EvaluateBeliefSelectivePolicyOptions, type ExtractBeliefDecisionPointsOptions, analyzeBeliefPolicy, beliefDecisionsToOffPolicyTrajectories, calibrateBeliefDecisions, embeddedBeliefOpeTargetPolicy, evaluateBeliefOffPolicy, evaluateBeliefSelectivePolicy, extractBeliefDecisionPoints, thresholdSelectivePolicy };