@tangle-network/agent-eval 0.18.0 → 0.19.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +39 -1
- package/dist/index.d.ts +192 -220
- package/dist/index.js +375 -238
- package/dist/index.js.map +1 -1
- package/docs/feature-guide.md +2 -2
- package/docs/wire-protocol.md +1 -1
- package/package.json +12 -10
package/README.md
CHANGED
|
@@ -81,13 +81,51 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
|
|
|
81
81
|
| `runAgentControlLoop` | Policy-based runtime for agentic tasks: observe typed state, validate, decide, act, repeat with budgets, tracing, and stuck-loop guards. | [control-runtime.md](./docs/control-runtime.md) |
|
|
82
82
|
| `FeedbackTrajectory`, `InMemoryFeedbackTrajectoryStore`, `FileSystemFeedbackTrajectoryStore` | Human/environment feedback loops: capture approvals, rejections, choices, revisions, metrics, and policy blocks as train/dev/test/holdout examples. | [feedback-trajectories.md](./docs/feedback-trajectories.md) |
|
|
83
83
|
| `evaluateActionPolicy` | Generic action preflight for approval, budget, expected-outcome, and kill-criteria checks. | [feature-guide.md](./docs/feature-guide.md) |
|
|
84
|
-
| `ExperimentTracker`,
|
|
84
|
+
| `ExperimentTracker`, steering optimizers, `bisector` | A/B prompts, optimize steering, bisect regressions. | SKILL.md |
|
|
85
85
|
| `runMultiShotOptimization`, `trialTraceFromMultiShotTrial` | GEPA-style optimization for variable-length agent trajectories with ASI, paired seeds, and optional held-out promotion gating. | [multi-shot-optimization.md](./docs/multi-shot-optimization.md) |
|
|
86
|
+
| `evaluateReleaseConfidence`, `assertReleaseConfidence` | Release scorecard that composes corpus coverage, search/holdout run evidence, ASI diagnostics, overfit checks, and cost/latency budgets. | §Release confidence |
|
|
86
87
|
| `runPromptEvolution`, `createCompositeMutator`, `createSandboxPool`, `createSandboxCodeMutator`, `MutationTelemetry`, `LineageRecorder`, `CostLedger`, `JsonlTrialCache` | Prompt + code evolution loops with bounded sandbox pools, durable JSONL telemetry, plateau-detecting composite mutators, crash-resumable trial cache. | §Evolution loop |
|
|
87
88
|
| `reflective-mutation` (`buildReflectionPrompt`, `parseReflectionResponse`, `DEFAULT_MUTATION_PRIMITIVES`) | Trace-conditioned LLM mutator that reasons over top/bottom trials instead of blind rewrites. | inline JSDoc |
|
|
88
89
|
| `correlationStudy`, `OutcomeStore`, `ProductRegistry` | Meta-eval: do our scores predict deployment outcomes (revenue, retention)? | inline JSDoc |
|
|
89
90
|
| Telemetry (`telemetry/`, `telemetry/file`) | OTLP export, trace replay, file sinks. | inline JSDoc |
|
|
90
91
|
|
|
92
|
+
## Release confidence
|
|
93
|
+
|
|
94
|
+
Use `evaluateReleaseConfidence` at the release boundary for every consuming
|
|
95
|
+
agent surface. It fails closed unless the release has a versioned corpus,
|
|
96
|
+
search and holdout run evidence, score/pass-rate evidence, ASI for failures,
|
|
97
|
+
and budget/overfit checks. Single-shot and multi-shot apps use the same path:
|
|
98
|
+
single-shot traces are just trace evidence with `turnCount: 1`.
|
|
99
|
+
|
|
100
|
+
```ts
|
|
101
|
+
import {
|
|
102
|
+
evaluateReleaseConfidence,
|
|
103
|
+
releaseTraceEvidenceFromMultiShotTrials,
|
|
104
|
+
} from '@tangle-network/agent-eval'
|
|
105
|
+
|
|
106
|
+
const scorecard = evaluateReleaseConfidence({
|
|
107
|
+
target: 'blueprint-agent/autoresearch',
|
|
108
|
+
candidateId: 'candidate-v3',
|
|
109
|
+
baselineId: 'baseline',
|
|
110
|
+
dataset: await dataset.manifest(),
|
|
111
|
+
runs: [...candidateRuns, ...baselineRuns],
|
|
112
|
+
traces: releaseTraceEvidenceFromMultiShotTrials(result.evolution.generations.flatMap((g) => g.trials)),
|
|
113
|
+
gateDecision: result.gate?.decision,
|
|
114
|
+
thresholds: {
|
|
115
|
+
minScenarioCount: 50,
|
|
116
|
+
minSearchRuns: 50,
|
|
117
|
+
minHoldoutRuns: 20,
|
|
118
|
+
minPassRate: 0.9,
|
|
119
|
+
minMeanScore: 0.8,
|
|
120
|
+
maxOverfitGap: 0.1,
|
|
121
|
+
maxMeanCostUsd: 0.05,
|
|
122
|
+
maxP95WallMs: 120_000,
|
|
123
|
+
},
|
|
124
|
+
})
|
|
125
|
+
|
|
126
|
+
if (!scorecard.promote) throw new Error(scorecard.summary)
|
|
127
|
+
```
|
|
128
|
+
|
|
91
129
|
## Evolution loop
|
|
92
130
|
|
|
93
131
|
For agent tasks that run across many chat turns or tool calls, start with
|