@tangle-network/agent-eval 0.17.3 → 0.19.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +8 -1
- package/dist/index.d.ts +303 -279
- package/dist/index.js +332 -210
- package/dist/index.js.map +1 -1
- package/docs/concepts.md +155 -0
- package/docs/control-runtime.md +351 -0
- package/docs/feature-guide.md +213 -0
- package/docs/feedback-trajectories.md +193 -0
- package/docs/multi-shot-optimization.md +122 -0
- package/docs/wire-protocol.md +199 -0
- package/package.json +21 -14
package/README.md
CHANGED
|
@@ -81,7 +81,8 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
|
|
|
81
81
|
| `runAgentControlLoop` | Policy-based runtime for agentic tasks: observe typed state, validate, decide, act, repeat with budgets, tracing, and stuck-loop guards. | [control-runtime.md](./docs/control-runtime.md) |
|
|
82
82
|
| `FeedbackTrajectory`, `InMemoryFeedbackTrajectoryStore`, `FileSystemFeedbackTrajectoryStore` | Human/environment feedback loops: capture approvals, rejections, choices, revisions, metrics, and policy blocks as train/dev/test/holdout examples. | [feedback-trajectories.md](./docs/feedback-trajectories.md) |
|
|
83
83
|
| `evaluateActionPolicy` | Generic action preflight for approval, budget, expected-outcome, and kill-criteria checks. | [feature-guide.md](./docs/feature-guide.md) |
|
|
84
|
-
| `ExperimentTracker`,
|
|
84
|
+
| `ExperimentTracker`, steering optimizers, `bisector` | A/B prompts, optimize steering, bisect regressions. | SKILL.md |
|
|
85
|
+
| `runMultiShotOptimization`, `trialTraceFromMultiShotTrial` | GEPA-style optimization for variable-length agent trajectories with ASI, paired seeds, and optional held-out promotion gating. | [multi-shot-optimization.md](./docs/multi-shot-optimization.md) |
|
|
85
86
|
| `runPromptEvolution`, `createCompositeMutator`, `createSandboxPool`, `createSandboxCodeMutator`, `MutationTelemetry`, `LineageRecorder`, `CostLedger`, `JsonlTrialCache` | Prompt + code evolution loops with bounded sandbox pools, durable JSONL telemetry, plateau-detecting composite mutators, crash-resumable trial cache. | §Evolution loop |
|
|
86
87
|
| `reflective-mutation` (`buildReflectionPrompt`, `parseReflectionResponse`, `DEFAULT_MUTATION_PRIMITIVES`) | Trace-conditioned LLM mutator that reasons over top/bottom trials instead of blind rewrites. | inline JSDoc |
|
|
87
88
|
| `correlationStudy`, `OutcomeStore`, `ProductRegistry` | Meta-eval: do our scores predict deployment outcomes (revenue, retention)? | inline JSDoc |
|
|
@@ -89,6 +90,12 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
|
|
|
89
90
|
|
|
90
91
|
## Evolution loop
|
|
91
92
|
|
|
93
|
+
For agent tasks that run across many chat turns or tool calls, start with
|
|
94
|
+
[`runMultiShotOptimization`](./docs/multi-shot-optimization.md). It runs the
|
|
95
|
+
same prompt-evolution core over full trajectories, carries actionable side
|
|
96
|
+
information into reflection, and separates the search winner from the variant
|
|
97
|
+
that actually passes held-out promotion.
|
|
98
|
+
|
|
92
99
|
Closing the loop on a prompt or codebase is **two adapters + a config**. Compose `runPromptEvolution` with `createCompositeMutator` (plateau policy) and you get prompt-only optimization until improvement stalls, then automatic switch to code-channel mutations from a coding agent inside a `SandboxPool`.
|
|
93
100
|
|
|
94
101
|
```ts
|