@tangle-network/agent-eval 0.17.1 → 0.17.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +49 -1
- package/dist/index.d.ts +1493 -1088
- package/dist/index.js +1487 -187
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -21,7 +21,7 @@ console.log(ship.result.passed, ship.result.score)
|
|
|
21
21
|
- You ship a content generator and need quality signal beyond "the LLM said it's good".
|
|
22
22
|
- You want a release gate that fails on regressions you can name, not vibes.
|
|
23
23
|
|
|
24
|
-
If that's you, start with [`docs/concepts.md`](./docs/concepts.md) — 5-minute mental model — then
|
|
24
|
+
If that's you, start with [`docs/concepts.md`](./docs/concepts.md) — 5-minute mental model — then use [`docs/feature-guide.md`](./docs/feature-guide.md) to choose the right primitive.
|
|
25
25
|
|
|
26
26
|
## Quickstart
|
|
27
27
|
|
|
@@ -65,6 +65,7 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
|
|
|
65
65
|
## Two ways to read this repo
|
|
66
66
|
|
|
67
67
|
- **You're a human onboarding** — read [`docs/concepts.md`](./docs/concepts.md) for the mental model, then [`docs/wire-protocol.md`](./docs/wire-protocol.md) if you'll call from another language, or `SKILL.md` if you'll embed in TS.
|
|
68
|
+
- **You're deciding what to integrate** — read [`docs/feature-guide.md`](./docs/feature-guide.md) for the layman explanation, use cases, feature map, and guardrails.
|
|
68
69
|
- **You're an LLM agent writing integration code** — read `SKILL.md`. Every directive there encodes a shipped bug; skipping one reintroduces the bug class.
|
|
69
70
|
|
|
70
71
|
## What's in the box
|
|
@@ -77,6 +78,9 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
|
|
|
77
78
|
| Wire protocol (`agent-eval serve` / `rpc`) | HTTP and stdio RPC interface for cross-language clients. | wire-protocol.md |
|
|
78
79
|
| `clients/python/` | First-party Python client (`tangle-agent-eval` on PyPI). Version-locked to npm. | clients/python/README.md |
|
|
79
80
|
| `BenchmarkRunner`, `executeScenario`, `ConvergenceTracker` | Multi-turn scenario execution + cross-run tracking. | SKILL.md |
|
|
81
|
+
| `runAgentControlLoop` | Policy-based runtime for agentic tasks: observe typed state, validate, decide, act, repeat with budgets, tracing, and stuck-loop guards. | [control-runtime.md](./docs/control-runtime.md) |
|
|
82
|
+
| `FeedbackTrajectory`, `InMemoryFeedbackTrajectoryStore`, `FileSystemFeedbackTrajectoryStore` | Human/environment feedback loops: capture approvals, rejections, choices, revisions, metrics, and policy blocks as train/dev/test/holdout examples. | [feedback-trajectories.md](./docs/feedback-trajectories.md) |
|
|
83
|
+
| `evaluateActionPolicy` | Generic action preflight for approval, budget, expected-outcome, and kill-criteria checks. | [feature-guide.md](./docs/feature-guide.md) |
|
|
80
84
|
| `ExperimentTracker`, `PromptOptimizer`, `bisector` | A/B prompts, optimize steering, bisect regressions. | SKILL.md |
|
|
81
85
|
| `runPromptEvolution`, `createCompositeMutator`, `createSandboxPool`, `createSandboxCodeMutator`, `MutationTelemetry`, `LineageRecorder`, `CostLedger`, `JsonlTrialCache` | Prompt + code evolution loops with bounded sandbox pools, durable JSONL telemetry, plateau-detecting composite mutators, crash-resumable trial cache. | §Evolution loop |
|
|
82
86
|
| `reflective-mutation` (`buildReflectionPrompt`, `parseReflectionResponse`, `DEFAULT_MUTATION_PRIMITIVES`) | Trace-conditioned LLM mutator that reasons over top/bottom trials instead of blind rewrites. | inline JSDoc |
|
|
@@ -168,6 +172,50 @@ The `MutationTelemetry`, `LineageRecorder`, and `CostLedger` pass into the `code
|
|
|
168
172
|
|
|
169
173
|
For the full primitive surface and rationale, read each module's JSDoc — `prompt-evolution.ts`, `composite-mutator.ts`, `sandbox-pool.ts`, `code-mutator.ts`, `reflective-mutation.ts`, `evolution-telemetry.ts`.
|
|
170
174
|
|
|
175
|
+
## Feedback trajectory loop
|
|
176
|
+
|
|
177
|
+
When normal agent usage should generate training/eval signal, use feedback
|
|
178
|
+
trajectories. They turn approvals, rejections, option choices, edits, metrics,
|
|
179
|
+
and policy blocks into reusable examples.
|
|
180
|
+
|
|
181
|
+
```ts
|
|
182
|
+
import {
|
|
183
|
+
createFeedbackTrajectory,
|
|
184
|
+
summarizePreferenceMemory,
|
|
185
|
+
feedbackTrajectoriesToDatasetScenarios,
|
|
186
|
+
feedbackTrajectoriesToOptimizerRows,
|
|
187
|
+
} from '@tangle-network/agent-eval'
|
|
188
|
+
|
|
189
|
+
const trajectory = createFeedbackTrajectory({
|
|
190
|
+
projectId: 'research-agent',
|
|
191
|
+
scenarioId: 'brief-review',
|
|
192
|
+
task: { intent: 'Revise a research brief until it is specific and sourced.' },
|
|
193
|
+
attempts: [{
|
|
194
|
+
id: 'draft-1',
|
|
195
|
+
stepIndex: 0,
|
|
196
|
+
artifactType: 'research',
|
|
197
|
+
artifact: { summary: 'Initial brief with weak sourcing.' },
|
|
198
|
+
createdAt: new Date().toISOString(),
|
|
199
|
+
}],
|
|
200
|
+
labels: [{
|
|
201
|
+
source: 'user',
|
|
202
|
+
kind: 'revision_request',
|
|
203
|
+
value: 'needs stronger evidence',
|
|
204
|
+
reason: 'add primary sources and remove unsupported claims',
|
|
205
|
+
severity: 'error',
|
|
206
|
+
createdAt: new Date().toISOString(),
|
|
207
|
+
}],
|
|
208
|
+
})
|
|
209
|
+
|
|
210
|
+
const memory = summarizePreferenceMemory([trajectory])
|
|
211
|
+
const scenarios = feedbackTrajectoriesToDatasetScenarios([trajectory])
|
|
212
|
+
const optimizerRows = feedbackTrajectoriesToOptimizerRows([trajectory])
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
This is the bridge between feedback and optimization: review signals become
|
|
216
|
+
immediate memory, replayable eval scenarios, and prompt/signature/code optimizer
|
|
217
|
+
input. See [`docs/feedback-trajectories.md`](./docs/feedback-trajectories.md).
|
|
218
|
+
|
|
171
219
|
## v0.16 highlights — production-rigor primitives
|
|
172
220
|
|
|
173
221
|
These are the primitives any team running prompt-optimization in production needs, regardless of whether they're writing a paper. v0.15 shipped them under "paper-grade" naming; v0.16 corrects that — they're production-first, paper-grade as a side effect.
|