@tangle-network/agent-eval 0.17.1 → 0.17.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -21,7 +21,7 @@ console.log(ship.result.passed, ship.result.score)
21
21
  - You ship a content generator and need quality signal beyond "the LLM said it's good".
22
22
  - You want a release gate that fails on regressions you can name, not vibes.
23
23
 
24
- If that's you, start with [`docs/concepts.md`](./docs/concepts.md) — 5-minute mental model — then come back here.
24
+ If that's you, start with [`docs/concepts.md`](./docs/concepts.md) — 5-minute mental model — then use [`docs/feature-guide.md`](./docs/feature-guide.md) to choose the right primitive.
25
25
 
26
26
  ## Quickstart
27
27
 
@@ -65,6 +65,7 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
65
65
  ## Two ways to read this repo
66
66
 
67
67
  - **You're a human onboarding** — read [`docs/concepts.md`](./docs/concepts.md) for the mental model, then [`docs/wire-protocol.md`](./docs/wire-protocol.md) if you'll call from another language, or `SKILL.md` if you'll embed in TS.
68
+ - **You're deciding what to integrate** — read [`docs/feature-guide.md`](./docs/feature-guide.md) for the layman explanation, use cases, feature map, and guardrails.
68
69
  - **You're an LLM agent writing integration code** — read `SKILL.md`. Every directive there encodes a shipped bug; skipping one reintroduces the bug class.
69
70
 
70
71
  ## What's in the box
@@ -77,6 +78,9 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
77
78
  | Wire protocol (`agent-eval serve` / `rpc`) | HTTP and stdio RPC interface for cross-language clients. | wire-protocol.md |
78
79
  | `clients/python/` | First-party Python client (`tangle-agent-eval` on PyPI). Version-locked to npm. | clients/python/README.md |
79
80
  | `BenchmarkRunner`, `executeScenario`, `ConvergenceTracker` | Multi-turn scenario execution + cross-run tracking. | SKILL.md |
81
+ | `runAgentControlLoop` | Policy-based runtime for agentic tasks: observe typed state, validate, decide, act, repeat with budgets, tracing, and stuck-loop guards. | [control-runtime.md](./docs/control-runtime.md) |
82
+ | `FeedbackTrajectory`, `InMemoryFeedbackTrajectoryStore`, `FileSystemFeedbackTrajectoryStore` | Human/environment feedback loops: capture approvals, rejections, choices, revisions, metrics, and policy blocks as train/dev/test/holdout examples. | [feedback-trajectories.md](./docs/feedback-trajectories.md) |
83
+ | `evaluateActionPolicy` | Generic action preflight for approval, budget, expected-outcome, and kill-criteria checks. | [feature-guide.md](./docs/feature-guide.md) |
80
84
  | `ExperimentTracker`, `PromptOptimizer`, `bisector` | A/B prompts, optimize steering, bisect regressions. | SKILL.md |
81
85
  | `runPromptEvolution`, `createCompositeMutator`, `createSandboxPool`, `createSandboxCodeMutator`, `MutationTelemetry`, `LineageRecorder`, `CostLedger`, `JsonlTrialCache` | Prompt + code evolution loops with bounded sandbox pools, durable JSONL telemetry, plateau-detecting composite mutators, crash-resumable trial cache. | §Evolution loop |
82
86
  | `reflective-mutation` (`buildReflectionPrompt`, `parseReflectionResponse`, `DEFAULT_MUTATION_PRIMITIVES`) | Trace-conditioned LLM mutator that reasons over top/bottom trials instead of blind rewrites. | inline JSDoc |
@@ -168,6 +172,50 @@ The `MutationTelemetry`, `LineageRecorder`, and `CostLedger` pass into the `code
168
172
 
169
173
  For the full primitive surface and rationale, read each module's JSDoc — `prompt-evolution.ts`, `composite-mutator.ts`, `sandbox-pool.ts`, `code-mutator.ts`, `reflective-mutation.ts`, `evolution-telemetry.ts`.
170
174
 
175
+ ## Feedback trajectory loop
176
+
177
+ When normal agent usage should generate training/eval signal, use feedback
178
+ trajectories. They turn approvals, rejections, option choices, edits, metrics,
179
+ and policy blocks into reusable examples.
180
+
181
+ ```ts
182
+ import {
183
+ createFeedbackTrajectory,
184
+ summarizePreferenceMemory,
185
+ feedbackTrajectoriesToDatasetScenarios,
186
+ feedbackTrajectoriesToOptimizerRows,
187
+ } from '@tangle-network/agent-eval'
188
+
189
+ const trajectory = createFeedbackTrajectory({
190
+ projectId: 'research-agent',
191
+ scenarioId: 'brief-review',
192
+ task: { intent: 'Revise a research brief until it is specific and sourced.' },
193
+ attempts: [{
194
+ id: 'draft-1',
195
+ stepIndex: 0,
196
+ artifactType: 'research',
197
+ artifact: { summary: 'Initial brief with weak sourcing.' },
198
+ createdAt: new Date().toISOString(),
199
+ }],
200
+ labels: [{
201
+ source: 'user',
202
+ kind: 'revision_request',
203
+ value: 'needs stronger evidence',
204
+ reason: 'add primary sources and remove unsupported claims',
205
+ severity: 'error',
206
+ createdAt: new Date().toISOString(),
207
+ }],
208
+ })
209
+
210
+ const memory = summarizePreferenceMemory([trajectory])
211
+ const scenarios = feedbackTrajectoriesToDatasetScenarios([trajectory])
212
+ const optimizerRows = feedbackTrajectoriesToOptimizerRows([trajectory])
213
+ ```
214
+
215
+ This is the bridge between feedback and optimization: review signals become
216
+ immediate memory, replayable eval scenarios, and prompt/signature/code optimizer
217
+ input. See [`docs/feedback-trajectories.md`](./docs/feedback-trajectories.md).
218
+
171
219
  ## v0.16 highlights — production-rigor primitives
172
220
 
173
221
  These are the primitives any team running prompt-optimization in production needs, regardless of whether they're writing a paper. v0.15 shipped them under "paper-grade" naming; v0.16 corrects that — they're production-first, paper-grade as a side effect.