@tangle-network/agent-eval 0.17.2 → 0.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,213 @@
1
+ # Feature Guide
2
+
3
+ This page explains the main `agent-eval` primitives in plain English first,
4
+ then shows when to use each one.
5
+
6
+ ## ELI5
7
+
8
+ LLM agents can write code, drafts, research, plans, and actions. The hard part
9
+ is knowing whether they actually did a good job, whether they should keep
10
+ trying, and whether a change made them better or worse.
11
+
12
+ `agent-eval` gives you reusable tools for that:
13
+
14
+ - **Judges** grade one output.
15
+ - **Verifiers** run several checks in order.
16
+ - **Control loops** let an agent keep working until it passes, gets blocked, or
17
+ hits a budget.
18
+ - **Feedback trajectories** turn normal user approvals/rejections into training
19
+ and eval data.
20
+ - **Datasets and holdouts** keep examples organized so you do not overfit.
21
+ - **Optimizers and mutation loops** try prompt/signature/code variants and keep
22
+ the ones that really improve.
23
+ - **Traces and telemetry** show what happened, step by step.
24
+
25
+ ## Which Primitive Should I Use?
26
+
27
+ | Problem | Use | Why |
28
+ | --- | --- | --- |
29
+ | “Did this single answer/draft pass?” | Judge or rubric | Fast quality signal for one artifact. |
30
+ | “Does generated code actually work?” | `BuilderSession`, `MultiLayerVerifier`, sandbox harness | Build/test/runtime gates catch failures judges miss. |
31
+ | “Should the agent keep trying?” | `runAgentControlLoop` | Budgeted `observe -> validate -> decide -> act` runtime. |
32
+ | “The agent should propose, verify, review, and revise.” | `runProposeReviewAsControlLoop` | Reusable preset over the generic control loop. |
33
+ | “Human feedback should become reusable eval data.” | `FeedbackTrajectory` | Captures approvals, rejections, edits, choices, metrics, and policy blocks. |
34
+ | “Can this action run, or does it need approval?” | `evaluateActionPolicy` | Generic preflight for side effects, budgets, and required evidence. |
35
+ | “I need train/dev/test/holdout examples.” | `Dataset` plus feedback trajectory conversion | Stable splits and contamination control. |
36
+ | “Which prompt or signature wins?” | `PromptOptimizer`, `OptimizationLoop`, steering optimizers | Runs variants on scenarios and compares scores. |
37
+ | “Improve a multi-turn agent over real task traces.” | `runMultiShotOptimization` | GEPA-style trajectory optimization with ASI and held-out promotion. |
38
+ | “Improve prompts, then code if prompts plateau.” | `runPromptEvolution`, composite mutator, code mutator | Bounded evolution with telemetry and lineage. |
39
+ | “Find why a regression happened.” | bisector, traces, run records | Narrows changes and preserves evidence. |
40
+ | “Expose evals to another language.” | Wire protocol and Python client | HTTP/RPC boundary for non-TypeScript apps. |
41
+
42
+ ## Integration Patterns
43
+
44
+ ### Recommended Agent Product Shape
45
+
46
+ Use this shape when the product needs to keep pushing work forward instead of
47
+ only answering once:
48
+
49
+ ```text
50
+ user intent
51
+ -> product adapter observes typed state
52
+ -> runAgentControlLoop decides the next driver action
53
+ -> worker executes in the real product environment
54
+ -> validators and judges grade the new state
55
+ -> FeedbackTrajectory records user/environment/reviewer signal
56
+ -> datasets and optimizers replay the same adapter
57
+ ```
58
+
59
+ The important part is that production and eval do not use different loops. The
60
+ adapter can swap dependencies (real user session, replay fixture, sandbox), but
61
+ the state shape, validators, actions, budgets, and stop policies should stay
62
+ the same. That is what makes benchmark gains transfer to real usage.
63
+
64
+ ### Agent Runtime Integration
65
+
66
+ Use when you have a coding, browser, computer-use, research, or documentation agent.
67
+
68
+ 1. Represent the current task state.
69
+ 2. Validate that state with objective checks first and judges second.
70
+ 3. Use `runAgentControlLoop` to decide the next action.
71
+ 4. Record user feedback as `FeedbackTrajectory`.
72
+ 5. Convert trajectories into datasets and optimizer rows.
73
+
74
+ Result:
75
+
76
+ ```text
77
+ normal agent usage -> labeled examples -> replay/eval -> optimization
78
+ ```
79
+
80
+ Implementation ownership:
81
+
82
+ - Put reusable loop mechanics, trace schemas, budget accounting, split
83
+ assignment, and optimizer row conversion in `agent-eval`.
84
+ - Put product state readers, action executors, approval policy, credentials,
85
+ workspace paths, and UI-specific storage in the downstream repo.
86
+ - Promote a product adapter into `agent-eval` only after at least two products
87
+ need the same adapter shape.
88
+
89
+ ### Code Generator
90
+
91
+ Use when an agent writes or patches a repo.
92
+
93
+ 1. Use `BuilderSession` or `MultiLayerVerifier`.
94
+ 2. Always run static gates like typecheck/build/tests.
95
+ 3. Add semantic judges only after build gates pass.
96
+ 4. Store traces and run records for regression debugging.
97
+
98
+ Result:
99
+
100
+ ```text
101
+ generated code -> build/test/runtime gates -> score -> ship or revise
102
+ ```
103
+
104
+ ### Prompt/Signature Optimizer
105
+
106
+ Use when you want Ax/GEPA-style improvement.
107
+
108
+ 1. For variable-length agent tasks, use `runMultiShotOptimization`.
109
+ 2. Build search/dev/test/holdout splits from the real product loop.
110
+ 3. Score full trajectories, not just final text.
111
+ 4. Emit actionable side information for failures the mutator can fix.
112
+ 5. Promote only `promotedVariant`, never a rejected `searchBestVariant`.
113
+ 6. Keep run records with prompt hash, model, config, cost, and commit.
114
+
115
+ Result:
116
+
117
+ ```text
118
+ candidate variant -> repeated evals -> statistical comparison -> promotion gate
119
+ ```
120
+
121
+ Do not optimize a toy harness if users run a different product loop. Build
122
+ training examples from `FeedbackTrajectory` and `controlRunToFeedbackTrajectory`,
123
+ then replay them through the same product adapter used at runtime. Train on
124
+ `train`, tune on `dev`, report on `test`, and keep `holdout` untouched until a
125
+ promotion decision.
126
+
127
+ ### Human Feedback Data
128
+
129
+ Use when operator or reviewer interaction should create labels.
130
+
131
+ Capture:
132
+
133
+ - approve/reject
134
+ - select A/B/C
135
+ - edit/rewrite
136
+ - rank/rate
137
+ - comment
138
+ - metric outcome
139
+ - policy block or budget block
140
+
141
+ Store as `FeedbackTrajectory`, then derive:
142
+
143
+ - preference memory for the next run
144
+ - dataset scenarios for regression
145
+ - optimizer rows for prompt/signature/code changes
146
+ - holdout examples to detect overfitting
147
+
148
+ ## Feature Map
149
+
150
+ | Area | Key exports | Best for | Notes |
151
+ | --- | --- | --- | --- |
152
+ | Judging | `createCustomJudge`, `createAntiSlopJudge`, wire rubrics | Content, voice, semantic quality | Pair with objective checks when possible. |
153
+ | Verification | `MultiLayerVerifier`, `JudgeRunner`, sandbox harness | Code and multi-step gates | Do not let semantic judges override failed builds. |
154
+ | Control | `runAgentControlLoop`, `objectiveEval`, `subjectiveEval` | Long-running agent tasks | Supports budgets, cost, stop policies, trace spans. |
155
+ | Propose/review | `runProposeReview`, `runProposeReviewAsControlLoop` | Iterative artifact repair | Good for code, docs, plans, briefs. |
156
+ | Feedback data | `FeedbackTrajectory`, stores, converters | Human/environment labels | Domain adapters live in downstream repos. |
157
+ | Action policy | `evaluateActionPolicy` | Approval/budget preflight | Blocks or labels actions before `act()`. |
158
+ | Datasets | `Dataset`, holdout tools, canaries | Train/dev/test/holdout corpora | Keeps optimization honest. |
159
+ | Optimization | `PromptOptimizer`, `OptimizationLoop`, steering optimizers | Prompt/signature comparison | Use held-out gates before promotion. |
160
+ | Evolution | prompt/code mutators, sandbox pool, telemetry | Autoresearch and mutation loops | Use budgets and lineage; do not run unbounded. |
161
+ | Telemetry | `TraceStore`, OTLP, file sinks | Audit and replay | Treat traces as evidence, not just logs. |
162
+ | Reporting | summaries, pareto, cost tracker | Decision support | Useful for PRs, launch gates, research notes. |
163
+
164
+ ## Guardrails
165
+
166
+ - Prefer deterministic checks before LLM judges.
167
+ - Keep holdout data out of optimization.
168
+ - Record model, prompt hash, config hash, commit, and cost for every serious
169
+ run.
170
+ - Budget every loop: steps, wall time, and dollars.
171
+ - Treat external side effects as downstream policy. The runtime can stop loops,
172
+ but your adapter decides what requires approval.
173
+ - Store user feedback with enough context to replay it later.
174
+ - Run harnesses and judges in the same sandbox when they need shared files,
175
+ logs, screenshots, or browser state. Use separate sandboxes for parallel
176
+ variants or destructive checks.
177
+
178
+ ## Same-Sandbox Example
179
+
180
+ `examples/same-sandbox-harness/` shows the common coding/browser pattern:
181
+
182
+ ```text
183
+ one sandbox/workdir -> install/build/test -> inspect evidence -> emit judge span
184
+ ```
185
+
186
+ Use this when a judge needs evidence produced by earlier harness phases. Use
187
+ isolated sandboxes when variants run in parallel or a phase can corrupt the
188
+ workspace.
189
+ - Treat telemetry as evidence, not control flow. A trace sink outage should be
190
+ visible in `runtimeErrors`, but it should not stop the worker from completing
191
+ the user task.
192
+
193
+ ## Highest-ROI Adoption Order
194
+
195
+ 1. Wrap one real product workflow in `runAgentControlLoop`.
196
+ 2. Add objective validators that can fail without an LLM.
197
+ 3. Emit traces and convert completed runs into feedback trajectories.
198
+ 4. Capture explicit user/reviewer feedback on attempts.
199
+ 5. Replay those trajectories as train/dev/test/holdout scenarios.
200
+ 6. Run prompt/signature optimization against that replay adapter.
201
+ 7. Promote only when held-out trajectories and product telemetry both improve.
202
+
203
+ ## What Stays Out Of Core
204
+
205
+ Domain-specific adapters should usually stay in downstream repos until they prove
206
+ reusable:
207
+
208
+ - browser site-specific actions
209
+ - repo-specific coding commands
210
+ - workspace-specific storage paths
211
+
212
+ Core should provide shapes, stores, runners, scoring, traces, and converters.
213
+ Downstream integrations provide domain state, policy, tools, and storage.
@@ -0,0 +1,193 @@
1
+ # Feedback Trajectories
2
+
3
+ Feedback trajectories are the generic shape behind feedback-driven learning
4
+ loops:
5
+
6
+ ```text
7
+ candidate artifact/action -> user/judge/environment feedback -> revision chain -> labeled example -> replay/eval/optimization
8
+ ```
9
+
10
+ They are deliberately domain-neutral. Browser task completion, code patch
11
+ review, and research brief revision all fit the same structure.
12
+
13
+ ## Core Shape
14
+
15
+ ```ts
16
+ import {
17
+ createFeedbackTrajectory,
18
+ summarizePreferenceMemory,
19
+ feedbackTrajectoriesToDatasetScenarios,
20
+ feedbackTrajectoriesToOptimizerRows,
21
+ } from '@tangle-network/agent-eval'
22
+
23
+ const trajectory = createFeedbackTrajectory({
24
+ projectId: 'research-agent',
25
+ scenarioId: 'brief-review',
26
+ task: {
27
+ intent: 'Revise a research brief until it is specific and sourced.',
28
+ context: { audience: 'technical reviewer' },
29
+ },
30
+ attempts: [
31
+ {
32
+ id: 'attempt-1',
33
+ stepIndex: 0,
34
+ artifactType: 'research',
35
+ artifact: { summary: 'Initial brief with weak sourcing.' },
36
+ createdAt: new Date().toISOString(),
37
+ },
38
+ ],
39
+ labels: [
40
+ {
41
+ source: 'user',
42
+ kind: 'revision_request',
43
+ value: 'needs stronger evidence',
44
+ reason: 'add primary sources and remove unsupported claims',
45
+ severity: 'error',
46
+ createdAt: new Date().toISOString(),
47
+ },
48
+ ],
49
+ })
50
+
51
+ const memory = summarizePreferenceMemory([trajectory])
52
+ const scenarios = feedbackTrajectoriesToDatasetScenarios([trajectory])
53
+ const optimizerRows = feedbackTrajectoriesToOptimizerRows([trajectory])
54
+ ```
55
+
56
+ ## What Belongs In Core
57
+
58
+ `agent-eval` owns the substrate:
59
+
60
+ - trajectory and label schemas
61
+ - in-memory and JSONL-backed stores
62
+ - deterministic train/dev/test/holdout splitting
63
+ - JSONL import/export
64
+ - conversion into `DatasetScenario`
65
+ - conversion into optimizer rows
66
+ - preference-memory distillation
67
+ - conversion from `runAgentControlLoop` results
68
+
69
+ Downstream repos own domain adapters:
70
+
71
+ - how review actions map to labels
72
+ - how generated artifacts are represented
73
+ - which side effects require approval
74
+ - which budgets and metrics matter
75
+ - where task-local data is stored
76
+
77
+ ## Label Sources
78
+
79
+ Labels can come from multiple places:
80
+
81
+ | Source | Example |
82
+ | --- | --- |
83
+ | `user` | Approved a draft, rejected a draft, selected option B. |
84
+ | `judge` | LLM/domain judge scores usefulness or voice. |
85
+ | `environment` | Browser task completed, tests passed, API call succeeded. |
86
+ | `metric` | A measured outcome improved or regressed. |
87
+ | `policy` | Budget cap blocked execution, approval required. |
88
+ | `system` | Control loop passed or failed. |
89
+
90
+ The most useful trajectories combine several label sources: user preference
91
+ plus objective outcome plus later metric result.
92
+
93
+ ## Multi-Shot Revision
94
+
95
+ Store every attempt, not only the final artifact:
96
+
97
+ ```text
98
+ draft 1 -> user rejects with reason -> draft 2 -> judge passes -> user approves -> metric outcome
99
+ ```
100
+
101
+ This supports tests that ask "does the agent improve after feedback?" instead
102
+ of only "was the final answer good?"
103
+
104
+ ## Control Runtime Bridge
105
+
106
+ `controlRunToFeedbackTrajectory` turns a finished control-loop run into a
107
+ trajectory:
108
+
109
+ ```ts
110
+ const run = await runAgentControlLoop({ ... })
111
+ const trajectory = controlRunToFeedbackTrajectory(run, {
112
+ projectId: 'coding-agent',
113
+ scenarioId: 'fix-typecheck',
114
+ artifactType: 'code',
115
+ })
116
+ ```
117
+
118
+ Use this for tasks where the agent works autonomously and the labels come from
119
+ validators, policies, or environment outcomes. Use direct trajectory recording
120
+ for review-heavy workflows where a person approves, rejects, edits, ranks, or
121
+ comments.
122
+
123
+ ## Optimization Loop
124
+
125
+ The same stored trajectories can feed three layers:
126
+
127
+ 1. **Immediate memory**: distill labels into short instructions.
128
+ 2. **Replay/eval**: convert trajectories into dataset scenarios.
129
+ 3. **Prompt/signature/code optimization**: convert trajectories into optimizer
130
+ rows and evaluate candidate variants on train/dev/holdout splits.
131
+
132
+ That is the reusable pattern:
133
+
134
+ ```text
135
+ normal agent usage -> labeled trajectory -> eval dataset -> optimizer input -> replay against held-out feedback
136
+ ```
137
+
138
+ ## Replay Adapter
139
+
140
+ Use `replayFeedbackTrajectory` when a stored trajectory should be tested against
141
+ a new prompt, signature, policy, or code path:
142
+
143
+ ```ts
144
+ const result = await replayFeedbackTrajectory(trajectory, {
145
+ async replay(item) {
146
+ const run = await runCandidateOn(item.task, item.attempts)
147
+ return {
148
+ pass: run.pass,
149
+ score: run.score,
150
+ labels: run.pass ? [] : [{
151
+ source: 'environment',
152
+ kind: 'reject',
153
+ value: false,
154
+ reason: run.reason,
155
+ severity: 'error',
156
+ createdAt: new Date().toISOString(),
157
+ }],
158
+ outcome: { success: run.pass, score: run.score, detail: run.reason },
159
+ }
160
+ },
161
+ })
162
+ ```
163
+
164
+ Replay adapters live downstream because only the integration knows how to
165
+ re-run a browser task, coding patch, or research brief.
166
+
167
+ ## Split Discipline
168
+
169
+ Treat feedback data like product analytics with labels:
170
+
171
+ - `train`: examples the optimizer can directly learn from.
172
+ - `dev`: examples used to choose among candidate variants.
173
+ - `test`: examples used for honest reporting after a variant is chosen.
174
+ - `holdout`: examples kept untouched until promotion or release review.
175
+
176
+ Do not let an optimizer see `test` or `holdout` examples through prompt text,
177
+ preference memory, few-shot examples, or manual tuning. If a trajectory becomes
178
+ part of memory, mark which split it came from and keep that memory out of
179
+ held-out evaluation.
180
+
181
+ ## What To Store
182
+
183
+ A useful trajectory has enough information to replay the decision later:
184
+
185
+ - user intent and relevant context
186
+ - every attempted artifact or action
187
+ - objective validation results
188
+ - user/reviewer/environment labels with reasons
189
+ - measured outcome when one exists
190
+ - model, prompt/config hash, code commit, and cost in metadata
191
+
192
+ If the record cannot answer "what did the agent try, why was it judged wrong,
193
+ and what changed next?", it is not yet useful training data.
@@ -0,0 +1,122 @@
1
+ # Multi-Shot Optimization
2
+
3
+ `runMultiShotOptimization` is the public adapter for GEPA-style optimization over
4
+ variable-length agent conversations.
5
+
6
+ Use it when the thing you want to improve is not a single model call. Typical
7
+ targets are agent system prompts, tool descriptions, routing policies, retrieval
8
+ plans, or app-specific scaffolding that affects an entire task trajectory.
9
+
10
+ The primitive is intentionally small. Your app owns the domain logic:
11
+
12
+ - `seedVariants`: prompt/config/tool-policy candidates
13
+ - `runner`: executes one complete task trajectory for one variant
14
+ - `scorer`: scores the trajectory and emits actionable side information
15
+ - `mutateAdapter`: proposes new variants from top and bottom trials
16
+
17
+ `agent-eval` owns the release-critical glue:
18
+
19
+ - stable paired seeds
20
+ - search-split prompt evolution
21
+ - cost/score Pareto objectives
22
+ - failed-run conversion into failed trials
23
+ - ASI projection into reflection traces and numeric metrics
24
+ - optional paired holdout gating through `HeldOutGate`
25
+ - validated `RunRecord` rows for promotion evidence
26
+
27
+ ## Result Contract
28
+
29
+ The return shape separates discovery from promotion:
30
+
31
+ - `searchBestVariant`: best variant on the optimizer-visible search scenarios
32
+ - `searchBestAggregate`: aggregate for that search winner
33
+ - `promotedVariant`: variant callers should ship
34
+ - `promotedAggregate`: aggregate for the promoted variant
35
+ - `gate`: holdout decision and evidence, or `null` when no gate ran
36
+
37
+ If a holdout gate is configured and rejects the search winner,
38
+ `promotedVariant` is the baseline. Do not ship `searchBestVariant` directly
39
+ unless you intentionally run without a holdout gate.
40
+
41
+ ## Actionable Side Information
42
+
43
+ The scorer should return `asi` rows for concrete failure modes:
44
+
45
+ ```ts
46
+ {
47
+ expectationId: 'used-primary-sources',
48
+ message: 'The final answer cited secondary summaries instead of primary sources.',
49
+ severity: 'error',
50
+ responsibleSurface: 'retrieval-policy',
51
+ suggestion: 'Prefer primary-source domains during source-gathering turns.',
52
+ }
53
+ ```
54
+
55
+ These rows become:
56
+
57
+ - reflection expectations via `trialTraceFromMultiShotTrial`
58
+ - aggregate metrics like `asi.error` and `surface.retrieval-policy`
59
+ - trace evidence available to downstream reports
60
+
61
+ This is the main reason to use this primitive instead of reducing each run to a
62
+ single scalar reward.
63
+
64
+ ## Holdout Discipline
65
+
66
+ For release gates, configure `gate`. The first seed variant is the baseline and
67
+ `gate.gate.baselineKey` must match its id.
68
+
69
+ Holdout scenarios must be disjoint from `searchScenarioIds`. The adapter runs
70
+ baseline and candidate with the same `(scenarioId, rep)` seed, validates every
71
+ row with `validateRunRecord`, then asks `HeldOutGate` whether to promote.
72
+
73
+ When `gate.searchScenarioIds` is omitted, the adapter reuses
74
+ `searchScenarioIds` for the overfit-gap check.
75
+
76
+ ## Minimal Shape
77
+
78
+ ```ts
79
+ import {
80
+ runMultiShotOptimization,
81
+ trialTraceFromMultiShotTrial,
82
+ type MultiShotVariant,
83
+ } from '@tangle-network/agent-eval'
84
+
85
+ type Payload = { systemPrompt: string }
86
+
87
+ const baseline: MultiShotVariant<Payload> = {
88
+ id: 'baseline',
89
+ label: 'baseline',
90
+ generation: 0,
91
+ payload: { systemPrompt: currentPrompt },
92
+ }
93
+
94
+ const result = await runMultiShotOptimization<Payload>({
95
+ runId: `research-agent-${Date.now()}`,
96
+ target: 'research-agent-system-prompt',
97
+ seedVariants: [baseline],
98
+ searchScenarioIds: searchScenarios.map((s) => s.id),
99
+ reps: 2,
100
+ generations: 4,
101
+ populationSize: 4,
102
+ scoreConcurrency: 4,
103
+ runner: {
104
+ async run({ variant, scenarioId, seed }) {
105
+ return runYourAgentToCompletion({ scenarioId, seed, prompt: variant.payload.systemPrompt })
106
+ },
107
+ },
108
+ scorer: {
109
+ async score({ run }) {
110
+ return scoreFullTrajectory(run.trace)
111
+ },
112
+ },
113
+ mutateAdapter: {
114
+ async mutate({ parent, bottomTrials, childCount, generation }) {
115
+ const traces = bottomTrials.map((t) => trialTraceFromMultiShotTrial(t))
116
+ return proposePromptMutations({ parent, traces, childCount, generation })
117
+ },
118
+ },
119
+ })
120
+
121
+ deploy(result.promotedVariant.payload)
122
+ ```