@tangle-network/agent-eval 0.20.10 → 0.20.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,10 +1,10 @@
1
1
  # @tangle-network/agent-eval
2
2
 
3
- Trace-first evaluation infrastructure for agent systems.
3
+ Evaluation infrastructure for agent systems.
4
4
 
5
- `agent-eval` provides the contracts and runtime primitives for measuring agent
6
- behavior: traces, harnesses, verifier pipelines, judges, datasets, holdout
7
- gates, failure classification, optimization loops, and release reports.
5
+ `agent-eval` gives agent products a reusable way to record what happened,
6
+ verify outcomes, classify failures, compare variants, optimize prompts or
7
+ policies, and make release decisions from evidence instead of anecdotes.
8
8
 
9
9
  It does not own your product state, credentials, UI, or model routing. Product
10
10
  teams keep those boundaries; this package standardizes how runs are recorded,
@@ -15,7 +15,9 @@ checked, compared, and promoted.
15
15
  - [When To Use It](#when-to-use-it)
16
16
  - [Architecture](#architecture)
17
17
  - [Install](#install)
18
+ - [Quick Start](#quick-start)
18
19
  - [Core Primitives](#core-primitives)
20
+ - [Adoption Path](#adoption-path)
19
21
  - [Examples](#examples)
20
22
  - [Documentation](#documentation)
21
23
  - [Development](#development)
@@ -80,6 +82,59 @@ cd clients/python
80
82
  pip install -e .
81
83
  ```
82
84
 
85
+ ## Quick Start
86
+
87
+ Wrap the real product loop first. Do not build a toy eval path that users never
88
+ exercise.
89
+
90
+ ```ts
91
+ import {
92
+ objectiveEval,
93
+ runAgentControlLoop,
94
+ } from '@tangle-network/agent-eval'
95
+
96
+ const result = await runAgentControlLoop({
97
+ intent: task.prompt,
98
+ budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },
99
+
100
+ async observe() {
101
+ return productAdapter.readState(task.id)
102
+ },
103
+
104
+ async validate({ state }) {
105
+ return [
106
+ objectiveEval({
107
+ id: 'build-passes',
108
+ passed: state.build.exitCode === 0,
109
+ severity: 'critical',
110
+ metadata: state.build,
111
+ }),
112
+ objectiveEval({
113
+ id: 'preview-serves',
114
+ passed: state.preview.httpStatus === 200,
115
+ severity: 'critical',
116
+ }),
117
+ ]
118
+ },
119
+
120
+ async decide({ evals }) {
121
+ return evals.every((evalResult) => evalResult.passed)
122
+ ? { type: 'stop', reason: 'all critical checks passed' }
123
+ : { type: 'continue', action: { type: 'repair' }, reason: 'checks failed' }
124
+ },
125
+
126
+ async act(action) {
127
+ return productAdapter.runAgentStep(task.id, action)
128
+ },
129
+ })
130
+
131
+ await productAdapter.storeControlResult(task.id, result)
132
+ ```
133
+
134
+ Once this loop represents production behavior, convert completed runs into
135
+ feedback trajectories, split them into train/dev/test/holdout sets, and run
136
+ multi-shot optimization against the same adapter.
137
+
83
138
  ## Core Primitives
84
139
 
85
140
  | Primitive | Purpose |
@@ -101,6 +156,24 @@ pip install -e .
101
156
  `NoopResearcher` is a fail-loud sentinel for wiring tests. Production systems
102
157
  should implement `Researcher` directly or use `CallbackResearcher`.
103
158
 
159
+ ## Adoption Path
160
+
161
+ 1. Choose one real workflow: code generation, browser task, research task,
162
+ workflow builder, voice interaction, or domain agent task.
163
+ 2. Write a product adapter that can observe state and execute one agent step.
164
+ 3. Add deterministic validators first: build, test, serve, schema, policy,
165
+ permission, retrieval, and deployment checks.
166
+ 4. Add LLM judges only for subjective quality that deterministic checks cannot
167
+ measure.
168
+ 5. Emit traces and convert successful and failed attempts into
169
+ `FeedbackTrajectory` records.
170
+ 6. Build train/dev/test/holdout scenarios from those trajectories.
171
+ 7. Run `runMultiShotOptimization()` or prompt/code evolution on train/dev.
172
+ 8. Promote only when test/holdout gates and real product telemetry improve.
173
+
174
+ For a complete product integration guide, see
175
+ [Product Eval Adoption](./docs/product-eval-adoption.md).
176
+
104
177
  ## Examples
105
178
 
106
179
  Runnable examples live in the repository's
@@ -121,6 +194,7 @@ tested, and copied without turning this page into a tutorial.
121
194
 
122
195
  - [Concepts](./docs/concepts.md)
123
196
  - [Feature Guide](./docs/feature-guide.md)
197
+ - [Product Eval Adoption](./docs/product-eval-adoption.md)
124
198
  - [Control Runtime](./docs/control-runtime.md)
125
199
  - [Knowledge Readiness](./docs/knowledge-readiness.md)
126
200
  - [Multi-Shot Optimization](./docs/multi-shot-optimization.md)
package/dist/openapi.json CHANGED
@@ -2,7 +2,7 @@
2
2
  "openapi": "3.1.0",
3
3
  "info": {
4
4
  "title": "@tangle-network/agent-eval — wire protocol",
5
- "version": "0.20.10",
5
+ "version": "0.20.11",
6
6
  "description": "HTTP and stdio RPC interface to agent-eval. The TypeScript runtime is the source of truth; this spec is the contract that cross-language clients (Python, Rust, Go) generate from.\n\nWire-protocol version: 1.0.0. Bumps on breaking changes to request/response schemas.",
7
7
  "contact": {
8
8
  "name": "Tangle Network",
@@ -0,0 +1,194 @@
1
+ # Product Eval Adoption
2
+
3
+ This guide is for teams adding `@tangle-network/agent-eval` to a real agent
4
+ product. The package supplies evaluation contracts and runtime primitives. Your
5
+ product supplies the actual workflow adapter, state, credentials, tools, UI, and
6
+ storage.
7
+
8
+ ## Goal
9
+
10
+ Use the same loop for production, replay, and optimization:
11
+
12
+ ```txt
13
+ real user task
14
+ -> product adapter observes state
15
+ -> validators and judges grade state
16
+ -> control loop decides next action
17
+ -> product agent acts in the real environment
18
+ -> trace + feedback trajectory are stored
19
+ -> datasets and optimizers replay the same adapter
20
+ ```
21
+
22
+ If production and eval use different loops, benchmark gains will not transfer.
23
+
24
+ ## What The Product Owns
25
+
26
+ The product owns:
27
+
28
+ - task state and domain models
29
+ - credentials, tenant policy, approval, and side-effect rules
30
+ - browser, sandbox, CLI, connector, or voice drivers
31
+ - database and trace persistence
32
+ - user/reviewer feedback collection
33
+ - deployment and live canary routing
34
+ - model gateway configuration
35
+
36
+ `agent-eval` owns:
37
+
38
+ - trace, run, dataset, feedback, and score contracts
39
+ - control-loop mechanics
40
+ - verifier and judge orchestration
41
+ - failure taxonomy
42
+ - paired statistics and holdout gates
43
+ - optimizer inputs and promotion reports
44
+
45
+ ## Minimal Production Adapter
46
+
47
+ Start with a small adapter that mirrors one real workflow.
48
+
49
+ ```ts
50
+ interface ProductEvalAdapter<TState, TAction> {
51
+ observe(taskId: string): Promise<TState>
52
+ validate(state: TState): Promise<ControlEvalResult[]>
53
+ decide(input: {
54
+ state: TState
55
+ evals: ControlEvalResult[]
56
+ history: unknown[]
57
+ }): Promise<TAction | 'stop'>
58
+ act(taskId: string, action: TAction): Promise<void>
59
+ }
60
+ ```
61
+
62
+ Keep the adapter product-owned until at least two products need the same shape.
63
+
64
+ ## Validator Order
65
+
66
+ Use deterministic checks before judges.
67
+
68
+ 1. **State validity**: schema, required files, required DB rows, required
69
+ connections.
70
+ 2. **Runtime gates**: install, build, typecheck, tests, serve, deploy smoke.
71
+ 3. **Policy gates**: approvals, side effects, budget, credentials, data
72
+ freshness.
73
+ 4. **Behavior gates**: browser flows, API calls, generated app preview, voice
74
+ transcript checks.
75
+ 5. **Semantic judges**: intent fit, quality, completeness, safety,
76
+ professional correctness.
77
+
78
+ Semantic judges should never turn a failed build into a pass.
79
+
80
+ ## Traces And Feedback
81
+
82
+ Every serious run should record:
83
+
84
+ - task id and scenario id
85
+ - git commit
86
+ - model and provider
87
+ - prompt/config hashes
88
+ - tool calls and retrieval spans
89
+ - build/test/deploy output
90
+ - cost, latency, and token use
91
+ - user/reviewer feedback
92
+ - final outcome and failure class
93
+
94
+ Convert runs into `FeedbackTrajectory` records so normal product usage becomes
95
+ replayable eval data.
96
+
97
+ ```txt
98
+ production run -> feedback trajectory -> dataset scenario -> optimizer row
99
+ ```
100
+
101
+ ## Datasets And Holdouts
102
+
103
+ Use four splits:
104
+
105
+ - `train`: optimizer search.
106
+ - `dev`: tuning and threshold selection.
107
+ - `test`: normal reporting.
108
+ - `holdout`: promotion-only gate.
109
+
110
+ Do not inspect or tune against holdout failures during optimization. If a
111
+ holdout failure reveals a real product bug, fix the bug and rotate the holdout
112
+ set with a signed note.
113
+
114
+ ## Optimization
115
+
116
+ Use `runMultiShotOptimization()` when the system is a multi-step agent, not a
117
+ single prompt.
118
+
119
+ Good optimization targets:
120
+
121
+ - system prompt
122
+ - tool descriptions
123
+ - retrieval policy
124
+ - data acquisition policy
125
+ - user-question policy
126
+ - evaluator threshold
127
+ - agent topology
128
+ - scaffold/template choice
129
+
130
+ Bad optimization targets:
131
+
132
+ - hidden holdout examples
133
+ - production credentials
134
+ - brittle string checks that do not match user value
135
+ - fake workflows that do not call the product adapter
136
+
137
+ Use actionable side information so the optimizer knows whether a failure belongs
138
+ to prompt, tools, retrieval, data acquisition, sandbox, evaluator, or product
139
+ runtime.
140
+
141
+ ## Release Gate
142
+
143
+ A launch or promotion should require:
144
+
145
+ - enough runs for the target risk level
146
+ - paired improvement over the current baseline
147
+ - no critical regression on test
148
+ - holdout pass or explicit rejection
149
+ - cost and latency within budget
150
+ - no unresolved canary or contamination failures
151
+ - trace evidence for representative successes and failures
152
+ - human-readable report with failure clusters and next actions
153
+
154
+ `evaluateReleaseConfidence()` and the paired statistics helpers provide the
155
+ decision data. The product decides the business threshold.
156
+
157
+ ## Product Patterns
158
+
159
+ ### Coding Or Builder Agent
160
+
161
+ Use sandbox/build/test/serve/browser validators. Add intent and semantic
162
+ concept judges only after the generated app runs.
163
+
164
+ ### Browser Agent
165
+
166
+ Record browser steps, screenshots, network errors, console errors, and final
167
+ state. Use deterministic DOM/API assertions before visual or semantic judges.
168
+
169
+ ### Domain Agent
170
+
171
+ Use domain fixtures, jurisdiction/date metadata, retrieval spans, and
172
+ professional judges. Fail missing/stale evidence separately from bad reasoning.
173
+
174
+ ### Workflow Or Integration Agent
175
+
176
+ Use `@tangle-network/agent-integrations` manifests as readiness inputs. Gate
177
+ missing connections, missing scopes, approval-required writes, and stale tokens
178
+ before blaming the agent prompt.
179
+
180
+ ### Voice Agent
181
+
182
+ Record transcript, timing, interruptions, tool calls, and task outcome. Judge
183
+ conversation quality separately from tool success and policy compliance.
184
+
185
+ ## Anti-Patterns
186
+
187
+ - Evaluating only final prose for an agent that actually builds, browses, or
188
+ calls tools.
189
+ - Letting an LLM judge override failed tests.
190
+ - Optimizing on examples that users will never hit.
191
+ - Recording traces as logs but never converting them to datasets.
192
+ - Calling every failure a prompt failure when context, data, auth, or runtime
193
+ readiness was missing.
194
+ - Shipping reports without run ids, commits, model ids, or evidence links.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@tangle-network/agent-eval",
3
- "version": "0.20.10",
3
+ "version": "0.20.11",
4
4
  "description": "Trace-first evaluation infrastructure for agent systems: traces, harnesses, verifier pipelines, judges, datasets, gates, optimization, and reporting.",
5
5
  "homepage": "https://github.com/tangle-network/agent-eval#readme",
6
6
  "repository": {