@tangle-network/agent-eval 0.17.3 → 0.19.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +8 -1
- package/dist/index.d.ts +303 -279
- package/dist/index.js +332 -210
- package/dist/index.js.map +1 -1
- package/docs/concepts.md +155 -0
- package/docs/control-runtime.md +351 -0
- package/docs/feature-guide.md +213 -0
- package/docs/feedback-trajectories.md +193 -0
- package/docs/multi-shot-optimization.md +122 -0
- package/docs/wire-protocol.md +199 -0
- package/package.json +21 -14
|
@@ -0,0 +1,199 @@
|
|
|
1
|
+
# Wire protocol
|
|
2
|
+
|
|
3
|
+
agent-eval exposes its evaluation logic over a versioned wire protocol so non-TypeScript clients (Python, Rust, Go, …) can drive it without a parallel implementation. The TypeScript runtime is the single source of truth; clients in other languages are *transport adapters*, not ports.
|
|
4
|
+
|
|
5
|
+
## Mental model
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
your code (any language)
|
|
9
|
+
│
|
|
10
|
+
▼
|
|
11
|
+
thin transport client ──HTTP──▶ agent-eval serve ──┐
|
|
12
|
+
│ │
|
|
13
|
+
└─────subprocess────────▶ agent-eval rpc ──┤
|
|
14
|
+
▼
|
|
15
|
+
same TS handlers, same rubrics,
|
|
16
|
+
same scoring code
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
Both transports talk to identical handlers. If you need a sustained connection (live agent paths, high-frequency calls), use HTTP. If you need a one-shot (cron, CI, batch), use stdio RPC. The wire shape is the same.
|
|
20
|
+
|
|
21
|
+
## Two transports, one contract
|
|
22
|
+
|
|
23
|
+
| | HTTP | stdio RPC |
|
|
24
|
+
|---|---|---|
|
|
25
|
+
| Start | `agent-eval serve --port 5005` | per-call: `agent-eval rpc <method>` |
|
|
26
|
+
| Latency | ~10 ms | ~500 ms (Node startup) |
|
|
27
|
+
| Best for | live calls, agent paths, dashboards | cron, CI, batch evaluation |
|
|
28
|
+
| Requires | running server | binary on PATH |
|
|
29
|
+
|
|
30
|
+
## Methods
|
|
31
|
+
|
|
32
|
+
The current surface is the smallest useful slice. Adding a method is mechanical — see [§Adding a method](#adding-a-method).
|
|
33
|
+
|
|
34
|
+
### `judge` — score content against a rubric
|
|
35
|
+
|
|
36
|
+
```http
|
|
37
|
+
POST /v1/judge
|
|
38
|
+
{
|
|
39
|
+
"rubricName": "anti-slop",
|
|
40
|
+
"content": "We just shipped zero-copy IO between sandboxes",
|
|
41
|
+
"context": { "platform": "x", "author": "drew", "impressions": 1240 }
|
|
42
|
+
}
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
```json
|
|
46
|
+
{
|
|
47
|
+
"composite": 0.78,
|
|
48
|
+
"dimensions": { "buyer_quality": 0.85, "voice": 0.7, "signal": 0.8 },
|
|
49
|
+
"failureModes": [],
|
|
50
|
+
"wins": ["specific-component", "earned-detail"],
|
|
51
|
+
"rationale": "Specific architectural detail, no AI cadence, technical voice.",
|
|
52
|
+
"rubricVersion": "anti-slop@a4f2b8c1",
|
|
53
|
+
"model": "claude-sonnet-4-6",
|
|
54
|
+
"durationMs": 1840
|
|
55
|
+
}
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
Pass either `rubricName` (built-in) or `rubric` (inline definition). Not both. The handler:
|
|
59
|
+
1. Resolves the rubric.
|
|
60
|
+
2. Calls the judging LLM with a JSON-schema-constrained response.
|
|
61
|
+
3. Computes `composite = Σ(weight_i × normalized_score_i) / Σ(weight_i)`.
|
|
62
|
+
4. Returns a typed `JudgeResult`.
|
|
63
|
+
|
|
64
|
+
`rubricVersion` is the stable hash of the rubric used. Scores are only comparable across runs when this matches.
|
|
65
|
+
|
|
66
|
+
### `listRubrics` — discover what's registered
|
|
67
|
+
|
|
68
|
+
```http
|
|
69
|
+
GET /v1/rubrics
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
```json
|
|
73
|
+
{
|
|
74
|
+
"rubrics": [
|
|
75
|
+
{
|
|
76
|
+
"name": "anti-slop",
|
|
77
|
+
"description": "Voice and signal quality for technical-buyer content.",
|
|
78
|
+
"dimensions": [
|
|
79
|
+
{ "id": "buyer_quality", "description": "Would the target buyer care?", "weight": 0.5 },
|
|
80
|
+
{ "id": "voice", "description": "Builder voice, not AI/marketing?", "weight": 0.3 },
|
|
81
|
+
{ "id": "signal", "description": "Non-obvious detail or constraint?", "weight": 0.2 }
|
|
82
|
+
],
|
|
83
|
+
"failureModes": ["ai-cadence", "marketing-tone", "vague-claim", "no-hook", "engagement-bait", "off-icp", "stale-claim"],
|
|
84
|
+
"rubricVersion": "anti-slop@a4f2b8c1"
|
|
85
|
+
}
|
|
86
|
+
]
|
|
87
|
+
}
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
### `version` — server + wire-protocol versions
|
|
91
|
+
|
|
92
|
+
```http
|
|
93
|
+
GET /v1/version
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
```json
|
|
97
|
+
{
|
|
98
|
+
"package": "@tangle-network/agent-eval",
|
|
99
|
+
"version": "0.19.0",
|
|
100
|
+
"wireVersion": "1.0.0",
|
|
101
|
+
"apiSurface": ["judge", "listRubrics", "version"]
|
|
102
|
+
}
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
`version` matches the npm/PyPI package version. `wireVersion` bumps independently — only on breaking request/response schema changes. Package versions can differ across releases as long as `wireVersion` matches.
|
|
106
|
+
|
|
107
|
+
### `GET /healthz` — liveness
|
|
108
|
+
|
|
109
|
+
For probing whether a server is up. Returns `{ "status": "ok", "uptimeSec": <number> }`.
|
|
110
|
+
|
|
111
|
+
### `GET /openapi.json` — full spec
|
|
112
|
+
|
|
113
|
+
Auto-generated from the Zod schemas. This is what code generators consume to produce typed clients in other languages.
|
|
114
|
+
|
|
115
|
+
## Errors
|
|
116
|
+
|
|
117
|
+
Every error response uses the same shape:
|
|
118
|
+
|
|
119
|
+
```json
|
|
120
|
+
{
|
|
121
|
+
"error": {
|
|
122
|
+
"code": "rubric_not_found",
|
|
123
|
+
"message": "No built-in rubric named \"missing-name\".",
|
|
124
|
+
"details": null
|
|
125
|
+
}
|
|
126
|
+
}
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
| HTTP | code | meaning |
|
|
130
|
+
|---|---|---|
|
|
131
|
+
| 400 | `validation_error` | Request didn't match the schema. |
|
|
132
|
+
| 404 | `rubric_not_found` | Unknown `rubricName`. |
|
|
133
|
+
| 500 | `judge_error` | LLM returned malformed output. |
|
|
134
|
+
| 500 | `internal_error` | Unexpected server error. |
|
|
135
|
+
|
|
136
|
+
stdio RPC uses the same shape inside an envelope: `{"error": {...}}` instead of `{"result": {...}}`. Exit code is non-zero on error.
|
|
137
|
+
|
|
138
|
+
## Running the server
|
|
139
|
+
|
|
140
|
+
```sh
|
|
141
|
+
agent-eval serve --port 5005 --host 127.0.0.1
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
Defaults to `127.0.0.1:5005`. Bind to `0.0.0.0` only if you trust the network.
|
|
145
|
+
|
|
146
|
+
```sh
|
|
147
|
+
# health
|
|
148
|
+
curl http://localhost:5005/healthz
|
|
149
|
+
|
|
150
|
+
# discover
|
|
151
|
+
curl http://localhost:5005/v1/rubrics | jq
|
|
152
|
+
|
|
153
|
+
# judge
|
|
154
|
+
curl -X POST http://localhost:5005/v1/judge \
|
|
155
|
+
-H 'content-type: application/json' \
|
|
156
|
+
-d '{"rubricName":"anti-slop","content":"We just shipped …"}'
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
## Using stdio RPC
|
|
160
|
+
|
|
161
|
+
```sh
|
|
162
|
+
# version
|
|
163
|
+
echo '{}' | agent-eval rpc version
|
|
164
|
+
|
|
165
|
+
# listRubrics
|
|
166
|
+
echo '{}' | agent-eval rpc listRubrics
|
|
167
|
+
|
|
168
|
+
# judge (one-shot)
|
|
169
|
+
echo '{"rubricName":"anti-slop","content":"…"}' | agent-eval rpc judge
|
|
170
|
+
|
|
171
|
+
# JSONL batch — one request per line
|
|
172
|
+
cat requests.jsonl | agent-eval rpc-batch judge > results.jsonl
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
Each invocation is one process — Node startup adds ~500 ms. For more than a few calls, stand up a server.
|
|
176
|
+
|
|
177
|
+
## Clients
|
|
178
|
+
|
|
179
|
+
- **Python**: [`tangle-agent-eval`](../clients/python/README.md) on PyPI. Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
|
|
180
|
+
- **TypeScript**: import directly from `@tangle-network/agent-eval` (no wire round-trip needed in-process).
|
|
181
|
+
- **Rust / Go / Other**: generate from `dist/openapi.json`. PRs welcome to add an officially-maintained client.
|
|
182
|
+
|
|
183
|
+
## Adding a method
|
|
184
|
+
|
|
185
|
+
1. **Schema** — define `XRequestSchema` and `XResponseSchema` in `src/wire/schemas.ts`. Every field gets a `.describe()` so docs flow through to OpenAPI.
|
|
186
|
+
2. **Handler** — pure function in `src/wire/handlers.ts`. Throws `WireError` for caller-fixable issues.
|
|
187
|
+
3. **Server route** — `app.post('/v1/x', …)` in `src/wire/server.ts`.
|
|
188
|
+
4. **RPC case** — add `case 'x':` in `dispatchRpc` in `src/wire/rpc.ts`.
|
|
189
|
+
5. **OpenAPI route** — register in `src/wire/openapi.ts` so it shows up in the spec.
|
|
190
|
+
6. **Test** — add to `tests/wire/`. At minimum: schema validation, happy-path, error-path.
|
|
191
|
+
7. **Python client** — add a method on `Client` in `clients/python/src/tangle_agent_eval/client.py`, plus pydantic models in `models.py` mirroring the new schemas.
|
|
192
|
+
|
|
193
|
+
The pattern is mechanical. When the surface grows past ~10 methods, swap the hand-written Python models for `datamodel-code-generator -i openapi.json -o models.py`.
|
|
194
|
+
|
|
195
|
+
## Wire-protocol versioning
|
|
196
|
+
|
|
197
|
+
`WIRE_VERSION` (in `src/wire/schemas.ts`) is a separate semver from the npm/PyPI package version. It bumps on **breaking** changes to a request/response schema. Additive changes (new optional fields, new methods) don't require a bump.
|
|
198
|
+
|
|
199
|
+
When `WIRE_VERSION` bumps, every language client gets a new major version; the dual-publish CI (see `.github/workflows/publish.yml`) enforces this lock-step.
|
package/package.json
CHANGED
|
@@ -1,7 +1,15 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@tangle-network/agent-eval",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.19.0",
|
|
4
4
|
"description": "Trace-first evaluation framework for Tangle agents. Core (spans, pipelines, sandbox harness, OTLP export), trust (dataset, red-team, calibration, behavior DSL), builder-of-builders (three-layer eval, resumable sessions, meta-runtime correlation), and frontier (meta-eval correlation study, Process Reward Modeling, bisector).",
|
|
5
|
+
"homepage": "https://github.com/tangle-network/agent-eval#readme",
|
|
6
|
+
"repository": {
|
|
7
|
+
"type": "git",
|
|
8
|
+
"url": "git+https://github.com/tangle-network/agent-eval.git"
|
|
9
|
+
},
|
|
10
|
+
"bugs": {
|
|
11
|
+
"url": "https://github.com/tangle-network/agent-eval/issues"
|
|
12
|
+
},
|
|
5
13
|
"type": "module",
|
|
6
14
|
"main": "./dist/index.js",
|
|
7
15
|
"types": "./dist/index.d.ts",
|
|
@@ -28,23 +36,15 @@
|
|
|
28
36
|
}
|
|
29
37
|
},
|
|
30
38
|
"bin": {
|
|
31
|
-
"agent-eval": "
|
|
39
|
+
"agent-eval": "dist/cli.js"
|
|
32
40
|
},
|
|
33
41
|
"files": [
|
|
34
|
-
"dist"
|
|
42
|
+
"dist",
|
|
43
|
+
"docs"
|
|
35
44
|
],
|
|
36
45
|
"publishConfig": {
|
|
37
46
|
"access": "public"
|
|
38
47
|
},
|
|
39
|
-
"scripts": {
|
|
40
|
-
"build": "tsup",
|
|
41
|
-
"dev": "tsup --watch",
|
|
42
|
-
"prepare": "tsup",
|
|
43
|
-
"test": "vitest run",
|
|
44
|
-
"test:watch": "vitest",
|
|
45
|
-
"typecheck": "tsc --noEmit",
|
|
46
|
-
"openapi": "node dist/cli.js openapi --out dist/openapi.json"
|
|
47
|
-
},
|
|
48
48
|
"dependencies": {
|
|
49
49
|
"@asteasolutions/zod-to-openapi": "^8.5.0",
|
|
50
50
|
"@ax-llm/ax": "^19.0.25",
|
|
@@ -64,5 +64,12 @@
|
|
|
64
64
|
"node": ">=20"
|
|
65
65
|
},
|
|
66
66
|
"license": "MIT",
|
|
67
|
-
"
|
|
68
|
-
|
|
67
|
+
"scripts": {
|
|
68
|
+
"build": "tsup",
|
|
69
|
+
"dev": "tsup --watch",
|
|
70
|
+
"test": "vitest run",
|
|
71
|
+
"test:watch": "vitest",
|
|
72
|
+
"typecheck": "tsc --noEmit",
|
|
73
|
+
"openapi": "node dist/cli.js openapi --out dist/openapi.json"
|
|
74
|
+
}
|
|
75
|
+
}
|