@tangle-network/agent-eval 0.17.2 → 0.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,199 @@
1
+ # Wire protocol
2
+
3
+ agent-eval exposes its evaluation logic over a versioned wire protocol so non-TypeScript clients (Python, Rust, Go, …) can drive it without a parallel implementation. The TypeScript runtime is the single source of truth; clients in other languages are *transport adapters*, not ports.
4
+
5
+ ## Mental model
6
+
7
+ ```
8
+ your code (any language)
9
+
10
+
11
+ thin transport client ──HTTP──▶ agent-eval serve ──┐
12
+ │ │
13
+ └─────subprocess────────▶ agent-eval rpc ──┤
14
+
15
+ same TS handlers, same rubrics,
16
+ same scoring code
17
+ ```
18
+
19
+ Both transports talk to identical handlers. If you need a sustained connection (live agent paths, high-frequency calls), use HTTP. If you need a one-shot (cron, CI, batch), use stdio RPC. The wire shape is the same.
20
+
21
+ ## Two transports, one contract
22
+
23
+ | | HTTP | stdio RPC |
24
+ |---|---|---|
25
+ | Start | `agent-eval serve --port 5005` | per-call: `agent-eval rpc <method>` |
26
+ | Latency | ~10 ms | ~500 ms (Node startup) |
27
+ | Best for | live calls, agent paths, dashboards | cron, CI, batch evaluation |
28
+ | Requires | running server | binary on PATH |
29
+
30
+ ## Methods
31
+
32
+ The current surface is the smallest useful slice. Adding a method is mechanical — see [§Adding a method](#adding-a-method).
33
+
34
+ ### `judge` — score content against a rubric
35
+
36
+ ```http
37
+ POST /v1/judge
38
+ {
39
+ "rubricName": "anti-slop",
40
+ "content": "We just shipped zero-copy IO between sandboxes",
41
+ "context": { "platform": "x", "author": "drew", "impressions": 1240 }
42
+ }
43
+ ```
44
+
45
+ ```json
46
+ {
47
+ "composite": 0.78,
48
+ "dimensions": { "buyer_quality": 0.85, "voice": 0.7, "signal": 0.8 },
49
+ "failureModes": [],
50
+ "wins": ["specific-component", "earned-detail"],
51
+ "rationale": "Specific architectural detail, no AI cadence, technical voice.",
52
+ "rubricVersion": "anti-slop@a4f2b8c1",
53
+ "model": "claude-sonnet-4-6",
54
+ "durationMs": 1840
55
+ }
56
+ ```
57
+
58
+ Pass either `rubricName` (built-in) or `rubric` (inline definition). Not both. The handler:
59
+ 1. Resolves the rubric.
60
+ 2. Calls the judging LLM with a JSON-schema-constrained response.
61
+ 3. Computes `composite = Σ(weight_i × normalized_score_i) / Σ(weight_i)`.
62
+ 4. Returns a typed `JudgeResult`.
63
+
64
+ `rubricVersion` is the stable hash of the rubric used. Scores are only comparable across runs when this matches.
65
+
66
+ ### `listRubrics` — discover what's registered
67
+
68
+ ```http
69
+ GET /v1/rubrics
70
+ ```
71
+
72
+ ```json
73
+ {
74
+ "rubrics": [
75
+ {
76
+ "name": "anti-slop",
77
+ "description": "Voice and signal quality for technical-buyer content.",
78
+ "dimensions": [
79
+ { "id": "buyer_quality", "description": "Would the target buyer care?", "weight": 0.5 },
80
+ { "id": "voice", "description": "Builder voice, not AI/marketing?", "weight": 0.3 },
81
+ { "id": "signal", "description": "Non-obvious detail or constraint?", "weight": 0.2 }
82
+ ],
83
+ "failureModes": ["ai-cadence", "marketing-tone", "vague-claim", "no-hook", "engagement-bait", "off-icp", "stale-claim"],
84
+ "rubricVersion": "anti-slop@a4f2b8c1"
85
+ }
86
+ ]
87
+ }
88
+ ```
89
+
90
+ ### `version` — server + wire-protocol versions
91
+
92
+ ```http
93
+ GET /v1/version
94
+ ```
95
+
96
+ ```json
97
+ {
98
+ "package": "@tangle-network/agent-eval",
99
+ "version": "0.18.0",
100
+ "wireVersion": "1.0.0",
101
+ "apiSurface": ["judge", "listRubrics", "version"]
102
+ }
103
+ ```
104
+
105
+ `version` matches the npm/PyPI package version. `wireVersion` bumps independently — only on breaking request/response schema changes. Package versions can differ across releases as long as `wireVersion` matches.
106
+
107
+ ### `GET /healthz` — liveness
108
+
109
+ For probing whether a server is up. Returns `{ "status": "ok", "uptimeSec": <number> }`.
110
+
111
+ ### `GET /openapi.json` — full spec
112
+
113
+ Auto-generated from the Zod schemas. This is what code generators consume to produce typed clients in other languages.
114
+
115
+ ## Errors
116
+
117
+ Every error response uses the same shape:
118
+
119
+ ```json
120
+ {
121
+ "error": {
122
+ "code": "rubric_not_found",
123
+ "message": "No built-in rubric named \"missing-name\".",
124
+ "details": null
125
+ }
126
+ }
127
+ ```
128
+
129
+ | HTTP | code | meaning |
130
+ |---|---|---|
131
+ | 400 | `validation_error` | Request didn't match the schema. |
132
+ | 404 | `rubric_not_found` | Unknown `rubricName`. |
133
+ | 500 | `judge_error` | LLM returned malformed output. |
134
+ | 500 | `internal_error` | Unexpected server error. |
135
+
136
+ stdio RPC uses the same shape inside an envelope: `{"error": {...}}` instead of `{"result": {...}}`. Exit code is non-zero on error.
137
+
138
+ ## Running the server
139
+
140
+ ```sh
141
+ agent-eval serve --port 5005 --host 127.0.0.1
142
+ ```
143
+
144
+ Defaults to `127.0.0.1:5005`. Bind to `0.0.0.0` only if you trust the network.
145
+
146
+ ```sh
147
+ # health
148
+ curl http://localhost:5005/healthz
149
+
150
+ # discover
151
+ curl http://localhost:5005/v1/rubrics | jq
152
+
153
+ # judge
154
+ curl -X POST http://localhost:5005/v1/judge \
155
+ -H 'content-type: application/json' \
156
+ -d '{"rubricName":"anti-slop","content":"We just shipped …"}'
157
+ ```
158
+
159
+ ## Using stdio RPC
160
+
161
+ ```sh
162
+ # version
163
+ echo '{}' | agent-eval rpc version
164
+
165
+ # listRubrics
166
+ echo '{}' | agent-eval rpc listRubrics
167
+
168
+ # judge (one-shot)
169
+ echo '{"rubricName":"anti-slop","content":"…"}' | agent-eval rpc judge
170
+
171
+ # JSONL batch — one request per line
172
+ cat requests.jsonl | agent-eval rpc-batch judge > results.jsonl
173
+ ```
174
+
175
+ Each invocation is one process — Node startup adds ~500 ms. For more than a few calls, stand up a server.
176
+
177
+ ## Clients
178
+
179
+ - **Python**: [`tangle-agent-eval`](../clients/python/README.md) on PyPI. Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
180
+ - **TypeScript**: import directly from `@tangle-network/agent-eval` (no wire round-trip needed in-process).
181
+ - **Rust / Go / Other**: generate from `dist/openapi.json`. PRs welcome to add an officially-maintained client.
182
+
183
+ ## Adding a method
184
+
185
+ 1. **Schema** — define `XRequestSchema` and `XResponseSchema` in `src/wire/schemas.ts`. Every field gets a `.describe()` so docs flow through to OpenAPI.
186
+ 2. **Handler** — pure function in `src/wire/handlers.ts`. Throws `WireError` for caller-fixable issues.
187
+ 3. **Server route** — `app.post('/v1/x', …)` in `src/wire/server.ts`.
188
+ 4. **RPC case** — add `case 'x':` in `dispatchRpc` in `src/wire/rpc.ts`.
189
+ 5. **OpenAPI route** — register in `src/wire/openapi.ts` so it shows up in the spec.
190
+ 6. **Test** — add to `tests/wire/`. At minimum: schema validation, happy-path, error-path.
191
+ 7. **Python client** — add a method on `Client` in `clients/python/src/tangle_agent_eval/client.py`, plus pydantic models in `models.py` mirroring the new schemas.
192
+
193
+ The pattern is mechanical. When the surface grows past ~10 methods, swap the hand-written Python models for `datamodel-code-generator -i openapi.json -o models.py`.
194
+
195
+ ## Wire-protocol versioning
196
+
197
+ `WIRE_VERSION` (in `src/wire/schemas.ts`) is a separate semver from the npm/PyPI package version. It bumps on **breaking** changes to a request/response schema. Additive changes (new optional fields, new methods) don't require a bump.
198
+
199
+ When `WIRE_VERSION` bumps, every language client gets a new major version; the dual-publish CI (see `.github/workflows/publish.yml`) enforces this lock-step.
package/package.json CHANGED
@@ -1,7 +1,15 @@
1
1
  {
2
2
  "name": "@tangle-network/agent-eval",
3
- "version": "0.17.2",
3
+ "version": "0.18.0",
4
4
  "description": "Trace-first evaluation framework for Tangle agents. Core (spans, pipelines, sandbox harness, OTLP export), trust (dataset, red-team, calibration, behavior DSL), builder-of-builders (three-layer eval, resumable sessions, meta-runtime correlation), and frontier (meta-eval correlation study, Process Reward Modeling, bisector).",
5
+ "homepage": "https://github.com/tangle-network/agent-eval#readme",
6
+ "repository": {
7
+ "type": "git",
8
+ "url": "git+https://github.com/tangle-network/agent-eval.git"
9
+ },
10
+ "bugs": {
11
+ "url": "https://github.com/tangle-network/agent-eval/issues"
12
+ },
5
13
  "type": "module",
6
14
  "main": "./dist/index.js",
7
15
  "types": "./dist/index.d.ts",
@@ -28,23 +36,15 @@
28
36
  }
29
37
  },
30
38
  "bin": {
31
- "agent-eval": "./dist/cli.js"
39
+ "agent-eval": "dist/cli.js"
32
40
  },
33
41
  "files": [
34
- "dist"
42
+ "dist",
43
+ "docs"
35
44
  ],
36
45
  "publishConfig": {
37
46
  "access": "public"
38
47
  },
39
- "scripts": {
40
- "build": "tsup",
41
- "dev": "tsup --watch",
42
- "prepare": "tsup",
43
- "test": "vitest run",
44
- "test:watch": "vitest",
45
- "typecheck": "tsc --noEmit",
46
- "openapi": "node dist/cli.js openapi --out dist/openapi.json"
47
- },
48
48
  "dependencies": {
49
49
  "@asteasolutions/zod-to-openapi": "^8.5.0",
50
50
  "@ax-llm/ax": "^19.0.25",
@@ -64,5 +64,12 @@
64
64
  "node": ">=20"
65
65
  },
66
66
  "license": "MIT",
67
- "packageManager": "pnpm@10.22.0"
68
- }
67
+ "scripts": {
68
+ "build": "tsup",
69
+ "dev": "tsup --watch",
70
+ "test": "vitest run",
71
+ "test:watch": "vitest",
72
+ "typecheck": "tsc --noEmit",
73
+ "openapi": "node dist/cli.js openapi --out dist/openapi.json"
74
+ }
75
+ }