npm - @tangle-network/agent-eval - Versions diffs - 0.20.8 → 0.20.9 - Mend

@tangle-network/agent-eval 0.20.8 → 0.20.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/LICENSE +21 -0
package/README.md +9 -6
package/dist/benchmarks/index.d.ts +1 -0
package/dist/benchmarks/index.js +12 -0
package/dist/benchmarks/index.js.map +1 -0
package/dist/chunk-XDGJUIV2.js +219 -0
package/dist/chunk-XDGJUIV2.js.map +1 -0
package/dist/index-CEWY1rmu.d.ts +290 -0
package/dist/index.d.ts +37 -298
package/dist/index.js +68 -239
package/dist/index.js.map +1 -1
package/dist/openapi.json +477 -0
package/docs/concepts.md +4 -4
package/docs/knowledge-readiness.md +2 -2
package/docs/wire-protocol.md +3 -3
package/package.json +14 -7
package/examples/benchmarks/README.md +0 -44
package/examples/benchmarks/gsm8k/index.ts +0 -126
package/examples/benchmarks/swebench-lite/index.ts +0 -178
package/examples/multi-shot-optimization/index.ts +0 -114
package/examples/same-sandbox-harness/index.ts +0 -63

package/dist/openapi.json ADDED Viewed

@@ -0,0 +1,477 @@
+{
+  "openapi": "3.1.0",
+  "info": {
+    "title": "@tangle-network/agent-eval — wire protocol",
+    "version": "0.20.9",
+    "description": "HTTP and stdio RPC interface to agent-eval. The TypeScript runtime is the source of truth; this spec is the contract that cross-language clients (Python, Rust, Go) generate from.\n\nWire-protocol version: 1.0.0. Bumps on breaking changes to request/response schemas.",
+    "contact": {
+      "name": "Tangle Network",
+      "url": "https://github.com/tangle-network/agent-eval"
+    },
+    "license": {
+      "name": "MIT"
+    }
+  },
+  "servers": [
+    {
+      "url": "http://localhost:5005",
+      "description": "Local agent-eval serve"
+    }
+  ],
+  "components": {
+    "schemas": {
+      "JudgeRequest": {
+        "type": "object",
+        "properties": {
+          "rubricName": {
+            "type": "string",
+            "description": "Use a built-in rubric by name. Mutually exclusive with `rubric`."
+          },
+          "rubric": {
+            "$ref": "#/components/schemas/Rubric"
+          },
+          "content": {
+            "type": "string",
+            "minLength": 1,
+            "description": "The text being judged — a tweet, a blog post, a code snippet, anything stringly."
+          },
+          "context": {
+            "type": "object",
+            "additionalProperties": {},
+            "description": "Free-form metadata for the rubric to use — analytics, source URL, author, etc. Surfaced to the LLM."
+          },
+          "model": {
+            "type": "string",
+            "description": "Override the judge model (default routes via tcloud). e.g. \"claude-opus-4-7\"."
+          }
+        },
+        "required": [
+          "content"
+        ]
+      },
+      "Rubric": {
+        "type": "object",
+        "properties": {
+          "name": {
+            "type": "string",
+            "minLength": 1,
+            "description": "Stable name like \"anti-slop\" — used by clients to invoke this rubric."
+          },
+          "description": {
+            "type": "string",
+            "minLength": 1,
+            "description": "What this rubric measures. Shown in /v1/rubrics listing."
+          },
+          "systemPrompt": {
+            "type": "string",
+            "minLength": 1,
+            "description": "Instructs the judging LLM. Should explain the persona (e.g. \"senior engineer reviewing voice\"), what to score on, and what to return."
+          },
+          "dimensions": {
+            "type": "array",
+            "items": {
+              "$ref": "#/components/schemas/RubricDimension"
+            },
+            "minItems": 1,
+            "description": "Scoring axes. The composite score is a weighted sum of these."
+          },
+          "failureModes": {
+            "type": "array",
+            "items": {
+              "$ref": "#/components/schemas/FailureMode"
+            },
+            "default": [],
+            "description": "Patterns to detect; each detected mode appears in the result.failureModes list."
+          },
+          "wins": {
+            "type": "array",
+            "items": {
+              "$ref": "#/components/schemas/FailureMode"
+            },
+            "default": [],
+            "description": "Positive patterns; each detected one appears in the result.wins list."
+          }
+        },
+        "required": [
+          "name",
+          "description",
+          "systemPrompt",
+          "dimensions"
+        ],
+        "description": "Inline rubric definition. Mutually exclusive with `rubricName`."
+      },
+      "RubricDimension": {
+        "type": "object",
+        "properties": {
+          "id": {
+            "type": "string",
+            "minLength": 1,
+            "description": "Short stable id like \"buyer_quality\" — used as the key in scoring output."
+          },
+          "description": {
+            "type": "string",
+            "minLength": 1,
+            "description": "One-line plain-English meaning. Read by humans reviewing low scores."
+          },
+          "weight": {
+            "type": "number",
+            "minimum": 0,
+            "default": 1,
+            "description": "Relative weight in the composite score. Default 1; 0 disables."
+          },
+          "min": {
+            "type": "number",
+            "default": 0,
+            "description": "Lower bound of valid score for this dimension."
+          },
+          "max": {
+            "type": "number",
+            "default": 1,
+            "description": "Upper bound of valid score for this dimension."
+          }
+        },
+        "required": [
+          "id",
+          "description"
+        ]
+      },
+      "FailureMode": {
+        "type": "object",
+        "properties": {
+          "id": {
+            "type": "string",
+            "minLength": 1,
+            "description": "Short stable id like \"ai-cadence\" — used in detection lists."
+          },
+          "description": {
+            "type": "string",
+            "minLength": 1,
+            "description": "Plain-English description of the failure pattern."
+          }
+        },
+        "required": [
+          "id",
+          "description"
+        ]
+      },
+      "JudgeResult": {
+        "type": "object",
+        "properties": {
+          "composite": {
+            "type": "number",
+            "minimum": 0,
+            "maximum": 1,
+            "description": "Weighted combination of dimension scores in 0..1. The single number to gate on."
+          },
+          "dimensions": {
+            "type": "object",
+            "additionalProperties": {
+              "type": "number"
+            },
+            "description": "Per-dimension score, keyed by RubricDimension.id."
+          },
+          "failureModes": {
+            "type": "array",
+            "items": {
+              "type": "string"
+            },
+            "default": [],
+            "description": "Failure-mode ids detected in the content (subset of rubric.failureModes ids)."
+          },
+          "wins": {
+            "type": "array",
+            "items": {
+              "type": "string"
+            },
+            "default": [],
+            "description": "Win ids detected in the content (subset of rubric.wins ids)."
+          },
+          "rationale": {
+            "type": "string",
+            "description": "Plain-English explanation of the score. Surfaced to the human reviewer."
+          },
+          "rubricVersion": {
+            "type": "string",
+            "description": "Stable hash of the rubric used. Scores are only comparable across runs when this matches."
+          },
+          "model": {
+            "type": "string",
+            "description": "Model that produced the judgement, for reproducibility."
+          },
+          "durationMs": {
+            "type": "integer",
+            "minimum": 0,
+            "description": "End-to-end wall time for this call."
+          }
+        },
+        "required": [
+          "composite",
+          "dimensions",
+          "rationale",
+          "rubricVersion",
+          "model",
+          "durationMs"
+        ]
+      },
+      "ListRubricsResponse": {
+        "type": "object",
+        "properties": {
+          "rubrics": {
+            "type": "array",
+            "items": {
+              "$ref": "#/components/schemas/RubricInfo"
+            }
+          }
+        },
+        "required": [
+          "rubrics"
+        ]
+      },
+      "RubricInfo": {
+        "type": "object",
+        "properties": {
+          "name": {
+            "type": "string",
+            "description": "Pass this to /v1/judge as `rubricName`."
+          },
+          "description": {
+            "type": "string",
+            "description": "What this rubric measures."
+          },
+          "dimensions": {
+            "type": "array",
+            "items": {
+              "type": "object",
+              "properties": {
+                "id": {
+                  "type": "string"
+                },
+                "description": {
+                  "type": "string"
+                },
+                "weight": {
+                  "type": "number"
+                }
+              },
+              "required": [
+                "id",
+                "description",
+                "weight"
+              ]
+            },
+            "description": "The scoring axes this rubric uses, with weights."
+          },
+          "failureModes": {
+            "type": "array",
+            "items": {
+              "type": "string"
+            },
+            "default": [],
+            "description": "Failure-mode ids this rubric detects."
+          },
+          "rubricVersion": {
+            "type": "string",
+            "description": "Stable hash — match this to compare scores across runs."
+          }
+        },
+        "required": [
+          "name",
+          "description",
+          "dimensions",
+          "rubricVersion"
+        ]
+      },
+      "VersionResponse": {
+        "type": "object",
+        "properties": {
+          "package": {
+            "type": "string",
+            "description": "Package name (always \"@tangle-network/agent-eval\")."
+          },
+          "version": {
+            "type": "string",
+            "description": "Semver of the running server. Match your client to this."
+          },
+          "wireVersion": {
+            "type": "string",
+            "description": "Wire-protocol semver. Bumps separately from package version when the schema changes."
+          },
+          "apiSurface": {
+            "type": "array",
+            "items": {
+              "type": "string"
+            },
+            "description": "List of supported method names."
+          }
+        },
+        "required": [
+          "package",
+          "version",
+          "wireVersion",
+          "apiSurface"
+        ]
+      },
+      "HealthResponse": {
+        "type": "object",
+        "properties": {
+          "status": {
+            "type": "string",
+            "enum": [
+              "ok"
+            ]
+          },
+          "uptimeSec": {
+            "type": "number"
+          }
+        },
+        "required": [
+          "status",
+          "uptimeSec"
+        ]
+      },
+      "ErrorResponse": {
+        "type": "object",
+        "properties": {
+          "error": {
+            "type": "object",
+            "properties": {
+              "code": {
+                "type": "string",
+                "description": "Machine-readable code: \"validation_error\", \"rubric_not_found\", \"judge_error\"."
+              },
+              "message": {
+                "type": "string",
+                "description": "Human-readable message."
+              },
+              "details": {
+                "description": "Optional structured detail."
+              }
+            },
+            "required": [
+              "code",
+              "message"
+            ],
+            "description": "Errors are always wrapped in this shape across all endpoints."
+          }
+        },
+        "required": [
+          "error"
+        ]
+      }
+    },
+    "parameters": {}
+  },
+  "paths": {
+    "/v1/judge": {
+      "post": {
+        "summary": "Score a piece of content against a rubric",
+        "description": "Runs the judging LLM with the named (or inline) rubric and returns dimension scores, detected failure modes, wins, and a composite score in 0..1.",
+        "requestBody": {
+          "content": {
+            "application/json": {
+              "schema": {
+                "$ref": "#/components/schemas/JudgeRequest"
+              }
+            }
+          }
+        },
+        "responses": {
+          "200": {
+            "description": "Successful judgement",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/JudgeResult"
+                }
+              }
+            }
+          },
+          "400": {
+            "description": "Validation error",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/ErrorResponse"
+                }
+              }
+            }
+          },
+          "404": {
+            "description": "Rubric not found",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/ErrorResponse"
+                }
+              }
+            }
+          },
+          "500": {
+            "description": "Judge error",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/ErrorResponse"
+                }
+              }
+            }
+          }
+        }
+      }
+    },
+    "/v1/rubrics": {
+      "get": {
+        "summary": "List built-in rubrics",
+        "description": "Returns every rubric registered server-side, with their dimensions and stable rubricVersion hash.",
+        "responses": {
+          "200": {
+            "description": "Listing",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/ListRubricsResponse"
+                }
+              }
+            }
+          }
+        }
+      }
+    },
+    "/v1/version": {
+      "get": {
+        "summary": "Server and wire-protocol version",
+        "description": "Match your client version to `version`; check `wireVersion` for compatibility.",
+        "responses": {
+          "200": {
+            "description": "Version info",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/VersionResponse"
+                }
+              }
+            }
+          }
+        }
+      }
+    },
+    "/healthz": {
+      "get": {
+        "summary": "Liveness check",
+        "responses": {
+          "200": {
+            "description": "OK",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/HealthResponse"
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+  },
+  "webhooks": {}
+}

package/docs/concepts.md CHANGED Viewed

@@ -43,7 +43,7 @@ that can seed memory, replay scenarios, and optimization.
 | **Trace store** | The append-only log of every span/event during a run. Replay = read this back. |
 | **Composite score** | A 0..1 number combining all dimensions. The single number you gate on. |
 | **Rubric version** | A stable hash of the rubric. Scores from different rubric versions are not comparable. |
-| **Muffled gate** | A check that should fail loud but silently passes (e.g. `command || true`). The most expensive bug class in this codebase — see SKILL.md. |
+| **Muffled gate** | A check that should fail loud but silently passes (e.g. `command || true`). The most expensive bug class in this codebase. |
 ## The feedback trajectory loop
@@ -119,7 +119,7 @@ report.blendedScore   // 0..1 — weighted aggregate
 report.layers         // per-layer status, findings, duration
 ```
-Two rules that will save you bugs (paid for in real incidents — see SKILL.md):
+Two rules that will save you bugs:
 1. **Run both gates.** Build gates catch code that doesn't compile; structural assertions catch missing files. Run both unconditionally — they catch orthogonal failures.
@@ -150,6 +150,6 @@ You don't need to build the trace tree by hand. `BuilderSession` does it for you
 - **Just want to score a string against a rubric?** → [wire-protocol.md](./wire-protocol.md) — HTTP/RPC interface, pluggable from any language.
 - **Need a reusable driver/worker/evaluator loop?** → [control-runtime.md](./control-runtime.md) — generic runtime plus coding, browser, computer-use, and research integration patterns.
 - **Want review feedback to become eval/optimization data?** → [feedback-trajectories.md](./feedback-trajectories.md) — turn feedback into datasets, optimizer rows, and preference memory.
-- **Building a code-generator eval?** → SKILL.md §Minimal working path — the `BuilderSession` recipe.
-- **Multi-layer verifier?** → SKILL.md §Verification pipeline.
+- **Building a code-generator eval?** → Start with `BuilderSession`, `SandboxHarness`, and `MultiLayerVerifier`.
+- **Multi-layer verifier?** → Use [control-runtime.md](./control-runtime.md) and `MultiLayerVerifier` for ordered gates with dependencies.
 - **Adding a new judge or rubric?** → `src/wire/rubrics.ts` for the cross-language path; `src/anti-slop.ts` and `src/judges.ts` for the in-process path.

package/docs/knowledge-readiness.md CHANGED Viewed

@@ -2,8 +2,8 @@
 `agent-eval` owns the contract for deciding whether an agent had enough
 task-world context to run. It does not own web crawling, connector storage, wiki
-pages, credentials, or product policy. Those live in `agent-knowledge` and
-product repos.
+pages, credentials, or product policy. Those live in
+`@tangle-network/agent-knowledge` and product repos.
 The core loop is:

package/docs/wire-protocol.md CHANGED Viewed

@@ -96,13 +96,13 @@ GET /v1/version
 ```json
 {
   "package": "@tangle-network/agent-eval",
-  "version": "0.19.0",
+  "version": "0.20.9",
   "wireVersion": "1.0.0",
   "apiSurface": ["judge", "listRubrics", "version"]
 }
 ```
-`version` matches the npm/PyPI package version. `wireVersion` bumps independently — only on breaking request/response schema changes. Package versions can differ across releases as long as `wireVersion` matches.
+`version` matches the package version. `wireVersion` bumps independently — only on breaking request/response schema changes. Package versions can differ across releases as long as `wireVersion` matches.
 ### `GET /healthz` — liveness
@@ -176,7 +176,7 @@ Each invocation is one process — Node startup adds ~500 ms. For more than a fe
 ## Clients
-- **Python**: [`tangle-agent-eval`](../clients/python/README.md) on PyPI. Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
+- **Python**: source lives in [`clients/python`](../clients/python/README.md). Auto-detects HTTP, falls back to subprocess. Version-locked to npm.
 - **TypeScript**: import directly from `@tangle-network/agent-eval` (no wire round-trip needed in-process).
 - **Rust / Go / Other**: generate from `dist/openapi.json`. PRs welcome to add an officially-maintained client.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@tangle-network/agent-eval",
-  "version": "0.20.8",
+  "version": "0.20.9",
   "description": "Trace-first evaluation infrastructure for agent systems: traces, harnesses, verifier pipelines, judges, datasets, gates, optimization, and reporting.",
   "homepage": "https://github.com/tangle-network/agent-eval#readme",
   "repository": {
@@ -33,6 +33,14 @@
       "types": "./dist/wire/index.d.ts",
       "import": "./dist/wire/index.js",
       "default": "./dist/wire/index.js"
+    },
+    "./benchmarks": {
+      "types": "./dist/benchmarks/index.d.ts",
+      "import": "./dist/benchmarks/index.js",
+      "default": "./dist/benchmarks/index.js"
+    },
+    "./openapi.json": {
+      "default": "./dist/openapi.json"
     }
   },
   "bin": {
@@ -40,26 +48,25 @@
   },
   "files": [
     "dist",
-    "docs",
-    "examples"
+    "docs"
   ],
   "publishConfig": {
     "access": "public"
   },
   "scripts": {
-    "build": "tsup",
+    "build": "tsup && node dist/cli.js openapi --out dist/openapi.json",
     "dev": "tsup --watch",
-    "prepare": "tsup",
+    "prepare": "pnpm build",
     "test": "vitest run",
     "test:watch": "vitest",
     "typecheck": "tsc --noEmit",
-    "openapi": "node dist/cli.js openapi --out dist/openapi.json"
+    "openapi": "pnpm build"
   },
   "dependencies": {
     "@asteasolutions/zod-to-openapi": "^8.5.0",
     "@ax-llm/ax": "^19.0.25",
     "@hono/node-server": "^2.0.0",
-    "@tangle-network/tcloud": "^0.2.0",
+    "@tangle-network/tcloud": "^0.4.6",
     "hono": "^4.12.15",
     "zod": "^4.3.6"
   },

package/examples/benchmarks/README.md DELETED Viewed

@@ -1,44 +0,0 @@
-# Example benchmark wrappers
-Reference implementations of `BenchmarkAdapter` for two public benchmarks. They are NOT bundled — they're intentionally shipped as source you read, copy, and adapt.
-| Wrapper | What it does | Why it's an example, not core |
-|---|---|---|
-| [`gsm8k/`](./gsm8k) | Exact-match grading on the final numeric answer of GSM8K (Cobbe et al.) | The dataset isn't ours and isn't bundled. The wrapper points to a local JSONL via `AGENT_EVAL_GSM8K_PATH`. |
-| [`swebench-lite/`](./swebench-lite) | Pass/fail grading via an external SWE-Bench grader command | The grader is a separate binary; the wrapper stubs the integration via `AGENT_EVAL_SWEBENCH_GRADER_CMD`. |
-The novel benchmark we ship and own — the synthetic routing task — lives in `src/benchmarks/routing/` and IS in the bundle.
-## Using these wrappers
-Two paths.
-**Option A — read and inline.** Copy the wrapper file into your project. Replace the import paths from `../../../src/benchmarks/types` and `../../../src/run-record` with `@tangle-network/agent-eval`. Done.
-**Option B — import from agent-eval source.** If your project sits in this monorepo (or you've cloned the repo), import directly:
-```ts
-import * as gsm8k from '@tangle-network/agent-eval/examples/benchmarks/gsm8k'
-```
-This requires adding `examples/**/*.ts` to your TypeScript paths. Easier to just copy.
-## What every BenchmarkAdapter exports
-```ts
-loadDataset(split: 'search' | 'dev' | 'holdout'): Promise<DatasetItem[]>
-evaluate(item, response): Promise<{ score: number, raw: Record<string, unknown> }>
-assignSplit(itemId: string): 'search' | 'dev' | 'holdout'
-```
-`assignSplit` uses `deterministicSplit(itemId, BENCHMARK_SPLIT_SEED)` — same item gets the same split everywhere. Don't change the seed; it's load-bearing for reproducibility.
-## Adding a new benchmark
-1. Create `examples/benchmarks/<your-benchmark>/index.ts`.
-2. Export `loadDataset`, `evaluate`, `assignSplit`. Optionally a typed `Adapter` class.
-3. Use `deterministicSplit` from `@tangle-network/agent-eval` for split assignment.
-4. Fail loud on missing config (env vars, paths). Never default to silent-pass.
-5. Document config requirements in a per-benchmark README.
-If your benchmark is novel and broadly useful, propose moving it into `src/benchmarks/` as core surface (PR welcome). The bar is: novel rubric, reusable across projects, low maintenance burden.