npm - @tangle-network/agent-eval - Versions diffs - 0.6.0 → 0.7.1 - Mend

@tangle-network/agent-eval 0.6.0 → 0.7.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/README.md CHANGED Viewed

@@ -1,57 +1,32 @@
 # @tangle-network/agent-eval
-Domain-agnostic evaluation framework for Tangle agent apps. Multi-turn scenario execution, multi-judge scoring, agent-driver meta-testing, convergence tracking. Every agent (tax, legal, film, gtm) imports this to get a reproducible quality harness.
+Trace-first evaluation framework for Tangle agents. Core (spans, pipelines, sandbox harness, OTLP export), trust (dataset, red-team, calibration, behavior DSL), builder-of-builders (three-layer eval, resumable sessions, meta-runtime correlation), and frontier (meta-eval correlation study, Process Reward Modeling, bisector).
 ## Install
 ```bash
-npm install @tangle-network/agent-eval
+pnpm add @tangle-network/agent-eval
 ```
 ## Usage
-```ts
-import { BenchmarkRunner, ProductClient, defaultJudges } from '@tangle-network/agent-eval'
-const client = new ProductClient({
-  baseUrl: 'https://my-agent.tangle.tools',
-  routes: {
-    signup: '/api/auth/sign-up/email',
-    chat: '/api/chat',
-    // ...
-  },
-})
-const runner = new BenchmarkRunner(client, {
-  scenarios: myScenarios,
-  judges: defaultJudges('film production'),
-  systemPrompt: MY_SYSTEM_PROMPT,
-})
-const report = await runner.run()
-```
-## What's in the box
+**→ [`.claude/skills/agent-eval/SKILL.md`](./.claude/skills/agent-eval/SKILL.md)** — single source of truth for every usage pattern. Covers: minimal builder-of-builders path, the seven muffled-gate footguns paid for in shipped bugs, the three-layer eval contract, regression tests worth writing, and "when to use what" for the 100+ exports.
-- **ProductClient** — configurable HTTP client (routes are config, not code)
-- **ScenarioRegistry** — auto-discovery + filtering
-- **executeScenario** — multi-turn executor with artifact collection
-- **BenchmarkRunner** — orchestrates scenarios + judges + scoring
-- **AgentDriver** — meta-agent that plays personas against a real product
-- **MetricsCollector** — per-turn product state metrics
-- **ConvergenceTracker** — completion% over turns
-- **Reporter** — markdown + console output
-- **Judges** — 4 built-in (domain expert, code execution, coherence, adversarial) + `createCustomJudge` factory
+If you're an LLM or agent reading this, load the skill file before writing integration code — it encodes 10+ incident-driven directives that will save you from rediscovering them.
-## Tier
+## Dev
-Marketplace tier of the [agent-builder](https://github.com/drewstone/tangle-agent-builder) three-tier architecture. Uses [`@tangle-network/tcloud`](https://github.com/tangle-network/tcloud) for judge LLM calls.
+```bash
+pnpm build        # tsup
+pnpm test         # vitest
+pnpm typecheck    # tsc --noEmit
+```
 ## Related
-- [`@tangle-network/agent-gateway`](https://github.com/tangle-network/agent-gateway) — the gateway agents published through
-- [`@tangle-network/agent-client`](https://github.com/tangle-network/agent-client) — consumer SDK for those endpoints
-- [`@tangle-network/tcloud`](https://github.com/tangle-network/tcloud) — platform SDK (used internally by judges)
+- [`@tangle-network/agent-gateway`](https://github.com/tangle-network/agent-gateway)
+- [`@tangle-network/agent-client`](https://github.com/tangle-network/agent-client)
+- [`@tangle-network/tcloud`](https://github.com/tangle-network/tcloud)
 ## License