@tangle-network/agent-eval 0.6.0 → 0.7.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,57 +1,32 @@
1
1
  # @tangle-network/agent-eval
2
2
 
3
- Domain-agnostic evaluation framework for Tangle agent apps. Multi-turn scenario execution, multi-judge scoring, agent-driver meta-testing, convergence tracking. Every agent (tax, legal, film, gtm) imports this to get a reproducible quality harness.
3
+ Trace-first evaluation framework for Tangle agents. Core (spans, pipelines, sandbox harness, OTLP export), trust (dataset, red-team, calibration, behavior DSL), builder-of-builders (three-layer eval, resumable sessions, meta-runtime correlation), and frontier (meta-eval correlation study, Process Reward Modeling, bisector).
4
4
 
5
5
  ## Install
6
6
 
7
7
  ```bash
8
- npm install @tangle-network/agent-eval
8
+ pnpm add @tangle-network/agent-eval
9
9
  ```
10
10
 
11
11
  ## Usage
12
12
 
13
- ```ts
14
- import { BenchmarkRunner, ProductClient, defaultJudges } from '@tangle-network/agent-eval'
15
-
16
- const client = new ProductClient({
17
- baseUrl: 'https://my-agent.tangle.tools',
18
- routes: {
19
- signup: '/api/auth/sign-up/email',
20
- chat: '/api/chat',
21
- // ...
22
- },
23
- })
24
-
25
- const runner = new BenchmarkRunner(client, {
26
- scenarios: myScenarios,
27
- judges: defaultJudges('film production'),
28
- systemPrompt: MY_SYSTEM_PROMPT,
29
- })
30
-
31
- const report = await runner.run()
32
- ```
33
-
34
- ## What's in the box
13
+ **→ [`.claude/skills/agent-eval/SKILL.md`](./.claude/skills/agent-eval/SKILL.md)** — single source of truth for every usage pattern. Covers: minimal builder-of-builders path, the seven muffled-gate footguns paid for in shipped bugs, the three-layer eval contract, regression tests worth writing, and "when to use what" for the 100+ exports.
35
14
 
36
- - **ProductClient**configurable HTTP client (routes are config, not code)
37
- - **ScenarioRegistry** — auto-discovery + filtering
38
- - **executeScenario** — multi-turn executor with artifact collection
39
- - **BenchmarkRunner** — orchestrates scenarios + judges + scoring
40
- - **AgentDriver** — meta-agent that plays personas against a real product
41
- - **MetricsCollector** — per-turn product state metrics
42
- - **ConvergenceTracker** — completion% over turns
43
- - **Reporter** — markdown + console output
44
- - **Judges** — 4 built-in (domain expert, code execution, coherence, adversarial) + `createCustomJudge` factory
15
+ If you're an LLM or agent reading this, load the skill file before writing integration code it encodes 10+ incident-driven directives that will save you from rediscovering them.
45
16
 
46
- ## Tier
17
+ ## Dev
47
18
 
48
- Marketplace tier of the [agent-builder](https://github.com/drewstone/tangle-agent-builder) three-tier architecture. Uses [`@tangle-network/tcloud`](https://github.com/tangle-network/tcloud) for judge LLM calls.
19
+ ```bash
20
+ pnpm build # tsup
21
+ pnpm test # vitest
22
+ pnpm typecheck # tsc --noEmit
23
+ ```
49
24
 
50
25
  ## Related
51
26
 
52
- - [`@tangle-network/agent-gateway`](https://github.com/tangle-network/agent-gateway) — the gateway agents published through
53
- - [`@tangle-network/agent-client`](https://github.com/tangle-network/agent-client) — consumer SDK for those endpoints
54
- - [`@tangle-network/tcloud`](https://github.com/tangle-network/tcloud) — platform SDK (used internally by judges)
27
+ - [`@tangle-network/agent-gateway`](https://github.com/tangle-network/agent-gateway)
28
+ - [`@tangle-network/agent-client`](https://github.com/tangle-network/agent-client)
29
+ - [`@tangle-network/tcloud`](https://github.com/tangle-network/tcloud)
55
30
 
56
31
  ## License
57
32