@tangle-network/agent-eval 0.6.0 → 0.7.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +13 -38
- package/dist/index.d.ts +661 -119
- package/dist/index.js +1186 -336
- package/dist/index.js.map +1 -1
- package/package.json +5 -1
package/README.md
CHANGED
|
@@ -1,57 +1,32 @@
|
|
|
1
1
|
# @tangle-network/agent-eval
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
Trace-first evaluation framework for Tangle agents. Core (spans, pipelines, sandbox harness, OTLP export), trust (dataset, red-team, calibration, behavior DSL), builder-of-builders (three-layer eval, resumable sessions, meta-runtime correlation), and frontier (meta-eval correlation study, Process Reward Modeling, bisector).
|
|
4
4
|
|
|
5
5
|
## Install
|
|
6
6
|
|
|
7
7
|
```bash
|
|
8
|
-
|
|
8
|
+
pnpm add @tangle-network/agent-eval
|
|
9
9
|
```
|
|
10
10
|
|
|
11
11
|
## Usage
|
|
12
12
|
|
|
13
|
-
|
|
14
|
-
import { BenchmarkRunner, ProductClient, defaultJudges } from '@tangle-network/agent-eval'
|
|
15
|
-
|
|
16
|
-
const client = new ProductClient({
|
|
17
|
-
baseUrl: 'https://my-agent.tangle.tools',
|
|
18
|
-
routes: {
|
|
19
|
-
signup: '/api/auth/sign-up/email',
|
|
20
|
-
chat: '/api/chat',
|
|
21
|
-
// ...
|
|
22
|
-
},
|
|
23
|
-
})
|
|
24
|
-
|
|
25
|
-
const runner = new BenchmarkRunner(client, {
|
|
26
|
-
scenarios: myScenarios,
|
|
27
|
-
judges: defaultJudges('film production'),
|
|
28
|
-
systemPrompt: MY_SYSTEM_PROMPT,
|
|
29
|
-
})
|
|
30
|
-
|
|
31
|
-
const report = await runner.run()
|
|
32
|
-
```
|
|
33
|
-
|
|
34
|
-
## What's in the box
|
|
13
|
+
**→ [`.claude/skills/agent-eval/SKILL.md`](./.claude/skills/agent-eval/SKILL.md)** — single source of truth for every usage pattern. Covers: minimal builder-of-builders path, the seven muffled-gate footguns paid for in shipped bugs, the three-layer eval contract, regression tests worth writing, and "when to use what" for the 100+ exports.
|
|
35
14
|
|
|
36
|
-
|
|
37
|
-
- **ScenarioRegistry** — auto-discovery + filtering
|
|
38
|
-
- **executeScenario** — multi-turn executor with artifact collection
|
|
39
|
-
- **BenchmarkRunner** — orchestrates scenarios + judges + scoring
|
|
40
|
-
- **AgentDriver** — meta-agent that plays personas against a real product
|
|
41
|
-
- **MetricsCollector** — per-turn product state metrics
|
|
42
|
-
- **ConvergenceTracker** — completion% over turns
|
|
43
|
-
- **Reporter** — markdown + console output
|
|
44
|
-
- **Judges** — 4 built-in (domain expert, code execution, coherence, adversarial) + `createCustomJudge` factory
|
|
15
|
+
If you're an LLM or agent reading this, load the skill file before writing integration code — it encodes 10+ incident-driven directives that will save you from rediscovering them.
|
|
45
16
|
|
|
46
|
-
##
|
|
17
|
+
## Dev
|
|
47
18
|
|
|
48
|
-
|
|
19
|
+
```bash
|
|
20
|
+
pnpm build # tsup
|
|
21
|
+
pnpm test # vitest
|
|
22
|
+
pnpm typecheck # tsc --noEmit
|
|
23
|
+
```
|
|
49
24
|
|
|
50
25
|
## Related
|
|
51
26
|
|
|
52
|
-
- [`@tangle-network/agent-gateway`](https://github.com/tangle-network/agent-gateway)
|
|
53
|
-
- [`@tangle-network/agent-client`](https://github.com/tangle-network/agent-client)
|
|
54
|
-
- [`@tangle-network/tcloud`](https://github.com/tangle-network/tcloud)
|
|
27
|
+
- [`@tangle-network/agent-gateway`](https://github.com/tangle-network/agent-gateway)
|
|
28
|
+
- [`@tangle-network/agent-client`](https://github.com/tangle-network/agent-client)
|
|
29
|
+
- [`@tangle-network/tcloud`](https://github.com/tangle-network/tcloud)
|
|
55
30
|
|
|
56
31
|
## License
|
|
57
32
|
|