@lythos/skill-arena 0.13.3 → 0.14.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +62 -86
- package/package.json +5 -5
- package/src/cli.ts +163 -0
package/README.md
CHANGED
|
@@ -1,140 +1,116 @@
|
|
|
1
1
|
# @lythos/skill-arena
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
> Controlled-variable benchmark for AI agent skills. Test single decks or compare A/B — agent-orchestrated by default, cross-player when you need it.
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
## Modes at a Glance
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
| Mode | How | When |
|
|
8
|
+
|------|-----|------|
|
|
9
|
+
| **Agent-Orchestrated** (DEFAULT) | Agent tool spawns subagents, parallel dispatch, native judge | Single deck test, cross-deck A/B comparison |
|
|
10
|
+
| **Cross-Player** (OPT-IN) | CLI runner spawns different agent binaries via Bun.spawn | Comparing kimi vs codex vs claude |
|
|
8
11
|
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
`skill-arena` scaffolds isolated environments where subagents complete the same task under different decks. A judge agent scores outputs across multiple dimensions. Supports:
|
|
12
|
-
|
|
13
|
-
- **Mode 1**: Single-skill comparison (controlled variable — same helper skills, different test skill).
|
|
14
|
-
- **Mode 2**: Full-deck comparison (Pareto frontier — no single winner, only optimal trade-offs).
|
|
15
|
-
|
|
16
|
-
## Prerequisites
|
|
17
|
-
|
|
18
|
-
Arena runs AI agents as subprocesses. You need at least one agent CLI installed:
|
|
19
|
-
|
|
20
|
-
### Kimi CLI (recommended default)
|
|
21
|
-
|
|
22
|
-
Kimi Code CLI is the default player for arena — it has reliable headless execution with eager tool loading (no deferred tool deadlock).
|
|
23
|
-
|
|
24
|
-
```bash
|
|
25
|
-
# Install via uv (recommended) — uv is Python's bunx equivalent
|
|
26
|
-
uv tool install kimi-cli
|
|
27
|
-
# Or run without installing:
|
|
28
|
-
uvx kimi-cli --print -p "hello"
|
|
29
|
-
|
|
30
|
-
# Authenticate
|
|
31
|
-
kimi login
|
|
32
|
-
# Or set API key:
|
|
33
|
-
export KIMI_API_KEY=your_key
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
Docs: [https://github.com/MoonshotAI/kimi-cli](https://github.com/MoonshotAI/kimi-cli)
|
|
37
|
-
|
|
38
|
-
### Claude CLI (secondary)
|
|
39
|
-
|
|
40
|
-
```bash
|
|
41
|
-
npm install -g @anthropic-ai/claude-code
|
|
42
|
-
claude --version # should be ≥ 2.1.128
|
|
43
|
-
```
|
|
44
|
-
|
|
45
|
-
Note: Claude `-p` mode has known issues with web tools in Bun.spawn (deferred tool deadlock). Kimi is the default for reliability.
|
|
12
|
+
**95% of arena use is agent-orchestrated.** The Agent tool can spawn parallel subagents with isolated workdirs and different decks — zero CLI. Cross-player mode is ONLY needed when comparing different agent CLIs (the Agent tool can only spawn same-type agents).
|
|
46
13
|
|
|
47
14
|
## Install
|
|
48
15
|
|
|
49
16
|
```bash
|
|
50
17
|
bun add -d @lythos/skill-arena
|
|
51
18
|
# or use directly
|
|
52
|
-
bunx @lythos/skill-arena@0.
|
|
19
|
+
bunx @lythos/skill-arena@0.14.0 <command>
|
|
53
20
|
```
|
|
54
21
|
|
|
55
22
|
## Quick Start
|
|
56
23
|
|
|
57
24
|
```bash
|
|
58
|
-
#
|
|
25
|
+
# single — test one deck (most common)
|
|
59
26
|
bunx @lythos/skill-arena@latest single \
|
|
60
27
|
--deck ./examples/decks/scout.toml \
|
|
61
28
|
--brief "Generate auth flow diagram" \
|
|
62
|
-
--player kimi \
|
|
63
|
-
--timeout 300000 \
|
|
64
|
-
--out ./output
|
|
65
|
-
|
|
66
|
-
# Single with remote deck (URL auto-fetched)
|
|
67
|
-
bunx @lythos/skill-arena@latest single \
|
|
68
|
-
--deck https://raw.githubusercontent.com/lythos-labs/lythoskill/main/examples/decks/scout.toml \
|
|
69
|
-
--brief "Generate auth flow diagram" \
|
|
70
29
|
--out ./output
|
|
71
30
|
|
|
72
|
-
#
|
|
73
|
-
|
|
31
|
+
# cross-deck vs — compare two decks (agent-orchestrated)
|
|
32
|
+
# Create arena.toml declaring sides with different decks, then:
|
|
74
33
|
bunx @lythos/skill-arena@latest vs --config ./arena.toml
|
|
34
|
+
|
|
35
|
+
# cross-player vs — compare kimi vs codex (CLI only)
|
|
36
|
+
bunx @lythos/skill-arena@latest vs --config ./arena.toml --player kimi
|
|
75
37
|
```
|
|
76
38
|
|
|
77
|
-
**
|
|
78
|
-
- Agent runs in an isolated `/tmp` workdir (no workspace pollution)
|
|
79
|
-
- All artifacts are copied to `--out` after completion
|
|
80
|
-
- Prompt template injects fixed contract (decision-log, robustness, tool preference) + your brief as variable
|
|
39
|
+
**What happens**: Agent creates isolated `/tmp` workdir per side, `deck link` skills, spawns parallel subagents, collects artifacts, judge scores outputs. Parent deck restored after.
|
|
81
40
|
|
|
82
41
|
## Commands
|
|
83
42
|
|
|
84
|
-
###
|
|
43
|
+
### `single` — one deck, one task
|
|
85
44
|
|
|
86
45
|
```bash
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
46
|
+
bunx @lythos/skill-arena@latest single \
|
|
47
|
+
--deck ./deck.toml \
|
|
48
|
+
--brief "Produce a .docx report with radar chart" \
|
|
49
|
+
--timeout 600000 \
|
|
50
|
+
--out ./output
|
|
92
51
|
```
|
|
93
52
|
|
|
94
|
-
###
|
|
53
|
+
### `vs` — multi-deck comparison
|
|
95
54
|
|
|
55
|
+
```bash
|
|
56
|
+
bunx @lythos/skill-arena@latest vs --config ./arena.toml
|
|
57
|
+
bunx @lythos/skill-arena@latest vs --config ./arena.toml --dry-run
|
|
96
58
|
```
|
|
97
|
-
|
|
98
|
-
|
|
59
|
+
|
|
60
|
+
### `scaffold` — legacy directory setup
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
bunx @lythos/skill-arena@latest scaffold \
|
|
64
|
+
--task "Generate auth flow diagram" \
|
|
65
|
+
--decks "./decks/minimal.toml,./decks/rich.toml"
|
|
99
66
|
```
|
|
100
67
|
|
|
101
|
-
###
|
|
68
|
+
### `viz` — render results
|
|
102
69
|
|
|
103
70
|
```bash
|
|
104
|
-
bunx @lythos/skill-arena@
|
|
71
|
+
bunx @lythos/skill-arena@latest viz runs/arena-<id>/
|
|
105
72
|
```
|
|
106
73
|
|
|
107
|
-
##
|
|
74
|
+
## Parameters
|
|
108
75
|
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
76
|
+
| Flag | Command | Description |
|
|
77
|
+
|------|---------|-------------|
|
|
78
|
+
| `--brief "<text>"` | single | Inline task brief |
|
|
79
|
+
| `--deck <path\|url>` | single | Deck file (URL auto-fetched) |
|
|
80
|
+
| `--player <name>` | single, vs | Only for cross-player: kimi\|codex\|deepseek\|claude |
|
|
81
|
+
| `--timeout <ms>` | single | Subagent timeout (300000–600000 for complex tasks) |
|
|
82
|
+
| `--out <dir>` | single, vs | Output directory |
|
|
83
|
+
| `--config <path>` | vs | arena.toml |
|
|
84
|
+
| `--dry-run` | vs | Print plan without execution |
|
|
112
85
|
|
|
113
|
-
##
|
|
86
|
+
## Prerequisites (cross-player only)
|
|
114
87
|
|
|
115
|
-
|
|
88
|
+
For cross-player mode, install at least one agent CLI:
|
|
116
89
|
|
|
117
|
-
```
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
90
|
+
```bash
|
|
91
|
+
uv tool install kimi-cli # kimi (recommended default)
|
|
92
|
+
npm i -g @openai/codex # codex
|
|
93
|
+
# deepseek: bundled with desktop app or pip install deepseek-cli
|
|
94
|
+
# claude: set ANTHROPIC_API_KEY (SDK, no CLI binary needed)
|
|
121
95
|
```
|
|
122
96
|
|
|
123
|
-
|
|
97
|
+
## Skill Documentation
|
|
98
|
+
|
|
99
|
+
The agent-visible skill layer: [skill/SKILL.md](./skill/SKILL.md)
|
|
100
|
+
|
|
101
|
+
## Architecture
|
|
124
102
|
|
|
125
103
|
```
|
|
126
104
|
arena.toml → ArenaToml (Zod) → ExecutionPlan (pure) → per-cell agent spawn (IO)
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
105
|
+
↓
|
|
106
|
+
aggregateAllStats (pure) ← verdicts[]
|
|
107
|
+
↓
|
|
108
|
+
runComparativeJudge (IO) → report.md + Pareto frontier
|
|
131
109
|
```
|
|
132
110
|
|
|
133
|
-
- **Intent**: `arena.toml` declarative config
|
|
111
|
+
- **Intent**: `arena.toml` declarative config
|
|
134
112
|
- **Plan**: `buildExecutionPlan()`, `aggregateSideStats()`, `computePareto()` — pure functions
|
|
135
|
-
- **Execute**:
|
|
136
|
-
|
|
137
|
-
Built on `@lythos/test-utils` shared infrastructure.
|
|
113
|
+
- **Execute**: Agent tool spawn (agent-orchestrated) or `AgentAdapter` (cross-player)
|
|
138
114
|
|
|
139
115
|
## License
|
|
140
116
|
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@lythos/skill-arena",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.14.0",
|
|
4
4
|
"description": "Skill Arena — benchmark skill effectiveness with controlled-variable comparison",
|
|
5
5
|
"keywords": [
|
|
6
6
|
"ai-agent",
|
|
@@ -42,13 +42,13 @@
|
|
|
42
42
|
"bun": ">=1.0.0"
|
|
43
43
|
},
|
|
44
44
|
"dependencies": {
|
|
45
|
-
"@lythos/cold-pool": "^0.
|
|
46
|
-
"@lythos/infra": "^0.
|
|
47
|
-
"@lythos/test-utils": "^0.
|
|
45
|
+
"@lythos/cold-pool": "^0.14.0",
|
|
46
|
+
"@lythos/infra": "^0.14.0",
|
|
47
|
+
"@lythos/test-utils": "^0.14.0",
|
|
48
48
|
"zod": "^3.24.0",
|
|
49
49
|
"zod-to-json-schema": "^3.25.2"
|
|
50
50
|
},
|
|
51
51
|
"optionalDependencies": {
|
|
52
|
-
"@lythos/agent-adapter-claude-sdk": "^0.
|
|
52
|
+
"@lythos/agent-adapter-claude-sdk": "^0.14.0"
|
|
53
53
|
}
|
|
54
54
|
}
|
package/src/cli.ts
CHANGED
|
@@ -100,6 +100,8 @@ Examples:
|
|
|
100
100
|
lythoskill-arena vs --config arena.toml --dry-run
|
|
101
101
|
lythoskill-arena vs --config arena.toml
|
|
102
102
|
lythoskill-arena viz runs/arena-20260504
|
|
103
|
+
lythoskill-arena prepare-workdir --deck ./decks/scout.toml --out /tmp/arena-20260517-side-a
|
|
104
|
+
lythoskill-arena archive --from /tmp/arena-20260517 --to playground/arena-20260517 --sides side-a,side-b
|
|
103
105
|
`)
|
|
104
106
|
process.exit(0)
|
|
105
107
|
}
|
|
@@ -113,6 +115,8 @@ function cli(args: string[]) {
|
|
|
113
115
|
if (cmd === 'vs' || cmd === 'compare') return vsRun(rest)
|
|
114
116
|
if (cmd === 'single' || cmd === 'run') return singleRun(rest)
|
|
115
117
|
if (cmd === 'viz') return vizRun(rest)
|
|
118
|
+
if (cmd === 'prepare-workdir') return prepareWorkdir(rest)
|
|
119
|
+
if (cmd === 'archive') return archiveRun(rest)
|
|
116
120
|
|
|
117
121
|
console.error(`Unknown command: ${cmd}`)
|
|
118
122
|
process.exit(1)
|
|
@@ -431,6 +435,165 @@ async function vizRun(args: string[]) {
|
|
|
431
435
|
console.log(`📈 Arena HTML report not yet implemented. See report.md in ${runsDir}/`)
|
|
432
436
|
}
|
|
433
437
|
|
|
438
|
+
// ═══════════════════════════════════════════════════════════════════════════
|
|
439
|
+
// ── prepare-workdir: reusable workdir setup (used by both CLI and agent) ──
|
|
440
|
+
// Intent: create an isolated arena workdir with deck linked and ready to run
|
|
441
|
+
|
|
442
|
+
async function prepareWorkdir(args: string[]) {
|
|
443
|
+
const opts: Record<string, string | undefined> = {}
|
|
444
|
+
for (let i = 0; i < args.length; i++) {
|
|
445
|
+
if (args[i] === '--deck' || args[i] === '-d') opts.deck = args[++i]
|
|
446
|
+
else if (args[i] === '--out' || args[i] === '-o') opts.out = args[++i]
|
|
447
|
+
else if (args[i] === '--brief' || args[i] === '-b') opts.brief = args[++i]
|
|
448
|
+
}
|
|
449
|
+
|
|
450
|
+
if (!opts.deck) {
|
|
451
|
+
console.error(`❌ --deck <path> is required.
|
|
452
|
+
lythoskill-arena prepare-workdir --deck ./skill-deck.toml --out /tmp/arena-side-a`)
|
|
453
|
+
process.exit(1)
|
|
454
|
+
}
|
|
455
|
+
|
|
456
|
+
const deckPath = resolve(opts.deck)
|
|
457
|
+
if (!existsSync(deckPath)) {
|
|
458
|
+
console.error(`❌ Deck file not found: ${deckPath}`)
|
|
459
|
+
process.exit(1)
|
|
460
|
+
}
|
|
461
|
+
|
|
462
|
+
const workDir = opts.out
|
|
463
|
+
? resolve(opts.out)
|
|
464
|
+
: join(tmpdir(), `arena-${Date.now()}`)
|
|
465
|
+
mkdirSync(workDir, { recursive: true })
|
|
466
|
+
|
|
467
|
+
// Copy deck into workdir
|
|
468
|
+
writeFileSync(join(workDir, 'skill-deck.toml'), readFileSync(deckPath, 'utf-8'))
|
|
469
|
+
|
|
470
|
+
// Write AGENTS.md (same contract as CLI singleRun)
|
|
471
|
+
writeFileSync(join(workDir, 'AGENTS.md'), [
|
|
472
|
+
'# Arena Test Environment',
|
|
473
|
+
'**Mode**: agent-orchestrated cell',
|
|
474
|
+
'',
|
|
475
|
+
'## Setup Order (why this sequence)',
|
|
476
|
+
'1. `skill-deck.toml` copied here → declares which skills you can use',
|
|
477
|
+
'2. `deck link` runs → cold pool skills become visible in `.claude/skills/`',
|
|
478
|
+
'3. Skill existence checked → warns if any declared skill is missing from cold pool',
|
|
479
|
+
'4. `AGENTS.md` written last → confirms setup succeeded before agent starts',
|
|
480
|
+
'If setup fails mid-sequence, the workdir is incomplete and nothing runs.',
|
|
481
|
+
'',
|
|
482
|
+
'## How This Works',
|
|
483
|
+
'- Write ALL output files to this directory (CWD).',
|
|
484
|
+
'- Use available skills — check `ls .claude/skills/`.',
|
|
485
|
+
'',
|
|
486
|
+
'## Output Contract',
|
|
487
|
+
'- MANDATORY: `decision-log.jsonl` — one JSON line per decision:',
|
|
488
|
+
' `{"t":<seconds>,"phase":"setup|content|design|output","decision":"...","reason":"..."}`',
|
|
489
|
+
].join('\n'))
|
|
490
|
+
|
|
491
|
+
// Parse deck for link + checks
|
|
492
|
+
const deckRaw = readFileSync(join(workDir, 'skill-deck.toml'), 'utf-8')
|
|
493
|
+
let deckParsed: Record<string, any> = {}
|
|
494
|
+
try { deckParsed = Bun.TOML.parse(deckRaw) as Record<string, any> } catch {}
|
|
495
|
+
const hasSkills = parseDeckSkills(deckParsed).length > 0
|
|
496
|
+
|
|
497
|
+
if (hasSkills) {
|
|
498
|
+
const { existsSync: es2 } = await import('node:fs')
|
|
499
|
+
const localDeckCli = join(import.meta.dir, '..', '..', 'lythoskill-deck', 'src', 'cli.ts')
|
|
500
|
+
const linkCmd = es2(localDeckCli)
|
|
501
|
+
? ['bun', localDeckCli, 'link']
|
|
502
|
+
: ['bunx', '@lythos/skill-deck', 'link']
|
|
503
|
+
const linkProc = Bun.spawn(linkCmd,
|
|
504
|
+
{ cwd: workDir, env: { ...process.env, HOME: process.env.HOME! } },
|
|
505
|
+
)
|
|
506
|
+
await linkProc.exited
|
|
507
|
+
const linkStderr = await new Response(linkProc.stderr).text()
|
|
508
|
+
const linkResult = validateLinkResult(linkProc.exitCode, linkStderr)
|
|
509
|
+
if (!linkResult.ok) {
|
|
510
|
+
console.error(`❌ ${linkResult.error}`)
|
|
511
|
+
process.exit(1)
|
|
512
|
+
}
|
|
513
|
+
} else {
|
|
514
|
+
console.log('ℹ️ No skills declared in deck — skipping link')
|
|
515
|
+
}
|
|
516
|
+
|
|
517
|
+
// Skill existence check
|
|
518
|
+
try {
|
|
519
|
+
const coldPoolDefault = join(homedir(), '.agents', 'skill-repos')
|
|
520
|
+
const coldPoolDir = resolveColdPoolDir(deckParsed?.deck?.cold_pool, homedir(), coldPoolDefault)
|
|
521
|
+
const skills = parseDeckSkills(deckParsed)
|
|
522
|
+
const checks = checkSkillExistence(skills, coldPoolDir, existsSync)
|
|
523
|
+
for (const warning of formatSkillWarnings(checks)) {
|
|
524
|
+
console.warn(`⚠️ ${warning}`)
|
|
525
|
+
}
|
|
526
|
+
} catch (e) {
|
|
527
|
+
console.warn('⚠️ Could not check skill existence:', e instanceof Error ? e.message : e)
|
|
528
|
+
}
|
|
529
|
+
|
|
530
|
+
console.log(`✅ Workdir ready → ${workDir}`)
|
|
531
|
+
console.log(` deck: ${deckPath}`)
|
|
532
|
+
if (opts.brief) console.log(` brief: ${opts.brief!.slice(0, 60)}...`)
|
|
533
|
+
}
|
|
534
|
+
|
|
535
|
+
// ═══════════════════════════════════════════════════════════════════════════
|
|
536
|
+
// ── archive: copy agent outputs from workdir(s) to outDir ─────────────────
|
|
537
|
+
// Intent: same copy behavior as CLI singleRun, reusable for agent-orchestrated
|
|
538
|
+
|
|
539
|
+
async function archiveRun(args: string[]) {
|
|
540
|
+
const opts: Record<string, string | undefined> = {}
|
|
541
|
+
for (let i = 0; i < args.length; i++) {
|
|
542
|
+
if (args[i] === '--from' || args[i] === '-f') opts.from = args[++i]
|
|
543
|
+
else if (args[i] === '--to' || args[i] === '-o') opts.to = args[++i]
|
|
544
|
+
else if (args[i] === '--sides') opts.sides = args[++i]
|
|
545
|
+
else if (args[i] === '--report') opts.report = args[++i]
|
|
546
|
+
}
|
|
547
|
+
|
|
548
|
+
if (!opts.from || !opts.to) {
|
|
549
|
+
console.error(`❌ --from <workdir> and --to <outdir> are required.
|
|
550
|
+
lythoskill-arena archive --from /tmp/arena-20260517 --to playground/arena-20260517 --sides side-a,side-b --report ./report.md`)
|
|
551
|
+
process.exit(1)
|
|
552
|
+
}
|
|
553
|
+
|
|
554
|
+
const fromDir = resolve(opts.from)
|
|
555
|
+
const outDir = resolve(opts.to)
|
|
556
|
+
mkdirSync(outDir, { recursive: true })
|
|
557
|
+
|
|
558
|
+
// Copy report if provided
|
|
559
|
+
if (opts.report && existsSync(resolve(opts.report))) {
|
|
560
|
+
const { cpSync } = await import('node:fs')
|
|
561
|
+
cpSync(resolve(opts.report), join(outDir, 'report.md'))
|
|
562
|
+
console.log(`📄 report.md → ${outDir}/report.md`)
|
|
563
|
+
}
|
|
564
|
+
|
|
565
|
+
// Copy per-side outputs (same skipSet as CLI singleRun)
|
|
566
|
+
const { cpSync, readdirSync } = await import('node:fs')
|
|
567
|
+
const skipSet = new Set(['.claude', 'skill-deck.toml', 'skill-deck.lock', 'AGENTS.md'])
|
|
568
|
+
|
|
569
|
+
const sides = opts.sides ? opts.sides.split(',') : ['.']
|
|
570
|
+
for (const side of sides) {
|
|
571
|
+
const sideWorkDir = side === '.' ? fromDir : join(fromDir, side)
|
|
572
|
+
if (!existsSync(sideWorkDir)) {
|
|
573
|
+
console.warn(`⚠️ Side workdir not found: ${sideWorkDir}`)
|
|
574
|
+
continue
|
|
575
|
+
}
|
|
576
|
+
|
|
577
|
+
const sideOutDir = join(outDir, side)
|
|
578
|
+
mkdirSync(sideOutDir, { recursive: true })
|
|
579
|
+
|
|
580
|
+
const entries = readdirSync(sideWorkDir, { withFileTypes: true })
|
|
581
|
+
for (const entry of entries) {
|
|
582
|
+
if (skipSet.has(entry.name)) continue
|
|
583
|
+
const src = join(sideWorkDir, entry.name)
|
|
584
|
+
const dest = join(sideOutDir, entry.name)
|
|
585
|
+
try {
|
|
586
|
+
cpSync(src, dest, { recursive: entry.isDirectory() })
|
|
587
|
+
console.log(` ${side}/${entry.name} → ${dest}`)
|
|
588
|
+
} catch (e) {
|
|
589
|
+
console.warn(`⚠️ Failed to copy ${side}/${entry.name}: ${e instanceof Error ? e.message : e}`)
|
|
590
|
+
}
|
|
591
|
+
}
|
|
592
|
+
}
|
|
593
|
+
|
|
594
|
+
console.log(`✅ Archive complete → ${outDir}`)
|
|
595
|
+
}
|
|
596
|
+
|
|
434
597
|
// ── Entry point ────────────────────────────────────────────────────────────
|
|
435
598
|
if (import.meta.main) {
|
|
436
599
|
main().catch(err => {
|