@lythos/skill-arena 0.13.2 → 0.14.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/README.md +62 -86
  2. package/package.json +5 -5
  3. package/src/cli.ts +163 -0
package/README.md CHANGED
@@ -1,140 +1,116 @@
1
1
  # @lythos/skill-arena
2
2
 
3
- ![CI](https://img.shields.io/badge/CI-41%20unit%20tests-brightgreen) ![Intent/Plan](https://img.shields.io/badge/arch-intent%2Fplan%2Fexecute-8A2BE2)
3
+ > Controlled-variable benchmark for AI agent skills. Test single decks or compare A/B — agent-orchestrated by default, cross-player when you need it.
4
4
 
5
- > Controlled-variable benchmark for AI agent skills. Compare skills, decks, or configurations on the same task — single-skill A/B or full-deck Pareto frontier analysis. Now with declarative `arena.toml` (k8s-manifest style) and deterministic Pareto frontier.
5
+ ## Modes at a Glance
6
6
 
7
- ## Why
7
+ | Mode | How | When |
8
+ |------|-----|------|
9
+ | **Agent-Orchestrated** (DEFAULT) | Agent tool spawns subagents, parallel dispatch, native judge | Single deck test, cross-deck A/B comparison |
10
+ | **Cross-Player** (OPT-IN) | CLI runner spawns different agent binaries via Bun.spawn | Comparing kimi vs codex vs claude |
8
11
 
9
- "Which skill is better?" is the wrong question. The right question is "which skill is better for what."
10
-
11
- `skill-arena` scaffolds isolated environments where subagents complete the same task under different decks. A judge agent scores outputs across multiple dimensions. Supports:
12
-
13
- - **Mode 1**: Single-skill comparison (controlled variable — same helper skills, different test skill).
14
- - **Mode 2**: Full-deck comparison (Pareto frontier — no single winner, only optimal trade-offs).
15
-
16
- ## Prerequisites
17
-
18
- Arena runs AI agents as subprocesses. You need at least one agent CLI installed:
19
-
20
- ### Kimi CLI (recommended default)
21
-
22
- Kimi Code CLI is the default player for arena — it has reliable headless execution with eager tool loading (no deferred tool deadlock).
23
-
24
- ```bash
25
- # Install via uv (recommended) — uv is Python's bunx equivalent
26
- uv tool install kimi-cli
27
- # Or run without installing:
28
- uvx kimi-cli --print -p "hello"
29
-
30
- # Authenticate
31
- kimi login
32
- # Or set API key:
33
- export KIMI_API_KEY=your_key
34
- ```
35
-
36
- Docs: [https://github.com/MoonshotAI/kimi-cli](https://github.com/MoonshotAI/kimi-cli)
37
-
38
- ### Claude CLI (secondary)
39
-
40
- ```bash
41
- npm install -g @anthropic-ai/claude-code
42
- claude --version # should be ≥ 2.1.128
43
- ```
44
-
45
- Note: Claude `-p` mode has known issues with web tools in Bun.spawn (deferred tool deadlock). Kimi is the default for reliability.
12
+ **95% of arena use is agent-orchestrated.** The Agent tool can spawn parallel subagents with isolated workdirs and different decks — zero CLI. Cross-player mode is ONLY needed when comparing different agent CLIs (the Agent tool can only spawn same-type agents).
46
13
 
47
14
  ## Install
48
15
 
49
16
  ```bash
50
17
  bun add -d @lythos/skill-arena
51
18
  # or use directly
52
- bunx @lythos/skill-arena@0.13.2 <command>
19
+ bunx @lythos/skill-arena@0.14.0 <command>
53
20
  ```
54
21
 
55
22
  ## Quick Start
56
23
 
57
24
  ```bash
58
- # Single: test a deck with one agent (most common)
25
+ # single test one deck (most common)
59
26
  bunx @lythos/skill-arena@latest single \
60
27
  --deck ./examples/decks/scout.toml \
61
28
  --brief "Generate auth flow diagram" \
62
- --player kimi \
63
- --timeout 300000 \
64
- --out ./output
65
-
66
- # Single with remote deck (URL auto-fetched)
67
- bunx @lythos/skill-arena@latest single \
68
- --deck https://raw.githubusercontent.com/lythos-labs/lythoskill/main/examples/decks/scout.toml \
69
- --brief "Generate auth flow diagram" \
70
29
  --out ./output
71
30
 
72
- # Vs: compare multiple decks side by side
73
- curl -fsSL https://raw.githubusercontent.com/lythos-labs/lythoskill/main/examples/arena/research-compare/arena.toml > arena.toml
31
+ # cross-deck vs — compare two decks (agent-orchestrated)
32
+ # Create arena.toml declaring sides with different decks, then:
74
33
  bunx @lythos/skill-arena@latest vs --config ./arena.toml
34
+
35
+ # cross-player vs — compare kimi vs codex (CLI only)
36
+ bunx @lythos/skill-arena@latest vs --config ./arena.toml --player kimi
75
37
  ```
76
38
 
77
- **Default behavior:**
78
- - Agent runs in an isolated `/tmp` workdir (no workspace pollution)
79
- - All artifacts are copied to `--out` after completion
80
- - Prompt template injects fixed contract (decision-log, robustness, tool preference) + your brief as variable
39
+ **What happens**: Agent creates isolated `/tmp` workdir per side, `deck link` skills, spawns parallel subagents, collects artifacts, judge scores outputs. Parent deck restored after.
81
40
 
82
41
  ## Commands
83
42
 
84
- ### Declarative mode (k8s-style, recommended)
43
+ ### `single` one deck, one task
85
44
 
86
45
  ```bash
87
- # Print execution plan without running
88
- bunx @lythos/skill-arena@0.13.2 vs --config arena.toml --dry-run
89
-
90
- # Execute with per-side runs_per_side and statistical aggregation
91
- bunx @lythos/skill-arena@0.13.2 vs --config arena.toml
46
+ bunx @lythos/skill-arena@latest single \
47
+ --deck ./deck.toml \
48
+ --brief "Produce a .docx report with radar chart" \
49
+ --timeout 600000 \
50
+ --out ./output
92
51
  ```
93
52
 
94
- ### Scaffold mode (legacy, manual execution)
53
+ ### `vs` multi-deck comparison
95
54
 
55
+ ```bash
56
+ bunx @lythos/skill-arena@latest vs --config ./arena.toml
57
+ bunx @lythos/skill-arena@latest vs --config ./arena.toml --dry-run
96
58
  ```
97
- bunx @lythos/skill-arena@0.13.2 scaffold --task "Generate auth flow diagram" \
98
- --decks https://raw.githubusercontent.com/lythos-labs/lythoskill/main/examples/decks/scout.toml,https://raw.githubusercontent.com/lythos-labs/lythoskill/main/examples/decks/documents.toml
59
+
60
+ ### `scaffold` — legacy directory setup
61
+
62
+ ```bash
63
+ bunx @lythos/skill-arena@latest scaffold \
64
+ --task "Generate auth flow diagram" \
65
+ --decks "./decks/minimal.toml,./decks/rich.toml"
99
66
  ```
100
67
 
101
- ### Viz
68
+ ### `viz` — render results
102
69
 
103
70
  ```bash
104
- bunx @lythos/skill-arena@0.13.2 viz runs/arena-<id>/
71
+ bunx @lythos/skill-arena@latest viz runs/arena-<id>/
105
72
  ```
106
73
 
107
- ## Skill Documentation
74
+ ## Parameters
108
75
 
109
- This package is the **Starter** layer (CLI implementation).
110
- The agent-visible **Skill** layer documentation is here:
111
- [packages/lythoskill-arena/skill/SKILL.md](../../packages/lythoskill-arena/skill/SKILL.md)
76
+ | Flag | Command | Description |
77
+ |------|---------|-------------|
78
+ | `--brief "<text>"` | single | Inline task brief |
79
+ | `--deck <path\|url>` | single | Deck file (URL auto-fetched) |
80
+ | `--player <name>` | single, vs | Only for cross-player: kimi\|codex\|deepseek\|claude |
81
+ | `--timeout <ms>` | single | Subagent timeout (300000–600000 for complex tasks) |
82
+ | `--out <dir>` | single, vs | Output directory |
83
+ | `--config <path>` | vs | arena.toml |
84
+ | `--dry-run` | vs | Print plan without execution |
112
85
 
113
- ## Architecture
86
+ ## Prerequisites (cross-player only)
114
87
 
115
- Part of the [lythoskill](https://github.com/lythos-labs/lythoskill) ecosystem the thin-skill pattern separates heavy logic (this npm package) from lightweight agent instructions (SKILL.md).
88
+ For cross-player mode, install at least one agent CLI:
116
89
 
117
- ```
118
- Starter (this package) npm publish → bunx @lythos/skill-arena@0.13.2 ...
119
- Skill (packages/<name>/skill/) → build SKILL.md + thin scripts
120
- Output (skills/<name>/) → git commit agent-visible skill
90
+ ```bash
91
+ uv tool install kimi-cli # kimi (recommended default)
92
+ npm i -g @openai/codex # codex
93
+ # deepseek: bundled with desktop app or pip install deepseek-cli
94
+ # claude: set ANTHROPIC_API_KEY (SDK, no CLI binary needed)
121
95
  ```
122
96
 
123
- ### Runtime architecture (intent/plan/execute)
97
+ ## Skill Documentation
98
+
99
+ The agent-visible skill layer: [skill/SKILL.md](./skill/SKILL.md)
100
+
101
+ ## Architecture
124
102
 
125
103
  ```
126
104
  arena.toml → ArenaToml (Zod) → ExecutionPlan (pure) → per-cell agent spawn (IO)
127
-
128
- aggregateAllStats (pure) ← verdicts[]
129
-
130
- runComparativeJudge (IO) → report.md + Pareto frontier
105
+
106
+ aggregateAllStats (pure) ← verdicts[]
107
+
108
+ runComparativeJudge (IO) → report.md + Pareto frontier
131
109
  ```
132
110
 
133
- - **Intent**: `arena.toml` declarative config (k8s-manifest style)
111
+ - **Intent**: `arena.toml` declarative config
134
112
  - **Plan**: `buildExecutionPlan()`, `aggregateSideStats()`, `computePareto()` — pure functions
135
- - **Execute**: `runAgentScenario` per cell, `runComparativeJudge` IO via `AgentAdapter`
136
-
137
- Built on `@lythos/test-utils` shared infrastructure.
113
+ - **Execute**: Agent tool spawn (agent-orchestrated) or `AgentAdapter` (cross-player)
138
114
 
139
115
  ## License
140
116
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@lythos/skill-arena",
3
- "version": "0.13.2",
3
+ "version": "0.14.0",
4
4
  "description": "Skill Arena — benchmark skill effectiveness with controlled-variable comparison",
5
5
  "keywords": [
6
6
  "ai-agent",
@@ -42,13 +42,13 @@
42
42
  "bun": ">=1.0.0"
43
43
  },
44
44
  "dependencies": {
45
- "@lythos/cold-pool": "^0.13.2",
46
- "@lythos/infra": "^0.13.2",
47
- "@lythos/test-utils": "^0.13.2",
45
+ "@lythos/cold-pool": "^0.14.0",
46
+ "@lythos/infra": "^0.14.0",
47
+ "@lythos/test-utils": "^0.14.0",
48
48
  "zod": "^3.24.0",
49
49
  "zod-to-json-schema": "^3.25.2"
50
50
  },
51
51
  "optionalDependencies": {
52
- "@lythos/agent-adapter-claude-sdk": "^0.13.2"
52
+ "@lythos/agent-adapter-claude-sdk": "^0.14.0"
53
53
  }
54
54
  }
package/src/cli.ts CHANGED
@@ -100,6 +100,8 @@ Examples:
100
100
  lythoskill-arena vs --config arena.toml --dry-run
101
101
  lythoskill-arena vs --config arena.toml
102
102
  lythoskill-arena viz runs/arena-20260504
103
+ lythoskill-arena prepare-workdir --deck ./decks/scout.toml --out /tmp/arena-20260517-side-a
104
+ lythoskill-arena archive --from /tmp/arena-20260517 --to playground/arena-20260517 --sides side-a,side-b
103
105
  `)
104
106
  process.exit(0)
105
107
  }
@@ -113,6 +115,8 @@ function cli(args: string[]) {
113
115
  if (cmd === 'vs' || cmd === 'compare') return vsRun(rest)
114
116
  if (cmd === 'single' || cmd === 'run') return singleRun(rest)
115
117
  if (cmd === 'viz') return vizRun(rest)
118
+ if (cmd === 'prepare-workdir') return prepareWorkdir(rest)
119
+ if (cmd === 'archive') return archiveRun(rest)
116
120
 
117
121
  console.error(`Unknown command: ${cmd}`)
118
122
  process.exit(1)
@@ -431,6 +435,165 @@ async function vizRun(args: string[]) {
431
435
  console.log(`📈 Arena HTML report not yet implemented. See report.md in ${runsDir}/`)
432
436
  }
433
437
 
438
+ // ═══════════════════════════════════════════════════════════════════════════
439
+ // ── prepare-workdir: reusable workdir setup (used by both CLI and agent) ──
440
+ // Intent: create an isolated arena workdir with deck linked and ready to run
441
+
442
+ async function prepareWorkdir(args: string[]) {
443
+ const opts: Record<string, string | undefined> = {}
444
+ for (let i = 0; i < args.length; i++) {
445
+ if (args[i] === '--deck' || args[i] === '-d') opts.deck = args[++i]
446
+ else if (args[i] === '--out' || args[i] === '-o') opts.out = args[++i]
447
+ else if (args[i] === '--brief' || args[i] === '-b') opts.brief = args[++i]
448
+ }
449
+
450
+ if (!opts.deck) {
451
+ console.error(`❌ --deck <path> is required.
452
+ lythoskill-arena prepare-workdir --deck ./skill-deck.toml --out /tmp/arena-side-a`)
453
+ process.exit(1)
454
+ }
455
+
456
+ const deckPath = resolve(opts.deck)
457
+ if (!existsSync(deckPath)) {
458
+ console.error(`❌ Deck file not found: ${deckPath}`)
459
+ process.exit(1)
460
+ }
461
+
462
+ const workDir = opts.out
463
+ ? resolve(opts.out)
464
+ : join(tmpdir(), `arena-${Date.now()}`)
465
+ mkdirSync(workDir, { recursive: true })
466
+
467
+ // Copy deck into workdir
468
+ writeFileSync(join(workDir, 'skill-deck.toml'), readFileSync(deckPath, 'utf-8'))
469
+
470
+ // Write AGENTS.md (same contract as CLI singleRun)
471
+ writeFileSync(join(workDir, 'AGENTS.md'), [
472
+ '# Arena Test Environment',
473
+ '**Mode**: agent-orchestrated cell',
474
+ '',
475
+ '## Setup Order (why this sequence)',
476
+ '1. `skill-deck.toml` copied here → declares which skills you can use',
477
+ '2. `deck link` runs → cold pool skills become visible in `.claude/skills/`',
478
+ '3. Skill existence checked → warns if any declared skill is missing from cold pool',
479
+ '4. `AGENTS.md` written last → confirms setup succeeded before agent starts',
480
+ 'If setup fails mid-sequence, the workdir is incomplete and nothing runs.',
481
+ '',
482
+ '## How This Works',
483
+ '- Write ALL output files to this directory (CWD).',
484
+ '- Use available skills — check `ls .claude/skills/`.',
485
+ '',
486
+ '## Output Contract',
487
+ '- MANDATORY: `decision-log.jsonl` — one JSON line per decision:',
488
+ ' `{"t":<seconds>,"phase":"setup|content|design|output","decision":"...","reason":"..."}`',
489
+ ].join('\n'))
490
+
491
+ // Parse deck for link + checks
492
+ const deckRaw = readFileSync(join(workDir, 'skill-deck.toml'), 'utf-8')
493
+ let deckParsed: Record<string, any> = {}
494
+ try { deckParsed = Bun.TOML.parse(deckRaw) as Record<string, any> } catch {}
495
+ const hasSkills = parseDeckSkills(deckParsed).length > 0
496
+
497
+ if (hasSkills) {
498
+ const { existsSync: es2 } = await import('node:fs')
499
+ const localDeckCli = join(import.meta.dir, '..', '..', 'lythoskill-deck', 'src', 'cli.ts')
500
+ const linkCmd = es2(localDeckCli)
501
+ ? ['bun', localDeckCli, 'link']
502
+ : ['bunx', '@lythos/skill-deck', 'link']
503
+ const linkProc = Bun.spawn(linkCmd,
504
+ { cwd: workDir, env: { ...process.env, HOME: process.env.HOME! } },
505
+ )
506
+ await linkProc.exited
507
+ const linkStderr = await new Response(linkProc.stderr).text()
508
+ const linkResult = validateLinkResult(linkProc.exitCode, linkStderr)
509
+ if (!linkResult.ok) {
510
+ console.error(`❌ ${linkResult.error}`)
511
+ process.exit(1)
512
+ }
513
+ } else {
514
+ console.log('ℹ️ No skills declared in deck — skipping link')
515
+ }
516
+
517
+ // Skill existence check
518
+ try {
519
+ const coldPoolDefault = join(homedir(), '.agents', 'skill-repos')
520
+ const coldPoolDir = resolveColdPoolDir(deckParsed?.deck?.cold_pool, homedir(), coldPoolDefault)
521
+ const skills = parseDeckSkills(deckParsed)
522
+ const checks = checkSkillExistence(skills, coldPoolDir, existsSync)
523
+ for (const warning of formatSkillWarnings(checks)) {
524
+ console.warn(`⚠️ ${warning}`)
525
+ }
526
+ } catch (e) {
527
+ console.warn('⚠️ Could not check skill existence:', e instanceof Error ? e.message : e)
528
+ }
529
+
530
+ console.log(`✅ Workdir ready → ${workDir}`)
531
+ console.log(` deck: ${deckPath}`)
532
+ if (opts.brief) console.log(` brief: ${opts.brief!.slice(0, 60)}...`)
533
+ }
534
+
535
+ // ═══════════════════════════════════════════════════════════════════════════
536
+ // ── archive: copy agent outputs from workdir(s) to outDir ─────────────────
537
+ // Intent: same copy behavior as CLI singleRun, reusable for agent-orchestrated
538
+
539
+ async function archiveRun(args: string[]) {
540
+ const opts: Record<string, string | undefined> = {}
541
+ for (let i = 0; i < args.length; i++) {
542
+ if (args[i] === '--from' || args[i] === '-f') opts.from = args[++i]
543
+ else if (args[i] === '--to' || args[i] === '-o') opts.to = args[++i]
544
+ else if (args[i] === '--sides') opts.sides = args[++i]
545
+ else if (args[i] === '--report') opts.report = args[++i]
546
+ }
547
+
548
+ if (!opts.from || !opts.to) {
549
+ console.error(`❌ --from <workdir> and --to <outdir> are required.
550
+ lythoskill-arena archive --from /tmp/arena-20260517 --to playground/arena-20260517 --sides side-a,side-b --report ./report.md`)
551
+ process.exit(1)
552
+ }
553
+
554
+ const fromDir = resolve(opts.from)
555
+ const outDir = resolve(opts.to)
556
+ mkdirSync(outDir, { recursive: true })
557
+
558
+ // Copy report if provided
559
+ if (opts.report && existsSync(resolve(opts.report))) {
560
+ const { cpSync } = await import('node:fs')
561
+ cpSync(resolve(opts.report), join(outDir, 'report.md'))
562
+ console.log(`📄 report.md → ${outDir}/report.md`)
563
+ }
564
+
565
+ // Copy per-side outputs (same skipSet as CLI singleRun)
566
+ const { cpSync, readdirSync } = await import('node:fs')
567
+ const skipSet = new Set(['.claude', 'skill-deck.toml', 'skill-deck.lock', 'AGENTS.md'])
568
+
569
+ const sides = opts.sides ? opts.sides.split(',') : ['.']
570
+ for (const side of sides) {
571
+ const sideWorkDir = side === '.' ? fromDir : join(fromDir, side)
572
+ if (!existsSync(sideWorkDir)) {
573
+ console.warn(`⚠️ Side workdir not found: ${sideWorkDir}`)
574
+ continue
575
+ }
576
+
577
+ const sideOutDir = join(outDir, side)
578
+ mkdirSync(sideOutDir, { recursive: true })
579
+
580
+ const entries = readdirSync(sideWorkDir, { withFileTypes: true })
581
+ for (const entry of entries) {
582
+ if (skipSet.has(entry.name)) continue
583
+ const src = join(sideWorkDir, entry.name)
584
+ const dest = join(sideOutDir, entry.name)
585
+ try {
586
+ cpSync(src, dest, { recursive: entry.isDirectory() })
587
+ console.log(` ${side}/${entry.name} → ${dest}`)
588
+ } catch (e) {
589
+ console.warn(`⚠️ Failed to copy ${side}/${entry.name}: ${e instanceof Error ? e.message : e}`)
590
+ }
591
+ }
592
+ }
593
+
594
+ console.log(`✅ Archive complete → ${outDir}`)
595
+ }
596
+
434
597
  // ── Entry point ────────────────────────────────────────────────────────────
435
598
  if (import.meta.main) {
436
599
  main().catch(err => {