@lythos/skill-arena 0.9.38 → 0.9.39

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/README.md +13 -24
  2. package/package.json +1 -1
  3. package/src/cli.ts +25 -105
package/README.md CHANGED
@@ -49,26 +49,24 @@ Note: Claude `-p` mode has known issues with web tools in Bun.spawn (deferred to
49
49
  ```bash
50
50
  bun add -d @lythos/skill-arena
51
51
  # or use directly
52
- bunx @lythos/skill-arena@0.9.38 <command>
52
+ bunx @lythos/skill-arena@0.9.39 <command>
53
53
  ```
54
54
 
55
55
  ## Quick Start
56
56
 
57
57
  ```bash
58
- # Mode 1: Compare two skills on the same task
59
- bunx @lythos/skill-arena@0.9.38 \
60
- --task "Generate auth flow diagram" \
61
- --skills "design-doc-mermaid,mermaid-tools" \
62
- --criteria "syntax,context,token"
58
+ # Mode 1: Compare two decks on the same task (declarative)
59
+ bunx @lythos/skill-arena@0.9.39 run \
60
+ --config examples/arena/research-compare/arena.toml
63
61
 
64
- # Mode 2: Compare full deck configurations
65
- bunx @lythos/skill-arena@0.9.38 \
62
+ # Mode 2: Compare full deck configurations via CLI flags
63
+ bunx @lythos/skill-arena@0.9.39 run \
66
64
  --task "Generate auth flow diagram" \
67
65
  --decks "./decks/minimal.toml,./decks/rich.toml" \
68
66
  --criteria "quality,token,maintainability"
69
67
 
70
68
  # Visualize results
71
- bunx @lythos/skill-arena@0.9.38 viz tmp/arena-<id>/
69
+ bunx @lythos/skill-arena@0.9.39 viz tmp/arena-<id>/
72
70
  ```
73
71
 
74
72
  ## Commands
@@ -77,16 +75,16 @@ bunx @lythos/skill-arena@0.9.38 viz tmp/arena-<id>/
77
75
 
78
76
  ```bash
79
77
  # Print execution plan without running
80
- bunx @lythos/skill-arena@0.9.38 run --config arena.toml --dry-run
78
+ bunx @lythos/skill-arena@0.9.39 run --config arena.toml --dry-run
81
79
 
82
80
  # Execute with per-side runs_per_side and statistical aggregation
83
- bunx @lythos/skill-arena@0.9.38 run --config arena.toml
81
+ bunx @lythos/skill-arena@0.9.39 run --config arena.toml
84
82
  ```
85
83
 
86
84
  ### CLI-flag mode (backward compat)
87
85
 
88
86
  ```
89
- bunx @lythos/skill-arena@0.9.38 run \
87
+ bunx @lythos/skill-arena@0.9.39 run \
90
88
  --task ./TASK-arena.md \
91
89
  --players ./players/claude.toml \
92
90
  --decks ./decks/run-01.toml,./decks/run-02.toml \
@@ -96,13 +94,13 @@ bunx @lythos/skill-arena@0.9.38 run \
96
94
  ### Scaffold mode (legacy, manual execution)
97
95
 
98
96
  ```
99
- bunx @lythos/skill-arena@0.9.38 scaffold --task "..." --skills a,b
97
+ bunx @lythos/skill-arena@0.9.39 scaffold --task "..." --decks a.toml,b.toml
100
98
  ```
101
99
 
102
100
  ### Viz
103
101
 
104
102
  ```bash
105
- bunx @lythos/skill-arena@0.9.38 viz runs/arena-<id>/
103
+ bunx @lythos/skill-arena@0.9.39 viz runs/arena-<id>/
106
104
  ```
107
105
 
108
106
  ## Skill Documentation
@@ -116,7 +114,7 @@ The agent-visible **Skill** layer documentation is here:
116
114
  Part of the [lythoskill](https://github.com/lythos-labs/lythoskill) ecosystem — the thin-skill pattern separates heavy logic (this npm package) from lightweight agent instructions (SKILL.md).
117
115
 
118
116
  ```
119
- Starter (this package) → npm publish → bunx @lythos/skill-arena@0.9.38 ...
117
+ Starter (this package) → npm publish → bunx @lythos/skill-arena@0.9.39 ...
120
118
  Skill (packages/<name>/skill/) → build → SKILL.md + thin scripts
121
119
  Output (skills/<name>/) → git commit → agent-visible skill
122
120
  ```
@@ -137,15 +135,6 @@ arena.toml → ArenaToml (Zod) → ExecutionPlan (pure) → per-cell agent
137
135
 
138
136
  Built on `@lythos/test-utils` shared infrastructure.
139
137
 
140
- ## Test Coverage
141
-
142
- | Layer | Count | CI | Notes |
143
- |-------|-------|----|-------|
144
- | Unit tests | 41 | ✅ | TOML parser, player resolution, Pareto, stats |
145
- | Agent BDD | — | ❌ | Requires `claude` CLI; run locally |
146
-
147
- Pareto frontier is a **deterministic algorithm** — never delegated to LLM. 8 unit tests cover dominance, cross-dominance, transitive chains, partial criteria, and empty scores.
148
-
149
138
  ## License
150
139
 
151
140
  MIT
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@lythos/skill-arena",
3
- "version": "0.9.38",
3
+ "version": "0.9.39",
4
4
  "description": "Skill Arena — benchmark skill effectiveness with controlled-variable comparison",
5
5
  "keywords": [
6
6
  "ai-agent",
package/src/cli.ts CHANGED
@@ -40,7 +40,6 @@ Usage:
40
40
  lythoskill-arena agent-run --task <path> --deck <path> [--player kimi] [--out <dir>] [--timeout <ms>]
41
41
  lythoskill-arena agent-run --brief "<prompt>" --deck <path> [--out <dir>] [--timeout <ms>]
42
42
  lythoskill-arena run --task <path> --players <A.toml,B.toml> --decks <A.toml,B.toml> --criteria <c1,c2,...> [--out <dir>]
43
- lythoskill-arena scaffold --task "<description>" --skills <skill1,skill2,...>
44
43
  lythoskill-arena scaffold --task "<description>" --decks <deck1,deck2,...>
45
44
  lythoskill-arena viz <arena-dir>
46
45
 
@@ -51,13 +50,11 @@ Commands:
51
50
 
52
51
  Options:
53
52
  -t, --task <path|desc> Task description or path to TASK-arena.md
54
- -s, --skills <list> Comma-separated skill names (scaffold only)
55
53
  --decks <list> Comma-separated deck paths
56
54
  -c, --criteria <list> Evaluation criteria (default: syntax,context,logic,token)
57
55
  --players <list> Comma-separated player.toml paths (CLI run only)
58
56
  --config <path> Path to arena.toml (declarative mode, k8s-style)
59
57
  --dry-run Print execution plan without running (with --config)
60
- --control <skill> Control skill for comparison (scaffold only)
61
58
  --out <dir> Output directory (run: defaults to runs/arena-<id>)
62
59
  -d, --dir <dir> Output directory (scaffold: defaults to tmp)
63
60
  -p, --project <dir> Project directory (default: .)
@@ -75,7 +72,7 @@ Examples:
75
72
  lythoskill-arena run --task ./TASK-arena.md --players ./players/claude.toml --decks ./decks/run-01.toml,./decks/run-02.toml --criteria coverage,relevance
76
73
 
77
74
  # Legacy scaffolding
78
- lythoskill-arena scaffold --task "Refactor auth module" --skills skill-a,skill-b
75
+ lythoskill-arena scaffold --task "Refactor auth module" --decks ./decks/a.toml,./decks/b.toml
79
76
  lythoskill-arena viz runs/arena-20260504
80
77
  `)
81
78
  }
@@ -268,10 +265,8 @@ function parseArgs(argv: string[]) {
268
265
 
269
266
  const options: Record<string, string | undefined> = {
270
267
  task: undefined,
271
- skills: undefined,
272
268
  decks: undefined,
273
269
  criteria: 'syntax,context,logic,token',
274
- control: 'lythoskill-project-scribe',
275
270
  dir: 'tmp',
276
271
  project: '.',
277
272
  config: undefined,
@@ -284,13 +279,10 @@ function parseArgs(argv: string[]) {
284
279
  const arg = argv[i]
285
280
  if (arg === '--task' || arg === '-t') {
286
281
  options.task = argv[++i]
287
- } else if (arg === '--skills' || arg === '-s') {
288
- options.skills = argv[++i]
289
282
  } else if (arg === '--decks') {
290
283
  options.decks = argv[++i]
291
284
  } else if (arg === '--criteria' || arg === '-c') {
292
285
  options.criteria = argv[++i]
293
- } else if (arg === '--control') {
294
286
  options.control = argv[++i]
295
287
  } else if (arg === '--dir' || arg === '-d') {
296
288
  options.dir = argv[++i]
@@ -319,39 +311,13 @@ export function runArena(argv: string[]) {
319
311
  process.exit(1)
320
312
  }
321
313
 
322
- const HAS_DECKS = !!options.decks
323
- const HAS_SKILLS = !!options.skills
314
+ const DECK_PATHS = (options.decks || '').split(',').map(s => s.trim()).filter(Boolean)
324
315
 
325
- if (!HAS_DECKS && !HAS_SKILLS) {
326
- console.error('❌ 请提供 --skills 或 --decks')
327
- process.exit(1)
328
- }
329
- if (HAS_DECKS && HAS_SKILLS) {
330
- console.error('❌ --skills 和 --decks 不能同时使用')
331
- process.exit(1)
332
- }
333
-
334
- const DECK_PATHS = HAS_DECKS
335
- ? (options.decks || '').split(',').map(s => s.trim()).filter(Boolean)
336
- : []
337
-
338
- const SKILLS = HAS_SKILLS
339
- ? (options.skills || '').split(',').map(s => s.trim()).filter(Boolean)
340
- : []
341
-
342
- if (HAS_SKILLS && SKILLS.length < 2) {
343
- console.error('❌ 至少需要 2 个 skill 才能进行 arena')
344
- process.exit(1)
345
- }
346
- if (HAS_SKILLS && SKILLS.length > 5) {
347
- console.error('❌ 一次 arena 最多 5 个 skill')
348
- process.exit(1)
349
- }
350
- if (HAS_DECKS && DECK_PATHS.length < 2) {
316
+ if (DECK_PATHS.length < 2) {
351
317
  console.error('❌ 至少需要 2 个 deck 才能进行 arena')
352
318
  process.exit(1)
353
319
  }
354
- if (HAS_DECKS && DECK_PATHS.length > 5) {
320
+ if (DECK_PATHS.length > 5) {
355
321
  console.error('❌ 一次 arena 最多 5 个 deck')
356
322
  process.exit(1)
357
323
  }
@@ -359,9 +325,6 @@ export function runArena(argv: string[]) {
359
325
  const CRITERIA = (options.criteria || 'syntax,context,logic,token')
360
326
  .split(',').map(s => s.trim()).filter(Boolean)
361
327
 
362
- const CONTROL_SKILLS = (options.control || 'lythoskill-project-scribe')
363
- .split(',').map(s => s.trim()).filter(Boolean)
364
-
365
328
  const PROJECT_DIR = resolve(options.project!)
366
329
  const ARENA_SLUG = slugify(TASK)
367
330
  const ARENA_ID = `arena-${timestamp()}-${ARENA_SLUG.slice(0, 30)}`
@@ -373,37 +336,20 @@ export function runArena(argv: string[]) {
373
336
  mkdirSync(join(ARENA_DIR, 'sides'), { recursive: true })
374
337
 
375
338
  // ── 生成参与者与 deck ───────────────────────────────────────
376
- let participants: { id: string; name: string; skill_name: string; deck_path: string }[]
377
- let mode: 'single-skill' | 'full-deck'
378
-
379
- if (HAS_DECKS) {
380
- mode = 'full-deck'
381
- participants = DECK_PATHS.map((deckPath, i) => {
382
- const id = `run-${String(i + 1).padStart(2, '0')}`
383
- const name = basename(deckPath).replace(/\.toml$/, '')
384
- const destPath = join(ARENA_DIR, 'decks', `arena-${id}.toml`)
385
- // Copy the provided deck to arena directory
386
- if (existsSync(deckPath)) {
387
- const content = readFileSync(deckPath, 'utf-8')
388
- writeFileSync(destPath, content)
389
- } else {
390
- console.error(`❌ Deck 文件不存在: ${deckPath}`)
391
- process.exit(1)
392
- }
393
- return { id, name, skill_name: name, deck_path: destPath }
394
- })
395
- } else {
396
- mode = 'single-skill'
397
- participants = SKILLS.map((skill, i) => {
398
- const id = `run-${String(i + 1).padStart(2, '0')}`
399
- return {
400
- id,
401
- name: skill,
402
- skill_name: skill,
403
- deck_path: join(ARENA_DIR, 'decks', `arena-${id}.toml`),
404
- }
405
- })
406
- }
339
+ const participants = DECK_PATHS.map((deckPath, i) => {
340
+ const id = `run-${String(i + 1).padStart(2, '0')}`
341
+ const name = basename(deckPath).replace(/\.toml$/, '')
342
+ const destPath = join(ARENA_DIR, 'decks', `arena-${id}.toml`)
343
+ // Copy the provided deck to arena directory
344
+ if (existsSync(deckPath)) {
345
+ const content = readFileSync(deckPath, 'utf-8')
346
+ writeFileSync(destPath, content)
347
+ } else {
348
+ console.error(`❌ Deck 文件不存在: ${deckPath}`)
349
+ process.exit(1)
350
+ }
351
+ return { id, name, skill_name: name, deck_path: destPath }
352
+ })
407
353
 
408
354
  const criteria = CRITERIA.map((c) => ({
409
355
  name: c,
@@ -411,29 +357,6 @@ export function runArena(argv: string[]) {
411
357
  weight: 1,
412
358
  }))
413
359
 
414
- if (mode === 'single-skill') {
415
- for (const p of participants) {
416
- const deckContent = `# ============================================================
417
- # Arena Deck: ${p.id} — ${p.name}
418
- # ============================================================
419
- # 变量:${p.name}
420
- # 控制变量:${CONTROL_SKILLS.join(', ')}
421
- # ============================================================
422
-
423
- [deck]
424
- working_set = ".claude/skills"
425
- cold_pool = "~/.agents/skill-repos"
426
- max_cards = 10
427
-
428
- [tool]
429
- skills = [
430
- ${[...new Set([p.skill_name, ...CONTROL_SKILLS])].map(s => ` "${s}",`).join('\n')}
431
- ]
432
- `
433
- writeFileSync(p.deck_path, deckContent)
434
- }
435
- }
436
-
437
360
  // ── 为每个 side 创建隔离工作空间 ────────────────────────────
438
361
  for (const p of participants) {
439
362
  const sideDir = join(ARENA_DIR, 'sides', p.id)
@@ -481,14 +404,11 @@ ${criteria.map(c => ` - ${c.label}`).join('\n')}
481
404
  arena_decks:
482
405
  ${participants.map(p => ` - ${p.deck_path.replace(PROJECT_DIR, '.')}`).join('\n')}
483
406
  judge_persona: |
484
- ${mode === 'full-deck'
485
- ? `你是一个多目标优化分析师。不要选 Winner。
486
- 对每个 deck 配置,按 evaluation_criteria 输出评分向量(1-5 分)。
487
- 识别 Pareto 非支配解集——没有"最强",只有"在不同维度上的最优权衡"。
488
- 对被支配的解,说明它被谁支配、在哪个维度上劣势。
489
- 如果发现任何涌现 combo(多个 skill 组合产生 1+1>2 的效果),单独标注。`
490
- : `你是一个中立的技能评测员。对比所有 subagent 的输出,
491
- 按 evaluation_criteria 给出 1-5 分评分,最终给出 Winner 和选型建议。`}
407
+ 你是一个多目标优化分析师。不要选 Winner。
408
+ 对每个 deck 配置,按 evaluation_criteria 输出评分向量(1-5 分)。
409
+ 识别 Pareto 非支配解集——没有"最强",只有"在不同维度上的最优权衡"。
410
+ 对被支配的解,说明它被谁支配、在哪个维度上劣势。
411
+ 如果发现任何涌现 combo(多个 skill 组合产生 1+1>2 的效果),单独标注。
492
412
  acceptance:
493
413
  ${participants.map(p => ` - Subagent ${p.id} 在 sides/${p.id}/ 隔离环境完成任务并写入 runs/${p.id}.md`).join('\n')}
494
414
  - Judge 读取所有 run 文件并生成 report.md
@@ -527,9 +447,9 @@ cd "${ARENA_DIR}"
527
447
  ID: ${ARENA_ID}
528
448
  任务: ${TASK}
529
449
  目录: ${ARENA_DIR}
530
- 模式: ${mode === 'full-deck' ? '完整 deck 配置对比' : '单 skill 对比'}
450
+ 模式: deck 配置对比
531
451
  参与者: ${participants.map(p => p.name).join(', ')}
532
- ${mode === 'single-skill' ? `控制变量: ${CONTROL_SKILLS.join(', ')}\n` : ''}评测维度: ${CRITERIA.join(', ')}
452
+ 评测维度: ${CRITERIA.join(', ')}
533
453
 
534
454
  生成文件:
535
455
  📋 ${join(ARENA_DIR, 'arena.json')}