npm - @lythos/skill-arena - Versions diffs - 0.13.3 → 0.14.0 - Mend

@lythos/skill-arena 0.13.3 → 0.14.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md CHANGED Viewed

@@ -1,140 +1,116 @@
 # @lythos/skill-arena
-![CI](https://img.shields.io/badge/CI-41%20unit%20tests-brightgreen) ![Intent/Plan](https://img.shields.io/badge/arch-intent%2Fplan%2Fexecute-8A2BE2)
+> Controlled-variable benchmark for AI agent skills. Test single decks or compare A/B — agent-orchestrated by default, cross-player when you need it.
-> Controlled-variable benchmark for AI agent skills. Compare skills, decks, or configurations on the same task — single-skill A/B or full-deck Pareto frontier analysis. Now with declarative `arena.toml` (k8s-manifest style) and deterministic Pareto frontier.
+## Modes at a Glance
-## Why
+| Mode | How | When |
+|------|-----|------|
+| **Agent-Orchestrated** (DEFAULT) | Agent tool spawns subagents, parallel dispatch, native judge | Single deck test, cross-deck A/B comparison |
+| **Cross-Player** (OPT-IN) | CLI runner spawns different agent binaries via Bun.spawn | Comparing kimi vs codex vs claude |
-"Which skill is better?" is the wrong question. The right question is "which skill is better for what."
-`skill-arena` scaffolds isolated environments where subagents complete the same task under different decks. A judge agent scores outputs across multiple dimensions. Supports:
-- **Mode 1**: Single-skill comparison (controlled variable — same helper skills, different test skill).
-- **Mode 2**: Full-deck comparison (Pareto frontier — no single winner, only optimal trade-offs).
-## Prerequisites
-Arena runs AI agents as subprocesses. You need at least one agent CLI installed:
-### Kimi CLI (recommended default)
-Kimi Code CLI is the default player for arena — it has reliable headless execution with eager tool loading (no deferred tool deadlock).
-```bash
-# Install via uv (recommended) — uv is Python's bunx equivalent
-uv tool install kimi-cli
-# Or run without installing:
-uvx kimi-cli --print -p "hello"
-# Authenticate
-kimi login
-# Or set API key:
-export KIMI_API_KEY=your_key
-```
-Docs: [https://github.com/MoonshotAI/kimi-cli](https://github.com/MoonshotAI/kimi-cli)
-### Claude CLI (secondary)
-```bash
-npm install -g @anthropic-ai/claude-code
-claude --version  # should be ≥ 2.1.128
-```
-Note: Claude `-p` mode has known issues with web tools in Bun.spawn (deferred tool deadlock). Kimi is the default for reliability.
+**95% of arena use is agent-orchestrated.** The Agent tool can spawn parallel subagents with isolated workdirs and different decks — zero CLI. Cross-player mode is ONLY needed when comparing different agent CLIs (the Agent tool can only spawn same-type agents).
 ## Install
 ```bash
 bun add -d @lythos/skill-arena
 # or use directly
-bunx @lythos/skill-arena@0.13.3 <command>
+bunx @lythos/skill-arena@0.14.0 <command>
 ```
 ## Quick Start
 ```bash
-# Single: test a deck with one agent (most common)
+# single — test one deck (most common)
 bunx @lythos/skill-arena@latest single \
   --deck ./examples/decks/scout.toml \
   --brief "Generate auth flow diagram" \
-  --player kimi \
-  --timeout 300000 \
-  --out ./output
-# Single with remote deck (URL auto-fetched)
-bunx @lythos/skill-arena@latest single \
-  --deck https://raw.githubusercontent.com/lythos-labs/lythoskill/main/examples/decks/scout.toml \
-  --brief "Generate auth flow diagram" \
   --out ./output
-# Vs: compare multiple decks side by side
-curl -fsSL https://raw.githubusercontent.com/lythos-labs/lythoskill/main/examples/arena/research-compare/arena.toml > arena.toml
+# cross-deck vs — compare two decks (agent-orchestrated)
+# Create arena.toml declaring sides with different decks, then:
 bunx @lythos/skill-arena@latest vs --config ./arena.toml
+# cross-player vs — compare kimi vs codex (CLI only)
+bunx @lythos/skill-arena@latest vs --config ./arena.toml --player kimi
 ```
-**Default behavior:**
-- Agent runs in an isolated `/tmp` workdir (no workspace pollution)
-- All artifacts are copied to `--out` after completion
-- Prompt template injects fixed contract (decision-log, robustness, tool preference) + your brief as variable
+**What happens**: Agent creates isolated `/tmp` workdir per side, `deck link` skills, spawns parallel subagents, collects artifacts, judge scores outputs. Parent deck restored after.
 ## Commands
-### Declarative mode (k8s-style, recommended)
+### `single` — one deck, one task
 ```bash
-# Print execution plan without running
-bunx @lythos/skill-arena@0.13.3 vs --config arena.toml --dry-run
-# Execute with per-side runs_per_side and statistical aggregation
-bunx @lythos/skill-arena@0.13.3 vs --config arena.toml
+bunx @lythos/skill-arena@latest single \
+  --deck ./deck.toml \
+  --brief "Produce a .docx report with radar chart" \
+  --timeout 600000 \
+  --out ./output
 ```
-### Scaffold mode (legacy, manual execution)
+### `vs` — multi-deck comparison
+```bash
+bunx @lythos/skill-arena@latest vs --config ./arena.toml
+bunx @lythos/skill-arena@latest vs --config ./arena.toml --dry-run
 ```
-bunx @lythos/skill-arena@0.13.3 scaffold --task "Generate auth flow diagram" \
-  --decks https://raw.githubusercontent.com/lythos-labs/lythoskill/main/examples/decks/scout.toml,https://raw.githubusercontent.com/lythos-labs/lythoskill/main/examples/decks/documents.toml
+### `scaffold` — legacy directory setup
+```bash
+bunx @lythos/skill-arena@latest scaffold \
+  --task "Generate auth flow diagram" \
+  --decks "./decks/minimal.toml,./decks/rich.toml"
 ```
-### Viz
+### `viz` — render results
 ```bash
-bunx @lythos/skill-arena@0.13.3 viz runs/arena-<id>/
+bunx @lythos/skill-arena@latest viz runs/arena-<id>/
 ```
-## Skill Documentation
+## Parameters
-This package is the **Starter** layer (CLI implementation).
-The agent-visible **Skill** layer documentation is here:
-[packages/lythoskill-arena/skill/SKILL.md](../../packages/lythoskill-arena/skill/SKILL.md)
+| Flag | Command | Description |
+|------|---------|-------------|
+| `--brief "<text>"` | single | Inline task brief |
+| `--deck <path\|url>` | single | Deck file (URL auto-fetched) |
+| `--player <name>` | single, vs | Only for cross-player: kimi\|codex\|deepseek\|claude |
+| `--timeout <ms>` | single | Subagent timeout (300000–600000 for complex tasks) |
+| `--out <dir>` | single, vs | Output directory |
+| `--config <path>` | vs | arena.toml |
+| `--dry-run` | vs | Print plan without execution |
-## Architecture
+## Prerequisites (cross-player only)
-Part of the [lythoskill](https://github.com/lythos-labs/lythoskill) ecosystem — the thin-skill pattern separates heavy logic (this npm package) from lightweight agent instructions (SKILL.md).
+For cross-player mode, install at least one agent CLI:
-```
-Starter (this package) → npm publish → bunx @lythos/skill-arena@0.13.3 ...
-Skill   (packages/<name>/skill/)     → build → SKILL.md + thin scripts
-Output  (skills/<name>/)             → git commit → agent-visible skill
+```bash
+uv tool install kimi-cli           # kimi (recommended default)
+npm i -g @openai/codex             # codex
+# deepseek: bundled with desktop app or pip install deepseek-cli
+# claude: set ANTHROPIC_API_KEY (SDK, no CLI binary needed)
 ```
-### Runtime architecture (intent/plan/execute)
+## Skill Documentation
+The agent-visible skill layer: [skill/SKILL.md](./skill/SKILL.md)
+## Architecture
 ```
 arena.toml  →  ArenaToml (Zod)  →  ExecutionPlan (pure)  →  per-cell agent spawn (IO)
-                                    ↓
-                aggregateAllStats (pure)  ←  verdicts[]
-                                    ↓
-                runComparativeJudge (IO)  →  report.md + Pareto frontier
+                                   ↓
+               aggregateAllStats (pure)  ←  verdicts[]
+                                   ↓
+               runComparativeJudge (IO)  →  report.md + Pareto frontier
 ```
-- **Intent**: `arena.toml` declarative config (k8s-manifest style)
+- **Intent**: `arena.toml` declarative config
 - **Plan**: `buildExecutionPlan()`, `aggregateSideStats()`, `computePareto()` — pure functions
-- **Execute**: `runAgentScenario` per cell, `runComparativeJudge` — IO via `AgentAdapter`
-Built on `@lythos/test-utils` shared infrastructure.
+- **Execute**: Agent tool spawn (agent-orchestrated) or `AgentAdapter` (cross-player)
 ## License

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@lythos/skill-arena",
-  "version": "0.13.3",
+  "version": "0.14.0",
   "description": "Skill Arena — benchmark skill effectiveness with controlled-variable comparison",
   "keywords": [
     "ai-agent",
@@ -42,13 +42,13 @@
     "bun": ">=1.0.0"
   },
   "dependencies": {
-    "@lythos/cold-pool": "^0.13.3",
-    "@lythos/infra": "^0.13.3",
-    "@lythos/test-utils": "^0.13.3",
+    "@lythos/cold-pool": "^0.14.0",
+    "@lythos/infra": "^0.14.0",
+    "@lythos/test-utils": "^0.14.0",
     "zod": "^3.24.0",
     "zod-to-json-schema": "^3.25.2"
   },
   "optionalDependencies": {
-    "@lythos/agent-adapter-claude-sdk": "^0.13.3"
+    "@lythos/agent-adapter-claude-sdk": "^0.14.0"
   }
 }

package/src/cli.ts CHANGED Viewed

@@ -100,6 +100,8 @@ Examples:
   lythoskill-arena vs --config arena.toml --dry-run
   lythoskill-arena vs --config arena.toml
   lythoskill-arena viz runs/arena-20260504
+  lythoskill-arena prepare-workdir --deck ./decks/scout.toml --out /tmp/arena-20260517-side-a
+  lythoskill-arena archive --from /tmp/arena-20260517 --to playground/arena-20260517 --sides side-a,side-b
 `)
     process.exit(0)
   }
@@ -113,6 +115,8 @@ function cli(args: string[]) {
   if (cmd === 'vs' || cmd === 'compare') return vsRun(rest)
   if (cmd === 'single' || cmd === 'run') return singleRun(rest)
   if (cmd === 'viz') return vizRun(rest)
+  if (cmd === 'prepare-workdir') return prepareWorkdir(rest)
+  if (cmd === 'archive') return archiveRun(rest)
   console.error(`Unknown command: ${cmd}`)
   process.exit(1)
@@ -431,6 +435,165 @@ async function vizRun(args: string[]) {
   console.log(`📈 Arena HTML report not yet implemented. See report.md in ${runsDir}/`)
 }
+// ═══════════════════════════════════════════════════════════════════════════
+// ── prepare-workdir: reusable workdir setup (used by both CLI and agent) ──
+// Intent: create an isolated arena workdir with deck linked and ready to run
+async function prepareWorkdir(args: string[]) {
+  const opts: Record<string, string | undefined> = {}
+  for (let i = 0; i < args.length; i++) {
+    if (args[i] === '--deck' || args[i] === '-d') opts.deck = args[++i]
+    else if (args[i] === '--out' || args[i] === '-o') opts.out = args[++i]
+    else if (args[i] === '--brief' || args[i] === '-b') opts.brief = args[++i]
+  }
+  if (!opts.deck) {
+    console.error(`❌ --deck <path> is required.
+   lythoskill-arena prepare-workdir --deck ./skill-deck.toml --out /tmp/arena-side-a`)
+    process.exit(1)
+  }
+  const deckPath = resolve(opts.deck)
+  if (!existsSync(deckPath)) {
+    console.error(`❌ Deck file not found: ${deckPath}`)
+    process.exit(1)
+  }
+  const workDir = opts.out
+    ? resolve(opts.out)
+    : join(tmpdir(), `arena-${Date.now()}`)
+  mkdirSync(workDir, { recursive: true })
+  // Copy deck into workdir
+  writeFileSync(join(workDir, 'skill-deck.toml'), readFileSync(deckPath, 'utf-8'))
+  // Write AGENTS.md (same contract as CLI singleRun)
+  writeFileSync(join(workDir, 'AGENTS.md'), [
+    '# Arena Test Environment',
+    '**Mode**: agent-orchestrated cell',
+    '',
+    '## Setup Order (why this sequence)',
+    '1. `skill-deck.toml` copied here → declares which skills you can use',
+    '2. `deck link` runs → cold pool skills become visible in `.claude/skills/`',
+    '3. Skill existence checked → warns if any declared skill is missing from cold pool',
+    '4. `AGENTS.md` written last → confirms setup succeeded before agent starts',
+    'If setup fails mid-sequence, the workdir is incomplete and nothing runs.',
+    '',
+    '## How This Works',
+    '- Write ALL output files to this directory (CWD).',
+    '- Use available skills — check `ls .claude/skills/`.',
+    '',
+    '## Output Contract',
+    '- MANDATORY: `decision-log.jsonl` — one JSON line per decision:',
+    '  `{"t":<seconds>,"phase":"setup|content|design|output","decision":"...","reason":"..."}`',
+  ].join('\n'))
+  // Parse deck for link + checks
+  const deckRaw = readFileSync(join(workDir, 'skill-deck.toml'), 'utf-8')
+  let deckParsed: Record<string, any> = {}
+  try { deckParsed = Bun.TOML.parse(deckRaw) as Record<string, any> } catch {}
+  const hasSkills = parseDeckSkills(deckParsed).length > 0
+  if (hasSkills) {
+    const { existsSync: es2 } = await import('node:fs')
+    const localDeckCli = join(import.meta.dir, '..', '..', 'lythoskill-deck', 'src', 'cli.ts')
+    const linkCmd = es2(localDeckCli)
+      ? ['bun', localDeckCli, 'link']
+      : ['bunx', '@lythos/skill-deck', 'link']
+    const linkProc = Bun.spawn(linkCmd,
+      { cwd: workDir, env: { ...process.env, HOME: process.env.HOME! } },
+    )
+    await linkProc.exited
+    const linkStderr = await new Response(linkProc.stderr).text()
+    const linkResult = validateLinkResult(linkProc.exitCode, linkStderr)
+    if (!linkResult.ok) {
+      console.error(`❌ ${linkResult.error}`)
+      process.exit(1)
+    }
+  } else {
+    console.log('ℹ️  No skills declared in deck — skipping link')
+  }
+  // Skill existence check
+  try {
+    const coldPoolDefault = join(homedir(), '.agents', 'skill-repos')
+    const coldPoolDir = resolveColdPoolDir(deckParsed?.deck?.cold_pool, homedir(), coldPoolDefault)
+    const skills = parseDeckSkills(deckParsed)
+    const checks = checkSkillExistence(skills, coldPoolDir, existsSync)
+    for (const warning of formatSkillWarnings(checks)) {
+      console.warn(`⚠️  ${warning}`)
+    }
+  } catch (e) {
+    console.warn('⚠️  Could not check skill existence:', e instanceof Error ? e.message : e)
+  }
+  console.log(`✅ Workdir ready → ${workDir}`)
+  console.log(`   deck: ${deckPath}`)
+  if (opts.brief) console.log(`   brief: ${opts.brief!.slice(0, 60)}...`)
+}
+// ═══════════════════════════════════════════════════════════════════════════
+// ── archive: copy agent outputs from workdir(s) to outDir ─────────────────
+// Intent: same copy behavior as CLI singleRun, reusable for agent-orchestrated
+async function archiveRun(args: string[]) {
+  const opts: Record<string, string | undefined> = {}
+  for (let i = 0; i < args.length; i++) {
+    if (args[i] === '--from' || args[i] === '-f') opts.from = args[++i]
+    else if (args[i] === '--to' || args[i] === '-o') opts.to = args[++i]
+    else if (args[i] === '--sides') opts.sides = args[++i]
+    else if (args[i] === '--report') opts.report = args[++i]
+  }
+  if (!opts.from || !opts.to) {
+    console.error(`❌ --from <workdir> and --to <outdir> are required.
+   lythoskill-arena archive --from /tmp/arena-20260517 --to playground/arena-20260517 --sides side-a,side-b --report ./report.md`)
+    process.exit(1)
+  }
+  const fromDir = resolve(opts.from)
+  const outDir = resolve(opts.to)
+  mkdirSync(outDir, { recursive: true })
+  // Copy report if provided
+  if (opts.report && existsSync(resolve(opts.report))) {
+    const { cpSync } = await import('node:fs')
+    cpSync(resolve(opts.report), join(outDir, 'report.md'))
+    console.log(`📄 report.md → ${outDir}/report.md`)
+  }
+  // Copy per-side outputs (same skipSet as CLI singleRun)
+  const { cpSync, readdirSync } = await import('node:fs')
+  const skipSet = new Set(['.claude', 'skill-deck.toml', 'skill-deck.lock', 'AGENTS.md'])
+  const sides = opts.sides ? opts.sides.split(',') : ['.']
+  for (const side of sides) {
+    const sideWorkDir = side === '.' ? fromDir : join(fromDir, side)
+    if (!existsSync(sideWorkDir)) {
+      console.warn(`⚠️  Side workdir not found: ${sideWorkDir}`)
+      continue
+    }
+    const sideOutDir = join(outDir, side)
+    mkdirSync(sideOutDir, { recursive: true })
+    const entries = readdirSync(sideWorkDir, { withFileTypes: true })
+    for (const entry of entries) {
+      if (skipSet.has(entry.name)) continue
+      const src = join(sideWorkDir, entry.name)
+      const dest = join(sideOutDir, entry.name)
+      try {
+        cpSync(src, dest, { recursive: entry.isDirectory() })
+        console.log(`   ${side}/${entry.name} → ${dest}`)
+      } catch (e) {
+        console.warn(`⚠️  Failed to copy ${side}/${entry.name}: ${e instanceof Error ? e.message : e}`)
+      }
+    }
+  }
+  console.log(`✅ Archive complete → ${outDir}`)
+}
 // ── Entry point ────────────────────────────────────────────────────────────
 if (import.meta.main) {
   main().catch(err => {