kairn-cli 2.10.0 → 2.14.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (4) hide show
  1. package/README.md +171 -483
  2. package/dist/cli.js +9740 -7107
  3. package/dist/cli.js.map +1 -1
  4. package/package.json +2 -1
package/README.md CHANGED
@@ -1,326 +1,144 @@
1
1
  # Kairn — The Agent Environment Compiler
2
2
 
3
- > Describe your workflow. Get an optimized Claude Code environment. Then evolve it automatically.
3
+ > Agent harnesses are programs. They should be compiled from intent and optimized through evolutionary search not hand-written.
4
4
 
5
- Kairn is a CLI that compiles natural language descriptions into minimal, optimal [Claude Code](https://code.claude.com/) agent environments complete with MCP servers, slash commands, skills, subagents, rules, and security. Then it uses **automated evolution** (inspired by [Meta-Harness](https://yoonholee.com/meta-harness/), Stanford IRIS Lab 2026) to improve them through real-world task execution.
5
+ Every Claude Code project ships with a `.claude/` directory: system prompts, slash commands, rules, agents, hooks, MCP configs, security policies. Today, teams hand-write these files, cargo-culting from templates and fixing problems by trial and error. The harness *is* the program that shapes agent behavior, but nobody treats it like one.
6
6
 
7
- **v2.5.0** adds **Intent-Aware Harnesses**project-specific routing that intercepts natural language and activates the right command. Two-tier: fast regex (Tier 1) + semantic Haiku fallback (Tier 2). Self-learningthe harness learns your vocabulary over time.
7
+ Kairn treats it like one. You describe your workflow in natural language. Kairn compiles an optimized environment through a multi-agent pipelinean @orchestrator plans the compilation, 6 specialist agents generate typed intermediate representation nodes in parallel, and a @linker validates cross-references before deterministic assembly. Then, optionally, Kairn *evolves* it: running real tasks against the harness, diagnosing failures via causal reasoning, proposing typed IR mutations, and repeating with population-based training, Thompson sampling for task selection, and KL regularization to prevent bloat.
8
8
 
9
- **No servers. No accounts. No telemetry. Runs locally with your own LLM key.**
9
+ The result is a harness that's been compiled from intent and stress-tested against real work, not guessed at by a human reading docs.
10
10
 
11
- ---
12
-
13
- ## Install
14
-
15
- ```bash
16
- npm install -g kairn-cli
17
- ```
18
-
19
- Requires Node.js 18+. The command is `kairn`.
20
-
21
- ## Quick Start
22
-
23
- ```bash
24
- # 1. Set up your LLM provider (Anthropic, OpenAI, Google, xAI, DeepSeek, Mistral, Groq, or custom)
25
- kairn init
26
-
27
- # 2. Describe your workflow (or scan an existing repo)
28
- kairn describe "Build a Next.js app with Supabase auth"
29
- # or
30
- kairn optimize # scans existing project at cwd
31
-
32
- # 3. Start coding
33
- claude
34
- ```
35
-
36
- Kairn generates the entire `.claude/` directory — CLAUDE.md, settings.json, commands, rules, agents, hooks, security policies — tailored to your specific workflow. Then, optionally, evolve it:
37
-
38
- ```bash
39
- # Set up evolution
40
- kairn evolve init # auto-generate 3-5 eval tasks
41
- kairn evolve baseline # snapshot current harness
42
-
43
- # Optimize
44
- kairn evolve run --iterations 5 # Run evolution loop
45
- kairn evolve apply # Accept best harness
46
- ```
11
+ **No servers. No accounts. No telemetry. Local-first, runs with your own LLM key.**
47
12
 
48
- ---
49
-
50
- ## What Gets Generated
51
-
52
- ```
53
- .claude/
54
- ├── CLAUDE.md # Workflow-specific system prompt (7 sections)
55
- ├── settings.json # Permissions, hooks, security rules, intent routing
56
- ├── commands/ # Slash commands (/project:help, /project:plan, etc.)
57
- ├── rules/ # Auto-loaded instructions (security, continuity, paths)
58
- ├── skills/ # Model-controlled capabilities (code, research, writing)
59
- ├── agents/ # Specialized subagents (@architect, @tester, etc.)
60
- ├── docs/ # Pre-initialized project memory
61
- ├── hooks/ # Intent router (Tier 1 regex + Tier 2 Haiku classifier)
62
- │ ├── intent-router.mjs # Project-specific regex patterns + fallthrough
63
- │ ├── intent-learner.mjs # Promotes recurring Tier 2 patterns to Tier 1
64
- │ └── intent-log.jsonl # Log of routed prompts (for learning)
65
- └── QUICKSTART.md # Interactive startup guide (Level 2-4)
66
- .mcp.json # Project-scoped MCP server config
67
- .env # API keys (gitignored, masked in output)
68
- ```
13
+ Kairn's own development environment was compiled and evolved by Kairn.
69
14
 
70
15
  ---
71
16
 
72
- ## Core Commands
73
-
74
- ### `kairn init`
75
-
76
- Interactive setup. Pick your LLM provider, enter credentials. API key stored locally at `~/.kairn/config.json`.
17
+ ## What's Under the Hood
77
18
 
78
- **Supported providers:**
79
- - **Anthropic** — Claude Sonnet 4.6, Opus 4.6, Haiku 4.5
80
- - **OpenAI** — GPT-4.1, GPT-4.1 mini, o4-mini, GPT-5 mini
81
- - **Google** — Gemini 2.5 Flash, Gemini 3 Flash, Gemini 2.5 Pro, Gemini 3.1 Pro
82
- - **xAI** — Grok 4.1 Fast, Grok 4.20 (2M context, $0.20/M)
83
- - **DeepSeek** — V3.2 Chat, V3.2 Reasoner (cheapest at $0.28/M)
84
- - **Mistral** — Large 3, Codestral, Small 4 (open-weight)
85
- - **Groq** — Llama 4, DeepSeek R1, Qwen 3 (free tier)
86
- - **Custom** — any OpenAI-compatible endpoint (local Ollama, LM Studio)
19
+ Most tools in this space either generate prompts or generate code. Kairn generates *full agent environments* — and then optimizes them as a system. Here's what that required building.
87
20
 
88
- ### `kairn describe [intent] [options]`
21
+ ### Multi-Agent Compilation Pipeline (v2.11)
89
22
 
90
- **The main command.** Describe what you want your agent to do. Kairn compiles an optimal environment.
23
+ The monolithic "ask an LLM to produce a giant JSON blob" approach hits a wall at ~16K tokens: truncation, incoherence, format corruption. Kairn decomposes compilation into a DAG of specialist agents, each producing typed output within its own token budget.
91
24
 
92
- ```bash
93
- kairn describe "Build a Next.js REST API with PostgreSQL"
94
- kairn describe "Research ML papers on GRPO training and summarize" --quick
95
25
  ```
96
-
97
- **Features:**
98
- - **Interactive clarification** 3-5 yes/no questions to refine your workflow (skip with `--quick`)
99
- - **Multi-pass compilation**Skeleton pass (tool selection) + Harness pass (content generation) + deterministic settings
100
- - **Autonomy levels** Choose how autonomous (1-4, default 2):
101
- - **Level 1 (Guided):** Manual workflow with `/project:tour`, help, and guidance
102
- - **Level 2 (Assisted):** `/project:loop` for workflow automation, `@pm` agent for planning
103
- - **Level 3 (Autonomous):** `/project:auto` for self-directed execution with PR delivery
104
- - **Level 4 (Full Auto):** `/project:autopilot` for continuous execution with stop conditions
105
- - **Secrets collection** — Prompted for API keys after generation, written to `.env`
106
- - **Intent routing** — Auto-generated `/project:*` command routing (both regex and Haiku-based)
107
-
108
- ### `kairn optimize [options]`
109
-
110
- Scan an existing project and optimize its Claude Code environment. Detects language, framework, dependencies, and generates improvements.
111
-
112
- ```bash
113
- kairn optimize # Scan, audit, and overwrite .claude/
114
- kairn optimize --diff # Preview changes before writing
115
- kairn optimize --audit-only # Show issues without generating
26
+ Pass 1: Skeleton — LLM selects tools, outlines the project (max_tokens: 2048)
27
+ Pass 2: @orchestrator — reads skeleton + intent, emits a CompilationPlan
28
+ (phased tasks, dependency ordering, per-agent token budgets)
29
+ Pass 3: Specialist agents parallel fan-out across phases:
30
+ Phase A: @sections-writer Section[], @rule-writer RuleNode[]
31
+ Phase B: @command-writer CommandNode[], @agent-writer AgentNode[],
32
+ @skill-writer SkillNode[]
33
+ Phase C: @linker cross-reference validation + auto-patching
34
+ Pass 4: Assembly deterministic generation of settings.json, .mcp.json, hooks
116
35
  ```
117
36
 
118
- **Features:**
119
- - **Full project scan** — language, framework, dependencies, scripts, env keys, CI/CD, existing harness
120
- - **Harness audit** — checks CLAUDE.md quality, missing commands/rules, MCP bloat, security configurations
121
- - **Two modes:**
122
- - No `.claude/` → generate from scratch
123
- - Has `.claude/` → optimize + overwrite (shows audit issues first, asks for confirmation)
124
- - **Diff preview** — see what would change before applying (with `--diff`)
37
+ Each specialist produces typed HarnessIR nodes, not strings. The @linker detects broken `@agent` references in commands, missing `/project:command` mentions in agents, and injects mandatory help/security/continuity rules if absent. If an agent's output is truncated (`stop_reason === 'max_tokens'`), the batch engine retries with doubled budget — one agent failing doesn't crash the whole compilation.
125
38
 
126
- ### `kairn templates [options]`
39
+ ### Structured Harness IR (v2.7)
127
40
 
128
- Browse pre-built environment templates. Activate one to jumpstart a new project.
41
+ Raw Markdown mutation accumulates contradictions, corrupts formatting, and breaks as files grow. Kairn operates on a typed intermediate representation: 14 node types (Section, CommandNode, RuleNode, AgentNode, SkillNode, DocNode, HookNode, SettingsIR, McpServerNode, IntentNode, ...), 17 mutation operations, and a semantic diff engine.
129
42
 
130
- ```bash
131
- kairn templates # Browse gallery
132
- kairn templates --activate nextjs # Apply a template
133
- ```
43
+ The IR is round-trip tested: `parse → render → parse` preserves all content on real `.claude/` directories. The evolution loop mutates IR nodes directly — no regex replacement, no string surgery. The compilation pipeline produces IR, the evolution loop mutates IR, and the renderer writes files. One representation, end to end.
134
44
 
135
- **Available templates:**
136
- - Next.js Full-Stack (React + Node + PostgreSQL + Supabase)
137
- - API Service (Express/Fastify + database + testing)
138
- - Research Project (paper analysis, literature review, synthesis)
139
- - Content Writing (blog, documentation, marketing)
45
+ ### Population-Based Training with Thompson Sampling (v2.6)
140
46
 
141
- ### `kairn doctor`
47
+ A single sequential evolution trajectory wastes wall-clock time on dead ends and overfits to its task sample. `kairn evolve pbt` runs N independent trajectories concurrently (default: 3), each with its own workspace, RNG seed, and Thompson Sampling beliefs.
142
48
 
143
- Validate the current environment against Claude Code best practices. Checks:
144
- - CLAUDE.md structure and token count
145
- - MCP server configuration completeness
146
- - Security rules and hooks
147
- - Command and agent definitions
148
- - Environment variable references
49
+ **Thompson Sampling** maintains a Beta distribution per eval task. Tasks with volatile scores (high uncertainty) get sampled more often; stable tasks less. This is uncertainty-driven exploration — the system automatically focuses evaluation budget where signal is weakest, rather than uniform random sampling.
149
50
 
150
- ### `kairn keys [options]`
51
+ **KL Regularization** prevents harness bloat. Every mutation pays a complexity cost: `effective_score = raw_score - λ * complexityCost * 100`. The cost measures lines, files, sections, and character-level diff from baseline. The proposer must *earn* every addition. Default λ = 0.1.
151
52
 
152
- Manage API keys for MCP servers in the current environment.
53
+ After all branches complete, a **Meta-Principal** LLM agent reads all branch results — iteration logs, per-task score matrices, Thompson beliefs, complexity metrics — and synthesizes the optimal harness by cherry-picking the best mutations from each trajectory. The synthesis is evaluated against the full task suite and must beat the best individual branch.
153
54
 
154
- ```bash
155
- kairn keys # Prompt for missing keys
156
- kairn keys --show # Show which keys are set vs missing
157
- ```
55
+ ### Hybrid Scoring (v2.8)
158
56
 
159
- ### `kairn list` / `kairn activate <env_id>`
57
+ Eval quality is the bottleneck of any optimization loop. Kairn blends deterministic rubric criteria (shell command checks: does the harness include a test command? does security block `rm -rf`?) with LLM-as-judge scoring, in a configurable weighted combination. Anthropic prompt caching on system prompts saves ~85% of tokens on repeated proposer/scorer calls. After mutation, targeted re-evaluation re-runs only tasks whose harness files were touched, saving ~40% eval cost per iteration.
160
58
 
161
- Show saved environments (stored in `~/.kairn/envs/`) and re-deploy them to any directory.
59
+ ### Persistent Execution Loops (v2.10)
162
60
 
163
- ```bash
164
- kairn list # List all saved environments
165
- kairn activate env_abc123 # Copy that environment to .claude/
166
- ```
61
+ Generated harnesses include `/project:persist` — a loop that reads acceptance criteria from `docs/SPRINT.md`, works criterion-by-criterion with structured progress tracking in `.claude/progress.json`, auto-retries on verification failure (max 3 per criterion), and delegates to a review gate before completion. Progress persists across sessions via `memory.json`.
167
62
 
168
- ### `kairn evolve` Automated Harness Optimization
63
+ A `UserPromptSubmit` hook detects complex tasks (multi-step, feature-scope, refactoring, bug-with-repro) via 6 complexity signals and auto-routes them through the persistence loop. Simple tasks pass through normally. Configurable: `auto | manual | off`.
169
64
 
170
- The heart of v2.x. Run your agent on real tasks, capture execution traces, diagnose failures, and mutate the harness iteratively.
65
+ ### Anthropic Harness Patterns (v2.9)
171
66
 
172
- #### `kairn evolve init`
67
+ Comparative analysis against [Anthropic's harness design guidance](https://www.anthropic.com/engineering/harness-design-long-running-apps), [Everything Claude Code](https://github.com/affaan-m/everything-claude-code) (151 skills, 102 security rules), and [Oh-My-ClaudeCode](https://github.com/yeachan-heo/oh-my-claudecode) (model routing) identified 6 gaps. Kairn now generates:
173
68
 
174
- Set up evolution for the current project. Auto-generates 3-5 concrete eval tasks based on your CLAUDE.md and project structure.
69
+ - **Sprint contracts** `@architect` outputs numbered acceptance criteria; `/project:develop` validates each one individually
70
+ - **Smart model routing** — agents include tiered routing guidance (Haiku for linting, Sonnet for implementation, Opus for architecture) with a `modelRouting` IR field
71
+ - **Expanded security** — PreToolUse patterns from 5 to 20+ across credential leaks, injection, destructive ops, and network exfiltration
72
+ - **Memory persistence** — SessionStart/End hooks save/load `.claude/memory.json` across sessions
73
+ - **Context reset protocol** — full PostCompact alternative for long sessions (>2 hours or >3 compactions)
175
74
 
176
- ```bash
177
- kairn evolve init
178
- ```
75
+ ### Self-Learning Intent Routing (v2.5)
179
76
 
180
- Creates `.kairn-evolve/tasks.yaml` with tasks like:
181
- - "Add a new feature X to the codebase"
182
- - "Fix this known bug Y"
183
- - "Refactor the API layer for clarity"
184
- - "Write comprehensive test coverage"
185
- - "Update documentation after feature launch"
77
+ Two-tier routing compiles project-specific intent patterns at generation time. Tier 1: regex patterns (<10ms, $0) match keywords and synonyms. Tier 2: Haiku-powered semantic classification (~$0.001) handles ambiguous prompts. A background learner promotes recurring Tier 2 patterns to Tier 1 regexes after 3+ matches. Over time, the harness learns the user's vocabulary: session 1 is 40% regex, session 10 is 90%.
186
78
 
187
- Uses 6 built-in templates: add-feature, fix-bug, refactor, test-writing, config-change, documentation.
188
-
189
- #### `kairn evolve baseline`
190
-
191
- Snapshot your current `.claude/` directory as iteration 0 (the baseline to improve against).
192
-
193
- ```bash
194
- kairn evolve baseline
195
- ```
196
-
197
- #### `kairn evolve run`
198
-
199
- Run the full evolution loop. Evaluates all tasks, diagnoses failures, proposes mutations, re-evaluates.
200
-
201
- ```bash
202
- kairn evolve run # 5 iterations (default)
203
- kairn evolve run --iterations 3 # Custom iteration count
204
- kairn evolve run --task <task_id> # Run a single task
205
- kairn evolve run --parallel 4 # Parallel task evaluation (4 concurrent)
206
- kairn evolve run --runs 3 # Run each task 3 times, report mean ± stddev
207
- ```
208
-
209
- **How it works (the loop):**
210
-
211
- 1. **Evaluate** — Run each eval task by spawning Claude Code in an isolated workspace. Capture full traces:
212
- - stdout, stderr
213
- - MCP tool calls (which tools, inputs, outputs)
214
- - Files changed (diffs)
215
- - Execution time, pass/fail status
216
-
217
- 2. **Diagnose** — A proposer agent (Opus) reads the full trace filesystem and performs causal reasoning:
218
- - "Task A failed because CLAUDE.md doesn't mention the /api path"
219
- - "Task B passed on iteration 1 but regressed on iteration 3 — the new security rule broke it"
220
- - "Tasks A and C both needed /project:fix but there's no /project:fix command"
79
+ ---
221
80
 
222
- 3. **Mutate** Propose minimal, targeted changes to the harness:
223
- - `replace`: Update a section in CLAUDE.md, a command, a rule
224
- - `add_section`: Insert new guidance into CLAUDE.md
225
- - `create_file`: Add a new command or rule
226
- - `delete_section`: Remove contradictory or bloat sections
227
- - `delete_file`: Remove unused commands/rules
228
- - `add_intent_pattern`: Add a new natural language pattern (v2.5.0)
229
- - `modify_intent_prompt`: Improve the Tier 2 Haiku classifier (v2.5.0)
81
+ ## What Makes Kairn Different
230
82
 
231
- 4. **Re-evaluate**Run all tasks again with the mutated harness. If scores improve accept. If scores regress rollback to previous best.
83
+ **vs. DSPy** — DSPy optimizes *prompts*. Kairn optimizes *full environments*: system prompts, slash commands, rules, agents, hooks, MCP configs, security policies, intent routing — as a coherent system. DSPy's mutation space is string replacement on prompt templates. Kairn's is 17 typed IR operations on a 14-node-type intermediate representation with cross-reference validation.
232
84
 
233
- 5. **Repeat**Iterate N times (default 5). Each iteration cycles through evaluate diagnose mutate re-evaluate.
85
+ **vs. OpenEvolve** — OpenEvolve optimizes *code*. Kairn optimizes the *harness that shapes how agents write code*. Different layer of the stack, different mutation space, different eval methodology (real agent execution traces, not unit tests).
234
86
 
235
- **Scoring:**
236
- - **pass/fail** (default) — task passes or fails
237
- - **llm-judge** — LLM reads task output and scores (0-100)
238
- - **rubric** — custom weighted scoring function
87
+ **vs. Oh-My-ClaudeCode / static harness collections** — OMC ships a fixed set of 150 skills and 100+ rules. Kairn generates *project-specific* environments from intent, then evolves them against real tasks. Static harnesses can't adapt; Kairn's improve with use.
239
88
 
240
- **Adaptive pruning (v2.2.7):**
241
- On middle iterations, skip slow/expensive tasks above a confidence threshold. Re-run all tasks on the first and last iteration for rigor.
89
+ **vs. manual `.claude/` directories** — No memorizing command names (intent routing). No trial-and-error (evolution loop). No format corruption (typed IR). No cargo-culting (compiled from your actual workflow).
242
90
 
243
- **Anti-regression guards (v2.2.8):**
244
- - `maxMutationsPerIteration` (default: 3) cap mutations per step
245
- - `maxTaskDrop` (default: 20) if any single task drops >20 points, rollback
246
- - Loss-weighted proposer focus proposer reads failures worst-first
91
+ **The specific technical gaps:**
92
+ - Full-environment optimization (not just prompts, not just code)
93
+ - Typed IR mutations with pre-condition validation (not string replacement)
94
+ - Population-based evolutionary search with uncertainty-driven sampling
95
+ - Cross-component validation via the @linker (commands reference real agents, agents reference real commands)
96
+ - Self-learning intent routing that promotes patterns from expensive LLM classification to free regex
247
97
 
248
- #### `kairn evolve report`
98
+ ---
249
99
 
250
- Generate a human-readable summary of the evolution run.
100
+ ## Quick Start
251
101
 
252
102
  ```bash
253
- kairn evolve report # Markdown to stdout
254
- kairn evolve report --json # Machine-readable JSON
255
- ```
256
-
257
- Shows:
258
- - Evolution leaderboard (iterations × tasks × scores)
259
- - Per-task trace diffs (what changed between iterations for the same task)
260
- - Counterfactual diagnosis (which mutations helped/hurt which tasks)
261
- - Wall time, token cost, iterations completed
103
+ npm install -g kairn-cli # Node.js 18+
262
104
 
263
- #### `kairn evolve diff <iter1> <iter2>`
264
-
265
- Show the harness changes between two iterations.
266
-
267
- ```bash
268
- kairn evolve diff 0 3 # Show all mutations from baseline to iteration 3
105
+ kairn init # Set up your LLM provider
106
+ kairn describe "Build a Next.js app with Supabase auth"
107
+ claude # Start Claude Code with the compiled harness
269
108
  ```
270
109
 
271
- #### `kairn evolve apply [--iter N]`
272
-
273
- Copy the best (or specified) evolved harness back to `.claude/`.
110
+ To evolve the harness:
274
111
 
275
112
  ```bash
276
- kairn evolve apply # Copy best iteration to .claude/
277
- kairn evolve apply --iter 3 # Copy iteration 3 specifically
113
+ kairn evolve init # Auto-generate eval tasks from your project
114
+ kairn evolve baseline # Snapshot current harness
115
+ kairn evolve run # 5 iterations: evaluate → diagnose → mutate → re-evaluate
116
+ kairn evolve apply # Deploy the best harness
278
117
  ```
279
118
 
280
- ---
281
-
282
- ## Tool Registry
283
-
284
- Kairn ships with **28 curated MCP servers** across 8 categories. Tools are auto-selected based on your workflow — fewer tools = less context bloat = better agent performance.
119
+ Kairn generates the entire `.claude/` directory — CLAUDE.md, settings.json, commands, rules, agents, skills, hooks, docs, intent routing, security policies — plus `.mcp.json` and `.env`.
285
120
 
286
- | Category | Tools |
287
- |----------|-------|
288
- | **Reasoning** | Context7, Sequential Thinking |
289
- | **Code & DevTools** | GitHub MCP, Chrome DevTools |
290
- | **Search & Research** | Exa, Brave Search, Firecrawl, Perplexity |
291
- | **Browser Automation** | Playwright, Browserbase |
292
- | **Data & Infrastructure** | PostgreSQL, Supabase, SQLite, Docker, Vercel |
293
- | **Communication** | Slack, Notion, Linear, AgentMail, Gmail |
294
- | **Security** | Semgrep, security-guidance |
295
- | **Design** | Figma, Frontend Design |
121
+ Supports 8 LLM providers: Anthropic, OpenAI, Google, xAI, DeepSeek, Mistral, Groq, and any OpenAI-compatible endpoint.
296
122
 
297
123
  ---
298
124
 
299
- ## How the Pipeline Works
300
-
301
- ### Generation (kairn describe / kairn optimize)
302
-
303
- 1. **User input** — intent string or scanned project profile
304
- 2. **Clarification** (optional) — 3-5 yes/no questions to refine workflow
305
- 3. **Pass 1: Skeleton** — LLM selects minimal tool set and outlines the project
306
- 4. **Pass 2: Harness** — LLM generates all content (CLAUDE.md, commands, rules, agents, docs)
307
- 5. **Pass 3: Settings** — Deterministic generation of `settings.json` and `.mcp.json` from registry
308
- 6. **Intent patterns** — Compile project-specific regex patterns from command names + synonyms
309
- 7. **Hook templates** — Generate `intent-router.mjs` (Tier 1) and Tier 2 prompt template
310
- 8. **Write files** — `.claude/` directory + `.mcp.json` + `.env` (with masked keys)
125
+ ## The Evolution Engine
311
126
 
312
- ### Evolution (kairn evolve run)
127
+ The heart of Kairn. Run your agent on real tasks, capture full execution traces, diagnose failures via causal reasoning, and mutate the harness iteratively.
313
128
 
314
129
  ```
315
130
  Baseline (.claude/ snapshot)
316
131
 
317
132
 
318
133
  Iteration 1
319
- ├─ Evaluate: run all tasks, capture traces
320
- ├─ Diagnose: proposer reads traces, reasons about failures
321
- ├─ Mutate: generate 1-3 harness mutations
322
- ├─ Re-evaluate: run all tasks again
323
- └─ Accept/rollback based on score improvement
134
+ ├─ Evaluate: spawn Claude Code on each task, capture traces
135
+ │ (stdout, MCP tool calls, file diffs, execution time, pass/fail)
136
+ ├─ Diagnose: proposer (Sonnet) reads traces worst-first, performs causal reasoning
137
+ │ ("Task A failed because CLAUDE.md doesn't mention the /api path")
138
+ ├─ Mutate: propose 1-3 typed IR mutations
139
+ │ (17 operation types: update/add/remove sections, commands, rules, agents, MCP servers, ...)
140
+ ├─ Re-evaluate: run all tasks against the mutated harness
141
+ └─ Accept improvement / rollback regression
324
142
 
325
143
 
326
144
  Iteration 2, 3, 4, 5...
@@ -329,283 +147,153 @@ Baseline (.claude/ snapshot)
329
147
  Best harness (apply to .claude/)
330
148
  ```
331
149
 
332
- Each iteration is independent and can be retried. The proposer has memory of all prior iterations (v2.4.0 experience replay, coming soon).
150
+ **Safety controls:** max 3 mutations per iteration, per-task regression guard (>20 point drop = rollback), adaptive eval pruning on middle iterations, loss-weighted proposer focus.
333
151
 
334
- ### Self-Learning (v2.5.0)
152
+ **Population-based mode:** `kairn evolve pbt` runs N parallel trajectories with Thompson Sampling + KL regularization, then synthesizes the optimal harness via Meta-Principal.
335
153
 
336
- ```
337
- Tier 1: regex hook intercepts prompt
338
- ├─ Matches pattern? → route to command + inject context
339
- └─ No match? → fallthrough to Tier 2
340
-
341
- Tier 2: Haiku prompt hook
342
- ├─ Classify intent
343
- ├─ Route to command if confident
344
- └─ Log routing attempt (for learning)
345
-
346
- SessionStart: intent-learner.mjs
347
- ├─ Read intent-log.jsonl (recent tier 2 routings)
348
- ├─ Promote recurring patterns to regex
349
- ├─ Update intent-router.mjs
350
- └─ Write audit trail
351
- ```
352
-
353
- Over time, more patterns become regex (fast, free) instead of Haiku (slow, $0.001).
354
-
355
- ---
356
-
357
- ## Example Workflow
358
-
359
- ### Scenario: Build a Next.js API
154
+ ### Example: Evolution in Action
360
155
 
361
156
  ```bash
362
- cd /tmp/my-api
363
- git init
364
-
365
- kairn describe "Next.js REST API with Prisma ORM and PostgreSQL. OAuth login, JWT auth, rate limiting."
366
-
367
- # Output:
368
- # ✔ Pass 1: Selected 7 tools (GitHub, PostgreSQL, Vercel, Semgrep, Docker, Context7, Sequential Thinking)
369
- # ✔ Pass 2: Generated 73 lines in CLAUDE.md, 8 commands, 4 rules, 3 agents, 2 skills
370
- # ✔ Pass 3: Configured 2 MCP servers (PostgreSQL + GitHub)
371
- #
372
- # Commands:
373
- # /project:help Show available commands
374
- # /project:plan Draft the API spec
375
- # /project:develop Full development pipeline
376
- # /project:test Run test suite
377
- # /project:fix Issue-driven bug fixing
378
- # /project:deploy Deploy to Vercel
379
- # /project:security Audit for vulnerabilities
380
- # /project:batch Run batches of independent tasks
381
- #
382
- # Env keys needed:
383
- # POSTGRES_URL
384
- # JWT_SECRET
385
- # GITHUB_TOKEN
386
- # VERCEL_TOKEN
387
- #
388
- # Paste your secrets (or press enter to skip):
389
- # POSTGRES_URL: ***
390
- # JWT_SECRET: ***
391
- # GITHUB_TOKEN: (skipped)
392
- # VERCEL_TOKEN: (skipped)
393
- #
394
- # Ready! Run: $ claude
395
-
396
- claude # Start Claude Code with the generated harness
397
-
398
- # In Claude Code:
399
- # > /project:plan
400
- # Drafts the API specification with OAuth flow, database schema, endpoint design
401
- #
402
- # > /project:develop feature/auth
403
- # Full pipeline: specs feature in detail, plans implementation, TDD red→green→refactor,
404
- # writes tests, runs security audit, updates docs
405
- #
406
- # > /project:fix
407
- # Shows recent issues, user picks one, Claude researches the bug, fixes it, runs tests
408
- ```
409
-
410
- ### Scenario: Optimize an Existing Project
411
-
412
- ```bash
413
- cd /path/to/existing/next-app
414
- # It has a manual .claude/ directory
415
-
416
- kairn optimize
417
-
418
- # Output:
419
- # ✔ Scan: TypeScript, Next.js, 47 dependencies, 8 scripts
420
- #
421
- # Harness Audit:
422
- # CLAUDE.md: 187 lines ✓ (good)
423
- # MCP servers: 4
424
- # Commands: 5 (/help, /plan, /code, /test, /deploy)
425
- # Rules: 2 (security, continuity)
426
- #
427
- # Issues found:
428
- # ⚠ Missing /project:develop command (full development pipeline)
429
- # ⚠ No path-scoped rules (api.md, testing.md for different code domains)
430
- # ⚠ Hooks not configured (missing destructive command blocking)
431
- #
432
- # Generate optimized environment? This will overwrite existing .claude/ files.
433
- # > Yes
434
- #
435
- # ✔ Environment compiled in 12s
436
- # ✔ Files written: 4 new, 3 modified, 1 unchanged
437
- #
438
- # Ready! Run: $ claude
439
- ```
440
-
441
- ### Scenario: Evolve the Harness
442
-
443
- ```bash
444
- # Harness is generated and working. Set up evolution:
445
-
446
- kairn evolve init
447
-
448
- # Auto-generated 5 eval tasks based on CLAUDE.md + project structure:
449
- # task-1: "Implement user profile page"
450
- # task-2: "Add password reset flow"
451
- # task-3: "Refactor authentication middleware"
452
- # task-4: "Write E2E tests for checkout flow"
453
- # task-5: "Update API documentation after feature release"
454
-
455
- kairn evolve baseline # Snapshot current .claude/ as iteration 0
456
-
157
+ kairn evolve init && kairn evolve baseline
457
158
  kairn evolve run --iterations 5
458
159
 
459
160
  # Iteration 1/5
460
- # Evaluating... [task-1] pass [task-2] fail [task-3] pass [task-4] fail [task-5] pass
461
- # Score: 3/5 (60%)
462
- #
463
- # Diagnosing failures...
464
- # - Task 2 failed: "password reset" not mentioned in CLAUDE.md. Need /project:email command.
465
- # - Task 4 failed: E2E tests failed because missing /project:test. Added but not documented.
466
- #
467
- # Proposing mutations:
468
- # - Add /project:email command with SMTP integration guidance
469
- # - Update CLAUDE.md "Authentication" section with password reset flow
470
- # - Add e2e.md path-scoped rule with Playwright patterns
161
+ # [task-1] pass [task-2] fail [task-3] pass [task-4] fail [task-5] pass
162
+ # Score: 60%
163
+ # Diagnosis: "password reset" not in CLAUDE.md, E2E tests need Playwright rule
164
+ # Mutations: +/project:email command, +authentication section, +e2e.md rule
471
165
  #
472
166
  # Iteration 2/5
473
- # Evaluating with mutated harness...
474
- # [task-1] pass [task-2] pass [task-3] pass [task-4] pass [task-5] pass
475
- # Score: 5/5 (100%) ✔ improvement! Accepting mutations.
167
+ # [task-1] pass [task-2] pass [task-3] pass [task-4] pass [task-5] pass
168
+ # Score: 100% accepting mutations
476
169
  #
477
170
  # Iteration 3/5
478
- # Evaluating...
479
- # [task-1] pass [task-2] pass [task-3] pass [task-4] pass [task-5] pass
480
- # Score: 5/5 (100%) — no regression, but no improvement. Proposing refactements...
481
- # - CLAUDE.md got bloated (142 lines). Moving detail to rules/.
482
- # Iteration 3 score: 5/5. Accepting.
171
+ # Score: 100% — CLAUDE.md bloated (142 lines), moving detail to rules/
483
172
  #
484
- # Iterations 4-5: Scores plateau at 5/5. No more mutations.
173
+ # Iterations 4-5: plateau at 100%. No regressions.
485
174
  #
486
- # Final leaderboard:
487
- # Iteration 0 (baseline): 60% (3/5)
488
- # Iteration 1: 60% (3/5)
489
- # Iteration 2: 100% (5/5) ← best
490
- # Iteration 3: 100% (5/5)
491
- # Iteration 4: 100% (5/5)
492
- # Iteration 5: 100% (5/5)
493
-
494
- kairn evolve report # Detailed markdown summary
495
- kairn evolve apply # Copy iteration 2 to .claude/
175
+ # Final: baseline 60% → evolved 100%
176
+
177
+ kairn evolve apply # Deploy the winning harness
496
178
  ```
497
179
 
180
+ See [docs/walkthroughs/](docs/walkthroughs/) for full examples including generation, optimization, and PBT runs.
181
+
498
182
  ---
499
183
 
500
- ## Architecture & Philosophy
184
+ ## Vision
501
185
 
502
- ### Design Principles
186
+ The architecture — typed IR, population-based training, multi-agent compilation with linker validation — was designed to extend from N=1 (one project, one harness) to N=500 (a fleet of agents with interdependent harnesses). Today Kairn compiles a single `.claude/` directory. The same pipeline generalizes to **swarm manifest compilation**: describe a fleet of agents with roles, contracts, and communication patterns; compile harnesses for each agent with inter-agent contract validation (agent A's output schema matches agent B's input expectations); evolve the fleet as a system, not individual harnesses in isolation.
503
187
 
504
- 1. **Minimal over complete.** 5 well-chosen tools beat 50 generic ones.
505
- 2. **Workflow-specific over generic.** Every file generated relates to your actual task.
506
- 3. **Self-improving.** Environments get better with use via the evolution loop and self-learning intent router.
507
- 4. **Local-first.** No accounts, no servers, no telemetry. Runs offline with your own LLM key.
508
- 5. **Transparent.** You can inspect every generated file. Nothing is hidden.
509
- 6. **Security by default.** Every environment includes deny rules, hooks, and guidance.
510
- 7. **Prove it.** Evolved harnesses must demonstrably outperform static ones. Claims require measurement.
188
+ The linker already validates cross-references within a single harness (commands ↔ agents ↔ rules). Extending it to validate cross-references *between* harnesses inter-agent contracts, shared MCP server configurations, compatible security policies — is the path from project-scoped optimization to fleet-scale coordination.
511
189
 
512
- ### What Makes Kairn Unique
190
+ ---
513
191
 
514
- **vs. Manual `.claude/` directories:**
515
- - Auto-generated from codebase scan or workflow description
516
- - Intent routing (don't memorize command names)
517
- - Automated evolution (harness improves on real tasks)
192
+ ## Command Reference
193
+
194
+ | Command | Description |
195
+ |---------|-------------|
196
+ | `kairn init` | Interactive LLM provider setup (8 providers, API key stored locally) |
197
+ | `kairn describe <intent>` | Compile intent → optimized `.claude/` environment |
198
+ | `kairn optimize` | Scan existing project, audit + regenerate harness (`--diff` to preview) |
199
+ | `kairn templates` | Browse and activate pre-built environments (Next.js, API, Research, Content) |
200
+ | `kairn doctor` | Validate environment against Claude Code best practices |
201
+ | `kairn keys` | Manage API keys for MCP servers (`--show` to audit) |
202
+ | `kairn list` / `kairn activate <id>` | Save, browse, and re-deploy environments |
203
+ | `kairn evolve init` | Scaffold evolution workspace, auto-generate eval tasks |
204
+ | `kairn evolve baseline` | Snapshot current `.claude/` as iteration 0 |
205
+ | `kairn evolve run` | Full evolution loop (`--iterations N`, `--parallel N`, `--runs N`) |
206
+ | `kairn evolve pbt` | Population-based training (N parallel branches + Meta-Principal synthesis) |
207
+ | `kairn evolve report` | Markdown/JSON summary with leaderboard and counterfactual diagnosis |
208
+ | `kairn evolve diff <i1> <i2>` | Harness changes between two iterations |
209
+ | `kairn evolve apply` | Deploy best (or specified) harness to `.claude/` |
210
+
211
+ **Describe options:** `--quick` (skip clarification), `--autonomy 1-4` (guided → full auto), `--runtime hermes` (Hermes adapter)
212
+
213
+ **Evolve options:** `--sampling thompson|uniform`, `--kl-lambda 0.1`, `--pbt-branches 3`, `--task <id>` (single task)
518
214
 
519
- **vs. Other agents (OMC, AutoCoder, etc.):**
520
- - Kairn manages the *harness* (instructions, MCP, commands, rules, agents), not agents themselves
521
- - Kairn uses the evolution loop to improve the harness (not the agent capability)
522
- - Two-tier intent routing (regex + Haiku) is unique to Kairn v2.5.0+
215
+ ---
216
+
217
+ ## What Gets Generated
218
+
219
+ ```
220
+ .claude/
221
+ ├── CLAUDE.md # Workflow-specific system prompt (7 sections)
222
+ ├── settings.json # Permissions, hooks, security rules, intent routing
223
+ ├── commands/ # Slash commands (/project:help, /project:plan, etc.)
224
+ ├── rules/ # Auto-loaded instructions (security, continuity, paths)
225
+ ├── skills/ # Model-controlled capabilities (code, research, writing)
226
+ ├── agents/ # Specialized subagents (@architect, @tester, etc.)
227
+ ├── docs/ # Pre-initialized project memory
228
+ ├── hooks/ # Intent router (Tier 1 regex + Tier 2 Haiku classifier)
229
+ │ ├── intent-router.mjs # Project-specific regex patterns + fallthrough
230
+ │ ├── intent-learner.mjs # Promotes recurring Tier 2 patterns to Tier 1
231
+ │ └── intent-log.jsonl # Log of routed prompts (for learning)
232
+ └── QUICKSTART.md # Interactive startup guide (Level 2-4)
233
+ .mcp.json # Project-scoped MCP server config
234
+ .env # API keys (gitignored, masked in output)
235
+ ```
523
236
 
524
- **vs. DSPy, Meta-Harness, OpenEvolve:**
525
- - Kairn is CLI-first and project-scoped (not a framework library)
526
- - Integrated with Claude Code's native hooks API (not custom inference)
527
- - Generates MCP configurations alongside harness (full integration)
237
+ **Tool registry:** 28 curated MCP servers across reasoning, code, search, browser automation, data/infrastructure, communication, security, and design. Auto-selected based on workflow — fewer tools = less context bloat = better agent performance.
528
238
 
529
239
  ---
530
240
 
531
241
  ## Roadmap
532
242
 
533
- ### v1.x (Complete)
534
- Local CLI for generating and managing Claude Code environments. Includes advanced patterns (sprint contracts, multi-agent QA, autonomy levels), templates, secrets management, and Claude Code power patterns (TDD, verification, known gotchas).
243
+ ### v1.x (Complete)
244
+ Local CLI: intent compilation, project scanning, templates, secrets management, autonomy levels (1-4), interactive clarification, branded CLI, verification patterns, sprint contracts, multi-agent QA, 8 LLM providers.
535
245
 
536
- ### v2.x (In Progress)
246
+ ### v2.x (Current — v2.11.0)
537
247
  **Kairn Evolve** — automated harness optimization.
538
248
 
539
- - **v2.0.0** ✅ Task Definition & Trace Infrastructure
540
- - **v2.1.0** ✅ The Evolution Loop
541
- - **v2.2.0** ✅ Diagnosis & Reporting
542
- - **v2.2.1-2.2.8** ✅ Bug fixes & optimizations
543
- - **v2.3.0** Eval Quality & Auth (Claude Code subscription OAuth, prompt caching)
544
- - **v2.4.0** Intelligent Evolution (principal proposer, experience replay, exploration/exploitation)
545
- - **v2.5.0** 🔄 Intent-Aware Harnesses (in-progress Ralph loop)
546
- - **v2.6.0** Structured Harness IR (mutations on typed IR, not raw text)
547
- - **v2.7.0** Polish & Integration (dashboard, watch mode, CI/CD integration)
249
+ - **v2.0** ✅ Task definition, trace infrastructure, eval templates
250
+ - **v2.1** ✅ The evolution loop (evaluate → diagnose → mutate → re-evaluate → rollback)
251
+ - **v2.2** ✅ Diagnosis, reporting, parallel evaluation, anti-regression guards
252
+ - **v2.3** ✅ Eval quality, Claude Code subscription auth, prompt caching
253
+ - **v2.5** Intent-aware harnesses (two-tier routing, self-learning promotion)
254
+ - **v2.6** Population-based training (Thompson sampling, KL regularization, Meta-Principal synthesis)
255
+ - **v2.7** Structured Harness IR (14 node types, 17 mutations, semantic diff, round-trip renderer)
256
+ - **v2.8** Hybrid scoring, prompt caching (~85% savings), targeted re-evaluation (~40% cost reduction)
257
+ - **v2.9** Anthropic patterns (sprint contracts, model routing, 20+ security rules, memory persistence)
258
+ - **v2.10** ✅ Persistent execution loops (/project:persist, auto-routing, progress tracking)
259
+ - **v2.11** ✅ Multi-agent compilation (orchestrator → specialist agents → linker → HarnessIR)
260
+ - **v2.12** ⏳ Polish: live dashboard, describe→evolve integration, CI/CD, template evolution
548
261
 
549
262
  ### v3.x (Aspirational)
550
- Broader harness scope (plugins, external tools), paid tool connections, hosted platform, learning system.
263
+ Fleet-scale harness optimization. Swarm manifest compilation. Inter-agent contract validation. Runtime-agnostic harness IR (Claude Code, Hermes, OpenClaw). Tool marketplace with proposer-initiated discovery.
551
264
 
552
265
  ---
553
266
 
554
267
  ## Security
555
268
 
556
- - **API keys stay local.** Stored at `~/.kairn/config.json`, never transmitted.
557
- - **Every environment includes security rules.** Deny rules for `rm -rf`, `curl | sh`, reading `.env` and `secrets/`.
558
- - **Curated registry only.** Every MCP server is manually verified.
559
- - **Environment variable references.** MCP configs use `${ENV_VAR}` syntax — secrets never written to config files.
560
- - **Path traversal protection.** Evolution mutations are validated against `../` injection.
561
- - **Hooks in settings.json** — `PreToolUse` hooks block destructive commands, `PostCompact` hooks restore context.
269
+ - API keys stay local (`~/.kairn/config.json`, never transmitted)
270
+ - Every environment includes 20+ PreToolUse deny rules across credential leaks, injection, destructive ops, and network exfiltration
271
+ - Curated MCP registry only every server manually verified
272
+ - Environment variables use `${ENV_VAR}` syntax — secrets never written to config files
273
+ - Path traversal protection on all evolution mutations
274
+ - Hooks block destructive commands; PostCompact restores context
562
275
 
563
276
  ---
564
277
 
565
278
  ## FAQ
566
279
 
567
- **Q: Do I need a Kairn account?**
568
- A: No. Kairn is a local CLI. Your API key for Claude/GPT/Gemini is configured once and stored locally.
280
+ **Do I need an account?** No. Local CLI, your API key, no backend.
569
281
 
570
- **Q: Does Kairn send my code to external servers?**
571
- A: No. All LLM calls use your own API key. Kairn CLI has no backend.
282
+ **Does Kairn send my code anywhere?** No. All LLM calls use your key. Nothing leaves your machine except API requests.
572
283
 
573
- **Q: Can I use Kairn with Claude Code on a team?**
574
- A: Yes. Generate the harness locally, commit `.claude/` to git. Team members run `claude` and get the same environment. The evolve loop runs locally per person (results don't auto-merge).
284
+ **Team use?** Generate locally, commit `.claude/` to git. Everyone gets the same environment.
575
285
 
576
- **Q: What if I want to keep my manual `.claude/` customizations?**
577
- A: Use `kairn optimize --diff` to preview changes. You can selectively accept or reject them. For full control, don't use `optimize` — use `describe` once and then hand-edit the generated files.
286
+ **Keep manual customizations?** `kairn optimize --diff` previews changes. Accept or reject selectively.
578
287
 
579
- **Q: How much does evolution cost?**
580
- A: Depends on your model, iteration count, and task volume. A 5-iteration evolution run with 5 tasks on Anthropic:
581
- - Evaluation: ~100K tokens per iteration (traces logged)
582
- - Proposer: ~80K tokens per iteration (diagnosis + mutation)
583
- - Re-evaluation: ~100K tokens per iteration
584
- - **Total:** ~1.5M tokens = ~$15-50 (Opus/Claude 3) or ~$2-5 (Haiku)
288
+ **Evolution cost?** 5 iterations, 5 tasks on Anthropic: ~1.5M tokens (~$15-50 Opus, ~$2-5 Haiku). PBT multiplies by branch count but runs concurrently.
585
289
 
586
- **Q: Can I evolve just one task?**
587
- A: Yes. `kairn evolve run --task <task_id>` runs a single task.
588
-
589
- **Q: What's the intent router doing on my prompt?**
590
- A: When you type a prompt like "deploy this", the intent router:
591
- 1. Checks Tier 1 regex patterns (fast, free)
592
- 2. If no match, sends to Tier 2 (Haiku, ~$0.001)
593
- 3. Injects `/project:deploy` into your message context
594
- 4. Claude reads that and executes the command
595
-
596
- You can disable it with `"enableTier2": false` in settings.json if you find it intrusive.
290
+ **What's the intent router doing?** Intercepts natural language prompts, matches to `/project:*` commands via regex (free) or Haiku (~$0.001). Disable Tier 2 with `"enableTier2": false`.
597
291
 
598
292
  ---
599
293
 
600
294
  ## Contributing
601
295
 
602
- Kairn is open-source. Contributions welcome:
603
- - New MCP servers to the registry
604
- - Eval task templates for new project types
605
- - Improved proposer prompts
606
- - Bug reports and UX feedback
607
-
608
- ---
296
+ Kairn is open-source. Contributions welcome: MCP servers to the registry, eval task templates, proposer prompt improvements, bug reports.
609
297
 
610
298
  ## License
611
299