claude-overnight 1.51.2 → 1.53.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,33 +1,36 @@
1
1
  # claude-overnight
2
2
 
3
- Parallel Claude agents in isolated git worktrees. Set a usage cap so your interactive Claude Code keeps its headroom. Rate-limited? It waits. Crash? It resumes with full context.
3
+ Overnight coding swarms in isolated git worktrees that plan, execute, review, and steer themselves until the objective is met. Hand it a goal and a budget, walk away, review the diff in the morning.
4
4
 
5
- Hand it an objective and a session budget, walk away, review the diff when the run ends. Every agent runs in its own worktree on its own branch a misbehaving agent can't trash your working tree. Unmerged branches are preserved for manual review, never discarded.
5
+ Every agent runs in its own worktree on its own branch, so a misbehaving session cannot trash your working tree. Unmerged branches are preserved for manual review, never discarded. Set a usage cap (say 90%) and your interactive Claude Code still has headroom to answer questions while the swarm runs.
6
6
 
7
- Built on the [Claude Agent SDK](https://www.npmjs.com/package/@anthropic-ai/claude-agent-sdk) every planner, worker, reviewer, and verifier session runs on the SDK's agent harness. `claude-overnight` is the orchestrator around that harness: it plans, routes, resumes, reviews, and persists many SDK sessions at once. Because the harness speaks Anthropic Messages, each role can run on Anthropic direct or any compatible endpoint.
7
+ Built on the [Claude Agent SDK](https://www.npmjs.com/package/@anthropic-ai/claude-agent-sdk): every planner, worker, reviewer, and verifier session runs on the SDK's agent harness with full session resume, streaming, and transcripts. `claude-overnight` is the orchestrator around that harness. It plans, routes, curates, resumes, and persists many SDK sessions at once. Because the harness speaks the Anthropic Messages API, any compatible endpoint plugs in as a role.
8
8
 
9
- ## Three execution layers
9
+ ## Three execution layers, mix per run
10
10
 
11
- Every run can mix and match three execution layers:
12
-
13
- | Layer | What it does | Typical choice |
11
+ | Layer | Runs on | What it does |
14
12
  |---|---|---|
15
- | Planner | Thinking wave, orchestration, steering, review, final gate | your strongest model |
16
- | Main worker | Bulk implementation | a reliable coding model |
17
- | Fast worker (optional) | Cheap, well-scoped tasks checked by later waves | a cheaper/faster model |
13
+ | Planner (harness) | Opus 4.6, Sonnet 4.6 | Thinking wave, orchestration, steering, post-wave review, final gate |
14
+ | Main worker | Sonnet, Gemini 2.5, Qwen 3.6 Plus, DeepSeek, any Anthropic-compatible endpoint | Bulk implementation |
15
+ | Fast worker (optional) | Kimi 2.6 Coding, Cursor composer-2, Haiku | Cheap well-scoped tasks, double-checked by later waves |
16
+
17
+ A common recipe: **Opus planner + Sonnet bulk worker + Cursor composer-2 fast worker**. Another: **Opus planner + Kimi 2.6 bulk worker + Haiku fast worker**. Providers are saved once to `~/.claude/claude-overnight/providers.json` and appear in every future run. The bundled `cursor-composer-in-claude` proxy makes Cursor-hosted models (`auto`, `composer`, `composer-2`) look like a normal provider.
18
+
19
+ ## What this recipe does that others do not
18
20
 
19
- The layers are configured independently. A common setup is Claude on the planner, Kimi or Qwen on the main worker, and Cursor or Haiku as the fast worker.
21
+ **Self-curating skill memory that improves mid-run.** Workers emit memory candidates when they discover something reusable: a repo-specific quirk, a recovery path, a command sequence that worked, a tool recipe worth saving. A scribe appends each candidate to disk without blocking the run. At the end of every wave, a **librarian** pass curates the queue. It promotes candidates into canon, patches existing skills via diff-style edits, or quarantines stale ones. **Wave N+1 of the same run starts with a better skill library than wave N.** Across runs, the library compounds. Inspired by Nous Research's Hermes Agent (Feb 2026), with progressive disclosure (L0 stub in every prompt, L1 body loaded on demand, L2 references on request), SQLite FTS5 retrieval, and per-skill win-rate tracking that auto-quarantines rot.
20
22
 
21
- ## First-class features
23
+ **Self-fixing, not just self-running.** Every task agent reviews its own `git diff` via SDK session resume (same session, full task context, no re-prompting) and runs a simplify pass before the commit lands. After each wave a dedicated review agent scans the consolidated diff for cross-agent issues the individual sessions could not see: missed reuse, copy-paste variations, leaky abstractions. When steering declares the objective done, a final gate reviews the full `git diff main` for architecture coherence before anything reaches your working tree.
22
24
 
23
- - **Harness-first orchestration.** This is not a replacement runtime. It is a multi-session control plane for the Claude Agent SDK harness, so you keep the same tool loop, session resume behavior, streaming model, and transcript format across the whole swarm.
24
- - **Dynamic repo memory.** Agents can propose reusable memory candidates during execution. A librarian curates them at the end of each wave, updates the skill index, and future waves see only a compact stub plus on-demand hydration instead of a giant static prompt.
25
- - **Run memory that compounds.** Long runs keep a live status snapshot, archived milestones, and an evolving goal file so steering can pick up exactly where it left off, even after rate limits, crashes, or an overnight stop.
26
- - **Embedded Cursor flexibility.** Cursor-hosted models are routed through a bundled `cursor-composer-in-claude` proxy, so Cursor becomes just another planner / worker / fast-worker option instead of a separate workflow.
25
+ **Multi-wave autonomous loop, not fire-and-forget.** After each wave a steering pass asks "how good is this?" and chooses between executing more tasks, spinning up a deeper reflection wave, or declaring done. The loop keeps going until steering is satisfied, the budget is exhausted, or the usage cap trips. Long runs keep a living status snapshot, archived milestones every five waves, and an evolving goal file, so steering picks up exactly where it left off after a rate limit or an overnight stop.
26
+
27
+ **Headroom-aware usage cap.** Set the cap to 90% of your 5h window and the swarm stops accepting new work there. Your interactive Claude Code keeps the remaining 10% to answer questions or run its own sessions while the overnight run grinds on.
28
+
29
+ **Crash-safe by design.** Planner state, the task plan, design docs, per-query NDJSON transcripts, steering decisions, and wave milestones all land on disk as they are produced. If the process dies mid-plan, the next resume salvages `tasks.json` and skips the expensive thinking wave. Planner crashes do not lose the $2 to $4 of orchestration work that already happened.
27
30
 
28
31
  ## Run on Kimi 2.6
29
32
 
30
- Want a cheap Anthropic-compatible worker with a simple shell setup? Kimi 2.6 via Kimi's coding endpoint is a drop-in worker that speaks the Anthropic Messages API -- same client, same flow, just a different base URL.
33
+ Want a cheap Anthropic-compatible worker with a simple shell setup? Kimi 2.6 via Kimi's coding endpoint is a drop-in worker that speaks the Anthropic Messages API, same client, same flow, just a different base URL.
31
34
 
32
35
  1. **Configure the provider.** Run `claude-overnight`, choose `Other…` on the worker step, and fill in:
33
36
 
@@ -51,9 +54,9 @@ claude-overnight
51
54
 
52
55
  ## Run on Qwen 3.6 Plus
53
56
 
54
- Hit your Claude Max plan limits? Running on a tight budget? Qwen 3.6 Plus via Alibaba Cloud's DashScope gateway is a drop-in worker that speaks the Anthropic Messages API -- same client, same flow, pennies per run.
57
+ Hit your Claude Max plan limits? Running on a tight budget? Qwen 3.6 Plus via Alibaba Cloud's DashScope gateway is a drop-in worker that speaks the Anthropic Messages API, same client, same flow, pennies per run.
55
58
 
56
- 1. **Get an API key.** Sign up at [Alibaba Cloud](https://account.alibabacloud.com/login/login.htm?oauth_callback=https%3A%2F%2Fmodelstudio.console.alibabacloud.com%2Fap-southeast-1%3Ftab%3Ddashboard%23%2Fapi-key&clearRedirectCookie=1) -- the link takes you straight to the API key dashboard.
59
+ 1. **Get an API key.** Sign up at [Alibaba Cloud](https://account.alibabacloud.com/login/login.htm?oauth_callback=https%3A%2F%2Fmodelstudio.console.alibabacloud.com%2Fap-southeast-1%3Ftab%3Ddashboard%23%2Fapi-key&clearRedirectCookie=1), the link takes you straight to the API key dashboard.
57
60
  2. **Configure the provider.** Run `claude-overnight`, choose `Other…` on the worker step, and fill in:
58
61
 
59
62
  | Field | Value |
@@ -124,19 +127,19 @@ If the bundled proxy cannot auto-start, the setup wizard prints the exact `node
124
127
 
125
128
  ### macOS: “Keychain Not Found” / `cursor-user`
126
129
 
127
- The Cursor **`agent`** binary stores an interactive login as **`cursor-user`** in your **login** keychain. For automation, use a **[User API key](https://cursor.com/docs/cli/headless)** (`export CURSOR_API_KEY=...` from [Integrations](https://cursor.com/dashboard/integrations)) the bundled proxy then does not need Keychain. `claude-overnight` forces `CURSOR_SKIP_KEYCHAIN=1` and `CI=true`; if System Settings still shows **“A keychain cannot be found to store …”**, the login keychain is often missing or damaged: open **Keychain Access → First Aid** on **login**, or use **Reset To Defaults** in the dialog. Some users fix a stuck keychain with:
130
+ The Cursor **`agent`** binary stores an interactive login as **`cursor-user`** in your **login** keychain. For automation, use a **[User API key](https://cursor.com/docs/cli/headless)** (`export CURSOR_API_KEY=...` from [Integrations](https://cursor.com/dashboard/integrations)): the bundled proxy then does not need Keychain. `claude-overnight` forces `CURSOR_SKIP_KEYCHAIN=1` and `CI=true`; if System Settings still shows **“A keychain cannot be found to store …”**, the login keychain is often missing or damaged: open **Keychain Access → First Aid** on **login**, or use **Reset To Defaults** in the dialog. Some users fix a stuck keychain with:
128
131
 
129
132
  ```bash
130
133
  security unlock-keychain ~/Library/Keychains/login.keychain-db
131
134
  ```
132
135
 
133
- **Automation:** Saving a key via **Cursor…** in `claude-overnight` is enough it is written to `providers.json` and injected into both the Claude SDK env and the bundled proxy (including `CURSOR_API_KEY` for the native `agent`). You do not need to `export` variables unless you want to override for one shell.
136
+ **Automation:** Saving a key via **Cursor…** in `claude-overnight` is enough. It is written to `providers.json` and injected into both the Claude SDK env and the bundled proxy (including `CURSOR_API_KEY` for the native `agent`). You do not need to `export` variables unless you want to override for one shell.
134
137
 
135
138
  **Advanced:** If something else must share port `8765` and you manage the proxy yourself, set `CURSOR_OVERNIGHT_NO_PROXY_RESTART=1` to skip the automatic “replace listener” step when a Cursor API token is present.
136
139
 
137
140
  **How headless Cursor + macOS Keychain actually works (discovery):** We documented the full investigation: why ACP was the wrong path for opus/sonnet `*-thinking-*` variants (model-name mismatch → silent `exit 1`), how **chat-only workspace** (default in cursor-composer) fakes `HOME` and triggers **Keychain timeouts** despite a User API key, and how a cloned **account pool** makes parallel cursor-agent spawns race-free. See **[docs/CURSOR_PROXY_MACOS_DISCOVERY.md](docs/CURSOR_PROXY_MACOS_DISCOVERY.md)**.
138
141
 
139
- **Quick reference bundled proxy env:** `CURSOR_BRIDGE_USE_ACP=0` (CLI streaming path accepts all friendly model names), `CURSOR_BRIDGE_CHAT_ONLY_WORKSPACE=false`, `CURSOR_CONFIG_DIRS=<5 cloned pool dirs>` (parallel-safe), plus `CURSOR_API_KEY` / `CURSOR_AUTH_TOKEN` / `CURSOR_BRIDGE_API_KEY` and `CURSOR_SKIP_KEYCHAIN=1` / `CI=true`. Details and tables are in the doc above.
142
+ **Quick reference, bundled proxy env:** `CURSOR_BRIDGE_USE_ACP=0` (CLI streaming path accepts all friendly model names), `CURSOR_BRIDGE_CHAT_ONLY_WORKSPACE=false`, `CURSOR_CONFIG_DIRS=<5 cloned pool dirs>` (parallel-safe), plus `CURSOR_API_KEY` / `CURSOR_AUTH_TOKEN` / `CURSOR_BRIDGE_API_KEY` and `CURSOR_SKIP_KEYCHAIN=1` / `CI=true`. Details and tables are in the doc above.
140
143
 
141
144
  **Regression / stress test:** `npm run matrix:cursor-proxy` (optional `--quick`, `--include-danger`). Use `MATRIX_MODELS=composer-2,claude-opus-4-7-thinking-high` to compare models; override `MATRIX_PORT_BASE`, `MATRIX_MODEL`, `MATRIX_MSG_TIMEOUT_MS` as needed.
142
145
 
@@ -146,7 +149,7 @@ security unlock-keychain ~/Library/Keychains/login.keychain-db
146
149
  npm install -g claude-overnight
147
150
  ```
148
151
 
149
- Requires Node.js ≥ 20. For Anthropic-direct roles, use `claude auth login` or `ANTHROPIC_API_KEY`. For provider-backed roles, save a Kimi / Qwen / Cursor / OpenRouter-compatible provider instead. No Anthropic plan or key? See **Run on Kimi 2.6** or **Run on Qwen 3.6 Plus** above -- cheap, drop-in alternatives.
152
+ Requires Node.js ≥ 20. For Anthropic-direct roles, use `claude auth login` or `ANTHROPIC_API_KEY`. For provider-backed roles, save a Kimi / Qwen / Cursor / OpenRouter-compatible provider instead. No Anthropic plan or key? See **Run on Kimi 2.6** or **Run on Qwen 3.6 Plus** above for cheap drop-in alternatives.
150
153
 
151
154
  ## Quick start
152
155
 
@@ -163,13 +166,13 @@ claude-overnight
163
166
 
164
167
  ② Budget [10]: 200
165
168
 
166
- ④ Planner model (thinking, steering -- use your strongest):
167
- ● Opus -- Opus 4.6 · Most capable
168
- ○ Sonnet -- Sonnet 4.6 · Best for everyday tasks
169
+ ④ Planner model (thinking, steering; use your strongest):
170
+ ● Opus · Opus 4.6 · Most capable
171
+ ○ Sonnet · Sonnet 4.6 · Best for everyday tasks
169
172
 
170
- ⑤ Worker model (what runs the tasks -- Kimi 2.6 / Qwen 3.6 Plus / OpenRouter / etc via Other…):
171
- ● Sonnet -- Sonnet 4.6 · Best for everyday tasks
172
- ○ Opus -- Opus 4.6 · Most capable
173
+ ⑤ Worker model (runs the tasks; Kimi 2.6 / Qwen 3.6 Plus / OpenRouter / etc via Other…):
174
+ ● Sonnet · Sonnet 4.6 · Best for everyday tasks
175
+ ○ Opus · Opus 4.6 · Most capable
173
176
  ○ Other… · custom OpenAI/Anthropic-compatible endpoint
174
177
 
175
178
  ⑥ Usage cap:
@@ -196,7 +199,7 @@ claude-overnight
196
199
  ◆ Assessing... ✓ Done
197
200
  ```
198
201
 
199
- You interact once (objective, budget, model, review themes), then the rest runs unattended -- thinking, planning, executing, curating memory, reflecting, steering. Rate-limited? It waits and retries. Crash? Resume where you left off. Capped at usage limit? Pick up next time with full context preserved.
202
+ You interact once (objective, budget, model, review themes), then the rest runs unattended, thinking, planning, executing, curating memory, reflecting, steering. Rate-limited? It waits and retries. Crash? Resume where you left off. Capped at usage limit? Pick up next time with full context preserved.
200
203
 
201
204
  ## Use cases
202
205
 
@@ -204,66 +207,49 @@ Overnight refactors, batch feature implementation, codebase-wide cleanups, test
204
207
 
205
208
  ## Typical flow
206
209
 
207
- ```mermaid
208
- flowchart TD
209
- subgraph Setup["Setup + planning"]
210
- A["Start or resume run"] --> B["Optional setup coach<br/>rewrite objective + suggest settings"]
211
- B --> C["Pick planner / worker / fast worker<br/>budget + concurrency + worktree mode"]
212
- C --> D["Optional provider preflight<br/>real auth / write probes"]
213
- D --> E["Theme discovery + user review/edit/chat"]
214
- E --> F["Thinking wave<br/>planner explores the codebase"]
215
- F --> G["Task orchestration<br/>planner writes concrete tasks"]
216
- end
217
-
218
- subgraph Wave["Per-wave loop"]
219
- G --> H["beforeWave hook<br/>optional shell commands"]
220
- H --> I["Execution wave<br/>main worker + optional fast worker<br/>isolated git worktrees"]
221
- I --> J["Per-agent simplify pass<br/>same SDK session resumes"]
222
- J --> K["Debrief + afterWave hook"]
223
- K --> L["Post-wave review<br/>flex mode"]
224
- L --> M["Wave-end librarian pass"]
225
- M --> N{"Flex mode?"}
226
- N -->|yes| O["Steering<br/>update status / milestones / goal"]
227
- N -->|no| P["Verifier<br/>fixed-plan gate between waves"]
228
- O -->|execute more| H
229
- O -->|reflect deeper| Q["Reflection wave<br/>extra review / audit"]
230
- Q --> O
231
- O -->|done| R["Final gate<br/>review full diff"]
232
- P -->|more work| H
233
- P -->|done| R
234
- end
235
-
236
- subgraph Memory["Dynamic repo memory"]
237
- S["Workers discover reusable patterns"] --> T["Scribe writes memory candidates"]
238
- T --> U["Librarian curates candidates"]
239
- U --> V["Canon markdown + SQLite index updated"]
240
- V --> W["Future waves get L0 stub<br/>hydrate L1/L2 on demand"]
241
- end
242
-
243
- J -. emits candidates .-> S
244
- M -. curates queue .-> U
245
- W -. informs later waves .-> I
246
- W -. informs planner decisions .-> O
247
- R --> X["afterRun hook<br/>optional shell commands"]
210
+ ```
211
+ ┌─ Setup + planning ──────────────────────────────────────────────┐
212
+ start/resume → coach rewrites objective → pick planner, │
213
+ │ worker, fast worker → provider preflight → theme review │
214
+ │ → thinking wave (parallel architects) → task orchestration │
215
+ └──────────────────────┬──────────────────────────────────────────┘
216
+
217
+ ┌─ Wave loop ──────────▼──────────────────────────────────────────┐
218
+ │ beforeWave hook → execution wave (workers in worktrees) │
219
+ → per-agent simplify pass (session resume on same context) │
220
+ │ → debrief + afterWave hook → post-wave review agent │
221
+ → librarian curates skill candidates into canon │
222
+ │ → steering decides: execute more reflect deeper │ done │
223
+ │ ↑ │
224
+ │ └── loop until done, budget out, or cap hit │
225
+ │ → final gate reviews full `git diff main` │
226
+ └─────────────────────────────────────────────────────────────────┘
227
+
228
+ ┌─ Skill memory (compounds within a run and across runs) ─────────┐
229
+ │ workers emit candidates → scribe writes to disk │
230
+ │ → librarian curates at wave end → canon markdown + │
231
+ │ SQLite FTS5 index updated → next wave gets an L0 stub, │
232
+ │ hydrates L1 body on demand, L2 references on request │
233
+ └─────────────────────────────────────────────────────────────────┘
248
234
  ```
249
235
 
250
- The chart above shows the main user-visible lifecycle. It intentionally omits some engine-internal branches such as health-check heal tasks, A/B skill assignment, zero-work retry, budget-extension prompts, and resume salvage after planning crashes.
236
+ This is the main user-visible lifecycle. Engine-internal branches (health-check heal tasks, A/B skill assignment across sibling branches, zero-work retry, budget-extension prompts, resume salvage after planning crashes) are omitted for clarity.
251
237
 
252
- ### 1. Thinking phase -- parallel architect sessions
238
+ ### 1. Thinking phase: parallel architect sessions
253
239
 
254
240
  For budgets > 15, the tool launches **architect agents** that explore your codebase before any code is written. Each one gets a different research angle (architecture, data models, APIs, testing, etc.) and writes a structured design document. The number scales with budget: 5 for budget=50, 10 for budget=2000.
255
241
 
256
242
  ### 2. Task orchestration
257
243
 
258
- An orchestrator session reads all design documents and synthesizes concrete execution tasks -- grounded in real files and patterns the architects found. The task plan is also written to a file for resilience -- if orchestration is interrupted, partial results survive.
244
+ An orchestrator session reads all design documents and synthesizes concrete execution tasks, grounded in real files and patterns the architects found. The task plan is also written to a file for resilience: if orchestration is interrupted, partial results survive.
259
245
 
260
246
  ### 3. Parallel execution waves
261
247
 
262
- Tasks run in parallel agent sessions (each in its own git worktree). After completing its task, each session automatically runs a **simplify pass** -- reviewing its own `git diff` for code reuse opportunities, quality issues, and inefficiencies, then fixing them before the framework commits. This is done via the SDK's **session resume** mechanism: the same agent session continues with a follow-up prompt, so the agent's full context from its task is still available -- no need to re-instruct or re-fill context. If a fast worker is configured, steering can route cheaper, well-scoped tasks there while the main worker handles heavier implementation.
248
+ Tasks run in parallel agent sessions (each in its own git worktree). After completing its task, each session automatically runs a **simplify pass**, reviewing its own `git diff` for code reuse opportunities, quality issues, and inefficiencies, then fixing them before the framework commits. This is done via the SDK's **session resume** mechanism: the same agent session continues with a follow-up prompt, so the agent's full context from its task is still available, no need to re-instruct or re-fill context. If a fast worker is configured, steering can route cheaper, well-scoped tasks there while the main worker handles heavier implementation.
263
249
 
264
250
  ### 4. Post-wave review
265
251
 
266
- After each wave (flex mode, budget remaining), a dedicated **review agent** inspects the consolidated diff for issues the individual agents may have blind-spotted: missed reuse opportunities, copy-paste variations, leaky abstractions, efficiency regressions. Runs as a single-agent wave -- one session reviews what the swarm just produced.
252
+ After each wave (flex mode, budget remaining), a dedicated **review agent** inspects the consolidated diff for issues the individual agents may have blind-spotted: missed reuse opportunities, copy-paste variations, leaky abstractions, efficiency regressions. Runs as a single-agent wave, one session reviews what the swarm just produced.
267
253
 
268
254
  ### 5. Librarian and dynamic memory
269
255
 
@@ -273,7 +259,7 @@ At the end of each wave, a **librarian** pass curates that queue. It can promote
273
259
 
274
260
  ### 6. Steering
275
261
 
276
- After each wave, steering assesses: "how good is this?" -- not "what's missing?" It can:
262
+ After each wave, steering asks "how good is this?" rather than "what's missing?". It can:
277
263
 
278
264
  - **Execute** more tasks to build features, fix bugs, polish UX
279
265
  - **Reflect** by spinning up 1-2 review sessions for deep quality/architecture audits
@@ -287,17 +273,17 @@ When the run completes (steering declares done), a final **comprehensive review*
287
273
 
288
274
  Long runs stay sharp because steering maintains three run-memory layers:
289
275
 
290
- - **Status** -- a living project snapshot, updated every wave. Compressed, never truncated.
291
- - **Milestones** -- strategic snapshots archived every ~5 waves. Long-term memory.
292
- - **Goal** -- the evolving north star. What quality means for this codebase.
276
+ - **Status**: a living project snapshot, updated every wave. Compressed, never truncated.
277
+ - **Milestones**: strategic snapshots archived every ~5 waves. Long-term memory.
278
+ - **Goal**: the evolving north star. What quality means for this codebase.
293
279
 
294
280
  ### Progressive-disclosure repo memory
295
281
 
296
282
  The repo memory system is separate from the run folder and is designed around three disclosure layers so context stays small:
297
283
 
298
- - **L0** -- a tiny ranked stub injected into planner and worker prompts. It lists only the names and descriptions of the most relevant project-specific skills and tool recipes.
299
- - **L1** -- the full skill body, loaded on demand with `skill_read(name)` when an agent wants the actual recipe or guidance.
300
- - **L2** -- attached references for deeper context. The library is structured for them even though most runs only need the L0 stub plus occasional L1 hydration.
284
+ - **L0**: a tiny ranked stub injected into planner and worker prompts. It lists only the names and descriptions of the most relevant project-specific skills and tool recipes.
285
+ - **L1**: the full skill body, loaded on demand with `skill_read(name)` when an agent wants the actual recipe or guidance.
286
+ - **L2**: attached references for deeper context. The library is structured for them even though most runs only need the L0 stub plus occasional L1 hydration.
301
287
 
302
288
  That progressive disclosure matters: the planner and workers do not carry the full memory library in every prompt. They get a compact overview, call `skill_search(query)` if they need to narrow it, and hydrate only the bodies that matter for the task in front of them.
303
289
 
@@ -320,7 +306,7 @@ Every run gets its own folder in `.claude-overnight/runs/`. Nothing is ever over
320
306
  run.json, transcripts/themes.ndjson ← see exactly what the planner was doing
321
307
  ```
322
308
 
323
- Any run that stops before the steering system declares the objective complete -- capped at usage limit, Ctrl+C, crash, rate limit timeout, steering failure -- is automatically resumable:
309
+ Any run that stops before the steering system declares the objective complete, capped at usage limit, Ctrl+C, crash, rate limit timeout, steering failure, is automatically resumable:
324
310
 
325
311
  ```
326
312
  ⚠ Unfinished run
@@ -335,7 +321,7 @@ Any run that stops before the steering system declares the objective complete -
335
321
 
336
322
  On resume: unmerged branches auto-merge, the wave loop continues, all context is preserved. Designs and reflections stay on disk until the objective is truly complete.
337
323
 
338
- If the thinking phase succeeds but orchestration crashes, the next run detects the orphaned design docs and reuses them -- no re-running $9 worth of architect sessions:
324
+ If the thinking phase succeeds but orchestration crashes, the next run detects the orphaned design docs and reuses them, no re-running $9 worth of architect sessions:
339
325
 
340
326
  ```
341
327
  ✓ Reusing 5 design docs (from prior attempt)
@@ -345,25 +331,25 @@ If the thinking phase succeeds but orchestration crashes, the next run detects t
345
331
  ...
346
332
  ```
347
333
 
348
- **Knowledge carries forward** -- new runs inherit knowledge from completed previous runs. Thinking sessions and steering see what past runs built. Run 2 knows run 1 already built the auth system.
334
+ **Knowledge carries forward**, new runs inherit knowledge from completed previous runs. Thinking sessions and steering see what past runs built. Run 2 knows run 1 already built the auth system.
349
335
 
350
336
  ### Transcripts and streaming
351
337
 
352
- Every planner/steering query streams through the Agent SDK with `includePartialMessages: true`, so tool calls, thinking, and text deltas are captured as they happen. Each query also appends an NDJSON transcript under `runs/<ts>/transcripts/<name>.ndjson` so if the planner crashes mid-think you still have the forensic trail (prompt preview, every tool use, every text/thinking delta, rate-limit events, and the final result or error). `themes.md` is also written as a human-readable summary right after the thinking wave.
338
+ Every planner/steering query streams through the Agent SDK with `includePartialMessages: true`, so tool calls, thinking, and text deltas are captured as they happen. Each query also appends an NDJSON transcript under `runs/<ts>/transcripts/<name>.ndjson`, so if the planner crashes mid-think you still have the forensic trail (prompt preview, every tool use, every text/thinking delta, rate-limit events, and the final result or error). `themes.md` is also written as a human-readable summary right after the thinking wave.
353
339
 
354
340
  Not every provider delivers the same streaming granularity:
355
341
 
356
342
  | Provider | Tool-use events | Thinking deltas | Text deltas |
357
343
  | --- | --- | --- | --- |
358
344
  | Anthropic (direct) | ✓ | ✓ | ✓ |
359
- | Cursor proxy (`cursor-composer-in-claude`) | | | ✓ (final answer only) |
345
+ | Cursor proxy (`cursor-composer-in-claude`) | no | no | ✓ (final answer only) |
360
346
  | Kimi / Qwen / OpenRouter / custom Anthropic-compatible | depends on upstream | depends | usually ✓ |
361
347
 
362
- When a provider doesn't stream partials (or the model is a reasoning model on the Cursor proxy the proxy suppresses the thinking phase and only emits the final answer), the ticker shows elapsed time with no live text, then the completed result lands in one go. The UI, transcripts, and the resume flow all behave identically either way streaming is used when available, never required.
348
+ When a provider doesn't stream partials (or the model is a reasoning model on the Cursor proxy, where the proxy suppresses the thinking phase and only emits the final answer), the ticker shows elapsed time with no live text, then the completed result lands in one go. The UI, transcripts, and the resume flow all behave identically either way: streaming is used when available, never required.
363
349
 
364
- Add `.claude-overnight/` to your `.gitignore` (with the trailing slash -- see below).
350
+ Add `.claude-overnight/` to your `.gitignore` (with the trailing slash, see below).
365
351
 
366
- A separate, tiny `claude-overnight.log.md` is also written at the repo root on every run. It's human-readable, append-only, one block per run (objective, start/finish, cost, outcome, branch), and is designed to be **committed** -- so even after `.claude-overnight/` is cleaned up you can still recover which prompt produced which commits. Use `.claude-overnight/` (with trailing slash) in your gitignore so this file isn't matched by accident.
352
+ A separate, tiny `claude-overnight.log.md` is also written at the repo root on every run. It's human-readable, append-only, one block per run (objective, start/finish, cost, outcome, branch), and is designed to be **committed**, so even after `.claude-overnight/` is cleaned up you can still recover which prompt produced which commits. Use `.claude-overnight/` (with trailing slash) in your gitignore so this file isn't matched by accident.
367
353
 
368
354
  ## Task file and inline modes
369
355
 
@@ -407,20 +393,20 @@ claude-overnight "fix auth bug in src/auth.ts" "add tests for user model"
407
393
  |---|---|---|
408
394
  | `--budget=N` | `10` | Total agent sessions |
409
395
  | `--concurrency=N` | `5` | Parallel agents |
410
- | `--model=NAME` | prompted | Worker model -- interactive picks planner + worker separately; `Other…` adds Kimi / Qwen / OpenRouter / any Anthropic-compat endpoint. In non-interactive mode, a saved provider's model id is auto-resolved to the provider. |
396
+ | `--model=NAME` | prompted | Worker model. Interactive picks planner and worker separately; `Other…` adds Kimi / Qwen / OpenRouter / any Anthropic-compat endpoint. In non-interactive mode, a saved provider's model id is auto-resolved to the provider. |
411
397
  | `--usage-cap=N` | unlimited | Stop at N% utilization |
412
398
  | `--allow-extra-usage` | off | Allow extra/overage usage (billed separately) |
413
- | `--extra-usage-budget=N` | -- | Max $ for extra usage (implies --allow-extra-usage) |
399
+ | `--extra-usage-budget=N` | | Max $ for extra usage (implies --allow-extra-usage) |
414
400
  | `--timeout=SECONDS` | `900` | Inactivity timeout per agent (nudges at timeout, kills at 2×) |
415
- | `--no-flex` | -- | Disable multi-wave steering |
416
- | `--dry-run` | -- | Show planned tasks without running |
401
+ | `--no-flex` | | Disable multi-wave steering |
402
+ | `--dry-run` | | Show planned tasks without running |
417
403
 
418
404
  ## Task file fields
419
405
 
420
406
  | Field | Type | Default | Description |
421
407
  |---|---|---|---|
422
408
  | `tasks` | `(string \| {prompt, cwd?, model?})[]` | required | Tasks to run |
423
- | `objective` | `string` | -- | High-level goal for steering |
409
+ | `objective` | `string` | | High-level goal for steering |
424
410
  | `flexiblePlan` | `boolean` | `false` | Enable multi-wave planning |
425
411
  | `model` | `string` | prompted | Worker model |
426
412
  | `concurrency` | `number` | `5` | Parallel agents |
@@ -431,12 +417,12 @@ claude-overnight "fix auth bug in src/auth.ts" "add tests for user model"
431
417
 
432
418
  ## Custom providers (Kimi, Qwen, OpenRouter, any Anthropic-compatible endpoint)
433
419
 
434
- Planner, main worker, and optional fast worker are each picked separately -- pair Opus-on-Anthropic for the planner/thinker with a cheaper model on another provider for the bulk of work. The fast worker is a real worker (same tools, same env), just on a cheaper/faster model steering routes well-scoped tasks to it by default.
420
+ Planner, main worker, and optional fast worker are each picked separately. Pair Opus-on-Anthropic for the planner/thinker with a cheaper model on another provider for the bulk of work. The fast worker is a real worker (same tools, same env), just on a cheaper/faster model, and steering routes well-scoped tasks to it by default.
435
421
 
436
422
  From the interactive picker, choose `Other…` on the planner, worker, or fast step:
437
423
 
438
424
  ```
439
- ⑤ Worker model (what runs the tasks -- Kimi 2.6 / Qwen 3.6 Plus / OpenRouter / etc via Other…):
425
+ ⑤ Worker model (runs the tasks; Kimi 2.6 / Qwen 3.6 Plus / OpenRouter / etc via Other…):
440
426
  ○ Sonnet
441
427
  ○ Opus
442
428
  ● Other…
@@ -458,13 +444,13 @@ Common examples:
458
444
 
459
445
  Saved providers live user-level at `~/.claude/claude-overnight/providers.json` (mode 0600) and show up automatically in every repo. No per-project config.
460
446
 
461
- **How routing works.** Each `query()` gets its own env override (`ANTHROPIC_BASE_URL` + `ANTHROPIC_AUTH_TOKEN`) -- planner queries use the planner provider, main-worker queries use the worker provider, fast-worker queries use the fast provider. No global shell env, no proxy daemon, no `process.env` pollution between calls.
447
+ **How routing works.** Each `query()` gets its own env override (`ANTHROPIC_BASE_URL` + `ANTHROPIC_AUTH_TOKEN`), planner queries use the planner provider, main-worker queries use the worker provider, fast-worker queries use the fast provider. No global shell env, no proxy daemon, no `process.env` pollution between calls.
462
448
 
463
449
  **Pre-flight.** Before the swarm starts, each custom provider is pinged with a 1-turn auth check. Bad keys fail fast with `✗ worker preflight failed: ...` instead of N scattered mid-run errors.
464
450
 
465
451
  **Resume.** Provider ids are persisted in `run.json` and rehydrated on resume. If you deleted a provider between runs, resume refuses to start and tells you exactly which id is missing.
466
452
 
467
- **Non-interactive / CI.** `claude-overnight --model=kimi-for-coding` (or `qwen3.6-plus`) auto-resolves the model id to a saved provider -- no separate `--provider` flag.
453
+ **Non-interactive / CI.** `claude-overnight --model=kimi-for-coding` (or `qwen3.6-plus`) auto-resolves the model id to a saved provider, no separate `--provider` flag.
468
454
 
469
455
  ## Parallel Playwright Testing
470
456
 
@@ -502,8 +488,8 @@ See `QUICKSHEET_PLAYWRIGHT.md` for full config examples.
502
488
 
503
489
  By default, extra/overage usage is **blocked**. When your plan's rate limits are exhausted, the run stops cleanly and is resumable. You control this in the interactive prompt (step ⑤) or via CLI flags:
504
490
 
505
- - `--allow-extra-usage` -- opt in to extra usage (billed separately)
506
- - `--extra-usage-budget=20` -- allow up to $20 of extra usage, then stop
491
+ - `--allow-extra-usage`, opt in to extra usage (billed separately)
492
+ - `--extra-usage-budget=20`, allow up to $20 of extra usage, then stop
507
493
 
508
494
  ### Live controls during execution
509
495
 
@@ -515,11 +501,11 @@ Press these keys while agents are running:
515
501
  | `t` | Change usage cap threshold (0-100%) |
516
502
  | `q` | Graceful stop (press twice to force quit) |
517
503
 
518
- Changes take effect between waves -- active agents finish their current task.
504
+ Changes take effect between waves; active agents finish their current task.
519
505
 
520
506
  ### Multi-window usage display
521
507
 
522
- The usage bar cycles through all rate limit windows (5h, 7d, etc.) every 3 seconds, showing utilization per window. Usage info is shown during all phases -- thinking, orchestration, steering, and execution.
508
+ The usage bar cycles through all rate limit windows (5h, 7d, etc.) every 3 seconds, showing utilization per window. Usage info is shown during all phases: thinking, orchestration, steering, and execution.
523
509
 
524
510
  When using extra usage with a budget, a dedicated progress bar shows spend vs limit with color-coded fill (magenta → yellow → red).
525
511
 
@@ -527,14 +513,14 @@ When using extra usage with a budget, a dedicated progress bar shows spend vs li
527
513
 
528
514
  Built for unattended runs lasting hours or days.
529
515
 
530
- - **Smooth overage transition**: when extra usage is allowed, plan limit rejection is seamless -- no dispatch blocking, agents continue into overage
531
- - **Interrupt + resume**: agents and planner queries that go silent are interrupted and resumed with full conversation context via SDK session resume -- not killed and restarted from scratch
516
+ - **Smooth overage transition**: when extra usage is allowed, plan limit rejection is seamless, no dispatch blocking, agents continue into overage
517
+ - **Interrupt + resume**: agents and planner queries that go silent are interrupted and resumed with full conversation context via SDK session resume, not killed and restarted from scratch
532
518
  - **Hard block**: pauses until the rate limit window resets, then resumes
533
519
  - **Soft throttle**: slows dispatch at >75% utilization
534
520
  - **Extra usage guard**: detects overage billing and stops unless explicitly allowed
535
521
  - **Cooldown between phases**: waits for rate limit reset after thinking before starting orchestration
536
522
  - **Retry with backoff**: transient errors (429, overloaded) retry automatically
537
- - **Usage cap**: set a ceiling, active agents finish, no new ones start -- run is resumable
523
+ - **Usage cap**: set a ceiling, active agents finish, no new ones start, run is resumable
538
524
  - **Planner retries**: steering and orchestration retry on rate limits (30s/60s/120s backoff) with full context
539
525
 
540
526
  ## Git worktrees and branch merging
@@ -548,13 +534,36 @@ Conflicts retry with `-X theirs`. Unresolved branches are preserved for manual m
548
534
 
549
535
  ## Claude Code plugin
550
536
 
551
- This repo also ships a Claude Code plugin so any Claude instance (inside this repo or any other) knows how to use, inspect, and resume `claude-overnight` runs:
537
+ This repo ships a Claude Code plugin so any Claude instance (inside this repo or any other) knows how to use, inspect, and resume `claude-overnight` runs:
552
538
 
553
539
  ```
554
540
  /plugin marketplace add Fornace/claude-overnight
555
541
  /plugin install claude-overnight
556
542
  ```
557
543
 
544
+ The plugin includes a skill for **authoring runs outside the CLI**. Claude can help you pick the run shape, critique the budget and decomposition, and write a `tasks.json` file before you ever invoke the CLI.
545
+
546
+ ### Writing `tasks.json` externally
547
+
548
+ When you pass a pre-written `tasks.json` to the CLI, it **skips the thinking wave and planning phase** and starts executing immediately:
549
+
550
+ ```bash
551
+ claude-overnight tasks.json
552
+ ```
553
+
554
+ This is useful when:
555
+ - You already have a concrete task list and don't need the planner to explore the codebase.
556
+ - You want to save the planner cost ($2–4) on a straightforward, mechanical job.
557
+ - You used the Claude skill to design the run and want to lock the plan before executing.
558
+
559
+ A fixed-plan `tasks.json` (without `flexiblePlan: true`) bypasses orchestration entirely. A flex-plan `tasks.json` (with `objective` + `flexiblePlan: true` + seed tasks) still uses steering across waves, but skips the initial thinking wave if the tasks are already concrete.
560
+
561
+ ### What happens when `tasks.json` exists
562
+
563
+ - **Crash resilience.** During normal planning, the orchestrator writes `tasks.json` to disk as soon as it generates the tasks. If the planner crashes or the process dies before the run state is persisted, the next resume salvages the tasks from `tasks.json` instead of re-running the expensive planning query.
564
+ - **Resume fallback.** If a run's state file is missing or incomplete, the resume flow falls back to `tasks.json` to reconstruct the task list. This also covers legacy runs from before v1.11.7 where the agent wrote the file but the orchestrator didn't save `run.json`.
565
+ - **Orphan recovery.** The state scanner backfills minimal run metadata for any run directory that contains a `tasks.json` but no `run.json`, so incomplete planning shells still show up in `claude-overnight --list`.
566
+
558
567
  ## Exit codes
559
568
 
560
569
  | Code | Meaning |
@@ -563,6 +572,19 @@ This repo also ships a Claude Code plugin so any Claude instance (inside this re
563
572
  | `1` | Some tasks failed |
564
573
  | `2` | All failed or none completed |
565
574
 
575
+ ## Prompt evolution (server-side)
576
+
577
+ The `src/prompt-evolution/` engine and `claude-overnight-evolve` CLI power a self-evolution pipeline that optimises prompts (the planner prompt here, MCP-browser's supervisor prompts, or any prompt in a user's repo) via Pareto-frontier mutation with LLM-as-judge and heuristic scoring.
578
+
579
+ **Multi-hour runs aren't meant for your laptop.** Two ways to run it:
580
+
581
+ 1. **`npx claude-overnight-evolve …`** — quickest. Fine for smoke tests or short runs; needs `ANTHROPIC_API_KEY` in env and keeps running only as long as your shell is open. Output: `~/.claude-overnight/prompt-evolution/<runId>/`.
582
+ 2. **Self-hosted Docker** — [`self-host/`](self-host/README.md) ships a tiny runner image + optional HTTP server (enqueue + read-back) you can run on any VPS. Laptop can be off.
583
+
584
+ Experiment credentials — any Anthropic-compatible provider (Anthropic direct, OpenRouter, Kimi, DashScope, a local proxy) — are injected via env vars: `ANTHROPIC_BASE_URL`, `ANTHROPIC_API_KEY`, `EVAL_MODEL`, `MUTATE_MODEL`. Self-host reads them from `self-host/.env` (or per-run `env:` in the enqueue body).
585
+
586
+ Full design: [docs/prompt-evolution-research.md](docs/prompt-evolution-research.md).
587
+
566
588
  ## License
567
589
 
568
590
  MIT
@@ -0,0 +1,18 @@
1
+ #!/usr/bin/env node
2
+ /**
3
+ * `claude-overnight-evolve` — CLI for the prompt-evolution engine.
4
+ *
5
+ * Ships with the npm package (compiled to dist/bin/evolve.js). The MCP-browser
6
+ * platform runs this binary inside a per-project `raw`-mode container via
7
+ * `docker exec`. See docs/prompt-evolution-research.md.
8
+ *
9
+ * Examples:
10
+ * claude-overnight-evolve --prompt 10_planning/10-3_plan --eval-model claude-haiku-4-5 --generations 3
11
+ * claude-overnight-evolve --target mcp-browser --prompt-kind plan-supervision --eval-model kimi-k2-6
12
+ *
13
+ * Requires ANTHROPIC_API_KEY (or ANTHROPIC_AUTH_TOKEN) in env. When `--target
14
+ * mcp-browser` is used the cwd must be the MCP-browser repo root (so
15
+ * `platform/supervisor/gemini-client.ts` resolves), or pass the file via
16
+ * `MCP_BROWSER_GEMINI_CLIENT`.
17
+ */
18
+ export {};