npm - claude-overnight - Versions diffs - 1.25.48 → 1.25.49 - Mend

claude-overnight 1.25.48 → 1.25.49

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/dist/core/_version.d.ts CHANGED Viewed

	@@ -1 +1 @@
1	- export declare const VERSION = "1.25.48";
1	+ export declare const VERSION = "1.25.49";

package/dist/core/_version.js CHANGED Viewed

@@ -1,2 +1,2 @@
 // Auto-generated by build — do not edit manually.
-export const VERSION = "1.25.48";
+export const VERSION = "1.25.49";

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "claude-overnight",
-  "version": "1.25.48",
+  "version": "1.25.49",
   "description": "Parallel Claude agents in git worktrees with a usage cap that reserves headroom for your interactive Claude Code. Crash-safe resume. Provider-agnostic model catalog (Anthropic, Cursor, OpenAI, Gemini, DeepSeek, Llama, Qwen) with capability-based task scoping.",
   "type": "module",
   "bin": {

package/plugins/claude-overnight/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "claude-overnight",
-  "version": "1.25.48",
+  "version": "1.25.49",
   "description": "Claude Code skill for understanding, installing, and inspecting claude-overnight runs  -- parallel Claude agents in git worktrees with thinking waves, multi-wave steering, and crash-safe resume. Supports Cursor API Proxy, Qwen, OpenRouter.",
   "author": {
     "name": "Francesco Fornace"

package/plugins/claude-overnight/skills/claude-overnight/SKILL.md CHANGED Viewed

@@ -1,12 +1,13 @@
 ---
 name: claude-overnight
 description: >
-  Understand, install, and inspect claude-overnight runs  -- a CLI that
+  Understand, author, install, and inspect claude-overnight runs  -- a CLI that
   launches parallel Claude agents in git worktrees with thinking waves,
-  multi-wave steering, three-layer review, and crash-safe resume. Use when the user mentions
-  claude-overnight, a `.claude-overnight/` folder, an "overnight" or
-  "swarm" run, or asks to check status / resume / continue a
-  multi-phase plan. Not for Vercel Workflow DevKit.
+  multi-wave steering, three-layer review, and crash-safe resume. Use when the user
+  mentions claude-overnight, a `.claude-overnight/` folder, an "overnight" or
+  "swarm" run, asks to check status / resume / continue a multi-phase plan,
+  or asks to plan / design / write a `tasks.json` / objective / overnight workflow.
+  Not for Vercel Workflow DevKit.
 ---
 # What it is
@@ -49,6 +50,15 @@ Live keys while running: `b` change budget · `t` change usage cap · `q` gracef
 Exit codes: `0` all ok · `1` some failed · `2` all/none.
+# Authoring a run (tasks.json / objective)
+When the user asks you to *plan*, *design*, or *write* an overnight run (not inspect one), load the authoring knowledge **on demand** — don't carry it by default:
+- `recipes.md` (next to this file) — scenario → recipe matrix: objective shape, `flexiblePlan`, initial tasks, `concurrency`, budget range, planner/worker pairing, phases to skip. Read this when picking a run shape for a known scenario (refactor, feature batch, migration, test/docs sprint, bug hunt, research).
+- `authoring.md` (next to this file) — decision tree (fixed vs flex vs inline; when to `--no-flex`; when thinking wave is wasted), pre-flight critic checklist (no "do anything" prompts, language-agnostic phrasing, verify-before-done, budget ≥ per-wave cost × expected waves, decomposition sanity), and anti-patterns. Read this before finalizing any tasks.json or before pressing Run.
+Rule of thumb: if the user has a concrete list of tasks and a clear endpoint, prefer fixed-plan (`--no-flex`) and skip the thinking wave. If the user has a fuzzy objective ("modernize X", "audit Y"), prefer `objective + flexiblePlan: true` with a small seed task list and let steering drive. Never send a single "do anything" prompt to one agent — decompose first (see authoring.md).
 # On-disk layout (this is how you inspect status)
 Every run lives at `<repo>/.claude-overnight/runs/<ISO-timestamp>/`:

package/plugins/claude-overnight/skills/claude-overnight/authoring.md ADDED Viewed

@@ -0,0 +1,107 @@
+# Authoring a claude-overnight Run
+Read this before finalizing a `tasks.json` or telling the user to press Run. Pair with `recipes.md` for the scenario matrix.
+## Decision tree
+1. **Does the user have a concrete list of tasks with a clear endpoint?**
+   - Yes → **fixed plan**: `tasks.json` with explicit `tasks[]`, `--no-flex`, skip thinking wave (auto-skipped below budget 15; for higher budgets pass a pre-written `tasks.json` — the CLI will not re-plan).
+   - No → continue.
+2. **Is the objective fuzzy ("modernize", "audit", "clean up", "make it amazing")?**
+   - Yes → **flex plan**: `objective` + `flexiblePlan: true` + 2–5 seed tasks. Let the thinking wave explore and steering drive. Budget ≥ 30.
+   - No but also not concrete → write 3–5 seed tasks you *know* are needed, enable flex, and let steering add the rest.
+3. **Is this a single-wave mechanical job (docs, formatting, coverage fill)?**
+   - Yes → `--no-flex`, skip thinking, low-cost worker (Qwen or Sonnet), high concurrency OK.
+4. **Is this a shared-surface problem (migration, bug hunt, one-file refactor)?**
+   - Yes → low concurrency (2–4). Merge conflicts dominate otherwise.
+5. **Does completion require running the app (not just reading code)?**
+   - Yes → task prompts must *explicitly* instruct run-and-test. Add `afterWave` hook to run tests. See *verify-before-done* below.
+## Pre-flight critic checklist
+Walk the proposed run against each item before Run. One fail = revise.
+### Task shape
+- [ ] **No "do anything" prompts.** Every task names a scope (files, module, feature) and a concrete outcome. If a task reads "improve X", decompose it first.
+- [ ] **Language-agnostic phrasing.** Don't bake in `npm`, `jest`, `pnpm`, etc., in the objective unless the repo is pinned to them. Shape meta-prompts ("run the project's test suite"), not tool names.
+- [ ] **Verify-before-done.** Each task that changes behavior must include "run and test the change" — not just "edit the code". For UI tasks, require browser verification (Playwright MCP). For backend, require the test suite or a repro script.
+- [ ] **Decomposition is real.** If a task is >1 day of human work, split. If two tasks touch the same file heavily, merge or serialize (low concurrency).
+- [ ] **One outcome per task.** No "refactor auth AND add tests AND update docs" — that's three tasks.
+### Budget & economics
+- [ ] **Budget ≥ per-wave cost × expected waves.** For flex runs, expect 3–6 waves. Per-wave cost = planner (~$1–3) + tasks × worker cost.
+- [ ] **Thinking wave justified.** Skip if tasks are already concrete or budget < 15. Thinking at budget=2000 costs $15–40 — worth it only for genuine exploration.
+- [ ] **Planner isn't cheaped out.** Planner quality = run quality ceiling. Opus for high-stakes, Sonnet for everyday, never Qwen for planner.
+- [ ] **Usage cap set.** Default 90% leaves headroom for interactive Claude. `--allow-extra-usage` off unless the user explicitly opts in, and only with `--extra-usage-budget=N`.
+### Environment & safety
+- [ ] **Clean git tree** (or user has explicitly OK'd uncommitted changes being swept into worktrees).
+- [ ] **`.claude-overnight/` in `.gitignore`** (with trailing slash — the `.md` log file at repo root stays committable).
+- [ ] **Required env / keys present.** API keys, DB URLs, auth tokens — if the worker needs them, the repo's `.env` must have them (worktrees inherit).
+- [ ] **MCP servers configured** for parallel Playwright (one `--isolated` entry per concurrency slot, or shared `--isolated --headless` if no login needed).
+- [ ] **Hooks don't abort the run.** `beforeWave`/`afterWave`/`afterRun` failures surface but never stop — make sure that's what the user wants.
+### Circuit-breaker awareness
+- [ ] **User knows to watch for 2 consecutive zero-file-change waves** — that's the halt signal. Silent try/catch in wave loops is a landmine; if the run looks "busy but unchanged", stop it.
+- [ ] **First-attempt failure mitigation.** First planner call is expensive ($2–4). If the objective can be expressed as concrete tasks, skip the planner entirely.
+## Common anti-patterns
+### The "overnight hail-mary"
+User dumps a vague wish + $1000 budget + flex + max concurrency and walks away. Output: 200 worktrees, 50 merge conflicts, no coherent diff, $400 of steering context-shuffling.
+**Fix:** decompose the wish into 5–10 seed tasks, start at budget=50, verify the first wave delivers before topping up (live `b` key).
+### The "single agent, do everything"
+One task prompt: "refactor the whole auth system". One agent touches 40 files, simplify pass can't review a sprawl, merge succeeds but the result is incoherent.
+**Fix:** decompose into per-surface tasks (middleware, session store, tokens, tests). Let steering add integration work.
+### The "verification theater"
+Tasks say "add tests" but don't say "run them". Agent writes plausible-looking tests that don't compile. Final gate catches it — but 10 waves in.
+**Fix:** every behavior-changing task ends with "run the test suite and ensure it passes". `afterWave: "pnpm test"` adds a safety net.
+### The "wrong tool for the job"
+Using flex mode for a mechanical docs sprint — planner burns budget steering a problem that needs no steering.
+**Fix:** `--no-flex` for mechanical work.
+### The "proxied model mystery"
+Worker is on Cursor proxy. User wonders why there are no thinking deltas in transcripts.
+**Fix:** expected behavior — Cursor proxy suppresses thinking phase (see README table). Don't chase it.
+## Writing a good objective (for flex runs)
+Structure: `<verb> <scope> so that <outcome / quality bar>`.
+Good:
+- "Modernize the auth system so that session tokens meet SOC2 storage requirements and existing flows continue to work."
+- "Raise test coverage in `packages/api` to >80% line coverage, prioritizing error paths and boundary cases."
+Bad:
+- "Make the code better." → no scope, no outcome.
+- "Do whatever needs doing on auth." → no quality bar.
+- "Refactor everything and add features." → two objectives.
+The `goal.md` file lets steering evolve the "north star" — but it can only evolve a seed that's already grounded. A vague seed stays vague.
+## Writing good seed tasks (flex mode)
+Each seed should:
+1. Name a scope (file, module, feature, package).
+2. Name an outcome (what "done" looks like).
+3. Be independently verifiable (a test, a build step, a visible UI change).
+4. Not overlap heavily with siblings (otherwise serialize or drop concurrency).
+Example seeds for "Modernize auth":
+- "Audit `packages/auth/middleware.ts` and document the session-token storage approach, flagging SOC2 gaps."
+- "Add a reproduction test for the current session-token storage that fails under the new SOC2 requirement."
+- "Design a migration path from cookie storage to encrypted-at-rest store; output as `designs/auth-migration.md`."
+Steering will add execution tasks (the actual migration code) in later waves, grounded in what the seeds found.
+## When to invoke the coach skill vs. this one
+- **`claude-overnight` skill** (this one): Claude helps the user plan an overnight run *outside* the CLI — picking shape, writing tasks.json, critiquing budget.
+- **`claude-overnight-coach` skill**: runs *inside* the CLI at startup, turns a raw objective into recommended settings + checklist. Different entry point, overlapping knowledge. Don't invoke coach from here.

package/plugins/claude-overnight/skills/claude-overnight/recipes.md ADDED Viewed

@@ -0,0 +1,48 @@
+# Overnight Run Recipes
+Scenario → recommended run shape. These are defaults, not laws — adjust when the repo or user constraints say so. Always pair with `authoring.md` (decision tree + pre-flight).
+## Recipe matrix
+| Scenario | Shape | `flexiblePlan` | Budget | Concurrency | Planner / Worker | Skip phases | Notes |
+|---|---|---|---|---|---|---|---|
+| **Fixed refactor** (concrete file list, clear endpoint) | `tasks.json` with explicit tasks | `false` (`--no-flex`) | 1× tasks + ~20% headroom | 3–5 | Sonnet / Sonnet | thinking wave, post-wave review | Each task = one cohesive unit of work. Cheapest mode. |
+| **Feature batch** (N independent features) | `tasks.json`, one task per feature | `false` initially; `true` if features bleed into shared code | 2–3× feature count | 4–6 | Opus / Sonnet | thinking wave if features are well-scoped | Require verify-before-done per task. |
+| **Framework migration** (Next 14→16, React 18→19, etc.) | `objective` + seed tasks per package | `true` | 50–200 | 3–5 | Opus / Sonnet | none — keep thinking + review | Steering re-plans as breakage surfaces. `beforeWave`: install deps. |
+| **Test sprint** (raise coverage, fill gaps) | `objective` + seed per module | `true` | 30–100 | 5–8 | Sonnet / Sonnet (or Qwen for cost) | thinking if coverage report is attached | `afterWave`: run test suite, feed failures forward. |
+| **Docs sprint** (API docs, guides) | `tasks.json` per doc surface | `false` | 1× docs + 10% | 4–6 | Sonnet / Sonnet (or Qwen) | thinking wave, reflection | Pure output task — flex mode wastes planner. |
+| **Bug hunt** (unknown cause, repro unstable) | `objective` + the repro | `true` | 20–80 | 2–4 | Opus / Opus | none | Low concurrency — workers step on each other on shared bug surface. Verify fix via reproduction script. |
+| **Codebase audit / research** (no code changes) | `objective` + focus list | `true` | 30–80 | 5–10 | Opus / Sonnet | n/a — architects *are* the work | Output is `designs/*.md` + `milestones/`. Set `permissionMode: "default"` so workers can't write. |
+| **Framework-wide cleanup** (dead code, consistency) | `objective` + seed tasks | `true` | 100–300 | 5–8 | Opus / Sonnet + fast Qwen | thinking if scope is obvious | Use fast worker for well-scoped mechanical tasks. |
+| **Long research run** (exploration, prototypes) | `objective` + loose tasks | `true` | 200–1000 | 3–5 | Opus / Opus | none | `usageCap: 90`, `--allow-extra-usage` off unless explicitly requested. |
+## Budget heuristics
+- **Per-wave cost floor** ≈ $2–4 planner + $N workers × avg task cost (Sonnet ≈ $0.15–0.40, Opus ≈ $0.50–1.50, Qwen ≈ <$0.05). Budget must cover *expected waves × per-wave cost*, not just task count.
+- **Thinking wave cost** scales with budget: 5 architects at budget=50 (~$3–8), 10 at budget=2000 (~$15–40). Skip when you don't need exploration.
+- **Flex overhead**: each steering pass is one planner call (~$0.50–2 on Opus). For 10-wave flex runs, reserve ~$10 for steering alone.
+## Model pairing defaults
+| Run type | Planner | Main worker | Fast worker |
+|---|---|---|---|
+| High-stakes (production refactor, migration) | Opus | Sonnet | — |
+| Everyday (features, tests, cleanups) | Sonnet | Sonnet | Qwen 3.6 Plus |
+| Cost-sensitive (docs, mechanical batch) | Sonnet | Qwen 3.6 Plus | — |
+| Research / audit (read-heavy) | Opus | Opus | — |
+Rationale: planner quality is the ceiling for the whole run. Never cheap out on planner unless the run is purely mechanical.
+## Phase-skip cheatsheet
+- **Skip thinking wave** when: tasks are already concrete · user has already explored the code · scenario is "docs/tests/mechanical batch" · budget < 15 (auto-skipped).
+- **Skip flex / steering** (`--no-flex`) when: endpoint is crisp · tasks are independent · no assessment needed between waves.
+- **Skip post-wave review** when: single-wave run · budget is tight · tasks are trivially verifiable (docs, formatting).
+- **Always keep final gate** (post-run review) unless `--dry-run`. It's the last quality check before the diff lands.
+## Anti-recipes (don't do these)
+- "Do everything in `src/`" → one agent, no decomposition. See `authoring.md` → *decompose fallback*.
+- `budget=5` + `flexiblePlan: true` → planner eats most of the budget, workers starve.
+- High concurrency on shared-file scenarios (migrations, bug hunts) → merge conflicts dominate. Drop to 2–4.
+- `usageCap: 100` + `--allow-extra-usage` without `--extra-usage-budget` → silent overage. Always cap extra spend explicitly.