@yemi33/squad 0.1.1 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +29 -24
- package/docs/engine-restart.md +92 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -172,7 +172,7 @@ The web dashboard at `http://localhost:7331` provides:
|
|
|
172
172
|
|
|
173
173
|
## Project Config
|
|
174
174
|
|
|
175
|
-
When you run `squad
|
|
175
|
+
When you run `squad add <dir>`, it prompts for project details and saves them to `config.json`. Each project entry looks like:
|
|
176
176
|
|
|
177
177
|
```json
|
|
178
178
|
{
|
|
@@ -205,7 +205,7 @@ The init script also creates `<project>/.squad/` with empty `work-items.json` an
|
|
|
205
205
|
|
|
206
206
|
### Auto-Discovery
|
|
207
207
|
|
|
208
|
-
When you run `squad
|
|
208
|
+
When you run `squad add`, the tool automatically detects what it can from the repo:
|
|
209
209
|
|
|
210
210
|
| What | How |
|
|
211
211
|
|------|-----|
|
|
@@ -227,7 +227,7 @@ Agents need MCP tools to interact with your repo host (create PRs, post review c
|
|
|
227
227
|
|
|
228
228
|
**Example:** If you use Azure DevOps, configure the `azure-ado` MCP server in your Claude Code settings. If you use GitHub, configure the `github` MCP server. Agents will discover and use whichever tools are available.
|
|
229
229
|
|
|
230
|
-
Manually refresh with `
|
|
230
|
+
Manually refresh with `squad mcp-sync`.
|
|
231
231
|
|
|
232
232
|
## Work Items
|
|
233
233
|
|
|
@@ -235,7 +235,7 @@ All work items use the shared `playbooks/work-item.md` template, which provides
|
|
|
235
235
|
|
|
236
236
|
**Per-project** — scoped to one repo. Select a project in the Command Center dropdown.
|
|
237
237
|
|
|
238
|
-
**Central (auto-route)** — agent gets all project descriptions and decides where to work. Use "Auto (agent decides)" in the dropdown, or `
|
|
238
|
+
**Central (auto-route)** — agent gets all project descriptions and decides where to work. Use "Auto (agent decides)" in the dropdown, or `squad work "title"`. Can span multiple repos.
|
|
239
239
|
|
|
240
240
|
### Fan-Out (Parallel Multi-Agent)
|
|
241
241
|
|
|
@@ -328,9 +328,10 @@ Routing rules in `routing.md`. Charters in `agents/{name}/charter.md`. Both are
|
|
|
328
328
|
| `implement.md` | Build a PRD item in a git worktree, create PR |
|
|
329
329
|
| `review.md` | Review a PR, post findings to repo host |
|
|
330
330
|
| `fix.md` | Fix review feedback on existing PR branch |
|
|
331
|
-
| `analyze.md` | Generate PRD gap analysis in a worktree |
|
|
332
331
|
| `explore.md` | Read-only codebase exploration |
|
|
333
332
|
| `test.md` | Run tests and report results |
|
|
333
|
+
| `build-and-test.md` | Build project and run test suite |
|
|
334
|
+
| `plan-to-prd.md` | Convert a plan into PRD gap items |
|
|
334
335
|
|
|
335
336
|
All playbooks use `{{template_variables}}` filled from project config. The `work-item.md` playbook uses `{{scope_section}}` to inject project-specific or multi-project context. Playbooks are fully customizable — edit them to match your workflow.
|
|
336
337
|
|
|
@@ -355,7 +356,7 @@ Agents can run for hours as long as they're producing output. The `heartbeatTime
|
|
|
355
356
|
| Orphaned worktrees | >24 hours old, no active dispatch references them |
|
|
356
357
|
| Zombie processes | In memory but no matching dispatch |
|
|
357
358
|
|
|
358
|
-
Manual cleanup: `
|
|
359
|
+
Manual cleanup: `squad cleanup`
|
|
359
360
|
|
|
360
361
|
## Self-Improvement Loop
|
|
361
362
|
|
|
@@ -374,7 +375,7 @@ When a reviewer flags issues, the engine creates `feedback-<author>-from-<review
|
|
|
374
375
|
`engine/metrics.json` tracks per agent: tasks completed, errors, PRs created/approved/rejected, reviews done. Visible in CLI (`status`) and dashboard with color-coded approval rates.
|
|
375
376
|
|
|
376
377
|
### 5. Skills
|
|
377
|
-
Agents save repeatable workflows to `skills/<name>.md` with Claude Code-compatible frontmatter. Engine builds an index injected into all prompts. Skills can also be stored per-project at `<project>/.claude/skills/<name>/SKILL.md` (requires a PR). Visible in dashboard
|
|
378
|
+
Agents save repeatable workflows to `skills/<name>.md` with Claude Code-compatible frontmatter. Engine builds an index injected into all prompts. Skills can also be stored per-project at `<project>/.claude/skills/<name>/SKILL.md` (requires a PR). Visible in the dashboard Skills section.
|
|
378
379
|
|
|
379
380
|
See `docs/self-improvement.md` for the full breakdown.
|
|
380
381
|
|
|
@@ -408,9 +409,9 @@ Engine behavior is controlled via `config.json`. Key settings:
|
|
|
408
409
|
|
|
409
410
|
The engine and all spawned agents use the Node binary that started the engine (`process.execPath`). After upgrading Node, restart the engine:
|
|
410
411
|
|
|
411
|
-
```
|
|
412
|
-
|
|
413
|
-
|
|
412
|
+
```bash
|
|
413
|
+
squad stop
|
|
414
|
+
squad start
|
|
414
415
|
```
|
|
415
416
|
|
|
416
417
|
## Portability
|
|
@@ -418,41 +419,45 @@ node ~/.squad/engine.js
|
|
|
418
419
|
**Portable (works on any machine):** Engine, dashboard, playbooks, charters, routing, notes, skills, docs, work items.
|
|
419
420
|
|
|
420
421
|
**Machine-specific (reconfigure per machine):**
|
|
421
|
-
- `config.json` — contains absolute paths to project directories. Re-link via `
|
|
422
|
+
- `config.json` — contains absolute paths to project directories. Re-link via `squad add <dir>`.
|
|
422
423
|
- `mcp-servers.json` — auto-synced from `~/.claude.json` on engine start.
|
|
423
424
|
|
|
424
|
-
To move to a new machine:
|
|
425
|
+
To move to a new machine: `npm install -g @yemi33/squad && squad init --force`, then re-run `squad add` for each project.
|
|
425
426
|
|
|
426
427
|
## File Layout
|
|
427
428
|
|
|
428
429
|
```
|
|
429
430
|
~/.squad/
|
|
430
|
-
|
|
431
|
+
bin/
|
|
432
|
+
squad.js <- Unified CLI entry point (npm package)
|
|
433
|
+
squad.js <- Project management: init, add, remove, list
|
|
431
434
|
engine.js <- Engine daemon
|
|
432
435
|
engine/
|
|
433
436
|
spawn-agent.js <- Agent spawn wrapper (resolves claude cli.js)
|
|
434
|
-
|
|
435
|
-
|
|
436
|
-
|
|
437
|
-
|
|
437
|
+
ado-mcp-wrapper.js <- ADO MCP authentication wrapper
|
|
438
|
+
control.json <- running/paused/stopped (runtime)
|
|
439
|
+
dispatch.json <- pending/active/completed queue (runtime)
|
|
440
|
+
log.json <- Audit trail, capped at 500 (runtime)
|
|
441
|
+
metrics.json <- Per-agent quality metrics (runtime)
|
|
438
442
|
dashboard.js <- Web dashboard server
|
|
439
|
-
dashboard.html <- Dashboard UI
|
|
443
|
+
dashboard.html <- Dashboard UI (single-file)
|
|
440
444
|
config.json <- projects[], agents, engine, claude settings
|
|
441
|
-
config.template.json <- Template for
|
|
445
|
+
config.template.json <- Template for new installs
|
|
446
|
+
package.json <- npm package definition
|
|
442
447
|
mcp-servers.json <- MCP servers (auto-synced, gitignored)
|
|
443
448
|
routing.md <- Dispatch rules table (editable)
|
|
444
449
|
team.md <- Team roster
|
|
445
|
-
notes.md
|
|
446
|
-
work-items.json <- Central work queue (
|
|
447
|
-
TODO.md <- Future improvements roadmap
|
|
450
|
+
notes.md <- Team rules + consolidated learnings (runtime)
|
|
451
|
+
work-items.json <- Central work queue (runtime)
|
|
448
452
|
playbooks/
|
|
449
453
|
work-item.md <- Shared work item template
|
|
450
454
|
implement.md <- Build a PRD item
|
|
451
455
|
review.md <- Review a PR
|
|
452
456
|
fix.md <- Fix review feedback
|
|
453
|
-
analyze.md <- Generate new PRD
|
|
454
457
|
explore.md <- Codebase exploration
|
|
455
458
|
test.md <- Run tests
|
|
459
|
+
build-and-test.md <- Build project and run test suite
|
|
460
|
+
plan-to-prd.md <- Convert plan to PRD gap items
|
|
456
461
|
skills/
|
|
457
462
|
README.md <- Skill format guide
|
|
458
463
|
<name>.md <- Agent-created reusable workflows
|
|
@@ -460,7 +465,7 @@ To move to a new machine: clone `~/.squad/`, delete `engine/control.json`, re-ru
|
|
|
460
465
|
{name}/
|
|
461
466
|
charter.md <- Agent identity and boundaries (editable)
|
|
462
467
|
status.json <- Current state (runtime)
|
|
463
|
-
history.md <- Task history
|
|
468
|
+
history.md <- Task history, last 20 (runtime)
|
|
464
469
|
live-output.log <- Streaming output while working (runtime)
|
|
465
470
|
output.log <- Final output after completion (runtime)
|
|
466
471
|
identity/
|
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
# Engine Restart & Agent Survival
|
|
2
|
+
|
|
3
|
+
## The Problem
|
|
4
|
+
|
|
5
|
+
When the engine restarts, it loses its in-memory process handles (`activeProcesses` Map). Claude CLI agents spawned before the restart are still running as OS processes, but the engine can't monitor their stdout, detect exit codes, or manage their lifecycle. Without protection, the heartbeat check (5-min default) would kill these agents as "orphans."
|
|
6
|
+
|
|
7
|
+
## What's Persisted vs Lost
|
|
8
|
+
|
|
9
|
+
| State | Storage | Survives Restart |
|
|
10
|
+
|-------|---------|-----------------|
|
|
11
|
+
| Dispatch queue (pending/active/completed) | `engine/dispatch.json` | Yes |
|
|
12
|
+
| Agent status (working/idle/error) | `agents/*/status.json` | Yes |
|
|
13
|
+
| Agent live output | `agents/*/live-output.log` | Yes (mtime used as heartbeat) |
|
|
14
|
+
| Process handles (`ChildProcess`) | In-memory Map | **No** |
|
|
15
|
+
| Cooldown timestamps | In-memory Map | **No** (repopulated from `engine/cooldowns.json`) |
|
|
16
|
+
|
|
17
|
+
## Protection Mechanisms
|
|
18
|
+
|
|
19
|
+
### 1. Grace Period on Startup (20 min default)
|
|
20
|
+
|
|
21
|
+
When the engine starts and finds active dispatches from a previous session, it sets `engineRestartGraceUntil` to `now + 20 minutes`. During this window, orphan detection is completely suppressed — agents won't be killed even if the engine has no process handle for them.
|
|
22
|
+
|
|
23
|
+
Configurable via `config.json`:
|
|
24
|
+
```json
|
|
25
|
+
{
|
|
26
|
+
"engine": {
|
|
27
|
+
"restartGracePeriod": 1200000
|
|
28
|
+
}
|
|
29
|
+
}
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
### 2. Blocking Tool Detection
|
|
33
|
+
|
|
34
|
+
Even after the grace period expires, the engine scans each agent's `live-output.log` for the most recent `tool_use` call. If the agent is in a known blocking tool:
|
|
35
|
+
|
|
36
|
+
- **`TaskOutput` with `block: true`** — timeout extended to the task's own timeout + 1 min
|
|
37
|
+
- **`Bash` with long timeout (>5 min)** — timeout extended to the bash timeout + 1 min
|
|
38
|
+
|
|
39
|
+
This works for both tracked processes and orphans (no process handle).
|
|
40
|
+
|
|
41
|
+
### 3. Stop Warning
|
|
42
|
+
|
|
43
|
+
`engine.js stop` checks for active dispatches and warns:
|
|
44
|
+
```
|
|
45
|
+
WARNING: 2 agent(s) are still working:
|
|
46
|
+
- Dallas: [office-bohemia] Build & test PR PR-4959092
|
|
47
|
+
- Rebecca: [office-bohemia] Review PR PR-4964594
|
|
48
|
+
|
|
49
|
+
These agents will continue running but the engine won't monitor them.
|
|
50
|
+
On next start, they'll get a 20-min grace period before being marked as orphans.
|
|
51
|
+
To kill them now, run: node engine.js kill
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
### 4. Exponential Backoff on Failures
|
|
55
|
+
|
|
56
|
+
If an agent is killed as an orphan and the work item retries, cooldowns use exponential backoff (2^failures, max 8x) to prevent spam-retrying broken tasks.
|
|
57
|
+
|
|
58
|
+
## Safe Restart Pattern
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
node engine.js stop # Check the warning — are agents working?
|
|
62
|
+
# If yes, decide: wait for them to finish, or accept the grace period
|
|
63
|
+
# Make your code changes
|
|
64
|
+
node engine.js start # Grace period kicks in for surviving agents
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
## What the Engine Cannot Do
|
|
68
|
+
|
|
69
|
+
- **Reattach to processes** — Node.js `child_process` doesn't support adopting external PIDs. Once the process handle is lost, the engine can only observe the agent indirectly via file output.
|
|
70
|
+
- **Guarantee completion** — An agent that finishes during a restart will have its output saved to `live-output.log`, but the engine won't run post-completion hooks (PR sync, metrics update, learnings check). These are picked up on the next tick via output file scanning.
|
|
71
|
+
- **Resume mid-task** — If an agent is killed (by orphan detection or timeout), the work item is marked failed. It can be retried but starts from scratch.
|
|
72
|
+
|
|
73
|
+
## Timeline of a Restart
|
|
74
|
+
|
|
75
|
+
```
|
|
76
|
+
T+0s engine.js stop (warns about active agents)
|
|
77
|
+
Engine process exits. Agents keep running as OS processes.
|
|
78
|
+
|
|
79
|
+
T+30s Code changes made. engine.js start.
|
|
80
|
+
Engine reads dispatch.json — finds 2 active items.
|
|
81
|
+
Sets grace period: 20 min from now.
|
|
82
|
+
Logs: "2 active dispatch(es) from previous session"
|
|
83
|
+
|
|
84
|
+
T+0-20m Ticks run. Orphan detection skipped (grace period).
|
|
85
|
+
If an agent finishes, output is written to live-output.log.
|
|
86
|
+
Engine detects completed output on next tick via file scan.
|
|
87
|
+
|
|
88
|
+
T+20m Grace period expires.
|
|
89
|
+
Heartbeat check resumes. Blocking tool detection still active.
|
|
90
|
+
Agent in TaskOutput block:true gets extended timeout.
|
|
91
|
+
Agent with no output for 5min+ and no blocking tool → orphaned.
|
|
92
|
+
```
|
package/package.json
CHANGED