@yemi33/squad 0.1.0 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -21
- package/README.md +29 -24
- package/TODO.md +10 -10
- package/bin/squad.js +164 -164
- package/dashboard.js +901 -886
- package/docs/engine-restart.md +92 -0
- package/engine/ado-mcp-wrapper.js +49 -49
- package/engine.js +194 -14
- package/package.json +46 -46
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
# Engine Restart & Agent Survival
|
|
2
|
+
|
|
3
|
+
## The Problem
|
|
4
|
+
|
|
5
|
+
When the engine restarts, it loses its in-memory process handles (`activeProcesses` Map). Claude CLI agents spawned before the restart are still running as OS processes, but the engine can't monitor their stdout, detect exit codes, or manage their lifecycle. Without protection, the heartbeat check (5-min default) would kill these agents as "orphans."
|
|
6
|
+
|
|
7
|
+
## What's Persisted vs Lost
|
|
8
|
+
|
|
9
|
+
| State | Storage | Survives Restart |
|
|
10
|
+
|-------|---------|-----------------|
|
|
11
|
+
| Dispatch queue (pending/active/completed) | `engine/dispatch.json` | Yes |
|
|
12
|
+
| Agent status (working/idle/error) | `agents/*/status.json` | Yes |
|
|
13
|
+
| Agent live output | `agents/*/live-output.log` | Yes (mtime used as heartbeat) |
|
|
14
|
+
| Process handles (`ChildProcess`) | In-memory Map | **No** |
|
|
15
|
+
| Cooldown timestamps | In-memory Map | **No** (repopulated from `engine/cooldowns.json`) |
|
|
16
|
+
|
|
17
|
+
## Protection Mechanisms
|
|
18
|
+
|
|
19
|
+
### 1. Grace Period on Startup (20 min default)
|
|
20
|
+
|
|
21
|
+
When the engine starts and finds active dispatches from a previous session, it sets `engineRestartGraceUntil` to `now + 20 minutes`. During this window, orphan detection is completely suppressed — agents won't be killed even if the engine has no process handle for them.
|
|
22
|
+
|
|
23
|
+
Configurable via `config.json`:
|
|
24
|
+
```json
|
|
25
|
+
{
|
|
26
|
+
"engine": {
|
|
27
|
+
"restartGracePeriod": 1200000
|
|
28
|
+
}
|
|
29
|
+
}
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
### 2. Blocking Tool Detection
|
|
33
|
+
|
|
34
|
+
Even after the grace period expires, the engine scans each agent's `live-output.log` for the most recent `tool_use` call. If the agent is in a known blocking tool:
|
|
35
|
+
|
|
36
|
+
- **`TaskOutput` with `block: true`** — timeout extended to the task's own timeout + 1 min
|
|
37
|
+
- **`Bash` with long timeout (>5 min)** — timeout extended to the bash timeout + 1 min
|
|
38
|
+
|
|
39
|
+
This works for both tracked processes and orphans (no process handle).
|
|
40
|
+
|
|
41
|
+
### 3. Stop Warning
|
|
42
|
+
|
|
43
|
+
`engine.js stop` checks for active dispatches and warns:
|
|
44
|
+
```
|
|
45
|
+
WARNING: 2 agent(s) are still working:
|
|
46
|
+
- Dallas: [office-bohemia] Build & test PR PR-4959092
|
|
47
|
+
- Rebecca: [office-bohemia] Review PR PR-4964594
|
|
48
|
+
|
|
49
|
+
These agents will continue running but the engine won't monitor them.
|
|
50
|
+
On next start, they'll get a 20-min grace period before being marked as orphans.
|
|
51
|
+
To kill them now, run: node engine.js kill
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
### 4. Exponential Backoff on Failures
|
|
55
|
+
|
|
56
|
+
If an agent is killed as an orphan and the work item retries, cooldowns use exponential backoff (2^failures, max 8x) to prevent spam-retrying broken tasks.
|
|
57
|
+
|
|
58
|
+
## Safe Restart Pattern
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
node engine.js stop # Check the warning — are agents working?
|
|
62
|
+
# If yes, decide: wait for them to finish, or accept the grace period
|
|
63
|
+
# Make your code changes
|
|
64
|
+
node engine.js start # Grace period kicks in for surviving agents
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
## What the Engine Cannot Do
|
|
68
|
+
|
|
69
|
+
- **Reattach to processes** — Node.js `child_process` doesn't support adopting external PIDs. Once the process handle is lost, the engine can only observe the agent indirectly via file output.
|
|
70
|
+
- **Guarantee completion** — An agent that finishes during a restart will have its output saved to `live-output.log`, but the engine won't run post-completion hooks (PR sync, metrics update, learnings check). These are picked up on the next tick via output file scanning.
|
|
71
|
+
- **Resume mid-task** — If an agent is killed (by orphan detection or timeout), the work item is marked failed. It can be retried but starts from scratch.
|
|
72
|
+
|
|
73
|
+
## Timeline of a Restart
|
|
74
|
+
|
|
75
|
+
```
|
|
76
|
+
T+0s engine.js stop (warns about active agents)
|
|
77
|
+
Engine process exits. Agents keep running as OS processes.
|
|
78
|
+
|
|
79
|
+
T+30s Code changes made. engine.js start.
|
|
80
|
+
Engine reads dispatch.json — finds 2 active items.
|
|
81
|
+
Sets grace period: 20 min from now.
|
|
82
|
+
Logs: "2 active dispatch(es) from previous session"
|
|
83
|
+
|
|
84
|
+
T+0-20m Ticks run. Orphan detection skipped (grace period).
|
|
85
|
+
If an agent finishes, output is written to live-output.log.
|
|
86
|
+
Engine detects completed output on next tick via file scan.
|
|
87
|
+
|
|
88
|
+
T+20m Grace period expires.
|
|
89
|
+
Heartbeat check resumes. Blocking tool detection still active.
|
|
90
|
+
Agent in TaskOutput block:true gets extended timeout.
|
|
91
|
+
Agent with no output for 5min+ and no blocking tool → orphaned.
|
|
92
|
+
```
|
|
@@ -1,49 +1,49 @@
|
|
|
1
|
-
#!/usr/bin/env node
|
|
2
|
-
/**
|
|
3
|
-
* Wrapper for @azure-devops/mcp that fetches an ADO token via azureauth
|
|
4
|
-
* broker (no browser popup) and sets AZURE_DEVOPS_EXT_PAT before launching
|
|
5
|
-
* the MCP server.
|
|
6
|
-
*/
|
|
7
|
-
const { execSync, spawn } = require('child_process');
|
|
8
|
-
const path = require('path');
|
|
9
|
-
|
|
10
|
-
// Fetch token via azureauth broker (corp tool, no browser)
|
|
11
|
-
let token;
|
|
12
|
-
try {
|
|
13
|
-
token = execSync('azureauth ado token --mode broker --output token --timeout 1', {
|
|
14
|
-
encoding: 'utf8',
|
|
15
|
-
timeout: 30000,
|
|
16
|
-
windowsHide: true,
|
|
17
|
-
}).trim();
|
|
18
|
-
} catch (e) {
|
|
19
|
-
// Fallback: try with web mode (may open browser as last resort)
|
|
20
|
-
try {
|
|
21
|
-
token = execSync('azureauth ado token --mode web --output token --timeout 5', {
|
|
22
|
-
encoding: 'utf8',
|
|
23
|
-
timeout: 120000,
|
|
24
|
-
windowsHide: true,
|
|
25
|
-
}).trim();
|
|
26
|
-
} catch (e2) {
|
|
27
|
-
process.stderr.write('ado-mcp-wrapper: Failed to get ADO token: ' + e2.message + '\n');
|
|
28
|
-
process.exit(1);
|
|
29
|
-
}
|
|
30
|
-
}
|
|
31
|
-
|
|
32
|
-
// Launch the actual MCP server with the token in env
|
|
33
|
-
const args = process.argv.slice(2);
|
|
34
|
-
const child = spawn(process.platform === 'win32' ? 'npx.cmd' : 'npx', [
|
|
35
|
-
'-y',
|
|
36
|
-
'--registry=https://registry.npmjs.org/',
|
|
37
|
-
'@azure-devops/mcp@latest',
|
|
38
|
-
...args
|
|
39
|
-
], {
|
|
40
|
-
stdio: 'inherit',
|
|
41
|
-
env: { ...process.env, AZURE_DEVOPS_EXT_PAT: token },
|
|
42
|
-
windowsHide: true,
|
|
43
|
-
});
|
|
44
|
-
|
|
45
|
-
child.on('exit', (code) => process.exit(code || 0));
|
|
46
|
-
child.on('error', (err) => {
|
|
47
|
-
process.stderr.write('ado-mcp-wrapper: ' + err.message + '\n');
|
|
48
|
-
process.exit(1);
|
|
49
|
-
});
|
|
1
|
+
#!/usr/bin/env node
|
|
2
|
+
/**
|
|
3
|
+
* Wrapper for @azure-devops/mcp that fetches an ADO token via azureauth
|
|
4
|
+
* broker (no browser popup) and sets AZURE_DEVOPS_EXT_PAT before launching
|
|
5
|
+
* the MCP server.
|
|
6
|
+
*/
|
|
7
|
+
const { execSync, spawn } = require('child_process');
|
|
8
|
+
const path = require('path');
|
|
9
|
+
|
|
10
|
+
// Fetch token via azureauth broker (corp tool, no browser)
|
|
11
|
+
let token;
|
|
12
|
+
try {
|
|
13
|
+
token = execSync('azureauth ado token --mode broker --output token --timeout 1', {
|
|
14
|
+
encoding: 'utf8',
|
|
15
|
+
timeout: 30000,
|
|
16
|
+
windowsHide: true,
|
|
17
|
+
}).trim();
|
|
18
|
+
} catch (e) {
|
|
19
|
+
// Fallback: try with web mode (may open browser as last resort)
|
|
20
|
+
try {
|
|
21
|
+
token = execSync('azureauth ado token --mode web --output token --timeout 5', {
|
|
22
|
+
encoding: 'utf8',
|
|
23
|
+
timeout: 120000,
|
|
24
|
+
windowsHide: true,
|
|
25
|
+
}).trim();
|
|
26
|
+
} catch (e2) {
|
|
27
|
+
process.stderr.write('ado-mcp-wrapper: Failed to get ADO token: ' + e2.message + '\n');
|
|
28
|
+
process.exit(1);
|
|
29
|
+
}
|
|
30
|
+
}
|
|
31
|
+
|
|
32
|
+
// Launch the actual MCP server with the token in env
|
|
33
|
+
const args = process.argv.slice(2);
|
|
34
|
+
const child = spawn(process.platform === 'win32' ? 'npx.cmd' : 'npx', [
|
|
35
|
+
'-y',
|
|
36
|
+
'--registry=https://registry.npmjs.org/',
|
|
37
|
+
'@azure-devops/mcp@latest',
|
|
38
|
+
...args
|
|
39
|
+
], {
|
|
40
|
+
stdio: 'inherit',
|
|
41
|
+
env: { ...process.env, AZURE_DEVOPS_EXT_PAT: token },
|
|
42
|
+
windowsHide: true,
|
|
43
|
+
});
|
|
44
|
+
|
|
45
|
+
child.on('exit', (code) => process.exit(code || 0));
|
|
46
|
+
child.on('error', (err) => {
|
|
47
|
+
process.stderr.write('ado-mcp-wrapper: ' + err.message + '\n');
|
|
48
|
+
process.exit(1);
|
|
49
|
+
});
|
package/engine.js
CHANGED
|
@@ -48,6 +48,44 @@ const IDENTITY_DIR = path.join(SQUAD_DIR, 'identity');
|
|
|
48
48
|
// "projects": [ { ... }, ... ] — multi-project (central .squad)
|
|
49
49
|
// Each project must have "localPath" pointing to the repo root.
|
|
50
50
|
|
|
51
|
+
function validateConfig(config) {
|
|
52
|
+
let errors = 0;
|
|
53
|
+
// Agents
|
|
54
|
+
if (!config.agents || Object.keys(config.agents).length === 0) {
|
|
55
|
+
console.error('FATAL: No agents defined in config.json');
|
|
56
|
+
errors++;
|
|
57
|
+
}
|
|
58
|
+
// Projects
|
|
59
|
+
const projects = config.projects || [];
|
|
60
|
+
if (projects.length === 0) {
|
|
61
|
+
console.error('FATAL: No projects configured');
|
|
62
|
+
errors++;
|
|
63
|
+
}
|
|
64
|
+
for (const p of projects) {
|
|
65
|
+
if (!p.localPath || !fs.existsSync(path.resolve(p.localPath))) {
|
|
66
|
+
console.error(`WARN: Project "${p.name}" path not found: ${p.localPath}`);
|
|
67
|
+
}
|
|
68
|
+
if (!p.repositoryId) {
|
|
69
|
+
console.warn(`WARN: Project "${p.name}" missing repositoryId — PR operations will fail`);
|
|
70
|
+
}
|
|
71
|
+
}
|
|
72
|
+
// Playbooks
|
|
73
|
+
const requiredPlaybooks = ['implement', 'review', 'fix', 'work-item'];
|
|
74
|
+
for (const pb of requiredPlaybooks) {
|
|
75
|
+
if (!fs.existsSync(path.join(PLAYBOOKS_DIR, `${pb}.md`))) {
|
|
76
|
+
console.error(`WARN: Missing playbook: playbooks/${pb}.md`);
|
|
77
|
+
}
|
|
78
|
+
}
|
|
79
|
+
// Routing
|
|
80
|
+
if (!fs.existsSync(ROUTING_PATH)) {
|
|
81
|
+
console.error('WARN: routing.md not found — agent routing will use fallbacks only');
|
|
82
|
+
}
|
|
83
|
+
if (errors > 0) {
|
|
84
|
+
console.error(`\n${errors} fatal config error(s) — exiting.`);
|
|
85
|
+
process.exit(1);
|
|
86
|
+
}
|
|
87
|
+
}
|
|
88
|
+
|
|
51
89
|
function getProjects(config) {
|
|
52
90
|
if (config.projects && Array.isArray(config.projects)) {
|
|
53
91
|
return config.projects;
|
|
@@ -525,6 +563,7 @@ function sanitizeBranch(name) {
|
|
|
525
563
|
// ─── Agent Spawner ──────────────────────────────────────────────────────────
|
|
526
564
|
|
|
527
565
|
const activeProcesses = new Map(); // dispatchId → { proc, agentId, startedAt }
|
|
566
|
+
let engineRestartGraceUntil = 0; // timestamp — suppress orphan detection until this time
|
|
528
567
|
|
|
529
568
|
function spawnAgent(dispatchItem, config) {
|
|
530
569
|
const { id, agent: agentId, prompt: taskPrompt, type, meta } = dispatchItem;
|
|
@@ -1268,6 +1307,77 @@ async function pollPrStatus(config) {
|
|
|
1268
1307
|
}
|
|
1269
1308
|
}
|
|
1270
1309
|
|
|
1310
|
+
// ─── Post-Merge / Post-Close Hooks ───────────────────────────────────────────
|
|
1311
|
+
|
|
1312
|
+
async function handlePostMerge(pr, project, config, newStatus) {
|
|
1313
|
+
const prNum = (pr.id || '').replace('PR-', '');
|
|
1314
|
+
|
|
1315
|
+
// 1. Worktree cleanup
|
|
1316
|
+
if (pr.branch) {
|
|
1317
|
+
const root = path.resolve(project.localPath);
|
|
1318
|
+
const wtRoot = path.resolve(root, config.engine?.worktreeRoot || '../worktrees');
|
|
1319
|
+
const wtPath = path.join(wtRoot, pr.branch);
|
|
1320
|
+
const btPath = path.join(wtRoot, `bt-${prNum}`); // build-and-test worktree
|
|
1321
|
+
for (const p of [wtPath, btPath]) {
|
|
1322
|
+
if (fs.existsSync(p)) {
|
|
1323
|
+
try {
|
|
1324
|
+
execSync(`git worktree remove "${p}" --force`, { cwd: root, stdio: 'pipe', timeout: 15000 });
|
|
1325
|
+
log('info', `Cleaned up worktree: ${p}`);
|
|
1326
|
+
} catch (e) { log('warn', `Failed to remove worktree ${p}: ${e.message}`); }
|
|
1327
|
+
}
|
|
1328
|
+
}
|
|
1329
|
+
}
|
|
1330
|
+
|
|
1331
|
+
// Only run remaining hooks for merged PRs (not abandoned)
|
|
1332
|
+
if (newStatus !== 'merged') return;
|
|
1333
|
+
|
|
1334
|
+
// 2. Update PRD item status to 'implemented'
|
|
1335
|
+
if (pr.prdItems?.length > 0) {
|
|
1336
|
+
const root = path.resolve(project.localPath);
|
|
1337
|
+
const prdSrc = project.workSources?.prd || {};
|
|
1338
|
+
const prdPath = path.resolve(root, prdSrc.path || 'docs/prd-gaps.json');
|
|
1339
|
+
const prd = safeJson(prdPath);
|
|
1340
|
+
if (prd?.missing_features) {
|
|
1341
|
+
let updated = 0;
|
|
1342
|
+
for (const itemId of pr.prdItems) {
|
|
1343
|
+
const feature = prd.missing_features.find(f => f.id === itemId);
|
|
1344
|
+
if (feature && feature.status !== 'implemented') {
|
|
1345
|
+
feature.status = 'implemented';
|
|
1346
|
+
updated++;
|
|
1347
|
+
}
|
|
1348
|
+
}
|
|
1349
|
+
if (updated > 0) {
|
|
1350
|
+
safeWrite(prdPath, prd);
|
|
1351
|
+
log('info', `Post-merge: marked ${updated} PRD item(s) as implemented for ${pr.id}`);
|
|
1352
|
+
}
|
|
1353
|
+
}
|
|
1354
|
+
}
|
|
1355
|
+
|
|
1356
|
+
// 3. Update agent metrics
|
|
1357
|
+
const agentId = (pr.agent || '').toLowerCase();
|
|
1358
|
+
if (agentId && config.agents?.[agentId]) {
|
|
1359
|
+
const metricsPath = path.join(ENGINE_DIR, 'metrics.json');
|
|
1360
|
+
const metrics = safeJson(metricsPath) || {};
|
|
1361
|
+
if (!metrics[agentId]) metrics[agentId] = { tasksCompleted:0, tasksErrored:0, prsCreated:0, prsApproved:0, prsRejected:0, prsMerged:0, reviewsDone:0, lastTask:null, lastCompleted:null };
|
|
1362
|
+
metrics[agentId].prsMerged = (metrics[agentId].prsMerged || 0) + 1;
|
|
1363
|
+
safeWrite(metricsPath, metrics);
|
|
1364
|
+
}
|
|
1365
|
+
|
|
1366
|
+
// 4. Teams notification
|
|
1367
|
+
const teamsUrl = process.env.TEAMS_PLAN_FLOW_URL;
|
|
1368
|
+
if (teamsUrl) {
|
|
1369
|
+
try {
|
|
1370
|
+
await fetch(teamsUrl, {
|
|
1371
|
+
method: 'POST',
|
|
1372
|
+
headers: { 'Content-Type': 'application/json' },
|
|
1373
|
+
body: JSON.stringify({ text: `PR ${pr.id} merged: ${pr.title} (${project.name}) by ${pr.agent || 'unknown'}` })
|
|
1374
|
+
});
|
|
1375
|
+
} catch (e) { log('warn', `Teams post-merge notify failed: ${e.message}`); }
|
|
1376
|
+
}
|
|
1377
|
+
|
|
1378
|
+
log('info', `Post-merge hooks completed for ${pr.id}`);
|
|
1379
|
+
}
|
|
1380
|
+
|
|
1271
1381
|
function checkForLearnings(agentId, agentInfo, taskDesc) {
|
|
1272
1382
|
const today = dateStamp();
|
|
1273
1383
|
const inboxFiles = getInboxFiles();
|
|
@@ -1621,16 +1731,43 @@ function updateSnapshot(config) {
|
|
|
1621
1731
|
safeWrite(path.join(IDENTITY_DIR, 'now.md'), snapshot);
|
|
1622
1732
|
}
|
|
1623
1733
|
|
|
1734
|
+
// ─── Idle Alert ─────────────────────────────────────────────────────────────
|
|
1735
|
+
|
|
1736
|
+
let _lastActivityTime = Date.now();
|
|
1737
|
+
let _idleAlertSent = false;
|
|
1738
|
+
|
|
1739
|
+
function checkIdleThreshold(config) {
|
|
1740
|
+
const thresholdMs = (config.engine?.idleAlertMinutes || 15) * 60 * 1000;
|
|
1741
|
+
const agents = Object.keys(config.agents || {});
|
|
1742
|
+
const allIdle = agents.every(id => isAgentIdle(id));
|
|
1743
|
+
const dispatch = getDispatch();
|
|
1744
|
+
const hasPending = (dispatch.pending || []).length > 0;
|
|
1745
|
+
|
|
1746
|
+
if (!allIdle || hasPending) {
|
|
1747
|
+
_lastActivityTime = Date.now();
|
|
1748
|
+
_idleAlertSent = false;
|
|
1749
|
+
return;
|
|
1750
|
+
}
|
|
1751
|
+
|
|
1752
|
+
const idleMs = Date.now() - _lastActivityTime;
|
|
1753
|
+
if (idleMs > thresholdMs && !_idleAlertSent) {
|
|
1754
|
+
const mins = Math.round(idleMs / 60000);
|
|
1755
|
+
log('warn', `All agents idle for ${mins} minutes — no work sources producing items`);
|
|
1756
|
+
_idleAlertSent = true;
|
|
1757
|
+
}
|
|
1758
|
+
}
|
|
1759
|
+
|
|
1624
1760
|
// ─── Timeout Checker ────────────────────────────────────────────────────────
|
|
1625
1761
|
|
|
1626
1762
|
function checkTimeouts(config) {
|
|
1627
1763
|
const timeout = config.engine?.agentTimeout || 18000000; // 5h default
|
|
1628
1764
|
const heartbeatTimeout = config.engine?.heartbeatTimeout || 300000; // 5min — no output = dead
|
|
1629
1765
|
|
|
1630
|
-
// 1. Check tracked processes for hard timeout
|
|
1766
|
+
// 1. Check tracked processes for hard timeout (supports per-item deadline from fan-out)
|
|
1631
1767
|
for (const [id, info] of activeProcesses.entries()) {
|
|
1768
|
+
const itemTimeout = info.meta?.deadline ? Math.max(0, info.meta.deadline - new Date(info.startedAt).getTime()) : timeout;
|
|
1632
1769
|
const elapsed = Date.now() - new Date(info.startedAt).getTime();
|
|
1633
|
-
if (elapsed >
|
|
1770
|
+
if (elapsed > itemTimeout) {
|
|
1634
1771
|
log('warn', `Agent ${info.agentId} (${id}) hit hard timeout after ${Math.round(elapsed / 1000)}s — killing`);
|
|
1635
1772
|
try { info.proc.kill('SIGTERM'); } catch {}
|
|
1636
1773
|
setTimeout(() => {
|
|
@@ -1707,9 +1844,10 @@ function checkTimeouts(config) {
|
|
|
1707
1844
|
|
|
1708
1845
|
// Check if agent is in a blocking tool call (TaskOutput block:true, Bash with long timeout, etc.)
|
|
1709
1846
|
// These tools produce no stdout for extended periods — don't kill them prematurely
|
|
1847
|
+
// Check for BOTH tracked and untracked processes (orphan case after engine restart)
|
|
1710
1848
|
let isBlocking = false;
|
|
1711
1849
|
let blockingTimeout = heartbeatTimeout;
|
|
1712
|
-
if (
|
|
1850
|
+
if (silentMs > heartbeatTimeout) {
|
|
1713
1851
|
try {
|
|
1714
1852
|
const liveLog = safeRead(liveLogPath);
|
|
1715
1853
|
if (liveLog) {
|
|
@@ -1747,9 +1885,9 @@ function checkTimeouts(config) {
|
|
|
1747
1885
|
|
|
1748
1886
|
const effectiveTimeout = isBlocking ? blockingTimeout : heartbeatTimeout;
|
|
1749
1887
|
|
|
1750
|
-
if (!hasProcess && silentMs >
|
|
1751
|
-
// No tracked process AND no recent output → orphaned
|
|
1752
|
-
log('warn', `Orphan detected: ${item.agent} (${item.id}) — no process tracked,
|
|
1888
|
+
if (!hasProcess && silentMs > effectiveTimeout && Date.now() > engineRestartGraceUntil) {
|
|
1889
|
+
// No tracked process AND no recent output past effective timeout AND grace period expired → orphaned
|
|
1890
|
+
log('warn', `Orphan detected: ${item.agent} (${item.id}) — no process tracked, silent for ${silentSec}s${isBlocking ? ' (blocking timeout exceeded)' : ''}`);
|
|
1753
1891
|
deadItems.push({ item, reason: `Orphaned — no process, silent for ${silentSec}s` });
|
|
1754
1892
|
} else if (hasProcess && silentMs > effectiveTimeout) {
|
|
1755
1893
|
// Has process but no output past effective timeout → hung
|
|
@@ -2041,15 +2179,16 @@ function discoverFromPrd(config, project) {
|
|
|
2041
2179
|
const statusFilter = src.itemFilter?.status || ['missing', 'planned'];
|
|
2042
2180
|
const items = (prd.missing_features || []).filter(f => statusFilter.includes(f.status));
|
|
2043
2181
|
const newWork = [];
|
|
2182
|
+
const skipped = { dispatched: 0, cooldown: 0, noAgent: 0 };
|
|
2044
2183
|
|
|
2045
2184
|
for (const item of items) {
|
|
2046
2185
|
const key = `prd-${project?.name || 'default'}-${item.id}`;
|
|
2047
|
-
if (isAlreadyDispatched(key)) continue;
|
|
2048
|
-
if (isOnCooldown(key, cooldownMs)) continue;
|
|
2186
|
+
if (isAlreadyDispatched(key)) { skipped.dispatched++; continue; }
|
|
2187
|
+
if (isOnCooldown(key, cooldownMs)) { skipped.cooldown++; continue; }
|
|
2049
2188
|
|
|
2050
2189
|
const workType = item.estimated_complexity === 'large' ? 'implement:large' : 'implement';
|
|
2051
2190
|
const agentId = resolveAgent(workType, config);
|
|
2052
|
-
if (!agentId) continue;
|
|
2191
|
+
if (!agentId) { skipped.noAgent++; continue; }
|
|
2053
2192
|
|
|
2054
2193
|
const branchName = `feature/${item.id.toLowerCase()}-${item.name.toLowerCase().replace(/[^a-z0-9]+/g, '-').slice(0, 40)}`;
|
|
2055
2194
|
const vars = {
|
|
@@ -2090,6 +2229,11 @@ function discoverFromPrd(config, project) {
|
|
|
2090
2229
|
setCooldown(key);
|
|
2091
2230
|
}
|
|
2092
2231
|
|
|
2232
|
+
const skipTotal = skipped.dispatched + skipped.cooldown + skipped.noAgent;
|
|
2233
|
+
if (skipTotal > 0) {
|
|
2234
|
+
log('debug', `PRD discovery (${project?.name}): skipped ${skipTotal} items (${skipped.dispatched} dispatched, ${skipped.cooldown} cooldown, ${skipped.noAgent} no agent)`);
|
|
2235
|
+
}
|
|
2236
|
+
|
|
2093
2237
|
return newWork;
|
|
2094
2238
|
}
|
|
2095
2239
|
|
|
@@ -2369,20 +2513,20 @@ function discoverFromWorkItems(config, project) {
|
|
|
2369
2513
|
const items = safeJson(path.resolve(root, src.path)) || [];
|
|
2370
2514
|
const cooldownMs = (src.cooldownMinutes || 0) * 60 * 1000;
|
|
2371
2515
|
const newWork = [];
|
|
2516
|
+
const skipped = { gated: 0, noAgent: 0 };
|
|
2372
2517
|
|
|
2373
2518
|
for (const item of items) {
|
|
2374
2519
|
if (item.status !== 'queued' && item.status !== 'pending') continue;
|
|
2375
2520
|
|
|
2376
2521
|
const key = `work-${project?.name || 'default'}-${item.id}`;
|
|
2377
|
-
if (isAlreadyDispatched(key) || isOnCooldown(key, cooldownMs)) continue;
|
|
2522
|
+
if (isAlreadyDispatched(key) || isOnCooldown(key, cooldownMs)) { skipped.gated++; continue; }
|
|
2378
2523
|
|
|
2379
2524
|
let workType = item.type || 'implement';
|
|
2380
|
-
// Route large items to architecture agents, matching PRD/plan behavior
|
|
2381
2525
|
if (workType === 'implement' && (item.complexity === 'large' || item.estimated_complexity === 'large')) {
|
|
2382
2526
|
workType = 'implement:large';
|
|
2383
2527
|
}
|
|
2384
2528
|
const agentId = item.agent || resolveAgent(workType, config);
|
|
2385
|
-
if (!agentId) continue;
|
|
2529
|
+
if (!agentId) { skipped.noAgent++; continue; }
|
|
2386
2530
|
|
|
2387
2531
|
const branchName = item.branch || `work/${item.id}`;
|
|
2388
2532
|
const vars = {
|
|
@@ -2443,6 +2587,11 @@ function discoverFromWorkItems(config, project) {
|
|
|
2443
2587
|
safeWrite(workItemsPath, items);
|
|
2444
2588
|
}
|
|
2445
2589
|
|
|
2590
|
+
const skipTotal = skipped.gated + skipped.noAgent;
|
|
2591
|
+
if (skipTotal > 0) {
|
|
2592
|
+
log('debug', `Work item discovery (${project?.name}): skipped ${skipTotal} items (${skipped.gated} gated, ${skipped.noAgent} no agent)`);
|
|
2593
|
+
}
|
|
2594
|
+
|
|
2446
2595
|
return newWork;
|
|
2447
2596
|
}
|
|
2448
2597
|
|
|
@@ -2695,7 +2844,10 @@ function discoverCentralWorkItems(config) {
|
|
|
2695
2844
|
agentRole: agent.role,
|
|
2696
2845
|
task: `[fan-out] ${item.title} → ${agent.name}${assignedProject ? ' → ' + assignedProject.name : ''}`,
|
|
2697
2846
|
prompt,
|
|
2698
|
-
meta: {
|
|
2847
|
+
meta: {
|
|
2848
|
+
dispatchKey: fanKey, source: 'central-work-item-fanout', item, parentKey: key,
|
|
2849
|
+
deadline: item.timeout ? Date.now() + item.timeout : Date.now() + (config.engine?.fanOutTimeout || config.engine?.agentTimeout || 18000000)
|
|
2850
|
+
}
|
|
2699
2851
|
});
|
|
2700
2852
|
}
|
|
2701
2853
|
|
|
@@ -2838,8 +2990,9 @@ async function tickInner() {
|
|
|
2838
2990
|
const config = getConfig();
|
|
2839
2991
|
tickCount++;
|
|
2840
2992
|
|
|
2841
|
-
// 1. Check for timed-out agents
|
|
2993
|
+
// 1. Check for timed-out agents and idle threshold
|
|
2842
2994
|
checkTimeouts(config);
|
|
2995
|
+
checkIdleThreshold(config);
|
|
2843
2996
|
|
|
2844
2997
|
// 2. Consolidate inbox
|
|
2845
2998
|
consolidateInbox(config);
|
|
@@ -2931,9 +3084,24 @@ const commands = {
|
|
|
2931
3084
|
}
|
|
2932
3085
|
}
|
|
2933
3086
|
|
|
3087
|
+
// Validate config before starting
|
|
3088
|
+
validateConfig(config);
|
|
3089
|
+
|
|
2934
3090
|
// Load persistent state
|
|
2935
3091
|
loadCooldowns();
|
|
2936
3092
|
|
|
3093
|
+
// Grace period for agents that survived a restart
|
|
3094
|
+
const dispatch = getDispatch();
|
|
3095
|
+
const activeOnStart = (dispatch.active || []);
|
|
3096
|
+
if (activeOnStart.length > 0) {
|
|
3097
|
+
const gracePeriod = config.engine?.restartGracePeriod || 1200000; // 20min default
|
|
3098
|
+
engineRestartGraceUntil = Date.now() + gracePeriod;
|
|
3099
|
+
console.log(` ${activeOnStart.length} active dispatch(es) from previous session — ${gracePeriod / 60000}min grace period before orphan detection`);
|
|
3100
|
+
for (const item of activeOnStart) {
|
|
3101
|
+
console.log(` - ${item.agentName || item.agent}: ${(item.task || '').slice(0, 70)}`);
|
|
3102
|
+
}
|
|
3103
|
+
}
|
|
3104
|
+
|
|
2937
3105
|
// Initial tick
|
|
2938
3106
|
tick();
|
|
2939
3107
|
|
|
@@ -2944,6 +3112,18 @@ const commands = {
|
|
|
2944
3112
|
},
|
|
2945
3113
|
|
|
2946
3114
|
stop() {
|
|
3115
|
+
// Warn if agents are actively working
|
|
3116
|
+
const dispatch = getDispatch();
|
|
3117
|
+
const active = (dispatch.active || []);
|
|
3118
|
+
if (active.length > 0) {
|
|
3119
|
+
console.log(`\n WARNING: ${active.length} agent(s) are still working:`);
|
|
3120
|
+
for (const item of active) {
|
|
3121
|
+
console.log(` - ${item.agentName || item.agent}: ${(item.task || '').slice(0, 80)}`);
|
|
3122
|
+
}
|
|
3123
|
+
console.log('\n These agents will continue running but the engine won\'t monitor them.');
|
|
3124
|
+
console.log(' On next start, they\'ll get a 20-min grace period before being marked as orphans.');
|
|
3125
|
+
console.log(' To kill them now, run: node engine.js kill\n');
|
|
3126
|
+
}
|
|
2947
3127
|
safeWrite(CONTROL_PATH, { state: 'stopped', stopped_at: ts() });
|
|
2948
3128
|
log('info', 'Engine stopped');
|
|
2949
3129
|
console.log('Engine stopped.');
|
package/package.json
CHANGED
|
@@ -1,46 +1,46 @@
|
|
|
1
|
-
{
|
|
2
|
-
"name": "@yemi33/squad",
|
|
3
|
-
"version": "0.1.
|
|
4
|
-
"description": "Multi-agent AI dev team that runs from ~/.squad/ — five autonomous agents share a single engine, dashboard, and knowledge base",
|
|
5
|
-
"bin": {
|
|
6
|
-
"squad": "bin/squad.js"
|
|
7
|
-
},
|
|
8
|
-
"keywords": [
|
|
9
|
-
"ai",
|
|
10
|
-
"agents",
|
|
11
|
-
"claude",
|
|
12
|
-
"dev-team",
|
|
13
|
-
"automation",
|
|
14
|
-
"multi-agent",
|
|
15
|
-
"cli"
|
|
16
|
-
],
|
|
17
|
-
"author": "yemi33",
|
|
18
|
-
"license": "MIT",
|
|
19
|
-
"repository": {
|
|
20
|
-
"type": "git",
|
|
21
|
-
"url": "https://github.com/yemi33/squad.git"
|
|
22
|
-
},
|
|
23
|
-
"homepage": "https://github.com/yemi33/squad#readme",
|
|
24
|
-
"engines": {
|
|
25
|
-
"node": ">=18"
|
|
26
|
-
},
|
|
27
|
-
"files": [
|
|
28
|
-
"bin/",
|
|
29
|
-
"agents/*/charter.md",
|
|
30
|
-
"config.template.json",
|
|
31
|
-
"dashboard.html",
|
|
32
|
-
"dashboard.js",
|
|
33
|
-
"docs/",
|
|
34
|
-
"engine.js",
|
|
35
|
-
"engine/spawn-agent.js",
|
|
36
|
-
"engine/ado-mcp-wrapper.js",
|
|
37
|
-
"playbooks/",
|
|
38
|
-
"routing.md",
|
|
39
|
-
"skills/README.md",
|
|
40
|
-
"skills/ado-pr-status-fetch.md",
|
|
41
|
-
"squad.js",
|
|
42
|
-
"team.md",
|
|
43
|
-
"README.md",
|
|
44
|
-
"TODO.md"
|
|
45
|
-
]
|
|
46
|
-
}
|
|
1
|
+
{
|
|
2
|
+
"name": "@yemi33/squad",
|
|
3
|
+
"version": "0.1.2",
|
|
4
|
+
"description": "Multi-agent AI dev team that runs from ~/.squad/ — five autonomous agents share a single engine, dashboard, and knowledge base",
|
|
5
|
+
"bin": {
|
|
6
|
+
"squad": "bin/squad.js"
|
|
7
|
+
},
|
|
8
|
+
"keywords": [
|
|
9
|
+
"ai",
|
|
10
|
+
"agents",
|
|
11
|
+
"claude",
|
|
12
|
+
"dev-team",
|
|
13
|
+
"automation",
|
|
14
|
+
"multi-agent",
|
|
15
|
+
"cli"
|
|
16
|
+
],
|
|
17
|
+
"author": "yemi33",
|
|
18
|
+
"license": "MIT",
|
|
19
|
+
"repository": {
|
|
20
|
+
"type": "git",
|
|
21
|
+
"url": "https://github.com/yemi33/squad.git"
|
|
22
|
+
},
|
|
23
|
+
"homepage": "https://github.com/yemi33/squad#readme",
|
|
24
|
+
"engines": {
|
|
25
|
+
"node": ">=18"
|
|
26
|
+
},
|
|
27
|
+
"files": [
|
|
28
|
+
"bin/",
|
|
29
|
+
"agents/*/charter.md",
|
|
30
|
+
"config.template.json",
|
|
31
|
+
"dashboard.html",
|
|
32
|
+
"dashboard.js",
|
|
33
|
+
"docs/",
|
|
34
|
+
"engine.js",
|
|
35
|
+
"engine/spawn-agent.js",
|
|
36
|
+
"engine/ado-mcp-wrapper.js",
|
|
37
|
+
"playbooks/",
|
|
38
|
+
"routing.md",
|
|
39
|
+
"skills/README.md",
|
|
40
|
+
"skills/ado-pr-status-fetch.md",
|
|
41
|
+
"squad.js",
|
|
42
|
+
"team.md",
|
|
43
|
+
"README.md",
|
|
44
|
+
"TODO.md"
|
|
45
|
+
]
|
|
46
|
+
}
|