@swarmclawai/swarmclaw 1.5.42 → 1.5.43

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -389,6 +389,13 @@ Operational docs: https://swarmclaw.ai/docs/observability
389
389
 
390
390
  ## Releases
391
391
 
392
+ ### v1.5.43 Highlights
393
+
394
+ - **`/api/version` no longer 500s in Docker**: the route used to shell out to `git` at runtime, which fails in the production image because `.git/` is not copied. The route now returns 200 with `{ source: 'package', version }` from `package.json` when git metadata is unavailable, and `{ source: 'git', version, commit, ... }` when it is. `/api/version/update` short-circuits on Docker-style installs with a clear `no_git_metadata` reason instead of an opaque 500. ([#41](https://github.com/swarmclawai/swarmclaw/issues/41) Bug 1, reported by [@SteamedFish](https://github.com/SteamedFish).)
395
+ - **Daemon reclaims stale `daemon-primary` leases on container restart**: when the previous container died holding the SQLite-backed lease, the new container previously waited up to the full 120 s TTL before the daemon could start. The successor now parses the recorded owner pid, probes it with `process.kill(pid, 0)`, and reclaims the lease immediately when the prior owner is provably dead on this host. When the owner is genuinely alive (or when the recorded host is ambiguous, such as multi-pod Kubernetes), behaviour is unchanged but a single deferred retry is scheduled just past the TTL so the daemon comes up automatically rather than waiting for the next API call. ([#41](https://github.com/swarmclawai/swarmclaw/issues/41) Bug 2.)
396
+ - **Subprocess daemon fallback fails soft in Docker**: when `resolveDaemonRuntimeEntry()` cannot find `src/lib/server/daemon/daemon-runtime.ts` (the file is intentionally not in the standalone build), `ensureDaemonProcessRunning()` now logs a one-shot warning and returns `false` instead of throwing into the API handler. The in-process daemon path (with the Bug 2 fix) is the production path in Docker. ([#41](https://github.com/swarmclawai/swarmclaw/issues/41) Bug 3.)
397
+ - **`CONTRIBUTING.md`**: dropped the broken reference to `AGENTS.md`. That file is `.gitignore`'d and not visible to external contributors. The single canonical project-conventions document is `CLAUDE.md`.
398
+
392
399
  ### v1.5.42 Highlights
393
400
 
394
401
  - **New `opencode-web` provider — connect to remote OpenCode HTTP servers** ([#40](https://github.com/swarmclawai/swarmclaw/issues/40), requested by [@SteamedFish](https://github.com/SteamedFish)): point an agent at any host running `opencode serve` or `opencode web` (default port `4096`). Supports HTTPS endpoints, HTTP Basic Auth (encode credentials as `username:password` in the API key field; bare passwords default the username to `opencode`), automatic OpenCode session reuse across chat turns, and per-session workspace isolation via `?directory=...`. Models are entered as `providerID/modelID` (e.g. `anthropic/claude-sonnet-4-5`). The existing `opencode-cli` provider is unchanged.
@@ -417,17 +424,6 @@ Operational docs: https://swarmclaw.ai/docs/observability
417
424
  - **Perf ring buffer raised to 2 000 entries**: queue/task repository events fire ~20 Hz during task processing and were evicting chat-execution/prompt perf entries out of the 200-entry buffer before they could be read. The larger buffer lets the perf viewer actually show a full turn.
418
425
  - **Tests**: added regression tests for pre-1.5.38 stale-checkout orphan recovery and for the scoped-tool-access algorithm.
419
426
 
420
- ### v1.5.38 Highlights
421
-
422
- - **Task queue: reclaim stale checkouts**: `checkoutTask()` now reclaims a lingering `checkoutRunId` on a `queued` task instead of refusing it forever. An ungraceful server exit mid-turn (crash, SIGKILL, HMR reload) previously left tasks uncheckoutable, producing a dispatch → orphan-recovery → failed-checkout spin that logged "Recovering orphaned queued task" tens of thousands of times per session. `scheduleRetryOrDeadLetter()` also clears the prior checkout when scheduling a retry or dead-lettering.
423
- - **Chat: suppress duplicate parallel tool calls**: some OSS models on Ollama (notably `devstral`) emit the same tool call twice in a single turn. The LangGraph tool-event tracker now dedupes by `name + input` signature, swallowing the duplicate start and its result while allowing a genuinely later identical call once the first completes. Hardened against replayed-start events (HMR, graph retries) that previously could leak a `run_id` into both the accepted and suppressed sets and leave `pendingCount` stuck above zero.
424
- - **Chat: disable `parallel_tool_calls` for Ollama**: local Ollama sessions now pass `parallel_tool_calls: false` to prevent the upstream duplicate-call behavior at the source for models that honor it.
425
- - **Chat: no-progress guard for tool summary retries**: if the model produces essentially no new text on a `tool_summary` continuation, the loop stops retrying instead of streaming the same short sentence two or three times. The guard is snapshot-aware: a transient-error rollback no longer leaves a stale progress counter that silently skips a legitimate retry (`lastToolSummaryTextLen` is now round-tripped through `ChatTurnState.snapshot`/`restore`).
426
- - **Task UI: distinguish retry-pending from failure**: a retrying task now renders in amber with a "Retry Pending" label in the task card and sheet, instead of the same red treatment used for dead-lettered failures.
427
- - **Autonomy: dedupe reflection memories across kinds**: the supervisor reflection writer now drops notes whose normalized text has already been stored this run, eliminating near-identical memory rows classified under multiple kinds.
428
- - **OpenClaw gateway: fast-fail on dangling credentials**: when an agent's OpenClaw route references a deleted or missing credential, the gateway now refuses to dial the WebSocket up front instead of attempting an unauthenticated handshake and waiting the full 120 s for the agent-side timeout. The credential-missing log line is promoted from warn to error so it surfaces in routine monitoring.
429
- - **Prompt size profiler**: setting `SWARMCLAW_PROFILE_PROMPT=1` now logs a per-section size breakdown of the assembled system prompt (block index, first-line label, char count) on every turn, making it practical to diagnose why a specific agent is eating context budget. Off by default so production turns stay quiet.
430
-
431
427
  Older releases: https://swarmclaw.ai/docs/release-notes
432
428
 
433
429
  - GitHub releases: https://github.com/swarmclawai/swarmclaw/releases
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@swarmclawai/swarmclaw",
3
- "version": "1.5.42",
3
+ "version": "1.5.43",
4
4
  "description": "Build and run autonomous AI agents with OpenClaw, Hermes, multiple model providers, orchestration, delegation, memory, skills, schedules, and chat connectors.",
5
5
  "main": "electron-dist/main.js",
6
6
  "license": "MIT",
@@ -1,7 +1,8 @@
1
1
  import { NextResponse } from 'next/server'
2
- import { execSync } from 'child_process'
3
- export const dynamic = 'force-dynamic'
2
+ import { gitAvailable, safeGit } from '@/lib/server/git-metadata'
3
+ import packageJson from '../../../../package.json'
4
4
 
5
+ export const dynamic = 'force-dynamic'
5
6
 
6
7
  let cachedRemote: {
7
8
  sha: string
@@ -10,74 +11,87 @@ let cachedRemote: {
10
11
  remoteTag: string | null
11
12
  checkedAt: number
12
13
  } | null = null
13
- const CACHE_TTL = 60_000 // 60s
14
+ const CACHE_TTL = 60_000
14
15
  const RELEASE_TAG_RE = /^v\d+\.\d+\.\d+(?:[-+][0-9A-Za-z.-]+)?$/
15
16
 
16
- function run(cmd: string): string {
17
- return execSync(cmd, { encoding: 'utf-8', cwd: process.cwd(), timeout: 15_000 }).trim()
18
- }
19
-
20
17
  function getLatestStableTag(): string | null {
21
- const tags = run(`git tag --list 'v*' --sort=-v:refname`)
22
- .split('\n')
23
- .map((line) => line.trim())
24
- .filter(Boolean)
25
- return tags.find((tag) => RELEASE_TAG_RE.test(tag)) || null
18
+ const out = safeGit(['tag', '--list', 'v*', '--sort=-v:refname'])
19
+ if (!out) return null
20
+ return out.split('\n').map((l) => l.trim()).filter(Boolean).find((t) => RELEASE_TAG_RE.test(t)) || null
26
21
  }
27
22
 
28
23
  function getHeadStableTag(): string | null {
29
- const tags = run(`git tag --points-at HEAD --list 'v*' --sort=-v:refname`)
30
- .split('\n')
31
- .map((line) => line.trim())
32
- .filter(Boolean)
33
- return tags.find((tag) => RELEASE_TAG_RE.test(tag)) || null
24
+ const out = safeGit(['tag', '--points-at', 'HEAD', '--list', 'v*', '--sort=-v:refname'])
25
+ if (!out) return null
26
+ return out.split('\n').map((l) => l.trim()).filter(Boolean).find((t) => RELEASE_TAG_RE.test(t)) || null
34
27
  }
35
28
 
36
29
  export async function GET(_req: Request) {
37
- try {
38
- const localSha = run('git rev-parse --short HEAD')
39
- const localTag = getHeadStableTag()
30
+ // Always return 200. When git metadata is unavailable (Docker production
31
+ // image, npm tarball install) we fall back to the static package.json
32
+ // version. Issue #41 reported a 500 response when `.git/` was not present
33
+ // in the production container; this route now degrades gracefully.
34
+ const packageVersion = packageJson.version
40
35
 
41
- let remoteSha = cachedRemote?.sha ?? localSha
42
- let behindBy = cachedRemote?.behindBy ?? 0
43
- let channel: 'stable' | 'main' = cachedRemote?.channel ?? 'main'
44
- let remoteTag = cachedRemote?.remoteTag ?? null
36
+ if (!gitAvailable()) {
37
+ return NextResponse.json({
38
+ source: 'package',
39
+ version: packageVersion,
40
+ localSha: null,
41
+ localTag: `v${packageVersion}`,
42
+ remoteSha: null,
43
+ remoteTag: null,
44
+ channel: 'stable',
45
+ updateAvailable: false,
46
+ behindBy: 0,
47
+ })
48
+ }
45
49
 
46
- if (!cachedRemote || Date.now() - cachedRemote.checkedAt > CACHE_TTL) {
47
- try {
48
- run('git fetch --tags origin --quiet')
49
- const latestTag = getLatestStableTag()
50
- if (latestTag) {
51
- channel = 'stable'
52
- remoteTag = latestTag
53
- remoteSha = run(`git rev-parse --short ${latestTag}^{commit}`)
54
- behindBy = parseInt(run(`git rev-list HEAD..${latestTag}^{commit} --count`), 10) || 0
55
- } else {
56
- // Fallback for repos without release tags yet.
57
- channel = 'main'
58
- remoteTag = null
59
- run('git fetch origin main --quiet')
60
- behindBy = parseInt(run('git rev-list HEAD..origin/main --count'), 10) || 0
61
- remoteSha = behindBy > 0
62
- ? run('git rev-parse --short origin/main')
63
- : localSha
50
+ const localSha = safeGit(['rev-parse', '--short', 'HEAD'])
51
+ const localTag = getHeadStableTag()
52
+
53
+ let remoteSha = cachedRemote?.sha ?? localSha
54
+ let behindBy = cachedRemote?.behindBy ?? 0
55
+ let channel: 'stable' | 'main' = cachedRemote?.channel ?? 'main'
56
+ let remoteTag = cachedRemote?.remoteTag ?? null
57
+
58
+ if (!cachedRemote || Date.now() - cachedRemote.checkedAt > CACHE_TTL) {
59
+ const fetched = safeGit(['fetch', '--tags', 'origin', '--quiet'])
60
+ if (fetched !== null) {
61
+ const latestTag = getLatestStableTag()
62
+ if (latestTag) {
63
+ channel = 'stable'
64
+ remoteTag = latestTag
65
+ const sha = safeGit(['rev-parse', '--short', `${latestTag}^{commit}`])
66
+ if (sha) remoteSha = sha
67
+ const count = safeGit(['rev-list', `HEAD..${latestTag}^{commit}`, '--count'])
68
+ behindBy = count ? (parseInt(count, 10) || 0) : 0
69
+ } else {
70
+ channel = 'main'
71
+ remoteTag = null
72
+ safeGit(['fetch', 'origin', 'main', '--quiet'])
73
+ const count = safeGit(['rev-list', 'HEAD..origin/main', '--count'])
74
+ behindBy = count ? (parseInt(count, 10) || 0) : 0
75
+ if (behindBy > 0) {
76
+ const sha = safeGit(['rev-parse', '--short', 'origin/main'])
77
+ if (sha) remoteSha = sha
78
+ } else if (localSha) {
79
+ remoteSha = localSha
64
80
  }
65
- cachedRemote = { sha: remoteSha, behindBy, channel, remoteTag, checkedAt: Date.now() }
66
- } catch {
67
- // fetch failed (no network, no remote, etc.) — use stale cache or defaults
68
81
  }
82
+ cachedRemote = { sha: remoteSha || '', behindBy, channel, remoteTag, checkedAt: Date.now() }
69
83
  }
70
-
71
- return NextResponse.json({
72
- localSha,
73
- localTag,
74
- remoteSha,
75
- remoteTag,
76
- channel,
77
- updateAvailable: behindBy > 0,
78
- behindBy,
79
- })
80
- } catch {
81
- return NextResponse.json({ error: 'Not a git repository' }, { status: 500 })
82
84
  }
85
+
86
+ return NextResponse.json({
87
+ source: 'git',
88
+ version: packageVersion,
89
+ localSha,
90
+ localTag,
91
+ remoteSha,
92
+ remoteTag,
93
+ channel,
94
+ updateAvailable: behindBy > 0,
95
+ behindBy,
96
+ })
83
97
  }
@@ -1,6 +1,7 @@
1
1
  import { NextResponse } from 'next/server'
2
2
  import { execSync } from 'child_process'
3
3
  import { getDb } from '@/lib/server/storage'
4
+ import { gitAvailable } from '@/lib/server/git-metadata'
4
5
 
5
6
  const RELEASE_TAG_RE = /^v\d+\.\d+\.\d+(?:[-+][0-9A-Za-z.-]+)?$/
6
7
 
@@ -37,6 +38,17 @@ function ensureCleanWorkingTree() {
37
38
  }
38
39
 
39
40
  export async function POST() {
41
+ // The git-pull update path only makes sense for source/git checkouts.
42
+ // Docker and packaged-app installs have their own update channels and
43
+ // calling this route on those installs would otherwise return a confusing
44
+ // 500. Surface the situation as a 200 with a clear reason instead.
45
+ if (!gitAvailable()) {
46
+ return NextResponse.json({
47
+ success: false,
48
+ reason: 'no_git_metadata',
49
+ error: 'Self-update is only supported for source / git checkouts. Use the npm or Docker upgrade path for this install.',
50
+ })
51
+ }
40
52
  try {
41
53
  const beforeSha = run('git rev-parse --short HEAD')
42
54
  const beforeRef = run('git rev-parse HEAD')
@@ -31,7 +31,14 @@ import {
31
31
  releaseRuntimeLock,
32
32
  tryAcquireRuntimeLock,
33
33
  } from '@/lib/server/runtime/runtime-lock-repository'
34
- import { errorMessage } from '@/lib/shared-utils'
34
+ import { errorMessage, hmrSingleton } from '@/lib/shared-utils'
35
+
36
+ // HMR-safe single-shot guard so the "subprocess fallback unavailable"
37
+ // warning logs once per process lifetime, not per API call.
38
+ const subprocessFallbackUnavailableLogged = hmrSingleton<{ value: boolean }>(
39
+ '__swarmclaw_daemon_subprocess_fallback_warned__',
40
+ () => ({ value: false }),
41
+ )
35
42
 
36
43
  const TAG = 'daemon-controller'
37
44
  const LAUNCH_LOCK_NAME = 'daemon-launcher'
@@ -367,7 +374,22 @@ export async function ensureDaemonProcessRunning(
367
374
  const secondCheck = await getLiveDaemonSnapshot()
368
375
  if (secondCheck?.status.running) return false
369
376
 
370
- const { root, entry } = resolveDaemonRuntimeEntry()
377
+ let resolved: { root: string; entry: string }
378
+ try {
379
+ resolved = resolveDaemonRuntimeEntry()
380
+ } catch (err: unknown) {
381
+ // The standalone Docker image does not ship `src/` (Next.js standalone
382
+ // output excludes raw source files), so the subprocess fallback can
383
+ // never spawn there. Fail soft: log once and let callers fall back to
384
+ // whatever in-process daemon path is available rather than surfacing
385
+ // a 500 to API consumers. Reported as issue #41 (Bug 3).
386
+ if (!subprocessFallbackUnavailableLogged.value) {
387
+ subprocessFallbackUnavailableLogged.value = true
388
+ log.warn(TAG, `[daemon] Subprocess fallback unavailable in this build (${errorMessage(err)}). The in-process daemon will continue to be the primary path.`)
389
+ }
390
+ return false
391
+ }
392
+ const { root, entry } = resolved
371
393
  const adminPort = await reservePort()
372
394
  const adminToken = crypto.randomBytes(24).toString('hex')
373
395
  fs.mkdirSync(path.dirname(DAEMON_LOG_PATH), { recursive: true })
@@ -0,0 +1,72 @@
1
+ import { describe, it } from 'node:test'
2
+ import assert from 'node:assert/strict'
3
+ import { isOwnerProcessDead, parseOwnerPid } from '@/lib/server/daemon/lease-owner'
4
+
5
+ function probeThrowing(code: string) {
6
+ return {
7
+ kill: () => {
8
+ const err = new Error('mock probe failure') as NodeJS.ErrnoException
9
+ err.code = code
10
+ throw err
11
+ },
12
+ }
13
+ }
14
+
15
+ const probeAlive = { kill: () => true as const }
16
+
17
+ describe('parseOwnerPid', () => {
18
+ it('returns the pid for a well-formed owner string', () => {
19
+ assert.equal(parseOwnerPid('pid:12345:abc'), 12345)
20
+ assert.equal(parseOwnerPid('pid:1:xyz'), 1)
21
+ })
22
+
23
+ it('returns null for unrecognised owner strings', () => {
24
+ assert.equal(parseOwnerPid(null), null)
25
+ assert.equal(parseOwnerPid(undefined), null)
26
+ assert.equal(parseOwnerPid(''), null)
27
+ assert.equal(parseOwnerPid('another process'), null)
28
+ assert.equal(parseOwnerPid('pid::abc'), null)
29
+ assert.equal(parseOwnerPid('pid:abc:xyz'), null)
30
+ assert.equal(parseOwnerPid('host:hostname:pid:1:abc'), null)
31
+ })
32
+
33
+ it('rejects zero and negative pids', () => {
34
+ assert.equal(parseOwnerPid('pid:0:abc'), null)
35
+ assert.equal(parseOwnerPid('pid:-1:abc'), null)
36
+ })
37
+ })
38
+
39
+ describe('isOwnerProcessDead — bug #41 stale-lease recovery', () => {
40
+ it('returns true when the probe reports ESRCH (no such process)', () => {
41
+ assert.equal(isOwnerProcessDead('pid:99999:abc', probeThrowing('ESRCH')), true)
42
+ })
43
+
44
+ it('returns false when the probe reports EPERM (process owned by someone else)', () => {
45
+ // EPERM means the process exists but signal delivery is blocked. Assume alive
46
+ // and do not steal the lease — bias towards waiting for TTL.
47
+ assert.equal(isOwnerProcessDead('pid:99999:abc', probeThrowing('EPERM')), false)
48
+ })
49
+
50
+ it('returns false when the probe succeeds (process is alive)', () => {
51
+ assert.equal(isOwnerProcessDead('pid:99999:abc', probeAlive), false)
52
+ })
53
+
54
+ it('returns false for any unknown probe error code (do not guess)', () => {
55
+ assert.equal(isOwnerProcessDead('pid:99999:abc', probeThrowing('EAGAIN')), false)
56
+ assert.equal(isOwnerProcessDead('pid:99999:abc', probeThrowing('UNKNOWN')), false)
57
+ })
58
+
59
+ it('returns false for owner strings we cannot parse (different host, malformed, missing)', () => {
60
+ assert.equal(isOwnerProcessDead(null, probeThrowing('ESRCH')), false)
61
+ assert.equal(isOwnerProcessDead('another process', probeThrowing('ESRCH')), false)
62
+ assert.equal(isOwnerProcessDead('host:remote:pid:1:abc', probeThrowing('ESRCH')), false)
63
+ })
64
+
65
+ it('refuses to declare its own pid dead even if probe lies', () => {
66
+ // Defence in depth: the current process is obviously alive; if a
67
+ // pathological probe returned ESRCH for its own pid, we must not
68
+ // act on that.
69
+ const owner = `pid:${process.pid}:self`
70
+ assert.equal(isOwnerProcessDead(owner, probeThrowing('ESRCH')), false)
71
+ })
72
+ })
@@ -0,0 +1,68 @@
1
+ /**
2
+ * Helpers for reasoning about who owns a runtime lease.
3
+ *
4
+ * Owner strings have the shape `pid:${pid}:${suffix}` (see
5
+ * `runtime/daemon-state/core.ts` where the suffix is generated). When the
6
+ * holding process disappears without releasing the lease (container crash,
7
+ * SIGKILL), a successor instance has no way to know the lease is stale
8
+ * other than waiting out the TTL. These helpers let the successor detect
9
+ * that the recorded pid is no longer alive and reclaim the lease.
10
+ *
11
+ * The reclaim path is intentionally conservative: any uncertainty (owner
12
+ * string format we do not recognise, probe outcome we cannot interpret,
13
+ * etc.) returns `false` so the caller falls back to "wait for TTL".
14
+ *
15
+ * Single-host only. If a lease was acquired on a different host (Kubernetes
16
+ * multi-pod), the recorded pid means nothing here. Recognising "different
17
+ * host" requires the owner string itself to encode a host id, which we do
18
+ * not currently do; for now, mixed-host deployments will continue to wait
19
+ * out the TTL, which is the correct behavior in the absence of a way to
20
+ * verify the remote process status.
21
+ */
22
+
23
+ const OWNER_PATTERN = /^pid:(\d+):/
24
+
25
+ export interface ProcessProbe {
26
+ /** Sends signal 0 to the pid, throws on error like `process.kill`. */
27
+ kill: (pid: number, signal: 0) => true | void
28
+ }
29
+
30
+ const realProbe: ProcessProbe = {
31
+ kill: (pid, signal) => {
32
+ process.kill(pid, signal)
33
+ return true
34
+ },
35
+ }
36
+
37
+ export function parseOwnerPid(owner: string | null | undefined): number | null {
38
+ if (typeof owner !== 'string') return null
39
+ const match = owner.match(OWNER_PATTERN)
40
+ if (!match) return null
41
+ const pid = Number(match[1])
42
+ return Number.isInteger(pid) && pid > 0 ? pid : null
43
+ }
44
+
45
+ /**
46
+ * Returns true when the recorded owner pid is provably dead on this host.
47
+ * Returns false for any other outcome:
48
+ * - owner string we cannot parse
49
+ * - probe succeeded (the process is alive)
50
+ * - probe failed with EPERM (process exists but is owned by someone
51
+ * else; treat as "alive, do not steal")
52
+ * - any other unexpected failure (do not guess)
53
+ *
54
+ * `probe` is injectable for tests.
55
+ */
56
+ export function isOwnerProcessDead(owner: string | null | undefined, probe: ProcessProbe = realProbe): boolean {
57
+ const pid = parseOwnerPid(owner)
58
+ if (pid === null) return false
59
+ if (pid === process.pid) return false
60
+ try {
61
+ probe.kill(pid, 0)
62
+ return false
63
+ } catch (err: unknown) {
64
+ const code = (err as NodeJS.ErrnoException | undefined)?.code
65
+ if (code === 'ESRCH') return true
66
+ return false
67
+ }
68
+ }
@@ -0,0 +1,45 @@
1
+ import { describe, it, beforeEach } from 'node:test'
2
+ import assert from 'node:assert/strict'
3
+ import { gitAvailable, resetGitAvailableCache, safeGit } from '@/lib/server/git-metadata'
4
+
5
+ describe('safeGit', () => {
6
+ it('returns null when git is invoked with arguments that produce no useful output', () => {
7
+ // `git` invoked outside of a repository and asked for a missing config key
8
+ // is one of the few invocations guaranteed to fail on every host, while
9
+ // still respecting the real binary path. If git itself is not installed,
10
+ // `safeGit` still returns null (the catch path).
11
+ const out = safeGit(['config', 'this.key.does.not.exist'])
12
+ assert.equal(out, null)
13
+ })
14
+
15
+ it('returns a trimmed string for a successful invocation', () => {
16
+ const version = safeGit(['--version'])
17
+ if (version === null) return // git is not installed in this env; skip
18
+ assert.match(version, /^git version /)
19
+ })
20
+ })
21
+
22
+ describe('gitAvailable', () => {
23
+ beforeEach(() => {
24
+ resetGitAvailableCache()
25
+ })
26
+
27
+ it('caches its result', () => {
28
+ const first = gitAvailable()
29
+ // After the first call, subsequent calls return the same value without
30
+ // re-probing. We cannot directly observe "did it re-probe?" without
31
+ // mocking `node:child_process`, so we just assert stability.
32
+ const second = gitAvailable()
33
+ const third = gitAvailable()
34
+ assert.equal(first, second)
35
+ assert.equal(second, third)
36
+ })
37
+
38
+ it('reflects whether the cwd is in a git checkout', () => {
39
+ // This test runs from inside the swarmclaw repo, so git should be
40
+ // available. When run from inside the published Docker image (where
41
+ // `.git/` is absent), the same call returns false.
42
+ const present = gitAvailable()
43
+ assert.equal(typeof present, 'boolean')
44
+ })
45
+ })
@@ -0,0 +1,42 @@
1
+ import { execFileSync } from 'node:child_process'
2
+
3
+ /**
4
+ * Pure helpers for reading git metadata at runtime, with graceful degradation
5
+ * when the working directory is not a git checkout (Docker production image,
6
+ * npm tarball install, etc.).
7
+ *
8
+ * Always uses `execFileSync` with an arg array (no shell) so user input cannot
9
+ * influence the command line.
10
+ */
11
+ export function safeGit(args: string[], cwd: string = process.cwd()): string | null {
12
+ try {
13
+ const out = execFileSync('git', args, {
14
+ cwd,
15
+ encoding: 'utf-8',
16
+ timeout: 15_000,
17
+ stdio: ['ignore', 'pipe', 'ignore'],
18
+ })
19
+ return typeof out === 'string' ? out.trim() : null
20
+ } catch {
21
+ return null
22
+ }
23
+ }
24
+
25
+ let cachedAvailable: boolean | null = null
26
+
27
+ /**
28
+ * Returns true when the current working directory looks like a git checkout
29
+ * (i.e. `git rev-parse --git-dir` succeeds). Cached for the lifetime of the
30
+ * process, since the answer does not change while a server is running.
31
+ *
32
+ * Exported `resetGitAvailableCache` is for unit tests only.
33
+ */
34
+ export function gitAvailable(): boolean {
35
+ if (cachedAvailable !== null) return cachedAvailable
36
+ cachedAvailable = safeGit(['rev-parse', '--git-dir']) !== null
37
+ return cachedAvailable
38
+ }
39
+
40
+ export function resetGitAvailableCache(): void {
41
+ cachedAvailable = null
42
+ }
@@ -4,6 +4,7 @@ import { loadConnectors, saveConnectors } from '@/lib/server/connectors/connecto
4
4
  import { decryptKey, loadCredentials } from '@/lib/server/credentials/credential-repository'
5
5
  import { loadQueue } from '@/lib/server/runtime/queue-repository'
6
6
  import { pruneExpiredLocks, readRuntimeLock, releaseRuntimeLock, renewRuntimeLock, tryAcquireRuntimeLock } from '@/lib/server/runtime/runtime-lock-repository'
7
+ import { isOwnerProcessDead } from '@/lib/server/daemon/lease-owner'
7
8
  import { loadSchedules } from '@/lib/server/schedules/schedule-repository'
8
9
  import { loadSessions } from '@/lib/server/sessions/session-repository'
9
10
  import { loadSettings } from '@/lib/server/settings/settings-repository'
@@ -126,6 +127,7 @@ interface DaemonState {
126
127
  shuttingDown: boolean
127
128
  providerPingCircuitBreaker: Map<string, { consecutiveFailures: number; skipUntil: number }>
128
129
  lockRenewIntervalId: ReturnType<typeof setInterval> | null
130
+ leaseRetryTimeoutId: ReturnType<typeof setTimeout> | null
129
131
  primaryLeaseHeld: boolean
130
132
  }
131
133
 
@@ -151,6 +153,7 @@ const ds: DaemonState = hmrSingleton<DaemonState>('__swarmclaw_daemon__', () =>
151
153
  shuttingDown: false,
152
154
  providerPingCircuitBreaker: new Map<string, { consecutiveFailures: number; skipUntil: number }>(),
153
155
  lockRenewIntervalId: null,
156
+ leaseRetryTimeoutId: null,
154
157
  primaryLeaseHeld: false,
155
158
  }))
156
159
 
@@ -180,6 +183,7 @@ if (ds.connectorHealthCheckRunning === undefined) ds.connectorHealthCheckRunning
180
183
  if (ds.shuttingDown === undefined) ds.shuttingDown = false
181
184
  if (!ds.providerPingCircuitBreaker) ds.providerPingCircuitBreaker = new Map<string, { consecutiveFailures: number; skipUntil: number }>()
182
185
  if (ds.lockRenewIntervalId === undefined) ds.lockRenewIntervalId = null
186
+ if (ds.leaseRetryTimeoutId === undefined) ds.leaseRetryTimeoutId = null
183
187
  if (ds.primaryLeaseHeld === undefined) ds.primaryLeaseHeld = false
184
188
 
185
189
  function stopDaemonLeaseRenewal(opts?: { release?: boolean }) {
@@ -229,12 +233,60 @@ function acquireDaemonLease(source: string): boolean {
229
233
  }
230
234
  if (!acquired) {
231
235
  let owner = 'another process'
236
+ let expiresAt: number | null = null
232
237
  try {
233
- owner = readRuntimeLock(DAEMON_RUNTIME_LOCK_NAME)?.owner || owner
238
+ const lease = readRuntimeLock(DAEMON_RUNTIME_LOCK_NAME)
239
+ if (lease) {
240
+ owner = lease.owner || owner
241
+ expiresAt = lease.expiresAt
242
+ }
234
243
  } catch {
235
244
  // Best-effort diagnostics only.
236
245
  }
246
+
247
+ // Stale-lease recovery: when a previous container / process crashed
248
+ // without releasing the lease, the new instance would otherwise wait
249
+ // up to the full TTL (DAEMON_RUNTIME_LOCK_TTL_MS) before being able
250
+ // to start the daemon. If the recorded owner pid is local to this
251
+ // host AND is no longer alive, reclaim the lease immediately and
252
+ // retry. Conservative: any uncertainty (different host, malformed
253
+ // owner, kill probe failed for an unexpected reason) skips the
254
+ // reclaim path. Reported as issue #41 (Bug 2).
255
+ if (isOwnerProcessDead(owner)) {
256
+ try {
257
+ releaseRuntimeLock(DAEMON_RUNTIME_LOCK_NAME, owner)
258
+ log.info(TAG, `[daemon] Reclaimed stale daemon-primary lease from dead owner ${owner}`)
259
+ let retried = false
260
+ try {
261
+ retried = tryAcquireRuntimeLock(DAEMON_RUNTIME_LOCK_NAME, daemonLockOwner, DAEMON_RUNTIME_LOCK_TTL_MS)
262
+ } catch (err: unknown) {
263
+ log.warn(TAG, `[daemon] Reclaim retry failed (source=${source}): ${errorMessage(err)}`)
264
+ }
265
+ if (retried) {
266
+ ds.primaryLeaseHeld = true
267
+ startDaemonLeaseRenewal()
268
+ return true
269
+ }
270
+ } catch (err: unknown) {
271
+ log.warn(TAG, `[daemon] Failed to release stale lease (source=${source}): ${errorMessage(err)}`)
272
+ }
273
+ }
274
+
237
275
  log.info(TAG, `[daemon] Skipping start (source=${source}); lease held by ${owner}`)
276
+
277
+ // Schedule one deferred retry slightly past the lease's expiry so
278
+ // the daemon comes up automatically once the prior owner's TTL has
279
+ // elapsed, instead of waiting for the next API call to nudge it.
280
+ if (expiresAt !== null) {
281
+ const delayMs = Math.max(1_000, expiresAt - Date.now() + 1_000)
282
+ if (ds.leaseRetryTimeoutId) clearTimeout(ds.leaseRetryTimeoutId)
283
+ ds.leaseRetryTimeoutId = setTimeout(() => {
284
+ ds.leaseRetryTimeoutId = null
285
+ if (ds.running || ds.primaryLeaseHeld) return
286
+ ensureDaemonStarted(`${source}:lease-retry`)
287
+ }, delayMs)
288
+ ds.leaseRetryTimeoutId.unref?.()
289
+ }
238
290
  return false
239
291
  }
240
292
  ds.primaryLeaseHeld = true