npm - @swarmclawai/swarmclaw - Versions diffs - 1.5.42 → 1.5.43 - Mend

@swarmclawai/swarmclaw 1.5.42 → 1.5.43

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/README.md +7 -11
package/package.json +1 -1
package/src/app/api/version/route.ts +71 -57
package/src/app/api/version/update/route.ts +12 -0
package/src/lib/server/daemon/controller.ts +24 -2
package/src/lib/server/daemon/lease-owner.test.ts +72 -0
package/src/lib/server/daemon/lease-owner.ts +68 -0
package/src/lib/server/git-metadata.test.ts +45 -0
package/src/lib/server/git-metadata.ts +42 -0
package/src/lib/server/runtime/daemon-state/core.ts +53 -1

package/README.md CHANGED Viewed

@@ -389,6 +389,13 @@ Operational docs: https://swarmclaw.ai/docs/observability
 ## Releases
+### v1.5.43 Highlights
+- **`/api/version` no longer 500s in Docker**: the route used to shell out to `git` at runtime, which fails in the production image because `.git/` is not copied. The route now returns 200 with `{ source: 'package', version }` from `package.json` when git metadata is unavailable, and `{ source: 'git', version, commit, ... }` when it is. `/api/version/update` short-circuits on Docker-style installs with a clear `no_git_metadata` reason instead of an opaque 500. ([#41](https://github.com/swarmclawai/swarmclaw/issues/41) Bug 1, reported by [@SteamedFish](https://github.com/SteamedFish).)
+- **Daemon reclaims stale `daemon-primary` leases on container restart**: when the previous container died holding the SQLite-backed lease, the new container previously waited up to the full 120 s TTL before the daemon could start. The successor now parses the recorded owner pid, probes it with `process.kill(pid, 0)`, and reclaims the lease immediately when the prior owner is provably dead on this host. When the owner is genuinely alive (or when the recorded host is ambiguous, such as multi-pod Kubernetes), behaviour is unchanged but a single deferred retry is scheduled just past the TTL so the daemon comes up automatically rather than waiting for the next API call. ([#41](https://github.com/swarmclawai/swarmclaw/issues/41) Bug 2.)
+- **Subprocess daemon fallback fails soft in Docker**: when `resolveDaemonRuntimeEntry()` cannot find `src/lib/server/daemon/daemon-runtime.ts` (the file is intentionally not in the standalone build), `ensureDaemonProcessRunning()` now logs a one-shot warning and returns `false` instead of throwing into the API handler. The in-process daemon path (with the Bug 2 fix) is the production path in Docker. ([#41](https://github.com/swarmclawai/swarmclaw/issues/41) Bug 3.)
+- **`CONTRIBUTING.md`**: dropped the broken reference to `AGENTS.md`. That file is `.gitignore`'d and not visible to external contributors. The single canonical project-conventions document is `CLAUDE.md`.
 ### v1.5.42 Highlights
 - **New `opencode-web` provider — connect to remote OpenCode HTTP servers** ([#40](https://github.com/swarmclawai/swarmclaw/issues/40), requested by [@SteamedFish](https://github.com/SteamedFish)): point an agent at any host running `opencode serve` or `opencode web` (default port `4096`). Supports HTTPS endpoints, HTTP Basic Auth (encode credentials as `username:password` in the API key field; bare passwords default the username to `opencode`), automatic OpenCode session reuse across chat turns, and per-session workspace isolation via `?directory=...`. Models are entered as `providerID/modelID` (e.g. `anthropic/claude-sonnet-4-5`). The existing `opencode-cli` provider is unchanged.
@@ -417,17 +424,6 @@ Operational docs: https://swarmclaw.ai/docs/observability
 - **Perf ring buffer raised to 2 000 entries**: queue/task repository events fire ~20 Hz during task processing and were evicting chat-execution/prompt perf entries out of the 200-entry buffer before they could be read. The larger buffer lets the perf viewer actually show a full turn.
 - **Tests**: added regression tests for pre-1.5.38 stale-checkout orphan recovery and for the scoped-tool-access algorithm.
-### v1.5.38 Highlights
-- **Task queue: reclaim stale checkouts**: `checkoutTask()` now reclaims a lingering `checkoutRunId` on a `queued` task instead of refusing it forever. An ungraceful server exit mid-turn (crash, SIGKILL, HMR reload) previously left tasks uncheckoutable, producing a dispatch → orphan-recovery → failed-checkout spin that logged "Recovering orphaned queued task" tens of thousands of times per session. `scheduleRetryOrDeadLetter()` also clears the prior checkout when scheduling a retry or dead-lettering.
-- **Chat: suppress duplicate parallel tool calls**: some OSS models on Ollama (notably `devstral`) emit the same tool call twice in a single turn. The LangGraph tool-event tracker now dedupes by `name + input` signature, swallowing the duplicate start and its result while allowing a genuinely later identical call once the first completes. Hardened against replayed-start events (HMR, graph retries) that previously could leak a `run_id` into both the accepted and suppressed sets and leave `pendingCount` stuck above zero.
-- **Chat: disable `parallel_tool_calls` for Ollama**: local Ollama sessions now pass `parallel_tool_calls: false` to prevent the upstream duplicate-call behavior at the source for models that honor it.
-- **Chat: no-progress guard for tool summary retries**: if the model produces essentially no new text on a `tool_summary` continuation, the loop stops retrying instead of streaming the same short sentence two or three times. The guard is snapshot-aware: a transient-error rollback no longer leaves a stale progress counter that silently skips a legitimate retry (`lastToolSummaryTextLen` is now round-tripped through `ChatTurnState.snapshot`/`restore`).
-- **Task UI: distinguish retry-pending from failure**: a retrying task now renders in amber with a "Retry Pending" label in the task card and sheet, instead of the same red treatment used for dead-lettered failures.
-- **Autonomy: dedupe reflection memories across kinds**: the supervisor reflection writer now drops notes whose normalized text has already been stored this run, eliminating near-identical memory rows classified under multiple kinds.
-- **OpenClaw gateway: fast-fail on dangling credentials**: when an agent's OpenClaw route references a deleted or missing credential, the gateway now refuses to dial the WebSocket up front instead of attempting an unauthenticated handshake and waiting the full 120 s for the agent-side timeout. The credential-missing log line is promoted from warn to error so it surfaces in routine monitoring.
-- **Prompt size profiler**: setting `SWARMCLAW_PROFILE_PROMPT=1` now logs a per-section size breakdown of the assembled system prompt (block index, first-line label, char count) on every turn, making it practical to diagnose why a specific agent is eating context budget. Off by default so production turns stay quiet.
 Older releases: https://swarmclaw.ai/docs/release-notes
 - GitHub releases: https://github.com/swarmclawai/swarmclaw/releases

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@swarmclawai/swarmclaw",
-  "version": "1.5.42",
+  "version": "1.5.43",
   "description": "Build and run autonomous AI agents with OpenClaw, Hermes, multiple model providers, orchestration, delegation, memory, skills, schedules, and chat connectors.",
   "main": "electron-dist/main.js",
   "license": "MIT",

package/src/app/api/version/route.ts CHANGED Viewed

@@ -1,7 +1,8 @@
 import { NextResponse } from 'next/server'
-import { execSync } from 'child_process'
-export const dynamic = 'force-dynamic'
+import { gitAvailable, safeGit } from '@/lib/server/git-metadata'
+import packageJson from '../../../../package.json'
+export const dynamic = 'force-dynamic'
 let cachedRemote: {
   sha: string
@@ -10,74 +11,87 @@ let cachedRemote: {
   remoteTag: string | null
   checkedAt: number
 } | null = null
-const CACHE_TTL = 60_000 // 60s
+const CACHE_TTL = 60_000
 const RELEASE_TAG_RE = /^v\d+\.\d+\.\d+(?:[-+][0-9A-Za-z.-]+)?$/
-function run(cmd: string): string {
-  return execSync(cmd, { encoding: 'utf-8', cwd: process.cwd(), timeout: 15_000 }).trim()
-}
 function getLatestStableTag(): string | null {
-  const tags = run(`git tag --list 'v*' --sort=-v:refname`)
-    .split('\n')
-    .map((line) => line.trim())
-    .filter(Boolean)
-  return tags.find((tag) => RELEASE_TAG_RE.test(tag)) || null
+  const out = safeGit(['tag', '--list', 'v*', '--sort=-v:refname'])
+  if (!out) return null
+  return out.split('\n').map((l) => l.trim()).filter(Boolean).find((t) => RELEASE_TAG_RE.test(t)) || null
 }
 function getHeadStableTag(): string | null {
-  const tags = run(`git tag --points-at HEAD --list 'v*' --sort=-v:refname`)
-    .split('\n')
-    .map((line) => line.trim())
-    .filter(Boolean)
-  return tags.find((tag) => RELEASE_TAG_RE.test(tag)) || null
+  const out = safeGit(['tag', '--points-at', 'HEAD', '--list', 'v*', '--sort=-v:refname'])
+  if (!out) return null
+  return out.split('\n').map((l) => l.trim()).filter(Boolean).find((t) => RELEASE_TAG_RE.test(t)) || null
 }
 export async function GET(_req: Request) {
-  try {
-    const localSha = run('git rev-parse --short HEAD')
-    const localTag = getHeadStableTag()
+  // Always return 200. When git metadata is unavailable (Docker production
+  // image, npm tarball install) we fall back to the static package.json
+  // version. Issue #41 reported a 500 response when `.git/` was not present
+  // in the production container; this route now degrades gracefully.
+  const packageVersion = packageJson.version
-    let remoteSha = cachedRemote?.sha ?? localSha
-    let behindBy = cachedRemote?.behindBy ?? 0
-    let channel: 'stable' | 'main' = cachedRemote?.channel ?? 'main'
-    let remoteTag = cachedRemote?.remoteTag ?? null
+  if (!gitAvailable()) {
+    return NextResponse.json({
+      source: 'package',
+      version: packageVersion,
+      localSha: null,
+      localTag: `v${packageVersion}`,
+      remoteSha: null,
+      remoteTag: null,
+      channel: 'stable',
+      updateAvailable: false,
+      behindBy: 0,
+    })
+  }
-    if (!cachedRemote || Date.now() - cachedRemote.checkedAt > CACHE_TTL) {
-      try {
-        run('git fetch --tags origin --quiet')
-        const latestTag = getLatestStableTag()
-        if (latestTag) {
-          channel = 'stable'
-          remoteTag = latestTag
-          remoteSha = run(`git rev-parse --short ${latestTag}^{commit}`)
-          behindBy = parseInt(run(`git rev-list HEAD..${latestTag}^{commit} --count`), 10) || 0
-        } else {
-          // Fallback for repos without release tags yet.
-          channel = 'main'
-          remoteTag = null
-          run('git fetch origin main --quiet')
-          behindBy = parseInt(run('git rev-list HEAD..origin/main --count'), 10) || 0
-          remoteSha = behindBy > 0
-            ? run('git rev-parse --short origin/main')
-            : localSha
+  const localSha = safeGit(['rev-parse', '--short', 'HEAD'])
+  const localTag = getHeadStableTag()
+  let remoteSha = cachedRemote?.sha ?? localSha
+  let behindBy = cachedRemote?.behindBy ?? 0
+  let channel: 'stable' | 'main' = cachedRemote?.channel ?? 'main'
+  let remoteTag = cachedRemote?.remoteTag ?? null
+  if (!cachedRemote || Date.now() - cachedRemote.checkedAt > CACHE_TTL) {
+    const fetched = safeGit(['fetch', '--tags', 'origin', '--quiet'])
+    if (fetched !== null) {
+      const latestTag = getLatestStableTag()
+      if (latestTag) {
+        channel = 'stable'
+        remoteTag = latestTag
+        const sha = safeGit(['rev-parse', '--short', `${latestTag}^{commit}`])
+        if (sha) remoteSha = sha
+        const count = safeGit(['rev-list', `HEAD..${latestTag}^{commit}`, '--count'])
+        behindBy = count ? (parseInt(count, 10) || 0) : 0
+      } else {
+        channel = 'main'
+        remoteTag = null
+        safeGit(['fetch', 'origin', 'main', '--quiet'])
+        const count = safeGit(['rev-list', 'HEAD..origin/main', '--count'])
+        behindBy = count ? (parseInt(count, 10) || 0) : 0
+        if (behindBy > 0) {
+          const sha = safeGit(['rev-parse', '--short', 'origin/main'])
+          if (sha) remoteSha = sha
+        } else if (localSha) {
+          remoteSha = localSha
         }
-        cachedRemote = { sha: remoteSha, behindBy, channel, remoteTag, checkedAt: Date.now() }
-      } catch {
-        // fetch failed (no network, no remote, etc.) — use stale cache or defaults
       }
+      cachedRemote = { sha: remoteSha || '', behindBy, channel, remoteTag, checkedAt: Date.now() }
     }
-    return NextResponse.json({
-      localSha,
-      localTag,
-      remoteSha,
-      remoteTag,
-      channel,
-      updateAvailable: behindBy > 0,
-      behindBy,
-    })
-  } catch {
-    return NextResponse.json({ error: 'Not a git repository' }, { status: 500 })
   }
+  return NextResponse.json({
+    source: 'git',
+    version: packageVersion,
+    localSha,
+    localTag,
+    remoteSha,
+    remoteTag,
+    channel,
+    updateAvailable: behindBy > 0,
+    behindBy,
+  })
 }

package/src/app/api/version/update/route.ts CHANGED Viewed

@@ -1,6 +1,7 @@
 import { NextResponse } from 'next/server'
 import { execSync } from 'child_process'
 import { getDb } from '@/lib/server/storage'
+import { gitAvailable } from '@/lib/server/git-metadata'
 const RELEASE_TAG_RE = /^v\d+\.\d+\.\d+(?:[-+][0-9A-Za-z.-]+)?$/
@@ -37,6 +38,17 @@ function ensureCleanWorkingTree() {
 }
 export async function POST() {
+  // The git-pull update path only makes sense for source/git checkouts.
+  // Docker and packaged-app installs have their own update channels and
+  // calling this route on those installs would otherwise return a confusing
+  // 500. Surface the situation as a 200 with a clear reason instead.
+  if (!gitAvailable()) {
+    return NextResponse.json({
+      success: false,
+      reason: 'no_git_metadata',
+      error: 'Self-update is only supported for source / git checkouts. Use the npm or Docker upgrade path for this install.',
+    })
+  }
   try {
     const beforeSha = run('git rev-parse --short HEAD')
     const beforeRef = run('git rev-parse HEAD')

package/src/lib/server/daemon/controller.ts CHANGED Viewed

@@ -31,7 +31,14 @@ import {
   releaseRuntimeLock,
   tryAcquireRuntimeLock,
 } from '@/lib/server/runtime/runtime-lock-repository'
-import { errorMessage } from '@/lib/shared-utils'
+import { errorMessage, hmrSingleton } from '@/lib/shared-utils'
+// HMR-safe single-shot guard so the "subprocess fallback unavailable"
+// warning logs once per process lifetime, not per API call.
+const subprocessFallbackUnavailableLogged = hmrSingleton<{ value: boolean }>(
+  '__swarmclaw_daemon_subprocess_fallback_warned__',
+  () => ({ value: false }),
+)
 const TAG = 'daemon-controller'
 const LAUNCH_LOCK_NAME = 'daemon-launcher'
@@ -367,7 +374,22 @@ export async function ensureDaemonProcessRunning(
     const secondCheck = await getLiveDaemonSnapshot()
     if (secondCheck?.status.running) return false
-    const { root, entry } = resolveDaemonRuntimeEntry()
+    let resolved: { root: string; entry: string }
+    try {
+      resolved = resolveDaemonRuntimeEntry()
+    } catch (err: unknown) {
+      // The standalone Docker image does not ship `src/` (Next.js standalone
+      // output excludes raw source files), so the subprocess fallback can
+      // never spawn there. Fail soft: log once and let callers fall back to
+      // whatever in-process daemon path is available rather than surfacing
+      // a 500 to API consumers. Reported as issue #41 (Bug 3).
+      if (!subprocessFallbackUnavailableLogged.value) {
+        subprocessFallbackUnavailableLogged.value = true
+        log.warn(TAG, `[daemon] Subprocess fallback unavailable in this build (${errorMessage(err)}). The in-process daemon will continue to be the primary path.`)
+      }
+      return false
+    }
+    const { root, entry } = resolved
     const adminPort = await reservePort()
     const adminToken = crypto.randomBytes(24).toString('hex')
     fs.mkdirSync(path.dirname(DAEMON_LOG_PATH), { recursive: true })

package/src/lib/server/daemon/lease-owner.test.ts ADDED Viewed

@@ -0,0 +1,72 @@
+import { describe, it } from 'node:test'
+import assert from 'node:assert/strict'
+import { isOwnerProcessDead, parseOwnerPid } from '@/lib/server/daemon/lease-owner'
+function probeThrowing(code: string) {
+  return {
+    kill: () => {
+      const err = new Error('mock probe failure') as NodeJS.ErrnoException
+      err.code = code
+      throw err
+    },
+  }
+}
+const probeAlive = { kill: () => true as const }
+describe('parseOwnerPid', () => {
+  it('returns the pid for a well-formed owner string', () => {
+    assert.equal(parseOwnerPid('pid:12345:abc'), 12345)
+    assert.equal(parseOwnerPid('pid:1:xyz'), 1)
+  })
+  it('returns null for unrecognised owner strings', () => {
+    assert.equal(parseOwnerPid(null), null)
+    assert.equal(parseOwnerPid(undefined), null)
+    assert.equal(parseOwnerPid(''), null)
+    assert.equal(parseOwnerPid('another process'), null)
+    assert.equal(parseOwnerPid('pid::abc'), null)
+    assert.equal(parseOwnerPid('pid:abc:xyz'), null)
+    assert.equal(parseOwnerPid('host:hostname:pid:1:abc'), null)
+  })
+  it('rejects zero and negative pids', () => {
+    assert.equal(parseOwnerPid('pid:0:abc'), null)
+    assert.equal(parseOwnerPid('pid:-1:abc'), null)
+  })
+})
+describe('isOwnerProcessDead — bug #41 stale-lease recovery', () => {
+  it('returns true when the probe reports ESRCH (no such process)', () => {
+    assert.equal(isOwnerProcessDead('pid:99999:abc', probeThrowing('ESRCH')), true)
+  })
+  it('returns false when the probe reports EPERM (process owned by someone else)', () => {
+    // EPERM means the process exists but signal delivery is blocked. Assume alive
+    // and do not steal the lease — bias towards waiting for TTL.
+    assert.equal(isOwnerProcessDead('pid:99999:abc', probeThrowing('EPERM')), false)
+  })
+  it('returns false when the probe succeeds (process is alive)', () => {
+    assert.equal(isOwnerProcessDead('pid:99999:abc', probeAlive), false)
+  })
+  it('returns false for any unknown probe error code (do not guess)', () => {
+    assert.equal(isOwnerProcessDead('pid:99999:abc', probeThrowing('EAGAIN')), false)
+    assert.equal(isOwnerProcessDead('pid:99999:abc', probeThrowing('UNKNOWN')), false)
+  })
+  it('returns false for owner strings we cannot parse (different host, malformed, missing)', () => {
+    assert.equal(isOwnerProcessDead(null, probeThrowing('ESRCH')), false)
+    assert.equal(isOwnerProcessDead('another process', probeThrowing('ESRCH')), false)
+    assert.equal(isOwnerProcessDead('host:remote:pid:1:abc', probeThrowing('ESRCH')), false)
+  })
+  it('refuses to declare its own pid dead even if probe lies', () => {
+    // Defence in depth: the current process is obviously alive; if a
+    // pathological probe returned ESRCH for its own pid, we must not
+    // act on that.
+    const owner = `pid:${process.pid}:self`
+    assert.equal(isOwnerProcessDead(owner, probeThrowing('ESRCH')), false)
+  })
+})

package/src/lib/server/daemon/lease-owner.ts ADDED Viewed

@@ -0,0 +1,68 @@
+/**
+ * Helpers for reasoning about who owns a runtime lease.
+ *
+ * Owner strings have the shape `pid:${pid}:${suffix}` (see
+ * `runtime/daemon-state/core.ts` where the suffix is generated). When the
+ * holding process disappears without releasing the lease (container crash,
+ * SIGKILL), a successor instance has no way to know the lease is stale
+ * other than waiting out the TTL. These helpers let the successor detect
+ * that the recorded pid is no longer alive and reclaim the lease.
+ *
+ * The reclaim path is intentionally conservative: any uncertainty (owner
+ * string format we do not recognise, probe outcome we cannot interpret,
+ * etc.) returns `false` so the caller falls back to "wait for TTL".
+ *
+ * Single-host only. If a lease was acquired on a different host (Kubernetes
+ * multi-pod), the recorded pid means nothing here. Recognising "different
+ * host" requires the owner string itself to encode a host id, which we do
+ * not currently do; for now, mixed-host deployments will continue to wait
+ * out the TTL, which is the correct behavior in the absence of a way to
+ * verify the remote process status.
+ */
+const OWNER_PATTERN = /^pid:(\d+):/
+export interface ProcessProbe {
+  /** Sends signal 0 to the pid, throws on error like `process.kill`. */
+  kill: (pid: number, signal: 0) => true | void
+}
+const realProbe: ProcessProbe = {
+  kill: (pid, signal) => {
+    process.kill(pid, signal)
+    return true
+  },
+}
+export function parseOwnerPid(owner: string | null | undefined): number | null {
+  if (typeof owner !== 'string') return null
+  const match = owner.match(OWNER_PATTERN)
+  if (!match) return null
+  const pid = Number(match[1])
+  return Number.isInteger(pid) && pid > 0 ? pid : null
+}
+/**
+ * Returns true when the recorded owner pid is provably dead on this host.
+ * Returns false for any other outcome:
+ *   - owner string we cannot parse
+ *   - probe succeeded (the process is alive)
+ *   - probe failed with EPERM (process exists but is owned by someone
+ *     else; treat as "alive, do not steal")
+ *   - any other unexpected failure (do not guess)
+ *
+ * `probe` is injectable for tests.
+ */
+export function isOwnerProcessDead(owner: string | null | undefined, probe: ProcessProbe = realProbe): boolean {
+  const pid = parseOwnerPid(owner)
+  if (pid === null) return false
+  if (pid === process.pid) return false
+  try {
+    probe.kill(pid, 0)
+    return false
+  } catch (err: unknown) {
+    const code = (err as NodeJS.ErrnoException | undefined)?.code
+    if (code === 'ESRCH') return true
+    return false
+  }
+}

package/src/lib/server/git-metadata.test.ts ADDED Viewed

@@ -0,0 +1,45 @@
+import { describe, it, beforeEach } from 'node:test'
+import assert from 'node:assert/strict'
+import { gitAvailable, resetGitAvailableCache, safeGit } from '@/lib/server/git-metadata'
+describe('safeGit', () => {
+  it('returns null when git is invoked with arguments that produce no useful output', () => {
+    // `git` invoked outside of a repository and asked for a missing config key
+    // is one of the few invocations guaranteed to fail on every host, while
+    // still respecting the real binary path. If git itself is not installed,
+    // `safeGit` still returns null (the catch path).
+    const out = safeGit(['config', 'this.key.does.not.exist'])
+    assert.equal(out, null)
+  })
+  it('returns a trimmed string for a successful invocation', () => {
+    const version = safeGit(['--version'])
+    if (version === null) return // git is not installed in this env; skip
+    assert.match(version, /^git version /)
+  })
+})
+describe('gitAvailable', () => {
+  beforeEach(() => {
+    resetGitAvailableCache()
+  })
+  it('caches its result', () => {
+    const first = gitAvailable()
+    // After the first call, subsequent calls return the same value without
+    // re-probing. We cannot directly observe "did it re-probe?" without
+    // mocking `node:child_process`, so we just assert stability.
+    const second = gitAvailable()
+    const third = gitAvailable()
+    assert.equal(first, second)
+    assert.equal(second, third)
+  })
+  it('reflects whether the cwd is in a git checkout', () => {
+    // This test runs from inside the swarmclaw repo, so git should be
+    // available. When run from inside the published Docker image (where
+    // `.git/` is absent), the same call returns false.
+    const present = gitAvailable()
+    assert.equal(typeof present, 'boolean')
+  })
+})

package/src/lib/server/git-metadata.ts ADDED Viewed

@@ -0,0 +1,42 @@
+import { execFileSync } from 'node:child_process'
+/**
+ * Pure helpers for reading git metadata at runtime, with graceful degradation
+ * when the working directory is not a git checkout (Docker production image,
+ * npm tarball install, etc.).
+ *
+ * Always uses `execFileSync` with an arg array (no shell) so user input cannot
+ * influence the command line.
+ */
+export function safeGit(args: string[], cwd: string = process.cwd()): string | null {
+  try {
+    const out = execFileSync('git', args, {
+      cwd,
+      encoding: 'utf-8',
+      timeout: 15_000,
+      stdio: ['ignore', 'pipe', 'ignore'],
+    })
+    return typeof out === 'string' ? out.trim() : null
+  } catch {
+    return null
+  }
+}
+let cachedAvailable: boolean | null = null
+/**
+ * Returns true when the current working directory looks like a git checkout
+ * (i.e. `git rev-parse --git-dir` succeeds). Cached for the lifetime of the
+ * process, since the answer does not change while a server is running.
+ *
+ * Exported `resetGitAvailableCache` is for unit tests only.
+ */
+export function gitAvailable(): boolean {
+  if (cachedAvailable !== null) return cachedAvailable
+  cachedAvailable = safeGit(['rev-parse', '--git-dir']) !== null
+  return cachedAvailable
+}
+export function resetGitAvailableCache(): void {
+  cachedAvailable = null
+}

package/src/lib/server/runtime/daemon-state/core.ts CHANGED Viewed

@@ -4,6 +4,7 @@ import { loadConnectors, saveConnectors } from '@/lib/server/connectors/connecto
 import { decryptKey, loadCredentials } from '@/lib/server/credentials/credential-repository'
 import { loadQueue } from '@/lib/server/runtime/queue-repository'
 import { pruneExpiredLocks, readRuntimeLock, releaseRuntimeLock, renewRuntimeLock, tryAcquireRuntimeLock } from '@/lib/server/runtime/runtime-lock-repository'
+import { isOwnerProcessDead } from '@/lib/server/daemon/lease-owner'
 import { loadSchedules } from '@/lib/server/schedules/schedule-repository'
 import { loadSessions } from '@/lib/server/sessions/session-repository'
 import { loadSettings } from '@/lib/server/settings/settings-repository'
@@ -126,6 +127,7 @@ interface DaemonState {
   shuttingDown: boolean
   providerPingCircuitBreaker: Map<string, { consecutiveFailures: number; skipUntil: number }>
   lockRenewIntervalId: ReturnType<typeof setInterval> | null
+  leaseRetryTimeoutId: ReturnType<typeof setTimeout> | null
   primaryLeaseHeld: boolean
 }
@@ -151,6 +153,7 @@ const ds: DaemonState = hmrSingleton<DaemonState>('__swarmclaw_daemon__', () =>
   shuttingDown: false,
   providerPingCircuitBreaker: new Map<string, { consecutiveFailures: number; skipUntil: number }>(),
   lockRenewIntervalId: null,
+  leaseRetryTimeoutId: null,
   primaryLeaseHeld: false,
 }))
@@ -180,6 +183,7 @@ if (ds.connectorHealthCheckRunning === undefined) ds.connectorHealthCheckRunning
 if (ds.shuttingDown === undefined) ds.shuttingDown = false
 if (!ds.providerPingCircuitBreaker) ds.providerPingCircuitBreaker = new Map<string, { consecutiveFailures: number; skipUntil: number }>()
 if (ds.lockRenewIntervalId === undefined) ds.lockRenewIntervalId = null
+if (ds.leaseRetryTimeoutId === undefined) ds.leaseRetryTimeoutId = null
 if (ds.primaryLeaseHeld === undefined) ds.primaryLeaseHeld = false
 function stopDaemonLeaseRenewal(opts?: { release?: boolean }) {
@@ -229,12 +233,60 @@ function acquireDaemonLease(source: string): boolean {
   }
   if (!acquired) {
     let owner = 'another process'
+    let expiresAt: number | null = null
     try {
-      owner = readRuntimeLock(DAEMON_RUNTIME_LOCK_NAME)?.owner || owner
+      const lease = readRuntimeLock(DAEMON_RUNTIME_LOCK_NAME)
+      if (lease) {
+        owner = lease.owner || owner
+        expiresAt = lease.expiresAt
+      }
     } catch {
       // Best-effort diagnostics only.
     }
+    // Stale-lease recovery: when a previous container / process crashed
+    // without releasing the lease, the new instance would otherwise wait
+    // up to the full TTL (DAEMON_RUNTIME_LOCK_TTL_MS) before being able
+    // to start the daemon. If the recorded owner pid is local to this
+    // host AND is no longer alive, reclaim the lease immediately and
+    // retry. Conservative: any uncertainty (different host, malformed
+    // owner, kill probe failed for an unexpected reason) skips the
+    // reclaim path. Reported as issue #41 (Bug 2).
+    if (isOwnerProcessDead(owner)) {
+      try {
+        releaseRuntimeLock(DAEMON_RUNTIME_LOCK_NAME, owner)
+        log.info(TAG, `[daemon] Reclaimed stale daemon-primary lease from dead owner ${owner}`)
+        let retried = false
+        try {
+          retried = tryAcquireRuntimeLock(DAEMON_RUNTIME_LOCK_NAME, daemonLockOwner, DAEMON_RUNTIME_LOCK_TTL_MS)
+        } catch (err: unknown) {
+          log.warn(TAG, `[daemon] Reclaim retry failed (source=${source}): ${errorMessage(err)}`)
+        }
+        if (retried) {
+          ds.primaryLeaseHeld = true
+          startDaemonLeaseRenewal()
+          return true
+        }
+      } catch (err: unknown) {
+        log.warn(TAG, `[daemon] Failed to release stale lease (source=${source}): ${errorMessage(err)}`)
+      }
+    }
     log.info(TAG, `[daemon] Skipping start (source=${source}); lease held by ${owner}`)
+    // Schedule one deferred retry slightly past the lease's expiry so
+    // the daemon comes up automatically once the prior owner's TTL has
+    // elapsed, instead of waiting for the next API call to nudge it.
+    if (expiresAt !== null) {
+      const delayMs = Math.max(1_000, expiresAt - Date.now() + 1_000)
+      if (ds.leaseRetryTimeoutId) clearTimeout(ds.leaseRetryTimeoutId)
+      ds.leaseRetryTimeoutId = setTimeout(() => {
+        ds.leaseRetryTimeoutId = null
+        if (ds.running || ds.primaryLeaseHeld) return
+        ensureDaemonStarted(`${source}:lease-retry`)
+      }, delayMs)
+      ds.leaseRetryTimeoutId.unref?.()
+    }
     return false
   }
   ds.primaryLeaseHeld = true