npm - @htekdev/actions-debugger - Versions diffs - 1.0.116 → 1.0.118 - Mend

@htekdev/actions-debugger 1.0.116 → 1.0.118

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

package/errors/runner-environment/broker-server-socket-exception-nat-timeout-linux.yml ADDED Viewed

@@ -0,0 +1,114 @@
+id: runner-environment-209
+title: 'Self-hosted runner BrokerServer TaskCanceledException / SocketException — runner stuck in Busy, jobs delayed'
+category: runner-environment
+severity: error
+tags:
+  - self-hosted
+  - broker
+  - TaskCanceledException
+  - SocketException
+  - NAT
+  - kubernetes
+  - ARC
+  - busy-state
+patterns:
+  - regex: 'BrokerServer.*TaskCanceledException'
+    flags: 'i'
+  - regex: 'SocketException \(125\): Operation canceled'
+    flags: 'i'
+  - regex: 'GET request to https://broker\.actions\.githubusercontent\.com.*has been cancelled'
+    flags: 'i'
+error_messages:
+  - '[ERR BrokerServer] System.Threading.Tasks.TaskCanceledException: The operation was canceled.'
+  - '[ERR BrokerServer] System.IO.IOException: Unable to read data from the transport connection: Operation canceled.'
+  - '[ERR BrokerServer] System.Net.Sockets.SocketException (125): Operation canceled'
+  - '[WARN GitHubActionsService] GET request to https://broker.actions.githubusercontent.com/message?sessionId=...&status=Busy&runnerVersion=... has been cancelled.'
+  - '[WARN BrokerServer] Back off 6.934 seconds before next retry. 4 attempt left.'
+root_cause: |
+  The GitHub Actions runner (on Linux, macOS, Kubernetes, and ARC) maintains a
+  persistent long-poll HTTPS connection to `broker.actions.githubusercontent.com`
+  to receive job dispatch messages. This connection is kept open by a blocking
+  GET request that the server holds for up to 90 seconds before responding.
+  When the runner operates behind a **NAT gateway, stateful firewall, or cloud
+  provider network** (common in Kubernetes/ARC deployments on EKS, GKE, AKS, or
+  on-premise k8s), the network layer's connection tracking table can expire the
+  idle TLS socket before the server responds. Most cloud NAT tables have a
+  default idle timeout of 30–60 seconds — shorter than the runner's 90-second
+  poll interval.
+  When the NAT table entry expires:
+  1. The next packet the runner sends receives an RST from the network (or is
+     silently dropped), causing the underlying `SslStream.ReadAsyncInternal`
+     to throw `SocketException (125): Operation canceled`
+  2. The exception propagates as `TaskCanceledException` through the
+     `BrokerHttpClient.GetRunnerMessageAsync` call chain
+  3. The runner logs `ERR BrokerServer` and backs off exponentially (6–60 s)
+  4. During the back-off, the runner remains in **Busy** status from the broker's
+     perspective, preventing new jobs from being dispatched
+  The back-off recovers automatically but delays job pickup by minutes. Under
+  high-frequency job dispatch (CI matrix builds), this causes jobs to queue
+  while the runner is technically idle.
+  **Distinct from re-199** (Windows V2 broker listener stops polling after the
+  first job — Windows-specific software bug): this issue affects Linux/macOS/K8s
+  and recovers automatically; re-199 causes a permanent stall requiring service
+  restart.
+fix: |
+  **1. Enable TCP keepalive on the runner host (most effective):**
+  Configure the OS to send TCP keepalive probes before the NAT table expires:
+  ```bash
+  # Linux — reduce keepalive idle time from default 7200s to 30s
+  sudo sysctl -w net.ipv4.tcp_keepalive_time=30
+  sudo sysctl -w net.ipv4.tcp_keepalive_intvl=10
+  sudo sysctl -w net.ipv4.tcp_keepalive_probes=3
+  # Make persistent:
+  echo "net.ipv4.tcp_keepalive_time=30" | sudo tee -a /etc/sysctl.conf
+  ```
+  **2. Increase NAT idle timeout (infrastructure change):**
+  - **AWS EKS:** Set `--conntrack-tcp-timeout-established=300` on kube-proxy,
+    or add a NAT gateway connection tracking timeout of 350 s
+  - **GKE:** Use Cloud NAT with `--nat-tcp-established-idle-timeout=350`
+  - **Azure AKS:** Set `--load-balancer-idle-timeout-in-minutes=10` (default is 4 min)
+  - **On-premise k8s:** Increase `conntrack` timeout or set up a keepalive proxy
+  **3. Use a runner proxy with keepalive support:**
+  Route runner outbound traffic through an application-level proxy that
+  maintains the connection, preventing the NAT table from expiring the socket.
+  **4. Upgrade runner version:**
+  Runner v2.326.0+ includes improved broker reconnect logic that reduces the
+  window where the runner stays in Busy state after a socket reset.
+fix_code:
+  - language: yaml
+    label: 'Runner DaemonSet init container — set TCP keepalive before runner starts'
+    code: |
+      # In your ARC runner DaemonSet / Pod spec
+      initContainers:
+        - name: set-sysctl
+          image: busybox
+          securityContext:
+            privileged: true
+          command:
+            - sh
+            - -c
+            - |
+              sysctl -w net.ipv4.tcp_keepalive_time=30
+              sysctl -w net.ipv4.tcp_keepalive_intvl=10
+              sysctl -w net.ipv4.tcp_keepalive_probes=3
+prevention:
+  - 'Set tcp_keepalive_time to 30 s on all Linux self-hosted runner hosts, especially those in Kubernetes'
+  - 'For ARC scale sets in EKS/GKE/AKS, explicitly configure NAT idle timeout to at least 350 seconds'
+  - 'Monitor runner diagnostic logs (Runner_<date>-utc.log) for repeated BrokerServer ERR lines — they indicate this issue'
+  - 'Upgrade runner to v2.326.0+ which has improved back-off and reconnect behavior'
+docs:
+  - url: 'https://github.com/actions/runner/issues/3904'
+    label: 'actions/runner#3904 — Runner fails to connect to broker, TaskCanceledException / SocketException (17 reactions)'
+  - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners#communication-between-self-hosted-runners-and-github'
+    label: 'GitHub Docs — Self-hosted runner communication requirements'

package/errors/runner-environment/checkout-v603-hash-algorithm-api-rate-limiting.yml ADDED Viewed

@@ -0,0 +1,100 @@
+id: runner-environment-206
+title: 'actions/checkout v6.0.3 regression — new /hash-algorithm API call on every checkout exhausts rate limits in high-volume orgs'
+category: runner-environment
+severity: error
+tags:
+  - checkout
+  - rate-limit
+  - api-rate-limit
+  - v6
+  - PAT
+  - regression
+  - hash-algorithm
+patterns:
+  - regex: 'API rate limit exceeded'
+    flags: 'i'
+  - regex: 'You have exceeded a secondary rate limit'
+    flags: 'i'
+  - regex: 'HttpError.*API rate limit exceeded'
+    flags: 'i'
+  - regex: 'Rate limit.*exceeded.*403'
+    flags: 'i'
+error_messages:
+  - 'Error: HttpError: API rate limit exceeded for user ID'
+  - 'You have exceeded a secondary rate limit and have been temporarily blocked from content creation'
+  - 'remote: Repository not found.'
+  - 'fatal: repository ''https://github.com/owner/repo/'' not found'
+root_cause: |
+  actions/checkout v6.0.3 (released June 2, 2026, commit 1cce339) introduced a new
+  REST API call to GET /repos/{owner}/{repo}/hash-algorithm on every checkout operation
+  to determine the repository's object hashing algorithm.
+  In organizations with high-concurrency workflows, this additional API call multiplies
+  rate-limit consumption significantly. A matrix build with 30 parallel jobs each running
+  checkout makes 30 additional API calls per push event. At the per-user rate limit of
+  5,000 requests/hour, large orgs using PATs (Personal Access Tokens) for cross-repo
+  checkout quickly exhaust their quota across many concurrent pipelines.
+  When the /hash-algorithm endpoint returns HTTP 403 (rate limited), the checkout action
+  may interpret the 403 as a resource-not-found condition, producing misleading errors
+  such as "remote: Repository not found" that mask the true rate-limit cause.
+  GITHUB_TOKEN is not affected in the same way because it carries per-repository rate
+  limits (15,000 requests/hour for Actions) rather than per-user limits.
+  Source: actions/checkout#2450.
+fix: |
+  Immediate fix — pin to actions/checkout@v6.0.2 until the upstream regression is
+  addressed:
+    uses: actions/checkout@v6.0.2
+  Use GITHUB_TOKEN instead of PAT where possible:
+  GITHUB_TOKEN rate limits (15,000 req/hr for GitHub Actions) are scoped per
+  repository and do not aggregate across your organization's other workflows.
+  Reserve PATs for cross-repo or cross-org checkouts only.
+  Reduce parallel checkout volume:
+  Add fetch-depth: 1 and/or sparse-checkout to minimize the API surface area
+  per checkout call, reducing the number of API requests triggered per job.
+  Monitor actions/checkout#2450 for the upstream fix (caching the hash-algorithm
+  result within a workflow run, or making the call conditional).
+fix_code:
+  - language: yaml
+    label: 'Pin to v6.0.2 to avoid the regression until upstream fix is released'
+    code: |
+      - uses: actions/checkout@v6.0.2
+        with:
+          # Prefer GITHUB_TOKEN over a PAT to use per-repo rate limits
+          token: ${{ secrets.GITHUB_TOKEN }}
+          fetch-depth: 1         # Shallow fetch reduces ancillary API calls
+  - language: yaml
+    label: 'Cross-repo checkout — use dedicated PAT only where necessary, pin version'
+    code: |
+      - uses: actions/checkout@v6.0.2
+        with:
+          repository: org/other-repo
+          token: ${{ secrets.CROSS_REPO_PAT }}   # PAT required here; rate-limited per user
+          fetch-depth: 1
+  - language: yaml
+    label: 'Monitor rate limit headers in a preflight step (diagnostic aid)'
+    code: |
+      - name: Check remaining GitHub API rate limit
+        run: |
+          remaining=$(curl -s -H "Authorization: Bearer ${{ secrets.GITHUB_TOKEN }}" \
+            https://api.github.com/rate_limit | jq '.rate.remaining')
+          echo "Remaining API calls: $remaining"
+          if (( remaining < 500 )); then
+            echo "::warning::Low API rate limit — $remaining calls remaining"
+          fi
+prevention:
+  - 'Pin actions/checkout to a specific patch version (e.g., @v6.0.2) in high-volume orgs — patch updates can introduce API regressions like this one'
+  - 'Prefer GITHUB_TOKEN over org-wide PATs for checkout; GITHUB_TOKEN rate limits are per-repository and isolated from other workflows'
+  - 'Monitor your org REST API usage under Settings > Insights > API Requests to detect unexpected call spikes from action updates before they hit production'
+  - 'Add a rate-limit check step to long-running or high-parallelism pipelines to catch exhaustion before it causes misleading errors'
+docs:
+  - url: 'https://github.com/actions/checkout/issues/2450'
+    label: 'actions/checkout#2450: New /hash-algorithm API call causing rate limiting failures in v6.0.3'
+  - url: 'https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api'
+    label: 'GitHub Docs: Rate limits for the REST API'
+  - url: 'https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#permissions-for-the-github_token'
+    label: 'GitHub Docs: GITHUB_TOKEN permissions and rate limits'

package/errors/runner-environment/macos-self-hosted-listener-aad-ghost-busy-stall.yml ADDED Viewed

@@ -0,0 +1,126 @@
+id: runner-environment-205
+title: 'macOS self-hosted Runner.Listener silently stalls after AAD credential-refresh — ghost-busy state blocks queue'
+category: runner-environment
+severity: silent-failure
+tags:
+  - self-hosted
+  - macos
+  - apple-silicon
+  - listener
+  - aad
+  - ghost-busy
+  - broker-reconnect
+  - credential-refresh
+patterns:
+  - regex: 'AAD Correlation ID for this token request:\s*Unknown'
+    flags: 'i'
+  - regex: 'RSAFileKeyManager.*Loading RSA key parameters from file.*credentials_rsaparams'
+    flags: 'i'
+  - regex: 'GitHubActionsService.*AAD Correlation ID.*Unknown'
+    flags: 'i'
+error_messages:
+  - '[INFO RSAFileKeyManager] Loading RSA key parameters from file .../.credentials_rsaparams'
+  - '[INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown'
+root_cause: |
+  On long-lived self-hosted macOS runners (v2.334.0+, Apple Silicon), the
+  Runner.Listener process can permanently stall after an AAD (Azure Active Directory)
+  credential-refresh event coincides with a broker session disconnect.
+  Normal broker long-poll timeouts produce "SocketException (89): Operation canceled"
+  entries and the listener successfully reconnects. However, when a broker disconnect
+  occurs at the same time as an AAD credential refresh, the listener logs its final
+  diagnostic sequence and then goes permanently silent:
+    [INFO RSAFileKeyManager] Loading RSA key parameters from file .../.credentials_rsaparams
+    [INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
+  After these lines, the main thread parks in pthread_cond_wait with no further diag
+  log output and no TCP ESTABLISHED connection to the broker. The OS process stays alive
+  (visible in ps/Activity Monitor/launchctl list), so launchd does not restart it. The
+  broker-side agent state continues to show the runner as "busy" from its last completed
+  job, stalling all subsequent queued jobs behind the phantom runner until an external
+  restart clears the state.
+  The trigger requires: runner lifetime longer than several hours (so a credential
+  refresh occurs), plus a broker disconnect at or immediately after the refresh boundary.
+  Observed simultaneously affecting all 4 of 4 macOS ARM64 runners on a single host
+  within a 32-minute window, causing a 4-hour queue stall. Source: actions/runner#4446.
+fix: |
+  No platform-side fix available as of June 2026 (open issue).
+  Workaround — implement an out-of-band watchdog script/cron that:
+  1. Confirms the Runner.Listener PID has an ESTABLISHED TCP socket to the broker
+     (check with `lsof -p <pid> -i TCP | grep ESTABLISHED`)
+  2. Confirms the most recent entry in _diag/Runner_*.log is less than N minutes old
+     (e.g., 10 minutes)
+  3. If both checks fail, restarts the runner service:
+     - macOS (launchd):
+         launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/actions.runner.<owner-repo>.<name>.plist
+         sleep 5
+         launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/actions.runner.<owner-repo>.<name>.plist
+     - Linux (systemd):
+         systemctl --user restart actions.runner.<owner-repo>.<name>.service
+  4. Optionally clear the broker-side ghost-busy state via REST API:
+         curl -X DELETE \
+           -H "Authorization: Bearer $GH_TOKEN" \
+           "https://api.github.com/repos/<owner>/<repo>/actions/runners/<runner_id>"
+     This forces re-registration and clears the stale busy state immediately.
+  Long-term: run macOS runners as ephemeral (--once) with a process supervisor
+  that restarts after each completed job, eliminating the multi-hour lifetime
+  that triggers the credential-refresh race.
+fix_code:
+  - language: yaml
+    label: 'Watchdog workflow on separate runner — detect and restart stalled listeners'
+    code: |
+      # Separate monitoring workflow on a non-affected runner
+      # Runs every 15 minutes via cron
+      on:
+        schedule:
+          - cron: '*/15 * * * *'
+      jobs:
+        watchdog:
+          runs-on: ubuntu-latest   # Use a separate hosted runner for the watchdog
+          steps:
+            - name: Check and restart stalled macOS listeners
+              env:
+                GH_TOKEN: ${{ secrets.RUNNER_MGMT_PAT }}
+              run: |
+                # List all self-hosted runners and check for stuck-busy ones
+                gh api /repos/${{ github.repository }}/actions/runners \
+                  --jq '.runners[] | select(.status=="online" and .busy==true) | .id' \
+                  | while read runner_id; do
+                    echo "Runner $runner_id showing busy — may need investigation"
+                    # Add custom liveness check here (SSH to host, check log freshness)
+                  done
+  - language: bash
+    label: 'Shell watchdog — check listener log freshness and restart via launchctl'
+    code: |
+      #!/bin/bash
+      # Run on the macOS runner host via cron every 10 minutes
+      RUNNER_LABEL="owner-repo-runner-name"
+      PLIST="$HOME/Library/LaunchAgents/actions.runner.${RUNNER_LABEL}.plist"
+      DIAG_DIR="$HOME/actions-runner/_diag"
+      STALE_MINUTES=10
+      latest_log=$(ls -t "${DIAG_DIR}/Runner_"*.log 2>/dev/null | head -1)
+      if [[ -z "$latest_log" ]]; then exit 0; fi
+      age_minutes=$(( ($(date +%s) - $(stat -f %m "$latest_log")) / 60 ))
+      if (( age_minutes > STALE_MINUTES )); then
+        echo "Runner diag log stale for ${age_minutes}min — restarting..."
+        launchctl bootout "gui/$(id -u)" "$PLIST" 2>/dev/null
+        sleep 5
+        launchctl bootstrap "gui/$(id -u)" "$PLIST"
+      fi
+prevention:
+  - 'Run macOS self-hosted runners as ephemeral (--once) with a process supervisor — this eliminates the multi-hour lifetime needed to trigger the credential-refresh race condition'
+  - 'Implement a log-freshness watchdog that monitors _diag/Runner_*.log modification time and restarts the launchd service if no new entries appear for > 10 minutes'
+  - 'Monitor GET /repos/{owner}/{repo}/actions/runners and alert on runners with busy=true for longer than your longest expected job duration'
+  - 'Limit runner lifetime with a cron-triggered scheduled restart between jobs (e.g., nightly) to reduce the window where credential refresh coincides with a broker disconnect'
+docs:
+  - url: 'https://github.com/actions/runner/issues/4446'
+    label: 'actions/runner#4446: Listener silently exits broker-reconnect loop after AAD credential-refresh (ghost-busy)'
+  - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/monitoring-and-troubleshooting-self-hosted-runners'
+    label: 'GitHub Docs: Monitoring and troubleshooting self-hosted runners'
+  - url: 'https://docs.github.com/en/rest/actions/self-hosted-runners'
+    label: 'GitHub REST API: Self-hosted runners'

package/errors/runner-environment/runner-environment-210.yml ADDED Viewed

@@ -0,0 +1,105 @@
+id: runner-environment-210
+title: 'Runner step-log and summary uploads silently stall behind egress-only firewall — .NET BlobClient ignores HTTPS_PROXY'
+category: runner-environment
+severity: silent-failure
+tags:
+  - proxy
+  - https-proxy
+  - blob-storage
+  - self-hosted
+  - egress-firewall
+  - logs-missing
+  - azure-blob
+patterns:
+  - regex: 'productionresultssa\d+\.blob\.core\.windows\.net'
+    flags: 'i'
+  - regex: 'ua=azsdk-net-Storage\.Blobs.*latency=[0-9]{2,3}\.'
+    flags: 'i'
+  - regex: 'step summary.*not.*available|summary.*upload.*failed|diagnostic.*log.*missing'
+    flags: 'i'
+error_messages:
+  - 'Step logs not visible in Actions UI — log upload stalled silently'
+  - 'Job completed but step summary is blank or missing'
+  - 'latency=74.999634s  ua=azsdk-net-Storage.Blobs/12.27.0 (.NET 8.0)'
+  - 'host=productionresultssa6.blob.core.windows.net:443  latency=74.99s'
+root_cause: |
+  The GitHub Actions runner process (v2.333.1 and earlier) creates Azure SDK
+  `BlobClient` instances for uploading step logs, workflow summaries, and diagnostic
+  logs without configuring an `HttpClientTransport` that honours the `HTTPS_PROXY`
+  environment variable.
+  In `ResultsHttpClient.cs`, the `GetBlobClient()` and `GetAppendBlobClient()` methods
+  pass only retry/timeout options to `BlobClientOptions` — no `Transport`. The Azure SDK
+  therefore falls back to its internal default HTTP pipeline, which does **not** inherit
+  the runner's `RunnerWebProxy` configuration.
+  The impact differs depending on network topology:
+  - **Without an egress firewall**: The runner's BlobClient connects directly to
+    `*.blob.core.windows.net` and `*.actions.githubusercontent.com`, bypassing the
+    proxy entirely. Direct egress works, so no error is visible.
+  - **With a deny-all egress firewall (proxy-only)**:  The BlobClient attempts a direct
+    connection that the firewall silently drops. Each attempt stalls for ~75 seconds
+    (the TCP connect timeout), then times out. Since log uploads are non-fatal, the job
+    eventually completes — but step logs are absent from the Actions UI and the job
+    takes 5–15 minutes longer than expected.
+  This is separate from the `upload-artifact@v6` proxy CONNECT-headers regression
+  (re-208), which affects the Node.js artifact upload path.
+fix: |
+  Add the Azure Blob Storage and Actions results endpoints to the `NO_PROXY` (or
+  `no_proxy`) environment variable so the runner bypasses the proxy for those hosts
+  and connects to them directly.
+  This requires that the runner's egress firewall allows direct connections to
+  `*.blob.core.windows.net` and `results-receiver.actions.githubusercontent.com`.
+  If only proxy egress is available, the workaround is to configure the proxy to
+  pass through those hosts without TLS inspection.
+  A proper fix (runner-side, not yet released): the `BlobClientOptions.Transport` in
+  `ResultsHttpClient.cs` should be configured with an `HttpClientTransport` wrapping
+  the runner's `RunnerWebProxy` — tracked in actions/runner#4351.
+fix_code:
+  - language: yaml
+    label: 'Self-hosted runner — set NO_PROXY to bypass proxy for Azure Blob endpoints'
+    code: |
+      # Set at the OS level or in the runner's .env file before starting the runner service.
+      # This allows the BlobClient to connect directly while other traffic goes through the proxy.
+      #
+      # On Linux/macOS (add to /etc/environment or runner startup script):
+      # NO_PROXY=.blob.core.windows.net,.actions.githubusercontent.com,results-receiver.actions.githubusercontent.com
+      #
+      # In a workflow (if runner is ephemeral and you can set env per job):
+      jobs:
+        build:
+          runs-on: self-hosted
+          env:
+            HTTPS_PROXY: http://proxy.corp.example.com:3128
+            NO_PROXY: '.blob.core.windows.net,.actions.githubusercontent.com,results-receiver.actions.githubusercontent.com'
+          steps:
+            - uses: actions/checkout@v4
+            - run: echo "Logs will upload correctly now"
+  - language: yaml
+    label: 'ARC / Kubernetes — set NO_PROXY in RunnerDeployment or RunnerScaleSet'
+    code: |
+      apiVersion: actions.summerwind.dev/v1alpha1
+      kind: RunnerDeployment
+      spec:
+        template:
+          spec:
+            env:
+              - name: HTTPS_PROXY
+                value: http://proxy.corp.example.com:3128
+              - name: NO_PROXY
+                value: '.blob.core.windows.net,.actions.githubusercontent.com,results-receiver.actions.githubusercontent.com'
+prevention:
+  - 'When deploying self-hosted runners behind a forward proxy with deny-all egress, always set NO_PROXY to include Azure Blob Storage endpoints — the runner BlobClient does not inherit HTTPS_PROXY.'
+  - 'Monitor Actions step log visibility alongside job exit codes — missing logs with a successful exit often indicate a proxy or network configuration issue, not a code failure.'
+  - 'Run a proxy connectivity diagnostic on a new runner host: confirm .blob.core.windows.net is reachable (directly or via proxy) before routing real workloads to it.'
+  - 'Track actions/runner#4351 for a first-party fix that configures the BlobClient transport to use the runner proxy settings.'
+docs:
+  - url: 'https://github.com/actions/runner/issues/4351'
+    label: 'actions/runner #4351 — BlobClient uploads stall through HTTPS proxy (Apr 2026)'
+  - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/using-a-proxy-server-with-self-hosted-runners'
+    label: 'GitHub Docs: Using a proxy server with self-hosted runners'

package/errors/runner-environment/runner-environment-213.yml ADDED Viewed

@@ -0,0 +1,142 @@
+id: runner-environment-213
+title: 'Self-Hosted Runner Stuck in Active State Indefinitely When Job Process Hangs Without Exiting'
+category: runner-environment
+severity: error
+tags:
+  - self-hosted
+  - runner
+  - active-state
+  - hung-process
+  - child-process
+  - stuck
+  - service-restart
+  - timeout-minutes
+patterns:
+  - regex: 'Waiting for a runner to pick up this job'
+    flags: 'i'
+  - regex: 'Runner\.Worker.*hung|Worker.*process.*running|active.*runner.*blocking'
+    flags: 'i'
+  - regex: 'sudo systemctl restart actions\.runner'
+    flags: 'i'
+error_messages:
+  - 'Waiting for a runner to pick up this job...'
+  - 'Runner shows as Active in GitHub UI; new jobs remain Queued indefinitely'
+root_cause: |
+  On self-hosted runners, the Runner.Worker process tracks the lifecycle of the running
+  job. When a job step spawns a child process (e.g., a test runner like `vitest --coverage`,
+  a long network operation, or a background daemon) that does not exit cleanly, the
+  Runner.Worker stays alive waiting for the child to terminate.
+  While Runner.Worker is alive, the parent Runner.Listener considers the runner slot
+  occupied (busy=true) and does not accept new job messages from the broker. The runner
+  appears as "Active" in the GitHub UI and all queued jobs for that runner remain
+  in "Waiting for a runner to pick up this job..." state indefinitely.
+  This differs from:
+  - GitHub-hosted runners: these have a hard 6-hour job timeout enforced by the platform;
+    the job is cancelled and the slot freed automatically
+  - Self-hosted runners WITH timeout-minutes set: once timeout-minutes elapses the
+    job is cancelled and the runner sends SIGTERM to the worker — but if the child
+    process ignores SIGTERM (common with some test runners), the worker still hangs
+  Common triggers:
+  - vitest --coverage, jest --forceExit not used, pytest hanging due to unclosed resources
+  - npm/yarn scripts that spawn background processes not tied to the shell session
+  - Docker commands (docker run without --rm) that keep running after the step exits
+  - Network calls blocked by firewall with no connection timeout
+  - Interactive prompts waiting for stdin input in a CI non-interactive context
+  Automatic recovery does NOT occur. The runner stays Active until either:
+  1. The hung child process eventually exits on its own
+  2. An operator manually restarts the runner service
+  3. A watchdog script kills the orphaned worker process
+fix: |
+  Immediate recovery: restart the runner service to free the stuck slot.
+    sudo systemctl restart actions.runner.<scope>.<name>.service   # Linux systemd
+    launchctl unload ~/Library/LaunchAgents/actions.runner.*.plist  # macOS
+    .\svc.sh stop && .\svc.sh start                                 # Windows
+  Prevention (preferred):
+  1. Add timeout-minutes to ALL jobs on self-hosted runners to cap maximum runtime.
+     Even if the worker hangs, the platform cancels the job and sends SIGTERM after
+     the timeout. Pair with process group kill to catch SIGTERM-resistant children.
+  2. Ensure test commands force-exit when done:
+     - vitest: add --forceExit flag
+     - jest: use jest --forceExit or --detectOpenHandles to identify hanging handles
+     - pytest: add timeout fixtures via pytest-timeout plugin
+  3. Use process groups (setsid / start new session) so SIGTERM cascades to children:
+     run: |
+       setsid bash -c 'npm test' &
+       CHILD_PID=$!
+       wait $CHILD_PID
+  4. Deploy a runner watchdog that monitors Worker processes with no active child
+     CPU activity for > N minutes and kills them:
+     - Check elapsed time + zero CPU descendants
+     - SIGKILL stale Worker processes
+     - Trigger runner service restart via systemd or equivalent
+fix_code:
+  - language: yaml
+    label: 'Add timeout-minutes to prevent indefinite runner lock'
+    code: |
+      jobs:
+        test:
+          runs-on: self-hosted
+          timeout-minutes: 30   # Always set on self-hosted runners
+          steps:
+            - uses: actions/checkout@v4
+            - name: Run tests
+              run: npm test
+  - language: yaml
+    label: 'Force-exit test runner so worker process completes cleanly'
+    code: |
+      jobs:
+        test:
+          runs-on: self-hosted
+          timeout-minutes: 30
+          steps:
+            - name: Run Vitest tests
+              run: npx vitest run --forceExit
+            - name: Run Jest tests
+              run: npx jest --forceExit
+            - name: Run pytest with timeout
+              run: pytest --timeout=300
+  - language: yaml
+    label: 'Watchdog step — kill orphaned background processes after main step'
+    code: |
+      jobs:
+        test:
+          runs-on: self-hosted
+          timeout-minutes: 30
+          steps:
+            - name: Run tests
+              run: npm test
+              continue-on-error: true
+            - name: Kill orphaned processes
+              if: always()
+              run: |
+                # Kill any remaining node processes owned by this runner user
+                pkill -u "$(whoami)" -f "vitest|jest|mocha" || true
+prevention:
+  - 'Always set timeout-minutes on self-hosted runner jobs — without it there is no
+    platform-enforced maximum and a hung process can block the runner indefinitely'
+  - 'Use --forceExit with Jest/Vitest; use --timeout with pytest; audit any test suite
+    that takes longer than expected for open handles (jest --detectOpenHandles)'
+  - 'Avoid spawning background daemons in run: steps without explicit cleanup in an
+    if: always() cleanup step'
+  - 'Consider running self-hosted runners as ephemeral (ephemeral: true with ARC or
+    JIT tokens) — an ephemeral runner terminates after one job, so a hung runner
+    does not affect other jobs (a new runner pod is provisioned for each job)'
+  - 'Monitor runner Active state duration via GitHub REST API (GET /repos/{owner}/{repo}/actions/runners)
+    and alert when busy: true persists beyond expected max job duration'
+docs:
+  - url: 'https://github.com/actions/runner/issues/4312'
+    label: 'actions/runner#4312 — Self-hosted runner gets stuck in active state, blocking queued jobs'
+  - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/monitoring-and-troubleshooting-self-hosted-runners'
+    label: 'GitHub Docs — Monitoring and troubleshooting self-hosted runners'
+  - url: 'https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes'
+    label: 'GitHub Docs — timeout-minutes syntax reference'