@htekdev/actions-debugger 1.0.109 → 1.0.111
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/errors/known-unsolved/ephemeral-jit-runner-lost-communication-false-positive.yml +93 -0
- package/errors/runner-environment/arc-runner-v2332-container-github-env-permission-denied.yml +117 -0
- package/errors/runner-environment/node24-16-toolcache-regression-browser-install-fails.yml +109 -0
- package/errors/runner-environment/runner-registration-token-502-burst.yml +102 -0
- package/errors/silent-failures/matrix-array-in-object-serialized-as-Array-string.yml +127 -0
- package/errors/silent-failures/setup-node-falls-through-wrong-version-on-download-failure.yml +112 -0
- package/package.json +1 -1
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
id: known-unsolved-058
|
|
2
|
+
title: 'Ephemeral JIT Runner Reports "Lost Communication" Despite Successful Job Completion'
|
|
3
|
+
category: known-unsolved
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- self-hosted
|
|
7
|
+
- ephemeral
|
|
8
|
+
- jit-runner
|
|
9
|
+
- lost-communication
|
|
10
|
+
- false-positive
|
|
11
|
+
- broker
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: 'The self-hosted runner lost communication with the server'
|
|
14
|
+
flags: 'i'
|
|
15
|
+
- regex: 'messageQueueLoopTokenSource|Stop message queue looping'
|
|
16
|
+
flags: 'i'
|
|
17
|
+
error_messages:
|
|
18
|
+
- 'The self-hosted runner lost communication with the server'
|
|
19
|
+
- 'GET request to broker.actions.githubusercontent.com/message ... has been cancelled'
|
|
20
|
+
- 'TaskCanceledException: The operation was canceled'
|
|
21
|
+
root_cause: |
|
|
22
|
+
In Runner.cs line 576, after a one-time-use (ephemeral/JIT) job completes, the message queue
|
|
23
|
+
is cancelled immediately with zero grace period:
|
|
24
|
+
|
|
25
|
+
messageQueueLoopTokenSource.Cancel();
|
|
26
|
+
|
|
27
|
+
This tears down the in-flight broker long-poll (GET broker.actions.githubusercontent.com/message)
|
|
28
|
+
immediately after CompleteJobAsync returns. GitHub's broker health monitor detects the TCP
|
|
29
|
+
disconnect and flags the runner as "lost communication" — racing against the pipeline service
|
|
30
|
+
that just received the successful completion.
|
|
31
|
+
|
|
32
|
+
Two independent GitHub backend systems race:
|
|
33
|
+
1. Pipeline service — received CompleteJobAsync, knows the job succeeded
|
|
34
|
+
2. Broker health monitor — sees TCP disconnect, flags runner as "lost communication"
|
|
35
|
+
|
|
36
|
+
When the broker monitor wins (which happens on ~5-10% of short jobs), the UI shows
|
|
37
|
+
"The self-hosted runner lost communication with the server" even though the worker exited
|
|
38
|
+
with code 100 (success) and the runner itself exited with return code 0.
|
|
39
|
+
|
|
40
|
+
The runner logs show all events at identical timestamps — zero delay between
|
|
41
|
+
"Received job status event. JobState: Online" and "messageQueueLoopTokenSource.Cancel()".
|
|
42
|
+
|
|
43
|
+
No user-side configuration can fix this. A code change to Runner.cs is required
|
|
44
|
+
(add ~5s grace delay before cancel). Source: actions/runner#4309 (March 2026, open bug,
|
|
45
|
+
affects runner v2.331.0+, ~5-10% failure rate on short ephemeral jobs).
|
|
46
|
+
fix: |
|
|
47
|
+
There is no user-side configuration fix. The root cause is a zero-grace-period teardown
|
|
48
|
+
in Runner.cs that must be patched by GitHub.
|
|
49
|
+
|
|
50
|
+
Mitigations:
|
|
51
|
+
1. Verify the diagnostic logs — the job actually succeeded. Check _diag/Runner_*.log
|
|
52
|
+
for "return code 0" and "result: Succeeded" to confirm before concluding failure.
|
|
53
|
+
2. Retry the failed job — the "lost communication" is a false positive; re-running
|
|
54
|
+
produces a clean success.
|
|
55
|
+
3. Do not use ephemeral JIT runners as required status checks until actions/runner#4309
|
|
56
|
+
is resolved, or implement a workflow-level retry wrapper.
|
|
57
|
+
4. Alert on job result=="failure" not on "lost communication" text alone — false positives
|
|
58
|
+
from this race should not page on-call engineers.
|
|
59
|
+
5. Batch small jobs — the race is more likely on jobs shorter than 60 seconds. Combining
|
|
60
|
+
work into longer-running jobs reduces false positive frequency.
|
|
61
|
+
fix_code:
|
|
62
|
+
- language: yaml
|
|
63
|
+
label: 'Confirm false positive by reading runner diagnostic log before retrying'
|
|
64
|
+
code: |
|
|
65
|
+
# After "lost communication" on an ephemeral JIT runner, check:
|
|
66
|
+
# _diag/Runner_YYYYMMDD-hhmmss-utc.log
|
|
67
|
+
#
|
|
68
|
+
# Successful completion signs (all at identical timestamps):
|
|
69
|
+
# [INFO] finish job request for job {id} with result: Succeeded
|
|
70
|
+
# [INFO] Job X completed with result: Succeeded
|
|
71
|
+
# [INFO] Received job status event. JobState: Online
|
|
72
|
+
# [INFO] Runner execution has finished with return code 0
|
|
73
|
+
#
|
|
74
|
+
# If these appear immediately before TaskCanceledException, the job DID succeed.
|
|
75
|
+
# Simply re-run the workflow — the retry will show a clean green result.
|
|
76
|
+
jobs:
|
|
77
|
+
build:
|
|
78
|
+
runs-on: [self-hosted, ephemeral, linux]
|
|
79
|
+
# Note: retry-on-failure logic can wrap this job in the caller:
|
|
80
|
+
steps:
|
|
81
|
+
- uses: actions/checkout@v4
|
|
82
|
+
- run: make build
|
|
83
|
+
prevention:
|
|
84
|
+
- 'Treat "lost communication" on ephemeral JIT runners as a potential false positive — check _diag/Runner_*.log for "return code 0" before escalating.'
|
|
85
|
+
- 'Do not gate required status checks on ephemeral JIT runners until actions/runner#4309 is resolved upstream.'
|
|
86
|
+
- 'Alert on job result=="failure", not on the "lost communication" message text alone.'
|
|
87
|
+
- 'Batch work into longer jobs (>60s total) to reduce the broker teardown race frequency.'
|
|
88
|
+
- 'Watch actions/runner#4309 for an upstream fix (proposed: 5-second grace period before messageQueueLoopTokenSource.Cancel()).'
|
|
89
|
+
docs:
|
|
90
|
+
- url: 'https://github.com/actions/runner/issues/4309'
|
|
91
|
+
label: 'actions/runner#4309: Ephemeral/JIT runner reports "lost communication" despite successful completion (March 2026)'
|
|
92
|
+
- url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners#using-just-in-time-runners'
|
|
93
|
+
label: 'GitHub Docs: Just-in-time (JIT) runners'
|
|
@@ -0,0 +1,117 @@
|
|
|
1
|
+
id: runner-environment-178
|
|
2
|
+
title: 'ARC Runner v2.332.0 Regression — Container Job GITHUB_ENV and Workspace Permission Denied'
|
|
3
|
+
category: runner-environment
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- arc
|
|
7
|
+
- actions-runner-controller
|
|
8
|
+
- container-job
|
|
9
|
+
- permissions
|
|
10
|
+
- GITHUB_ENV
|
|
11
|
+
- non-root
|
|
12
|
+
- kubernetes
|
|
13
|
+
- regression
|
|
14
|
+
- v2.332
|
|
15
|
+
patterns:
|
|
16
|
+
- regex: 'cannot create /__w/_temp/_runner_file_commands/set_env_[0-9a-f]+: Permission denied'
|
|
17
|
+
flags: 'i'
|
|
18
|
+
- regex: 'cannot create /__w/_temp/_runner_file_commands/add_path_[0-9a-f]+: Permission denied'
|
|
19
|
+
flags: 'i'
|
|
20
|
+
- regex: 'fatal: detected dubious ownership in repository at .+/__w/'
|
|
21
|
+
flags: 'i'
|
|
22
|
+
- regex: '_runner_file_commands.*Permission denied'
|
|
23
|
+
flags: 'i'
|
|
24
|
+
error_messages:
|
|
25
|
+
- "/__w/_temp/36e38446.sh: 5: cannot create /__w/_temp/_runner_file_commands/set_env_7bb88aaa: Permission denied"
|
|
26
|
+
- "fatal: detected dubious ownership in repository at '/__w/repo/repo'"
|
|
27
|
+
- "/__w/_temp/_runner_file_commands/add_path_: Permission denied"
|
|
28
|
+
root_cause: |
|
|
29
|
+
Upgrading Actions Runner Controller (ARC) from runner v2.330.0 to v2.332.0
|
|
30
|
+
introduces a compound regression that breaks container jobs using non-root users.
|
|
31
|
+
|
|
32
|
+
The regression spans two runner releases:
|
|
33
|
+
|
|
34
|
+
1. v2.331.0 changed the runner container base image from Ubuntu 22.04 to
|
|
35
|
+
Ubuntu 24.04. The newer base ships git 2.43+ which enforces stricter
|
|
36
|
+
safe.directory checks. The mounted workspace volume is owned by the runner
|
|
37
|
+
UID, so a container running as a different non-root UID receives
|
|
38
|
+
"fatal: detected dubious ownership" on any git operation.
|
|
39
|
+
|
|
40
|
+
2. v2.332.0 bumped container hooks to v0.8.1, which updated workspace
|
|
41
|
+
ownership handling for the runner pod itself — but not for downstream job
|
|
42
|
+
containers. The _runner_file_commands directory under /__w/_temp/ is
|
|
43
|
+
still created with runner UID ownership. When a container step writes to
|
|
44
|
+
$GITHUB_ENV, $GITHUB_OUTPUT, $GITHUB_PATH, or $GITHUB_STEP_SUMMARY via a
|
|
45
|
+
shell redirect, the shell (running as the container's non-root user) cannot
|
|
46
|
+
create the file and exits non-zero.
|
|
47
|
+
|
|
48
|
+
This regression affects:
|
|
49
|
+
- ARC-managed self-hosted runners on Kubernetes (EKS, GKE, AKS, on-prem)
|
|
50
|
+
- Any workflow using `container: image: my-image` with `options: --user <uid>`
|
|
51
|
+
or a non-root USER in the Dockerfile
|
|
52
|
+
- Workflows that previously worked on runner v2.330.0 and below
|
|
53
|
+
|
|
54
|
+
GitHub-hosted runners (ubuntu-latest, etc.) are NOT affected.
|
|
55
|
+
fix: |
|
|
56
|
+
If you cannot immediately pin the runner version, add a workaround step at
|
|
57
|
+
the top of the affected job to fix ownership of the runner file command
|
|
58
|
+
directories. Alternatively, pin ARC runner images to v2.330.0 until an
|
|
59
|
+
upstream fix for container hooks is released.
|
|
60
|
+
|
|
61
|
+
The most robust long-term fix is to explicitly add the workspace to git's
|
|
62
|
+
safe.directory list and pre-create the file command directories with the
|
|
63
|
+
correct ownership.
|
|
64
|
+
fix_code:
|
|
65
|
+
- language: yaml
|
|
66
|
+
label: 'Option A — Pre-create _runner_file_commands with container user ownership'
|
|
67
|
+
code: |
|
|
68
|
+
jobs:
|
|
69
|
+
build:
|
|
70
|
+
runs-on: self-hosted
|
|
71
|
+
container:
|
|
72
|
+
image: my-app:latest
|
|
73
|
+
options: --user 1000
|
|
74
|
+
steps:
|
|
75
|
+
- name: Fix runner file command directory ownership (v2.332.0 workaround)
|
|
76
|
+
# Run as root before any steps that use GITHUB_ENV/GITHUB_OUTPUT
|
|
77
|
+
run: |
|
|
78
|
+
chown -R 1000:1000 /__w/_temp/_runner_file_commands/ || true
|
|
79
|
+
git config --global --add safe.directory /__w/${{ github.repository }}
|
|
80
|
+
shell: bash
|
|
81
|
+
# Note: requires container image to have chown available as root
|
|
82
|
+
- language: yaml
|
|
83
|
+
label: 'Option B — Pin ARC runner image to v2.330.0 to avoid the regression'
|
|
84
|
+
code: |
|
|
85
|
+
# In your ARC HelmRelease or RunnerDeployment spec:
|
|
86
|
+
# spec:
|
|
87
|
+
# template:
|
|
88
|
+
# spec:
|
|
89
|
+
# containers:
|
|
90
|
+
# - name: runner
|
|
91
|
+
# image: ghcr.io/actions/actions-runner:2.330.0
|
|
92
|
+
- language: yaml
|
|
93
|
+
label: 'Option C — Run container job as root to avoid UID mismatch'
|
|
94
|
+
code: |
|
|
95
|
+
jobs:
|
|
96
|
+
build:
|
|
97
|
+
runs-on: self-hosted
|
|
98
|
+
container:
|
|
99
|
+
image: my-app:latest
|
|
100
|
+
options: --user root # avoid UID mismatch until ARC fix ships
|
|
101
|
+
steps:
|
|
102
|
+
- uses: actions/checkout@v4
|
|
103
|
+
- run: echo "FOO=bar" >> $GITHUB_ENV
|
|
104
|
+
prevention:
|
|
105
|
+
- 'Test ARC runner version upgrades in a staging environment before rolling out to production — especially major bumps (v2.330 → v2.332).'
|
|
106
|
+
- 'Pin container jobs to run as root if your workflow uses GITHUB_ENV, GITHUB_OUTPUT, or GITHUB_PATH writes and you depend on non-root containers.'
|
|
107
|
+
- 'Subscribe to actions/runner releases and scan for changes to container-hooks between minor versions.'
|
|
108
|
+
- 'Add a smoke-test workflow that writes to GITHUB_ENV in a non-root container job — run it against each ARC upgrade to catch regressions early.'
|
|
109
|
+
docs:
|
|
110
|
+
- url: 'https://github.com/actions/runner/issues/4302'
|
|
111
|
+
label: 'actions/runner #4302 — v2.332.0: Container jobs fail with permission denied on GITHUB_ENV and workspace'
|
|
112
|
+
- url: 'https://github.com/actions/runner/issues/4131'
|
|
113
|
+
label: 'actions/runner #4131 — Permissions issue on runners v2.330.0 (/home/runner ownership regression)'
|
|
114
|
+
- url: 'https://github.com/actions/runner/issues/4251'
|
|
115
|
+
label: 'actions/runner #4251 — TempDirectoryManager fails to clean temp directory (permission denied on v2.331.0)'
|
|
116
|
+
- url: 'https://github.com/actions/runner-container-hooks/issues/282'
|
|
117
|
+
label: 'runner-container-hooks #282 — Permissions denied on workingDir'
|
|
@@ -0,0 +1,109 @@
|
|
|
1
|
+
id: runner-environment-177
|
|
2
|
+
title: 'Node.js 24.16.0 Toolcache Update Breaks Puppeteer, Playwright, and Cypress Browser Install'
|
|
3
|
+
category: runner-environment
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- nodejs
|
|
7
|
+
- node24
|
|
8
|
+
- puppeteer
|
|
9
|
+
- playwright
|
|
10
|
+
- cypress
|
|
11
|
+
- toolcache
|
|
12
|
+
- browser-install
|
|
13
|
+
- extract-zip
|
|
14
|
+
- regression
|
|
15
|
+
patterns:
|
|
16
|
+
- regex: 'Could not find Chrome \(ver\. \d+\.\d+\.\d+'
|
|
17
|
+
flags: 'i'
|
|
18
|
+
- regex: 'npx puppeteer browsers install .+ exited with code [^0]'
|
|
19
|
+
flags: 'i'
|
|
20
|
+
- regex: 'browserType\.launch: Executable doesn.t exist at .+/chromium'
|
|
21
|
+
flags: 'i'
|
|
22
|
+
- regex: 'Cannot find browser at path.*\.cache/puppeteer'
|
|
23
|
+
flags: 'i'
|
|
24
|
+
- regex: 'Failed to install browsers.*extract.*zip'
|
|
25
|
+
flags: 'i'
|
|
26
|
+
error_messages:
|
|
27
|
+
- "Error: Could not find Chrome (ver. 146.0.7680.153). This can occur if either 1. you did not perform an installation before running the script (e.g. `npx puppeteer browsers install chrome-headless-shell`) or 2. your cache path is incorrectly configured"
|
|
28
|
+
- "browserType.launch: Executable doesn't exist at /home/runner/.cache/ms-playwright/chromium-1169/chrome-linux/chrome"
|
|
29
|
+
- "Error: Failed to install browsers at /home/runner/.cache/ms-playwright"
|
|
30
|
+
root_cause: |
|
|
31
|
+
The ubuntu-24.04 image update from 20260518.149.1 → 20260525.161.1 bumped the
|
|
32
|
+
cached Node.js toolcache version from 24.15.0 to 24.16.0. Node.js 24.16.0
|
|
33
|
+
contains an upstream regression in the readable-stream.destroy() lifecycle
|
|
34
|
+
that breaks yauzl (a ZIP reading library) and extract-zip which depends on it.
|
|
35
|
+
|
|
36
|
+
The affected tools all use @puppeteer/browsers or equivalent ZIP-based browser
|
|
37
|
+
downloaders internally:
|
|
38
|
+
- Puppeteer: `npx puppeteer browsers install chrome-headless-shell`
|
|
39
|
+
- Playwright: `npx playwright install chromium` / `npx playwright install --with-deps`
|
|
40
|
+
- Cypress: `npx cypress install`
|
|
41
|
+
|
|
42
|
+
The ZIP archive download completes successfully and the partial extraction
|
|
43
|
+
begins, but the stream destroy bug causes yauzl to exit before all entries
|
|
44
|
+
are written. The browser binary never lands on disk. The browser installer
|
|
45
|
+
exits 0 (or a non-descriptive exit code) and the next step fails with
|
|
46
|
+
"Could not find Chrome at path..." or a missing executable error.
|
|
47
|
+
|
|
48
|
+
Root upstream issues:
|
|
49
|
+
- https://github.com/nodejs/node/issues/63487 (yauzl/extract-zip hang / partial extraction)
|
|
50
|
+
- https://github.com/nodejs/node/issues/63638 (libuv regression on Windows)
|
|
51
|
+
|
|
52
|
+
Because actions/setup-node resolves to the cached Node.js 24.16.0 when
|
|
53
|
+
node-version: '24' or node-version: '24.x' is specified (or when using the
|
|
54
|
+
default runner-baked Node 24 on ubuntu-24.04), every workflow that installs
|
|
55
|
+
a browser via these tools is affected until Node.js 24.17.0 ships a fix.
|
|
56
|
+
fix: |
|
|
57
|
+
Pin Node.js to 24.15.0 (the last known-good version) via actions/setup-node
|
|
58
|
+
until Node.js 24.17.0 is published and rolled into the runner toolcache.
|
|
59
|
+
|
|
60
|
+
If your workflow does not strictly require Node.js 24, fall back to Node.js 22
|
|
61
|
+
(the runner image default), which is unaffected by this regression.
|
|
62
|
+
|
|
63
|
+
Do NOT use node-version: '24' or node-version: 'latest' until the upstream
|
|
64
|
+
fix lands in Node.js 24.17.0.
|
|
65
|
+
fix_code:
|
|
66
|
+
- language: yaml
|
|
67
|
+
label: 'Option A — Pin Node.js to 24.15.0 (last known-good release)'
|
|
68
|
+
code: |
|
|
69
|
+
steps:
|
|
70
|
+
- uses: actions/setup-node@v6
|
|
71
|
+
with:
|
|
72
|
+
node-version: '24.15.0' # pin until Node 24.17.0 fixes readable-stream regression
|
|
73
|
+
cache: 'npm'
|
|
74
|
+
|
|
75
|
+
- name: Install Puppeteer Chrome
|
|
76
|
+
run: npx puppeteer browsers install chrome-headless-shell
|
|
77
|
+
- language: yaml
|
|
78
|
+
label: 'Option B — Fall back to Node.js 22 (unaffected)'
|
|
79
|
+
code: |
|
|
80
|
+
steps:
|
|
81
|
+
- uses: actions/setup-node@v6
|
|
82
|
+
with:
|
|
83
|
+
node-version: '22' # LTS, not affected by readable-stream regression
|
|
84
|
+
cache: 'npm'
|
|
85
|
+
|
|
86
|
+
- name: Install Playwright browsers
|
|
87
|
+
run: npx playwright install --with-deps chromium
|
|
88
|
+
- language: yaml
|
|
89
|
+
label: 'Option C — Pin Playwright install to avoid extract-zip entirely (Playwright only)'
|
|
90
|
+
code: |
|
|
91
|
+
steps:
|
|
92
|
+
- uses: actions/setup-node@v6
|
|
93
|
+
with:
|
|
94
|
+
node-version: '24.15.0'
|
|
95
|
+
- uses: microsoft/playwright-github-action@v1 # uses pre-installed image browsers
|
|
96
|
+
prevention:
|
|
97
|
+
- 'Pin node-version to a specific patch (e.g. 24.15.0) rather than a major/minor range in workflows that install browser binaries via npx commands.'
|
|
98
|
+
- 'After bumping Node.js versions, verify browser install steps succeed by checking the binary path explicitly with `ls -la ~/.cache/puppeteer` or equivalent before running tests.'
|
|
99
|
+
- 'Subscribe to actions/runner-images releases to catch toolcache updates that may include Node.js patch regressions.'
|
|
100
|
+
- 'For Playwright, prefer `npx playwright install --with-deps` combined with an explicit Node.js pin rather than relying on runner-image cached Node versions.'
|
|
101
|
+
docs:
|
|
102
|
+
- url: 'https://github.com/actions/runner-images/issues/14173'
|
|
103
|
+
label: 'runner-images #14173 — Puppeteer broken in Ubuntu 24.04 version 20260525.161.1'
|
|
104
|
+
- url: 'https://github.com/nodejs/node/issues/63487'
|
|
105
|
+
label: 'nodejs/node #63487 — yauzl/extract-zip hang and partial extraction (readable-stream regression)'
|
|
106
|
+
- url: 'https://github.com/nodejs/node/issues/63638'
|
|
107
|
+
label: 'nodejs/node #63638 — libuv regression in Node.js 24.16.0'
|
|
108
|
+
- url: 'https://github.com/actions/runner-images/releases/tag/ubuntu24%2F20260525.161'
|
|
109
|
+
label: 'runner-images ubuntu24/20260525.161 release — Node.js toolcache bumped 24.15.0 → 24.16.0'
|
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
id: runner-environment-179
|
|
2
|
+
title: 'Runner Registration Token Endpoint Returns 502 Under Concurrent Spawn Bursts'
|
|
3
|
+
category: runner-environment
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- self-hosted
|
|
7
|
+
- ephemeral
|
|
8
|
+
- registration
|
|
9
|
+
- autoscaler
|
|
10
|
+
- burst
|
|
11
|
+
- 502
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: 'HTTP Error 502: Bad Gateway'
|
|
14
|
+
flags: 'i'
|
|
15
|
+
- regex: 'registration-token.*502|502.*Bad Gateway.*registration'
|
|
16
|
+
flags: 'i'
|
|
17
|
+
- regex: 'HTTPError.*502|urlopen error.*502'
|
|
18
|
+
flags: 'i'
|
|
19
|
+
error_messages:
|
|
20
|
+
- 'HTTP Error 502: Bad Gateway'
|
|
21
|
+
- 'urllib.error.HTTPError: HTTP Error 502: Bad Gateway'
|
|
22
|
+
- 'POST https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token — 502'
|
|
23
|
+
root_cause: |
|
|
24
|
+
GitHub REST endpoint:
|
|
25
|
+
POST https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token
|
|
26
|
+
|
|
27
|
+
returns HTTP 502 Bad Gateway under burst load when many ephemeral runners attempt to
|
|
28
|
+
register concurrently (e.g., triggered by webhook-driven autoscalers on workflow_job.queued
|
|
29
|
+
events). The burst saturates the GitHub backend's registration token issuer.
|
|
30
|
+
|
|
31
|
+
The reference runner setup scripts (run.sh / config.sh flow) do NOT retry 5xx responses —
|
|
32
|
+
a single transient 502 causes the container to exit immediately before registration completes.
|
|
33
|
+
For ephemeral autoscaler setups (Modal, Lambda, Kubernetes pod-per-job), the container slot
|
|
34
|
+
is burned and no runner is available for the queued job.
|
|
35
|
+
|
|
36
|
+
Phantom containers accumulate: the autoscaler billed for the container, GitHub never received
|
|
37
|
+
a registration, and the autoscaler's max_containers cap can be held by dead slots — stalling
|
|
38
|
+
the merge queue for hours.
|
|
39
|
+
|
|
40
|
+
Source: actions/runner#4399 (May 2026, open issue, observed during 20+ concurrent registrations
|
|
41
|
+
— ~21 phantom containers during a webhook-driven burst at ~16:00-17:00 UTC on 2026-05-04).
|
|
42
|
+
fix: |
|
|
43
|
+
1. Add exponential backoff to your registration token call (4 attempts, 2^n second delays).
|
|
44
|
+
2. Stagger autoscaler spawns with per-container random jitter (0-10 seconds) to break up
|
|
45
|
+
concurrent registration bursts before they hit the endpoint simultaneously.
|
|
46
|
+
3. Switch to JIT runner tokens — they are pre-assigned per job and do not require a burst-
|
|
47
|
+
sensitive registration token call:
|
|
48
|
+
POST /repos/{owner}/{repo}/actions/runners/generate-jitconfig
|
|
49
|
+
4. Reduce max_concurrent_spawns in your autoscaler to limit peak registration load.
|
|
50
|
+
5. Implement phantom container detection: health-check containers shortly after spawn and
|
|
51
|
+
kill any that failed to register within N seconds.
|
|
52
|
+
fix_code:
|
|
53
|
+
- language: yaml
|
|
54
|
+
label: 'Python — retry-with-backoff on 502 when minting registration token'
|
|
55
|
+
code: |
|
|
56
|
+
import time, json, urllib.request, urllib.error
|
|
57
|
+
|
|
58
|
+
def mint_registration_token(owner, repo, github_token):
|
|
59
|
+
url = f"https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token"
|
|
60
|
+
headers = {
|
|
61
|
+
"Authorization": f"token {github_token}",
|
|
62
|
+
"Accept": "application/vnd.github+json",
|
|
63
|
+
"X-GitHub-Api-Version": "2022-11-28",
|
|
64
|
+
}
|
|
65
|
+
for attempt in range(4):
|
|
66
|
+
try:
|
|
67
|
+
req = urllib.request.Request(url, method="POST", headers=headers)
|
|
68
|
+
with urllib.request.urlopen(req, timeout=15) as resp:
|
|
69
|
+
return json.load(resp)["token"]
|
|
70
|
+
except urllib.error.HTTPError as e:
|
|
71
|
+
if 500 <= e.code < 600 and attempt < 3:
|
|
72
|
+
delay = 2 ** attempt # 1s, 2s, 4s
|
|
73
|
+
print(f"Registration token {e.code}, retrying in {delay}s (attempt {attempt+1}/4)...")
|
|
74
|
+
time.sleep(delay)
|
|
75
|
+
continue
|
|
76
|
+
raise
|
|
77
|
+
- language: yaml
|
|
78
|
+
label: 'Autoscaler — add random jitter per container spawn to break up burst'
|
|
79
|
+
code: |
|
|
80
|
+
import asyncio, random
|
|
81
|
+
|
|
82
|
+
async def handle_workflow_job_queued(event):
|
|
83
|
+
# Stagger spawns: random delay 0-10s reduces simultaneous token requests
|
|
84
|
+
jitter = random.uniform(0, 10)
|
|
85
|
+
await asyncio.sleep(jitter)
|
|
86
|
+
token = await mint_registration_token(owner, repo, github_token)
|
|
87
|
+
await spawn_ephemeral_container(token)
|
|
88
|
+
prevention:
|
|
89
|
+
- 'Add exponential backoff (4 attempts, 2^n seconds) for 5xx responses in all registration token calls.'
|
|
90
|
+
- 'Add random per-container spawn jitter (0-10s) in your autoscaler to break up concurrent registration bursts.'
|
|
91
|
+
- 'Switch to JIT runner tokens (generate-jitconfig) for burst autoscaler workloads — they do not hit the burst-sensitive registration endpoint.'
|
|
92
|
+
- 'Monitor for phantom containers: any container that does not show as registered within 60 seconds of spawn should be terminated.'
|
|
93
|
+
- 'Cap max_concurrent_spawns conservatively (≤10) and tune based on observed 502 rates.'
|
|
94
|
+
docs:
|
|
95
|
+
- url: 'https://github.com/actions/runner/issues/4399'
|
|
96
|
+
label: 'actions/runner#4399: Registration token endpoint returns 502 in bursts (May 2026)'
|
|
97
|
+
- url: 'https://docs.github.com/en/rest/actions/self-hosted-runners#create-a-registration-token-for-a-repository'
|
|
98
|
+
label: 'GitHub REST API: Create a registration token for a repository'
|
|
99
|
+
- url: 'https://docs.github.com/en/rest/actions/self-hosted-runners#create-configuration-for-a-just-in-time-runner-for-a-repository'
|
|
100
|
+
label: 'GitHub REST API: Create JIT runner config (avoids burst-sensitive registration token)'
|
|
101
|
+
- url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners'
|
|
102
|
+
label: 'GitHub Docs: Autoscaling with self-hosted runners'
|
|
@@ -0,0 +1,127 @@
|
|
|
1
|
+
id: silent-failures-098
|
|
2
|
+
title: 'Matrix Array Output Nested in Object Property Serializes as "Array" String'
|
|
3
|
+
category: silent-failures
|
|
4
|
+
severity: silent-failure
|
|
5
|
+
tags:
|
|
6
|
+
- matrix
|
|
7
|
+
- fromJSON
|
|
8
|
+
- job-outputs
|
|
9
|
+
- dynamic-matrix
|
|
10
|
+
- strategy
|
|
11
|
+
- serialization
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: '\bArray\b'
|
|
14
|
+
flags: ''
|
|
15
|
+
- regex: 'fromJSON\s*\(.*outputs\.[a-zA-Z_]+'
|
|
16
|
+
flags: 'i'
|
|
17
|
+
error_messages:
|
|
18
|
+
- 'rhel Array'
|
|
19
|
+
- 'debian Array'
|
|
20
|
+
- 'distro = Array'
|
|
21
|
+
- 'component.distro: Array'
|
|
22
|
+
root_cause: |
|
|
23
|
+
When a job output containing a JSON array string (e.g. '["el8","el9"]') is passed via
|
|
24
|
+
fromJSON() as a VALUE inside a matrix OBJECT property, the GitHub Actions expression engine
|
|
25
|
+
coerces the array to a string using JavaScript Array.prototype.toString(), which produces
|
|
26
|
+
the literal word "Array" instead of the array elements.
|
|
27
|
+
|
|
28
|
+
This happens because the matrix strategy parser evaluates object property values as strings
|
|
29
|
+
at expansion time. When fromJSON() returns a JavaScript array, the matrix serialization path
|
|
30
|
+
calls .toString() instead of expanding into multiple jobs.
|
|
31
|
+
|
|
32
|
+
The bug only triggers when the array is nested INSIDE an object:
|
|
33
|
+
|
|
34
|
+
# BROKEN — array inside object property
|
|
35
|
+
strategy:
|
|
36
|
+
matrix:
|
|
37
|
+
component:
|
|
38
|
+
- name: rhel
|
|
39
|
+
distro: ${{ fromJSON(needs.define-matrix.outputs.rpms) }}
|
|
40
|
+
# ^ distro becomes the string "Array", not ["el8","el9"]
|
|
41
|
+
|
|
42
|
+
The same fromJSON() call works correctly at the TOP LEVEL of the matrix:
|
|
43
|
+
|
|
44
|
+
# WORKS — array at top level dimension
|
|
45
|
+
strategy:
|
|
46
|
+
matrix:
|
|
47
|
+
distro: ${{ fromJSON(needs.define-matrix.outputs.rpms) }}
|
|
48
|
+
|
|
49
|
+
The job does NOT fail — it silently runs with matrix.component.distro set to "Array",
|
|
50
|
+
producing wrong build targets with no error or warning.
|
|
51
|
+
|
|
52
|
+
Source: actions/runner#3794 (April 2025, labeled bug, open as of April 2026 after stale
|
|
53
|
+
cycle — author confirmed still broken).
|
|
54
|
+
fix: |
|
|
55
|
+
Restructure the matrix to avoid placing fromJSON() array outputs inside object properties.
|
|
56
|
+
|
|
57
|
+
Option 1 — Pre-compute the full cross-product as a JSON array of objects in the matrix-
|
|
58
|
+
generating job and pass it as a single fromJSON() at the include level.
|
|
59
|
+
|
|
60
|
+
Option 2 — Use separate top-level matrix dimensions instead of grouping into objects.
|
|
61
|
+
|
|
62
|
+
If object grouping is required for semantic reasons, flatten the arrays at generation time
|
|
63
|
+
using jq and emit a fully-expanded JSON array of objects from the setup step.
|
|
64
|
+
fix_code:
|
|
65
|
+
- language: yaml
|
|
66
|
+
label: 'Workaround — pre-compute full matrix as JSON array, expand via include'
|
|
67
|
+
code: |
|
|
68
|
+
jobs:
|
|
69
|
+
define-matrix:
|
|
70
|
+
runs-on: ubuntu-latest
|
|
71
|
+
outputs:
|
|
72
|
+
matrix: ${{ steps.build.outputs.matrix }}
|
|
73
|
+
steps:
|
|
74
|
+
- id: build
|
|
75
|
+
run: |
|
|
76
|
+
# Build the full cross-product as a JSON array of objects
|
|
77
|
+
matrix=$(jq -nc '[
|
|
78
|
+
{"arch":"x86_64","runner":"ubuntu-24.04","component":"rhel","distro":"el8"},
|
|
79
|
+
{"arch":"x86_64","runner":"ubuntu-24.04","component":"rhel","distro":"el9"},
|
|
80
|
+
{"arch":"x86_64","runner":"ubuntu-24.04","component":"debian","distro":"focal"},
|
|
81
|
+
{"arch":"aarch64","runner":"ubuntu-24.04-arm","component":"rhel","distro":"el8"},
|
|
82
|
+
{"arch":"aarch64","runner":"ubuntu-24.04-arm","component":"rhel","distro":"el9"},
|
|
83
|
+
{"arch":"aarch64","runner":"ubuntu-24.04-arm","component":"debian","distro":"focal"}
|
|
84
|
+
]')
|
|
85
|
+
echo "matrix=$matrix" >> "$GITHUB_OUTPUT"
|
|
86
|
+
|
|
87
|
+
build:
|
|
88
|
+
needs: define-matrix
|
|
89
|
+
runs-on: ${{ matrix.runner }}
|
|
90
|
+
strategy:
|
|
91
|
+
matrix:
|
|
92
|
+
include: ${{ fromJSON(needs.define-matrix.outputs.matrix) }}
|
|
93
|
+
steps:
|
|
94
|
+
- run: echo "${{ matrix.component }} ${{ matrix.distro }} on ${{ matrix.arch }}"
|
|
95
|
+
- language: yaml
|
|
96
|
+
label: 'Debug step — detect "Array" serialization early'
|
|
97
|
+
code: |
|
|
98
|
+
jobs:
|
|
99
|
+
build:
|
|
100
|
+
needs: define-matrix
|
|
101
|
+
runs-on: ubuntu-latest
|
|
102
|
+
strategy:
|
|
103
|
+
matrix:
|
|
104
|
+
# Use top-level dimensions — arrays work at this level
|
|
105
|
+
distro: ${{ fromJSON(needs.define-matrix.outputs.rpms) }}
|
|
106
|
+
arch: [x86_64, aarch64]
|
|
107
|
+
steps:
|
|
108
|
+
- name: Verify matrix values are not "Array"
|
|
109
|
+
run: |
|
|
110
|
+
DISTRO="${{ matrix.distro }}"
|
|
111
|
+
if [ "$DISTRO" = "Array" ]; then
|
|
112
|
+
echo "::error::matrix.distro serialized as 'Array' — fromJSON() inside object bug"
|
|
113
|
+
exit 1
|
|
114
|
+
fi
|
|
115
|
+
echo "distro=$DISTRO"
|
|
116
|
+
prevention:
|
|
117
|
+
- 'Never place fromJSON() array outputs as values inside matrix object properties — use top-level dimensions or pre-computed include arrays.'
|
|
118
|
+
- 'Add a guard step that fails if any matrix value equals the string "Array" to catch this bug early.'
|
|
119
|
+
- 'Generate complex multi-dimensional matrix shapes in the setup job as a single JSON array, then expand via include: ${{ fromJSON(...) }}.'
|
|
120
|
+
- 'Track actions/runner#3794 for an upstream fix to the matrix object property serialization bug.'
|
|
121
|
+
docs:
|
|
122
|
+
- url: 'https://github.com/actions/runner/issues/3794'
|
|
123
|
+
label: 'actions/runner#3794: array outputs not understood by matrix when nested inside object (April 2025, open)'
|
|
124
|
+
- url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/running-variations-of-jobs-in-a-workflow'
|
|
125
|
+
label: 'GitHub Docs: Using a matrix for your jobs'
|
|
126
|
+
- url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/evaluate-expressions-in-workflows-and-actions#fromjson'
|
|
127
|
+
label: 'GitHub Docs: fromJSON expression function'
|
|
@@ -0,0 +1,112 @@
|
|
|
1
|
+
id: silent-failures-097
|
|
2
|
+
title: 'setup-node Silently Uses Runner-Baked Node Version When Download Fails — Wrong Version Active'
|
|
3
|
+
category: silent-failures
|
|
4
|
+
severity: silent-failure
|
|
5
|
+
tags:
|
|
6
|
+
- setup-node
|
|
7
|
+
- nodejs
|
|
8
|
+
- download-failure
|
|
9
|
+
- silent-failure
|
|
10
|
+
- wrong-version
|
|
11
|
+
- toolcache
|
|
12
|
+
- hosted-runner
|
|
13
|
+
- fallthrough
|
|
14
|
+
patterns:
|
|
15
|
+
- regex: 'Attempting to download \d+\.\d+\.\d+\.\.\.'
|
|
16
|
+
flags: 'i'
|
|
17
|
+
- regex: 'Cannot find module.*engines.*node.*>=\s*24'
|
|
18
|
+
flags: 'i'
|
|
19
|
+
- regex: 'The engine .node. is incompatible with this module\. Expected version .+\. Got .2[0-2]\.'
|
|
20
|
+
flags: 'i'
|
|
21
|
+
- regex: 'ELIFECYCLE.*node --version.*v2[0-2]\.'
|
|
22
|
+
flags: 'i'
|
|
23
|
+
error_messages:
|
|
24
|
+
- "Attempting to download 24.15.0..."
|
|
25
|
+
- "error: The engine 'node' is incompatible with this module. Expected version '>=24.0.0'. Got '22.14.0'"
|
|
26
|
+
- "npm ERR! code ELIFECYCLE"
|
|
27
|
+
- "Error: Cannot find module 'node:crypto' (Node.js version too old)"
|
|
28
|
+
root_cause: |
|
|
29
|
+
When actions/setup-node's download or extract path fails transiently —
|
|
30
|
+
network blip, manifest miss, partial extract from a concurrent toolcache
|
|
31
|
+
write, or a transient S3/CDN cache failure — the action does not surface the
|
|
32
|
+
error. Instead, it falls back to a secondary download path. If that secondary
|
|
33
|
+
path also fails or returns an unusable toolPath, setup-node adds an empty or
|
|
34
|
+
incorrect directory to PATH and exits 0 (success).
|
|
35
|
+
|
|
36
|
+
Because the setup-node step succeeds, the runner-baked Node.js version
|
|
37
|
+
(e.g. v22.x on ubuntu-latest after the Node 20 removal) remains on PATH.
|
|
38
|
+
Downstream steps execute against the wrong Node.js major version with no
|
|
39
|
+
indication that setup-node did not install the requested version.
|
|
40
|
+
|
|
41
|
+
The mechanism (in official_builds.ts, as of 2026-05-21):
|
|
42
|
+
- Download/extract errors are logged via core.info(), not core.warning()
|
|
43
|
+
or core.error(), so they are buried in normal output
|
|
44
|
+
- After the fallback download attempt, there is no post-condition check
|
|
45
|
+
that verifies node --version matches the requested version
|
|
46
|
+
- core.addPath() is called even if toolPath/bin is empty or stale
|
|
47
|
+
|
|
48
|
+
Reported failing run: https://github.com/n8n-io/n8n/actions/runs/26100630929
|
|
49
|
+
The run showed "Attempting to download 24.15.0..." → 33 seconds of silence →
|
|
50
|
+
next step ran against runner-baked v20.20.0 with no error from setup-node.
|
|
51
|
+
|
|
52
|
+
This is distinct from silent-failures-028 which covers self-hosted runners
|
|
53
|
+
where node is completely absent (node: not found). This entry covers hosted
|
|
54
|
+
runners where the wrong version is silently active and node IS found.
|
|
55
|
+
|
|
56
|
+
Root upstream issue: actions/toolkit#804 — concurrent toolcache writes create
|
|
57
|
+
partial extracts that pass path existence checks.
|
|
58
|
+
fix: |
|
|
59
|
+
Add an explicit node --version verification step immediately after setup-node
|
|
60
|
+
and fail the job if the version does not match. This is the external workaround
|
|
61
|
+
used by affected projects (e.g., n8n/n8n PR #30849).
|
|
62
|
+
|
|
63
|
+
Until actions/setup-node ships a built-in post-install assertion, this
|
|
64
|
+
workflow-level guard is the only reliable way to catch the silent fallthrough.
|
|
65
|
+
fix_code:
|
|
66
|
+
- language: yaml
|
|
67
|
+
label: 'Add explicit version verification after setup-node'
|
|
68
|
+
code: |
|
|
69
|
+
steps:
|
|
70
|
+
- uses: actions/setup-node@v6
|
|
71
|
+
with:
|
|
72
|
+
node-version: '24'
|
|
73
|
+
cache: 'npm'
|
|
74
|
+
|
|
75
|
+
- name: Verify Node.js version
|
|
76
|
+
shell: bash
|
|
77
|
+
run: |
|
|
78
|
+
ACTUAL=$(node --version)
|
|
79
|
+
EXPECTED_MAJOR="24"
|
|
80
|
+
if [[ "$ACTUAL" != v${EXPECTED_MAJOR}.* ]]; then
|
|
81
|
+
echo "::error::setup-node installed Node ${EXPECTED_MAJOR} but \`node --version\` reports $ACTUAL"
|
|
82
|
+
echo "::error::This usually indicates a transient download failure or partial toolcache extract."
|
|
83
|
+
exit 1
|
|
84
|
+
fi
|
|
85
|
+
echo "Node.js version confirmed: $ACTUAL"
|
|
86
|
+
|
|
87
|
+
- name: Install dependencies
|
|
88
|
+
run: npm ci
|
|
89
|
+
- language: yaml
|
|
90
|
+
label: 'Pin to exact patch version to reduce toolcache misses'
|
|
91
|
+
code: |
|
|
92
|
+
steps:
|
|
93
|
+
- uses: actions/setup-node@v6
|
|
94
|
+
with:
|
|
95
|
+
node-version: '24.15.0' # exact pin reduces manifest/toolcache lookup failures
|
|
96
|
+
cache: 'npm'
|
|
97
|
+
|
|
98
|
+
- name: Verify Node.js version (belt-and-suspenders)
|
|
99
|
+
run: |
|
|
100
|
+
node --version | grep -E '^v24\.15\.' || (echo "Wrong Node version" && exit 1)
|
|
101
|
+
prevention:
|
|
102
|
+
- 'Always verify node --version matches the requested major after setup-node, especially in workflows that depend on Node.js 24+ features or native modules.'
|
|
103
|
+
- 'Pin to an exact patch version (e.g. 24.15.0) rather than a range (24.x) to avoid unexpected toolcache miss fallbacks.'
|
|
104
|
+
- 'If you see "Attempting to download X.Y.Z..." followed by an unusually long pause in setup-node output, the download may have stalled and the fallback path may be active.'
|
|
105
|
+
- 'Watch setup-node releases for a built-in post-install assertion fix (tracked in actions/setup-node#1556 and actions/toolkit#804).'
|
|
106
|
+
docs:
|
|
107
|
+
- url: 'https://github.com/actions/setup-node/issues/1556'
|
|
108
|
+
label: 'setup-node #1556 — setup-node silently falls through to runner-baked Node on download/extract failure'
|
|
109
|
+
- url: 'https://github.com/actions/toolkit/issues/804'
|
|
110
|
+
label: 'actions/toolkit #804 — Concurrent toolcache writes cause partial extracts on multi-tenant runners'
|
|
111
|
+
- url: 'https://github.com/n8n-io/n8n/pull/30849'
|
|
112
|
+
label: 'n8n/n8n PR #30849 — External Verify Node.js Version workaround'
|
package/package.json
CHANGED