@htekdev/actions-debugger 1.0.109 → 1.0.111

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,93 @@
1
+ id: known-unsolved-058
2
+ title: 'Ephemeral JIT Runner Reports "Lost Communication" Despite Successful Job Completion'
3
+ category: known-unsolved
4
+ severity: error
5
+ tags:
6
+ - self-hosted
7
+ - ephemeral
8
+ - jit-runner
9
+ - lost-communication
10
+ - false-positive
11
+ - broker
12
+ patterns:
13
+ - regex: 'The self-hosted runner lost communication with the server'
14
+ flags: 'i'
15
+ - regex: 'messageQueueLoopTokenSource|Stop message queue looping'
16
+ flags: 'i'
17
+ error_messages:
18
+ - 'The self-hosted runner lost communication with the server'
19
+ - 'GET request to broker.actions.githubusercontent.com/message ... has been cancelled'
20
+ - 'TaskCanceledException: The operation was canceled'
21
+ root_cause: |
22
+ In Runner.cs line 576, after a one-time-use (ephemeral/JIT) job completes, the message queue
23
+ is cancelled immediately with zero grace period:
24
+
25
+ messageQueueLoopTokenSource.Cancel();
26
+
27
+ This tears down the in-flight broker long-poll (GET broker.actions.githubusercontent.com/message)
28
+ immediately after CompleteJobAsync returns. GitHub's broker health monitor detects the TCP
29
+ disconnect and flags the runner as "lost communication" — racing against the pipeline service
30
+ that just received the successful completion.
31
+
32
+ Two independent GitHub backend systems race:
33
+ 1. Pipeline service — received CompleteJobAsync, knows the job succeeded
34
+ 2. Broker health monitor — sees TCP disconnect, flags runner as "lost communication"
35
+
36
+ When the broker monitor wins (which happens on ~5-10% of short jobs), the UI shows
37
+ "The self-hosted runner lost communication with the server" even though the worker exited
38
+ with code 100 (success) and the runner itself exited with return code 0.
39
+
40
+ The runner logs show all events at identical timestamps — zero delay between
41
+ "Received job status event. JobState: Online" and "messageQueueLoopTokenSource.Cancel()".
42
+
43
+ No user-side configuration can fix this. A code change to Runner.cs is required
44
+ (add ~5s grace delay before cancel). Source: actions/runner#4309 (March 2026, open bug,
45
+ affects runner v2.331.0+, ~5-10% failure rate on short ephemeral jobs).
46
+ fix: |
47
+ There is no user-side configuration fix. The root cause is a zero-grace-period teardown
48
+ in Runner.cs that must be patched by GitHub.
49
+
50
+ Mitigations:
51
+ 1. Verify the diagnostic logs — the job actually succeeded. Check _diag/Runner_*.log
52
+ for "return code 0" and "result: Succeeded" to confirm before concluding failure.
53
+ 2. Retry the failed job — the "lost communication" is a false positive; re-running
54
+ produces a clean success.
55
+ 3. Do not use ephemeral JIT runners as required status checks until actions/runner#4309
56
+ is resolved, or implement a workflow-level retry wrapper.
57
+ 4. Alert on job result=="failure" not on "lost communication" text alone — false positives
58
+ from this race should not page on-call engineers.
59
+ 5. Batch small jobs — the race is more likely on jobs shorter than 60 seconds. Combining
60
+ work into longer-running jobs reduces false positive frequency.
61
+ fix_code:
62
+ - language: yaml
63
+ label: 'Confirm false positive by reading runner diagnostic log before retrying'
64
+ code: |
65
+ # After "lost communication" on an ephemeral JIT runner, check:
66
+ # _diag/Runner_YYYYMMDD-hhmmss-utc.log
67
+ #
68
+ # Successful completion signs (all at identical timestamps):
69
+ # [INFO] finish job request for job {id} with result: Succeeded
70
+ # [INFO] Job X completed with result: Succeeded
71
+ # [INFO] Received job status event. JobState: Online
72
+ # [INFO] Runner execution has finished with return code 0
73
+ #
74
+ # If these appear immediately before TaskCanceledException, the job DID succeed.
75
+ # Simply re-run the workflow — the retry will show a clean green result.
76
+ jobs:
77
+ build:
78
+ runs-on: [self-hosted, ephemeral, linux]
79
+ # Note: retry-on-failure logic can wrap this job in the caller:
80
+ steps:
81
+ - uses: actions/checkout@v4
82
+ - run: make build
83
+ prevention:
84
+ - 'Treat "lost communication" on ephemeral JIT runners as a potential false positive — check _diag/Runner_*.log for "return code 0" before escalating.'
85
+ - 'Do not gate required status checks on ephemeral JIT runners until actions/runner#4309 is resolved upstream.'
86
+ - 'Alert on job result=="failure", not on the "lost communication" message text alone.'
87
+ - 'Batch work into longer jobs (>60s total) to reduce the broker teardown race frequency.'
88
+ - 'Watch actions/runner#4309 for an upstream fix (proposed: 5-second grace period before messageQueueLoopTokenSource.Cancel()).'
89
+ docs:
90
+ - url: 'https://github.com/actions/runner/issues/4309'
91
+ label: 'actions/runner#4309: Ephemeral/JIT runner reports "lost communication" despite successful completion (March 2026)'
92
+ - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners#using-just-in-time-runners'
93
+ label: 'GitHub Docs: Just-in-time (JIT) runners'
@@ -0,0 +1,117 @@
1
+ id: runner-environment-178
2
+ title: 'ARC Runner v2.332.0 Regression — Container Job GITHUB_ENV and Workspace Permission Denied'
3
+ category: runner-environment
4
+ severity: error
5
+ tags:
6
+ - arc
7
+ - actions-runner-controller
8
+ - container-job
9
+ - permissions
10
+ - GITHUB_ENV
11
+ - non-root
12
+ - kubernetes
13
+ - regression
14
+ - v2.332
15
+ patterns:
16
+ - regex: 'cannot create /__w/_temp/_runner_file_commands/set_env_[0-9a-f]+: Permission denied'
17
+ flags: 'i'
18
+ - regex: 'cannot create /__w/_temp/_runner_file_commands/add_path_[0-9a-f]+: Permission denied'
19
+ flags: 'i'
20
+ - regex: 'fatal: detected dubious ownership in repository at .+/__w/'
21
+ flags: 'i'
22
+ - regex: '_runner_file_commands.*Permission denied'
23
+ flags: 'i'
24
+ error_messages:
25
+ - "/__w/_temp/36e38446.sh: 5: cannot create /__w/_temp/_runner_file_commands/set_env_7bb88aaa: Permission denied"
26
+ - "fatal: detected dubious ownership in repository at '/__w/repo/repo'"
27
+ - "/__w/_temp/_runner_file_commands/add_path_: Permission denied"
28
+ root_cause: |
29
+ Upgrading Actions Runner Controller (ARC) from runner v2.330.0 to v2.332.0
30
+ introduces a compound regression that breaks container jobs using non-root users.
31
+
32
+ The regression spans two runner releases:
33
+
34
+ 1. v2.331.0 changed the runner container base image from Ubuntu 22.04 to
35
+ Ubuntu 24.04. The newer base ships git 2.43+ which enforces stricter
36
+ safe.directory checks. The mounted workspace volume is owned by the runner
37
+ UID, so a container running as a different non-root UID receives
38
+ "fatal: detected dubious ownership" on any git operation.
39
+
40
+ 2. v2.332.0 bumped container hooks to v0.8.1, which updated workspace
41
+ ownership handling for the runner pod itself — but not for downstream job
42
+ containers. The _runner_file_commands directory under /__w/_temp/ is
43
+ still created with runner UID ownership. When a container step writes to
44
+ $GITHUB_ENV, $GITHUB_OUTPUT, $GITHUB_PATH, or $GITHUB_STEP_SUMMARY via a
45
+ shell redirect, the shell (running as the container's non-root user) cannot
46
+ create the file and exits non-zero.
47
+
48
+ This regression affects:
49
+ - ARC-managed self-hosted runners on Kubernetes (EKS, GKE, AKS, on-prem)
50
+ - Any workflow using `container: image: my-image` with `options: --user <uid>`
51
+ or a non-root USER in the Dockerfile
52
+ - Workflows that previously worked on runner v2.330.0 and below
53
+
54
+ GitHub-hosted runners (ubuntu-latest, etc.) are NOT affected.
55
+ fix: |
56
+ If you cannot immediately pin the runner version, add a workaround step at
57
+ the top of the affected job to fix ownership of the runner file command
58
+ directories. Alternatively, pin ARC runner images to v2.330.0 until an
59
+ upstream fix for container hooks is released.
60
+
61
+ The most robust long-term fix is to explicitly add the workspace to git's
62
+ safe.directory list and pre-create the file command directories with the
63
+ correct ownership.
64
+ fix_code:
65
+ - language: yaml
66
+ label: 'Option A — Pre-create _runner_file_commands with container user ownership'
67
+ code: |
68
+ jobs:
69
+ build:
70
+ runs-on: self-hosted
71
+ container:
72
+ image: my-app:latest
73
+ options: --user 1000
74
+ steps:
75
+ - name: Fix runner file command directory ownership (v2.332.0 workaround)
76
+ # Run as root before any steps that use GITHUB_ENV/GITHUB_OUTPUT
77
+ run: |
78
+ chown -R 1000:1000 /__w/_temp/_runner_file_commands/ || true
79
+ git config --global --add safe.directory /__w/${{ github.repository }}
80
+ shell: bash
81
+ # Note: requires container image to have chown available as root
82
+ - language: yaml
83
+ label: 'Option B — Pin ARC runner image to v2.330.0 to avoid the regression'
84
+ code: |
85
+ # In your ARC HelmRelease or RunnerDeployment spec:
86
+ # spec:
87
+ # template:
88
+ # spec:
89
+ # containers:
90
+ # - name: runner
91
+ # image: ghcr.io/actions/actions-runner:2.330.0
92
+ - language: yaml
93
+ label: 'Option C — Run container job as root to avoid UID mismatch'
94
+ code: |
95
+ jobs:
96
+ build:
97
+ runs-on: self-hosted
98
+ container:
99
+ image: my-app:latest
100
+ options: --user root # avoid UID mismatch until ARC fix ships
101
+ steps:
102
+ - uses: actions/checkout@v4
103
+ - run: echo "FOO=bar" >> $GITHUB_ENV
104
+ prevention:
105
+ - 'Test ARC runner version upgrades in a staging environment before rolling out to production — especially major bumps (v2.330 → v2.332).'
106
+ - 'Pin container jobs to run as root if your workflow uses GITHUB_ENV, GITHUB_OUTPUT, or GITHUB_PATH writes and you depend on non-root containers.'
107
+ - 'Subscribe to actions/runner releases and scan for changes to container-hooks between minor versions.'
108
+ - 'Add a smoke-test workflow that writes to GITHUB_ENV in a non-root container job — run it against each ARC upgrade to catch regressions early.'
109
+ docs:
110
+ - url: 'https://github.com/actions/runner/issues/4302'
111
+ label: 'actions/runner #4302 — v2.332.0: Container jobs fail with permission denied on GITHUB_ENV and workspace'
112
+ - url: 'https://github.com/actions/runner/issues/4131'
113
+ label: 'actions/runner #4131 — Permissions issue on runners v2.330.0 (/home/runner ownership regression)'
114
+ - url: 'https://github.com/actions/runner/issues/4251'
115
+ label: 'actions/runner #4251 — TempDirectoryManager fails to clean temp directory (permission denied on v2.331.0)'
116
+ - url: 'https://github.com/actions/runner-container-hooks/issues/282'
117
+ label: 'runner-container-hooks #282 — Permissions denied on workingDir'
@@ -0,0 +1,109 @@
1
+ id: runner-environment-177
2
+ title: 'Node.js 24.16.0 Toolcache Update Breaks Puppeteer, Playwright, and Cypress Browser Install'
3
+ category: runner-environment
4
+ severity: error
5
+ tags:
6
+ - nodejs
7
+ - node24
8
+ - puppeteer
9
+ - playwright
10
+ - cypress
11
+ - toolcache
12
+ - browser-install
13
+ - extract-zip
14
+ - regression
15
+ patterns:
16
+ - regex: 'Could not find Chrome \(ver\. \d+\.\d+\.\d+'
17
+ flags: 'i'
18
+ - regex: 'npx puppeteer browsers install .+ exited with code [^0]'
19
+ flags: 'i'
20
+ - regex: 'browserType\.launch: Executable doesn.t exist at .+/chromium'
21
+ flags: 'i'
22
+ - regex: 'Cannot find browser at path.*\.cache/puppeteer'
23
+ flags: 'i'
24
+ - regex: 'Failed to install browsers.*extract.*zip'
25
+ flags: 'i'
26
+ error_messages:
27
+ - "Error: Could not find Chrome (ver. 146.0.7680.153). This can occur if either 1. you did not perform an installation before running the script (e.g. `npx puppeteer browsers install chrome-headless-shell`) or 2. your cache path is incorrectly configured"
28
+ - "browserType.launch: Executable doesn't exist at /home/runner/.cache/ms-playwright/chromium-1169/chrome-linux/chrome"
29
+ - "Error: Failed to install browsers at /home/runner/.cache/ms-playwright"
30
+ root_cause: |
31
+ The ubuntu-24.04 image update from 20260518.149.1 → 20260525.161.1 bumped the
32
+ cached Node.js toolcache version from 24.15.0 to 24.16.0. Node.js 24.16.0
33
+ contains an upstream regression in the readable-stream.destroy() lifecycle
34
+ that breaks yauzl (a ZIP reading library) and extract-zip which depends on it.
35
+
36
+ The affected tools all use @puppeteer/browsers or equivalent ZIP-based browser
37
+ downloaders internally:
38
+ - Puppeteer: `npx puppeteer browsers install chrome-headless-shell`
39
+ - Playwright: `npx playwright install chromium` / `npx playwright install --with-deps`
40
+ - Cypress: `npx cypress install`
41
+
42
+ The ZIP archive download completes successfully and the partial extraction
43
+ begins, but the stream destroy bug causes yauzl to exit before all entries
44
+ are written. The browser binary never lands on disk. The browser installer
45
+ exits 0 (or a non-descriptive exit code) and the next step fails with
46
+ "Could not find Chrome at path..." or a missing executable error.
47
+
48
+ Root upstream issues:
49
+ - https://github.com/nodejs/node/issues/63487 (yauzl/extract-zip hang / partial extraction)
50
+ - https://github.com/nodejs/node/issues/63638 (libuv regression on Windows)
51
+
52
+ Because actions/setup-node resolves to the cached Node.js 24.16.0 when
53
+ node-version: '24' or node-version: '24.x' is specified (or when using the
54
+ default runner-baked Node 24 on ubuntu-24.04), every workflow that installs
55
+ a browser via these tools is affected until Node.js 24.17.0 ships a fix.
56
+ fix: |
57
+ Pin Node.js to 24.15.0 (the last known-good version) via actions/setup-node
58
+ until Node.js 24.17.0 is published and rolled into the runner toolcache.
59
+
60
+ If your workflow does not strictly require Node.js 24, fall back to Node.js 22
61
+ (the runner image default), which is unaffected by this regression.
62
+
63
+ Do NOT use node-version: '24' or node-version: 'latest' until the upstream
64
+ fix lands in Node.js 24.17.0.
65
+ fix_code:
66
+ - language: yaml
67
+ label: 'Option A — Pin Node.js to 24.15.0 (last known-good release)'
68
+ code: |
69
+ steps:
70
+ - uses: actions/setup-node@v6
71
+ with:
72
+ node-version: '24.15.0' # pin until Node 24.17.0 fixes readable-stream regression
73
+ cache: 'npm'
74
+
75
+ - name: Install Puppeteer Chrome
76
+ run: npx puppeteer browsers install chrome-headless-shell
77
+ - language: yaml
78
+ label: 'Option B — Fall back to Node.js 22 (unaffected)'
79
+ code: |
80
+ steps:
81
+ - uses: actions/setup-node@v6
82
+ with:
83
+ node-version: '22' # LTS, not affected by readable-stream regression
84
+ cache: 'npm'
85
+
86
+ - name: Install Playwright browsers
87
+ run: npx playwright install --with-deps chromium
88
+ - language: yaml
89
+ label: 'Option C — Pin Playwright install to avoid extract-zip entirely (Playwright only)'
90
+ code: |
91
+ steps:
92
+ - uses: actions/setup-node@v6
93
+ with:
94
+ node-version: '24.15.0'
95
+ - uses: microsoft/playwright-github-action@v1 # uses pre-installed image browsers
96
+ prevention:
97
+ - 'Pin node-version to a specific patch (e.g. 24.15.0) rather than a major/minor range in workflows that install browser binaries via npx commands.'
98
+ - 'After bumping Node.js versions, verify browser install steps succeed by checking the binary path explicitly with `ls -la ~/.cache/puppeteer` or equivalent before running tests.'
99
+ - 'Subscribe to actions/runner-images releases to catch toolcache updates that may include Node.js patch regressions.'
100
+ - 'For Playwright, prefer `npx playwright install --with-deps` combined with an explicit Node.js pin rather than relying on runner-image cached Node versions.'
101
+ docs:
102
+ - url: 'https://github.com/actions/runner-images/issues/14173'
103
+ label: 'runner-images #14173 — Puppeteer broken in Ubuntu 24.04 version 20260525.161.1'
104
+ - url: 'https://github.com/nodejs/node/issues/63487'
105
+ label: 'nodejs/node #63487 — yauzl/extract-zip hang and partial extraction (readable-stream regression)'
106
+ - url: 'https://github.com/nodejs/node/issues/63638'
107
+ label: 'nodejs/node #63638 — libuv regression in Node.js 24.16.0'
108
+ - url: 'https://github.com/actions/runner-images/releases/tag/ubuntu24%2F20260525.161'
109
+ label: 'runner-images ubuntu24/20260525.161 release — Node.js toolcache bumped 24.15.0 → 24.16.0'
@@ -0,0 +1,102 @@
1
+ id: runner-environment-179
2
+ title: 'Runner Registration Token Endpoint Returns 502 Under Concurrent Spawn Bursts'
3
+ category: runner-environment
4
+ severity: error
5
+ tags:
6
+ - self-hosted
7
+ - ephemeral
8
+ - registration
9
+ - autoscaler
10
+ - burst
11
+ - 502
12
+ patterns:
13
+ - regex: 'HTTP Error 502: Bad Gateway'
14
+ flags: 'i'
15
+ - regex: 'registration-token.*502|502.*Bad Gateway.*registration'
16
+ flags: 'i'
17
+ - regex: 'HTTPError.*502|urlopen error.*502'
18
+ flags: 'i'
19
+ error_messages:
20
+ - 'HTTP Error 502: Bad Gateway'
21
+ - 'urllib.error.HTTPError: HTTP Error 502: Bad Gateway'
22
+ - 'POST https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token — 502'
23
+ root_cause: |
24
+ GitHub REST endpoint:
25
+ POST https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token
26
+
27
+ returns HTTP 502 Bad Gateway under burst load when many ephemeral runners attempt to
28
+ register concurrently (e.g., triggered by webhook-driven autoscalers on workflow_job.queued
29
+ events). The burst saturates the GitHub backend's registration token issuer.
30
+
31
+ The reference runner setup scripts (run.sh / config.sh flow) do NOT retry 5xx responses —
32
+ a single transient 502 causes the container to exit immediately before registration completes.
33
+ For ephemeral autoscaler setups (Modal, Lambda, Kubernetes pod-per-job), the container slot
34
+ is burned and no runner is available for the queued job.
35
+
36
+ Phantom containers accumulate: the autoscaler billed for the container, GitHub never received
37
+ a registration, and the autoscaler's max_containers cap can be held by dead slots — stalling
38
+ the merge queue for hours.
39
+
40
+ Source: actions/runner#4399 (May 2026, open issue, observed during 20+ concurrent registrations
41
+ — ~21 phantom containers during a webhook-driven burst at ~16:00-17:00 UTC on 2026-05-04).
42
+ fix: |
43
+ 1. Add exponential backoff to your registration token call (4 attempts, 2^n second delays).
44
+ 2. Stagger autoscaler spawns with per-container random jitter (0-10 seconds) to break up
45
+ concurrent registration bursts before they hit the endpoint simultaneously.
46
+ 3. Switch to JIT runner tokens — they are pre-assigned per job and do not require a burst-
47
+ sensitive registration token call:
48
+ POST /repos/{owner}/{repo}/actions/runners/generate-jitconfig
49
+ 4. Reduce max_concurrent_spawns in your autoscaler to limit peak registration load.
50
+ 5. Implement phantom container detection: health-check containers shortly after spawn and
51
+ kill any that failed to register within N seconds.
52
+ fix_code:
53
+ - language: yaml
54
+ label: 'Python — retry-with-backoff on 502 when minting registration token'
55
+ code: |
56
+ import time, json, urllib.request, urllib.error
57
+
58
+ def mint_registration_token(owner, repo, github_token):
59
+ url = f"https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token"
60
+ headers = {
61
+ "Authorization": f"token {github_token}",
62
+ "Accept": "application/vnd.github+json",
63
+ "X-GitHub-Api-Version": "2022-11-28",
64
+ }
65
+ for attempt in range(4):
66
+ try:
67
+ req = urllib.request.Request(url, method="POST", headers=headers)
68
+ with urllib.request.urlopen(req, timeout=15) as resp:
69
+ return json.load(resp)["token"]
70
+ except urllib.error.HTTPError as e:
71
+ if 500 <= e.code < 600 and attempt < 3:
72
+ delay = 2 ** attempt # 1s, 2s, 4s
73
+ print(f"Registration token {e.code}, retrying in {delay}s (attempt {attempt+1}/4)...")
74
+ time.sleep(delay)
75
+ continue
76
+ raise
77
+ - language: yaml
78
+ label: 'Autoscaler — add random jitter per container spawn to break up burst'
79
+ code: |
80
+ import asyncio, random
81
+
82
+ async def handle_workflow_job_queued(event):
83
+ # Stagger spawns: random delay 0-10s reduces simultaneous token requests
84
+ jitter = random.uniform(0, 10)
85
+ await asyncio.sleep(jitter)
86
+ token = await mint_registration_token(owner, repo, github_token)
87
+ await spawn_ephemeral_container(token)
88
+ prevention:
89
+ - 'Add exponential backoff (4 attempts, 2^n seconds) for 5xx responses in all registration token calls.'
90
+ - 'Add random per-container spawn jitter (0-10s) in your autoscaler to break up concurrent registration bursts.'
91
+ - 'Switch to JIT runner tokens (generate-jitconfig) for burst autoscaler workloads — they do not hit the burst-sensitive registration endpoint.'
92
+ - 'Monitor for phantom containers: any container that does not show as registered within 60 seconds of spawn should be terminated.'
93
+ - 'Cap max_concurrent_spawns conservatively (≤10) and tune based on observed 502 rates.'
94
+ docs:
95
+ - url: 'https://github.com/actions/runner/issues/4399'
96
+ label: 'actions/runner#4399: Registration token endpoint returns 502 in bursts (May 2026)'
97
+ - url: 'https://docs.github.com/en/rest/actions/self-hosted-runners#create-a-registration-token-for-a-repository'
98
+ label: 'GitHub REST API: Create a registration token for a repository'
99
+ - url: 'https://docs.github.com/en/rest/actions/self-hosted-runners#create-configuration-for-a-just-in-time-runner-for-a-repository'
100
+ label: 'GitHub REST API: Create JIT runner config (avoids burst-sensitive registration token)'
101
+ - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners'
102
+ label: 'GitHub Docs: Autoscaling with self-hosted runners'
@@ -0,0 +1,127 @@
1
+ id: silent-failures-098
2
+ title: 'Matrix Array Output Nested in Object Property Serializes as "Array" String'
3
+ category: silent-failures
4
+ severity: silent-failure
5
+ tags:
6
+ - matrix
7
+ - fromJSON
8
+ - job-outputs
9
+ - dynamic-matrix
10
+ - strategy
11
+ - serialization
12
+ patterns:
13
+ - regex: '\bArray\b'
14
+ flags: ''
15
+ - regex: 'fromJSON\s*\(.*outputs\.[a-zA-Z_]+'
16
+ flags: 'i'
17
+ error_messages:
18
+ - 'rhel Array'
19
+ - 'debian Array'
20
+ - 'distro = Array'
21
+ - 'component.distro: Array'
22
+ root_cause: |
23
+ When a job output containing a JSON array string (e.g. '["el8","el9"]') is passed via
24
+ fromJSON() as a VALUE inside a matrix OBJECT property, the GitHub Actions expression engine
25
+ coerces the array to a string using JavaScript Array.prototype.toString(), which produces
26
+ the literal word "Array" instead of the array elements.
27
+
28
+ This happens because the matrix strategy parser evaluates object property values as strings
29
+ at expansion time. When fromJSON() returns a JavaScript array, the matrix serialization path
30
+ calls .toString() instead of expanding into multiple jobs.
31
+
32
+ The bug only triggers when the array is nested INSIDE an object:
33
+
34
+ # BROKEN — array inside object property
35
+ strategy:
36
+ matrix:
37
+ component:
38
+ - name: rhel
39
+ distro: ${{ fromJSON(needs.define-matrix.outputs.rpms) }}
40
+ # ^ distro becomes the string "Array", not ["el8","el9"]
41
+
42
+ The same fromJSON() call works correctly at the TOP LEVEL of the matrix:
43
+
44
+ # WORKS — array at top level dimension
45
+ strategy:
46
+ matrix:
47
+ distro: ${{ fromJSON(needs.define-matrix.outputs.rpms) }}
48
+
49
+ The job does NOT fail — it silently runs with matrix.component.distro set to "Array",
50
+ producing wrong build targets with no error or warning.
51
+
52
+ Source: actions/runner#3794 (April 2025, labeled bug, open as of April 2026 after stale
53
+ cycle — author confirmed still broken).
54
+ fix: |
55
+ Restructure the matrix to avoid placing fromJSON() array outputs inside object properties.
56
+
57
+ Option 1 — Pre-compute the full cross-product as a JSON array of objects in the matrix-
58
+ generating job and pass it as a single fromJSON() at the include level.
59
+
60
+ Option 2 — Use separate top-level matrix dimensions instead of grouping into objects.
61
+
62
+ If object grouping is required for semantic reasons, flatten the arrays at generation time
63
+ using jq and emit a fully-expanded JSON array of objects from the setup step.
64
+ fix_code:
65
+ - language: yaml
66
+ label: 'Workaround — pre-compute full matrix as JSON array, expand via include'
67
+ code: |
68
+ jobs:
69
+ define-matrix:
70
+ runs-on: ubuntu-latest
71
+ outputs:
72
+ matrix: ${{ steps.build.outputs.matrix }}
73
+ steps:
74
+ - id: build
75
+ run: |
76
+ # Build the full cross-product as a JSON array of objects
77
+ matrix=$(jq -nc '[
78
+ {"arch":"x86_64","runner":"ubuntu-24.04","component":"rhel","distro":"el8"},
79
+ {"arch":"x86_64","runner":"ubuntu-24.04","component":"rhel","distro":"el9"},
80
+ {"arch":"x86_64","runner":"ubuntu-24.04","component":"debian","distro":"focal"},
81
+ {"arch":"aarch64","runner":"ubuntu-24.04-arm","component":"rhel","distro":"el8"},
82
+ {"arch":"aarch64","runner":"ubuntu-24.04-arm","component":"rhel","distro":"el9"},
83
+ {"arch":"aarch64","runner":"ubuntu-24.04-arm","component":"debian","distro":"focal"}
84
+ ]')
85
+ echo "matrix=$matrix" >> "$GITHUB_OUTPUT"
86
+
87
+ build:
88
+ needs: define-matrix
89
+ runs-on: ${{ matrix.runner }}
90
+ strategy:
91
+ matrix:
92
+ include: ${{ fromJSON(needs.define-matrix.outputs.matrix) }}
93
+ steps:
94
+ - run: echo "${{ matrix.component }} ${{ matrix.distro }} on ${{ matrix.arch }}"
95
+ - language: yaml
96
+ label: 'Debug step — detect "Array" serialization early'
97
+ code: |
98
+ jobs:
99
+ build:
100
+ needs: define-matrix
101
+ runs-on: ubuntu-latest
102
+ strategy:
103
+ matrix:
104
+ # Use top-level dimensions — arrays work at this level
105
+ distro: ${{ fromJSON(needs.define-matrix.outputs.rpms) }}
106
+ arch: [x86_64, aarch64]
107
+ steps:
108
+ - name: Verify matrix values are not "Array"
109
+ run: |
110
+ DISTRO="${{ matrix.distro }}"
111
+ if [ "$DISTRO" = "Array" ]; then
112
+ echo "::error::matrix.distro serialized as 'Array' — fromJSON() inside object bug"
113
+ exit 1
114
+ fi
115
+ echo "distro=$DISTRO"
116
+ prevention:
117
+ - 'Never place fromJSON() array outputs as values inside matrix object properties — use top-level dimensions or pre-computed include arrays.'
118
+ - 'Add a guard step that fails if any matrix value equals the string "Array" to catch this bug early.'
119
+ - 'Generate complex multi-dimensional matrix shapes in the setup job as a single JSON array, then expand via include: ${{ fromJSON(...) }}.'
120
+ - 'Track actions/runner#3794 for an upstream fix to the matrix object property serialization bug.'
121
+ docs:
122
+ - url: 'https://github.com/actions/runner/issues/3794'
123
+ label: 'actions/runner#3794: array outputs not understood by matrix when nested inside object (April 2025, open)'
124
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/running-variations-of-jobs-in-a-workflow'
125
+ label: 'GitHub Docs: Using a matrix for your jobs'
126
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/evaluate-expressions-in-workflows-and-actions#fromjson'
127
+ label: 'GitHub Docs: fromJSON expression function'
@@ -0,0 +1,112 @@
1
+ id: silent-failures-097
2
+ title: 'setup-node Silently Uses Runner-Baked Node Version When Download Fails — Wrong Version Active'
3
+ category: silent-failures
4
+ severity: silent-failure
5
+ tags:
6
+ - setup-node
7
+ - nodejs
8
+ - download-failure
9
+ - silent-failure
10
+ - wrong-version
11
+ - toolcache
12
+ - hosted-runner
13
+ - fallthrough
14
+ patterns:
15
+ - regex: 'Attempting to download \d+\.\d+\.\d+\.\.\.'
16
+ flags: 'i'
17
+ - regex: 'Cannot find module.*engines.*node.*>=\s*24'
18
+ flags: 'i'
19
+ - regex: 'The engine .node. is incompatible with this module\. Expected version .+\. Got .2[0-2]\.'
20
+ flags: 'i'
21
+ - regex: 'ELIFECYCLE.*node --version.*v2[0-2]\.'
22
+ flags: 'i'
23
+ error_messages:
24
+ - "Attempting to download 24.15.0..."
25
+ - "error: The engine 'node' is incompatible with this module. Expected version '>=24.0.0'. Got '22.14.0'"
26
+ - "npm ERR! code ELIFECYCLE"
27
+ - "Error: Cannot find module 'node:crypto' (Node.js version too old)"
28
+ root_cause: |
29
+ When actions/setup-node's download or extract path fails transiently —
30
+ network blip, manifest miss, partial extract from a concurrent toolcache
31
+ write, or a transient S3/CDN cache failure — the action does not surface the
32
+ error. Instead, it falls back to a secondary download path. If that secondary
33
+ path also fails or returns an unusable toolPath, setup-node adds an empty or
34
+ incorrect directory to PATH and exits 0 (success).
35
+
36
+ Because the setup-node step succeeds, the runner-baked Node.js version
37
+ (e.g. v22.x on ubuntu-latest after the Node 20 removal) remains on PATH.
38
+ Downstream steps execute against the wrong Node.js major version with no
39
+ indication that setup-node did not install the requested version.
40
+
41
+ The mechanism (in official_builds.ts, as of 2026-05-21):
42
+ - Download/extract errors are logged via core.info(), not core.warning()
43
+ or core.error(), so they are buried in normal output
44
+ - After the fallback download attempt, there is no post-condition check
45
+ that verifies node --version matches the requested version
46
+ - core.addPath() is called even if toolPath/bin is empty or stale
47
+
48
+ Reported failing run: https://github.com/n8n-io/n8n/actions/runs/26100630929
49
+ The run showed "Attempting to download 24.15.0..." → 33 seconds of silence →
50
+ next step ran against runner-baked v20.20.0 with no error from setup-node.
51
+
52
+ This is distinct from silent-failures-028 which covers self-hosted runners
53
+ where node is completely absent (node: not found). This entry covers hosted
54
+ runners where the wrong version is silently active and node IS found.
55
+
56
+ Root upstream issue: actions/toolkit#804 — concurrent toolcache writes create
57
+ partial extracts that pass path existence checks.
58
+ fix: |
59
+ Add an explicit node --version verification step immediately after setup-node
60
+ and fail the job if the version does not match. This is the external workaround
61
+ used by affected projects (e.g., n8n/n8n PR #30849).
62
+
63
+ Until actions/setup-node ships a built-in post-install assertion, this
64
+ workflow-level guard is the only reliable way to catch the silent fallthrough.
65
+ fix_code:
66
+ - language: yaml
67
+ label: 'Add explicit version verification after setup-node'
68
+ code: |
69
+ steps:
70
+ - uses: actions/setup-node@v6
71
+ with:
72
+ node-version: '24'
73
+ cache: 'npm'
74
+
75
+ - name: Verify Node.js version
76
+ shell: bash
77
+ run: |
78
+ ACTUAL=$(node --version)
79
+ EXPECTED_MAJOR="24"
80
+ if [[ "$ACTUAL" != v${EXPECTED_MAJOR}.* ]]; then
81
+ echo "::error::setup-node installed Node ${EXPECTED_MAJOR} but \`node --version\` reports $ACTUAL"
82
+ echo "::error::This usually indicates a transient download failure or partial toolcache extract."
83
+ exit 1
84
+ fi
85
+ echo "Node.js version confirmed: $ACTUAL"
86
+
87
+ - name: Install dependencies
88
+ run: npm ci
89
+ - language: yaml
90
+ label: 'Pin to exact patch version to reduce toolcache misses'
91
+ code: |
92
+ steps:
93
+ - uses: actions/setup-node@v6
94
+ with:
95
+ node-version: '24.15.0' # exact pin reduces manifest/toolcache lookup failures
96
+ cache: 'npm'
97
+
98
+ - name: Verify Node.js version (belt-and-suspenders)
99
+ run: |
100
+ node --version | grep -E '^v24\.15\.' || (echo "Wrong Node version" && exit 1)
101
+ prevention:
102
+ - 'Always verify node --version matches the requested major after setup-node, especially in workflows that depend on Node.js 24+ features or native modules.'
103
+ - 'Pin to an exact patch version (e.g. 24.15.0) rather than a range (24.x) to avoid unexpected toolcache miss fallbacks.'
104
+ - 'If you see "Attempting to download X.Y.Z..." followed by an unusually long pause in setup-node output, the download may have stalled and the fallback path may be active.'
105
+ - 'Watch setup-node releases for a built-in post-install assertion fix (tracked in actions/setup-node#1556 and actions/toolkit#804).'
106
+ docs:
107
+ - url: 'https://github.com/actions/setup-node/issues/1556'
108
+ label: 'setup-node #1556 — setup-node silently falls through to runner-baked Node on download/extract failure'
109
+ - url: 'https://github.com/actions/toolkit/issues/804'
110
+ label: 'actions/toolkit #804 — Concurrent toolcache writes cause partial extracts on multi-tenant runners'
111
+ - url: 'https://github.com/n8n-io/n8n/pull/30849'
112
+ label: 'n8n/n8n PR #30849 — External Verify Node.js Version workaround'
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@htekdev/actions-debugger",
3
- "version": "1.0.109",
3
+ "version": "1.0.111",
4
4
  "description": "65+ real GitHub Actions errors, queryable by agents. CLI + MCP server + Copilot skills + error database.",
5
5
  "type": "module",
6
6
  "main": "./dist/index.js",