@htekdev/actions-debugger 1.0.110 → 1.0.111
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
id: known-unsolved-058
|
|
2
|
+
title: 'Ephemeral JIT Runner Reports "Lost Communication" Despite Successful Job Completion'
|
|
3
|
+
category: known-unsolved
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- self-hosted
|
|
7
|
+
- ephemeral
|
|
8
|
+
- jit-runner
|
|
9
|
+
- lost-communication
|
|
10
|
+
- false-positive
|
|
11
|
+
- broker
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: 'The self-hosted runner lost communication with the server'
|
|
14
|
+
flags: 'i'
|
|
15
|
+
- regex: 'messageQueueLoopTokenSource|Stop message queue looping'
|
|
16
|
+
flags: 'i'
|
|
17
|
+
error_messages:
|
|
18
|
+
- 'The self-hosted runner lost communication with the server'
|
|
19
|
+
- 'GET request to broker.actions.githubusercontent.com/message ... has been cancelled'
|
|
20
|
+
- 'TaskCanceledException: The operation was canceled'
|
|
21
|
+
root_cause: |
|
|
22
|
+
In Runner.cs line 576, after a one-time-use (ephemeral/JIT) job completes, the message queue
|
|
23
|
+
is cancelled immediately with zero grace period:
|
|
24
|
+
|
|
25
|
+
messageQueueLoopTokenSource.Cancel();
|
|
26
|
+
|
|
27
|
+
This tears down the in-flight broker long-poll (GET broker.actions.githubusercontent.com/message)
|
|
28
|
+
immediately after CompleteJobAsync returns. GitHub's broker health monitor detects the TCP
|
|
29
|
+
disconnect and flags the runner as "lost communication" — racing against the pipeline service
|
|
30
|
+
that just received the successful completion.
|
|
31
|
+
|
|
32
|
+
Two independent GitHub backend systems race:
|
|
33
|
+
1. Pipeline service — received CompleteJobAsync, knows the job succeeded
|
|
34
|
+
2. Broker health monitor — sees TCP disconnect, flags runner as "lost communication"
|
|
35
|
+
|
|
36
|
+
When the broker monitor wins (which happens on ~5-10% of short jobs), the UI shows
|
|
37
|
+
"The self-hosted runner lost communication with the server" even though the worker exited
|
|
38
|
+
with code 100 (success) and the runner itself exited with return code 0.
|
|
39
|
+
|
|
40
|
+
The runner logs show all events at identical timestamps — zero delay between
|
|
41
|
+
"Received job status event. JobState: Online" and "messageQueueLoopTokenSource.Cancel()".
|
|
42
|
+
|
|
43
|
+
No user-side configuration can fix this. A code change to Runner.cs is required
|
|
44
|
+
(add ~5s grace delay before cancel). Source: actions/runner#4309 (March 2026, open bug,
|
|
45
|
+
affects runner v2.331.0+, ~5-10% failure rate on short ephemeral jobs).
|
|
46
|
+
fix: |
|
|
47
|
+
There is no user-side configuration fix. The root cause is a zero-grace-period teardown
|
|
48
|
+
in Runner.cs that must be patched by GitHub.
|
|
49
|
+
|
|
50
|
+
Mitigations:
|
|
51
|
+
1. Verify the diagnostic logs — the job actually succeeded. Check _diag/Runner_*.log
|
|
52
|
+
for "return code 0" and "result: Succeeded" to confirm before concluding failure.
|
|
53
|
+
2. Retry the failed job — the "lost communication" is a false positive; re-running
|
|
54
|
+
produces a clean success.
|
|
55
|
+
3. Do not use ephemeral JIT runners as required status checks until actions/runner#4309
|
|
56
|
+
is resolved, or implement a workflow-level retry wrapper.
|
|
57
|
+
4. Alert on job result=="failure" not on "lost communication" text alone — false positives
|
|
58
|
+
from this race should not page on-call engineers.
|
|
59
|
+
5. Batch small jobs — the race is more likely on jobs shorter than 60 seconds. Combining
|
|
60
|
+
work into longer-running jobs reduces false positive frequency.
|
|
61
|
+
fix_code:
|
|
62
|
+
- language: yaml
|
|
63
|
+
label: 'Confirm false positive by reading runner diagnostic log before retrying'
|
|
64
|
+
code: |
|
|
65
|
+
# After "lost communication" on an ephemeral JIT runner, check:
|
|
66
|
+
# _diag/Runner_YYYYMMDD-hhmmss-utc.log
|
|
67
|
+
#
|
|
68
|
+
# Successful completion signs (all at identical timestamps):
|
|
69
|
+
# [INFO] finish job request for job {id} with result: Succeeded
|
|
70
|
+
# [INFO] Job X completed with result: Succeeded
|
|
71
|
+
# [INFO] Received job status event. JobState: Online
|
|
72
|
+
# [INFO] Runner execution has finished with return code 0
|
|
73
|
+
#
|
|
74
|
+
# If these appear immediately before TaskCanceledException, the job DID succeed.
|
|
75
|
+
# Simply re-run the workflow — the retry will show a clean green result.
|
|
76
|
+
jobs:
|
|
77
|
+
build:
|
|
78
|
+
runs-on: [self-hosted, ephemeral, linux]
|
|
79
|
+
# Note: retry-on-failure logic can wrap this job in the caller:
|
|
80
|
+
steps:
|
|
81
|
+
- uses: actions/checkout@v4
|
|
82
|
+
- run: make build
|
|
83
|
+
prevention:
|
|
84
|
+
- 'Treat "lost communication" on ephemeral JIT runners as a potential false positive — check _diag/Runner_*.log for "return code 0" before escalating.'
|
|
85
|
+
- 'Do not gate required status checks on ephemeral JIT runners until actions/runner#4309 is resolved upstream.'
|
|
86
|
+
- 'Alert on job result=="failure", not on the "lost communication" message text alone.'
|
|
87
|
+
- 'Batch work into longer jobs (>60s total) to reduce the broker teardown race frequency.'
|
|
88
|
+
- 'Watch actions/runner#4309 for an upstream fix (proposed: 5-second grace period before messageQueueLoopTokenSource.Cancel()).'
|
|
89
|
+
docs:
|
|
90
|
+
- url: 'https://github.com/actions/runner/issues/4309'
|
|
91
|
+
label: 'actions/runner#4309: Ephemeral/JIT runner reports "lost communication" despite successful completion (March 2026)'
|
|
92
|
+
- url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners#using-just-in-time-runners'
|
|
93
|
+
label: 'GitHub Docs: Just-in-time (JIT) runners'
|
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
id: runner-environment-179
|
|
2
|
+
title: 'Runner Registration Token Endpoint Returns 502 Under Concurrent Spawn Bursts'
|
|
3
|
+
category: runner-environment
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- self-hosted
|
|
7
|
+
- ephemeral
|
|
8
|
+
- registration
|
|
9
|
+
- autoscaler
|
|
10
|
+
- burst
|
|
11
|
+
- 502
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: 'HTTP Error 502: Bad Gateway'
|
|
14
|
+
flags: 'i'
|
|
15
|
+
- regex: 'registration-token.*502|502.*Bad Gateway.*registration'
|
|
16
|
+
flags: 'i'
|
|
17
|
+
- regex: 'HTTPError.*502|urlopen error.*502'
|
|
18
|
+
flags: 'i'
|
|
19
|
+
error_messages:
|
|
20
|
+
- 'HTTP Error 502: Bad Gateway'
|
|
21
|
+
- 'urllib.error.HTTPError: HTTP Error 502: Bad Gateway'
|
|
22
|
+
- 'POST https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token — 502'
|
|
23
|
+
root_cause: |
|
|
24
|
+
GitHub REST endpoint:
|
|
25
|
+
POST https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token
|
|
26
|
+
|
|
27
|
+
returns HTTP 502 Bad Gateway under burst load when many ephemeral runners attempt to
|
|
28
|
+
register concurrently (e.g., triggered by webhook-driven autoscalers on workflow_job.queued
|
|
29
|
+
events). The burst saturates the GitHub backend's registration token issuer.
|
|
30
|
+
|
|
31
|
+
The reference runner setup scripts (run.sh / config.sh flow) do NOT retry 5xx responses —
|
|
32
|
+
a single transient 502 causes the container to exit immediately before registration completes.
|
|
33
|
+
For ephemeral autoscaler setups (Modal, Lambda, Kubernetes pod-per-job), the container slot
|
|
34
|
+
is burned and no runner is available for the queued job.
|
|
35
|
+
|
|
36
|
+
Phantom containers accumulate: the autoscaler billed for the container, GitHub never received
|
|
37
|
+
a registration, and the autoscaler's max_containers cap can be held by dead slots — stalling
|
|
38
|
+
the merge queue for hours.
|
|
39
|
+
|
|
40
|
+
Source: actions/runner#4399 (May 2026, open issue, observed during 20+ concurrent registrations
|
|
41
|
+
— ~21 phantom containers during a webhook-driven burst at ~16:00-17:00 UTC on 2026-05-04).
|
|
42
|
+
fix: |
|
|
43
|
+
1. Add exponential backoff to your registration token call (4 attempts, 2^n second delays).
|
|
44
|
+
2. Stagger autoscaler spawns with per-container random jitter (0-10 seconds) to break up
|
|
45
|
+
concurrent registration bursts before they hit the endpoint simultaneously.
|
|
46
|
+
3. Switch to JIT runner tokens — they are pre-assigned per job and do not require a burst-
|
|
47
|
+
sensitive registration token call:
|
|
48
|
+
POST /repos/{owner}/{repo}/actions/runners/generate-jitconfig
|
|
49
|
+
4. Reduce max_concurrent_spawns in your autoscaler to limit peak registration load.
|
|
50
|
+
5. Implement phantom container detection: health-check containers shortly after spawn and
|
|
51
|
+
kill any that failed to register within N seconds.
|
|
52
|
+
fix_code:
|
|
53
|
+
- language: yaml
|
|
54
|
+
label: 'Python — retry-with-backoff on 502 when minting registration token'
|
|
55
|
+
code: |
|
|
56
|
+
import time, json, urllib.request, urllib.error
|
|
57
|
+
|
|
58
|
+
def mint_registration_token(owner, repo, github_token):
|
|
59
|
+
url = f"https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token"
|
|
60
|
+
headers = {
|
|
61
|
+
"Authorization": f"token {github_token}",
|
|
62
|
+
"Accept": "application/vnd.github+json",
|
|
63
|
+
"X-GitHub-Api-Version": "2022-11-28",
|
|
64
|
+
}
|
|
65
|
+
for attempt in range(4):
|
|
66
|
+
try:
|
|
67
|
+
req = urllib.request.Request(url, method="POST", headers=headers)
|
|
68
|
+
with urllib.request.urlopen(req, timeout=15) as resp:
|
|
69
|
+
return json.load(resp)["token"]
|
|
70
|
+
except urllib.error.HTTPError as e:
|
|
71
|
+
if 500 <= e.code < 600 and attempt < 3:
|
|
72
|
+
delay = 2 ** attempt # 1s, 2s, 4s
|
|
73
|
+
print(f"Registration token {e.code}, retrying in {delay}s (attempt {attempt+1}/4)...")
|
|
74
|
+
time.sleep(delay)
|
|
75
|
+
continue
|
|
76
|
+
raise
|
|
77
|
+
- language: yaml
|
|
78
|
+
label: 'Autoscaler — add random jitter per container spawn to break up burst'
|
|
79
|
+
code: |
|
|
80
|
+
import asyncio, random
|
|
81
|
+
|
|
82
|
+
async def handle_workflow_job_queued(event):
|
|
83
|
+
# Stagger spawns: random delay 0-10s reduces simultaneous token requests
|
|
84
|
+
jitter = random.uniform(0, 10)
|
|
85
|
+
await asyncio.sleep(jitter)
|
|
86
|
+
token = await mint_registration_token(owner, repo, github_token)
|
|
87
|
+
await spawn_ephemeral_container(token)
|
|
88
|
+
prevention:
|
|
89
|
+
- 'Add exponential backoff (4 attempts, 2^n seconds) for 5xx responses in all registration token calls.'
|
|
90
|
+
- 'Add random per-container spawn jitter (0-10s) in your autoscaler to break up concurrent registration bursts.'
|
|
91
|
+
- 'Switch to JIT runner tokens (generate-jitconfig) for burst autoscaler workloads — they do not hit the burst-sensitive registration endpoint.'
|
|
92
|
+
- 'Monitor for phantom containers: any container that does not show as registered within 60 seconds of spawn should be terminated.'
|
|
93
|
+
- 'Cap max_concurrent_spawns conservatively (≤10) and tune based on observed 502 rates.'
|
|
94
|
+
docs:
|
|
95
|
+
- url: 'https://github.com/actions/runner/issues/4399'
|
|
96
|
+
label: 'actions/runner#4399: Registration token endpoint returns 502 in bursts (May 2026)'
|
|
97
|
+
- url: 'https://docs.github.com/en/rest/actions/self-hosted-runners#create-a-registration-token-for-a-repository'
|
|
98
|
+
label: 'GitHub REST API: Create a registration token for a repository'
|
|
99
|
+
- url: 'https://docs.github.com/en/rest/actions/self-hosted-runners#create-configuration-for-a-just-in-time-runner-for-a-repository'
|
|
100
|
+
label: 'GitHub REST API: Create JIT runner config (avoids burst-sensitive registration token)'
|
|
101
|
+
- url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners'
|
|
102
|
+
label: 'GitHub Docs: Autoscaling with self-hosted runners'
|
|
@@ -0,0 +1,127 @@
|
|
|
1
|
+
id: silent-failures-098
|
|
2
|
+
title: 'Matrix Array Output Nested in Object Property Serializes as "Array" String'
|
|
3
|
+
category: silent-failures
|
|
4
|
+
severity: silent-failure
|
|
5
|
+
tags:
|
|
6
|
+
- matrix
|
|
7
|
+
- fromJSON
|
|
8
|
+
- job-outputs
|
|
9
|
+
- dynamic-matrix
|
|
10
|
+
- strategy
|
|
11
|
+
- serialization
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: '\bArray\b'
|
|
14
|
+
flags: ''
|
|
15
|
+
- regex: 'fromJSON\s*\(.*outputs\.[a-zA-Z_]+'
|
|
16
|
+
flags: 'i'
|
|
17
|
+
error_messages:
|
|
18
|
+
- 'rhel Array'
|
|
19
|
+
- 'debian Array'
|
|
20
|
+
- 'distro = Array'
|
|
21
|
+
- 'component.distro: Array'
|
|
22
|
+
root_cause: |
|
|
23
|
+
When a job output containing a JSON array string (e.g. '["el8","el9"]') is passed via
|
|
24
|
+
fromJSON() as a VALUE inside a matrix OBJECT property, the GitHub Actions expression engine
|
|
25
|
+
coerces the array to a string using JavaScript Array.prototype.toString(), which produces
|
|
26
|
+
the literal word "Array" instead of the array elements.
|
|
27
|
+
|
|
28
|
+
This happens because the matrix strategy parser evaluates object property values as strings
|
|
29
|
+
at expansion time. When fromJSON() returns a JavaScript array, the matrix serialization path
|
|
30
|
+
calls .toString() instead of expanding into multiple jobs.
|
|
31
|
+
|
|
32
|
+
The bug only triggers when the array is nested INSIDE an object:
|
|
33
|
+
|
|
34
|
+
# BROKEN — array inside object property
|
|
35
|
+
strategy:
|
|
36
|
+
matrix:
|
|
37
|
+
component:
|
|
38
|
+
- name: rhel
|
|
39
|
+
distro: ${{ fromJSON(needs.define-matrix.outputs.rpms) }}
|
|
40
|
+
# ^ distro becomes the string "Array", not ["el8","el9"]
|
|
41
|
+
|
|
42
|
+
The same fromJSON() call works correctly at the TOP LEVEL of the matrix:
|
|
43
|
+
|
|
44
|
+
# WORKS — array at top level dimension
|
|
45
|
+
strategy:
|
|
46
|
+
matrix:
|
|
47
|
+
distro: ${{ fromJSON(needs.define-matrix.outputs.rpms) }}
|
|
48
|
+
|
|
49
|
+
The job does NOT fail — it silently runs with matrix.component.distro set to "Array",
|
|
50
|
+
producing wrong build targets with no error or warning.
|
|
51
|
+
|
|
52
|
+
Source: actions/runner#3794 (April 2025, labeled bug, open as of April 2026 after stale
|
|
53
|
+
cycle — author confirmed still broken).
|
|
54
|
+
fix: |
|
|
55
|
+
Restructure the matrix to avoid placing fromJSON() array outputs inside object properties.
|
|
56
|
+
|
|
57
|
+
Option 1 — Pre-compute the full cross-product as a JSON array of objects in the matrix-
|
|
58
|
+
generating job and pass it as a single fromJSON() at the include level.
|
|
59
|
+
|
|
60
|
+
Option 2 — Use separate top-level matrix dimensions instead of grouping into objects.
|
|
61
|
+
|
|
62
|
+
If object grouping is required for semantic reasons, flatten the arrays at generation time
|
|
63
|
+
using jq and emit a fully-expanded JSON array of objects from the setup step.
|
|
64
|
+
fix_code:
|
|
65
|
+
- language: yaml
|
|
66
|
+
label: 'Workaround — pre-compute full matrix as JSON array, expand via include'
|
|
67
|
+
code: |
|
|
68
|
+
jobs:
|
|
69
|
+
define-matrix:
|
|
70
|
+
runs-on: ubuntu-latest
|
|
71
|
+
outputs:
|
|
72
|
+
matrix: ${{ steps.build.outputs.matrix }}
|
|
73
|
+
steps:
|
|
74
|
+
- id: build
|
|
75
|
+
run: |
|
|
76
|
+
# Build the full cross-product as a JSON array of objects
|
|
77
|
+
matrix=$(jq -nc '[
|
|
78
|
+
{"arch":"x86_64","runner":"ubuntu-24.04","component":"rhel","distro":"el8"},
|
|
79
|
+
{"arch":"x86_64","runner":"ubuntu-24.04","component":"rhel","distro":"el9"},
|
|
80
|
+
{"arch":"x86_64","runner":"ubuntu-24.04","component":"debian","distro":"focal"},
|
|
81
|
+
{"arch":"aarch64","runner":"ubuntu-24.04-arm","component":"rhel","distro":"el8"},
|
|
82
|
+
{"arch":"aarch64","runner":"ubuntu-24.04-arm","component":"rhel","distro":"el9"},
|
|
83
|
+
{"arch":"aarch64","runner":"ubuntu-24.04-arm","component":"debian","distro":"focal"}
|
|
84
|
+
]')
|
|
85
|
+
echo "matrix=$matrix" >> "$GITHUB_OUTPUT"
|
|
86
|
+
|
|
87
|
+
build:
|
|
88
|
+
needs: define-matrix
|
|
89
|
+
runs-on: ${{ matrix.runner }}
|
|
90
|
+
strategy:
|
|
91
|
+
matrix:
|
|
92
|
+
include: ${{ fromJSON(needs.define-matrix.outputs.matrix) }}
|
|
93
|
+
steps:
|
|
94
|
+
- run: echo "${{ matrix.component }} ${{ matrix.distro }} on ${{ matrix.arch }}"
|
|
95
|
+
- language: yaml
|
|
96
|
+
label: 'Debug step — detect "Array" serialization early'
|
|
97
|
+
code: |
|
|
98
|
+
jobs:
|
|
99
|
+
build:
|
|
100
|
+
needs: define-matrix
|
|
101
|
+
runs-on: ubuntu-latest
|
|
102
|
+
strategy:
|
|
103
|
+
matrix:
|
|
104
|
+
# Use top-level dimensions — arrays work at this level
|
|
105
|
+
distro: ${{ fromJSON(needs.define-matrix.outputs.rpms) }}
|
|
106
|
+
arch: [x86_64, aarch64]
|
|
107
|
+
steps:
|
|
108
|
+
- name: Verify matrix values are not "Array"
|
|
109
|
+
run: |
|
|
110
|
+
DISTRO="${{ matrix.distro }}"
|
|
111
|
+
if [ "$DISTRO" = "Array" ]; then
|
|
112
|
+
echo "::error::matrix.distro serialized as 'Array' — fromJSON() inside object bug"
|
|
113
|
+
exit 1
|
|
114
|
+
fi
|
|
115
|
+
echo "distro=$DISTRO"
|
|
116
|
+
prevention:
|
|
117
|
+
- 'Never place fromJSON() array outputs as values inside matrix object properties — use top-level dimensions or pre-computed include arrays.'
|
|
118
|
+
- 'Add a guard step that fails if any matrix value equals the string "Array" to catch this bug early.'
|
|
119
|
+
- 'Generate complex multi-dimensional matrix shapes in the setup job as a single JSON array, then expand via include: ${{ fromJSON(...) }}.'
|
|
120
|
+
- 'Track actions/runner#3794 for an upstream fix to the matrix object property serialization bug.'
|
|
121
|
+
docs:
|
|
122
|
+
- url: 'https://github.com/actions/runner/issues/3794'
|
|
123
|
+
label: 'actions/runner#3794: array outputs not understood by matrix when nested inside object (April 2025, open)'
|
|
124
|
+
- url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/running-variations-of-jobs-in-a-workflow'
|
|
125
|
+
label: 'GitHub Docs: Using a matrix for your jobs'
|
|
126
|
+
- url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/evaluate-expressions-in-workflows-and-actions#fromjson'
|
|
127
|
+
label: 'GitHub Docs: fromJSON expression function'
|
package/package.json
CHANGED