@htekdev/actions-debugger 1.0.110 → 1.0.111

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,93 @@
1
+ id: known-unsolved-058
2
+ title: 'Ephemeral JIT Runner Reports "Lost Communication" Despite Successful Job Completion'
3
+ category: known-unsolved
4
+ severity: error
5
+ tags:
6
+ - self-hosted
7
+ - ephemeral
8
+ - jit-runner
9
+ - lost-communication
10
+ - false-positive
11
+ - broker
12
+ patterns:
13
+ - regex: 'The self-hosted runner lost communication with the server'
14
+ flags: 'i'
15
+ - regex: 'messageQueueLoopTokenSource|Stop message queue looping'
16
+ flags: 'i'
17
+ error_messages:
18
+ - 'The self-hosted runner lost communication with the server'
19
+ - 'GET request to broker.actions.githubusercontent.com/message ... has been cancelled'
20
+ - 'TaskCanceledException: The operation was canceled'
21
+ root_cause: |
22
+ In Runner.cs line 576, after a one-time-use (ephemeral/JIT) job completes, the message queue
23
+ is cancelled immediately with zero grace period:
24
+
25
+ messageQueueLoopTokenSource.Cancel();
26
+
27
+ This tears down the in-flight broker long-poll (GET broker.actions.githubusercontent.com/message)
28
+ immediately after CompleteJobAsync returns. GitHub's broker health monitor detects the TCP
29
+ disconnect and flags the runner as "lost communication" — racing against the pipeline service
30
+ that just received the successful completion.
31
+
32
+ Two independent GitHub backend systems race:
33
+ 1. Pipeline service — received CompleteJobAsync, knows the job succeeded
34
+ 2. Broker health monitor — sees TCP disconnect, flags runner as "lost communication"
35
+
36
+ When the broker monitor wins (which happens on ~5-10% of short jobs), the UI shows
37
+ "The self-hosted runner lost communication with the server" even though the worker exited
38
+ with code 100 (success) and the runner itself exited with return code 0.
39
+
40
+ The runner logs show all events at identical timestamps — zero delay between
41
+ "Received job status event. JobState: Online" and "messageQueueLoopTokenSource.Cancel()".
42
+
43
+ No user-side configuration can fix this. A code change to Runner.cs is required
44
+ (add ~5s grace delay before cancel). Source: actions/runner#4309 (March 2026, open bug,
45
+ affects runner v2.331.0+, ~5-10% failure rate on short ephemeral jobs).
46
+ fix: |
47
+ There is no user-side configuration fix. The root cause is a zero-grace-period teardown
48
+ in Runner.cs that must be patched by GitHub.
49
+
50
+ Mitigations:
51
+ 1. Verify the diagnostic logs — the job actually succeeded. Check _diag/Runner_*.log
52
+ for "return code 0" and "result: Succeeded" to confirm before concluding failure.
53
+ 2. Retry the failed job — the "lost communication" is a false positive; re-running
54
+ produces a clean success.
55
+ 3. Do not use ephemeral JIT runners as required status checks until actions/runner#4309
56
+ is resolved, or implement a workflow-level retry wrapper.
57
+ 4. Alert on job result=="failure" not on "lost communication" text alone — false positives
58
+ from this race should not page on-call engineers.
59
+ 5. Batch small jobs — the race is more likely on jobs shorter than 60 seconds. Combining
60
+ work into longer-running jobs reduces false positive frequency.
61
+ fix_code:
62
+ - language: yaml
63
+ label: 'Confirm false positive by reading runner diagnostic log before retrying'
64
+ code: |
65
+ # After "lost communication" on an ephemeral JIT runner, check:
66
+ # _diag/Runner_YYYYMMDD-hhmmss-utc.log
67
+ #
68
+ # Successful completion signs (all at identical timestamps):
69
+ # [INFO] finish job request for job {id} with result: Succeeded
70
+ # [INFO] Job X completed with result: Succeeded
71
+ # [INFO] Received job status event. JobState: Online
72
+ # [INFO] Runner execution has finished with return code 0
73
+ #
74
+ # If these appear immediately before TaskCanceledException, the job DID succeed.
75
+ # Simply re-run the workflow — the retry will show a clean green result.
76
+ jobs:
77
+ build:
78
+ runs-on: [self-hosted, ephemeral, linux]
79
+ # Note: retry-on-failure logic can wrap this job in the caller:
80
+ steps:
81
+ - uses: actions/checkout@v4
82
+ - run: make build
83
+ prevention:
84
+ - 'Treat "lost communication" on ephemeral JIT runners as a potential false positive — check _diag/Runner_*.log for "return code 0" before escalating.'
85
+ - 'Do not gate required status checks on ephemeral JIT runners until actions/runner#4309 is resolved upstream.'
86
+ - 'Alert on job result=="failure", not on the "lost communication" message text alone.'
87
+ - 'Batch work into longer jobs (>60s total) to reduce the broker teardown race frequency.'
88
+ - 'Watch actions/runner#4309 for an upstream fix (proposed: 5-second grace period before messageQueueLoopTokenSource.Cancel()).'
89
+ docs:
90
+ - url: 'https://github.com/actions/runner/issues/4309'
91
+ label: 'actions/runner#4309: Ephemeral/JIT runner reports "lost communication" despite successful completion (March 2026)'
92
+ - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners#using-just-in-time-runners'
93
+ label: 'GitHub Docs: Just-in-time (JIT) runners'
@@ -0,0 +1,102 @@
1
+ id: runner-environment-179
2
+ title: 'Runner Registration Token Endpoint Returns 502 Under Concurrent Spawn Bursts'
3
+ category: runner-environment
4
+ severity: error
5
+ tags:
6
+ - self-hosted
7
+ - ephemeral
8
+ - registration
9
+ - autoscaler
10
+ - burst
11
+ - 502
12
+ patterns:
13
+ - regex: 'HTTP Error 502: Bad Gateway'
14
+ flags: 'i'
15
+ - regex: 'registration-token.*502|502.*Bad Gateway.*registration'
16
+ flags: 'i'
17
+ - regex: 'HTTPError.*502|urlopen error.*502'
18
+ flags: 'i'
19
+ error_messages:
20
+ - 'HTTP Error 502: Bad Gateway'
21
+ - 'urllib.error.HTTPError: HTTP Error 502: Bad Gateway'
22
+ - 'POST https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token — 502'
23
+ root_cause: |
24
+ GitHub REST endpoint:
25
+ POST https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token
26
+
27
+ returns HTTP 502 Bad Gateway under burst load when many ephemeral runners attempt to
28
+ register concurrently (e.g., triggered by webhook-driven autoscalers on workflow_job.queued
29
+ events). The burst saturates the GitHub backend's registration token issuer.
30
+
31
+ The reference runner setup scripts (run.sh / config.sh flow) do NOT retry 5xx responses —
32
+ a single transient 502 causes the container to exit immediately before registration completes.
33
+ For ephemeral autoscaler setups (Modal, Lambda, Kubernetes pod-per-job), the container slot
34
+ is burned and no runner is available for the queued job.
35
+
36
+ Phantom containers accumulate: the autoscaler billed for the container, GitHub never received
37
+ a registration, and the autoscaler's max_containers cap can be held by dead slots — stalling
38
+ the merge queue for hours.
39
+
40
+ Source: actions/runner#4399 (May 2026, open issue, observed during 20+ concurrent registrations
41
+ — ~21 phantom containers during a webhook-driven burst at ~16:00-17:00 UTC on 2026-05-04).
42
+ fix: |
43
+ 1. Add exponential backoff to your registration token call (4 attempts, 2^n second delays).
44
+ 2. Stagger autoscaler spawns with per-container random jitter (0-10 seconds) to break up
45
+ concurrent registration bursts before they hit the endpoint simultaneously.
46
+ 3. Switch to JIT runner tokens — they are pre-assigned per job and do not require a burst-
47
+ sensitive registration token call:
48
+ POST /repos/{owner}/{repo}/actions/runners/generate-jitconfig
49
+ 4. Reduce max_concurrent_spawns in your autoscaler to limit peak registration load.
50
+ 5. Implement phantom container detection: health-check containers shortly after spawn and
51
+ kill any that failed to register within N seconds.
52
+ fix_code:
53
+ - language: yaml
54
+ label: 'Python — retry-with-backoff on 502 when minting registration token'
55
+ code: |
56
+ import time, json, urllib.request, urllib.error
57
+
58
+ def mint_registration_token(owner, repo, github_token):
59
+ url = f"https://api.github.com/repos/{owner}/{repo}/actions/runners/registration-token"
60
+ headers = {
61
+ "Authorization": f"token {github_token}",
62
+ "Accept": "application/vnd.github+json",
63
+ "X-GitHub-Api-Version": "2022-11-28",
64
+ }
65
+ for attempt in range(4):
66
+ try:
67
+ req = urllib.request.Request(url, method="POST", headers=headers)
68
+ with urllib.request.urlopen(req, timeout=15) as resp:
69
+ return json.load(resp)["token"]
70
+ except urllib.error.HTTPError as e:
71
+ if 500 <= e.code < 600 and attempt < 3:
72
+ delay = 2 ** attempt # 1s, 2s, 4s
73
+ print(f"Registration token {e.code}, retrying in {delay}s (attempt {attempt+1}/4)...")
74
+ time.sleep(delay)
75
+ continue
76
+ raise
77
+ - language: yaml
78
+ label: 'Autoscaler — add random jitter per container spawn to break up burst'
79
+ code: |
80
+ import asyncio, random
81
+
82
+ async def handle_workflow_job_queued(event):
83
+ # Stagger spawns: random delay 0-10s reduces simultaneous token requests
84
+ jitter = random.uniform(0, 10)
85
+ await asyncio.sleep(jitter)
86
+ token = await mint_registration_token(owner, repo, github_token)
87
+ await spawn_ephemeral_container(token)
88
+ prevention:
89
+ - 'Add exponential backoff (4 attempts, 2^n seconds) for 5xx responses in all registration token calls.'
90
+ - 'Add random per-container spawn jitter (0-10s) in your autoscaler to break up concurrent registration bursts.'
91
+ - 'Switch to JIT runner tokens (generate-jitconfig) for burst autoscaler workloads — they do not hit the burst-sensitive registration endpoint.'
92
+ - 'Monitor for phantom containers: any container that does not show as registered within 60 seconds of spawn should be terminated.'
93
+ - 'Cap max_concurrent_spawns conservatively (≤10) and tune based on observed 502 rates.'
94
+ docs:
95
+ - url: 'https://github.com/actions/runner/issues/4399'
96
+ label: 'actions/runner#4399: Registration token endpoint returns 502 in bursts (May 2026)'
97
+ - url: 'https://docs.github.com/en/rest/actions/self-hosted-runners#create-a-registration-token-for-a-repository'
98
+ label: 'GitHub REST API: Create a registration token for a repository'
99
+ - url: 'https://docs.github.com/en/rest/actions/self-hosted-runners#create-configuration-for-a-just-in-time-runner-for-a-repository'
100
+ label: 'GitHub REST API: Create JIT runner config (avoids burst-sensitive registration token)'
101
+ - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners'
102
+ label: 'GitHub Docs: Autoscaling with self-hosted runners'
@@ -0,0 +1,127 @@
1
+ id: silent-failures-098
2
+ title: 'Matrix Array Output Nested in Object Property Serializes as "Array" String'
3
+ category: silent-failures
4
+ severity: silent-failure
5
+ tags:
6
+ - matrix
7
+ - fromJSON
8
+ - job-outputs
9
+ - dynamic-matrix
10
+ - strategy
11
+ - serialization
12
+ patterns:
13
+ - regex: '\bArray\b'
14
+ flags: ''
15
+ - regex: 'fromJSON\s*\(.*outputs\.[a-zA-Z_]+'
16
+ flags: 'i'
17
+ error_messages:
18
+ - 'rhel Array'
19
+ - 'debian Array'
20
+ - 'distro = Array'
21
+ - 'component.distro: Array'
22
+ root_cause: |
23
+ When a job output containing a JSON array string (e.g. '["el8","el9"]') is passed via
24
+ fromJSON() as a VALUE inside a matrix OBJECT property, the GitHub Actions expression engine
25
+ coerces the array to a string using JavaScript Array.prototype.toString(), which produces
26
+ the literal word "Array" instead of the array elements.
27
+
28
+ This happens because the matrix strategy parser evaluates object property values as strings
29
+ at expansion time. When fromJSON() returns a JavaScript array, the matrix serialization path
30
+ calls .toString() instead of expanding into multiple jobs.
31
+
32
+ The bug only triggers when the array is nested INSIDE an object:
33
+
34
+ # BROKEN — array inside object property
35
+ strategy:
36
+ matrix:
37
+ component:
38
+ - name: rhel
39
+ distro: ${{ fromJSON(needs.define-matrix.outputs.rpms) }}
40
+ # ^ distro becomes the string "Array", not ["el8","el9"]
41
+
42
+ The same fromJSON() call works correctly at the TOP LEVEL of the matrix:
43
+
44
+ # WORKS — array at top level dimension
45
+ strategy:
46
+ matrix:
47
+ distro: ${{ fromJSON(needs.define-matrix.outputs.rpms) }}
48
+
49
+ The job does NOT fail — it silently runs with matrix.component.distro set to "Array",
50
+ producing wrong build targets with no error or warning.
51
+
52
+ Source: actions/runner#3794 (April 2025, labeled bug, open as of April 2026 after stale
53
+ cycle — author confirmed still broken).
54
+ fix: |
55
+ Restructure the matrix to avoid placing fromJSON() array outputs inside object properties.
56
+
57
+ Option 1 — Pre-compute the full cross-product as a JSON array of objects in the matrix-
58
+ generating job and pass it as a single fromJSON() at the include level.
59
+
60
+ Option 2 — Use separate top-level matrix dimensions instead of grouping into objects.
61
+
62
+ If object grouping is required for semantic reasons, flatten the arrays at generation time
63
+ using jq and emit a fully-expanded JSON array of objects from the setup step.
64
+ fix_code:
65
+ - language: yaml
66
+ label: 'Workaround — pre-compute full matrix as JSON array, expand via include'
67
+ code: |
68
+ jobs:
69
+ define-matrix:
70
+ runs-on: ubuntu-latest
71
+ outputs:
72
+ matrix: ${{ steps.build.outputs.matrix }}
73
+ steps:
74
+ - id: build
75
+ run: |
76
+ # Build the full cross-product as a JSON array of objects
77
+ matrix=$(jq -nc '[
78
+ {"arch":"x86_64","runner":"ubuntu-24.04","component":"rhel","distro":"el8"},
79
+ {"arch":"x86_64","runner":"ubuntu-24.04","component":"rhel","distro":"el9"},
80
+ {"arch":"x86_64","runner":"ubuntu-24.04","component":"debian","distro":"focal"},
81
+ {"arch":"aarch64","runner":"ubuntu-24.04-arm","component":"rhel","distro":"el8"},
82
+ {"arch":"aarch64","runner":"ubuntu-24.04-arm","component":"rhel","distro":"el9"},
83
+ {"arch":"aarch64","runner":"ubuntu-24.04-arm","component":"debian","distro":"focal"}
84
+ ]')
85
+ echo "matrix=$matrix" >> "$GITHUB_OUTPUT"
86
+
87
+ build:
88
+ needs: define-matrix
89
+ runs-on: ${{ matrix.runner }}
90
+ strategy:
91
+ matrix:
92
+ include: ${{ fromJSON(needs.define-matrix.outputs.matrix) }}
93
+ steps:
94
+ - run: echo "${{ matrix.component }} ${{ matrix.distro }} on ${{ matrix.arch }}"
95
+ - language: yaml
96
+ label: 'Debug step — detect "Array" serialization early'
97
+ code: |
98
+ jobs:
99
+ build:
100
+ needs: define-matrix
101
+ runs-on: ubuntu-latest
102
+ strategy:
103
+ matrix:
104
+ # Use top-level dimensions — arrays work at this level
105
+ distro: ${{ fromJSON(needs.define-matrix.outputs.rpms) }}
106
+ arch: [x86_64, aarch64]
107
+ steps:
108
+ - name: Verify matrix values are not "Array"
109
+ run: |
110
+ DISTRO="${{ matrix.distro }}"
111
+ if [ "$DISTRO" = "Array" ]; then
112
+ echo "::error::matrix.distro serialized as 'Array' — fromJSON() inside object bug"
113
+ exit 1
114
+ fi
115
+ echo "distro=$DISTRO"
116
+ prevention:
117
+ - 'Never place fromJSON() array outputs as values inside matrix object properties — use top-level dimensions or pre-computed include arrays.'
118
+ - 'Add a guard step that fails if any matrix value equals the string "Array" to catch this bug early.'
119
+ - 'Generate complex multi-dimensional matrix shapes in the setup job as a single JSON array, then expand via include: ${{ fromJSON(...) }}.'
120
+ - 'Track actions/runner#3794 for an upstream fix to the matrix object property serialization bug.'
121
+ docs:
122
+ - url: 'https://github.com/actions/runner/issues/3794'
123
+ label: 'actions/runner#3794: array outputs not understood by matrix when nested inside object (April 2025, open)'
124
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/running-variations-of-jobs-in-a-workflow'
125
+ label: 'GitHub Docs: Using a matrix for your jobs'
126
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/evaluate-expressions-in-workflows-and-actions#fromjson'
127
+ label: 'GitHub Docs: fromJSON expression function'
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@htekdev/actions-debugger",
3
- "version": "1.0.110",
3
+ "version": "1.0.111",
4
4
  "description": "65+ real GitHub Actions errors, queryable by agents. CLI + MCP server + Copilot skills + error database.",
5
5
  "type": "module",
6
6
  "main": "./dist/index.js",