@htekdev/actions-debugger 1.0.3 → 1.0.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/errors/concurrency-timing/job-stuck-waiting-for-runner.yml +105 -0
- package/errors/concurrency-timing/matrix-fail-fast-sibling-cancellation.yml +113 -0
- package/errors/concurrency-timing/runner-group-not-found-race.yml +97 -0
- package/errors/concurrency-timing/timeout-minutes-job-killed.yml +107 -0
- package/errors/known-unsolved/github-step-summary-size-limit.yml +112 -0
- package/errors/known-unsolved/job-maximum-execution-time.yml +127 -0
- package/errors/runner-environment/checkout-safe-directory-container.yml +84 -0
- package/errors/runner-environment/self-hosted-runner-version-deprecated.yml +99 -0
- package/errors/silent-failures/cache-hit-restore-keys-misleading.yml +82 -0
- package/package.json +1 -1
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
id: concurrency-timing-006
|
|
2
|
+
title: "Job Stuck: 'Waiting for a Runner to Pick Up This Job'"
|
|
3
|
+
category: concurrency-timing
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- runner
|
|
7
|
+
- runs-on
|
|
8
|
+
- self-hosted
|
|
9
|
+
- queued
|
|
10
|
+
- stuck
|
|
11
|
+
- deprecated-runner
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: "Waiting for a runner to pick up this job"
|
|
14
|
+
flags: "i"
|
|
15
|
+
- regex: "No runner matching the specified labels"
|
|
16
|
+
flags: "i"
|
|
17
|
+
- regex: "Could not find any online and idle runners"
|
|
18
|
+
flags: "i"
|
|
19
|
+
error_messages:
|
|
20
|
+
- "Waiting for a runner to pick up this job."
|
|
21
|
+
- "No runner matching the specified labels was found: [your-label]"
|
|
22
|
+
- "Could not find any online and idle runners matching the required labels."
|
|
23
|
+
root_cause: |
|
|
24
|
+
A job remains stuck in the "queued" state — showing "Waiting for a runner to pick up
|
|
25
|
+
this job" — when GitHub Actions cannot find an available runner matching the `runs-on:`
|
|
26
|
+
labels. The job will wait indefinitely until the `timeout-minutes` limit is reached.
|
|
27
|
+
|
|
28
|
+
The most common causes:
|
|
29
|
+
|
|
30
|
+
1. **Deprecated or retired runner label** — GitHub periodically retires old runner images.
|
|
31
|
+
`ubuntu-18.04` was retired in April 2023. `ubuntu-20.04` deprecation is in progress.
|
|
32
|
+
Jobs using these labels get stuck because no GitHub-hosted runners serve the label.
|
|
33
|
+
|
|
34
|
+
2. **Typo in `runs-on:` label** — `ubuntu-latets`, `ubuntu_latest`, `UBuntu-latest` all
|
|
35
|
+
fail silently. GitHub-hosted label matching is case-sensitive for custom labels.
|
|
36
|
+
|
|
37
|
+
3. **Self-hosted runner offline or de-registered** — the runner was stopped, the service
|
|
38
|
+
was not restarted after a reboot, or the runner registration token expired. GitHub queues
|
|
39
|
+
the job and waits for a registered runner with matching labels to come online.
|
|
40
|
+
|
|
41
|
+
4. **Runner group restrictions** — organization admins restrict which repositories can use
|
|
42
|
+
which runner groups. A job referencing a group the repository is not authorized for will
|
|
43
|
+
queue indefinitely without an explicit permission error.
|
|
44
|
+
|
|
45
|
+
5. **All runners busy** — all matching runners are executing other jobs. The job correctly
|
|
46
|
+
queues but appears "stuck" during peak usage. It will eventually be picked up.
|
|
47
|
+
|
|
48
|
+
There is no notification when a job has been queued for an unusually long time — the only
|
|
49
|
+
signal is the job's wall-clock age and the static "Waiting for a runner" message.
|
|
50
|
+
fix: |
|
|
51
|
+
Verify the `runs-on:` label against the current list of supported GitHub-hosted runner
|
|
52
|
+
images. For self-hosted runners, check runner registration and service health.
|
|
53
|
+
fix_code:
|
|
54
|
+
- language: yaml
|
|
55
|
+
label: "Use current, non-deprecated GitHub-hosted runner labels"
|
|
56
|
+
code: |
|
|
57
|
+
jobs:
|
|
58
|
+
build:
|
|
59
|
+
# Use current supported labels only
|
|
60
|
+
runs-on: ubuntu-latest # OR ubuntu-22.04, ubuntu-24.04
|
|
61
|
+
# NOT: ubuntu-18.04 (retired), ubuntu-20.04 (deprecated)
|
|
62
|
+
|
|
63
|
+
build-windows:
|
|
64
|
+
runs-on: windows-latest # OR windows-2022, windows-2025
|
|
65
|
+
|
|
66
|
+
build-macos:
|
|
67
|
+
runs-on: macos-latest # OR macos-13, macos-14, macos-15
|
|
68
|
+
- language: yaml
|
|
69
|
+
label: "Self-hosted runner — verify registration and labels match exactly"
|
|
70
|
+
code: |
|
|
71
|
+
jobs:
|
|
72
|
+
deploy:
|
|
73
|
+
# Labels must exactly match what the runner was registered with
|
|
74
|
+
# Check: GitHub Settings → Actions → Runners → click runner → Labels
|
|
75
|
+
runs-on: [self-hosted, linux, production]
|
|
76
|
+
|
|
77
|
+
steps:
|
|
78
|
+
- name: Verify runner is the expected host
|
|
79
|
+
run: echo "Running on $RUNNER_NAME at $(hostname)"
|
|
80
|
+
- language: yaml
|
|
81
|
+
label: "Fallback: matrix across hosted and self-hosted runners"
|
|
82
|
+
code: |
|
|
83
|
+
jobs:
|
|
84
|
+
build:
|
|
85
|
+
strategy:
|
|
86
|
+
matrix:
|
|
87
|
+
runner: [ubuntu-latest, [self-hosted, linux]]
|
|
88
|
+
runs-on: ${{ matrix.runner }}
|
|
89
|
+
prevention:
|
|
90
|
+
- "Audit `runs-on:` labels in all workflows when GitHub announces runner image deprecations."
|
|
91
|
+
- "Set a job-level `timeout-minutes` so stuck jobs don't consume queue slots indefinitely."
|
|
92
|
+
- "For self-hosted runners, configure the runner service to auto-restart on reboot (e.g., `--service` install on Linux via `./svc.sh install`)."
|
|
93
|
+
- "Use GitHub's runner status page (Settings → Actions → Runners) to verify runners are Online before triggering long jobs."
|
|
94
|
+
- "Subscribe to GitHub Changelog and Actions deprecation notices to catch retiring runner labels early."
|
|
95
|
+
docs:
|
|
96
|
+
- url: "https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources"
|
|
97
|
+
label: "Supported GitHub-hosted runner labels"
|
|
98
|
+
- url: "https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners"
|
|
99
|
+
label: "Adding self-hosted runners"
|
|
100
|
+
- url: "https://stackoverflow.com/questions/70959954/error-waiting-for-a-runner-to-pick-up-this-job-using-github-actions"
|
|
101
|
+
label: "Stack Overflow: Waiting for a runner to pick up this job"
|
|
102
|
+
- url: "https://github.com/actions/runner/issues/3609"
|
|
103
|
+
label: "actions/runner#3609 — Self-hosted runner stuck / deadlock"
|
|
104
|
+
- url: "https://github.com/orgs/community/discussions/147604"
|
|
105
|
+
label: "Community: Workflow stuck in queued state"
|
|
@@ -0,0 +1,113 @@
|
|
|
1
|
+
id: concurrency-timing-007
|
|
2
|
+
title: "Matrix Sibling Jobs Silently Cancelled by fail-fast Default"
|
|
3
|
+
category: concurrency-timing
|
|
4
|
+
severity: silent-failure
|
|
5
|
+
tags:
|
|
6
|
+
- matrix
|
|
7
|
+
- fail-fast
|
|
8
|
+
- cancellation
|
|
9
|
+
- silent-failure
|
|
10
|
+
- strategy
|
|
11
|
+
- job-cancelled
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: "Some jobs were not run because a sibling job failed"
|
|
14
|
+
flags: "i"
|
|
15
|
+
- regex: "Canceling since a higher priority waiting run was found"
|
|
16
|
+
flags: "i"
|
|
17
|
+
- regex: "The workflow run was canceled\\."
|
|
18
|
+
flags: "i"
|
|
19
|
+
error_messages:
|
|
20
|
+
- "Some jobs were not run because a sibling job failed. To allow them to run anyway, add 'continue-on-error: true' to the matrix job."
|
|
21
|
+
- "Job was cancelled"
|
|
22
|
+
root_cause: |
|
|
23
|
+
GitHub Actions matrix strategy defaults to `fail-fast: true`. When ANY matrix leg fails,
|
|
24
|
+
GitHub immediately cancels all other in-progress and pending legs in the same matrix.
|
|
25
|
+
|
|
26
|
+
This default is rarely what developers want during debugging or CI investigation, and
|
|
27
|
+
produces a confusing failure pattern:
|
|
28
|
+
|
|
29
|
+
1. **Cancelled legs appear as "Cancelled" not "Failed"** — matrix siblings killed by
|
|
30
|
+
`fail-fast` show as CANCELLED in the UI (grey icon) rather than red failures. Developers
|
|
31
|
+
scanning the run summary see one red failure and many grey cancellations, and may not
|
|
32
|
+
realize those sibling legs had reached significant progress (e.g., partway through a
|
|
33
|
+
test suite on a different OS or Node version) before being killed.
|
|
34
|
+
|
|
35
|
+
2. **Root cause is obscured** — the only failing leg that matters for diagnosis is the one
|
|
36
|
+
that triggered `fail-fast`, but with multiple cancellations in the UI, it can be hard to
|
|
37
|
+
identify which leg failed first.
|
|
38
|
+
|
|
39
|
+
3. **`fail-fast` is inherited silently** — there is no warning annotation that says
|
|
40
|
+
"fail-fast is enabled and cancelled 5 sibling legs." The default is documented but
|
|
41
|
+
easy to forget when adding a new matrix.
|
|
42
|
+
|
|
43
|
+
4. **Re-running failed jobs doesn't re-run cancelled siblings** — "Re-run failed jobs"
|
|
44
|
+
only re-runs the legs that explicitly FAILED, not the ones that were cancelled by
|
|
45
|
+
fail-fast. Developers re-running failed jobs think they'll see results from all legs,
|
|
46
|
+
but cancelled siblings stay cancelled. Only "Re-run all jobs" restarts everything.
|
|
47
|
+
|
|
48
|
+
Example: a 3-OS matrix (ubuntu, windows, macos) where ubuntu fails. With fail-fast,
|
|
49
|
+
windows and macos are immediately cancelled. The developer sees one failure and two
|
|
50
|
+
cancellations, re-runs the failed ubuntu job, and never discovers that windows also
|
|
51
|
+
had an independent failing test.
|
|
52
|
+
fix: |
|
|
53
|
+
Set `fail-fast: false` explicitly on any matrix where you need full signal from all
|
|
54
|
+
legs — especially for cross-platform or multi-version compatibility matrices. Use
|
|
55
|
+
`fail-fast: true` intentionally only when running the full matrix after one failure is
|
|
56
|
+
wasteful (e.g., expensive build matrices during pre-merge CI).
|
|
57
|
+
fix_code:
|
|
58
|
+
- language: yaml
|
|
59
|
+
label: "Disable fail-fast to see all matrix leg results"
|
|
60
|
+
code: |
|
|
61
|
+
jobs:
|
|
62
|
+
test:
|
|
63
|
+
strategy:
|
|
64
|
+
fail-fast: false # All legs run regardless of siblings failing
|
|
65
|
+
matrix:
|
|
66
|
+
os: [ubuntu-latest, windows-latest, macos-latest]
|
|
67
|
+
node: [18, 20, 22]
|
|
68
|
+
runs-on: ${{ matrix.os }}
|
|
69
|
+
steps:
|
|
70
|
+
- uses: actions/checkout@v4
|
|
71
|
+
- uses: actions/setup-node@v4
|
|
72
|
+
with:
|
|
73
|
+
node-version: ${{ matrix.node }}
|
|
74
|
+
- run: npm ci
|
|
75
|
+
- run: npm test
|
|
76
|
+
- language: yaml
|
|
77
|
+
label: "Use fail-fast: true only for expensive pre-merge CI"
|
|
78
|
+
code: |
|
|
79
|
+
jobs:
|
|
80
|
+
# Pre-merge: fail fast to conserve minutes — just need to know if it passes
|
|
81
|
+
lint-and-typecheck:
|
|
82
|
+
strategy:
|
|
83
|
+
fail-fast: true # OK: fast, cheap, fail early
|
|
84
|
+
matrix:
|
|
85
|
+
node: [20, 22]
|
|
86
|
+
runs-on: ubuntu-latest
|
|
87
|
+
steps:
|
|
88
|
+
- run: npm run lint && npm run typecheck
|
|
89
|
+
|
|
90
|
+
# Post-merge: always see all platform results
|
|
91
|
+
full-test-suite:
|
|
92
|
+
if: github.event_name == 'push'
|
|
93
|
+
strategy:
|
|
94
|
+
fail-fast: false # Need full signal on all platforms
|
|
95
|
+
matrix:
|
|
96
|
+
os: [ubuntu-latest, windows-latest, macos-latest]
|
|
97
|
+
runs-on: ${{ matrix.os }}
|
|
98
|
+
steps:
|
|
99
|
+
- run: npm test
|
|
100
|
+
prevention:
|
|
101
|
+
- "Always set `fail-fast: false` explicitly on cross-platform or multi-version matrices where you need full compatibility signal."
|
|
102
|
+
- "After a matrix failure, use 'Re-run all jobs' (not 'Re-run failed jobs') to get results from previously-cancelled siblings."
|
|
103
|
+
- "Add a workflow summary step with `if: always()` to collect and consolidate test results across all matrix legs even when some are cancelled."
|
|
104
|
+
- "Be aware that cancelled legs (grey) are NOT the same as passed legs (green) — visually scan for both red and grey when investigating failures."
|
|
105
|
+
docs:
|
|
106
|
+
- url: "https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#jobsjob_idstrategyfail-fast"
|
|
107
|
+
label: "Workflow syntax: jobs.<job_id>.strategy.fail-fast"
|
|
108
|
+
- url: "https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#jobsjob_idstrategymatrix"
|
|
109
|
+
label: "Workflow syntax: jobs.<job_id>.strategy.matrix"
|
|
110
|
+
- url: "https://github.com/orgs/community/discussions/26822"
|
|
111
|
+
label: "Community: fail-fast cancels matrix siblings unexpectedly"
|
|
112
|
+
- url: "https://stackoverflow.com/questions/57850553/github-actions-check-steps-status"
|
|
113
|
+
label: "Stack Overflow: Matrix job cancellation behavior with fail-fast"
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
id: "concurrency-timing-008"
|
|
2
|
+
title: "Intermittent 'Required runner group not found' when ephemeral runner registers after job dispatch"
|
|
3
|
+
category: "concurrency-timing"
|
|
4
|
+
severity: "error"
|
|
5
|
+
tags:
|
|
6
|
+
- "runner-group"
|
|
7
|
+
- "self-hosted"
|
|
8
|
+
- "ephemeral"
|
|
9
|
+
- "race-condition"
|
|
10
|
+
- "dispatch"
|
|
11
|
+
- "organization"
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: "Required runner group '.*' not found"
|
|
14
|
+
flags: "i"
|
|
15
|
+
- regex: "runner_group_id.*null"
|
|
16
|
+
flags: "i"
|
|
17
|
+
error_messages:
|
|
18
|
+
- "Required runner group 'x' not found"
|
|
19
|
+
root_cause: |
|
|
20
|
+
In autoscaling self-hosted runner setups (EC2, GKE ephemeral runners, ARC), the runner
|
|
21
|
+
must register with GitHub BEFORE the job dispatcher resolves runner group membership.
|
|
22
|
+
|
|
23
|
+
The race condition occurs when:
|
|
24
|
+
1. A workflow is triggered and GitHub's broker immediately tries to assign the job
|
|
25
|
+
2. The ephemeral runner is still initializing (EC2 bootstrap, container pull, ~2:30 min)
|
|
26
|
+
3. The broker resolves runner group membership before the new runner completes registration
|
|
27
|
+
4. The broker reports "Required runner group 'X' not found" and fails the job
|
|
28
|
+
|
|
29
|
+
This is intermittent: matrix jobs expose it clearly because some cells get
|
|
30
|
+
already-running (pre-registered) runners while others need a fresh runner, triggering
|
|
31
|
+
the race. Inspecting the failed job via the Jobs API shows `runner_group_id: null`
|
|
32
|
+
and `runner_name: null` throughout the queue duration even though the runner group
|
|
33
|
+
exists and has the correct repository access.
|
|
34
|
+
|
|
35
|
+
A separate but related pattern occurs with org-level runner group repository access
|
|
36
|
+
grants not propagating to the broker V2 protocol in time, causing identical symptoms
|
|
37
|
+
regardless of runner initialization speed.
|
|
38
|
+
|
|
39
|
+
Reported upstream: https://github.com/actions/runner/issues/4252
|
|
40
|
+
Related: https://github.com/actions/runner/issues/4429
|
|
41
|
+
fix: |
|
|
42
|
+
For ephemeral autoscaling runners:
|
|
43
|
+
Implement a registration wait loop that polls the GitHub Runners API before signaling
|
|
44
|
+
the runner as available. The runner should only become eligible for jobs after the
|
|
45
|
+
broker has acknowledged its registration.
|
|
46
|
+
|
|
47
|
+
For org-level runner group access issues:
|
|
48
|
+
Verify that the target repository is in the runner group's allowed repositories list
|
|
49
|
+
via API. If misconfigured, re-registering the runner at repository level instead of
|
|
50
|
+
org level is a reliable workaround.
|
|
51
|
+
|
|
52
|
+
General mitigation: Add timeout-minutes to all jobs on self-hosted runners so stuck
|
|
53
|
+
queued jobs fail fast rather than waiting until the 6-hour workflow timeout.
|
|
54
|
+
fix_code:
|
|
55
|
+
- language: yaml
|
|
56
|
+
label: "Add timeout-minutes to detect stuck queued jobs quickly"
|
|
57
|
+
code: |
|
|
58
|
+
jobs:
|
|
59
|
+
build:
|
|
60
|
+
runs-on:
|
|
61
|
+
group: my-runner-group
|
|
62
|
+
labels: [self-hosted, linux]
|
|
63
|
+
timeout-minutes: 10 # Fail fast if runner never picks up the job
|
|
64
|
+
steps:
|
|
65
|
+
- uses: actions/checkout@v4
|
|
66
|
+
- run: echo "Runner assigned successfully"
|
|
67
|
+
- language: yaml
|
|
68
|
+
label: "Verify runner group repository access via API"
|
|
69
|
+
code: |
|
|
70
|
+
jobs:
|
|
71
|
+
debug-runner-group:
|
|
72
|
+
runs-on: ubuntu-latest
|
|
73
|
+
steps:
|
|
74
|
+
- name: Check runner group repository access
|
|
75
|
+
env:
|
|
76
|
+
GH_TOKEN: ${{ secrets.ORG_RUNNER_READ_TOKEN }}
|
|
77
|
+
run: |
|
|
78
|
+
echo "Runner groups and their visibility:"
|
|
79
|
+
gh api /orgs/${{ github.repository_owner }}/actions/runner-groups \
|
|
80
|
+
--jq '.runner_groups[] | "\(.name) (id: \(.id)) — visibility: \(.visibility)"'
|
|
81
|
+
|
|
82
|
+
echo "Repositories allowed for runner group ID 1:"
|
|
83
|
+
gh api /orgs/${{ github.repository_owner }}/actions/runner-groups/1/repositories \
|
|
84
|
+
--jq '.repositories[].full_name'
|
|
85
|
+
prevention:
|
|
86
|
+
- "Add timeout-minutes to all jobs using self-hosted runners so stuck-queued jobs fail fast instead of waiting for the 6h workflow limit"
|
|
87
|
+
- "For ephemeral runners (EC2/ARC), implement a registration health check that polls the Runners API before the runner accepts jobs"
|
|
88
|
+
- "For org-level runners, verify group repository access via API after any runner group configuration change"
|
|
89
|
+
- "For matrix jobs with ephemeral runners, keep N idle pre-registered runners to avoid cold-start races"
|
|
90
|
+
- "Monitor runner_group_id via the Jobs API to detect dispatch failures early in autoscaling pipelines"
|
|
91
|
+
docs:
|
|
92
|
+
- url: "https://github.com/actions/runner/issues/4252"
|
|
93
|
+
label: "actions/runner #4252 — Intermittent runner group not found"
|
|
94
|
+
- url: "https://github.com/actions/runner/issues/4429"
|
|
95
|
+
label: "actions/runner #4429 — Org-level runner never dispatched"
|
|
96
|
+
- url: "https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/managing-access-to-self-hosted-runners-using-groups"
|
|
97
|
+
label: "GitHub Docs — Managing runner group access"
|
|
@@ -0,0 +1,107 @@
|
|
|
1
|
+
id: concurrency-timing-005
|
|
2
|
+
title: "Job Silently Cancelled When timeout-minutes Is Exceeded"
|
|
3
|
+
category: concurrency-timing
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- timeout
|
|
7
|
+
- timeout-minutes
|
|
8
|
+
- job-cancelled
|
|
9
|
+
- timing
|
|
10
|
+
- runner
|
|
11
|
+
patterns:
|
|
12
|
+
- regex: "##\\[error\\]The operation was cancelled\\."
|
|
13
|
+
flags: "i"
|
|
14
|
+
- regex: "The job '.*' was cancelled because it exceeded the maximum execution time"
|
|
15
|
+
flags: "i"
|
|
16
|
+
- regex: "Error: The operation was canceled"
|
|
17
|
+
flags: "i"
|
|
18
|
+
- regex: "cancel is received"
|
|
19
|
+
flags: "i"
|
|
20
|
+
error_messages:
|
|
21
|
+
- "##[error]The operation was cancelled."
|
|
22
|
+
- "Error: The operation was canceled"
|
|
23
|
+
- "The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled."
|
|
24
|
+
root_cause: |
|
|
25
|
+
When a job (or step) exceeds its configured `timeout-minutes`, GitHub Actions sends a
|
|
26
|
+
cancellation signal to the runner. The runner has 5 minutes to complete graceful shutdown,
|
|
27
|
+
after which it is forcibly terminated.
|
|
28
|
+
|
|
29
|
+
The failure mode has two layers of confusion:
|
|
30
|
+
|
|
31
|
+
1. **Status shows "Cancelled" not "Failed"** — a timed-out job is marked CANCELLED in the
|
|
32
|
+
UI. It does not appear as a red failure. Developers scanning the Actions tab may miss it
|
|
33
|
+
entirely, especially if another run succeeded after it.
|
|
34
|
+
|
|
35
|
+
2. **No step-level attribution** — the job log shows "The operation was cancelled" but does
|
|
36
|
+
not identify which specific step was still running or how far it had progressed. Long
|
|
37
|
+
builds, network-heavy steps, and interactive prompts are common culprits.
|
|
38
|
+
|
|
39
|
+
3. **Default timeout is 360 minutes (6 hours)** — if `timeout-minutes` is not explicitly
|
|
40
|
+
set, GitHub uses the platform default of 6 hours for GitHub-hosted runners. A job that
|
|
41
|
+
accidentally blocks (waiting for user input, infinite loop, hung network call) will silently
|
|
42
|
+
consume 6 hours of runner minutes before being cancelled with no diagnostic output.
|
|
43
|
+
|
|
44
|
+
4. **Step-level timeouts are independent** — `timeout-minutes` on a `steps[*]` entry cancels
|
|
45
|
+
only that step; the job continues. `timeout-minutes` on `jobs[*]` cancels the entire job.
|
|
46
|
+
Mixing both is valid but must be understood deliberately.
|
|
47
|
+
fix: |
|
|
48
|
+
Always set explicit `timeout-minutes` at the job level to bound worst-case runner cost.
|
|
49
|
+
Tune based on your typical build time (e.g., 2-3× the median duration). Add step-level
|
|
50
|
+
timeouts on known slow steps (network downloads, test suites) to get better attribution.
|
|
51
|
+
|
|
52
|
+
To diagnose which step was running at cancellation: add a step near the end that dumps
|
|
53
|
+
elapsed time, or use `if: cancelled()` post-steps to capture diagnostics on timeout.
|
|
54
|
+
fix_code:
|
|
55
|
+
- language: yaml
|
|
56
|
+
label: "Explicit job-level timeout with diagnostic post-step"
|
|
57
|
+
code: |
|
|
58
|
+
jobs:
|
|
59
|
+
build:
|
|
60
|
+
runs-on: ubuntu-latest
|
|
61
|
+
timeout-minutes: 30 # Set explicitly — don't rely on 6h default
|
|
62
|
+
steps:
|
|
63
|
+
- uses: actions/checkout@v4
|
|
64
|
+
|
|
65
|
+
- name: Build
|
|
66
|
+
run: make build
|
|
67
|
+
|
|
68
|
+
- name: Tests
|
|
69
|
+
timeout-minutes: 15 # Step-level timeout for attribution
|
|
70
|
+
run: make test
|
|
71
|
+
|
|
72
|
+
# Always runs — captures which step caused the timeout
|
|
73
|
+
- name: Dump elapsed time on cancellation
|
|
74
|
+
if: cancelled()
|
|
75
|
+
run: echo "Job was cancelled at $(date -u). Check step durations above."
|
|
76
|
+
- language: yaml
|
|
77
|
+
label: "Identify which step timed out with job summary annotation"
|
|
78
|
+
code: |
|
|
79
|
+
steps:
|
|
80
|
+
- name: Long network operation
|
|
81
|
+
timeout-minutes: 10
|
|
82
|
+
run: |
|
|
83
|
+
# Use --max-time with curl to avoid relying solely on timeout-minutes
|
|
84
|
+
curl --max-time 300 https://example.com/large-asset -o output.bin
|
|
85
|
+
|
|
86
|
+
- name: Report timeout if cancelled
|
|
87
|
+
if: cancelled()
|
|
88
|
+
run: |
|
|
89
|
+
echo "## ⏱️ Job Timed Out" >> $GITHUB_STEP_SUMMARY
|
|
90
|
+
echo "The job was cancelled. Review step durations in the log." >> $GITHUB_STEP_SUMMARY
|
|
91
|
+
prevention:
|
|
92
|
+
- "Always set `timeout-minutes` at the job level — never rely on the 6-hour GitHub default."
|
|
93
|
+
- "Add step-level `timeout-minutes` on network-heavy or test steps so cancellation is attributed to a specific step."
|
|
94
|
+
- "Use `if: cancelled()` post-steps to write a job summary annotation explaining the timeout."
|
|
95
|
+
- "Run commands with their own timeout flags (e.g., `curl --max-time`, `pytest --timeout`) in addition to runner timeouts."
|
|
96
|
+
- "Monitor job duration trends — a job approaching its timeout limit is a signal to investigate performance."
|
|
97
|
+
docs:
|
|
98
|
+
- url: "https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes"
|
|
99
|
+
label: "Workflow syntax: jobs.<job_id>.timeout-minutes"
|
|
100
|
+
- url: "https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes"
|
|
101
|
+
label: "Workflow syntax: jobs.<job_id>.steps[*].timeout-minutes"
|
|
102
|
+
- url: "https://github.com/actions/runner/issues/1326"
|
|
103
|
+
label: "actions/runner#1326 — Steps hanging until timeout with no log output"
|
|
104
|
+
- url: "https://github.com/orgs/community/discussions/38004"
|
|
105
|
+
label: "Community: Job stops producing output and is later cancelled"
|
|
106
|
+
- url: "https://docs.github.com/en/actions/administering-github-actions/usage-limits-billing-and-administration#usage-limits"
|
|
107
|
+
label: "Usage limits: maximum job execution time"
|
|
@@ -0,0 +1,112 @@
|
|
|
1
|
+
id: known-unsolved-008
|
|
2
|
+
title: "GITHUB_STEP_SUMMARY Upload Aborted When Content Exceeds 1024k"
|
|
3
|
+
category: known-unsolved
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- step-summary
|
|
7
|
+
- GITHUB_STEP_SUMMARY
|
|
8
|
+
- size-limit
|
|
9
|
+
- job-summary
|
|
10
|
+
- markdown
|
|
11
|
+
- limitation
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: "\\$GITHUB_STEP_SUMMARY upload aborted, supports content up to a size of 1024k, got \\d+k"
|
|
14
|
+
flags: "i"
|
|
15
|
+
- regex: "upload aborted.*supports content up to a size of 1024k"
|
|
16
|
+
flags: "i"
|
|
17
|
+
- regex: "Error: GITHUB_STEP_SUMMARY.*1024"
|
|
18
|
+
flags: "i"
|
|
19
|
+
error_messages:
|
|
20
|
+
- "$GITHUB_STEP_SUMMARY upload aborted, supports content up to a size of 1024k, got 1387k"
|
|
21
|
+
- "$GITHUB_STEP_SUMMARY upload aborted, supports content up to a size of 1024k, got 2048k"
|
|
22
|
+
root_cause: |
|
|
23
|
+
GitHub Actions imposes a hard 1 MiB (1024 KiB) size limit on the content written to
|
|
24
|
+
`$GITHUB_STEP_SUMMARY`. When a step writes more than this limit, the runner aborts
|
|
25
|
+
the summary upload and logs an error.
|
|
26
|
+
|
|
27
|
+
This is a **platform limit with no workaround** — you cannot increase it. GitHub has not
|
|
28
|
+
announced plans to raise the limit.
|
|
29
|
+
|
|
30
|
+
Common triggers:
|
|
31
|
+
1. **Test reporters** — tools like `dorny/test-reporter`, `ctrf-io/github-actions-ctrf`,
|
|
32
|
+
or `EnricoMi/publish-unit-test-result-action` write per-test result tables. Large
|
|
33
|
+
test suites (thousands of test cases, especially with long failure messages) easily
|
|
34
|
+
exceed 1 MiB.
|
|
35
|
+
2. **Dependency review action** — `actions/dependency-review-action` writes full
|
|
36
|
+
dependency diff tables. Large projects with hundreds of transitive dependencies produce
|
|
37
|
+
summaries well above 1 MiB.
|
|
38
|
+
3. **Coverage reports** — HTML-style coverage tables written to `$GITHUB_STEP_SUMMARY`
|
|
39
|
+
with per-file rows can grow unboundedly on large monorepos.
|
|
40
|
+
4. **Log echo pipelines** — `cat large-file >> $GITHUB_STEP_SUMMARY` without size
|
|
41
|
+
checking is the most direct way to hit the limit.
|
|
42
|
+
|
|
43
|
+
The error aborts the summary upload but does **not** fail the step or job by default.
|
|
44
|
+
Depending on the action's error handling, the step may succeed (exit 0) even though the
|
|
45
|
+
summary was not written — making this a silent failure from a reporting perspective.
|
|
46
|
+
fix: |
|
|
47
|
+
Truncate or paginate summary content before writing it. Most test reporters provide
|
|
48
|
+
options to limit which results are written (e.g., only failures, not all passed tests).
|
|
49
|
+
For custom summary generation, check the size before writing and truncate with a note.
|
|
50
|
+
fix_code:
|
|
51
|
+
- language: yaml
|
|
52
|
+
label: "Truncate summary content with size check before writing"
|
|
53
|
+
code: |
|
|
54
|
+
- name: Generate test report
|
|
55
|
+
run: |
|
|
56
|
+
# Generate report to a temp file first
|
|
57
|
+
./scripts/generate-report.sh > /tmp/report.md
|
|
58
|
+
|
|
59
|
+
# Check size before writing to summary
|
|
60
|
+
SIZE_KB=$(du -k /tmp/report.md | cut -f1)
|
|
61
|
+
MAX_KB=800 # Leave headroom below 1024k limit
|
|
62
|
+
|
|
63
|
+
if [ "$SIZE_KB" -gt "$MAX_KB" ]; then
|
|
64
|
+
echo "⚠️ Full report too large (${SIZE_KB}k). Showing failures only." >> "$GITHUB_STEP_SUMMARY"
|
|
65
|
+
./scripts/generate-report.sh --failures-only >> "$GITHUB_STEP_SUMMARY"
|
|
66
|
+
else
|
|
67
|
+
cat /tmp/report.md >> "$GITHUB_STEP_SUMMARY"
|
|
68
|
+
fi
|
|
69
|
+
- language: yaml
|
|
70
|
+
label: "dorny/test-reporter — limit to failures only for large test suites"
|
|
71
|
+
code: |
|
|
72
|
+
- name: Test Report
|
|
73
|
+
uses: dorny/test-reporter@v1
|
|
74
|
+
if: always()
|
|
75
|
+
with:
|
|
76
|
+
name: Test Results
|
|
77
|
+
path: test-results/**/*.xml
|
|
78
|
+
reporter: jest-junit
|
|
79
|
+
# Limit output to avoid 1024k summary limit on large suites
|
|
80
|
+
only-summary: true # Write only totals, not per-test rows
|
|
81
|
+
fail-on-error: false
|
|
82
|
+
- language: yaml
|
|
83
|
+
label: "Upload full report as artifact instead of writing to summary"
|
|
84
|
+
code: |
|
|
85
|
+
- name: Generate full coverage report
|
|
86
|
+
run: ./scripts/coverage.sh > /tmp/coverage-full.md
|
|
87
|
+
|
|
88
|
+
- name: Write summary (truncated)
|
|
89
|
+
run: |
|
|
90
|
+
head -100 /tmp/coverage-full.md >> "$GITHUB_STEP_SUMMARY"
|
|
91
|
+
echo "" >> "$GITHUB_STEP_SUMMARY"
|
|
92
|
+
echo "_Full report available as workflow artifact._" >> "$GITHUB_STEP_SUMMARY"
|
|
93
|
+
|
|
94
|
+
- name: Upload full report as artifact
|
|
95
|
+
uses: actions/upload-artifact@v4
|
|
96
|
+
with:
|
|
97
|
+
name: coverage-report
|
|
98
|
+
path: /tmp/coverage-full.md
|
|
99
|
+
prevention:
|
|
100
|
+
- "Never pipe unbounded command output directly to `$GITHUB_STEP_SUMMARY` — always size-check or limit first."
|
|
101
|
+
- "Configure test reporter actions to write only failures (not all passing tests) when the test suite is large."
|
|
102
|
+
- "Upload large reports as workflow artifacts and link to them from a short summary, instead of embedding all content in the summary."
|
|
103
|
+
- "The undocumented historical limit of 65,535 characters cited in older docs/answers is no longer accurate — the current limit is 1024 KiB (1 MiB)."
|
|
104
|
+
docs:
|
|
105
|
+
- url: "https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#adding-a-job-summary"
|
|
106
|
+
label: "Workflow commands: Adding a job summary"
|
|
107
|
+
- url: "https://github.com/actions/dependency-review-action/issues/786"
|
|
108
|
+
label: "dependency-review-action#786 — Job Summary Size Limitation aborts the job"
|
|
109
|
+
- url: "https://github.com/dorny/test-reporter/issues/379"
|
|
110
|
+
label: "dorny/test-reporter#379 — Is the step summary limit for 65535 characters still accurate?"
|
|
111
|
+
- url: "https://docs.github.com/en/actions/administering-github-actions/usage-limits-billing-and-administration#usage-limits"
|
|
112
|
+
label: "Usage limits — GitHub Actions"
|
|
@@ -0,0 +1,127 @@
|
|
|
1
|
+
id: known-unsolved-009
|
|
2
|
+
title: "Job Killed After Maximum Execution Time (6h Hosted / 35-Day Workflow)"
|
|
3
|
+
category: known-unsolved
|
|
4
|
+
severity: limitation
|
|
5
|
+
tags:
|
|
6
|
+
- timeout
|
|
7
|
+
- execution-time
|
|
8
|
+
- job-limits
|
|
9
|
+
- platform-limit
|
|
10
|
+
- self-hosted
|
|
11
|
+
- workflow-duration
|
|
12
|
+
- limitation
|
|
13
|
+
patterns:
|
|
14
|
+
- regex: "The job running has exceeded the maximum execution time"
|
|
15
|
+
flags: "i"
|
|
16
|
+
- regex: "exceeded the maximum (?:time|execution time)"
|
|
17
|
+
flags: "i"
|
|
18
|
+
- regex: "job .* exceeded .* maximum"
|
|
19
|
+
flags: "i"
|
|
20
|
+
error_messages:
|
|
21
|
+
- "The job running on runner GitHub Actions X has exceeded the maximum execution time of 360 minutes."
|
|
22
|
+
- "The job running has exceeded the maximum execution time"
|
|
23
|
+
root_cause: |
|
|
24
|
+
GitHub Actions enforces hard platform-level execution time limits that cannot be
|
|
25
|
+
overridden or extended by workflow configuration. These limits exist to protect
|
|
26
|
+
shared infrastructure and prevent runaway jobs from consuming unlimited resources.
|
|
27
|
+
|
|
28
|
+
**GitHub-hosted runner limits:**
|
|
29
|
+
- Maximum job execution time: **6 hours** (360 minutes)
|
|
30
|
+
- Maximum workflow run time: **35 days** (across all jobs, including queued time)
|
|
31
|
+
- Default `timeout-minutes` when not set: **360 minutes** (6 hours)
|
|
32
|
+
|
|
33
|
+
**Self-hosted runner limits:**
|
|
34
|
+
- Maximum job execution time: **5 days** (7,200 minutes) by default
|
|
35
|
+
- Maximum workflow run time: **35 days** (same as hosted)
|
|
36
|
+
- Self-hosted limits can be customized in enterprise plans via org/enterprise policies
|
|
37
|
+
|
|
38
|
+
**When limits are hit:**
|
|
39
|
+
- The runner process is sent a SIGTERM (graceful) then SIGKILL (forced) after a grace period
|
|
40
|
+
- The job is marked CANCELLED (not FAILED) in the UI
|
|
41
|
+
- The log message "The job running has exceeded the maximum execution time" appears in
|
|
42
|
+
the runner log (may be visible in the step logs depending on where the runner was killed)
|
|
43
|
+
- Any `post:` steps for active actions (e.g., cache save, artifact upload) are skipped
|
|
44
|
+
- No email notification is sent to the repo owner about the cancellation
|
|
45
|
+
|
|
46
|
+
**Why this is a limitation, not just misconfiguration:**
|
|
47
|
+
- There is no way to set `timeout-minutes` above 21600 (360 hours) to extend the GitHub-hosted 6h cap
|
|
48
|
+
- The workflow `timeout-minutes` field cannot override the platform cap on GitHub-hosted runners
|
|
49
|
+
- Jobs requiring more than 6 hours on GitHub-hosted runners have NO supported path without
|
|
50
|
+
migrating to self-hosted or restructuring the job into multiple shorter sequential jobs
|
|
51
|
+
fix: |
|
|
52
|
+
There is no way to extend the GitHub-hosted runner 6-hour job cap. Options:
|
|
53
|
+
|
|
54
|
+
1. **Break the job into smaller sequential jobs** — split long-running work (e.g., build
|
|
55
|
+
artifacts first, test in separate parallel jobs, deploy last). Each job has its own
|
|
56
|
+
6-hour budget.
|
|
57
|
+
|
|
58
|
+
2. **Migrate to self-hosted runners** — self-hosted runners support up to 5-day jobs.
|
|
59
|
+
Use actions-runner-controller (ARC) or cloud auto-scaling for elastic capacity.
|
|
60
|
+
|
|
61
|
+
3. **Optimize the slow step** — profile build/test times; parallelize with matrix
|
|
62
|
+
strategy; use incremental builds or test sharding to reduce per-job duration.
|
|
63
|
+
|
|
64
|
+
4. **Use caching aggressively** — `actions/cache` reduces download/build time between
|
|
65
|
+
runs, but does not extend limits.
|
|
66
|
+
fix_code:
|
|
67
|
+
- language: yaml
|
|
68
|
+
label: "Split a long job into sequential jobs to stay within 6h per job"
|
|
69
|
+
code: |
|
|
70
|
+
jobs:
|
|
71
|
+
build:
|
|
72
|
+
runs-on: ubuntu-latest
|
|
73
|
+
timeout-minutes: 120 # 2h budget for build
|
|
74
|
+
outputs:
|
|
75
|
+
artifact-id: ${{ steps.upload.outputs.artifact-id }}
|
|
76
|
+
steps:
|
|
77
|
+
- uses: actions/checkout@v4
|
|
78
|
+
- name: Build
|
|
79
|
+
run: make build-release
|
|
80
|
+
- name: Upload build artifact
|
|
81
|
+
id: upload
|
|
82
|
+
uses: actions/upload-artifact@v4
|
|
83
|
+
with:
|
|
84
|
+
name: release-build
|
|
85
|
+
path: dist/
|
|
86
|
+
|
|
87
|
+
# Separate job — gets its own 6h budget
|
|
88
|
+
test:
|
|
89
|
+
needs: build
|
|
90
|
+
runs-on: ubuntu-latest
|
|
91
|
+
timeout-minutes: 180 # 3h budget for tests
|
|
92
|
+
steps:
|
|
93
|
+
- uses: actions/download-artifact@v4
|
|
94
|
+
with:
|
|
95
|
+
artifact-id: ${{ needs.build.outputs.artifact-id }}
|
|
96
|
+
- run: make test-full
|
|
97
|
+
- language: yaml
|
|
98
|
+
label: "Self-hosted runner for jobs requiring more than 6 hours"
|
|
99
|
+
code: |
|
|
100
|
+
jobs:
|
|
101
|
+
long-running-job:
|
|
102
|
+
# Self-hosted runners support up to 5-day job duration
|
|
103
|
+
runs-on: [self-hosted, linux, x64]
|
|
104
|
+
timeout-minutes: 2880 # 48h — only possible on self-hosted
|
|
105
|
+
steps:
|
|
106
|
+
- uses: actions/checkout@v4
|
|
107
|
+
- name: Long-running process
|
|
108
|
+
run: ./scripts/full-dataset-processing.sh
|
|
109
|
+
prevention:
|
|
110
|
+
- "Set explicit `timeout-minutes` on every job — don't rely on the implicit 6h GitHub-hosted cap as your only safeguard."
|
|
111
|
+
- "Profile job duration regularly and alert when a job's P99 duration approaches 80% of its timeout budget."
|
|
112
|
+
- "Parallelize test suites using matrix strategy or `actions/github-script` dynamic matrix generation to reduce per-job time."
|
|
113
|
+
- "Use self-hosted runners for any workflow that legitimately requires more than 2-3 hours per job (e.g., large model training, full database rebuild, exhaustive integration tests)."
|
|
114
|
+
- "Be aware that post-run actions (cache save, artifact upload) will NOT execute if the parent job is killed for exceeding the time limit."
|
|
115
|
+
docs:
|
|
116
|
+
- url: "https://docs.github.com/en/actions/administering-github-actions/usage-limits-billing-and-administration#usage-limits"
|
|
117
|
+
label: "Usage limits: job execution time and workflow run time"
|
|
118
|
+
- url: "https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners#usage-limits"
|
|
119
|
+
label: "Self-hosted runner usage limits"
|
|
120
|
+
- url: "https://github.com/orgs/community/discussions/48790"
|
|
121
|
+
label: "Community: Workflow run time limit 35 days"
|
|
122
|
+
- url: "https://github.com/orgs/community/discussions/150900"
|
|
123
|
+
label: "Community: Job cancellation after 6 hours"
|
|
124
|
+
- url: "https://stackoverflow.com/questions/70187174/github-actions-self-hosted-runner-the-job-running-has-exceeded-the-maximum-exe"
|
|
125
|
+
label: "Stack Overflow: The job running has exceeded the maximum execution time"
|
|
126
|
+
- url: "https://github.com/actions/actions-runner-controller"
|
|
127
|
+
label: "Actions Runner Controller (ARC) — Kubernetes-based self-hosted runner auto-scaling"
|
|
@@ -0,0 +1,84 @@
|
|
|
1
|
+
id: "runner-environment-022"
|
|
2
|
+
title: "actions/checkout set-safe-directory only runs in post step — container jobs get dubious ownership errors"
|
|
3
|
+
category: "runner-environment"
|
|
4
|
+
severity: "error"
|
|
5
|
+
tags:
|
|
6
|
+
- "actions/checkout"
|
|
7
|
+
- "safe-directory"
|
|
8
|
+
- "container"
|
|
9
|
+
- "dubious-ownership"
|
|
10
|
+
- "CVE-2022-24765"
|
|
11
|
+
- "post-step"
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: "fatal: detected dubious ownership in repository"
|
|
14
|
+
flags: "i"
|
|
15
|
+
- regex: "safe\\.directory.*not.*owned"
|
|
16
|
+
flags: "i"
|
|
17
|
+
error_messages:
|
|
18
|
+
- "fatal: detected dubious ownership in repository at '/github/workspace'"
|
|
19
|
+
- "hint: git config --global --add safe.directory /github/workspace"
|
|
20
|
+
root_cause: |
|
|
21
|
+
The `actions/checkout` action configures `safe.directory` to allow git operations in
|
|
22
|
+
the workspace. However, this configuration only runs in the **post step** (cleanup
|
|
23
|
+
phase), not during the main execution step.
|
|
24
|
+
|
|
25
|
+
In container jobs, the workspace is mounted from the host and may be owned by a
|
|
26
|
+
different UID than the user running inside the container. Git's safe.directory
|
|
27
|
+
protection (introduced in Git 2.35.2 for CVE-2022-24765) blocks access when the
|
|
28
|
+
directory owner differs from the running user.
|
|
29
|
+
|
|
30
|
+
Because safe.directory is only written during the post step — after all workflow
|
|
31
|
+
steps have already run — any subsequent git operations in the job's main steps
|
|
32
|
+
fail with "fatal: detected dubious ownership". This includes third-party actions
|
|
33
|
+
that internally invoke git (reviewdog, gitops tools, semantic-release, etc.).
|
|
34
|
+
|
|
35
|
+
Reported upstream: https://github.com/actions/checkout/issues/2031
|
|
36
|
+
fix: |
|
|
37
|
+
Add an explicit safe.directory configuration step immediately after `actions/checkout`
|
|
38
|
+
in any container job that performs git operations. This ensures the directory is
|
|
39
|
+
trusted before any subsequent steps run.
|
|
40
|
+
fix_code:
|
|
41
|
+
- language: yaml
|
|
42
|
+
label: "Add safe.directory config step after checkout in container jobs"
|
|
43
|
+
code: |
|
|
44
|
+
jobs:
|
|
45
|
+
build:
|
|
46
|
+
runs-on: ubuntu-latest
|
|
47
|
+
container: node:20-bookworm
|
|
48
|
+
steps:
|
|
49
|
+
- uses: actions/checkout@v4
|
|
50
|
+
|
|
51
|
+
# Workaround: post step safe.directory config doesn't help in container jobs
|
|
52
|
+
- name: Mark workspace as safe for git
|
|
53
|
+
run: git config --global --add safe.directory "$GITHUB_WORKSPACE"
|
|
54
|
+
|
|
55
|
+
- name: Run git-dependent steps
|
|
56
|
+
run: git log --oneline -5
|
|
57
|
+
- language: yaml
|
|
58
|
+
label: "Use wildcard to mark all directories safe in container workflows"
|
|
59
|
+
code: |
|
|
60
|
+
jobs:
|
|
61
|
+
build:
|
|
62
|
+
runs-on: ubuntu-latest
|
|
63
|
+
container: python:3.12-slim
|
|
64
|
+
steps:
|
|
65
|
+
- uses: actions/checkout@v4
|
|
66
|
+
|
|
67
|
+
- name: Configure git safe directories
|
|
68
|
+
run: |
|
|
69
|
+
git config --global --add safe.directory '*'
|
|
70
|
+
|
|
71
|
+
- name: Lint with pre-commit
|
|
72
|
+
run: pre-commit run --all-files
|
|
73
|
+
prevention:
|
|
74
|
+
- "Always add a safe.directory config step after checkout when using container jobs"
|
|
75
|
+
- "Audit third-party actions in container jobs — reviewdog, semantic-release, and gitops tools invoke git internally"
|
|
76
|
+
- "Consider running without a container and using docker run explicitly if git safety is complex to manage"
|
|
77
|
+
- "Track https://github.com/actions/checkout/issues/2031 for an official fix from the actions team"
|
|
78
|
+
docs:
|
|
79
|
+
- url: "https://github.com/actions/checkout/issues/2031"
|
|
80
|
+
label: "actions/checkout #2031 — safe.directory only set in post step"
|
|
81
|
+
- url: "https://github.blog/2022-04-12-git-security-vulnerability-announced/"
|
|
82
|
+
label: "Git CVE-2022-24765 — safe.directory background"
|
|
83
|
+
- url: "https://docs.github.com/en/actions/writing-workflows/choosing-where-your-workflow-runs/running-jobs-in-a-container"
|
|
84
|
+
label: "GitHub Docs — Running jobs in a container"
|
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
id: "runner-environment-023"
|
|
2
|
+
title: "Self-hosted runner on deprecated version stops receiving jobs"
|
|
3
|
+
category: "runner-environment"
|
|
4
|
+
severity: "error"
|
|
5
|
+
tags:
|
|
6
|
+
- "self-hosted"
|
|
7
|
+
- "runner"
|
|
8
|
+
- "deprecated"
|
|
9
|
+
- "version"
|
|
10
|
+
- "cannot-receive-messages"
|
|
11
|
+
- "maintenance"
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: "Runner version v\\d+\\.\\d+\\.\\d+ is deprecated and cannot receive messages"
|
|
14
|
+
flags: "i"
|
|
15
|
+
- regex: "WRITE ERROR.*runner.*deprecated"
|
|
16
|
+
flags: "i"
|
|
17
|
+
error_messages:
|
|
18
|
+
- "Runner version v2.332.0 is deprecated and cannot receive messages."
|
|
19
|
+
- "WRITE ERROR: An error occured: Runner version v2.XXX.X is deprecated and cannot receive messages."
|
|
20
|
+
root_cause: |
|
|
21
|
+
GitHub periodically deprecates older self-hosted runner versions. Once a runner version
|
|
22
|
+
is past its deprecation deadline, the runner agent can no longer communicate with the
|
|
23
|
+
GitHub Actions broker service.
|
|
24
|
+
|
|
25
|
+
The runner process stays alive, appears online in the GitHub UI (Settings → Actions →
|
|
26
|
+
Runners), and is listed as "Active" — but it can no longer receive job assignments.
|
|
27
|
+
Jobs queued for a runner group containing only deprecated-version runners will either
|
|
28
|
+
stay "Queued" indefinitely or time out without a clear error in the workflow UI.
|
|
29
|
+
|
|
30
|
+
This is a silent failure mode: the runner shows as online, no workflow error is
|
|
31
|
+
surfaced, but jobs never start. The deprecation schedule is published in the GitHub
|
|
32
|
+
Changelog and actions/runner releases but teams often miss it without automated
|
|
33
|
+
update pipelines.
|
|
34
|
+
|
|
35
|
+
As of 2026, GitHub requires runners to be within approximately the last 6 months of
|
|
36
|
+
releases. Related issues: actions/runner #4305, actions/runner #4442
|
|
37
|
+
fix: |
|
|
38
|
+
Update the runner binary to a currently supported version.
|
|
39
|
+
|
|
40
|
+
For manually managed runners:
|
|
41
|
+
1. SSH to the runner host
|
|
42
|
+
2. Stop the runner service: ./svc.sh stop
|
|
43
|
+
3. Download the latest runner from https://github.com/actions/runner/releases/latest
|
|
44
|
+
4. Extract to the runner directory (configuration is preserved via .runner file)
|
|
45
|
+
5. Restart: ./svc.sh start
|
|
46
|
+
|
|
47
|
+
For ARC (Actions Runner Controller) or autoscaling solutions: bump the runner image
|
|
48
|
+
version tag in your HelmRelease or Deployment manifest and redeploy.
|
|
49
|
+
fix_code:
|
|
50
|
+
- language: yaml
|
|
51
|
+
label: "Scheduled workflow to alert on outdated runner versions"
|
|
52
|
+
code: |
|
|
53
|
+
name: Check self-hosted runner versions
|
|
54
|
+
on:
|
|
55
|
+
schedule:
|
|
56
|
+
- cron: '0 9 * * 1' # Weekly Monday 9 AM
|
|
57
|
+
|
|
58
|
+
jobs:
|
|
59
|
+
check-runners:
|
|
60
|
+
runs-on: ubuntu-latest # GitHub-hosted runner for this diagnostic
|
|
61
|
+
steps:
|
|
62
|
+
- name: List runner versions via API
|
|
63
|
+
env:
|
|
64
|
+
GH_TOKEN: ${{ secrets.RUNNER_READ_TOKEN }}
|
|
65
|
+
run: |
|
|
66
|
+
echo "=== Org runners ==="
|
|
67
|
+
gh api /orgs/${{ github.repository_owner }}/actions/runners \
|
|
68
|
+
--jq '.runners[] | "\(.name): v\(.version) (\(.status))"'
|
|
69
|
+
|
|
70
|
+
- name: Check latest available version
|
|
71
|
+
run: |
|
|
72
|
+
LATEST=$(curl -sf https://api.github.com/repos/actions/runner/releases/latest | jq -r .tag_name)
|
|
73
|
+
echo "Latest runner version: $LATEST"
|
|
74
|
+
echo "Compare against your registered runners above"
|
|
75
|
+
- language: yaml
|
|
76
|
+
label: "ARC — bump runner version in HelmRelease"
|
|
77
|
+
code: |
|
|
78
|
+
# In your ARC HelmRelease or values.yaml
|
|
79
|
+
githubConfigUrl: "https://github.com/myorg"
|
|
80
|
+
template:
|
|
81
|
+
spec:
|
|
82
|
+
containers:
|
|
83
|
+
- name: runner
|
|
84
|
+
image: ghcr.io/actions/actions-runner:2.335.0 # Bump this regularly
|
|
85
|
+
prevention:
|
|
86
|
+
- "Subscribe to the GitHub Changelog (https://github.blog/changelog/) or watch actions/runner releases for deprecation notices"
|
|
87
|
+
- "Use Actions Runner Controller (ARC) or an autoscaling solution to automate runner lifecycle management"
|
|
88
|
+
- "Schedule a weekly cron workflow that checks registered runner versions via the Runners API and alerts if any are outdated"
|
|
89
|
+
- "Pin runner version in IaC (Terraform, Ansible) and include a runner version bump in your monthly maintenance checklist"
|
|
90
|
+
- "Set up Dependabot or Renovate to auto-update runner image tags in Docker/ARC manifests"
|
|
91
|
+
docs:
|
|
92
|
+
- url: "https://github.com/actions/runner/releases"
|
|
93
|
+
label: "actions/runner releases — version history and changelogs"
|
|
94
|
+
- url: "https://github.com/actions/runner/issues/4305"
|
|
95
|
+
label: "actions/runner #4305 — runner deprecated cannot receive messages"
|
|
96
|
+
- url: "https://github.com/actions/runner/issues/4442"
|
|
97
|
+
label: "actions/runner #4442 — version deprecation notice"
|
|
98
|
+
- url: "https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners"
|
|
99
|
+
label: "GitHub Docs — About self-hosted runners"
|
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+
id: "silent-failures-010"
|
|
2
|
+
title: "cache-hit output is 'true' on restore-keys partial match, not just exact key"
|
|
3
|
+
category: "silent-failures"
|
|
4
|
+
severity: "silent-failure"
|
|
5
|
+
tags:
|
|
6
|
+
- "actions/cache"
|
|
7
|
+
- "cache-hit"
|
|
8
|
+
- "restore-keys"
|
|
9
|
+
- "partial-match"
|
|
10
|
+
- "output"
|
|
11
|
+
- "exact-match"
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: "cache-hit.*true"
|
|
14
|
+
flags: "i"
|
|
15
|
+
error_messages:
|
|
16
|
+
- "cache-hit: true"
|
|
17
|
+
root_cause: |
|
|
18
|
+
The `cache-hit` output of `actions/cache` is documented to return `true` for an exact
|
|
19
|
+
cache key match. In practice, `cache-hit` also returns `true` when the key matched via
|
|
20
|
+
`restore-keys` (a partial/fallback match), not only for exact key hits.
|
|
21
|
+
|
|
22
|
+
Workflows that gate post-build steps on `steps.cache.outputs.cache-hit == 'true'` to
|
|
23
|
+
skip dependency installs assume an exact match. When a restore-keys match occurs,
|
|
24
|
+
`cache-hit` is `true` even though the cache may be stale or from a different branch.
|
|
25
|
+
The correct exact-match indicator is the `exact-match` output (available in
|
|
26
|
+
actions/cache v4+) or comparing `cache-matched-key` against the computed key.
|
|
27
|
+
|
|
28
|
+
Reported upstream: https://github.com/actions/cache/issues/1675
|
|
29
|
+
fix: |
|
|
30
|
+
Use the `exact-match` output (actions/cache v4+) to determine if the restore was an
|
|
31
|
+
exact key match. `cache-hit` alone does not distinguish between exact and partial
|
|
32
|
+
(restore-keys) matches.
|
|
33
|
+
|
|
34
|
+
Alternatively, compare the `cache-matched-key` output against the expected key to
|
|
35
|
+
determine if the restore was exact.
|
|
36
|
+
fix_code:
|
|
37
|
+
- language: yaml
|
|
38
|
+
label: "Use exact-match output (actions/cache v4+)"
|
|
39
|
+
code: |
|
|
40
|
+
- name: Restore cache
|
|
41
|
+
id: cache
|
|
42
|
+
uses: actions/cache@v4
|
|
43
|
+
with:
|
|
44
|
+
path: ~/.npm
|
|
45
|
+
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
|
|
46
|
+
restore-keys: |
|
|
47
|
+
${{ runner.os }}-node-
|
|
48
|
+
|
|
49
|
+
# Only skip install on EXACT key match, not partial restore-keys hit
|
|
50
|
+
- name: Install dependencies
|
|
51
|
+
if: steps.cache.outputs.exact-match != 'true'
|
|
52
|
+
run: npm ci
|
|
53
|
+
- language: yaml
|
|
54
|
+
label: "Compare cache-matched-key to verify exact hit"
|
|
55
|
+
code: |
|
|
56
|
+
- name: Restore cache
|
|
57
|
+
id: cache
|
|
58
|
+
uses: actions/cache@v4
|
|
59
|
+
with:
|
|
60
|
+
path: ~/.npm
|
|
61
|
+
key: npm-${{ hashFiles('**/package-lock.json') }}
|
|
62
|
+
restore-keys: npm-
|
|
63
|
+
|
|
64
|
+
- name: Install or skip dependencies
|
|
65
|
+
run: |
|
|
66
|
+
EXPECTED_KEY="npm-${{ hashFiles('**/package-lock.json') }}"
|
|
67
|
+
if [ "${{ steps.cache.outputs.cache-matched-key }}" = "$EXPECTED_KEY" ]; then
|
|
68
|
+
echo "Exact cache hit — skipping npm ci"
|
|
69
|
+
else
|
|
70
|
+
echo "Partial/stale restore-keys hit — running npm ci"
|
|
71
|
+
npm ci
|
|
72
|
+
fi
|
|
73
|
+
prevention:
|
|
74
|
+
- "Never rely on cache-hit == 'true' alone to skip dependency installs; it fires on partial restore-keys matches too"
|
|
75
|
+
- "Use the exact-match output (actions/cache v4+) when you need to distinguish exact vs partial cache hits"
|
|
76
|
+
- "Use cache-matched-key output to log or compare the actual key that was restored"
|
|
77
|
+
- "If using restore-keys, always validate that your skip-install condition handles partial matches correctly"
|
|
78
|
+
docs:
|
|
79
|
+
- url: "https://github.com/actions/cache/issues/1675"
|
|
80
|
+
label: "actions/cache #1675 — cache-hit true on restore-keys match"
|
|
81
|
+
- url: "https://github.com/actions/cache#outputs"
|
|
82
|
+
label: "actions/cache outputs documentation"
|
package/package.json
CHANGED