@htekdev/actions-debugger 1.0.3 → 1.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,105 @@
1
+ id: concurrency-timing-006
2
+ title: "Job Stuck: 'Waiting for a Runner to Pick Up This Job'"
3
+ category: concurrency-timing
4
+ severity: error
5
+ tags:
6
+ - runner
7
+ - runs-on
8
+ - self-hosted
9
+ - queued
10
+ - stuck
11
+ - deprecated-runner
12
+ patterns:
13
+ - regex: "Waiting for a runner to pick up this job"
14
+ flags: "i"
15
+ - regex: "No runner matching the specified labels"
16
+ flags: "i"
17
+ - regex: "Could not find any online and idle runners"
18
+ flags: "i"
19
+ error_messages:
20
+ - "Waiting for a runner to pick up this job."
21
+ - "No runner matching the specified labels was found: [your-label]"
22
+ - "Could not find any online and idle runners matching the required labels."
23
+ root_cause: |
24
+ A job remains stuck in the "queued" state — showing "Waiting for a runner to pick up
25
+ this job" — when GitHub Actions cannot find an available runner matching the `runs-on:`
26
+ labels. The job will wait indefinitely until the `timeout-minutes` limit is reached.
27
+
28
+ The most common causes:
29
+
30
+ 1. **Deprecated or retired runner label** — GitHub periodically retires old runner images.
31
+ `ubuntu-18.04` was retired in April 2023. `ubuntu-20.04` deprecation is in progress.
32
+ Jobs using these labels get stuck because no GitHub-hosted runners serve the label.
33
+
34
+ 2. **Typo in `runs-on:` label** — `ubuntu-latets`, `ubuntu_latest`, `UBuntu-latest` all
35
+ fail silently. GitHub-hosted label matching is case-sensitive for custom labels.
36
+
37
+ 3. **Self-hosted runner offline or de-registered** — the runner was stopped, the service
38
+ was not restarted after a reboot, or the runner registration token expired. GitHub queues
39
+ the job and waits for a registered runner with matching labels to come online.
40
+
41
+ 4. **Runner group restrictions** — organization admins restrict which repositories can use
42
+ which runner groups. A job referencing a group the repository is not authorized for will
43
+ queue indefinitely without an explicit permission error.
44
+
45
+ 5. **All runners busy** — all matching runners are executing other jobs. The job correctly
46
+ queues but appears "stuck" during peak usage. It will eventually be picked up.
47
+
48
+ There is no notification when a job has been queued for an unusually long time — the only
49
+ signal is the job's wall-clock age and the static "Waiting for a runner" message.
50
+ fix: |
51
+ Verify the `runs-on:` label against the current list of supported GitHub-hosted runner
52
+ images. For self-hosted runners, check runner registration and service health.
53
+ fix_code:
54
+ - language: yaml
55
+ label: "Use current, non-deprecated GitHub-hosted runner labels"
56
+ code: |
57
+ jobs:
58
+ build:
59
+ # Use current supported labels only
60
+ runs-on: ubuntu-latest # OR ubuntu-22.04, ubuntu-24.04
61
+ # NOT: ubuntu-18.04 (retired), ubuntu-20.04 (deprecated)
62
+
63
+ build-windows:
64
+ runs-on: windows-latest # OR windows-2022, windows-2025
65
+
66
+ build-macos:
67
+ runs-on: macos-latest # OR macos-13, macos-14, macos-15
68
+ - language: yaml
69
+ label: "Self-hosted runner — verify registration and labels match exactly"
70
+ code: |
71
+ jobs:
72
+ deploy:
73
+ # Labels must exactly match what the runner was registered with
74
+ # Check: GitHub Settings → Actions → Runners → click runner → Labels
75
+ runs-on: [self-hosted, linux, production]
76
+
77
+ steps:
78
+ - name: Verify runner is the expected host
79
+ run: echo "Running on $RUNNER_NAME at $(hostname)"
80
+ - language: yaml
81
+ label: "Fallback: matrix across hosted and self-hosted runners"
82
+ code: |
83
+ jobs:
84
+ build:
85
+ strategy:
86
+ matrix:
87
+ runner: [ubuntu-latest, [self-hosted, linux]]
88
+ runs-on: ${{ matrix.runner }}
89
+ prevention:
90
+ - "Audit `runs-on:` labels in all workflows when GitHub announces runner image deprecations."
91
+ - "Set a job-level `timeout-minutes` so stuck jobs don't consume queue slots indefinitely."
92
+ - "For self-hosted runners, configure the runner service to auto-restart on reboot (e.g., `--service` install on Linux via `./svc.sh install`)."
93
+ - "Use GitHub's runner status page (Settings → Actions → Runners) to verify runners are Online before triggering long jobs."
94
+ - "Subscribe to GitHub Changelog and Actions deprecation notices to catch retiring runner labels early."
95
+ docs:
96
+ - url: "https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources"
97
+ label: "Supported GitHub-hosted runner labels"
98
+ - url: "https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners"
99
+ label: "Adding self-hosted runners"
100
+ - url: "https://stackoverflow.com/questions/70959954/error-waiting-for-a-runner-to-pick-up-this-job-using-github-actions"
101
+ label: "Stack Overflow: Waiting for a runner to pick up this job"
102
+ - url: "https://github.com/actions/runner/issues/3609"
103
+ label: "actions/runner#3609 — Self-hosted runner stuck / deadlock"
104
+ - url: "https://github.com/orgs/community/discussions/147604"
105
+ label: "Community: Workflow stuck in queued state"
@@ -0,0 +1,113 @@
1
+ id: concurrency-timing-007
2
+ title: "Matrix Sibling Jobs Silently Cancelled by fail-fast Default"
3
+ category: concurrency-timing
4
+ severity: silent-failure
5
+ tags:
6
+ - matrix
7
+ - fail-fast
8
+ - cancellation
9
+ - silent-failure
10
+ - strategy
11
+ - job-cancelled
12
+ patterns:
13
+ - regex: "Some jobs were not run because a sibling job failed"
14
+ flags: "i"
15
+ - regex: "Canceling since a higher priority waiting run was found"
16
+ flags: "i"
17
+ - regex: "The workflow run was canceled\\."
18
+ flags: "i"
19
+ error_messages:
20
+ - "Some jobs were not run because a sibling job failed. To allow them to run anyway, add 'continue-on-error: true' to the matrix job."
21
+ - "Job was cancelled"
22
+ root_cause: |
23
+ GitHub Actions matrix strategy defaults to `fail-fast: true`. When ANY matrix leg fails,
24
+ GitHub immediately cancels all other in-progress and pending legs in the same matrix.
25
+
26
+ This default is rarely what developers want during debugging or CI investigation, and
27
+ produces a confusing failure pattern:
28
+
29
+ 1. **Cancelled legs appear as "Cancelled" not "Failed"** — matrix siblings killed by
30
+ `fail-fast` show as CANCELLED in the UI (grey icon) rather than red failures. Developers
31
+ scanning the run summary see one red failure and many grey cancellations, and may not
32
+ realize those sibling legs had reached significant progress (e.g., partway through a
33
+ test suite on a different OS or Node version) before being killed.
34
+
35
+ 2. **Root cause is obscured** — the only failing leg that matters for diagnosis is the one
36
+ that triggered `fail-fast`, but with multiple cancellations in the UI, it can be hard to
37
+ identify which leg failed first.
38
+
39
+ 3. **`fail-fast` is inherited silently** — there is no warning annotation that says
40
+ "fail-fast is enabled and cancelled 5 sibling legs." The default is documented but
41
+ easy to forget when adding a new matrix.
42
+
43
+ 4. **Re-running failed jobs doesn't re-run cancelled siblings** — "Re-run failed jobs"
44
+ only re-runs the legs that explicitly FAILED, not the ones that were cancelled by
45
+ fail-fast. Developers re-running failed jobs think they'll see results from all legs,
46
+ but cancelled siblings stay cancelled. Only "Re-run all jobs" restarts everything.
47
+
48
+ Example: a 3-OS matrix (ubuntu, windows, macos) where ubuntu fails. With fail-fast,
49
+ windows and macos are immediately cancelled. The developer sees one failure and two
50
+ cancellations, re-runs the failed ubuntu job, and never discovers that windows also
51
+ had an independent failing test.
52
+ fix: |
53
+ Set `fail-fast: false` explicitly on any matrix where you need full signal from all
54
+ legs — especially for cross-platform or multi-version compatibility matrices. Use
55
+ `fail-fast: true` intentionally only when running the full matrix after one failure is
56
+ wasteful (e.g., expensive build matrices during pre-merge CI).
57
+ fix_code:
58
+ - language: yaml
59
+ label: "Disable fail-fast to see all matrix leg results"
60
+ code: |
61
+ jobs:
62
+ test:
63
+ strategy:
64
+ fail-fast: false # All legs run regardless of siblings failing
65
+ matrix:
66
+ os: [ubuntu-latest, windows-latest, macos-latest]
67
+ node: [18, 20, 22]
68
+ runs-on: ${{ matrix.os }}
69
+ steps:
70
+ - uses: actions/checkout@v4
71
+ - uses: actions/setup-node@v4
72
+ with:
73
+ node-version: ${{ matrix.node }}
74
+ - run: npm ci
75
+ - run: npm test
76
+ - language: yaml
77
+ label: "Use fail-fast: true only for expensive pre-merge CI"
78
+ code: |
79
+ jobs:
80
+ # Pre-merge: fail fast to conserve minutes — just need to know if it passes
81
+ lint-and-typecheck:
82
+ strategy:
83
+ fail-fast: true # OK: fast, cheap, fail early
84
+ matrix:
85
+ node: [20, 22]
86
+ runs-on: ubuntu-latest
87
+ steps:
88
+ - run: npm run lint && npm run typecheck
89
+
90
+ # Post-merge: always see all platform results
91
+ full-test-suite:
92
+ if: github.event_name == 'push'
93
+ strategy:
94
+ fail-fast: false # Need full signal on all platforms
95
+ matrix:
96
+ os: [ubuntu-latest, windows-latest, macos-latest]
97
+ runs-on: ${{ matrix.os }}
98
+ steps:
99
+ - run: npm test
100
+ prevention:
101
+ - "Always set `fail-fast: false` explicitly on cross-platform or multi-version matrices where you need full compatibility signal."
102
+ - "After a matrix failure, use 'Re-run all jobs' (not 'Re-run failed jobs') to get results from previously-cancelled siblings."
103
+ - "Add a workflow summary step with `if: always()` to collect and consolidate test results across all matrix legs even when some are cancelled."
104
+ - "Be aware that cancelled legs (grey) are NOT the same as passed legs (green) — visually scan for both red and grey when investigating failures."
105
+ docs:
106
+ - url: "https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#jobsjob_idstrategyfail-fast"
107
+ label: "Workflow syntax: jobs.<job_id>.strategy.fail-fast"
108
+ - url: "https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#jobsjob_idstrategymatrix"
109
+ label: "Workflow syntax: jobs.<job_id>.strategy.matrix"
110
+ - url: "https://github.com/orgs/community/discussions/26822"
111
+ label: "Community: fail-fast cancels matrix siblings unexpectedly"
112
+ - url: "https://stackoverflow.com/questions/57850553/github-actions-check-steps-status"
113
+ label: "Stack Overflow: Matrix job cancellation behavior with fail-fast"
@@ -0,0 +1,97 @@
1
+ id: "concurrency-timing-008"
2
+ title: "Intermittent 'Required runner group not found' when ephemeral runner registers after job dispatch"
3
+ category: "concurrency-timing"
4
+ severity: "error"
5
+ tags:
6
+ - "runner-group"
7
+ - "self-hosted"
8
+ - "ephemeral"
9
+ - "race-condition"
10
+ - "dispatch"
11
+ - "organization"
12
+ patterns:
13
+ - regex: "Required runner group '.*' not found"
14
+ flags: "i"
15
+ - regex: "runner_group_id.*null"
16
+ flags: "i"
17
+ error_messages:
18
+ - "Required runner group 'x' not found"
19
+ root_cause: |
20
+ In autoscaling self-hosted runner setups (EC2, GKE ephemeral runners, ARC), the runner
21
+ must register with GitHub BEFORE the job dispatcher resolves runner group membership.
22
+
23
+ The race condition occurs when:
24
+ 1. A workflow is triggered and GitHub's broker immediately tries to assign the job
25
+ 2. The ephemeral runner is still initializing (EC2 bootstrap, container pull, ~2:30 min)
26
+ 3. The broker resolves runner group membership before the new runner completes registration
27
+ 4. The broker reports "Required runner group 'X' not found" and fails the job
28
+
29
+ This is intermittent: matrix jobs expose it clearly because some cells get
30
+ already-running (pre-registered) runners while others need a fresh runner, triggering
31
+ the race. Inspecting the failed job via the Jobs API shows `runner_group_id: null`
32
+ and `runner_name: null` throughout the queue duration even though the runner group
33
+ exists and has the correct repository access.
34
+
35
+ A separate but related pattern occurs with org-level runner group repository access
36
+ grants not propagating to the broker V2 protocol in time, causing identical symptoms
37
+ regardless of runner initialization speed.
38
+
39
+ Reported upstream: https://github.com/actions/runner/issues/4252
40
+ Related: https://github.com/actions/runner/issues/4429
41
+ fix: |
42
+ For ephemeral autoscaling runners:
43
+ Implement a registration wait loop that polls the GitHub Runners API before signaling
44
+ the runner as available. The runner should only become eligible for jobs after the
45
+ broker has acknowledged its registration.
46
+
47
+ For org-level runner group access issues:
48
+ Verify that the target repository is in the runner group's allowed repositories list
49
+ via API. If misconfigured, re-registering the runner at repository level instead of
50
+ org level is a reliable workaround.
51
+
52
+ General mitigation: Add timeout-minutes to all jobs on self-hosted runners so stuck
53
+ queued jobs fail fast rather than waiting until the 6-hour workflow timeout.
54
+ fix_code:
55
+ - language: yaml
56
+ label: "Add timeout-minutes to detect stuck queued jobs quickly"
57
+ code: |
58
+ jobs:
59
+ build:
60
+ runs-on:
61
+ group: my-runner-group
62
+ labels: [self-hosted, linux]
63
+ timeout-minutes: 10 # Fail fast if runner never picks up the job
64
+ steps:
65
+ - uses: actions/checkout@v4
66
+ - run: echo "Runner assigned successfully"
67
+ - language: yaml
68
+ label: "Verify runner group repository access via API"
69
+ code: |
70
+ jobs:
71
+ debug-runner-group:
72
+ runs-on: ubuntu-latest
73
+ steps:
74
+ - name: Check runner group repository access
75
+ env:
76
+ GH_TOKEN: ${{ secrets.ORG_RUNNER_READ_TOKEN }}
77
+ run: |
78
+ echo "Runner groups and their visibility:"
79
+ gh api /orgs/${{ github.repository_owner }}/actions/runner-groups \
80
+ --jq '.runner_groups[] | "\(.name) (id: \(.id)) — visibility: \(.visibility)"'
81
+
82
+ echo "Repositories allowed for runner group ID 1:"
83
+ gh api /orgs/${{ github.repository_owner }}/actions/runner-groups/1/repositories \
84
+ --jq '.repositories[].full_name'
85
+ prevention:
86
+ - "Add timeout-minutes to all jobs using self-hosted runners so stuck-queued jobs fail fast instead of waiting for the 6h workflow limit"
87
+ - "For ephemeral runners (EC2/ARC), implement a registration health check that polls the Runners API before the runner accepts jobs"
88
+ - "For org-level runners, verify group repository access via API after any runner group configuration change"
89
+ - "For matrix jobs with ephemeral runners, keep N idle pre-registered runners to avoid cold-start races"
90
+ - "Monitor runner_group_id via the Jobs API to detect dispatch failures early in autoscaling pipelines"
91
+ docs:
92
+ - url: "https://github.com/actions/runner/issues/4252"
93
+ label: "actions/runner #4252 — Intermittent runner group not found"
94
+ - url: "https://github.com/actions/runner/issues/4429"
95
+ label: "actions/runner #4429 — Org-level runner never dispatched"
96
+ - url: "https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/managing-access-to-self-hosted-runners-using-groups"
97
+ label: "GitHub Docs — Managing runner group access"
@@ -0,0 +1,107 @@
1
+ id: concurrency-timing-005
2
+ title: "Job Silently Cancelled When timeout-minutes Is Exceeded"
3
+ category: concurrency-timing
4
+ severity: error
5
+ tags:
6
+ - timeout
7
+ - timeout-minutes
8
+ - job-cancelled
9
+ - timing
10
+ - runner
11
+ patterns:
12
+ - regex: "##\\[error\\]The operation was cancelled\\."
13
+ flags: "i"
14
+ - regex: "The job '.*' was cancelled because it exceeded the maximum execution time"
15
+ flags: "i"
16
+ - regex: "Error: The operation was canceled"
17
+ flags: "i"
18
+ - regex: "cancel is received"
19
+ flags: "i"
20
+ error_messages:
21
+ - "##[error]The operation was cancelled."
22
+ - "Error: The operation was canceled"
23
+ - "The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled."
24
+ root_cause: |
25
+ When a job (or step) exceeds its configured `timeout-minutes`, GitHub Actions sends a
26
+ cancellation signal to the runner. The runner has 5 minutes to complete graceful shutdown,
27
+ after which it is forcibly terminated.
28
+
29
+ The failure mode has two layers of confusion:
30
+
31
+ 1. **Status shows "Cancelled" not "Failed"** — a timed-out job is marked CANCELLED in the
32
+ UI. It does not appear as a red failure. Developers scanning the Actions tab may miss it
33
+ entirely, especially if another run succeeded after it.
34
+
35
+ 2. **No step-level attribution** — the job log shows "The operation was cancelled" but does
36
+ not identify which specific step was still running or how far it had progressed. Long
37
+ builds, network-heavy steps, and interactive prompts are common culprits.
38
+
39
+ 3. **Default timeout is 360 minutes (6 hours)** — if `timeout-minutes` is not explicitly
40
+ set, GitHub uses the platform default of 6 hours for GitHub-hosted runners. A job that
41
+ accidentally blocks (waiting for user input, infinite loop, hung network call) will silently
42
+ consume 6 hours of runner minutes before being cancelled with no diagnostic output.
43
+
44
+ 4. **Step-level timeouts are independent** — `timeout-minutes` on a `steps[*]` entry cancels
45
+ only that step; the job continues. `timeout-minutes` on `jobs[*]` cancels the entire job.
46
+ Mixing both is valid but must be understood deliberately.
47
+ fix: |
48
+ Always set explicit `timeout-minutes` at the job level to bound worst-case runner cost.
49
+ Tune based on your typical build time (e.g., 2-3× the median duration). Add step-level
50
+ timeouts on known slow steps (network downloads, test suites) to get better attribution.
51
+
52
+ To diagnose which step was running at cancellation: add a step near the end that dumps
53
+ elapsed time, or use `if: cancelled()` post-steps to capture diagnostics on timeout.
54
+ fix_code:
55
+ - language: yaml
56
+ label: "Explicit job-level timeout with diagnostic post-step"
57
+ code: |
58
+ jobs:
59
+ build:
60
+ runs-on: ubuntu-latest
61
+ timeout-minutes: 30 # Set explicitly — don't rely on 6h default
62
+ steps:
63
+ - uses: actions/checkout@v4
64
+
65
+ - name: Build
66
+ run: make build
67
+
68
+ - name: Tests
69
+ timeout-minutes: 15 # Step-level timeout for attribution
70
+ run: make test
71
+
72
+ # Always runs — captures which step caused the timeout
73
+ - name: Dump elapsed time on cancellation
74
+ if: cancelled()
75
+ run: echo "Job was cancelled at $(date -u). Check step durations above."
76
+ - language: yaml
77
+ label: "Identify which step timed out with job summary annotation"
78
+ code: |
79
+ steps:
80
+ - name: Long network operation
81
+ timeout-minutes: 10
82
+ run: |
83
+ # Use --max-time with curl to avoid relying solely on timeout-minutes
84
+ curl --max-time 300 https://example.com/large-asset -o output.bin
85
+
86
+ - name: Report timeout if cancelled
87
+ if: cancelled()
88
+ run: |
89
+ echo "## ⏱️ Job Timed Out" >> $GITHUB_STEP_SUMMARY
90
+ echo "The job was cancelled. Review step durations in the log." >> $GITHUB_STEP_SUMMARY
91
+ prevention:
92
+ - "Always set `timeout-minutes` at the job level — never rely on the 6-hour GitHub default."
93
+ - "Add step-level `timeout-minutes` on network-heavy or test steps so cancellation is attributed to a specific step."
94
+ - "Use `if: cancelled()` post-steps to write a job summary annotation explaining the timeout."
95
+ - "Run commands with their own timeout flags (e.g., `curl --max-time`, `pytest --timeout`) in addition to runner timeouts."
96
+ - "Monitor job duration trends — a job approaching its timeout limit is a signal to investigate performance."
97
+ docs:
98
+ - url: "https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes"
99
+ label: "Workflow syntax: jobs.<job_id>.timeout-minutes"
100
+ - url: "https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes"
101
+ label: "Workflow syntax: jobs.<job_id>.steps[*].timeout-minutes"
102
+ - url: "https://github.com/actions/runner/issues/1326"
103
+ label: "actions/runner#1326 — Steps hanging until timeout with no log output"
104
+ - url: "https://github.com/orgs/community/discussions/38004"
105
+ label: "Community: Job stops producing output and is later cancelled"
106
+ - url: "https://docs.github.com/en/actions/administering-github-actions/usage-limits-billing-and-administration#usage-limits"
107
+ label: "Usage limits: maximum job execution time"
@@ -0,0 +1,112 @@
1
+ id: known-unsolved-008
2
+ title: "GITHUB_STEP_SUMMARY Upload Aborted When Content Exceeds 1024k"
3
+ category: known-unsolved
4
+ severity: error
5
+ tags:
6
+ - step-summary
7
+ - GITHUB_STEP_SUMMARY
8
+ - size-limit
9
+ - job-summary
10
+ - markdown
11
+ - limitation
12
+ patterns:
13
+ - regex: "\\$GITHUB_STEP_SUMMARY upload aborted, supports content up to a size of 1024k, got \\d+k"
14
+ flags: "i"
15
+ - regex: "upload aborted.*supports content up to a size of 1024k"
16
+ flags: "i"
17
+ - regex: "Error: GITHUB_STEP_SUMMARY.*1024"
18
+ flags: "i"
19
+ error_messages:
20
+ - "$GITHUB_STEP_SUMMARY upload aborted, supports content up to a size of 1024k, got 1387k"
21
+ - "$GITHUB_STEP_SUMMARY upload aborted, supports content up to a size of 1024k, got 2048k"
22
+ root_cause: |
23
+ GitHub Actions imposes a hard 1 MiB (1024 KiB) size limit on the content written to
24
+ `$GITHUB_STEP_SUMMARY`. When a step writes more than this limit, the runner aborts
25
+ the summary upload and logs an error.
26
+
27
+ This is a **platform limit with no workaround** — you cannot increase it. GitHub has not
28
+ announced plans to raise the limit.
29
+
30
+ Common triggers:
31
+ 1. **Test reporters** — tools like `dorny/test-reporter`, `ctrf-io/github-actions-ctrf`,
32
+ or `EnricoMi/publish-unit-test-result-action` write per-test result tables. Large
33
+ test suites (thousands of test cases, especially with long failure messages) easily
34
+ exceed 1 MiB.
35
+ 2. **Dependency review action** — `actions/dependency-review-action` writes full
36
+ dependency diff tables. Large projects with hundreds of transitive dependencies produce
37
+ summaries well above 1 MiB.
38
+ 3. **Coverage reports** — HTML-style coverage tables written to `$GITHUB_STEP_SUMMARY`
39
+ with per-file rows can grow unboundedly on large monorepos.
40
+ 4. **Log echo pipelines** — `cat large-file >> $GITHUB_STEP_SUMMARY` without size
41
+ checking is the most direct way to hit the limit.
42
+
43
+ The error aborts the summary upload but does **not** fail the step or job by default.
44
+ Depending on the action's error handling, the step may succeed (exit 0) even though the
45
+ summary was not written — making this a silent failure from a reporting perspective.
46
+ fix: |
47
+ Truncate or paginate summary content before writing it. Most test reporters provide
48
+ options to limit which results are written (e.g., only failures, not all passed tests).
49
+ For custom summary generation, check the size before writing and truncate with a note.
50
+ fix_code:
51
+ - language: yaml
52
+ label: "Truncate summary content with size check before writing"
53
+ code: |
54
+ - name: Generate test report
55
+ run: |
56
+ # Generate report to a temp file first
57
+ ./scripts/generate-report.sh > /tmp/report.md
58
+
59
+ # Check size before writing to summary
60
+ SIZE_KB=$(du -k /tmp/report.md | cut -f1)
61
+ MAX_KB=800 # Leave headroom below 1024k limit
62
+
63
+ if [ "$SIZE_KB" -gt "$MAX_KB" ]; then
64
+ echo "⚠️ Full report too large (${SIZE_KB}k). Showing failures only." >> "$GITHUB_STEP_SUMMARY"
65
+ ./scripts/generate-report.sh --failures-only >> "$GITHUB_STEP_SUMMARY"
66
+ else
67
+ cat /tmp/report.md >> "$GITHUB_STEP_SUMMARY"
68
+ fi
69
+ - language: yaml
70
+ label: "dorny/test-reporter — limit to failures only for large test suites"
71
+ code: |
72
+ - name: Test Report
73
+ uses: dorny/test-reporter@v1
74
+ if: always()
75
+ with:
76
+ name: Test Results
77
+ path: test-results/**/*.xml
78
+ reporter: jest-junit
79
+ # Limit output to avoid 1024k summary limit on large suites
80
+ only-summary: true # Write only totals, not per-test rows
81
+ fail-on-error: false
82
+ - language: yaml
83
+ label: "Upload full report as artifact instead of writing to summary"
84
+ code: |
85
+ - name: Generate full coverage report
86
+ run: ./scripts/coverage.sh > /tmp/coverage-full.md
87
+
88
+ - name: Write summary (truncated)
89
+ run: |
90
+ head -100 /tmp/coverage-full.md >> "$GITHUB_STEP_SUMMARY"
91
+ echo "" >> "$GITHUB_STEP_SUMMARY"
92
+ echo "_Full report available as workflow artifact._" >> "$GITHUB_STEP_SUMMARY"
93
+
94
+ - name: Upload full report as artifact
95
+ uses: actions/upload-artifact@v4
96
+ with:
97
+ name: coverage-report
98
+ path: /tmp/coverage-full.md
99
+ prevention:
100
+ - "Never pipe unbounded command output directly to `$GITHUB_STEP_SUMMARY` — always size-check or limit first."
101
+ - "Configure test reporter actions to write only failures (not all passing tests) when the test suite is large."
102
+ - "Upload large reports as workflow artifacts and link to them from a short summary, instead of embedding all content in the summary."
103
+ - "The undocumented historical limit of 65,535 characters cited in older docs/answers is no longer accurate — the current limit is 1024 KiB (1 MiB)."
104
+ docs:
105
+ - url: "https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#adding-a-job-summary"
106
+ label: "Workflow commands: Adding a job summary"
107
+ - url: "https://github.com/actions/dependency-review-action/issues/786"
108
+ label: "dependency-review-action#786 — Job Summary Size Limitation aborts the job"
109
+ - url: "https://github.com/dorny/test-reporter/issues/379"
110
+ label: "dorny/test-reporter#379 — Is the step summary limit for 65535 characters still accurate?"
111
+ - url: "https://docs.github.com/en/actions/administering-github-actions/usage-limits-billing-and-administration#usage-limits"
112
+ label: "Usage limits — GitHub Actions"
@@ -0,0 +1,127 @@
1
+ id: known-unsolved-009
2
+ title: "Job Killed After Maximum Execution Time (6h Hosted / 35-Day Workflow)"
3
+ category: known-unsolved
4
+ severity: limitation
5
+ tags:
6
+ - timeout
7
+ - execution-time
8
+ - job-limits
9
+ - platform-limit
10
+ - self-hosted
11
+ - workflow-duration
12
+ - limitation
13
+ patterns:
14
+ - regex: "The job running has exceeded the maximum execution time"
15
+ flags: "i"
16
+ - regex: "exceeded the maximum (?:time|execution time)"
17
+ flags: "i"
18
+ - regex: "job .* exceeded .* maximum"
19
+ flags: "i"
20
+ error_messages:
21
+ - "The job running on runner GitHub Actions X has exceeded the maximum execution time of 360 minutes."
22
+ - "The job running has exceeded the maximum execution time"
23
+ root_cause: |
24
+ GitHub Actions enforces hard platform-level execution time limits that cannot be
25
+ overridden or extended by workflow configuration. These limits exist to protect
26
+ shared infrastructure and prevent runaway jobs from consuming unlimited resources.
27
+
28
+ **GitHub-hosted runner limits:**
29
+ - Maximum job execution time: **6 hours** (360 minutes)
30
+ - Maximum workflow run time: **35 days** (across all jobs, including queued time)
31
+ - Default `timeout-minutes` when not set: **360 minutes** (6 hours)
32
+
33
+ **Self-hosted runner limits:**
34
+ - Maximum job execution time: **5 days** (7,200 minutes) by default
35
+ - Maximum workflow run time: **35 days** (same as hosted)
36
+ - Self-hosted limits can be customized in enterprise plans via org/enterprise policies
37
+
38
+ **When limits are hit:**
39
+ - The runner process is sent a SIGTERM (graceful) then SIGKILL (forced) after a grace period
40
+ - The job is marked CANCELLED (not FAILED) in the UI
41
+ - The log message "The job running has exceeded the maximum execution time" appears in
42
+ the runner log (may be visible in the step logs depending on where the runner was killed)
43
+ - Any `post:` steps for active actions (e.g., cache save, artifact upload) are skipped
44
+ - No email notification is sent to the repo owner about the cancellation
45
+
46
+ **Why this is a limitation, not just misconfiguration:**
47
+ - There is no way to set `timeout-minutes` above 21600 (360 hours) to extend the GitHub-hosted 6h cap
48
+ - The workflow `timeout-minutes` field cannot override the platform cap on GitHub-hosted runners
49
+ - Jobs requiring more than 6 hours on GitHub-hosted runners have NO supported path without
50
+ migrating to self-hosted or restructuring the job into multiple shorter sequential jobs
51
+ fix: |
52
+ There is no way to extend the GitHub-hosted runner 6-hour job cap. Options:
53
+
54
+ 1. **Break the job into smaller sequential jobs** — split long-running work (e.g., build
55
+ artifacts first, test in separate parallel jobs, deploy last). Each job has its own
56
+ 6-hour budget.
57
+
58
+ 2. **Migrate to self-hosted runners** — self-hosted runners support up to 5-day jobs.
59
+ Use actions-runner-controller (ARC) or cloud auto-scaling for elastic capacity.
60
+
61
+ 3. **Optimize the slow step** — profile build/test times; parallelize with matrix
62
+ strategy; use incremental builds or test sharding to reduce per-job duration.
63
+
64
+ 4. **Use caching aggressively** — `actions/cache` reduces download/build time between
65
+ runs, but does not extend limits.
66
+ fix_code:
67
+ - language: yaml
68
+ label: "Split a long job into sequential jobs to stay within 6h per job"
69
+ code: |
70
+ jobs:
71
+ build:
72
+ runs-on: ubuntu-latest
73
+ timeout-minutes: 120 # 2h budget for build
74
+ outputs:
75
+ artifact-id: ${{ steps.upload.outputs.artifact-id }}
76
+ steps:
77
+ - uses: actions/checkout@v4
78
+ - name: Build
79
+ run: make build-release
80
+ - name: Upload build artifact
81
+ id: upload
82
+ uses: actions/upload-artifact@v4
83
+ with:
84
+ name: release-build
85
+ path: dist/
86
+
87
+ # Separate job — gets its own 6h budget
88
+ test:
89
+ needs: build
90
+ runs-on: ubuntu-latest
91
+ timeout-minutes: 180 # 3h budget for tests
92
+ steps:
93
+ - uses: actions/download-artifact@v4
94
+ with:
95
+ artifact-id: ${{ needs.build.outputs.artifact-id }}
96
+ - run: make test-full
97
+ - language: yaml
98
+ label: "Self-hosted runner for jobs requiring more than 6 hours"
99
+ code: |
100
+ jobs:
101
+ long-running-job:
102
+ # Self-hosted runners support up to 5-day job duration
103
+ runs-on: [self-hosted, linux, x64]
104
+ timeout-minutes: 2880 # 48h — only possible on self-hosted
105
+ steps:
106
+ - uses: actions/checkout@v4
107
+ - name: Long-running process
108
+ run: ./scripts/full-dataset-processing.sh
109
+ prevention:
110
+ - "Set explicit `timeout-minutes` on every job — don't rely on the implicit 6h GitHub-hosted cap as your only safeguard."
111
+ - "Profile job duration regularly and alert when a job's P99 duration approaches 80% of its timeout budget."
112
+ - "Parallelize test suites using matrix strategy or `actions/github-script` dynamic matrix generation to reduce per-job time."
113
+ - "Use self-hosted runners for any workflow that legitimately requires more than 2-3 hours per job (e.g., large model training, full database rebuild, exhaustive integration tests)."
114
+ - "Be aware that post-run actions (cache save, artifact upload) will NOT execute if the parent job is killed for exceeding the time limit."
115
+ docs:
116
+ - url: "https://docs.github.com/en/actions/administering-github-actions/usage-limits-billing-and-administration#usage-limits"
117
+ label: "Usage limits: job execution time and workflow run time"
118
+ - url: "https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners#usage-limits"
119
+ label: "Self-hosted runner usage limits"
120
+ - url: "https://github.com/orgs/community/discussions/48790"
121
+ label: "Community: Workflow run time limit 35 days"
122
+ - url: "https://github.com/orgs/community/discussions/150900"
123
+ label: "Community: Job cancellation after 6 hours"
124
+ - url: "https://stackoverflow.com/questions/70187174/github-actions-self-hosted-runner-the-job-running-has-exceeded-the-maximum-exe"
125
+ label: "Stack Overflow: The job running has exceeded the maximum execution time"
126
+ - url: "https://github.com/actions/actions-runner-controller"
127
+ label: "Actions Runner Controller (ARC) — Kubernetes-based self-hosted runner auto-scaling"
@@ -0,0 +1,84 @@
1
+ id: "runner-environment-022"
2
+ title: "actions/checkout set-safe-directory only runs in post step — container jobs get dubious ownership errors"
3
+ category: "runner-environment"
4
+ severity: "error"
5
+ tags:
6
+ - "actions/checkout"
7
+ - "safe-directory"
8
+ - "container"
9
+ - "dubious-ownership"
10
+ - "CVE-2022-24765"
11
+ - "post-step"
12
+ patterns:
13
+ - regex: "fatal: detected dubious ownership in repository"
14
+ flags: "i"
15
+ - regex: "safe\\.directory.*not.*owned"
16
+ flags: "i"
17
+ error_messages:
18
+ - "fatal: detected dubious ownership in repository at '/github/workspace'"
19
+ - "hint: git config --global --add safe.directory /github/workspace"
20
+ root_cause: |
21
+ The `actions/checkout` action configures `safe.directory` to allow git operations in
22
+ the workspace. However, this configuration only runs in the **post step** (cleanup
23
+ phase), not during the main execution step.
24
+
25
+ In container jobs, the workspace is mounted from the host and may be owned by a
26
+ different UID than the user running inside the container. Git's safe.directory
27
+ protection (introduced in Git 2.35.2 for CVE-2022-24765) blocks access when the
28
+ directory owner differs from the running user.
29
+
30
+ Because safe.directory is only written during the post step — after all workflow
31
+ steps have already run — any subsequent git operations in the job's main steps
32
+ fail with "fatal: detected dubious ownership". This includes third-party actions
33
+ that internally invoke git (reviewdog, gitops tools, semantic-release, etc.).
34
+
35
+ Reported upstream: https://github.com/actions/checkout/issues/2031
36
+ fix: |
37
+ Add an explicit safe.directory configuration step immediately after `actions/checkout`
38
+ in any container job that performs git operations. This ensures the directory is
39
+ trusted before any subsequent steps run.
40
+ fix_code:
41
+ - language: yaml
42
+ label: "Add safe.directory config step after checkout in container jobs"
43
+ code: |
44
+ jobs:
45
+ build:
46
+ runs-on: ubuntu-latest
47
+ container: node:20-bookworm
48
+ steps:
49
+ - uses: actions/checkout@v4
50
+
51
+ # Workaround: post step safe.directory config doesn't help in container jobs
52
+ - name: Mark workspace as safe for git
53
+ run: git config --global --add safe.directory "$GITHUB_WORKSPACE"
54
+
55
+ - name: Run git-dependent steps
56
+ run: git log --oneline -5
57
+ - language: yaml
58
+ label: "Use wildcard to mark all directories safe in container workflows"
59
+ code: |
60
+ jobs:
61
+ build:
62
+ runs-on: ubuntu-latest
63
+ container: python:3.12-slim
64
+ steps:
65
+ - uses: actions/checkout@v4
66
+
67
+ - name: Configure git safe directories
68
+ run: |
69
+ git config --global --add safe.directory '*'
70
+
71
+ - name: Lint with pre-commit
72
+ run: pre-commit run --all-files
73
+ prevention:
74
+ - "Always add a safe.directory config step after checkout when using container jobs"
75
+ - "Audit third-party actions in container jobs — reviewdog, semantic-release, and gitops tools invoke git internally"
76
+ - "Consider running without a container and using docker run explicitly if git safety is complex to manage"
77
+ - "Track https://github.com/actions/checkout/issues/2031 for an official fix from the actions team"
78
+ docs:
79
+ - url: "https://github.com/actions/checkout/issues/2031"
80
+ label: "actions/checkout #2031 — safe.directory only set in post step"
81
+ - url: "https://github.blog/2022-04-12-git-security-vulnerability-announced/"
82
+ label: "Git CVE-2022-24765 — safe.directory background"
83
+ - url: "https://docs.github.com/en/actions/writing-workflows/choosing-where-your-workflow-runs/running-jobs-in-a-container"
84
+ label: "GitHub Docs — Running jobs in a container"
@@ -0,0 +1,99 @@
1
+ id: "runner-environment-023"
2
+ title: "Self-hosted runner on deprecated version stops receiving jobs"
3
+ category: "runner-environment"
4
+ severity: "error"
5
+ tags:
6
+ - "self-hosted"
7
+ - "runner"
8
+ - "deprecated"
9
+ - "version"
10
+ - "cannot-receive-messages"
11
+ - "maintenance"
12
+ patterns:
13
+ - regex: "Runner version v\\d+\\.\\d+\\.\\d+ is deprecated and cannot receive messages"
14
+ flags: "i"
15
+ - regex: "WRITE ERROR.*runner.*deprecated"
16
+ flags: "i"
17
+ error_messages:
18
+ - "Runner version v2.332.0 is deprecated and cannot receive messages."
19
+ - "WRITE ERROR: An error occured: Runner version v2.XXX.X is deprecated and cannot receive messages."
20
+ root_cause: |
21
+ GitHub periodically deprecates older self-hosted runner versions. Once a runner version
22
+ is past its deprecation deadline, the runner agent can no longer communicate with the
23
+ GitHub Actions broker service.
24
+
25
+ The runner process stays alive, appears online in the GitHub UI (Settings → Actions →
26
+ Runners), and is listed as "Active" — but it can no longer receive job assignments.
27
+ Jobs queued for a runner group containing only deprecated-version runners will either
28
+ stay "Queued" indefinitely or time out without a clear error in the workflow UI.
29
+
30
+ This is a silent failure mode: the runner shows as online, no workflow error is
31
+ surfaced, but jobs never start. The deprecation schedule is published in the GitHub
32
+ Changelog and actions/runner releases but teams often miss it without automated
33
+ update pipelines.
34
+
35
+ As of 2026, GitHub requires runners to be within approximately the last 6 months of
36
+ releases. Related issues: actions/runner #4305, actions/runner #4442
37
+ fix: |
38
+ Update the runner binary to a currently supported version.
39
+
40
+ For manually managed runners:
41
+ 1. SSH to the runner host
42
+ 2. Stop the runner service: ./svc.sh stop
43
+ 3. Download the latest runner from https://github.com/actions/runner/releases/latest
44
+ 4. Extract to the runner directory (configuration is preserved via .runner file)
45
+ 5. Restart: ./svc.sh start
46
+
47
+ For ARC (Actions Runner Controller) or autoscaling solutions: bump the runner image
48
+ version tag in your HelmRelease or Deployment manifest and redeploy.
49
+ fix_code:
50
+ - language: yaml
51
+ label: "Scheduled workflow to alert on outdated runner versions"
52
+ code: |
53
+ name: Check self-hosted runner versions
54
+ on:
55
+ schedule:
56
+ - cron: '0 9 * * 1' # Weekly Monday 9 AM
57
+
58
+ jobs:
59
+ check-runners:
60
+ runs-on: ubuntu-latest # GitHub-hosted runner for this diagnostic
61
+ steps:
62
+ - name: List runner versions via API
63
+ env:
64
+ GH_TOKEN: ${{ secrets.RUNNER_READ_TOKEN }}
65
+ run: |
66
+ echo "=== Org runners ==="
67
+ gh api /orgs/${{ github.repository_owner }}/actions/runners \
68
+ --jq '.runners[] | "\(.name): v\(.version) (\(.status))"'
69
+
70
+ - name: Check latest available version
71
+ run: |
72
+ LATEST=$(curl -sf https://api.github.com/repos/actions/runner/releases/latest | jq -r .tag_name)
73
+ echo "Latest runner version: $LATEST"
74
+ echo "Compare against your registered runners above"
75
+ - language: yaml
76
+ label: "ARC — bump runner version in HelmRelease"
77
+ code: |
78
+ # In your ARC HelmRelease or values.yaml
79
+ githubConfigUrl: "https://github.com/myorg"
80
+ template:
81
+ spec:
82
+ containers:
83
+ - name: runner
84
+ image: ghcr.io/actions/actions-runner:2.335.0 # Bump this regularly
85
+ prevention:
86
+ - "Subscribe to the GitHub Changelog (https://github.blog/changelog/) or watch actions/runner releases for deprecation notices"
87
+ - "Use Actions Runner Controller (ARC) or an autoscaling solution to automate runner lifecycle management"
88
+ - "Schedule a weekly cron workflow that checks registered runner versions via the Runners API and alerts if any are outdated"
89
+ - "Pin runner version in IaC (Terraform, Ansible) and include a runner version bump in your monthly maintenance checklist"
90
+ - "Set up Dependabot or Renovate to auto-update runner image tags in Docker/ARC manifests"
91
+ docs:
92
+ - url: "https://github.com/actions/runner/releases"
93
+ label: "actions/runner releases — version history and changelogs"
94
+ - url: "https://github.com/actions/runner/issues/4305"
95
+ label: "actions/runner #4305 — runner deprecated cannot receive messages"
96
+ - url: "https://github.com/actions/runner/issues/4442"
97
+ label: "actions/runner #4442 — version deprecation notice"
98
+ - url: "https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners"
99
+ label: "GitHub Docs — About self-hosted runners"
@@ -0,0 +1,82 @@
1
+ id: "silent-failures-010"
2
+ title: "cache-hit output is 'true' on restore-keys partial match, not just exact key"
3
+ category: "silent-failures"
4
+ severity: "silent-failure"
5
+ tags:
6
+ - "actions/cache"
7
+ - "cache-hit"
8
+ - "restore-keys"
9
+ - "partial-match"
10
+ - "output"
11
+ - "exact-match"
12
+ patterns:
13
+ - regex: "cache-hit.*true"
14
+ flags: "i"
15
+ error_messages:
16
+ - "cache-hit: true"
17
+ root_cause: |
18
+ The `cache-hit` output of `actions/cache` is documented to return `true` for an exact
19
+ cache key match. In practice, `cache-hit` also returns `true` when the key matched via
20
+ `restore-keys` (a partial/fallback match), not only for exact key hits.
21
+
22
+ Workflows that gate post-build steps on `steps.cache.outputs.cache-hit == 'true'` to
23
+ skip dependency installs assume an exact match. When a restore-keys match occurs,
24
+ `cache-hit` is `true` even though the cache may be stale or from a different branch.
25
+ The correct exact-match indicator is the `exact-match` output (available in
26
+ actions/cache v4+) or comparing `cache-matched-key` against the computed key.
27
+
28
+ Reported upstream: https://github.com/actions/cache/issues/1675
29
+ fix: |
30
+ Use the `exact-match` output (actions/cache v4+) to determine if the restore was an
31
+ exact key match. `cache-hit` alone does not distinguish between exact and partial
32
+ (restore-keys) matches.
33
+
34
+ Alternatively, compare the `cache-matched-key` output against the expected key to
35
+ determine if the restore was exact.
36
+ fix_code:
37
+ - language: yaml
38
+ label: "Use exact-match output (actions/cache v4+)"
39
+ code: |
40
+ - name: Restore cache
41
+ id: cache
42
+ uses: actions/cache@v4
43
+ with:
44
+ path: ~/.npm
45
+ key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
46
+ restore-keys: |
47
+ ${{ runner.os }}-node-
48
+
49
+ # Only skip install on EXACT key match, not partial restore-keys hit
50
+ - name: Install dependencies
51
+ if: steps.cache.outputs.exact-match != 'true'
52
+ run: npm ci
53
+ - language: yaml
54
+ label: "Compare cache-matched-key to verify exact hit"
55
+ code: |
56
+ - name: Restore cache
57
+ id: cache
58
+ uses: actions/cache@v4
59
+ with:
60
+ path: ~/.npm
61
+ key: npm-${{ hashFiles('**/package-lock.json') }}
62
+ restore-keys: npm-
63
+
64
+ - name: Install or skip dependencies
65
+ run: |
66
+ EXPECTED_KEY="npm-${{ hashFiles('**/package-lock.json') }}"
67
+ if [ "${{ steps.cache.outputs.cache-matched-key }}" = "$EXPECTED_KEY" ]; then
68
+ echo "Exact cache hit — skipping npm ci"
69
+ else
70
+ echo "Partial/stale restore-keys hit — running npm ci"
71
+ npm ci
72
+ fi
73
+ prevention:
74
+ - "Never rely on cache-hit == 'true' alone to skip dependency installs; it fires on partial restore-keys matches too"
75
+ - "Use the exact-match output (actions/cache v4+) when you need to distinguish exact vs partial cache hits"
76
+ - "Use cache-matched-key output to log or compare the actual key that was restored"
77
+ - "If using restore-keys, always validate that your skip-install condition handles partial matches correctly"
78
+ docs:
79
+ - url: "https://github.com/actions/cache/issues/1675"
80
+ label: "actions/cache #1675 — cache-hit true on restore-keys match"
81
+ - url: "https://github.com/actions/cache#outputs"
82
+ label: "actions/cache outputs documentation"
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@htekdev/actions-debugger",
3
- "version": "1.0.3",
3
+ "version": "1.0.5",
4
4
  "description": "65+ real GitHub Actions errors, queryable by agents. MCP server + Copilot skills + error database.",
5
5
  "type": "module",
6
6
  "main": "./dist/index.js",