@htekdev/actions-debugger 1.0.118 → 1.0.119

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,94 @@
1
+ id: caching-artifacts-070
2
+ title: "setup-python Post step fails — pip cache directory doesn't exist on disk"
3
+ category: caching-artifacts
4
+ severity: error
5
+ tags:
6
+ - setup-python
7
+ - pip
8
+ - cache
9
+ - post-step
10
+ - cache-miss
11
+ - no-dependencies
12
+ - python-version-bump
13
+ patterns:
14
+ - regex: 'Cache folder path is retrieved for pip but doesn.t exist on disk'
15
+ flags: 'i'
16
+ - regex: 'likely indicates that there are no dependencies to cache'
17
+ flags: 'i'
18
+ - regex: 'Post Setup Python.*fail|Post.*setup-python.*error'
19
+ flags: 'i'
20
+ error_messages:
21
+ - "Cache folder path is retrieved for pip but doesn't exist on disk: /home/runner/.cache/pip. This likely indicates that there are no dependencies to cache. Consider removing the cache step if it is not needed."
22
+ root_cause: |
23
+ When actions/setup-python is configured with cache: 'pip', the action records the expected
24
+ pip cache directory path at setup time (/home/runner/.cache/pip on Linux, equivalent on
25
+ macOS/Windows). The Post Setup Python step runs at job end and attempts to save the cache
26
+ to that path. If the directory does not exist on disk at save time, the Post step fails
27
+ with this error.
28
+
29
+ Two common causes:
30
+
31
+ 1. No pip install ran in the job: The job uses cache: 'pip' but only runs linting or other
32
+ pre-installed tools without installing any Python packages. Pip never creates its cache
33
+ directory, so there is nothing to save. The job's actual steps appear green while the
34
+ Post step fails and turns the overall workflow run red.
35
+
36
+ 2. Python version bump causes cache key miss: When the Python patch version changes (e.g.,
37
+ from 3.13.5 to 3.13.6), the setup-python cache key changes and the first run after the
38
+ bump experiences a full cache miss. If pip install runs but writes packages to a virtual
39
+ environment rather than the global pip cache (/home/runner/.cache/pip), the expected
40
+ directory remains empty and the Post step fails. Subsequent runs after the new cache is
41
+ warmed succeed.
42
+
43
+ The failure is deceptive because it surfaces in the Post Setup Python cleanup step — well
44
+ after the test or build steps have already succeeded — making it easy to overlook the root
45
+ cause.
46
+ fix: |
47
+ - Remove cache: 'pip' from any setup-python step in jobs that do not call pip install.
48
+ Linting-only jobs, type-check-only jobs, and jobs that rely entirely on pre-installed
49
+ system Python do not benefit from pip caching.
50
+ - If pip install does run, ensure it runs against the global pip (not a virtualenv that
51
+ bypasses /home/runner/.cache/pip) so the post step can find and save the cache.
52
+ - Upgrade to actions/setup-python@v5 or later: newer versions emit a warning annotation
53
+ instead of failing the step when the cache directory is missing.
54
+ - After a Python version bump, the first run is expected to cache-miss; monitor for Post
55
+ step failures on that first run and confirm subsequent runs succeed.
56
+ fix_code:
57
+ - language: yaml
58
+ label: 'Fix: Remove cache when no pip install runs'
59
+ code: |
60
+ # WRONG: cache: pip set but job only lints — no pip install → Post step fails
61
+ - uses: actions/setup-python@v4
62
+ with:
63
+ python-version: '3.12'
64
+ cache: 'pip' # ← REMOVE when no pip install follows
65
+ - name: Lint
66
+ run: flake8 . # no pip install; /home/runner/.cache/pip never created
67
+
68
+ # CORRECT: omit cache when the job does not install packages
69
+ - uses: actions/setup-python@v4
70
+ with:
71
+ python-version: '3.12'
72
+ # no cache key — Post step skips cache save attempt entirely
73
+ - name: Lint
74
+ run: flake8 .
75
+ - language: yaml
76
+ label: 'Fix: Upgrade to setup-python@v5 for graceful handling'
77
+ code: |
78
+ # CORRECT: v5+ emits a warning annotation instead of failing when cache path missing
79
+ - uses: actions/setup-python@v5
80
+ with:
81
+ python-version: '3.12'
82
+ cache: 'pip'
83
+ - name: Install dependencies
84
+ run: pip install -r requirements.txt
85
+ prevention:
86
+ - 'Only set cache: pip on jobs that actually run pip install — linting-only jobs should omit it.'
87
+ - 'Use actions/setup-python@v5 or later; it handles missing cache directories with a warning instead of a failure.'
88
+ - 'After bumping the Python patch version in python-version:, expect one cache-miss run and watch for Post step failures on that run only.'
89
+ - 'When using virtual environments (venv/pipenv/poetry), ensure pip still writes to the global cache or configure cache-dependency-path appropriately.'
90
+ docs:
91
+ - url: 'https://github.com/actions/setup-python/issues/1169'
92
+ label: 'setup-python#1169: Cache folder path doesn''t exist on disk (Aug 2025)'
93
+ - url: 'https://github.com/actions/setup-python'
94
+ label: 'actions/setup-python: Caching packages documentation'
@@ -0,0 +1,127 @@
1
+ id: concurrency-timing-056
2
+ title: 'Workflow-level and job-level concurrency share same group key — deadlock cancellation fires immediately'
3
+ category: concurrency-timing
4
+ severity: error
5
+ tags:
6
+ - concurrency
7
+ - deadlock
8
+ - workflow-level
9
+ - job-level
10
+ - reusable-workflow
11
+ - workflow-call
12
+ - github-workflow-context
13
+ patterns:
14
+ - regex: 'Canceling since a deadlock for concurrency group .* was detected between'
15
+ flags: 'i'
16
+ - regex: 'deadlock.*concurrency group|concurrency group.*deadlock'
17
+ flags: 'i'
18
+ error_messages:
19
+ - "Canceling since a deadlock for concurrency group 'ci-refs/heads/main' was detected between 'top level workflow' and 'deploy'"
20
+ - "Canceling since a deadlock for concurrency group 'release-refs/heads/main' was detected between 'top level workflow' and 'api'"
21
+ root_cause: |
22
+ GitHub Actions fires a deadlock error and immediately cancels the offending job when the
23
+ same concurrency group name is held simultaneously by two levels within a single workflow
24
+ execution. Two distinct scenarios trigger this:
25
+
26
+ Scenario 1 — Same-workflow self-deadlock: A workflow file defines concurrency: at the
27
+ workflow level AND one of its jobs also defines jobs.<id>.concurrency: using an expression
28
+ that evaluates to the same string:
29
+
30
+ concurrency:
31
+ group: ${{ github.workflow }}-${{ github.ref }} # workflow-level slot
32
+
33
+ jobs:
34
+ deploy:
35
+ concurrency:
36
+ group: ${{ github.workflow }}-${{ github.ref }} # same string → deadlock
37
+ runs-on: ubuntu-latest
38
+
39
+ Scenario 2 — Reusable callee inherits caller context: A calling workflow has workflow-level
40
+ concurrency using ${{ github.workflow }}-${{ github.ref }}. The called reusable workflow
41
+ also defines workflow-level concurrency with the same expression. Because github.workflow
42
+ inside a reusable workflow inherits the CALLER's workflow name (not the callee's filename),
43
+ both evaluate to the identical group key and GitHub detects a deadlock.
44
+
45
+ github.workflow_ref also inherits from the top-level caller and does NOT produce a unique
46
+ value in the callee context; it cannot be used to distinguish caller from callee.
47
+ fix: |
48
+ Scenario 1 — Remove the duplicate concurrency block. Keep either the workflow-level OR
49
+ the job-level declaration, not both with the same key. If per-job isolation is needed,
50
+ append ${{ github.job }} to the job-level group name:
51
+
52
+ jobs:
53
+ deploy:
54
+ concurrency:
55
+ group: ${{ github.workflow }}-${{ github.ref }}-${{ github.job }}
56
+
57
+ Scenario 2 — Remove the concurrency: block entirely from the reusable workflow. The
58
+ caller's workflow-level concurrency already governs the entire execution. If the callee
59
+ needs standalone concurrency when triggered via workflow_dispatch, use a hardcoded unique
60
+ prefix instead of ${{ github.workflow }}:
61
+
62
+ concurrency:
63
+ group: deploy-${{ github.ref }} # hardcoded prefix avoids collision with any caller
64
+ cancel-in-progress: true
65
+ fix_code:
66
+ - language: yaml
67
+ label: 'Scenario 1 fix: remove duplicate job-level concurrency'
68
+ code: |
69
+ # WRONG — identical group at workflow level and job level → deadlock
70
+ concurrency:
71
+ group: ${{ github.workflow }}-${{ github.ref }}
72
+ cancel-in-progress: true
73
+
74
+ jobs:
75
+ deploy:
76
+ concurrency:
77
+ group: ${{ github.workflow }}-${{ github.ref }} # ← DELETE THIS
78
+ runs-on: ubuntu-latest
79
+ steps:
80
+ - run: echo deploying
81
+
82
+ # CORRECT — concurrency only at workflow level
83
+ concurrency:
84
+ group: ${{ github.workflow }}-${{ github.ref }}
85
+ cancel-in-progress: true
86
+
87
+ jobs:
88
+ deploy:
89
+ runs-on: ubuntu-latest
90
+ steps:
91
+ - run: echo deploying
92
+ - language: yaml
93
+ label: 'Scenario 2 fix: remove concurrency from reusable workflow'
94
+ code: |
95
+ # deploy.yml (reusable) — WRONG: workflow-level concurrency collides with caller
96
+ # because github.workflow returns the CALLER's name in reusable context
97
+ on:
98
+ workflow_call:
99
+ workflow_dispatch:
100
+ # concurrency: ← DELETE this entire block from the reusable workflow
101
+ # group: ${{ github.workflow }}-${{ github.ref }}
102
+ # cancel-in-progress: true
103
+
104
+ jobs:
105
+ deploy:
106
+ runs-on: ubuntu-latest
107
+ steps:
108
+ - run: echo deploying
109
+
110
+ # If standalone concurrency is needed for workflow_dispatch calls, use hardcoded prefix:
111
+ # concurrency:
112
+ # group: deploy-${{ github.ref }} # hardcoded "deploy-" avoids collision
113
+ # cancel-in-progress: true
114
+ prevention:
115
+ - 'Before adding concurrency: to a reusable workflow, check if it will be called via workflow_call — if so, remove it or use a hardcoded prefix.'
116
+ - 'Never use the same concurrency group expression at both the workflow level and job level in the same file.'
117
+ - 'Note that ${{ github.workflow }} and ${{ github.workflow_ref }} both return the top-level caller''s values inside a reusable workflow; neither provides the callee''s filename.'
118
+ - 'Use actionlint to statically detect identical concurrency groups — issue actionlint#538 tracks adding this check.'
119
+ docs:
120
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/control-the-concurrency-of-workflows-and-jobs'
121
+ label: 'GitHub Docs: Controlling the concurrency of workflows and jobs'
122
+ - url: 'https://stackoverflow.com/questions/78101326/github-actions-concurrency-deadlock'
123
+ label: 'SO: GitHub Actions concurrency deadlock (Score 6, 1.7K views)'
124
+ - url: 'https://stackoverflow.com/questions/79511940/using-workflow-filename-in-concurrency-group-for-workflows-started-by-workflow-c'
125
+ label: 'SO: Using workflow filename in concurrency group for workflow_call (Score 3)'
126
+ - url: 'https://github.com/github/vscode-github-actions/issues/135'
127
+ label: 'vscode-github-actions#135: Identical concurrency groups cause silent never-run (14 reactions)'
@@ -0,0 +1,115 @@
1
+ id: concurrency-timing-057
2
+ title: "Fork PRs with Identical Branch Names Share Concurrency Group and Cancel Each Other"
3
+ category: concurrency-timing
4
+ severity: silent-failure
5
+ tags:
6
+ - concurrency
7
+ - fork
8
+ - pull_request
9
+ - head_ref
10
+ - cancel-in-progress
11
+ - silent-cancel
12
+ patterns:
13
+ - regex: 'group.*github\.head_ref'
14
+ flags: 'i'
15
+ - regex: 'This run was cancelled'
16
+ flags: 'i'
17
+ error_messages:
18
+ - "This run was cancelled."
19
+ - "Run was cancelled."
20
+ root_cause: |
21
+ When a workflow uses 'github.head_ref' as the sole identifier in a
22
+ concurrency group key (a common pattern to cancel stale runs on the same
23
+ branch), all pull requests that share a branch name across different forks
24
+ map to the SAME concurrency group. With 'cancel-in-progress: true', the
25
+ latest queued run cancels all earlier runs in that group — including runs
26
+ from completely unrelated PRs in OTHER contributor forks.
27
+
28
+ Example scenario:
29
+ - Fork A (alice/myrepo) opens PR from branch 'fix/auth-bug'.
30
+ - Fork B (bob/myrepo) opens a PR from a branch also named 'fix/auth-bug'.
31
+ - Both PRs target the same upstream repo (org/myrepo).
32
+ - Concurrency group: 'ci-fix/auth-bug' (from github.head_ref).
33
+ - When Fork B's PR triggers a run, it cancels Fork A's in-progress run.
34
+
35
+ The cancellation appears as "This run was cancelled" with no explanation
36
+ that a different fork's PR caused it. Maintainers and contributors see
37
+ flaky-looking CI with no obvious cause.
38
+
39
+ This is especially common in:
40
+ - Large open-source projects where many contributors use the same
41
+ conventional branch names (fix/typo, docs/readme, feature/x).
42
+ - Dependabot/Renovate PRs across forks — all use the same structured
43
+ branch name pattern (dependabot/npm_and_yarn/lodash-4.0.0).
44
+ fix: |
45
+ Include 'github.event.pull_request.number' in the concurrency group key.
46
+ PR numbers are unique per repository, so two PRs from different forks
47
+ always have different numbers even if their branch names collide.
48
+
49
+ Alternative: use 'github.run_id' for maximum uniqueness (no cancellation
50
+ across runs at all), but this defeats the purpose of cancel-in-progress
51
+ for the same PR.
52
+
53
+ The recommended pattern from GitHub documentation:
54
+ group: '${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}'
55
+
56
+ The '|| github.ref' fallback handles non-PR events (push, schedule)
57
+ where 'github.event.pull_request.number' is empty.
58
+ fix_code:
59
+ - language: yaml
60
+ label: "Broken — github.head_ref alone causes cross-fork cancellation"
61
+ code: |
62
+ concurrency:
63
+ group: ci-${{ github.head_ref }} # ❌ Collides across forks with same branch name
64
+ cancel-in-progress: true
65
+
66
+ - language: yaml
67
+ label: "Fixed — include PR number to ensure per-PR uniqueness"
68
+ code: |
69
+ concurrency:
70
+ # PR number is unique per repo — different forks never collide
71
+ group: '${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}'
72
+ cancel-in-progress: true
73
+
74
+ - language: yaml
75
+ label: "Alternative — include repo owner to scope per fork"
76
+ code: |
77
+ concurrency:
78
+ # Include head repo full_name to distinguish forks explicitly
79
+ group: >-
80
+ ${{ github.workflow }}-
81
+ ${{ github.event.pull_request.head.repo.full_name || github.repository }}-
82
+ ${{ github.event.pull_request.number || github.ref }}
83
+ cancel-in-progress: true
84
+
85
+ - language: yaml
86
+ label: "Recommended pattern from GitHub docs — workflow + PR number or ref"
87
+ code: |
88
+ name: CI
89
+ on:
90
+ pull_request:
91
+ push:
92
+ branches: [main]
93
+
94
+ concurrency:
95
+ group: '${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}'
96
+ cancel-in-progress: true
97
+
98
+ jobs:
99
+ test:
100
+ runs-on: ubuntu-latest
101
+ steps:
102
+ - uses: actions/checkout@v4
103
+ - run: ./run-tests.sh
104
+
105
+ prevention:
106
+ - "Never use github.head_ref alone as a concurrency group key for pull_request workflows."
107
+ - "Always pair github.head_ref with github.event.pull_request.number to scope to the specific PR."
108
+ - "Use the GitHub-recommended pattern: '${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}'."
109
+ - "Test concurrency behavior by opening two PRs from different forks with the same branch name before merging concurrency configuration."
110
+ - "On public repos with many external contributors, audit all concurrency group keys for cross-fork collision risk."
111
+ docs:
112
+ - url: "https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/using-concurrency"
113
+ label: "GitHub Docs — Using concurrency (recommended group key pattern)"
114
+ - url: "https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#concurrency"
115
+ label: "GitHub Docs — Workflow syntax: concurrency"
@@ -0,0 +1,117 @@
1
+ id: known-unsolved-067
2
+ title: 'ubuntu-24.04 Runner df Reports 12-15 GB Ghost Disk Usage — Invisible to du/lsof'
3
+ category: known-unsolved
4
+ severity: silent-failure
5
+ tags:
6
+ - ubuntu-24
7
+ - disk-space
8
+ - enospc
9
+ - runner-agent
10
+ - diagnostics
11
+ - phantom-disk
12
+ - playwright
13
+ - hosted-runner
14
+ patterns:
15
+ - regex: 'ENOSPC:\s*no space left on device'
16
+ flags: 'i'
17
+ - regex: 'df\s+/\s+.*\d{4,5}M\s+.*\d+%'
18
+ flags: 'i'
19
+ - regex: 'No space left on device'
20
+ flags: 'i'
21
+ error_messages:
22
+ - 'ENOSPC: no space left on device, write'
23
+ - 'No space left on device'
24
+ - 'df: cannot read table of mounted file systems: No space left on device'
25
+ root_cause: |
26
+ On ubuntu-24.04 hosted runners, `df /` can report 12–15 GB of disk used
27
+ during heavy test runs (particularly those spawning many short-lived child
28
+ processes or producing large volumes of stdout, such as Playwright WebKit /
29
+ WPE test suites). This usage CANNOT be accounted for by:
30
+
31
+ - `du -shx /` (sum of all directories does not grow)
32
+ - `lsof +L1` (deleted-but-open files show only kernel /memfd:* entries)
33
+ - /proc/<PID>/maps (only kernel memfd entries)
34
+ - /proc/<PID>/io write_bytes (single-digit MB cumulative)
35
+
36
+ The ghost usage RECOVERS fully ~40 seconds after the job's main process
37
+ exits — gradually, over ~10 seconds — even though all child processes are
38
+ already reaped at recovery start. This rules out lingering processes holding
39
+ mmap'd files.
40
+
41
+ Best-guess root cause (unconfirmed by GitHub team as of June 2026): the
42
+ runner agent's diagnostic/log buffers are flushed periodically on the host
43
+ and the flushed bytes are counted in the container's `df` view but are not
44
+ visible from inside the runner's PID namespace. The ~40-second recovery delay
45
+ is consistent with a periodic flush cycle on the agent side.
46
+
47
+ This issue is non-deterministic and tied to the state of the underlying host
48
+ VM. The same workload run locally on ubuntu-24.04 does not reproduce.
49
+
50
+ Affected environments:
51
+ - Native `ubuntu-24.04` hosted runner
52
+ - Containers running on the `ubuntu-24.04` runner (which share the host's /)
53
+ - Does NOT reproduce on self-hosted ubuntu-24.04 VM locally
54
+
55
+ Tracked upstream: https://github.com/actions/runner/issues/4448 (open, May 2026)
56
+ fix: |
57
+ There is NO user-side fix for the phantom disk usage itself — this is
58
+ infrastructure-level behaviour outside the workflow's control.
59
+
60
+ Mitigations to prevent ENOSPC failures:
61
+
62
+ 1. Use a larger runner (8-core or 16-core) — larger runner classes have
63
+ more disk allocated on different host hardware.
64
+
65
+ 2. Reduce stdout volume by adding --quiet / --silent flags to test runners
66
+ and package managers (npm ci --quiet, pytest -q, etc.).
67
+
68
+ 3. Pre-clean the runner's docker layer cache and tool downloads that are
69
+ not needed:
70
+ - name: Free disk space
71
+ run: |
72
+ sudo rm -rf /usr/share/dotnet
73
+ sudo rm -rf /opt/ghc
74
+ sudo rm -rf /usr/local/lib/android
75
+ docker system prune -af
76
+
77
+ 4. Split the job into smaller parallel matrix jobs to reduce per-job output.
78
+
79
+ 5. Monitor disk in a background step to detect the ghost spike early and
80
+ correlate it with failures.
81
+ fix_code:
82
+ - language: yaml
83
+ label: 'Pre-clean unused runner tools to reclaim disk headroom'
84
+ code: |
85
+ steps:
86
+ - name: Free runner disk space
87
+ run: |
88
+ sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android
89
+ sudo apt-get clean
90
+ docker system prune -af --volumes || true
91
+ df -h / # confirm headroom before heavy tests
92
+
93
+ - name: Run Playwright tests
94
+ run: npx playwright test
95
+ - language: yaml
96
+ label: 'Use a larger runner with more disk allocation'
97
+ code: |
98
+ jobs:
99
+ test:
100
+ runs-on: ubuntu-latest-8-cores # or ubuntu-24.04-x64-8-cores
101
+ steps:
102
+ - run: npx playwright test
103
+ prevention:
104
+ - 'Add `df -h /` before and after heavy test steps to measure actual disk
105
+ consumption and detect when the ghost spike occurs.'
106
+ - 'Reduce test output verbosity — the agent diagnostic buffer hypothesis
107
+ correlates large stdout volumes with larger phantom disk readings.'
108
+ - 'For Playwright/WebKit CI that regularly sees ENOSPC: switch to
109
+ `ubuntu-24.04` larger runners or use `--reporter=dot` to minimise output.'
110
+ - 'Do not rely on `du -shx /` for disk capacity planning on hosted runners —
111
+ `df /` may show significantly more usage than du can account for during
112
+ heavy-output jobs.'
113
+ docs:
114
+ - url: 'https://github.com/actions/runner/issues/4448'
115
+ label: 'runner #4448 — df reports 12-15 GB ghost disk usage on ubuntu-24.04 runner (open, May 2026)'
116
+ - url: 'https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources'
117
+ label: 'GitHub Docs — Hosted runner hardware resources (disk sizes per runner class)'
@@ -0,0 +1,124 @@
1
+ id: known-unsolved-068
2
+ title: "Step outcome cannot distinguish timeout from failure — both report as 'failure' in steps context"
3
+ category: known-unsolved
4
+ severity: limitation
5
+ tags:
6
+ - timeout-minutes
7
+ - outcome
8
+ - conclusion
9
+ - continue-on-error
10
+ - steps-context
11
+ - retry
12
+ - known-limitation
13
+ - no-fix
14
+ patterns:
15
+ - regex: 'steps\.\w+\.outcome\s*==\s*.failure.'
16
+ flags: 'i'
17
+ - regex: 'timeout-minutes.*continue-on-error|continue-on-error.*timeout-minutes'
18
+ flags: 'im'
19
+ - regex: 'The process.*timed out after \d+ minutes'
20
+ flags: 'i'
21
+ error_messages:
22
+ - "Error: The process '/usr/bin/bash' failed with exit code 1"
23
+ - 'Error: Process completed with exit code 1'
24
+ root_cause: |
25
+ GitHub Actions exposes two result fields for completed steps in the steps context:
26
+
27
+ - steps.<id>.outcome: the raw result before continue-on-error is applied.
28
+ Possible values: success, failure, cancelled, skipped.
29
+ - steps.<id>.conclusion: the final result after continue-on-error is applied.
30
+ When continue-on-error: true is set on a failed step, conclusion becomes 'success'
31
+ even if outcome is 'failure'.
32
+
33
+ Neither field distinguishes between a step that failed because the process exited with a
34
+ non-zero code and a step that failed because it hit its timeout-minutes limit. Both
35
+ scenarios set outcome to 'failure'. There is no 'timed_out' value, no
36
+ steps.<id>.timed_out boolean, and no built-in expression function to query the reason
37
+ for failure.
38
+
39
+ This means workflows cannot natively:
40
+ - Retry only on timeout while failing fast on real errors
41
+ - Alert with different severity for timeouts vs application failures
42
+ - Auto-escalate timeout-minutes only when a timeout (not a logic error) occurred
43
+
44
+ The limitation has been a known open request in the GitHub Actions community since at
45
+ least 2022 with no current implementation timeline from GitHub.
46
+ fix: |
47
+ No native fix exists within GitHub Actions expressions. Two manual workarounds are
48
+ available in bash-based steps:
49
+
50
+ 1. Record start time and compute elapsed duration at the next step to infer timeout:
51
+ Compare elapsed seconds against the timeout-minutes threshold. A step that used
52
+ approximately 100% of its time budget likely timed out.
53
+
54
+ 2. Write a sentinel file just before the critical work; check for its absence afterward.
55
+ A timed-out step never reaches the sentinel-write line after the long-running command,
56
+ while a normally-failing step (which exits immediately on error) may or may not.
57
+
58
+ Neither workaround is exact — both have race conditions and edge cases. The most
59
+ reliable approach is to implement timeout detection inside the script itself using
60
+ shell signals or test-framework timeout flags.
61
+ fix_code:
62
+ - language: yaml
63
+ label: 'Workaround 1: Infer timeout via elapsed time'
64
+ code: |
65
+ - name: Start timer
66
+ id: timer
67
+ run: echo "start=$(date +%s)" >> "$GITHUB_OUTPUT"
68
+
69
+ - name: Run slow tests
70
+ id: tests
71
+ timeout-minutes: 10
72
+ continue-on-error: true
73
+ run: npm test
74
+
75
+ - name: Classify failure type
76
+ if: steps.tests.outcome == 'failure'
77
+ env:
78
+ START: ${{ steps.timer.outputs.start }}
79
+ run: |
80
+ elapsed=$(( $(date +%s) - START ))
81
+ timeout_secs=600 # 10 minutes in seconds
82
+ threshold=$(( timeout_secs - 30 )) # within 30s of limit → likely timeout
83
+ if [ "$elapsed" -ge "$threshold" ]; then
84
+ echo "::warning::Step likely timed out (elapsed ${elapsed}s, limit ${timeout_secs}s)"
85
+ # Handle timeout-specific logic here (e.g., don't fail, just warn)
86
+ else
87
+ echo "::error::Step failed (exit code, not timeout — elapsed ${elapsed}s)"
88
+ exit 1
89
+ fi
90
+ - language: yaml
91
+ label: 'Workaround 2: Sentinel file to detect timeout vs normal failure'
92
+ code: |
93
+ - name: Run tests with sentinel
94
+ id: tests
95
+ timeout-minutes: 10
96
+ continue-on-error: true
97
+ run: |
98
+ # The long-running command:
99
+ npm test
100
+ # Only reached on clean exit (not timeout, not error):
101
+ touch /tmp/test-completed
102
+
103
+ - name: Check failure reason
104
+ if: steps.tests.outcome == 'failure'
105
+ run: |
106
+ if [ ! -f /tmp/test-completed ]; then
107
+ echo "Step timed out or failed before completing"
108
+ # Inspect logs for timeout keyword:
109
+ # If the runner log shows "The process timed out after N minutes" → it was timeout
110
+ else
111
+ echo "Step completed but exited non-zero — application failure"
112
+ exit 1
113
+ fi
114
+ prevention:
115
+ - 'Log test durations inside the script itself; test framework flags like --testTimeout (Jest) or --timeout (Mocha) provide per-test granularity inside logs.'
116
+ - 'Use separate jobs for steps with different timeout characteristics — a dedicated integration-test job with a high timeout-minutes and a unit-test job with a low one makes failures easier to categorize.'
117
+ - 'If the step runs a single long command, wrap it in a shell timeout with a slightly shorter duration than timeout-minutes; the shell timeout exit code (124) is detectable inside the same step.'
118
+ docs:
119
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/accessing-contextual-information-about-workflow-runs#steps-context'
120
+ label: 'GitHub Docs: steps context — outcome and conclusion fields'
121
+ - url: 'https://stackoverflow.com/questions/78233438/github-action-cannot-get-timeout-status-from-previous-step'
122
+ label: 'SO: Cannot get timeout status from previous step (Mar 2024)'
123
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/using-conditions-to-control-job-execution'
124
+ label: 'GitHub Docs: Status check functions (failure, success, cancelled, always)'