@htekdev/actions-debugger 1.0.118 → 1.0.119
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/errors/caching-artifacts/caching-artifacts-070.yml +94 -0
- package/errors/concurrency-timing/concurrency-timing-056.yml +127 -0
- package/errors/concurrency-timing/concurrency-timing-057.yml +115 -0
- package/errors/known-unsolved/known-unsolved-067.yml +117 -0
- package/errors/known-unsolved/known-unsolved-068.yml +124 -0
- package/errors/runner-environment/runner-environment-214.yml +107 -0
- package/errors/runner-environment/runner-environment-215.yml +93 -0
- package/errors/runner-environment/runner-environment-216.yml +82 -0
- package/errors/runner-environment/runner-environment-217.yml +99 -0
- package/errors/runner-environment/runner-environment-218.yml +111 -0
- package/errors/silent-failures/silent-failures-109.yml +119 -0
- package/errors/silent-failures/silent-failures-110.yml +91 -0
- package/errors/silent-failures/silent-failures-111.yml +107 -0
- package/errors/yaml-syntax/yaml-syntax-072.yml +93 -0
- package/errors/yaml-syntax/yaml-syntax-073.yml +103 -0
- package/package.json +1 -1
|
@@ -0,0 +1,94 @@
|
|
|
1
|
+
id: caching-artifacts-070
|
|
2
|
+
title: "setup-python Post step fails — pip cache directory doesn't exist on disk"
|
|
3
|
+
category: caching-artifacts
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- setup-python
|
|
7
|
+
- pip
|
|
8
|
+
- cache
|
|
9
|
+
- post-step
|
|
10
|
+
- cache-miss
|
|
11
|
+
- no-dependencies
|
|
12
|
+
- python-version-bump
|
|
13
|
+
patterns:
|
|
14
|
+
- regex: 'Cache folder path is retrieved for pip but doesn.t exist on disk'
|
|
15
|
+
flags: 'i'
|
|
16
|
+
- regex: 'likely indicates that there are no dependencies to cache'
|
|
17
|
+
flags: 'i'
|
|
18
|
+
- regex: 'Post Setup Python.*fail|Post.*setup-python.*error'
|
|
19
|
+
flags: 'i'
|
|
20
|
+
error_messages:
|
|
21
|
+
- "Cache folder path is retrieved for pip but doesn't exist on disk: /home/runner/.cache/pip. This likely indicates that there are no dependencies to cache. Consider removing the cache step if it is not needed."
|
|
22
|
+
root_cause: |
|
|
23
|
+
When actions/setup-python is configured with cache: 'pip', the action records the expected
|
|
24
|
+
pip cache directory path at setup time (/home/runner/.cache/pip on Linux, equivalent on
|
|
25
|
+
macOS/Windows). The Post Setup Python step runs at job end and attempts to save the cache
|
|
26
|
+
to that path. If the directory does not exist on disk at save time, the Post step fails
|
|
27
|
+
with this error.
|
|
28
|
+
|
|
29
|
+
Two common causes:
|
|
30
|
+
|
|
31
|
+
1. No pip install ran in the job: The job uses cache: 'pip' but only runs linting or other
|
|
32
|
+
pre-installed tools without installing any Python packages. Pip never creates its cache
|
|
33
|
+
directory, so there is nothing to save. The job's actual steps appear green while the
|
|
34
|
+
Post step fails and turns the overall workflow run red.
|
|
35
|
+
|
|
36
|
+
2. Python version bump causes cache key miss: When the Python patch version changes (e.g.,
|
|
37
|
+
from 3.13.5 to 3.13.6), the setup-python cache key changes and the first run after the
|
|
38
|
+
bump experiences a full cache miss. If pip install runs but writes packages to a virtual
|
|
39
|
+
environment rather than the global pip cache (/home/runner/.cache/pip), the expected
|
|
40
|
+
directory remains empty and the Post step fails. Subsequent runs after the new cache is
|
|
41
|
+
warmed succeed.
|
|
42
|
+
|
|
43
|
+
The failure is deceptive because it surfaces in the Post Setup Python cleanup step — well
|
|
44
|
+
after the test or build steps have already succeeded — making it easy to overlook the root
|
|
45
|
+
cause.
|
|
46
|
+
fix: |
|
|
47
|
+
- Remove cache: 'pip' from any setup-python step in jobs that do not call pip install.
|
|
48
|
+
Linting-only jobs, type-check-only jobs, and jobs that rely entirely on pre-installed
|
|
49
|
+
system Python do not benefit from pip caching.
|
|
50
|
+
- If pip install does run, ensure it runs against the global pip (not a virtualenv that
|
|
51
|
+
bypasses /home/runner/.cache/pip) so the post step can find and save the cache.
|
|
52
|
+
- Upgrade to actions/setup-python@v5 or later: newer versions emit a warning annotation
|
|
53
|
+
instead of failing the step when the cache directory is missing.
|
|
54
|
+
- After a Python version bump, the first run is expected to cache-miss; monitor for Post
|
|
55
|
+
step failures on that first run and confirm subsequent runs succeed.
|
|
56
|
+
fix_code:
|
|
57
|
+
- language: yaml
|
|
58
|
+
label: 'Fix: Remove cache when no pip install runs'
|
|
59
|
+
code: |
|
|
60
|
+
# WRONG: cache: pip set but job only lints — no pip install → Post step fails
|
|
61
|
+
- uses: actions/setup-python@v4
|
|
62
|
+
with:
|
|
63
|
+
python-version: '3.12'
|
|
64
|
+
cache: 'pip' # ← REMOVE when no pip install follows
|
|
65
|
+
- name: Lint
|
|
66
|
+
run: flake8 . # no pip install; /home/runner/.cache/pip never created
|
|
67
|
+
|
|
68
|
+
# CORRECT: omit cache when the job does not install packages
|
|
69
|
+
- uses: actions/setup-python@v4
|
|
70
|
+
with:
|
|
71
|
+
python-version: '3.12'
|
|
72
|
+
# no cache key — Post step skips cache save attempt entirely
|
|
73
|
+
- name: Lint
|
|
74
|
+
run: flake8 .
|
|
75
|
+
- language: yaml
|
|
76
|
+
label: 'Fix: Upgrade to setup-python@v5 for graceful handling'
|
|
77
|
+
code: |
|
|
78
|
+
# CORRECT: v5+ emits a warning annotation instead of failing when cache path missing
|
|
79
|
+
- uses: actions/setup-python@v5
|
|
80
|
+
with:
|
|
81
|
+
python-version: '3.12'
|
|
82
|
+
cache: 'pip'
|
|
83
|
+
- name: Install dependencies
|
|
84
|
+
run: pip install -r requirements.txt
|
|
85
|
+
prevention:
|
|
86
|
+
- 'Only set cache: pip on jobs that actually run pip install — linting-only jobs should omit it.'
|
|
87
|
+
- 'Use actions/setup-python@v5 or later; it handles missing cache directories with a warning instead of a failure.'
|
|
88
|
+
- 'After bumping the Python patch version in python-version:, expect one cache-miss run and watch for Post step failures on that run only.'
|
|
89
|
+
- 'When using virtual environments (venv/pipenv/poetry), ensure pip still writes to the global cache or configure cache-dependency-path appropriately.'
|
|
90
|
+
docs:
|
|
91
|
+
- url: 'https://github.com/actions/setup-python/issues/1169'
|
|
92
|
+
label: 'setup-python#1169: Cache folder path doesn''t exist on disk (Aug 2025)'
|
|
93
|
+
- url: 'https://github.com/actions/setup-python'
|
|
94
|
+
label: 'actions/setup-python: Caching packages documentation'
|
|
@@ -0,0 +1,127 @@
|
|
|
1
|
+
id: concurrency-timing-056
|
|
2
|
+
title: 'Workflow-level and job-level concurrency share same group key — deadlock cancellation fires immediately'
|
|
3
|
+
category: concurrency-timing
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- concurrency
|
|
7
|
+
- deadlock
|
|
8
|
+
- workflow-level
|
|
9
|
+
- job-level
|
|
10
|
+
- reusable-workflow
|
|
11
|
+
- workflow-call
|
|
12
|
+
- github-workflow-context
|
|
13
|
+
patterns:
|
|
14
|
+
- regex: 'Canceling since a deadlock for concurrency group .* was detected between'
|
|
15
|
+
flags: 'i'
|
|
16
|
+
- regex: 'deadlock.*concurrency group|concurrency group.*deadlock'
|
|
17
|
+
flags: 'i'
|
|
18
|
+
error_messages:
|
|
19
|
+
- "Canceling since a deadlock for concurrency group 'ci-refs/heads/main' was detected between 'top level workflow' and 'deploy'"
|
|
20
|
+
- "Canceling since a deadlock for concurrency group 'release-refs/heads/main' was detected between 'top level workflow' and 'api'"
|
|
21
|
+
root_cause: |
|
|
22
|
+
GitHub Actions fires a deadlock error and immediately cancels the offending job when the
|
|
23
|
+
same concurrency group name is held simultaneously by two levels within a single workflow
|
|
24
|
+
execution. Two distinct scenarios trigger this:
|
|
25
|
+
|
|
26
|
+
Scenario 1 — Same-workflow self-deadlock: A workflow file defines concurrency: at the
|
|
27
|
+
workflow level AND one of its jobs also defines jobs.<id>.concurrency: using an expression
|
|
28
|
+
that evaluates to the same string:
|
|
29
|
+
|
|
30
|
+
concurrency:
|
|
31
|
+
group: ${{ github.workflow }}-${{ github.ref }} # workflow-level slot
|
|
32
|
+
|
|
33
|
+
jobs:
|
|
34
|
+
deploy:
|
|
35
|
+
concurrency:
|
|
36
|
+
group: ${{ github.workflow }}-${{ github.ref }} # same string → deadlock
|
|
37
|
+
runs-on: ubuntu-latest
|
|
38
|
+
|
|
39
|
+
Scenario 2 — Reusable callee inherits caller context: A calling workflow has workflow-level
|
|
40
|
+
concurrency using ${{ github.workflow }}-${{ github.ref }}. The called reusable workflow
|
|
41
|
+
also defines workflow-level concurrency with the same expression. Because github.workflow
|
|
42
|
+
inside a reusable workflow inherits the CALLER's workflow name (not the callee's filename),
|
|
43
|
+
both evaluate to the identical group key and GitHub detects a deadlock.
|
|
44
|
+
|
|
45
|
+
github.workflow_ref also inherits from the top-level caller and does NOT produce a unique
|
|
46
|
+
value in the callee context; it cannot be used to distinguish caller from callee.
|
|
47
|
+
fix: |
|
|
48
|
+
Scenario 1 — Remove the duplicate concurrency block. Keep either the workflow-level OR
|
|
49
|
+
the job-level declaration, not both with the same key. If per-job isolation is needed,
|
|
50
|
+
append ${{ github.job }} to the job-level group name:
|
|
51
|
+
|
|
52
|
+
jobs:
|
|
53
|
+
deploy:
|
|
54
|
+
concurrency:
|
|
55
|
+
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.job }}
|
|
56
|
+
|
|
57
|
+
Scenario 2 — Remove the concurrency: block entirely from the reusable workflow. The
|
|
58
|
+
caller's workflow-level concurrency already governs the entire execution. If the callee
|
|
59
|
+
needs standalone concurrency when triggered via workflow_dispatch, use a hardcoded unique
|
|
60
|
+
prefix instead of ${{ github.workflow }}:
|
|
61
|
+
|
|
62
|
+
concurrency:
|
|
63
|
+
group: deploy-${{ github.ref }} # hardcoded prefix avoids collision with any caller
|
|
64
|
+
cancel-in-progress: true
|
|
65
|
+
fix_code:
|
|
66
|
+
- language: yaml
|
|
67
|
+
label: 'Scenario 1 fix: remove duplicate job-level concurrency'
|
|
68
|
+
code: |
|
|
69
|
+
# WRONG — identical group at workflow level and job level → deadlock
|
|
70
|
+
concurrency:
|
|
71
|
+
group: ${{ github.workflow }}-${{ github.ref }}
|
|
72
|
+
cancel-in-progress: true
|
|
73
|
+
|
|
74
|
+
jobs:
|
|
75
|
+
deploy:
|
|
76
|
+
concurrency:
|
|
77
|
+
group: ${{ github.workflow }}-${{ github.ref }} # ← DELETE THIS
|
|
78
|
+
runs-on: ubuntu-latest
|
|
79
|
+
steps:
|
|
80
|
+
- run: echo deploying
|
|
81
|
+
|
|
82
|
+
# CORRECT — concurrency only at workflow level
|
|
83
|
+
concurrency:
|
|
84
|
+
group: ${{ github.workflow }}-${{ github.ref }}
|
|
85
|
+
cancel-in-progress: true
|
|
86
|
+
|
|
87
|
+
jobs:
|
|
88
|
+
deploy:
|
|
89
|
+
runs-on: ubuntu-latest
|
|
90
|
+
steps:
|
|
91
|
+
- run: echo deploying
|
|
92
|
+
- language: yaml
|
|
93
|
+
label: 'Scenario 2 fix: remove concurrency from reusable workflow'
|
|
94
|
+
code: |
|
|
95
|
+
# deploy.yml (reusable) — WRONG: workflow-level concurrency collides with caller
|
|
96
|
+
# because github.workflow returns the CALLER's name in reusable context
|
|
97
|
+
on:
|
|
98
|
+
workflow_call:
|
|
99
|
+
workflow_dispatch:
|
|
100
|
+
# concurrency: ← DELETE this entire block from the reusable workflow
|
|
101
|
+
# group: ${{ github.workflow }}-${{ github.ref }}
|
|
102
|
+
# cancel-in-progress: true
|
|
103
|
+
|
|
104
|
+
jobs:
|
|
105
|
+
deploy:
|
|
106
|
+
runs-on: ubuntu-latest
|
|
107
|
+
steps:
|
|
108
|
+
- run: echo deploying
|
|
109
|
+
|
|
110
|
+
# If standalone concurrency is needed for workflow_dispatch calls, use hardcoded prefix:
|
|
111
|
+
# concurrency:
|
|
112
|
+
# group: deploy-${{ github.ref }} # hardcoded "deploy-" avoids collision
|
|
113
|
+
# cancel-in-progress: true
|
|
114
|
+
prevention:
|
|
115
|
+
- 'Before adding concurrency: to a reusable workflow, check if it will be called via workflow_call — if so, remove it or use a hardcoded prefix.'
|
|
116
|
+
- 'Never use the same concurrency group expression at both the workflow level and job level in the same file.'
|
|
117
|
+
- 'Note that ${{ github.workflow }} and ${{ github.workflow_ref }} both return the top-level caller''s values inside a reusable workflow; neither provides the callee''s filename.'
|
|
118
|
+
- 'Use actionlint to statically detect identical concurrency groups — issue actionlint#538 tracks adding this check.'
|
|
119
|
+
docs:
|
|
120
|
+
- url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/control-the-concurrency-of-workflows-and-jobs'
|
|
121
|
+
label: 'GitHub Docs: Controlling the concurrency of workflows and jobs'
|
|
122
|
+
- url: 'https://stackoverflow.com/questions/78101326/github-actions-concurrency-deadlock'
|
|
123
|
+
label: 'SO: GitHub Actions concurrency deadlock (Score 6, 1.7K views)'
|
|
124
|
+
- url: 'https://stackoverflow.com/questions/79511940/using-workflow-filename-in-concurrency-group-for-workflows-started-by-workflow-c'
|
|
125
|
+
label: 'SO: Using workflow filename in concurrency group for workflow_call (Score 3)'
|
|
126
|
+
- url: 'https://github.com/github/vscode-github-actions/issues/135'
|
|
127
|
+
label: 'vscode-github-actions#135: Identical concurrency groups cause silent never-run (14 reactions)'
|
|
@@ -0,0 +1,115 @@
|
|
|
1
|
+
id: concurrency-timing-057
|
|
2
|
+
title: "Fork PRs with Identical Branch Names Share Concurrency Group and Cancel Each Other"
|
|
3
|
+
category: concurrency-timing
|
|
4
|
+
severity: silent-failure
|
|
5
|
+
tags:
|
|
6
|
+
- concurrency
|
|
7
|
+
- fork
|
|
8
|
+
- pull_request
|
|
9
|
+
- head_ref
|
|
10
|
+
- cancel-in-progress
|
|
11
|
+
- silent-cancel
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: 'group.*github\.head_ref'
|
|
14
|
+
flags: 'i'
|
|
15
|
+
- regex: 'This run was cancelled'
|
|
16
|
+
flags: 'i'
|
|
17
|
+
error_messages:
|
|
18
|
+
- "This run was cancelled."
|
|
19
|
+
- "Run was cancelled."
|
|
20
|
+
root_cause: |
|
|
21
|
+
When a workflow uses 'github.head_ref' as the sole identifier in a
|
|
22
|
+
concurrency group key (a common pattern to cancel stale runs on the same
|
|
23
|
+
branch), all pull requests that share a branch name across different forks
|
|
24
|
+
map to the SAME concurrency group. With 'cancel-in-progress: true', the
|
|
25
|
+
latest queued run cancels all earlier runs in that group — including runs
|
|
26
|
+
from completely unrelated PRs in OTHER contributor forks.
|
|
27
|
+
|
|
28
|
+
Example scenario:
|
|
29
|
+
- Fork A (alice/myrepo) opens PR from branch 'fix/auth-bug'.
|
|
30
|
+
- Fork B (bob/myrepo) opens a PR from a branch also named 'fix/auth-bug'.
|
|
31
|
+
- Both PRs target the same upstream repo (org/myrepo).
|
|
32
|
+
- Concurrency group: 'ci-fix/auth-bug' (from github.head_ref).
|
|
33
|
+
- When Fork B's PR triggers a run, it cancels Fork A's in-progress run.
|
|
34
|
+
|
|
35
|
+
The cancellation appears as "This run was cancelled" with no explanation
|
|
36
|
+
that a different fork's PR caused it. Maintainers and contributors see
|
|
37
|
+
flaky-looking CI with no obvious cause.
|
|
38
|
+
|
|
39
|
+
This is especially common in:
|
|
40
|
+
- Large open-source projects where many contributors use the same
|
|
41
|
+
conventional branch names (fix/typo, docs/readme, feature/x).
|
|
42
|
+
- Dependabot/Renovate PRs across forks — all use the same structured
|
|
43
|
+
branch name pattern (dependabot/npm_and_yarn/lodash-4.0.0).
|
|
44
|
+
fix: |
|
|
45
|
+
Include 'github.event.pull_request.number' in the concurrency group key.
|
|
46
|
+
PR numbers are unique per repository, so two PRs from different forks
|
|
47
|
+
always have different numbers even if their branch names collide.
|
|
48
|
+
|
|
49
|
+
Alternative: use 'github.run_id' for maximum uniqueness (no cancellation
|
|
50
|
+
across runs at all), but this defeats the purpose of cancel-in-progress
|
|
51
|
+
for the same PR.
|
|
52
|
+
|
|
53
|
+
The recommended pattern from GitHub documentation:
|
|
54
|
+
group: '${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}'
|
|
55
|
+
|
|
56
|
+
The '|| github.ref' fallback handles non-PR events (push, schedule)
|
|
57
|
+
where 'github.event.pull_request.number' is empty.
|
|
58
|
+
fix_code:
|
|
59
|
+
- language: yaml
|
|
60
|
+
label: "Broken — github.head_ref alone causes cross-fork cancellation"
|
|
61
|
+
code: |
|
|
62
|
+
concurrency:
|
|
63
|
+
group: ci-${{ github.head_ref }} # ❌ Collides across forks with same branch name
|
|
64
|
+
cancel-in-progress: true
|
|
65
|
+
|
|
66
|
+
- language: yaml
|
|
67
|
+
label: "Fixed — include PR number to ensure per-PR uniqueness"
|
|
68
|
+
code: |
|
|
69
|
+
concurrency:
|
|
70
|
+
# PR number is unique per repo — different forks never collide
|
|
71
|
+
group: '${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}'
|
|
72
|
+
cancel-in-progress: true
|
|
73
|
+
|
|
74
|
+
- language: yaml
|
|
75
|
+
label: "Alternative — include repo owner to scope per fork"
|
|
76
|
+
code: |
|
|
77
|
+
concurrency:
|
|
78
|
+
# Include head repo full_name to distinguish forks explicitly
|
|
79
|
+
group: >-
|
|
80
|
+
${{ github.workflow }}-
|
|
81
|
+
${{ github.event.pull_request.head.repo.full_name || github.repository }}-
|
|
82
|
+
${{ github.event.pull_request.number || github.ref }}
|
|
83
|
+
cancel-in-progress: true
|
|
84
|
+
|
|
85
|
+
- language: yaml
|
|
86
|
+
label: "Recommended pattern from GitHub docs — workflow + PR number or ref"
|
|
87
|
+
code: |
|
|
88
|
+
name: CI
|
|
89
|
+
on:
|
|
90
|
+
pull_request:
|
|
91
|
+
push:
|
|
92
|
+
branches: [main]
|
|
93
|
+
|
|
94
|
+
concurrency:
|
|
95
|
+
group: '${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}'
|
|
96
|
+
cancel-in-progress: true
|
|
97
|
+
|
|
98
|
+
jobs:
|
|
99
|
+
test:
|
|
100
|
+
runs-on: ubuntu-latest
|
|
101
|
+
steps:
|
|
102
|
+
- uses: actions/checkout@v4
|
|
103
|
+
- run: ./run-tests.sh
|
|
104
|
+
|
|
105
|
+
prevention:
|
|
106
|
+
- "Never use github.head_ref alone as a concurrency group key for pull_request workflows."
|
|
107
|
+
- "Always pair github.head_ref with github.event.pull_request.number to scope to the specific PR."
|
|
108
|
+
- "Use the GitHub-recommended pattern: '${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}'."
|
|
109
|
+
- "Test concurrency behavior by opening two PRs from different forks with the same branch name before merging concurrency configuration."
|
|
110
|
+
- "On public repos with many external contributors, audit all concurrency group keys for cross-fork collision risk."
|
|
111
|
+
docs:
|
|
112
|
+
- url: "https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/using-concurrency"
|
|
113
|
+
label: "GitHub Docs — Using concurrency (recommended group key pattern)"
|
|
114
|
+
- url: "https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#concurrency"
|
|
115
|
+
label: "GitHub Docs — Workflow syntax: concurrency"
|
|
@@ -0,0 +1,117 @@
|
|
|
1
|
+
id: known-unsolved-067
|
|
2
|
+
title: 'ubuntu-24.04 Runner df Reports 12-15 GB Ghost Disk Usage — Invisible to du/lsof'
|
|
3
|
+
category: known-unsolved
|
|
4
|
+
severity: silent-failure
|
|
5
|
+
tags:
|
|
6
|
+
- ubuntu-24
|
|
7
|
+
- disk-space
|
|
8
|
+
- enospc
|
|
9
|
+
- runner-agent
|
|
10
|
+
- diagnostics
|
|
11
|
+
- phantom-disk
|
|
12
|
+
- playwright
|
|
13
|
+
- hosted-runner
|
|
14
|
+
patterns:
|
|
15
|
+
- regex: 'ENOSPC:\s*no space left on device'
|
|
16
|
+
flags: 'i'
|
|
17
|
+
- regex: 'df\s+/\s+.*\d{4,5}M\s+.*\d+%'
|
|
18
|
+
flags: 'i'
|
|
19
|
+
- regex: 'No space left on device'
|
|
20
|
+
flags: 'i'
|
|
21
|
+
error_messages:
|
|
22
|
+
- 'ENOSPC: no space left on device, write'
|
|
23
|
+
- 'No space left on device'
|
|
24
|
+
- 'df: cannot read table of mounted file systems: No space left on device'
|
|
25
|
+
root_cause: |
|
|
26
|
+
On ubuntu-24.04 hosted runners, `df /` can report 12–15 GB of disk used
|
|
27
|
+
during heavy test runs (particularly those spawning many short-lived child
|
|
28
|
+
processes or producing large volumes of stdout, such as Playwright WebKit /
|
|
29
|
+
WPE test suites). This usage CANNOT be accounted for by:
|
|
30
|
+
|
|
31
|
+
- `du -shx /` (sum of all directories does not grow)
|
|
32
|
+
- `lsof +L1` (deleted-but-open files show only kernel /memfd:* entries)
|
|
33
|
+
- /proc/<PID>/maps (only kernel memfd entries)
|
|
34
|
+
- /proc/<PID>/io write_bytes (single-digit MB cumulative)
|
|
35
|
+
|
|
36
|
+
The ghost usage RECOVERS fully ~40 seconds after the job's main process
|
|
37
|
+
exits — gradually, over ~10 seconds — even though all child processes are
|
|
38
|
+
already reaped at recovery start. This rules out lingering processes holding
|
|
39
|
+
mmap'd files.
|
|
40
|
+
|
|
41
|
+
Best-guess root cause (unconfirmed by GitHub team as of June 2026): the
|
|
42
|
+
runner agent's diagnostic/log buffers are flushed periodically on the host
|
|
43
|
+
and the flushed bytes are counted in the container's `df` view but are not
|
|
44
|
+
visible from inside the runner's PID namespace. The ~40-second recovery delay
|
|
45
|
+
is consistent with a periodic flush cycle on the agent side.
|
|
46
|
+
|
|
47
|
+
This issue is non-deterministic and tied to the state of the underlying host
|
|
48
|
+
VM. The same workload run locally on ubuntu-24.04 does not reproduce.
|
|
49
|
+
|
|
50
|
+
Affected environments:
|
|
51
|
+
- Native `ubuntu-24.04` hosted runner
|
|
52
|
+
- Containers running on the `ubuntu-24.04` runner (which share the host's /)
|
|
53
|
+
- Does NOT reproduce on self-hosted ubuntu-24.04 VM locally
|
|
54
|
+
|
|
55
|
+
Tracked upstream: https://github.com/actions/runner/issues/4448 (open, May 2026)
|
|
56
|
+
fix: |
|
|
57
|
+
There is NO user-side fix for the phantom disk usage itself — this is
|
|
58
|
+
infrastructure-level behaviour outside the workflow's control.
|
|
59
|
+
|
|
60
|
+
Mitigations to prevent ENOSPC failures:
|
|
61
|
+
|
|
62
|
+
1. Use a larger runner (8-core or 16-core) — larger runner classes have
|
|
63
|
+
more disk allocated on different host hardware.
|
|
64
|
+
|
|
65
|
+
2. Reduce stdout volume by adding --quiet / --silent flags to test runners
|
|
66
|
+
and package managers (npm ci --quiet, pytest -q, etc.).
|
|
67
|
+
|
|
68
|
+
3. Pre-clean the runner's docker layer cache and tool downloads that are
|
|
69
|
+
not needed:
|
|
70
|
+
- name: Free disk space
|
|
71
|
+
run: |
|
|
72
|
+
sudo rm -rf /usr/share/dotnet
|
|
73
|
+
sudo rm -rf /opt/ghc
|
|
74
|
+
sudo rm -rf /usr/local/lib/android
|
|
75
|
+
docker system prune -af
|
|
76
|
+
|
|
77
|
+
4. Split the job into smaller parallel matrix jobs to reduce per-job output.
|
|
78
|
+
|
|
79
|
+
5. Monitor disk in a background step to detect the ghost spike early and
|
|
80
|
+
correlate it with failures.
|
|
81
|
+
fix_code:
|
|
82
|
+
- language: yaml
|
|
83
|
+
label: 'Pre-clean unused runner tools to reclaim disk headroom'
|
|
84
|
+
code: |
|
|
85
|
+
steps:
|
|
86
|
+
- name: Free runner disk space
|
|
87
|
+
run: |
|
|
88
|
+
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android
|
|
89
|
+
sudo apt-get clean
|
|
90
|
+
docker system prune -af --volumes || true
|
|
91
|
+
df -h / # confirm headroom before heavy tests
|
|
92
|
+
|
|
93
|
+
- name: Run Playwright tests
|
|
94
|
+
run: npx playwright test
|
|
95
|
+
- language: yaml
|
|
96
|
+
label: 'Use a larger runner with more disk allocation'
|
|
97
|
+
code: |
|
|
98
|
+
jobs:
|
|
99
|
+
test:
|
|
100
|
+
runs-on: ubuntu-latest-8-cores # or ubuntu-24.04-x64-8-cores
|
|
101
|
+
steps:
|
|
102
|
+
- run: npx playwright test
|
|
103
|
+
prevention:
|
|
104
|
+
- 'Add `df -h /` before and after heavy test steps to measure actual disk
|
|
105
|
+
consumption and detect when the ghost spike occurs.'
|
|
106
|
+
- 'Reduce test output verbosity — the agent diagnostic buffer hypothesis
|
|
107
|
+
correlates large stdout volumes with larger phantom disk readings.'
|
|
108
|
+
- 'For Playwright/WebKit CI that regularly sees ENOSPC: switch to
|
|
109
|
+
`ubuntu-24.04` larger runners or use `--reporter=dot` to minimise output.'
|
|
110
|
+
- 'Do not rely on `du -shx /` for disk capacity planning on hosted runners —
|
|
111
|
+
`df /` may show significantly more usage than du can account for during
|
|
112
|
+
heavy-output jobs.'
|
|
113
|
+
docs:
|
|
114
|
+
- url: 'https://github.com/actions/runner/issues/4448'
|
|
115
|
+
label: 'runner #4448 — df reports 12-15 GB ghost disk usage on ubuntu-24.04 runner (open, May 2026)'
|
|
116
|
+
- url: 'https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources'
|
|
117
|
+
label: 'GitHub Docs — Hosted runner hardware resources (disk sizes per runner class)'
|
|
@@ -0,0 +1,124 @@
|
|
|
1
|
+
id: known-unsolved-068
|
|
2
|
+
title: "Step outcome cannot distinguish timeout from failure — both report as 'failure' in steps context"
|
|
3
|
+
category: known-unsolved
|
|
4
|
+
severity: limitation
|
|
5
|
+
tags:
|
|
6
|
+
- timeout-minutes
|
|
7
|
+
- outcome
|
|
8
|
+
- conclusion
|
|
9
|
+
- continue-on-error
|
|
10
|
+
- steps-context
|
|
11
|
+
- retry
|
|
12
|
+
- known-limitation
|
|
13
|
+
- no-fix
|
|
14
|
+
patterns:
|
|
15
|
+
- regex: 'steps\.\w+\.outcome\s*==\s*.failure.'
|
|
16
|
+
flags: 'i'
|
|
17
|
+
- regex: 'timeout-minutes.*continue-on-error|continue-on-error.*timeout-minutes'
|
|
18
|
+
flags: 'im'
|
|
19
|
+
- regex: 'The process.*timed out after \d+ minutes'
|
|
20
|
+
flags: 'i'
|
|
21
|
+
error_messages:
|
|
22
|
+
- "Error: The process '/usr/bin/bash' failed with exit code 1"
|
|
23
|
+
- 'Error: Process completed with exit code 1'
|
|
24
|
+
root_cause: |
|
|
25
|
+
GitHub Actions exposes two result fields for completed steps in the steps context:
|
|
26
|
+
|
|
27
|
+
- steps.<id>.outcome: the raw result before continue-on-error is applied.
|
|
28
|
+
Possible values: success, failure, cancelled, skipped.
|
|
29
|
+
- steps.<id>.conclusion: the final result after continue-on-error is applied.
|
|
30
|
+
When continue-on-error: true is set on a failed step, conclusion becomes 'success'
|
|
31
|
+
even if outcome is 'failure'.
|
|
32
|
+
|
|
33
|
+
Neither field distinguishes between a step that failed because the process exited with a
|
|
34
|
+
non-zero code and a step that failed because it hit its timeout-minutes limit. Both
|
|
35
|
+
scenarios set outcome to 'failure'. There is no 'timed_out' value, no
|
|
36
|
+
steps.<id>.timed_out boolean, and no built-in expression function to query the reason
|
|
37
|
+
for failure.
|
|
38
|
+
|
|
39
|
+
This means workflows cannot natively:
|
|
40
|
+
- Retry only on timeout while failing fast on real errors
|
|
41
|
+
- Alert with different severity for timeouts vs application failures
|
|
42
|
+
- Auto-escalate timeout-minutes only when a timeout (not a logic error) occurred
|
|
43
|
+
|
|
44
|
+
The limitation has been a known open request in the GitHub Actions community since at
|
|
45
|
+
least 2022 with no current implementation timeline from GitHub.
|
|
46
|
+
fix: |
|
|
47
|
+
No native fix exists within GitHub Actions expressions. Two manual workarounds are
|
|
48
|
+
available in bash-based steps:
|
|
49
|
+
|
|
50
|
+
1. Record start time and compute elapsed duration at the next step to infer timeout:
|
|
51
|
+
Compare elapsed seconds against the timeout-minutes threshold. A step that used
|
|
52
|
+
approximately 100% of its time budget likely timed out.
|
|
53
|
+
|
|
54
|
+
2. Write a sentinel file just before the critical work; check for its absence afterward.
|
|
55
|
+
A timed-out step never reaches the sentinel-write line after the long-running command,
|
|
56
|
+
while a normally-failing step (which exits immediately on error) may or may not.
|
|
57
|
+
|
|
58
|
+
Neither workaround is exact — both have race conditions and edge cases. The most
|
|
59
|
+
reliable approach is to implement timeout detection inside the script itself using
|
|
60
|
+
shell signals or test-framework timeout flags.
|
|
61
|
+
fix_code:
|
|
62
|
+
- language: yaml
|
|
63
|
+
label: 'Workaround 1: Infer timeout via elapsed time'
|
|
64
|
+
code: |
|
|
65
|
+
- name: Start timer
|
|
66
|
+
id: timer
|
|
67
|
+
run: echo "start=$(date +%s)" >> "$GITHUB_OUTPUT"
|
|
68
|
+
|
|
69
|
+
- name: Run slow tests
|
|
70
|
+
id: tests
|
|
71
|
+
timeout-minutes: 10
|
|
72
|
+
continue-on-error: true
|
|
73
|
+
run: npm test
|
|
74
|
+
|
|
75
|
+
- name: Classify failure type
|
|
76
|
+
if: steps.tests.outcome == 'failure'
|
|
77
|
+
env:
|
|
78
|
+
START: ${{ steps.timer.outputs.start }}
|
|
79
|
+
run: |
|
|
80
|
+
elapsed=$(( $(date +%s) - START ))
|
|
81
|
+
timeout_secs=600 # 10 minutes in seconds
|
|
82
|
+
threshold=$(( timeout_secs - 30 )) # within 30s of limit → likely timeout
|
|
83
|
+
if [ "$elapsed" -ge "$threshold" ]; then
|
|
84
|
+
echo "::warning::Step likely timed out (elapsed ${elapsed}s, limit ${timeout_secs}s)"
|
|
85
|
+
# Handle timeout-specific logic here (e.g., don't fail, just warn)
|
|
86
|
+
else
|
|
87
|
+
echo "::error::Step failed (exit code, not timeout — elapsed ${elapsed}s)"
|
|
88
|
+
exit 1
|
|
89
|
+
fi
|
|
90
|
+
- language: yaml
|
|
91
|
+
label: 'Workaround 2: Sentinel file to detect timeout vs normal failure'
|
|
92
|
+
code: |
|
|
93
|
+
- name: Run tests with sentinel
|
|
94
|
+
id: tests
|
|
95
|
+
timeout-minutes: 10
|
|
96
|
+
continue-on-error: true
|
|
97
|
+
run: |
|
|
98
|
+
# The long-running command:
|
|
99
|
+
npm test
|
|
100
|
+
# Only reached on clean exit (not timeout, not error):
|
|
101
|
+
touch /tmp/test-completed
|
|
102
|
+
|
|
103
|
+
- name: Check failure reason
|
|
104
|
+
if: steps.tests.outcome == 'failure'
|
|
105
|
+
run: |
|
|
106
|
+
if [ ! -f /tmp/test-completed ]; then
|
|
107
|
+
echo "Step timed out or failed before completing"
|
|
108
|
+
# Inspect logs for timeout keyword:
|
|
109
|
+
# If the runner log shows "The process timed out after N minutes" → it was timeout
|
|
110
|
+
else
|
|
111
|
+
echo "Step completed but exited non-zero — application failure"
|
|
112
|
+
exit 1
|
|
113
|
+
fi
|
|
114
|
+
prevention:
|
|
115
|
+
- 'Log test durations inside the script itself; test framework flags like --testTimeout (Jest) or --timeout (Mocha) provide per-test granularity inside logs.'
|
|
116
|
+
- 'Use separate jobs for steps with different timeout characteristics — a dedicated integration-test job with a high timeout-minutes and a unit-test job with a low one makes failures easier to categorize.'
|
|
117
|
+
- 'If the step runs a single long command, wrap it in a shell timeout with a slightly shorter duration than timeout-minutes; the shell timeout exit code (124) is detectable inside the same step.'
|
|
118
|
+
docs:
|
|
119
|
+
- url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/accessing-contextual-information-about-workflow-runs#steps-context'
|
|
120
|
+
label: 'GitHub Docs: steps context — outcome and conclusion fields'
|
|
121
|
+
- url: 'https://stackoverflow.com/questions/78233438/github-action-cannot-get-timeout-status-from-previous-step'
|
|
122
|
+
label: 'SO: Cannot get timeout status from previous step (Mar 2024)'
|
|
123
|
+
- url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/using-conditions-to-control-job-execution'
|
|
124
|
+
label: 'GitHub Docs: Status check functions (failure, success, cancelled, always)'
|