@htekdev/actions-debugger 1.0.127 → 1.0.129

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,141 @@
1
+ id: caching-artifacts-076
2
+ title: 'actions/cache 429 rate limit on restore not retried — workflow silently falls through to cache miss'
3
+ category: caching-artifacts
4
+ severity: warning
5
+ tags:
6
+ - actions-cache
7
+ - rate-limit
8
+ - 429
9
+ - restore
10
+ - cache-miss
11
+ - silent-fallthrough
12
+ patterns:
13
+ - regex: 'Failed to restore.*Rate limited.*429|rate limit exceeded.*GetCacheEntryDownloadURL'
14
+ flags: 'i'
15
+ - regex: 'You.ve hit a rate limit.*rate limit will reset'
16
+ flags: 'i'
17
+ - regex: 'Failed to GetCacheEntryDownloadURL.*Too Many Requests'
18
+ flags: 'i'
19
+ error_messages:
20
+ - "Warning: You've hit a rate limit, your rate limit will reset in 18 seconds"
21
+ - "Warning: Failed to restore: Failed to GetCacheEntryDownloadURL: Rate limited: Failed request: (429) Too Many Requests: rate limit exceeded"
22
+ - "Cache not found for input keys: <key>"
23
+ root_cause: |
24
+ When the GitHub Actions cache service returns HTTP 429 (Too Many Requests) on the
25
+ **restore/download path**, `actions/cache` does NOT implement retry logic for this
26
+ specific error. Instead, it:
27
+
28
+ 1. Logs `Warning: You've hit a rate limit, your rate limit will reset in X seconds`
29
+ 2. Logs `Warning: Failed to restore: Failed to GetCacheEntryDownloadURL: Rate limited`
30
+ 3. Falls through to `Cache not found for input keys: <key>` — treating it as a cache miss
31
+ 4. Continues workflow execution as if the cache doesn't exist
32
+
33
+ **Important distinction from upload 429:** The existing entry `cache-service-429-upload-ebadf-crash.yml`
34
+ covers HTTP 429 during cache UPLOAD causing an EBADF crash (a different, harder failure).
35
+ This entry covers 429 during cache RESTORE, which silently falls through without crashing
36
+ but causes unnecessary full rebuilds.
37
+
38
+ **Impact:** Workflows with many concurrent runs hitting the same cache key simultaneously
39
+ (e.g. a heavily-used monorepo's `node_modules` or `~/.cargo/registry` cache) can exceed
40
+ the per-key request rate limit. All concurrent runs that get rate-limited then execute a
41
+ full dependency install/build from scratch, multiplying CI time and cost. The 429 response
42
+ includes a reset time in the warning message but `actions/cache` ignores it.
43
+
44
+ **When it happens:** Most commonly seen in:
45
+ - Monorepos with many parallel jobs all restoring the same cache key
46
+ - High-frequency push workflows where many runs are active simultaneously
47
+ - Nightly builds on large teams where multiple jobs race for the same cache
48
+
49
+ Source: actions/cache#1758 (May 13, 2026), also reported via oxidecomputer/hubris#2535
50
+ fix: |
51
+ No retry mechanism exists in `actions/cache` for restore-path 429 errors. The
52
+ following strategies reduce exposure:
53
+
54
+ **Option 1: Add job-level cache-key uniqueness to reduce simultaneous key contention**
55
+
56
+ Use a cache key that includes the job name or runner index, so parallel runs don't all
57
+ hit the exact same cache entry at once.
58
+
59
+ **Option 2: Add a retry wrapper around the restore step**
60
+
61
+ Use a loop + `cache-hit` output check to retry on miss within a reasonable window.
62
+
63
+ **Option 3: Use `actions/cache/restore` + `actions/cache/save` split and add `fail-on-cache-miss: false`**
64
+
65
+ Explicitly handle the miss and emit a warning rather than silently rebuilding.
66
+
67
+ **Option 4: File a +1 on actions/cache#1758**
68
+
69
+ The correct fix is for `actions/cache` to detect the 429 reset time and wait/retry
70
+ instead of falling through. Adding reactions encourages prioritization.
71
+ fix_code:
72
+ - language: yaml
73
+ label: 'Recognize the 429 pattern and add explicit retry logic'
74
+ code: |
75
+ # The standard cache step falls through silently on 429.
76
+ # Add a retry wrapper to explicitly detect and retry:
77
+ steps:
78
+ - name: Restore cache (with 429 retry)
79
+ id: cache-restore
80
+ uses: actions/cache@v4
81
+ with:
82
+ path: ~/.npm
83
+ key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
84
+ restore-keys: |
85
+ ${{ runner.os }}-node-
86
+
87
+ # Detect rate-limit fallthrough: cache appears to miss but key exists
88
+ - name: Retry cache if rate-limited
89
+ if: steps.cache-restore.outputs.cache-hit != 'true'
90
+ uses: actions/cache/restore@v4
91
+ with:
92
+ path: ~/.npm
93
+ key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
94
+ restore-keys: |
95
+ ${{ runner.os }}-node-
96
+
97
+ - language: yaml
98
+ label: 'Reduce cache contention — add job index to key for parallel runs'
99
+ code: |
100
+ # When many parallel jobs restore the same cache key, add job uniqueness
101
+ # to distribute reads across different cache entries:
102
+ strategy:
103
+ matrix:
104
+ shard: [1, 2, 3, 4]
105
+
106
+ steps:
107
+ - uses: actions/cache@v4
108
+ with:
109
+ path: ~/.npm
110
+ # Use matrix shard to reduce simultaneous same-key requests
111
+ key: ${{ runner.os }}-node-shard${{ matrix.shard }}-${{ hashFiles('**/package-lock.json') }}
112
+ restore-keys: |
113
+ ${{ runner.os }}-node-shard${{ matrix.shard }}-
114
+ ${{ runner.os }}-node-
115
+ - language: yaml
116
+ label: 'Observe the rate limit in workflow logs'
117
+ code: |
118
+ # What the rate limit looks like in workflow output:
119
+ # (these are Warning lines, not Error — easy to miss)
120
+ #
121
+ # Run actions/cache@v4
122
+ # Received 0 cache(s) from cache.githubusercontent.com
123
+ # Warning: You've hit a rate limit, your rate limit will reset in 18 seconds
124
+ # Warning: Failed to restore: Failed to GetCacheEntryDownloadURL: Rate limited:
125
+ # Failed request: (429) Too Many Requests: rate limit exceeded
126
+ # Cache not found for input keys: linux-node-abc123, linux-node-
127
+ #
128
+ # The workflow then continues as if cache was never created.
129
+ # Silent "success" hides wasted rebuild time.
130
+ prevention:
131
+ - "When running many parallel jobs on a shared cache key, add per-job cache key suffixes (e.g. matrix index, job name) to distribute read load across different cache entries."
132
+ - "Monitor CI timing for unexplained slowdowns — a sudden increase in dependency install time across all parallel jobs is a signal that cache restoration is being rate-limited."
133
+ - "Cache keys with very high read frequency (multiple runs per minute across many jobs) are most susceptible. Use granular keys to avoid this."
134
+ - "Check that your cache key is stable (not changing per-run) — flapping keys force repeated cache service requests for new key registrations, increasing rate limit exposure."
135
+ docs:
136
+ - url: 'https://github.com/actions/cache/issues/1758'
137
+ label: 'actions/cache#1758 — Handle rate limit on restore (no retry implemented)'
138
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#usage-limits-and-eviction-policy'
139
+ label: 'GitHub Docs: Cache usage limits and eviction policy'
140
+ - url: 'https://github.com/actions/cache'
141
+ label: 'actions/cache repository'
@@ -0,0 +1,158 @@
1
+ id: caching-artifacts-077
2
+ title: 'upload-artifact fails with "Bad Request" on self-hosted runners behind HTTPS proxy — Azure BlobClient not configured with proxy transport'
3
+ category: caching-artifacts
4
+ severity: error
5
+ tags:
6
+ - upload-artifact
7
+ - proxy
8
+ - https-proxy
9
+ - azure-blob
10
+ - self-hosted
11
+ - bad-request
12
+ patterns:
13
+ - regex: 'Error: Bad Request.*upload.*artifact|upload.*artifact.*Error: Bad Request'
14
+ flags: 'i'
15
+ - regex: 'BlobClient.*proxy|blob.*core.*windows.*proxy.*Bad Request'
16
+ flags: 'i'
17
+ - regex: 'Beginning upload of artifact content to blob storage[\s\S]*?Error: Bad Request'
18
+ flags: 'i'
19
+ error_messages:
20
+ - "Error: Bad Request"
21
+ - "Beginning upload of artifact content to blob storage\nError: Bad Request"
22
+ - "Uploading artifact: <name>\nBeginning upload of artifact content to blob storage\nError: Bad Request"
23
+ root_cause: |
24
+ When `HTTPS_PROXY` (or `https_proxy`) is set in the self-hosted runner environment,
25
+ `actions/upload-artifact` (v4+) fails with `Error: Bad Request` during the blob upload
26
+ step.
27
+
28
+ **Root cause:** `@actions/artifact` creates an Azure `BlobClient` from `@azure/storage-blob`
29
+ using only the pre-signed upload URL — no proxy transport configuration is passed:
30
+
31
+ ```typescript
32
+ const blobClient = new BlobClient(authenticatedUploadURL)
33
+ const blockBlobClient = blobClient.getBlockBlobClient()
34
+ ```
35
+
36
+ The Azure SDK creates its own internal HTTP pipeline that does not honor the environment's
37
+ `HTTPS_PROXY` variable. The CONNECT tunnel may appear to succeed (proxy returns 200) but
38
+ the Azure SDK's TLS negotiation through the proxy tunnel fails or sends incorrect frames,
39
+ producing a `Bad Request` response after a ~75-second stall.
40
+
41
+ **Affected actions:** Any action using `@actions/artifact` for blob uploads:
42
+ - `actions/upload-artifact` v4+
43
+ - Any third-party action built on `@actions/artifact`
44
+
45
+ **Same root cause in the runner itself:** The runner's C# `BlobClient` also lacks proxy
46
+ transport configuration, causing step logs and diagnostic uploads to stall similarly
47
+ (see actions/runner#4351).
48
+
49
+ **Proxy logs show the pattern:**
50
+ ```
51
+ CONNECT productionresultssa6.blob.core.windows.net:443 → status=200 (ALLOWED)
52
+ requestSize=~17 KB (partial upload)
53
+ latency=~75 seconds (stalled, not completing)
54
+ ```
55
+
56
+ **curl and Python upload the same blob endpoint in <1s through the same proxy** — the
57
+ issue is specific to the Azure SDK's pipeline construction when created without proxy opts.
58
+
59
+ Source: actions/toolkit#2377 (Apr 15, 2026), actions/runner#4351
60
+ fix: |
61
+ **Workaround (recommended): Add `.blob.core.windows.net` to NO_PROXY**
62
+
63
+ Bypass the proxy for Azure Blob Storage traffic. Artifacts upload directly to Azure Blob
64
+ Storage endpoints (`*.blob.core.windows.net`), not through `api.github.com`.
65
+
66
+ Set `NO_PROXY` in the runner's environment (`.env` file, systemd unit, or runner config):
67
+ ```
68
+ NO_PROXY=.blob.core.windows.net,.actions.githubusercontent.com
69
+ ```
70
+
71
+ **Important:** This bypasses proxy inspection for Azure Blob traffic. If your security
72
+ policy requires proxy inspection for all egress, you must configure the proxy to pass
73
+ TLS-tunneled connections to `*.blob.core.windows.net` without re-signing.
74
+
75
+ **Permanent fix (pending upstream):**
76
+ The `@actions/artifact` package needs to pass `proxyOptions` when constructing `BlobClient`.
77
+ Monitor actions/toolkit#2377 for the fix. Once released, upgrading to the fixed version
78
+ of `actions/upload-artifact` will resolve the issue without needing `NO_PROXY`.
79
+ fix_code:
80
+ - language: yaml
81
+ label: 'Workaround — set NO_PROXY to bypass proxy for Azure Blob Storage'
82
+ code: |
83
+ # In your runner's .env file (located at: <runner-dir>/.env)
84
+ # Add Azure Blob Storage and GitHub Actions results endpoints to NO_PROXY:
85
+ #
86
+ # HTTPS_PROXY=http://your-proxy:8080
87
+ # NO_PROXY=.blob.core.windows.net,.actions.githubusercontent.com,localhost,127.0.0.1
88
+ #
89
+ # Restart the runner service after editing .env:
90
+ # sudo ./svc.sh stop
91
+ # sudo ./svc.sh start
92
+
93
+ # If using ARC/Kubernetes, set as environment variables in the runner pod spec:
94
+ # spec:
95
+ # containers:
96
+ # - name: runner
97
+ # env:
98
+ # - name: HTTPS_PROXY
99
+ # value: "http://your-proxy:8080"
100
+ # - name: NO_PROXY
101
+ # value: ".blob.core.windows.net,.actions.githubusercontent.com"
102
+
103
+ - language: yaml
104
+ label: 'Set NO_PROXY in the workflow as a fallback for managed runners'
105
+ code: |
106
+ # For managed self-hosted runners where you cannot edit the .env file,
107
+ # set NO_PROXY as a job-level env override before the upload step.
108
+ # Note: This only applies to workflow steps, not the runner service itself.
109
+ jobs:
110
+ build:
111
+ runs-on: [self-hosted]
112
+ env:
113
+ # Ensure Azure Blob Storage bypasses proxy for artifact uploads
114
+ NO_PROXY: '.blob.core.windows.net,.actions.githubusercontent.com'
115
+ steps:
116
+ - uses: actions/checkout@v4
117
+ - run: make build
118
+ - uses: actions/upload-artifact@v4
119
+ with:
120
+ name: build-output
121
+ path: dist/
122
+ - language: yaml
123
+ label: 'Confirm the root cause from workflow logs'
124
+ code: |
125
+ # Affected workflow output looks like:
126
+ #
127
+ # Run actions/upload-artifact@v4
128
+ # With the provided path, there will be 1 file uploaded
129
+ # Artifact name is valid!
130
+ # Root directory input is valid!
131
+ # Uploading artifact: my-artifact
132
+ # Beginning upload of artifact content to blob storage
133
+ # Error: Bad Request
134
+ #
135
+ # The "Beginning upload of artifact content to blob storage" line appears
136
+ # followed immediately (after ~75 seconds) by "Error: Bad Request".
137
+ # No detailed error, no retry, no stack trace.
138
+ #
139
+ # Workaround confirmation — after setting NO_PROXY correctly:
140
+ #
141
+ # Uploading artifact: my-artifact
142
+ # Beginning upload of artifact content to blob storage
143
+ # Artifact upload has finished successfully.
144
+ # Artifact 'my-artifact' has been successfully uploaded!
145
+ prevention:
146
+ - "When deploying self-hosted runners behind a corporate HTTPS proxy, always configure `NO_PROXY` to include `.blob.core.windows.net` — this is where GitHub Actions stores artifacts and build outputs."
147
+ - "Test artifact upload and download on a new self-hosted runner deployment before declaring it production-ready. A quick test workflow with `actions/upload-artifact` and `actions/download-artifact` will reveal proxy issues immediately."
148
+ - "Include `.actions.githubusercontent.com` and `.blob.core.windows.net` in your proxy allow-list or NO_PROXY config — these are required for GitHub Actions to function correctly."
149
+ - "Monitor actions/toolkit#2377 for the upstream fix that adds proxy transport configuration to the Azure BlobClient. Once released, upgrade `actions/upload-artifact` to the fixed version."
150
+ docs:
151
+ - url: 'https://github.com/actions/toolkit/issues/2377'
152
+ label: 'actions/toolkit#2377 — BlobClient upload fails with "Bad Request" through HTTPS proxy'
153
+ - url: 'https://github.com/actions/runner/issues/4351'
154
+ label: 'actions/runner#4351 — Runner Azure Blob uploads stall through HTTPS proxy (same root cause)'
155
+ - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners#internet-access-for-self-hosted-runners'
156
+ label: 'GitHub Docs: Internet access requirements for self-hosted runners'
157
+ - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/using-self-hosted-runners-in-a-workflow'
158
+ label: 'GitHub Docs: Using self-hosted runners in a workflow'
@@ -0,0 +1,147 @@
1
+ id: known-unsolved-075
2
+ title: '`matrix` context unavailable in job-level `if:` condition when matrix is dynamically generated from upstream job outputs'
3
+ category: known-unsolved
4
+ severity: limitation
5
+ tags:
6
+ - matrix
7
+ - dynamic-matrix
8
+ - if-condition
9
+ - job-level
10
+ - fromJSON
11
+ - needs-outputs
12
+ patterns:
13
+ - regex: 'Unrecognized named-value.*matrix.*Located at position'
14
+ flags: 'i'
15
+ - regex: 'matrix.*context.*not.*available.*if|if.*matrix.*unrecognized'
16
+ flags: 'i'
17
+ error_messages:
18
+ - "Unrecognized named-value: 'matrix'. Located at position 26 within expression: contains(inputs.SCHEMAS, matrix.customer.schema)"
19
+ - "The workflow is not valid. ... Unrecognized named-value: 'matrix'."
20
+ root_cause: |
21
+ When a job uses `strategy.matrix` with values derived from an upstream job's outputs
22
+ (e.g. `fromJson(needs.set-matrix.outputs.matrix)`), the `matrix` context is **not**
23
+ available in the job's own `if:` condition.
24
+
25
+ The root cause is evaluation ordering: GitHub Actions evaluates job-level `if:`
26
+ conditions **before** resolving the dynamic matrix values from upstream job outputs.
27
+ At the time `if:` is checked, the specific matrix combination (e.g. `matrix.schema`,
28
+ `matrix.os`) has not yet been bound.
29
+
30
+ This limitation does NOT affect:
31
+ - Step-level `if:` conditions inside the job (matrix IS available there)
32
+ - Static matrices defined inline with literal values (matrix IS available in job-level if)
33
+ - Downstream jobs reading this job's outputs (normal needs chain)
34
+
35
+ **Workaround does not exist at job level**: There is no supported way to filter
36
+ individual matrix combinations at the job `if:` level when the matrix is dynamic.
37
+ GitHub's matrix `include`/`exclude` keys do not support expressions.
38
+
39
+ The only workaround is to move the filtering logic inside a step using
40
+ `if: condition` at the step level, or to generate a pre-filtered matrix in the
41
+ upstream job so that no filtering is needed at the consumer job level.
42
+
43
+ Source: actions/runner#1985 (64 reactions, open since 2022)
44
+ Community discussion: https://github.community/t/matrix-cannot-be-used-in-jobs-level-if/17177
45
+ fix: |
46
+ **Option 1 (recommended): Filter inside the upstream matrix-generation job**
47
+
48
+ Produce a matrix JSON that only includes the combinations you want to run. This is
49
+ the cleanest approach — no filtering needed in the consumer job.
50
+
51
+ **Option 2: Use step-level `if:` instead of job-level `if:`**
52
+
53
+ Move the filtering logic into the first step of the job. The job itself runs for
54
+ every matrix entry but exits cleanly. This wastes a job slot but works.
55
+
56
+ **Option 3: Use `continue-on-error: true` with an early-exit pattern**
57
+
58
+ Not recommended — harder to distinguish real failures from filtered runs.
59
+ fix_code:
60
+ - language: yaml
61
+ label: "Broken — matrix context in job-level if with dynamic matrix"
62
+ code: |
63
+ jobs:
64
+ set-matrix:
65
+ runs-on: ubuntu-latest
66
+ outputs:
67
+ matrix: ${{ steps.gen.outputs.matrix }}
68
+ steps:
69
+ - id: gen
70
+ run: |
71
+ echo 'matrix={"customer":[{"schema":"prod"},{"schema":"staging"}]}' >> "$GITHUB_OUTPUT"
72
+
73
+ deploy:
74
+ needs: set-matrix
75
+ runs-on: ubuntu-latest
76
+ strategy:
77
+ matrix: ${{ fromJson(needs.set-matrix.outputs.matrix) }}
78
+ # ❌ FAILS: "Unrecognized named-value: 'matrix'"
79
+ if: contains(inputs.SCHEMAS, matrix.customer.schema)
80
+ steps:
81
+ - run: echo "Deploying ${{ matrix.customer.schema }}"
82
+
83
+ - language: yaml
84
+ label: "Fixed Option 1 — pre-filter in the matrix-generation job"
85
+ code: |
86
+ jobs:
87
+ set-matrix:
88
+ runs-on: ubuntu-latest
89
+ outputs:
90
+ matrix: ${{ steps.gen.outputs.matrix }}
91
+ steps:
92
+ - id: gen
93
+ # ✅ Generate only the matrix entries that should run
94
+ run: |
95
+ # Filter based on inputs.SCHEMAS inside the script
96
+ SCHEMAS="${{ inputs.SCHEMAS }}"
97
+ MATRIX=$(jq -n --arg schemas "$SCHEMAS" \
98
+ '[{"schema":"prod"},{"schema":"staging"}] |
99
+ map(select(.schema | IN($schemas | split(","))))' \
100
+ | jq -c '{customer:.}')
101
+ echo "matrix=$MATRIX" >> "$GITHUB_OUTPUT"
102
+
103
+ deploy:
104
+ needs: set-matrix
105
+ runs-on: ubuntu-latest
106
+ strategy:
107
+ matrix: ${{ fromJson(needs.set-matrix.outputs.matrix) }}
108
+ # ✅ No job-level if needed — matrix is already filtered
109
+ steps:
110
+ - run: echo "Deploying ${{ matrix.customer.schema }}"
111
+
112
+ - language: yaml
113
+ label: "Fixed Option 2 — move filtering to step-level if"
114
+ code: |
115
+ jobs:
116
+ deploy:
117
+ needs: set-matrix
118
+ runs-on: ubuntu-latest
119
+ strategy:
120
+ matrix: ${{ fromJson(needs.set-matrix.outputs.matrix) }}
121
+ # ✅ No job-level if — allow all matrix entries through
122
+ steps:
123
+ # Early exit for matrix entries that don't match
124
+ - name: Check if this schema should deploy
125
+ # ✅ matrix context IS available in step-level if conditions
126
+ if: "!contains(inputs.SCHEMAS, matrix.customer.schema)"
127
+ run: |
128
+ echo "Skipping schema ${{ matrix.customer.schema }}"
129
+ exit 0
130
+
131
+ - name: Deploy
132
+ if: contains(inputs.SCHEMAS, matrix.customer.schema)
133
+ run: echo "Deploying ${{ matrix.customer.schema }}"
134
+
135
+ prevention:
136
+ - "When you need per-combination filtering, pre-filter the matrix JSON in the generation step rather than relying on job-level `if:` with matrix context."
137
+ - "Use step-level `if:` conditions (not job-level) when you must reference `matrix.*` context for filtering — step-level conditions evaluate after matrix expansion."
138
+ - "Check GitHub docs for 'Context availability' to confirm which contexts are available at each level before authoring complex conditional logic."
139
+ docs:
140
+ - url: 'https://github.com/actions/runner/issues/1985'
141
+ label: 'actions/runner#1985 — Unrecognized named-value: matrix in job if conditional (64 reactions)'
142
+ - url: 'https://github.community/t/matrix-cannot-be-used-in-jobs-level-if/17177'
143
+ label: 'GitHub Community: matrix cannot be used in jobs level if'
144
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/using-a-matrix-for-your-jobs'
145
+ label: 'GitHub Docs: Using a matrix for your jobs'
146
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/contexts#context-availability'
147
+ label: 'GitHub Docs: Context availability table'
@@ -0,0 +1,146 @@
1
+ id: runner-environment-237
2
+ title: 'Container job `$HOME` is hardcoded to `/github/home` — Docker images built with `/root` home break silently'
3
+ category: runner-environment
4
+ severity: error
5
+ tags:
6
+ - container
7
+ - HOME
8
+ - docker
9
+ - self-hosted
10
+ - environment-variable
11
+ - tool-cache
12
+ patterns:
13
+ - regex: '\$HOME.*github/home'
14
+ flags: 'i'
15
+ - regex: 'Could not find.*\$HOME.*plugin|command not found.*HOME.*github'
16
+ flags: 'i'
17
+ - regex: 'no such file.*github/home|Permission denied.*github/home'
18
+ flags: 'i'
19
+ error_messages:
20
+ - "The reason `bluebase plugins` doesn't work is because it depends on `$HOME` pointing to `/root` but now GitHub Actions has changed it to `/github/home`."
21
+ - "Error: Could not find plugin at /github/home/.cache/@bluebase"
22
+ - "rust: command not found"
23
+ - "cargo: command not found"
24
+ - "go: command not found"
25
+ root_cause: |
26
+ When using `jobs.<name>.container`, the GitHub Actions runner mounts a host volume at
27
+ `/github/home` and unconditionally overrides the `HOME` environment variable to point there,
28
+ regardless of:
29
+ - The Docker image's configured user or home directory
30
+ - Any `env: HOME:` value set in the container spec
31
+ - The container image's `ENV HOME` instruction
32
+
33
+ This is hard-coded in ContainerOperationProvider.cs:
34
+ `-v "/runner/work/_temp/_github_home":"/github/home"`
35
+ `HOME=/github/home`
36
+
37
+ The volume mount overwrites whatever was at `/github/home` inside the container, and
38
+ the forced HOME env var points to that empty/host-controlled directory.
39
+
40
+ **Impact on pre-installed tools**: Any CLI tool or plugin manager that stores
41
+ state relative to `$HOME` during the Docker image build (e.g. `npm global`, Rust's
42
+ `cargo`, Go binaries in `~/go/bin`, Homebrew cellar, Python user installs at
43
+ `~/.local`) will no longer find its data because HOME now points to `/github/home`
44
+ instead of `/root` or the image user's home.
45
+
46
+ Source: actions/runner#863 (124 reactions, open since 2021)
47
+ fix: |
48
+ **Option 1 (recommended): Prefix affected commands with `HOME=/root`**
49
+
50
+ Temporarily reset HOME to the original image home for each step that relies on
51
+ pre-installed tools. Do NOT set HOME permanently in the workflow — the runner
52
+ depends on `/github/home` for internal state.
53
+
54
+ **Option 2: Rebuild Docker image with `/github/home` as the home directory**
55
+
56
+ Set `ENV HOME /github/home` in the Dockerfile before installing tools.
57
+ Note: if `/github/home` is empty at build time (it is), tools install there, and
58
+ the volume mount at runtime will still OVERWRITE the directory. This does not work.
59
+
60
+ **Option 3: Copy tool state in a setup step**
61
+
62
+ Add an entrypoint or a workflow step that copies `/root/.config`, `/root/.cache`,
63
+ `/root/go`, etc. to `/github/home` before the steps that need them run.
64
+
65
+ **Option 4: Avoid container jobs for images with pre-installed tools**
66
+
67
+ Use a Docker action (`uses: docker://image`) instead of `jobs.<name>.container`
68
+ — Docker actions do not override HOME.
69
+ fix_code:
70
+ - language: yaml
71
+ label: "Broken — pre-installed cargo/rust not found in container job"
72
+ code: |
73
+ jobs:
74
+ build:
75
+ runs-on: ubuntu-latest
76
+ container:
77
+ image: my-rust-tools:latest # Built with cargo installed at /root/.cargo
78
+ steps:
79
+ - uses: actions/checkout@v4
80
+ - name: Build
81
+ run: cargo build --release # ❌ FAILS: cargo not found — HOME=/github/home
82
+
83
+ - language: yaml
84
+ label: "Fixed — reset HOME per-step for pre-built tool invocations"
85
+ code: |
86
+ jobs:
87
+ build:
88
+ runs-on: ubuntu-latest
89
+ container:
90
+ image: my-rust-tools:latest
91
+ steps:
92
+ - uses: actions/checkout@v4
93
+ - name: Build
94
+ run: HOME=/root cargo build --release # ✅ HOME temporarily reset to where cargo is
95
+
96
+ # Or use env: at the step level:
97
+ - name: Run tests
98
+ env:
99
+ HOME: /root
100
+ run: cargo test
101
+
102
+ - language: yaml
103
+ label: "Fixed — rebuild Docker image with tools installed at github/home path"
104
+ code: |
105
+ # Dockerfile — install tools where GHA will point HOME
106
+ FROM ubuntu:22.04
107
+
108
+ # Install Rust to /github/home/.cargo (matches runtime HOME)
109
+ RUN mkdir -p /github/home
110
+ ENV HOME=/github/home
111
+ RUN curl https://sh.rustup.rs -sSf | sh -s -- -y
112
+
113
+ # In workflow, no HOME override needed:
114
+ # cargo is at /github/home/.cargo/bin — BUT runtime mount overwrites!
115
+ # This approach ONLY works if you add the PATH explicitly:
116
+ ENV PATH="/github/home/.cargo/bin:${PATH}"
117
+
118
+ # Better: install to /usr/local/bin which is not affected by HOME mount
119
+ RUN curl https://sh.rustup.rs -sSf | sh -s -- -y --default-toolchain stable
120
+ RUN cp -r /github/home/.cargo/bin/* /usr/local/bin/
121
+
122
+ - language: yaml
123
+ label: "Alternative — use Docker action instead of container job (HOME not overridden)"
124
+ code: |
125
+ jobs:
126
+ build:
127
+ runs-on: ubuntu-latest
128
+ steps:
129
+ - uses: actions/checkout@v4
130
+ - name: Build in custom container
131
+ uses: docker://my-rust-tools:latest # ✅ HOME not overridden by runner
132
+ with:
133
+ args: cargo build --release
134
+
135
+ prevention:
136
+ - "When building Docker images for use in `jobs.<name>.container`, install binaries to `/usr/local/bin` or another PATH location not dependent on `$HOME`."
137
+ - "Test container images locally by running `docker run -e HOME=/github/home <image> <command>` to reproduce the GHA HOME override before using them in workflows."
138
+ - "Prefer Docker actions (`uses: docker://image`) over container jobs if your image relies on a specific HOME directory — Docker actions do not override HOME."
139
+ - "Read actions/runner#863 before publishing a Docker image intended for use as a GHA container job."
140
+ docs:
141
+ - url: 'https://github.com/actions/runner/issues/863'
142
+ label: 'actions/runner#863 — HOME is overridden for containers (124 reactions, open since 2021)'
143
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/running-jobs-in-a-container'
144
+ label: 'GitHub Docs: Running jobs in a container'
145
+ - url: 'https://stackoverflow.com/questions/58516181/missing-installed-dependencies-when-docker-image-is-used'
146
+ label: 'Stack Overflow: Missing installed dependencies when Docker image is used (Score: 3)'
@@ -0,0 +1,151 @@
1
+ id: runner-environment-238
2
+ title: 'Self-hosted runner with Docker container steps creates root-owned files in workspace — `git clean` fails on next run'
3
+ category: runner-environment
4
+ severity: error
5
+ tags:
6
+ - self-hosted
7
+ - docker
8
+ - container
9
+ - workspace
10
+ - permissions
11
+ - checkout
12
+ - root-owned
13
+ patterns:
14
+ - regex: 'warning: could not open directory.*Permission denied'
15
+ flags: 'i'
16
+ - regex: 'failed to remove.*Directory not empty|rm.*cannot remove.*Permission denied'
17
+ flags: 'i'
18
+ - regex: 'Unable to clean or reset the repository.*recreated'
19
+ flags: 'i'
20
+ error_messages:
21
+ - "warning: could not open directory 'foo/': Permission denied"
22
+ - "warning: failed to remove foo/: Directory not empty"
23
+ - "##[warning]Unable to clean or reset the repository. The repository will be recreated instead."
24
+ - "##[error]Command failed: rm -rf \"/home/runner/_work/repo/repo/foo\""
25
+ - "rm: cannot remove '/home/runner/_work/repo/repo/foo': Permission denied"
26
+ - "error: failed to run 'git clean -ffdx': exit code 1"
27
+ root_cause: |
28
+ When a GitHub Actions workflow on a **self-hosted runner** uses a Docker-based step
29
+ (either `container:` jobs or `uses: docker://` actions), files written to the
30
+ workspace during that step are owned by `root:root` — because most Docker containers
31
+ run as root by default.
32
+
33
+ The self-hosted runner process runs as a non-root user (e.g., `github-runner`, `ubuntu`).
34
+ On the **next workflow run**, `actions/checkout` attempts to clean the workspace by
35
+ running `git clean -ffdx`. Git's clean-up calls `stat()` on root-owned directories.
36
+ Because stat succeeds (files are visible) but `rm` is blocked (not root), `git clean`
37
+ reports the directory but cannot remove it. Git then falls back to `rm -rf` via the
38
+ runner, which also fails with `Permission denied`.
39
+
40
+ The checkout action logs a warning "Unable to clean or reset the repository. The
41
+ repository will be recreated instead." — then `rm -rf` of the entire workspace also
42
+ fails because root-owned directories are nested inside. The job fails permanently
43
+ until a human manually removes the root-owned files on the runner host.
44
+
45
+ **Why GitHub-hosted runners don't hit this**: GitHub-hosted runners are ephemeral —
46
+ the workspace is destroyed after every job, so there's no cross-run persistence of
47
+ root-owned files. Self-hosted runners persist the workspace by default.
48
+
49
+ Source: actions/runner#434 (131 reactions, open since 2020)
50
+ fix: |
51
+ **Option 1 (recommended): Run the container as the same UID as the host runner user**
52
+
53
+ Pass `--user $(id -u):$(id -g)` to Docker options so files are written with the
54
+ runner's UID/GID. Most images support this without modification.
55
+
56
+ **Option 2: Add a cleanup step before checkout**
57
+
58
+ Add a step at the start of the workflow that removes workspace contents using sudo.
59
+ Requires configuring passwordless sudo for the runner user.
60
+
61
+ **Option 3: Configure the container to run as root but add a post-step chown**
62
+
63
+ After container steps, add a step that runs `sudo chown -R $USER:$USER .` to
64
+ reclaim ownership. Still requires sudo access.
65
+
66
+ **Option 4: Use ephemeral (one-shot) self-hosted runners**
67
+
68
+ Ephemeral runners create a fresh workspace per job. No cross-run file persistence
69
+ means root-owned files from Docker steps never accumulate.
70
+
71
+ **Option 5: Set `clean: false` on checkout and handle cleanup yourself**
72
+
73
+ Disable automatic workspace cleaning in checkout and add your own robust cleanup
74
+ step that can handle permission errors.
75
+ fix_code:
76
+ - language: yaml
77
+ label: "Fixed — run Docker container as host runner's UID"
78
+ code: |
79
+ jobs:
80
+ build:
81
+ runs-on: [self-hosted, linux]
82
+ container:
83
+ image: node:20
84
+ options: '--user 1001:1001' # ✅ Match runner UID — no root-owned files
85
+
86
+ steps:
87
+ - uses: actions/checkout@v4
88
+ - run: npm ci && npm run build
89
+
90
+ - language: yaml
91
+ label: "Fixed — run container as current user dynamically"
92
+ code: |
93
+ jobs:
94
+ build:
95
+ runs-on: [self-hosted, linux]
96
+ steps:
97
+ - uses: actions/checkout@v4
98
+
99
+ - name: Build in container (current user)
100
+ run: |
101
+ docker run --rm \
102
+ --user "$(id -u):$(id -g)" \
103
+ -v "${{ github.workspace }}:/work" \
104
+ -w /work \
105
+ node:20 \
106
+ npm ci && npm run build
107
+ # ✅ Files written inside container owned by host runner user
108
+
109
+ - language: yaml
110
+ label: "Fixed — cleanup step before checkout to handle pre-existing root-owned files"
111
+ code: |
112
+ jobs:
113
+ build:
114
+ runs-on: [self-hosted, linux]
115
+ steps:
116
+ # ✅ Cleanup root-owned files before checkout
117
+ - name: Cleanup workspace
118
+ run: |
119
+ if [ -d "${{ github.workspace }}" ]; then
120
+ sudo chown -R "$USER:$USER" "${{ github.workspace }}" || true
121
+ sudo rm -rf "${{ github.workspace }}"
122
+ fi
123
+ # Requires: echo "runner ALL=(ALL) NOPASSWD: /bin/chown, /bin/rm" | sudo tee /etc/sudoers.d/runner
124
+
125
+ - uses: actions/checkout@v4
126
+
127
+ - language: yaml
128
+ label: "Fixed — register self-hosted runner as ephemeral (--ephemeral flag)"
129
+ code: |
130
+ # Register the runner with --ephemeral so each job gets a fresh workspace:
131
+ # ./config.sh --url https://github.com/org/repo --token TOKEN --ephemeral
132
+ #
133
+ # Or for ARC (Actions Runner Controller):
134
+ # spec:
135
+ # ephemeral: true # Each job pod is destroyed after completion
136
+
137
+ prevention:
138
+ - "Always run Docker container steps with `--user $(id -u):$(id -g)` on self-hosted runners to match the host runner's UID/GID."
139
+ - "Use ephemeral self-hosted runners to eliminate cross-run workspace persistence entirely."
140
+ - "Add a workspace cleanup step at the start of workflows that use Docker container steps to proactively clear root-owned files."
141
+ - "After setting up a self-hosted runner, test with a workflow that uses a `container:` job and verify subsequent runs can clean up without errors."
142
+ - "When using ARC (Actions Runner Controller), enable `ephemeral: true` on the RunnerDeployment spec."
143
+ docs:
144
+ - url: 'https://github.com/actions/runner/issues/434'
145
+ label: 'actions/runner#434 — Self-hosted runner with Docker step creates root-owned files (131 reactions)'
146
+ - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners'
147
+ label: 'GitHub Docs: About self-hosted runners'
148
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/running-jobs-in-a-container'
149
+ label: 'GitHub Docs: Running jobs in a container'
150
+ - url: 'https://stackoverflow.com/questions/76407664/files-owned-by-rootroot-when-using-actions-checkout-on-self-hosted-runner'
151
+ label: 'Stack Overflow: Files owned by root:root when using actions/checkout on self-hosted runner'
@@ -0,0 +1,121 @@
1
+ id: runner-environment-239
2
+ title: 'setup-python fails on Ubuntu 26.04 container in self-hosted runners — Python version not in manifest'
3
+ category: runner-environment
4
+ severity: error
5
+ tags:
6
+ - setup-python
7
+ - ubuntu-26
8
+ - manifest
9
+ - self-hosted
10
+ - container
11
+ - python-version
12
+ patterns:
13
+ - regex: 'The version.*with architecture.*was not found for this operating system'
14
+ flags: 'i'
15
+ - regex: 'setup-python.*ubuntu.*26|ubuntu.*26.*setup-python'
16
+ flags: 'i'
17
+ - regex: 'version.*not found.*operating system.*ubuntu'
18
+ flags: 'i'
19
+ error_messages:
20
+ - "Error: The version '3.13' with architecture 'x64' was not found for this operating system."
21
+ - "Error: The version '3.12' with architecture 'x64' was not found for this operating system."
22
+ - "Error: The version '3.11' with architecture 'x64' was not found for this operating system."
23
+ root_cause: |
24
+ When a new Ubuntu LTS version (e.g. 26.04) is released, pre-built Python binaries for
25
+ that OS are NOT immediately added to the versions manifest at
26
+ `https://raw.githubusercontent.com/actions/python-versions/main/versions-manifest.json`.
27
+
28
+ `actions/setup-python` reads this manifest to locate and download pre-compiled Python
29
+ binaries matching the requested version and architecture. When the running OS identifier
30
+ is not present in the manifest, all version requests fail — even for Python versions that
31
+ exist for other Ubuntu releases (22.04, 24.04).
32
+
33
+ **Why GitHub-hosted runners succeed:** On GitHub-hosted `ubuntu-latest` runners, Python
34
+ versions are pre-installed in the toolcache. `setup-python` resolves from cache without
35
+ downloading, so the missing manifest entry doesn't matter.
36
+
37
+ **Why self-hosted runners fail:** Self-hosted runners typically don't have pre-installed
38
+ toolcaches. When `setup-python` tries to download the version for the container OS, the
39
+ manifest lookup fails because `ubuntu-26.04` is absent.
40
+
41
+ This is a recurring pattern. The same issue occurred when Ubuntu 24.04 launched
42
+ (see setup-python#854). Pre-built binaries are added only after the runner image reaches
43
+ general availability in `actions/runner-images`, which lags the OS release.
44
+
45
+ **Affected configurations:**
46
+ - `runs-on: [self-hosted]` + `container: ubuntu:latest` (resolves to Ubuntu 26.04)
47
+ - `runs-on: [self-hosted]` + `container: ubuntu:26.04`
48
+ - Any self-hosted runner setup where Python must be downloaded on demand for ubuntu 26.04
49
+
50
+ Source: actions/setup-python#1309 (May 5, 2026)
51
+ fix: |
52
+ **Option 1 (recommended, temporary): Pin container to Ubuntu 24.04**
53
+
54
+ Until Ubuntu 26.04 Python binaries are added to the versions manifest, use an explicit
55
+ Ubuntu 24.04 container. This is the fastest fix with no other workflow changes.
56
+
57
+ **Option 2: Use a GitHub-hosted runner for Python setup**
58
+
59
+ GitHub-hosted `ubuntu-latest` runners have Python pre-installed in the toolcache and do
60
+ not require a manifest download. If you need a self-hosted runner for other reasons, you
61
+ can split: run setup-python on a hosted runner and transfer artifacts to the self-hosted
62
+ runner.
63
+
64
+ **Option 3: Pre-install Python in your container image**
65
+
66
+ Build a custom container image with Python pre-installed (via `apt`, `deadsnakes`, or
67
+ `pyenv`) so `setup-python` finds it in the system Python path rather than downloading.
68
+
69
+ **Option 4: Wait for manifest update**
70
+
71
+ Ubuntu 26.04 Python binaries will be added to the manifest after `ubuntu-26.04` reaches
72
+ GA status in `actions/runner-images`. Monitor setup-python#1309 for the fix notification.
73
+ fix_code:
74
+ - language: yaml
75
+ label: 'Workaround — pin container to Ubuntu 24.04 until Ubuntu 26.04 is in manifest'
76
+ code: |
77
+ jobs:
78
+ build:
79
+ runs-on: [self-hosted]
80
+ # Pin to ubuntu:24.04 until ubuntu:26.04 is in the versions manifest
81
+ # Unpin once actions/setup-python#1309 is resolved
82
+ container: ubuntu:24.04 # was: ubuntu:latest or ubuntu:26.04
83
+ steps:
84
+ - uses: actions/checkout@v4
85
+ - uses: actions/setup-python@v6
86
+ with:
87
+ python-version: '3.13'
88
+ - run: python --version
89
+ - language: yaml
90
+ label: 'Alternative — pre-install Python in a custom container image'
91
+ code: |
92
+ # Dockerfile — base image with Python already installed
93
+ FROM ubuntu:26.04
94
+ RUN apt-get update && apt-get install -y python3.13 python3.13-venv python3-pip && \
95
+ ln -s /usr/bin/python3.13 /usr/bin/python
96
+ # In workflow: container: your-registry/ubuntu-python:26.04
97
+ # Then skip actions/setup-python entirely — Python is pre-installed
98
+
99
+ # workflow.yml
100
+ jobs:
101
+ build:
102
+ runs-on: [self-hosted]
103
+ container: your-registry/ubuntu-python:26.04
104
+ steps:
105
+ - uses: actions/checkout@v4
106
+ # No setup-python needed — Python already in container
107
+ - run: python --version
108
+ prevention:
109
+ - "When using `ubuntu:latest` as a container image, be aware that it will track the latest Ubuntu release. On new LTS releases, pin to an explicit version (e.g. `ubuntu:24.04`) until `actions/setup-python` adds manifest support."
110
+ - "Subscribe to the relevant `actions/setup-python` issues when Ubuntu LTS versions are released — Python manifest support lags by several weeks after the GA runner image is available."
111
+ - "If your self-hosted runners MUST use the newest Ubuntu release, pre-install Python in your base container image rather than relying on `setup-python` to download it."
112
+ - "Check the versions manifest directly to confirm OS support before adopting new Ubuntu versions in CI: https://raw.githubusercontent.com/actions/python-versions/main/versions-manifest.json"
113
+ docs:
114
+ - url: 'https://github.com/actions/setup-python/issues/1309'
115
+ label: 'setup-python#1309 — Failing to fetch version from manifest when using Ubuntu 26.04 container'
116
+ - url: 'https://github.com/actions/setup-python/issues/854'
117
+ label: 'setup-python#854 — Same issue when Ubuntu 24.04 launched (historical precedent)'
118
+ - url: 'https://github.com/actions/python-versions/blob/main/versions-manifest.json'
119
+ label: 'Python versions manifest — check OS support'
120
+ - url: 'https://docs.github.com/en/actions/using-containerized-services/about-service-containers'
121
+ label: 'GitHub Docs: About service containers'
@@ -0,0 +1,140 @@
1
+ id: runner-environment-240
2
+ title: 'ARC v0.14.1 listener pods enter infinite crash-loop after upgrade — scaleset v0.3.0 treats broker EOF as fatal'
3
+ category: runner-environment
4
+ severity: error
5
+ tags:
6
+ - arc
7
+ - actions-runner-controller
8
+ - kubernetes
9
+ - listener
10
+ - crash-loop
11
+ - scaleset
12
+ - ephemeral-runners
13
+ patterns:
14
+ - regex: 'Listener pod is terminated.*reason.*Error'
15
+ flags: 'i'
16
+ - regex: 'scaleset.*v0\.3\.[0-9].*EOF|EOF.*broker.*fatal|EOF.*scaleset.*crash'
17
+ flags: 'i'
18
+ - regex: 'AutoscalingListener.*recreate.*loop|listener.*restart.*loop'
19
+ flags: 'i'
20
+ error_messages:
21
+ - 'Listener pod is terminated {"reason": "Error", "message": ""}'
22
+ - 'level=FATAL source=github.com/actions/scaleset@v0.3.0/listener/listener.go msg="EOF from broker"'
23
+ - 'AutoscalingListener: listener terminated unexpectedly, recreating'
24
+ root_cause: |
25
+ ARC (Actions Runner Controller) v0.14.1 upgraded the internal `scaleset` library from
26
+ v0.2.0 to v0.3.0. The v0.3.0 library introduced a "Restore job acquisition flow" change
27
+ (scaleset PR #90) that treats **EOF responses from the GitHub Actions broker service**
28
+ (`broker.actions.githubusercontent.com`) as a **fatal error** instead of a transient
29
+ condition to retry.
30
+
31
+ **Before v0.3.0 (ARC v0.14.0):** EOF from the broker long-poll endpoint was treated as a
32
+ reconnect signal — the listener would gracefully retry the connection and continue.
33
+
34
+ **After v0.3.0 (ARC v0.14.1):** EOF causes the listener to exit with a non-zero status
35
+ code. The ARC controller detects the pod termination via `Listener pod is terminated`
36
+ with `reason: "Error"` and immediately recreates the listener pod. The new pod hits
37
+ another EOF and crashes again — producing an infinite restart loop.
38
+
39
+ **Note:** scaleset v0.4.0 does NOT fix this. It addresses a different (message ordering)
40
+ issue. The EOF-as-fatal regression is still present in v0.4.0.
41
+
42
+ **Impact:** All `AutoscalingRunnerSet` resources in the cluster enter crash-loop behavior.
43
+ Jobs queue but no runners are dispatched. Production ARC deployments can accumulate dozens
44
+ of error terminations within a 30-60 minute window.
45
+
46
+ **GitHub-hosted runners are unaffected** — this only impacts organizations using ARC
47
+ (self-hosted Kubernetes runners) via Helm chart `gha-runner-scale-set-controller`.
48
+
49
+ Source: actions/actions-runner-controller#4488 (May 6, 2026, 4 reactions)
50
+ fix: |
51
+ **Option 1 (recommended): Downgrade to ARC v0.14.0**
52
+
53
+ Roll back the `gha-runner-scale-set-controller` Helm chart to the previous stable version:
54
+
55
+ ```
56
+ helm upgrade gha-runner-scale-set-controller \
57
+ --namespace arc-systems \
58
+ actions-runner-controller/gha-runner-scale-set-controller \
59
+ --version 0.14.0
60
+ ```
61
+
62
+ **Option 2: Wait for ARC v0.14.2+ with the EOF fix**
63
+
64
+ Monitor the ARC changelog and release notes at:
65
+ https://github.com/actions/actions-runner-controller/blob/master/docs/gha-runner-scale-set-controller/README.md#changelog
66
+
67
+ **Option 3 (partial mitigation): Reduce listener restart pressure**
68
+
69
+ Set `minRunners: 0` on affected `AutoscalingRunnerSet` resources to reduce the frequency
70
+ of broker long-poll connections that can EOF. This reduces symptom frequency but does
71
+ not eliminate the crash-loop.
72
+
73
+ **How to confirm the root cause:**
74
+ Check controller logs in the `arc-systems` namespace:
75
+ ```
76
+ kubectl logs -n arc-systems -l app=gha-runner-scale-set-controller --since=1h | grep "terminated"
77
+ ```
78
+ If you see repeated `Listener pod is terminated {"reason": "Error"}` entries cycling
79
+ every 1-5 minutes, this is the scaleset v0.3.0 EOF regression.
80
+ fix_code:
81
+ - language: yaml
82
+ label: 'Identify crash-loop in controller logs'
83
+ code: |
84
+ # Check controller logs for the crash-loop pattern
85
+ # kubectl logs -n arc-systems deployment/gha-runner-scale-set-controller
86
+ # Look for repeated:
87
+ # Listener pod is terminated {"reason": "Error", "message": ""}
88
+ # AutoscalingListener: recreating listener pod
89
+
90
+ # Also check the listener pod itself before it's deleted:
91
+ # kubectl logs -n arc-systems <listener-pod-name>
92
+ # Look for scaleset@v0.3.0 in the source path:
93
+ # time=... source=github.com/actions/scaleset@v0.3.0/listener/listener.go:179
94
+ # msg="Getting next message"
95
+ - language: yaml
96
+ label: 'Downgrade ARC controller Helm chart to v0.14.0'
97
+ code: |
98
+ # Downgrade via Helm — replace the controller chart version
99
+ # helm upgrade gha-runner-scale-set-controller \
100
+ # --namespace arc-systems \
101
+ # actions-runner-controller/gha-runner-scale-set-controller \
102
+ # --version 0.14.0
103
+
104
+ # Or in your values/ArgoCD/Flux manifest, pin the chart version:
105
+ # Chart.yaml / helmrelease / kustomize reference:
106
+ apiVersion: helm.toolkit.fluxcd.io/v2
107
+ kind: HelmRelease
108
+ metadata:
109
+ name: gha-runner-scale-set-controller
110
+ namespace: arc-systems
111
+ spec:
112
+ chart:
113
+ spec:
114
+ chart: gha-runner-scale-set-controller
115
+ version: '0.14.0' # Pin until EOF crash-loop is fixed in 0.14.2+
116
+ sourceRef:
117
+ kind: HelmRepository
118
+ name: actions-runner-controller
119
+ - language: yaml
120
+ label: 'Partial mitigation — set minRunners to 0 to reduce EOF pressure'
121
+ code: |
122
+ # gha-runner-scale-set values.yaml
123
+ # Reducing minRunners to 0 means the listener doesn't stay
124
+ # permanently connected to the broker, reducing EOF frequency.
125
+ # NOT a full fix — crash-loop still occurs but less frequently.
126
+ githubConfigUrl: 'https://github.com/<org>'
127
+ minRunners: 0 # Reduce from default to limit constant broker connections
128
+ maxRunners: 10
129
+ prevention:
130
+ - "Before upgrading ARC to a new minor version, review the changelog at https://github.com/actions/actions-runner-controller/blob/master/docs/gha-runner-scale-set-controller/README.md#changelog for breaking changes in the scaleset library."
131
+ - "Deploy ARC upgrades to a staging cluster first. Monitor listener pod restarts for 30-60 minutes before rolling out to production."
132
+ - "Set up alerting on `Listener pod is terminated` log patterns in your ARC controller namespace — frequent terminations signal a crash-loop before it causes widespread job queuing failures."
133
+ - "Subscribe to release notifications for `actions/actions-runner-controller` to receive timely notice of regression fixes."
134
+ docs:
135
+ - url: 'https://github.com/actions/actions-runner-controller/issues/4488'
136
+ label: 'ARC#4488 — Listener pods crash-loop after upgrading to ARC v0.14.1 (4 reactions)'
137
+ - url: 'https://github.com/actions/actions-runner-controller/blob/master/docs/gha-runner-scale-set-controller/README.md#changelog'
138
+ label: 'ARC Changelog — gha-runner-scale-set-controller release notes'
139
+ - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors'
140
+ label: 'GitHub Docs: Troubleshooting Actions Runner Controller errors'
@@ -0,0 +1,162 @@
1
+ id: silent-failures-119
2
+ title: '`github.ref` and `GITHUB_REF` intermittently empty on `release` event — workflows that tag-check silently fail or exit 1'
3
+ category: silent-failures
4
+ severity: silent-failure
5
+ tags:
6
+ - github.ref
7
+ - GITHUB_REF
8
+ - release
9
+ - release-event
10
+ - intermittent
11
+ - tag
12
+ - empty
13
+ patterns:
14
+ - regex: 'github\.ref.*release|GITHUB_REF.*release.*empty'
15
+ flags: 'i'
16
+ - regex: 'refs/tags.*empty.*release|release.*github_ref.*blank'
17
+ flags: 'i'
18
+ - regex: 'startsWith\(github\.ref.*refs/tags.*false.*release'
19
+ flags: 'i'
20
+ error_messages:
21
+ - "if [[ \"\" == refs/tags/release/* ]] ; then"
22
+ - "::error ::Could not determine the version number using ref "
23
+ - "GITHUB_REF: ''"
24
+ - "github.ref evaluates to empty string on release event"
25
+ root_cause: |
26
+ When a workflow is triggered by an `on: release` event (e.g. `published`, `created`,
27
+ `prereleased`), the `github.ref` context value and `GITHUB_REF` environment variable
28
+ are **intermittently empty** — sometimes populated correctly, sometimes empty string.
29
+
30
+ The bug is non-deterministic: re-running the same workflow sometimes succeeds (ref
31
+ populated) and sometimes fails (ref empty). It is more frequent when:
32
+ - The release is created immediately after tagging
33
+ - Multiple releases fire in quick succession
34
+ - The workflow is called as a reusable workflow with `github.ref` forwarded
35
+
36
+ **Root cause**: A race condition or internal state propagation issue in the GitHub
37
+ Actions platform where the `release` event payload is dispatched before the ref
38
+ metadata is fully resolved in all runtime components. The `github.event.release`
39
+ object IS populated correctly in the same runs that have an empty `github.ref`.
40
+
41
+ **Impact**: Workflows that use `github.ref` to extract tag names (e.g.
42
+ `startsWith(github.ref, 'refs/tags/')`, or `echo $GITHUB_REF | cut -c11-`) will
43
+ silently get an empty string, causing:
44
+ - Conditional steps to be skipped (release-only deployments don't run)
45
+ - Version parsing to produce empty strings (publishing incorrect artifacts)
46
+ - Shell scripts to `exit 1` on empty-ref checks
47
+
48
+ This issue was introduced/regressed in August 2023 (runner ~v2.307) and remains
49
+ intermittently reproducible as of April/August 2025.
50
+
51
+ Source: actions/runner#2788 (65 reactions, open since 2023)
52
+ fix: |
53
+ **Always use `github.event.release.tag_name` for release tag extraction** instead of
54
+ parsing `github.ref`. The `github.event.release` object is consistently populated
55
+ even when `github.ref` is empty.
56
+
57
+ For step-level `if:` conditions that check whether a workflow was triggered by a
58
+ release event, use `github.event_name == 'release'` instead of
59
+ `startsWith(github.ref, 'refs/tags/')`.
60
+
61
+ If you need the full `refs/tags/v1.2.3` ref format, construct it from the event:
62
+ `refs/tags/${{ github.event.release.tag_name }}`
63
+ fix_code:
64
+ - language: yaml
65
+ label: "Broken — version extraction via github.ref (intermittently empty)"
66
+ code: |
67
+ on:
68
+ release:
69
+ types: [published]
70
+
71
+ jobs:
72
+ publish:
73
+ runs-on: ubuntu-latest
74
+ steps:
75
+ - name: Extract version
76
+ run: |
77
+ # ❌ BROKEN: github.ref is intermittently empty on release events
78
+ VERSION="${{ github.ref }}"
79
+ if [[ "$VERSION" == refs/tags/v* ]]; then
80
+ TAG="${VERSION#refs/tags/}"
81
+ else
82
+ echo "::error ::Could not determine version from ref: $VERSION"
83
+ exit 1
84
+ fi
85
+
86
+ # Also fragile:
87
+ - if: startsWith(github.ref, 'refs/tags/') # ❌ sometimes false when ref is empty
88
+ run: echo "Publishing release"
89
+
90
+ - language: yaml
91
+ label: "Fixed — use github.event.release.tag_name (always populated)"
92
+ code: |
93
+ on:
94
+ release:
95
+ types: [published]
96
+
97
+ jobs:
98
+ publish:
99
+ runs-on: ubuntu-latest
100
+ steps:
101
+ - name: Extract version
102
+ run: |
103
+ # ✅ FIXED: event.release.tag_name is always populated correctly
104
+ TAG="${{ github.event.release.tag_name }}"
105
+ echo "Publishing version: $TAG"
106
+ echo "VERSION=$TAG" >> "$GITHUB_ENV"
107
+
108
+ # For if: conditions, use event_name check instead of ref check:
109
+ - if: github.event_name == 'release' # ✅ Always correct
110
+ run: echo "Publishing release ${{ github.event.release.tag_name }}"
111
+
112
+ # If you need the refs/tags/... format, construct it:
113
+ - name: Construct full ref
114
+ run: |
115
+ FULL_REF="refs/tags/${{ github.event.release.tag_name }}"
116
+ echo "Full ref: $FULL_REF"
117
+
118
+ - language: yaml
119
+ label: "Fixed — defensive workaround using both sources"
120
+ code: |
121
+ on:
122
+ release:
123
+ types: [published]
124
+
125
+ jobs:
126
+ publish:
127
+ runs-on: ubuntu-latest
128
+ steps:
129
+ - name: Determine tag with fallback
130
+ id: tag
131
+ run: |
132
+ # Primary: event object (always populated)
133
+ TAG="${{ github.event.release.tag_name }}"
134
+
135
+ # Fallback: parse from github.ref if event is empty (unlikely)
136
+ if [ -z "$TAG" ] && [ -n "${{ github.ref }}" ]; then
137
+ TAG="${{ github.ref }}"
138
+ TAG="${TAG#refs/tags/}"
139
+ fi
140
+
141
+ if [ -z "$TAG" ]; then
142
+ echo "::error ::Could not determine release tag from either source"
143
+ exit 1
144
+ fi
145
+ echo "tag=$TAG" >> "$GITHUB_OUTPUT"
146
+
147
+ - run: echo "Publishing ${{ steps.tag.outputs.tag }}"
148
+
149
+ prevention:
150
+ - "Never parse release tags from `github.ref` — always use `github.event.release.tag_name` in release-triggered workflows."
151
+ - "Use `github.event_name == 'release'` in `if:` conditions rather than `startsWith(github.ref, 'refs/tags/')` to avoid empty-ref false negatives."
152
+ - "Add explicit error messages when tag extraction returns empty string so intermittent failures are visible rather than silent."
153
+ - "If a release workflow fails unexpectedly, check whether `github.ref` was empty by adding `run: echo \"ref=${{ github.ref }}\"; echo \"tag=${{ github.event.release.tag_name }}\"`."
154
+ docs:
155
+ - url: 'https://github.com/actions/runner/issues/2788'
156
+ label: 'actions/runner#2788 — github.ref is empty for workflows triggered by release (65 reactions)'
157
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/events-that-trigger-workflows#release'
158
+ label: 'GitHub Docs: Events that trigger workflows — release'
159
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/contexts#github-context'
160
+ label: 'GitHub Docs: github context'
161
+ - url: 'https://github.com/orgs/community/discussions/64528'
162
+ label: 'GitHub Community: github.ref empty on release event (discussion)'
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@htekdev/actions-debugger",
3
- "version": "1.0.127",
3
+ "version": "1.0.129",
4
4
  "description": "65+ real GitHub Actions errors, queryable by agents. CLI + MCP server + Copilot skills + error database.",
5
5
  "type": "module",
6
6
  "main": "./dist/index.js",