@htekdev/actions-debugger 1.0.114 → 1.0.116
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/errors/concurrency-timing/concurrency-timing-053.yml +83 -0
- package/errors/known-unsolved/known-unsolved-062.yml +87 -0
- package/errors/known-unsolved/runner-rest-api-busy-false-broker-state-desync.yml +102 -0
- package/errors/permissions-auth/upload-code-coverage-missing-code-quality-write-permission.yml +94 -0
- package/errors/runner-environment/arc-ephemeral-runner-oom-kill-session-conflict.yml +129 -0
- package/errors/runner-environment/macos-26-homebrew-python313-removed-stdlib-modules.yml +113 -0
- package/errors/runner-environment/macos-26-openssl3-legacy-cipher-p12-import-failure.yml +102 -0
- package/errors/runner-environment/runner-environment-199.yml +93 -0
- package/errors/runner-environment/runner-v2334-action-download-repeated-case-sensitivity.yml +100 -0
- package/errors/runner-environment/setup-python-macos-self-hosted-symlink-permission-denied.yml +94 -0
- package/errors/runner-environment/setup-python-windows-self-hosted-no-admin-install-fails.yml +101 -0
- package/errors/silent-failures/paths-filter-before-field-missing-workflow-run.yml +105 -0
- package/errors/silent-failures/windows-11-arm-bash-shell-intermittent-zero-output.yml +99 -0
- package/errors/triggers/triggers-069.yml +100 -0
- package/errors/yaml-syntax/continue-on-error-inputs-composite-action-unexpected-value.yml +110 -0
- package/package.json +1 -1
|
@@ -0,0 +1,83 @@
|
|
|
1
|
+
id: concurrency-timing-053
|
|
2
|
+
title: 'Concurrency pending slot overflow: third concurrent run silently cancels the already-queued second run when cancel-in-progress is false'
|
|
3
|
+
category: concurrency-timing
|
|
4
|
+
severity: silent-failure
|
|
5
|
+
tags:
|
|
6
|
+
- concurrency
|
|
7
|
+
- cancel-in-progress
|
|
8
|
+
- pending
|
|
9
|
+
- queue
|
|
10
|
+
- silent-cancellation
|
|
11
|
+
- overflow
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: 'Canceling since a higher priority waiting run was found'
|
|
14
|
+
flags: 'i'
|
|
15
|
+
- regex: 'This run was automatically cancelled'
|
|
16
|
+
flags: 'i'
|
|
17
|
+
error_messages:
|
|
18
|
+
- 'Canceling since a higher priority waiting run was found'
|
|
19
|
+
- 'This run was automatically cancelled'
|
|
20
|
+
root_cause: |
|
|
21
|
+
GitHub Actions concurrency groups with cancel-in-progress: false allow at most one
|
|
22
|
+
run to be in-progress and one run to be pending (queued) simultaneously per group.
|
|
23
|
+
This queue depth is exactly 1, not unlimited.
|
|
24
|
+
|
|
25
|
+
When a third run arrives while run 1 is in-progress and run 2 is pending:
|
|
26
|
+
- Run 2 (the pending run) is immediately and silently cancelled
|
|
27
|
+
- Run 3 takes run 2's pending slot
|
|
28
|
+
|
|
29
|
+
The user who triggered run 2 typically sees it flip from "Queued" to "Cancelled"
|
|
30
|
+
with no notification and no failure alert. From their perspective their commit's CI
|
|
31
|
+
simply disappeared.
|
|
32
|
+
|
|
33
|
+
This behavior is documented in GitHub docs but surprises teams that expect a FIFO
|
|
34
|
+
queue of unlimited depth. The concurrency feature is a mutex with a single
|
|
35
|
+
waiting-room slot, not a job scheduler queue.
|
|
36
|
+
|
|
37
|
+
Common scenarios where this causes silent data loss:
|
|
38
|
+
- Rapid-fire merges to a protected branch with slow integration tests
|
|
39
|
+
- Multiple developers pushing within seconds of each other to the same branch
|
|
40
|
+
- Automated commits (dependency updates, release bots) arriving while CI runs
|
|
41
|
+
fix: |
|
|
42
|
+
Option 1 — Accept the overflow: intended behavior for fast-merge scenarios where only
|
|
43
|
+
the LATEST commit needs CI. No change needed.
|
|
44
|
+
|
|
45
|
+
Option 2 — Widen the concurrency key so each commit gets its own slot:
|
|
46
|
+
group: ${{ github.workflow }}-${{ github.sha }}
|
|
47
|
+
This disables cancellation entirely; every run completes regardless of newer pushes.
|
|
48
|
+
|
|
49
|
+
Option 3 — Use cancel-in-progress: true explicitly if "latest wins" is the desired
|
|
50
|
+
semantics. In-progress runs cancel rather than queued runs disappearing silently.
|
|
51
|
+
|
|
52
|
+
Option 4 — Queue at the runner group level by using a self-hosted runner group with
|
|
53
|
+
a concurrency limit to provide true multi-run queuing.
|
|
54
|
+
fix_code:
|
|
55
|
+
- language: yaml
|
|
56
|
+
label: 'Common mistake: expecting cancel-in-progress: false to queue all pending runs indefinitely'
|
|
57
|
+
code: |
|
|
58
|
+
concurrency:
|
|
59
|
+
group: ${{ github.workflow }}-${{ github.ref }}
|
|
60
|
+
cancel-in-progress: false
|
|
61
|
+
# Only 1 run can be pending; a 3rd arriving run silently cancels the queued 2nd
|
|
62
|
+
- language: yaml
|
|
63
|
+
label: 'Option A: per-commit group key — every run completes, no cancellation at all'
|
|
64
|
+
code: |
|
|
65
|
+
concurrency:
|
|
66
|
+
group: ${{ github.workflow }}-${{ github.sha }}
|
|
67
|
+
cancel-in-progress: false
|
|
68
|
+
- language: yaml
|
|
69
|
+
label: 'Option B: cancel-in-progress: true — explicit latest-wins, in-progress runs cancelled not pending ones'
|
|
70
|
+
code: |
|
|
71
|
+
concurrency:
|
|
72
|
+
group: ${{ github.workflow }}-${{ github.ref }}
|
|
73
|
+
cancel-in-progress: true
|
|
74
|
+
prevention:
|
|
75
|
+
- 'Understand that cancel-in-progress: false does not create an unlimited queue — it allows exactly one pending run per concurrency group key'
|
|
76
|
+
- 'For deployment workflows where no commit should be skipped, use per-commit group keys (${{ github.sha }}) to guarantee every run completes'
|
|
77
|
+
- 'Monitor the Actions tab during rapid-push periods to verify queued runs are completing, not silently disappearing'
|
|
78
|
+
- 'Prefer cancel-in-progress: true when only the latest result matters; the cancellation is explicit and visible rather than silent'
|
|
79
|
+
docs:
|
|
80
|
+
- url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/using-concurrency'
|
|
81
|
+
label: 'GitHub Docs: Using concurrency'
|
|
82
|
+
- url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/using-concurrency#example-only-cancel-in-progress-jobs-or-runs-for-the-current-workflow'
|
|
83
|
+
label: 'GitHub Docs: Concurrency — one pending slot per group'
|
|
@@ -0,0 +1,87 @@
|
|
|
1
|
+
id: known-unsolved-062
|
|
2
|
+
title: 'workflow_run chains are limited to one level — a workflow_run-triggered workflow cannot trigger another downstream workflow via workflow_run'
|
|
3
|
+
category: known-unsolved
|
|
4
|
+
severity: limitation
|
|
5
|
+
tags:
|
|
6
|
+
- workflow-run
|
|
7
|
+
- chaining
|
|
8
|
+
- pipeline
|
|
9
|
+
- limitation
|
|
10
|
+
- no-fix
|
|
11
|
+
- event-trigger
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: 'on:\s*\n\s+workflow_run:'
|
|
14
|
+
flags: 'i'
|
|
15
|
+
error_messages: []
|
|
16
|
+
root_cause: |
|
|
17
|
+
GitHub Actions explicitly prevents workflow_run events from chaining more than one
|
|
18
|
+
level deep. A workflow triggered by workflow_run CANNOT itself use workflow_run as
|
|
19
|
+
an on: trigger to fire a third downstream workflow.
|
|
20
|
+
|
|
21
|
+
From GitHub documentation: "A workflow triggered by a workflow_run event can only
|
|
22
|
+
be triggered by a workflow that is not itself triggered by a workflow_run event."
|
|
23
|
+
|
|
24
|
+
This restriction prevents infinite trigger loops but also prevents building linear
|
|
25
|
+
CI/CD pipelines using workflow_run chaining alone. Multi-stage pipelines of the form
|
|
26
|
+
Build (push) → Test (workflow_run) → Deploy (workflow_run) → Notify (workflow_run)
|
|
27
|
+
fail silently at the second hop: the Deploy and Notify workflows never appear in the
|
|
28
|
+
Actions tab and no error is raised anywhere.
|
|
29
|
+
|
|
30
|
+
There is no runtime error, no annotation, and no warning. The on: workflow_run
|
|
31
|
+
trigger on the downstream workflow is simply never evaluated.
|
|
32
|
+
fix: |
|
|
33
|
+
Replace the second-hop workflow_run trigger with an explicit dispatch from the
|
|
34
|
+
first downstream workflow:
|
|
35
|
+
|
|
36
|
+
Option 1 — repository_dispatch: use the GitHub REST API from a job step in workflow
|
|
37
|
+
B to POST to /repos/{owner}/{repo}/dispatches with a custom event_type. Workflow C
|
|
38
|
+
listens on on: repository_dispatch with a matching types: filter.
|
|
39
|
+
|
|
40
|
+
Option 2 — workflow_dispatch: use gh workflow run from a step in workflow B to
|
|
41
|
+
directly trigger workflow C by filename. Requires a GitHub token with actions:write.
|
|
42
|
+
|
|
43
|
+
Option 3 — Consolidate: merge the second and third workflows into a single workflow
|
|
44
|
+
with job dependencies (needs:) eliminating the cross-workflow hop entirely.
|
|
45
|
+
fix_code:
|
|
46
|
+
- language: yaml
|
|
47
|
+
label: 'Does NOT work: workflow_run cannot chain more than one level deep'
|
|
48
|
+
code: |
|
|
49
|
+
# Workflow C — this trigger is never evaluated when Workflow B is
|
|
50
|
+
# itself triggered by workflow_run
|
|
51
|
+
on:
|
|
52
|
+
workflow_run:
|
|
53
|
+
workflows: ["B - Integration Tests"]
|
|
54
|
+
types: [completed]
|
|
55
|
+
- language: yaml
|
|
56
|
+
label: 'Fix: dispatch workflow C via repository_dispatch from workflow B'
|
|
57
|
+
code: |
|
|
58
|
+
# In workflow B (intermediate workflow, triggered by workflow_run):
|
|
59
|
+
jobs:
|
|
60
|
+
dispatch-downstream:
|
|
61
|
+
runs-on: ubuntu-latest
|
|
62
|
+
if: ${{ github.event.workflow_run.conclusion == 'success' }}
|
|
63
|
+
steps:
|
|
64
|
+
- name: Trigger workflow C via repository_dispatch
|
|
65
|
+
run: |
|
|
66
|
+
gh api repos/${{ github.repository }}/dispatches \
|
|
67
|
+
--method POST \
|
|
68
|
+
--field event_type=run-deploy \
|
|
69
|
+
--field client_payload='{"run_id":"${{ github.event.workflow_run.id }}"}'
|
|
70
|
+
env:
|
|
71
|
+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
|
72
|
+
|
|
73
|
+
# In workflow C:
|
|
74
|
+
on:
|
|
75
|
+
repository_dispatch:
|
|
76
|
+
types: [run-deploy]
|
|
77
|
+
prevention:
|
|
78
|
+
- 'Design CI/CD pipelines assuming workflow_run allows only one hop; use repository_dispatch or workflow_dispatch for any second-level chaining'
|
|
79
|
+
- 'Prefer consolidating multi-stage pipelines into a single workflow with job dependencies (needs:) when the stages always execute in sequence'
|
|
80
|
+
- 'When a downstream workflow never appears in the Actions tab after merging, verify whether its on: trigger is workflow_run and whether its upstream workflow is also workflow_run-triggered'
|
|
81
|
+
docs:
|
|
82
|
+
- url: 'https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows#workflow_run'
|
|
83
|
+
label: 'GitHub Docs: workflow_run event — one-level-deep restriction'
|
|
84
|
+
- url: 'https://docs.github.com/en/rest/repos/repos#create-a-repository-dispatch-event'
|
|
85
|
+
label: 'GitHub REST API: Create a repository dispatch event'
|
|
86
|
+
- url: 'https://cli.github.com/manual/gh_workflow_run'
|
|
87
|
+
label: 'GitHub CLI: gh workflow run (alternative dispatch method)'
|
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
id: known-unsolved-063
|
|
2
|
+
title: 'REST API reports runner busy:false while broker shows runner actively executing a job'
|
|
3
|
+
category: known-unsolved
|
|
4
|
+
severity: silent-failure
|
|
5
|
+
tags:
|
|
6
|
+
- self-hosted
|
|
7
|
+
- runner
|
|
8
|
+
- autoscaler
|
|
9
|
+
- rest-api
|
|
10
|
+
- broker
|
|
11
|
+
- state-desync
|
|
12
|
+
- v2-flow
|
|
13
|
+
- job-killed
|
|
14
|
+
patterns:
|
|
15
|
+
- regex: 'busy.*false.*runner.*executing|runner.*busy.*false.*job'
|
|
16
|
+
flags: 'i'
|
|
17
|
+
- regex: '"busy"\s*:\s*false'
|
|
18
|
+
flags: 'i'
|
|
19
|
+
error_messages:
|
|
20
|
+
- '"busy": false'
|
|
21
|
+
- 'GET /repos/{owner}/{repo}/actions/runners/{id} → {"busy": false}'
|
|
22
|
+
root_cause: |
|
|
23
|
+
On non-ephemeral self-hosted runners using the V2 broker flow
|
|
24
|
+
(`broker.actions.githubusercontent.com`), a state desynchronization exists between
|
|
25
|
+
the broker service and the GitHub REST API:
|
|
26
|
+
|
|
27
|
+
- The broker correctly tracks runner state in real-time: after picking up Job B, the
|
|
28
|
+
runner reports `JobState: Busy` to the broker and renews its job lease every 60s.
|
|
29
|
+
- However, `GET /repos/{owner}/{repo}/actions/runners/{runner_id}` (the public REST
|
|
30
|
+
API) continues to return `"busy": false` during the early phase of job execution.
|
|
31
|
+
The REST API state may only update after the runner's next periodic sync, which
|
|
32
|
+
can lag 30–120 seconds behind the broker state.
|
|
33
|
+
|
|
34
|
+
Auto-scaling tools that rely on the REST API to identify idle runners (e.g.,
|
|
35
|
+
`github-aws-runners/terraform-aws-github-runner`, KEDA GitHub Actions scaler,
|
|
36
|
+
custom Lambda/CloudFunction scalers) interpret `busy: false` as "runner is idle and
|
|
37
|
+
safe to terminate." This causes the autoscaler to terminate an EC2/GCE/Azure instance
|
|
38
|
+
mid-job — killing the runner process with no Actions-level error and marking the job
|
|
39
|
+
as failed with a runner disconnection error.
|
|
40
|
+
|
|
41
|
+
From the affected job's perspective, the log ends mid-step with "The runner has
|
|
42
|
+
received a shutdown signal" or the job times out. There is no annotation indicating
|
|
43
|
+
the root cause was an autoscaler decision based on stale REST API data.
|
|
44
|
+
|
|
45
|
+
No GitHub-side fix is available as of June 2026. The REST API does not expose a
|
|
46
|
+
real-time busy status consistent with the broker. Open at actions/runner#4422.
|
|
47
|
+
fix: |
|
|
48
|
+
There is no complete fix — this is a known state inconsistency in the GitHub platform.
|
|
49
|
+
|
|
50
|
+
Workarounds (choose one based on your autoscaling setup):
|
|
51
|
+
|
|
52
|
+
1. **Switch to ephemeral JIT runners (recommended)**: Use JIT tokens and terminate
|
|
53
|
+
runners after exactly one job. There is no window for autoscalers to misidentify
|
|
54
|
+
a running job as idle because the runner is registered and deregistered atomically.
|
|
55
|
+
|
|
56
|
+
2. **Add a grace period before termination**: When your autoscaler sees `busy: false`,
|
|
57
|
+
wait 2–3 minutes and re-poll before actually terminating. This covers the lag
|
|
58
|
+
between broker state and REST API state.
|
|
59
|
+
|
|
60
|
+
3. **Poll job status instead of runner status**: Use
|
|
61
|
+
`GET /repos/{owner}/{repo}/actions/runs` to check for `in_progress` workflow runs
|
|
62
|
+
before terminating any runner, rather than relying on per-runner `busy` status.
|
|
63
|
+
|
|
64
|
+
4. **Use runner labels + job assignment**: If your autoscaler assigns specific runners
|
|
65
|
+
to specific jobs via labels, you can cross-reference queued/in-progress job
|
|
66
|
+
assignments against runner IDs before terminating.
|
|
67
|
+
fix_code:
|
|
68
|
+
- language: yaml
|
|
69
|
+
label: 'Example: Switch to ephemeral JIT runners (removes the desync window entirely)'
|
|
70
|
+
code: |
|
|
71
|
+
# Use JIT runner registration in your autoscaler
|
|
72
|
+
# Each runner handles exactly one job — busy/idle desync cannot occur
|
|
73
|
+
# See: https://docs.github.com/en/actions/security-for-github-actions/security-guides/security-hardening-for-github-actions#using-just-in-time-runners
|
|
74
|
+
|
|
75
|
+
# In your autoscaler provisioning logic:
|
|
76
|
+
# POST /repos/{owner}/{repo}/actions/runners/generate-jit-config
|
|
77
|
+
# → Use the returned jit_config to start an ephemeral runner
|
|
78
|
+
# → Runner auto-deregisters after job completes — no stale REST state possible
|
|
79
|
+
- language: yaml
|
|
80
|
+
label: 'Example: Grace period before termination (Terraform-style pseudocode)'
|
|
81
|
+
code: |
|
|
82
|
+
# In your autoscaler Lambda/script, before terminating an instance:
|
|
83
|
+
# 1. GET /repos/{owner}/{repo}/actions/runners/{runner_id}
|
|
84
|
+
# 2. If busy == false, wait 2 minutes
|
|
85
|
+
# 3. Re-poll: GET /repos/{owner}/{repo}/actions/runners/{runner_id}
|
|
86
|
+
# 4. Only terminate if STILL busy == false after the grace period
|
|
87
|
+
|
|
88
|
+
# This covers the broker→REST lag window (~30-120s observed in practice)
|
|
89
|
+
prevention:
|
|
90
|
+
- "Prefer ephemeral JIT runners for any workload where mid-job termination would be costly; the broker-REST desync window is zero for single-job-per-runner setups."
|
|
91
|
+
- "Never terminate a runner instance based solely on a single REST API `busy: false` reading — always double-check with a grace period or secondary signal."
|
|
92
|
+
- "Monitor for jobs that end with 'runner has received a shutdown signal' — this is a reliable indicator that a runner was terminated externally mid-job."
|
|
93
|
+
- "If using terraform-aws-github-runner or similar, check whether the tool version has built-in grace periods for the busy-state lag."
|
|
94
|
+
docs:
|
|
95
|
+
- url: 'https://github.com/actions/runner/issues/4422'
|
|
96
|
+
label: 'actions/runner#4422 — /runners REST API reports busy:false for active runner'
|
|
97
|
+
- url: 'https://docs.github.com/en/rest/actions/self-hosted-runners'
|
|
98
|
+
label: 'REST API: Self-hosted runners'
|
|
99
|
+
- url: 'https://docs.github.com/en/actions/security-for-github-actions/security-guides/security-hardening-for-github-actions#using-just-in-time-runners'
|
|
100
|
+
label: 'Just-in-time runners documentation'
|
|
101
|
+
- url: 'https://github.com/github-aws-runners/terraform-aws-github-runner'
|
|
102
|
+
label: 'terraform-aws-github-runner (commonly affected autoscaler)'
|
package/errors/permissions-auth/upload-code-coverage-missing-code-quality-write-permission.yml
ADDED
|
@@ -0,0 +1,94 @@
|
|
|
1
|
+
id: permissions-auth-068
|
|
2
|
+
title: 'upload-code-coverage action fails with 403 — missing code-quality:write permission'
|
|
3
|
+
category: permissions-auth
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- permissions
|
|
7
|
+
- code-quality
|
|
8
|
+
- upload-code-coverage
|
|
9
|
+
- github-token
|
|
10
|
+
- 403
|
|
11
|
+
- fine-grained
|
|
12
|
+
patterns:
|
|
13
|
+
- regex: 'Resource not accessible by integration'
|
|
14
|
+
flags: 'i'
|
|
15
|
+
- regex: 'Upload failed.*403|HTTP 403.*code.coverage|code.coverage.*403'
|
|
16
|
+
flags: 'i'
|
|
17
|
+
- regex: 'code-quality.*write|code_quality.*write'
|
|
18
|
+
flags: 'i'
|
|
19
|
+
error_messages:
|
|
20
|
+
- '{"message":"Resource not accessible by integration","documentation_url":"https://docs.github.com/rest"}'
|
|
21
|
+
- 'Error: Upload failed: HTTP 403 Forbidden'
|
|
22
|
+
- 'HTTP Status: 403'
|
|
23
|
+
root_cause: |
|
|
24
|
+
GitHub's code coverage upload API (introduced May 2026 as part of Code Quality for
|
|
25
|
+
pull requests) requires the new fine-grained permission `code-quality:write` on the
|
|
26
|
+
calling token. The default GITHUB_TOKEN in GitHub Actions workflows has this permission
|
|
27
|
+
set to `none` unless explicitly granted.
|
|
28
|
+
|
|
29
|
+
When `actions/upload-code-coverage` calls the coverage upload API without this
|
|
30
|
+
permission, GitHub returns HTTP 403 "Resource not accessible by integration". Because
|
|
31
|
+
`code-quality:write` is a newly introduced permission class (not present in older
|
|
32
|
+
workflow permission schemas), developers familiar with the standard permissions
|
|
33
|
+
(contents, issues, pull-requests, etc.) don't know to add it.
|
|
34
|
+
|
|
35
|
+
This affects all workflows that do not specify `permissions:` at all (which defaults to
|
|
36
|
+
`read-all` — but `code-quality` is still `none` for new permissions), as well as
|
|
37
|
+
workflows that explicitly set `permissions: {}` or use a restrictive block.
|
|
38
|
+
fix: |
|
|
39
|
+
Add `code-quality: write` to the `permissions` block of the job that runs
|
|
40
|
+
`actions/upload-code-coverage`. This grants the GITHUB_TOKEN the required scope to
|
|
41
|
+
call the code coverage upload API.
|
|
42
|
+
|
|
43
|
+
Note: `code-quality:` is a job-level permission. It cannot be set as a global
|
|
44
|
+
`GITHUB_TOKEN` permission through repository settings — it must be declared in the
|
|
45
|
+
workflow YAML per-job.
|
|
46
|
+
fix_code:
|
|
47
|
+
- language: yaml
|
|
48
|
+
label: 'Add code-quality:write to the coverage upload job'
|
|
49
|
+
code: |
|
|
50
|
+
jobs:
|
|
51
|
+
test:
|
|
52
|
+
runs-on: ubuntu-latest
|
|
53
|
+
permissions:
|
|
54
|
+
contents: read
|
|
55
|
+
code-quality: write # Required for upload-code-coverage
|
|
56
|
+
steps:
|
|
57
|
+
- uses: actions/checkout@v4
|
|
58
|
+
|
|
59
|
+
- name: Run tests and generate coverage
|
|
60
|
+
run: pytest --cov=src --cov-report=xml
|
|
61
|
+
|
|
62
|
+
- name: Upload code coverage
|
|
63
|
+
uses: actions/upload-code-coverage@v1
|
|
64
|
+
with:
|
|
65
|
+
file: coverage.xml
|
|
66
|
+
language: Python
|
|
67
|
+
- language: yaml
|
|
68
|
+
label: 'Minimal permissions block if no others are needed'
|
|
69
|
+
code: |
|
|
70
|
+
jobs:
|
|
71
|
+
coverage:
|
|
72
|
+
runs-on: ubuntu-latest
|
|
73
|
+
permissions:
|
|
74
|
+
code-quality: write
|
|
75
|
+
steps:
|
|
76
|
+
- uses: actions/upload-code-coverage@v1
|
|
77
|
+
with:
|
|
78
|
+
file: cobertura.xml
|
|
79
|
+
language: Java
|
|
80
|
+
label: code-coverage/jacoco
|
|
81
|
+
prevention:
|
|
82
|
+
- "Whenever you add actions/upload-code-coverage to a workflow, immediately add `code-quality: write` to the job's permissions block."
|
|
83
|
+
- "Use a linter or policy-as-code tool (e.g., Poutine, StepSecurity) that validates required permissions against known action requirements."
|
|
84
|
+
- "If your org uses required permissions: {} at the workflow level for security hardening, remember that code-quality: write must still be declared per-job."
|
|
85
|
+
- "Check the GitHub Changelog periodically — new actions introduce new permission classes that aren't reflected in older documentation or IDE auto-complete."
|
|
86
|
+
docs:
|
|
87
|
+
- url: 'https://github.blog/changelog/2026-05-26-code-coverage-in-pull-requests-is-now-in-public-preview/'
|
|
88
|
+
label: 'GitHub Changelog: Code coverage in pull requests (May 26, 2026)'
|
|
89
|
+
- url: 'https://github.com/actions/upload-code-coverage'
|
|
90
|
+
label: 'actions/upload-code-coverage repository'
|
|
91
|
+
- url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/controlling-permissions-for-github_token'
|
|
92
|
+
label: 'Controlling permissions for GITHUB_TOKEN'
|
|
93
|
+
- url: 'https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/enabling-features-for-your-repository/managing-github-actions-settings-for-a-repository'
|
|
94
|
+
label: 'GitHub Actions permissions documentation'
|
|
@@ -0,0 +1,129 @@
|
|
|
1
|
+
id: runner-environment-203
|
|
2
|
+
title: 'ARC EphemeralRunner Stuck in Running State After OOM Kill — Scale Set Blocked'
|
|
3
|
+
category: runner-environment
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- arc
|
|
7
|
+
- actions-runner-controller
|
|
8
|
+
- ephemeral-runner
|
|
9
|
+
- oomkill
|
|
10
|
+
- kubernetes
|
|
11
|
+
- scale-set
|
|
12
|
+
- session-conflict
|
|
13
|
+
- stuck
|
|
14
|
+
- self-hosted
|
|
15
|
+
patterns:
|
|
16
|
+
- regex: 'RunnerScaleSetSessionConflictException'
|
|
17
|
+
flags: 'i'
|
|
18
|
+
- regex: 'TaskAgentSessionConflictException'
|
|
19
|
+
flags: 'i'
|
|
20
|
+
- regex: 'A session for this runner already exists'
|
|
21
|
+
flags: 'i'
|
|
22
|
+
- regex: 'Runner connect error: Error: Conflict\. Retrying until reconnected'
|
|
23
|
+
flags: 'i'
|
|
24
|
+
error_messages:
|
|
25
|
+
- 'RunnerScaleSetSessionConflictException: there is already an active session'
|
|
26
|
+
- 'TaskAgentSessionConflictException: Error: Conflict'
|
|
27
|
+
- 'A session for this runner already exists.'
|
|
28
|
+
- '2026-XX-XX HH:MM:SSZ: Runner connect error: Error: Conflict. Retrying until reconnected.'
|
|
29
|
+
root_cause: |
|
|
30
|
+
When an ARC (Actions Runner Controller) ephemeral runner pod is OOM-killed by the
|
|
31
|
+
Kubernetes kubelet (memory limit exceeded), the pod terminates abruptly without going
|
|
32
|
+
through the runner's graceful shutdown path. The runner never sends a "job completed" or
|
|
33
|
+
"runner offline" signal to the GitHub broker.
|
|
34
|
+
|
|
35
|
+
As a result:
|
|
36
|
+
1. The EphemeralRunner custom resource stays in phase `Running` indefinitely.
|
|
37
|
+
2. The ARC scale-set controller still counts the dead runner as "in use", so it does not
|
|
38
|
+
spin up a replacement pod to service the next queued job.
|
|
39
|
+
3. New jobs remain stuck at "Waiting for a runner to pick up this job..." until the
|
|
40
|
+
stale EphemeralRunner CR is manually deleted.
|
|
41
|
+
|
|
42
|
+
When ARC tries to restart the runner (either via a controller health check or manually),
|
|
43
|
+
the new runner pod connects to the GitHub broker using the same JIT token/session, and
|
|
44
|
+
the broker responds HTTP 409 Conflict because the old session is still registered:
|
|
45
|
+
|
|
46
|
+
TaskAgentSessionConflictException: Error: Conflict
|
|
47
|
+
A session for this runner already exists.
|
|
48
|
+
Runner connect error: Error: Conflict. Retrying until reconnected.
|
|
49
|
+
|
|
50
|
+
The runner retries every 30 seconds. After the broker's session lease expires (~2-3 min
|
|
51
|
+
in most cases), the conflict resolves and the runner connects — but the session timeout
|
|
52
|
+
window varies and can leave the scale set blocked for longer periods.
|
|
53
|
+
|
|
54
|
+
Versions affected:
|
|
55
|
+
- Reproducible across ARC v0.9.x - v0.12.x; partial mitigation added in ARC v0.12.0
|
|
56
|
+
(stale EphemeralRunner detection), but OOM kills on active runners can still bypass it.
|
|
57
|
+
- Frequently triggered by Vitest --coverage, Jest with large test suites, or any
|
|
58
|
+
memory-intensive build tool running without memory limits in the runner container.
|
|
59
|
+
fix: |
|
|
60
|
+
Immediate recovery: delete the stuck EphemeralRunner CR to release the scale-set slot:
|
|
61
|
+
|
|
62
|
+
kubectl delete ephemeralrunner -n <namespace> <runner-name>
|
|
63
|
+
|
|
64
|
+
After deletion, ARC will spin up a new runner pod and pick up the queued job.
|
|
65
|
+
|
|
66
|
+
Root fix (prevent recurrence):
|
|
67
|
+
1. Set memory limits on runner containers that match actual job requirements with headroom.
|
|
68
|
+
2. Add workflow-level `timeout-minutes:` to ensure jobs terminate and release the runner
|
|
69
|
+
if they run too long.
|
|
70
|
+
3. Upgrade ARC to v0.12.0+ for improved stale EphemeralRunner detection.
|
|
71
|
+
4. Configure `terminationGracePeriodSeconds: 90` (or longer) on runner pods to give the
|
|
72
|
+
runner process time to deregister gracefully before the kubelet force-kills it.
|
|
73
|
+
fix_code:
|
|
74
|
+
- language: yaml
|
|
75
|
+
label: 'Set memory limits and timeout on runner pods to prevent OOM kills'
|
|
76
|
+
code: |
|
|
77
|
+
# In your HelmRelease / values.yaml for actions-runner-controller
|
|
78
|
+
githubConfigUrl: "https://github.com/your-org/your-repo"
|
|
79
|
+
maxRunners: 4
|
|
80
|
+
minRunners: 0
|
|
81
|
+
template:
|
|
82
|
+
spec:
|
|
83
|
+
terminationGracePeriodSeconds: 90
|
|
84
|
+
containers:
|
|
85
|
+
- name: runner
|
|
86
|
+
image: ghcr.io/actions/actions-runner:latest
|
|
87
|
+
resources:
|
|
88
|
+
requests:
|
|
89
|
+
memory: "2Gi"
|
|
90
|
+
cpu: "500m"
|
|
91
|
+
limits:
|
|
92
|
+
memory: "4Gi" # Set appropriate limit; OOM kill occurs when exceeded
|
|
93
|
+
cpu: "2"
|
|
94
|
+
- language: yaml
|
|
95
|
+
label: 'Add workflow timeout to guarantee runner release even on hang'
|
|
96
|
+
code: |
|
|
97
|
+
jobs:
|
|
98
|
+
test:
|
|
99
|
+
runs-on: arc-runner-set
|
|
100
|
+
timeout-minutes: 30 # Runner is released after 30 min even if job hangs
|
|
101
|
+
steps:
|
|
102
|
+
- uses: actions/checkout@v4
|
|
103
|
+
- run: npm test -- --coverage
|
|
104
|
+
- language: yaml
|
|
105
|
+
label: 'Manual recovery: delete stuck EphemeralRunner CR'
|
|
106
|
+
code: |
|
|
107
|
+
# List stuck EphemeralRunners
|
|
108
|
+
kubectl get ephemeralrunner -n arc-systems
|
|
109
|
+
|
|
110
|
+
# Delete the stuck one (ARC will create a new pod automatically)
|
|
111
|
+
kubectl delete ephemeralrunner -n arc-systems <stuck-runner-name>
|
|
112
|
+
|
|
113
|
+
# Alternatively, delete all stuck runners in a namespace
|
|
114
|
+
kubectl delete ephemeralrunner -n arc-systems --field-selector='status.phase=Running'
|
|
115
|
+
prevention:
|
|
116
|
+
- 'Always set memory limits on ARC runner containers; without limits, a single job can consume all node memory and OOM-kill other runners.'
|
|
117
|
+
- 'Set timeout-minutes: at the job level for all ARC-backed workflows to guarantee the runner is eventually released.'
|
|
118
|
+
- 'Upgrade ARC to v0.12.0+ for automatic stale EphemeralRunner cleanup.'
|
|
119
|
+
- 'Monitor EphemeralRunner phase distribution; a growing count of Running CRs with no corresponding pods is a leading indicator of this issue.'
|
|
120
|
+
- 'Add terminationGracePeriodSeconds: 90+ to runner pod templates so gradual shutdown signals have time to deregister the runner.'
|
|
121
|
+
docs:
|
|
122
|
+
- url: 'https://github.com/actions/actions-runner-controller/issues/4155'
|
|
123
|
+
label: 'EphemeralRunner and its pods left stuck Running after runner OOMKILL (15 reactions)'
|
|
124
|
+
- url: 'https://github.com/actions/actions-runner-controller/issues/3922'
|
|
125
|
+
label: 'Scaleset controllers stuck with RunnerScaleSetSessionConflictException (12 reactions)'
|
|
126
|
+
- url: 'https://github.com/actions/runner/issues/4312'
|
|
127
|
+
label: 'Self-hosted runner gets stuck in active state, blocking queued jobs across multiple repositories'
|
|
128
|
+
- url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller'
|
|
129
|
+
label: 'Managing self-hosted runners with Actions Runner Controller'
|
|
@@ -0,0 +1,113 @@
|
|
|
1
|
+
id: runner-environment-201
|
|
2
|
+
title: 'macOS 26 Homebrew python@3 ships Python 3.13 — removed stdlib modules (cgi, imghdr, aifc, telnetlib) cause ModuleNotFoundError'
|
|
3
|
+
category: runner-environment
|
|
4
|
+
severity: error
|
|
5
|
+
tags:
|
|
6
|
+
- macos
|
|
7
|
+
- macos-26
|
|
8
|
+
- python
|
|
9
|
+
- homebrew
|
|
10
|
+
- stdlib
|
|
11
|
+
- runner-image
|
|
12
|
+
- breaking-change
|
|
13
|
+
patterns:
|
|
14
|
+
- regex: 'ModuleNotFoundError: No module named ''(cgi|imghdr|aifc|chunk|nntplib|telnetlib|uu|xdrlib|sndhdr|sunau|mailcap|msilib|pipes|crypt|spwd|ossaudiodev)'''
|
|
15
|
+
flags: 'i'
|
|
16
|
+
- regex: 'ImportError.*No module named.*cgi|No module named.*imghdr|No module named.*telnetlib'
|
|
17
|
+
flags: 'i'
|
|
18
|
+
- regex: 'python.*3\.13.*deprecated.*module|removed.*python.*3\.13'
|
|
19
|
+
flags: 'i'
|
|
20
|
+
error_messages:
|
|
21
|
+
- 'ModuleNotFoundError: No module named ''cgi'''
|
|
22
|
+
- 'ModuleNotFoundError: No module named ''imghdr'''
|
|
23
|
+
- 'ModuleNotFoundError: No module named ''aifc'''
|
|
24
|
+
- 'ModuleNotFoundError: No module named ''telnetlib'''
|
|
25
|
+
- 'ModuleNotFoundError: No module named ''chunk'''
|
|
26
|
+
- 'ModuleNotFoundError: No module named ''nntplib'''
|
|
27
|
+
root_cause: |
|
|
28
|
+
macOS 26 runner images ship Homebrew python@3 pointing to Python 3.13.x. Python
|
|
29
|
+
3.13 removed the following stdlib modules that were deprecated since Python 3.11:
|
|
30
|
+
|
|
31
|
+
cgi, cgitb, aifc, chunk, crypt, imghdr, mailcap, msilib (Windows only),
|
|
32
|
+
nntplib, ossaudiodev, pipes, sndhdr, spwd, sunau, telnetlib, uu, xdrlib
|
|
33
|
+
|
|
34
|
+
Workflows that call bare python3 (resolved to Homebrew Python 3.13 on macOS 26)
|
|
35
|
+
and import any of these modules fail with ModuleNotFoundError at runtime.
|
|
36
|
+
|
|
37
|
+
This affects:
|
|
38
|
+
- Scripts using cgi or cgitb for HTTP form parsing
|
|
39
|
+
- Image-type detection using imghdr (commonly used with Pillow-based workflows)
|
|
40
|
+
- Legacy FTP/NNTP clients using nntplib or telnetlib
|
|
41
|
+
- Audio file handling using aifc, sunau, or sndhdr
|
|
42
|
+
|
|
43
|
+
Workflows that previously ran on macos-14 or macos-15 (Homebrew Python 3.11/3.12)
|
|
44
|
+
are affected when the job label is macos-26 or when macos-latest migrates to
|
|
45
|
+
macOS 26. The failure is not immediately obvious because the error occurs at
|
|
46
|
+
import time inside Python, not at the runner level, and the runner step exits
|
|
47
|
+
with a non-zero code that may be mistaken for a test failure rather than an
|
|
48
|
+
environment regression.
|
|
49
|
+
|
|
50
|
+
Note: actions/setup-python@v5+ with an explicit python-version is unaffected —
|
|
51
|
+
this issue only affects scripts that rely on the system/Homebrew python3 binary.
|
|
52
|
+
fix: |
|
|
53
|
+
Option 1 (recommended) — Pin Python with actions/setup-python:
|
|
54
|
+
Always use actions/setup-python with an explicit version to get the exact
|
|
55
|
+
Python version your code requires. This bypasses the Homebrew python@3 symlink.
|
|
56
|
+
|
|
57
|
+
Option 2 — Replace removed modules with modern equivalents:
|
|
58
|
+
- cgi → urllib.parse + email.parser (or the 3rd-party 'cgi' backport)
|
|
59
|
+
- imghdr → imghdr is available as the 3rd-party 'imghdr' backport on PyPI,
|
|
60
|
+
or use python-magic / filetype for image detection
|
|
61
|
+
- telnetlib → use telnetlib3 (PyPI) or asyncio-based Telnet
|
|
62
|
+
- aifc/sunau → use soundfile or wave for audio I/O
|
|
63
|
+
|
|
64
|
+
Option 3 — Pin Homebrew Python to 3.12 on macos-26 (temporary):
|
|
65
|
+
brew install python@3.12
|
|
66
|
+
brew link python@3.12 --force
|
|
67
|
+
echo "/usr/local/opt/python@3.12/bin" >> $GITHUB_PATH
|
|
68
|
+
fix_code:
|
|
69
|
+
- language: yaml
|
|
70
|
+
label: 'Fix: pin Python version with actions/setup-python to avoid Homebrew python@3'
|
|
71
|
+
code: |
|
|
72
|
+
- uses: actions/setup-python@v5
|
|
73
|
+
with:
|
|
74
|
+
python-version: '3.12' # pins to 3.12; immune to Homebrew python@3 upgrade
|
|
75
|
+
|
|
76
|
+
- name: Install dependencies
|
|
77
|
+
run: pip install -r requirements.txt
|
|
78
|
+
|
|
79
|
+
- name: Run script
|
|
80
|
+
run: python script.py # uses setup-python's 3.12, not Homebrew 3.13
|
|
81
|
+
- language: yaml
|
|
82
|
+
label: 'Fix: install removed modules from PyPI backports'
|
|
83
|
+
code: |
|
|
84
|
+
- uses: actions/setup-python@v5
|
|
85
|
+
with:
|
|
86
|
+
python-version: '3.13'
|
|
87
|
+
- name: Install backported removed modules
|
|
88
|
+
run: |
|
|
89
|
+
pip install imghdr # PyPI backport of imghdr for Python 3.13+
|
|
90
|
+
# pip install telnetlib3 # if using Telnet
|
|
91
|
+
- name: Run script
|
|
92
|
+
run: python script.py
|
|
93
|
+
- language: yaml
|
|
94
|
+
label: 'Temporary: install and use python@3.12 from Homebrew on macos-26'
|
|
95
|
+
code: |
|
|
96
|
+
- name: Pin Homebrew Python to 3.12
|
|
97
|
+
run: |
|
|
98
|
+
brew install python@3.12
|
|
99
|
+
echo "/usr/local/opt/python@3.12/libexec/bin" >> $GITHUB_PATH
|
|
100
|
+
- name: Verify Python version
|
|
101
|
+
run: python3 --version # should print Python 3.12.x
|
|
102
|
+
prevention:
|
|
103
|
+
- 'Always use actions/setup-python with an explicit version — never rely on bare python3 pointing to Homebrew python@3'
|
|
104
|
+
- 'Audit scripts for imports of modules removed in Python 3.13: cgi, imghdr, aifc, telnetlib, nntplib, chunk, uu'
|
|
105
|
+
- 'Run pyupgrade --py313-plus locally before the macos-26 migration to catch deprecated imports'
|
|
106
|
+
- 'Add python --version to diagnostic steps to catch unexpected Python version changes early'
|
|
107
|
+
docs:
|
|
108
|
+
- url: 'https://docs.python.org/3/whatsnew/3.13.html#removed-modules'
|
|
109
|
+
label: 'Python 3.13: Removed modules (official docs)'
|
|
110
|
+
- url: 'https://peps.python.org/pep-0594/'
|
|
111
|
+
label: 'PEP 594: Removing dead batteries from the standard library'
|
|
112
|
+
- url: 'https://github.com/actions/setup-python'
|
|
113
|
+
label: 'actions/setup-python: Pin a specific Python version'
|