@htekdev/actions-debugger 1.0.114 → 1.0.116

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,83 @@
1
+ id: concurrency-timing-053
2
+ title: 'Concurrency pending slot overflow: third concurrent run silently cancels the already-queued second run when cancel-in-progress is false'
3
+ category: concurrency-timing
4
+ severity: silent-failure
5
+ tags:
6
+ - concurrency
7
+ - cancel-in-progress
8
+ - pending
9
+ - queue
10
+ - silent-cancellation
11
+ - overflow
12
+ patterns:
13
+ - regex: 'Canceling since a higher priority waiting run was found'
14
+ flags: 'i'
15
+ - regex: 'This run was automatically cancelled'
16
+ flags: 'i'
17
+ error_messages:
18
+ - 'Canceling since a higher priority waiting run was found'
19
+ - 'This run was automatically cancelled'
20
+ root_cause: |
21
+ GitHub Actions concurrency groups with cancel-in-progress: false allow at most one
22
+ run to be in-progress and one run to be pending (queued) simultaneously per group.
23
+ This queue depth is exactly 1, not unlimited.
24
+
25
+ When a third run arrives while run 1 is in-progress and run 2 is pending:
26
+ - Run 2 (the pending run) is immediately and silently cancelled
27
+ - Run 3 takes run 2's pending slot
28
+
29
+ The user who triggered run 2 typically sees it flip from "Queued" to "Cancelled"
30
+ with no notification and no failure alert. From their perspective their commit's CI
31
+ simply disappeared.
32
+
33
+ This behavior is documented in GitHub docs but surprises teams that expect a FIFO
34
+ queue of unlimited depth. The concurrency feature is a mutex with a single
35
+ waiting-room slot, not a job scheduler queue.
36
+
37
+ Common scenarios where this causes silent data loss:
38
+ - Rapid-fire merges to a protected branch with slow integration tests
39
+ - Multiple developers pushing within seconds of each other to the same branch
40
+ - Automated commits (dependency updates, release bots) arriving while CI runs
41
+ fix: |
42
+ Option 1 — Accept the overflow: intended behavior for fast-merge scenarios where only
43
+ the LATEST commit needs CI. No change needed.
44
+
45
+ Option 2 — Widen the concurrency key so each commit gets its own slot:
46
+ group: ${{ github.workflow }}-${{ github.sha }}
47
+ This disables cancellation entirely; every run completes regardless of newer pushes.
48
+
49
+ Option 3 — Use cancel-in-progress: true explicitly if "latest wins" is the desired
50
+ semantics. In-progress runs cancel rather than queued runs disappearing silently.
51
+
52
+ Option 4 — Queue at the runner group level by using a self-hosted runner group with
53
+ a concurrency limit to provide true multi-run queuing.
54
+ fix_code:
55
+ - language: yaml
56
+ label: 'Common mistake: expecting cancel-in-progress: false to queue all pending runs indefinitely'
57
+ code: |
58
+ concurrency:
59
+ group: ${{ github.workflow }}-${{ github.ref }}
60
+ cancel-in-progress: false
61
+ # Only 1 run can be pending; a 3rd arriving run silently cancels the queued 2nd
62
+ - language: yaml
63
+ label: 'Option A: per-commit group key — every run completes, no cancellation at all'
64
+ code: |
65
+ concurrency:
66
+ group: ${{ github.workflow }}-${{ github.sha }}
67
+ cancel-in-progress: false
68
+ - language: yaml
69
+ label: 'Option B: cancel-in-progress: true — explicit latest-wins, in-progress runs cancelled not pending ones'
70
+ code: |
71
+ concurrency:
72
+ group: ${{ github.workflow }}-${{ github.ref }}
73
+ cancel-in-progress: true
74
+ prevention:
75
+ - 'Understand that cancel-in-progress: false does not create an unlimited queue — it allows exactly one pending run per concurrency group key'
76
+ - 'For deployment workflows where no commit should be skipped, use per-commit group keys (${{ github.sha }}) to guarantee every run completes'
77
+ - 'Monitor the Actions tab during rapid-push periods to verify queued runs are completing, not silently disappearing'
78
+ - 'Prefer cancel-in-progress: true when only the latest result matters; the cancellation is explicit and visible rather than silent'
79
+ docs:
80
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/using-concurrency'
81
+ label: 'GitHub Docs: Using concurrency'
82
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/using-concurrency#example-only-cancel-in-progress-jobs-or-runs-for-the-current-workflow'
83
+ label: 'GitHub Docs: Concurrency — one pending slot per group'
@@ -0,0 +1,87 @@
1
+ id: known-unsolved-062
2
+ title: 'workflow_run chains are limited to one level — a workflow_run-triggered workflow cannot trigger another downstream workflow via workflow_run'
3
+ category: known-unsolved
4
+ severity: limitation
5
+ tags:
6
+ - workflow-run
7
+ - chaining
8
+ - pipeline
9
+ - limitation
10
+ - no-fix
11
+ - event-trigger
12
+ patterns:
13
+ - regex: 'on:\s*\n\s+workflow_run:'
14
+ flags: 'i'
15
+ error_messages: []
16
+ root_cause: |
17
+ GitHub Actions explicitly prevents workflow_run events from chaining more than one
18
+ level deep. A workflow triggered by workflow_run CANNOT itself use workflow_run as
19
+ an on: trigger to fire a third downstream workflow.
20
+
21
+ From GitHub documentation: "A workflow triggered by a workflow_run event can only
22
+ be triggered by a workflow that is not itself triggered by a workflow_run event."
23
+
24
+ This restriction prevents infinite trigger loops but also prevents building linear
25
+ CI/CD pipelines using workflow_run chaining alone. Multi-stage pipelines of the form
26
+ Build (push) → Test (workflow_run) → Deploy (workflow_run) → Notify (workflow_run)
27
+ fail silently at the second hop: the Deploy and Notify workflows never appear in the
28
+ Actions tab and no error is raised anywhere.
29
+
30
+ There is no runtime error, no annotation, and no warning. The on: workflow_run
31
+ trigger on the downstream workflow is simply never evaluated.
32
+ fix: |
33
+ Replace the second-hop workflow_run trigger with an explicit dispatch from the
34
+ first downstream workflow:
35
+
36
+ Option 1 — repository_dispatch: use the GitHub REST API from a job step in workflow
37
+ B to POST to /repos/{owner}/{repo}/dispatches with a custom event_type. Workflow C
38
+ listens on on: repository_dispatch with a matching types: filter.
39
+
40
+ Option 2 — workflow_dispatch: use gh workflow run from a step in workflow B to
41
+ directly trigger workflow C by filename. Requires a GitHub token with actions:write.
42
+
43
+ Option 3 — Consolidate: merge the second and third workflows into a single workflow
44
+ with job dependencies (needs:) eliminating the cross-workflow hop entirely.
45
+ fix_code:
46
+ - language: yaml
47
+ label: 'Does NOT work: workflow_run cannot chain more than one level deep'
48
+ code: |
49
+ # Workflow C — this trigger is never evaluated when Workflow B is
50
+ # itself triggered by workflow_run
51
+ on:
52
+ workflow_run:
53
+ workflows: ["B - Integration Tests"]
54
+ types: [completed]
55
+ - language: yaml
56
+ label: 'Fix: dispatch workflow C via repository_dispatch from workflow B'
57
+ code: |
58
+ # In workflow B (intermediate workflow, triggered by workflow_run):
59
+ jobs:
60
+ dispatch-downstream:
61
+ runs-on: ubuntu-latest
62
+ if: ${{ github.event.workflow_run.conclusion == 'success' }}
63
+ steps:
64
+ - name: Trigger workflow C via repository_dispatch
65
+ run: |
66
+ gh api repos/${{ github.repository }}/dispatches \
67
+ --method POST \
68
+ --field event_type=run-deploy \
69
+ --field client_payload='{"run_id":"${{ github.event.workflow_run.id }}"}'
70
+ env:
71
+ GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
72
+
73
+ # In workflow C:
74
+ on:
75
+ repository_dispatch:
76
+ types: [run-deploy]
77
+ prevention:
78
+ - 'Design CI/CD pipelines assuming workflow_run allows only one hop; use repository_dispatch or workflow_dispatch for any second-level chaining'
79
+ - 'Prefer consolidating multi-stage pipelines into a single workflow with job dependencies (needs:) when the stages always execute in sequence'
80
+ - 'When a downstream workflow never appears in the Actions tab after merging, verify whether its on: trigger is workflow_run and whether its upstream workflow is also workflow_run-triggered'
81
+ docs:
82
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows#workflow_run'
83
+ label: 'GitHub Docs: workflow_run event — one-level-deep restriction'
84
+ - url: 'https://docs.github.com/en/rest/repos/repos#create-a-repository-dispatch-event'
85
+ label: 'GitHub REST API: Create a repository dispatch event'
86
+ - url: 'https://cli.github.com/manual/gh_workflow_run'
87
+ label: 'GitHub CLI: gh workflow run (alternative dispatch method)'
@@ -0,0 +1,102 @@
1
+ id: known-unsolved-063
2
+ title: 'REST API reports runner busy:false while broker shows runner actively executing a job'
3
+ category: known-unsolved
4
+ severity: silent-failure
5
+ tags:
6
+ - self-hosted
7
+ - runner
8
+ - autoscaler
9
+ - rest-api
10
+ - broker
11
+ - state-desync
12
+ - v2-flow
13
+ - job-killed
14
+ patterns:
15
+ - regex: 'busy.*false.*runner.*executing|runner.*busy.*false.*job'
16
+ flags: 'i'
17
+ - regex: '"busy"\s*:\s*false'
18
+ flags: 'i'
19
+ error_messages:
20
+ - '"busy": false'
21
+ - 'GET /repos/{owner}/{repo}/actions/runners/{id} → {"busy": false}'
22
+ root_cause: |
23
+ On non-ephemeral self-hosted runners using the V2 broker flow
24
+ (`broker.actions.githubusercontent.com`), a state desynchronization exists between
25
+ the broker service and the GitHub REST API:
26
+
27
+ - The broker correctly tracks runner state in real-time: after picking up Job B, the
28
+ runner reports `JobState: Busy` to the broker and renews its job lease every 60s.
29
+ - However, `GET /repos/{owner}/{repo}/actions/runners/{runner_id}` (the public REST
30
+ API) continues to return `"busy": false` during the early phase of job execution.
31
+ The REST API state may only update after the runner's next periodic sync, which
32
+ can lag 30–120 seconds behind the broker state.
33
+
34
+ Auto-scaling tools that rely on the REST API to identify idle runners (e.g.,
35
+ `github-aws-runners/terraform-aws-github-runner`, KEDA GitHub Actions scaler,
36
+ custom Lambda/CloudFunction scalers) interpret `busy: false` as "runner is idle and
37
+ safe to terminate." This causes the autoscaler to terminate an EC2/GCE/Azure instance
38
+ mid-job — killing the runner process with no Actions-level error and marking the job
39
+ as failed with a runner disconnection error.
40
+
41
+ From the affected job's perspective, the log ends mid-step with "The runner has
42
+ received a shutdown signal" or the job times out. There is no annotation indicating
43
+ the root cause was an autoscaler decision based on stale REST API data.
44
+
45
+ No GitHub-side fix is available as of June 2026. The REST API does not expose a
46
+ real-time busy status consistent with the broker. Open at actions/runner#4422.
47
+ fix: |
48
+ There is no complete fix — this is a known state inconsistency in the GitHub platform.
49
+
50
+ Workarounds (choose one based on your autoscaling setup):
51
+
52
+ 1. **Switch to ephemeral JIT runners (recommended)**: Use JIT tokens and terminate
53
+ runners after exactly one job. There is no window for autoscalers to misidentify
54
+ a running job as idle because the runner is registered and deregistered atomically.
55
+
56
+ 2. **Add a grace period before termination**: When your autoscaler sees `busy: false`,
57
+ wait 2–3 minutes and re-poll before actually terminating. This covers the lag
58
+ between broker state and REST API state.
59
+
60
+ 3. **Poll job status instead of runner status**: Use
61
+ `GET /repos/{owner}/{repo}/actions/runs` to check for `in_progress` workflow runs
62
+ before terminating any runner, rather than relying on per-runner `busy` status.
63
+
64
+ 4. **Use runner labels + job assignment**: If your autoscaler assigns specific runners
65
+ to specific jobs via labels, you can cross-reference queued/in-progress job
66
+ assignments against runner IDs before terminating.
67
+ fix_code:
68
+ - language: yaml
69
+ label: 'Example: Switch to ephemeral JIT runners (removes the desync window entirely)'
70
+ code: |
71
+ # Use JIT runner registration in your autoscaler
72
+ # Each runner handles exactly one job — busy/idle desync cannot occur
73
+ # See: https://docs.github.com/en/actions/security-for-github-actions/security-guides/security-hardening-for-github-actions#using-just-in-time-runners
74
+
75
+ # In your autoscaler provisioning logic:
76
+ # POST /repos/{owner}/{repo}/actions/runners/generate-jit-config
77
+ # → Use the returned jit_config to start an ephemeral runner
78
+ # → Runner auto-deregisters after job completes — no stale REST state possible
79
+ - language: yaml
80
+ label: 'Example: Grace period before termination (Terraform-style pseudocode)'
81
+ code: |
82
+ # In your autoscaler Lambda/script, before terminating an instance:
83
+ # 1. GET /repos/{owner}/{repo}/actions/runners/{runner_id}
84
+ # 2. If busy == false, wait 2 minutes
85
+ # 3. Re-poll: GET /repos/{owner}/{repo}/actions/runners/{runner_id}
86
+ # 4. Only terminate if STILL busy == false after the grace period
87
+
88
+ # This covers the broker→REST lag window (~30-120s observed in practice)
89
+ prevention:
90
+ - "Prefer ephemeral JIT runners for any workload where mid-job termination would be costly; the broker-REST desync window is zero for single-job-per-runner setups."
91
+ - "Never terminate a runner instance based solely on a single REST API `busy: false` reading — always double-check with a grace period or secondary signal."
92
+ - "Monitor for jobs that end with 'runner has received a shutdown signal' — this is a reliable indicator that a runner was terminated externally mid-job."
93
+ - "If using terraform-aws-github-runner or similar, check whether the tool version has built-in grace periods for the busy-state lag."
94
+ docs:
95
+ - url: 'https://github.com/actions/runner/issues/4422'
96
+ label: 'actions/runner#4422 — /runners REST API reports busy:false for active runner'
97
+ - url: 'https://docs.github.com/en/rest/actions/self-hosted-runners'
98
+ label: 'REST API: Self-hosted runners'
99
+ - url: 'https://docs.github.com/en/actions/security-for-github-actions/security-guides/security-hardening-for-github-actions#using-just-in-time-runners'
100
+ label: 'Just-in-time runners documentation'
101
+ - url: 'https://github.com/github-aws-runners/terraform-aws-github-runner'
102
+ label: 'terraform-aws-github-runner (commonly affected autoscaler)'
@@ -0,0 +1,94 @@
1
+ id: permissions-auth-068
2
+ title: 'upload-code-coverage action fails with 403 — missing code-quality:write permission'
3
+ category: permissions-auth
4
+ severity: error
5
+ tags:
6
+ - permissions
7
+ - code-quality
8
+ - upload-code-coverage
9
+ - github-token
10
+ - 403
11
+ - fine-grained
12
+ patterns:
13
+ - regex: 'Resource not accessible by integration'
14
+ flags: 'i'
15
+ - regex: 'Upload failed.*403|HTTP 403.*code.coverage|code.coverage.*403'
16
+ flags: 'i'
17
+ - regex: 'code-quality.*write|code_quality.*write'
18
+ flags: 'i'
19
+ error_messages:
20
+ - '{"message":"Resource not accessible by integration","documentation_url":"https://docs.github.com/rest"}'
21
+ - 'Error: Upload failed: HTTP 403 Forbidden'
22
+ - 'HTTP Status: 403'
23
+ root_cause: |
24
+ GitHub's code coverage upload API (introduced May 2026 as part of Code Quality for
25
+ pull requests) requires the new fine-grained permission `code-quality:write` on the
26
+ calling token. The default GITHUB_TOKEN in GitHub Actions workflows has this permission
27
+ set to `none` unless explicitly granted.
28
+
29
+ When `actions/upload-code-coverage` calls the coverage upload API without this
30
+ permission, GitHub returns HTTP 403 "Resource not accessible by integration". Because
31
+ `code-quality:write` is a newly introduced permission class (not present in older
32
+ workflow permission schemas), developers familiar with the standard permissions
33
+ (contents, issues, pull-requests, etc.) don't know to add it.
34
+
35
+ This affects all workflows that do not specify `permissions:` at all (which defaults to
36
+ `read-all` — but `code-quality` is still `none` for new permissions), as well as
37
+ workflows that explicitly set `permissions: {}` or use a restrictive block.
38
+ fix: |
39
+ Add `code-quality: write` to the `permissions` block of the job that runs
40
+ `actions/upload-code-coverage`. This grants the GITHUB_TOKEN the required scope to
41
+ call the code coverage upload API.
42
+
43
+ Note: `code-quality:` is a job-level permission. It cannot be set as a global
44
+ `GITHUB_TOKEN` permission through repository settings — it must be declared in the
45
+ workflow YAML per-job.
46
+ fix_code:
47
+ - language: yaml
48
+ label: 'Add code-quality:write to the coverage upload job'
49
+ code: |
50
+ jobs:
51
+ test:
52
+ runs-on: ubuntu-latest
53
+ permissions:
54
+ contents: read
55
+ code-quality: write # Required for upload-code-coverage
56
+ steps:
57
+ - uses: actions/checkout@v4
58
+
59
+ - name: Run tests and generate coverage
60
+ run: pytest --cov=src --cov-report=xml
61
+
62
+ - name: Upload code coverage
63
+ uses: actions/upload-code-coverage@v1
64
+ with:
65
+ file: coverage.xml
66
+ language: Python
67
+ - language: yaml
68
+ label: 'Minimal permissions block if no others are needed'
69
+ code: |
70
+ jobs:
71
+ coverage:
72
+ runs-on: ubuntu-latest
73
+ permissions:
74
+ code-quality: write
75
+ steps:
76
+ - uses: actions/upload-code-coverage@v1
77
+ with:
78
+ file: cobertura.xml
79
+ language: Java
80
+ label: code-coverage/jacoco
81
+ prevention:
82
+ - "Whenever you add actions/upload-code-coverage to a workflow, immediately add `code-quality: write` to the job's permissions block."
83
+ - "Use a linter or policy-as-code tool (e.g., Poutine, StepSecurity) that validates required permissions against known action requirements."
84
+ - "If your org uses required permissions: {} at the workflow level for security hardening, remember that code-quality: write must still be declared per-job."
85
+ - "Check the GitHub Changelog periodically — new actions introduce new permission classes that aren't reflected in older documentation or IDE auto-complete."
86
+ docs:
87
+ - url: 'https://github.blog/changelog/2026-05-26-code-coverage-in-pull-requests-is-now-in-public-preview/'
88
+ label: 'GitHub Changelog: Code coverage in pull requests (May 26, 2026)'
89
+ - url: 'https://github.com/actions/upload-code-coverage'
90
+ label: 'actions/upload-code-coverage repository'
91
+ - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/controlling-permissions-for-github_token'
92
+ label: 'Controlling permissions for GITHUB_TOKEN'
93
+ - url: 'https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/enabling-features-for-your-repository/managing-github-actions-settings-for-a-repository'
94
+ label: 'GitHub Actions permissions documentation'
@@ -0,0 +1,129 @@
1
+ id: runner-environment-203
2
+ title: 'ARC EphemeralRunner Stuck in Running State After OOM Kill — Scale Set Blocked'
3
+ category: runner-environment
4
+ severity: error
5
+ tags:
6
+ - arc
7
+ - actions-runner-controller
8
+ - ephemeral-runner
9
+ - oomkill
10
+ - kubernetes
11
+ - scale-set
12
+ - session-conflict
13
+ - stuck
14
+ - self-hosted
15
+ patterns:
16
+ - regex: 'RunnerScaleSetSessionConflictException'
17
+ flags: 'i'
18
+ - regex: 'TaskAgentSessionConflictException'
19
+ flags: 'i'
20
+ - regex: 'A session for this runner already exists'
21
+ flags: 'i'
22
+ - regex: 'Runner connect error: Error: Conflict\. Retrying until reconnected'
23
+ flags: 'i'
24
+ error_messages:
25
+ - 'RunnerScaleSetSessionConflictException: there is already an active session'
26
+ - 'TaskAgentSessionConflictException: Error: Conflict'
27
+ - 'A session for this runner already exists.'
28
+ - '2026-XX-XX HH:MM:SSZ: Runner connect error: Error: Conflict. Retrying until reconnected.'
29
+ root_cause: |
30
+ When an ARC (Actions Runner Controller) ephemeral runner pod is OOM-killed by the
31
+ Kubernetes kubelet (memory limit exceeded), the pod terminates abruptly without going
32
+ through the runner's graceful shutdown path. The runner never sends a "job completed" or
33
+ "runner offline" signal to the GitHub broker.
34
+
35
+ As a result:
36
+ 1. The EphemeralRunner custom resource stays in phase `Running` indefinitely.
37
+ 2. The ARC scale-set controller still counts the dead runner as "in use", so it does not
38
+ spin up a replacement pod to service the next queued job.
39
+ 3. New jobs remain stuck at "Waiting for a runner to pick up this job..." until the
40
+ stale EphemeralRunner CR is manually deleted.
41
+
42
+ When ARC tries to restart the runner (either via a controller health check or manually),
43
+ the new runner pod connects to the GitHub broker using the same JIT token/session, and
44
+ the broker responds HTTP 409 Conflict because the old session is still registered:
45
+
46
+ TaskAgentSessionConflictException: Error: Conflict
47
+ A session for this runner already exists.
48
+ Runner connect error: Error: Conflict. Retrying until reconnected.
49
+
50
+ The runner retries every 30 seconds. After the broker's session lease expires (~2-3 min
51
+ in most cases), the conflict resolves and the runner connects — but the session timeout
52
+ window varies and can leave the scale set blocked for longer periods.
53
+
54
+ Versions affected:
55
+ - Reproducible across ARC v0.9.x - v0.12.x; partial mitigation added in ARC v0.12.0
56
+ (stale EphemeralRunner detection), but OOM kills on active runners can still bypass it.
57
+ - Frequently triggered by Vitest --coverage, Jest with large test suites, or any
58
+ memory-intensive build tool running without memory limits in the runner container.
59
+ fix: |
60
+ Immediate recovery: delete the stuck EphemeralRunner CR to release the scale-set slot:
61
+
62
+ kubectl delete ephemeralrunner -n <namespace> <runner-name>
63
+
64
+ After deletion, ARC will spin up a new runner pod and pick up the queued job.
65
+
66
+ Root fix (prevent recurrence):
67
+ 1. Set memory limits on runner containers that match actual job requirements with headroom.
68
+ 2. Add workflow-level `timeout-minutes:` to ensure jobs terminate and release the runner
69
+ if they run too long.
70
+ 3. Upgrade ARC to v0.12.0+ for improved stale EphemeralRunner detection.
71
+ 4. Configure `terminationGracePeriodSeconds: 90` (or longer) on runner pods to give the
72
+ runner process time to deregister gracefully before the kubelet force-kills it.
73
+ fix_code:
74
+ - language: yaml
75
+ label: 'Set memory limits and timeout on runner pods to prevent OOM kills'
76
+ code: |
77
+ # In your HelmRelease / values.yaml for actions-runner-controller
78
+ githubConfigUrl: "https://github.com/your-org/your-repo"
79
+ maxRunners: 4
80
+ minRunners: 0
81
+ template:
82
+ spec:
83
+ terminationGracePeriodSeconds: 90
84
+ containers:
85
+ - name: runner
86
+ image: ghcr.io/actions/actions-runner:latest
87
+ resources:
88
+ requests:
89
+ memory: "2Gi"
90
+ cpu: "500m"
91
+ limits:
92
+ memory: "4Gi" # Set appropriate limit; OOM kill occurs when exceeded
93
+ cpu: "2"
94
+ - language: yaml
95
+ label: 'Add workflow timeout to guarantee runner release even on hang'
96
+ code: |
97
+ jobs:
98
+ test:
99
+ runs-on: arc-runner-set
100
+ timeout-minutes: 30 # Runner is released after 30 min even if job hangs
101
+ steps:
102
+ - uses: actions/checkout@v4
103
+ - run: npm test -- --coverage
104
+ - language: yaml
105
+ label: 'Manual recovery: delete stuck EphemeralRunner CR'
106
+ code: |
107
+ # List stuck EphemeralRunners
108
+ kubectl get ephemeralrunner -n arc-systems
109
+
110
+ # Delete the stuck one (ARC will create a new pod automatically)
111
+ kubectl delete ephemeralrunner -n arc-systems <stuck-runner-name>
112
+
113
+ # Alternatively, delete all stuck runners in a namespace
114
+ kubectl delete ephemeralrunner -n arc-systems --field-selector='status.phase=Running'
115
+ prevention:
116
+ - 'Always set memory limits on ARC runner containers; without limits, a single job can consume all node memory and OOM-kill other runners.'
117
+ - 'Set timeout-minutes: at the job level for all ARC-backed workflows to guarantee the runner is eventually released.'
118
+ - 'Upgrade ARC to v0.12.0+ for automatic stale EphemeralRunner cleanup.'
119
+ - 'Monitor EphemeralRunner phase distribution; a growing count of Running CRs with no corresponding pods is a leading indicator of this issue.'
120
+ - 'Add terminationGracePeriodSeconds: 90+ to runner pod templates so gradual shutdown signals have time to deregister the runner.'
121
+ docs:
122
+ - url: 'https://github.com/actions/actions-runner-controller/issues/4155'
123
+ label: 'EphemeralRunner and its pods left stuck Running after runner OOMKILL (15 reactions)'
124
+ - url: 'https://github.com/actions/actions-runner-controller/issues/3922'
125
+ label: 'Scaleset controllers stuck with RunnerScaleSetSessionConflictException (12 reactions)'
126
+ - url: 'https://github.com/actions/runner/issues/4312'
127
+ label: 'Self-hosted runner gets stuck in active state, blocking queued jobs across multiple repositories'
128
+ - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller'
129
+ label: 'Managing self-hosted runners with Actions Runner Controller'
@@ -0,0 +1,113 @@
1
+ id: runner-environment-201
2
+ title: 'macOS 26 Homebrew python@3 ships Python 3.13 — removed stdlib modules (cgi, imghdr, aifc, telnetlib) cause ModuleNotFoundError'
3
+ category: runner-environment
4
+ severity: error
5
+ tags:
6
+ - macos
7
+ - macos-26
8
+ - python
9
+ - homebrew
10
+ - stdlib
11
+ - runner-image
12
+ - breaking-change
13
+ patterns:
14
+ - regex: 'ModuleNotFoundError: No module named ''(cgi|imghdr|aifc|chunk|nntplib|telnetlib|uu|xdrlib|sndhdr|sunau|mailcap|msilib|pipes|crypt|spwd|ossaudiodev)'''
15
+ flags: 'i'
16
+ - regex: 'ImportError.*No module named.*cgi|No module named.*imghdr|No module named.*telnetlib'
17
+ flags: 'i'
18
+ - regex: 'python.*3\.13.*deprecated.*module|removed.*python.*3\.13'
19
+ flags: 'i'
20
+ error_messages:
21
+ - 'ModuleNotFoundError: No module named ''cgi'''
22
+ - 'ModuleNotFoundError: No module named ''imghdr'''
23
+ - 'ModuleNotFoundError: No module named ''aifc'''
24
+ - 'ModuleNotFoundError: No module named ''telnetlib'''
25
+ - 'ModuleNotFoundError: No module named ''chunk'''
26
+ - 'ModuleNotFoundError: No module named ''nntplib'''
27
+ root_cause: |
28
+ macOS 26 runner images ship Homebrew python@3 pointing to Python 3.13.x. Python
29
+ 3.13 removed the following stdlib modules that were deprecated since Python 3.11:
30
+
31
+ cgi, cgitb, aifc, chunk, crypt, imghdr, mailcap, msilib (Windows only),
32
+ nntplib, ossaudiodev, pipes, sndhdr, spwd, sunau, telnetlib, uu, xdrlib
33
+
34
+ Workflows that call bare python3 (resolved to Homebrew Python 3.13 on macOS 26)
35
+ and import any of these modules fail with ModuleNotFoundError at runtime.
36
+
37
+ This affects:
38
+ - Scripts using cgi or cgitb for HTTP form parsing
39
+ - Image-type detection using imghdr (commonly used with Pillow-based workflows)
40
+ - Legacy FTP/NNTP clients using nntplib or telnetlib
41
+ - Audio file handling using aifc, sunau, or sndhdr
42
+
43
+ Workflows that previously ran on macos-14 or macos-15 (Homebrew Python 3.11/3.12)
44
+ are affected when the job label is macos-26 or when macos-latest migrates to
45
+ macOS 26. The failure is not immediately obvious because the error occurs at
46
+ import time inside Python, not at the runner level, and the runner step exits
47
+ with a non-zero code that may be mistaken for a test failure rather than an
48
+ environment regression.
49
+
50
+ Note: actions/setup-python@v5+ with an explicit python-version is unaffected —
51
+ this issue only affects scripts that rely on the system/Homebrew python3 binary.
52
+ fix: |
53
+ Option 1 (recommended) — Pin Python with actions/setup-python:
54
+ Always use actions/setup-python with an explicit version to get the exact
55
+ Python version your code requires. This bypasses the Homebrew python@3 symlink.
56
+
57
+ Option 2 — Replace removed modules with modern equivalents:
58
+ - cgi → urllib.parse + email.parser (or the 3rd-party 'cgi' backport)
59
+ - imghdr → imghdr is available as the 3rd-party 'imghdr' backport on PyPI,
60
+ or use python-magic / filetype for image detection
61
+ - telnetlib → use telnetlib3 (PyPI) or asyncio-based Telnet
62
+ - aifc/sunau → use soundfile or wave for audio I/O
63
+
64
+ Option 3 — Pin Homebrew Python to 3.12 on macos-26 (temporary):
65
+ brew install python@3.12
66
+ brew link python@3.12 --force
67
+ echo "/usr/local/opt/python@3.12/bin" >> $GITHUB_PATH
68
+ fix_code:
69
+ - language: yaml
70
+ label: 'Fix: pin Python version with actions/setup-python to avoid Homebrew python@3'
71
+ code: |
72
+ - uses: actions/setup-python@v5
73
+ with:
74
+ python-version: '3.12' # pins to 3.12; immune to Homebrew python@3 upgrade
75
+
76
+ - name: Install dependencies
77
+ run: pip install -r requirements.txt
78
+
79
+ - name: Run script
80
+ run: python script.py # uses setup-python's 3.12, not Homebrew 3.13
81
+ - language: yaml
82
+ label: 'Fix: install removed modules from PyPI backports'
83
+ code: |
84
+ - uses: actions/setup-python@v5
85
+ with:
86
+ python-version: '3.13'
87
+ - name: Install backported removed modules
88
+ run: |
89
+ pip install imghdr # PyPI backport of imghdr for Python 3.13+
90
+ # pip install telnetlib3 # if using Telnet
91
+ - name: Run script
92
+ run: python script.py
93
+ - language: yaml
94
+ label: 'Temporary: install and use python@3.12 from Homebrew on macos-26'
95
+ code: |
96
+ - name: Pin Homebrew Python to 3.12
97
+ run: |
98
+ brew install python@3.12
99
+ echo "/usr/local/opt/python@3.12/libexec/bin" >> $GITHUB_PATH
100
+ - name: Verify Python version
101
+ run: python3 --version # should print Python 3.12.x
102
+ prevention:
103
+ - 'Always use actions/setup-python with an explicit version — never rely on bare python3 pointing to Homebrew python@3'
104
+ - 'Audit scripts for imports of modules removed in Python 3.13: cgi, imghdr, aifc, telnetlib, nntplib, chunk, uu'
105
+ - 'Run pyupgrade --py313-plus locally before the macos-26 migration to catch deprecated imports'
106
+ - 'Add python --version to diagnostic steps to catch unexpected Python version changes early'
107
+ docs:
108
+ - url: 'https://docs.python.org/3/whatsnew/3.13.html#removed-modules'
109
+ label: 'Python 3.13: Removed modules (official docs)'
110
+ - url: 'https://peps.python.org/pep-0594/'
111
+ label: 'PEP 594: Removing dead batteries from the standard library'
112
+ - url: 'https://github.com/actions/setup-python'
113
+ label: 'actions/setup-python: Pin a specific Python version'