@htekdev/actions-debugger 1.0.2 → 1.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,127 @@
1
+ id: known-unsolved-009
2
+ title: "Job Killed After Maximum Execution Time (6h Hosted / 35-Day Workflow)"
3
+ category: known-unsolved
4
+ severity: limitation
5
+ tags:
6
+ - timeout
7
+ - execution-time
8
+ - job-limits
9
+ - platform-limit
10
+ - self-hosted
11
+ - workflow-duration
12
+ - limitation
13
+ patterns:
14
+ - regex: "The job running has exceeded the maximum execution time"
15
+ flags: "i"
16
+ - regex: "exceeded the maximum (?:time|execution time)"
17
+ flags: "i"
18
+ - regex: "job .* exceeded .* maximum"
19
+ flags: "i"
20
+ error_messages:
21
+ - "The job running on runner GitHub Actions X has exceeded the maximum execution time of 360 minutes."
22
+ - "The job running has exceeded the maximum execution time"
23
+ root_cause: |
24
+ GitHub Actions enforces hard platform-level execution time limits that cannot be
25
+ overridden or extended by workflow configuration. These limits exist to protect
26
+ shared infrastructure and prevent runaway jobs from consuming unlimited resources.
27
+
28
+ **GitHub-hosted runner limits:**
29
+ - Maximum job execution time: **6 hours** (360 minutes)
30
+ - Maximum workflow run time: **35 days** (across all jobs, including queued time)
31
+ - Default `timeout-minutes` when not set: **360 minutes** (6 hours)
32
+
33
+ **Self-hosted runner limits:**
34
+ - Maximum job execution time: **5 days** (7,200 minutes) by default
35
+ - Maximum workflow run time: **35 days** (same as hosted)
36
+ - Self-hosted limits can be customized in enterprise plans via org/enterprise policies
37
+
38
+ **When limits are hit:**
39
+ - The runner process is sent a SIGTERM (graceful) then SIGKILL (forced) after a grace period
40
+ - The job is marked CANCELLED (not FAILED) in the UI
41
+ - The log message "The job running has exceeded the maximum execution time" appears in
42
+ the runner log (may be visible in the step logs depending on where the runner was killed)
43
+ - Any `post:` steps for active actions (e.g., cache save, artifact upload) are skipped
44
+ - No email notification is sent to the repo owner about the cancellation
45
+
46
+ **Why this is a limitation, not just misconfiguration:**
47
+ - There is no way to set `timeout-minutes` above 21600 (360 hours) to extend the GitHub-hosted 6h cap
48
+ - The workflow `timeout-minutes` field cannot override the platform cap on GitHub-hosted runners
49
+ - Jobs requiring more than 6 hours on GitHub-hosted runners have NO supported path without
50
+ migrating to self-hosted or restructuring the job into multiple shorter sequential jobs
51
+ fix: |
52
+ There is no way to extend the GitHub-hosted runner 6-hour job cap. Options:
53
+
54
+ 1. **Break the job into smaller sequential jobs** — split long-running work (e.g., build
55
+ artifacts first, test in separate parallel jobs, deploy last). Each job has its own
56
+ 6-hour budget.
57
+
58
+ 2. **Migrate to self-hosted runners** — self-hosted runners support up to 5-day jobs.
59
+ Use actions-runner-controller (ARC) or cloud auto-scaling for elastic capacity.
60
+
61
+ 3. **Optimize the slow step** — profile build/test times; parallelize with matrix
62
+ strategy; use incremental builds or test sharding to reduce per-job duration.
63
+
64
+ 4. **Use caching aggressively** — `actions/cache` reduces download/build time between
65
+ runs, but does not extend limits.
66
+ fix_code:
67
+ - language: yaml
68
+ label: "Split a long job into sequential jobs to stay within 6h per job"
69
+ code: |
70
+ jobs:
71
+ build:
72
+ runs-on: ubuntu-latest
73
+ timeout-minutes: 120 # 2h budget for build
74
+ outputs:
75
+ artifact-id: ${{ steps.upload.outputs.artifact-id }}
76
+ steps:
77
+ - uses: actions/checkout@v4
78
+ - name: Build
79
+ run: make build-release
80
+ - name: Upload build artifact
81
+ id: upload
82
+ uses: actions/upload-artifact@v4
83
+ with:
84
+ name: release-build
85
+ path: dist/
86
+
87
+ # Separate job — gets its own 6h budget
88
+ test:
89
+ needs: build
90
+ runs-on: ubuntu-latest
91
+ timeout-minutes: 180 # 3h budget for tests
92
+ steps:
93
+ - uses: actions/download-artifact@v4
94
+ with:
95
+ artifact-id: ${{ needs.build.outputs.artifact-id }}
96
+ - run: make test-full
97
+ - language: yaml
98
+ label: "Self-hosted runner for jobs requiring more than 6 hours"
99
+ code: |
100
+ jobs:
101
+ long-running-job:
102
+ # Self-hosted runners support up to 5-day job duration
103
+ runs-on: [self-hosted, linux, x64]
104
+ timeout-minutes: 2880 # 48h — only possible on self-hosted
105
+ steps:
106
+ - uses: actions/checkout@v4
107
+ - name: Long-running process
108
+ run: ./scripts/full-dataset-processing.sh
109
+ prevention:
110
+ - "Set explicit `timeout-minutes` on every job — don't rely on the implicit 6h GitHub-hosted cap as your only safeguard."
111
+ - "Profile job duration regularly and alert when a job's P99 duration approaches 80% of its timeout budget."
112
+ - "Parallelize test suites using matrix strategy or `actions/github-script` dynamic matrix generation to reduce per-job time."
113
+ - "Use self-hosted runners for any workflow that legitimately requires more than 2-3 hours per job (e.g., large model training, full database rebuild, exhaustive integration tests)."
114
+ - "Be aware that post-run actions (cache save, artifact upload) will NOT execute if the parent job is killed for exceeding the time limit."
115
+ docs:
116
+ - url: "https://docs.github.com/en/actions/administering-github-actions/usage-limits-billing-and-administration#usage-limits"
117
+ label: "Usage limits: job execution time and workflow run time"
118
+ - url: "https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners#usage-limits"
119
+ label: "Self-hosted runner usage limits"
120
+ - url: "https://github.com/orgs/community/discussions/48790"
121
+ label: "Community: Workflow run time limit 35 days"
122
+ - url: "https://github.com/orgs/community/discussions/150900"
123
+ label: "Community: Job cancellation after 6 hours"
124
+ - url: "https://stackoverflow.com/questions/70187174/github-actions-self-hosted-runner-the-job-running-has-exceeded-the-maximum-exe"
125
+ label: "Stack Overflow: The job running has exceeded the maximum execution time"
126
+ - url: "https://github.com/actions/actions-runner-controller"
127
+ label: "Actions Runner Controller (ARC) — Kubernetes-based self-hosted runner auto-scaling"
@@ -0,0 +1,89 @@
1
+ id: runner-environment-018
2
+ title: "macOS 14 Sonoma Runner Deprecation — EOL November 2026"
3
+ category: runner-environment
4
+ severity: warning
5
+ tags:
6
+ - macos
7
+ - deprecation
8
+ - runner-image
9
+ - eol
10
+ - migration
11
+ patterns:
12
+ - regex: "##\\[error\\]This request was rejected because.*macos-14"
13
+ flags: "i"
14
+ - regex: "Image.*macos-14.*deprecated|macos-14.*no longer supported"
15
+ flags: "i"
16
+ - regex: "The requested image.*macos-14.*not available"
17
+ flags: "i"
18
+ error_messages:
19
+ - "##[error]This request was rejected because the runner label 'macos-14' is no longer supported."
20
+ - "The macOS 14 image has been retired. Please update to macos-15 or macos-latest."
21
+ root_cause: |
22
+ GitHub announced the deprecation of `macOS 14 Sonoma` runner images on the following schedule:
23
+ - **Deprecation begins**: July 6, 2026 — longer queue times during peak hours, brownout periods
24
+ - **Full retirement**: November 2, 2026 — jobs using `macos-14` will fail permanently
25
+
26
+ GitHub maintains only the latest two stable macOS major versions. Since macOS 26 Tahoe is now
27
+ GA on GitHub Actions, macOS 14 is the oldest and must retire.
28
+
29
+ **Brownout schedule** (jobs deliberately failed during these windows to force migration):
30
+ - October 5, 14:00 UTC – October 6, 00:00 UTC
31
+ - October 12, 14:00 UTC – October 13, 00:00 UTC
32
+ - October 16–30, 14:00 UTC – next day 00:00 UTC (escalating weekly)
33
+
34
+ Affected labels: `macos-14`, `macos-14-large`, `macos-14-xlarge`.
35
+ fix: |
36
+ Update your workflow's `runs-on` label to a supported macOS version:
37
+
38
+ | Old label | Replace with |
39
+ |-----------|--------------|
40
+ | `macos-14` | `macos-latest` or `macos-15` or `macos-26` |
41
+ | `macos-14-large` | `macos-latest-large` or `macos-15-large` |
42
+ | `macos-14-xlarge` | `macos-latest-xlarge` or `macos-15-xlarge` or `macos-26-xlarge` |
43
+
44
+ If your workflow depends on macOS 14-specific software versions (e.g., Xcode 15, older
45
+ Python/Ruby), test carefully against macOS 15 before switching `macos-latest`.
46
+ See runner-environment-017 for macOS 15 → 26 migration notes if moving to `macos-latest`.
47
+ fix_code:
48
+ - language: yaml
49
+ label: "Migrate from macos-14 to macos-15 (conservative) or macos-latest"
50
+ code: |
51
+ jobs:
52
+ build:
53
+ # Before:
54
+ # runs-on: macos-14
55
+
56
+ # Conservative: macos-15 (similar software stack, no OpenSSL jump)
57
+ runs-on: macos-15
58
+
59
+ # OR accept latest (currently macos-26 after June 15 2026):
60
+ # runs-on: macos-latest
61
+
62
+ steps:
63
+ - uses: actions/checkout@v4
64
+ - language: yaml
65
+ label: "Strategy matrix: test across multiple macOS versions during migration"
66
+ code: |
67
+ jobs:
68
+ build:
69
+ strategy:
70
+ matrix:
71
+ os: [macos-15, macos-26]
72
+ runs-on: ${{ matrix.os }}
73
+ steps:
74
+ - uses: actions/checkout@v4
75
+ - name: Build and test
76
+ run: make test
77
+ prevention:
78
+ - "Subscribe to actions/runner-images GitHub Issues announcements label to receive deprecation notices well in advance."
79
+ - "Avoid pinning to specific macOS point versions (`macos-14`, `macos-15`) in long-lived workflows — use `macos-latest` and test proactively against the next version."
80
+ - "Run a matrix strategy against `macos-latest` and your pinned version to detect incompatibilities before they cause production failures."
81
+ - "Search your organization's workflows periodically for deprecated runner labels using: `gh search code 'runs-on: macos-14' --owner YOUR_ORG`."
82
+ docs:
83
+ - url: "https://github.com/actions/runner-images/issues/13518"
84
+ label: "GitHub Announcement: macOS 14 Sonoma deprecation timeline"
85
+ - url: "https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources"
86
+ label: "Supported GitHub-hosted runner labels"
87
+ source:
88
+ article: "https://htek.dev/articles/github-actions-debugging-guide"
89
+ section: "Runner deprecation and EOL"
@@ -0,0 +1,127 @@
1
+ id: runner-environment-017
2
+ title: "macos-latest Label Now Points to macOS 26 Tahoe"
3
+ category: runner-environment
4
+ severity: error
5
+ tags:
6
+ - macos
7
+ - runner-image
8
+ - breaking-change
9
+ - openssl
10
+ - xcode
11
+ - migration
12
+ patterns:
13
+ - regex: "Error opening configuration file.*openssl"
14
+ flags: "i"
15
+ - regex: "SSL_CTX_new.*failed|SSL_connect.*SYSCALL error"
16
+ flags: "i"
17
+ - regex: "Could not find Xcode.*version.*16\\.\\d"
18
+ flags: "i"
19
+ - regex: "ruby.*requires.*Ruby (2|3\\.0|3\\.1|3\\.2|3\\.3)"
20
+ flags: "i"
21
+ - regex: "dyld.*Library not loaded.*libssl\\.1\\.1"
22
+ flags: "i"
23
+ error_messages:
24
+ - "dyld[]: Library not loaded: /usr/local/opt/openssl@1.1/lib/libssl.1.1.dylib"
25
+ - "Could not find Xcode version '16.4'"
26
+ - "SSL_connect returned=1 errno=0 state=error: certificate verify failed"
27
+ - "Your Ruby version is 3.3.x, but your Gemfile specified ~> 3.3"
28
+ - "npm warn old lockfile"
29
+ root_cause: |
30
+ The `macos-latest` label in GitHub Actions was migrated to point to macOS 26 Tahoe
31
+ beginning June 15, 2026 (completing by July 15, 2026). Previously it pointed to macOS 15
32
+ Sequoia. This migration includes several major software version changes that silently break
33
+ workflows:
34
+
35
+ - **OpenSSL**: 1.1.1w → 3.6.2 — The biggest breaking change. Many C/C++ projects,
36
+ Ruby gems (openssl), and Python packages that link against the system OpenSSL dynamically
37
+ will fail at runtime because libssl.1.1.dylib no longer ships with macOS 26.
38
+ - **Xcode**: Default changes from 16.4 to 26.4.1 (Xcode 26 series). Workflows pinning
39
+ `xcode-version: '16.4'` or relying on Clang 17 will break — Clang/LLVM jumps from 17 → 21.
40
+ - **Ruby**: 3.3.x → 3.4.x — Minor version bump can cause Gemfile constraint failures and
41
+ gem native extension compilation issues.
42
+ - **Node.js**: Default moves from 22 → 24 (though both images include Node 24 in cached tools).
43
+ - **npm**: 10.x → 11.x — Major npm version. Package-lock.json format changes possible.
44
+ - **Homebrew LLVM**: 18 → 20 (major version jump for workflows using `llvm@18` explicitly).
45
+ fix: |
46
+ **Immediate mitigation:** Pin to `macos-15` to restore previous behavior while you migrate:
47
+
48
+ ```yaml
49
+ runs-on: macos-15
50
+ ```
51
+
52
+ **Proper fix** — address each breaking dependency:
53
+
54
+ 1. **OpenSSL**: Brew-install a pinned OpenSSL version and set library paths:
55
+ ```bash
56
+ brew install openssl@1.1
57
+ export LDFLAGS="-L$(brew --prefix openssl@1.1)/lib"
58
+ export CPPFLAGS="-I$(brew --prefix openssl@1.1)/include"
59
+ ```
60
+ Or migrate to OpenSSL 3.x compatible code.
61
+
62
+ 2. **Xcode**: Pin the Xcode version explicitly using `maxim-lobanov/setup-xcode`:
63
+ ```yaml
64
+ - uses: maxim-lobanov/setup-xcode@v1
65
+ with:
66
+ xcode-version: '16.4'
67
+ ```
68
+ Or migrate your project to Xcode 26.
69
+
70
+ 3. **Ruby**: Update your Gemfile to accept 3.4.x (`~> 3.4`) or use `ruby/setup-ruby` to
71
+ pin a specific version:
72
+ ```yaml
73
+ - uses: ruby/setup-ruby@v1
74
+ with:
75
+ ruby-version: '3.3'
76
+ ```
77
+ fix_code:
78
+ - language: yaml
79
+ label: "Pin to macos-15 for immediate rollback"
80
+ code: |
81
+ jobs:
82
+ build:
83
+ runs-on: macos-15 # pinned until macos-26 migration complete
84
+ steps:
85
+ - uses: actions/checkout@v4
86
+ - language: yaml
87
+ label: "Full macos-26 compatible workflow — pin Xcode, Ruby, and OpenSSL"
88
+ code: |
89
+ jobs:
90
+ build:
91
+ runs-on: macos-latest # now macos-26
92
+ steps:
93
+ - uses: actions/checkout@v4
94
+
95
+ # Pin Xcode version explicitly
96
+ - uses: maxim-lobanov/setup-xcode@v1
97
+ with:
98
+ xcode-version: '26.4'
99
+
100
+ # Pin Ruby if needed
101
+ - uses: ruby/setup-ruby@v1
102
+ with:
103
+ ruby-version: '3.3'
104
+ bundler-cache: true
105
+
106
+ # If OpenSSL 1.1 is needed, install and export paths
107
+ - name: Install OpenSSL 1.1
108
+ run: |
109
+ brew install openssl@1.1
110
+ echo "LDFLAGS=-L$(brew --prefix openssl@1.1)/lib" >> $GITHUB_ENV
111
+ echo "CPPFLAGS=-I$(brew --prefix openssl@1.1)/include" >> $GITHUB_ENV
112
+ prevention:
113
+ - "Avoid relying on `macos-latest` for builds that depend on specific system library versions (OpenSSL, LLVM). Pin to a concrete label like `macos-15`."
114
+ - "Subscribe to the actions/runner-images repository announcements to get advance notice of `macos-latest` label migrations."
115
+ - "Use `ruby/setup-ruby`, `actions/setup-node`, and `actions/setup-python` to pin language runtimes instead of relying on runner image defaults."
116
+ - "Test your macOS workflows against the new image early by substituting `macos-26` before the `macos-latest` migration completes."
117
+ - "Audit all dylib/framework dependencies — anything linking to libssl.1.1 must be migrated to OpenSSL 3.x or brew-pinned."
118
+ docs:
119
+ - url: "https://github.com/actions/runner-images/issues/14167"
120
+ label: "GitHub Announcement: macos-latest will use macos-26 in June 2026"
121
+ - url: "https://github.com/actions/runner-images/blob/main/images/macos/macos-26-arm64-Readme.md"
122
+ label: "macOS 26 arm64 image README — full software list"
123
+ - url: "https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners"
124
+ label: "About GitHub-hosted runners — supported runner labels"
125
+ source:
126
+ article: "https://htek.dev/articles/github-actions-debugging-guide"
127
+ section: "Runner image migrations"
@@ -0,0 +1,112 @@
1
+ id: runner-environment-019
2
+ title: "PowerShell 7.4 → 7.6 LTS Upgrade Breaks pwsh Scripts"
3
+ category: runner-environment
4
+ severity: warning
5
+ tags:
6
+ - powershell
7
+ - pwsh
8
+ - runner-image
9
+ - breaking-change
10
+ - windows
11
+ - linux
12
+ patterns:
13
+ - regex: "The term 'ThreadJob\\\\\\\\Start-ThreadJob' is not recognized"
14
+ flags: "i"
15
+ - regex: "Cannot bind parameter.*ChildPath.*expected.*String.*got.*Array"
16
+ flags: "i"
17
+ - regex: "WildcardPattern.*backtick|escape.*pattern.*unexpected"
18
+ flags: "i"
19
+ - regex: "New-EventLog.*source.*trailing|event source.*not found"
20
+ flags: "i"
21
+ error_messages:
22
+ - "The term 'ThreadJob\\Start-ThreadJob' is not recognized as a name of a cmdlet, function, script file, or executable program."
23
+ - "Cannot bind parameter 'ChildPath'. Cannot convert the 'System.Object[]' value of type 'System.Object[]' to type 'System.String'."
24
+ - "Start-ThreadJob : The term 'ThreadJob\\Start-ThreadJob' is not recognized"
25
+ root_cause: |
26
+ GitHub Actions upgraded PowerShell from 7.4.x to 7.6 LTS on all runner images beginning
27
+ June 8, 2026 (completing June 15, 2026). This affects every runner OS: ubuntu-22.04,
28
+ ubuntu-24.04, ubuntu-slim, macos-14/15/26, windows-2022/2025/2025-vs2026.
29
+
30
+ PowerShell 7.6 is built on .NET 10 (7.4 was on .NET 8). The documented breaking changes are:
31
+
32
+ 1. **ThreadJob module renamed**: `ThreadJob` module is now `Microsoft.PowerShell.ThreadJob`.
33
+ Scripts calling `ThreadJob\Start-ThreadJob` with the old module-qualified name will throw
34
+ a "not recognized" error. The cmdlet itself (`Start-ThreadJob`) still works without prefix.
35
+
36
+ 2. **`Join-Path -ChildPath` now accepts `string[]`**: Parameter binding changed from `string`
37
+ to `string[]`. Existing code that passes multiple -ChildPath args may see binding errors
38
+ depending on how arguments were constructed.
39
+
40
+ 3. **`WildcardPattern.Escape` now correctly escapes lone backticks**: Scripts that relied on
41
+ the previous (incorrect) behavior of backtick handling in wildcard patterns may produce
42
+ different results.
43
+
44
+ 4. **Event source name trailing space removed**: New-EventLog / Write-EventLog source names
45
+ no longer have a trailing space. Scripts that matched exact event source names including
46
+ the trailing space (e.g., `"MySource "`) will fail to match.
47
+
48
+ 5. **.NET 10 runtime**: Any `pwsh` script using .NET types or reflection that was tested
49
+ against .NET 8 behavior may see subtle differences.
50
+ fix: |
51
+ **ThreadJob module name** — the most common breaking change:
52
+ Replace `ThreadJob\Start-ThreadJob` with `Microsoft.PowerShell.ThreadJob\Start-ThreadJob`,
53
+ or just call `Start-ThreadJob` without the module qualifier.
54
+
55
+ **Event source name** — if you match exact source names:
56
+ Trim any trailing spaces from source name comparisons.
57
+
58
+ **Pin PowerShell version** if you need time to migrate (not recommended long-term):
59
+ Install a specific pwsh version in your workflow before running scripts.
60
+
61
+ **Test locally**: Install PowerShell 7.6 (`winget install Microsoft.PowerShell`) and run
62
+ your scripts to catch any remaining issues before they surface in CI.
63
+ fix_code:
64
+ - language: yaml
65
+ label: "Fix ThreadJob module-qualified name"
66
+ code: |
67
+ # In your PowerShell script — change this:
68
+ # ThreadJob\Start-ThreadJob -ScriptBlock { ... }
69
+ # To either:
70
+ # Start-ThreadJob -ScriptBlock { ... } # simplest fix
71
+ # Microsoft.PowerShell.ThreadJob\Start-ThreadJob -ScriptBlock { ... } # fully qualified
72
+ - language: yaml
73
+ label: "Pin PowerShell version as a temporary workaround"
74
+ code: |
75
+ jobs:
76
+ build:
77
+ runs-on: ubuntu-latest
78
+ steps:
79
+ - uses: actions/checkout@v4
80
+
81
+ # Pin pwsh to 7.4.x temporarily while migrating
82
+ - name: Install PowerShell 7.4
83
+ run: |
84
+ wget -q "https://github.com/PowerShell/PowerShell/releases/download/v7.4.10/powershell_7.4.10-1.deb_amd64.deb"
85
+ sudo dpkg -i powershell_7.4.10-1.deb_amd64.deb
86
+
87
+ - name: Run PowerShell script
88
+ shell: pwsh
89
+ run: ./scripts/build.ps1
90
+ - language: yaml
91
+ label: "Fix event source name trailing-space match"
92
+ code: |
93
+ # Before (broken on 7.6):
94
+ # if ($event.Source -eq "MyApp ") { ... }
95
+ #
96
+ # After (works on 7.4 and 7.6):
97
+ # if ($event.Source.Trim() -eq "MyApp") { ... }
98
+ prevention:
99
+ - "Subscribe to actions/runner-images announcements for PowerShell upgrade notices well before they ship."
100
+ - "Always use unqualified cmdlet names (`Start-ThreadJob`) rather than module-qualified names (`ThreadJob\\Start-ThreadJob`) to avoid module-rename breakage."
101
+ - "Run your PowerShell scripts through PSScriptAnalyzer with the latest rule set after any PS version upgrade."
102
+ - "Test pwsh workflows in a matrix with the previous and new PS version during runner image transition periods."
103
+ docs:
104
+ - url: "https://github.com/actions/runner-images/issues/14150"
105
+ label: "GitHub Announcement: PowerShell 7.4 → 7.6 upgrade on all runner images"
106
+ - url: "https://learn.microsoft.com/en-us/powershell/scripting/whats-new/what-s-new-in-powershell-76"
107
+ label: "PowerShell 7.6 release notes and breaking changes"
108
+ - url: "https://learn.microsoft.com/en-us/powershell/scripting/install/powershell-support-lifecycle"
109
+ label: "PowerShell support lifecycle"
110
+ source:
111
+ article: "https://htek.dev/articles/github-actions-debugging-guide"
112
+ section: "Runner image tool version changes"
@@ -0,0 +1,126 @@
1
+ id: runner-environment-021
2
+ title: "Service Container Marked Unhealthy — Health Check Timeout"
3
+ category: runner-environment
4
+ severity: error
5
+ tags:
6
+ - service-container
7
+ - healthcheck
8
+ - docker
9
+ - timeout
10
+ - postgres
11
+ - redis
12
+ - rabbitmq
13
+ patterns:
14
+ - regex: "service is unhealthy"
15
+ flags: "i"
16
+ - regex: "Failed to initialize.*service is unhealthy"
17
+ flags: "i"
18
+ - regex: "##\\[error\\]Failed to initialize.*service"
19
+ flags: "i"
20
+ - regex: "container_id.*unhealthy"
21
+ flags: "i"
22
+ error_messages:
23
+ - "##[error]Failed to initialize, rabbitmq service is unhealthy."
24
+ - "##[error]Failed to initialize, postgres service is unhealthy."
25
+ - "##[error]Failed to initialize, redis service is unhealthy."
26
+ - "unhealthy"
27
+ - "service is starting, waiting 29 seconds before checking again."
28
+ root_cause: |
29
+ GitHub Actions checks the Docker HEALTHCHECK status of service containers
30
+ before allowing dependent job steps to run. If the container does not
31
+ transition to `healthy` within the runner's fixed retry window, the job
32
+ fails with "service is unhealthy".
33
+
34
+ This commonly occurs because:
35
+ 1. **No options specified** — Docker uses the image's built-in HEALTHCHECK,
36
+ which may be missing, too aggressive, or unsuitable for the CI environment.
37
+ 2. **Startup time** — services like PostgreSQL, RabbitMQ, or Elasticsearch
38
+ take longer to initialize on GitHub-hosted runners than on local machines,
39
+ and the default health-check interval/retries expire before they're ready.
40
+ 3. **Wrong health-check command** — a network ping health check may fail if
41
+ the service port isn't yet bound even though the process is running.
42
+ 4. **Missing `--health-start-period`** — without a start period, Docker counts
43
+ health-check failures from container start, before the service has had time
44
+ to initialize.
45
+
46
+ Documented in actions/example-services issue #3.
47
+ fix: |
48
+ Add `options:` to the service container definition with explicit health-check
49
+ parameters suited to the service and GitHub-hosted runner environment:
50
+
51
+ - `--health-cmd`: Use a service-native health check command, not a TCP probe.
52
+ Examples:
53
+ PostgreSQL: `pg_isready -U postgres`
54
+ Redis: `redis-cli ping`
55
+ MySQL: `mysqladmin ping -h localhost`
56
+ RabbitMQ: `rabbitmqctl node_health_check`
57
+
58
+ - `--health-interval 10s`: Check every 10 seconds
59
+ - `--health-timeout 5s`: Allow up to 5 seconds per check
60
+ - `--health-retries 5`: Retry up to 5 times before marking unhealthy
61
+ - `--health-start-period 30s`: Give 30 seconds before counting failures
62
+
63
+ If the image lacks a health check and the options approach is insufficient,
64
+ add an explicit wait step after job start using the service label name to
65
+ poll readiness with a loop.
66
+ fix_code:
67
+ - language: yaml
68
+ label: "PostgreSQL service container with proper health check"
69
+ code: |
70
+ jobs:
71
+ test:
72
+ runs-on: ubuntu-latest
73
+ services:
74
+ postgres:
75
+ image: postgres:16
76
+ env:
77
+ POSTGRES_PASSWORD: postgres
78
+ POSTGRES_DB: testdb
79
+ ports:
80
+ - 5432:5432
81
+ options: >-
82
+ --health-cmd "pg_isready -U postgres"
83
+ --health-interval 10s
84
+ --health-timeout 5s
85
+ --health-retries 5
86
+ --health-start-period 30s
87
+ steps:
88
+ - run: psql postgresql://postgres:postgres@localhost:5432/testdb -c "SELECT 1"
89
+ - language: yaml
90
+ label: "Redis service container with proper health check"
91
+ code: |
92
+ services:
93
+ redis:
94
+ image: redis:7
95
+ ports:
96
+ - 6379:6379
97
+ options: >-
98
+ --health-cmd "redis-cli ping"
99
+ --health-interval 10s
100
+ --health-timeout 5s
101
+ --health-retries 5
102
+ - language: yaml
103
+ label: "Fallback — explicit wait step if health check unreliable"
104
+ code: |
105
+ steps:
106
+ - name: Wait for PostgreSQL to be ready
107
+ run: |
108
+ until pg_isready -h localhost -p 5432 -U postgres; do
109
+ echo "Waiting for postgres..."
110
+ sleep 2
111
+ done
112
+ timeout-minutes: 2
113
+ prevention:
114
+ - "Always specify `options:` with `--health-cmd`, `--health-interval`, `--health-retries`, and `--health-start-period` for every service container."
115
+ - "Use service-native health check commands (pg_isready, redis-cli ping) rather than generic TCP probes."
116
+ - "Add `--health-start-period` of 20-30 seconds for services with slow initialization (PostgreSQL, Elasticsearch, RabbitMQ)."
117
+ - "Test health check timing locally in Docker before relying on it in CI: `docker run --health-cmd '...' --health-interval 5s <image>`."
118
+ docs:
119
+ - url: "https://docs.github.com/en/actions/using-containerized-services/about-service-containers"
120
+ label: "About service containers"
121
+ - url: "https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idservicesservice_idoptions"
122
+ label: "Workflow syntax — services options"
123
+ - url: "https://github.com/actions/example-services/issues/3"
124
+ label: "actions/example-services #3 — Service container health check questions and fixes"
125
+ - url: "https://stackoverflow.com/questions/66763353/how-to-health-check-a-service-in-github"
126
+ label: "Stack Overflow — How to health check a service in GitHub Actions"