npm - @htekdev/actions-debugger - Versions diffs - 1.0.123 → 1.0.125 - Mend

@htekdev/actions-debugger 1.0.123 → 1.0.125

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (32) hide show

package/errors/runner-environment/runner-environment-235.yml ADDED Viewed

@@ -0,0 +1,151 @@
+id: runner-environment-235
+title: 'Action download fails with HTTP 429 Too Many Requests during action resolution — "Failed to download action ... Response status code does not indicate success: 429"'
+category: runner-environment
+severity: error
+tags:
+  - action-download
+  - rate-limit
+  - 429
+  - action-resolution
+  - tarball
+  - github-api
+  - outage
+  - retry
+patterns:
+  - regex: 'Failed to download action.*429|Response status code does not indicate success: 429 \(Too Many Requests\)'
+    flags: 'i'
+  - regex: 'Error: Response status code does not indicate success: 429'
+    flags: 'i'
+  - regex: 'Warning: Failed to download action.*tarball'
+    flags: 'i'
+error_messages:
+  - "Warning: Failed to download action 'https://api.github.com/repos/actions/checkout/tarball/11bd71901bbe5b1630ceea73d27597364c9af683'. Error: Response status code does not indicate success: 429 (Too Many Requests)."
+  - "Error: Response status code does not indicate success: 429 (Too Many Requests)."
+  - "Download action repository 'actions/checkout@v4'"
+root_cause: |
+  When a GitHub Actions job starts, the runner downloads each action referenced in the
+  workflow by fetching the action's source tarball from the GitHub API endpoint:
+    GET https://api.github.com/repos/{owner}/{repo}/tarball/{sha}
+  This endpoint is subject to GitHub's API rate limits. When many runners start
+  simultaneously — during periods of high load, GitHub infrastructure incidents, or
+  large-scale rollouts — the tarball download requests can be rate-limited, returning
+  HTTP 429 Too Many Requests.
+  The runner logs a warning and retries with exponential backoff:
+    "Warning: Failed to download action '...'. Error: Response status code does not
+     indicate success: 429 (Too Many Requests). <correlationId>"
+  If the rate limit persists through all retry attempts, the job setup fails and the
+  entire job is marked as failed with "##[error]" on the "Set up job" step. The failure
+  is NOT a problem with the workflow configuration itself — it is a transient GitHub
+  infrastructure issue.
+  Contributing factors:
+  1. **GitHub outages or degraded performance** — action downloads share API quota with
+     all other GitHub API traffic from the runner's IP range.
+  2. **High parallel job counts** — organizations running hundreds of parallel jobs on
+     GitHub-hosted runners can exhaust rate limit buckets across a runner fleet.
+  3. **Composite actions with many nested uses** — deeply nested composite actions make
+     many download calls per job, multiplying the per-job API request count. PR #4296
+     in actions/runner (merged 2026-03) adds batching to reduce this.
+  4. **Self-hosted runners without caching** — runners that are freshly provisioned per
+     job always re-download all actions; runners with persistent `_work/_tool` directories
+     can cache downloads across jobs.
+  Note: this is distinct from runner-environment-202 (repeated downloads caused by
+  case-sensitivity mismatch in v2.334.0), which causes multiple SUCCESSFUL downloads
+  rather than 429 failures.
+  Source: actions/runner#4232 (Feb 2026, open), actions/checkout#2230.
+fix: |
+  **Immediate:** Retry the failed workflow run. 429 errors during action downloads
+  are almost always transient — a re-run a few minutes later usually succeeds.
+  **Short-term: Reduce per-job download requests.**
+  Pin actions to a specific commit SHA rather than a mutable tag. Runners maintain an
+  action download cache keyed by commit SHA; pinned SHAs hit the cache more reliably
+  than mutable tags which may require a fresh lookup each run.
+  **For self-hosted runners: persist the action cache directory.**
+  The runner caches downloaded actions in `<runner-root>/_work/_tool`. If your runner
+  instances are ephemeral (e.g., provisioned fresh per job on EC2/GKE), mount a shared
+  volume or EFS/EBS on this path to share the cache across instances, reducing
+  download traffic dramatically.
+  **For high-parallelism organizations: stagger workflow start times.**
+  Spread out scheduled workflows or push triggers to avoid thousands of jobs starting
+  simultaneously. Use concurrency groups or scheduled-at offsets to distribute load.
+  **Self-hosted runners on GHES / GHAE:** Use GHES's internal action caching to serve
+  action downloads locally instead of hitting GitHub.com — this avoids rate limiting
+  entirely.
+  **Monitor GitHub status:** Check https://www.githubstatus.com/ when 429 errors appear
+  — they often correlate with incidents listed on the status page.
+fix_code:
+  - language: yaml
+    label: 'Pin actions to commit SHA to maximize cache hits and reduce download requests'
+    code: |
+      # Instead of mutable tags (re-downloaded each time if tag moves):
+      # - uses: actions/checkout@v4
+      # - uses: actions/setup-node@v4
+      # Pin to immutable commit SHAs for reliable caching:
+      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683  # v4.2.2
+      - uses: actions/setup-node@39370e3970a6d050c480ffad4ff0ed4d3fdee5af  # v4.1.0
+  - language: yaml
+    label: 'Self-hosted runner: configure RUNNER_TOOL_CACHE to a persistent shared path'
+    code: |
+      # In the runner environment, set the tool cache path before starting the runner:
+      # export RUNNER_TOOL_CACHE=/mnt/shared/runner-tool-cache
+      # ./run.sh
+      # Or in the runner .env file (config/.env relative to the runner root):
+      # RUNNER_TOOL_CACHE=/mnt/shared/runner-tool-cache
+      # Docker-based self-hosted runner: mount the cache directory:
+      # docker run -v /mnt/shared/runner-tool-cache:/opt/hostedtoolcache ...
+  - language: yaml
+    label: 'Workflow-level retry: re-run the job automatically on setup failure'
+    code: |
+      # Actions does not have a built-in retry for setup failures.
+      # A common workaround: use a retry action in the first step that validates setup.
+      # Alternatively, set up a workflow_run trigger to retry on failure:
+      on:
+        workflow_run:
+          workflows: ["CI"]
+          types: [completed]
+      jobs:
+        retry-on-429:
+          if: github.event.workflow_run.conclusion == 'failure'
+          runs-on: ubuntu-latest
+          steps:
+            - name: Re-trigger the failed workflow
+              uses: actions/github-script@v7
+              with:
+                script: |
+                  await github.rest.actions.createWorkflowDispatch({
+                    owner: context.repo.owner,
+                    repo: context.repo.repo,
+                    workflow_id: 'ci.yml',
+                    ref: '${{ github.event.workflow_run.head_branch }}'
+                  });
+prevention:
+  - 'Pin all `uses:` references to immutable commit SHAs — this maximises the chance that the runner serves the action from its local download cache rather than calling the GitHub API'
+  - 'For self-hosted runners provisioned per job (ephemeral), mount a persistent shared volume at the runner tool cache path so action downloads are reused across job invocations'
+  - 'Upgrade to actions/runner v2.335.0+ which includes PR#4296 action resolution batching — this reduces the number of download requests per job for workflows using composite actions'
+  - 'When you see 429 errors, check https://www.githubstatus.com/ — they often coincide with GitHub API rate limit incidents affecting the entire platform'
+  - 'Spread scheduled workflows across different minutes using cron expressions like `17 4 * * *` rather than `0 4 * * *` to avoid the stampede effect when many organizations schedule jobs at round-number times'
+docs:
+  - url: 'https://github.com/actions/runner/issues/4232'
+    label: 'actions/runner#4232: Warning: Failed to download action ... 429 Too Many Requests'
+  - url: 'https://github.com/actions/checkout/issues/2230'
+    label: 'actions/checkout#2230: checkout fails with 429 Too Many Requests'
+  - url: 'https://github.com/actions/runner/pull/4296'
+    label: 'actions/runner PR#4296: Batch and deduplicate action resolution across composite depths (reduces 429 risk)'
+  - url: 'https://docs.github.com/en/rest/overview/rate-limits-for-the-rest-api'
+    label: 'GitHub Docs: REST API rate limits'
+  - url: 'https://www.githubstatus.com/'
+    label: 'GitHub Status page — check for active incidents when 429 errors appear'

package/errors/silent-failures/silent-failures-112.yml ADDED Viewed

@@ -0,0 +1,97 @@
+id: silent-failures-112
+title: 'Non-ephemeral self-hosted runner REST API busy flag desync causes auto-scaler to kill active job'
+category: silent-failures
+severity: silent-failure
+tags:
+  - self-hosted
+  - non-ephemeral
+  - rest-api
+  - autoscaling
+  - broker
+  - busy-flag
+  - desync
+  - aws
+patterns:
+  - regex: 'The runner has received a shutdown signal'
+    flags: 'i'
+  - regex: '"busy":\s*false'
+    flags: 'i'
+  - regex: 'broker\.actions\.githubusercontent\.com.*busy.*JobState'
+    flags: 'i'
+error_messages:
+  - '##[error]The runner has received a shutdown signal.'
+  - '"busy": false  ← REST API shows idle while broker actively renewing job lease'
+  - 'Successfully renew job, valid till ...'
+root_cause: |
+  On non-ephemeral self-hosted runners (runners that execute multiple sequential jobs on the same
+  instance), there is a state desynchronization between the broker layer
+  (broker.actions.githubusercontent.com) and the REST API endpoint
+  GET /repos/{owner}/{repo}/actions/runners/{runner_id}.
+  The failure sequence:
+  1. Runner completes Job A and briefly enters the Online/idle state.
+  2. Runner picks up Job B, transitions to Busy, and reports JobState: Busy to the broker.
+  3. The broker acknowledges the job and successfully renews the lease every 60 seconds
+     (logged as "Successfully renew job, valid till ...").
+  4. Despite (3), the REST API returns "busy": false — sometimes immediately when Job B starts,
+     sometimes after tracking correctly for several minutes before spontaneously flipping.
+  5. Auto-scaling infrastructure (AWS Lambda scaler, GKE controllers, terraform-aws-github-runner,
+     or custom polling services) queries the REST API, sees the runner as idle, and terminates the
+     EC2/VM instance mid-job.
+  6. Job B fails with "The runner has received a shutdown signal." No error is surfaced by GitHub's
+     UI indicating the runner was inappropriately killed.
+  The desync appears to occur at the Busy→Online transition boundary between jobs. The broker and
+  the REST API maintain separate state stores; in certain timing windows the REST API does not pick
+  up the Job B start event, leaving its state stale from the inter-job idle period.
+fix: |
+  Option 1 — Use ephemeral runners (recommended). Each runner instance handles exactly one job
+  and then terminates. There is no second job pick-up, so the busy-flag desync window never exists.
+  Option 2 — Do not rely solely on "busy": false from the REST API as the termination signal.
+  Cross-reference with a job-completion event (CloudWatch, Pub/Sub, broker callback) or add a
+  grace period of 60-120 seconds after the REST API first reports idle before terminating.
+  Option 3 — Use the Webhook-based ARC (Actions Runner Controller) instead of REST-poll-based
+  autoscaling. ARC receives direct runner lifecycle events and does not depend on the REST API
+  busy state for scale-down decisions.
+fix_code:
+  - language: hcl
+    label: 'Enable ephemeral runners in terraform-aws-github-runner to prevent multi-job desync'
+    code: |
+      module "runners" {
+        source  = "philips-labs/github-runner/aws"
+        version = "~> 6.6.0"
+        # Each instance executes exactly one job then terminates.
+        # Eliminates the busy-flag desync window between sequential jobs.
+        enable_ephemeral_runners = true
+        # ... other config
+      }
+  - language: yaml
+    label: 'Declare ephemeral runner label in the workflow'
+    code: |
+      jobs:
+        build:
+          # Request a fresh ephemeral runner — one job per instance.
+          # Requires your runner pool to be configured for ephemeral mode.
+          runs-on: [self-hosted, ephemeral, linux, x64]
+          steps:
+            - uses: actions/checkout@v4
+            - run: ./build.sh
+prevention:
+  - 'Prefer ephemeral runners over reuse runners in autoscaling environments — one job per instance eliminates the busy-flag desync window entirely.'
+  - 'Never use "busy": false from the REST API as the sole signal for terminating a non-ephemeral runner instance.'
+  - 'Add a minimum 90-second grace period after the REST API first reports busy=false before terminating any non-ephemeral runner.'
+  - 'Monitor broker lease-renewal logs ("Successfully renew job") to track active job state independently of the REST API.'
+docs:
+  - url: 'https://github.com/actions/runner/issues/4422'
+    label: 'actions/runner#4422 — /runners REST API reports busy: false while broker says busy (open)'
+  - url: 'https://github.com/github-aws-runners/terraform-aws-github-runner'
+    label: 'terraform-aws-github-runner — ephemeral runner module'
+  - url: 'https://docs.github.com/en/rest/actions/self-hosted-runners#list-self-hosted-runners-for-a-repository'
+    label: 'GitHub Docs — REST API for self-hosted runners'
+  - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners'
+    label: 'GitHub Docs — About self-hosted runners'

package/errors/silent-failures/silent-failures-113.yml ADDED Viewed

@@ -0,0 +1,110 @@
+id: silent-failures-113
+title: 'Org-level self-hosted runner with correct group access never dispatched — job stays Queued indefinitely with no error'
+category: silent-failures
+severity: silent-failure
+tags:
+  - self-hosted
+  - runner-group
+  - org-runner
+  - dispatch
+  - queued
+  - v2-broker
+  - silent
+patterns:
+  - regex: 'runner_group_id.*null'
+    flags: 'i'
+  - regex: 'Waiting for a runner to pick up this job'
+    flags: 'i'
+error_messages:
+  - 'Waiting for a runner to pick up this job'
+  - '(job stays in Queued state indefinitely — no error message is displayed)'
+root_cause: |
+  When a self-hosted runner is registered at the **organization level** (not the
+  repository level) and added to a runner group with explicit per-repository access,
+  a bug in the V2 broker flow (`useV2Flow: true`,
+  `serverUrl: broker.actions.githubusercontent.com`) can cause the dispatcher to fail
+  to resolve the runner group membership against the target repository.
+  The result: the queued job never receives a `runner_group_id` assignment
+  (confirmed via `GET /repos/{owner}/{repo}/actions/runs/{run_id}/jobs` API, which
+  returns `runnerId: null`, `runnerName: null`, `runnerGroupId: null` throughout the
+  entire queue duration). The runner is online and idle throughout — it simply is
+  never offered the job.
+  From the UI perspective, the job shows the standard "Waiting for a runner to pick
+  up this job" message with no indication that the runner group resolution failed.
+  No error annotation is ever produced — the workflow just hangs until cancelled.
+  This failure mode has been observed under these conditions:
+  - Runner registered at organization scope
+  - Runner group has explicit per-repository access (not "All repositories")
+  - Runner version 2.334.0, V2 broker protocol enabled
+  - GitHub Team plan
+  - GitHub Enterprise Cloud is NOT required to reproduce
+  Repository-level runners using identical labels and configuration dispatch
+  correctly, confirming the bug is specific to the org-level + runner-group + V2
+  broker path.
+  The precise broker-side failure point is unknown; the hypothesis is an inconsistency
+  in how V2 broker resolves org runner group → repository access grant at dispatch time.
+fix: |
+  Immediate workaround: re-register the runner at the repository level instead of
+  the organization level.
+  1. Stop and unregister the org-level runner:
+     cd <runner-dir> && ./config.sh remove --token <removal-token>
+  2. Re-register at repository level:
+     ./config.sh --url https://github.com/<org>/<repo> --token <repo-token>
+  3. Trigger the workflow again — dispatch should proceed immediately.
+  If you need the runner to serve multiple repositories, repeat the registration
+  for each repository, or use runner groups with "All repositories" scope as an
+  interim workaround until GitHub resolves the V2 broker dispatch bug.
+  Track actions/runner#4429 for an official fix.
+fix_code:
+  - language: bash
+    label: 'Re-register runner at repository level instead of org level'
+    code: |
+      cd /path/to/runner
+      # Remove org-level registration
+      ./config.sh remove --token <ORG_REMOVAL_TOKEN>
+      # Re-register at repository level
+      ./config.sh \
+        --url https://github.com/<org>/<repo> \
+        --token <REPO_RUNNER_TOKEN> \
+        --name my-runner \
+        --labels my-label \
+        --unattended
+      # Start the runner
+      ./run.sh
+  - language: yaml
+    label: 'Workflow — no workflow change needed; fix is in runner registration scope'
+    code: |
+      # No changes needed in the workflow YAML itself.
+      # The runs-on label works correctly once the runner is repo-scoped.
+      jobs:
+        build:
+          runs-on: [self-hosted, my-label]
+          steps:
+            - uses: actions/checkout@v4
+            - run: echo "Runner dispatched correctly"
+prevention:
+  - 'Prefer repository-level runner registration for single-repo pipelines; use org-level runners only when the runner must serve many repos and test dispatch before rolling out.'
+  - 'After registering an org-level runner, verify dispatch with a simple echo workflow before building production pipelines on top of it.'
+  - 'Monitor actions/runner#4429 for a fix to the V2 broker org-runner-group dispatch resolution bug.'
+  - 'When investigating stuck jobs, use the REST API (GET /repos/{owner}/{repo}/actions/runs/{run_id}/jobs) to check if runnerGroupId is null — this confirms dispatch resolution failure rather than a resource-wait.'
+docs:
+  - url: 'https://github.com/actions/runner/issues/4429'
+    label: 'actions/runner#4429 — Org-level self-hosted runner never dispatched despite correct runner group repo access'
+  - url: 'https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners'
+    label: 'GitHub Docs — Adding self-hosted runners (org vs repo level)'
+  - url: 'https://docs.github.com/en/rest/actions/self-hosted-runners'
+    label: 'GitHub REST API — Self-hosted runners'

package/errors/silent-failures/silent-failures-114.yml ADDED Viewed

@@ -0,0 +1,116 @@
+id: silent-failures-114
+title: 'upload-artifact@v7 archive:false uploads artifact using filename instead of name input — causes name conflict across jobs'
+category: silent-failures
+severity: silent-failure
+tags:
+  - upload-artifact
+  - archive
+  - artifact-name
+  - v7
+  - matrix
+  - parallel-jobs
+  - name-conflict
+patterns:
+  - regex: 'An artifact with this name already exists on the run'
+    flags: 'i'
+  - regex: 'Artifact name conflict.*archive.*false'
+    flags: 'i'
+  - regex: 'Failed to CreateArtifact.*already exists'
+    flags: 'i'
+error_messages:
+  - 'An artifact with this name already exists on the run'
+  - 'Failed to CreateArtifact: artifact with name already exists'
+root_cause: |
+  In `actions/upload-artifact@v7`, when `archive: false` is set, the action uploads
+  individual files as separate artifact entries rather than bundling them into a zip
+  archive. In this mode, a bug causes the action to use the **uploaded filename** as
+  the artifact name instead of respecting the `name:` input parameter.
+  For example, if two parallel jobs both upload `report.html` with `name: report` and
+  `archive: false`, the first job creates an artifact named `report.html` (not `report`).
+  The second job then tries to create another artifact named `report.html` and hits a
+  name collision error:
+    "An artifact with this name already exists on the run"
+  In the `archive: true` (default) mode, the `name:` input works correctly and the
+  artifact is a zip named after the `name:` value. The bug is specific to the
+  unarchived upload path introduced in v7.
+  This failure is partially silent because:
+  - The first job's upload succeeds (just with the wrong name)
+  - The error only manifests on the second+ job where the collision is detected
+  - The artifact appears in the Actions UI under the filename, not the user-provided name
+  - Users downloading by the `name:` value will find no artifact with that name
+  This bug was filed against v7.0.0 and was not fixed at time of filing.
+fix: |
+  Option 1 — Remove archive:false and use the default archive:true behavior.
+  The name: input works correctly when archiving is enabled. This is the recommended
+  fix until the v7 archive:false bug is resolved.
+  Option 2 — Use unique name: values per job.
+  If archive:false is required (e.g., for browser-preview of HTML artifacts), give
+  each job a unique artifact name using the matrix value or job index to avoid the
+  filename-collision bug manifesting:
+    name: report-${{ matrix.os }}
+  Option 3 — Pin to upload-artifact@v6 or earlier.
+  The archive:false feature did not exist in v6; artifacts were always archived.
+  Track actions/upload-artifact#785 for a fix.
+fix_code:
+  - language: yaml
+    label: 'Remove archive:false — use default archive:true (name: input works correctly)'
+    code: |
+      jobs:
+        lint:
+          runs-on: ubuntu-latest
+          steps:
+            - uses: actions/checkout@v4
+            - run: ./lint.sh > report.html
+            - uses: actions/upload-artifact@v7
+              with:
+                name: lint-report   # works correctly with default archive: true
+                path: report.html
+                # archive: false  <-- remove this line
+        test:
+          runs-on: ubuntu-latest
+          steps:
+            - uses: actions/checkout@v4
+            - run: ./test.sh > report.html
+            - uses: actions/upload-artifact@v7
+              with:
+                name: test-report   # different name per job — no collision
+                path: report.html
+  - language: yaml
+    label: 'Workaround — unique name per job if archive:false is required'
+    code: |
+      jobs:
+        check:
+          strategy:
+            matrix:
+              component: [lint, test, typecheck]
+          runs-on: ubuntu-latest
+          steps:
+            - uses: actions/checkout@v4
+            - run: ./${{ matrix.component }}.sh > report.html
+            - uses: actions/upload-artifact@v7
+              with:
+                # Include matrix value in name to avoid the archive:false collision bug
+                name: report-${{ matrix.component }}
+                path: report.html
+                archive: false
+prevention:
+  - 'When uploading the same filename from multiple parallel jobs, always use a unique name: per job (e.g., name: report-${{ matrix.component }}) regardless of the archive: setting.'
+  - 'Test artifact upload in a single-job workflow first to verify the artifact appears under the expected name before rolling out to parallel matrix jobs.'
+  - 'Monitor actions/upload-artifact#785 and the v7 changelog for a fix to the archive:false + name: input handling bug.'
+  - 'Default to archive:true (the default) unless you specifically need individual file browsing in the UI — the default mode has fewer edge-case bugs.'
+docs:
+  - url: 'https://github.com/actions/upload-artifact/issues/785'
+    label: 'actions/upload-artifact#785 — archive:false does not respect artifact name'
+  - url: 'https://github.com/actions/upload-artifact/blob/main/README.md#inputs'
+    label: 'upload-artifact README — inputs (name, path, archive)'
+  - url: 'https://github.com/actions/upload-artifact/releases/tag/v7.0.0'
+    label: 'upload-artifact v7.0.0 release notes'

package/errors/silent-failures/silent-failures-115.yml ADDED Viewed

@@ -0,0 +1,130 @@
+id: silent-failures-115
+title: 'actions/cache save exits 0 and logs "Cache saved" despite 503 backend upload failures'
+category: silent-failures
+severity: silent-failure
+tags:
+  - actions/cache
+  - cache-save
+  - 503
+  - backend-error
+  - silent-success
+  - false-positive
+  - upload
+patterns:
+  - regex: 'Cache service responded with 503'
+    flags: 'i'
+  - regex: 'uploadChunk .+ failed: Cache service responded with'
+    flags: 'i'
+  - regex: 'Warning: Failed to save: uploadChunk .+ failed'
+    flags: 'i'
+error_messages:
+  - 'Warning: Failed to save: uploadChunk (start: 67108864, end: 100663295) failed: Cache service responded with 503'
+  - 'Cache saved with key: Linux-<run_id>-build'
+root_cause: |
+  When `actions/cache` (or `actions/cache/save@v4`) uploads a cache, it
+  splits the archive into chunks and uploads them in parallel using the Azure
+  SDK `BlobClient`. If the backend returns HTTP 503 errors for individual
+  chunks, the action retries once but ultimately logs the failure as a
+  `Warning:` line and continues.
+  The critical flaw is that the action then calls `commitCache()` to finalize
+  the cache entry regardless of whether all chunks uploaded successfully.
+  The cache entry is committed to the Actions service database even though
+  the underlying blob storage is corrupt or incomplete.
+  This results in a silently false "cache saved" state:
+  - The action logs `Cache saved with key: <key>` (green success)
+  - The job exits with code 0 (no failure)
+  - BUT the stored cache entry is corrupt or truncated
+  On the next run, `actions/cache/restore` (or the inline `actions/cache`
+  restore phase) either fails with a decompression/extraction error or
+  silently falls back to "Cache not found" — depending on the extent of
+  corruption. The developer sees a cache miss on the restore side and
+  investigates there, never realizing the root cause was a silent save failure
+  several runs earlier.
+  This pattern is especially harmful when:
+  - The cache backend is temporarily degraded (503 storms during peak usage)
+  - The cache key includes the run_id, making the corrupt entry unique and
+    never overwritten by a subsequent run with the same key
+  - The project has long build times that the cache was supposed to skip
+fix: |
+  1. **Add a post-save validation step** that calls the GitHub Actions cache
+     REST API to confirm the cache entry was actually committed:
+     ```bash
+     gh api "/repos/${{ github.repository }}/actions/caches?key=${{ steps.cache.outputs.cache-primary-key }}" \
+       | jq -e '.actions_caches | length > 0' || { echo "Cache save failed — no entry in API"; exit 1; }
+     ```
+  2. **Use `save-always: false` (the default)** — only invoke `actions/cache`
+     for save when you know the prior job succeeded. If you use the `save`
+     sub-action with `save-always: true`, be aware that backend failures are
+     swallowed.
+  3. **Separate save from restore** using the `actions/cache/save` and
+     `actions/cache/restore` sub-actions so you can add explicit error
+     handling around the save step via `continue-on-error: false`.
+  4. **Monitor for the upstream fix** in actions/cache#1416. Once the action
+     propagates chunk upload failures to the overall exit code, the job will
+     correctly fail on corrupt saves.
+fix_code:
+  - language: yaml
+    label: 'Validate cache was committed after save using the REST API'
+    code: |
+      - name: Cache node modules
+        id: cache
+        uses: actions/cache@v4
+        with:
+          path: ~/.npm
+          key: ${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}
+      - name: Verify cache entry was committed
+        if: steps.cache.outputs.cache-hit != 'true'
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          KEY="${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}"
+          COUNT=$(gh api \
+            "/repos/${{ github.repository }}/actions/caches?key=${KEY}" \
+            --jq '.actions_caches | length')
+          if [ "$COUNT" -eq 0 ]; then
+            echo "::error::Cache save reported success but no entry found in API. Backend may have returned 503."
+            exit 1
+          fi
+  - language: yaml
+    label: 'Separate save from restore to add explicit error handling'
+    code: |
+      # In your build job:
+      - name: Restore cache
+        id: cache-restore
+        uses: actions/cache/restore@v4
+        with:
+          path: ~/.npm
+          key: ${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}
+      - run: npm ci
+      - name: Save cache
+        if: steps.cache-restore.outputs.cache-hit != 'true'
+        uses: actions/cache/save@v4
+        with:
+          path: ~/.npm
+          key: ${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}
+        # NOTE: If this step shows "Warning: Failed to save: uploadChunk..."
+        # but still exits 0, the cache is corrupt. Add the REST API validation
+        # step above to catch this scenario.
+prevention:
+  - 'Watch for "Warning: Failed to save: uploadChunk ... failed: Cache service responded with 503" in your workflow logs — this indicates a silent corrupt save even though the step shows green.'
+  - 'If you see intermittent cache misses that cannot be explained by key changes, check whether the previous save step logged any 503 chunk-upload warnings.'
+  - 'Use time-bounded cache keys (e.g. including the week number) so a corrupt entry from a bad save is overwritten on the next successful run rather than being cached indefinitely.'
+  - 'For critical caches that must succeed, add a post-save REST API validation step to fail fast when the backend silently corrupted the save.'
+docs:
+  - url: 'https://github.com/actions/cache/issues/1416'
+    label: 'actions/cache#1416 — actions/cache and actions/cache/save consider cache uploaded successfully even with backend errors'
+  - url: 'https://docs.github.com/en/rest/actions/cache?apiVersion=2022-11-28#list-github-actions-caches-for-a-repository'
+    label: 'GitHub REST API — List GitHub Actions caches for a repository'
+  - url: 'https://github.com/actions/cache/blob/main/tips-and-workarounds.md'
+    label: 'actions/cache — Tips and Workarounds'