npm - screwdriver-buildcluster-queue-worker - Versions diffs - 5.2.0 → 6.0.0 - Mend

screwdriver-buildcluster-queue-worker 5.2.0 → 6.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/README.md +4 -5
package/WORKFLOW.md +334 -150
package/config/custom-environment-variables.yaml +4 -0
package/config/default.yaml +5 -1
package/lib/config.js +6 -2
package/lib/retry-queue.js +29 -8
package/package.json +1 -1
package/receiver.js +147 -58

package/README.md CHANGED Viewed

@@ -11,15 +11,14 @@ npm install screwdriver-buildcluster-queue-worker
 ## Build Start Workflow
-The queue worker processes build start messages from RabbitMQ and manages pod lifecycle in Kubernetes.
-> **See [WORKFLOW.md](WORKFLOW.md) for detailed workflow diagram with retry behavior**
+The queue worker processes build start messages from RabbitMQ and manages pod lifecycle in Kubernetes with **smart retry logic** and **progressive backoff**.
+> **See [WORKFLOW.md](WORKFLOW.md) for detailed workflow diagram with retry behavior and queue configuration**
 ### Configuration
 - `prefetchCount`: 20 messages per worker (default)
-- `buildInitTimeout`: 5 minutes (default)
-- `messageReprocessLimit`: 5 retries in retry queue (default)
+- `initTimeout`: 5 minutes (default)
+- `messageReprocessLimit`: 6 retries in retry queue (default)
 ## Testing

package/WORKFLOW.md CHANGED Viewed

@@ -1,166 +1,350 @@
-# Build Start Workflow - Detailed Flow
+# Build Messages Processing Workflow
-## Main Queue Processing
+## Overview
+This document describes the **queue-based retry mechanism** for build pod initialization with **progressive backoff** and **smart status distinction**. The system uses RabbitMQ's native message TTL and dead-letter exchange features with per-message TTL for variable delays
+and simulate delayed queue behavior for message verification.
+### Key Features
+- **Status Code Distinction**: Separates pod scheduling issues (`waiting`) from image pull delays (`initializing`)
+- **Progressive Backoff**: Increasing retry delays for large image downloads (30s → 80s)
+- **Timeout Tracking**: Only pod scheduling delays count against the 3-minute SLO
+- **Per-Message TTL**: Allows different retry delays for different scenarios
+- **Two-Queue Pattern**: Wait queue (`sdRetryQueue-wait`) with TTL → Ready queue (`sdRetryQueue`)
+## Architecture
 ```
 ┌─────────────────────────────────────────────────────────────────────────────┐
-│ MAIN QUEUE: Message Processing                                              │
+│ QUEUE TOPOLOGY                                                              │
+└─────────────────────────────────────────────────────────────────────────────┘
+    queue-service (Redis/Resque)
+            │
+            ▼
+    ┌───────────────────────────────────────────────────────┐
+    │ RabbitMQ Exchange: "build" (topic)                    │
+    └───────────────────────────────────────────────────────┘
+            │
+            ├─────────────────┬──────────────────┬──────────────────┬──────────────────┐
+            │                 │                  │                  │                  │
+            ▼                 ▼                  ▼                  ▼                  ▼
+    ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
+    │ sd           │  │ sdRetry      │  │sdRetry-wait  │  │ sddlr        │  │ default      │
+    │ (main queue) │  │ (ready queue)│  │ (wait queue) │  │ (delay/retry)│  │ (catch-all)  │
+    └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘
+    │                 │                  │                  │
+    │ start/stop      │ verify           │ per-msg TTL      │ delay 5s
+    │ TTL: 8hr        │ NO queue TTL     │ 30s-80s          │ then → sd
+    │ DLX → gq1dlr    │ (consumers)      │ DLX → sdretry   │
+    └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘
+                                         │
+                                         │ (after per-msg TTL expires)
+                                         └────────► sdretry
+```
+## Main Queue Processing (sd)
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ MAIN QUEUE: Start/Stop Job Processing                                       │
 └─────────────────────────────────────────────────────────────────────────────┘
     ┌──────────────────┐
     │ Receive Message  │
+    │ from sd          │
     │ (prefetch=20)    │
     └────────┬─────────┘
              │
-             ├─────────────────────────────────────────────────────────┐
-             │                                                         │
-    ┌────────▼─────────┐                                    ┌─────────▼─────────┐
-    │ Start Timeout    │                                    │ Spawn Thread      │
-    │ (5 min timer)    │                                    │ Call _start()     │
-    └──────────────────┘                                    └─────────┬─────────┘
-             │                                                         │
-             │                                              ┌──────────▼──────────┐
-             │                                              │ Try Create K8s Pod  │
-             │                                              │ (POST to K8s API)   │
-             │                                              └──────────┬──────────┘
-             │                                                         │
-             │                                    ┌────────────────────┼───────────────┐
-             │                                    │                    │               │
-             │                         ┌──────────▼─────────┐  ┌──────▼─────────────────────┐
-             │                         │ Success (201)      │  │ API Error (500/503/etc)    │
-             │                         │ Pod Created!       │  │ Network error, K8s down    │
-             │                         └──────────┬─────────┘  └──────┬─────────────────────┘
-             │                                    │                    │
-             │                         ┌──────────▼──────────────┐  ┌─▼──────────────────────┐
-             │                         │ Check Pod Status        │  │ THROW EXCEPTION        │
-             │                         │ (GET pod/status)        │  │ "Failed to create pod" │
-             │                         └──────────┬──────────────┘  └─┬──────────────────────┘
-             │                                    │                    │
-             │                      ┌─────────────┼─────────────┐      │ .on('error')
-             │                      │             │             │      │
-             │           ┌──────────▼─────┐  ┌───▼───┐  ┌─────▼──────┐▼──────────────────┐
-             │           │ Pod Status:    │  │ Pod:  │  │ Pod Status:││ Retry < 5?       │
-             │           │ pending/running│  │failed │  │ unknown    ││ YES: NACK (retry)│
-             │           └──────────┬─────┘  └───┬───┘  └─────┬──────┘│ NO: FAILURE+ACK  │
-             │                      │            │             │       └──────────────────┘
-             │           ┌──────────▼─────┐  ┌───▼─────────────▼───┐
-             │           │ Return TRUE    │  │ Return FALSE         │
-             │           │ "Pod OK"       │  │ "Status check failed"│
-             │           └──────────┬─────┘  └───┬──────────────────┘
-             │                      │            │
-             │           ┌──────────▼─────┐  ┌───▼──────────────┐
-             │           │ ACK message    │  │ Clear timeout    │
-             │           │ (free prefetch)│  │ ACK message      │
-             │           └──────────┬─────┘  │ Push to RETRY    │
-             │                      │        │ QUEUE (verify)   │
-             │           ┌──────────▼─────┐  └───┬──────────────┘
-             │           │ DON'T clear    │      │
-             │           │ timeout!       │      │
-             │           │ (keep monitor) │      │
-             │           └──────────┬─────┘      │
-             │                      │            │
-             │◄─────────────────────┘            │
-             │                                   │
-    ┌────────▼─────────┐                        │
-    │ Wait 5 minutes   │                        │
-    └────────┬─────────┘                        │
-             │                                   │
-    ┌────────▼───────────────────────┐          │
-    │ Timeout Fires!                 │          │
-    │ Update build statusmessage:    │          │
-    │ "Build initialization delayed" │          │
-    └────────┬───────────────────────┘          │
-             │                                   │
-    ┌────────▼─────────┐                        │
-    │ Push to          │◄───────────────────────┘
-    │ RETRY QUEUE      │
+             ▼
+    ┌──────────────────┐
+    │ Parse Message    │
+    │ jobType: start   │
+    │         stop     │
+    │         clear    │
     └────────┬─────────┘
              │
+             ├──────────────────────────────────────┐
+             │                                      │
+    ┌────────▼─────────┐                  ┌────────▼─────────┐
+    │ jobType=start    │                  │ jobType=stop     │
+    │                  │                  │ jobType=clear    │
+    └────────┬─────────┘                  └────────┬─────────┘
+             │                                      │
+    ┌────────▼─────────┐                  ┌────────▼─────────┐
+    │ Spawn Thread     │                  │ Spawn Thread     │
+    │ Call _start()    │                  │ Execute job      │
+    └────────┬─────────┘                  └────────┬─────────┘
+             │                                     │
+    ┌────────▼────────────────┐                    │
+    │ Create K8s Pod          │                    │
+    │ (POST to K8s API)       │                    │
+    └────────┬────────────────┘                    │
+             │                                     │
+    ┌────────┼──────────────────┐                  │
+    │        │                  │                  │
+    ▼        ▼                  ▼                  │
+┌─────────────┐  ┌──────────────────┐              │
+│ Success     │  │ K8s API Error    │              │
+│ (201)       │  │ Network timeout  │              │
+└─────┬───────┘  └──────────┬───────┘              │
+      │                     │                      │
+      │                     ▼                      │
+      │          ┌────────────────────┐            │
+      │          │ .on('error')       │            │
+      │          │ retryCount < 3?    │            │
+      │          │ YES: NACK (retry)  │            │
+      │          │ NO: FAILURE + ACK  │            │
+      │          └────────────────────┘            │
+      │                                            │
+      ▼                                            │
+┌──────────────────────────────┐                   │
+│ Pod created successfully     │                   │
+│ .on('message')               │                   │
+└──────────┬───────────────────┘                   │
+           │                                       │
+           ▼                                       │
+┌──────────────────────────────┐                   │
+│ ACK message immediately      │◄──────────────────┘
+│ (free up prefetch slot)      │
+└──────────┬───────────────────┘
+           │
+           ▼
+┌─────────────────────────────────────────────────────────────┐
+│ Push to sdretry-wait for verification                      │
+│ - Add header: x-build-start-time = Date.now()               │
+│ - Add header: x-retry-count = 0                             │
+│ - Set per-message TTL: 30 seconds (expiration property)     │
+│ - Publishes to: sdretry-wait (not sdretry directly!)      │
+└──────────┬──────────────────────────────────────────────────┘
+           │
+           │
+           ▼
+┌─────────────────────────────────────────────────────────────┐
+│ WAIT QUEUE: sdretry-wait (waits for TTL to expire)         │
+│ - Message sits here for TTL duration (30s default)          │
+│ - When TTL expires → Dead-letter to sdretry                │
+└─────────────────────────────────────────────────────────────┘
+           │
+           │ (after TTL expires)
+           ▼
+┌─────────────────────────────────────────────────────────────┐
+│ RETRY QUEUE: sdretry (ready for consumption)               │
+│ - Consumer picks up message for pod verification            │
+└─────────────────────────────────────────────────────────────┘
+```
+## Retry Queue Processing (sdretry)
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ RETRY QUEUE: Pod Verification & Status Check                                │
+└─────────────────────────────────────────────────────────────────────────────┘
+    ┌──────────────────────────────────┐
+    │ Consumer picks up message        │
+    │ from sdretry                    │
+    │ Headers: x-build-start-time      │
+    │          x-retry-count           │
+    └──────────┬───────────────────────┘
+               │
+               ▼
+    ┌──────────────────────────────────┐
+    │ Check retry count                │
+    │ retryCount = x-retry-count || 0  │
+    │ if retryCount >= 6: FAIL         │
+    └──────────┬───────────────────────┘
+               │
+               ▼
+    ┌──────────────────────────────────┐
+    │ Spawn Thread                     │
+    │ Call _verify()                   │
+    └──────────┬───────────────────────┘
+               │
+    ┌──────────▼──────────────┐
+    │ Get Pod Status          │
+    │ (GET pods?labelSelector)│
+    └──────────┬──────────────┘
+               │
+    ┌──────────┼────────────────────────────────┐
+    │          │                                │
+    ▼          ▼                                ▼
+┌─────────────────────┐  ┌──────────────────────────┐
+│ Status: 'waiting'   │  │ Status: 'initializing'   │
+│ (pod not scheduled) │  │ (pod pulling image)      │
+└──────────┬──────────┘  └──────────┬───────────────┘
+           │                        │
+           ▼                        ▼
+┌────────────────────────────────────────────────────┐
+│ Check Init Timeout                                 │
+│ ONLY for 'waiting'                                 │
+│ elapsed = now - x-build-start-time                 │
+│ if elapsed >= 3min: TIMEOUT                        │
+└────────────┬───────────────────────────────────────┘
              │
-┌────────────▼─────────────────────────────────────────────────────────────────┐
-│ RETRY QUEUE: Pod Verification                                                │
-└──────────────────────────────────────────────────────────────────────────────┘
-    ┌────────────────────┐
-    │ Receive Message    │
-    │ from Retry Queue   │
-    └─────────┬──────────┘
-              │
-    ┌─────────▼──────────┐
-    │ Spawn Thread       │
-    │ Call _verify()     │
-    └─────────┬──────────┘
-              │
-    ┌─────────▼────────────────┐
-    │ Try Get Pod Status       │
-    │ (GET pods?labelSelector) │
-    └─────────┬────────────────┘
-              │
-    ┌─────────┼────────────────────────────┐
-    │         │                            │
-┌───▼─────────────┐              ┌─────────▼────────────────┐
-│ Success         │              │ API Error (K8s API down) │
-│ Got pod status  │              │ Network issue            │
-└───┬─────────────┘              └─────────┬────────────────┘
-    │                                      │
-    │                            ┌─────────▼────────────────┐
-    │                            │ THROW EXCEPTION          │
-    │                            │ .on('error')             │
-    │                            └─────────┬────────────────┘
-    │                                      │
-    │                            ┌─────────▼────────────────┐
-    │                            │ Retry < 5?               │
-    │                            │ YES: NACK (retry verify) │
-    │                            │ NO: FAILURE + ACK        │
-    │                            └──────────────────────────┘
-    │
-    ▼
-┌─────────────────────────────────────────────────────────────────┐
-│ Check Pod Status & Container Waiting Reason                     │
-└─────────┬────────────────────────────────────────────────────────┘
-          │
-    ┌─────┴──────────┬────────────────┬───────────────┬─────────────────┐
-    │                │                │               │                 │
-┌───▼────────────┐  ┌▼──────────┐  ┌─▼────────────┐ ┌▼───────────────┐ ┌▼──────────────┐
-│ Pod Status:    │  │ Pod:      │  │ Pod:         │ │ Pod:           │ │ Pod:          │
-│ running/       │  │ failed/   │  │ pending +    │ │ pending +      │ │ pending +     │
-│ succeeded      │  │ unknown   │  │ ErrImagePull │ │ CrashLoopBack  │ │ PodInitializing│
-└───┬────────────┘  └┬──────────┘  └─┬────────────┘ └┬───────────────┘ └┬──────────────┘
-    │                │                │               │                  │
-┌───▼────────────┐  ┌▼────────────────────────────────▼──────────────────▼──────────────┐
-│ Return EMPTY   │  │ Return ERROR MESSAGE                                               │
-│ (success)      │  │ "Build failed to start..."                                         │
-└───┬────────────┘  └┬───────────────────────────────────────────────────────────────────┘
-    │                │                                   │
-┌───▼────────────┐  ┌▼────────────────┐      ┌─────────▼──────────┐
-│ ACK message    │  │ Update build to │      │ Return EMPTY       │
-│ (build OK)     │  │ FAILURE         │      │ (allow more time   │
-└────────────────┘  │ ACK message     │      │ for image pull)    │
-                    └─────────────────┘      └─────────┬──────────┘
-                                                       │
-                                             ┌─────────▼──────────┐
-                                             │ ACK message        │
-                                             │ (pod still healthy │
-                                             │ may take 10+ min)  │
-                                             └────────────────────┘
+    ┌────────┼─────────┐
+    │        │         │
+    ▼        ▼         ▼
+┌─────────────┐  ┌──────────────┐
+│ Timeout!    │  │ Within time  │
+│ elapsed>=3m │  │ elapsed<3m   │
+└─────┬───────┘  └──────┬───────┘
+      │                 │
+      ▼                 ▼
+┌──────────────────┐  ┌────────────────────────────────┐
+│ FAIL BUILD       │  │ Retry with appropriate delay   │
+│ "Pod scheduling  │  │                                │
+│ timeout exceeded"│  │ 'waiting': Fixed 30s delay     │
+│ ACK + Stop       │  │ 'initializing': Progressive    │
+└──────────────────┘  │   30s + (retryCount × 10s)     │
+                      └─────────┬──────────────────────┘
+                                │
+                                ▼
+                      ┌───────────────────────────────┐
+                      │ ACK current message           │
+                      │ Publish to sdretry-wait      │
+                      │ with new TTL (expiration)     │
+                      │ and x-retry-count += 1        │
+                      └───────────┬───────────────────┘
+                                  │
+                                  ▼
+                      ┌───────────────────────────────┐
+                      │ Message waits in sdretry-wait│
+                      │ for TTL duration              │
+                      │ Then dead-letter → sdretry   │
+                      └───────────────────────────────┘
+Other status codes:
+    '' (empty string)  → ACK (success, pod running)
+    Error message      → ACK + Update build → FAILURE
+```
+## Pod Status Decision Tree
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ POD VERIFICATION LOGIC (_verify in executor-k8s/index.js)                   │
+└─────────────────────────────────────────────────────────────────────────────┘
+    Check Pod Status
+         │
+    ┌────┴──────────────────────────────────────────────────┐
+    │                                                       │
+    ▼                                                       ▼
+Container Waiting Reason?                              Pod Phase?
+    │                                                       │
+    ├─ ErrImagePull ──────────┐                             │
+    ├─ ImagePullBackOff ───────┼────► FAIL FAST             │
+    ├─ InvalidImageName ────────┘     "Check your image"    │
+    │                                                       │
+    ├─ CrashLoopBackOff ───────┐                            │
+    ├─ CreateContainerError ────┼────► FAIL FAST            │
+    ├─ StartError ──────────────┘     "Contact admin"       │
+    │                                                       │
+    └─ (none/other) ────────────────────────────────────────┼──► Check phase
+                                                            │
+                                                            ├─ Running ──────► SUCCESS ('')
+                                                            ├─ Succeeded ────► SUCCESS ('')
+                                                            ├─ Failed ───────► FAILURE (error msg)
+                                                            ├─ Unknown ──────► FAILURE (error msg)
+                                                            │
+                                                            └─ Pending ──┐
+                                                                         │
+                                                         ┌───────────────▼──────────────┐
+                                                         │ Has nodeName assigned?       │
+                                                         └───────────────┬──────────────┘
+                                                                         │
+                                                    ┌────────────────────┼────────────────────┐
+                                                    │                    │                    │
+                                                    ▼                    ▼                    ▼
+                                            ┌───────────────┐  ┌─────────────────┐  ┌──────────────┐
+                                            │ nodeName: NO  │  │ nodeName: YES   │  │ Other cases  │
+                                            │ (not sched)   │  │ (initializing)  │  │              │
+                                            └───────┬───────┘  └────────┬────────┘  └──────┬───────┘
+                                                    │                   │                  │
+                                                    ▼                   ▼                  ▼
+                                            ┌───────────────┐  ┌─────────────────┐  ┌──────────────┐
+                                            │ Return        │  │ Return          │  │ Fail or      │
+                                            │ 'waiting'     │  │ 'initializing'  │  │ other status │
+                                            │               │  │                 │  │              │
+                                            │ (pod waiting  │  │ (pod pulling    │  └──────────────┘
+                                            │  to schedule) │  │  image)         │
+                                            └───────────────┘  └─────────────────┘
+Status Code Meanings:
+  - '' (empty string)    → Pod is running successfully
+  - 'waiting'            → Pod not scheduled (counts against 3min timeout)
+  - 'initializing'       → Pod pulling image (progressive backoff, no timeout)
+  - Error message string → Immediate failure (ImagePullBackOff, CrashLoopBackOff, etc.)
+```
+## Queue Configuration
+### RabbitMQ Queue Definitions
+**sdQueue** (main queue for consumers):
+```json
+{
+    "name": "sdQueue",
+    "vhost": "screwdriver",
+    "durable": true,
+    "auto_delete": false,
+    "arguments": {
+      "x-dead-letter-exchange": "build",
+      "x-dead-letter-routing-key": "sdQueuedlr",
+      "x-max-priority": 3,
+      "x-message-ttl": 28800000
+    }
+}
+```
+**sdQueue** (DLR queue for consumers, for messages that fail to be ACK'd):
+```json
+{
+    "name": "sdQueuedlr",
+    "vhost": "screwdriver",
+    "durable": true,
+    "auto_delete": false,
+    "arguments": {
+      "x-dead-letter-exchange": "build",
+      "x-dead-letter-routing-key": "sdQueue",
+      "x-max-priority": 3,
+      "x-message-ttl": 5000,
+      "x-queue-mode": "lazy"
+    }
+}
+```
+**sdRetryQueue** (ready queue for consumers):
+```json
+{
+    "name": "sdRetryQueue",
+    "vhost": "screwdriver",
+    "durable": true,
+    "auto_delete": false,
+    "arguments": {
+        "x-max-priority": 3,
+        "x-queue-type": "classic"
+    }
+}
+```
+**IMPORTANT**: `sdRetryQueue` must NOT have `x-message-ttl` to allow per-message TTL!
+**sdRetryQueue-wait** (wait queue with dead-letter routing):
+```json
+{
+    "name": "sdretry-wait",
+    "vhost": "screwdriver",
+    "durable": true,
+    "auto_delete": false,
+    "arguments": {
+        "x-dead-letter-exchange": "build",
+        "x-dead-letter-routing-key": "sdretry",
+        "x-max-priority": 3,
+        "x-queue-type": "classic"
+    }
+}
 ```
-## Key Points
-### Main Queue Retries (NACK):
-- **When**: Pod creation throws exception (K8s API error, network issue)
-- **Why**: Pod was never created, safe to retry
-- **How many**: Up to 5 times via RabbitMQ requeue
-- **After max retries**: Update build to FAILURE and ACK
-### Retry Queue Retries (NACK):
-- **When**: _verify() throws exception (can't get pod status from K8s)
-- **Why**: Transient API issue, pod might be fine
-- **How many**: Up to 5 times via RabbitMQ requeue
-- **After max retries**: Update build to FAILURE and ACK
-### No Retries (ACK immediately):
-- Pod created successfully (pending/running status) → main queue
-- Pod status check failed (pod exists but failed/unknown) → main queue → retry queue
-- Verify detects failed pod (returns error message) → retry queue
-- Verify detects healthy pod (returns empty) → retry queue
+# TODO: Use Delayed queue plugin https://github.com/rabbitmq/rabbitmq-delayed-message-exchange

package/config/custom-environment-variables.yaml CHANGED Viewed

@@ -359,12 +359,16 @@ rabbitmq:
     messageReprocessLimit: RABBITMQ_MSG_REPROCESS_LIMIT
     # Queue name of the retry queue
     retryQueue: RABBITMQ_RETRYQUEUE
+    # Queue name of the delayed retry queue
+    retryDelayedQueue: RABBITMQ_RETRYDELAYEDQUEUE
     # retry queue enable/disable flag
     retryQueueEnabled: RABBITMQ_RETRYQUEUE_ENABLED
     # Exchange / router name for rabbitmq
     exchange: RABBITMQ_EXCHANGE
     # build pod initialization timeout
     initTimeout: RABBITMQ_BUILD_INIT_TIMEOUT
+    # delay between retries in seconds
+    retryDelay: RABBITMQ_RETRY_DELAY
 httpd:
   # Port to listen on
   port: PORT

package/config/default.yaml CHANGED Viewed

@@ -240,15 +240,19 @@ rabbitmq:
     # Prefetch count
     prefetchCount: "20"
     # Message reprocess limit - max retry for a message
-    messageReprocessLimit: "3"
+    messageReprocessLimit: "6" # short wait but more retries
     # Queue name of the retry queue
     retryQueue: sdRetryQueue
+    # Queue name of the delayed retry queue
+    retryDelayedQueue: sdRetryQueue-wait
     # retry queue enable/disable flag
     retryQueueEnabled: false
     # Exchange / router name for rabbitmq
     exchange: build
     # build pod initialization timeout in minutes
     initTimeout: "5"
+    # delay between retries in seconds
+    retryDelay: "30"
 httpd:
     # Port to listen on
     port: 80

package/lib/config.js CHANGED Viewed

@@ -18,9 +18,11 @@ const {
     prefetchCount,
     messageReprocessLimit,
     retryQueue,
+    retryDelayedQueue,
     retryQueueEnabled,
     exchange,
-    initTimeout
+    initTimeout,
+    retryDelay
 } = rabbitmqConfig;
 const amqpURI = `${protocol}://${username}:${password}@${host}:${port}${vhost}`;
@@ -60,7 +62,9 @@ function getConfig() {
         retryQueue,
         retryQueueEnabled: convertToBool(retryQueueEnabled),
         exchange,
-        initTimeout: Number(initTimeout) || 5
+        initTimeout: Number(initTimeout) || 5,
+        retryDelay: Number(retryDelay) || 30,
+        retryDelayedQueue
     };
 }

package/lib/retry-queue.js CHANGED Viewed

@@ -3,7 +3,7 @@
 const amqp = require('amqp-connection-manager');
 const logger = require('screwdriver-logger');
 const config = require('./config');
-const { amqpURI, connectOptions, retryQueue, exchange, retryQueueEnabled } = config.getConfig();
+const { amqpURI, connectOptions, exchange, retryQueueEnabled, retryDelayedQueue } = config.getConfig();
 let retryQueueConn;
@@ -29,12 +29,16 @@ function getRetryQueueConn() {
 }
 /**
- * Pushes a message to the retry queue
+ * Pushes a message to the retry wait queue (delay queue)
+ * Messages will sit in the wait queue for the specified delay before being routed to the ready queue
  * @param {message} buildConfig build config
  * @param {messageId} messageId id of the message queue
+ * @param {number} delayMs delay in milliseconds (supports dynamic delays for progressive backoff)
+ * @param {number} retryCount current retry count (optional, defaults to 0)
+ * @param {number} buildStartTime timestamp when build verification started (optional)
  * @returns {Promise} resolves to null or error
  */
-async function push(buildConfig, messageId) {
+async function push(buildConfig, messageId, delayMs = 30000, retryCount = 0, buildStartTime = null) {
     if (!retryQueueEnabled) {
         return Promise.resolve();
     }
@@ -49,20 +53,37 @@ async function push(buildConfig, messageId) {
         setup: channel => channel.checkExchange(exchange)
     });
-    logger.info('publishing msg to retry queue: %s', messageId);
+    // Publish to the WAIT queue, not the ready queue with  per-message TTL
+    const waitQueue = retryDelayedQueue;
+    const delaySec = (delayMs / 1000).toFixed(0);
+    logger.info('publishing msg to retry wait queue: %s (will delay %ss)', messageId, delaySec);
+    // Add headers for timeout tracking and retry count
+    const headers = {
+        'x-build-start-time': buildStartTime || Date.now(),
+        'x-retry-count': retryCount
+    };
     return channelWrapper
-        .publish(exchange, retryQueue, message, {
+        .publish(exchange, waitQueue, message, {
             contentType: 'application/json',
-            persistent: true
+            persistent: true,
+            headers,
+            expiration: String(delayMs)
         })
         .then(() => {
-            logger.info('successfully publishing msg id %s -> queue %s', messageId, retryQueue);
+            logger.info(
+                'successfully published msg id %s -> wait queue %s (delay: %ss)',
+                messageId,
+                waitQueue,
+                delaySec
+            );
             return channelWrapper.close();
         })
         .catch(err => {
-            logger.error('publishing failed to retry queue: %s', err.message);
+            logger.error('publishing failed to retry wait queue: %s', err.message);
             channelWrapper.close();
             throw err;

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "screwdriver-buildcluster-queue-worker",
-  "version": "5.2.0",
+  "version": "6.0.0",
   "description": "An amqp connection manager implementation that consumes jobs from Rabbitmq queue.",
   "main": "index.js",
   "scripts": {

package/receiver.js CHANGED Viewed

@@ -17,12 +17,13 @@ const {
     cachePath,
     retryQueue,
     retryQueueEnabled,
-    initTimeout
+    exchange,
+    initTimeout,
+    retryDelay
 } = config.getConfig();
 const { spawn } = threads;
 const CACHE_STRATEGY_DISK = 'disk';
 let channelWrapper;
-const INIT_TIMEOUT = initTimeout * 60 * 1000; // milliseconds
 /**
  * onMessage consume messages in batches, once its available in the queue. channelWrapper has in-built back pressure
@@ -105,68 +106,22 @@ const onMessage = data => {
                 }
             }
-            let timeoutWarningLogged = false;
-            let timeoutTimer = null;
-            if (jobType === 'start') {
-                timeoutTimer = setTimeout(async () => {
-                    if (!timeoutWarningLogged) {
-                        timeoutWarningLogged = true;
-                        const timeoutMessage = `Build initialization timeout exceeded (${initTimeout}min) for ${job}`;
-                        logger.error(timeoutMessage);
-                        // Update build statusmessage only to show delayed initialization
-                        try {
-                            await helper.updateBuildStatusAsync(
-                                buildConfig,
-                                undefined,
-                                'Build initialization delayed - pod creation taking longer than expected'
-                            );
-                            logger.info(`Build status updated with delay warning for build ${buildId}`);
-                        } catch (err) {
-                            logger.error(
-                                `Failed to update build status with delay warning for build:${buildId}:${err}`
-                            );
-                        }
-                        // Push to retry queue for verification and potential failure
-                        // This allows verify to check pod status and fail if still pending
-                        logger.info(`Pushing ${job} to retry queue for verification after timeout`);
-                        retryQueueLib.push(buildConfig, buildId);
-                    }
-                }, INIT_TIMEOUT);
-            }
             thread
                 .send([jobType, buildConfig, job])
                 .on('message', successful => {
                     logger.info(`acknowledge, job completed for ${job}, result: ${successful}`);
-                    if (!successful && jobType === 'start') {
-                        // Pod failed immediately (status check returned false)
-                        // Clear timeout and push to retry queue for immediate verification
-                        if (timeoutTimer) {
-                            clearTimeout(timeoutTimer);
-                        }
-                        retryQueueLib.push(buildConfig, buildId);
-                    } else if (successful && jobType === 'start') {
-                        // Pod created successfully - DON'T clear timeout
-                        // Let the timeout fire to verify pod eventually started
-                        // This handles pods that get stuck in pending after creation
-                        logger.info(`Timeout remains active for ${job}, will verify after ${initTimeout}min`);
-                    } else if (timeoutTimer) {
-                        // For non-start jobs (stop, verify), or other cases, clear timeout normally
-                        clearTimeout(timeoutTimer);
+                    if (jobType === 'start') {
+                        logger.info(`Pushing ${job} to retry queue for verification`);
+                        retryQueueLib.push(buildConfig, buildId).catch(err => {
+                            logger.error(`Failed to push to retry queue for ${job}: ${err.message}`);
+                        });
                     }
                     channelWrapper.ack(data);
                     thread.kill();
                 })
                 .on('error', async error => {
-                    if (timeoutTimer) {
-                        clearTimeout(timeoutTimer);
-                    }
                     thread.kill();
                     if (['403', '404'].includes(error.message.substring(0, 3))) {
                         channelWrapper.ack(data);
@@ -220,25 +175,155 @@ const onRetryMessage = async data => {
         logger.info(`processing ${job}`);
+        const buildStartTime =
+            data.properties.headers && data.properties.headers['x-build-start-time']
+                ? data.properties.headers['x-build-start-time']
+                : null;
+        const initTimeoutMs = initTimeout * 60 * 1000;
         if (typeof data.properties.headers !== 'undefined') {
             if (Object.keys(data.properties.headers).length > 0) {
-                retryCount = data.properties.headers['x-death'][0].count;
-                logger.info(`retrying ${retryCount}(${messageReprocessLimit}) for ${job}`);
+                if (data.properties.headers['x-retry-count']) {
+                    retryCount = data.properties.headers['x-retry-count'];
+                    logger.info(`retrying ${retryCount}(${messageReprocessLimit}) for ${job}`);
+                } else if (data.properties.headers['x-death']) {
+                    retryCount = data.properties.headers['x-death'][0].count;
+                    logger.info(`retrying ${retryCount}(${messageReprocessLimit}) for ${job}`);
+                }
             }
         }
         thread
             .send([jobType, buildConfig, job])
             .on('message', async message => {
                 logger.info(`acknowledge, job completed for ${job}, result: ${message}`);
-                if (message) {
+                if (message === 'waiting') {
+                    // Pod not scheduled - check timeout
+                    if (buildStartTime) {
+                        const elapsedMs = Date.now() - buildStartTime;
+                        const elapsedMinutes = (elapsedMs / 1000 / 60).toFixed(2);
+                        logger.info(
+                            `Build ${buildId} pod not scheduled yet, elapsed: ${elapsedMinutes}min, timeout: ${initTimeout}min`
+                        );
+                        if (elapsedMs >= initTimeoutMs) {
+                            // Timeout exceeded - fail immediately
+                            logger.error(
+                                `Build ${buildId} pod scheduling timeout exceeded: ${elapsedMinutes}min > ${initTimeout}min`
+                            );
+                            // metric for alerting
+                            logger.error(
+                                `[BUILD_SCHEDULING_FAILURE] buildId=${buildId} elapsed_minutes=${elapsedMinutes} ` +
+                                    `timeout_minutes=${initTimeout} retry_count=${retryCount}`
+                            );
+                            try {
+                                await helper.updateBuildStatusAsync(
+                                    buildConfig,
+                                    'FAILURE',
+                                    `Build failed to start within ${initTimeout} minutes (elapsed: ${elapsedMinutes} minutes). Pod was not scheduled - cluster may be out of capacity.`
+                                );
+                                logger.info(`Build ${buildId} marked as FAILURE due to pod scheduling timeout`);
+                            } catch (err) {
+                                logger.error(`Failed to update build status to FAILURE for build:${buildId}:${err}`);
+                            }
+                            channelWrapper.ack(data);
+                            thread.kill();
+                            return;
+                        }
+                    }
+                    // Timeout not exceeded - retry with delay
+                    if (retryCount >= messageReprocessLimit) {
+                        logger.error(
+                            `Build ${buildId} max retries (${messageReprocessLimit}) exceeded while waiting for pod scheduling`
+                        );
+                        // metric for alerting
+                        logger.error(
+                            `[BUILD_SCHEDULING_FAILURE] buildId=${buildId} elapsed_minutes=` +
+                                `${((Date.now() - buildStartTime) / 1000 / 60).toFixed(2)} max_retries=${retryCount}`
+                        );
+                        try {
+                            await helper.updateBuildStatusAsync(
+                                buildConfig,
+                                'FAILURE',
+                                'Build failed to start. Pod was not scheduled after maximum retries - cluster may be out of capacity.'
+                            );
+                            logger.info(`Build ${buildId} marked as FAILURE due to max retries`);
+                        } catch (err) {
+                            logger.error(`Failed to update build status to FAILURE for build:${buildId}:${err}`);
+                        }
+                        channelWrapper.ack(data);
+                    } else {
+                        const nextRetryCount = retryCount + 1;
+                        logger.info(
+                            `Build ${buildId} pod not scheduled, retrying ${nextRetryCount}/${messageReprocessLimit} in ${retryDelay}s`
+                        );
+                        channelWrapper.ack(data);
+                        // Re-publish to retry queue with incremented retry count
+                        retryQueueLib
+                            .push(buildConfig, buildId, retryDelay * 1000, nextRetryCount, buildStartTime)
+                            .catch(err => {
+                                logger.error(`Failed to re-publish to retry queue for ${job}: ${err.message}`);
+                            });
+                    }
+                } else if (message === 'initializing') {
+                    // Pod is initializing (pulling image) - use progressive backoff for large images
+                    if (retryCount >= messageReprocessLimit) {
+                        logger.error(
+                            `Build ${buildId} max retries (${messageReprocessLimit}) exceeded while pod initializing/pulling image`
+                        );
+                        try {
+                            await helper.updateBuildStatusAsync(
+                                buildConfig,
+                                'FAILURE',
+                                'Build failed to start. Pod initialization timeout - pod may be stuck pulling a large image or container startup is slow.'
+                            );
+                            logger.info(`Build ${buildId} marked as FAILURE due to max retries during initialization`);
+                        } catch (err) {
+                            logger.error(`Failed to update build status to FAILURE for build:${buildId}:${err}`);
+                        }
+                        channelWrapper.ack(data);
+                    } else {
+                        const nextRetryCount = retryCount + 1;
+                        const baseDelayMs = retryDelay * 1000;
+                        const incrementMs = 10000 * retryCount;
+                        const delayMs = baseDelayMs + incrementMs;
+                        const delaySec = (delayMs / 1000).toFixed(0);
+                        logger.info(
+                            `Build ${buildId} pod still initializing/pulling image, retrying ${nextRetryCount}/${messageReprocessLimit} in ${delaySec}s (progressive backoff)`
+                        );
+                        channelWrapper.ack(data);
+                        // Re-publish to retry queue with incremented retry count and progressive delay
+                        retryQueueLib.push(buildConfig, buildId, delayMs, nextRetryCount, buildStartTime).catch(err => {
+                            logger.error(`Failed to re-publish to retry queue for ${job}: ${err.message}`);
+                        });
+                    }
+                } else if (message && message !== '') {
+                    // Pod has failed - update build status and ack
                     try {
                         await helper.updateBuildStatusAsync(buildConfig, 'FAILURE', message);
                         logger.info(`build status successfully updated for build ${buildId}`);
                     } catch (err) {
                         logger.error(`Failed to update build status to FAILURE for build:${buildId}:${err}`);
                     }
+                    channelWrapper.ack(data);
+                } else {
+                    // Empty string means pod is running successfully - ack
+                    logger.info(`pod started successfully for ${job}, acknowledging`);
+                    channelWrapper.ack(data);
                 }
-                channelWrapper.ack(data);
                 thread.kill();
             })
             .on('error', async error => {
@@ -288,7 +373,11 @@ const listen = async () => {
         const queueFn = [channel.checkQueue(queue), channel.prefetch(prefetchCount), channel.consume(queue, onMessage)];
         if (retryQueueEnabled) {
-            queueFn.push(channel.checkQueue(retryQueue), channel.consume(retryQueue, onRetryMessage));
+            queueFn.push(
+                channel.checkQueue(retryQueue),
+                channel.bindQueue(retryQueue, exchange, retryQueue),
+                channel.consume(retryQueue, onRetryMessage)
+            );
         }
         return Promise.all(queueFn);