RubyGems - dispatch_policy - Versions diffs - 0.4.0 → 0.4.1 - Mend

dispatch_policy 0.4.0 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +26 -0
data/README.md +31 -0
data/lib/dispatch_policy/repository.rb +25 -2
data/lib/dispatch_policy/tick.rb +35 -0
data/lib/dispatch_policy/version.rb +1 -1
metadata +1 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 24ab8c2fe85abc57507f84edc955c8263f59a96505522ecd9ceb6ce60e14bcba
-  data.tar.gz: 152dc560f5b1169d5ef6f4a27065ae629da426ceefb0725fa6bc7a8d13c62a3f
+  metadata.gz: 55753a0af85b649115d306ab790978668958d8a4ebe43b44c4b955a00b525b3b
+  data.tar.gz: 1f12837a2f561ff28f1fa00aedcd5cf92092766b75043a1937658964a0583d90
 SHA512:
-  metadata.gz: 88f5adb73f3e7bb1893eab32e098a5dda1d7b147c0cd000ee35eb6771b2d2c21f0fe1541c16725b537a1e3a1121166b2681331e7fcea79de26e3d3926137bea0
-  data.tar.gz: c2b17c50ccd765c95bb6d316a68f3e7dece5134ae9ca26fcb918356e0baf611ebbfba19f695e41b1b7402f0021f0b9d3d08111c7938448efb9fb5accda0a4131
+  metadata.gz: 98ad50661323d62f22593bf323d61a18716c059a0571fcf8057e0f8b18bdd39dd58214cc399e38b4c96297539cfeb1f841675879733a0ef5d7913261a0c0d42d
+  data.tar.gz: e54665de99a9cb7c63f522bbefeb406958382831bf0a16e41daaa25ecd977fbdf215c12fd0b91411bed8a9981f454238007ea615767914d7046401c78ff9dc81

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,31 @@
 # Changelog
+## 0.4.1
+### Fixed
+- Admission now regenerates `active_job_id` for each row before
+  pre-inserting `dispatch_policy_inflight_jobs` and handing the job
+  to the adapter. Adapters that use `active_job_id` as the PK of
+  their jobs table (`good_job`, `solid_queue`) would otherwise raise
+  `ActiveRecord::RecordNotUnique` on `good_jobs_pkey` /
+  `solid_queue_jobs_pkey` when a residual row from a previous
+  admission of the same staged job still existed — most commonly a
+  retry-restage (default `retry_strategy: :restage`) whose original
+  adapter row had not been finalized yet. The collision rolled back
+  the entire admission TX, the staged row returned, and the next
+  tick re-collided in a loop. The staged-side identity is
+  `staged_jobs.id`; the active_job_id only needs to be unique at
+  adapter-insert time.
+- `record_partition_admit!` clamps the EWMA decay exponent at -700
+  so `exp()` no longer raises `value out of range: underflow` when a
+  partition has been idle for many half-lives. Postgres throws this
+  error around `exp(-746)` on double precision, and a partition that
+  sat idle long enough (e.g. a few weeks with `half_life = 60s`)
+  produced a Δt/τ ratio past that threshold; the broken UPDATE rolled
+  back the whole admission TX every tick, so the partition could
+  never drain again. -700 still yields a finite ~9.86e-305, which is
+  effectively zero for the EWMA.
 ## 0.3.0
 ### Added

data/README.md CHANGED Viewed

@@ -449,6 +449,37 @@ end
 `connected_to(role:)` when set. Staging tables and the adapter's
 table must live in the same DB for atomicity to hold.
+### Job identity across staging and adapter
+`Tick.admit_partition` regenerates the ActiveJob `job_id` for every
+claimed row immediately before pre-inserting `inflight_jobs` and
+handing the job to the adapter. So a job has two identities through
+its lifecycle:
+- **Pre-admission** — `staged_jobs.id` (the staged-side identity) and
+  `staged_jobs.job_data->>'job_id'` (the UUID `perform_later` returned
+  to the caller).
+- **Post-admission** — `inflight_jobs.active_job_id` and the adapter's
+  row id (`good_jobs.id` / `solid_queue_jobs.id`), both equal to the
+  newly generated UUID. This is also the `job_id` the worker observes
+  during perform.
+The two UUIDs are intentionally different. Adapters that use
+`active_job_id` as their PK (`good_job`, `solid_queue`) would
+otherwise collide on the adapter row when a previous admission of
+the same staged job left a residual row behind — most commonly a
+retry-restage whose original adapter row had not been finalized yet.
+The mapping is logged at debug level on every admission:
+```
+[dispatch_policy] admit staged_id=… policy=… partition=… active_job_id: <old> -> <new>
+```
+If you correlate jobs across the staging boundary from outside Rails,
+use `staged_jobs.id` as the stable handle pre-admission and the
+adapter row id (= `inflight_jobs.active_job_id`) post-admission.
 ## Running the tick
 `DispatchPolicy::TickLoop.run(policy_name:, shard:, stop_when:)` is

data/lib/dispatch_policy/repository.rb CHANGED Viewed

@@ -245,14 +245,27 @@ module DispatchPolicy
       if half_life_seconds && half_life_seconds.to_f.positive?
         # decay constant τ such that exp(-Δt/τ) halves every half_life:
         # τ = half_life / ln(2). NULLIF guards a degenerate τ=0.
+        #
+        # The GREATEST(..., -700) clamp keeps `exp()` from raising
+        # `value out of range: underflow` when a partition has been
+        # idle for many half-lives. Postgres throws around
+        # `exp(-746)` on double precision; -700 still yields a finite
+        # ~9.86e-305, which is effectively zero for the EWMA. Without
+        # the clamp, a partition idle long enough for Δt/τ to exceed
+        # ~746 breaks every subsequent admission UPDATE on it: Tick
+        # rolls back the whole TX, the staged rows return, and the
+        # partition never drains.
         decay_idx        = params.size + 1
         admitted_idx_for_ewma = 3
         decay_tau        = half_life_seconds.to_f / Math.log(2)
         params << decay_tau
         decay_sql = <<~SQL.squish
           decayed_admits     = decayed_admits *
-                                exp(- COALESCE(EXTRACT(EPOCH FROM (now() - decayed_admits_at)), 0)
-                                     / NULLIF($#{decay_idx}::double precision, 0))
+                                exp(GREATEST(
+                                  - COALESCE(EXTRACT(EPOCH FROM (now() - decayed_admits_at)), 0)
+                                    / NULLIF($#{decay_idx}::double precision, 0),
+                                  -700
+                                ))
                               + $#{admitted_idx_for_ewma},
           decayed_admits_at  = now(),
         SQL
@@ -336,6 +349,16 @@ module DispatchPolicy
         values_sql << "($#{base + 1}, $#{base + 2}, $#{base + 3}, now(), now())"
         params.push(row[:policy_name], row[:partition_key], row[:active_job_id])
       end
+      # ON CONFLICT (active_job_id) DO NOTHING covers two paths that
+      # the around_perform tracker exercises on its own:
+      #   1) the around_perform inflight insert runs even when the row
+      #      was already pre-inserted by Tick (concurrency-gated policies);
+      #   2) a stale row that survived a crash gets re-inserted by the
+      #      around_perform without colliding while the sweeper is still
+      #      catching up.
+      # Admission proper can no longer collide here: Tick regenerates
+      # active_job_id before this insert, so each admission contributes a
+      # fresh UUID.
       connection.exec_query(
         <<~SQL.squish,
           INSERT INTO #{INFLIGHT_TABLE}

data/lib/dispatch_policy/tick.rb CHANGED Viewed

@@ -1,5 +1,7 @@
 # frozen_string_literal: true
+require "securerandom"
 module DispatchPolicy
   # One pass of admission for a single policy.
   #
@@ -219,6 +221,39 @@ module DispatchPolicy
           # scheduled in the future, or another tick raced us to them).
           next if rows.empty?
+          # Decouple the active_job_id we hand to the adapter from the
+          # staged payload's job_id. Adapters that use active_job_id as
+          # the PK of their jobs table (good_job, solid_queue) would
+          # otherwise collide when a residual row from a previous
+          # admission of the same job still exists — most commonly a
+          # retry-restage whose original adapter row has not been
+          # finalized yet. The collision raises RecordNotUnique inside
+          # the admission TX, rolls everything back, and the staged
+          # row keeps re-colliding on every subsequent tick.
+          #
+          # The staged-side identity is staged_jobs.id; active_job_id
+          # only needs to be unique at adapter-insert time. We mutate
+          # the row's job_data in place so both the inflight pre-insert
+          # below and Forwarder.dispatch (via Serializer.deserialize)
+          # observe the new id.
+          #
+          # Logs the (staged_job_id, original_active_job_id, new_active_job_id)
+          # mapping at debug level so operators can grep-bridge the two
+          # identities when troubleshooting — `perform_later` returns the
+          # original; the adapter row and the worker logs use the new one.
+          logger = DispatchPolicy.config.logger
+          rows.each do |row|
+            old_aj_id = row["job_data"]["job_id"]
+            new_aj_id = SecureRandom.uuid
+            row["job_data"]["job_id"] = new_aj_id
+            logger&.debug(
+              "[dispatch_policy] admit staged_id=#{row['id']} " \
+              "policy=#{@policy_name} partition=#{partition['partition_key']} " \
+              "active_job_id: #{old_aj_id} -> #{new_aj_id}"
+            )
+          end
           # Pre-insert an inflight row per admitted job so the concurrency
           # gate sees them immediately. With a concurrency gate, use its
           # (coarser) partition key so the gate's COUNT(*) keeps aggregating

data/lib/dispatch_policy/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module DispatchPolicy
-  VERSION = "0.4.0"
+  VERSION = "0.4.1"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: dispatch_policy
 version: !ruby/object:Gem::Version
-  version: 0.4.0
+  version: 0.4.1
 platform: ruby
 authors:
 - José Galisteo