dispatch_policy 0.4.0 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 24ab8c2fe85abc57507f84edc955c8263f59a96505522ecd9ceb6ce60e14bcba
4
- data.tar.gz: 152dc560f5b1169d5ef6f4a27065ae629da426ceefb0725fa6bc7a8d13c62a3f
3
+ metadata.gz: 55753a0af85b649115d306ab790978668958d8a4ebe43b44c4b955a00b525b3b
4
+ data.tar.gz: 1f12837a2f561ff28f1fa00aedcd5cf92092766b75043a1937658964a0583d90
5
5
  SHA512:
6
- metadata.gz: 88f5adb73f3e7bb1893eab32e098a5dda1d7b147c0cd000ee35eb6771b2d2c21f0fe1541c16725b537a1e3a1121166b2681331e7fcea79de26e3d3926137bea0
7
- data.tar.gz: c2b17c50ccd765c95bb6d316a68f3e7dece5134ae9ca26fcb918356e0baf611ebbfba19f695e41b1b7402f0021f0b9d3d08111c7938448efb9fb5accda0a4131
6
+ metadata.gz: 98ad50661323d62f22593bf323d61a18716c059a0571fcf8057e0f8b18bdd39dd58214cc399e38b4c96297539cfeb1f841675879733a0ef5d7913261a0c0d42d
7
+ data.tar.gz: e54665de99a9cb7c63f522bbefeb406958382831bf0a16e41daaa25ecd977fbdf215c12fd0b91411bed8a9981f454238007ea615767914d7046401c78ff9dc81
data/CHANGELOG.md CHANGED
@@ -1,5 +1,31 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.4.1
4
+
5
+ ### Fixed
6
+ - Admission now regenerates `active_job_id` for each row before
7
+ pre-inserting `dispatch_policy_inflight_jobs` and handing the job
8
+ to the adapter. Adapters that use `active_job_id` as the PK of
9
+ their jobs table (`good_job`, `solid_queue`) would otherwise raise
10
+ `ActiveRecord::RecordNotUnique` on `good_jobs_pkey` /
11
+ `solid_queue_jobs_pkey` when a residual row from a previous
12
+ admission of the same staged job still existed — most commonly a
13
+ retry-restage (default `retry_strategy: :restage`) whose original
14
+ adapter row had not been finalized yet. The collision rolled back
15
+ the entire admission TX, the staged row returned, and the next
16
+ tick re-collided in a loop. The staged-side identity is
17
+ `staged_jobs.id`; the active_job_id only needs to be unique at
18
+ adapter-insert time.
19
+ - `record_partition_admit!` clamps the EWMA decay exponent at -700
20
+ so `exp()` no longer raises `value out of range: underflow` when a
21
+ partition has been idle for many half-lives. Postgres throws this
22
+ error around `exp(-746)` on double precision, and a partition that
23
+ sat idle long enough (e.g. a few weeks with `half_life = 60s`)
24
+ produced a Δt/τ ratio past that threshold; the broken UPDATE rolled
25
+ back the whole admission TX every tick, so the partition could
26
+ never drain again. -700 still yields a finite ~9.86e-305, which is
27
+ effectively zero for the EWMA.
28
+
3
29
  ## 0.3.0
4
30
 
5
31
  ### Added
data/README.md CHANGED
@@ -449,6 +449,37 @@ end
449
449
  `connected_to(role:)` when set. Staging tables and the adapter's
450
450
  table must live in the same DB for atomicity to hold.
451
451
 
452
+ ### Job identity across staging and adapter
453
+
454
+ `Tick.admit_partition` regenerates the ActiveJob `job_id` for every
455
+ claimed row immediately before pre-inserting `inflight_jobs` and
456
+ handing the job to the adapter. So a job has two identities through
457
+ its lifecycle:
458
+
459
+ - **Pre-admission** — `staged_jobs.id` (the staged-side identity) and
460
+ `staged_jobs.job_data->>'job_id'` (the UUID `perform_later` returned
461
+ to the caller).
462
+ - **Post-admission** — `inflight_jobs.active_job_id` and the adapter's
463
+ row id (`good_jobs.id` / `solid_queue_jobs.id`), both equal to the
464
+ newly generated UUID. This is also the `job_id` the worker observes
465
+ during perform.
466
+
467
+ The two UUIDs are intentionally different. Adapters that use
468
+ `active_job_id` as their PK (`good_job`, `solid_queue`) would
469
+ otherwise collide on the adapter row when a previous admission of
470
+ the same staged job left a residual row behind — most commonly a
471
+ retry-restage whose original adapter row had not been finalized yet.
472
+
473
+ The mapping is logged at debug level on every admission:
474
+
475
+ ```
476
+ [dispatch_policy] admit staged_id=… policy=… partition=… active_job_id: <old> -> <new>
477
+ ```
478
+
479
+ If you correlate jobs across the staging boundary from outside Rails,
480
+ use `staged_jobs.id` as the stable handle pre-admission and the
481
+ adapter row id (= `inflight_jobs.active_job_id`) post-admission.
482
+
452
483
  ## Running the tick
453
484
 
454
485
  `DispatchPolicy::TickLoop.run(policy_name:, shard:, stop_when:)` is
@@ -245,14 +245,27 @@ module DispatchPolicy
245
245
  if half_life_seconds && half_life_seconds.to_f.positive?
246
246
  # decay constant τ such that exp(-Δt/τ) halves every half_life:
247
247
  # τ = half_life / ln(2). NULLIF guards a degenerate τ=0.
248
+ #
249
+ # The GREATEST(..., -700) clamp keeps `exp()` from raising
250
+ # `value out of range: underflow` when a partition has been
251
+ # idle for many half-lives. Postgres throws around
252
+ # `exp(-746)` on double precision; -700 still yields a finite
253
+ # ~9.86e-305, which is effectively zero for the EWMA. Without
254
+ # the clamp, a partition idle long enough for Δt/τ to exceed
255
+ # ~746 breaks every subsequent admission UPDATE on it: Tick
256
+ # rolls back the whole TX, the staged rows return, and the
257
+ # partition never drains.
248
258
  decay_idx = params.size + 1
249
259
  admitted_idx_for_ewma = 3
250
260
  decay_tau = half_life_seconds.to_f / Math.log(2)
251
261
  params << decay_tau
252
262
  decay_sql = <<~SQL.squish
253
263
  decayed_admits = decayed_admits *
254
- exp(- COALESCE(EXTRACT(EPOCH FROM (now() - decayed_admits_at)), 0)
255
- / NULLIF($#{decay_idx}::double precision, 0))
264
+ exp(GREATEST(
265
+ - COALESCE(EXTRACT(EPOCH FROM (now() - decayed_admits_at)), 0)
266
+ / NULLIF($#{decay_idx}::double precision, 0),
267
+ -700
268
+ ))
256
269
  + $#{admitted_idx_for_ewma},
257
270
  decayed_admits_at = now(),
258
271
  SQL
@@ -336,6 +349,16 @@ module DispatchPolicy
336
349
  values_sql << "($#{base + 1}, $#{base + 2}, $#{base + 3}, now(), now())"
337
350
  params.push(row[:policy_name], row[:partition_key], row[:active_job_id])
338
351
  end
352
+ # ON CONFLICT (active_job_id) DO NOTHING covers two paths that
353
+ # the around_perform tracker exercises on its own:
354
+ # 1) the around_perform inflight insert runs even when the row
355
+ # was already pre-inserted by Tick (concurrency-gated policies);
356
+ # 2) a stale row that survived a crash gets re-inserted by the
357
+ # around_perform without colliding while the sweeper is still
358
+ # catching up.
359
+ # Admission proper can no longer collide here: Tick regenerates
360
+ # active_job_id before this insert, so each admission contributes a
361
+ # fresh UUID.
339
362
  connection.exec_query(
340
363
  <<~SQL.squish,
341
364
  INSERT INTO #{INFLIGHT_TABLE}
@@ -1,5 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require "securerandom"
4
+
3
5
  module DispatchPolicy
4
6
  # One pass of admission for a single policy.
5
7
  #
@@ -219,6 +221,39 @@ module DispatchPolicy
219
221
  # scheduled in the future, or another tick raced us to them).
220
222
  next if rows.empty?
221
223
 
224
+ # Decouple the active_job_id we hand to the adapter from the
225
+ # staged payload's job_id. Adapters that use active_job_id as
226
+ # the PK of their jobs table (good_job, solid_queue) would
227
+ # otherwise collide when a residual row from a previous
228
+ # admission of the same job still exists — most commonly a
229
+ # retry-restage whose original adapter row has not been
230
+ # finalized yet. The collision raises RecordNotUnique inside
231
+ # the admission TX, rolls everything back, and the staged
232
+ # row keeps re-colliding on every subsequent tick.
233
+ #
234
+ # The staged-side identity is staged_jobs.id; active_job_id
235
+ # only needs to be unique at adapter-insert time. We mutate
236
+ # the row's job_data in place so both the inflight pre-insert
237
+ # below and Forwarder.dispatch (via Serializer.deserialize)
238
+ # observe the new id.
239
+ #
240
+ # Logs the (staged_job_id, original_active_job_id, new_active_job_id)
241
+ # mapping at debug level so operators can grep-bridge the two
242
+ # identities when troubleshooting — `perform_later` returns the
243
+ # original; the adapter row and the worker logs use the new one.
244
+ logger = DispatchPolicy.config.logger
245
+ rows.each do |row|
246
+ old_aj_id = row["job_data"]["job_id"]
247
+ new_aj_id = SecureRandom.uuid
248
+ row["job_data"]["job_id"] = new_aj_id
249
+
250
+ logger&.debug(
251
+ "[dispatch_policy] admit staged_id=#{row['id']} " \
252
+ "policy=#{@policy_name} partition=#{partition['partition_key']} " \
253
+ "active_job_id: #{old_aj_id} -> #{new_aj_id}"
254
+ )
255
+ end
256
+
222
257
  # Pre-insert an inflight row per admitted job so the concurrency
223
258
  # gate sees them immediately. With a concurrency gate, use its
224
259
  # (coarser) partition key so the gate's COUNT(*) keeps aggregating
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module DispatchPolicy
4
- VERSION = "0.4.0"
4
+ VERSION = "0.4.1"
5
5
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: dispatch_policy
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.0
4
+ version: 0.4.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - José Galisteo