dispatch_policy 0.4.0 → 0.4.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +26 -0
- data/README.md +31 -0
- data/lib/dispatch_policy/repository.rb +25 -2
- data/lib/dispatch_policy/tick.rb +35 -0
- data/lib/dispatch_policy/version.rb +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 55753a0af85b649115d306ab790978668958d8a4ebe43b44c4b955a00b525b3b
|
|
4
|
+
data.tar.gz: 1f12837a2f561ff28f1fa00aedcd5cf92092766b75043a1937658964a0583d90
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 98ad50661323d62f22593bf323d61a18716c059a0571fcf8057e0f8b18bdd39dd58214cc399e38b4c96297539cfeb1f841675879733a0ef5d7913261a0c0d42d
|
|
7
|
+
data.tar.gz: e54665de99a9cb7c63f522bbefeb406958382831bf0a16e41daaa25ecd977fbdf215c12fd0b91411bed8a9981f454238007ea615767914d7046401c78ff9dc81
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,31 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.4.1
|
|
4
|
+
|
|
5
|
+
### Fixed
|
|
6
|
+
- Admission now regenerates `active_job_id` for each row before
|
|
7
|
+
pre-inserting `dispatch_policy_inflight_jobs` and handing the job
|
|
8
|
+
to the adapter. Adapters that use `active_job_id` as the PK of
|
|
9
|
+
their jobs table (`good_job`, `solid_queue`) would otherwise raise
|
|
10
|
+
`ActiveRecord::RecordNotUnique` on `good_jobs_pkey` /
|
|
11
|
+
`solid_queue_jobs_pkey` when a residual row from a previous
|
|
12
|
+
admission of the same staged job still existed — most commonly a
|
|
13
|
+
retry-restage (default `retry_strategy: :restage`) whose original
|
|
14
|
+
adapter row had not been finalized yet. The collision rolled back
|
|
15
|
+
the entire admission TX, the staged row returned, and the next
|
|
16
|
+
tick re-collided in a loop. The staged-side identity is
|
|
17
|
+
`staged_jobs.id`; the active_job_id only needs to be unique at
|
|
18
|
+
adapter-insert time.
|
|
19
|
+
- `record_partition_admit!` clamps the EWMA decay exponent at -700
|
|
20
|
+
so `exp()` no longer raises `value out of range: underflow` when a
|
|
21
|
+
partition has been idle for many half-lives. Postgres throws this
|
|
22
|
+
error around `exp(-746)` on double precision, and a partition that
|
|
23
|
+
sat idle long enough (e.g. a few weeks with `half_life = 60s`)
|
|
24
|
+
produced a Δt/τ ratio past that threshold; the broken UPDATE rolled
|
|
25
|
+
back the whole admission TX every tick, so the partition could
|
|
26
|
+
never drain again. -700 still yields a finite ~9.86e-305, which is
|
|
27
|
+
effectively zero for the EWMA.
|
|
28
|
+
|
|
3
29
|
## 0.3.0
|
|
4
30
|
|
|
5
31
|
### Added
|
data/README.md
CHANGED
|
@@ -449,6 +449,37 @@ end
|
|
|
449
449
|
`connected_to(role:)` when set. Staging tables and the adapter's
|
|
450
450
|
table must live in the same DB for atomicity to hold.
|
|
451
451
|
|
|
452
|
+
### Job identity across staging and adapter
|
|
453
|
+
|
|
454
|
+
`Tick.admit_partition` regenerates the ActiveJob `job_id` for every
|
|
455
|
+
claimed row immediately before pre-inserting `inflight_jobs` and
|
|
456
|
+
handing the job to the adapter. So a job has two identities through
|
|
457
|
+
its lifecycle:
|
|
458
|
+
|
|
459
|
+
- **Pre-admission** — `staged_jobs.id` (the staged-side identity) and
|
|
460
|
+
`staged_jobs.job_data->>'job_id'` (the UUID `perform_later` returned
|
|
461
|
+
to the caller).
|
|
462
|
+
- **Post-admission** — `inflight_jobs.active_job_id` and the adapter's
|
|
463
|
+
row id (`good_jobs.id` / `solid_queue_jobs.id`), both equal to the
|
|
464
|
+
newly generated UUID. This is also the `job_id` the worker observes
|
|
465
|
+
during perform.
|
|
466
|
+
|
|
467
|
+
The two UUIDs are intentionally different. Adapters that use
|
|
468
|
+
`active_job_id` as their PK (`good_job`, `solid_queue`) would
|
|
469
|
+
otherwise collide on the adapter row when a previous admission of
|
|
470
|
+
the same staged job left a residual row behind — most commonly a
|
|
471
|
+
retry-restage whose original adapter row had not been finalized yet.
|
|
472
|
+
|
|
473
|
+
The mapping is logged at debug level on every admission:
|
|
474
|
+
|
|
475
|
+
```
|
|
476
|
+
[dispatch_policy] admit staged_id=… policy=… partition=… active_job_id: <old> -> <new>
|
|
477
|
+
```
|
|
478
|
+
|
|
479
|
+
If you correlate jobs across the staging boundary from outside Rails,
|
|
480
|
+
use `staged_jobs.id` as the stable handle pre-admission and the
|
|
481
|
+
adapter row id (= `inflight_jobs.active_job_id`) post-admission.
|
|
482
|
+
|
|
452
483
|
## Running the tick
|
|
453
484
|
|
|
454
485
|
`DispatchPolicy::TickLoop.run(policy_name:, shard:, stop_when:)` is
|
|
@@ -245,14 +245,27 @@ module DispatchPolicy
|
|
|
245
245
|
if half_life_seconds && half_life_seconds.to_f.positive?
|
|
246
246
|
# decay constant τ such that exp(-Δt/τ) halves every half_life:
|
|
247
247
|
# τ = half_life / ln(2). NULLIF guards a degenerate τ=0.
|
|
248
|
+
#
|
|
249
|
+
# The GREATEST(..., -700) clamp keeps `exp()` from raising
|
|
250
|
+
# `value out of range: underflow` when a partition has been
|
|
251
|
+
# idle for many half-lives. Postgres throws around
|
|
252
|
+
# `exp(-746)` on double precision; -700 still yields a finite
|
|
253
|
+
# ~9.86e-305, which is effectively zero for the EWMA. Without
|
|
254
|
+
# the clamp, a partition idle long enough for Δt/τ to exceed
|
|
255
|
+
# ~746 breaks every subsequent admission UPDATE on it: Tick
|
|
256
|
+
# rolls back the whole TX, the staged rows return, and the
|
|
257
|
+
# partition never drains.
|
|
248
258
|
decay_idx = params.size + 1
|
|
249
259
|
admitted_idx_for_ewma = 3
|
|
250
260
|
decay_tau = half_life_seconds.to_f / Math.log(2)
|
|
251
261
|
params << decay_tau
|
|
252
262
|
decay_sql = <<~SQL.squish
|
|
253
263
|
decayed_admits = decayed_admits *
|
|
254
|
-
exp(
|
|
255
|
-
|
|
264
|
+
exp(GREATEST(
|
|
265
|
+
- COALESCE(EXTRACT(EPOCH FROM (now() - decayed_admits_at)), 0)
|
|
266
|
+
/ NULLIF($#{decay_idx}::double precision, 0),
|
|
267
|
+
-700
|
|
268
|
+
))
|
|
256
269
|
+ $#{admitted_idx_for_ewma},
|
|
257
270
|
decayed_admits_at = now(),
|
|
258
271
|
SQL
|
|
@@ -336,6 +349,16 @@ module DispatchPolicy
|
|
|
336
349
|
values_sql << "($#{base + 1}, $#{base + 2}, $#{base + 3}, now(), now())"
|
|
337
350
|
params.push(row[:policy_name], row[:partition_key], row[:active_job_id])
|
|
338
351
|
end
|
|
352
|
+
# ON CONFLICT (active_job_id) DO NOTHING covers two paths that
|
|
353
|
+
# the around_perform tracker exercises on its own:
|
|
354
|
+
# 1) the around_perform inflight insert runs even when the row
|
|
355
|
+
# was already pre-inserted by Tick (concurrency-gated policies);
|
|
356
|
+
# 2) a stale row that survived a crash gets re-inserted by the
|
|
357
|
+
# around_perform without colliding while the sweeper is still
|
|
358
|
+
# catching up.
|
|
359
|
+
# Admission proper can no longer collide here: Tick regenerates
|
|
360
|
+
# active_job_id before this insert, so each admission contributes a
|
|
361
|
+
# fresh UUID.
|
|
339
362
|
connection.exec_query(
|
|
340
363
|
<<~SQL.squish,
|
|
341
364
|
INSERT INTO #{INFLIGHT_TABLE}
|
data/lib/dispatch_policy/tick.rb
CHANGED
|
@@ -1,5 +1,7 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
|
+
require "securerandom"
|
|
4
|
+
|
|
3
5
|
module DispatchPolicy
|
|
4
6
|
# One pass of admission for a single policy.
|
|
5
7
|
#
|
|
@@ -219,6 +221,39 @@ module DispatchPolicy
|
|
|
219
221
|
# scheduled in the future, or another tick raced us to them).
|
|
220
222
|
next if rows.empty?
|
|
221
223
|
|
|
224
|
+
# Decouple the active_job_id we hand to the adapter from the
|
|
225
|
+
# staged payload's job_id. Adapters that use active_job_id as
|
|
226
|
+
# the PK of their jobs table (good_job, solid_queue) would
|
|
227
|
+
# otherwise collide when a residual row from a previous
|
|
228
|
+
# admission of the same job still exists — most commonly a
|
|
229
|
+
# retry-restage whose original adapter row has not been
|
|
230
|
+
# finalized yet. The collision raises RecordNotUnique inside
|
|
231
|
+
# the admission TX, rolls everything back, and the staged
|
|
232
|
+
# row keeps re-colliding on every subsequent tick.
|
|
233
|
+
#
|
|
234
|
+
# The staged-side identity is staged_jobs.id; active_job_id
|
|
235
|
+
# only needs to be unique at adapter-insert time. We mutate
|
|
236
|
+
# the row's job_data in place so both the inflight pre-insert
|
|
237
|
+
# below and Forwarder.dispatch (via Serializer.deserialize)
|
|
238
|
+
# observe the new id.
|
|
239
|
+
#
|
|
240
|
+
# Logs the (staged_job_id, original_active_job_id, new_active_job_id)
|
|
241
|
+
# mapping at debug level so operators can grep-bridge the two
|
|
242
|
+
# identities when troubleshooting — `perform_later` returns the
|
|
243
|
+
# original; the adapter row and the worker logs use the new one.
|
|
244
|
+
logger = DispatchPolicy.config.logger
|
|
245
|
+
rows.each do |row|
|
|
246
|
+
old_aj_id = row["job_data"]["job_id"]
|
|
247
|
+
new_aj_id = SecureRandom.uuid
|
|
248
|
+
row["job_data"]["job_id"] = new_aj_id
|
|
249
|
+
|
|
250
|
+
logger&.debug(
|
|
251
|
+
"[dispatch_policy] admit staged_id=#{row['id']} " \
|
|
252
|
+
"policy=#{@policy_name} partition=#{partition['partition_key']} " \
|
|
253
|
+
"active_job_id: #{old_aj_id} -> #{new_aj_id}"
|
|
254
|
+
)
|
|
255
|
+
end
|
|
256
|
+
|
|
222
257
|
# Pre-insert an inflight row per admitted job so the concurrency
|
|
223
258
|
# gate sees them immediately. With a concurrency gate, use its
|
|
224
259
|
# (coarser) partition key so the gate's COUNT(*) keeps aggregating
|