dispatch_policy 0.4.3 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (37) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +185 -0
  3. data/README.md +30 -7
  4. data/app/controllers/dispatch_policy/application_controller.rb +21 -2
  5. data/app/controllers/dispatch_policy/dashboard_controller.rb +3 -0
  6. data/app/controllers/dispatch_policy/partitions_controller.rb +51 -15
  7. data/app/controllers/dispatch_policy/policies_controller.rb +26 -4
  8. data/app/models/dispatch_policy/policy_setting.rb +14 -0
  9. data/app/views/dispatch_policy/dashboard/index.html.erb +6 -1
  10. data/app/views/dispatch_policy/partitions/index.html.erb +1 -1
  11. data/app/views/dispatch_policy/partitions/show.html.erb +1 -1
  12. data/app/views/dispatch_policy/policies/index.html.erb +11 -3
  13. data/app/views/dispatch_policy/policies/show.html.erb +13 -4
  14. data/app/views/dispatch_policy/shared/_partition_row.html.erb +9 -2
  15. data/app/views/layouts/dispatch_policy/application.html.erb +21 -25
  16. data/db/migrate/20260501000001_create_dispatch_policy_tables.rb +13 -0
  17. data/lib/dispatch_policy/config.rb +5 -0
  18. data/lib/dispatch_policy/context.rb +12 -2
  19. data/lib/dispatch_policy/cursor_pagination.rb +24 -7
  20. data/lib/dispatch_policy/gates/adaptive_concurrency.rb +14 -0
  21. data/lib/dispatch_policy/gates/concurrency.rb +4 -0
  22. data/lib/dispatch_policy/gates/throttle.rb +36 -9
  23. data/lib/dispatch_policy/inflight_tracker.rb +72 -26
  24. data/lib/dispatch_policy/job_extension.rb +33 -9
  25. data/lib/dispatch_policy/manual_admission.rb +18 -0
  26. data/lib/dispatch_policy/operator_hints.rb +14 -0
  27. data/lib/dispatch_policy/policy.rb +12 -0
  28. data/lib/dispatch_policy/policy_dsl.rb +10 -2
  29. data/lib/dispatch_policy/railtie.rb +10 -0
  30. data/lib/dispatch_policy/registry.rb +8 -4
  31. data/lib/dispatch_policy/repository.rb +102 -30
  32. data/lib/dispatch_policy/tick.rb +18 -2
  33. data/lib/dispatch_policy/tick_loop.rb +15 -7
  34. data/lib/dispatch_policy/version.rb +1 -1
  35. data/lib/generators/dispatch_policy/install/templates/create_dispatch_policy_tables.rb.tt +9 -0
  36. data/lib/generators/dispatch_policy/install/templates/dispatch_tick_loop_job.rb.tt +30 -2
  37. metadata +2 -1
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 23433a64c963b0e0908c185ad8dc8e6f97edbd8d476ee712d15023f74ba0e338
4
- data.tar.gz: 64ff19e04a6d02b0f1eedb4fb6d74b0e073e3773efb9a3afc92ae1a3e9002aeb
3
+ metadata.gz: aa4e5f3f353ac2b0de80f0e1c584aa7d53395f75ad4644355cf14070bf36cb71
4
+ data.tar.gz: 01607ab2a98c331f791b46f6c11badbddeec17532d85d78c24d3987020945cae
5
5
  SHA512:
6
- metadata.gz: e168e049dbb0d399dddc6e84427b7b557474d9ce10cb1983d3f4f24c6fde43ffda9c03c179b4590d9f09d225a8adeb5d7295a10788898ebc4ab0bc47a765163c
7
- data.tar.gz: d8ef9debaebdf89de7cce28e5fa669484acafd27e4b5e65ff959f6f177c1aaa124cb1215da191c3a6759094aac467980489241062f907574860b7226dd9dbc9a
6
+ metadata.gz: c7ead479e4a623510eee9a4cfcf3a58e1cec0fb387239a8149743990b4bc560f9cab4883ffa2fa93bf9eb81a580cbc458425fdaa3f356c708f037e04341ebd52
7
+ data.tar.gz: b6c18523ea1f59184631bc7f0b471fdead3ce48166f65217e46c48de40214fc76a6364c5faf6d9c2b236f3b725cdde3cee7197d84fa2110c8fc7bda4aa891e7a
data/CHANGELOG.md CHANGED
@@ -1,5 +1,190 @@
1
1
  # Changelog
2
2
 
3
+ ## Unreleased
4
+
5
+ ## 0.5.0
6
+
7
+ ### Upgrade notes
8
+ - **New table `dispatch_policy_policy_settings`.** Required by the
9
+ policy-level pause fix below. New installs get it from the updated
10
+ install generator. **Existing installs must add it** — the gem ships a
11
+ single migration, so either re-copy the migration via
12
+ `rails dispatch_policy:install:migrations` (or hand-apply) or run:
13
+
14
+ ```ruby
15
+ create_table :dispatch_policy_policy_settings do |t|
16
+ t.string :policy_name, null: false
17
+ t.boolean :paused, null: false, default: false
18
+ t.timestamps
19
+ end
20
+ add_index :dispatch_policy_policy_settings, :policy_name,
21
+ unique: true, name: "idx_dp_policy_settings_lookup"
22
+ ```
23
+
24
+ Until the table exists, the tick's `claim_partitions` raises
25
+ `PG::UndefinedTable`. One row per policy holds its pause flag; it's the
26
+ policy-wide source of truth `claim_partitions` consults.
27
+
28
+ ### Added
29
+ - The `:throttle` gate's `per` now accepts a lambda (like `rate`), so the
30
+ rate-limit window can depend on per-job context. A resolved `per <= 0`
31
+ raises.
32
+ - Policy-level **pause** now actually holds the whole policy. The pause
33
+ flag lives in the new `dispatch_policy_policy_settings` table and is
34
+ honored by `claim_partitions`, so it also stops partitions that first
35
+ appear *after* the pause — previously `pause` only flipped the `status`
36
+ of partition rows that existed at click time, and a tenant's first
37
+ enqueue afterwards created an `active` partition the next tick admitted.
38
+ The per-partition `status` update is kept for the partitions-index
39
+ display; `resume` clears the flag.
40
+ - The admin UI now reflects the policy-level pause flag everywhere
41
+ (policies index + show, dashboard policy rows, partitions index + show):
42
+ partitions created after a pause render as effectively paused even
43
+ though their own `status` is still `active`, the pause/resume button
44
+ toggles to a single relevant action, and `policies#show` shows a PAUSED
45
+ badge. The per-policy operator hints also short-circuit to a single
46
+ "policy is paused" note instead of falsely warning about never-checked
47
+ partitions / growing backlog while admission is intentionally stopped.
48
+
49
+ ### Fixed
50
+ - **The admin UI honors `config.database_role`.** The engine controllers
51
+ query the gem tables through the AR models directly (`Partition`,
52
+ `StagedJob`, `InflightJob`, `PolicySetting`, `TickSample`), which the
53
+ `Repository` role wrapper doesn't cover — under multi-DB every dashboard
54
+ page queried the default writing role (`PG::UndefinedTable` → 500), and
55
+ `pause`/`resume` updated the partition `status` in the wrong DB while
56
+ the policy flag went to the right one. An `around_action` in the
57
+ engine's `ApplicationController` now wraps every action — including view
58
+ rendering, so lazily-evaluated relations stay routed — in
59
+ `Repository.with_connection`. No-op without `database_role`.
60
+ - **`pause`/`resume` write the policy flag and the partition statuses in
61
+ one transaction.** They were two autocommitted statements; a crash
62
+ between them left the partition list contradicting what admission
63
+ actually does until the next toggle.
64
+ - **The generated `DispatchTickLoopJob` no longer dies after its first run
65
+ under good_job.** It re-enqueues itself at the end of `perform`, but
66
+ `good_job_control_concurrency_with(total_limit: 1)` counts the
67
+ still-running job in its enqueue check (`unfinished`), so the successor
68
+ was silently aborted and admission stopped after `tick_max_duration`.
69
+ Switched to `enqueue_limit: 1` + `perform_limit: 1` (the enqueue check
70
+ excludes the running job) and the job now logs an error if a re-enqueue
71
+ is ever refused. solid_queue was unaffected.
72
+ - **`Tick#record_sample!` routes its two AR-model reads through
73
+ `config.database_role`.** They bypassed the `Repository` role wrapper, so
74
+ under a separate queue DB they queried the wrong role and the swallowed
75
+ error meant no `tick_sample` was ever written (empty dashboard/metrics).
76
+ - **Multi-DB (`config.database_role`) is now honored everywhere.** It was
77
+ only applied at the three admission-TX boundaries (`Tick`,
78
+ `ManualAdmission`), leaving staging, partition claim, inflight
79
+ counts/tracking, sweeps and dashboard reads on the default writing role.
80
+ Under a separate queue DB (e.g. `solid_queue`) with the gem tables
81
+ there, staging wrote one DB while the tick read another — silent job
82
+ loss — and the concurrency gate counted inflight rows in a different DB
83
+ than the tracker wrote them to. Every public `Repository` method now
84
+ opens inside `connected_to(role:)`; `InflightTracker`'s direct access
85
+ (lookup + heartbeat thread) is routed too.
86
+ - **A policy may declare each gate type at most once.** Two gates of the
87
+ same type shared a single `gate_state` key (both throttles wrote
88
+ `gate_state["throttle"]`), so the merged patch kept only the last gate's
89
+ bucket and the other then saw a permanently full bucket — silently
90
+ defeating the stricter limit (the classic 10/min + 600/hour idiom).
91
+ `Policy#validate!` now raises `InvalidPolicy`; use separate policies for
92
+ multi-window limits.
93
+ - **Bulk `perform_all_later` correctness.** A job whose declared policy
94
+ wasn't registered was silently dropped (neither staged nor sent to the
95
+ adapter); jobs were marked `successfully_enqueued` before the INSERT
96
+ committed; and the bulk path ignored `bypass_retries`. It now mirrors
97
+ the single path: unstageable jobs fall through to the adapter, the
98
+ enqueued flag is set only after `stage_many!` returns, and retries on a
99
+ `:bypass` policy skip staging.
100
+ - **`ManualAdmission.force!` pre-inserts inflight rows** in the same
101
+ transaction as the claim, like the Tick. Without it the concurrency
102
+ gate under-counted force-admitted jobs (UI admit/drain) until each one
103
+ started performing — an over-admission window proportional to the
104
+ backlog drained.
105
+ - **Inflight rows are reaped when a job is discarded before performing.**
106
+ `discard_on ActiveJob::DeserializationError` (and any discard) fires
107
+ during argument deserialization, before `around_perform`, so
108
+ `InflightTracker.track`'s `ensure` never ran and the Tick's pre-inserted
109
+ row sat until the `inflight_queued_stale_after` sweeper (1h), holding a
110
+ concurrency slot. The railtie now subscribes to `discard.active_job` and
111
+ deletes the row by `active_job_id`.
112
+ - **`throttle` no longer busy-loops on a zero/nil rate.** A `rate` of `0`
113
+ or `nil` (e.g. a paused tenant) denied with a NULL `retry_after`, which
114
+ left the partition immediately eligible — re-claimed and re-evaluated
115
+ every tick — and clobbered any existing backoff. It now backs off one
116
+ `per` window, and `bulk_record_partition_denies!` preserves the existing
117
+ `next_eligible_at` when `retry_after` is NULL instead of nulling it.
118
+ - **`throttle` rate is read as `Float`.** A fractional rate (e.g. `2.5`)
119
+ kept its fractional part instead of truncating every refill (systematic
120
+ under-admission), and a sub-unit rate (`rate: 0.5`) accumulates a whole
121
+ token and admits instead of truncating to `0` and denying forever.
122
+ - **`adaptive_concurrency` validates its tuning knobs.** Out-of-range
123
+ values silently inverted the AIMD loop: `ewma_alpha: 0` froze the EWMA
124
+ at its seed so the cap grew unbounded, and a decrease factor `>= 1`
125
+ turned the multiplicative *decrease* into a positive-feedback *increase*
126
+ under failure/overload. The constructor now requires
127
+ `0 < ewma_alpha <= 1` and `0 < failure/overload_decrease_factor < 1`.
128
+ - **`partitions#admit` bounds its count.** An unbounded `count` forced a
129
+ single `DELETE…RETURNING` + dispatch of the whole backlog in one
130
+ transaction (bypassing the batching/cap that `drain` uses), and a
131
+ non-numeric value 500'd. It's now clamped to `[1, 10_000]` with a
132
+ fallback to `1`.
133
+ - **Forged timestamp pagination cursors no longer 500.** A non-parseable
134
+ string on a `stale`/`recent` sort bound into a timestamp column and
135
+ raised `invalid input syntax for type timestamp`. `CursorPagination`
136
+ now requires a parseable ISO8601 value for timestamp sorts, falling back
137
+ to the first page otherwise.
138
+ - `stage_many!` chunks its INSERT into batches of 1,000 rows so a bulk
139
+ `perform_all_later` larger than ~8,191 jobs no longer blows Postgres's
140
+ 65,535 bind-param limit and fails the whole batch.
141
+ - `InflightTracker.track` now inserts the inflight row and spawns the
142
+ heartbeat inside its `begin/ensure`, so a failure spawning the heartbeat
143
+ thread can't leave a ghost inflight row behind until the sweeper.
144
+ - `Registry` reads (`fetch`/`names`/`each`/`size`) take the same mutex as
145
+ `register`/`clear` (snapshotting before iterating in `each`), removing a
146
+ data race on non-GVL runtimes (JRuby/TruffleRuby).
147
+ - The DSL rejects `tick_admission_budget`/`admission_batch_size` of `0` or
148
+ negative (a silent full stop of the policy) and the `concurrency` /
149
+ `adaptive_concurrency` gates reject a negative `full_backoff` (which
150
+ would put `next_eligible_at` in the past and re-evaluate every tick).
151
+ `nil` still defers to config.
152
+ - The policy-wide drain passes its remaining budget to each partition so
153
+ the total can't overshoot the 10,000 cap by nearly 2×, and a drain that
154
+ only leaves future-scheduled jobs now says "N scheduled for later
155
+ remain" instead of looping "click drain again" forever.
156
+ - `partitions#show` lists recent staged jobs in the real admission order
157
+ (`priority DESC, scheduled_at NULLS FIRST, id`) and drops a dead,
158
+ mis-scoped `@inflight` query.
159
+ - `Context` now exposes indifferent (symbol/string) access at every depth,
160
+ not just the top level — `ctx[:limits][:max]` no longer silently returns
161
+ nil when the host wrote a nested hash with symbol keys. `to_jsonb`/`to_h`
162
+ still return the plain string-keyed hash for storage.
163
+ - The tick loop survives misconfigured pacing: `sweep_every_ticks <= 0`
164
+ now means "never sweep" instead of raising `ZeroDivisionError`, and a
165
+ negative `idle_pause`/`busy_pause` is treated as no pause instead of
166
+ raising in `sleep`. Both previously escaped the loop's rescues and
167
+ stopped admission.
168
+ - Pass-2 budget redistribution denies (e.g. a throttle emptied after
169
+ pass-1) now feed the tick sample's denied-reason breakdown, so the
170
+ dashboard reflects why redistribution stopped.
171
+ - Admin UI: `format_count` keeps the sign of negative values; durations
172
+ clamp at 0 so app↔DB clock skew can't render "-340ms"; the partition
173
+ search escapes `%`/`_` so a literal key containing them matches
174
+ literally; and the refresh/theme controls bind via a single delegated
175
+ document listener instead of per-button (Turbo's morph refresh dropped
176
+ the `data-bound` guard, leaking a new listener per refresh).
177
+ - Dummy app: the throttle demos (`slow_api`, `mixed`) honor the form's
178
+ `per` field via the new callable `per` instead of a hardcoded window
179
+ (`slow_api` was stuck at 60000s), and the enqueue forms tolerate blank
180
+ numeric fields / unknown job names instead of 500ing.
181
+
182
+ ### Internal
183
+ - Corrected the `bulk_record_partition_denies!` comment: `claim_partitions`
184
+ runs autocommitted, so its `FOR UPDATE SKIP LOCKED` locks don't guard the
185
+ end-of-tick deny flush — the one-tick-loop-per-(policy,shard) invariant
186
+ and the `last_checked_at` bump do.
187
+
3
188
  ## 0.4.3
4
189
 
5
190
  ### Fixed
data/README.md CHANGED
@@ -210,6 +210,12 @@ Gates run in declared order; each narrows the survivor count. Every
210
210
  option that takes a value can alternatively take a lambda receiving
211
211
  the `ctx` hash, so parameters can depend on per-job data.
212
212
 
213
+ A policy may declare each gate type **at most once** — two gates of the
214
+ same type would share a `gate_state` key and corrupt each other's
215
+ persisted state, so the policy raises `InvalidPolicy` at definition
216
+ time. For multi-window rate limiting (e.g. 10/min *and* 600/hour), use
217
+ separate policies.
218
+
213
219
  ### `:throttle` — token-bucket rate limit per partition
214
220
 
215
221
  Refills `rate` tokens every `per` seconds, capped at `rate` (no
@@ -223,9 +229,20 @@ gate :throttle,
223
229
  per: 1.minute
224
230
  ```
225
231
 
232
+ Both `rate` and `per` accept a lambda receiving the `ctx`, so the rate
233
+ limit and its window can depend on per-job data (e.g. a per-tenant plan
234
+ that sets both). A `per` that resolves to `<= 0` raises.
235
+
226
236
  Throttle does **not** release tokens on completion — tokens refill
227
237
  only with elapsed time.
228
238
 
239
+ `rate` may be fractional (e.g. `2.5`): the bucket keeps the fractional
240
+ part so the long-run rate is exact rather than truncated. A sub-unit
241
+ rate works too — the bucket holds at least one whole token, so e.g.
242
+ `rate: 1, per: 2.seconds` admits one job every two seconds. A `rate`
243
+ of `0` (or `nil`) denies and backs the partition off for one `per`
244
+ window. Prefer expressing low rates via a longer `per`.
245
+
229
246
  ### `:concurrency` — in-flight cap per partition
230
247
 
231
248
  Caps the number of admitted-but-not-yet-completed jobs per partition.
@@ -445,9 +462,12 @@ DispatchPolicy.configure do |c|
445
462
  end
446
463
  ```
447
464
 
448
- `Repository.with_connection` wraps the admission TX in
449
- `connected_to(role:)` when set. Staging tables and the adapter's
450
- table must live in the same DB for atomicity to hold.
465
+ When set, **every** DB access the gem makes runs inside
466
+ `connected_to(role:)` staging on `perform_later`, the admission TX,
467
+ inflight tracking and its heartbeat thread, sweeps, and the admin UI
468
+ (an `around_action` routes each dashboard request, so its reads and
469
+ operator actions hit the same DB the tick writes). Staging tables and
470
+ the adapter's table must live in the same DB for atomicity to hold.
451
471
 
452
472
  ### Job identity across staging and adapter
453
473
 
@@ -509,7 +529,10 @@ Mount the engine and visit `/dispatch_policy`:
509
529
  ("avg tick at 88% of tick_max_duration — shard or lower
510
530
  admission_batch_size").
511
531
  - **Policies** — per-policy throughput, denial reasons breakdown,
512
- top partitions by lifetime/pending, pause/resume/drain.
532
+ top partitions by lifetime/pending, pause/resume/drain. Pause is a
533
+ policy-level flag (stored in `dispatch_policy_policy_settings`) the
534
+ tick honors, so it also holds partitions that first appear *after*
535
+ the pause; resume clears it.
513
536
  - **Partitions** — searchable list, detail view with gate state,
514
537
  decayed_admits + admits/min estimate, recent staged jobs,
515
538
  force-admit, drain.
@@ -535,13 +558,13 @@ DispatchPolicy.configure do |c|
535
558
  c.partition_inactive_after = 86_400 # GC partitions idle this long
536
559
  c.inflight_stale_after = 300 # GC inflight rows whose worker stopped heartbeating
537
560
  c.inflight_queued_stale_after = 3_600 # GC inflight rows admitted but never started (queued)
538
- c.inflight_heartbeat_interval = 30 # how often the worker bumps heartbeat_at
539
- c.sweep_every_ticks = 50 # sweeper cadence (in tick iterations)
561
+ c.inflight_heartbeat_interval = 30 # how often the worker bumps heartbeat_at; 0 disables the thread
562
+ c.sweep_every_ticks = 50 # sweeper cadence (in tick iterations); <= 0 never sweeps
540
563
  c.metrics_retention = 86_400 # tick_samples kept this long
541
564
  c.fairness_half_life_seconds = 60 # EWMA half-life for in-tick reorder; nil disables
542
565
  c.tick_admission_budget = nil # global cap on admissions per tick; nil = none
543
566
  c.adapter_throughput_target = nil # jobs/sec; UI shows admit rate as % of this
544
- c.database_role = nil # AR role for the admission TX (multi-DB)
567
+ c.database_role = nil # AR role ALL gem DB access runs against (multi-DB)
545
568
  end
546
569
  ```
547
570
 
@@ -4,11 +4,24 @@ module DispatchPolicy
4
4
  class ApplicationController < ActionController::Base
5
5
  protect_from_forgery with: :exception
6
6
 
7
+ # The dashboard reads and writes the gem tables through the AR models
8
+ # directly (Partition, StagedJob, InflightJob, PolicySetting,
9
+ # TickSample), which — unlike Repository — have no role wrapper of
10
+ # their own. Under multi-DB (config.database_role) those queries would
11
+ # hit the default writing role, where the gem tables don't live.
12
+ # Wrapping the whole action keeps view rendering inside the role too,
13
+ # so lazily-evaluated relations (@partitions etc.) stay routed.
14
+ around_action :route_database_role
15
+
7
16
  helper_method :format_time, :format_count, :format_duration_seconds,
8
17
  :format_duration_ms, :sparkline, :registered_policies
9
18
 
10
19
  private
11
20
 
21
+ def route_database_role(&action)
22
+ Repository.with_connection(&action)
23
+ end
24
+
12
25
  def registered_policies
13
26
  DispatchPolicy.registry.each.to_a
14
27
  end
@@ -20,12 +33,18 @@ module DispatchPolicy
20
33
 
21
34
  def format_count(value)
22
35
  return "0" if value.nil?
23
- value.to_i.to_s.reverse.scan(/\d{1,3}/).join(",").reverse
36
+ n = value.to_i
37
+ sign = n.negative? ? "-" : ""
38
+ digits = n.abs.to_s.reverse.scan(/\d{1,3}/).join(",").reverse
39
+ "#{sign}#{digits}"
24
40
  end
25
41
 
26
42
  def format_duration_seconds(seconds)
27
43
  return "—" if seconds.nil?
28
- s = seconds.to_f
44
+ # A duration is never meaningfully negative; clock skew between the
45
+ # app and Postgres (timestamps written by now(), subtracted in Ruby)
46
+ # can yield a small negative — clamp so the UI shows 0ms, not "-340ms".
47
+ s = [seconds.to_f, 0.0].max
29
48
  return "%.0fms" % (s * 1000) if s < 1
30
49
  return "%.1fs" % s if s < 60
31
50
  return "%.1fm" % (s / 60) if s < 3600
@@ -69,6 +69,8 @@ module DispatchPolicy
69
69
  denied_by = Repository.top_denied_reason_by_policy(since: one_min_ago)
70
70
  rt_by = Repository.partition_round_trip_stats_by_policy
71
71
 
72
+ paused_policies = PolicySetting.paused.pluck(:policy_name).to_set
73
+
72
74
  names = (pending_by_policy.keys + in_flight_by_policy.keys).uniq.sort
73
75
  @policies = names.map do |name|
74
76
  info = pending_by_policy[name] || {}
@@ -79,6 +81,7 @@ module DispatchPolicy
79
81
 
80
82
  {
81
83
  name: name,
84
+ paused: paused_policies.include?(name),
82
85
  pending: info[:pending] || 0,
83
86
  in_flight: in_flight_by_policy[name] || 0,
84
87
  last_admit_at: info[:last_admit_at],
@@ -13,7 +13,11 @@ module DispatchPolicy
13
13
  base = Partition.all
14
14
  base = base.for_policy(params[:policy]) if params[:policy].present?
15
15
  base = base.for_shard(params[:shard]) if params[:shard].present?
16
- base = base.where("partition_key ILIKE ?", "%#{params[:q]}%") if params[:q].present?
16
+ if params[:q].present?
17
+ # Escape %/_ so a literal key containing them (e.g. "discount_50%")
18
+ # matches literally instead of as ILIKE wildcards.
19
+ base = base.where("partition_key ILIKE ?", "%#{Partition.sanitize_sql_like(params[:q])}%")
20
+ end
17
21
  base = base.where("pending_count > 0") if params[:only_pending] == "1"
18
22
 
19
23
  @sort = DispatchPolicy::CursorPagination::SORTS.key?(params[:sort]) ? params[:sort] : DispatchPolicy::CursorPagination::DEFAULT_SORT
@@ -40,6 +44,11 @@ module DispatchPolicy
40
44
  @query = params[:q]
41
45
  @only_pending = params[:only_pending] == "1"
42
46
 
47
+ # Policy-level pause flags so rows show their EFFECTIVE state: a
48
+ # partition created after a pause has status 'active' but is not
49
+ # being admitted (claim_partitions skips the whole policy).
50
+ @paused_policies = PolicySetting.paused.pluck(:policy_name).to_set
51
+
43
52
  shards_scope = Partition.all
44
53
  shards_scope = shards_scope.for_policy(params[:policy]) if params[:policy].present?
45
54
  @shards = shards_scope.distinct.pluck(:shard).sort
@@ -59,15 +68,25 @@ module DispatchPolicy
59
68
  helper_method :pagination_params
60
69
 
61
70
  def show
71
+ # Order matches the tick's claim order (claim_staged_jobs!) so the list
72
+ # reflects what would actually be admitted first, not the reverse.
62
73
  @recent_jobs = StagedJob
63
74
  .for_partition(@partition.policy_name, @partition.partition_key)
64
- .order(:scheduled_at, :id)
75
+ .order(Arel.sql("priority DESC, scheduled_at ASC NULLS FIRST, id ASC"))
65
76
  .limit(50)
66
- @inflight = InflightJob.where(policy_name: @partition.policy_name).limit(50)
77
+ # The whole policy may be paused even if this partition's own status
78
+ # is 'active' (it was created after the pause). claim_partitions skips
79
+ # the policy regardless, so surface the effective state.
80
+ @policy_paused = PolicySetting.for_policy(@partition.policy_name).pick(:paused) || false
67
81
  end
68
82
 
69
83
  def admit
70
- count = Integer(params[:count] || 1)
84
+ # Bound the count: an unbounded value would force a single
85
+ # DELETE…RETURNING + dispatch of the whole backlog in one transaction,
86
+ # bypassing the batching/cap that #drain uses precisely to avoid
87
+ # request timeouts and giant transactions. A non-numeric value falls
88
+ # back to 1 instead of raising (ArgumentError → 500).
89
+ count = (Integer(params[:count], exception: false) || 1).clamp(1, DRAIN_MAX_PER_REQUEST)
71
90
  forwarded = ManualAdmission.force!(
72
91
  policy_name: @partition.policy_name,
73
92
  partition_key: @partition.partition_key,
@@ -81,19 +100,33 @@ module DispatchPolicy
81
100
  # huge backlog can't time the controller out — the operator clicks again
82
101
  # for the next batch.
83
102
  def drain
84
- drained, remaining = self.class.drain_partition!(@partition)
85
- notice = if remaining.positive?
86
- "Drained #{drained} job(s); #{remaining} still pending — click drain again to continue."
87
- else
88
- "Drained #{drained} job(s); partition empty."
89
- end
103
+ drained, due_remaining, scheduled_remaining =
104
+ self.class.drain_partition!(@partition)
105
+
106
+ notice =
107
+ if due_remaining.positive?
108
+ "Drained #{drained} job(s); #{due_remaining} still pending — click drain again to continue."
109
+ elsif scheduled_remaining.positive?
110
+ # The claim only picks up rows whose scheduled_at has arrived, so
111
+ # future-scheduled jobs can't be drained now. Saying "click again"
112
+ # would just loop forwarding zero.
113
+ "Drained #{drained} job(s); #{scheduled_remaining} scheduled for later remain."
114
+ else
115
+ "Drained #{drained} job(s); partition empty."
116
+ end
90
117
  redirect_to partition_path(@partition), notice: notice
91
118
  end
92
119
 
93
- def self.drain_partition!(partition)
120
+ # Force-admits up to DRAIN_MAX_PER_REQUEST due jobs in DRAIN_BATCH_SIZE
121
+ # batches. Optional `cap` lets the policy-wide drain bound the TOTAL
122
+ # across partitions. Returns [drained, due_remaining, scheduled_remaining]
123
+ # — due_remaining is claimable-now work the cap left behind;
124
+ # scheduled_remaining is future-scheduled rows the claim can't touch yet.
125
+ def self.drain_partition!(partition, cap: DRAIN_MAX_PER_REQUEST)
126
+ cap = [cap, DRAIN_MAX_PER_REQUEST].min
94
127
  drained = 0
95
- while drained < DRAIN_MAX_PER_REQUEST
96
- batch_limit = [DRAIN_BATCH_SIZE, DRAIN_MAX_PER_REQUEST - drained].min
128
+ while drained < cap
129
+ batch_limit = [DRAIN_BATCH_SIZE, cap - drained].min
97
130
  forwarded = ManualAdmission.force!(
98
131
  policy_name: partition.policy_name,
99
132
  partition_key: partition.partition_key,
@@ -103,8 +136,11 @@ module DispatchPolicy
103
136
 
104
137
  drained += forwarded
105
138
  end
106
- remaining = partition.class.where(id: partition.id).pick(:pending_count) || 0
107
- [drained, remaining]
139
+
140
+ scope = StagedJob.for_partition(partition.policy_name, partition.partition_key)
141
+ due_remaining = scope.due.count
142
+ scheduled_remaining = scope.count - due_remaining
143
+ [drained, due_remaining, scheduled_remaining]
108
144
  end
109
145
 
110
146
  private
@@ -15,12 +15,16 @@ module DispatchPolicy
15
15
  # One grouped query for pending / partition count / paused count
16
16
  # across every policy instead of three per policy.
17
17
  counts_by_policy = Repository.partition_counts_by_policy
18
+ # Policy-level pause flags — the source of truth the tick honors
19
+ # (partitions.status alone misses partitions created after the pause).
20
+ paused_policies = PolicySetting.paused.pluck(:policy_name).to_set
18
21
 
19
22
  @rows = names.map do |name|
20
23
  counts = counts_by_policy[name] || {}
21
24
  {
22
25
  name: name,
23
26
  registered: registry_names.include?(name),
27
+ paused: paused_policies.include?(name),
24
28
  pending: counts[:pending] || 0,
25
29
  in_flight: in_flight_by_policy[name] || 0,
26
30
  partitions: counts[:partitions] || 0,
@@ -31,6 +35,7 @@ module DispatchPolicy
31
35
 
32
36
  def show
33
37
  @policy_object = DispatchPolicy.registry.fetch(@policy_name)
38
+ @paused = PolicySetting.for_policy(@policy_name).pick(:paused) || false
34
39
  @partitions = Partition.for_policy(@policy_name)
35
40
  .order(Arel.sql("pending_count DESC, last_admit_at DESC NULLS LAST"))
36
41
  .limit(100)
@@ -77,17 +82,31 @@ module DispatchPolicy
77
82
  in_backoff: @round_trip[:in_backoff],
78
83
  total_partitions: @totals[:partitions],
79
84
  adapter_target_jps: @capacity[:adapter_target_jps],
80
- pending_trend: @pending_trend
85
+ pending_trend: @pending_trend,
86
+ paused: @paused
81
87
  )
82
88
  end
83
89
 
84
90
  def pause
85
- Partition.for_policy(@policy_name).update_all(status: "paused", updated_at: Time.current)
91
+ # Policy-level flag is the source of truth the tick honors (so a key
92
+ # that first appears AFTER the pause is held too). The per-partition
93
+ # status update is kept for the partitions index display. One TX so
94
+ # both writes commit or neither: a flag without the statuses (or vice
95
+ # versa) leaves the partition list contradicting what admission
96
+ # actually does until the next toggle. set_policy_paused! shares the
97
+ # connection (same role via around_action), so it joins this TX.
98
+ Partition.transaction do
99
+ Repository.set_policy_paused!(policy_name: @policy_name, paused: true)
100
+ Partition.for_policy(@policy_name).update_all(status: "paused", updated_at: Time.current)
101
+ end
86
102
  redirect_to policy_path(@policy_name), notice: "Policy paused."
87
103
  end
88
104
 
89
105
  def resume
90
- Partition.for_policy(@policy_name).update_all(status: "active", updated_at: Time.current)
106
+ Partition.transaction do
107
+ Repository.set_policy_paused!(policy_name: @policy_name, paused: false)
108
+ Partition.for_policy(@policy_name).update_all(status: "active", updated_at: Time.current)
109
+ end
91
110
  redirect_to policy_path(@policy_name), notice: "Policy resumed."
92
111
  end
93
112
 
@@ -103,7 +122,10 @@ module DispatchPolicy
103
122
  .each do |partition|
104
123
  break if drained >= DRAIN_MAX_PER_REQUEST
105
124
 
106
- batch, _ = PartitionsController.drain_partition!(partition)
125
+ # Pass the REMAINING budget so a single partition can't push the
126
+ # total past the cap (a fixed per-partition cap could overshoot by
127
+ # nearly 2× when the first partition nearly fills it).
128
+ batch, = PartitionsController.drain_partition!(partition, cap: DRAIN_MAX_PER_REQUEST - drained)
107
129
  drained += batch
108
130
  end
109
131
 
@@ -0,0 +1,14 @@
1
+ # frozen_string_literal: true
2
+
3
+ module DispatchPolicy
4
+ # Policy-level settings (currently just the pause flag). One row per
5
+ # policy_name. The tick's claim_partitions consults this so a pause takes
6
+ # effect for partitions created after the pause too — not only the ones
7
+ # that existed when the operator clicked.
8
+ class PolicySetting < ApplicationRecord
9
+ self.table_name = "dispatch_policy_policy_settings"
10
+
11
+ scope :for_policy, ->(name) { where(policy_name: name) }
12
+ scope :paused, -> { where(paused: true) }
13
+ end
14
+ end
@@ -84,7 +84,12 @@
84
84
  <tbody>
85
85
  <% @policies.each do |p| %>
86
86
  <tr>
87
- <td><%= link_to p[:name], policy_path(p[:name]), class: "dp-link" %></td>
87
+ <td>
88
+ <%= link_to p[:name], policy_path(p[:name]), class: "dp-link" %>
89
+ <% if p[:paused] %>
90
+ <span class="dp-warn" style="font-size:11px; border:1px solid currentColor; border-radius:4px; padding:1px 5px; margin-left:4px;">paused</span>
91
+ <% end %>
92
+ </td>
88
93
  <td class="dp-num"><%= format_count(p[:pending]) %></td>
89
94
  <td class="dp-num"><%= format_count(p[:in_flight]) %></td>
90
95
  <td class="dp-num"><%= format_count(p[:admitted_1m]) %></td>
@@ -35,7 +35,7 @@
35
35
  </thead>
36
36
  <tbody>
37
37
  <% @partitions.each do |p| %>
38
- <%= render "dispatch_policy/shared/partition_row", partition: p %>
38
+ <%= render "dispatch_policy/shared/partition_row", partition: p, policy_paused: @paused_policies.include?(p.policy_name) %>
39
39
  <% end %>
40
40
  </tbody>
41
41
  </table>
@@ -23,7 +23,7 @@
23
23
  <div class="dp-stat"><span class="dp-stat-label">Policy</span><span class="dp-stat-value"><%= @partition.policy_name %></span></div>
24
24
  <div class="dp-stat"><span class="dp-stat-label">Shard</span><span class="dp-stat-value"><code><%= @partition.shard %></code></span></div>
25
25
  <div class="dp-stat"><span class="dp-stat-label">Queue</span><span class="dp-stat-value"><%= @partition.queue_name || "—" %></span></div>
26
- <div class="dp-stat"><span class="dp-stat-label">Status</span><span class="dp-stat-value <%= "dp-warn" if @partition.paused? %>"><%= @partition.status %></span></div>
26
+ <div class="dp-stat"><span class="dp-stat-label">Status</span><span class="dp-stat-value <%= "dp-warn" if @partition.paused? || @policy_paused %>"><%= @policy_paused && !@partition.paused? ? "#{@partition.status} (policy paused)" : @partition.status %></span></div>
27
27
  <div class="dp-stat"><span class="dp-stat-label">Pending</span><span class="dp-stat-value"><%= format_count(@partition.pending_count) %></span></div>
28
28
  <div class="dp-stat"><span class="dp-stat-label">Lifetime admitted</span><span class="dp-stat-value"><%= format_count(@partition.total_admitted) %></span></div>
29
29
  <div class="dp-stat"><span class="dp-stat-label">Round-trip age</span><span class="dp-stat-value"><%= age_seconds ? format_duration_seconds(age_seconds) : "never" %></span></div>
@@ -13,15 +13,23 @@
13
13
  <tbody>
14
14
  <% @rows.each do |row| %>
15
15
  <tr>
16
- <td><%= link_to row[:name], policy_path(row[:name]), class: "dp-link" %></td>
16
+ <td>
17
+ <%= link_to row[:name], policy_path(row[:name]), class: "dp-link" %>
18
+ <% if row[:paused] %>
19
+ <span class="dp-warn" style="font-size:11px; border:1px solid currentColor; border-radius:4px; padding:1px 5px; margin-left:4px;">paused</span>
20
+ <% end %>
21
+ </td>
17
22
  <td class="dp-num"><%= format_count(row[:pending]) %></td>
18
23
  <td class="dp-num"><%= format_count(row[:in_flight]) %></td>
19
24
  <td class="dp-num"><%= format_count(row[:partitions]) %></td>
20
25
  <td class="dp-num"><%= row[:paused_count].positive? ? content_tag(:span, format_count(row[:paused_count]), class: "dp-warn") : 0 %></td>
21
26
  <td><%= row[:registered] ? "yes" : content_tag(:span, "no (orphan)", class: "dp-warn") %></td>
22
27
  <td>
23
- <%= button_to "Pause", pause_policy_path(row[:name]), class: "dp-btn", method: :post, form: { class: "dp-form-inline" } %>
24
- <%= button_to "Resume", resume_policy_path(row[:name]), class: "dp-btn dp-btn-ok", method: :post, form: { class: "dp-form-inline" } %>
28
+ <% if row[:paused] %>
29
+ <%= button_to "Resume", resume_policy_path(row[:name]), class: "dp-btn dp-btn-ok", method: :post, form: { class: "dp-form-inline" } %>
30
+ <% else %>
31
+ <%= button_to "Pause", pause_policy_path(row[:name]), class: "dp-btn", method: :post, form: { class: "dp-form-inline" } %>
32
+ <% end %>
25
33
  </td>
26
34
  </tr>
27
35
  <% end %>
@@ -1,4 +1,9 @@
1
- <h1>Policy <code><%= @policy_name %></code></h1>
1
+ <h1>
2
+ Policy <code><%= @policy_name %></code>
3
+ <% if @paused %>
4
+ <span class="dp-warn" style="font-size:14px; vertical-align:middle; border:1px solid currentColor; border-radius:4px; padding:2px 8px; margin-left:8px;">PAUSED</span>
5
+ <% end %>
6
+ </h1>
2
7
 
3
8
  <section class="dp-stats">
4
9
  <div class="dp-stat"><span class="dp-stat-label">Partitions</span><span class="dp-stat-value"><%= format_count(@totals[:partitions]) %></span></div>
@@ -150,15 +155,19 @@
150
155
 
151
156
  <section class="dp-section">
152
157
  <h2>Actions</h2>
153
- <%= button_to "Pause all partitions", pause_policy_path(@policy_name), class: "dp-btn", method: :post, form: { class: "dp-form-inline" } %>
154
- <%= button_to "Resume all partitions", resume_policy_path(@policy_name), class: "dp-btn dp-btn-ok", method: :post, form: { class: "dp-form-inline" } %>
158
+ <% if @paused %>
159
+ <%= button_to "Resume policy", resume_policy_path(@policy_name), class: "dp-btn dp-btn-ok", method: :post, form: { class: "dp-form-inline" } %>
160
+ <% else %>
161
+ <%= button_to "Pause policy", pause_policy_path(@policy_name), class: "dp-btn", method: :post, form: { class: "dp-form-inline" } %>
162
+ <% end %>
155
163
  <%= button_to "Drain policy", drain_policy_path(@policy_name),
156
164
  class: "dp-btn dp-btn-warn",
157
165
  method: :post,
158
166
  form: { class: "dp-form-inline",
159
167
  onsubmit: "return confirm('Force-admit every staged job across every partition of this policy, bypassing all gates?');" } %>
160
168
  <p class="dp-hint">
161
- <strong>Pause</strong> stops admission but keeps staging — the queue keeps filling, in-flight jobs finish.
169
+ <strong>Pause</strong> stops admission for the whole policy including partitions created
170
+ after the pause — but keeps staging: the queue keeps filling, in-flight jobs finish.
162
171
  <strong>Drain</strong> empties the staging table by force-admitting every job (bypassing gates).
163
172
  Capped at 10,000 jobs per click — click again for more.
164
173
  </p>