dispatch_policy 0.4.3 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +185 -0
- data/README.md +30 -7
- data/app/controllers/dispatch_policy/application_controller.rb +21 -2
- data/app/controllers/dispatch_policy/dashboard_controller.rb +3 -0
- data/app/controllers/dispatch_policy/partitions_controller.rb +51 -15
- data/app/controllers/dispatch_policy/policies_controller.rb +26 -4
- data/app/models/dispatch_policy/policy_setting.rb +14 -0
- data/app/views/dispatch_policy/dashboard/index.html.erb +6 -1
- data/app/views/dispatch_policy/partitions/index.html.erb +1 -1
- data/app/views/dispatch_policy/partitions/show.html.erb +1 -1
- data/app/views/dispatch_policy/policies/index.html.erb +11 -3
- data/app/views/dispatch_policy/policies/show.html.erb +13 -4
- data/app/views/dispatch_policy/shared/_partition_row.html.erb +9 -2
- data/app/views/layouts/dispatch_policy/application.html.erb +21 -25
- data/db/migrate/20260501000001_create_dispatch_policy_tables.rb +13 -0
- data/lib/dispatch_policy/config.rb +5 -0
- data/lib/dispatch_policy/context.rb +12 -2
- data/lib/dispatch_policy/cursor_pagination.rb +24 -7
- data/lib/dispatch_policy/gates/adaptive_concurrency.rb +14 -0
- data/lib/dispatch_policy/gates/concurrency.rb +4 -0
- data/lib/dispatch_policy/gates/throttle.rb +36 -9
- data/lib/dispatch_policy/inflight_tracker.rb +72 -26
- data/lib/dispatch_policy/job_extension.rb +33 -9
- data/lib/dispatch_policy/manual_admission.rb +18 -0
- data/lib/dispatch_policy/operator_hints.rb +14 -0
- data/lib/dispatch_policy/policy.rb +12 -0
- data/lib/dispatch_policy/policy_dsl.rb +10 -2
- data/lib/dispatch_policy/railtie.rb +10 -0
- data/lib/dispatch_policy/registry.rb +8 -4
- data/lib/dispatch_policy/repository.rb +102 -30
- data/lib/dispatch_policy/tick.rb +18 -2
- data/lib/dispatch_policy/tick_loop.rb +15 -7
- data/lib/dispatch_policy/version.rb +1 -1
- data/lib/generators/dispatch_policy/install/templates/create_dispatch_policy_tables.rb.tt +9 -0
- data/lib/generators/dispatch_policy/install/templates/dispatch_tick_loop_job.rb.tt +30 -2
- metadata +2 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: aa4e5f3f353ac2b0de80f0e1c584aa7d53395f75ad4644355cf14070bf36cb71
|
|
4
|
+
data.tar.gz: 01607ab2a98c331f791b46f6c11badbddeec17532d85d78c24d3987020945cae
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: c7ead479e4a623510eee9a4cfcf3a58e1cec0fb387239a8149743990b4bc560f9cab4883ffa2fa93bf9eb81a580cbc458425fdaa3f356c708f037e04341ebd52
|
|
7
|
+
data.tar.gz: b6c18523ea1f59184631bc7f0b471fdead3ce48166f65217e46c48de40214fc76a6364c5faf6d9c2b236f3b725cdde3cee7197d84fa2110c8fc7bda4aa891e7a
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,190 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## Unreleased
|
|
4
|
+
|
|
5
|
+
## 0.5.0
|
|
6
|
+
|
|
7
|
+
### Upgrade notes
|
|
8
|
+
- **New table `dispatch_policy_policy_settings`.** Required by the
|
|
9
|
+
policy-level pause fix below. New installs get it from the updated
|
|
10
|
+
install generator. **Existing installs must add it** — the gem ships a
|
|
11
|
+
single migration, so either re-copy the migration via
|
|
12
|
+
`rails dispatch_policy:install:migrations` (or hand-apply) or run:
|
|
13
|
+
|
|
14
|
+
```ruby
|
|
15
|
+
create_table :dispatch_policy_policy_settings do |t|
|
|
16
|
+
t.string :policy_name, null: false
|
|
17
|
+
t.boolean :paused, null: false, default: false
|
|
18
|
+
t.timestamps
|
|
19
|
+
end
|
|
20
|
+
add_index :dispatch_policy_policy_settings, :policy_name,
|
|
21
|
+
unique: true, name: "idx_dp_policy_settings_lookup"
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
Until the table exists, the tick's `claim_partitions` raises
|
|
25
|
+
`PG::UndefinedTable`. One row per policy holds its pause flag; it's the
|
|
26
|
+
policy-wide source of truth `claim_partitions` consults.
|
|
27
|
+
|
|
28
|
+
### Added
|
|
29
|
+
- The `:throttle` gate's `per` now accepts a lambda (like `rate`), so the
|
|
30
|
+
rate-limit window can depend on per-job context. A resolved `per <= 0`
|
|
31
|
+
raises.
|
|
32
|
+
- Policy-level **pause** now actually holds the whole policy. The pause
|
|
33
|
+
flag lives in the new `dispatch_policy_policy_settings` table and is
|
|
34
|
+
honored by `claim_partitions`, so it also stops partitions that first
|
|
35
|
+
appear *after* the pause — previously `pause` only flipped the `status`
|
|
36
|
+
of partition rows that existed at click time, and a tenant's first
|
|
37
|
+
enqueue afterwards created an `active` partition the next tick admitted.
|
|
38
|
+
The per-partition `status` update is kept for the partitions-index
|
|
39
|
+
display; `resume` clears the flag.
|
|
40
|
+
- The admin UI now reflects the policy-level pause flag everywhere
|
|
41
|
+
(policies index + show, dashboard policy rows, partitions index + show):
|
|
42
|
+
partitions created after a pause render as effectively paused even
|
|
43
|
+
though their own `status` is still `active`, the pause/resume button
|
|
44
|
+
toggles to a single relevant action, and `policies#show` shows a PAUSED
|
|
45
|
+
badge. The per-policy operator hints also short-circuit to a single
|
|
46
|
+
"policy is paused" note instead of falsely warning about never-checked
|
|
47
|
+
partitions / growing backlog while admission is intentionally stopped.
|
|
48
|
+
|
|
49
|
+
### Fixed
|
|
50
|
+
- **The admin UI honors `config.database_role`.** The engine controllers
|
|
51
|
+
query the gem tables through the AR models directly (`Partition`,
|
|
52
|
+
`StagedJob`, `InflightJob`, `PolicySetting`, `TickSample`), which the
|
|
53
|
+
`Repository` role wrapper doesn't cover — under multi-DB every dashboard
|
|
54
|
+
page queried the default writing role (`PG::UndefinedTable` → 500), and
|
|
55
|
+
`pause`/`resume` updated the partition `status` in the wrong DB while
|
|
56
|
+
the policy flag went to the right one. An `around_action` in the
|
|
57
|
+
engine's `ApplicationController` now wraps every action — including view
|
|
58
|
+
rendering, so lazily-evaluated relations stay routed — in
|
|
59
|
+
`Repository.with_connection`. No-op without `database_role`.
|
|
60
|
+
- **`pause`/`resume` write the policy flag and the partition statuses in
|
|
61
|
+
one transaction.** They were two autocommitted statements; a crash
|
|
62
|
+
between them left the partition list contradicting what admission
|
|
63
|
+
actually does until the next toggle.
|
|
64
|
+
- **The generated `DispatchTickLoopJob` no longer dies after its first run
|
|
65
|
+
under good_job.** It re-enqueues itself at the end of `perform`, but
|
|
66
|
+
`good_job_control_concurrency_with(total_limit: 1)` counts the
|
|
67
|
+
still-running job in its enqueue check (`unfinished`), so the successor
|
|
68
|
+
was silently aborted and admission stopped after `tick_max_duration`.
|
|
69
|
+
Switched to `enqueue_limit: 1` + `perform_limit: 1` (the enqueue check
|
|
70
|
+
excludes the running job) and the job now logs an error if a re-enqueue
|
|
71
|
+
is ever refused. solid_queue was unaffected.
|
|
72
|
+
- **`Tick#record_sample!` routes its two AR-model reads through
|
|
73
|
+
`config.database_role`.** They bypassed the `Repository` role wrapper, so
|
|
74
|
+
under a separate queue DB they queried the wrong role and the swallowed
|
|
75
|
+
error meant no `tick_sample` was ever written (empty dashboard/metrics).
|
|
76
|
+
- **Multi-DB (`config.database_role`) is now honored everywhere.** It was
|
|
77
|
+
only applied at the three admission-TX boundaries (`Tick`,
|
|
78
|
+
`ManualAdmission`), leaving staging, partition claim, inflight
|
|
79
|
+
counts/tracking, sweeps and dashboard reads on the default writing role.
|
|
80
|
+
Under a separate queue DB (e.g. `solid_queue`) with the gem tables
|
|
81
|
+
there, staging wrote one DB while the tick read another — silent job
|
|
82
|
+
loss — and the concurrency gate counted inflight rows in a different DB
|
|
83
|
+
than the tracker wrote them to. Every public `Repository` method now
|
|
84
|
+
opens inside `connected_to(role:)`; `InflightTracker`'s direct access
|
|
85
|
+
(lookup + heartbeat thread) is routed too.
|
|
86
|
+
- **A policy may declare each gate type at most once.** Two gates of the
|
|
87
|
+
same type shared a single `gate_state` key (both throttles wrote
|
|
88
|
+
`gate_state["throttle"]`), so the merged patch kept only the last gate's
|
|
89
|
+
bucket and the other then saw a permanently full bucket — silently
|
|
90
|
+
defeating the stricter limit (the classic 10/min + 600/hour idiom).
|
|
91
|
+
`Policy#validate!` now raises `InvalidPolicy`; use separate policies for
|
|
92
|
+
multi-window limits.
|
|
93
|
+
- **Bulk `perform_all_later` correctness.** A job whose declared policy
|
|
94
|
+
wasn't registered was silently dropped (neither staged nor sent to the
|
|
95
|
+
adapter); jobs were marked `successfully_enqueued` before the INSERT
|
|
96
|
+
committed; and the bulk path ignored `bypass_retries`. It now mirrors
|
|
97
|
+
the single path: unstageable jobs fall through to the adapter, the
|
|
98
|
+
enqueued flag is set only after `stage_many!` returns, and retries on a
|
|
99
|
+
`:bypass` policy skip staging.
|
|
100
|
+
- **`ManualAdmission.force!` pre-inserts inflight rows** in the same
|
|
101
|
+
transaction as the claim, like the Tick. Without it the concurrency
|
|
102
|
+
gate under-counted force-admitted jobs (UI admit/drain) until each one
|
|
103
|
+
started performing — an over-admission window proportional to the
|
|
104
|
+
backlog drained.
|
|
105
|
+
- **Inflight rows are reaped when a job is discarded before performing.**
|
|
106
|
+
`discard_on ActiveJob::DeserializationError` (and any discard) fires
|
|
107
|
+
during argument deserialization, before `around_perform`, so
|
|
108
|
+
`InflightTracker.track`'s `ensure` never ran and the Tick's pre-inserted
|
|
109
|
+
row sat until the `inflight_queued_stale_after` sweeper (1h), holding a
|
|
110
|
+
concurrency slot. The railtie now subscribes to `discard.active_job` and
|
|
111
|
+
deletes the row by `active_job_id`.
|
|
112
|
+
- **`throttle` no longer busy-loops on a zero/nil rate.** A `rate` of `0`
|
|
113
|
+
or `nil` (e.g. a paused tenant) denied with a NULL `retry_after`, which
|
|
114
|
+
left the partition immediately eligible — re-claimed and re-evaluated
|
|
115
|
+
every tick — and clobbered any existing backoff. It now backs off one
|
|
116
|
+
`per` window, and `bulk_record_partition_denies!` preserves the existing
|
|
117
|
+
`next_eligible_at` when `retry_after` is NULL instead of nulling it.
|
|
118
|
+
- **`throttle` rate is read as `Float`.** A fractional rate (e.g. `2.5`)
|
|
119
|
+
kept its fractional part instead of truncating every refill (systematic
|
|
120
|
+
under-admission), and a sub-unit rate (`rate: 0.5`) accumulates a whole
|
|
121
|
+
token and admits instead of truncating to `0` and denying forever.
|
|
122
|
+
- **`adaptive_concurrency` validates its tuning knobs.** Out-of-range
|
|
123
|
+
values silently inverted the AIMD loop: `ewma_alpha: 0` froze the EWMA
|
|
124
|
+
at its seed so the cap grew unbounded, and a decrease factor `>= 1`
|
|
125
|
+
turned the multiplicative *decrease* into a positive-feedback *increase*
|
|
126
|
+
under failure/overload. The constructor now requires
|
|
127
|
+
`0 < ewma_alpha <= 1` and `0 < failure/overload_decrease_factor < 1`.
|
|
128
|
+
- **`partitions#admit` bounds its count.** An unbounded `count` forced a
|
|
129
|
+
single `DELETE…RETURNING` + dispatch of the whole backlog in one
|
|
130
|
+
transaction (bypassing the batching/cap that `drain` uses), and a
|
|
131
|
+
non-numeric value 500'd. It's now clamped to `[1, 10_000]` with a
|
|
132
|
+
fallback to `1`.
|
|
133
|
+
- **Forged timestamp pagination cursors no longer 500.** A non-parseable
|
|
134
|
+
string on a `stale`/`recent` sort bound into a timestamp column and
|
|
135
|
+
raised `invalid input syntax for type timestamp`. `CursorPagination`
|
|
136
|
+
now requires a parseable ISO8601 value for timestamp sorts, falling back
|
|
137
|
+
to the first page otherwise.
|
|
138
|
+
- `stage_many!` chunks its INSERT into batches of 1,000 rows so a bulk
|
|
139
|
+
`perform_all_later` larger than ~8,191 jobs no longer blows Postgres's
|
|
140
|
+
65,535 bind-param limit and fails the whole batch.
|
|
141
|
+
- `InflightTracker.track` now inserts the inflight row and spawns the
|
|
142
|
+
heartbeat inside its `begin/ensure`, so a failure spawning the heartbeat
|
|
143
|
+
thread can't leave a ghost inflight row behind until the sweeper.
|
|
144
|
+
- `Registry` reads (`fetch`/`names`/`each`/`size`) take the same mutex as
|
|
145
|
+
`register`/`clear` (snapshotting before iterating in `each`), removing a
|
|
146
|
+
data race on non-GVL runtimes (JRuby/TruffleRuby).
|
|
147
|
+
- The DSL rejects `tick_admission_budget`/`admission_batch_size` of `0` or
|
|
148
|
+
negative (a silent full stop of the policy) and the `concurrency` /
|
|
149
|
+
`adaptive_concurrency` gates reject a negative `full_backoff` (which
|
|
150
|
+
would put `next_eligible_at` in the past and re-evaluate every tick).
|
|
151
|
+
`nil` still defers to config.
|
|
152
|
+
- The policy-wide drain passes its remaining budget to each partition so
|
|
153
|
+
the total can't overshoot the 10,000 cap by nearly 2×, and a drain that
|
|
154
|
+
only leaves future-scheduled jobs now says "N scheduled for later
|
|
155
|
+
remain" instead of looping "click drain again" forever.
|
|
156
|
+
- `partitions#show` lists recent staged jobs in the real admission order
|
|
157
|
+
(`priority DESC, scheduled_at NULLS FIRST, id`) and drops a dead,
|
|
158
|
+
mis-scoped `@inflight` query.
|
|
159
|
+
- `Context` now exposes indifferent (symbol/string) access at every depth,
|
|
160
|
+
not just the top level — `ctx[:limits][:max]` no longer silently returns
|
|
161
|
+
nil when the host wrote a nested hash with symbol keys. `to_jsonb`/`to_h`
|
|
162
|
+
still return the plain string-keyed hash for storage.
|
|
163
|
+
- The tick loop survives misconfigured pacing: `sweep_every_ticks <= 0`
|
|
164
|
+
now means "never sweep" instead of raising `ZeroDivisionError`, and a
|
|
165
|
+
negative `idle_pause`/`busy_pause` is treated as no pause instead of
|
|
166
|
+
raising in `sleep`. Both previously escaped the loop's rescues and
|
|
167
|
+
stopped admission.
|
|
168
|
+
- Pass-2 budget redistribution denies (e.g. a throttle emptied after
|
|
169
|
+
pass-1) now feed the tick sample's denied-reason breakdown, so the
|
|
170
|
+
dashboard reflects why redistribution stopped.
|
|
171
|
+
- Admin UI: `format_count` keeps the sign of negative values; durations
|
|
172
|
+
clamp at 0 so app↔DB clock skew can't render "-340ms"; the partition
|
|
173
|
+
search escapes `%`/`_` so a literal key containing them matches
|
|
174
|
+
literally; and the refresh/theme controls bind via a single delegated
|
|
175
|
+
document listener instead of per-button (Turbo's morph refresh dropped
|
|
176
|
+
the `data-bound` guard, leaking a new listener per refresh).
|
|
177
|
+
- Dummy app: the throttle demos (`slow_api`, `mixed`) honor the form's
|
|
178
|
+
`per` field via the new callable `per` instead of a hardcoded window
|
|
179
|
+
(`slow_api` was stuck at 60000s), and the enqueue forms tolerate blank
|
|
180
|
+
numeric fields / unknown job names instead of 500ing.
|
|
181
|
+
|
|
182
|
+
### Internal
|
|
183
|
+
- Corrected the `bulk_record_partition_denies!` comment: `claim_partitions`
|
|
184
|
+
runs autocommitted, so its `FOR UPDATE SKIP LOCKED` locks don't guard the
|
|
185
|
+
end-of-tick deny flush — the one-tick-loop-per-(policy,shard) invariant
|
|
186
|
+
and the `last_checked_at` bump do.
|
|
187
|
+
|
|
3
188
|
## 0.4.3
|
|
4
189
|
|
|
5
190
|
### Fixed
|
data/README.md
CHANGED
|
@@ -210,6 +210,12 @@ Gates run in declared order; each narrows the survivor count. Every
|
|
|
210
210
|
option that takes a value can alternatively take a lambda receiving
|
|
211
211
|
the `ctx` hash, so parameters can depend on per-job data.
|
|
212
212
|
|
|
213
|
+
A policy may declare each gate type **at most once** — two gates of the
|
|
214
|
+
same type would share a `gate_state` key and corrupt each other's
|
|
215
|
+
persisted state, so the policy raises `InvalidPolicy` at definition
|
|
216
|
+
time. For multi-window rate limiting (e.g. 10/min *and* 600/hour), use
|
|
217
|
+
separate policies.
|
|
218
|
+
|
|
213
219
|
### `:throttle` — token-bucket rate limit per partition
|
|
214
220
|
|
|
215
221
|
Refills `rate` tokens every `per` seconds, capped at `rate` (no
|
|
@@ -223,9 +229,20 @@ gate :throttle,
|
|
|
223
229
|
per: 1.minute
|
|
224
230
|
```
|
|
225
231
|
|
|
232
|
+
Both `rate` and `per` accept a lambda receiving the `ctx`, so the rate
|
|
233
|
+
limit and its window can depend on per-job data (e.g. a per-tenant plan
|
|
234
|
+
that sets both). A `per` that resolves to `<= 0` raises.
|
|
235
|
+
|
|
226
236
|
Throttle does **not** release tokens on completion — tokens refill
|
|
227
237
|
only with elapsed time.
|
|
228
238
|
|
|
239
|
+
`rate` may be fractional (e.g. `2.5`): the bucket keeps the fractional
|
|
240
|
+
part so the long-run rate is exact rather than truncated. A sub-unit
|
|
241
|
+
rate works too — the bucket holds at least one whole token, so e.g.
|
|
242
|
+
`rate: 1, per: 2.seconds` admits one job every two seconds. A `rate`
|
|
243
|
+
of `0` (or `nil`) denies and backs the partition off for one `per`
|
|
244
|
+
window. Prefer expressing low rates via a longer `per`.
|
|
245
|
+
|
|
229
246
|
### `:concurrency` — in-flight cap per partition
|
|
230
247
|
|
|
231
248
|
Caps the number of admitted-but-not-yet-completed jobs per partition.
|
|
@@ -445,9 +462,12 @@ DispatchPolicy.configure do |c|
|
|
|
445
462
|
end
|
|
446
463
|
```
|
|
447
464
|
|
|
448
|
-
|
|
449
|
-
`connected_to(role:)`
|
|
450
|
-
|
|
465
|
+
When set, **every** DB access the gem makes runs inside
|
|
466
|
+
`connected_to(role:)` — staging on `perform_later`, the admission TX,
|
|
467
|
+
inflight tracking and its heartbeat thread, sweeps, and the admin UI
|
|
468
|
+
(an `around_action` routes each dashboard request, so its reads and
|
|
469
|
+
operator actions hit the same DB the tick writes). Staging tables and
|
|
470
|
+
the adapter's table must live in the same DB for atomicity to hold.
|
|
451
471
|
|
|
452
472
|
### Job identity across staging and adapter
|
|
453
473
|
|
|
@@ -509,7 +529,10 @@ Mount the engine and visit `/dispatch_policy`:
|
|
|
509
529
|
("avg tick at 88% of tick_max_duration — shard or lower
|
|
510
530
|
admission_batch_size").
|
|
511
531
|
- **Policies** — per-policy throughput, denial reasons breakdown,
|
|
512
|
-
top partitions by lifetime/pending, pause/resume/drain.
|
|
532
|
+
top partitions by lifetime/pending, pause/resume/drain. Pause is a
|
|
533
|
+
policy-level flag (stored in `dispatch_policy_policy_settings`) the
|
|
534
|
+
tick honors, so it also holds partitions that first appear *after*
|
|
535
|
+
the pause; resume clears it.
|
|
513
536
|
- **Partitions** — searchable list, detail view with gate state,
|
|
514
537
|
decayed_admits + admits/min estimate, recent staged jobs,
|
|
515
538
|
force-admit, drain.
|
|
@@ -535,13 +558,13 @@ DispatchPolicy.configure do |c|
|
|
|
535
558
|
c.partition_inactive_after = 86_400 # GC partitions idle this long
|
|
536
559
|
c.inflight_stale_after = 300 # GC inflight rows whose worker stopped heartbeating
|
|
537
560
|
c.inflight_queued_stale_after = 3_600 # GC inflight rows admitted but never started (queued)
|
|
538
|
-
c.inflight_heartbeat_interval = 30 # how often the worker bumps heartbeat_at
|
|
539
|
-
c.sweep_every_ticks = 50 # sweeper cadence (in tick iterations)
|
|
561
|
+
c.inflight_heartbeat_interval = 30 # how often the worker bumps heartbeat_at; 0 disables the thread
|
|
562
|
+
c.sweep_every_ticks = 50 # sweeper cadence (in tick iterations); <= 0 never sweeps
|
|
540
563
|
c.metrics_retention = 86_400 # tick_samples kept this long
|
|
541
564
|
c.fairness_half_life_seconds = 60 # EWMA half-life for in-tick reorder; nil disables
|
|
542
565
|
c.tick_admission_budget = nil # global cap on admissions per tick; nil = none
|
|
543
566
|
c.adapter_throughput_target = nil # jobs/sec; UI shows admit rate as % of this
|
|
544
|
-
c.database_role = nil # AR role
|
|
567
|
+
c.database_role = nil # AR role ALL gem DB access runs against (multi-DB)
|
|
545
568
|
end
|
|
546
569
|
```
|
|
547
570
|
|
|
@@ -4,11 +4,24 @@ module DispatchPolicy
|
|
|
4
4
|
class ApplicationController < ActionController::Base
|
|
5
5
|
protect_from_forgery with: :exception
|
|
6
6
|
|
|
7
|
+
# The dashboard reads and writes the gem tables through the AR models
|
|
8
|
+
# directly (Partition, StagedJob, InflightJob, PolicySetting,
|
|
9
|
+
# TickSample), which — unlike Repository — have no role wrapper of
|
|
10
|
+
# their own. Under multi-DB (config.database_role) those queries would
|
|
11
|
+
# hit the default writing role, where the gem tables don't live.
|
|
12
|
+
# Wrapping the whole action keeps view rendering inside the role too,
|
|
13
|
+
# so lazily-evaluated relations (@partitions etc.) stay routed.
|
|
14
|
+
around_action :route_database_role
|
|
15
|
+
|
|
7
16
|
helper_method :format_time, :format_count, :format_duration_seconds,
|
|
8
17
|
:format_duration_ms, :sparkline, :registered_policies
|
|
9
18
|
|
|
10
19
|
private
|
|
11
20
|
|
|
21
|
+
def route_database_role(&action)
|
|
22
|
+
Repository.with_connection(&action)
|
|
23
|
+
end
|
|
24
|
+
|
|
12
25
|
def registered_policies
|
|
13
26
|
DispatchPolicy.registry.each.to_a
|
|
14
27
|
end
|
|
@@ -20,12 +33,18 @@ module DispatchPolicy
|
|
|
20
33
|
|
|
21
34
|
def format_count(value)
|
|
22
35
|
return "0" if value.nil?
|
|
23
|
-
value.to_i
|
|
36
|
+
n = value.to_i
|
|
37
|
+
sign = n.negative? ? "-" : ""
|
|
38
|
+
digits = n.abs.to_s.reverse.scan(/\d{1,3}/).join(",").reverse
|
|
39
|
+
"#{sign}#{digits}"
|
|
24
40
|
end
|
|
25
41
|
|
|
26
42
|
def format_duration_seconds(seconds)
|
|
27
43
|
return "—" if seconds.nil?
|
|
28
|
-
|
|
44
|
+
# A duration is never meaningfully negative; clock skew between the
|
|
45
|
+
# app and Postgres (timestamps written by now(), subtracted in Ruby)
|
|
46
|
+
# can yield a small negative — clamp so the UI shows 0ms, not "-340ms".
|
|
47
|
+
s = [seconds.to_f, 0.0].max
|
|
29
48
|
return "%.0fms" % (s * 1000) if s < 1
|
|
30
49
|
return "%.1fs" % s if s < 60
|
|
31
50
|
return "%.1fm" % (s / 60) if s < 3600
|
|
@@ -69,6 +69,8 @@ module DispatchPolicy
|
|
|
69
69
|
denied_by = Repository.top_denied_reason_by_policy(since: one_min_ago)
|
|
70
70
|
rt_by = Repository.partition_round_trip_stats_by_policy
|
|
71
71
|
|
|
72
|
+
paused_policies = PolicySetting.paused.pluck(:policy_name).to_set
|
|
73
|
+
|
|
72
74
|
names = (pending_by_policy.keys + in_flight_by_policy.keys).uniq.sort
|
|
73
75
|
@policies = names.map do |name|
|
|
74
76
|
info = pending_by_policy[name] || {}
|
|
@@ -79,6 +81,7 @@ module DispatchPolicy
|
|
|
79
81
|
|
|
80
82
|
{
|
|
81
83
|
name: name,
|
|
84
|
+
paused: paused_policies.include?(name),
|
|
82
85
|
pending: info[:pending] || 0,
|
|
83
86
|
in_flight: in_flight_by_policy[name] || 0,
|
|
84
87
|
last_admit_at: info[:last_admit_at],
|
|
@@ -13,7 +13,11 @@ module DispatchPolicy
|
|
|
13
13
|
base = Partition.all
|
|
14
14
|
base = base.for_policy(params[:policy]) if params[:policy].present?
|
|
15
15
|
base = base.for_shard(params[:shard]) if params[:shard].present?
|
|
16
|
-
|
|
16
|
+
if params[:q].present?
|
|
17
|
+
# Escape %/_ so a literal key containing them (e.g. "discount_50%")
|
|
18
|
+
# matches literally instead of as ILIKE wildcards.
|
|
19
|
+
base = base.where("partition_key ILIKE ?", "%#{Partition.sanitize_sql_like(params[:q])}%")
|
|
20
|
+
end
|
|
17
21
|
base = base.where("pending_count > 0") if params[:only_pending] == "1"
|
|
18
22
|
|
|
19
23
|
@sort = DispatchPolicy::CursorPagination::SORTS.key?(params[:sort]) ? params[:sort] : DispatchPolicy::CursorPagination::DEFAULT_SORT
|
|
@@ -40,6 +44,11 @@ module DispatchPolicy
|
|
|
40
44
|
@query = params[:q]
|
|
41
45
|
@only_pending = params[:only_pending] == "1"
|
|
42
46
|
|
|
47
|
+
# Policy-level pause flags so rows show their EFFECTIVE state: a
|
|
48
|
+
# partition created after a pause has status 'active' but is not
|
|
49
|
+
# being admitted (claim_partitions skips the whole policy).
|
|
50
|
+
@paused_policies = PolicySetting.paused.pluck(:policy_name).to_set
|
|
51
|
+
|
|
43
52
|
shards_scope = Partition.all
|
|
44
53
|
shards_scope = shards_scope.for_policy(params[:policy]) if params[:policy].present?
|
|
45
54
|
@shards = shards_scope.distinct.pluck(:shard).sort
|
|
@@ -59,15 +68,25 @@ module DispatchPolicy
|
|
|
59
68
|
helper_method :pagination_params
|
|
60
69
|
|
|
61
70
|
def show
|
|
71
|
+
# Order matches the tick's claim order (claim_staged_jobs!) so the list
|
|
72
|
+
# reflects what would actually be admitted first, not the reverse.
|
|
62
73
|
@recent_jobs = StagedJob
|
|
63
74
|
.for_partition(@partition.policy_name, @partition.partition_key)
|
|
64
|
-
.order(
|
|
75
|
+
.order(Arel.sql("priority DESC, scheduled_at ASC NULLS FIRST, id ASC"))
|
|
65
76
|
.limit(50)
|
|
66
|
-
|
|
77
|
+
# The whole policy may be paused even if this partition's own status
|
|
78
|
+
# is 'active' (it was created after the pause). claim_partitions skips
|
|
79
|
+
# the policy regardless, so surface the effective state.
|
|
80
|
+
@policy_paused = PolicySetting.for_policy(@partition.policy_name).pick(:paused) || false
|
|
67
81
|
end
|
|
68
82
|
|
|
69
83
|
def admit
|
|
70
|
-
count
|
|
84
|
+
# Bound the count: an unbounded value would force a single
|
|
85
|
+
# DELETE…RETURNING + dispatch of the whole backlog in one transaction,
|
|
86
|
+
# bypassing the batching/cap that #drain uses precisely to avoid
|
|
87
|
+
# request timeouts and giant transactions. A non-numeric value falls
|
|
88
|
+
# back to 1 instead of raising (ArgumentError → 500).
|
|
89
|
+
count = (Integer(params[:count], exception: false) || 1).clamp(1, DRAIN_MAX_PER_REQUEST)
|
|
71
90
|
forwarded = ManualAdmission.force!(
|
|
72
91
|
policy_name: @partition.policy_name,
|
|
73
92
|
partition_key: @partition.partition_key,
|
|
@@ -81,19 +100,33 @@ module DispatchPolicy
|
|
|
81
100
|
# huge backlog can't time the controller out — the operator clicks again
|
|
82
101
|
# for the next batch.
|
|
83
102
|
def drain
|
|
84
|
-
drained,
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
103
|
+
drained, due_remaining, scheduled_remaining =
|
|
104
|
+
self.class.drain_partition!(@partition)
|
|
105
|
+
|
|
106
|
+
notice =
|
|
107
|
+
if due_remaining.positive?
|
|
108
|
+
"Drained #{drained} job(s); #{due_remaining} still pending — click drain again to continue."
|
|
109
|
+
elsif scheduled_remaining.positive?
|
|
110
|
+
# The claim only picks up rows whose scheduled_at has arrived, so
|
|
111
|
+
# future-scheduled jobs can't be drained now. Saying "click again"
|
|
112
|
+
# would just loop forwarding zero.
|
|
113
|
+
"Drained #{drained} job(s); #{scheduled_remaining} scheduled for later remain."
|
|
114
|
+
else
|
|
115
|
+
"Drained #{drained} job(s); partition empty."
|
|
116
|
+
end
|
|
90
117
|
redirect_to partition_path(@partition), notice: notice
|
|
91
118
|
end
|
|
92
119
|
|
|
93
|
-
|
|
120
|
+
# Force-admits up to DRAIN_MAX_PER_REQUEST due jobs in DRAIN_BATCH_SIZE
|
|
121
|
+
# batches. Optional `cap` lets the policy-wide drain bound the TOTAL
|
|
122
|
+
# across partitions. Returns [drained, due_remaining, scheduled_remaining]
|
|
123
|
+
# — due_remaining is claimable-now work the cap left behind;
|
|
124
|
+
# scheduled_remaining is future-scheduled rows the claim can't touch yet.
|
|
125
|
+
def self.drain_partition!(partition, cap: DRAIN_MAX_PER_REQUEST)
|
|
126
|
+
cap = [cap, DRAIN_MAX_PER_REQUEST].min
|
|
94
127
|
drained = 0
|
|
95
|
-
while drained <
|
|
96
|
-
batch_limit = [DRAIN_BATCH_SIZE,
|
|
128
|
+
while drained < cap
|
|
129
|
+
batch_limit = [DRAIN_BATCH_SIZE, cap - drained].min
|
|
97
130
|
forwarded = ManualAdmission.force!(
|
|
98
131
|
policy_name: partition.policy_name,
|
|
99
132
|
partition_key: partition.partition_key,
|
|
@@ -103,8 +136,11 @@ module DispatchPolicy
|
|
|
103
136
|
|
|
104
137
|
drained += forwarded
|
|
105
138
|
end
|
|
106
|
-
|
|
107
|
-
|
|
139
|
+
|
|
140
|
+
scope = StagedJob.for_partition(partition.policy_name, partition.partition_key)
|
|
141
|
+
due_remaining = scope.due.count
|
|
142
|
+
scheduled_remaining = scope.count - due_remaining
|
|
143
|
+
[drained, due_remaining, scheduled_remaining]
|
|
108
144
|
end
|
|
109
145
|
|
|
110
146
|
private
|
|
@@ -15,12 +15,16 @@ module DispatchPolicy
|
|
|
15
15
|
# One grouped query for pending / partition count / paused count
|
|
16
16
|
# across every policy instead of three per policy.
|
|
17
17
|
counts_by_policy = Repository.partition_counts_by_policy
|
|
18
|
+
# Policy-level pause flags — the source of truth the tick honors
|
|
19
|
+
# (partitions.status alone misses partitions created after the pause).
|
|
20
|
+
paused_policies = PolicySetting.paused.pluck(:policy_name).to_set
|
|
18
21
|
|
|
19
22
|
@rows = names.map do |name|
|
|
20
23
|
counts = counts_by_policy[name] || {}
|
|
21
24
|
{
|
|
22
25
|
name: name,
|
|
23
26
|
registered: registry_names.include?(name),
|
|
27
|
+
paused: paused_policies.include?(name),
|
|
24
28
|
pending: counts[:pending] || 0,
|
|
25
29
|
in_flight: in_flight_by_policy[name] || 0,
|
|
26
30
|
partitions: counts[:partitions] || 0,
|
|
@@ -31,6 +35,7 @@ module DispatchPolicy
|
|
|
31
35
|
|
|
32
36
|
def show
|
|
33
37
|
@policy_object = DispatchPolicy.registry.fetch(@policy_name)
|
|
38
|
+
@paused = PolicySetting.for_policy(@policy_name).pick(:paused) || false
|
|
34
39
|
@partitions = Partition.for_policy(@policy_name)
|
|
35
40
|
.order(Arel.sql("pending_count DESC, last_admit_at DESC NULLS LAST"))
|
|
36
41
|
.limit(100)
|
|
@@ -77,17 +82,31 @@ module DispatchPolicy
|
|
|
77
82
|
in_backoff: @round_trip[:in_backoff],
|
|
78
83
|
total_partitions: @totals[:partitions],
|
|
79
84
|
adapter_target_jps: @capacity[:adapter_target_jps],
|
|
80
|
-
pending_trend: @pending_trend
|
|
85
|
+
pending_trend: @pending_trend,
|
|
86
|
+
paused: @paused
|
|
81
87
|
)
|
|
82
88
|
end
|
|
83
89
|
|
|
84
90
|
def pause
|
|
85
|
-
|
|
91
|
+
# Policy-level flag is the source of truth the tick honors (so a key
|
|
92
|
+
# that first appears AFTER the pause is held too). The per-partition
|
|
93
|
+
# status update is kept for the partitions index display. One TX so
|
|
94
|
+
# both writes commit or neither: a flag without the statuses (or vice
|
|
95
|
+
# versa) leaves the partition list contradicting what admission
|
|
96
|
+
# actually does until the next toggle. set_policy_paused! shares the
|
|
97
|
+
# connection (same role via around_action), so it joins this TX.
|
|
98
|
+
Partition.transaction do
|
|
99
|
+
Repository.set_policy_paused!(policy_name: @policy_name, paused: true)
|
|
100
|
+
Partition.for_policy(@policy_name).update_all(status: "paused", updated_at: Time.current)
|
|
101
|
+
end
|
|
86
102
|
redirect_to policy_path(@policy_name), notice: "Policy paused."
|
|
87
103
|
end
|
|
88
104
|
|
|
89
105
|
def resume
|
|
90
|
-
Partition.
|
|
106
|
+
Partition.transaction do
|
|
107
|
+
Repository.set_policy_paused!(policy_name: @policy_name, paused: false)
|
|
108
|
+
Partition.for_policy(@policy_name).update_all(status: "active", updated_at: Time.current)
|
|
109
|
+
end
|
|
91
110
|
redirect_to policy_path(@policy_name), notice: "Policy resumed."
|
|
92
111
|
end
|
|
93
112
|
|
|
@@ -103,7 +122,10 @@ module DispatchPolicy
|
|
|
103
122
|
.each do |partition|
|
|
104
123
|
break if drained >= DRAIN_MAX_PER_REQUEST
|
|
105
124
|
|
|
106
|
-
|
|
125
|
+
# Pass the REMAINING budget so a single partition can't push the
|
|
126
|
+
# total past the cap (a fixed per-partition cap could overshoot by
|
|
127
|
+
# nearly 2× when the first partition nearly fills it).
|
|
128
|
+
batch, = PartitionsController.drain_partition!(partition, cap: DRAIN_MAX_PER_REQUEST - drained)
|
|
107
129
|
drained += batch
|
|
108
130
|
end
|
|
109
131
|
|
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module DispatchPolicy
|
|
4
|
+
# Policy-level settings (currently just the pause flag). One row per
|
|
5
|
+
# policy_name. The tick's claim_partitions consults this so a pause takes
|
|
6
|
+
# effect for partitions created after the pause too — not only the ones
|
|
7
|
+
# that existed when the operator clicked.
|
|
8
|
+
class PolicySetting < ApplicationRecord
|
|
9
|
+
self.table_name = "dispatch_policy_policy_settings"
|
|
10
|
+
|
|
11
|
+
scope :for_policy, ->(name) { where(policy_name: name) }
|
|
12
|
+
scope :paused, -> { where(paused: true) }
|
|
13
|
+
end
|
|
14
|
+
end
|
|
@@ -84,7 +84,12 @@
|
|
|
84
84
|
<tbody>
|
|
85
85
|
<% @policies.each do |p| %>
|
|
86
86
|
<tr>
|
|
87
|
-
<td
|
|
87
|
+
<td>
|
|
88
|
+
<%= link_to p[:name], policy_path(p[:name]), class: "dp-link" %>
|
|
89
|
+
<% if p[:paused] %>
|
|
90
|
+
<span class="dp-warn" style="font-size:11px; border:1px solid currentColor; border-radius:4px; padding:1px 5px; margin-left:4px;">paused</span>
|
|
91
|
+
<% end %>
|
|
92
|
+
</td>
|
|
88
93
|
<td class="dp-num"><%= format_count(p[:pending]) %></td>
|
|
89
94
|
<td class="dp-num"><%= format_count(p[:in_flight]) %></td>
|
|
90
95
|
<td class="dp-num"><%= format_count(p[:admitted_1m]) %></td>
|
|
@@ -35,7 +35,7 @@
|
|
|
35
35
|
</thead>
|
|
36
36
|
<tbody>
|
|
37
37
|
<% @partitions.each do |p| %>
|
|
38
|
-
<%= render "dispatch_policy/shared/partition_row", partition: p %>
|
|
38
|
+
<%= render "dispatch_policy/shared/partition_row", partition: p, policy_paused: @paused_policies.include?(p.policy_name) %>
|
|
39
39
|
<% end %>
|
|
40
40
|
</tbody>
|
|
41
41
|
</table>
|
|
@@ -23,7 +23,7 @@
|
|
|
23
23
|
<div class="dp-stat"><span class="dp-stat-label">Policy</span><span class="dp-stat-value"><%= @partition.policy_name %></span></div>
|
|
24
24
|
<div class="dp-stat"><span class="dp-stat-label">Shard</span><span class="dp-stat-value"><code><%= @partition.shard %></code></span></div>
|
|
25
25
|
<div class="dp-stat"><span class="dp-stat-label">Queue</span><span class="dp-stat-value"><%= @partition.queue_name || "—" %></span></div>
|
|
26
|
-
<div class="dp-stat"><span class="dp-stat-label">Status</span><span class="dp-stat-value <%= "dp-warn" if @partition.paused? %>"><%= @partition.status %></span></div>
|
|
26
|
+
<div class="dp-stat"><span class="dp-stat-label">Status</span><span class="dp-stat-value <%= "dp-warn" if @partition.paused? || @policy_paused %>"><%= @policy_paused && !@partition.paused? ? "#{@partition.status} (policy paused)" : @partition.status %></span></div>
|
|
27
27
|
<div class="dp-stat"><span class="dp-stat-label">Pending</span><span class="dp-stat-value"><%= format_count(@partition.pending_count) %></span></div>
|
|
28
28
|
<div class="dp-stat"><span class="dp-stat-label">Lifetime admitted</span><span class="dp-stat-value"><%= format_count(@partition.total_admitted) %></span></div>
|
|
29
29
|
<div class="dp-stat"><span class="dp-stat-label">Round-trip age</span><span class="dp-stat-value"><%= age_seconds ? format_duration_seconds(age_seconds) : "never" %></span></div>
|
|
@@ -13,15 +13,23 @@
|
|
|
13
13
|
<tbody>
|
|
14
14
|
<% @rows.each do |row| %>
|
|
15
15
|
<tr>
|
|
16
|
-
<td
|
|
16
|
+
<td>
|
|
17
|
+
<%= link_to row[:name], policy_path(row[:name]), class: "dp-link" %>
|
|
18
|
+
<% if row[:paused] %>
|
|
19
|
+
<span class="dp-warn" style="font-size:11px; border:1px solid currentColor; border-radius:4px; padding:1px 5px; margin-left:4px;">paused</span>
|
|
20
|
+
<% end %>
|
|
21
|
+
</td>
|
|
17
22
|
<td class="dp-num"><%= format_count(row[:pending]) %></td>
|
|
18
23
|
<td class="dp-num"><%= format_count(row[:in_flight]) %></td>
|
|
19
24
|
<td class="dp-num"><%= format_count(row[:partitions]) %></td>
|
|
20
25
|
<td class="dp-num"><%= row[:paused_count].positive? ? content_tag(:span, format_count(row[:paused_count]), class: "dp-warn") : 0 %></td>
|
|
21
26
|
<td><%= row[:registered] ? "yes" : content_tag(:span, "no (orphan)", class: "dp-warn") %></td>
|
|
22
27
|
<td>
|
|
23
|
-
|
|
24
|
-
|
|
28
|
+
<% if row[:paused] %>
|
|
29
|
+
<%= button_to "Resume", resume_policy_path(row[:name]), class: "dp-btn dp-btn-ok", method: :post, form: { class: "dp-form-inline" } %>
|
|
30
|
+
<% else %>
|
|
31
|
+
<%= button_to "Pause", pause_policy_path(row[:name]), class: "dp-btn", method: :post, form: { class: "dp-form-inline" } %>
|
|
32
|
+
<% end %>
|
|
25
33
|
</td>
|
|
26
34
|
</tr>
|
|
27
35
|
<% end %>
|
|
@@ -1,4 +1,9 @@
|
|
|
1
|
-
<h1>
|
|
1
|
+
<h1>
|
|
2
|
+
Policy <code><%= @policy_name %></code>
|
|
3
|
+
<% if @paused %>
|
|
4
|
+
<span class="dp-warn" style="font-size:14px; vertical-align:middle; border:1px solid currentColor; border-radius:4px; padding:2px 8px; margin-left:8px;">PAUSED</span>
|
|
5
|
+
<% end %>
|
|
6
|
+
</h1>
|
|
2
7
|
|
|
3
8
|
<section class="dp-stats">
|
|
4
9
|
<div class="dp-stat"><span class="dp-stat-label">Partitions</span><span class="dp-stat-value"><%= format_count(@totals[:partitions]) %></span></div>
|
|
@@ -150,15 +155,19 @@
|
|
|
150
155
|
|
|
151
156
|
<section class="dp-section">
|
|
152
157
|
<h2>Actions</h2>
|
|
153
|
-
|
|
154
|
-
|
|
158
|
+
<% if @paused %>
|
|
159
|
+
<%= button_to "Resume policy", resume_policy_path(@policy_name), class: "dp-btn dp-btn-ok", method: :post, form: { class: "dp-form-inline" } %>
|
|
160
|
+
<% else %>
|
|
161
|
+
<%= button_to "Pause policy", pause_policy_path(@policy_name), class: "dp-btn", method: :post, form: { class: "dp-form-inline" } %>
|
|
162
|
+
<% end %>
|
|
155
163
|
<%= button_to "Drain policy", drain_policy_path(@policy_name),
|
|
156
164
|
class: "dp-btn dp-btn-warn",
|
|
157
165
|
method: :post,
|
|
158
166
|
form: { class: "dp-form-inline",
|
|
159
167
|
onsubmit: "return confirm('Force-admit every staged job across every partition of this policy, bypassing all gates?');" } %>
|
|
160
168
|
<p class="dp-hint">
|
|
161
|
-
<strong>Pause</strong> stops admission
|
|
169
|
+
<strong>Pause</strong> stops admission for the whole policy — including partitions created
|
|
170
|
+
after the pause — but keeps staging: the queue keeps filling, in-flight jobs finish.
|
|
162
171
|
<strong>Drain</strong> empties the staging table by force-admitting every job (bypassing gates).
|
|
163
172
|
Capped at 10,000 jobs per click — click again for more.
|
|
164
173
|
</p>
|