pgbus 0.7.8 → 0.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +42 -0
- data/app/helpers/pgbus/streams_helper.rb +3 -1
- data/lib/pgbus/active_job/executor.rb +31 -2
- data/lib/pgbus/client/notify_stream.rb +37 -0
- data/lib/pgbus/client.rb +2 -0
- data/lib/pgbus/configuration.rb +36 -1
- data/lib/pgbus/engine.rb +15 -0
- data/lib/pgbus/event_bus/handler.rb +22 -2
- data/lib/pgbus/instrumentation.rb +15 -6
- data/lib/pgbus/integrations/appsignal/dashboards/pgbus_health.json +87 -0
- data/lib/pgbus/integrations/appsignal/dashboards/pgbus_streams.json +65 -0
- data/lib/pgbus/integrations/appsignal/dashboards/pgbus_throughput.json +81 -0
- data/lib/pgbus/integrations/appsignal/probe.rb +128 -0
- data/lib/pgbus/integrations/appsignal/subscriber.rb +303 -0
- data/lib/pgbus/integrations/appsignal.rb +52 -0
- data/lib/pgbus/outbox.rb +17 -13
- data/lib/pgbus/process/dispatcher.rb +38 -0
- data/lib/pgbus/process/worker.rb +20 -2
- data/lib/pgbus/recurring/scheduler.rb +10 -2
- data/lib/pgbus/streams/turbo_broadcastable.rb +2 -1
- data/lib/pgbus/streams.rb +28 -7
- data/lib/pgbus/version.rb +1 -1
- data/lib/pgbus/web/data_source.rb +43 -4
- data/lib/pgbus/web/streamer/listener.rb +9 -5
- data/lib/pgbus/web/streamer/stream_event_dispatcher.rb +45 -21
- data/lib/pgbus.rb +7 -2
- metadata +8 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 60cc7178f84e5d28919085f5c5a9d824aca958180d433daf13cb04a369ed25c0
|
|
4
|
+
data.tar.gz: 5a7fdb569f90cf3e60ef5951e86d6ed0c3c31c7703b0f88be91c2df3054b8817
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 8cc8aa7893bdb605379f4cea9542766062b5a759a89d6149e49a39729484f56790bc131b1063b16a855357a0ab4979aeb52f4bd9f572568e793482fcabaf9939
|
|
7
|
+
data.tar.gz: '052525538d45540220e66da0d43f007727848543d8f1184f7598f481e3abe3fe6be961eb6c634f414c158cd884b856305a35ded77e39d2f0f4d7c24bc10cf1a4'
|
data/README.md
CHANGED
|
@@ -728,6 +728,48 @@ Reporters are wired into all critical rescue paths: job execution failures, work
|
|
|
728
728
|
|
|
729
729
|
`ErrorReporter.report` is guaranteed to never raise — if a reporter or the logger itself throws, the error is swallowed silently. This preserves fault-tolerance invariants at every rescue site.
|
|
730
730
|
|
|
731
|
+
### AppSignal integration
|
|
732
|
+
|
|
733
|
+
When the `appsignal` gem is loaded in your app, Pgbus auto-installs a subscriber and a minutely probe that report into AppSignal:
|
|
734
|
+
|
|
735
|
+
- **Background-job transactions** for every ActiveJob run and every event-bus handler invocation. Action names follow the AppSignal convention: `MyJob#perform`, `MyHandler#handle`. Tags include `queue`, `job_class`/`handler`, `routing_key`, `attempts`, and the `active_job_id` / `provider_job_id`. `enqueued_at` becomes the AppSignal `queue_start` timestamp so "time on queue" shows up correctly in the timeline.
|
|
736
|
+
- **Custom counters and distributions** for sends, reads, broadcasts, outbox publishes, recurring scheduling, and worker recycles. All metric names are prefixed `pgbus_`.
|
|
737
|
+
- **A minutely probe** that gauges queue depth (visible vs total), oldest message age per queue, DLQ depth, failed events count, dead-tuple totals, MVCC horizon age, active processes, and stream connection estimates.
|
|
738
|
+
|
|
739
|
+
There is nothing to wire up — load the appsignal gem and the integration installs itself in a Rails initializer. To opt out:
|
|
740
|
+
|
|
741
|
+
```ruby
|
|
742
|
+
Pgbus.configure do |c|
|
|
743
|
+
c.appsignal_enabled = false # disable subscriber + probe entirely
|
|
744
|
+
c.appsignal_probe_enabled = false # keep transactions, drop the gauge probe
|
|
745
|
+
end
|
|
746
|
+
```
|
|
747
|
+
|
|
748
|
+
#### Dashboards
|
|
749
|
+
|
|
750
|
+
Three importable AppSignal dashboards ship with the gem:
|
|
751
|
+
|
|
752
|
+
| File | Purpose |
|
|
753
|
+
|------|---------|
|
|
754
|
+
| `lib/pgbus/integrations/appsignal/dashboards/pgbus_throughput.json` | Jobs/sec, perform-duration percentiles, send/read counts |
|
|
755
|
+
| `lib/pgbus/integrations/appsignal/dashboards/pgbus_health.json` | Queue depth, oldest message age, DLQ, dead tuples, MVCC horizon, worker recycles |
|
|
756
|
+
| `lib/pgbus/integrations/appsignal/dashboards/pgbus_streams.json` | Broadcasts, fanout, active SSE connections, outbox, recurring tasks |
|
|
757
|
+
|
|
758
|
+
Import via the AppSignal dashboard UI ("New dashboard" → "Import JSON") or the AppSignal API.
|
|
759
|
+
|
|
760
|
+
#### Custom subscriptions
|
|
761
|
+
|
|
762
|
+
The integration is built on `ActiveSupport::Notifications`. If you want to push pgbus telemetry into a different APM (Datadog, New Relic, OpenTelemetry), subscribe directly:
|
|
763
|
+
|
|
764
|
+
```ruby
|
|
765
|
+
ActiveSupport::Notifications.subscribe(/^pgbus\./) do |name, start, finish, _id, payload|
|
|
766
|
+
duration_ms = (finish - start) * 1_000
|
|
767
|
+
YourApm.record(name, duration_ms, payload)
|
|
768
|
+
end
|
|
769
|
+
```
|
|
770
|
+
|
|
771
|
+
Events emitted: `pgbus.executor.execute`, `pgbus.job_completed`, `pgbus.job_failed`, `pgbus.job_dead_lettered`, `pgbus.event_processed`, `pgbus.event_failed`, `pgbus.client.send_message`, `pgbus.client.send_batch`, `pgbus.client.read_batch`, `pgbus.stream.broadcast`, `pgbus.outbox.publish`, `pgbus.recurring.enqueue`, `pgbus.worker.recycle`. Payload keys are documented in `lib/pgbus/instrumentation.rb`.
|
|
772
|
+
|
|
731
773
|
### Structured logging
|
|
732
774
|
|
|
733
775
|
Pgbus ships two log formatters inspired by Sidekiq's `Logger::Formatters`:
|
|
@@ -131,7 +131,9 @@ module Pgbus
|
|
|
131
131
|
return nil if cache[:script_emitted]
|
|
132
132
|
|
|
133
133
|
cache[:script_emitted] = true
|
|
134
|
-
|
|
134
|
+
nonce = content_security_policy_nonce if respond_to?(:content_security_policy_nonce)
|
|
135
|
+
nonce_attr = nonce ? %( nonce="#{CGI.escape_html(nonce)}") : ""
|
|
136
|
+
script = %(<script type="module"#{nonce_attr}>import "pgbus/stream_source_element"</script>)
|
|
135
137
|
script.respond_to?(:html_safe) ? script.html_safe : script
|
|
136
138
|
end
|
|
137
139
|
|
|
@@ -33,6 +33,15 @@ module Pgbus
|
|
|
33
33
|
signal_batch_discarded(payload)
|
|
34
34
|
Uniqueness.release_lock(Uniqueness.extract_key(payload))
|
|
35
35
|
record_stat(payload, queue_name, "dead_lettered", execution_start, message: message)
|
|
36
|
+
instrument(
|
|
37
|
+
"pgbus.job_dead_lettered",
|
|
38
|
+
queue: queue_name,
|
|
39
|
+
job_class: job_class,
|
|
40
|
+
job_id: payload["job_id"],
|
|
41
|
+
provider_job_id: payload["provider_job_id"],
|
|
42
|
+
read_ct: read_count,
|
|
43
|
+
msg_id: message.msg_id.to_i
|
|
44
|
+
)
|
|
36
45
|
Pgbus.logger.debug { "[Pgbus::Executor] dead_lettered #{tag} job_class=#{job_class}" }
|
|
37
46
|
return :dead_lettered
|
|
38
47
|
end
|
|
@@ -60,7 +69,17 @@ module Pgbus
|
|
|
60
69
|
job_succeeded = false
|
|
61
70
|
|
|
62
71
|
msg_id = message.msg_id.to_i
|
|
63
|
-
|
|
72
|
+
instrument_payload = {
|
|
73
|
+
queue: queue_name,
|
|
74
|
+
job_class: job_class,
|
|
75
|
+
job_id: payload["job_id"],
|
|
76
|
+
provider_job_id: payload["provider_job_id"],
|
|
77
|
+
arguments: payload["arguments"],
|
|
78
|
+
enqueued_at: payload["enqueued_at"],
|
|
79
|
+
read_ct: read_count,
|
|
80
|
+
msg_id: msg_id
|
|
81
|
+
}
|
|
82
|
+
Instrumentation.instrument("pgbus.executor.execute", instrument_payload) do
|
|
64
83
|
job = ::ActiveJob::Base.deserialize(payload)
|
|
65
84
|
Pgbus.logger.debug { "[Pgbus::Executor] running #{tag} job_class=#{job_class}" }
|
|
66
85
|
execute_job(job)
|
|
@@ -85,7 +104,17 @@ module Pgbus
|
|
|
85
104
|
# silently lost control flow — no failed event row, no job_failed
|
|
86
105
|
# notification, uniqueness lock held until VT expired. See issue #126.
|
|
87
106
|
handle_failure(message, queue_name, e, payload: payload)
|
|
88
|
-
instrument(
|
|
107
|
+
instrument(
|
|
108
|
+
"pgbus.job_failed",
|
|
109
|
+
queue: queue_name,
|
|
110
|
+
job_class: payload&.dig("job_class"),
|
|
111
|
+
job_id: payload&.dig("job_id"),
|
|
112
|
+
provider_job_id: payload&.dig("provider_job_id"),
|
|
113
|
+
read_ct: message.read_ct.to_i,
|
|
114
|
+
msg_id: message.msg_id.to_i,
|
|
115
|
+
error: e.class.name,
|
|
116
|
+
exception_object: e
|
|
117
|
+
)
|
|
89
118
|
record_stat(payload, queue_name, "failed", execution_start, message: message)
|
|
90
119
|
Pgbus.logger.debug { "[Pgbus::Executor] failed #{tag} job_class=#{payload&.dig("job_class")} error=#{e.class}" }
|
|
91
120
|
# Don't signal concurrency on transient failure — the job will be retried.
|
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Pgbus
|
|
4
|
+
class Client
|
|
5
|
+
# Fire-and-forget PG NOTIFY for ephemeral stream broadcasts. No PGMQ
|
|
6
|
+
# queue is created — the payload travels via the Postgres NOTIFY channel
|
|
7
|
+
# only, matching the channel naming convention that PGMQ's trigger uses:
|
|
8
|
+
# pgmq.q_<full_queue_name>.INSERT
|
|
9
|
+
#
|
|
10
|
+
# Subscribers already LISTEN on this channel via the Streamer's Listener.
|
|
11
|
+
# When a subscriber is connected, the StreamEventDispatcher receives the
|
|
12
|
+
# NOTIFY and fans out the payload. When no subscriber is connected,
|
|
13
|
+
# the NOTIFY is silently discarded by Postgres — no queue, no storage,
|
|
14
|
+
# no orphan tables.
|
|
15
|
+
#
|
|
16
|
+
# The payload is JSON-serialized into the NOTIFY's optional payload
|
|
17
|
+
# parameter (max 8000 bytes in Postgres). Broadcasts exceeding this
|
|
18
|
+
# limit will raise a PG::ProgramLimitExceeded error — callers needing
|
|
19
|
+
# large payloads should use durable mode (which inserts into PGMQ).
|
|
20
|
+
module NotifyStream
|
|
21
|
+
def notify_stream(stream_name, payload)
|
|
22
|
+
full_name = config.queue_name(stream_name)
|
|
23
|
+
sanitized = QueueNameValidator.sanitize!(full_name)
|
|
24
|
+
channel = "pgmq.q_#{sanitized}.INSERT"
|
|
25
|
+
json = payload.is_a?(String) ? payload : JSON.generate(payload)
|
|
26
|
+
|
|
27
|
+
Instrumentation.instrument("pgbus.stream.notify", stream: stream_name, bytes: json.bytesize) do
|
|
28
|
+
synchronized do
|
|
29
|
+
@pgmq.with_connection do |conn|
|
|
30
|
+
conn.exec_params("SELECT pg_notify($1, $2)", [channel, json])
|
|
31
|
+
end
|
|
32
|
+
end
|
|
33
|
+
end
|
|
34
|
+
end
|
|
35
|
+
end
|
|
36
|
+
end
|
|
37
|
+
end
|
data/lib/pgbus/client.rb
CHANGED
|
@@ -3,11 +3,13 @@
|
|
|
3
3
|
require "json"
|
|
4
4
|
require_relative "client/read_after"
|
|
5
5
|
require_relative "client/ensure_stream_queue"
|
|
6
|
+
require_relative "client/notify_stream"
|
|
6
7
|
|
|
7
8
|
module Pgbus
|
|
8
9
|
class Client
|
|
9
10
|
include ReadAfter
|
|
10
11
|
include EnsureStreamQueue
|
|
12
|
+
include NotifyStream
|
|
11
13
|
|
|
12
14
|
attr_reader :pgmq, :config
|
|
13
15
|
|
data/lib/pgbus/configuration.rb
CHANGED
|
@@ -104,7 +104,13 @@ module Pgbus
|
|
|
104
104
|
:streams_default_retention, :streams_retention, :streams_heartbeat_interval,
|
|
105
105
|
:streams_max_connections, :streams_idle_timeout, :streams_listen_health_check_ms,
|
|
106
106
|
:streams_write_deadline_ms, :streams_falcon_streaming_body,
|
|
107
|
-
:streams_stats_enabled, :streams_test_mode
|
|
107
|
+
:streams_stats_enabled, :streams_test_mode,
|
|
108
|
+
:streams_orphan_sweep_interval, :streams_orphan_threshold
|
|
109
|
+
attr_reader :streams_default_broadcast_mode # rubocop:disable Style/AccessorGrouping
|
|
110
|
+
|
|
111
|
+
# AppSignal integration (auto-loaded when ::Appsignal is defined and this is true).
|
|
112
|
+
# Set to false to opt out without uninstalling the appsignal gem.
|
|
113
|
+
attr_accessor :appsignal_enabled, :appsignal_probe_enabled
|
|
108
114
|
|
|
109
115
|
def initialize
|
|
110
116
|
@database_url = nil
|
|
@@ -212,6 +218,14 @@ module Pgbus
|
|
|
212
218
|
# usually want job stats on and stream stats off, or vice versa.
|
|
213
219
|
@streams_stats_enabled = false
|
|
214
220
|
@streams_test_mode = false
|
|
221
|
+
@streams_default_broadcast_mode = :ephemeral
|
|
222
|
+
@streams_orphan_sweep_interval = 3600 # 1 hour
|
|
223
|
+
@streams_orphan_threshold = 86_400 # 24 hours
|
|
224
|
+
|
|
225
|
+
# AppSignal: auto-on when the appsignal gem is loaded; probe runs in
|
|
226
|
+
# the same process, so the operator can disable it independently.
|
|
227
|
+
@appsignal_enabled = true
|
|
228
|
+
@appsignal_probe_enabled = true
|
|
215
229
|
end
|
|
216
230
|
|
|
217
231
|
def queue_name(name)
|
|
@@ -255,6 +269,18 @@ module Pgbus
|
|
|
255
269
|
end
|
|
256
270
|
end
|
|
257
271
|
|
|
272
|
+
VALID_BROADCAST_MODES = %i[ephemeral durable].freeze
|
|
273
|
+
|
|
274
|
+
def streams_default_broadcast_mode=(mode)
|
|
275
|
+
mode = mode.to_sym
|
|
276
|
+
unless VALID_BROADCAST_MODES.include?(mode)
|
|
277
|
+
raise ArgumentError,
|
|
278
|
+
"Invalid streams_default_broadcast_mode: #{mode}. Must be one of: #{VALID_BROADCAST_MODES.join(", ")}"
|
|
279
|
+
end
|
|
280
|
+
|
|
281
|
+
@streams_default_broadcast_mode = mode
|
|
282
|
+
end
|
|
283
|
+
|
|
258
284
|
VALID_PGMQ_SCHEMA_MODES = %i[auto extension embedded].freeze
|
|
259
285
|
|
|
260
286
|
def pgmq_schema_mode=(mode)
|
|
@@ -334,6 +360,15 @@ module Pgbus
|
|
|
334
360
|
end
|
|
335
361
|
|
|
336
362
|
raise ArgumentError, "streams_retention must be a Hash" unless streams_retention.is_a?(Hash)
|
|
363
|
+
|
|
364
|
+
if streams_orphan_sweep_interval && !(streams_orphan_sweep_interval.is_a?(Numeric) && streams_orphan_sweep_interval.positive?)
|
|
365
|
+
raise ArgumentError, "streams_orphan_sweep_interval must be a positive number or nil to disable"
|
|
366
|
+
end
|
|
367
|
+
|
|
368
|
+
return if streams_orphan_threshold.nil?
|
|
369
|
+
return if streams_orphan_threshold.is_a?(Numeric) && streams_orphan_threshold.positive?
|
|
370
|
+
|
|
371
|
+
raise ArgumentError, "streams_orphan_threshold must be a positive number or nil to disable"
|
|
337
372
|
end
|
|
338
373
|
|
|
339
374
|
# Set the worker capsule list. Accepts:
|
data/lib/pgbus/engine.rb
CHANGED
|
@@ -71,6 +71,21 @@ module Pgbus
|
|
|
71
71
|
require "pgbus/web/data_source"
|
|
72
72
|
end
|
|
73
73
|
|
|
74
|
+
# AppSignal is third-party and entirely optional. We require the
|
|
75
|
+
# integration only when the host app has the appsignal gem loaded
|
|
76
|
+
# AND hasn't disabled it via config.appsignal_enabled. AppSignal
|
|
77
|
+
# itself loads early (it's typically required from config/environment.rb
|
|
78
|
+
# before Rails finishes booting), so by the time `after_initialize`
|
|
79
|
+
# fires the constant check is reliable.
|
|
80
|
+
initializer "pgbus.integrations.appsignal", after: :load_config_initializers do
|
|
81
|
+
ActiveSupport.on_load(:after_initialize) do
|
|
82
|
+
next unless defined?(::Appsignal) && Pgbus.configuration.appsignal_enabled
|
|
83
|
+
|
|
84
|
+
require "pgbus/integrations/appsignal"
|
|
85
|
+
Pgbus::Integrations::Appsignal.install!
|
|
86
|
+
end
|
|
87
|
+
end
|
|
88
|
+
|
|
74
89
|
# Install the watermark cache middleware ahead of the app's own
|
|
75
90
|
# middleware so the thread-local cache is cleared between every
|
|
76
91
|
# Rack request. Without this, repeated page renders served by the
|
|
@@ -30,12 +30,32 @@ module Pgbus
|
|
|
30
30
|
def process!(message)
|
|
31
31
|
raw = JSON.parse(message.message)
|
|
32
32
|
event = build_event(raw)
|
|
33
|
+
routing_key = raw.dig("headers", "routing_key") || raw["routing_key"]
|
|
33
34
|
|
|
34
35
|
return :skipped if self.class.idempotent? && !claim_idempotency?(event.event_id)
|
|
35
36
|
|
|
36
|
-
|
|
37
|
-
|
|
37
|
+
instrument_payload = {
|
|
38
|
+
event_id: event.event_id,
|
|
39
|
+
handler: self.class.name,
|
|
40
|
+
routing_key: routing_key,
|
|
41
|
+
published_at: event.published_at,
|
|
42
|
+
read_ct: message.read_ct.to_i,
|
|
43
|
+
msg_id: message.msg_id.to_i
|
|
44
|
+
}
|
|
45
|
+
Instrumentation.instrument("pgbus.event_processed", instrument_payload) do
|
|
46
|
+
handle(event)
|
|
47
|
+
end
|
|
38
48
|
:handled
|
|
49
|
+
rescue StandardError => e
|
|
50
|
+
instrument(
|
|
51
|
+
"pgbus.event_failed",
|
|
52
|
+
event_id: event&.event_id,
|
|
53
|
+
handler: self.class.name,
|
|
54
|
+
routing_key: routing_key,
|
|
55
|
+
error: e.class.name,
|
|
56
|
+
exception_object: e
|
|
57
|
+
)
|
|
58
|
+
raise
|
|
39
59
|
end
|
|
40
60
|
|
|
41
61
|
# Mirrors Pgbus::ActiveJob::Executor#execute_job: wrap the handler
|
|
@@ -7,12 +7,21 @@ module Pgbus
|
|
|
7
7
|
# automatically when used with the block form of AS::Notifications.instrument.
|
|
8
8
|
#
|
|
9
9
|
# Events emitted:
|
|
10
|
-
# pgbus.client.send_message
|
|
11
|
-
# pgbus.client.send_batch
|
|
12
|
-
# pgbus.client.read_batch
|
|
13
|
-
# pgbus.client.read_message
|
|
14
|
-
# pgbus.executor.execute
|
|
15
|
-
# pgbus.
|
|
10
|
+
# pgbus.client.send_message — single message enqueue
|
|
11
|
+
# pgbus.client.send_batch — batch enqueue
|
|
12
|
+
# pgbus.client.read_batch — batch dequeue
|
|
13
|
+
# pgbus.client.read_message — single message dequeue
|
|
14
|
+
# pgbus.executor.execute — full job execution (deserialize + perform + archive)
|
|
15
|
+
# pgbus.job_completed — job archived successfully
|
|
16
|
+
# pgbus.job_failed — job raised; carries :exception_object
|
|
17
|
+
# pgbus.job_dead_lettered — job exceeded max_retries and was DLQ-routed
|
|
18
|
+
# pgbus.event_processed — event handler succeeded
|
|
19
|
+
# pgbus.event_failed — event handler raised; carries :exception_object
|
|
20
|
+
# pgbus.stream.broadcast — stream broadcast (sync or deferred)
|
|
21
|
+
# pgbus.outbox.publish — outbox row created
|
|
22
|
+
# pgbus.recurring.enqueue — scheduler enqueued a due recurring task
|
|
23
|
+
# pgbus.worker.recycle — worker hit a recycle threshold
|
|
24
|
+
# pgbus.serializer.serialize — job/event serialization
|
|
16
25
|
# pgbus.serializer.deserialize — job/event deserialization
|
|
17
26
|
#
|
|
18
27
|
module Instrumentation
|
|
@@ -0,0 +1,87 @@
|
|
|
1
|
+
{
|
|
2
|
+
"title": "Pgbus — Health",
|
|
3
|
+
"description": "Backlog, dead-letter activity, dead-tuple growth, and MVCC horizon. The 'should I page someone?' dashboard.",
|
|
4
|
+
"graphs": [
|
|
5
|
+
{
|
|
6
|
+
"title": "Queue depth (visible vs total)",
|
|
7
|
+
"description": "Visible depth excludes messages whose VT hasn't expired. A divergence between the two means workers are slow but the queue isn't growing.",
|
|
8
|
+
"line_label": "%queue",
|
|
9
|
+
"format": "number",
|
|
10
|
+
"kind": "timeseries",
|
|
11
|
+
"metrics": [
|
|
12
|
+
{ "name": "pgbus_queue_depth", "fields": ["GAUGE"], "tags": [] },
|
|
13
|
+
{ "name": "pgbus_queue_visible_depth", "fields": ["GAUGE"], "tags": [] }
|
|
14
|
+
]
|
|
15
|
+
},
|
|
16
|
+
{
|
|
17
|
+
"title": "Oldest message age (seconds)",
|
|
18
|
+
"description": "Per-queue head-of-line waiting time. If this climbs while queue depth stays flat, a single poison message is stuck in the VT loop.",
|
|
19
|
+
"line_label": "%queue",
|
|
20
|
+
"format": "duration",
|
|
21
|
+
"format_input": "second",
|
|
22
|
+
"kind": "timeseries",
|
|
23
|
+
"metrics": [
|
|
24
|
+
{ "name": "pgbus_queue_oldest_message_age_seconds", "fields": ["GAUGE"], "tags": [] }
|
|
25
|
+
]
|
|
26
|
+
},
|
|
27
|
+
{
|
|
28
|
+
"title": "DLQ depth + failed events",
|
|
29
|
+
"description": "Messages that exceeded max_retries plus the failed-events table. Spikes after a deploy point at a regression.",
|
|
30
|
+
"format": "number",
|
|
31
|
+
"kind": "timeseries",
|
|
32
|
+
"metrics": [
|
|
33
|
+
{ "name": "pgbus_dlq_depth", "fields": ["GAUGE"], "tags": [] },
|
|
34
|
+
{ "name": "pgbus_failed_events_total", "fields": ["GAUGE"], "tags": [] }
|
|
35
|
+
]
|
|
36
|
+
},
|
|
37
|
+
{
|
|
38
|
+
"title": "Dead-lettered jobs per minute",
|
|
39
|
+
"line_label": "%queue %job_class",
|
|
40
|
+
"format": "number",
|
|
41
|
+
"kind": "timeseries",
|
|
42
|
+
"draw_null_as_zero": true,
|
|
43
|
+
"metrics": [
|
|
44
|
+
{ "name": "pgbus_queue_job_count", "fields": ["COUNTER"], "tags": [{ "key": "status", "value": "dead_lettered" }] }
|
|
45
|
+
]
|
|
46
|
+
},
|
|
47
|
+
{
|
|
48
|
+
"title": "Active processes",
|
|
49
|
+
"description": "Workers + dispatcher + scheduler currently heartbeating into pgbus_processes.",
|
|
50
|
+
"format": "number",
|
|
51
|
+
"kind": "timeseries",
|
|
52
|
+
"metrics": [
|
|
53
|
+
{ "name": "pgbus_active_processes", "fields": ["GAUGE"], "tags": [] }
|
|
54
|
+
]
|
|
55
|
+
},
|
|
56
|
+
{
|
|
57
|
+
"title": "Dead tuples in queue/archive tables",
|
|
58
|
+
"description": "If autovacuum can't keep up the index gets bloated and lock acquisition slows. Tune autovacuum_vacuum_scale_factor on the offending tables when this climbs.",
|
|
59
|
+
"format": "number",
|
|
60
|
+
"kind": "timeseries",
|
|
61
|
+
"metrics": [
|
|
62
|
+
{ "name": "pgbus_total_dead_tuples", "fields": ["GAUGE"], "tags": [] }
|
|
63
|
+
]
|
|
64
|
+
},
|
|
65
|
+
{
|
|
66
|
+
"title": "Oldest open transaction (seconds)",
|
|
67
|
+
"description": "MVCC horizon pin. Long-running transactions prevent VACUUM from cleaning the dead tuples above. Anything over 60s is a smell.",
|
|
68
|
+
"format": "duration",
|
|
69
|
+
"format_input": "second",
|
|
70
|
+
"kind": "timeseries",
|
|
71
|
+
"metrics": [
|
|
72
|
+
{ "name": "pgbus_oldest_transaction_age_seconds", "fields": ["GAUGE"], "tags": [] }
|
|
73
|
+
]
|
|
74
|
+
},
|
|
75
|
+
{
|
|
76
|
+
"title": "Worker recycles per minute",
|
|
77
|
+
"description": "Recycles by reason. Steady max_jobs is healthy; spiking max_memory means you have a leak.",
|
|
78
|
+
"line_label": "%reason",
|
|
79
|
+
"format": "number",
|
|
80
|
+
"kind": "timeseries",
|
|
81
|
+
"draw_null_as_zero": true,
|
|
82
|
+
"metrics": [
|
|
83
|
+
{ "name": "pgbus_worker_recycled", "fields": ["COUNTER"], "tags": [] }
|
|
84
|
+
]
|
|
85
|
+
}
|
|
86
|
+
]
|
|
87
|
+
}
|
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
{
|
|
2
|
+
"title": "Pgbus — Streams",
|
|
3
|
+
"description": "Real-time SSE pub/sub. Broadcasts, fanout, active connections, and the outbox/recurring scheduler.",
|
|
4
|
+
"graphs": [
|
|
5
|
+
{
|
|
6
|
+
"title": "Stream broadcasts per minute",
|
|
7
|
+
"line_label": "%stream %deferred",
|
|
8
|
+
"format": "number",
|
|
9
|
+
"kind": "timeseries",
|
|
10
|
+
"draw_null_as_zero": true,
|
|
11
|
+
"metrics": [
|
|
12
|
+
{ "name": "pgbus_stream_broadcast_count", "fields": ["COUNTER"], "tags": [] }
|
|
13
|
+
]
|
|
14
|
+
},
|
|
15
|
+
{
|
|
16
|
+
"title": "Active SSE connections",
|
|
17
|
+
"description": "Estimated from connect/disconnect events in the last 60 minutes. Use as a rough capacity gauge — exact count requires the SSE process telemetry.",
|
|
18
|
+
"format": "number",
|
|
19
|
+
"kind": "timeseries",
|
|
20
|
+
"metrics": [
|
|
21
|
+
{ "name": "pgbus_stream_active_connections", "fields": ["GAUGE"], "tags": [] }
|
|
22
|
+
]
|
|
23
|
+
},
|
|
24
|
+
{
|
|
25
|
+
"title": "Average fanout per broadcast",
|
|
26
|
+
"description": "Mean number of connections that received each broadcast over the last hour.",
|
|
27
|
+
"format": "number",
|
|
28
|
+
"kind": "timeseries",
|
|
29
|
+
"metrics": [
|
|
30
|
+
{ "name": "pgbus_stream_avg_fanout", "fields": ["GAUGE"], "tags": [] }
|
|
31
|
+
]
|
|
32
|
+
},
|
|
33
|
+
{
|
|
34
|
+
"title": "Broadcast payload size (bytes)",
|
|
35
|
+
"description": "Distribution of payload bytes. Use to spot accidentally-streaming-an-entire-page bugs.",
|
|
36
|
+
"line_label": "%stream",
|
|
37
|
+
"format": "size",
|
|
38
|
+
"format_input": "byte",
|
|
39
|
+
"kind": "timeseries",
|
|
40
|
+
"metrics": [
|
|
41
|
+
{ "name": "pgbus_stream_broadcast_bytes", "fields": ["MEAN", "P95"], "tags": [] }
|
|
42
|
+
]
|
|
43
|
+
},
|
|
44
|
+
{
|
|
45
|
+
"title": "Outbox publishes per minute",
|
|
46
|
+
"line_label": "%kind",
|
|
47
|
+
"format": "number",
|
|
48
|
+
"kind": "timeseries",
|
|
49
|
+
"draw_null_as_zero": true,
|
|
50
|
+
"metrics": [
|
|
51
|
+
{ "name": "pgbus_outbox_published", "fields": ["COUNTER"], "tags": [] }
|
|
52
|
+
]
|
|
53
|
+
},
|
|
54
|
+
{
|
|
55
|
+
"title": "Recurring tasks enqueued per minute",
|
|
56
|
+
"line_label": "%task",
|
|
57
|
+
"format": "number",
|
|
58
|
+
"kind": "timeseries",
|
|
59
|
+
"draw_null_as_zero": true,
|
|
60
|
+
"metrics": [
|
|
61
|
+
{ "name": "pgbus_recurring_enqueued", "fields": ["COUNTER"], "tags": [] }
|
|
62
|
+
]
|
|
63
|
+
}
|
|
64
|
+
]
|
|
65
|
+
}
|
|
@@ -0,0 +1,81 @@
|
|
|
1
|
+
{
|
|
2
|
+
"title": "Pgbus — Throughput & Latency",
|
|
3
|
+
"description": "Job and event throughput, perform-duration percentiles, and PGMQ send/read counts. Drives the most common 'is the worker keeping up?' question.",
|
|
4
|
+
"graphs": [
|
|
5
|
+
{
|
|
6
|
+
"title": "Jobs processed per minute",
|
|
7
|
+
"description": "Successful, failed, and dead-lettered jobs. A spike in failed without a spike in processed usually means a deploy regression.",
|
|
8
|
+
"line_label": "%queue %job_class %status",
|
|
9
|
+
"format": "number",
|
|
10
|
+
"format_input": null,
|
|
11
|
+
"kind": "timeseries",
|
|
12
|
+
"draw_null_as_zero": true,
|
|
13
|
+
"metrics": [
|
|
14
|
+
{ "name": "pgbus_queue_job_count", "fields": ["COUNTER"], "tags": [] }
|
|
15
|
+
]
|
|
16
|
+
},
|
|
17
|
+
{
|
|
18
|
+
"title": "Job perform duration (ms)",
|
|
19
|
+
"description": "Distribution of how long perform_now takes per job class. P95 and P99 are the lines to watch.",
|
|
20
|
+
"line_label": "%job_class",
|
|
21
|
+
"format": "duration",
|
|
22
|
+
"format_input": "millisecond",
|
|
23
|
+
"kind": "timeseries",
|
|
24
|
+
"metrics": [
|
|
25
|
+
{ "name": "pgbus_job_duration_ms", "fields": ["MEAN", "P95", "P99"], "tags": [] }
|
|
26
|
+
]
|
|
27
|
+
},
|
|
28
|
+
{
|
|
29
|
+
"title": "Events processed per minute",
|
|
30
|
+
"description": "Event-bus handler invocations grouped by routing key.",
|
|
31
|
+
"line_label": "%routing_key %handler %status",
|
|
32
|
+
"format": "number",
|
|
33
|
+
"kind": "timeseries",
|
|
34
|
+
"draw_null_as_zero": true,
|
|
35
|
+
"metrics": [
|
|
36
|
+
{ "name": "pgbus_event_count", "fields": ["COUNTER"], "tags": [] }
|
|
37
|
+
]
|
|
38
|
+
},
|
|
39
|
+
{
|
|
40
|
+
"title": "Event handler duration (ms)",
|
|
41
|
+
"line_label": "%handler",
|
|
42
|
+
"format": "duration",
|
|
43
|
+
"format_input": "millisecond",
|
|
44
|
+
"kind": "timeseries",
|
|
45
|
+
"metrics": [
|
|
46
|
+
{ "name": "pgbus_event_duration_ms", "fields": ["MEAN", "P95", "P99"], "tags": [] }
|
|
47
|
+
]
|
|
48
|
+
},
|
|
49
|
+
{
|
|
50
|
+
"title": "PGMQ messages sent (per minute)",
|
|
51
|
+
"line_label": "%queue",
|
|
52
|
+
"format": "number",
|
|
53
|
+
"kind": "timeseries",
|
|
54
|
+
"draw_null_as_zero": true,
|
|
55
|
+
"metrics": [
|
|
56
|
+
{ "name": "pgbus_messages_sent", "fields": ["COUNTER"], "tags": [] }
|
|
57
|
+
]
|
|
58
|
+
},
|
|
59
|
+
{
|
|
60
|
+
"title": "Send duration (ms)",
|
|
61
|
+
"line_label": "%queue",
|
|
62
|
+
"format": "duration",
|
|
63
|
+
"format_input": "millisecond",
|
|
64
|
+
"kind": "timeseries",
|
|
65
|
+
"metrics": [
|
|
66
|
+
{ "name": "pgbus_send_duration_ms", "fields": ["MEAN", "P95", "P99"], "tags": [] }
|
|
67
|
+
]
|
|
68
|
+
},
|
|
69
|
+
{
|
|
70
|
+
"title": "PGMQ messages read (per minute)",
|
|
71
|
+
"description": "Messages fetched from queues by workers. Compare against 'sent' to spot backlog growth.",
|
|
72
|
+
"line_label": "%queue",
|
|
73
|
+
"format": "number",
|
|
74
|
+
"kind": "timeseries",
|
|
75
|
+
"draw_null_as_zero": true,
|
|
76
|
+
"metrics": [
|
|
77
|
+
{ "name": "pgbus_messages_read", "fields": ["COUNTER"], "tags": [] }
|
|
78
|
+
]
|
|
79
|
+
}
|
|
80
|
+
]
|
|
81
|
+
}
|