pgbus 0.7.8 → 0.7.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +42 -0
- data/lib/pgbus/active_job/executor.rb +31 -2
- data/lib/pgbus/configuration.rb +9 -0
- data/lib/pgbus/engine.rb +15 -0
- data/lib/pgbus/event_bus/handler.rb +22 -2
- data/lib/pgbus/instrumentation.rb +15 -6
- data/lib/pgbus/integrations/appsignal/dashboards/pgbus_health.json +87 -0
- data/lib/pgbus/integrations/appsignal/dashboards/pgbus_streams.json +65 -0
- data/lib/pgbus/integrations/appsignal/dashboards/pgbus_throughput.json +81 -0
- data/lib/pgbus/integrations/appsignal/probe.rb +128 -0
- data/lib/pgbus/integrations/appsignal/subscriber.rb +303 -0
- data/lib/pgbus/integrations/appsignal.rb +52 -0
- data/lib/pgbus/outbox.rb +17 -13
- data/lib/pgbus/process/worker.rb +20 -2
- data/lib/pgbus/recurring/scheduler.rb +10 -2
- data/lib/pgbus/streams.rb +13 -5
- data/lib/pgbus/version.rb +1 -1
- data/lib/pgbus.rb +4 -0
- metadata +7 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: d7661d7d684ac911e36b15267b4b6135081fa3bd9ca0a795cfceae7e9977304a
|
|
4
|
+
data.tar.gz: 7cdb802918724dafa634925a99c48c9a0ee154356ec807e69ad6ec86cb48f9be
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 917889d47343e7c8775f6be1c9bc9d72b7ca8a26de356a9434ba1470f0f4f345215e9f016f9bba74b78007c72668254b92715944574fc056b20ceb1af92ebaac
|
|
7
|
+
data.tar.gz: 011533444eec2c09e29de3938d84e6fc06efd87c82621efcccf92008d88b3a3ac76464619c959617d022e076da4bed5ca03b50f3b5997e7d81ec9b52ccce90c3
|
data/README.md
CHANGED
|
@@ -728,6 +728,48 @@ Reporters are wired into all critical rescue paths: job execution failures, work
|
|
|
728
728
|
|
|
729
729
|
`ErrorReporter.report` is guaranteed to never raise — if a reporter or the logger itself throws, the error is swallowed silently. This preserves fault-tolerance invariants at every rescue site.
|
|
730
730
|
|
|
731
|
+
### AppSignal integration
|
|
732
|
+
|
|
733
|
+
When the `appsignal` gem is loaded in your app, Pgbus auto-installs a subscriber and a minutely probe that report into AppSignal:
|
|
734
|
+
|
|
735
|
+
- **Background-job transactions** for every ActiveJob run and every event-bus handler invocation. Action names follow the AppSignal convention: `MyJob#perform`, `MyHandler#handle`. Tags include `queue`, `job_class`/`handler`, `routing_key`, `attempts`, and the `active_job_id` / `provider_job_id`. `enqueued_at` becomes the AppSignal `queue_start` timestamp so "time on queue" shows up correctly in the timeline.
|
|
736
|
+
- **Custom counters and distributions** for sends, reads, broadcasts, outbox publishes, recurring scheduling, and worker recycles. All metric names are prefixed `pgbus_`.
|
|
737
|
+
- **A minutely probe** that gauges queue depth (visible vs total), oldest message age per queue, DLQ depth, failed events count, dead-tuple totals, MVCC horizon age, active processes, and stream connection estimates.
|
|
738
|
+
|
|
739
|
+
There is nothing to wire up — load the appsignal gem and the integration installs itself in a Rails initializer. To opt out:
|
|
740
|
+
|
|
741
|
+
```ruby
|
|
742
|
+
Pgbus.configure do |c|
|
|
743
|
+
c.appsignal_enabled = false # disable subscriber + probe entirely
|
|
744
|
+
c.appsignal_probe_enabled = false # keep transactions, drop the gauge probe
|
|
745
|
+
end
|
|
746
|
+
```
|
|
747
|
+
|
|
748
|
+
#### Dashboards
|
|
749
|
+
|
|
750
|
+
Three importable AppSignal dashboards ship with the gem:
|
|
751
|
+
|
|
752
|
+
| File | Purpose |
|
|
753
|
+
|------|---------|
|
|
754
|
+
| `lib/pgbus/integrations/appsignal/dashboards/pgbus_throughput.json` | Jobs/sec, perform-duration percentiles, send/read counts |
|
|
755
|
+
| `lib/pgbus/integrations/appsignal/dashboards/pgbus_health.json` | Queue depth, oldest message age, DLQ, dead tuples, MVCC horizon, worker recycles |
|
|
756
|
+
| `lib/pgbus/integrations/appsignal/dashboards/pgbus_streams.json` | Broadcasts, fanout, active SSE connections, outbox, recurring tasks |
|
|
757
|
+
|
|
758
|
+
Import via the AppSignal dashboard UI ("New dashboard" → "Import JSON") or the AppSignal API.
|
|
759
|
+
|
|
760
|
+
#### Custom subscriptions
|
|
761
|
+
|
|
762
|
+
The integration is built on `ActiveSupport::Notifications`. If you want to push pgbus telemetry into a different APM (Datadog, New Relic, OpenTelemetry), subscribe directly:
|
|
763
|
+
|
|
764
|
+
```ruby
|
|
765
|
+
ActiveSupport::Notifications.subscribe(/^pgbus\./) do |name, start, finish, _id, payload|
|
|
766
|
+
duration_ms = (finish - start) * 1_000
|
|
767
|
+
YourApm.record(name, duration_ms, payload)
|
|
768
|
+
end
|
|
769
|
+
```
|
|
770
|
+
|
|
771
|
+
Events emitted: `pgbus.executor.execute`, `pgbus.job_completed`, `pgbus.job_failed`, `pgbus.job_dead_lettered`, `pgbus.event_processed`, `pgbus.event_failed`, `pgbus.client.send_message`, `pgbus.client.send_batch`, `pgbus.client.read_batch`, `pgbus.stream.broadcast`, `pgbus.outbox.publish`, `pgbus.recurring.enqueue`, `pgbus.worker.recycle`. Payload keys are documented in `lib/pgbus/instrumentation.rb`.
|
|
772
|
+
|
|
731
773
|
### Structured logging
|
|
732
774
|
|
|
733
775
|
Pgbus ships two log formatters inspired by Sidekiq's `Logger::Formatters`:
|
|
@@ -33,6 +33,15 @@ module Pgbus
|
|
|
33
33
|
signal_batch_discarded(payload)
|
|
34
34
|
Uniqueness.release_lock(Uniqueness.extract_key(payload))
|
|
35
35
|
record_stat(payload, queue_name, "dead_lettered", execution_start, message: message)
|
|
36
|
+
instrument(
|
|
37
|
+
"pgbus.job_dead_lettered",
|
|
38
|
+
queue: queue_name,
|
|
39
|
+
job_class: job_class,
|
|
40
|
+
job_id: payload["job_id"],
|
|
41
|
+
provider_job_id: payload["provider_job_id"],
|
|
42
|
+
read_ct: read_count,
|
|
43
|
+
msg_id: message.msg_id.to_i
|
|
44
|
+
)
|
|
36
45
|
Pgbus.logger.debug { "[Pgbus::Executor] dead_lettered #{tag} job_class=#{job_class}" }
|
|
37
46
|
return :dead_lettered
|
|
38
47
|
end
|
|
@@ -60,7 +69,17 @@ module Pgbus
|
|
|
60
69
|
job_succeeded = false
|
|
61
70
|
|
|
62
71
|
msg_id = message.msg_id.to_i
|
|
63
|
-
|
|
72
|
+
instrument_payload = {
|
|
73
|
+
queue: queue_name,
|
|
74
|
+
job_class: job_class,
|
|
75
|
+
job_id: payload["job_id"],
|
|
76
|
+
provider_job_id: payload["provider_job_id"],
|
|
77
|
+
arguments: payload["arguments"],
|
|
78
|
+
enqueued_at: payload["enqueued_at"],
|
|
79
|
+
read_ct: read_count,
|
|
80
|
+
msg_id: msg_id
|
|
81
|
+
}
|
|
82
|
+
Instrumentation.instrument("pgbus.executor.execute", instrument_payload) do
|
|
64
83
|
job = ::ActiveJob::Base.deserialize(payload)
|
|
65
84
|
Pgbus.logger.debug { "[Pgbus::Executor] running #{tag} job_class=#{job_class}" }
|
|
66
85
|
execute_job(job)
|
|
@@ -85,7 +104,17 @@ module Pgbus
|
|
|
85
104
|
# silently lost control flow — no failed event row, no job_failed
|
|
86
105
|
# notification, uniqueness lock held until VT expired. See issue #126.
|
|
87
106
|
handle_failure(message, queue_name, e, payload: payload)
|
|
88
|
-
instrument(
|
|
107
|
+
instrument(
|
|
108
|
+
"pgbus.job_failed",
|
|
109
|
+
queue: queue_name,
|
|
110
|
+
job_class: payload&.dig("job_class"),
|
|
111
|
+
job_id: payload&.dig("job_id"),
|
|
112
|
+
provider_job_id: payload&.dig("provider_job_id"),
|
|
113
|
+
read_ct: message.read_ct.to_i,
|
|
114
|
+
msg_id: message.msg_id.to_i,
|
|
115
|
+
error: e.class.name,
|
|
116
|
+
exception_object: e
|
|
117
|
+
)
|
|
89
118
|
record_stat(payload, queue_name, "failed", execution_start, message: message)
|
|
90
119
|
Pgbus.logger.debug { "[Pgbus::Executor] failed #{tag} job_class=#{payload&.dig("job_class")} error=#{e.class}" }
|
|
91
120
|
# Don't signal concurrency on transient failure — the job will be retried.
|
data/lib/pgbus/configuration.rb
CHANGED
|
@@ -106,6 +106,10 @@ module Pgbus
|
|
|
106
106
|
:streams_write_deadline_ms, :streams_falcon_streaming_body,
|
|
107
107
|
:streams_stats_enabled, :streams_test_mode
|
|
108
108
|
|
|
109
|
+
# AppSignal integration (auto-loaded when ::Appsignal is defined and this is true).
|
|
110
|
+
# Set to false to opt out without uninstalling the appsignal gem.
|
|
111
|
+
attr_accessor :appsignal_enabled, :appsignal_probe_enabled
|
|
112
|
+
|
|
109
113
|
def initialize
|
|
110
114
|
@database_url = nil
|
|
111
115
|
@connection_params = nil
|
|
@@ -212,6 +216,11 @@ module Pgbus
|
|
|
212
216
|
# usually want job stats on and stream stats off, or vice versa.
|
|
213
217
|
@streams_stats_enabled = false
|
|
214
218
|
@streams_test_mode = false
|
|
219
|
+
|
|
220
|
+
# AppSignal: auto-on when the appsignal gem is loaded; probe runs in
|
|
221
|
+
# the same process, so the operator can disable it independently.
|
|
222
|
+
@appsignal_enabled = true
|
|
223
|
+
@appsignal_probe_enabled = true
|
|
215
224
|
end
|
|
216
225
|
|
|
217
226
|
def queue_name(name)
|
data/lib/pgbus/engine.rb
CHANGED
|
@@ -71,6 +71,21 @@ module Pgbus
|
|
|
71
71
|
require "pgbus/web/data_source"
|
|
72
72
|
end
|
|
73
73
|
|
|
74
|
+
# AppSignal is third-party and entirely optional. We require the
|
|
75
|
+
# integration only when the host app has the appsignal gem loaded
|
|
76
|
+
# AND hasn't disabled it via config.appsignal_enabled. AppSignal
|
|
77
|
+
# itself loads early (it's typically required from config/environment.rb
|
|
78
|
+
# before Rails finishes booting), so by the time `after_initialize`
|
|
79
|
+
# fires the constant check is reliable.
|
|
80
|
+
initializer "pgbus.integrations.appsignal", after: :load_config_initializers do
|
|
81
|
+
ActiveSupport.on_load(:after_initialize) do
|
|
82
|
+
next unless defined?(::Appsignal) && Pgbus.configuration.appsignal_enabled
|
|
83
|
+
|
|
84
|
+
require "pgbus/integrations/appsignal"
|
|
85
|
+
Pgbus::Integrations::Appsignal.install!
|
|
86
|
+
end
|
|
87
|
+
end
|
|
88
|
+
|
|
74
89
|
# Install the watermark cache middleware ahead of the app's own
|
|
75
90
|
# middleware so the thread-local cache is cleared between every
|
|
76
91
|
# Rack request. Without this, repeated page renders served by the
|
|
@@ -30,12 +30,32 @@ module Pgbus
|
|
|
30
30
|
def process!(message)
|
|
31
31
|
raw = JSON.parse(message.message)
|
|
32
32
|
event = build_event(raw)
|
|
33
|
+
routing_key = raw.dig("headers", "routing_key") || raw["routing_key"]
|
|
33
34
|
|
|
34
35
|
return :skipped if self.class.idempotent? && !claim_idempotency?(event.event_id)
|
|
35
36
|
|
|
36
|
-
|
|
37
|
-
|
|
37
|
+
instrument_payload = {
|
|
38
|
+
event_id: event.event_id,
|
|
39
|
+
handler: self.class.name,
|
|
40
|
+
routing_key: routing_key,
|
|
41
|
+
published_at: event.published_at,
|
|
42
|
+
read_ct: message.read_ct.to_i,
|
|
43
|
+
msg_id: message.msg_id.to_i
|
|
44
|
+
}
|
|
45
|
+
Instrumentation.instrument("pgbus.event_processed", instrument_payload) do
|
|
46
|
+
handle(event)
|
|
47
|
+
end
|
|
38
48
|
:handled
|
|
49
|
+
rescue StandardError => e
|
|
50
|
+
instrument(
|
|
51
|
+
"pgbus.event_failed",
|
|
52
|
+
event_id: event&.event_id,
|
|
53
|
+
handler: self.class.name,
|
|
54
|
+
routing_key: routing_key,
|
|
55
|
+
error: e.class.name,
|
|
56
|
+
exception_object: e
|
|
57
|
+
)
|
|
58
|
+
raise
|
|
39
59
|
end
|
|
40
60
|
|
|
41
61
|
# Mirrors Pgbus::ActiveJob::Executor#execute_job: wrap the handler
|
|
@@ -7,12 +7,21 @@ module Pgbus
|
|
|
7
7
|
# automatically when used with the block form of AS::Notifications.instrument.
|
|
8
8
|
#
|
|
9
9
|
# Events emitted:
|
|
10
|
-
# pgbus.client.send_message
|
|
11
|
-
# pgbus.client.send_batch
|
|
12
|
-
# pgbus.client.read_batch
|
|
13
|
-
# pgbus.client.read_message
|
|
14
|
-
# pgbus.executor.execute
|
|
15
|
-
# pgbus.
|
|
10
|
+
# pgbus.client.send_message — single message enqueue
|
|
11
|
+
# pgbus.client.send_batch — batch enqueue
|
|
12
|
+
# pgbus.client.read_batch — batch dequeue
|
|
13
|
+
# pgbus.client.read_message — single message dequeue
|
|
14
|
+
# pgbus.executor.execute — full job execution (deserialize + perform + archive)
|
|
15
|
+
# pgbus.job_completed — job archived successfully
|
|
16
|
+
# pgbus.job_failed — job raised; carries :exception_object
|
|
17
|
+
# pgbus.job_dead_lettered — job exceeded max_retries and was DLQ-routed
|
|
18
|
+
# pgbus.event_processed — event handler succeeded
|
|
19
|
+
# pgbus.event_failed — event handler raised; carries :exception_object
|
|
20
|
+
# pgbus.stream.broadcast — stream broadcast (sync or deferred)
|
|
21
|
+
# pgbus.outbox.publish — outbox row created
|
|
22
|
+
# pgbus.recurring.enqueue — scheduler enqueued a due recurring task
|
|
23
|
+
# pgbus.worker.recycle — worker hit a recycle threshold
|
|
24
|
+
# pgbus.serializer.serialize — job/event serialization
|
|
16
25
|
# pgbus.serializer.deserialize — job/event deserialization
|
|
17
26
|
#
|
|
18
27
|
module Instrumentation
|
|
@@ -0,0 +1,87 @@
|
|
|
1
|
+
{
|
|
2
|
+
"title": "Pgbus — Health",
|
|
3
|
+
"description": "Backlog, dead-letter activity, dead-tuple growth, and MVCC horizon. The 'should I page someone?' dashboard.",
|
|
4
|
+
"graphs": [
|
|
5
|
+
{
|
|
6
|
+
"title": "Queue depth (visible vs total)",
|
|
7
|
+
"description": "Visible depth excludes messages whose VT hasn't expired. A divergence between the two means workers are slow but the queue isn't growing.",
|
|
8
|
+
"line_label": "%queue",
|
|
9
|
+
"format": "number",
|
|
10
|
+
"kind": "timeseries",
|
|
11
|
+
"metrics": [
|
|
12
|
+
{ "name": "pgbus_queue_depth", "fields": ["GAUGE"], "tags": [] },
|
|
13
|
+
{ "name": "pgbus_queue_visible_depth", "fields": ["GAUGE"], "tags": [] }
|
|
14
|
+
]
|
|
15
|
+
},
|
|
16
|
+
{
|
|
17
|
+
"title": "Oldest message age (seconds)",
|
|
18
|
+
"description": "Per-queue head-of-line waiting time. If this climbs while queue depth stays flat, a single poison message is stuck in the VT loop.",
|
|
19
|
+
"line_label": "%queue",
|
|
20
|
+
"format": "duration",
|
|
21
|
+
"format_input": "second",
|
|
22
|
+
"kind": "timeseries",
|
|
23
|
+
"metrics": [
|
|
24
|
+
{ "name": "pgbus_queue_oldest_message_age_seconds", "fields": ["GAUGE"], "tags": [] }
|
|
25
|
+
]
|
|
26
|
+
},
|
|
27
|
+
{
|
|
28
|
+
"title": "DLQ depth + failed events",
|
|
29
|
+
"description": "Messages that exceeded max_retries plus the failed-events table. Spikes after a deploy point at a regression.",
|
|
30
|
+
"format": "number",
|
|
31
|
+
"kind": "timeseries",
|
|
32
|
+
"metrics": [
|
|
33
|
+
{ "name": "pgbus_dlq_depth", "fields": ["GAUGE"], "tags": [] },
|
|
34
|
+
{ "name": "pgbus_failed_events_total", "fields": ["GAUGE"], "tags": [] }
|
|
35
|
+
]
|
|
36
|
+
},
|
|
37
|
+
{
|
|
38
|
+
"title": "Dead-lettered jobs per minute",
|
|
39
|
+
"line_label": "%queue %job_class",
|
|
40
|
+
"format": "number",
|
|
41
|
+
"kind": "timeseries",
|
|
42
|
+
"draw_null_as_zero": true,
|
|
43
|
+
"metrics": [
|
|
44
|
+
{ "name": "pgbus_queue_job_count", "fields": ["COUNTER"], "tags": [{ "key": "status", "value": "dead_lettered" }] }
|
|
45
|
+
]
|
|
46
|
+
},
|
|
47
|
+
{
|
|
48
|
+
"title": "Active processes",
|
|
49
|
+
"description": "Workers + dispatcher + scheduler currently heartbeating into pgbus_processes.",
|
|
50
|
+
"format": "number",
|
|
51
|
+
"kind": "timeseries",
|
|
52
|
+
"metrics": [
|
|
53
|
+
{ "name": "pgbus_active_processes", "fields": ["GAUGE"], "tags": [] }
|
|
54
|
+
]
|
|
55
|
+
},
|
|
56
|
+
{
|
|
57
|
+
"title": "Dead tuples in queue/archive tables",
|
|
58
|
+
"description": "If autovacuum can't keep up the index gets bloated and lock acquisition slows. Tune autovacuum_vacuum_scale_factor on the offending tables when this climbs.",
|
|
59
|
+
"format": "number",
|
|
60
|
+
"kind": "timeseries",
|
|
61
|
+
"metrics": [
|
|
62
|
+
{ "name": "pgbus_total_dead_tuples", "fields": ["GAUGE"], "tags": [] }
|
|
63
|
+
]
|
|
64
|
+
},
|
|
65
|
+
{
|
|
66
|
+
"title": "Oldest open transaction (seconds)",
|
|
67
|
+
"description": "MVCC horizon pin. Long-running transactions prevent VACUUM from cleaning the dead tuples above. Anything over 60s is a smell.",
|
|
68
|
+
"format": "duration",
|
|
69
|
+
"format_input": "second",
|
|
70
|
+
"kind": "timeseries",
|
|
71
|
+
"metrics": [
|
|
72
|
+
{ "name": "pgbus_oldest_transaction_age_seconds", "fields": ["GAUGE"], "tags": [] }
|
|
73
|
+
]
|
|
74
|
+
},
|
|
75
|
+
{
|
|
76
|
+
"title": "Worker recycles per minute",
|
|
77
|
+
"description": "Recycles by reason. Steady max_jobs is healthy; spiking max_memory means you have a leak.",
|
|
78
|
+
"line_label": "%reason",
|
|
79
|
+
"format": "number",
|
|
80
|
+
"kind": "timeseries",
|
|
81
|
+
"draw_null_as_zero": true,
|
|
82
|
+
"metrics": [
|
|
83
|
+
{ "name": "pgbus_worker_recycled", "fields": ["COUNTER"], "tags": [] }
|
|
84
|
+
]
|
|
85
|
+
}
|
|
86
|
+
]
|
|
87
|
+
}
|
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
{
|
|
2
|
+
"title": "Pgbus — Streams",
|
|
3
|
+
"description": "Real-time SSE pub/sub. Broadcasts, fanout, active connections, and the outbox/recurring scheduler.",
|
|
4
|
+
"graphs": [
|
|
5
|
+
{
|
|
6
|
+
"title": "Stream broadcasts per minute",
|
|
7
|
+
"line_label": "%stream %deferred",
|
|
8
|
+
"format": "number",
|
|
9
|
+
"kind": "timeseries",
|
|
10
|
+
"draw_null_as_zero": true,
|
|
11
|
+
"metrics": [
|
|
12
|
+
{ "name": "pgbus_stream_broadcast_count", "fields": ["COUNTER"], "tags": [] }
|
|
13
|
+
]
|
|
14
|
+
},
|
|
15
|
+
{
|
|
16
|
+
"title": "Active SSE connections",
|
|
17
|
+
"description": "Estimated from connect/disconnect events in the last 60 minutes. Use as a rough capacity gauge — exact count requires the SSE process telemetry.",
|
|
18
|
+
"format": "number",
|
|
19
|
+
"kind": "timeseries",
|
|
20
|
+
"metrics": [
|
|
21
|
+
{ "name": "pgbus_stream_active_connections", "fields": ["GAUGE"], "tags": [] }
|
|
22
|
+
]
|
|
23
|
+
},
|
|
24
|
+
{
|
|
25
|
+
"title": "Average fanout per broadcast",
|
|
26
|
+
"description": "Mean number of connections that received each broadcast over the last hour.",
|
|
27
|
+
"format": "number",
|
|
28
|
+
"kind": "timeseries",
|
|
29
|
+
"metrics": [
|
|
30
|
+
{ "name": "pgbus_stream_avg_fanout", "fields": ["GAUGE"], "tags": [] }
|
|
31
|
+
]
|
|
32
|
+
},
|
|
33
|
+
{
|
|
34
|
+
"title": "Broadcast payload size (bytes)",
|
|
35
|
+
"description": "Distribution of payload bytes. Use to spot accidentally-streaming-an-entire-page bugs.",
|
|
36
|
+
"line_label": "%stream",
|
|
37
|
+
"format": "size",
|
|
38
|
+
"format_input": "byte",
|
|
39
|
+
"kind": "timeseries",
|
|
40
|
+
"metrics": [
|
|
41
|
+
{ "name": "pgbus_stream_broadcast_bytes", "fields": ["MEAN", "P95"], "tags": [] }
|
|
42
|
+
]
|
|
43
|
+
},
|
|
44
|
+
{
|
|
45
|
+
"title": "Outbox publishes per minute",
|
|
46
|
+
"line_label": "%kind",
|
|
47
|
+
"format": "number",
|
|
48
|
+
"kind": "timeseries",
|
|
49
|
+
"draw_null_as_zero": true,
|
|
50
|
+
"metrics": [
|
|
51
|
+
{ "name": "pgbus_outbox_published", "fields": ["COUNTER"], "tags": [] }
|
|
52
|
+
]
|
|
53
|
+
},
|
|
54
|
+
{
|
|
55
|
+
"title": "Recurring tasks enqueued per minute",
|
|
56
|
+
"line_label": "%task",
|
|
57
|
+
"format": "number",
|
|
58
|
+
"kind": "timeseries",
|
|
59
|
+
"draw_null_as_zero": true,
|
|
60
|
+
"metrics": [
|
|
61
|
+
{ "name": "pgbus_recurring_enqueued", "fields": ["COUNTER"], "tags": [] }
|
|
62
|
+
]
|
|
63
|
+
}
|
|
64
|
+
]
|
|
65
|
+
}
|
|
@@ -0,0 +1,81 @@
|
|
|
1
|
+
{
|
|
2
|
+
"title": "Pgbus — Throughput & Latency",
|
|
3
|
+
"description": "Job and event throughput, perform-duration percentiles, and PGMQ send/read counts. Drives the most common 'is the worker keeping up?' question.",
|
|
4
|
+
"graphs": [
|
|
5
|
+
{
|
|
6
|
+
"title": "Jobs processed per minute",
|
|
7
|
+
"description": "Successful, failed, and dead-lettered jobs. A spike in failed without a spike in processed usually means a deploy regression.",
|
|
8
|
+
"line_label": "%queue %job_class %status",
|
|
9
|
+
"format": "number",
|
|
10
|
+
"format_input": null,
|
|
11
|
+
"kind": "timeseries",
|
|
12
|
+
"draw_null_as_zero": true,
|
|
13
|
+
"metrics": [
|
|
14
|
+
{ "name": "pgbus_queue_job_count", "fields": ["COUNTER"], "tags": [] }
|
|
15
|
+
]
|
|
16
|
+
},
|
|
17
|
+
{
|
|
18
|
+
"title": "Job perform duration (ms)",
|
|
19
|
+
"description": "Distribution of how long perform_now takes per job class. P95 and P99 are the lines to watch.",
|
|
20
|
+
"line_label": "%job_class",
|
|
21
|
+
"format": "duration",
|
|
22
|
+
"format_input": "millisecond",
|
|
23
|
+
"kind": "timeseries",
|
|
24
|
+
"metrics": [
|
|
25
|
+
{ "name": "pgbus_job_duration_ms", "fields": ["MEAN", "P95", "P99"], "tags": [] }
|
|
26
|
+
]
|
|
27
|
+
},
|
|
28
|
+
{
|
|
29
|
+
"title": "Events processed per minute",
|
|
30
|
+
"description": "Event-bus handler invocations grouped by routing key.",
|
|
31
|
+
"line_label": "%routing_key %handler %status",
|
|
32
|
+
"format": "number",
|
|
33
|
+
"kind": "timeseries",
|
|
34
|
+
"draw_null_as_zero": true,
|
|
35
|
+
"metrics": [
|
|
36
|
+
{ "name": "pgbus_event_count", "fields": ["COUNTER"], "tags": [] }
|
|
37
|
+
]
|
|
38
|
+
},
|
|
39
|
+
{
|
|
40
|
+
"title": "Event handler duration (ms)",
|
|
41
|
+
"line_label": "%handler",
|
|
42
|
+
"format": "duration",
|
|
43
|
+
"format_input": "millisecond",
|
|
44
|
+
"kind": "timeseries",
|
|
45
|
+
"metrics": [
|
|
46
|
+
{ "name": "pgbus_event_duration_ms", "fields": ["MEAN", "P95", "P99"], "tags": [] }
|
|
47
|
+
]
|
|
48
|
+
},
|
|
49
|
+
{
|
|
50
|
+
"title": "PGMQ messages sent (per minute)",
|
|
51
|
+
"line_label": "%queue",
|
|
52
|
+
"format": "number",
|
|
53
|
+
"kind": "timeseries",
|
|
54
|
+
"draw_null_as_zero": true,
|
|
55
|
+
"metrics": [
|
|
56
|
+
{ "name": "pgbus_messages_sent", "fields": ["COUNTER"], "tags": [] }
|
|
57
|
+
]
|
|
58
|
+
},
|
|
59
|
+
{
|
|
60
|
+
"title": "Send duration (ms)",
|
|
61
|
+
"line_label": "%queue",
|
|
62
|
+
"format": "duration",
|
|
63
|
+
"format_input": "millisecond",
|
|
64
|
+
"kind": "timeseries",
|
|
65
|
+
"metrics": [
|
|
66
|
+
{ "name": "pgbus_send_duration_ms", "fields": ["MEAN", "P95", "P99"], "tags": [] }
|
|
67
|
+
]
|
|
68
|
+
},
|
|
69
|
+
{
|
|
70
|
+
"title": "PGMQ messages read (per minute)",
|
|
71
|
+
"description": "Messages fetched from queues by workers. Compare against 'sent' to spot backlog growth.",
|
|
72
|
+
"line_label": "%queue",
|
|
73
|
+
"format": "number",
|
|
74
|
+
"kind": "timeseries",
|
|
75
|
+
"draw_null_as_zero": true,
|
|
76
|
+
"metrics": [
|
|
77
|
+
{ "name": "pgbus_messages_read", "fields": ["COUNTER"], "tags": [] }
|
|
78
|
+
]
|
|
79
|
+
}
|
|
80
|
+
]
|
|
81
|
+
}
|
|
@@ -0,0 +1,128 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Pgbus
|
|
4
|
+
module Integrations
|
|
5
|
+
module Appsignal
|
|
6
|
+
# Minutely probe that pushes pgbus-wide gauges into AppSignal.
|
|
7
|
+
#
|
|
8
|
+
# All readings come from Pgbus::Web::DataSource so the probe doesn't
|
|
9
|
+
# duplicate query logic. DataSource is built to be resilient — every
|
|
10
|
+
# method rescues StandardError and returns a safe default — but we
|
|
11
|
+
# still wrap each section in our own rescue so a probe iteration
|
|
12
|
+
# never raises out into the AppSignal probe runner.
|
|
13
|
+
module Probe
|
|
14
|
+
METRIC_PREFIX = "pgbus_"
|
|
15
|
+
private_constant :METRIC_PREFIX
|
|
16
|
+
|
|
17
|
+
class << self
|
|
18
|
+
def install! # rubocop:disable Naming/PredicateMethod
|
|
19
|
+
return false if @installed
|
|
20
|
+
|
|
21
|
+
::Appsignal::Probes.register :pgbus, new_probe_instance
|
|
22
|
+
@installed = true
|
|
23
|
+
true
|
|
24
|
+
end
|
|
25
|
+
|
|
26
|
+
def installed?
|
|
27
|
+
@installed == true
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
def reset!
|
|
31
|
+
::Appsignal::Probes.unregister(:pgbus) if defined?(::Appsignal::Probes) &&
|
|
32
|
+
::Appsignal::Probes.respond_to?(:unregister)
|
|
33
|
+
@installed = false
|
|
34
|
+
end
|
|
35
|
+
|
|
36
|
+
# Visible for testing — returns a fresh runnable probe.
|
|
37
|
+
def new_probe_instance
|
|
38
|
+
Runner.new
|
|
39
|
+
end
|
|
40
|
+
end
|
|
41
|
+
|
|
42
|
+
# The actual probe object; AppSignal calls #call once per minute.
|
|
43
|
+
class Runner
|
|
44
|
+
def initialize(data_source: nil)
|
|
45
|
+
@data_source = data_source
|
|
46
|
+
end
|
|
47
|
+
|
|
48
|
+
def call
|
|
49
|
+
return unless data_source
|
|
50
|
+
|
|
51
|
+
track_queues
|
|
52
|
+
track_processes
|
|
53
|
+
track_summary
|
|
54
|
+
track_streams
|
|
55
|
+
end
|
|
56
|
+
|
|
57
|
+
private
|
|
58
|
+
|
|
59
|
+
def data_source
|
|
60
|
+
@data_source ||=
|
|
61
|
+
(::Pgbus::Web::DataSource.new if defined?(::Pgbus::Web::DataSource))
|
|
62
|
+
end
|
|
63
|
+
|
|
64
|
+
def track_queues
|
|
65
|
+
data_source.queues_with_metrics.each do |q|
|
|
66
|
+
tags = { queue: q[:name] }
|
|
67
|
+
gauge "queue_depth", q[:queue_length], tags
|
|
68
|
+
gauge "queue_visible_depth", q[:queue_visible_length], tags
|
|
69
|
+
gauge "queue_paused", q[:paused] ? 1 : 0, tags
|
|
70
|
+
age = q[:oldest_msg_age_sec]
|
|
71
|
+
gauge "queue_oldest_message_age_seconds", age, tags if age
|
|
72
|
+
end
|
|
73
|
+
rescue StandardError => e
|
|
74
|
+
log_failure("queue metrics", e)
|
|
75
|
+
end
|
|
76
|
+
|
|
77
|
+
def track_processes
|
|
78
|
+
gauge "active_processes", data_source.processes.count
|
|
79
|
+
rescue StandardError => e
|
|
80
|
+
log_failure("process metrics", e)
|
|
81
|
+
end
|
|
82
|
+
|
|
83
|
+
def track_summary
|
|
84
|
+
stats = data_source.summary_stats
|
|
85
|
+
gauge "total_queues", stats[:total_queues]
|
|
86
|
+
gauge "total_depth", stats[:total_depth]
|
|
87
|
+
gauge "total_visible", stats[:total_visible]
|
|
88
|
+
gauge "dlq_depth", stats[:dlq_depth]
|
|
89
|
+
gauge "failed_events_total", stats[:failed_count]
|
|
90
|
+
gauge "throughput_rate", stats[:throughput_rate]
|
|
91
|
+
gauge "total_dead_tuples", stats[:total_dead_tuples]
|
|
92
|
+
gauge "tables_needing_vacuum", stats[:tables_needing_vacuum]
|
|
93
|
+
gauge "oldest_transaction_age_seconds", stats[:oldest_transaction_age_sec]
|
|
94
|
+
rescue StandardError => e
|
|
95
|
+
log_failure("summary metrics", e)
|
|
96
|
+
end
|
|
97
|
+
|
|
98
|
+
def track_streams
|
|
99
|
+
return unless data_source.respond_to?(:stream_stats_available?) &&
|
|
100
|
+
data_source.stream_stats_available?
|
|
101
|
+
|
|
102
|
+
summary = data_source.stream_stats_summary
|
|
103
|
+
gauge "stream_broadcasts_60m", summary[:broadcasts]
|
|
104
|
+
gauge "stream_connects_60m", summary[:connects]
|
|
105
|
+
gauge "stream_disconnects_60m", summary[:disconnects]
|
|
106
|
+
gauge "stream_active_connections", summary[:active_estimate]
|
|
107
|
+
gauge "stream_avg_fanout", summary[:avg_fanout]
|
|
108
|
+
gauge "stream_avg_broadcast_ms", summary[:avg_broadcast_ms]
|
|
109
|
+
rescue StandardError => e
|
|
110
|
+
log_failure("stream metrics", e)
|
|
111
|
+
end
|
|
112
|
+
|
|
113
|
+
def gauge(key, value, tags = {})
|
|
114
|
+
return if value.nil?
|
|
115
|
+
|
|
116
|
+
::Appsignal.set_gauge("#{METRIC_PREFIX}#{key}", value, tags)
|
|
117
|
+
end
|
|
118
|
+
|
|
119
|
+
def log_failure(label, error)
|
|
120
|
+
Pgbus.logger.debug do
|
|
121
|
+
"[Pgbus::AppSignal::Probe] #{label} failed: #{error.class}: #{error.message}"
|
|
122
|
+
end
|
|
123
|
+
end
|
|
124
|
+
end
|
|
125
|
+
end
|
|
126
|
+
end
|
|
127
|
+
end
|
|
128
|
+
end
|
|
@@ -0,0 +1,303 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "time"
|
|
4
|
+
|
|
5
|
+
module Pgbus
|
|
6
|
+
module Integrations
|
|
7
|
+
module Appsignal
|
|
8
|
+
# Translates Pgbus::Instrumentation events into AppSignal transactions
|
|
9
|
+
# and custom metrics.
|
|
10
|
+
#
|
|
11
|
+
# Job and event-handler events open a BACKGROUND_JOB transaction so they
|
|
12
|
+
# appear under AppSignal's "Performance > Background jobs" view, with
|
|
13
|
+
# action `<JobClass>#perform` or `<HandlerClass>#handle`. All other
|
|
14
|
+
# events are reported as counters or distributions only.
|
|
15
|
+
#
|
|
16
|
+
# All metric names are prefixed `pgbus_`. Tag keys avoid high-cardinality
|
|
17
|
+
# values (no msg_id, no event_id) so AppSignal's metric storage stays
|
|
18
|
+
# efficient.
|
|
19
|
+
module Subscriber
|
|
20
|
+
BACKGROUND_JOB = "background_job"
|
|
21
|
+
METRIC_PREFIX = "pgbus_"
|
|
22
|
+
private_constant :BACKGROUND_JOB, :METRIC_PREFIX
|
|
23
|
+
|
|
24
|
+
# Tracked so we can detach in reset! (used by specs).
|
|
25
|
+
@subscriptions = []
|
|
26
|
+
|
|
27
|
+
class << self
|
|
28
|
+
def install!
|
|
29
|
+
return false if @installed
|
|
30
|
+
|
|
31
|
+
@subscriptions = [
|
|
32
|
+
subscribe("pgbus.executor.execute") { |event| on_executor_execute(event) },
|
|
33
|
+
subscribe("pgbus.job_completed") { |event| on_job_completed(event) },
|
|
34
|
+
subscribe("pgbus.job_failed") { |event| on_job_failed(event) },
|
|
35
|
+
subscribe("pgbus.job_dead_lettered") { |event| on_job_dead_lettered(event) },
|
|
36
|
+
subscribe("pgbus.event_processed") { |event| on_event_processed(event) },
|
|
37
|
+
subscribe("pgbus.event_failed") { |event| on_event_failed(event) },
|
|
38
|
+
subscribe("pgbus.client.send_message") { |event| on_send_message(event) },
|
|
39
|
+
subscribe("pgbus.client.send_batch") { |event| on_send_batch(event) },
|
|
40
|
+
subscribe("pgbus.client.read_batch") { |event| on_read_batch(event) },
|
|
41
|
+
subscribe("pgbus.stream.broadcast") { |event| on_stream_broadcast(event) },
|
|
42
|
+
subscribe("pgbus.outbox.publish") { |event| on_outbox_publish(event) },
|
|
43
|
+
subscribe("pgbus.recurring.enqueue") { |event| on_recurring_enqueue(event) },
|
|
44
|
+
subscribe("pgbus.worker.recycle") { |event| on_worker_recycle(event) }
|
|
45
|
+
]
|
|
46
|
+
@installed = true
|
|
47
|
+
end
|
|
48
|
+
|
|
49
|
+
def installed?
|
|
50
|
+
@installed == true
|
|
51
|
+
end
|
|
52
|
+
|
|
53
|
+
def reset!
|
|
54
|
+
@subscriptions&.each { |s| ActiveSupport::Notifications.unsubscribe(s) }
|
|
55
|
+
@subscriptions = []
|
|
56
|
+
@installed = false
|
|
57
|
+
end
|
|
58
|
+
|
|
59
|
+
private
|
|
60
|
+
|
|
61
|
+
def subscribe(name, &block)
|
|
62
|
+
# silence rubocop unused
|
|
63
|
+
ActiveSupport::Notifications.subscribe(name) do |*args|
|
|
64
|
+
event = ActiveSupport::Notifications::Event.new(*args)
|
|
65
|
+
safely { block.call(event) }
|
|
66
|
+
end
|
|
67
|
+
end
|
|
68
|
+
|
|
69
|
+
# Errors in the subscriber must never affect the producer thread.
|
|
70
|
+
# AppSignal can be misconfigured, the agent can be down, etc. — log
|
|
71
|
+
# and move on.
|
|
72
|
+
def safely
|
|
73
|
+
yield
|
|
74
|
+
rescue StandardError => e
|
|
75
|
+
Pgbus.logger.warn do
|
|
76
|
+
"[Pgbus::AppSignal] subscriber error: #{e.class}: #{e.message}"
|
|
77
|
+
end
|
|
78
|
+
end
|
|
79
|
+
|
|
80
|
+
# ── Job execution ───────────────────────────────────────────────
|
|
81
|
+
|
|
82
|
+
def on_executor_execute(event)
|
|
83
|
+
payload = event.payload
|
|
84
|
+
transaction = ::Appsignal::Transaction.create(BACKGROUND_JOB)
|
|
85
|
+
transaction.set_action_if_nil("#{payload[:job_class] || "UnknownJob"}#perform")
|
|
86
|
+
apply_queue_start(transaction, payload[:enqueued_at])
|
|
87
|
+
transaction.add_tags(job_tags(payload))
|
|
88
|
+
transaction.add_params_if_nil { { arguments: payload[:arguments] } }
|
|
89
|
+
::Appsignal.add_distribution_value(
|
|
90
|
+
"#{METRIC_PREFIX}job_duration_ms",
|
|
91
|
+
event.duration,
|
|
92
|
+
{ queue: payload[:queue], job_class: payload[:job_class] }
|
|
93
|
+
)
|
|
94
|
+
ensure
|
|
95
|
+
::Appsignal::Transaction.complete_current!
|
|
96
|
+
end
|
|
97
|
+
|
|
98
|
+
def on_job_completed(event)
|
|
99
|
+
payload = event.payload
|
|
100
|
+
::Appsignal.increment_counter(
|
|
101
|
+
"#{METRIC_PREFIX}queue_job_count",
|
|
102
|
+
1,
|
|
103
|
+
{ queue: payload[:queue], job_class: payload[:job_class], status: "processed" }
|
|
104
|
+
)
|
|
105
|
+
end
|
|
106
|
+
|
|
107
|
+
def on_job_failed(event)
|
|
108
|
+
payload = event.payload
|
|
109
|
+
::Appsignal.increment_counter(
|
|
110
|
+
"#{METRIC_PREFIX}queue_job_count",
|
|
111
|
+
1,
|
|
112
|
+
{ queue: payload[:queue], job_class: payload[:job_class], status: "failed" }
|
|
113
|
+
)
|
|
114
|
+
err = payload[:exception_object]
|
|
115
|
+
::Appsignal.set_error(err) if err && ::Appsignal.respond_to?(:set_error)
|
|
116
|
+
end
|
|
117
|
+
|
|
118
|
+
def on_job_dead_lettered(event)
|
|
119
|
+
payload = event.payload
|
|
120
|
+
::Appsignal.increment_counter(
|
|
121
|
+
"#{METRIC_PREFIX}queue_job_count",
|
|
122
|
+
1,
|
|
123
|
+
{ queue: payload[:queue], job_class: payload[:job_class], status: "dead_lettered" }
|
|
124
|
+
)
|
|
125
|
+
end
|
|
126
|
+
|
|
127
|
+
# ── Event handler ───────────────────────────────────────────────
|
|
128
|
+
|
|
129
|
+
def on_event_processed(event)
|
|
130
|
+
payload = event.payload
|
|
131
|
+
transaction = ::Appsignal::Transaction.create(BACKGROUND_JOB)
|
|
132
|
+
transaction.set_action_if_nil("#{payload[:handler] || "UnknownHandler"}#handle")
|
|
133
|
+
apply_queue_start(transaction, payload[:published_at])
|
|
134
|
+
transaction.add_tags(handler_tags(payload))
|
|
135
|
+
::Appsignal.add_distribution_value(
|
|
136
|
+
"#{METRIC_PREFIX}event_duration_ms",
|
|
137
|
+
event.duration,
|
|
138
|
+
{ handler: payload[:handler], routing_key: payload[:routing_key] }
|
|
139
|
+
)
|
|
140
|
+
::Appsignal.increment_counter(
|
|
141
|
+
"#{METRIC_PREFIX}event_count",
|
|
142
|
+
1,
|
|
143
|
+
{ handler: payload[:handler], routing_key: payload[:routing_key], status: "processed" }
|
|
144
|
+
)
|
|
145
|
+
ensure
|
|
146
|
+
::Appsignal::Transaction.complete_current!
|
|
147
|
+
end
|
|
148
|
+
|
|
149
|
+
def on_event_failed(event)
|
|
150
|
+
payload = event.payload
|
|
151
|
+
::Appsignal.increment_counter(
|
|
152
|
+
"#{METRIC_PREFIX}event_count",
|
|
153
|
+
1,
|
|
154
|
+
{ handler: payload[:handler], routing_key: payload[:routing_key], status: "failed" }
|
|
155
|
+
)
|
|
156
|
+
err = payload[:exception_object]
|
|
157
|
+
::Appsignal.set_error(err) if err && ::Appsignal.respond_to?(:set_error)
|
|
158
|
+
end
|
|
159
|
+
|
|
160
|
+
# ── Client (PGMQ wrapper) ───────────────────────────────────────
|
|
161
|
+
|
|
162
|
+
def on_send_message(event)
|
|
163
|
+
payload = event.payload
|
|
164
|
+
::Appsignal.increment_counter(
|
|
165
|
+
"#{METRIC_PREFIX}messages_sent",
|
|
166
|
+
1,
|
|
167
|
+
{ queue: payload[:queue] }
|
|
168
|
+
)
|
|
169
|
+
::Appsignal.add_distribution_value(
|
|
170
|
+
"#{METRIC_PREFIX}send_duration_ms",
|
|
171
|
+
event.duration,
|
|
172
|
+
{ queue: payload[:queue] }
|
|
173
|
+
)
|
|
174
|
+
end
|
|
175
|
+
|
|
176
|
+
def on_send_batch(event)
|
|
177
|
+
payload = event.payload
|
|
178
|
+
count = payload[:count] || payload[:batch_size] || 1
|
|
179
|
+
::Appsignal.increment_counter(
|
|
180
|
+
"#{METRIC_PREFIX}messages_sent",
|
|
181
|
+
count,
|
|
182
|
+
{ queue: payload[:queue] }
|
|
183
|
+
)
|
|
184
|
+
::Appsignal.add_distribution_value(
|
|
185
|
+
"#{METRIC_PREFIX}send_batch_duration_ms",
|
|
186
|
+
event.duration,
|
|
187
|
+
{ queue: payload[:queue] }
|
|
188
|
+
)
|
|
189
|
+
end
|
|
190
|
+
|
|
191
|
+
def on_read_batch(event)
|
|
192
|
+
payload = event.payload
|
|
193
|
+
count = payload[:count] || payload[:fetched] || 0
|
|
194
|
+
::Appsignal.increment_counter(
|
|
195
|
+
"#{METRIC_PREFIX}messages_read",
|
|
196
|
+
count,
|
|
197
|
+
{ queue: payload[:queue] }
|
|
198
|
+
)
|
|
199
|
+
end
|
|
200
|
+
|
|
201
|
+
# ── Streams ─────────────────────────────────────────────────────
|
|
202
|
+
|
|
203
|
+
def on_stream_broadcast(event)
|
|
204
|
+
payload = event.payload
|
|
205
|
+
::Appsignal.increment_counter(
|
|
206
|
+
"#{METRIC_PREFIX}stream_broadcast_count",
|
|
207
|
+
1,
|
|
208
|
+
{ stream: payload[:stream], deferred: payload[:deferred] ? "true" : "false" }
|
|
209
|
+
)
|
|
210
|
+
return unless payload[:bytes]
|
|
211
|
+
|
|
212
|
+
::Appsignal.add_distribution_value(
|
|
213
|
+
"#{METRIC_PREFIX}stream_broadcast_bytes",
|
|
214
|
+
payload[:bytes],
|
|
215
|
+
{ stream: payload[:stream] }
|
|
216
|
+
)
|
|
217
|
+
end
|
|
218
|
+
|
|
219
|
+
# ── Outbox ──────────────────────────────────────────────────────
|
|
220
|
+
|
|
221
|
+
def on_outbox_publish(event)
|
|
222
|
+
payload = event.payload
|
|
223
|
+
::Appsignal.increment_counter(
|
|
224
|
+
"#{METRIC_PREFIX}outbox_published",
|
|
225
|
+
1,
|
|
226
|
+
{ kind: payload[:kind] || "job" }
|
|
227
|
+
)
|
|
228
|
+
::Appsignal.add_distribution_value(
|
|
229
|
+
"#{METRIC_PREFIX}outbox_publish_duration_ms",
|
|
230
|
+
event.duration,
|
|
231
|
+
{ kind: payload[:kind] || "job" }
|
|
232
|
+
)
|
|
233
|
+
end
|
|
234
|
+
|
|
235
|
+
# ── Recurring scheduler ─────────────────────────────────────────
|
|
236
|
+
|
|
237
|
+
def on_recurring_enqueue(event)
|
|
238
|
+
payload = event.payload
|
|
239
|
+
::Appsignal.increment_counter(
|
|
240
|
+
"#{METRIC_PREFIX}recurring_enqueued",
|
|
241
|
+
1,
|
|
242
|
+
{ task: payload[:task], class_name: payload[:class_name] }
|
|
243
|
+
)
|
|
244
|
+
end
|
|
245
|
+
|
|
246
|
+
# ── Worker lifecycle ────────────────────────────────────────────
|
|
247
|
+
|
|
248
|
+
def on_worker_recycle(event)
|
|
249
|
+
payload = event.payload
|
|
250
|
+
::Appsignal.increment_counter(
|
|
251
|
+
"#{METRIC_PREFIX}worker_recycled",
|
|
252
|
+
1,
|
|
253
|
+
{ reason: payload[:reason] }
|
|
254
|
+
)
|
|
255
|
+
end
|
|
256
|
+
|
|
257
|
+
# ── Helpers ─────────────────────────────────────────────────────
|
|
258
|
+
|
|
259
|
+
# AppSignal expects queue-start as Unix epoch milliseconds. Pgbus
|
|
260
|
+
# carries it as either an ISO-8601 String or a Time — both happen
|
|
261
|
+
# in practice (executor passes the JSON string, handler passes a
|
|
262
|
+
# parsed Time).
|
|
263
|
+
def apply_queue_start(transaction, value)
|
|
264
|
+
return unless value
|
|
265
|
+
|
|
266
|
+
millis =
|
|
267
|
+
case value
|
|
268
|
+
when Time
|
|
269
|
+
(value.to_f * 1_000).to_i
|
|
270
|
+
when String
|
|
271
|
+
(Time.parse(value).to_f * 1_000).to_i
|
|
272
|
+
when Numeric
|
|
273
|
+
value.to_i
|
|
274
|
+
end
|
|
275
|
+
transaction.set_queue_start(millis) if millis
|
|
276
|
+
rescue ArgumentError
|
|
277
|
+
# Unparseable timestamp — skip rather than blow up.
|
|
278
|
+
end
|
|
279
|
+
|
|
280
|
+
def job_tags(payload)
|
|
281
|
+
tags = {
|
|
282
|
+
"queue" => payload[:queue],
|
|
283
|
+
"job_class" => payload[:job_class],
|
|
284
|
+
"attempts" => payload[:read_ct]
|
|
285
|
+
}
|
|
286
|
+
tags["active_job_id"] = payload[:job_id] if payload[:job_id]
|
|
287
|
+
tags["provider_job_id"] = payload[:provider_job_id] if payload[:provider_job_id]
|
|
288
|
+
tags["request_id"] = payload[:provider_job_id] || payload[:job_id]
|
|
289
|
+
tags.compact
|
|
290
|
+
end
|
|
291
|
+
|
|
292
|
+
def handler_tags(payload)
|
|
293
|
+
{
|
|
294
|
+
"handler" => payload[:handler],
|
|
295
|
+
"routing_key" => payload[:routing_key],
|
|
296
|
+
"attempts" => payload[:read_ct]
|
|
297
|
+
}.compact
|
|
298
|
+
end
|
|
299
|
+
end
|
|
300
|
+
end
|
|
301
|
+
end
|
|
302
|
+
end
|
|
303
|
+
end
|
|
@@ -0,0 +1,52 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "pgbus/integrations/appsignal/subscriber"
|
|
4
|
+
require "pgbus/integrations/appsignal/probe"
|
|
5
|
+
|
|
6
|
+
module Pgbus
|
|
7
|
+
module Integrations
|
|
8
|
+
# AppSignal integration for pgbus.
|
|
9
|
+
#
|
|
10
|
+
# Loaded automatically by Pgbus::Engine when the appsignal gem is present
|
|
11
|
+
# and config.appsignal_enabled is true (default). To opt out:
|
|
12
|
+
#
|
|
13
|
+
# Pgbus.configure do |c|
|
|
14
|
+
# c.appsignal_enabled = false
|
|
15
|
+
# end
|
|
16
|
+
#
|
|
17
|
+
# The integration:
|
|
18
|
+
# * Subscribes to pgbus.* ActiveSupport::Notifications and translates
|
|
19
|
+
# them into AppSignal background-job transactions and metrics.
|
|
20
|
+
# * Registers a minutely probe that reports queue depth, DLQ size,
|
|
21
|
+
# dead-tuple counts, MVCC horizon age, and stream stats from
|
|
22
|
+
# Pgbus::Web::DataSource.
|
|
23
|
+
#
|
|
24
|
+
# All metric names are prefixed `pgbus_` so they group cleanly in
|
|
25
|
+
# AppSignal's custom-metrics view.
|
|
26
|
+
module Appsignal
|
|
27
|
+
module_function
|
|
28
|
+
|
|
29
|
+
def install! # rubocop:disable Naming/PredicateMethod
|
|
30
|
+
return false unless defined?(::Appsignal)
|
|
31
|
+
return false if @installed
|
|
32
|
+
|
|
33
|
+
Subscriber.install!
|
|
34
|
+
Probe.install! if Pgbus.configuration.appsignal_probe_enabled
|
|
35
|
+
@installed = true
|
|
36
|
+
Pgbus.logger.info { "[Pgbus] AppSignal integration installed" }
|
|
37
|
+
true
|
|
38
|
+
end
|
|
39
|
+
|
|
40
|
+
def installed?
|
|
41
|
+
@installed == true
|
|
42
|
+
end
|
|
43
|
+
|
|
44
|
+
# Test hook: tear everything down so a fresh install! can run.
|
|
45
|
+
def reset!
|
|
46
|
+
Subscriber.reset! if defined?(Subscriber)
|
|
47
|
+
Probe.reset! if defined?(Probe)
|
|
48
|
+
@installed = false
|
|
49
|
+
end
|
|
50
|
+
end
|
|
51
|
+
end
|
|
52
|
+
end
|
data/lib/pgbus/outbox.rb
CHANGED
|
@@ -5,22 +5,26 @@ module Pgbus
|
|
|
5
5
|
module_function
|
|
6
6
|
|
|
7
7
|
def publish(queue_name, payload, headers: nil, priority: nil, delay: 0)
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
8
|
+
Instrumentation.instrument("pgbus.outbox.publish", queue: queue_name, kind: :job) do
|
|
9
|
+
OutboxEntry.create!(
|
|
10
|
+
queue_name: queue_name,
|
|
11
|
+
payload: payload,
|
|
12
|
+
headers: headers,
|
|
13
|
+
priority: priority || Pgbus.configuration.default_priority,
|
|
14
|
+
delay: delay
|
|
15
|
+
)
|
|
16
|
+
end
|
|
15
17
|
end
|
|
16
18
|
|
|
17
19
|
def publish_event(routing_key, payload, headers: nil)
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
20
|
+
Instrumentation.instrument("pgbus.outbox.publish", routing_key: routing_key, kind: :event) do
|
|
21
|
+
event_data = EventBus::Publisher.build_event_data(payload)
|
|
22
|
+
OutboxEntry.create!(
|
|
23
|
+
routing_key: routing_key,
|
|
24
|
+
payload: event_data,
|
|
25
|
+
headers: headers
|
|
26
|
+
)
|
|
27
|
+
end
|
|
24
28
|
end
|
|
25
29
|
|
|
26
30
|
def flush!
|
data/lib/pgbus/process/worker.rb
CHANGED
|
@@ -302,15 +302,33 @@ module Pgbus
|
|
|
302
302
|
end
|
|
303
303
|
|
|
304
304
|
def check_recycle
|
|
305
|
-
return unless @lifecycle.running?
|
|
305
|
+
return unless @lifecycle.running?
|
|
306
|
+
|
|
307
|
+
reason = recycle_reason
|
|
308
|
+
return unless reason
|
|
306
309
|
|
|
307
310
|
Pgbus.stopping = true
|
|
308
311
|
@lifecycle.transition_to(:draining)
|
|
312
|
+
Pgbus::Instrumentation.instrument(
|
|
313
|
+
"pgbus.worker.recycle",
|
|
314
|
+
reason: reason,
|
|
315
|
+
jobs_processed: @jobs_processed.value,
|
|
316
|
+
memory_mb: current_memory_mb,
|
|
317
|
+
lifetime_seconds: monotonic_now - @started_at_monotonic
|
|
318
|
+
)
|
|
309
319
|
@wake_signal.notify!
|
|
310
320
|
end
|
|
311
321
|
|
|
322
|
+
def recycle_reason
|
|
323
|
+
return :max_jobs if exceeded_max_jobs?
|
|
324
|
+
return :max_memory if exceeded_max_memory?
|
|
325
|
+
return :max_lifetime if exceeded_max_lifetime?
|
|
326
|
+
|
|
327
|
+
nil
|
|
328
|
+
end
|
|
329
|
+
|
|
312
330
|
def recycle_needed?
|
|
313
|
-
|
|
331
|
+
!recycle_reason.nil?
|
|
314
332
|
end
|
|
315
333
|
|
|
316
334
|
def exceeded_max_jobs?
|
|
@@ -41,8 +41,16 @@ module Pgbus
|
|
|
41
41
|
|
|
42
42
|
def tick(now)
|
|
43
43
|
schedule.due_tasks(now).each do |task, run_at|
|
|
44
|
-
|
|
45
|
-
|
|
44
|
+
Pgbus::Instrumentation.instrument(
|
|
45
|
+
"pgbus.recurring.enqueue",
|
|
46
|
+
task: task.key,
|
|
47
|
+
class_name: task.class_name,
|
|
48
|
+
queue: task.queue_name,
|
|
49
|
+
run_at: run_at
|
|
50
|
+
) do
|
|
51
|
+
schedule.enqueue_task(task, run_at: run_at)
|
|
52
|
+
@last_runs[task.key] = now
|
|
53
|
+
end
|
|
46
54
|
rescue StandardError => e
|
|
47
55
|
Pgbus.logger.error do
|
|
48
56
|
"[Pgbus] Error scheduling recurring task #{task.key}: #{e.class}: #{e.message}"
|
data/lib/pgbus/streams.rb
CHANGED
|
@@ -73,11 +73,19 @@ module Pgbus
|
|
|
73
73
|
wrapped = { "html" => payload.to_s }
|
|
74
74
|
wrapped["visible_to"] = visible_to.to_s if visible_to
|
|
75
75
|
transaction = current_open_transaction
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
76
|
+
instrument_payload = {
|
|
77
|
+
stream: @name,
|
|
78
|
+
visible_to: visible_to,
|
|
79
|
+
deferred: !transaction.nil?,
|
|
80
|
+
bytes: wrapped["html"].bytesize
|
|
81
|
+
}
|
|
82
|
+
Instrumentation.instrument("pgbus.stream.broadcast", instrument_payload) do
|
|
83
|
+
if transaction
|
|
84
|
+
transaction.after_commit { @client.send_message(@name, wrapped) }
|
|
85
|
+
nil
|
|
86
|
+
else
|
|
87
|
+
@client.send_message(@name, wrapped)
|
|
88
|
+
end
|
|
81
89
|
end
|
|
82
90
|
end
|
|
83
91
|
|
data/lib/pgbus/version.rb
CHANGED
data/lib/pgbus.rb
CHANGED
|
@@ -40,6 +40,10 @@ module Pgbus
|
|
|
40
40
|
loader.ignore("#{__dir__}/generators")
|
|
41
41
|
loader.ignore("#{__dir__}/active_job")
|
|
42
42
|
loader.ignore("#{__dir__}/pgbus/testing")
|
|
43
|
+
# Vendor integrations are loaded conditionally (when the vendor gem
|
|
44
|
+
# is present) by lib/pgbus/engine.rb. Keeping them out of Zeitwerk
|
|
45
|
+
# means we don't reference vendor constants at autoload time.
|
|
46
|
+
loader.ignore("#{__dir__}/pgbus/integrations")
|
|
43
47
|
# lib/puma/plugin/pgbus_streams.rb is a Puma plugin — it's required
|
|
44
48
|
# explicitly by the user from config/puma.rb via `plugin :pgbus_streams`.
|
|
45
49
|
# Without this ignore, Zeitwerk scans lib/puma/ under the pgbus loader
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: pgbus
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.7.
|
|
4
|
+
version: 0.7.9
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Mikael Henriksson
|
|
@@ -273,6 +273,12 @@ files:
|
|
|
273
273
|
- lib/pgbus/generators/database_target_detector.rb
|
|
274
274
|
- lib/pgbus/generators/migration_detector.rb
|
|
275
275
|
- lib/pgbus/instrumentation.rb
|
|
276
|
+
- lib/pgbus/integrations/appsignal.rb
|
|
277
|
+
- lib/pgbus/integrations/appsignal/dashboards/pgbus_health.json
|
|
278
|
+
- lib/pgbus/integrations/appsignal/dashboards/pgbus_streams.json
|
|
279
|
+
- lib/pgbus/integrations/appsignal/dashboards/pgbus_throughput.json
|
|
280
|
+
- lib/pgbus/integrations/appsignal/probe.rb
|
|
281
|
+
- lib/pgbus/integrations/appsignal/subscriber.rb
|
|
276
282
|
- lib/pgbus/log_formatter.rb
|
|
277
283
|
- lib/pgbus/outbox.rb
|
|
278
284
|
- lib/pgbus/outbox/poller.rb
|