eventhub-processor2 1.27.2 → 1.28.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +2 -1
- data/.tool-versions +1 -1
- data/CHANGELOG.md +32 -0
- data/Makefile +115 -0
- data/README.md +30 -1
- data/Rakefile +9 -0
- data/example/README.md +15 -35
- data/lib/eventhub/actor_heartbeat.rb +39 -9
- data/lib/eventhub/actor_listener_amqp.rb +57 -3
- data/lib/eventhub/actor_publisher.rb +46 -12
- data/lib/eventhub/actor_watchdog.rb +23 -5
- data/lib/eventhub/base.rb +27 -1
- data/lib/eventhub/docs_renderer.rb +12 -7
- data/lib/eventhub/helper.rb +21 -4
- data/lib/eventhub/patches/celluloid_logger.rb +51 -0
- data/lib/eventhub/processor2.rb +5 -0
- data/lib/eventhub/version.rb +1 -1
- data/soak/README.md +72 -0
- data/soak/check_orphans.rb +37 -0
- data/{example → soak}/publisher.rb +44 -7
- metadata +13 -9
- /data/{example → soak}/CHANGELOG.md +0 -0
- /data/{example → soak}/config/receiver.json +0 -0
- /data/{example → soak}/config/router.json +0 -0
- /data/{example → soak}/crasher.rb +0 -0
- /data/{example → soak}/receiver.rb +0 -0
- /data/{example → soak}/router.rb +0 -0
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 712c5d8a0d59c49a1efe698824cca79e0d229894db4673cfc6eaed9d8080bc57
|
|
4
|
+
data.tar.gz: c3107bdb2f39ab2e79faa2bcd6cd2230c0c4fa07b31a067efd9cba07fe51540b
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 8ff4cb3ad79b36661685556048fd9006f899c1deb7dcc80e00c83f15a253e2d5bd6e60d5697a83827ad1adcefe22dd2c54d3de7ee4eeeb0e9ce3126e7f3c345d
|
|
7
|
+
data.tar.gz: e486e53d7a81179645b3e2a0be64bca1e6604215a16bdb5e1c5a08114ed2b7893339bf89ae48c9c34241bf516c49fa0bded503673b2fa15934b08c9bf20f2ee4
|
data/.gitignore
CHANGED
data/.tool-versions
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
ruby 4.0.
|
|
1
|
+
ruby 4.0.4
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,37 @@
|
|
|
1
1
|
# Changelog of EventHub::Processor2
|
|
2
2
|
|
|
3
|
+
# 1.28.0 / 2026-05-18
|
|
4
|
+
|
|
5
|
+
**Reliability**
|
|
6
|
+
|
|
7
|
+
* Survive broker restarts. `automatically_recover: true` + recovery callbacks, so transient disconnects no longer crash the Celluloid actor or exhaust the supervisor's restart budget. Fixes the dispatcher's crash-loop into permanent death after a broker bounce.
|
|
8
|
+
* Survive silent consumer-thread death. New `channel#on_uncaught_exception` / `consumer#on_cancellation` hooks on `ActorListenerAmqp` escalate to an actor restart, but only for non-recoverable errors - transient `Bunny::NetworkFailure` / `ConnectionClosedError` / `TCPConnectionFailed` / `Timeout::Error` / `IOError` and mid-disconnect cancellations are left to Bunny's recovery (no spurious 15s restart sleep).
|
|
9
|
+
* `ActorWatchdog` tolerates transient broker errors; only escalates on 3 consecutive cycles of persistent queue absence.
|
|
10
|
+
* Tolerant `cleanup` in `ActorListenerAmqp`, `ActorPublisher`, `ActorHeartbeat` - bunny-3 raises on `close` of a torn-down session no longer crash the finalizer.
|
|
11
|
+
* **Patch Celluloid 0.18 / Ruby 3.x incompatibility.** Celluloid's `Internals::Logger.crash` mutates a frozen string literal, which raises `FrozenError` before our exception handler runs - the actor thread then dies silently, the supervisor never sees the exit, and the listener becomes a zombie. Symptom: SIGHUP-triggered restart looks like it works but the listener stops consuming. Patched via `Module#prepend` in `lib/eventhub/patches/celluloid_logger.rb`.
|
|
12
|
+
|
|
13
|
+
**Throughput**
|
|
14
|
+
|
|
15
|
+
* `ActorPublisher` and `ActorHeartbeat` now reuse a single channel per actor instead of opening one per message. Eliminates the RabbitMQ `Channel is stopping with N pending publisher confirms` warning and lifts publisher throughput from ~hundreds to ~7k msg/s in local tests.
|
|
16
|
+
* New `rake test:performance` task: regression gate at 5k msg/s on a reused channel (opt-in, tagged `:performance`).
|
|
17
|
+
|
|
18
|
+
**Tracing**
|
|
19
|
+
|
|
20
|
+
* `correlation_id` now survives the cross-actor hop on publish. `Processor2#publish` captures `CorrelationId.current` in the caller's thread before handing off to the publisher actor, so the AMQP header is preserved end-to-end.
|
|
21
|
+
|
|
22
|
+
**Security**
|
|
23
|
+
|
|
24
|
+
* Sensitive-value redaction in the rendered config page now walks nested hashes and arrays. Previously `server.credentials.password` and `connections[].token` leaked through; now redacted at any depth.
|
|
25
|
+
|
|
26
|
+
**Observability**
|
|
27
|
+
|
|
28
|
+
* `Celluloid.exception_handler` log line includes the dying actor's class name.
|
|
29
|
+
* Log broker `connection.blocked` / `connection.unblocked` events.
|
|
30
|
+
|
|
31
|
+
**Test harness**
|
|
32
|
+
|
|
33
|
+
* New `soak/` reliability harness (publisher / router / receiver / crasher) with `make soak` Makefile target. Adaptive drain, real-vs-in-flight orphan classification, configurable chaos length. Validated with 2h chaos run: 0 real orphans.
|
|
34
|
+
|
|
3
35
|
# 1.27.2 / 2026-04-08
|
|
4
36
|
|
|
5
37
|
* Fix publish return value leaking Bunny::Exchange object back to callers, causing unintended re-publishing of garbage messages via `handle_payload`
|
data/Makefile
ADDED
|
@@ -0,0 +1,115 @@
|
|
|
1
|
+
# Convenience targets for running and soak-testing the chaos harness in soak/.
|
|
2
|
+
# The gem itself is built via `rake` (see Rakefile).
|
|
3
|
+
|
|
4
|
+
SHELL := /bin/bash
|
|
5
|
+
SOAK_DIR := soak
|
|
6
|
+
DATA_DIR := $(SOAK_DIR)/data
|
|
7
|
+
LOG_DIR := $(SOAK_DIR)/logs
|
|
8
|
+
SOAK_MINUTES ?= 10
|
|
9
|
+
SOAK_DRAIN_POLL_S ?= 5
|
|
10
|
+
SOAK_DRAIN_MAX_S ?= 600
|
|
11
|
+
|
|
12
|
+
.PHONY: help
|
|
13
|
+
help:
|
|
14
|
+
@echo "Targets:"
|
|
15
|
+
@echo " make soak-start start publisher, router, receiver, crasher in background"
|
|
16
|
+
@echo " make soak-stop stop everything (SIGINT, clean shutdown)"
|
|
17
|
+
@echo " make soak-clean stop and wipe $(DATA_DIR)/ + $(LOG_DIR)/"
|
|
18
|
+
@echo " make soak run reliability soak for SOAK_MINUTES (default 10) min"
|
|
19
|
+
@echo ""
|
|
20
|
+
@echo "Soak env overrides:"
|
|
21
|
+
@echo " SOAK_MINUTES=N length of the chaos phase (default 10)"
|
|
22
|
+
@echo " SOAK_DRAIN_POLL_S=N drain poll interval in seconds (default 5)"
|
|
23
|
+
@echo " SOAK_DRAIN_MAX_S=N drain hard cap in seconds (default 600 = 10 min)"
|
|
24
|
+
@echo ""
|
|
25
|
+
@echo "Publisher env overrides (forwarded to publisher.rb):"
|
|
26
|
+
@echo " PAUSE_BETWEEN_WORK=F seconds between publishes (default 0.05)"
|
|
27
|
+
@echo " PUBLISH_MAX_ATTEMPTS=N publish retry attempts on transient errors (default 8)"
|
|
28
|
+
@echo " PUBLISH_RETRY_DELAY_S=F seconds between publish retries (default 1)"
|
|
29
|
+
|
|
30
|
+
.PHONY: soak-start
|
|
31
|
+
soak-start: soak-stop
|
|
32
|
+
@mkdir -p $(DATA_DIR)
|
|
33
|
+
@echo "==> starting receiver"
|
|
34
|
+
@cd $(SOAK_DIR) && nohup ruby receiver.rb > /dev/null 2>&1 &
|
|
35
|
+
@echo "==> starting router"
|
|
36
|
+
@cd $(SOAK_DIR) && nohup ruby router.rb > /dev/null 2>&1 &
|
|
37
|
+
@sleep 3
|
|
38
|
+
@echo "==> starting publisher"
|
|
39
|
+
@cd $(SOAK_DIR) && nohup ruby publisher.rb > /dev/null 2>&1 &
|
|
40
|
+
@sleep 2
|
|
41
|
+
@echo "==> starting crasher"
|
|
42
|
+
@cd $(SOAK_DIR) && nohup ruby crasher.rb > /dev/null 2>&1 &
|
|
43
|
+
@echo "==> all processes started; tail logs in $(SOAK_DIR)/logs/ruby/"
|
|
44
|
+
|
|
45
|
+
.PHONY: soak-stop
|
|
46
|
+
soak-stop:
|
|
47
|
+
@pkill -INT -f "ruby (publisher|router|receiver|crasher)\.rb" 2>/dev/null || true
|
|
48
|
+
@sleep 2
|
|
49
|
+
|
|
50
|
+
.PHONY: soak-clean
|
|
51
|
+
soak-clean: soak-stop
|
|
52
|
+
@rm -rf $(DATA_DIR) $(LOG_DIR)
|
|
53
|
+
@mkdir -p $(DATA_DIR) $(LOG_DIR)/ruby
|
|
54
|
+
@echo "==> cleaned $(DATA_DIR) and $(LOG_DIR)"
|
|
55
|
+
|
|
56
|
+
# Soak: run the chaos loop for SOAK_MINUTES, then SIGKILL the publisher
|
|
57
|
+
# (skipping its cleanup so we keep an honest snapshot of any in-flight
|
|
58
|
+
# files), give router+receiver SOAK_DRAIN_S seconds to drain the queue,
|
|
59
|
+
# then count remaining files. Anything left is an orphan.
|
|
60
|
+
.PHONY: soak
|
|
61
|
+
soak:
|
|
62
|
+
@started=$$(date +%s); started_human=$$(date '+%Y-%m-%d %H:%M:%S %Z'); \
|
|
63
|
+
chaos_ends_at=$$(date -r $$(($$started + $(SOAK_MINUTES) * 60)) '+%H:%M:%S' 2>/dev/null \
|
|
64
|
+
|| date -d @"$$(($$started + $(SOAK_MINUTES) * 60))" '+%H:%M:%S' 2>/dev/null); \
|
|
65
|
+
echo "==> soak: $(SOAK_MINUTES) min chaos + adaptive drain (cap $(SOAK_DRAIN_MAX_S)s)"; \
|
|
66
|
+
echo " started: $$started_human"; \
|
|
67
|
+
echo " chaos ends ~ $$chaos_ends_at"; \
|
|
68
|
+
$(MAKE) --no-print-directory soak-clean; \
|
|
69
|
+
mkdir -p $(DATA_DIR); \
|
|
70
|
+
( cd $(SOAK_DIR) && nohup ruby receiver.rb > /dev/null 2>&1 & ); \
|
|
71
|
+
( cd $(SOAK_DIR) && nohup ruby router.rb > /dev/null 2>&1 & ); \
|
|
72
|
+
sleep 3; \
|
|
73
|
+
( cd $(SOAK_DIR) && nohup ruby publisher.rb > /dev/null 2>&1 & ); \
|
|
74
|
+
sleep 2; \
|
|
75
|
+
( cd $(SOAK_DIR) && nohup ruby crasher.rb > /dev/null 2>&1 & ); \
|
|
76
|
+
echo "==> chaos phase running for $$(($(SOAK_MINUTES) * 60))s..."; \
|
|
77
|
+
sleep $$(($(SOAK_MINUTES) * 60)); \
|
|
78
|
+
echo "==> stopping crasher (SIGINT)"; \
|
|
79
|
+
pkill -INT -f "ruby crasher\.rb" 2>/dev/null || true; \
|
|
80
|
+
echo "==> SIGKILL publisher to skip its cleanup and freeze the snapshot"; \
|
|
81
|
+
pkill -KILL -f "ruby publisher\.rb" 2>/dev/null || true; \
|
|
82
|
+
sleep 1; \
|
|
83
|
+
drain_started=$$(date +%s); prev=-1; deadline=$$(($$drain_started + $(SOAK_DRAIN_MAX_S))); \
|
|
84
|
+
echo "==> draining (poll every $(SOAK_DRAIN_POLL_S)s, cap $(SOAK_DRAIN_MAX_S)s)"; \
|
|
85
|
+
echo " (orphans = files with no matching store.json entry; in-flight = SIGKILL race, not a failure)"; \
|
|
86
|
+
while :; do \
|
|
87
|
+
now=$$(date +%s); \
|
|
88
|
+
read real in_flight <<< $$($(SOAK_DIR)/check_orphans.rb $(DATA_DIR)); \
|
|
89
|
+
elapsed_drain=$$(($$now - $$drain_started)); \
|
|
90
|
+
printf " [%4ds] real=%d in_flight=%d\n" $$elapsed_drain $$real $$in_flight; \
|
|
91
|
+
if [ "$$real" = "0" ]; then break; fi; \
|
|
92
|
+
if [ $$now -ge $$deadline ]; then echo " drain cap reached, giving up"; break; fi; \
|
|
93
|
+
if [ "$$real" = "$$prev" ]; then echo " no progress in last interval, giving up"; break; fi; \
|
|
94
|
+
prev=$$real; \
|
|
95
|
+
sleep $(SOAK_DRAIN_POLL_S); \
|
|
96
|
+
done; \
|
|
97
|
+
finished=$$(date +%s); finished_human=$$(date '+%Y-%m-%d %H:%M:%S %Z'); \
|
|
98
|
+
elapsed=$$(($$finished - $$started)); \
|
|
99
|
+
read real in_flight <<< $$($(SOAK_DIR)/check_orphans.rb $(DATA_DIR)); \
|
|
100
|
+
echo ""; \
|
|
101
|
+
echo "==> soak result"; \
|
|
102
|
+
echo " started: $$started_human"; \
|
|
103
|
+
echo " finished: $$finished_human"; \
|
|
104
|
+
echo " elapsed total: $${elapsed}s"; \
|
|
105
|
+
echo " drain time: $$(($$finished - $$drain_started))s"; \
|
|
106
|
+
echo " real orphans: $$real (pipeline loss)"; \
|
|
107
|
+
echo " in-flight at SIGKILL: $$in_flight (expected residual; cleaned on next publisher start)"; \
|
|
108
|
+
$(MAKE) --no-print-directory soak-stop; \
|
|
109
|
+
if [ "$$real" = "0" ]; then \
|
|
110
|
+
echo "==> PASS"; \
|
|
111
|
+
else \
|
|
112
|
+
echo "==> FAIL: $$real real orphan(s) remain after drain"; \
|
|
113
|
+
$(SOAK_DIR)/check_orphans.rb $(DATA_DIR) --list-orphans >/dev/null; \
|
|
114
|
+
exit 1; \
|
|
115
|
+
fi
|
data/README.md
CHANGED
|
@@ -463,7 +463,7 @@ end
|
|
|
463
463
|
|
|
464
464
|
### Configuration
|
|
465
465
|
|
|
466
|
-
Displays the active configuration as an HTML table. Sensitive values (passwords, tokens, keys) are automatically redacted.
|
|
466
|
+
Displays the active configuration as an HTML table. Sensitive values (passwords, tokens, keys) are automatically redacted at any depth — keys matching the sensitive list are masked whether they appear at the top level, inside nested hashes, or inside hashes nested in arrays.
|
|
467
467
|
|
|
468
468
|
```
|
|
469
469
|
GET {base_path}/docs/configuration
|
|
@@ -547,6 +547,35 @@ end
|
|
|
547
547
|
|
|
548
548
|
To install this gem onto your local machine, run `bundle exec rake install`.
|
|
549
549
|
|
|
550
|
+
### Reliability soak harness
|
|
551
|
+
|
|
552
|
+
A multi-process chaos harness lives in `soak/` (publisher, router, receiver, crasher). It validates end-to-end reliability under sustained broker restarts and `SIGHUP`-triggered listener restarts. A run is considered passing when zero "real orphan" files remain (files whose UUID is not in the publisher's transaction store — i.e. messages whose delivery was claimed by the publisher but never made it through the pipeline).
|
|
553
|
+
|
|
554
|
+
```
|
|
555
|
+
# 10-minute chaos + adaptive drain (default), prints PASS/FAIL
|
|
556
|
+
make soak
|
|
557
|
+
|
|
558
|
+
# Longer runs
|
|
559
|
+
SOAK_MINUTES=60 make soak
|
|
560
|
+
SOAK_MINUTES=120 SOAK_DRAIN_MAX_S=1200 make soak
|
|
561
|
+
|
|
562
|
+
# See all targets and env knobs
|
|
563
|
+
make help
|
|
564
|
+
```
|
|
565
|
+
|
|
566
|
+
See `soak/README.md` for the full description.
|
|
567
|
+
|
|
568
|
+
### Throughput baseline
|
|
569
|
+
|
|
570
|
+
`rake test:performance` runs a publisher-throughput regression spec against a real local RabbitMQ. It is excluded from `rake spec` (tagged `:performance`) so it does not run on every CI build.
|
|
571
|
+
|
|
572
|
+
```
|
|
573
|
+
bundle exec rake test:performance
|
|
574
|
+
PERF_FLOOR=3000 bundle exec rake test:performance # lower the floor for slower hosts
|
|
575
|
+
```
|
|
576
|
+
|
|
577
|
+
Default floor is 5,000 msg/s. Typical local result is ~7,000 msg/s.
|
|
578
|
+
|
|
550
579
|
## Publishing
|
|
551
580
|
|
|
552
581
|
This project uses [Trusted Publishing](https://guides.rubygems.org/trusted-publishing/) to securely publish gems to RubyGems.org via GitHub Actions. To release a new version:
|
data/Rakefile
CHANGED
|
@@ -4,6 +4,15 @@ require "standard/rake"
|
|
|
4
4
|
|
|
5
5
|
RSpec::Core::RakeTask.new(:spec) do |t|
|
|
6
6
|
t.verbose = false
|
|
7
|
+
t.rspec_opts = "--tag ~performance"
|
|
8
|
+
end
|
|
9
|
+
|
|
10
|
+
namespace :test do
|
|
11
|
+
desc "Run throughput baseline against real RabbitMQ (skipped by `rake spec`)"
|
|
12
|
+
RSpec::Core::RakeTask.new(:performance) do |t|
|
|
13
|
+
t.verbose = false
|
|
14
|
+
t.rspec_opts = "--tag performance --format documentation"
|
|
15
|
+
end
|
|
7
16
|
end
|
|
8
17
|
|
|
9
18
|
desc "Initialize or reset rabbitmq docker container (run before rspec)"
|
data/example/README.md
CHANGED
|
@@ -1,40 +1,20 @@
|
|
|
1
|
-
## Example
|
|
1
|
+
## Example processor
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
`example.rb` is a minimal `EventHub::Processor2` subclass used to demo and
|
|
4
|
+
visually verify the gem's built-in HTTP endpoints (`/heartbeat`, `/version`,
|
|
5
|
+
`/docs`, `/changelog`, `/configuration`). It's intentionally tiny - just enough
|
|
6
|
+
to start a processor against the local RabbitMQ container.
|
|
4
7
|
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
* receiver.rb - receives message and does final processing
|
|
9
|
-
* crasher.rb - restarts message broker or sends signals to other processes
|
|
8
|
+
For the chaos / reliability test harness (publisher, router, receiver,
|
|
9
|
+
crasher), see [`../soak/`](../soak/) and the `make soak` target in the
|
|
10
|
+
project root.
|
|
10
11
|
|
|
11
|
-
###
|
|
12
|
+
### Run
|
|
12
13
|
|
|
13
|
-
|
|
14
|
-
|
|
14
|
+
```bash
|
|
15
|
+
cd example
|
|
16
|
+
bundle exec ruby example.rb
|
|
17
|
+
```
|
|
15
18
|
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
2. router.rb receives the message and passes it to exmaple.outbound queue
|
|
19
|
-
|
|
20
|
-
3. receiver.rb gets the message and deletes the file with the given ID
|
|
21
|
-
|
|
22
|
-
### Goal
|
|
23
|
-
What ever happens to these components (restarted, killed and restarted, stopped and started, message broker killed, stopped and started) if you do a graceful shutdown at the end there should be no message in the /data folder (except store.json).
|
|
24
|
-
|
|
25
|
-
Graceful shutdown with CTRL-C or TERM signal to pid
|
|
26
|
-
* Stop producer.rb. Leave the other components running until all messages in example.* queues are gone.
|
|
27
|
-
* Stop remaining components
|
|
28
|
-
* Check ./example/data folder
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
### How to use?
|
|
32
|
-
* Make sure docker container (process-rabbitmq) is running (see [readme](../docker/README.md))
|
|
33
|
-
* Start one or more router with: bundle exec ruby router.rb
|
|
34
|
-
* Start one or more receiver with: bundle exec ruby receier.rb
|
|
35
|
-
* Start one publisher with: bundle exec ruby publisher.rb
|
|
36
|
-
* Start one crasher with: bundle exec ruby crasher.rb (or do this manually)
|
|
37
|
-
|
|
38
|
-
### Note
|
|
39
|
-
* Publisher has a simple transaction store implemented to deal with issues between file creation and file publishing. At the end of the publisher process in the cleanup method pending transaction get processed and coresponding files get deleted.
|
|
40
|
-
* Watch for huge log files!
|
|
19
|
+
Then visit `http://localhost:8083/svc/example/docs` (or whatever `http.port`
|
|
20
|
+
is configured in `config/example.json`).
|
|
@@ -9,6 +9,8 @@ module EventHub
|
|
|
9
9
|
|
|
10
10
|
def initialize(processor_instance)
|
|
11
11
|
@processor_instance = processor_instance
|
|
12
|
+
@connection = nil
|
|
13
|
+
@channel = nil
|
|
12
14
|
async.start
|
|
13
15
|
end
|
|
14
16
|
|
|
@@ -28,21 +30,49 @@ module EventHub
|
|
|
28
30
|
|
|
29
31
|
def cleanup
|
|
30
32
|
EventHub.logger.info("Heartbeat is cleaning up...")
|
|
31
|
-
|
|
32
|
-
|
|
33
|
+
begin
|
|
34
|
+
publish(heartbeat(action: "stopped"))
|
|
35
|
+
EventHub.logger.info("Heartbeat has sent a [stopped] beat")
|
|
36
|
+
rescue => ex
|
|
37
|
+
EventHub.logger.warn("Heartbeat cleanup publish: ignoring #{ex.class}: #{ex.message}")
|
|
38
|
+
end
|
|
39
|
+
begin
|
|
40
|
+
@channel&.close
|
|
41
|
+
rescue => ex
|
|
42
|
+
EventHub.logger.warn("Heartbeat cleanup channel: ignoring #{ex.class}: #{ex.message}")
|
|
43
|
+
end
|
|
44
|
+
begin
|
|
45
|
+
@connection&.close
|
|
46
|
+
rescue => ex
|
|
47
|
+
EventHub.logger.warn("Heartbeat cleanup connection: ignoring #{ex.class}: #{ex.message}")
|
|
48
|
+
end
|
|
33
49
|
end
|
|
34
50
|
|
|
35
51
|
private
|
|
36
52
|
|
|
37
53
|
def publish(message)
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
channel = connection.create_channel
|
|
41
|
-
channel.confirm_select(tracking: true)
|
|
42
|
-
exchange = channel.direct(EventHub::EH_X_INBOUND, durable: true)
|
|
54
|
+
ensure_channel
|
|
55
|
+
exchange = @channel.direct(EventHub::EH_X_INBOUND, durable: true)
|
|
43
56
|
exchange.publish(message, persistent: true)
|
|
44
|
-
|
|
45
|
-
|
|
57
|
+
rescue Bunny::NetworkFailure, Bunny::ChannelAlreadyClosed => e
|
|
58
|
+
EventHub.logger.warn("Heartbeat channel dropped: #{e.class}: #{e.message}")
|
|
59
|
+
begin
|
|
60
|
+
@channel&.close
|
|
61
|
+
rescue
|
|
62
|
+
nil
|
|
63
|
+
end
|
|
64
|
+
@channel = nil
|
|
65
|
+
raise
|
|
66
|
+
end
|
|
67
|
+
|
|
68
|
+
def ensure_channel
|
|
69
|
+
unless @connection
|
|
70
|
+
@connection = create_bunny_connection
|
|
71
|
+
@connection.start
|
|
72
|
+
end
|
|
73
|
+
return if @channel&.open?
|
|
74
|
+
@channel = @connection.create_channel
|
|
75
|
+
@channel.confirm_select(tracking: true)
|
|
46
76
|
end
|
|
47
77
|
|
|
48
78
|
def heartbeat(args = {action: "running"})
|
|
@@ -39,6 +39,37 @@ module EventHub
|
|
|
39
39
|
def listen(args = {})
|
|
40
40
|
with_listen(args) do |connection, channel, consumer, queue, queue_name|
|
|
41
41
|
EventHub.logger.info("Listening to queue [#{queue_name}]")
|
|
42
|
+
|
|
43
|
+
# log broker-initiated connection state changes
|
|
44
|
+
connection.on_blocked { |reason| EventHub.logger.warn("Broker blocked connection: #{reason}") }
|
|
45
|
+
connection.on_unblocked { EventHub.logger.info("Broker unblocked connection") }
|
|
46
|
+
|
|
47
|
+
# Only escalate to an actor restart for exceptions Bunny will NOT
|
|
48
|
+
# recover from. Transient network exceptions are handled by Bunny's
|
|
49
|
+
# automatic recovery; an actor restart in that window races recovery
|
|
50
|
+
# and incurs an avoidable 15s before_restart sleep without consumption.
|
|
51
|
+
channel.on_uncaught_exception do |ex, _consumer|
|
|
52
|
+
if recoverable_bunny_error?(ex)
|
|
53
|
+
EventHub.logger.warn("Consumer thread raised recoverable #{ex.class}: #{ex.message} - leaving recovery to Bunny")
|
|
54
|
+
else
|
|
55
|
+
EventHub.logger.error("Consumer thread raised non-recoverable #{ex.class}: #{ex.message} - restarting listener")
|
|
56
|
+
Celluloid::Actor[:actor_listener_amqp]&.async&.restart
|
|
57
|
+
end
|
|
58
|
+
end
|
|
59
|
+
|
|
60
|
+
# Broker may cancel a consumer (queue deleted, HA failover, policy change).
|
|
61
|
+
# If the connection is still open, this is a real broker-side cancel and
|
|
62
|
+
# we must restart. If the connection is closed/recovering, Bunny will
|
|
63
|
+
# re-register the consumer itself on reconnect; do not race it.
|
|
64
|
+
consumer.on_cancellation do
|
|
65
|
+
if connection.open?
|
|
66
|
+
EventHub.logger.error("Consumer for [#{queue_name}] cancelled by broker - restarting listener")
|
|
67
|
+
Celluloid::Actor[:actor_listener_amqp]&.async&.restart
|
|
68
|
+
else
|
|
69
|
+
EventHub.logger.warn("Consumer for [#{queue_name}] cancelled during disconnect - leaving recovery to Bunny")
|
|
70
|
+
end
|
|
71
|
+
end
|
|
72
|
+
|
|
42
73
|
consumer.on_delivery do |delivery_info, metadata, payload|
|
|
43
74
|
CorrelationId.with(metadata[:correlation_id]) do
|
|
44
75
|
EventHub.logger.info("#{queue_name}: [#{delivery_info.delivery_tag}]" \
|
|
@@ -70,9 +101,10 @@ module EventHub
|
|
|
70
101
|
|
|
71
102
|
def with_listen(args = {}, &block)
|
|
72
103
|
connection = create_bunny_connection
|
|
73
|
-
connection.start
|
|
74
104
|
queue_name = args[:queue_name]
|
|
105
|
+
# store FIRST so cleanup can find a partially-started session
|
|
75
106
|
@connections[queue_name] = connection
|
|
107
|
+
connection.start
|
|
76
108
|
channel = connection.create_channel
|
|
77
109
|
channel.prefetch(1)
|
|
78
110
|
queue = channel.queue(queue_name, durable: true)
|
|
@@ -88,6 +120,7 @@ module EventHub
|
|
|
88
120
|
def handle_payload(args = {})
|
|
89
121
|
response_messages = []
|
|
90
122
|
connection = args[:connection]
|
|
123
|
+
correlation_id = args[:correlation_id] || CorrelationId.current
|
|
91
124
|
|
|
92
125
|
# convert to EventHub message
|
|
93
126
|
message = EventHub::Message.from_json(args[:payload])
|
|
@@ -123,9 +156,12 @@ module EventHub
|
|
|
123
156
|
end
|
|
124
157
|
end
|
|
125
158
|
|
|
159
|
+
# use possibly-updated execution_id fallback from above
|
|
160
|
+
correlation_id ||= CorrelationId.current
|
|
161
|
+
|
|
126
162
|
Array(response_messages).each do |message|
|
|
127
163
|
next unless message.is_a?(EventHub::Message)
|
|
128
|
-
publish(message: message.to_json, connection: connection)
|
|
164
|
+
publish(message: message.to_json, connection: connection, correlation_id: correlation_id)
|
|
129
165
|
end
|
|
130
166
|
end
|
|
131
167
|
|
|
@@ -136,15 +172,33 @@ module EventHub
|
|
|
136
172
|
|
|
137
173
|
def cleanup
|
|
138
174
|
EventHub.logger.info("Listener amqp is cleaning up...")
|
|
139
|
-
# close all open connections
|
|
175
|
+
# close all open connections; bunny-3 can raise on a torn-down session
|
|
140
176
|
return unless @connections
|
|
141
177
|
@connections.values.each do |connection|
|
|
142
178
|
connection&.close
|
|
179
|
+
rescue => ex
|
|
180
|
+
EventHub.logger.warn("Listener cleanup: ignoring #{ex.class}: #{ex.message}")
|
|
143
181
|
end
|
|
144
182
|
end
|
|
145
183
|
|
|
146
184
|
def publish(args)
|
|
147
185
|
@actor_publisher.publish(args)
|
|
148
186
|
end
|
|
187
|
+
|
|
188
|
+
# Exceptions that Bunny's network recovery handles transparently. If one of
|
|
189
|
+
# these bubbles into `on_uncaught_exception`, the right move is to let the
|
|
190
|
+
# in-flight recovery complete rather than racing it with an actor restart.
|
|
191
|
+
RECOVERABLE_BUNNY_ERRORS = [
|
|
192
|
+
Bunny::NetworkFailure,
|
|
193
|
+
Bunny::ConnectionClosedError,
|
|
194
|
+
Bunny::TCPConnectionFailed,
|
|
195
|
+
Bunny::TCPConnectionFailedForAllHosts,
|
|
196
|
+
Timeout::Error,
|
|
197
|
+
IOError
|
|
198
|
+
].freeze
|
|
199
|
+
|
|
200
|
+
def recoverable_bunny_error?(ex)
|
|
201
|
+
RECOVERABLE_BUNNY_ERRORS.any? { |klass| ex.is_a?(klass) }
|
|
202
|
+
end
|
|
149
203
|
end
|
|
150
204
|
end
|
|
@@ -10,23 +10,17 @@ module EventHub
|
|
|
10
10
|
def initialize
|
|
11
11
|
EventHub.logger.info("Publisher is starting...")
|
|
12
12
|
@connection = nil
|
|
13
|
+
@channel = nil
|
|
13
14
|
end
|
|
14
15
|
|
|
15
16
|
def publish(args = {})
|
|
16
|
-
|
|
17
|
-
unless @connection
|
|
18
|
-
@connection = create_bunny_connection
|
|
19
|
-
@connection.start
|
|
20
|
-
end
|
|
17
|
+
ensure_channel
|
|
21
18
|
|
|
22
19
|
message = args[:message]
|
|
23
20
|
return if message.nil?
|
|
24
21
|
|
|
25
22
|
exchange_name = args[:exchange_name] || EH_X_INBOUND
|
|
26
|
-
|
|
27
|
-
channel = @connection.create_channel
|
|
28
|
-
channel.confirm_select(tracking: true)
|
|
29
|
-
exchange = channel.direct(exchange_name, durable: true)
|
|
23
|
+
exchange = @channel.direct(exchange_name, durable: true)
|
|
30
24
|
|
|
31
25
|
publish_options = {persistent: true}
|
|
32
26
|
correlation_id = args[:correlation_id] || CorrelationId.current
|
|
@@ -34,13 +28,53 @@ module EventHub
|
|
|
34
28
|
|
|
35
29
|
exchange.publish(message, publish_options)
|
|
36
30
|
nil
|
|
37
|
-
|
|
38
|
-
channel
|
|
31
|
+
rescue Bunny::NetworkFailure, Bunny::ChannelAlreadyClosed => e
|
|
32
|
+
# broker-side close - drop the channel so next publish reopens it
|
|
33
|
+
EventHub.logger.warn("Publisher channel dropped: #{e.class}: #{e.message}")
|
|
34
|
+
begin
|
|
35
|
+
@channel&.close
|
|
36
|
+
rescue
|
|
37
|
+
nil
|
|
38
|
+
end
|
|
39
|
+
@channel = nil
|
|
40
|
+
raise
|
|
39
41
|
end
|
|
40
42
|
|
|
41
43
|
def cleanup
|
|
42
44
|
EventHub.logger.info("Publisher is cleaning up...")
|
|
43
|
-
|
|
45
|
+
begin
|
|
46
|
+
@channel&.close
|
|
47
|
+
rescue => ex
|
|
48
|
+
EventHub.logger.warn("Publisher cleanup channel: ignoring #{ex.class}: #{ex.message}")
|
|
49
|
+
end
|
|
50
|
+
begin
|
|
51
|
+
@connection&.close
|
|
52
|
+
rescue => ex
|
|
53
|
+
EventHub.logger.warn("Publisher cleanup connection: ignoring #{ex.class}: #{ex.message}")
|
|
54
|
+
end
|
|
55
|
+
end
|
|
56
|
+
|
|
57
|
+
private
|
|
58
|
+
|
|
59
|
+
def ensure_channel
|
|
60
|
+
unless @connection
|
|
61
|
+
@connection = create_bunny_connection
|
|
62
|
+
@connection.start
|
|
63
|
+
end
|
|
64
|
+
return if @channel&.open?
|
|
65
|
+
|
|
66
|
+
attempts = 0
|
|
67
|
+
begin
|
|
68
|
+
@channel = @connection.create_channel
|
|
69
|
+
@channel.confirm_select(tracking: true)
|
|
70
|
+
rescue Bunny::NetworkFailure, Bunny::ChannelAlreadyClosed
|
|
71
|
+
attempts += 1
|
|
72
|
+
if attempts < 3
|
|
73
|
+
sleep 1
|
|
74
|
+
retry
|
|
75
|
+
end
|
|
76
|
+
raise
|
|
77
|
+
end
|
|
44
78
|
end
|
|
45
79
|
end
|
|
46
80
|
end
|
|
@@ -7,9 +7,13 @@ module EventHub
|
|
|
7
7
|
|
|
8
8
|
finalizer :cleanup
|
|
9
9
|
|
|
10
|
+
# number of consecutive failed cycles before we raise to force a restart
|
|
11
|
+
MISSING_QUEUE_THRESHOLD = 3
|
|
12
|
+
|
|
10
13
|
def initialize
|
|
11
14
|
cycle = Configuration.processor[:watchdog_cycle_in_s]
|
|
12
15
|
EventHub.logger.info("Watchdog is starting [cycle: #{cycle}s]...")
|
|
16
|
+
@consecutive_failures = 0
|
|
13
17
|
async.start
|
|
14
18
|
end
|
|
15
19
|
|
|
@@ -30,14 +34,28 @@ module EventHub
|
|
|
30
34
|
connection = create_bunny_connection
|
|
31
35
|
connection.start
|
|
32
36
|
|
|
33
|
-
EventHub::Configuration.processor[:listener_queues].
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
+
missing = EventHub::Configuration.processor[:listener_queues].reject do |queue_name|
|
|
38
|
+
connection.queue_exists?(queue_name)
|
|
39
|
+
end
|
|
40
|
+
|
|
41
|
+
if missing.empty?
|
|
42
|
+
@consecutive_failures = 0
|
|
43
|
+
else
|
|
44
|
+
@consecutive_failures += 1
|
|
45
|
+
EventHub.logger.warn("Watchdog: queue(s) missing #{missing.inspect} (#{@consecutive_failures}/#{MISSING_QUEUE_THRESHOLD})")
|
|
46
|
+
if @consecutive_failures >= MISSING_QUEUE_THRESHOLD
|
|
47
|
+
raise "Queue(s) missing for #{@consecutive_failures} consecutive cycles: #{missing.inspect}"
|
|
37
48
|
end
|
|
38
49
|
end
|
|
50
|
+
rescue Bunny::NetworkFailure, Bunny::TCPConnectionFailed, Timeout::Error => ex
|
|
51
|
+
# transient broker problems are auto-recovered by Bunny; don't fight it
|
|
52
|
+
EventHub.logger.warn("Watchdog: transient broker error #{ex.class}: #{ex.message} - skipping cycle")
|
|
39
53
|
ensure
|
|
40
|
-
|
|
54
|
+
begin
|
|
55
|
+
connection&.close
|
|
56
|
+
rescue
|
|
57
|
+
nil
|
|
58
|
+
end
|
|
41
59
|
end
|
|
42
60
|
end
|
|
43
61
|
end
|
data/lib/eventhub/base.rb
CHANGED
|
@@ -28,6 +28,32 @@ require_relative "actor_listener_amqp"
|
|
|
28
28
|
require_relative "actor_listener_http"
|
|
29
29
|
require_relative "docs_renderer"
|
|
30
30
|
require_relative "processor2"
|
|
31
|
+
require_relative "patches/celluloid_logger"
|
|
32
|
+
|
|
33
|
+
module EventHub
|
|
34
|
+
# Format a Celluloid actor exception with the dying actor's class name
|
|
35
|
+
# so post-mortem analysis can identify which actor died.
|
|
36
|
+
#
|
|
37
|
+
# Important: inside an actor's crash flow, `Celluloid.current_actor`
|
|
38
|
+
# returns the Proxy::Cell, whose `.class` goes through method_missing /
|
|
39
|
+
# the mailbox - which can hang on a dying actor. Read the raw actor
|
|
40
|
+
# object out of the thread-local instead and walk to the subject class
|
|
41
|
+
# via instance variables (no proxy round-trips).
|
|
42
|
+
def self.format_celluloid_exception(ex)
|
|
43
|
+
actor_name = begin
|
|
44
|
+
actor = Thread.current[:celluloid_actor]
|
|
45
|
+
if actor
|
|
46
|
+
behavior = actor.instance_variable_get(:@behavior)
|
|
47
|
+
subject = behavior&.instance_variable_get(:@subject)
|
|
48
|
+
subject&.class&.name
|
|
49
|
+
end
|
|
50
|
+
rescue
|
|
51
|
+
nil
|
|
52
|
+
end
|
|
53
|
+
prefix = actor_name ? "[#{actor_name}] " : ""
|
|
54
|
+
"#{prefix}Exception occured: #{ex.class}: #{ex.message}"
|
|
55
|
+
end
|
|
56
|
+
end
|
|
31
57
|
|
|
32
58
|
Celluloid.logger = nil
|
|
33
|
-
Celluloid.exception_handler { |ex| EventHub.logger.error
|
|
59
|
+
Celluloid.exception_handler { |ex| EventHub.logger.error(EventHub.format_celluloid_exception(ex)) }
|
|
@@ -162,7 +162,9 @@ module EventHub
|
|
|
162
162
|
def config_to_html_table(hash, depth = 0, prefix = "")
|
|
163
163
|
rows = hash.map do |key, value|
|
|
164
164
|
full_key = prefix.empty? ? key.to_s : "#{prefix}.#{key}"
|
|
165
|
-
if
|
|
165
|
+
if sensitive_key?(key) && !value.nil? && !(value.respond_to?(:empty?) && value.empty?)
|
|
166
|
+
"<tr><td class=\"config-key\">#{ERB::Util.html_escape(full_key)}</td><td>#{redacted_html}</td></tr>"
|
|
167
|
+
elsif depth == 0 && value.is_a?(Hash) && !value.empty?
|
|
166
168
|
"<tr class=\"is-section is-section-top\"><td colspan=\"2\"><strong>#{ERB::Util.html_escape(full_key)}</strong></td></tr>\n" \
|
|
167
169
|
"#{config_to_html_table(value, 1, full_key)}"
|
|
168
170
|
elsif value.is_a?(Hash) && value.empty?
|
|
@@ -178,9 +180,7 @@ module EventHub
|
|
|
178
180
|
elsif value.is_a?(Array)
|
|
179
181
|
format_array_rows(full_key, key, value, depth)
|
|
180
182
|
else
|
|
181
|
-
display_value = if
|
|
182
|
-
"<span class=\"redacted\">***</span>"
|
|
183
|
-
elsif value.nil? || value.to_s.strip.empty?
|
|
183
|
+
display_value = if value.nil? || value.to_s.strip.empty?
|
|
184
184
|
"<span class=\"not-set\">(not set)</span>"
|
|
185
185
|
else
|
|
186
186
|
ERB::Util.html_escape(value.to_s)
|
|
@@ -200,7 +200,7 @@ module EventHub
|
|
|
200
200
|
return "<tr><td class=\"config-key\">#{ERB::Util.html_escape(full_key)}</td><td><span class=\"not-set\">(empty)</span></td></tr>" if array.empty?
|
|
201
201
|
|
|
202
202
|
if sensitive_key?(key)
|
|
203
|
-
return "<tr><td class=\"config-key\">#{ERB::Util.html_escape(full_key)}</td><td
|
|
203
|
+
return "<tr><td class=\"config-key\">#{ERB::Util.html_escape(full_key)}</td><td>#{redacted_html}</td></tr>"
|
|
204
204
|
end
|
|
205
205
|
|
|
206
206
|
inner = array.map { |item| format_array_item(item) }.join("\n")
|
|
@@ -210,7 +210,7 @@ module EventHub
|
|
|
210
210
|
def format_array_item(item)
|
|
211
211
|
if item.is_a?(Hash)
|
|
212
212
|
rows = item.map do |k, v|
|
|
213
|
-
value = format_nested_value(v)
|
|
213
|
+
value = sensitive_key?(k) ? redacted_html : format_nested_value(v)
|
|
214
214
|
"<tr><td>#{ERB::Util.html_escape(k)}</td><td>#{value}</td></tr>"
|
|
215
215
|
end.join
|
|
216
216
|
"<li><table class=\"table is-bordered is-narrow config-subtable\">#{rows}</table></li>"
|
|
@@ -225,7 +225,8 @@ module EventHub
|
|
|
225
225
|
def format_nested_value(value)
|
|
226
226
|
if value.is_a?(Hash)
|
|
227
227
|
rows = value.map do |k, v|
|
|
228
|
-
|
|
228
|
+
inner = sensitive_key?(k) ? redacted_html : format_nested_value(v)
|
|
229
|
+
"<tr><td>#{ERB::Util.html_escape(k)}</td><td>#{inner}</td></tr>"
|
|
229
230
|
end.join
|
|
230
231
|
"<table class=\"table is-bordered is-narrow config-subtable\">#{rows}</table>"
|
|
231
232
|
elsif value.is_a?(Array)
|
|
@@ -238,6 +239,10 @@ module EventHub
|
|
|
238
239
|
end
|
|
239
240
|
end
|
|
240
241
|
|
|
242
|
+
def redacted_html
|
|
243
|
+
"<span class=\"redacted\">***</span>"
|
|
244
|
+
end
|
|
245
|
+
|
|
241
246
|
def compact_hash?(hash)
|
|
242
247
|
hash.values.all? do |v|
|
|
243
248
|
if v.is_a?(Hash)
|
data/lib/eventhub/helper.rb
CHANGED
|
@@ -18,6 +18,11 @@ module EventHub
|
|
|
18
18
|
end
|
|
19
19
|
|
|
20
20
|
def create_bunny_connection
|
|
21
|
+
connection_string, connection_properties = bunny_connection_options
|
|
22
|
+
Bunny.new(connection_string, connection_properties)
|
|
23
|
+
end
|
|
24
|
+
|
|
25
|
+
def bunny_connection_options
|
|
21
26
|
server = EventHub::Configuration.server
|
|
22
27
|
|
|
23
28
|
protocol = "amqp"
|
|
@@ -31,8 +36,21 @@ module EventHub
|
|
|
31
36
|
connection_properties[:logger] = Logger.new(File::NULL)
|
|
32
37
|
end
|
|
33
38
|
|
|
34
|
-
#
|
|
35
|
-
|
|
39
|
+
# Bunny's network recovery: re-establish connection, re-open channels,
|
|
40
|
+
# re-declare queues, and re-register consumers transparently after a
|
|
41
|
+
# broker disconnect (heartbeat miss, broker restart, LB drop).
|
|
42
|
+
connection_properties[:automatically_recover] = true
|
|
43
|
+
connection_properties[:network_recovery_interval] = 5
|
|
44
|
+
connection_properties[:recovery_attempts] = nil
|
|
45
|
+
connection_properties[:continuation_timeout] = 15_000
|
|
46
|
+
|
|
47
|
+
# Belt-and-suspenders: if recovery_attempts is ever capped, escalate to
|
|
48
|
+
# a Celluloid actor restart instead of going silent.
|
|
49
|
+
connection_properties[:recovery_attempts_exhausted] = lambda do
|
|
50
|
+
EventHub.logger.error("Bunny recovery attempts exhausted - actor will restart")
|
|
51
|
+
actor = Celluloid::Actor[:actor_listener_amqp]
|
|
52
|
+
actor&.async&.restart
|
|
53
|
+
end
|
|
36
54
|
|
|
37
55
|
# do we do tls?
|
|
38
56
|
if server[:tls]
|
|
@@ -45,8 +63,7 @@ module EventHub
|
|
|
45
63
|
end
|
|
46
64
|
|
|
47
65
|
connection_string = "#{protocol}://#{server[:host]}:#{server[:port]}"
|
|
48
|
-
|
|
49
|
-
Bunny.new(connection_string, connection_properties)
|
|
66
|
+
[connection_string, connection_properties]
|
|
50
67
|
end
|
|
51
68
|
|
|
52
69
|
# Formats stamp into UTC format
|
|
@@ -0,0 +1,51 @@
|
|
|
1
|
+
# EventHub patches for upstream gems.
|
|
2
|
+
#
|
|
3
|
+
# Celluloid 0.18 (the last released version, ~2016) is incompatible with
|
|
4
|
+
# Ruby 3.x frozen-string-literal defaults. Its `Internals::Logger.crash`
|
|
5
|
+
# mutates string literals like:
|
|
6
|
+
#
|
|
7
|
+
# def crash(string, exception)
|
|
8
|
+
# if Celluloid.log_actor_crashes
|
|
9
|
+
# string << "\n" << format_exception(exception) # FrozenError under Ruby 3.x
|
|
10
|
+
# error string
|
|
11
|
+
# end
|
|
12
|
+
# @exception_handlers.each { |h| h.call(exception) }
|
|
13
|
+
# end
|
|
14
|
+
#
|
|
15
|
+
# The `string << ...` raises FrozenError BEFORE the registered exception
|
|
16
|
+
# handlers fire. The actor thread then dies silently:
|
|
17
|
+
# * no exit event is sent to the supervisor (no restart),
|
|
18
|
+
# * no exit event is sent to linked sub-actors (they stay alive as zombies),
|
|
19
|
+
# * no error is logged anywhere.
|
|
20
|
+
#
|
|
21
|
+
# Externally the symptom is: an actor whose method raises (e.g. our
|
|
22
|
+
# `ActorListenerAmqp#restart` raising "Listener amqp is restarting...") appears
|
|
23
|
+
# to be entering the raise but never actually dies, and the listener never
|
|
24
|
+
# gets restarted. We hit this in 1.28.0 testing: SIGHUP looked like it
|
|
25
|
+
# worked (Configuration reloaded, async.restart enqueued, restart entered)
|
|
26
|
+
# but the listener silently became a zombie.
|
|
27
|
+
#
|
|
28
|
+
# Upstream fix is unlikely - Celluloid is unmaintained. We prepend a corrected
|
|
29
|
+
# `crash` that defrosts the input string before mutating it. Behavior is
|
|
30
|
+
# otherwise identical to the original.
|
|
31
|
+
module EventHub
|
|
32
|
+
module Patches
|
|
33
|
+
module CelluloidLoggerCrash
|
|
34
|
+
def crash(string, exception)
|
|
35
|
+
message = +String(string)
|
|
36
|
+
if Celluloid.log_actor_crashes
|
|
37
|
+
message << "\n" << format_exception(exception)
|
|
38
|
+
error message
|
|
39
|
+
end
|
|
40
|
+
|
|
41
|
+
@exception_handlers.each do |handler|
|
|
42
|
+
handler.call(exception)
|
|
43
|
+
rescue => ex
|
|
44
|
+
error(+"EXCEPTION HANDLER CRASHED:\n" << format_exception(ex))
|
|
45
|
+
end
|
|
46
|
+
end
|
|
47
|
+
end
|
|
48
|
+
end
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
Celluloid::Internals::Logger.singleton_class.prepend(EventHub::Patches::CelluloidLoggerCrash)
|
data/lib/eventhub/processor2.rb
CHANGED
|
@@ -61,6 +61,11 @@ module EventHub
|
|
|
61
61
|
# pass message as string like: '{ "header": ... , "body": { .. }}'
|
|
62
62
|
# and optionally exchange_name: 'your exchange name'
|
|
63
63
|
def publish(args = {})
|
|
64
|
+
# capture caller-thread thread-local before the cross-actor hop;
|
|
65
|
+
# CorrelationId.current is nil inside the publisher actor's thread
|
|
66
|
+
if CorrelationId.current && !args[:correlation_id]
|
|
67
|
+
args = args.merge(correlation_id: CorrelationId.current)
|
|
68
|
+
end
|
|
64
69
|
Celluloid::Actor[:actor_listener_amqp].publish(args)
|
|
65
70
|
rescue => error
|
|
66
71
|
EventHub.logger.error("Unexpected exeption while publish: #{error}")
|
data/lib/eventhub/version.rb
CHANGED
data/soak/README.md
ADDED
|
@@ -0,0 +1,72 @@
|
|
|
1
|
+
## Soak / reliability harness
|
|
2
|
+
|
|
3
|
+
This folder contains a multi-process chaos test for the `eventhub-processor2`
|
|
4
|
+
gem. Four programs cooperate so that any reliability gap (lost messages,
|
|
5
|
+
zombie consumers, stuck publisher, ...) shows up as an orphan file in
|
|
6
|
+
`soak/data/`.
|
|
7
|
+
|
|
8
|
+
* `publisher.rb` - creates a unique file and a message, publishes to `example.outbound`
|
|
9
|
+
* `router.rb` - listens on `example.outbound`, re-publishes to `example.inbound`
|
|
10
|
+
* `receiver.rb` - listens on `example.inbound`, deletes the file with the given id
|
|
11
|
+
* `crasher.rb` - randomly restarts RabbitMQ or sends SIGHUP to router/receiver
|
|
12
|
+
|
|
13
|
+
```
|
|
14
|
+
publisher => [example.outbound] => router => [example.inbound] => receiver
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
### Goal
|
|
18
|
+
|
|
19
|
+
No matter what the crasher does, after a graceful shutdown and a drain
|
|
20
|
+
period `soak/data/` should contain only `store.json` (and that should
|
|
21
|
+
be `{}`).
|
|
22
|
+
|
|
23
|
+
### Quick start (via Makefile)
|
|
24
|
+
|
|
25
|
+
The project root has a `Makefile` that wraps everything:
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
make soak # 10-minute default soak, prints PASS/FAIL
|
|
29
|
+
SOAK_MINUTES=30 make soak # longer
|
|
30
|
+
make soak-start # start all four manually
|
|
31
|
+
make soak-stop # SIGINT them all
|
|
32
|
+
make soak-clean # stop + wipe data/
|
|
33
|
+
make help # list targets and env knobs
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
The `soak` target runs the chaos loop, then `SIGKILL`s the publisher
|
|
37
|
+
(skipping its cleanup) so any in-flight file stays on disk as an honest
|
|
38
|
+
snapshot. After draining, it counts files in `data/` excluding `store.json`
|
|
39
|
+
and exits 0 if empty, 1 with the first 5 orphan ids if not.
|
|
40
|
+
|
|
41
|
+
### Manual start
|
|
42
|
+
|
|
43
|
+
Make sure the RabbitMQ container is running (see [docker readme](../docker/README.md)):
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
cd soak
|
|
47
|
+
bundle exec ruby receiver.rb # in its own terminal
|
|
48
|
+
bundle exec ruby router.rb # in its own terminal
|
|
49
|
+
bundle exec ruby publisher.rb # in its own terminal
|
|
50
|
+
bundle exec ruby crasher.rb # optional, in its own terminal
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
Graceful shutdown order: stop `publisher.rb` first, let `router`/`receiver`
|
|
54
|
+
drain the queues, then stop the rest. Check `soak/data/` afterwards.
|
|
55
|
+
|
|
56
|
+
### Publisher knobs (env overridable)
|
|
57
|
+
|
|
58
|
+
* `PAUSE_BETWEEN_WORK=F` - seconds between publishes (default `0.05` ~ 20 msg/s)
|
|
59
|
+
* `PUBLISH_MAX_ATTEMPTS=N` - publish retry attempts on transient Bunny errors (default `8`)
|
|
60
|
+
* `PUBLISH_RETRY_DELAY_S=F` - seconds between publish retries (default `1`)
|
|
61
|
+
|
|
62
|
+
The publisher uses `wait_for_confirms` (synchronous publisher confirms) and
|
|
63
|
+
retries `Bunny::NetworkFailure` / `Bunny::ChannelAlreadyClosed` / `Timeout::Error`
|
|
64
|
+
to bridge Bunny's channel-recovery window after a broker restart.
|
|
65
|
+
|
|
66
|
+
### Notes
|
|
67
|
+
|
|
68
|
+
* The publisher's `TransactionStore.cleanup` runs on graceful shutdown and
|
|
69
|
+
deletes any pending file in `store.json`. That's why the `make soak`
|
|
70
|
+
target uses `SIGKILL` for the publisher at the end - you want to see what
|
|
71
|
+
was actually in flight at the moment the chaos phase ended.
|
|
72
|
+
* Watch for huge log files in `logs/ruby/` during long runs.
|
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
#!/usr/bin/env ruby
|
|
2
|
+
# Distinguish real orphans from in-flight-at-SIGKILL artifacts.
|
|
3
|
+
#
|
|
4
|
+
# After the publisher is SIGKILLed, three states can exist in data/:
|
|
5
|
+
# * file + UUID in store.json - publisher was mid-transaction; expected residual.
|
|
6
|
+
# A real restart would cleanup via TransactionStore.
|
|
7
|
+
# * file + UUID NOT in store - publisher confirmed delivery but receiver never
|
|
8
|
+
# deleted the file. This is a real pipeline loss.
|
|
9
|
+
# * no file + UUID in store - delivery completed despite SIGKILL; harmless.
|
|
10
|
+
#
|
|
11
|
+
# We treat only the middle case as a soak FAILURE.
|
|
12
|
+
#
|
|
13
|
+
# Usage: soak/check_orphans.rb <data_dir>
|
|
14
|
+
# Output (single line): "<real_orphan_count> <in_flight_count>"
|
|
15
|
+
|
|
16
|
+
require "json"
|
|
17
|
+
|
|
18
|
+
data_dir = ARGV[0] || "soak/data"
|
|
19
|
+
store_path = File.join(data_dir, "store.json")
|
|
20
|
+
in_flight = begin
|
|
21
|
+
JSON.parse(File.read(store_path))
|
|
22
|
+
rescue
|
|
23
|
+
{}
|
|
24
|
+
end.keys
|
|
25
|
+
|
|
26
|
+
files = Dir.glob(File.join(data_dir, "*.json"))
|
|
27
|
+
.map { |f| File.basename(f, ".json") }
|
|
28
|
+
.reject { |id| id == "store" }
|
|
29
|
+
|
|
30
|
+
real_orphans = files - in_flight
|
|
31
|
+
|
|
32
|
+
puts "#{real_orphans.size} #{in_flight.size}"
|
|
33
|
+
|
|
34
|
+
if ARGV.include?("--list-orphans") && !real_orphans.empty?
|
|
35
|
+
warn "first 5 real orphan ids:"
|
|
36
|
+
real_orphans.first(5).each { |id| warn " #{id}" }
|
|
37
|
+
end
|
|
@@ -9,10 +9,16 @@ require_relative "../lib/eventhub/sleeper"
|
|
|
9
9
|
SIGNALS_FOR_TERMINATION = [:INT, :TERM, :QUIT]
|
|
10
10
|
SIGNALS_FOR_RELOAD_CONFIG = [:HUP]
|
|
11
11
|
ALL_SIGNALS = SIGNALS_FOR_TERMINATION + SIGNALS_FOR_RELOAD_CONFIG
|
|
12
|
-
|
|
12
|
+
|
|
13
|
+
# Tunables (override via env). Defaults are sized to bridge a typical
|
|
14
|
+
# Bunny channel-recovery window (~network_recovery_interval = 5s) plus
|
|
15
|
+
# reconnect, so a broker restart doesn't leak in-flight publishes.
|
|
16
|
+
PAUSE_BETWEEN_WORK = Float(ENV.fetch("PAUSE_BETWEEN_WORK", "0.05"))
|
|
17
|
+
PUBLISH_MAX_ATTEMPTS = Integer(ENV.fetch("PUBLISH_MAX_ATTEMPTS", "8"))
|
|
18
|
+
PUBLISH_RETRY_DELAY_S = Float(ENV.fetch("PUBLISH_RETRY_DELAY_S", "1"))
|
|
13
19
|
|
|
14
20
|
Celluloid.logger = nil
|
|
15
|
-
Celluloid.exception_handler { |ex| Publisher.logger.error "Exception occured: #{ex}}" }
|
|
21
|
+
Celluloid.exception_handler { |ex| Publisher.logger.error "Exception occured: #{ex.class}: #{ex.message}" }
|
|
16
22
|
Celluloid.boot
|
|
17
23
|
|
|
18
24
|
# Publisher module
|
|
@@ -110,14 +116,24 @@ module Publisher
|
|
|
110
116
|
sleep PAUSE_BETWEEN_WORK
|
|
111
117
|
end
|
|
112
118
|
ensure
|
|
113
|
-
|
|
119
|
+
begin
|
|
120
|
+
@connection&.close
|
|
121
|
+
rescue => ex
|
|
122
|
+
Publisher.logger.warn("Worker connection close: ignoring #{ex.class}: #{ex.message}")
|
|
123
|
+
end
|
|
114
124
|
end
|
|
115
125
|
|
|
116
126
|
private
|
|
117
127
|
|
|
118
128
|
def connect
|
|
119
129
|
@connection = Bunny.new(vhost: "event_hub",
|
|
120
|
-
|
|
130
|
+
# match the gem's recovery defaults so Bunny re-opens the channel
|
|
131
|
+
# and re-declares the exchange after a broker restart, without
|
|
132
|
+
# crashing the Worker actor.
|
|
133
|
+
automatically_recover: true,
|
|
134
|
+
network_recovery_interval: 5,
|
|
135
|
+
recovery_attempts: nil,
|
|
136
|
+
continuation_timeout: 15_000,
|
|
121
137
|
logger: Logger.new(File::NULL))
|
|
122
138
|
@connection.start
|
|
123
139
|
@channel = @connection.create_channel
|
|
@@ -131,14 +147,35 @@ module Publisher
|
|
|
131
147
|
file_name = "data/#{id}.json"
|
|
132
148
|
data = {body: {id: id}}.to_json
|
|
133
149
|
|
|
134
|
-
# start transaction
|
|
150
|
+
# start transaction (durable on disk via store.json)
|
|
135
151
|
Celluloid::Actor[:transaction_store].start(id)
|
|
136
152
|
File.write(file_name, data)
|
|
137
153
|
Publisher.logger.info("[#{id}] - Message/File created")
|
|
138
154
|
|
|
139
|
-
|
|
155
|
+
# Bridge Bunny's channel-recovery window: when the broker is bouncing,
|
|
156
|
+
# the channel needs ~network_recovery_interval seconds to reopen and
|
|
157
|
+
# publishes during that window raise ChannelAlreadyClosed. Retry a few
|
|
158
|
+
# times with backoff so the message survives a broker hiccup. After
|
|
159
|
+
# PUBLISH_MAX_ATTEMPTS, leave it pending on disk for an external sweep.
|
|
160
|
+
attempts = 0
|
|
161
|
+
begin
|
|
162
|
+
attempts += 1
|
|
163
|
+
@exchange.publish(data, persistent: true)
|
|
164
|
+
unless @channel.wait_for_confirms
|
|
165
|
+
Publisher.logger.warn("[#{id}] - broker nacked; leaving pending")
|
|
166
|
+
return
|
|
167
|
+
end
|
|
168
|
+
rescue Bunny::NetworkFailure, Bunny::ChannelAlreadyClosed, Timeout::Error => ex
|
|
169
|
+
if attempts < PUBLISH_MAX_ATTEMPTS
|
|
170
|
+
sleep PUBLISH_RETRY_DELAY_S
|
|
171
|
+
retry
|
|
172
|
+
end
|
|
173
|
+
Publisher.logger.warn("[#{id}] - publish failed after #{attempts} attempts (#{ex.class}: #{ex.message}); leaving pending")
|
|
174
|
+
return
|
|
175
|
+
end
|
|
176
|
+
|
|
140
177
|
Celluloid::Actor[:transaction_store]&.stop(id)
|
|
141
|
-
Publisher
|
|
178
|
+
Publisher.logger.info("[#{id}] - Message sent")
|
|
142
179
|
end
|
|
143
180
|
end
|
|
144
181
|
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: eventhub-processor2
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.
|
|
4
|
+
version: 1.28.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Steiner, Thomas
|
|
@@ -179,6 +179,7 @@ files:
|
|
|
179
179
|
- CHANGELOG.md
|
|
180
180
|
- Gemfile
|
|
181
181
|
- LICENSE.txt
|
|
182
|
+
- Makefile
|
|
182
183
|
- README.md
|
|
183
184
|
- Rakefile
|
|
184
185
|
- bin/console
|
|
@@ -190,16 +191,9 @@ files:
|
|
|
190
191
|
- docker/rabbitmq.config
|
|
191
192
|
- docker/reset
|
|
192
193
|
- eventhub-processor2.gemspec
|
|
193
|
-
- example/CHANGELOG.md
|
|
194
194
|
- example/README.md
|
|
195
195
|
- example/config/example.json
|
|
196
|
-
- example/config/receiver.json
|
|
197
|
-
- example/config/router.json
|
|
198
|
-
- example/crasher.rb
|
|
199
196
|
- example/example.rb
|
|
200
|
-
- example/publisher.rb
|
|
201
|
-
- example/receiver.rb
|
|
202
|
-
- example/router.rb
|
|
203
197
|
- lib/eventhub/actor_heartbeat.rb
|
|
204
198
|
- lib/eventhub/actor_listener_amqp.rb
|
|
205
199
|
- lib/eventhub/actor_listener_http.rb
|
|
@@ -220,11 +214,21 @@ files:
|
|
|
220
214
|
- lib/eventhub/helper.rb
|
|
221
215
|
- lib/eventhub/logger.rb
|
|
222
216
|
- lib/eventhub/message.rb
|
|
217
|
+
- lib/eventhub/patches/celluloid_logger.rb
|
|
223
218
|
- lib/eventhub/processor2.rb
|
|
224
219
|
- lib/eventhub/sleeper.rb
|
|
225
220
|
- lib/eventhub/statistics.rb
|
|
226
221
|
- lib/eventhub/templates/layout.erb
|
|
227
222
|
- lib/eventhub/version.rb
|
|
223
|
+
- soak/CHANGELOG.md
|
|
224
|
+
- soak/README.md
|
|
225
|
+
- soak/check_orphans.rb
|
|
226
|
+
- soak/config/receiver.json
|
|
227
|
+
- soak/config/router.json
|
|
228
|
+
- soak/crasher.rb
|
|
229
|
+
- soak/publisher.rb
|
|
230
|
+
- soak/receiver.rb
|
|
231
|
+
- soak/router.rb
|
|
228
232
|
homepage: https://github.com/thomis/eventhub-processor2
|
|
229
233
|
licenses:
|
|
230
234
|
- MIT
|
|
@@ -243,7 +247,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
|
243
247
|
- !ruby/object:Gem::Version
|
|
244
248
|
version: '0'
|
|
245
249
|
requirements: []
|
|
246
|
-
rubygems_version: 4.0.
|
|
250
|
+
rubygems_version: 4.0.10
|
|
247
251
|
specification_version: 4
|
|
248
252
|
summary: Next generation gem to build ruby based eventhub processor
|
|
249
253
|
test_files: []
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
/data/{example → soak}/router.rb
RENAMED
|
File without changes
|