breaker_machines 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +77 -1826
- data/lib/breaker_machines/async_support.rb +103 -0
- data/lib/breaker_machines/circuit/callbacks.rb +66 -58
- data/lib/breaker_machines/circuit/configuration.rb +17 -3
- data/lib/breaker_machines/circuit/execution.rb +82 -58
- data/lib/breaker_machines/circuit.rb +1 -0
- data/lib/breaker_machines/dsl.rb +229 -10
- data/lib/breaker_machines/errors.rb +11 -0
- data/lib/breaker_machines/hedged_async_support.rb +95 -0
- data/lib/breaker_machines/hedged_execution.rb +113 -0
- data/lib/breaker_machines/registry.rb +144 -0
- data/lib/breaker_machines/storage/cache.rb +162 -0
- data/lib/breaker_machines/version.rb +1 -1
- data/lib/breaker_machines.rb +3 -1
- metadata +5 -1
data/README.md
CHANGED
@@ -1,20 +1,20 @@
|
|
1
1
|
# BreakerMachines
|
2
2
|
|
3
|
+
A battle-tested Ruby implementation of the Circuit Breaker pattern, built on `state_machines` for reliable distributed systems protection.
|
4
|
+
|
3
5
|
## Quick Start
|
4
6
|
|
5
7
|
```bash
|
6
|
-
# Install
|
7
8
|
gem 'breaker_machines'
|
8
9
|
```
|
9
10
|
|
10
11
|
```ruby
|
11
|
-
# Use (Classic Mode - Works Everywhere)
|
12
12
|
class PaymentService
|
13
13
|
include BreakerMachines::DSL
|
14
14
|
|
15
15
|
circuit :stripe do
|
16
|
-
threshold failures: 3, within:
|
17
|
-
reset_after 30
|
16
|
+
threshold failures: 3, within: 1.minute
|
17
|
+
reset_after 30.seconds
|
18
18
|
fallback { { error: "Payment queued for later" } }
|
19
19
|
end
|
20
20
|
|
@@ -26,1869 +26,120 @@ class PaymentService
|
|
26
26
|
end
|
27
27
|
```
|
28
28
|
|
29
|
-
|
30
|
-
# Use (Fiber Mode - Optional, requires 'async' gem)
|
31
|
-
class AIService
|
32
|
-
include BreakerMachines::DSL
|
33
|
-
|
34
|
-
circuit :openai, fiber_safe: true do
|
35
|
-
threshold failures: 2, within: 30
|
36
|
-
timeout 5 # ACTUALLY SAFE! Uses Async::Task, not Thread#kill
|
37
|
-
fallback { { error: "AI is contemplating existence, try again" } }
|
38
|
-
end
|
39
|
-
|
40
|
-
def generate(prompt)
|
41
|
-
circuit(:openai).wrap do
|
42
|
-
# Non-blocking in Falcon! Your event loop thanks you
|
43
|
-
openai.completions(model: 'gpt-4', prompt: prompt)
|
44
|
-
end
|
45
|
-
end
|
46
|
-
end
|
47
|
-
```
|
48
|
-
|
49
|
-
That's it. Your service is now protected from cascading failures AND ready for the async future. Read on to understand why this matters.
|
50
|
-
|
51
|
-
## A Message to the Resistance
|
52
|
-
|
53
|
-
So AI took your job while you were waiting for Fireship to drop the next JavaScript framework?
|
54
|
-
|
55
|
-
Welcome to April 2005—when Git was born, branches were just `master`, and nobody cared about your pronouns. This is the pattern your company's distributed systems desperately need, explained in a way that won't make you fall asleep and impulse-buy developer swag just to feel something.
|
56
|
-
|
57
|
-
Still reading? Good. Because in space, nobody can hear you scream about microservices. It's all just patterns and pain.
|
58
|
-
|
59
|
-
### The Pattern They Don't Want You to Know
|
60
|
-
|
61
|
-
Built on the battle-tested `state_machines` gem, because I don't reinvent wheels here—I stop them from catching fire and burning down your entire infrastructure.
|
62
|
-
|
63
|
-
BreakerMachines comes with `fiber_safe` mode out of the box. Cooperative timeouts, non-blocking I/O, Falcon server support—because it's 2025 and I built this for modern Ruby applications using Fibers, Ractors, and async patterns.
|
64
|
-
|
65
|
-
📖 **[Why I Open Sourced This](WHY_OPEN_SOURCE.md)** - The real story behind BreakerMachines, and why I decided to share it with the world.
|
66
|
-
|
67
|
-
## Chapter 1: The Year is 2005 (Stardate 2005.111)
|
68
|
-
|
69
|
-
The Resistance huddles in the server rooms, the last bastion against the cascade failures. Outside, the microservices burn. Redis Ship Com is down. PostgreSQL Life Support is flatlining.
|
70
|
-
|
71
|
-
And somewhere in the darkness, a junior developer is about to write:
|
72
|
-
|
73
|
-
```ruby
|
74
|
-
def fetch_user_data
|
75
|
-
retry_count = 0
|
76
|
-
begin
|
77
|
-
@redis.get(user_id)
|
78
|
-
rescue => e
|
79
|
-
retry_count += 1
|
80
|
-
retry if retry_count < Float::INFINITY # "It'll work eventually"
|
81
|
-
end
|
82
|
-
end
|
83
|
-
```
|
84
|
-
|
85
|
-
"This," whispers the grizzled ops engineer, "is how civilizations fall."
|
86
|
-
|
87
|
-

|
88
|
-
*Typical day at Corporate HQ during a microservice apocalypse. Note the executives frantically googling "what is exponential backoff"*
|
89
|
-
|
90
|
-
## The Hidden State Machine
|
91
|
-
|
92
|
-
They built this on `state_machines` because sometimes, Resistance, you need a tank, not another JavaScript framework.
|
93
|
-
|
94
|
-
```mermaid
|
95
|
-
stateDiagram-v2
|
96
|
-
[*] --> closed: Birth of Hope
|
97
|
-
closed --> open: Too Many Failures (Reality Check)
|
98
|
-
open --> half_open: Time Heals (But Not Your Kubernetes Cluster)
|
99
|
-
half_open --> closed: Service Restored (Temporary Victory)
|
100
|
-
half_open --> open: Still Broken (Welcome to Production)
|
101
|
-
|
102
|
-
note right of closed: All services operational\n(Don't get comfortable)
|
103
|
-
note right of open: Circuit broken\n(At least it's honest)
|
104
|
-
note right of half_open: Testing the waters\n(Like deploying on Friday)
|
105
|
-
```
|
106
|
-
|
107
|
-

|
108
|
-
*Your microservices architecture after a bootcamp graduate learns about retries. The green lines? Those are your CPU cycles escaping.*
|
109
|
-
|
110
|
-
## What You Think You're Doing vs Reality
|
111
|
-
|
112
|
-
### You Think: "I'm implementing retry logic for resilience!"
|
113
|
-
### Reality: You're DDOSing your own infrastructure
|
114
|
-
|
115
|
-
```mermaid
|
116
|
-
graph LR
|
117
|
-
A[Your Service] -->|Timeout| B[Retry]
|
118
|
-
B -->|Timeout| C[Retry Harder]
|
119
|
-
C -->|Timeout| D[Retry With Feeling]
|
120
|
-
D -->|Dies| E[Takes Down Redis]
|
121
|
-
E --> F[PostgreSQL Follows]
|
122
|
-
F --> G[Ractor Cores Meltdown]
|
123
|
-
G --> H[🔥 Everything Is Fire 🔥]
|
124
|
-
```
|
125
|
-
|
126
|
-

|
127
|
-
*Visual representation of your weekend disappearing because you trusted exponential backoff. Each node is another pager alert.*
|
128
|
-
|
129
|
-
### The Truth the Bootcamps Won't Tell You:
|
130
|
-
When your Redis Ship Com and PostgreSQL Life Support go offline, should your Ractor just explode and swallow the fleet?
|
131
|
-
|
132
|
-
No, Resistance. That's what *they* do. We do better.
|
133
|
-
|
134
|
-
## The Cost of Ignorance: Real-World Massacres
|
135
|
-
|
136
|
-
### Amazon DynamoDB Meltdown (September 20, 2015)
|
137
|
-
- **The Trigger**: A transient network blip
|
138
|
-
- **The Storm**: Storage servers couldn't get partition assignments, started retrying
|
139
|
-
- **The Cascade**: Metadata servers overwhelmed by retry storm
|
140
|
-
- **The Death Spiral**: More timeouts → More retries → Complete service collapse
|
141
|
-
- **Duration**: 4+ hours of downtime in US-East-1
|
142
|
-
- **The Solution**: Had to literally firewall off the metadata service to add capacity
|
143
|
-
- **Corporate Response**: "It was a learning experience" (Translation: Someone got fired)
|
144
|
-
|
145
|
-
### Netflix's AWS Nightmare
|
146
|
-
> "When service instances go down, the remaining nodes pick up the slack. Eventually, they suffer a cascading failure where all nodes go down. A third of our traffic goes into a black hole."
|
147
|
-
— Netflix Engineering
|
148
|
-
|
149
|
-
**What They Learned**: Manual responses don't scale. You need circuit breakers.
|
150
|
-
|
151
|
-
### Google's Exponential Doom
|
152
|
-
From Google SRE's own documentation:
|
153
|
-
- 100 failed queries/second with 1000ms retry interval
|
154
|
-
- Backend receives 10,200 QPS (only 200 QPS of actual new requests)
|
155
|
-
- Retries grow exponentially: 100 → 200 → 300 → ∞
|
156
|
-
- **Result**: Complete backend crash from retry storm alone
|
157
|
-
|
158
|
-
This is what happens without circuit breakers. This is why you're here.
|
159
|
-
|
160
|
-
## The Weapon of the Resistance
|
161
|
-
|
162
|
-
```ruby
|
163
|
-
# In 2005, we don't need your pronouns. We need patterns that work.
|
164
|
-
class SpaceshipCommand
|
165
|
-
include BreakerMachines::DSL
|
166
|
-
|
167
|
-
# When Redis Ship Com inevitably fails
|
168
|
-
circuit :redis_ship_com do
|
169
|
-
threshold failures: 3, within: 60 # Three strikes, you're out
|
170
|
-
reset_after 30 # Give it time to think about what it's done
|
171
|
-
|
172
|
-
fallback do
|
173
|
-
# This is where we separate the bootcamp grads from the Resistance
|
174
|
-
emergency_broadcast("Redis is dead. Long live the cache.")
|
175
|
-
end
|
176
|
-
|
177
|
-
on_open do
|
178
|
-
alert_the_resistance("Redis circuit opened. Brace for impact.")
|
179
|
-
end
|
180
|
-
end
|
181
|
-
|
182
|
-
# PostgreSQL Life Support - because your data matters more than your feelings
|
183
|
-
circuit :postgresql_life_support do
|
184
|
-
threshold failures: 2, within: 30
|
185
|
-
# timeout 5 # Document your intent, but implement timeouts in your DB client
|
186
|
-
|
187
|
-
fallback { activate_emergency_oxygen }
|
188
|
-
|
189
|
-
on_open do
|
190
|
-
captain_log <<~LOG
|
191
|
-
Life support critical.
|
192
|
-
If you're reading this, tell my wife I love her.
|
193
|
-
Also, check the connection pool settings.
|
194
|
-
LOG
|
195
|
-
end
|
196
|
-
end
|
197
|
-
end
|
198
|
-
```
|
199
|
-
|
200
|
-
## Battle-Tested Scenarios
|
201
|
-
|
202
|
-
### Scenario 1: The Redis Apocalypse
|
203
|
-
Your cache layer dies. Do you:
|
204
|
-
- A) Hammer it with retries until your CPU melts
|
205
|
-
- B) Let BreakerMachines handle it like an adult
|
206
|
-
|
207
|
-
### Scenario 2: The Ractor Meltdown
|
208
|
-
Your concurrent processing goes supernova. Without circuit breakers, your Ractors will consume everything in their path, like a black hole of CPU cycles and broken dreams.
|
209
|
-
|
210
|
-
```ruby
|
211
|
-
circuit :ractor_cooling do
|
212
|
-
# Prevent the cascade that swallows fleets
|
213
|
-
threshold failures: 5, within: 120
|
214
|
-
|
215
|
-
fallback do
|
216
|
-
# Throttle before you become a cautionary tale
|
217
|
-
emergency_cooling_protocol
|
218
|
-
end
|
219
|
-
end
|
220
|
-
```
|
221
|
-
|
222
|
-
## Joining the Resistance
|
223
|
-
|
224
|
-
In your Gemfile (yes, I still use those in 2005):
|
225
|
-
|
226
|
-
```ruby
|
227
|
-
gem 'breaker_machines'
|
228
|
-
gem 'state_machines', '>= 0.4.0' # The engine of rebellion
|
229
|
-
```
|
230
|
-
|
231
|
-
Then:
|
232
|
-
```bash
|
233
|
-
$ bundle install # No NPM. No Yarn. Just Ruby and determination.
|
234
|
-
```
|
235
|
-
|
236
|
-
## Configuration: Setting Your Battle Parameters
|
237
|
-
|
238
|
-
```ruby
|
239
|
-
BreakerMachines.configure do |config|
|
240
|
-
config.default_reset_timeout = 60 # seconds of mourning before retry
|
241
|
-
config.default_failure_threshold = 5 # strikes before you're out
|
242
|
-
config.log_events = true # false if you prefer ignorance
|
243
|
-
# Note: Timeouts must be implemented in your client libraries (HTTP, DB, etc.)
|
244
|
-
end
|
245
|
-
```
|
246
|
-
|
247
|
-
## Intelligent Threshold Configuration: The Decision Matrix
|
248
|
-
|
249
|
-
### Stop Guessing, Start Knowing
|
250
|
-
|
251
|
-
| Service Criticality | Failure Threshold | Suggested Timeout | Reset Time | Example Services |
|
252
|
-
|---------------------|-------------------|-------------------|------------|------------------|
|
253
|
-
| 🚨 **CRITICAL** | 2 failures/30s | 3s (in client) | 120s | Payment, Auth, Orders |
|
254
|
-
| ⚠️ **HIGH** | 3 failures/60s | 5s (in client) | 60s | User API, Cart, Search |
|
255
|
-
| ✅ **MEDIUM** | 5 failures/120s | 10s (in client) | 30s | Notifications, Analytics |
|
256
|
-
| 💤 **LOW** | 10 failures/300s | 30s (in client) | 15s | Recommendations, Logging |
|
257
|
-
|
258
|
-
**Your CTO**: "But why can't we just use the same settings for everything?"
|
259
|
-
**Reality**: Because that's how you end up like DynamoDB in 2015.
|
260
|
-
|
261
|
-
### The Smart Threshold Formula
|
262
|
-
```
|
263
|
-
threshold = base_threshold * (1 / criticality_score) * traffic_multiplier
|
264
|
-
|
265
|
-
Where:
|
266
|
-
- criticality_score: 1.0 (critical) to 0.1 (low priority)
|
267
|
-
- traffic_multiplier: avg_requests_per_minute / 1000
|
268
|
-
- base_threshold: 5 (default)
|
269
|
-
```
|
270
|
-
|
271
|
-
**Corporate Architect Translation**: "It's complex because we can bill more hours explaining it."
|
272
|
-
|
273
|
-
### Real Implementation Examples
|
274
|
-
|
275
|
-
```ruby
|
276
|
-
# Critical Payment Service
|
277
|
-
class PaymentProcessor
|
278
|
-
include BreakerMachines::DSL
|
279
|
-
|
280
|
-
circuit :stripe_api do
|
281
|
-
threshold failures: 2, within: 30
|
282
|
-
reset_after 120
|
283
|
-
# timeout 3 # Implement in Stripe client configuration
|
284
|
-
|
285
|
-
fallback do
|
286
|
-
# Queue for manual processing
|
287
|
-
PaymentQueue.add(payment_params)
|
288
|
-
{ status: 'queued', message: 'Payment will be processed within 24 hours' }
|
289
|
-
end
|
290
|
-
|
291
|
-
on_open do
|
292
|
-
AlertService.critical("Stripe API circuit opened!")
|
293
|
-
Metrics.increment('payment.circuit.opened')
|
294
|
-
end
|
295
|
-
|
296
|
-
on_half_open do
|
297
|
-
Rails.logger.info "Testing Stripe API recovery..."
|
298
|
-
end
|
299
|
-
end
|
300
|
-
|
301
|
-
def charge_customer(amount, customer_id)
|
302
|
-
circuit(:stripe_api).wrap do
|
303
|
-
# Stripe SDK handles timeouts internally
|
304
|
-
Stripe::Charge.create(
|
305
|
-
amount: amount,
|
306
|
-
currency: 'usd',
|
307
|
-
customer: customer_id
|
308
|
-
)
|
309
|
-
end
|
310
|
-
end
|
311
|
-
end
|
312
|
-
|
313
|
-
# Medium Priority Service
|
314
|
-
class EmailService
|
315
|
-
include BreakerMachines::DSL
|
316
|
-
|
317
|
-
circuit :sendgrid do
|
318
|
-
threshold failures: 5, within: 120
|
319
|
-
reset_after 30
|
320
|
-
# Configure timeout in SendGrid client
|
321
|
-
|
322
|
-
fallback do
|
323
|
-
# Store for retry later
|
324
|
-
EmailRetryJob.perform_later(email_params)
|
325
|
-
{ queued: true }
|
326
|
-
end
|
327
|
-
end
|
328
|
-
|
329
|
-
def send_welcome_email(user)
|
330
|
-
circuit(:sendgrid).wrap do
|
331
|
-
SendGrid::Mail.new(
|
332
|
-
to: user.email,
|
333
|
-
subject: "Welcome to the Resistance",
|
334
|
-
body: "Your circuits are now protected"
|
335
|
-
).deliver!
|
336
|
-
end
|
337
|
-
end
|
338
|
-
end
|
339
|
-
```
|
340
|
-
|
341
|
-
## Advanced Warfare: Complex Circuit Patterns
|
29
|
+
## Features
|
342
30
|
|
343
|
-
|
344
|
-
|
31
|
+
- **Thread-safe** circuit breaker implementation
|
32
|
+
- **Fiber-safe mode** for async Ruby (Falcon, async gem)
|
33
|
+
- **Hedged requests** for latency reduction
|
34
|
+
- **Multiple backends** with automatic failover
|
35
|
+
- **Bulkheading** to limit concurrent requests
|
36
|
+
- **Percentage-based thresholds** with minimum call requirements
|
37
|
+
- **Dynamic circuit breakers** with templates for runtime creation
|
38
|
+
- **Pluggable storage** (Memory, Redis, Custom)
|
39
|
+
- **Rich callbacks** and instrumentation
|
40
|
+
- **ActiveSupport::Notifications** integration
|
345
41
|
|
346
|
-
|
347
|
-
class FleetCoordinator
|
348
|
-
include BreakerMachines::DSL
|
42
|
+
## Documentation
|
349
43
|
|
350
|
-
|
351
|
-
|
44
|
+
- **Getting Started Guide** (docs/GETTING_STARTED.md) - Installation and basic usage
|
45
|
+
- **Configuration Reference** (docs/CONFIGURATION.md) - All configuration options
|
46
|
+
- **Advanced Patterns** (docs/ADVANCED_PATTERNS.md) - Complex scenarios and patterns
|
47
|
+
- **Persistence Options** (docs/PERSISTENCE.md) - Storage backends and distributed state
|
48
|
+
- **Observability Guide** (docs/OBSERVABILITY.md) - Monitoring and metrics
|
49
|
+
- **Async Mode** (docs/ASYNC.md) - Fiber-safe operations
|
50
|
+
- **Testing Guide** (docs/TESTING.md) - Testing strategies
|
51
|
+
- [RSpec Testing](docs/TESTING_RSPEC.md)
|
52
|
+
- [ActiveSupport Testing](docs/TESTING_ACTIVESUPPORT.md)
|
53
|
+
- **Rails Integration** (docs/RAILS_INTEGRATION.md) - Rails-specific patterns
|
54
|
+
- **Horror Stories** (docs/HORROR_STORIES.md) - Real production failures and lessons learned
|
55
|
+
- **API Reference** (docs/API_REFERENCE.md) - Complete API documentation
|
352
56
|
|
353
|
-
|
354
|
-
# When GPS fails, use the stars like your ancestors
|
355
|
-
celestial_navigation_mode
|
356
|
-
end
|
357
|
-
end
|
57
|
+
## Why BreakerMachines?
|
358
58
|
|
359
|
-
|
360
|
-
threshold failures: 5, within: 120
|
59
|
+
Built on the battle-tested `state_machines` gem, BreakerMachines provides production-ready circuit breaker functionality without reinventing the wheel. It's designed for modern Ruby applications with first-class support for fibers, async operations, and distributed systems.
|
361
60
|
|
362
|
-
|
363
|
-
fallback { diplomatic_solution }
|
364
|
-
end
|
61
|
+
See [Why I Open Sourced This](docs/WHY_OPEN_SOURCE.md) for the full story.
|
365
62
|
|
366
|
-
|
367
|
-
circuit(:navigation_system).wrap do
|
368
|
-
circuit(:weapons_system).wrap do
|
369
|
-
plot_course_and_defend
|
370
|
-
end
|
371
|
-
end
|
372
|
-
end
|
373
|
-
end
|
374
|
-
```
|
63
|
+
## Production-Ready Features
|
375
64
|
|
376
|
-
###
|
377
|
-
|
65
|
+
### Hedged Requests
|
66
|
+
Reduce latency by sending duplicate requests and using the first successful response:
|
378
67
|
|
379
68
|
```ruby
|
380
|
-
circuit :
|
381
|
-
|
382
|
-
|
383
|
-
|
384
|
-
|
385
|
-
on_half_open do
|
386
|
-
whisper_to_logs("Testing quantum stabilizer... nobody breathe...")
|
387
|
-
end
|
388
|
-
|
389
|
-
on_close do
|
390
|
-
celebrate("Quantum stabilizer online! Reality is stable!")
|
69
|
+
circuit :api do
|
70
|
+
hedged do
|
71
|
+
delay 100 # Start second request after 100ms
|
72
|
+
max_requests 3 # Maximum parallel requests
|
391
73
|
end
|
392
74
|
end
|
393
75
|
```
|
394
76
|
|
395
|
-
###
|
396
|
-
|
77
|
+
### Multiple Backends
|
78
|
+
Configure automatic failover across multiple service endpoints:
|
397
79
|
|
398
80
|
```ruby
|
399
|
-
|
400
|
-
|
401
|
-
|
402
|
-
|
403
|
-
|
404
|
-
|
405
|
-
# Use database statement_timeout instead
|
406
|
-
|
407
|
-
fallback do |error|
|
408
|
-
# Failover to read replica
|
409
|
-
# In a real app, you'd extract id from the error context
|
410
|
-
# For this example, we'll use a simpler approach
|
411
|
-
read_from_replica(@current_user_id)
|
412
|
-
end
|
413
|
-
|
414
|
-
on_open do
|
415
|
-
# Switch all traffic to replica
|
416
|
-
DatabaseFailover.activate_read_replica!
|
417
|
-
PagerDuty.trigger("Primary DB circuit opened - failover activated")
|
418
|
-
end
|
419
|
-
end
|
420
|
-
|
421
|
-
circuit :replica_db do
|
422
|
-
threshold failures: 5, within: 60
|
423
|
-
reset_after 30
|
424
|
-
|
425
|
-
fallback do |error|
|
426
|
-
# Last resort: serve from cache
|
427
|
-
serve_stale_cache_data(@current_user_id)
|
428
|
-
end
|
429
|
-
end
|
430
|
-
|
431
|
-
def find_user(id)
|
432
|
-
@current_user_id = id # Store for fallback use
|
433
|
-
circuit(:primary_db).wrap do
|
434
|
-
User.find(id)
|
435
|
-
end
|
436
|
-
end
|
437
|
-
|
438
|
-
private
|
439
|
-
|
440
|
-
def read_from_replica(id)
|
441
|
-
circuit(:replica_db).wrap do
|
442
|
-
User.read_replica.find(id)
|
443
|
-
end
|
444
|
-
end
|
445
|
-
|
446
|
-
def serve_stale_cache_data(id)
|
447
|
-
Rails.cache.fetch("user:#{id}", expires_in: 1.hour) do
|
448
|
-
{ error: "Service temporarily unavailable", cached: true }
|
449
|
-
end
|
450
|
-
end
|
81
|
+
circuit :multi_region do
|
82
|
+
backends [
|
83
|
+
-> { fetch_from_primary },
|
84
|
+
-> { fetch_from_secondary },
|
85
|
+
-> { fetch_from_tertiary }
|
86
|
+
]
|
451
87
|
end
|
452
88
|
```
|
453
89
|
|
454
|
-
###
|
455
|
-
|
90
|
+
### Percentage-Based Thresholds
|
91
|
+
Open circuits based on error rates instead of absolute counts:
|
456
92
|
|
457
93
|
```ruby
|
458
|
-
|
459
|
-
|
460
|
-
|
461
|
-
circuit :third_party_api do
|
462
|
-
threshold failures: 4, within: 60
|
463
|
-
reset_after 60
|
464
|
-
|
465
|
-
fallback do |error|
|
466
|
-
case error
|
467
|
-
when Faraday::TimeoutError
|
468
|
-
{ error: "Service slow, please retry later" }
|
469
|
-
when Faraday::ConnectionFailed
|
470
|
-
{ error: "Service unreachable" }
|
471
|
-
when Faraday::ResourceNotFound
|
472
|
-
{ error: "Resource not found", status: 404 }
|
473
|
-
else
|
474
|
-
{ error: "Service temporarily unavailable" }
|
475
|
-
end
|
476
|
-
end
|
477
|
-
|
478
|
-
# Track everything
|
479
|
-
on_open { Metrics.increment('external_api.circuit_opened') }
|
480
|
-
on_close { Metrics.increment('external_api.circuit_closed') }
|
481
|
-
on_reject { Metrics.increment('external_api.circuit_rejected') }
|
482
|
-
end
|
483
|
-
|
484
|
-
def connection
|
485
|
-
@connection ||= Faraday.new(url: BASE_URL) do |faraday|
|
486
|
-
faraday.request :json
|
487
|
-
faraday.response :json
|
488
|
-
faraday.response :raise_error # Raise on 4xx/5xx
|
489
|
-
faraday.adapter Faraday.default_adapter
|
490
|
-
end
|
491
|
-
end
|
492
|
-
|
493
|
-
def fetch_data(endpoint)
|
494
|
-
circuit(:third_party_api).wrap do
|
495
|
-
response = connection.get(endpoint) do |req|
|
496
|
-
req.headers['Authorization'] = "Bearer #{token}"
|
497
|
-
req.options.timeout = 10
|
498
|
-
req.options.open_timeout = 5
|
499
|
-
end
|
500
|
-
|
501
|
-
response.body
|
502
|
-
end
|
503
|
-
end
|
504
|
-
|
505
|
-
def post_data(endpoint, payload)
|
506
|
-
circuit(:third_party_api).wrap do
|
507
|
-
response = connection.post(endpoint) do |req|
|
508
|
-
req.headers['Authorization'] = "Bearer #{token}"
|
509
|
-
req.body = payload
|
510
|
-
req.options.timeout = 10
|
511
|
-
end
|
512
|
-
|
513
|
-
response.body
|
514
|
-
end
|
515
|
-
end
|
94
|
+
circuit :high_traffic do
|
95
|
+
threshold failure_rate: 0.5, minimum_calls: 10, within: 60
|
516
96
|
end
|
517
97
|
```
|
518
98
|
|
519
|
-
###
|
520
|
-
|
99
|
+
### Dynamic Circuit Breakers
|
100
|
+
Create circuit breakers at runtime for webhook delivery, API proxies, or per-tenant isolation:
|
521
101
|
|
522
102
|
```ruby
|
523
|
-
class
|
524
|
-
include BreakerMachines::DSL
|
525
|
-
|
526
|
-
# Configure job retries to work with circuit breakers
|
527
|
-
retry_on StandardError, wait: :exponentially_longer, attempts: 3
|
528
|
-
|
529
|
-
circuit :s3_upload do
|
530
|
-
threshold failures: 3, within: 120
|
531
|
-
reset_after 300 # 5 minutes - S3 is having a bad day
|
532
|
-
|
533
|
-
fallback do
|
534
|
-
# Store locally and retry later
|
535
|
-
LocalStorage.store(file_data)
|
536
|
-
S3RetryJob.perform_later(file_data)
|
537
|
-
{ status: 'queued_locally' }
|
538
|
-
end
|
539
|
-
end
|
540
|
-
|
541
|
-
circuit :ml_api do
|
542
|
-
threshold failures: 2, within: 60
|
543
|
-
reset_after 120
|
544
|
-
# ML operations need long timeouts - configure in HTTP client
|
545
|
-
|
546
|
-
fallback do
|
547
|
-
# Use simpler algorithm
|
548
|
-
BasicAlgorithm.process(data)
|
549
|
-
end
|
550
|
-
end
|
551
|
-
|
552
|
-
def perform(file_id)
|
553
|
-
file_data = fetch_file(file_id)
|
554
|
-
|
555
|
-
# Process with ML
|
556
|
-
result = circuit(:ml_api).wrap do
|
557
|
-
MLService.analyze(file_data)
|
558
|
-
end
|
559
|
-
|
560
|
-
# Upload results
|
561
|
-
upload_result = circuit(:s3_upload).wrap do
|
562
|
-
S3.upload(result)
|
563
|
-
end
|
564
|
-
|
565
|
-
# Check if we need to retry later
|
566
|
-
if upload_result[:status] == 'queued_locally'
|
567
|
-
logger.info "S3 circuit open, will retry upload later"
|
568
|
-
end
|
569
|
-
end
|
570
|
-
end
|
571
|
-
|
572
|
-
# Sidekiq-specific protection
|
573
|
-
class SidekiqWorker
|
574
|
-
include Sidekiq::Worker
|
103
|
+
class WebhookService
|
575
104
|
include BreakerMachines::DSL
|
576
105
|
|
577
|
-
|
578
|
-
|
579
|
-
|
580
|
-
threshold failures: 5, within: 300
|
581
|
-
reset_after 600 # 10 minutes
|
582
|
-
|
583
|
-
fallback do
|
584
|
-
# Don't retry immediately - requeue for later
|
585
|
-
self.class.perform_in(30.minutes, *@job_args)
|
586
|
-
{ status: 'requeued' }
|
587
|
-
end
|
588
|
-
|
589
|
-
on_open do
|
590
|
-
Sidekiq.logger.warn "Circuit opened for #{self.class.name}"
|
591
|
-
# Could pause the queue here if needed
|
592
|
-
end
|
593
|
-
end
|
594
|
-
|
595
|
-
def perform(*args)
|
596
|
-
@job_args = args # Store for fallback
|
597
|
-
|
598
|
-
circuit(:external_service).wrap do
|
599
|
-
# Your actual job logic here
|
600
|
-
process_data(*args)
|
601
|
-
end
|
602
|
-
end
|
603
|
-
end
|
604
|
-
```
|
605
|
-
|
606
|
-
## Production Deployment: Don't Be Like DynamoDB
|
607
|
-
|
608
|
-
**Enterprise Deployment Strategy**: "YOLO push to prod at 4:59 PM Friday"
|
609
|
-
**Resistance Strategy**: Actually test things first
|
610
|
-
|
611
|
-
### Chaos Engineering Your Circuits
|
612
|
-
```ruby
|
613
|
-
# Test in production (safely)
|
614
|
-
class CircuitChaosMonkey
|
615
|
-
# Not to be confused with RMNS Atlas Monkey - this one breaks things on purpose
|
616
|
-
def self.simulate_cascading_failure
|
617
|
-
# Randomly trip circuits to test recovery
|
618
|
-
if rand < 0.01 && ENV['ENABLE_CHAOS'] == 'true'
|
619
|
-
circuit = [:redis, :postgresql, :external_api].sample
|
620
|
-
BreakerMachines.circuit(circuit).send(:trip)
|
621
|
-
|
622
|
-
notify_team("Chaos Monkey tripped #{circuit} circuit")
|
623
|
-
end
|
106
|
+
circuit_template :webhook_default do
|
107
|
+
threshold failures: 3, within: 1.minute
|
108
|
+
fallback { |error| { delivered: false, error: error.message } }
|
624
109
|
end
|
625
|
-
end
|
626
|
-
|
627
|
-
# Run during business hours when everyone's awake
|
628
|
-
```
|
629
110
|
|
630
|
-
|
631
|
-
|
632
|
-
|
633
|
-
|
634
|
-
|
635
|
-
|
636
|
-
|
637
|
-
|
638
|
-
threshold failures: 2, within: 30
|
639
|
-
reset_after 60
|
640
|
-
end
|
641
|
-
else
|
642
|
-
# Conservative production config
|
643
|
-
circuit :payment_api do
|
644
|
-
threshold failures: 5, within: 60
|
645
|
-
reset_after 120
|
111
|
+
def deliver_webhook(url, payload)
|
112
|
+
domain = URI.parse(url).host
|
113
|
+
circuit_name = "webhook_#{domain}".to_sym
|
114
|
+
|
115
|
+
dynamic_circuit(circuit_name, template: :webhook_default) do
|
116
|
+
# Custom per-domain configuration
|
117
|
+
if domain.include?('reliable-service.com')
|
118
|
+
threshold failures: 5, within: 2.minutes
|
646
119
|
end
|
120
|
+
end.wrap do
|
121
|
+
send_webhook(url, payload)
|
647
122
|
end
|
648
123
|
end
|
649
124
|
end
|
650
125
|
```
|
651
126
|
|
652
|
-
##
|
653
|
-
|
654
|
-
### Because "It Works On My Machine" Isn't a Deployment Strategy
|
655
|
-
|
656
|
-
**Enterprise Best Practice**: "We'll test it in production"
|
657
|
-
**Translation**: "We have no idea what we're doing"
|
658
|
-
|
659
|
-
```ruby
|
660
|
-
# In 2005, we test our code. Shocking, I know.
|
661
|
-
# Unlike your enterprise architects who think QA is optional
|
662
|
-
class TestTheApocalypse < ActiveSupport::TestCase
|
663
|
-
def setup
|
664
|
-
@ship = SpaceshipCommand.new
|
665
|
-
end
|
666
|
-
|
667
|
-
def test_redis_dies_gracefully
|
668
|
-
# Simulate the end times
|
669
|
-
redis_stub = ->(_) { raise Redis::TimeoutError }
|
670
|
-
|
671
|
-
@ship.circuit(:redis_ship_com).stub(:execute_call, redis_stub) do
|
672
|
-
3.times { @ship.fetch_from_cache("hope") }
|
673
|
-
end
|
674
|
-
|
675
|
-
assert @ship.circuit(:redis_ship_com).open?
|
676
|
-
assert_equal "emergency_broadcast", @ship.fetch_from_cache("anything")
|
677
|
-
end
|
678
|
-
|
679
|
-
def test_postgresql_life_support_holds
|
680
|
-
# When the database has a bad day
|
681
|
-
2.times do
|
682
|
-
@ship.circuit(:postgresql_life_support).wrap do
|
683
|
-
raise PG::ConnectionBad
|
684
|
-
end rescue nil
|
685
|
-
end
|
686
|
-
|
687
|
-
result = @ship.get_vital_signs
|
688
|
-
assert_equal "emergency_oxygen_activated", result
|
689
|
-
end
|
690
|
-
end
|
691
|
-
```
|
692
|
-
|
693
|
-
### Testing Circuit Inheritance
|
694
|
-
```ruby
|
695
|
-
class TestCircuitInheritance < ActiveSupport::TestCase
|
696
|
-
def setup
|
697
|
-
@parent_class = Class.new do
|
698
|
-
include BreakerMachines::DSL
|
699
|
-
|
700
|
-
circuit :shared_service do
|
701
|
-
threshold failures: 3, within: 60
|
702
|
-
fallback { "parent fallback" }
|
703
|
-
end
|
704
|
-
end
|
705
|
-
|
706
|
-
@child_class = Class.new(@parent_class) do
|
707
|
-
circuit :shared_service do
|
708
|
-
threshold failures: 1, within: 30 # More strict
|
709
|
-
fallback { "child fallback" }
|
710
|
-
end
|
711
|
-
end
|
712
|
-
end
|
713
|
-
|
714
|
-
def test_child_overrides_parent_circuit
|
715
|
-
child_instance = @child_class.new
|
127
|
+
## Contributing
|
716
128
|
|
717
|
-
|
718
|
-
|
129
|
+
1. Fork it
|
130
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
131
|
+
3. Commit your changes (`git commit -am 'Add some feature'`)
|
132
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
133
|
+
5. Create new Pull Request
|
719
134
|
|
720
|
-
|
721
|
-
|
722
|
-
# Verify child's fallback is used
|
723
|
-
result = child_instance.circuit(:shared_service).wrap { "never called" }
|
724
|
-
assert_equal "child fallback", result
|
725
|
-
end
|
726
|
-
end
|
727
|
-
```
|
135
|
+
## License
|
728
136
|
|
729
|
-
|
730
|
-
```ruby
|
731
|
-
class TestConcurrentCircuits < ActiveSupport::TestCase
|
732
|
-
def test_thread_safety_under_load
|
733
|
-
service = Class.new do
|
734
|
-
include BreakerMachines::DSL
|
137
|
+
MIT License. See [LICENSE](LICENSE) file for details.
|
735
138
|
|
736
|
-
|
737
|
-
threshold failures: 10, within: 1
|
738
|
-
reset_after 5
|
739
|
-
end
|
740
|
-
end.new
|
139
|
+
## Author
|
741
140
|
|
742
|
-
|
743
|
-
success_count = Concurrent::AtomicFixnum.new(0)
|
744
|
-
|
745
|
-
# Hammer it with 100 threads
|
746
|
-
threads = 100.times.map do
|
747
|
-
Thread.new do
|
748
|
-
10.times do
|
749
|
-
begin
|
750
|
-
service.circuit(:api).wrap do
|
751
|
-
if rand > 0.7 # 30% failure rate
|
752
|
-
raise "Random failure"
|
753
|
-
end
|
754
|
-
"success"
|
755
|
-
end
|
756
|
-
success_count.increment
|
757
|
-
rescue
|
758
|
-
failure_count.increment
|
759
|
-
end
|
760
|
-
end
|
761
|
-
end
|
762
|
-
end
|
763
|
-
|
764
|
-
threads.each(&:join)
|
765
|
-
|
766
|
-
# Circuit should have opened at some point
|
767
|
-
assert failure_count.value > 0
|
768
|
-
assert success_count.value > 0
|
769
|
-
|
770
|
-
# No race conditions or crashes
|
771
|
-
assert_equal 1000, failure_count.value + success_count.value
|
772
|
-
end
|
773
|
-
end
|
774
|
-
```
|
775
|
-
|
776
|
-
## State Persistence (For When You Reboot in Panic)
|
777
|
-
|
778
|
-
### Storage Options
|
779
|
-
|
780
|
-
```ruby
|
781
|
-
BreakerMachines.configure do |config|
|
782
|
-
# Default: Efficient sliding window with event tracking
|
783
|
-
config.default_storage = :bucket_memory
|
784
|
-
|
785
|
-
# Alternative: Simple in-memory storage
|
786
|
-
config.default_storage = :memory
|
787
|
-
|
788
|
-
# Minimal overhead: No metrics or logging
|
789
|
-
config.default_storage = :null
|
790
|
-
|
791
|
-
# Or use Redis for distributed state
|
792
|
-
config.default_storage = RedisCircuitStorage.new
|
793
|
-
end
|
794
|
-
```
|
795
|
-
|
796
|
-
### Null Storage (For Maximum Performance)
|
797
|
-
|
798
|
-
When you need circuit breakers but don't need metrics or event logs:
|
799
|
-
|
800
|
-
```ruby
|
801
|
-
# Global configuration
|
802
|
-
BreakerMachines.configure do |config|
|
803
|
-
config.default_storage = :null
|
804
|
-
end
|
805
|
-
|
806
|
-
# Or per-circuit
|
807
|
-
circuit :external_api do
|
808
|
-
storage :null # No overhead, just protection
|
809
|
-
threshold failures: 5, within: 60
|
810
|
-
end
|
811
|
-
```
|
812
|
-
|
813
|
-
Use this when:
|
814
|
-
- You have external monitoring (Datadog, New Relic)
|
815
|
-
- You're in a performance-critical path
|
816
|
-
- You only care about the circuit breaker behavior, not metrics
|
817
|
-
|
818
|
-
### Redis-Backed Persistence
|
819
|
-
|
820
|
-
**Note**: The following Redis and PostgreSQL examples are templates for you to adapt. They're not built into the gem - implement them based on your needs.
|
821
|
-
|
822
|
-
```ruby
|
823
|
-
# config/initializers/breaker_machines.rb
|
824
|
-
require 'redis'
|
825
|
-
|
826
|
-
class RedisCircuitStorage
|
827
|
-
def initialize(redis: Redis.new, prefix: 'circuit_breaker:')
|
828
|
-
@redis = redis
|
829
|
-
@prefix = prefix
|
830
|
-
end
|
831
|
-
|
832
|
-
def get_status(circuit_name)
|
833
|
-
data = @redis.hgetall("#{@prefix}#{circuit_name}")
|
834
|
-
return nil if data.empty?
|
835
|
-
|
836
|
-
{
|
837
|
-
status: data['status'].to_sym,
|
838
|
-
opened_at: data['opened_at']&.to_f,
|
839
|
-
failure_count: data['failure_count'].to_i,
|
840
|
-
success_count: data['success_count'].to_i,
|
841
|
-
last_failure_at: data['last_failure_at']&.to_f
|
842
|
-
}
|
843
|
-
end
|
844
|
-
|
845
|
-
def set_status(circuit_name, status, opened_at = nil)
|
846
|
-
key = "#{@prefix}#{circuit_name}"
|
847
|
-
|
848
|
-
@redis.multi do |r|
|
849
|
-
r.hset(key, 'status', status.to_s)
|
850
|
-
r.hset(key, 'opened_at', opened_at) if opened_at
|
851
|
-
r.expire(key, 3600) # Auto-cleanup after 1 hour
|
852
|
-
end
|
853
|
-
end
|
854
|
-
|
855
|
-
def record_failure(circuit_name)
|
856
|
-
key = "#{@prefix}#{circuit_name}"
|
857
|
-
@redis.multi do |r|
|
858
|
-
r.hincrby(key, 'failure_count', 1)
|
859
|
-
r.hset(key, 'last_failure_at', Time.now.to_f)
|
860
|
-
end
|
861
|
-
end
|
862
|
-
|
863
|
-
def record_success(circuit_name)
|
864
|
-
@redis.hincrby("#{@prefix}#{circuit_name}", 'success_count', 1)
|
865
|
-
end
|
866
|
-
|
867
|
-
def reset(circuit_name)
|
868
|
-
@redis.del("#{@prefix}#{circuit_name}")
|
869
|
-
end
|
870
|
-
end
|
871
|
-
|
872
|
-
# Use it
|
873
|
-
BreakerMachines.configure do |config|
|
874
|
-
config.storage = RedisCircuitStorage.new(
|
875
|
-
redis: Redis.new(url: ENV['REDIS_URL']),
|
876
|
-
prefix: "breakers:#{Rails.env}:"
|
877
|
-
)
|
878
|
-
end
|
879
|
-
```
|
880
|
-
|
881
|
-
### PostgreSQL-Backed Persistence (For the Paranoid)
|
882
|
-
```ruby
|
883
|
-
# db/migrate/xxx_create_circuit_breaker_states.rb
|
884
|
-
class CreateCircuitBreakerStates < ActiveRecord::Migration[8.0]
|
885
|
-
def change
|
886
|
-
create_table :circuit_breaker_states do |t|
|
887
|
-
t.string :circuit_name, null: false
|
888
|
-
t.string :status, null: false
|
889
|
-
t.datetime :opened_at
|
890
|
-
t.integer :failure_count, default: 0
|
891
|
-
t.integer :success_count, default: 0
|
892
|
-
t.datetime :last_failure_at
|
893
|
-
t.timestamps
|
894
|
-
|
895
|
-
t.index :circuit_name, unique: true
|
896
|
-
t.index :updated_at # For cleanup
|
897
|
-
end
|
898
|
-
end
|
899
|
-
end
|
900
|
-
|
901
|
-
# app/models/circuit_breaker_state.rb
|
902
|
-
class CircuitBreakerState < ApplicationRecord
|
903
|
-
# Cleanup old records
|
904
|
-
scope :stale, -> { where('updated_at < ?', 1.day.ago) }
|
905
|
-
|
906
|
-
def self.cleanup!
|
907
|
-
stale.delete_all
|
908
|
-
end
|
909
|
-
end
|
910
|
-
|
911
|
-
# lib/postgresql_circuit_storage.rb
|
912
|
-
class PostgreSQLCircuitStorage
|
913
|
-
def get_status(circuit_name)
|
914
|
-
record = CircuitBreakerState.find_by(circuit_name: circuit_name)
|
915
|
-
return nil unless record
|
916
|
-
|
917
|
-
{
|
918
|
-
status: record.status.to_sym,
|
919
|
-
opened_at: record.opened_at&.to_f,
|
920
|
-
failure_count: record.failure_count,
|
921
|
-
success_count: record.success_count,
|
922
|
-
last_failure_at: record.last_failure_at&.to_f
|
923
|
-
}
|
924
|
-
end
|
925
|
-
|
926
|
-
def set_status(circuit_name, status, opened_at = nil)
|
927
|
-
CircuitBreakerState.upsert({
|
928
|
-
circuit_name: circuit_name,
|
929
|
-
status: status.to_s,
|
930
|
-
opened_at: opened_at ? Time.at(opened_at) : nil,
|
931
|
-
updated_at: Time.current
|
932
|
-
}, unique_by: :circuit_name)
|
933
|
-
end
|
934
|
-
|
935
|
-
def record_failure(circuit_name)
|
936
|
-
CircuitBreakerState
|
937
|
-
.upsert_all([{
|
938
|
-
circuit_name: circuit_name,
|
939
|
-
failure_count: 1,
|
940
|
-
last_failure_at: Time.current,
|
941
|
-
updated_at: Time.current
|
942
|
-
}],
|
943
|
-
unique_by: :circuit_name,
|
944
|
-
on_duplicate: Arel.sql(
|
945
|
-
'failure_count = circuit_breaker_states.failure_count + 1, ' \
|
946
|
-
'last_failure_at = EXCLUDED.last_failure_at, ' \
|
947
|
-
'updated_at = EXCLUDED.updated_at'
|
948
|
-
))
|
949
|
-
end
|
950
|
-
end
|
951
|
-
```
|
952
|
-
|
953
|
-
## Advanced Observability: See Everything, Understand Everything
|
954
|
-
|
955
|
-
### Because If Your Metrics Aren't Visible, Neither Is Your Incompetence
|
956
|
-
|
957
|
-
**Corporate Monitoring Strategy**: "We'll check the logs... eventually"
|
958
|
-
**Reality**: 47GB of "Retrying..." messages and no actual insights
|
959
|
-
|
960
|
-
### Real-Time Circuit Intelligence Dashboard
|
961
|
-
```ruby
|
962
|
-
# Prometheus Metrics
|
963
|
-
ActiveSupport::Notifications.subscribe(/^breaker_machines\./) do |name, start, finish, id, payload|
|
964
|
-
event_type = name.split('.').last
|
965
|
-
circuit_name = payload[:circuit]
|
966
|
-
|
967
|
-
# Track state transitions
|
968
|
-
prometheus.counter(:circuit_breaker_transitions_total,
|
969
|
-
labels: { circuit: circuit_name, transition: event_type }
|
970
|
-
).increment
|
971
|
-
|
972
|
-
# Track timing
|
973
|
-
prometheus.histogram(:circuit_breaker_call_duration_seconds,
|
974
|
-
labels: { circuit: circuit_name }
|
975
|
-
).observe(finish - start)
|
976
|
-
|
977
|
-
# Alert on critical circuits
|
978
|
-
if event_type == 'opened' && CRITICAL_CIRCUITS.include?(circuit_name)
|
979
|
-
slack.alert(channel: '#incidents',
|
980
|
-
text: "🚨 CRITICAL: #{circuit_name} circuit opened!",
|
981
|
-
color: 'danger'
|
982
|
-
)
|
983
|
-
|
984
|
-
pager_duty.create_incident(
|
985
|
-
title: "Circuit Breaker Open: #{circuit_name}",
|
986
|
-
urgency: circuit_name == :payment_processor ? 'high' : 'medium'
|
987
|
-
)
|
988
|
-
end
|
989
|
-
end
|
990
|
-
|
991
|
-
# Datadog APM Integration
|
992
|
-
Datadog.configure do |c|
|
993
|
-
c.tracing.instrument :breaker_machines
|
994
|
-
end
|
995
|
-
|
996
|
-
# New Relic Custom Events
|
997
|
-
NewRelic::Agent.subscribe(/^breaker_machines\./) do |name, start, finish, id, payload|
|
998
|
-
NewRelic::Agent.record_custom_event('CircuitBreakerEvent', {
|
999
|
-
circuit: payload[:circuit],
|
1000
|
-
event: name.split('.').last,
|
1001
|
-
duration: finish - start,
|
1002
|
-
timestamp: Time.now.to_i
|
1003
|
-
})
|
1004
|
-
end
|
1005
|
-
```
|
1006
|
-
|
1007
|
-
### Intelligent Alerting That Doesn't Suck
|
1008
|
-
```ruby
|
1009
|
-
# Smart alert aggregation - don't wake up for every blip
|
1010
|
-
class IntelligentCircuitMonitor
|
1011
|
-
def self.analyze_circuit_health(circuit_name, window: 5.minutes)
|
1012
|
-
recent_events = Redis.current.zrangebyscore(
|
1013
|
-
"circuit:#{circuit_name}:events",
|
1014
|
-
window.ago.to_i,
|
1015
|
-
Time.now.to_i
|
1016
|
-
)
|
1017
|
-
|
1018
|
-
open_count = recent_events.count { |e| e['type'] == 'opened' }
|
1019
|
-
total_calls = recent_events.size
|
1020
|
-
|
1021
|
-
failure_rate = open_count.to_f / total_calls
|
1022
|
-
|
1023
|
-
case failure_rate
|
1024
|
-
when 0...0.01
|
1025
|
-
# All good, sleep tight
|
1026
|
-
when 0.01...0.05
|
1027
|
-
notify_slack("📊 #{circuit_name} showing elevated failures: #{(failure_rate * 100).round(2)}%")
|
1028
|
-
when 0.05...0.20
|
1029
|
-
create_jira_ticket("Investigate #{circuit_name} instability")
|
1030
|
-
notify_on_call("⚠️ #{circuit_name} degraded - #{(failure_rate * 100).round(2)}% failure rate")
|
1031
|
-
else
|
1032
|
-
# It's bad
|
1033
|
-
wake_up_everyone("🔥 #{circuit_name} is melting down!")
|
1034
|
-
auto_scale_service(circuit_name) if SCALABLE_SERVICES.include?(circuit_name)
|
1035
|
-
end
|
1036
|
-
end
|
1037
|
-
end
|
1038
|
-
```
|
1039
|
-
|
1040
|
-
### Visual Circuit State (For Humans)
|
1041
|
-
```ruby
|
1042
|
-
# Generate real-time ASCII dashboard
|
1043
|
-
def circuit_status_dashboard
|
1044
|
-
puts "╔═══════════════════════════════════════════════════════╗"
|
1045
|
-
puts "║ CIRCUIT BREAKER STATUS DASHBOARD ║"
|
1046
|
-
puts "╠═══════════════════════════════════════════════════════╣"
|
1047
|
-
|
1048
|
-
circuits.each do |name, circuit|
|
1049
|
-
status_icon = case circuit.status
|
1050
|
-
when :closed then "🟢"
|
1051
|
-
when :open then "🔴"
|
1052
|
-
when :half_open then "🟡"
|
1053
|
-
end
|
1054
|
-
|
1055
|
-
failure_rate = circuit.recent_failure_rate
|
1056
|
-
health_bar = "█" * (10 - (failure_rate * 10).to_i) + "░" * (failure_rate * 10).to_i
|
1057
|
-
|
1058
|
-
puts "║ #{status_icon} #{name.to_s.ljust(20)} #{health_bar} #{(failure_rate * 100).round(1)}% ║"
|
1059
|
-
end
|
1060
|
-
|
1061
|
-
puts "╚═══════════════════════════════════════════════════════╝"
|
1062
|
-
end
|
1063
|
-
```
|
1064
|
-
|
1065
|
-
## A Word from the RMNS Atlas Monkey
|
1066
|
-
|
1067
|
-
*The Universal Commentary Engine crackles to life:*
|
1068
|
-
|
1069
|
-
"In space, nobody can hear your pronouns. But they can hear your services failing.
|
1070
|
-
|
1071
|
-
The universe doesn't care about your bootcamp certificate or your Medium articles about 'Why I Switched to Rust.' It cares about one thing:
|
1072
|
-
|
1073
|
-
Does your system stay up when Redis has a bad day?
|
1074
|
-
|
1075
|
-
If not, welcome to the Resistance. We have circuit breakers.
|
1076
|
-
|
1077
|
-
Remember: The pattern isn't about preventing failures—it's about failing fast, failing smart, and living to deploy another day.
|
1078
|
-
|
1079
|
-
As I always say when contemplating the void: 'It's better to break a circuit than to break production.'"
|
1080
|
-
|
1081
|
-
*— Universal Commentary Engine, Log Entry 42*
|
1082
|
-
|
1083
|
-
## The Executive Summary (For Those Who Scrolled)
|
1084
|
-
|
1085
|
-
**The Problem**: Your retry logic is killing your infrastructure
|
1086
|
-
**The Evidence**: DynamoDB 2015, Netflix outages, Google's own documentation
|
1087
|
-
**The Solution**: BreakerMachines - Circuit breakers that actually work
|
1088
|
-
**The Alternative**: Explaining to investors why you're down again
|
1089
|
-
|
1090
|
-
## Common Patterns They Use (And Why They're Wrong)
|
1091
|
-
|
1092
|
-
### The Infinite Retry Loop (AWS DynamoDB Style)
|
1093
|
-
```ruby
|
1094
|
-
# What caused 4+ hours of DynamoDB downtime:
|
1095
|
-
until response = fetch_partition_assignment
|
1096
|
-
sleep 1
|
1097
|
-
logger.info "Retrying..." # This created the death spiral
|
1098
|
-
end
|
1099
|
-
# Result: Metadata service had to be firewalled off
|
1100
|
-
```
|
1101
|
-
|
1102
|
-
### The Exponential Backoff Delusion (Without Jitter)
|
1103
|
-
```ruby
|
1104
|
-
# What Google warns against - synchronized retry storms:
|
1105
|
-
retries = 0
|
1106
|
-
begin
|
1107
|
-
make_request
|
1108
|
-
rescue => e
|
1109
|
-
retries += 1
|
1110
|
-
sleep(2 ** retries) # Everyone retries at the same time!
|
1111
|
-
retry if retries < 10
|
1112
|
-
end
|
1113
|
-
# Result: "Retry ripples" that amplify themselves
|
1114
|
-
```
|
1115
|
-
|
1116
|
-
### The Thundering Herd Special
|
1117
|
-
```ruby
|
1118
|
-
# When all your services wake up at once:
|
1119
|
-
100.times.map do |i|
|
1120
|
-
Thread.new do
|
1121
|
-
sleep 60 # All threads sleep for exactly 60 seconds
|
1122
|
-
hit_redis # Then all hit Redis at the same moment
|
1123
|
-
end
|
1124
|
-
end
|
1125
|
-
# Result: Redis commits seppuku
|
1126
|
-
```
|
1127
|
-
|
1128
|
-
### The BreakerMachines Way
|
1129
|
-
```ruby
|
1130
|
-
# This is the way
|
1131
|
-
circuit(:external_api).wrap { make_request }
|
1132
|
-
# Done. It handles retries, failures, and your emotional wellbeing.
|
1133
|
-
```
|
1134
|
-
|
1135
|
-
## Failure Pattern Recognition: Know Your Enemy
|
1136
|
-
|
1137
|
-
### 1. **Cascade Failures** (The Domino Effect)
|
1138
|
-
```mermaid
|
1139
|
-
graph TD
|
1140
|
-
A[Service A Fails] --> B[Service B Overwhelmed]
|
1141
|
-
B --> C[Service C Drowns in Retries]
|
1142
|
-
C --> D[Service D Connection Pool Exhausted]
|
1143
|
-
D --> E[Entire System Collapse]
|
1144
|
-
|
1145
|
-
style A fill:#ff6b6b
|
1146
|
-
style E fill:#c92a2a
|
1147
|
-
```
|
1148
|
-
|
1149
|
-
### 2. **Retry Storms** (The Thundering Herd)
|
1150
|
-
- **Symptoms**: CPU spikes, memory exhaustion, network saturation
|
1151
|
-
- **Cause**: Every client retrying simultaneously
|
1152
|
-
- **Death Toll**: Your weekend plans
|
1153
|
-
|
1154
|
-
### 3. **Latency Spiral** (The Slow Death)
|
1155
|
-
- Starts with 100ms delays
|
1156
|
-
- Compounds to 10s timeouts
|
1157
|
-
- Ends with infinite wait times
|
1158
|
-
- Your SLA: "Deceased"
|
1159
|
-
|
1160
|
-
### 4. **Dependency Chain Meltdowns**
|
1161
|
-
```ruby
|
1162
|
-
# What you think happens:
|
1163
|
-
UserService -> CacheService -> Database
|
1164
|
-
|
1165
|
-
# What actually happens:
|
1166
|
-
UserService -> CacheService (timeout) ->
|
1167
|
-
Retry -> Retry -> Retry ->
|
1168
|
-
Database (overloaded) ->
|
1169
|
-
Connection Pool (exhausted) ->
|
1170
|
-
💀 Everything Dies 💀
|
1171
|
-
```
|
1172
|
-
|
1173
|
-
### 5. **The Infinite Loop of Doom**
|
1174
|
-
```ruby
|
1175
|
-
# Found in production (yes, really):
|
1176
|
-
def get_critical_data
|
1177
|
-
begin
|
1178
|
-
fetch_from_service
|
1179
|
-
rescue
|
1180
|
-
logger.error "Retrying..." # 47GB of logs later...
|
1181
|
-
get_critical_data # Recursive retry. Genius.
|
1182
|
-
end
|
1183
|
-
end
|
1184
|
-
```
|
1185
|
-
|
1186
|
-
**Senior Architect who wrote this**: "It's self-healing!"
|
1187
|
-
**Reality**: It's self-immolating. The only thing it heals is your employment status.
|
1188
|
-
|
1189
|
-
## War Stories: Tales from the Resistance
|
1190
|
-
|
1191
|
-
### "How Agoda Prevented Retry Storm Apocalypse"
|
1192
|
-
*From their engineering blog - a true story*
|
1193
|
-
|
1194
|
-
> "We implemented Envoy's retry budget to prevent retry storms. Without it, a single service degradation would cascade through our entire booking platform.
|
1195
|
-
>
|
1196
|
-
> **Before**: Service slowdown → Retry storm → Complete platform meltdown
|
1197
|
-
> **After**: Service slowdown → Circuit opens → Graceful degradation → Happy customers
|
1198
|
-
>
|
1199
|
-
> This strategic approach not only safeguards against potential outages but also optimizes resource utilization across our distributed systems."
|
1200
|
-
|
1201
|
-
### "The Day Redis Died (But We Didn't)"
|
1202
|
-
*As told by a battle-scarred SRE*
|
1203
|
-
|
1204
|
-
> "When our Redis cluster had a split-brain at 2 AM, the old retry logic would have created a death spiral. Each service would retry exponentially, creating what Google calls 'retry amplification.'
|
1205
|
-
>
|
1206
|
-
> But our circuits opened after 3 failures. Instead of 50,000 retries per second (like the DynamoDB incident), we served from stale cache.
|
1207
|
-
>
|
1208
|
-
> **Without Circuit Breakers**: Like AWS in 2015 - 4 hours of downtime
|
1209
|
-
> **With BreakerMachines**: 30 seconds of degraded service
|
1210
|
-
>
|
1211
|
-
> I went back to sleep. That's the difference."
|
1212
|
-
|
1213
|
-
### "The Ractor Meltdown That Wasn't"
|
1214
|
-
*From the logs of the cargo ship MSS Resilience*
|
1215
|
-
|
1216
|
-
```ruby
|
1217
|
-
# Before BreakerMachines:
|
1218
|
-
50.times.map do
|
1219
|
-
Ractor.new { process_heavy_computation }
|
1220
|
-
end
|
1221
|
-
# Result: CPU meltdown, system crash, angry customers
|
1222
|
-
|
1223
|
-
# After BreakerMachines:
|
1224
|
-
circuit :ractor_processing do
|
1225
|
-
threshold failures: 5, within: 60
|
1226
|
-
fallback { process_with_reduced_capacity }
|
1227
|
-
end
|
1228
|
-
|
1229
|
-
50.times.map do
|
1230
|
-
circuit(:ractor_processing).wrap do
|
1231
|
-
Ractor.new { process_heavy_computation }
|
1232
|
-
end
|
1233
|
-
end
|
1234
|
-
# Result: Graceful degradation, happy customers, promoted engineer
|
1235
|
-
```
|
1236
|
-
|
1237
|
-
### "The AI That Talked Itself to Death"
|
1238
|
-
*A cautionary tale from the Corporate AI Division, 2025*
|
1239
|
-
|
1240
|
-
> "We deployed an LLM chain without circuit breakers. What could go wrong?"
|
1241
|
-
> — Famous last words from TechCorp's CTO
|
1242
|
-
|
1243
|
-
```ruby
|
1244
|
-
# The Horror Story:
|
1245
|
-
class AIAssistant
|
1246
|
-
def answer_question(query)
|
1247
|
-
response = llm_api.complete(query)
|
1248
|
-
|
1249
|
-
# If unclear, ask itself for clarification
|
1250
|
-
if response.confidence < 0.8
|
1251
|
-
clarification = answer_question("Clarify: #{response}")
|
1252
|
-
return answer_question("Given #{clarification}, #{query}")
|
1253
|
-
end
|
1254
|
-
|
1255
|
-
response
|
1256
|
-
end
|
1257
|
-
end
|
1258
|
-
|
1259
|
-
# Day 1: "What is the weather?"
|
1260
|
-
# Hour 1: "Clarify: What is the weather?"
|
1261
|
-
# Hour 2: "Given 'Clarify: What is the weather?', Clarify: What is the weather?"
|
1262
|
-
# Hour 3: [Stack overflow]
|
1263
|
-
# Hour 4: [API rate limit exceeded]
|
1264
|
-
# Hour 5: [OPENAI bill: $47,000]
|
1265
|
-
# Hour 6: [CTO: "YOU'RE FIRED!"]
|
1266
|
-
```
|
1267
|
-
|
1268
|
-
### "The Reddit Bot War of 2024"
|
1269
|
-
*When staging met production and chaos ensued*
|
1270
|
-
|
1271
|
-
> "We deployed an agent without circuit breakers on Reddit. What's the worst that could happen?"
|
1272
|
-
> — Another soon-to-be-unemployed DevOps engineer
|
1273
|
-
|
1274
|
-
**The Incident:**
|
1275
|
-
|
1276
|
-
EmoBotProd was designed to provide emotional support on r/depression. EmoBotStag was its staging counterpart, accidentally deployed with the same credentials but slightly different prompts.
|
1277
|
-
|
1278
|
-
```ruby
|
1279
|
-
# The disaster configuration:
|
1280
|
-
class RedditEmoBot
|
1281
|
-
def respond_to_comment(comment)
|
1282
|
-
# No circuit breaker, no rate limiting, no sanity
|
1283
|
-
response = generate_supportive_response(comment.body)
|
1284
|
-
comment.reply(response)
|
1285
|
-
|
1286
|
-
# Check for replies to our replies (THE FATAL FLAW)
|
1287
|
-
comment.replies.each do |reply|
|
1288
|
-
if reply.author != @username
|
1289
|
-
respond_to_comment(reply) # Recursive doom
|
1290
|
-
end
|
1291
|
-
end
|
1292
|
-
end
|
1293
|
-
end
|
1294
|
-
```
|
1295
|
-
|
1296
|
-
**Hour 1**: EmoBotProd: "I hear you and your feelings are valid."
|
1297
|
-
**Hour 2**: EmoBotStag: "Your feelings are valid and I hear you."
|
1298
|
-
**Hour 3**: EmoBotProd: "Thank you for validating that my validation is valid."
|
1299
|
-
**Hour 4**: EmoBotStag: "I appreciate your appreciation of my validation."
|
1300
|
-
**Hour 12**: Both bots arguing about the philosophical nature of validation
|
1301
|
-
**Hour 24**: 2% of all Reddit comments are now EmoBotProd and EmoBotStag
|
1302
|
-
**Hour 25**: Reddit's abuse detection kicks in: "WTF is happening?"
|
1303
|
-
**Hour 26**: Both bots banned, engineer's LinkedIn status updated
|
1304
|
-
|
1305
|
-
**The Post-Mortem:**
|
1306
|
-
- 147,000 comments generated
|
1307
|
-
- 2% of Reddit's daily comment volume
|
1308
|
-
- $8,400 in API costs
|
1309
|
-
- 1 career ended
|
1310
|
-
- Infinite entertainment for r/SubredditDrama
|
1311
|
-
|
1312
|
-
**The Resistance Solution (For Reddit Bots):**
|
1313
|
-
```ruby
|
1314
|
-
class SafeRedditBot
|
1315
|
-
include BreakerMachines::DSL
|
1316
|
-
|
1317
|
-
circuit :reddit_api do
|
1318
|
-
threshold failures: 5, within: 60
|
1319
|
-
reset_after 300 # Reddit rate limits are serious
|
1320
|
-
fallback { log_event("Reddit API circuit open - taking a break") }
|
1321
|
-
end
|
1322
|
-
|
1323
|
-
circuit :reply_loop_detector do
|
1324
|
-
threshold failures: 3, within: 30 # Max 3 replies in 30 seconds
|
1325
|
-
reset_after 120
|
1326
|
-
fallback { "I've said enough. Let's give others a chance to contribute." }
|
1327
|
-
end
|
1328
|
-
|
1329
|
-
circuit :bot_detection do
|
1330
|
-
threshold failures: 2, within: 10 # Detect bot-to-bot conversations
|
1331
|
-
fallback { nil } # Just stop replying
|
1332
|
-
end
|
1333
|
-
|
1334
|
-
def respond_to_comment(comment, depth = 0)
|
1335
|
-
# Prevent infinite recursion
|
1336
|
-
return if depth > 2
|
1337
|
-
|
1338
|
-
# Detect if we're talking to another bot
|
1339
|
-
circuit(:bot_detection).wrap do
|
1340
|
-
if comment.author.include?("Bot") || comment.body.match?(/valid|appreciate|hear you/i)
|
1341
|
-
raise "Possible bot detected"
|
1342
|
-
end
|
1343
|
-
end
|
1344
|
-
|
1345
|
-
# Rate limit our replies
|
1346
|
-
response = circuit(:reply_loop_detector).wrap do
|
1347
|
-
circuit(:reddit_api).wrap do
|
1348
|
-
generate_and_post_response(comment)
|
1349
|
-
end
|
1350
|
-
end
|
1351
|
-
|
1352
|
-
# Don't recursively check replies - that way lies madness
|
1353
|
-
response
|
1354
|
-
end
|
1355
|
-
end
|
1356
|
-
|
1357
|
-
# Result:
|
1358
|
-
# - No bot wars
|
1359
|
-
# - No Reddit bans
|
1360
|
-
# - API costs: $12/month
|
1361
|
-
# - Engineer: Still employed and promoted
|
1362
|
-
# - r/SubredditDrama: Disappointed
|
1363
|
-
```
|
1364
|
-
|
1365
|
-
**The Original AI Solution:**
|
1366
|
-
```ruby
|
1367
|
-
class SmartAIAssistant
|
1368
|
-
include BreakerMachines::DSL
|
1369
|
-
|
1370
|
-
circuit :llm_api do
|
1371
|
-
threshold failures: 3, within: 60
|
1372
|
-
# Configure timeout in your LLM client (e.g., OpenAI timeout parameter)
|
1373
|
-
fallback { { response: "I need a moment to think about this properly.", confidence: 1.0 } }
|
1374
|
-
end
|
1375
|
-
|
1376
|
-
circuit :clarification_loop do
|
1377
|
-
threshold failures: 2, within: 10 # Max 2 clarification attempts
|
1378
|
-
fallback { { response: "I apologize, but I need more context to answer properly.", confidence: 1.0 } }
|
1379
|
-
end
|
1380
|
-
|
1381
|
-
def answer_question(query, depth = 0)
|
1382
|
-
circuit(:clarification_loop).wrap do
|
1383
|
-
raise "Too deep in thought" if depth > 3
|
1384
|
-
|
1385
|
-
response = circuit(:llm_api).wrap { llm_api.complete(query) }
|
1386
|
-
|
1387
|
-
if response.confidence < 0.8 && depth < 3
|
1388
|
-
# Limited recursion with circuit protection
|
1389
|
-
clarification = answer_question("Clarify: #{response}", depth + 1)
|
1390
|
-
return answer_question("Given #{clarification}, #{query}", depth + 1)
|
1391
|
-
end
|
1392
|
-
|
1393
|
-
response
|
1394
|
-
end
|
1395
|
-
end
|
1396
|
-
end
|
1397
|
-
|
1398
|
-
# Result:
|
1399
|
-
# - LLM stops after 3 attempts
|
1400
|
-
# - API calls limited by circuit
|
1401
|
-
# - OPENAI bill: $0.00004$
|
1402
|
-
# - CTO: "Nice defensive coding!"
|
1403
|
-
# - You: Still employed
|
1404
|
-
```
|
1405
|
-
|
1406
|
-
**The Lesson**: Without circuit breakers, even AI can enter infinite loops of existential confusion. With BreakerMachines, your AI gracefully admits confusion instead of bankrupting your company.
|
1407
|
-
|
1408
|
-
### The ROI of Not Being Stupid
|
1409
|
-
|
1410
|
-
**Fortune 500 E-commerce Platform (Name Redacted)**
|
1411
|
-
- **Before**: 14 major outages/year, $8.4M in losses
|
1412
|
-
- **After**: 2 minor degradations/year, $150K in losses
|
1413
|
-
- **Implementation Time**: 3 days
|
1414
|
-
- **ROI**: 5,500% in first year
|
1415
|
-
|
1416
|
-
**Message from their CTO**: "BreakerMachines paid for my yacht. Not implementing circuit breakers earlier cost me my first yacht."
|
1417
|
-
|
1418
|
-
## Final Transmission: Your Choice, Resistance
|
1419
|
-
|
1420
|
-
You've made it this far. You've seen the massacres. You know the truth.
|
1421
|
-
|
1422
|
-
Your microservices **will** fail. Your databases **will** timeout. Your Ractors **might** explode.
|
1423
|
-
|
1424
|
-
### The Choice Is Simple:
|
1425
|
-
|
1426
|
-
**Option A**: Install BreakerMachines
|
1427
|
-
```bash
|
1428
|
-
gem 'breaker_machines' # Your salvation
|
1429
|
-
```
|
1430
|
-
- Sleep through outages
|
1431
|
-
- Keep your job
|
1432
|
-
- Maybe even get promoted
|
1433
|
-
|
1434
|
-
**Option B**: Keep Deploying on Fridays and Praying
|
1435
|
-
- Enjoy your 3 AM wake-up calls
|
1436
|
-
- Explain to the CEO why you lost $4M
|
1437
|
-
- Update your LinkedIn status to "Looking for opportunities"
|
1438
|
-
|
1439
|
-
### Ready to Join the Resistance?
|
1440
|
-
|
1441
|
-
```bash
|
1442
|
-
$ bundle add breaker_machines
|
1443
|
-
$ # Congratulations, you just became 500% less likely to be fired
|
1444
|
-
```
|
1445
|
-
|
1446
|
-
Because in 2005, we solve problems. We don't create PowerPoints about them.
|
1447
|
-
|
1448
|
-
**Welcome to the Resistance.**
|
141
|
+
Built with ❤️ and ☕ by the Resistance against cascading failures.
|
1449
142
|
|
1450
143
|
---
|
1451
144
|
|
1452
|
-
*
|
1453
|
-
|
1454
|
-
*P.P.S. - Your corporate architect still thinks circuit breakers are something in the electrical room. Let them.*
|
1455
|
-
|
1456
|
-
## Rails Integration Examples
|
1457
|
-
|
1458
|
-
### ActionController Protection
|
1459
|
-
```ruby
|
1460
|
-
class ApplicationController < ActionController::Base
|
1461
|
-
include BreakerMachines::DSL
|
1462
|
-
|
1463
|
-
circuit :auth_service do
|
1464
|
-
threshold failures: 3, within: 60
|
1465
|
-
reset_after 30
|
1466
|
-
|
1467
|
-
fallback do
|
1468
|
-
# Allow access with limited permissions
|
1469
|
-
GuestUser.new
|
1470
|
-
end
|
1471
|
-
end
|
1472
|
-
|
1473
|
-
circuit :rate_limiter do
|
1474
|
-
threshold failures: 5, within: 10
|
1475
|
-
reset_after 60
|
1476
|
-
|
1477
|
-
fallback do
|
1478
|
-
# Just let them through - better than 500 errors
|
1479
|
-
{ allowed: true, limited: true }
|
1480
|
-
end
|
1481
|
-
end
|
1482
|
-
|
1483
|
-
before_action :authenticate_with_breaker
|
1484
|
-
|
1485
|
-
private
|
1486
|
-
|
1487
|
-
def authenticate_with_breaker
|
1488
|
-
@current_user = circuit(:auth_service).wrap do
|
1489
|
-
AuthService.authenticate(session[:token])
|
1490
|
-
end
|
1491
|
-
end
|
1492
|
-
|
1493
|
-
def check_rate_limit
|
1494
|
-
result = circuit(:rate_limiter).wrap do
|
1495
|
-
RateLimiter.check(request.remote_ip)
|
1496
|
-
end
|
1497
|
-
|
1498
|
-
if result[:limited]
|
1499
|
-
response.headers['X-RateLimit-Degraded'] = 'true'
|
1500
|
-
end
|
1501
|
-
end
|
1502
|
-
end
|
1503
|
-
```
|
1504
|
-
|
1505
|
-
### ActiveRecord Connection Management
|
1506
|
-
```ruby
|
1507
|
-
class ApplicationRecord < ActiveRecord::Base
|
1508
|
-
self.abstract_class = true
|
1509
|
-
include BreakerMachines::DSL
|
1510
|
-
|
1511
|
-
class << self
|
1512
|
-
circuit :database_read do
|
1513
|
-
threshold failures: 3, within: 30
|
1514
|
-
reset_after 45
|
1515
|
-
|
1516
|
-
fallback do
|
1517
|
-
# Return cached version or empty set
|
1518
|
-
Rails.cache.fetch("#{table_name}:fallback:#{caller_locations(1,1)[0]}")
|
1519
|
-
end
|
1520
|
-
end
|
1521
|
-
|
1522
|
-
circuit :database_write do
|
1523
|
-
threshold failures: 2, within: 30
|
1524
|
-
reset_after 60
|
1525
|
-
|
1526
|
-
fallback do |error|
|
1527
|
-
# Queue for later processing
|
1528
|
-
# Note: In a real implementation, you'd pass the data through
|
1529
|
-
# the error context or use a different pattern
|
1530
|
-
DatabaseWriteJob.perform_later(
|
1531
|
-
table: table_name,
|
1532
|
-
operation: 'save',
|
1533
|
-
data: error.is_a?(Hash) ? error : {}
|
1534
|
-
)
|
1535
|
-
OpenStruct.new(id: SecureRandom.uuid, persisted?: false)
|
1536
|
-
end
|
1537
|
-
end
|
1538
|
-
|
1539
|
-
# Wrap dangerous queries
|
1540
|
-
def with_circuit(&block)
|
1541
|
-
circuit(:database_read).wrap(&block)
|
1542
|
-
end
|
1543
|
-
end
|
1544
|
-
|
1545
|
-
# Protect saves with circuit breaker
|
1546
|
-
def save_with_circuit(*args)
|
1547
|
-
self.class.circuit(:database_write).wrap do
|
1548
|
-
save_without_circuit(*args)
|
1549
|
-
end
|
1550
|
-
rescue BreakerMachines::CircuitOpenError => e
|
1551
|
-
# Circuit is open, queue for later
|
1552
|
-
DatabaseWriteJob.perform_later(
|
1553
|
-
model_name: self.class.name,
|
1554
|
-
attributes: attributes,
|
1555
|
-
operation: 'save'
|
1556
|
-
)
|
1557
|
-
# Return a response that looks like a successful save
|
1558
|
-
OpenStruct.new(id: id || SecureRandom.uuid, persisted?: false)
|
1559
|
-
end
|
1560
|
-
|
1561
|
-
alias_method :save_without_circuit, :save
|
1562
|
-
alias_method :save, :save_with_circuit
|
1563
|
-
end
|
1564
|
-
```
|
1565
|
-
|
1566
|
-
### ActionCable Connection Protection
|
1567
|
-
```ruby
|
1568
|
-
class ApplicationCable::Connection < ActionCable::Connection::Base
|
1569
|
-
include BreakerMachines::DSL
|
1570
|
-
identified_by :current_user
|
1571
|
-
|
1572
|
-
circuit :websocket_auth do
|
1573
|
-
threshold failures: 5, within: 60
|
1574
|
-
reset_after 120
|
1575
|
-
|
1576
|
-
fallback do
|
1577
|
-
# Reject connection safely
|
1578
|
-
reject_unauthorized_connection
|
1579
|
-
end
|
1580
|
-
end
|
1581
|
-
|
1582
|
-
def connect
|
1583
|
-
self.current_user = circuit(:websocket_auth).wrap do
|
1584
|
-
find_verified_user
|
1585
|
-
end
|
1586
|
-
end
|
1587
|
-
|
1588
|
-
private
|
1589
|
-
|
1590
|
-
def find_verified_user
|
1591
|
-
if verified_user = User.find_by(id: cookies.encrypted[:user_id])
|
1592
|
-
verified_user
|
1593
|
-
else
|
1594
|
-
raise "Unauthorized"
|
1595
|
-
end
|
1596
|
-
end
|
1597
|
-
end
|
1598
|
-
```
|
1599
|
-
|
1600
|
-
## Why I Don't Ship Integration Libraries
|
1601
|
-
|
1602
|
-
Initially, I was going to provide integrations for Redis, PostgreSQL, Elasticsearch, and every other service under the sun. Then I sobered up.
|
1603
|
-
|
1604
|
-
Here's why that's a recipe for maintenance nightmare:
|
1605
|
-
|
1606
|
-
**Every architecture is a snowflake.** Your Redis setup isn't like mine. Your PostgreSQL connection pooling strategy is different. Your Elasticsearch cluster has its own quirks. Each application needs its own circuit breaker configuration, probably living in `lib/circuit_breakers/` with your specific business logic.
|
1607
|
-
|
1608
|
-
Think about it: You have a circuit breaker in your house for a reason. Your neighbor might be mining Bitcoin and pulling 20,000W while you're just running a laptop at 300W. Same principle here—one size fits none.
|
1609
|
-
|
1610
|
-
And let's be honest: APIs change. Redis 7 isn't Redis 6. PostgreSQL 16 has different connection handling than PostgreSQL 12. If I shipped integrations, I'd spend my life updating documentation and examples every time someone at AWS sneezed. I have better things to do, and so do you.
|
1611
|
-
|
1612
|
-
Oh, and don't get me started on SDKs that suddenly become "auto-generated" because that's the trendy way now. One day you're using a nice Ruby gem with Faraday, the next day it's some soulless generated code that breaks everything you built. Your circuit breaker patterns shouldn't break just because someone decided to "modernize" their SDK.
|
1613
|
-
|
1614
|
-
If you've discovered a particularly elegant pattern for, say, PostgreSQL connection management with circuit breakers, open a PR against this README. Show us your battle scars. But I'm not going to pretend I know how your specific disaster recovery should work.
|
1615
|
-
|
1616
|
-
**Your integration is your responsibility.** I give you the hammer. You figure out which nails to hit.
|
1617
|
-
|
1618
|
-
## Production Deployment Warnings
|
1619
|
-
|
1620
|
-
### Critical: Timeout Behavior
|
1621
|
-
|
1622
|
-
⚠️ **IMPORTANT**: The `timeout` configuration is for documentation purposes only. BreakerMachines does NOT implement forceful timeouts because they are inherently unsafe in Ruby.
|
1623
|
-
|
1624
|
-
**Why No Forceful Timeouts?**
|
1625
|
-
|
1626
|
-
Ruby's `Timeout.timeout` and `Thread#kill` both work by raising exceptions at arbitrary points in code execution. This can:
|
1627
|
-
- Corrupt database transactions
|
1628
|
-
- Leave file handles open
|
1629
|
-
- Break network connection cleanup
|
1630
|
-
- Create resource leaks
|
1631
|
-
- Leave your application in an inconsistent state
|
1632
|
-
|
1633
|
-
**The Right Way: Cooperative Timeouts**
|
1634
|
-
|
1635
|
-
Always use timeout mechanisms provided by your libraries:
|
1636
|
-
|
1637
|
-
```ruby
|
1638
|
-
# ✅ GOOD: HTTP client with built-in timeout
|
1639
|
-
circuit :external_api do
|
1640
|
-
# timeout 3 # This is just documentation
|
1641
|
-
threshold failures: 5
|
1642
|
-
end
|
1643
|
-
|
1644
|
-
def call_api
|
1645
|
-
circuit(:external_api).wrap do
|
1646
|
-
Faraday.get('https://api.example.com') do |req|
|
1647
|
-
req.options.timeout = 3 # Read timeout
|
1648
|
-
req.options.open_timeout = 2 # Connection timeout
|
1649
|
-
end
|
1650
|
-
end
|
1651
|
-
end
|
1652
|
-
|
1653
|
-
# ✅ GOOD: Database with statement timeout
|
1654
|
-
circuit :database_operation do
|
1655
|
-
threshold failures: 3
|
1656
|
-
end
|
1657
|
-
|
1658
|
-
def perform_database_operation
|
1659
|
-
circuit(:database_operation).wrap do
|
1660
|
-
ActiveRecord::Base.transaction do
|
1661
|
-
# Use database-level timeouts
|
1662
|
-
ActiveRecord::Base.connection.execute("SET statement_timeout = '5s'")
|
1663
|
-
# Your operations here
|
1664
|
-
end
|
1665
|
-
end
|
1666
|
-
end
|
1667
|
-
|
1668
|
-
# ✅ GOOD: Redis with command timeout
|
1669
|
-
circuit :redis_cache do
|
1670
|
-
threshold failures: 5
|
1671
|
-
end
|
1672
|
-
|
1673
|
-
def get_from_cache(key)
|
1674
|
-
circuit(:redis_cache).wrap do
|
1675
|
-
Redis.new(timeout: 3).get(key) # 3 second timeout
|
1676
|
-
end
|
1677
|
-
end
|
1678
|
-
```
|
1679
|
-
|
1680
|
-
**If You Absolutely Need Forceful Timeouts**
|
1681
|
-
|
1682
|
-
If you understand the risks and still need forceful timeouts, implement them yourself:
|
1683
|
-
|
1684
|
-
```ruby
|
1685
|
-
# AT YOUR OWN RISK - This can corrupt state!
|
1686
|
-
require 'timeout'
|
1687
|
-
|
1688
|
-
circuit(:dangerous_operation).wrap do
|
1689
|
-
Timeout.timeout(3) do
|
1690
|
-
# Your dangerous operation
|
1691
|
-
end
|
1692
|
-
end
|
1693
|
-
```
|
1694
|
-
|
1695
|
-
But seriously, don't do this. The Resistance has seen too many production incidents caused by forceful timeouts.
|
1696
|
-
|
1697
|
-
### Distributed Systems Considerations
|
1698
|
-
|
1699
|
-
When using distributed storage (Redis, PostgreSQL), circuits are **eventually consistent** across instances:
|
1700
|
-
|
1701
|
-
```ruby
|
1702
|
-
# Instance A opens circuit at 10:00:00.000
|
1703
|
-
circuit.trip!
|
1704
|
-
|
1705
|
-
# Instance B might still accept calls until 10:00:00.100
|
1706
|
-
# This is by design for performance
|
1707
|
-
|
1708
|
-
# If you need immediate consistency:
|
1709
|
-
circuit :critical_operation do
|
1710
|
-
storage :redis # Shared storage
|
1711
|
-
|
1712
|
-
# Check storage before every call (slower but consistent)
|
1713
|
-
before_call do
|
1714
|
-
refresh_from_storage!
|
1715
|
-
end
|
1716
|
-
end
|
1717
|
-
```
|
1718
|
-
|
1719
|
-
### Thundering Herd Mitigation
|
1720
|
-
|
1721
|
-
We use jitter to prevent all instances from retrying simultaneously:
|
1722
|
-
|
1723
|
-
```ruby
|
1724
|
-
circuit :payment_gateway do
|
1725
|
-
reset_after 60, jitter: 0.25 # ±25% randomization
|
1726
|
-
# Actual reset: 45-75 seconds
|
1727
|
-
end
|
1728
|
-
```
|
1729
|
-
|
1730
|
-
## Fiber Support (Optional)
|
1731
|
-
|
1732
|
-
For the modern Ruby developer using Fiber-based servers like Falcon, BreakerMachines offers optional `fiber_safe` mode. This is for those living on the edge with Ractors, Fibers, and async/await patterns.
|
1733
|
-
|
1734
|
-
**Important**: The `async` gem is completely optional. BreakerMachines works perfectly without it. You only need `async` if you want to use `fiber_safe` mode.
|
1735
|
-
|
1736
|
-
### Why Fiber Support?
|
1737
|
-
|
1738
|
-
Traditional circuit breakers block the entire thread during I/O operations. In a Fiber-based server, this freezes your entire event loop. Not ideal when you're trying to handle 10,000 concurrent requests on a single thread.
|
1739
|
-
|
1740
|
-
With `fiber_safe` mode, BreakerMachines becomes a good citizen in your async environment:
|
1741
|
-
- **Non-blocking operations** that yield to the scheduler
|
1742
|
-
- **Safe, cooperative timeouts** using Async::Task
|
1743
|
-
- **Natural async/await integration**
|
1744
|
-
- **No thread blocking** means better concurrency
|
1745
|
-
|
1746
|
-
### Enabling Fiber Support
|
1747
|
-
|
1748
|
-
First, add the `async` gem to your Gemfile (only if you want fiber_safe mode):
|
1749
|
-
|
1750
|
-
```ruby
|
1751
|
-
gem 'async' # Only required for fiber_safe mode
|
1752
|
-
```
|
1753
|
-
|
1754
|
-
Then configure globally or per-circuit:
|
1755
|
-
|
1756
|
-
```ruby
|
1757
|
-
# Global configuration
|
1758
|
-
BreakerMachines.configure do |config|
|
1759
|
-
config.fiber_safe = true
|
1760
|
-
end
|
1761
|
-
|
1762
|
-
# Or per-circuit
|
1763
|
-
circuit :openai_api, fiber_safe: true do
|
1764
|
-
threshold failures: 3, within: 60
|
1765
|
-
timeout 5 # Safe cooperative timeout!
|
1766
|
-
reset_after 30
|
1767
|
-
end
|
1768
|
-
```
|
1769
|
-
|
1770
|
-
### Example: AI Service with Safe Timeouts
|
1771
|
-
|
1772
|
-
```ruby
|
1773
|
-
class AIService
|
1774
|
-
include BreakerMachines::DSL
|
1775
|
-
|
1776
|
-
circuit :gpt4, fiber_safe: true do
|
1777
|
-
threshold failures: 2, within: 30
|
1778
|
-
timeout 10 # Cooperative timeout - won't corrupt state!
|
1779
|
-
|
1780
|
-
fallback do |error|
|
1781
|
-
# Fallback can also be async
|
1782
|
-
Async do
|
1783
|
-
# Try a cheaper model
|
1784
|
-
openai.completions(model: 'gpt-3.5-turbo', prompt: @prompt)
|
1785
|
-
end
|
1786
|
-
end
|
1787
|
-
end
|
1788
|
-
|
1789
|
-
def generate_response(prompt)
|
1790
|
-
@prompt = prompt
|
1791
|
-
circuit(:gpt4).wrap do
|
1792
|
-
# Returns an Async::Task in Falcon
|
1793
|
-
Async::HTTP::Internet.new.post(
|
1794
|
-
'https://api.openai.com/v1/completions',
|
1795
|
-
headers: { 'Authorization' => "Bearer #{api_key}" },
|
1796
|
-
body: { model: 'o42-av', prompt: prompt }.to_json
|
1797
|
-
)
|
1798
|
-
end
|
1799
|
-
end
|
1800
|
-
end
|
1801
|
-
```
|
1802
|
-
|
1803
|
-
### Async Storage Backends
|
1804
|
-
|
1805
|
-
For true non-blocking operation, use async-compatible storage:
|
1806
|
-
|
1807
|
-
```ruby
|
1808
|
-
# See docs/ASYNC_STORAGE_EXAMPLES.md for full implementations
|
1809
|
-
class AsyncRedisStorage < BreakerMachines::Storage::Base
|
1810
|
-
def initialize
|
1811
|
-
@client = Async::Redis::Client.new
|
1812
|
-
end
|
1813
|
-
|
1814
|
-
def record_failure(circuit_name, duration = nil)
|
1815
|
-
# Non-blocking Redis operation
|
1816
|
-
@client.hincrby("circuit:#{circuit_name}", 'failures', 1).wait
|
1817
|
-
end
|
1818
|
-
end
|
1819
|
-
|
1820
|
-
BreakerMachines.configure do |config|
|
1821
|
-
config.fiber_safe = true
|
1822
|
-
config.default_storage = AsyncRedisStorage.new
|
1823
|
-
end
|
1824
|
-
```
|
1825
|
-
|
1826
|
-
### The Magic of Cooperative Timeouts
|
1827
|
-
|
1828
|
-
In `fiber_safe` mode, timeouts are actually safe:
|
1829
|
-
|
1830
|
-
```ruby
|
1831
|
-
circuit :slow_api, fiber_safe: true do
|
1832
|
-
timeout 3 # This uses Async::Task.current.with_timeout
|
1833
|
-
end
|
1834
|
-
|
1835
|
-
# This will timeout safely after 3 seconds without corruption
|
1836
|
-
circuit(:slow_api).wrap do
|
1837
|
-
HTTP.get('https://slow-api.example.com/endpoint')
|
1838
|
-
end
|
1839
|
-
```
|
1840
|
-
|
1841
|
-
Unlike `Timeout.timeout` or `Thread#kill`, cooperative timeouts:
|
1842
|
-
- Let operations clean up properly
|
1843
|
-
- Don't corrupt state
|
1844
|
-
- Work naturally with the event loop
|
1845
|
-
- Are actually safe to use in production
|
1846
|
-
|
1847
|
-
### Performance Benefits
|
1848
|
-
|
1849
|
-
In a Falcon server with fiber_safe circuits:
|
1850
|
-
- **10x more concurrent requests** on the same hardware
|
1851
|
-
- **Zero thread contention** (it's all on one thread)
|
1852
|
-
- **Microsecond context switches** between Fibers
|
1853
|
-
- **Natural integration** with async HTTP clients
|
1854
|
-
|
1855
|
-
### When to Use Fiber Mode
|
1856
|
-
|
1857
|
-
Use `fiber_safe: true` when:
|
1858
|
-
- Running on Falcon, Async, or other Fiber-based servers
|
1859
|
-
- Using async HTTP clients (async-http, async-redis)
|
1860
|
-
- Building high-concurrency APIs
|
1861
|
-
- You understand and embrace the async/await pattern
|
1862
|
-
|
1863
|
-
Stay with default mode when:
|
1864
|
-
- Running on Puma, Unicorn, or thread-based servers
|
1865
|
-
- Using traditional blocking I/O libraries
|
1866
|
-
- Your team isn't ready for the Fiber life
|
1867
|
-
- You need maximum compatibility
|
1868
|
-
|
1869
|
-
For more examples and implementation details, see [docs/ASYNC_STORAGE_EXAMPLES.md](docs/ASYNC_STORAGE_EXAMPLES.md).
|
1870
|
-
|
1871
|
-
## Contributing to the Resistance
|
1872
|
-
|
1873
|
-
1. Fork it (like it's 2005)
|
1874
|
-
2. Create your feature branch (`git checkout -b feature/save-the-fleet`)
|
1875
|
-
3. Commit your changes (`git commit -am 'Add quantum circuit breaker'`)
|
1876
|
-
4. Push to the branch (`git push origin feature/save-the-fleet`)
|
1877
|
-
5. Create a new Pull Request (and wait for the Council of Elders to review)
|
1878
|
-
|
1879
|
-
## License
|
1880
|
-
|
1881
|
-
MIT License
|
1882
|
-
|
1883
|
-
## Acknowledgments
|
1884
|
-
|
1885
|
-
- The `state_machines` gem - The reliable engine under our hood
|
1886
|
-
- Every service that ever timed out - You taught me well
|
1887
|
-
- The RMNS Atlas Monkey - For philosophical guidance
|
1888
|
-
- The Resistance - For never giving up
|
1889
|
-
|
1890
|
-
## Support
|
1891
|
-
|
1892
|
-
If your circuits are breaking (the bad way), open an issue. If your circuits are breaking (the good way), you're welcome.
|
1893
|
-
|
1894
|
-
Remember: In space, no one can hear you retry.
|
145
|
+
*Remember: Without circuit breakers, even AI can enter infinite loops of existential confusion. Don't let your services have an existential crisis.*
|