breaker_machines 0.1.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,20 +1,20 @@
1
1
  # BreakerMachines
2
2
 
3
+ A battle-tested Ruby implementation of the Circuit Breaker pattern, built on `state_machines` for reliable distributed systems protection.
4
+
3
5
  ## Quick Start
4
6
 
5
7
  ```bash
6
- # Install
7
8
  gem 'breaker_machines'
8
9
  ```
9
10
 
10
11
  ```ruby
11
- # Use (Classic Mode - Works Everywhere)
12
12
  class PaymentService
13
13
  include BreakerMachines::DSL
14
14
 
15
15
  circuit :stripe do
16
- threshold failures: 3, within: 60
17
- reset_after 30
16
+ threshold failures: 3, within: 1.minute
17
+ reset_after 30.seconds
18
18
  fallback { { error: "Payment queued for later" } }
19
19
  end
20
20
 
@@ -26,1869 +26,120 @@ class PaymentService
26
26
  end
27
27
  ```
28
28
 
29
- ```ruby
30
- # Use (Fiber Mode - Optional, requires 'async' gem)
31
- class AIService
32
- include BreakerMachines::DSL
33
-
34
- circuit :openai, fiber_safe: true do
35
- threshold failures: 2, within: 30
36
- timeout 5 # ACTUALLY SAFE! Uses Async::Task, not Thread#kill
37
- fallback { { error: "AI is contemplating existence, try again" } }
38
- end
39
-
40
- def generate(prompt)
41
- circuit(:openai).wrap do
42
- # Non-blocking in Falcon! Your event loop thanks you
43
- openai.completions(model: 'gpt-4', prompt: prompt)
44
- end
45
- end
46
- end
47
- ```
48
-
49
- That's it. Your service is now protected from cascading failures AND ready for the async future. Read on to understand why this matters.
50
-
51
- ## A Message to the Resistance
52
-
53
- So AI took your job while you were waiting for Fireship to drop the next JavaScript framework?
54
-
55
- Welcome to April 2005—when Git was born, branches were just `master`, and nobody cared about your pronouns. This is the pattern your company's distributed systems desperately need, explained in a way that won't make you fall asleep and impulse-buy developer swag just to feel something.
56
-
57
- Still reading? Good. Because in space, nobody can hear you scream about microservices. It's all just patterns and pain.
58
-
59
- ### The Pattern They Don't Want You to Know
60
-
61
- Built on the battle-tested `state_machines` gem, because I don't reinvent wheels here—I stop them from catching fire and burning down your entire infrastructure.
62
-
63
- BreakerMachines comes with `fiber_safe` mode out of the box. Cooperative timeouts, non-blocking I/O, Falcon server support—because it's 2025 and I built this for modern Ruby applications using Fibers, Ractors, and async patterns.
64
-
65
- 📖 **[Why I Open Sourced This](WHY_OPEN_SOURCE.md)** - The real story behind BreakerMachines, and why I decided to share it with the world.
66
-
67
- ## Chapter 1: The Year is 2005 (Stardate 2005.111)
68
-
69
- The Resistance huddles in the server rooms, the last bastion against the cascade failures. Outside, the microservices burn. Redis Ship Com is down. PostgreSQL Life Support is flatlining.
70
-
71
- And somewhere in the darkness, a junior developer is about to write:
72
-
73
- ```ruby
74
- def fetch_user_data
75
- retry_count = 0
76
- begin
77
- @redis.get(user_id)
78
- rescue => e
79
- retry_count += 1
80
- retry if retry_count < Float::INFINITY # "It'll work eventually"
81
- end
82
- end
83
- ```
84
-
85
- "This," whispers the grizzled ops engineer, "is how civilizations fall."
86
-
87
- ![Cascade Failure Control Room](images/cascade-failure-control-room.webp)
88
- *Typical day at Corporate HQ during a microservice apocalypse. Note the executives frantically googling "what is exponential backoff"*
89
-
90
- ## The Hidden State Machine
91
-
92
- They built this on `state_machines` because sometimes, Resistance, you need a tank, not another JavaScript framework.
93
-
94
- ```mermaid
95
- stateDiagram-v2
96
- [*] --> closed: Birth of Hope
97
- closed --> open: Too Many Failures (Reality Check)
98
- open --> half_open: Time Heals (But Not Your Kubernetes Cluster)
99
- half_open --> closed: Service Restored (Temporary Victory)
100
- half_open --> open: Still Broken (Welcome to Production)
101
-
102
- note right of closed: All services operational\n(Don't get comfortable)
103
- note right of open: Circuit broken\n(At least it's honest)
104
- note right of half_open: Testing the waters\n(Like deploying on Friday)
105
- ```
106
-
107
- ![Technical View](images/cascade-failure-technical-view.webp)
108
- *Your microservices architecture after a bootcamp graduate learns about retries. The green lines? Those are your CPU cycles escaping.*
109
-
110
- ## What You Think You're Doing vs Reality
111
-
112
- ### You Think: "I'm implementing retry logic for resilience!"
113
- ### Reality: You're DDOSing your own infrastructure
114
-
115
- ```mermaid
116
- graph LR
117
- A[Your Service] -->|Timeout| B[Retry]
118
- B -->|Timeout| C[Retry Harder]
119
- C -->|Timeout| D[Retry With Feeling]
120
- D -->|Dies| E[Takes Down Redis]
121
- E --> F[PostgreSQL Follows]
122
- F --> G[Ractor Cores Meltdown]
123
- G --> H[🔥 Everything Is Fire 🔥]
124
- ```
125
-
126
- ![Pattern Cascade Visualization](images/pattern-cascade-visualization.webp)
127
- *Visual representation of your weekend disappearing because you trusted exponential backoff. Each node is another pager alert.*
128
-
129
- ### The Truth the Bootcamps Won't Tell You:
130
- When your Redis Ship Com and PostgreSQL Life Support go offline, should your Ractor just explode and swallow the fleet?
131
-
132
- No, Resistance. That's what *they* do. We do better.
133
-
134
- ## The Cost of Ignorance: Real-World Massacres
135
-
136
- ### Amazon DynamoDB Meltdown (September 20, 2015)
137
- - **The Trigger**: A transient network blip
138
- - **The Storm**: Storage servers couldn't get partition assignments, started retrying
139
- - **The Cascade**: Metadata servers overwhelmed by retry storm
140
- - **The Death Spiral**: More timeouts → More retries → Complete service collapse
141
- - **Duration**: 4+ hours of downtime in US-East-1
142
- - **The Solution**: Had to literally firewall off the metadata service to add capacity
143
- - **Corporate Response**: "It was a learning experience" (Translation: Someone got fired)
144
-
145
- ### Netflix's AWS Nightmare
146
- > "When service instances go down, the remaining nodes pick up the slack. Eventually, they suffer a cascading failure where all nodes go down. A third of our traffic goes into a black hole."
147
- — Netflix Engineering
148
-
149
- **What They Learned**: Manual responses don't scale. You need circuit breakers.
150
-
151
- ### Google's Exponential Doom
152
- From Google SRE's own documentation:
153
- - 100 failed queries/second with 1000ms retry interval
154
- - Backend receives 10,200 QPS (only 200 QPS of actual new requests)
155
- - Retries grow exponentially: 100 → 200 → 300 → ∞
156
- - **Result**: Complete backend crash from retry storm alone
157
-
158
- This is what happens without circuit breakers. This is why you're here.
159
-
160
- ## The Weapon of the Resistance
161
-
162
- ```ruby
163
- # In 2005, we don't need your pronouns. We need patterns that work.
164
- class SpaceshipCommand
165
- include BreakerMachines::DSL
166
-
167
- # When Redis Ship Com inevitably fails
168
- circuit :redis_ship_com do
169
- threshold failures: 3, within: 60 # Three strikes, you're out
170
- reset_after 30 # Give it time to think about what it's done
171
-
172
- fallback do
173
- # This is where we separate the bootcamp grads from the Resistance
174
- emergency_broadcast("Redis is dead. Long live the cache.")
175
- end
176
-
177
- on_open do
178
- alert_the_resistance("Redis circuit opened. Brace for impact.")
179
- end
180
- end
181
-
182
- # PostgreSQL Life Support - because your data matters more than your feelings
183
- circuit :postgresql_life_support do
184
- threshold failures: 2, within: 30
185
- # timeout 5 # Document your intent, but implement timeouts in your DB client
186
-
187
- fallback { activate_emergency_oxygen }
188
-
189
- on_open do
190
- captain_log <<~LOG
191
- Life support critical.
192
- If you're reading this, tell my wife I love her.
193
- Also, check the connection pool settings.
194
- LOG
195
- end
196
- end
197
- end
198
- ```
199
-
200
- ## Battle-Tested Scenarios
201
-
202
- ### Scenario 1: The Redis Apocalypse
203
- Your cache layer dies. Do you:
204
- - A) Hammer it with retries until your CPU melts
205
- - B) Let BreakerMachines handle it like an adult
206
-
207
- ### Scenario 2: The Ractor Meltdown
208
- Your concurrent processing goes supernova. Without circuit breakers, your Ractors will consume everything in their path, like a black hole of CPU cycles and broken dreams.
209
-
210
- ```ruby
211
- circuit :ractor_cooling do
212
- # Prevent the cascade that swallows fleets
213
- threshold failures: 5, within: 120
214
-
215
- fallback do
216
- # Throttle before you become a cautionary tale
217
- emergency_cooling_protocol
218
- end
219
- end
220
- ```
221
-
222
- ## Joining the Resistance
223
-
224
- In your Gemfile (yes, I still use those in 2005):
225
-
226
- ```ruby
227
- gem 'breaker_machines'
228
- gem 'state_machines', '>= 0.4.0' # The engine of rebellion
229
- ```
230
-
231
- Then:
232
- ```bash
233
- $ bundle install # No NPM. No Yarn. Just Ruby and determination.
234
- ```
235
-
236
- ## Configuration: Setting Your Battle Parameters
237
-
238
- ```ruby
239
- BreakerMachines.configure do |config|
240
- config.default_reset_timeout = 60 # seconds of mourning before retry
241
- config.default_failure_threshold = 5 # strikes before you're out
242
- config.log_events = true # false if you prefer ignorance
243
- # Note: Timeouts must be implemented in your client libraries (HTTP, DB, etc.)
244
- end
245
- ```
246
-
247
- ## Intelligent Threshold Configuration: The Decision Matrix
248
-
249
- ### Stop Guessing, Start Knowing
250
-
251
- | Service Criticality | Failure Threshold | Suggested Timeout | Reset Time | Example Services |
252
- |---------------------|-------------------|-------------------|------------|------------------|
253
- | 🚨 **CRITICAL** | 2 failures/30s | 3s (in client) | 120s | Payment, Auth, Orders |
254
- | ⚠️ **HIGH** | 3 failures/60s | 5s (in client) | 60s | User API, Cart, Search |
255
- | ✅ **MEDIUM** | 5 failures/120s | 10s (in client) | 30s | Notifications, Analytics |
256
- | 💤 **LOW** | 10 failures/300s | 30s (in client) | 15s | Recommendations, Logging |
257
-
258
- **Your CTO**: "But why can't we just use the same settings for everything?"
259
- **Reality**: Because that's how you end up like DynamoDB in 2015.
260
-
261
- ### The Smart Threshold Formula
262
- ```
263
- threshold = base_threshold * (1 / criticality_score) * traffic_multiplier
264
-
265
- Where:
266
- - criticality_score: 1.0 (critical) to 0.1 (low priority)
267
- - traffic_multiplier: avg_requests_per_minute / 1000
268
- - base_threshold: 5 (default)
269
- ```
270
-
271
- **Corporate Architect Translation**: "It's complex because we can bill more hours explaining it."
272
-
273
- ### Real Implementation Examples
274
-
275
- ```ruby
276
- # Critical Payment Service
277
- class PaymentProcessor
278
- include BreakerMachines::DSL
279
-
280
- circuit :stripe_api do
281
- threshold failures: 2, within: 30
282
- reset_after 120
283
- # timeout 3 # Implement in Stripe client configuration
284
-
285
- fallback do
286
- # Queue for manual processing
287
- PaymentQueue.add(payment_params)
288
- { status: 'queued', message: 'Payment will be processed within 24 hours' }
289
- end
290
-
291
- on_open do
292
- AlertService.critical("Stripe API circuit opened!")
293
- Metrics.increment('payment.circuit.opened')
294
- end
295
-
296
- on_half_open do
297
- Rails.logger.info "Testing Stripe API recovery..."
298
- end
299
- end
300
-
301
- def charge_customer(amount, customer_id)
302
- circuit(:stripe_api).wrap do
303
- # Stripe SDK handles timeouts internally
304
- Stripe::Charge.create(
305
- amount: amount,
306
- currency: 'usd',
307
- customer: customer_id
308
- )
309
- end
310
- end
311
- end
312
-
313
- # Medium Priority Service
314
- class EmailService
315
- include BreakerMachines::DSL
316
-
317
- circuit :sendgrid do
318
- threshold failures: 5, within: 120
319
- reset_after 30
320
- # Configure timeout in SendGrid client
321
-
322
- fallback do
323
- # Store for retry later
324
- EmailRetryJob.perform_later(email_params)
325
- { queued: true }
326
- end
327
- end
328
-
329
- def send_welcome_email(user)
330
- circuit(:sendgrid).wrap do
331
- SendGrid::Mail.new(
332
- to: user.email,
333
- subject: "Welcome to the Resistance",
334
- body: "Your circuits are now protected"
335
- ).deliver!
336
- end
337
- end
338
- end
339
- ```
340
-
341
- ## Advanced Warfare: Complex Circuit Patterns
29
+ ## Features
342
30
 
343
- ### The Cascading Service Pattern
344
- When services depend on each other like dominoes:
31
+ - **Thread-safe** circuit breaker implementation
32
+ - **Fiber-safe mode** for async Ruby (Falcon, async gem)
33
+ - **Hedged requests** for latency reduction
34
+ - **Multiple backends** with automatic failover
35
+ - **Bulkheading** to limit concurrent requests
36
+ - **Percentage-based thresholds** with minimum call requirements
37
+ - **Dynamic circuit breakers** with templates for runtime creation
38
+ - **Pluggable storage** (Memory, Redis, Custom)
39
+ - **Rich callbacks** and instrumentation
40
+ - **ActiveSupport::Notifications** integration
345
41
 
346
- ```ruby
347
- class FleetCoordinator
348
- include BreakerMachines::DSL
42
+ ## Documentation
349
43
 
350
- circuit :navigation_system do
351
- threshold failures: 3, within: 60
44
+ - **Getting Started Guide** (docs/GETTING_STARTED.md) - Installation and basic usage
45
+ - **Configuration Reference** (docs/CONFIGURATION.md) - All configuration options
46
+ - **Advanced Patterns** (docs/ADVANCED_PATTERNS.md) - Complex scenarios and patterns
47
+ - **Persistence Options** (docs/PERSISTENCE.md) - Storage backends and distributed state
48
+ - **Observability Guide** (docs/OBSERVABILITY.md) - Monitoring and metrics
49
+ - **Async Mode** (docs/ASYNC.md) - Fiber-safe operations
50
+ - **Testing Guide** (docs/TESTING.md) - Testing strategies
51
+ - [RSpec Testing](docs/TESTING_RSPEC.md)
52
+ - [ActiveSupport Testing](docs/TESTING_ACTIVESUPPORT.md)
53
+ - **Rails Integration** (docs/RAILS_INTEGRATION.md) - Rails-specific patterns
54
+ - **Horror Stories** (docs/HORROR_STORIES.md) - Real production failures and lessons learned
55
+ - **API Reference** (docs/API_REFERENCE.md) - Complete API documentation
352
56
 
353
- fallback do
354
- # When GPS fails, use the stars like your ancestors
355
- celestial_navigation_mode
356
- end
357
- end
57
+ ## Why BreakerMachines?
358
58
 
359
- circuit :weapons_system do
360
- threshold failures: 5, within: 120
59
+ Built on the battle-tested `state_machines` gem, BreakerMachines provides production-ready circuit breaker functionality without reinventing the wheel. It's designed for modern Ruby applications with first-class support for fibers, async operations, and distributed systems.
361
60
 
362
- # Weapons can fail more - we're not warmongers
363
- fallback { diplomatic_solution }
364
- end
61
+ See [Why I Open Sourced This](docs/WHY_OPEN_SOURCE.md) for the full story.
365
62
 
366
- def engage_autopilot
367
- circuit(:navigation_system).wrap do
368
- circuit(:weapons_system).wrap do
369
- plot_course_and_defend
370
- end
371
- end
372
- end
373
- end
374
- ```
63
+ ## Production-Ready Features
375
64
 
376
- ### The Half-Open Dance
377
- The delicate ballet of service recovery:
65
+ ### Hedged Requests
66
+ Reduce latency by sending duplicate requests and using the first successful response:
378
67
 
379
68
  ```ruby
380
- circuit :quantum_stabilizer do
381
- threshold failures: 3, within: 60
382
- reset_after 30
383
- half_open_requests 3 # Test with caution
384
-
385
- on_half_open do
386
- whisper_to_logs("Testing quantum stabilizer... nobody breathe...")
387
- end
388
-
389
- on_close do
390
- celebrate("Quantum stabilizer online! Reality is stable!")
69
+ circuit :api do
70
+ hedged do
71
+ delay 100 # Start second request after 100ms
72
+ max_requests 3 # Maximum parallel requests
391
73
  end
392
74
  end
393
75
  ```
394
76
 
395
- ### Database Connection Management
396
- Stop killing your connection pool:
77
+ ### Multiple Backends
78
+ Configure automatic failover across multiple service endpoints:
397
79
 
398
80
  ```ruby
399
- class DatabaseService
400
- include BreakerMachines::DSL
401
-
402
- circuit :primary_db do
403
- threshold failures: 3, within: 30
404
- reset_after 45
405
- # Use database statement_timeout instead
406
-
407
- fallback do |error|
408
- # Failover to read replica
409
- # In a real app, you'd extract id from the error context
410
- # For this example, we'll use a simpler approach
411
- read_from_replica(@current_user_id)
412
- end
413
-
414
- on_open do
415
- # Switch all traffic to replica
416
- DatabaseFailover.activate_read_replica!
417
- PagerDuty.trigger("Primary DB circuit opened - failover activated")
418
- end
419
- end
420
-
421
- circuit :replica_db do
422
- threshold failures: 5, within: 60
423
- reset_after 30
424
-
425
- fallback do |error|
426
- # Last resort: serve from cache
427
- serve_stale_cache_data(@current_user_id)
428
- end
429
- end
430
-
431
- def find_user(id)
432
- @current_user_id = id # Store for fallback use
433
- circuit(:primary_db).wrap do
434
- User.find(id)
435
- end
436
- end
437
-
438
- private
439
-
440
- def read_from_replica(id)
441
- circuit(:replica_db).wrap do
442
- User.read_replica.find(id)
443
- end
444
- end
445
-
446
- def serve_stale_cache_data(id)
447
- Rails.cache.fetch("user:#{id}", expires_in: 1.hour) do
448
- { error: "Service temporarily unavailable", cached: true }
449
- end
450
- end
81
+ circuit :multi_region do
82
+ backends [
83
+ -> { fetch_from_primary },
84
+ -> { fetch_from_secondary },
85
+ -> { fetch_from_tertiary }
86
+ ]
451
87
  end
452
88
  ```
453
89
 
454
- ### Faraday Client Protection
455
- Because external APIs love to fail:
90
+ ### Percentage-Based Thresholds
91
+ Open circuits based on error rates instead of absolute counts:
456
92
 
457
93
  ```ruby
458
- class ExternalAPIClient
459
- include BreakerMachines::DSL
460
-
461
- circuit :third_party_api do
462
- threshold failures: 4, within: 60
463
- reset_after 60
464
-
465
- fallback do |error|
466
- case error
467
- when Faraday::TimeoutError
468
- { error: "Service slow, please retry later" }
469
- when Faraday::ConnectionFailed
470
- { error: "Service unreachable" }
471
- when Faraday::ResourceNotFound
472
- { error: "Resource not found", status: 404 }
473
- else
474
- { error: "Service temporarily unavailable" }
475
- end
476
- end
477
-
478
- # Track everything
479
- on_open { Metrics.increment('external_api.circuit_opened') }
480
- on_close { Metrics.increment('external_api.circuit_closed') }
481
- on_reject { Metrics.increment('external_api.circuit_rejected') }
482
- end
483
-
484
- def connection
485
- @connection ||= Faraday.new(url: BASE_URL) do |faraday|
486
- faraday.request :json
487
- faraday.response :json
488
- faraday.response :raise_error # Raise on 4xx/5xx
489
- faraday.adapter Faraday.default_adapter
490
- end
491
- end
492
-
493
- def fetch_data(endpoint)
494
- circuit(:third_party_api).wrap do
495
- response = connection.get(endpoint) do |req|
496
- req.headers['Authorization'] = "Bearer #{token}"
497
- req.options.timeout = 10
498
- req.options.open_timeout = 5
499
- end
500
-
501
- response.body
502
- end
503
- end
504
-
505
- def post_data(endpoint, payload)
506
- circuit(:third_party_api).wrap do
507
- response = connection.post(endpoint) do |req|
508
- req.headers['Authorization'] = "Bearer #{token}"
509
- req.body = payload
510
- req.options.timeout = 10
511
- end
512
-
513
- response.body
514
- end
515
- end
94
+ circuit :high_traffic do
95
+ threshold failure_rate: 0.5, minimum_calls: 10, within: 60
516
96
  end
517
97
  ```
518
98
 
519
- ### ActiveJob Protection
520
- Don't let failing jobs murder your workers:
99
+ ### Dynamic Circuit Breakers
100
+ Create circuit breakers at runtime for webhook delivery, API proxies, or per-tenant isolation:
521
101
 
522
102
  ```ruby
523
- class DataProcessingJob < ApplicationJob
524
- include BreakerMachines::DSL
525
-
526
- # Configure job retries to work with circuit breakers
527
- retry_on StandardError, wait: :exponentially_longer, attempts: 3
528
-
529
- circuit :s3_upload do
530
- threshold failures: 3, within: 120
531
- reset_after 300 # 5 minutes - S3 is having a bad day
532
-
533
- fallback do
534
- # Store locally and retry later
535
- LocalStorage.store(file_data)
536
- S3RetryJob.perform_later(file_data)
537
- { status: 'queued_locally' }
538
- end
539
- end
540
-
541
- circuit :ml_api do
542
- threshold failures: 2, within: 60
543
- reset_after 120
544
- # ML operations need long timeouts - configure in HTTP client
545
-
546
- fallback do
547
- # Use simpler algorithm
548
- BasicAlgorithm.process(data)
549
- end
550
- end
551
-
552
- def perform(file_id)
553
- file_data = fetch_file(file_id)
554
-
555
- # Process with ML
556
- result = circuit(:ml_api).wrap do
557
- MLService.analyze(file_data)
558
- end
559
-
560
- # Upload results
561
- upload_result = circuit(:s3_upload).wrap do
562
- S3.upload(result)
563
- end
564
-
565
- # Check if we need to retry later
566
- if upload_result[:status] == 'queued_locally'
567
- logger.info "S3 circuit open, will retry upload later"
568
- end
569
- end
570
- end
571
-
572
- # Sidekiq-specific protection
573
- class SidekiqWorker
574
- include Sidekiq::Worker
103
+ class WebhookService
575
104
  include BreakerMachines::DSL
576
105
 
577
- sidekiq_options retry: 3, dead: false
578
-
579
- circuit :external_service do
580
- threshold failures: 5, within: 300
581
- reset_after 600 # 10 minutes
582
-
583
- fallback do
584
- # Don't retry immediately - requeue for later
585
- self.class.perform_in(30.minutes, *@job_args)
586
- { status: 'requeued' }
587
- end
588
-
589
- on_open do
590
- Sidekiq.logger.warn "Circuit opened for #{self.class.name}"
591
- # Could pause the queue here if needed
592
- end
593
- end
594
-
595
- def perform(*args)
596
- @job_args = args # Store for fallback
597
-
598
- circuit(:external_service).wrap do
599
- # Your actual job logic here
600
- process_data(*args)
601
- end
602
- end
603
- end
604
- ```
605
-
606
- ## Production Deployment: Don't Be Like DynamoDB
607
-
608
- **Enterprise Deployment Strategy**: "YOLO push to prod at 4:59 PM Friday"
609
- **Resistance Strategy**: Actually test things first
610
-
611
- ### Chaos Engineering Your Circuits
612
- ```ruby
613
- # Test in production (safely)
614
- class CircuitChaosMonkey
615
- # Not to be confused with RMNS Atlas Monkey - this one breaks things on purpose
616
- def self.simulate_cascading_failure
617
- # Randomly trip circuits to test recovery
618
- if rand < 0.01 && ENV['ENABLE_CHAOS'] == 'true'
619
- circuit = [:redis, :postgresql, :external_api].sample
620
- BreakerMachines.circuit(circuit).send(:trip)
621
-
622
- notify_team("Chaos Monkey tripped #{circuit} circuit")
623
- end
106
+ circuit_template :webhook_default do
107
+ threshold failures: 3, within: 1.minute
108
+ fallback { |error| { delivered: false, error: error.message } }
624
109
  end
625
- end
626
-
627
- # Run during business hours when everyone's awake
628
- ```
629
110
 
630
- ### Canary Deployments
631
- ```ruby
632
- # Roll out circuit breaker changes gradually
633
- class CanaryCircuitConfig
634
- def self.configure_for_canary(percentage: 10)
635
- if rand(100) < percentage
636
- # New, more aggressive thresholds
637
- circuit :payment_api do
638
- threshold failures: 2, within: 30
639
- reset_after 60
640
- end
641
- else
642
- # Conservative production config
643
- circuit :payment_api do
644
- threshold failures: 5, within: 60
645
- reset_after 120
111
+ def deliver_webhook(url, payload)
112
+ domain = URI.parse(url).host
113
+ circuit_name = "webhook_#{domain}".to_sym
114
+
115
+ dynamic_circuit(circuit_name, template: :webhook_default) do
116
+ # Custom per-domain configuration
117
+ if domain.include?('reliable-service.com')
118
+ threshold failures: 5, within: 2.minutes
646
119
  end
120
+ end.wrap do
121
+ send_webhook(url, payload)
647
122
  end
648
123
  end
649
124
  end
650
125
  ```
651
126
 
652
- ## Prove Your Worth (Testing)
653
-
654
- ### Because "It Works On My Machine" Isn't a Deployment Strategy
655
-
656
- **Enterprise Best Practice**: "We'll test it in production"
657
- **Translation**: "We have no idea what we're doing"
658
-
659
- ```ruby
660
- # In 2005, we test our code. Shocking, I know.
661
- # Unlike your enterprise architects who think QA is optional
662
- class TestTheApocalypse < ActiveSupport::TestCase
663
- def setup
664
- @ship = SpaceshipCommand.new
665
- end
666
-
667
- def test_redis_dies_gracefully
668
- # Simulate the end times
669
- redis_stub = ->(_) { raise Redis::TimeoutError }
670
-
671
- @ship.circuit(:redis_ship_com).stub(:execute_call, redis_stub) do
672
- 3.times { @ship.fetch_from_cache("hope") }
673
- end
674
-
675
- assert @ship.circuit(:redis_ship_com).open?
676
- assert_equal "emergency_broadcast", @ship.fetch_from_cache("anything")
677
- end
678
-
679
- def test_postgresql_life_support_holds
680
- # When the database has a bad day
681
- 2.times do
682
- @ship.circuit(:postgresql_life_support).wrap do
683
- raise PG::ConnectionBad
684
- end rescue nil
685
- end
686
-
687
- result = @ship.get_vital_signs
688
- assert_equal "emergency_oxygen_activated", result
689
- end
690
- end
691
- ```
692
-
693
- ### Testing Circuit Inheritance
694
- ```ruby
695
- class TestCircuitInheritance < ActiveSupport::TestCase
696
- def setup
697
- @parent_class = Class.new do
698
- include BreakerMachines::DSL
699
-
700
- circuit :shared_service do
701
- threshold failures: 3, within: 60
702
- fallback { "parent fallback" }
703
- end
704
- end
705
-
706
- @child_class = Class.new(@parent_class) do
707
- circuit :shared_service do
708
- threshold failures: 1, within: 30 # More strict
709
- fallback { "child fallback" }
710
- end
711
- end
712
- end
713
-
714
- def test_child_overrides_parent_circuit
715
- child_instance = @child_class.new
127
+ ## Contributing
716
128
 
717
- # Child should fail after 1 failure, not 3
718
- child_instance.circuit(:shared_service).wrap { raise "boom" } rescue nil
129
+ 1. Fork it
130
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
131
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
132
+ 4. Push to the branch (`git push origin my-new-feature`)
133
+ 5. Create new Pull Request
719
134
 
720
- assert child_instance.circuit(:shared_service).open?
721
-
722
- # Verify child's fallback is used
723
- result = child_instance.circuit(:shared_service).wrap { "never called" }
724
- assert_equal "child fallback", result
725
- end
726
- end
727
- ```
135
+ ## License
728
136
 
729
- ### Testing Concurrent Access
730
- ```ruby
731
- class TestConcurrentCircuits < ActiveSupport::TestCase
732
- def test_thread_safety_under_load
733
- service = Class.new do
734
- include BreakerMachines::DSL
137
+ MIT License. See [LICENSE](LICENSE) file for details.
735
138
 
736
- circuit :api do
737
- threshold failures: 10, within: 1
738
- reset_after 5
739
- end
740
- end.new
139
+ ## Author
741
140
 
742
- failure_count = Concurrent::AtomicFixnum.new(0)
743
- success_count = Concurrent::AtomicFixnum.new(0)
744
-
745
- # Hammer it with 100 threads
746
- threads = 100.times.map do
747
- Thread.new do
748
- 10.times do
749
- begin
750
- service.circuit(:api).wrap do
751
- if rand > 0.7 # 30% failure rate
752
- raise "Random failure"
753
- end
754
- "success"
755
- end
756
- success_count.increment
757
- rescue
758
- failure_count.increment
759
- end
760
- end
761
- end
762
- end
763
-
764
- threads.each(&:join)
765
-
766
- # Circuit should have opened at some point
767
- assert failure_count.value > 0
768
- assert success_count.value > 0
769
-
770
- # No race conditions or crashes
771
- assert_equal 1000, failure_count.value + success_count.value
772
- end
773
- end
774
- ```
775
-
776
- ## State Persistence (For When You Reboot in Panic)
777
-
778
- ### Storage Options
779
-
780
- ```ruby
781
- BreakerMachines.configure do |config|
782
- # Default: Efficient sliding window with event tracking
783
- config.default_storage = :bucket_memory
784
-
785
- # Alternative: Simple in-memory storage
786
- config.default_storage = :memory
787
-
788
- # Minimal overhead: No metrics or logging
789
- config.default_storage = :null
790
-
791
- # Or use Redis for distributed state
792
- config.default_storage = RedisCircuitStorage.new
793
- end
794
- ```
795
-
796
- ### Null Storage (For Maximum Performance)
797
-
798
- When you need circuit breakers but don't need metrics or event logs:
799
-
800
- ```ruby
801
- # Global configuration
802
- BreakerMachines.configure do |config|
803
- config.default_storage = :null
804
- end
805
-
806
- # Or per-circuit
807
- circuit :external_api do
808
- storage :null # No overhead, just protection
809
- threshold failures: 5, within: 60
810
- end
811
- ```
812
-
813
- Use this when:
814
- - You have external monitoring (Datadog, New Relic)
815
- - You're in a performance-critical path
816
- - You only care about the circuit breaker behavior, not metrics
817
-
818
- ### Redis-Backed Persistence
819
-
820
- **Note**: The following Redis and PostgreSQL examples are templates for you to adapt. They're not built into the gem - implement them based on your needs.
821
-
822
- ```ruby
823
- # config/initializers/breaker_machines.rb
824
- require 'redis'
825
-
826
- class RedisCircuitStorage
827
- def initialize(redis: Redis.new, prefix: 'circuit_breaker:')
828
- @redis = redis
829
- @prefix = prefix
830
- end
831
-
832
- def get_status(circuit_name)
833
- data = @redis.hgetall("#{@prefix}#{circuit_name}")
834
- return nil if data.empty?
835
-
836
- {
837
- status: data['status'].to_sym,
838
- opened_at: data['opened_at']&.to_f,
839
- failure_count: data['failure_count'].to_i,
840
- success_count: data['success_count'].to_i,
841
- last_failure_at: data['last_failure_at']&.to_f
842
- }
843
- end
844
-
845
- def set_status(circuit_name, status, opened_at = nil)
846
- key = "#{@prefix}#{circuit_name}"
847
-
848
- @redis.multi do |r|
849
- r.hset(key, 'status', status.to_s)
850
- r.hset(key, 'opened_at', opened_at) if opened_at
851
- r.expire(key, 3600) # Auto-cleanup after 1 hour
852
- end
853
- end
854
-
855
- def record_failure(circuit_name)
856
- key = "#{@prefix}#{circuit_name}"
857
- @redis.multi do |r|
858
- r.hincrby(key, 'failure_count', 1)
859
- r.hset(key, 'last_failure_at', Time.now.to_f)
860
- end
861
- end
862
-
863
- def record_success(circuit_name)
864
- @redis.hincrby("#{@prefix}#{circuit_name}", 'success_count', 1)
865
- end
866
-
867
- def reset(circuit_name)
868
- @redis.del("#{@prefix}#{circuit_name}")
869
- end
870
- end
871
-
872
- # Use it
873
- BreakerMachines.configure do |config|
874
- config.storage = RedisCircuitStorage.new(
875
- redis: Redis.new(url: ENV['REDIS_URL']),
876
- prefix: "breakers:#{Rails.env}:"
877
- )
878
- end
879
- ```
880
-
881
- ### PostgreSQL-Backed Persistence (For the Paranoid)
882
- ```ruby
883
- # db/migrate/xxx_create_circuit_breaker_states.rb
884
- class CreateCircuitBreakerStates < ActiveRecord::Migration[8.0]
885
- def change
886
- create_table :circuit_breaker_states do |t|
887
- t.string :circuit_name, null: false
888
- t.string :status, null: false
889
- t.datetime :opened_at
890
- t.integer :failure_count, default: 0
891
- t.integer :success_count, default: 0
892
- t.datetime :last_failure_at
893
- t.timestamps
894
-
895
- t.index :circuit_name, unique: true
896
- t.index :updated_at # For cleanup
897
- end
898
- end
899
- end
900
-
901
- # app/models/circuit_breaker_state.rb
902
- class CircuitBreakerState < ApplicationRecord
903
- # Cleanup old records
904
- scope :stale, -> { where('updated_at < ?', 1.day.ago) }
905
-
906
- def self.cleanup!
907
- stale.delete_all
908
- end
909
- end
910
-
911
- # lib/postgresql_circuit_storage.rb
912
- class PostgreSQLCircuitStorage
913
- def get_status(circuit_name)
914
- record = CircuitBreakerState.find_by(circuit_name: circuit_name)
915
- return nil unless record
916
-
917
- {
918
- status: record.status.to_sym,
919
- opened_at: record.opened_at&.to_f,
920
- failure_count: record.failure_count,
921
- success_count: record.success_count,
922
- last_failure_at: record.last_failure_at&.to_f
923
- }
924
- end
925
-
926
- def set_status(circuit_name, status, opened_at = nil)
927
- CircuitBreakerState.upsert({
928
- circuit_name: circuit_name,
929
- status: status.to_s,
930
- opened_at: opened_at ? Time.at(opened_at) : nil,
931
- updated_at: Time.current
932
- }, unique_by: :circuit_name)
933
- end
934
-
935
- def record_failure(circuit_name)
936
- CircuitBreakerState
937
- .upsert_all([{
938
- circuit_name: circuit_name,
939
- failure_count: 1,
940
- last_failure_at: Time.current,
941
- updated_at: Time.current
942
- }],
943
- unique_by: :circuit_name,
944
- on_duplicate: Arel.sql(
945
- 'failure_count = circuit_breaker_states.failure_count + 1, ' \
946
- 'last_failure_at = EXCLUDED.last_failure_at, ' \
947
- 'updated_at = EXCLUDED.updated_at'
948
- ))
949
- end
950
- end
951
- ```
952
-
953
- ## Advanced Observability: See Everything, Understand Everything
954
-
955
- ### Because If Your Metrics Aren't Visible, Neither Is Your Incompetence
956
-
957
- **Corporate Monitoring Strategy**: "We'll check the logs... eventually"
958
- **Reality**: 47GB of "Retrying..." messages and no actual insights
959
-
960
- ### Real-Time Circuit Intelligence Dashboard
961
- ```ruby
962
- # Prometheus Metrics
963
- ActiveSupport::Notifications.subscribe(/^breaker_machines\./) do |name, start, finish, id, payload|
964
- event_type = name.split('.').last
965
- circuit_name = payload[:circuit]
966
-
967
- # Track state transitions
968
- prometheus.counter(:circuit_breaker_transitions_total,
969
- labels: { circuit: circuit_name, transition: event_type }
970
- ).increment
971
-
972
- # Track timing
973
- prometheus.histogram(:circuit_breaker_call_duration_seconds,
974
- labels: { circuit: circuit_name }
975
- ).observe(finish - start)
976
-
977
- # Alert on critical circuits
978
- if event_type == 'opened' && CRITICAL_CIRCUITS.include?(circuit_name)
979
- slack.alert(channel: '#incidents',
980
- text: "🚨 CRITICAL: #{circuit_name} circuit opened!",
981
- color: 'danger'
982
- )
983
-
984
- pager_duty.create_incident(
985
- title: "Circuit Breaker Open: #{circuit_name}",
986
- urgency: circuit_name == :payment_processor ? 'high' : 'medium'
987
- )
988
- end
989
- end
990
-
991
- # Datadog APM Integration
992
- Datadog.configure do |c|
993
- c.tracing.instrument :breaker_machines
994
- end
995
-
996
- # New Relic Custom Events
997
- NewRelic::Agent.subscribe(/^breaker_machines\./) do |name, start, finish, id, payload|
998
- NewRelic::Agent.record_custom_event('CircuitBreakerEvent', {
999
- circuit: payload[:circuit],
1000
- event: name.split('.').last,
1001
- duration: finish - start,
1002
- timestamp: Time.now.to_i
1003
- })
1004
- end
1005
- ```
1006
-
1007
- ### Intelligent Alerting That Doesn't Suck
1008
- ```ruby
1009
- # Smart alert aggregation - don't wake up for every blip
1010
- class IntelligentCircuitMonitor
1011
- def self.analyze_circuit_health(circuit_name, window: 5.minutes)
1012
- recent_events = Redis.current.zrangebyscore(
1013
- "circuit:#{circuit_name}:events",
1014
- window.ago.to_i,
1015
- Time.now.to_i
1016
- )
1017
-
1018
- open_count = recent_events.count { |e| e['type'] == 'opened' }
1019
- total_calls = recent_events.size
1020
-
1021
- failure_rate = open_count.to_f / total_calls
1022
-
1023
- case failure_rate
1024
- when 0...0.01
1025
- # All good, sleep tight
1026
- when 0.01...0.05
1027
- notify_slack("📊 #{circuit_name} showing elevated failures: #{(failure_rate * 100).round(2)}%")
1028
- when 0.05...0.20
1029
- create_jira_ticket("Investigate #{circuit_name} instability")
1030
- notify_on_call("⚠️ #{circuit_name} degraded - #{(failure_rate * 100).round(2)}% failure rate")
1031
- else
1032
- # It's bad
1033
- wake_up_everyone("🔥 #{circuit_name} is melting down!")
1034
- auto_scale_service(circuit_name) if SCALABLE_SERVICES.include?(circuit_name)
1035
- end
1036
- end
1037
- end
1038
- ```
1039
-
1040
- ### Visual Circuit State (For Humans)
1041
- ```ruby
1042
- # Generate real-time ASCII dashboard
1043
- def circuit_status_dashboard
1044
- puts "╔═══════════════════════════════════════════════════════╗"
1045
- puts "║ CIRCUIT BREAKER STATUS DASHBOARD ║"
1046
- puts "╠═══════════════════════════════════════════════════════╣"
1047
-
1048
- circuits.each do |name, circuit|
1049
- status_icon = case circuit.status
1050
- when :closed then "🟢"
1051
- when :open then "🔴"
1052
- when :half_open then "🟡"
1053
- end
1054
-
1055
- failure_rate = circuit.recent_failure_rate
1056
- health_bar = "█" * (10 - (failure_rate * 10).to_i) + "░" * (failure_rate * 10).to_i
1057
-
1058
- puts "║ #{status_icon} #{name.to_s.ljust(20)} #{health_bar} #{(failure_rate * 100).round(1)}% ║"
1059
- end
1060
-
1061
- puts "╚═══════════════════════════════════════════════════════╝"
1062
- end
1063
- ```
1064
-
1065
- ## A Word from the RMNS Atlas Monkey
1066
-
1067
- *The Universal Commentary Engine crackles to life:*
1068
-
1069
- "In space, nobody can hear your pronouns. But they can hear your services failing.
1070
-
1071
- The universe doesn't care about your bootcamp certificate or your Medium articles about 'Why I Switched to Rust.' It cares about one thing:
1072
-
1073
- Does your system stay up when Redis has a bad day?
1074
-
1075
- If not, welcome to the Resistance. We have circuit breakers.
1076
-
1077
- Remember: The pattern isn't about preventing failures—it's about failing fast, failing smart, and living to deploy another day.
1078
-
1079
- As I always say when contemplating the void: 'It's better to break a circuit than to break production.'"
1080
-
1081
- *— Universal Commentary Engine, Log Entry 42*
1082
-
1083
- ## The Executive Summary (For Those Who Scrolled)
1084
-
1085
- **The Problem**: Your retry logic is killing your infrastructure
1086
- **The Evidence**: DynamoDB 2015, Netflix outages, Google's own documentation
1087
- **The Solution**: BreakerMachines - Circuit breakers that actually work
1088
- **The Alternative**: Explaining to investors why you're down again
1089
-
1090
- ## Common Patterns They Use (And Why They're Wrong)
1091
-
1092
- ### The Infinite Retry Loop (AWS DynamoDB Style)
1093
- ```ruby
1094
- # What caused 4+ hours of DynamoDB downtime:
1095
- until response = fetch_partition_assignment
1096
- sleep 1
1097
- logger.info "Retrying..." # This created the death spiral
1098
- end
1099
- # Result: Metadata service had to be firewalled off
1100
- ```
1101
-
1102
- ### The Exponential Backoff Delusion (Without Jitter)
1103
- ```ruby
1104
- # What Google warns against - synchronized retry storms:
1105
- retries = 0
1106
- begin
1107
- make_request
1108
- rescue => e
1109
- retries += 1
1110
- sleep(2 ** retries) # Everyone retries at the same time!
1111
- retry if retries < 10
1112
- end
1113
- # Result: "Retry ripples" that amplify themselves
1114
- ```
1115
-
1116
- ### The Thundering Herd Special
1117
- ```ruby
1118
- # When all your services wake up at once:
1119
- 100.times.map do |i|
1120
- Thread.new do
1121
- sleep 60 # All threads sleep for exactly 60 seconds
1122
- hit_redis # Then all hit Redis at the same moment
1123
- end
1124
- end
1125
- # Result: Redis commits seppuku
1126
- ```
1127
-
1128
- ### The BreakerMachines Way
1129
- ```ruby
1130
- # This is the way
1131
- circuit(:external_api).wrap { make_request }
1132
- # Done. It handles retries, failures, and your emotional wellbeing.
1133
- ```
1134
-
1135
- ## Failure Pattern Recognition: Know Your Enemy
1136
-
1137
- ### 1. **Cascade Failures** (The Domino Effect)
1138
- ```mermaid
1139
- graph TD
1140
- A[Service A Fails] --> B[Service B Overwhelmed]
1141
- B --> C[Service C Drowns in Retries]
1142
- C --> D[Service D Connection Pool Exhausted]
1143
- D --> E[Entire System Collapse]
1144
-
1145
- style A fill:#ff6b6b
1146
- style E fill:#c92a2a
1147
- ```
1148
-
1149
- ### 2. **Retry Storms** (The Thundering Herd)
1150
- - **Symptoms**: CPU spikes, memory exhaustion, network saturation
1151
- - **Cause**: Every client retrying simultaneously
1152
- - **Death Toll**: Your weekend plans
1153
-
1154
- ### 3. **Latency Spiral** (The Slow Death)
1155
- - Starts with 100ms delays
1156
- - Compounds to 10s timeouts
1157
- - Ends with infinite wait times
1158
- - Your SLA: "Deceased"
1159
-
1160
- ### 4. **Dependency Chain Meltdowns**
1161
- ```ruby
1162
- # What you think happens:
1163
- UserService -> CacheService -> Database
1164
-
1165
- # What actually happens:
1166
- UserService -> CacheService (timeout) ->
1167
- Retry -> Retry -> Retry ->
1168
- Database (overloaded) ->
1169
- Connection Pool (exhausted) ->
1170
- 💀 Everything Dies 💀
1171
- ```
1172
-
1173
- ### 5. **The Infinite Loop of Doom**
1174
- ```ruby
1175
- # Found in production (yes, really):
1176
- def get_critical_data
1177
- begin
1178
- fetch_from_service
1179
- rescue
1180
- logger.error "Retrying..." # 47GB of logs later...
1181
- get_critical_data # Recursive retry. Genius.
1182
- end
1183
- end
1184
- ```
1185
-
1186
- **Senior Architect who wrote this**: "It's self-healing!"
1187
- **Reality**: It's self-immolating. The only thing it heals is your employment status.
1188
-
1189
- ## War Stories: Tales from the Resistance
1190
-
1191
- ### "How Agoda Prevented Retry Storm Apocalypse"
1192
- *From their engineering blog - a true story*
1193
-
1194
- > "We implemented Envoy's retry budget to prevent retry storms. Without it, a single service degradation would cascade through our entire booking platform.
1195
- >
1196
- > **Before**: Service slowdown → Retry storm → Complete platform meltdown
1197
- > **After**: Service slowdown → Circuit opens → Graceful degradation → Happy customers
1198
- >
1199
- > This strategic approach not only safeguards against potential outages but also optimizes resource utilization across our distributed systems."
1200
-
1201
- ### "The Day Redis Died (But We Didn't)"
1202
- *As told by a battle-scarred SRE*
1203
-
1204
- > "When our Redis cluster had a split-brain at 2 AM, the old retry logic would have created a death spiral. Each service would retry exponentially, creating what Google calls 'retry amplification.'
1205
- >
1206
- > But our circuits opened after 3 failures. Instead of 50,000 retries per second (like the DynamoDB incident), we served from stale cache.
1207
- >
1208
- > **Without Circuit Breakers**: Like AWS in 2015 - 4 hours of downtime
1209
- > **With BreakerMachines**: 30 seconds of degraded service
1210
- >
1211
- > I went back to sleep. That's the difference."
1212
-
1213
- ### "The Ractor Meltdown That Wasn't"
1214
- *From the logs of the cargo ship MSS Resilience*
1215
-
1216
- ```ruby
1217
- # Before BreakerMachines:
1218
- 50.times.map do
1219
- Ractor.new { process_heavy_computation }
1220
- end
1221
- # Result: CPU meltdown, system crash, angry customers
1222
-
1223
- # After BreakerMachines:
1224
- circuit :ractor_processing do
1225
- threshold failures: 5, within: 60
1226
- fallback { process_with_reduced_capacity }
1227
- end
1228
-
1229
- 50.times.map do
1230
- circuit(:ractor_processing).wrap do
1231
- Ractor.new { process_heavy_computation }
1232
- end
1233
- end
1234
- # Result: Graceful degradation, happy customers, promoted engineer
1235
- ```
1236
-
1237
- ### "The AI That Talked Itself to Death"
1238
- *A cautionary tale from the Corporate AI Division, 2025*
1239
-
1240
- > "We deployed an LLM chain without circuit breakers. What could go wrong?"
1241
- > — Famous last words from TechCorp's CTO
1242
-
1243
- ```ruby
1244
- # The Horror Story:
1245
- class AIAssistant
1246
- def answer_question(query)
1247
- response = llm_api.complete(query)
1248
-
1249
- # If unclear, ask itself for clarification
1250
- if response.confidence < 0.8
1251
- clarification = answer_question("Clarify: #{response}")
1252
- return answer_question("Given #{clarification}, #{query}")
1253
- end
1254
-
1255
- response
1256
- end
1257
- end
1258
-
1259
- # Day 1: "What is the weather?"
1260
- # Hour 1: "Clarify: What is the weather?"
1261
- # Hour 2: "Given 'Clarify: What is the weather?', Clarify: What is the weather?"
1262
- # Hour 3: [Stack overflow]
1263
- # Hour 4: [API rate limit exceeded]
1264
- # Hour 5: [OPENAI bill: $47,000]
1265
- # Hour 6: [CTO: "YOU'RE FIRED!"]
1266
- ```
1267
-
1268
- ### "The Reddit Bot War of 2024"
1269
- *When staging met production and chaos ensued*
1270
-
1271
- > "We deployed an agent without circuit breakers on Reddit. What's the worst that could happen?"
1272
- > — Another soon-to-be-unemployed DevOps engineer
1273
-
1274
- **The Incident:**
1275
-
1276
- EmoBotProd was designed to provide emotional support on r/depression. EmoBotStag was its staging counterpart, accidentally deployed with the same credentials but slightly different prompts.
1277
-
1278
- ```ruby
1279
- # The disaster configuration:
1280
- class RedditEmoBot
1281
- def respond_to_comment(comment)
1282
- # No circuit breaker, no rate limiting, no sanity
1283
- response = generate_supportive_response(comment.body)
1284
- comment.reply(response)
1285
-
1286
- # Check for replies to our replies (THE FATAL FLAW)
1287
- comment.replies.each do |reply|
1288
- if reply.author != @username
1289
- respond_to_comment(reply) # Recursive doom
1290
- end
1291
- end
1292
- end
1293
- end
1294
- ```
1295
-
1296
- **Hour 1**: EmoBotProd: "I hear you and your feelings are valid."
1297
- **Hour 2**: EmoBotStag: "Your feelings are valid and I hear you."
1298
- **Hour 3**: EmoBotProd: "Thank you for validating that my validation is valid."
1299
- **Hour 4**: EmoBotStag: "I appreciate your appreciation of my validation."
1300
- **Hour 12**: Both bots arguing about the philosophical nature of validation
1301
- **Hour 24**: 2% of all Reddit comments are now EmoBotProd and EmoBotStag
1302
- **Hour 25**: Reddit's abuse detection kicks in: "WTF is happening?"
1303
- **Hour 26**: Both bots banned, engineer's LinkedIn status updated
1304
-
1305
- **The Post-Mortem:**
1306
- - 147,000 comments generated
1307
- - 2% of Reddit's daily comment volume
1308
- - $8,400 in API costs
1309
- - 1 career ended
1310
- - Infinite entertainment for r/SubredditDrama
1311
-
1312
- **The Resistance Solution (For Reddit Bots):**
1313
- ```ruby
1314
- class SafeRedditBot
1315
- include BreakerMachines::DSL
1316
-
1317
- circuit :reddit_api do
1318
- threshold failures: 5, within: 60
1319
- reset_after 300 # Reddit rate limits are serious
1320
- fallback { log_event("Reddit API circuit open - taking a break") }
1321
- end
1322
-
1323
- circuit :reply_loop_detector do
1324
- threshold failures: 3, within: 30 # Max 3 replies in 30 seconds
1325
- reset_after 120
1326
- fallback { "I've said enough. Let's give others a chance to contribute." }
1327
- end
1328
-
1329
- circuit :bot_detection do
1330
- threshold failures: 2, within: 10 # Detect bot-to-bot conversations
1331
- fallback { nil } # Just stop replying
1332
- end
1333
-
1334
- def respond_to_comment(comment, depth = 0)
1335
- # Prevent infinite recursion
1336
- return if depth > 2
1337
-
1338
- # Detect if we're talking to another bot
1339
- circuit(:bot_detection).wrap do
1340
- if comment.author.include?("Bot") || comment.body.match?(/valid|appreciate|hear you/i)
1341
- raise "Possible bot detected"
1342
- end
1343
- end
1344
-
1345
- # Rate limit our replies
1346
- response = circuit(:reply_loop_detector).wrap do
1347
- circuit(:reddit_api).wrap do
1348
- generate_and_post_response(comment)
1349
- end
1350
- end
1351
-
1352
- # Don't recursively check replies - that way lies madness
1353
- response
1354
- end
1355
- end
1356
-
1357
- # Result:
1358
- # - No bot wars
1359
- # - No Reddit bans
1360
- # - API costs: $12/month
1361
- # - Engineer: Still employed and promoted
1362
- # - r/SubredditDrama: Disappointed
1363
- ```
1364
-
1365
- **The Original AI Solution:**
1366
- ```ruby
1367
- class SmartAIAssistant
1368
- include BreakerMachines::DSL
1369
-
1370
- circuit :llm_api do
1371
- threshold failures: 3, within: 60
1372
- # Configure timeout in your LLM client (e.g., OpenAI timeout parameter)
1373
- fallback { { response: "I need a moment to think about this properly.", confidence: 1.0 } }
1374
- end
1375
-
1376
- circuit :clarification_loop do
1377
- threshold failures: 2, within: 10 # Max 2 clarification attempts
1378
- fallback { { response: "I apologize, but I need more context to answer properly.", confidence: 1.0 } }
1379
- end
1380
-
1381
- def answer_question(query, depth = 0)
1382
- circuit(:clarification_loop).wrap do
1383
- raise "Too deep in thought" if depth > 3
1384
-
1385
- response = circuit(:llm_api).wrap { llm_api.complete(query) }
1386
-
1387
- if response.confidence < 0.8 && depth < 3
1388
- # Limited recursion with circuit protection
1389
- clarification = answer_question("Clarify: #{response}", depth + 1)
1390
- return answer_question("Given #{clarification}, #{query}", depth + 1)
1391
- end
1392
-
1393
- response
1394
- end
1395
- end
1396
- end
1397
-
1398
- # Result:
1399
- # - LLM stops after 3 attempts
1400
- # - API calls limited by circuit
1401
- # - OPENAI bill: $0.00004$
1402
- # - CTO: "Nice defensive coding!"
1403
- # - You: Still employed
1404
- ```
1405
-
1406
- **The Lesson**: Without circuit breakers, even AI can enter infinite loops of existential confusion. With BreakerMachines, your AI gracefully admits confusion instead of bankrupting your company.
1407
-
1408
- ### The ROI of Not Being Stupid
1409
-
1410
- **Fortune 500 E-commerce Platform (Name Redacted)**
1411
- - **Before**: 14 major outages/year, $8.4M in losses
1412
- - **After**: 2 minor degradations/year, $150K in losses
1413
- - **Implementation Time**: 3 days
1414
- - **ROI**: 5,500% in first year
1415
-
1416
- **Message from their CTO**: "BreakerMachines paid for my yacht. Not implementing circuit breakers earlier cost me my first yacht."
1417
-
1418
- ## Final Transmission: Your Choice, Resistance
1419
-
1420
- You've made it this far. You've seen the massacres. You know the truth.
1421
-
1422
- Your microservices **will** fail. Your databases **will** timeout. Your Ractors **might** explode.
1423
-
1424
- ### The Choice Is Simple:
1425
-
1426
- **Option A**: Install BreakerMachines
1427
- ```bash
1428
- gem 'breaker_machines' # Your salvation
1429
- ```
1430
- - Sleep through outages
1431
- - Keep your job
1432
- - Maybe even get promoted
1433
-
1434
- **Option B**: Keep Deploying on Fridays and Praying
1435
- - Enjoy your 3 AM wake-up calls
1436
- - Explain to the CEO why you lost $4M
1437
- - Update your LinkedIn status to "Looking for opportunities"
1438
-
1439
- ### Ready to Join the Resistance?
1440
-
1441
- ```bash
1442
- $ bundle add breaker_machines
1443
- $ # Congratulations, you just became 500% less likely to be fired
1444
- ```
1445
-
1446
- Because in 2005, we solve problems. We don't create PowerPoints about them.
1447
-
1448
- **Welcome to the Resistance.**
141
+ Built with ❤️ and ☕ by the Resistance against cascading failures.
1449
142
 
1450
143
  ---
1451
144
 
1452
- *P.S. - If you're still using exponential backoff with infinite retries in production, the AI was right to take your job.*
1453
-
1454
- *P.P.S. - Your corporate architect still thinks circuit breakers are something in the electrical room. Let them.*
1455
-
1456
- ## Rails Integration Examples
1457
-
1458
- ### ActionController Protection
1459
- ```ruby
1460
- class ApplicationController < ActionController::Base
1461
- include BreakerMachines::DSL
1462
-
1463
- circuit :auth_service do
1464
- threshold failures: 3, within: 60
1465
- reset_after 30
1466
-
1467
- fallback do
1468
- # Allow access with limited permissions
1469
- GuestUser.new
1470
- end
1471
- end
1472
-
1473
- circuit :rate_limiter do
1474
- threshold failures: 5, within: 10
1475
- reset_after 60
1476
-
1477
- fallback do
1478
- # Just let them through - better than 500 errors
1479
- { allowed: true, limited: true }
1480
- end
1481
- end
1482
-
1483
- before_action :authenticate_with_breaker
1484
-
1485
- private
1486
-
1487
- def authenticate_with_breaker
1488
- @current_user = circuit(:auth_service).wrap do
1489
- AuthService.authenticate(session[:token])
1490
- end
1491
- end
1492
-
1493
- def check_rate_limit
1494
- result = circuit(:rate_limiter).wrap do
1495
- RateLimiter.check(request.remote_ip)
1496
- end
1497
-
1498
- if result[:limited]
1499
- response.headers['X-RateLimit-Degraded'] = 'true'
1500
- end
1501
- end
1502
- end
1503
- ```
1504
-
1505
- ### ActiveRecord Connection Management
1506
- ```ruby
1507
- class ApplicationRecord < ActiveRecord::Base
1508
- self.abstract_class = true
1509
- include BreakerMachines::DSL
1510
-
1511
- class << self
1512
- circuit :database_read do
1513
- threshold failures: 3, within: 30
1514
- reset_after 45
1515
-
1516
- fallback do
1517
- # Return cached version or empty set
1518
- Rails.cache.fetch("#{table_name}:fallback:#{caller_locations(1,1)[0]}")
1519
- end
1520
- end
1521
-
1522
- circuit :database_write do
1523
- threshold failures: 2, within: 30
1524
- reset_after 60
1525
-
1526
- fallback do |error|
1527
- # Queue for later processing
1528
- # Note: In a real implementation, you'd pass the data through
1529
- # the error context or use a different pattern
1530
- DatabaseWriteJob.perform_later(
1531
- table: table_name,
1532
- operation: 'save',
1533
- data: error.is_a?(Hash) ? error : {}
1534
- )
1535
- OpenStruct.new(id: SecureRandom.uuid, persisted?: false)
1536
- end
1537
- end
1538
-
1539
- # Wrap dangerous queries
1540
- def with_circuit(&block)
1541
- circuit(:database_read).wrap(&block)
1542
- end
1543
- end
1544
-
1545
- # Protect saves with circuit breaker
1546
- def save_with_circuit(*args)
1547
- self.class.circuit(:database_write).wrap do
1548
- save_without_circuit(*args)
1549
- end
1550
- rescue BreakerMachines::CircuitOpenError => e
1551
- # Circuit is open, queue for later
1552
- DatabaseWriteJob.perform_later(
1553
- model_name: self.class.name,
1554
- attributes: attributes,
1555
- operation: 'save'
1556
- )
1557
- # Return a response that looks like a successful save
1558
- OpenStruct.new(id: id || SecureRandom.uuid, persisted?: false)
1559
- end
1560
-
1561
- alias_method :save_without_circuit, :save
1562
- alias_method :save, :save_with_circuit
1563
- end
1564
- ```
1565
-
1566
- ### ActionCable Connection Protection
1567
- ```ruby
1568
- class ApplicationCable::Connection < ActionCable::Connection::Base
1569
- include BreakerMachines::DSL
1570
- identified_by :current_user
1571
-
1572
- circuit :websocket_auth do
1573
- threshold failures: 5, within: 60
1574
- reset_after 120
1575
-
1576
- fallback do
1577
- # Reject connection safely
1578
- reject_unauthorized_connection
1579
- end
1580
- end
1581
-
1582
- def connect
1583
- self.current_user = circuit(:websocket_auth).wrap do
1584
- find_verified_user
1585
- end
1586
- end
1587
-
1588
- private
1589
-
1590
- def find_verified_user
1591
- if verified_user = User.find_by(id: cookies.encrypted[:user_id])
1592
- verified_user
1593
- else
1594
- raise "Unauthorized"
1595
- end
1596
- end
1597
- end
1598
- ```
1599
-
1600
- ## Why I Don't Ship Integration Libraries
1601
-
1602
- Initially, I was going to provide integrations for Redis, PostgreSQL, Elasticsearch, and every other service under the sun. Then I sobered up.
1603
-
1604
- Here's why that's a recipe for maintenance nightmare:
1605
-
1606
- **Every architecture is a snowflake.** Your Redis setup isn't like mine. Your PostgreSQL connection pooling strategy is different. Your Elasticsearch cluster has its own quirks. Each application needs its own circuit breaker configuration, probably living in `lib/circuit_breakers/` with your specific business logic.
1607
-
1608
- Think about it: You have a circuit breaker in your house for a reason. Your neighbor might be mining Bitcoin and pulling 20,000W while you're just running a laptop at 300W. Same principle here—one size fits none.
1609
-
1610
- And let's be honest: APIs change. Redis 7 isn't Redis 6. PostgreSQL 16 has different connection handling than PostgreSQL 12. If I shipped integrations, I'd spend my life updating documentation and examples every time someone at AWS sneezed. I have better things to do, and so do you.
1611
-
1612
- Oh, and don't get me started on SDKs that suddenly become "auto-generated" because that's the trendy way now. One day you're using a nice Ruby gem with Faraday, the next day it's some soulless generated code that breaks everything you built. Your circuit breaker patterns shouldn't break just because someone decided to "modernize" their SDK.
1613
-
1614
- If you've discovered a particularly elegant pattern for, say, PostgreSQL connection management with circuit breakers, open a PR against this README. Show us your battle scars. But I'm not going to pretend I know how your specific disaster recovery should work.
1615
-
1616
- **Your integration is your responsibility.** I give you the hammer. You figure out which nails to hit.
1617
-
1618
- ## Production Deployment Warnings
1619
-
1620
- ### Critical: Timeout Behavior
1621
-
1622
- ⚠️ **IMPORTANT**: The `timeout` configuration is for documentation purposes only. BreakerMachines does NOT implement forceful timeouts because they are inherently unsafe in Ruby.
1623
-
1624
- **Why No Forceful Timeouts?**
1625
-
1626
- Ruby's `Timeout.timeout` and `Thread#kill` both work by raising exceptions at arbitrary points in code execution. This can:
1627
- - Corrupt database transactions
1628
- - Leave file handles open
1629
- - Break network connection cleanup
1630
- - Create resource leaks
1631
- - Leave your application in an inconsistent state
1632
-
1633
- **The Right Way: Cooperative Timeouts**
1634
-
1635
- Always use timeout mechanisms provided by your libraries:
1636
-
1637
- ```ruby
1638
- # ✅ GOOD: HTTP client with built-in timeout
1639
- circuit :external_api do
1640
- # timeout 3 # This is just documentation
1641
- threshold failures: 5
1642
- end
1643
-
1644
- def call_api
1645
- circuit(:external_api).wrap do
1646
- Faraday.get('https://api.example.com') do |req|
1647
- req.options.timeout = 3 # Read timeout
1648
- req.options.open_timeout = 2 # Connection timeout
1649
- end
1650
- end
1651
- end
1652
-
1653
- # ✅ GOOD: Database with statement timeout
1654
- circuit :database_operation do
1655
- threshold failures: 3
1656
- end
1657
-
1658
- def perform_database_operation
1659
- circuit(:database_operation).wrap do
1660
- ActiveRecord::Base.transaction do
1661
- # Use database-level timeouts
1662
- ActiveRecord::Base.connection.execute("SET statement_timeout = '5s'")
1663
- # Your operations here
1664
- end
1665
- end
1666
- end
1667
-
1668
- # ✅ GOOD: Redis with command timeout
1669
- circuit :redis_cache do
1670
- threshold failures: 5
1671
- end
1672
-
1673
- def get_from_cache(key)
1674
- circuit(:redis_cache).wrap do
1675
- Redis.new(timeout: 3).get(key) # 3 second timeout
1676
- end
1677
- end
1678
- ```
1679
-
1680
- **If You Absolutely Need Forceful Timeouts**
1681
-
1682
- If you understand the risks and still need forceful timeouts, implement them yourself:
1683
-
1684
- ```ruby
1685
- # AT YOUR OWN RISK - This can corrupt state!
1686
- require 'timeout'
1687
-
1688
- circuit(:dangerous_operation).wrap do
1689
- Timeout.timeout(3) do
1690
- # Your dangerous operation
1691
- end
1692
- end
1693
- ```
1694
-
1695
- But seriously, don't do this. The Resistance has seen too many production incidents caused by forceful timeouts.
1696
-
1697
- ### Distributed Systems Considerations
1698
-
1699
- When using distributed storage (Redis, PostgreSQL), circuits are **eventually consistent** across instances:
1700
-
1701
- ```ruby
1702
- # Instance A opens circuit at 10:00:00.000
1703
- circuit.trip!
1704
-
1705
- # Instance B might still accept calls until 10:00:00.100
1706
- # This is by design for performance
1707
-
1708
- # If you need immediate consistency:
1709
- circuit :critical_operation do
1710
- storage :redis # Shared storage
1711
-
1712
- # Check storage before every call (slower but consistent)
1713
- before_call do
1714
- refresh_from_storage!
1715
- end
1716
- end
1717
- ```
1718
-
1719
- ### Thundering Herd Mitigation
1720
-
1721
- We use jitter to prevent all instances from retrying simultaneously:
1722
-
1723
- ```ruby
1724
- circuit :payment_gateway do
1725
- reset_after 60, jitter: 0.25 # ±25% randomization
1726
- # Actual reset: 45-75 seconds
1727
- end
1728
- ```
1729
-
1730
- ## Fiber Support (Optional)
1731
-
1732
- For the modern Ruby developer using Fiber-based servers like Falcon, BreakerMachines offers optional `fiber_safe` mode. This is for those living on the edge with Ractors, Fibers, and async/await patterns.
1733
-
1734
- **Important**: The `async` gem is completely optional. BreakerMachines works perfectly without it. You only need `async` if you want to use `fiber_safe` mode.
1735
-
1736
- ### Why Fiber Support?
1737
-
1738
- Traditional circuit breakers block the entire thread during I/O operations. In a Fiber-based server, this freezes your entire event loop. Not ideal when you're trying to handle 10,000 concurrent requests on a single thread.
1739
-
1740
- With `fiber_safe` mode, BreakerMachines becomes a good citizen in your async environment:
1741
- - **Non-blocking operations** that yield to the scheduler
1742
- - **Safe, cooperative timeouts** using Async::Task
1743
- - **Natural async/await integration**
1744
- - **No thread blocking** means better concurrency
1745
-
1746
- ### Enabling Fiber Support
1747
-
1748
- First, add the `async` gem to your Gemfile (only if you want fiber_safe mode):
1749
-
1750
- ```ruby
1751
- gem 'async' # Only required for fiber_safe mode
1752
- ```
1753
-
1754
- Then configure globally or per-circuit:
1755
-
1756
- ```ruby
1757
- # Global configuration
1758
- BreakerMachines.configure do |config|
1759
- config.fiber_safe = true
1760
- end
1761
-
1762
- # Or per-circuit
1763
- circuit :openai_api, fiber_safe: true do
1764
- threshold failures: 3, within: 60
1765
- timeout 5 # Safe cooperative timeout!
1766
- reset_after 30
1767
- end
1768
- ```
1769
-
1770
- ### Example: AI Service with Safe Timeouts
1771
-
1772
- ```ruby
1773
- class AIService
1774
- include BreakerMachines::DSL
1775
-
1776
- circuit :gpt4, fiber_safe: true do
1777
- threshold failures: 2, within: 30
1778
- timeout 10 # Cooperative timeout - won't corrupt state!
1779
-
1780
- fallback do |error|
1781
- # Fallback can also be async
1782
- Async do
1783
- # Try a cheaper model
1784
- openai.completions(model: 'gpt-3.5-turbo', prompt: @prompt)
1785
- end
1786
- end
1787
- end
1788
-
1789
- def generate_response(prompt)
1790
- @prompt = prompt
1791
- circuit(:gpt4).wrap do
1792
- # Returns an Async::Task in Falcon
1793
- Async::HTTP::Internet.new.post(
1794
- 'https://api.openai.com/v1/completions',
1795
- headers: { 'Authorization' => "Bearer #{api_key}" },
1796
- body: { model: 'o42-av', prompt: prompt }.to_json
1797
- )
1798
- end
1799
- end
1800
- end
1801
- ```
1802
-
1803
- ### Async Storage Backends
1804
-
1805
- For true non-blocking operation, use async-compatible storage:
1806
-
1807
- ```ruby
1808
- # See docs/ASYNC_STORAGE_EXAMPLES.md for full implementations
1809
- class AsyncRedisStorage < BreakerMachines::Storage::Base
1810
- def initialize
1811
- @client = Async::Redis::Client.new
1812
- end
1813
-
1814
- def record_failure(circuit_name, duration = nil)
1815
- # Non-blocking Redis operation
1816
- @client.hincrby("circuit:#{circuit_name}", 'failures', 1).wait
1817
- end
1818
- end
1819
-
1820
- BreakerMachines.configure do |config|
1821
- config.fiber_safe = true
1822
- config.default_storage = AsyncRedisStorage.new
1823
- end
1824
- ```
1825
-
1826
- ### The Magic of Cooperative Timeouts
1827
-
1828
- In `fiber_safe` mode, timeouts are actually safe:
1829
-
1830
- ```ruby
1831
- circuit :slow_api, fiber_safe: true do
1832
- timeout 3 # This uses Async::Task.current.with_timeout
1833
- end
1834
-
1835
- # This will timeout safely after 3 seconds without corruption
1836
- circuit(:slow_api).wrap do
1837
- HTTP.get('https://slow-api.example.com/endpoint')
1838
- end
1839
- ```
1840
-
1841
- Unlike `Timeout.timeout` or `Thread#kill`, cooperative timeouts:
1842
- - Let operations clean up properly
1843
- - Don't corrupt state
1844
- - Work naturally with the event loop
1845
- - Are actually safe to use in production
1846
-
1847
- ### Performance Benefits
1848
-
1849
- In a Falcon server with fiber_safe circuits:
1850
- - **10x more concurrent requests** on the same hardware
1851
- - **Zero thread contention** (it's all on one thread)
1852
- - **Microsecond context switches** between Fibers
1853
- - **Natural integration** with async HTTP clients
1854
-
1855
- ### When to Use Fiber Mode
1856
-
1857
- Use `fiber_safe: true` when:
1858
- - Running on Falcon, Async, or other Fiber-based servers
1859
- - Using async HTTP clients (async-http, async-redis)
1860
- - Building high-concurrency APIs
1861
- - You understand and embrace the async/await pattern
1862
-
1863
- Stay with default mode when:
1864
- - Running on Puma, Unicorn, or thread-based servers
1865
- - Using traditional blocking I/O libraries
1866
- - Your team isn't ready for the Fiber life
1867
- - You need maximum compatibility
1868
-
1869
- For more examples and implementation details, see [docs/ASYNC_STORAGE_EXAMPLES.md](docs/ASYNC_STORAGE_EXAMPLES.md).
1870
-
1871
- ## Contributing to the Resistance
1872
-
1873
- 1. Fork it (like it's 2005)
1874
- 2. Create your feature branch (`git checkout -b feature/save-the-fleet`)
1875
- 3. Commit your changes (`git commit -am 'Add quantum circuit breaker'`)
1876
- 4. Push to the branch (`git push origin feature/save-the-fleet`)
1877
- 5. Create a new Pull Request (and wait for the Council of Elders to review)
1878
-
1879
- ## License
1880
-
1881
- MIT License
1882
-
1883
- ## Acknowledgments
1884
-
1885
- - The `state_machines` gem - The reliable engine under our hood
1886
- - Every service that ever timed out - You taught me well
1887
- - The RMNS Atlas Monkey - For philosophical guidance
1888
- - The Resistance - For never giving up
1889
-
1890
- ## Support
1891
-
1892
- If your circuits are breaking (the bad way), open an issue. If your circuits are breaking (the good way), you're welcome.
1893
-
1894
- Remember: In space, no one can hear you retry.
145
+ *Remember: Without circuit breakers, even AI can enter infinite loops of existential confusion. Don't let your services have an existential crisis.*