rails_error_dashboard 0.8.1 → 0.8.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (27) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +22 -0
  3. data/app/controllers/rails_error_dashboard/application_controller.rb +5 -0
  4. data/app/controllers/rails_error_dashboard/errors_controller.rb +12 -0
  5. data/app/jobs/rails_error_dashboard/storm_flush_job.rb +19 -0
  6. data/app/jobs/rails_error_dashboard/storm_notification_job.rb +74 -0
  7. data/app/models/rails_error_dashboard/storm_event.rb +34 -0
  8. data/app/views/layouts/rails_error_dashboard.html.erb +21 -0
  9. data/app/views/rails_error_dashboard/errors/storms.html.erb +91 -0
  10. data/config/routes.rb +1 -0
  11. data/db/migrate/20260306000002_add_instance_variables_to_error_logs.rb +7 -1
  12. data/db/migrate/20260306000003_create_rails_error_dashboard_swallowed_exceptions.rb +4 -0
  13. data/db/migrate/20260307000001_create_rails_error_dashboard_diagnostic_dumps.rb +4 -0
  14. data/db/migrate/20260613000001_create_storm_events.rb +28 -0
  15. data/lib/generators/rails_error_dashboard/install/templates/initializer.rb +36 -0
  16. data/lib/rails_error_dashboard/commands/flush_storm_counts.rb +188 -0
  17. data/lib/rails_error_dashboard/commands/log_error.rb +70 -12
  18. data/lib/rails_error_dashboard/configuration.rb +60 -0
  19. data/lib/rails_error_dashboard/queries/storm_history.rb +39 -0
  20. data/lib/rails_error_dashboard/services/storm_protection/circuit_breaker.rb +195 -0
  21. data/lib/rails_error_dashboard/services/storm_protection/count_buffer.rb +100 -0
  22. data/lib/rails_error_dashboard/services/storm_protection/fingerprint_buckets.rb +123 -0
  23. data/lib/rails_error_dashboard/services/storm_protection/gate.rb +258 -0
  24. data/lib/rails_error_dashboard/subscribers/issue_tracker_subscriber.rb +12 -0
  25. data/lib/rails_error_dashboard/version.rb +1 -1
  26. data/lib/rails_error_dashboard.rb +6 -0
  27. metadata +13 -2
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 532921b0ba2ccb40a532a66e5e3c7b1fcf17fc56e53cb5b0ed3772648644f0c0
4
- data.tar.gz: 282169f161d90326aaebce5933521cd54c8bb373611e9002b447f114b6654204
3
+ metadata.gz: c51d8221b7ca00ce0f9242f98f3594fd86028a2c98c464fcac28aac34f6d2a3b
4
+ data.tar.gz: 597373253abc68befcf70edab809093c320740fca1a575bcc8cae7fd6ae7e3e5
5
5
  SHA512:
6
- metadata.gz: 63c2ba5ef4319eaff8450837fec526135e37865ea0d2c72c30b7010ff6e86c03e49f728172d032bec6ad626f8dba39d00841300839fee9b8a493caa912669559
7
- data.tar.gz: c84238b00e2f6959fd1e4ac5a1c9ff2d696a0d28379a102e99d761585caf4c24a4f6446f4d303d69c933e5135b31e7c3ef1d0a6b7464376e88ad25b14068c75a
6
+ metadata.gz: 2bcae54eb97ce19345f5ca8a92e515c7b44d9957afc985885bda168afddadc24b3640b7fc3dcdfbe5e80637564975be851d7aeb2d2be14fbdfb8b38bf769300c
7
+ data.tar.gz: f269598ef3e5c0b4a7761be4bb4ddeeef32f662dd18f6e75dca292f108804b8d73f4d95e0f30e86a550e51478ec66d43431bb722740caf7d0149f98addb9c042
data/README.md CHANGED
@@ -79,6 +79,28 @@ Error capture from controllers, jobs, and middleware. Custom-designed dashboard
79
79
 
80
80
  ### Optional Features
81
81
 
82
+ <details>
83
+ <summary><strong>Storm Protection — Circuit Breaker + Adaptive Sampling</strong></summary>
84
+
85
+ When the error rate spikes (a bad deploy throwing thousands of errors a minute), the nightmare scenario for any in-process tracker is amplifying the outage with its own database writes. Storm protection makes the gem **provably degrade itself first** — ON by default.
86
+
87
+ - **Per-fingerprint caps:** past N occurrences/minute per error, context is shed, then rows are sampled deterministically (a fresh exemplar is always kept each minute)
88
+ - **Global circuit breaker:** sustained floods flip the gem to count-only mode — zero per-event I/O, exact in-memory counts reconciled onto error records every 30s. Async mode is gated too (a SolidQueue enqueue is itself a DB write)
89
+ - **One storm notification** replaces hundreds of per-error pings; auto-issue creation is capped (default 5 per 10 min) so a storm of new errors can't open 500 GitHub/Linear issues
90
+ - **Honest accounting:** a dashboard banner during/after the storm, a Storm History page with exact counts of everything shed, and `reached_open`/peak-rate per episode. Counts are never extrapolated
91
+ - **Calm-weather economy:** after 25 full-context captures of the same error per day, context is sampled (occurrence counting unaffected)
92
+ - **Fails open:** any internal storm-protection error means full capture. Protection can never be the thing that loses an error
93
+
94
+ ```ruby
95
+ config.enable_storm_protection = true # default
96
+ config.storm_open_threshold_per_second = 50 # per process
97
+ ```
98
+
99
+ All thresholds are per process and individually configurable. Disable with one flag.
100
+
101
+ **Measured overhead** (Apple Silicon, Ruby 4.0): 2.4µs/error with protection active and calm, 2.95µs in count-only mode, 0.2µs when disabled — against a 5µs budget. The check is a digest plus an atomic increment; there is no I/O on the hot path.
102
+ </details>
103
+
82
104
  <details>
83
105
  <summary><strong>Breadcrumbs — Request Activity Trail</strong></summary>
84
106
 
@@ -77,6 +77,11 @@ module RailsErrorDashboard
77
77
  def set_common_view_variables
78
78
  @applications = Application.ordered_by_name.pluck(:name, :id) rescue []
79
79
  @default_credentials_warning = RailsErrorDashboard.configuration.default_credentials? rescue false
80
+ # Only query when a before_action (e.g. ErrorsController#load_storm_banner)
81
+ # hasn't already loaded it this request — the error renderer runs after
82
+ # those callbacks, so reuse their result instead of querying a second time.
83
+ # Tracked by a flag rather than the value, since the common case is nil.
84
+ @storm_banner_event = Queries::StormHistory.banner_event unless @storm_banner_loaded
80
85
  end
81
86
  end
82
87
  end
@@ -5,6 +5,7 @@ module RailsErrorDashboard
5
5
  before_action :authenticate_dashboard_user!
6
6
  before_action :set_application_context
7
7
  before_action :check_default_credentials
8
+ before_action :load_storm_banner
8
9
 
9
10
  FILTERABLE_PARAMS = %i[
10
11
  error_type
@@ -329,6 +330,12 @@ module RailsErrorDashboard
329
330
  @pagy, @releases = pagy(:offset, all_releases, limit: params[:per_page] || 25)
330
331
  end
331
332
 
333
+ def storms
334
+ result = Queries::StormHistory.call
335
+ @active_storm = result[:active]
336
+ @storm_events = result[:events]
337
+ end
338
+
332
339
  def user_impact
333
340
  days = days_param(default: 30)
334
341
  @days = days
@@ -729,6 +736,11 @@ module RailsErrorDashboard
729
736
  @default_credentials_warning = RailsErrorDashboard.configuration.default_credentials?
730
737
  end
731
738
 
739
+ def load_storm_banner
740
+ @storm_banner_loaded = true
741
+ @storm_banner_event = Queries::StormHistory.banner_event
742
+ end
743
+
732
744
  def authenticate_dashboard_user!
733
745
  auth_lambda = RailsErrorDashboard.configuration.authenticate_with
734
746
 
@@ -0,0 +1,19 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RailsErrorDashboard
4
+ # Persists storm-protection count snapshots (mirrors SwallowedExceptionFlushJob):
5
+ # the gate accumulates counts in memory with zero I/O, snapshots are handed
6
+ # to this job at most once per flush interval, and ALL DB writes happen here.
7
+ class StormFlushJob < ApplicationJob
8
+ queue_as :default
9
+
10
+ def perform(entries: [], overflow: 0, episode: nil)
11
+ entries = entries.map { |e| e.respond_to?(:stringify_keys) ? e.stringify_keys : e }
12
+ episode = episode.stringify_keys if episode.respond_to?(:stringify_keys)
13
+
14
+ Commands::FlushStormCounts.call(entries: entries, overflow: overflow, episode: episode)
15
+ rescue => e
16
+ Rails.logger.error("[RailsErrorDashboard] StormFlushJob failed: #{e.class} - #{e.message}")
17
+ end
18
+ end
19
+ end
@@ -0,0 +1,74 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RailsErrorDashboard
4
+ # Sends the SINGLE "error storm in progress" notification per storm episode.
5
+ #
6
+ # During a storm, per-error notifications are suppressed (500 Slack pings
7
+ # help nobody) — this one message replaces them. The gate guarantees at
8
+ # most one enqueue per episode; this job just delivers.
9
+ class StormNotificationJob < ApplicationJob
10
+ queue_as :default
11
+
12
+ # @param started_at [String] ISO8601 episode start
13
+ # @param state [String] breaker state at notification time ("shedding"/"open")
14
+ def perform(started_at:, state: "shedding")
15
+ config = RailsErrorDashboard.configuration
16
+ message = build_message(started_at, state, config)
17
+
18
+ if config.enable_slack_notifications && config.slack_webhook_url.present?
19
+ post_json(config.slack_webhook_url, { text: message })
20
+ end
21
+
22
+ if config.enable_discord_notifications && config.discord_webhook_url.present?
23
+ post_json(config.discord_webhook_url, { content: message })
24
+ end
25
+
26
+ if config.enable_webhook_notifications && config.webhook_urls.any?
27
+ payload = {
28
+ event: "error_storm_detected",
29
+ started_at: started_at,
30
+ state: state,
31
+ application: app_name(config)
32
+ }
33
+ config.webhook_urls.each { |url| post_json(url, payload) }
34
+ end
35
+ rescue => e
36
+ Rails.logger.error("[RailsErrorDashboard] StormNotificationJob failed: #{e.class} - #{e.message}")
37
+ end
38
+
39
+ private
40
+
41
+ def build_message(started_at, state, config)
42
+ mode = state == "open" ? "count-only mode (occurrences tallied, detail paused)" : "shedding mode (context sampling active)"
43
+ dashboard = (config.dashboard_base_url || "").chomp("/")
44
+ link = dashboard.present? ? " Dashboard: #{dashboard}/errors/storms" : ""
45
+
46
+ ":warning: Error storm detected in #{app_name(config)} at #{started_at}. " \
47
+ "Storm protection engaged — #{mode}. Per-error notifications are " \
48
+ "suppressed until the storm subsides; exact counts are preserved.#{link}"
49
+ end
50
+
51
+ def app_name(config)
52
+ config.application_name || ENV["APPLICATION_NAME"] ||
53
+ (defined?(Rails) && Rails.application.class.module_parent_name) || "Rails Application"
54
+ end
55
+
56
+ def post_json(url, payload)
57
+ if defined?(HTTParty)
58
+ HTTParty.post(url, body: payload.to_json,
59
+ headers: { "Content-Type" => "application/json" }, timeout: 10)
60
+ else
61
+ uri = URI(url)
62
+ http = Net::HTTP.new(uri.host, uri.port)
63
+ http.use_ssl = uri.scheme == "https"
64
+ http.open_timeout = 5
65
+ http.read_timeout = 10
66
+ request = Net::HTTP::Post.new(uri.path, { "Content-Type" => "application/json" })
67
+ request.body = payload.to_json
68
+ http.request(request)
69
+ end
70
+ rescue => e
71
+ Rails.logger.error("[RailsErrorDashboard] Storm notification post failed: #{e.message}")
72
+ end
73
+ end
74
+ end
@@ -0,0 +1,34 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RailsErrorDashboard
4
+ # One row per storm-protection episode (per process). Powers the dashboard
5
+ # banner and the storm history page. Counts are exact, not extrapolated.
6
+ # Inherits ErrorLogsRecord so separate-database routing applies.
7
+ class StormEvent < ErrorLogsRecord
8
+ self.table_name = "rails_error_dashboard_storm_events"
9
+
10
+ scope :active, -> { where(ended_at: nil) }
11
+ scope :recent_first, -> { order(started_at: :desc) }
12
+ scope :ended_within, ->(duration) { where.not(ended_at: nil).where(ended_at: duration.ago..) }
13
+
14
+ def active?
15
+ ended_at.nil?
16
+ end
17
+
18
+ def duration_seconds
19
+ return nil unless ended_at
20
+
21
+ (ended_at - started_at).round
22
+ end
23
+
24
+ # @return [Array<Hash>] top fingerprints by count, [] when absent/corrupt
25
+ def top_fingerprints_list
26
+ return [] if top_fingerprints.blank?
27
+
28
+ parsed = JSON.parse(top_fingerprints)
29
+ parsed.is_a?(Array) ? parsed : []
30
+ rescue JSON::ParserError
31
+ []
32
+ end
33
+ end
34
+ end
@@ -1036,6 +1036,9 @@ tr[data-red-row-href]:hover .sev-bar { opacity: 1 !important; }
1036
1036
  <% if RailsErrorDashboard.configuration.enable_diagnostic_dump %>
1037
1037
  <% diag_items << { path: diagnostic_dumps_errors_path(nav_params), icon: 'bi-clipboard-pulse', label: 'Diagnostics' } %>
1038
1038
  <% end %>
1039
+ <% if RailsErrorDashboard.configuration.enable_storm_protection %>
1040
+ <% diag_items << { path: storms_errors_path(nav_params), icon: 'bi-cloud-lightning-rain', label: 'Storms' } %>
1041
+ <% end %>
1039
1042
 
1040
1043
  <% if diag_items.any? %>
1041
1044
  <div style="margin-bottom: var(--space-2);" id="navDiagSection">
@@ -1175,6 +1178,24 @@ tr[data-red-row-href]:hover .sev-bar { opacity: 1 !important; }
1175
1178
  </div>
1176
1179
  <script<%= " nonce=\"#{red_csp_nonce}\"".html_safe if red_csp_nonce %>>try { if (sessionStorage.getItem('red_dismiss_creds_warning') === '1') { document.getElementById('security-warning').style.display = 'none'; } } catch(e) {}</script>
1177
1180
  <% end %>
1181
+ <% if defined?(@storm_banner_event) && @storm_banner_event %>
1182
+ <% storm_active = @storm_banner_event.active? %>
1183
+ <div id="storm-banner" class="alert <%= storm_active ? 'alert-danger' : 'alert-info' %>" style="display: flex; align-items: center; gap: 10px; margin-top: var(--space-2); margin-bottom: var(--space-4); border-left: 4px solid <%= storm_active ? 'var(--status-critical)' : 'var(--status-info)' %>;">
1184
+ <i class="bi <%= storm_active ? 'bi-cloud-lightning-rain-fill' : 'bi-cloud-check' %>" style="font-size: 18px; flex-shrink: 0;"></i>
1185
+ <div style="flex: 1;">
1186
+ <% if storm_active %>
1187
+ <strong>Error storm in progress</strong> (since <%= @storm_banner_event.started_at.strftime("%H:%M %Z") %>) —
1188
+ storm protection engaged. Occurrences are being counted exactly;
1189
+ <%= @storm_banner_event.reached_open ? "detail capture is paused (count-only mode)" : "context capture is sampled" %>.
1190
+ <% else %>
1191
+ <strong>Storm protection engaged recently</strong> (ended <%= @storm_banner_event.ended_at.strftime("%H:%M %Z") %>) —
1192
+ <%= number_with_delimiter(@storm_banner_event.events_counted_only.to_i + @storm_banner_event.events_overflow.to_i) %> occurrences
1193
+ were recorded as exact counts with sampled detail.
1194
+ <% end %>
1195
+ <a href="<%= storms_errors_path %>" style="margin-left: 4px;">View storm history</a>
1196
+ </div>
1197
+ </div>
1198
+ <% end %>
1178
1199
  <%= yield %>
1179
1200
  </main>
1180
1201
 
@@ -0,0 +1,91 @@
1
+ <% content_for :page_title, "Storm History" %>
2
+
3
+ <div>
4
+ <div class="d-flex justify-content-between align-items-center mb-4">
5
+ <h1 style="font-size: 20px; font-weight: 700; margin: 0;">
6
+ <i class="bi bi-cloud-lightning-rain me-2"></i>
7
+ Storm History
8
+ </h1>
9
+ </div>
10
+
11
+ <p class="text-muted" style="font-size: 13px; margin-bottom: var(--space-4);">
12
+ When the error rate spikes (a bad deploy, a dependency outage), storm protection
13
+ limits the gem's own database writes so it never amplifies the incident.
14
+ Occurrences are always counted exactly — only per-event detail is sampled.
15
+ Thresholds are configurable per process via <code>storm_*</code> options.
16
+ </p>
17
+
18
+ <% if @active_storm %>
19
+ <div class="alert alert-danger" style="border-left: 4px solid var(--status-critical);">
20
+ <i class="bi bi-cloud-lightning-rain-fill me-1"></i>
21
+ <strong>Storm in progress</strong> since <%= @active_storm.started_at.strftime("%Y-%m-%d %H:%M %Z") %> —
22
+ peak rate <%= number_with_delimiter(@active_storm.peak_rate_per_minute) %>/min.
23
+ </div>
24
+ <% end %>
25
+
26
+ <% if @storm_events.empty? %>
27
+ <div class="red-empty-state">
28
+ <i class="bi bi-cloud-sun display-1 text-muted mb-3"></i>
29
+ <div class="red-empty-state-title">No Storms Recorded</div>
30
+ <p class="text-muted">
31
+ Storm protection is standing by. If an error flood ever hits this app,
32
+ episodes will appear here with exact counts of what was shed.
33
+ </p>
34
+ </div>
35
+ <% else %>
36
+ <div class="card">
37
+ <div class="card-body" style="padding: 0;">
38
+ <table class="table" style="margin: 0;">
39
+ <thead>
40
+ <tr>
41
+ <th>Started</th>
42
+ <th>Duration</th>
43
+ <th>Peak Rate</th>
44
+ <th>Mode</th>
45
+ <th class="text-end">Counted (exact)</th>
46
+ <th class="text-end">Overflow</th>
47
+ <th>Top Errors</th>
48
+ </tr>
49
+ </thead>
50
+ <tbody>
51
+ <% @storm_events.each do |event| %>
52
+ <tr>
53
+ <td style="white-space: nowrap;">
54
+ <%= event.started_at.strftime("%Y-%m-%d %H:%M") %>
55
+ <% if event.active? %>
56
+ <span class="badge bg-danger ms-1">active</span>
57
+ <% end %>
58
+ </td>
59
+ <td>
60
+ <% if event.duration_seconds %>
61
+ <%= ActiveSupport::Duration.build(event.duration_seconds).inspect %>
62
+ <% else %>
63
+ &mdash;
64
+ <% end %>
65
+ </td>
66
+ <td><%= number_with_delimiter(event.peak_rate_per_minute) %>/min</td>
67
+ <td>
68
+ <% if event.reached_open %>
69
+ <span class="badge bg-danger">count-only</span>
70
+ <% else %>
71
+ <span class="badge bg-warning text-dark">shedding</span>
72
+ <% end %>
73
+ </td>
74
+ <td class="text-end"><%= number_with_delimiter(event.events_counted_only) %></td>
75
+ <td class="text-end"><%= number_with_delimiter(event.events_overflow) %></td>
76
+ <td style="max-width: 320px;">
77
+ <% event.top_fingerprints_list.first(3).each do |fp| %>
78
+ <div style="font-size: 12px; font-family: var(--font-mono); overflow: hidden; text-overflow: ellipsis; white-space: nowrap;" title="<%= fp["message"] %>">
79
+ <strong><%= fp["class"] %></strong>
80
+ <span class="text-muted">&times;<%= number_with_delimiter(fp["count"]) %></span>
81
+ </div>
82
+ <% end %>
83
+ </td>
84
+ </tr>
85
+ <% end %>
86
+ </tbody>
87
+ </table>
88
+ </div>
89
+ </div>
90
+ <% end %>
91
+ </div>
data/config/routes.rb CHANGED
@@ -40,6 +40,7 @@ RailsErrorDashboard::Engine.routes.draw do
40
40
  get :activestorage_health_summary
41
41
  get :llm_health_summary
42
42
  get :releases
43
+ get :storms
43
44
  get :user_impact
44
45
  get :diagnostic_dumps
45
46
  post :create_diagnostic_dump
@@ -1,7 +1,13 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  class AddInstanceVariablesToErrorLogs < ActiveRecord::Migration[7.0]
4
- def change
4
+ def up
5
+ return if column_exists?(:rails_error_dashboard_error_logs, :instance_variables)
6
+
5
7
  add_column :rails_error_dashboard_error_logs, :instance_variables, :text
6
8
  end
9
+
10
+ def down
11
+ remove_column :rails_error_dashboard_error_logs, :instance_variables if column_exists?(:rails_error_dashboard_error_logs, :instance_variables)
12
+ end
7
13
  end
@@ -2,6 +2,10 @@
2
2
 
3
3
  class CreateRailsErrorDashboardSwallowedExceptions < ActiveRecord::Migration[7.0]
4
4
  def change
5
+ # Guard against the squashed schema migration having already created this
6
+ # table — without it, every later migration is silently cancelled.
7
+ return if table_exists?(:rails_error_dashboard_swallowed_exceptions)
8
+
5
9
  create_table :rails_error_dashboard_swallowed_exceptions do |t|
6
10
  t.string :exception_class, null: false, limit: 250
7
11
  t.string :raise_location, null: false, limit: 250
@@ -2,6 +2,10 @@
2
2
 
3
3
  class CreateRailsErrorDashboardDiagnosticDumps < ActiveRecord::Migration[7.0]
4
4
  def change
5
+ # Guard against the squashed schema migration having already created this
6
+ # table — without it, every later migration is silently cancelled.
7
+ return if table_exists?(:rails_error_dashboard_diagnostic_dumps)
8
+
5
9
  create_table :rails_error_dashboard_diagnostic_dumps do |t|
6
10
  t.references :application, null: false,
7
11
  foreign_key: { to_table: :rails_error_dashboard_applications }
@@ -0,0 +1,28 @@
1
+ # frozen_string_literal: true
2
+
3
+ # Storm protection honesty layer: one row per storm episode, powering the
4
+ # dashboard banner ("storm detected, counts recorded, detail sampled") and
5
+ # the storm history page. Small table — a few rows per incident.
6
+ class CreateStormEvents < ActiveRecord::Migration[7.0]
7
+ def change
8
+ return if table_exists?(:rails_error_dashboard_storm_events)
9
+
10
+ create_table :rails_error_dashboard_storm_events do |t|
11
+ t.datetime :started_at, null: false
12
+ t.datetime :ended_at # NULL while the storm is active
13
+ t.integer :peak_rate_per_minute, default: 0
14
+ t.boolean :reached_open, default: false # true if count-only mode engaged
15
+ t.bigint :events_total, default: 0 # count-only total = counted_only + overflow (excludes :lite/:full rows)
16
+ t.bigint :events_counted_only, default: 0 # counted in memory, no rows
17
+ t.bigint :events_overflow, default: 0 # beyond the bounded map — exact total, anonymous identity
18
+ t.integer :fingerprints_affected, default: 0
19
+ t.text :top_fingerprints # JSON: top 5 by count [{class, message, count}]
20
+ t.timestamps
21
+ end
22
+
23
+ add_index :rails_error_dashboard_storm_events, :ended_at,
24
+ name: "index_red_storm_events_on_ended_at"
25
+ add_index :rails_error_dashboard_storm_events, :started_at,
26
+ name: "index_red_storm_events_on_started_at"
27
+ end
28
+ end
@@ -518,6 +518,42 @@ RailsErrorDashboard.configure do |config|
518
518
  # No PII or request bodies in span attributes — just metadata + timing.
519
519
  # Safe to enable on production OTel pipelines.
520
520
 
521
+ # ============================================================================
522
+ # STORM PROTECTION (circuit breaker + adaptive sampling) — ON by default
523
+ # ============================================================================
524
+ #
525
+ # When the error rate spikes (bad deploy, dependency outage), storm
526
+ # protection limits the gem's own database writes so it never amplifies
527
+ # the incident. Occurrences are ALWAYS counted exactly — only per-event
528
+ # detail (context payloads, occurrence rows) is sampled under load.
529
+ #
530
+ # How it degrades, in order:
531
+ # 1. Per-fingerprint cap: past N/min, context is shed, then rows sampled
532
+ # 2. Global breaker: shedding (context off) → open (count-only mode)
533
+ # 3. Per-error notifications replaced by ONE "storm in progress" message
534
+ # 4. Counts reconciled onto error records every flush interval
535
+ #
536
+ # All thresholds are PER PROCESS (each Puma worker runs its own breaker).
537
+ #
538
+ # config.enable_storm_protection = true
539
+ # config.storm_fingerprint_full_per_minute = 30 # full-fidelity captures per fingerprint/min
540
+ # config.storm_occurrence_sample_keep_every = 10 # past the cap, keep every Nth occurrence
541
+ # config.storm_shedding_threshold_per_second = 10 # global rate entering shedding state
542
+ # config.storm_open_threshold_per_second = 50 # global rate opening the breaker (count-only)
543
+ # config.storm_cooldown_seconds = 60 # open → half-open probe delay
544
+ # config.storm_notification = true # one notification per storm episode
545
+ #
546
+ # Always-on issue cap (a storm of NEW critical errors must not open
547
+ # hundreds of GitHub/Linear issues):
548
+ # config.auto_issue_rate_limit_count = 5
549
+ # config.auto_issue_rate_limit_window_minutes = 10
550
+ #
551
+ # Calm-weather context economy: an error seen 1000x/day doesn't need 1000
552
+ # breadcrumb trails. After N full-context captures per fingerprint per day,
553
+ # context is kept every Mth time (occurrence rows are unaffected):
554
+ # config.context_sampling_threshold_per_day = 25
555
+ # config.context_sampling_keep_every = 10
556
+
521
557
  # ============================================================================
522
558
  # ISSUE TRACKING (GitHub / GitLab / Codeberg / Linear)
523
559
  # ============================================================================
@@ -0,0 +1,188 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RailsErrorDashboard
4
+ module Commands
5
+ # Command: Reconcile counted-not-stored storm events onto ErrorLog rows
6
+ # and maintain the storm_events episode record.
7
+ #
8
+ # Runs in a background job (DB allowed). For each counted fingerprint:
9
+ # 1. Recompute the canonical error_hash from the stored identity parts
10
+ # (the gate's key deliberately omits application_id — resolved here)
11
+ # 2. Unresolved match → single UPDATE: occurrence_count += N
12
+ # 3. Resolved match → reopen (mirrors FindOrIncrementError semantics)
13
+ # 4. No match → create a minimal ErrorLog from the exemplar
14
+ #
15
+ # Counts are exact. Notifications are NOT dispatched from here — during a
16
+ # storm they're suppressed by design; the storm notification covers it.
17
+ class FlushStormCounts
18
+ def self.call(entries:, overflow: 0, episode: nil)
19
+ new(entries: entries, overflow: overflow, episode: episode).call
20
+ end
21
+
22
+ def initialize(entries:, overflow: 0, episode: nil)
23
+ @entries = Array(entries)
24
+ @overflow = overflow.to_i
25
+ @episode = episode
26
+ end
27
+
28
+ def call
29
+ application = resolve_application
30
+ counted = 0
31
+
32
+ @entries.each do |entry|
33
+ entry = entry.with_indifferent_access if entry.respond_to?(:with_indifferent_access)
34
+ counted += reconcile_entry(entry, application)
35
+ rescue => e
36
+ # A corrupt (non-Hash) entry must not abort the whole batch — and the
37
+ # log line itself must not assume `entry` is subscriptable (an Integer
38
+ # from a broken serializer would raise again here, escaping this rescue).
39
+ error_class = entry.is_a?(Hash) ? entry["error_class"] : entry.class
40
+ RailsErrorDashboard::Logger.error(
41
+ "[RailsErrorDashboard] Storm count reconcile failed for #{error_class}: #{e.class} - #{e.message}"
42
+ )
43
+ end
44
+
45
+ upsert_storm_event(counted)
46
+ { success: true, reconciled: counted, overflow: @overflow }
47
+ rescue => e
48
+ RailsErrorDashboard::Logger.error(
49
+ "[RailsErrorDashboard] FlushStormCounts failed: #{e.class} - #{e.message}"
50
+ )
51
+ { success: false, error: "#{e.class}: #{e.message}" }
52
+ end
53
+
54
+ private
55
+
56
+ def reconcile_entry(entry, application)
57
+ count = entry["count"].to_i
58
+ return 0 if count <= 0
59
+
60
+ error_hash = canonical_hash(entry, application)
61
+ last_seen = parse_time(entry["last_seen_at"]) || Time.current
62
+
63
+ # Priority 1: unresolved match — one UPDATE, no row instantiation
64
+ updated = ErrorLog.unresolved
65
+ .where(error_hash: error_hash, application_id: application.id)
66
+ .update_all([ "occurrence_count = occurrence_count + ?, last_seen_at = ?", count, last_seen ])
67
+ return count if updated.positive?
68
+
69
+ # Priority 2: resolved/wont_fix match — reopen, mirroring
70
+ # FindOrIncrementError so storm recurrences don't stay buried
71
+ resolved = ErrorLog
72
+ .where(error_hash: error_hash, application_id: application.id)
73
+ .where(status: %w[resolved wont_fix])
74
+ .order(last_seen_at: :desc)
75
+ .first
76
+ if resolved
77
+ attrs = {
78
+ resolved: false,
79
+ status: "new",
80
+ resolved_at: nil,
81
+ occurrence_count: resolved.occurrence_count + count,
82
+ last_seen_at: last_seen
83
+ }
84
+ attrs[:reopened_at] = Time.current if ErrorLog.column_names.include?("reopened_at")
85
+ resolved.update!(attrs)
86
+ return count
87
+ end
88
+
89
+ # Priority 3: first seen during count-only mode — minimal ErrorLog
90
+ # from the exemplar (no backtrace/context was captured; the next
91
+ # occurrence after the storm fills in detail via the normal path)
92
+ ErrorLog.create!(
93
+ application_id: application.id,
94
+ error_type: entry["error_class"],
95
+ message: entry["message"],
96
+ backtrace: entry["first_app_frame"],
97
+ controller_name: entry["controller_name"],
98
+ action_name: entry["action_name"],
99
+ occurred_at: parse_time(entry["first_seen_at"]) || Time.current,
100
+ last_seen_at: last_seen,
101
+ occurrence_count: count,
102
+ error_hash: error_hash,
103
+ resolved: false
104
+ )
105
+ count
106
+ end
107
+
108
+ # Mirrors ErrorHashGenerator.call exactly: same fields, same order,
109
+ # same normalization — so counts land on the same ErrorLog the full
110
+ # capture path would have used.
111
+ def canonical_hash(entry, application)
112
+ return entry["custom_hash"] if entry["custom_hash"].present?
113
+
114
+ digest_input = [
115
+ entry["error_class"],
116
+ Services::ErrorHashGenerator.normalize_message(entry["message"]),
117
+ entry["first_app_frame"],
118
+ entry["controller_name"],
119
+ entry["action_name"],
120
+ application.id.to_s
121
+ ].compact.join("|")
122
+
123
+ Digest::SHA256.hexdigest(digest_input)[0..15]
124
+ end
125
+
126
+ def resolve_application
127
+ # Same chain LogError uses — app name is process-global
128
+ app_name = RailsErrorDashboard.configuration.application_name ||
129
+ ENV["APPLICATION_NAME"] ||
130
+ (defined?(Rails) && Rails.application.class.module_parent_name) ||
131
+ "Rails Application"
132
+ Application.find_or_create_by_name(app_name)
133
+ end
134
+
135
+ def upsert_storm_event(counted)
136
+ return unless @episode.is_a?(Hash)
137
+ return unless StormEvent.table_exists?
138
+
139
+ started_at = parse_time(@episode["started_at"])
140
+ return unless started_at
141
+
142
+ event = StormEvent.active.recent_first.first || StormEvent.create!(started_at: started_at)
143
+
144
+ event.events_counted_only = event.events_counted_only.to_i + counted
145
+ event.events_overflow = event.events_overflow.to_i + @overflow
146
+ # events_total is the count-only total: in-map reconciled + overflow.
147
+ # It deliberately excludes :lite/:full admissions (those became real
148
+ # ErrorLog rows on the hot path and are never counted here), so it is
149
+ # always events_counted_only + events_overflow. Derive it rather than
150
+ # accumulate so it can't drift from its two components.
151
+ event.events_total = event.events_counted_only.to_i + event.events_overflow.to_i
152
+ event.fingerprints_affected = [ event.fingerprints_affected.to_i, @entries.size ].max
153
+ event.peak_rate_per_minute = [ event.peak_rate_per_minute.to_i, @episode["peak_rate_per_minute"].to_i ].max
154
+ event.reached_open ||= @episode["reached_open"] == true
155
+ event.top_fingerprints = top_fingerprints_json(event)
156
+ event.ended_at = parse_time(@episode["ended_at"]) if @episode["ended_at"]
157
+ event.save!
158
+ rescue => e
159
+ RailsErrorDashboard::Logger.error(
160
+ "[RailsErrorDashboard] Storm event upsert failed: #{e.class} - #{e.message}"
161
+ )
162
+ end
163
+
164
+ def top_fingerprints_json(event)
165
+ existing = event.top_fingerprints_list
166
+ fresh = @entries.map { |e|
167
+ e = e.with_indifferent_access if e.respond_to?(:with_indifferent_access)
168
+ { "class" => e["error_class"], "message" => e["message"].to_s[0, 120], "count" => e["count"].to_i }
169
+ }
170
+
171
+ merged = (existing + fresh)
172
+ .group_by { |f| [ f["class"], f["message"] ] }
173
+ .map { |_k, group| group.first.merge("count" => group.sum { |f| f["count"].to_i }) }
174
+
175
+ merged.sort_by { |f| -f["count"].to_i }.first(5).to_json
176
+ end
177
+
178
+ def parse_time(value)
179
+ return value if value.is_a?(Time) || value.is_a?(ActiveSupport::TimeWithZone)
180
+ return nil if value.blank?
181
+
182
+ Time.zone.parse(value.to_s)
183
+ rescue ArgumentError
184
+ nil
185
+ end
186
+ end
187
+ end
188
+ end