rails_error_dashboard 0.8.1 → 0.8.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +22 -0
- data/app/controllers/rails_error_dashboard/application_controller.rb +5 -0
- data/app/controllers/rails_error_dashboard/errors_controller.rb +12 -0
- data/app/jobs/rails_error_dashboard/storm_flush_job.rb +19 -0
- data/app/jobs/rails_error_dashboard/storm_notification_job.rb +74 -0
- data/app/models/rails_error_dashboard/storm_event.rb +34 -0
- data/app/views/layouts/rails_error_dashboard.html.erb +21 -0
- data/app/views/rails_error_dashboard/errors/storms.html.erb +91 -0
- data/config/routes.rb +1 -0
- data/db/migrate/20260306000002_add_instance_variables_to_error_logs.rb +7 -1
- data/db/migrate/20260306000003_create_rails_error_dashboard_swallowed_exceptions.rb +4 -0
- data/db/migrate/20260307000001_create_rails_error_dashboard_diagnostic_dumps.rb +4 -0
- data/db/migrate/20260613000001_create_storm_events.rb +28 -0
- data/lib/generators/rails_error_dashboard/install/templates/initializer.rb +36 -0
- data/lib/rails_error_dashboard/commands/flush_storm_counts.rb +188 -0
- data/lib/rails_error_dashboard/commands/log_error.rb +70 -12
- data/lib/rails_error_dashboard/configuration.rb +60 -0
- data/lib/rails_error_dashboard/queries/storm_history.rb +39 -0
- data/lib/rails_error_dashboard/services/storm_protection/circuit_breaker.rb +195 -0
- data/lib/rails_error_dashboard/services/storm_protection/count_buffer.rb +100 -0
- data/lib/rails_error_dashboard/services/storm_protection/fingerprint_buckets.rb +123 -0
- data/lib/rails_error_dashboard/services/storm_protection/gate.rb +258 -0
- data/lib/rails_error_dashboard/subscribers/issue_tracker_subscriber.rb +12 -0
- data/lib/rails_error_dashboard/version.rb +1 -1
- data/lib/rails_error_dashboard.rb +6 -0
- metadata +13 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: c51d8221b7ca00ce0f9242f98f3594fd86028a2c98c464fcac28aac34f6d2a3b
|
|
4
|
+
data.tar.gz: 597373253abc68befcf70edab809093c320740fca1a575bcc8cae7fd6ae7e3e5
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 2bcae54eb97ce19345f5ca8a92e515c7b44d9957afc985885bda168afddadc24b3640b7fc3dcdfbe5e80637564975be851d7aeb2d2be14fbdfb8b38bf769300c
|
|
7
|
+
data.tar.gz: f269598ef3e5c0b4a7761be4bb4ddeeef32f662dd18f6e75dca292f108804b8d73f4d95e0f30e86a550e51478ec66d43431bb722740caf7d0149f98addb9c042
|
data/README.md
CHANGED
|
@@ -79,6 +79,28 @@ Error capture from controllers, jobs, and middleware. Custom-designed dashboard
|
|
|
79
79
|
|
|
80
80
|
### Optional Features
|
|
81
81
|
|
|
82
|
+
<details>
|
|
83
|
+
<summary><strong>Storm Protection — Circuit Breaker + Adaptive Sampling</strong></summary>
|
|
84
|
+
|
|
85
|
+
When the error rate spikes (a bad deploy throwing thousands of errors a minute), the nightmare scenario for any in-process tracker is amplifying the outage with its own database writes. Storm protection makes the gem **provably degrade itself first** — ON by default.
|
|
86
|
+
|
|
87
|
+
- **Per-fingerprint caps:** past N occurrences/minute per error, context is shed, then rows are sampled deterministically (a fresh exemplar is always kept each minute)
|
|
88
|
+
- **Global circuit breaker:** sustained floods flip the gem to count-only mode — zero per-event I/O, exact in-memory counts reconciled onto error records every 30s. Async mode is gated too (a SolidQueue enqueue is itself a DB write)
|
|
89
|
+
- **One storm notification** replaces hundreds of per-error pings; auto-issue creation is capped (default 5 per 10 min) so a storm of new errors can't open 500 GitHub/Linear issues
|
|
90
|
+
- **Honest accounting:** a dashboard banner during/after the storm, a Storm History page with exact counts of everything shed, and `reached_open`/peak-rate per episode. Counts are never extrapolated
|
|
91
|
+
- **Calm-weather economy:** after 25 full-context captures of the same error per day, context is sampled (occurrence counting unaffected)
|
|
92
|
+
- **Fails open:** any internal storm-protection error means full capture. Protection can never be the thing that loses an error
|
|
93
|
+
|
|
94
|
+
```ruby
|
|
95
|
+
config.enable_storm_protection = true # default
|
|
96
|
+
config.storm_open_threshold_per_second = 50 # per process
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
All thresholds are per process and individually configurable. Disable with one flag.
|
|
100
|
+
|
|
101
|
+
**Measured overhead** (Apple Silicon, Ruby 4.0): 2.4µs/error with protection active and calm, 2.95µs in count-only mode, 0.2µs when disabled — against a 5µs budget. The check is a digest plus an atomic increment; there is no I/O on the hot path.
|
|
102
|
+
</details>
|
|
103
|
+
|
|
82
104
|
<details>
|
|
83
105
|
<summary><strong>Breadcrumbs — Request Activity Trail</strong></summary>
|
|
84
106
|
|
|
@@ -77,6 +77,11 @@ module RailsErrorDashboard
|
|
|
77
77
|
def set_common_view_variables
|
|
78
78
|
@applications = Application.ordered_by_name.pluck(:name, :id) rescue []
|
|
79
79
|
@default_credentials_warning = RailsErrorDashboard.configuration.default_credentials? rescue false
|
|
80
|
+
# Only query when a before_action (e.g. ErrorsController#load_storm_banner)
|
|
81
|
+
# hasn't already loaded it this request — the error renderer runs after
|
|
82
|
+
# those callbacks, so reuse their result instead of querying a second time.
|
|
83
|
+
# Tracked by a flag rather than the value, since the common case is nil.
|
|
84
|
+
@storm_banner_event = Queries::StormHistory.banner_event unless @storm_banner_loaded
|
|
80
85
|
end
|
|
81
86
|
end
|
|
82
87
|
end
|
|
@@ -5,6 +5,7 @@ module RailsErrorDashboard
|
|
|
5
5
|
before_action :authenticate_dashboard_user!
|
|
6
6
|
before_action :set_application_context
|
|
7
7
|
before_action :check_default_credentials
|
|
8
|
+
before_action :load_storm_banner
|
|
8
9
|
|
|
9
10
|
FILTERABLE_PARAMS = %i[
|
|
10
11
|
error_type
|
|
@@ -329,6 +330,12 @@ module RailsErrorDashboard
|
|
|
329
330
|
@pagy, @releases = pagy(:offset, all_releases, limit: params[:per_page] || 25)
|
|
330
331
|
end
|
|
331
332
|
|
|
333
|
+
def storms
|
|
334
|
+
result = Queries::StormHistory.call
|
|
335
|
+
@active_storm = result[:active]
|
|
336
|
+
@storm_events = result[:events]
|
|
337
|
+
end
|
|
338
|
+
|
|
332
339
|
def user_impact
|
|
333
340
|
days = days_param(default: 30)
|
|
334
341
|
@days = days
|
|
@@ -729,6 +736,11 @@ module RailsErrorDashboard
|
|
|
729
736
|
@default_credentials_warning = RailsErrorDashboard.configuration.default_credentials?
|
|
730
737
|
end
|
|
731
738
|
|
|
739
|
+
def load_storm_banner
|
|
740
|
+
@storm_banner_loaded = true
|
|
741
|
+
@storm_banner_event = Queries::StormHistory.banner_event
|
|
742
|
+
end
|
|
743
|
+
|
|
732
744
|
def authenticate_dashboard_user!
|
|
733
745
|
auth_lambda = RailsErrorDashboard.configuration.authenticate_with
|
|
734
746
|
|
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module RailsErrorDashboard
|
|
4
|
+
# Persists storm-protection count snapshots (mirrors SwallowedExceptionFlushJob):
|
|
5
|
+
# the gate accumulates counts in memory with zero I/O, snapshots are handed
|
|
6
|
+
# to this job at most once per flush interval, and ALL DB writes happen here.
|
|
7
|
+
class StormFlushJob < ApplicationJob
|
|
8
|
+
queue_as :default
|
|
9
|
+
|
|
10
|
+
def perform(entries: [], overflow: 0, episode: nil)
|
|
11
|
+
entries = entries.map { |e| e.respond_to?(:stringify_keys) ? e.stringify_keys : e }
|
|
12
|
+
episode = episode.stringify_keys if episode.respond_to?(:stringify_keys)
|
|
13
|
+
|
|
14
|
+
Commands::FlushStormCounts.call(entries: entries, overflow: overflow, episode: episode)
|
|
15
|
+
rescue => e
|
|
16
|
+
Rails.logger.error("[RailsErrorDashboard] StormFlushJob failed: #{e.class} - #{e.message}")
|
|
17
|
+
end
|
|
18
|
+
end
|
|
19
|
+
end
|
|
@@ -0,0 +1,74 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module RailsErrorDashboard
|
|
4
|
+
# Sends the SINGLE "error storm in progress" notification per storm episode.
|
|
5
|
+
#
|
|
6
|
+
# During a storm, per-error notifications are suppressed (500 Slack pings
|
|
7
|
+
# help nobody) — this one message replaces them. The gate guarantees at
|
|
8
|
+
# most one enqueue per episode; this job just delivers.
|
|
9
|
+
class StormNotificationJob < ApplicationJob
|
|
10
|
+
queue_as :default
|
|
11
|
+
|
|
12
|
+
# @param started_at [String] ISO8601 episode start
|
|
13
|
+
# @param state [String] breaker state at notification time ("shedding"/"open")
|
|
14
|
+
def perform(started_at:, state: "shedding")
|
|
15
|
+
config = RailsErrorDashboard.configuration
|
|
16
|
+
message = build_message(started_at, state, config)
|
|
17
|
+
|
|
18
|
+
if config.enable_slack_notifications && config.slack_webhook_url.present?
|
|
19
|
+
post_json(config.slack_webhook_url, { text: message })
|
|
20
|
+
end
|
|
21
|
+
|
|
22
|
+
if config.enable_discord_notifications && config.discord_webhook_url.present?
|
|
23
|
+
post_json(config.discord_webhook_url, { content: message })
|
|
24
|
+
end
|
|
25
|
+
|
|
26
|
+
if config.enable_webhook_notifications && config.webhook_urls.any?
|
|
27
|
+
payload = {
|
|
28
|
+
event: "error_storm_detected",
|
|
29
|
+
started_at: started_at,
|
|
30
|
+
state: state,
|
|
31
|
+
application: app_name(config)
|
|
32
|
+
}
|
|
33
|
+
config.webhook_urls.each { |url| post_json(url, payload) }
|
|
34
|
+
end
|
|
35
|
+
rescue => e
|
|
36
|
+
Rails.logger.error("[RailsErrorDashboard] StormNotificationJob failed: #{e.class} - #{e.message}")
|
|
37
|
+
end
|
|
38
|
+
|
|
39
|
+
private
|
|
40
|
+
|
|
41
|
+
def build_message(started_at, state, config)
|
|
42
|
+
mode = state == "open" ? "count-only mode (occurrences tallied, detail paused)" : "shedding mode (context sampling active)"
|
|
43
|
+
dashboard = (config.dashboard_base_url || "").chomp("/")
|
|
44
|
+
link = dashboard.present? ? " Dashboard: #{dashboard}/errors/storms" : ""
|
|
45
|
+
|
|
46
|
+
":warning: Error storm detected in #{app_name(config)} at #{started_at}. " \
|
|
47
|
+
"Storm protection engaged — #{mode}. Per-error notifications are " \
|
|
48
|
+
"suppressed until the storm subsides; exact counts are preserved.#{link}"
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
def app_name(config)
|
|
52
|
+
config.application_name || ENV["APPLICATION_NAME"] ||
|
|
53
|
+
(defined?(Rails) && Rails.application.class.module_parent_name) || "Rails Application"
|
|
54
|
+
end
|
|
55
|
+
|
|
56
|
+
def post_json(url, payload)
|
|
57
|
+
if defined?(HTTParty)
|
|
58
|
+
HTTParty.post(url, body: payload.to_json,
|
|
59
|
+
headers: { "Content-Type" => "application/json" }, timeout: 10)
|
|
60
|
+
else
|
|
61
|
+
uri = URI(url)
|
|
62
|
+
http = Net::HTTP.new(uri.host, uri.port)
|
|
63
|
+
http.use_ssl = uri.scheme == "https"
|
|
64
|
+
http.open_timeout = 5
|
|
65
|
+
http.read_timeout = 10
|
|
66
|
+
request = Net::HTTP::Post.new(uri.path, { "Content-Type" => "application/json" })
|
|
67
|
+
request.body = payload.to_json
|
|
68
|
+
http.request(request)
|
|
69
|
+
end
|
|
70
|
+
rescue => e
|
|
71
|
+
Rails.logger.error("[RailsErrorDashboard] Storm notification post failed: #{e.message}")
|
|
72
|
+
end
|
|
73
|
+
end
|
|
74
|
+
end
|
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module RailsErrorDashboard
|
|
4
|
+
# One row per storm-protection episode (per process). Powers the dashboard
|
|
5
|
+
# banner and the storm history page. Counts are exact, not extrapolated.
|
|
6
|
+
# Inherits ErrorLogsRecord so separate-database routing applies.
|
|
7
|
+
class StormEvent < ErrorLogsRecord
|
|
8
|
+
self.table_name = "rails_error_dashboard_storm_events"
|
|
9
|
+
|
|
10
|
+
scope :active, -> { where(ended_at: nil) }
|
|
11
|
+
scope :recent_first, -> { order(started_at: :desc) }
|
|
12
|
+
scope :ended_within, ->(duration) { where.not(ended_at: nil).where(ended_at: duration.ago..) }
|
|
13
|
+
|
|
14
|
+
def active?
|
|
15
|
+
ended_at.nil?
|
|
16
|
+
end
|
|
17
|
+
|
|
18
|
+
def duration_seconds
|
|
19
|
+
return nil unless ended_at
|
|
20
|
+
|
|
21
|
+
(ended_at - started_at).round
|
|
22
|
+
end
|
|
23
|
+
|
|
24
|
+
# @return [Array<Hash>] top fingerprints by count, [] when absent/corrupt
|
|
25
|
+
def top_fingerprints_list
|
|
26
|
+
return [] if top_fingerprints.blank?
|
|
27
|
+
|
|
28
|
+
parsed = JSON.parse(top_fingerprints)
|
|
29
|
+
parsed.is_a?(Array) ? parsed : []
|
|
30
|
+
rescue JSON::ParserError
|
|
31
|
+
[]
|
|
32
|
+
end
|
|
33
|
+
end
|
|
34
|
+
end
|
|
@@ -1036,6 +1036,9 @@ tr[data-red-row-href]:hover .sev-bar { opacity: 1 !important; }
|
|
|
1036
1036
|
<% if RailsErrorDashboard.configuration.enable_diagnostic_dump %>
|
|
1037
1037
|
<% diag_items << { path: diagnostic_dumps_errors_path(nav_params), icon: 'bi-clipboard-pulse', label: 'Diagnostics' } %>
|
|
1038
1038
|
<% end %>
|
|
1039
|
+
<% if RailsErrorDashboard.configuration.enable_storm_protection %>
|
|
1040
|
+
<% diag_items << { path: storms_errors_path(nav_params), icon: 'bi-cloud-lightning-rain', label: 'Storms' } %>
|
|
1041
|
+
<% end %>
|
|
1039
1042
|
|
|
1040
1043
|
<% if diag_items.any? %>
|
|
1041
1044
|
<div style="margin-bottom: var(--space-2);" id="navDiagSection">
|
|
@@ -1175,6 +1178,24 @@ tr[data-red-row-href]:hover .sev-bar { opacity: 1 !important; }
|
|
|
1175
1178
|
</div>
|
|
1176
1179
|
<script<%= " nonce=\"#{red_csp_nonce}\"".html_safe if red_csp_nonce %>>try { if (sessionStorage.getItem('red_dismiss_creds_warning') === '1') { document.getElementById('security-warning').style.display = 'none'; } } catch(e) {}</script>
|
|
1177
1180
|
<% end %>
|
|
1181
|
+
<% if defined?(@storm_banner_event) && @storm_banner_event %>
|
|
1182
|
+
<% storm_active = @storm_banner_event.active? %>
|
|
1183
|
+
<div id="storm-banner" class="alert <%= storm_active ? 'alert-danger' : 'alert-info' %>" style="display: flex; align-items: center; gap: 10px; margin-top: var(--space-2); margin-bottom: var(--space-4); border-left: 4px solid <%= storm_active ? 'var(--status-critical)' : 'var(--status-info)' %>;">
|
|
1184
|
+
<i class="bi <%= storm_active ? 'bi-cloud-lightning-rain-fill' : 'bi-cloud-check' %>" style="font-size: 18px; flex-shrink: 0;"></i>
|
|
1185
|
+
<div style="flex: 1;">
|
|
1186
|
+
<% if storm_active %>
|
|
1187
|
+
<strong>Error storm in progress</strong> (since <%= @storm_banner_event.started_at.strftime("%H:%M %Z") %>) —
|
|
1188
|
+
storm protection engaged. Occurrences are being counted exactly;
|
|
1189
|
+
<%= @storm_banner_event.reached_open ? "detail capture is paused (count-only mode)" : "context capture is sampled" %>.
|
|
1190
|
+
<% else %>
|
|
1191
|
+
<strong>Storm protection engaged recently</strong> (ended <%= @storm_banner_event.ended_at.strftime("%H:%M %Z") %>) —
|
|
1192
|
+
<%= number_with_delimiter(@storm_banner_event.events_counted_only.to_i + @storm_banner_event.events_overflow.to_i) %> occurrences
|
|
1193
|
+
were recorded as exact counts with sampled detail.
|
|
1194
|
+
<% end %>
|
|
1195
|
+
<a href="<%= storms_errors_path %>" style="margin-left: 4px;">View storm history</a>
|
|
1196
|
+
</div>
|
|
1197
|
+
</div>
|
|
1198
|
+
<% end %>
|
|
1178
1199
|
<%= yield %>
|
|
1179
1200
|
</main>
|
|
1180
1201
|
|
|
@@ -0,0 +1,91 @@
|
|
|
1
|
+
<% content_for :page_title, "Storm History" %>
|
|
2
|
+
|
|
3
|
+
<div>
|
|
4
|
+
<div class="d-flex justify-content-between align-items-center mb-4">
|
|
5
|
+
<h1 style="font-size: 20px; font-weight: 700; margin: 0;">
|
|
6
|
+
<i class="bi bi-cloud-lightning-rain me-2"></i>
|
|
7
|
+
Storm History
|
|
8
|
+
</h1>
|
|
9
|
+
</div>
|
|
10
|
+
|
|
11
|
+
<p class="text-muted" style="font-size: 13px; margin-bottom: var(--space-4);">
|
|
12
|
+
When the error rate spikes (a bad deploy, a dependency outage), storm protection
|
|
13
|
+
limits the gem's own database writes so it never amplifies the incident.
|
|
14
|
+
Occurrences are always counted exactly — only per-event detail is sampled.
|
|
15
|
+
Thresholds are configurable per process via <code>storm_*</code> options.
|
|
16
|
+
</p>
|
|
17
|
+
|
|
18
|
+
<% if @active_storm %>
|
|
19
|
+
<div class="alert alert-danger" style="border-left: 4px solid var(--status-critical);">
|
|
20
|
+
<i class="bi bi-cloud-lightning-rain-fill me-1"></i>
|
|
21
|
+
<strong>Storm in progress</strong> since <%= @active_storm.started_at.strftime("%Y-%m-%d %H:%M %Z") %> —
|
|
22
|
+
peak rate <%= number_with_delimiter(@active_storm.peak_rate_per_minute) %>/min.
|
|
23
|
+
</div>
|
|
24
|
+
<% end %>
|
|
25
|
+
|
|
26
|
+
<% if @storm_events.empty? %>
|
|
27
|
+
<div class="red-empty-state">
|
|
28
|
+
<i class="bi bi-cloud-sun display-1 text-muted mb-3"></i>
|
|
29
|
+
<div class="red-empty-state-title">No Storms Recorded</div>
|
|
30
|
+
<p class="text-muted">
|
|
31
|
+
Storm protection is standing by. If an error flood ever hits this app,
|
|
32
|
+
episodes will appear here with exact counts of what was shed.
|
|
33
|
+
</p>
|
|
34
|
+
</div>
|
|
35
|
+
<% else %>
|
|
36
|
+
<div class="card">
|
|
37
|
+
<div class="card-body" style="padding: 0;">
|
|
38
|
+
<table class="table" style="margin: 0;">
|
|
39
|
+
<thead>
|
|
40
|
+
<tr>
|
|
41
|
+
<th>Started</th>
|
|
42
|
+
<th>Duration</th>
|
|
43
|
+
<th>Peak Rate</th>
|
|
44
|
+
<th>Mode</th>
|
|
45
|
+
<th class="text-end">Counted (exact)</th>
|
|
46
|
+
<th class="text-end">Overflow</th>
|
|
47
|
+
<th>Top Errors</th>
|
|
48
|
+
</tr>
|
|
49
|
+
</thead>
|
|
50
|
+
<tbody>
|
|
51
|
+
<% @storm_events.each do |event| %>
|
|
52
|
+
<tr>
|
|
53
|
+
<td style="white-space: nowrap;">
|
|
54
|
+
<%= event.started_at.strftime("%Y-%m-%d %H:%M") %>
|
|
55
|
+
<% if event.active? %>
|
|
56
|
+
<span class="badge bg-danger ms-1">active</span>
|
|
57
|
+
<% end %>
|
|
58
|
+
</td>
|
|
59
|
+
<td>
|
|
60
|
+
<% if event.duration_seconds %>
|
|
61
|
+
<%= ActiveSupport::Duration.build(event.duration_seconds).inspect %>
|
|
62
|
+
<% else %>
|
|
63
|
+
—
|
|
64
|
+
<% end %>
|
|
65
|
+
</td>
|
|
66
|
+
<td><%= number_with_delimiter(event.peak_rate_per_minute) %>/min</td>
|
|
67
|
+
<td>
|
|
68
|
+
<% if event.reached_open %>
|
|
69
|
+
<span class="badge bg-danger">count-only</span>
|
|
70
|
+
<% else %>
|
|
71
|
+
<span class="badge bg-warning text-dark">shedding</span>
|
|
72
|
+
<% end %>
|
|
73
|
+
</td>
|
|
74
|
+
<td class="text-end"><%= number_with_delimiter(event.events_counted_only) %></td>
|
|
75
|
+
<td class="text-end"><%= number_with_delimiter(event.events_overflow) %></td>
|
|
76
|
+
<td style="max-width: 320px;">
|
|
77
|
+
<% event.top_fingerprints_list.first(3).each do |fp| %>
|
|
78
|
+
<div style="font-size: 12px; font-family: var(--font-mono); overflow: hidden; text-overflow: ellipsis; white-space: nowrap;" title="<%= fp["message"] %>">
|
|
79
|
+
<strong><%= fp["class"] %></strong>
|
|
80
|
+
<span class="text-muted">×<%= number_with_delimiter(fp["count"]) %></span>
|
|
81
|
+
</div>
|
|
82
|
+
<% end %>
|
|
83
|
+
</td>
|
|
84
|
+
</tr>
|
|
85
|
+
<% end %>
|
|
86
|
+
</tbody>
|
|
87
|
+
</table>
|
|
88
|
+
</div>
|
|
89
|
+
</div>
|
|
90
|
+
<% end %>
|
|
91
|
+
</div>
|
data/config/routes.rb
CHANGED
|
@@ -1,7 +1,13 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
3
|
class AddInstanceVariablesToErrorLogs < ActiveRecord::Migration[7.0]
|
|
4
|
-
def
|
|
4
|
+
def up
|
|
5
|
+
return if column_exists?(:rails_error_dashboard_error_logs, :instance_variables)
|
|
6
|
+
|
|
5
7
|
add_column :rails_error_dashboard_error_logs, :instance_variables, :text
|
|
6
8
|
end
|
|
9
|
+
|
|
10
|
+
def down
|
|
11
|
+
remove_column :rails_error_dashboard_error_logs, :instance_variables if column_exists?(:rails_error_dashboard_error_logs, :instance_variables)
|
|
12
|
+
end
|
|
7
13
|
end
|
|
@@ -2,6 +2,10 @@
|
|
|
2
2
|
|
|
3
3
|
class CreateRailsErrorDashboardSwallowedExceptions < ActiveRecord::Migration[7.0]
|
|
4
4
|
def change
|
|
5
|
+
# Guard against the squashed schema migration having already created this
|
|
6
|
+
# table — without it, every later migration is silently cancelled.
|
|
7
|
+
return if table_exists?(:rails_error_dashboard_swallowed_exceptions)
|
|
8
|
+
|
|
5
9
|
create_table :rails_error_dashboard_swallowed_exceptions do |t|
|
|
6
10
|
t.string :exception_class, null: false, limit: 250
|
|
7
11
|
t.string :raise_location, null: false, limit: 250
|
|
@@ -2,6 +2,10 @@
|
|
|
2
2
|
|
|
3
3
|
class CreateRailsErrorDashboardDiagnosticDumps < ActiveRecord::Migration[7.0]
|
|
4
4
|
def change
|
|
5
|
+
# Guard against the squashed schema migration having already created this
|
|
6
|
+
# table — without it, every later migration is silently cancelled.
|
|
7
|
+
return if table_exists?(:rails_error_dashboard_diagnostic_dumps)
|
|
8
|
+
|
|
5
9
|
create_table :rails_error_dashboard_diagnostic_dumps do |t|
|
|
6
10
|
t.references :application, null: false,
|
|
7
11
|
foreign_key: { to_table: :rails_error_dashboard_applications }
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
# Storm protection honesty layer: one row per storm episode, powering the
|
|
4
|
+
# dashboard banner ("storm detected, counts recorded, detail sampled") and
|
|
5
|
+
# the storm history page. Small table — a few rows per incident.
|
|
6
|
+
class CreateStormEvents < ActiveRecord::Migration[7.0]
|
|
7
|
+
def change
|
|
8
|
+
return if table_exists?(:rails_error_dashboard_storm_events)
|
|
9
|
+
|
|
10
|
+
create_table :rails_error_dashboard_storm_events do |t|
|
|
11
|
+
t.datetime :started_at, null: false
|
|
12
|
+
t.datetime :ended_at # NULL while the storm is active
|
|
13
|
+
t.integer :peak_rate_per_minute, default: 0
|
|
14
|
+
t.boolean :reached_open, default: false # true if count-only mode engaged
|
|
15
|
+
t.bigint :events_total, default: 0 # count-only total = counted_only + overflow (excludes :lite/:full rows)
|
|
16
|
+
t.bigint :events_counted_only, default: 0 # counted in memory, no rows
|
|
17
|
+
t.bigint :events_overflow, default: 0 # beyond the bounded map — exact total, anonymous identity
|
|
18
|
+
t.integer :fingerprints_affected, default: 0
|
|
19
|
+
t.text :top_fingerprints # JSON: top 5 by count [{class, message, count}]
|
|
20
|
+
t.timestamps
|
|
21
|
+
end
|
|
22
|
+
|
|
23
|
+
add_index :rails_error_dashboard_storm_events, :ended_at,
|
|
24
|
+
name: "index_red_storm_events_on_ended_at"
|
|
25
|
+
add_index :rails_error_dashboard_storm_events, :started_at,
|
|
26
|
+
name: "index_red_storm_events_on_started_at"
|
|
27
|
+
end
|
|
28
|
+
end
|
|
@@ -518,6 +518,42 @@ RailsErrorDashboard.configure do |config|
|
|
|
518
518
|
# No PII or request bodies in span attributes — just metadata + timing.
|
|
519
519
|
# Safe to enable on production OTel pipelines.
|
|
520
520
|
|
|
521
|
+
# ============================================================================
|
|
522
|
+
# STORM PROTECTION (circuit breaker + adaptive sampling) — ON by default
|
|
523
|
+
# ============================================================================
|
|
524
|
+
#
|
|
525
|
+
# When the error rate spikes (bad deploy, dependency outage), storm
|
|
526
|
+
# protection limits the gem's own database writes so it never amplifies
|
|
527
|
+
# the incident. Occurrences are ALWAYS counted exactly — only per-event
|
|
528
|
+
# detail (context payloads, occurrence rows) is sampled under load.
|
|
529
|
+
#
|
|
530
|
+
# How it degrades, in order:
|
|
531
|
+
# 1. Per-fingerprint cap: past N/min, context is shed, then rows sampled
|
|
532
|
+
# 2. Global breaker: shedding (context off) → open (count-only mode)
|
|
533
|
+
# 3. Per-error notifications replaced by ONE "storm in progress" message
|
|
534
|
+
# 4. Counts reconciled onto error records every flush interval
|
|
535
|
+
#
|
|
536
|
+
# All thresholds are PER PROCESS (each Puma worker runs its own breaker).
|
|
537
|
+
#
|
|
538
|
+
# config.enable_storm_protection = true
|
|
539
|
+
# config.storm_fingerprint_full_per_minute = 30 # full-fidelity captures per fingerprint/min
|
|
540
|
+
# config.storm_occurrence_sample_keep_every = 10 # past the cap, keep every Nth occurrence
|
|
541
|
+
# config.storm_shedding_threshold_per_second = 10 # global rate entering shedding state
|
|
542
|
+
# config.storm_open_threshold_per_second = 50 # global rate opening the breaker (count-only)
|
|
543
|
+
# config.storm_cooldown_seconds = 60 # open → half-open probe delay
|
|
544
|
+
# config.storm_notification = true # one notification per storm episode
|
|
545
|
+
#
|
|
546
|
+
# Always-on issue cap (a storm of NEW critical errors must not open
|
|
547
|
+
# hundreds of GitHub/Linear issues):
|
|
548
|
+
# config.auto_issue_rate_limit_count = 5
|
|
549
|
+
# config.auto_issue_rate_limit_window_minutes = 10
|
|
550
|
+
#
|
|
551
|
+
# Calm-weather context economy: an error seen 1000x/day doesn't need 1000
|
|
552
|
+
# breadcrumb trails. After N full-context captures per fingerprint per day,
|
|
553
|
+
# context is kept every Mth time (occurrence rows are unaffected):
|
|
554
|
+
# config.context_sampling_threshold_per_day = 25
|
|
555
|
+
# config.context_sampling_keep_every = 10
|
|
556
|
+
|
|
521
557
|
# ============================================================================
|
|
522
558
|
# ISSUE TRACKING (GitHub / GitLab / Codeberg / Linear)
|
|
523
559
|
# ============================================================================
|
|
@@ -0,0 +1,188 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module RailsErrorDashboard
|
|
4
|
+
module Commands
|
|
5
|
+
# Command: Reconcile counted-not-stored storm events onto ErrorLog rows
|
|
6
|
+
# and maintain the storm_events episode record.
|
|
7
|
+
#
|
|
8
|
+
# Runs in a background job (DB allowed). For each counted fingerprint:
|
|
9
|
+
# 1. Recompute the canonical error_hash from the stored identity parts
|
|
10
|
+
# (the gate's key deliberately omits application_id — resolved here)
|
|
11
|
+
# 2. Unresolved match → single UPDATE: occurrence_count += N
|
|
12
|
+
# 3. Resolved match → reopen (mirrors FindOrIncrementError semantics)
|
|
13
|
+
# 4. No match → create a minimal ErrorLog from the exemplar
|
|
14
|
+
#
|
|
15
|
+
# Counts are exact. Notifications are NOT dispatched from here — during a
|
|
16
|
+
# storm they're suppressed by design; the storm notification covers it.
|
|
17
|
+
class FlushStormCounts
|
|
18
|
+
def self.call(entries:, overflow: 0, episode: nil)
|
|
19
|
+
new(entries: entries, overflow: overflow, episode: episode).call
|
|
20
|
+
end
|
|
21
|
+
|
|
22
|
+
def initialize(entries:, overflow: 0, episode: nil)
|
|
23
|
+
@entries = Array(entries)
|
|
24
|
+
@overflow = overflow.to_i
|
|
25
|
+
@episode = episode
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
def call
|
|
29
|
+
application = resolve_application
|
|
30
|
+
counted = 0
|
|
31
|
+
|
|
32
|
+
@entries.each do |entry|
|
|
33
|
+
entry = entry.with_indifferent_access if entry.respond_to?(:with_indifferent_access)
|
|
34
|
+
counted += reconcile_entry(entry, application)
|
|
35
|
+
rescue => e
|
|
36
|
+
# A corrupt (non-Hash) entry must not abort the whole batch — and the
|
|
37
|
+
# log line itself must not assume `entry` is subscriptable (an Integer
|
|
38
|
+
# from a broken serializer would raise again here, escaping this rescue).
|
|
39
|
+
error_class = entry.is_a?(Hash) ? entry["error_class"] : entry.class
|
|
40
|
+
RailsErrorDashboard::Logger.error(
|
|
41
|
+
"[RailsErrorDashboard] Storm count reconcile failed for #{error_class}: #{e.class} - #{e.message}"
|
|
42
|
+
)
|
|
43
|
+
end
|
|
44
|
+
|
|
45
|
+
upsert_storm_event(counted)
|
|
46
|
+
{ success: true, reconciled: counted, overflow: @overflow }
|
|
47
|
+
rescue => e
|
|
48
|
+
RailsErrorDashboard::Logger.error(
|
|
49
|
+
"[RailsErrorDashboard] FlushStormCounts failed: #{e.class} - #{e.message}"
|
|
50
|
+
)
|
|
51
|
+
{ success: false, error: "#{e.class}: #{e.message}" }
|
|
52
|
+
end
|
|
53
|
+
|
|
54
|
+
private
|
|
55
|
+
|
|
56
|
+
def reconcile_entry(entry, application)
|
|
57
|
+
count = entry["count"].to_i
|
|
58
|
+
return 0 if count <= 0
|
|
59
|
+
|
|
60
|
+
error_hash = canonical_hash(entry, application)
|
|
61
|
+
last_seen = parse_time(entry["last_seen_at"]) || Time.current
|
|
62
|
+
|
|
63
|
+
# Priority 1: unresolved match — one UPDATE, no row instantiation
|
|
64
|
+
updated = ErrorLog.unresolved
|
|
65
|
+
.where(error_hash: error_hash, application_id: application.id)
|
|
66
|
+
.update_all([ "occurrence_count = occurrence_count + ?, last_seen_at = ?", count, last_seen ])
|
|
67
|
+
return count if updated.positive?
|
|
68
|
+
|
|
69
|
+
# Priority 2: resolved/wont_fix match — reopen, mirroring
|
|
70
|
+
# FindOrIncrementError so storm recurrences don't stay buried
|
|
71
|
+
resolved = ErrorLog
|
|
72
|
+
.where(error_hash: error_hash, application_id: application.id)
|
|
73
|
+
.where(status: %w[resolved wont_fix])
|
|
74
|
+
.order(last_seen_at: :desc)
|
|
75
|
+
.first
|
|
76
|
+
if resolved
|
|
77
|
+
attrs = {
|
|
78
|
+
resolved: false,
|
|
79
|
+
status: "new",
|
|
80
|
+
resolved_at: nil,
|
|
81
|
+
occurrence_count: resolved.occurrence_count + count,
|
|
82
|
+
last_seen_at: last_seen
|
|
83
|
+
}
|
|
84
|
+
attrs[:reopened_at] = Time.current if ErrorLog.column_names.include?("reopened_at")
|
|
85
|
+
resolved.update!(attrs)
|
|
86
|
+
return count
|
|
87
|
+
end
|
|
88
|
+
|
|
89
|
+
# Priority 3: first seen during count-only mode — minimal ErrorLog
|
|
90
|
+
# from the exemplar (no backtrace/context was captured; the next
|
|
91
|
+
# occurrence after the storm fills in detail via the normal path)
|
|
92
|
+
ErrorLog.create!(
|
|
93
|
+
application_id: application.id,
|
|
94
|
+
error_type: entry["error_class"],
|
|
95
|
+
message: entry["message"],
|
|
96
|
+
backtrace: entry["first_app_frame"],
|
|
97
|
+
controller_name: entry["controller_name"],
|
|
98
|
+
action_name: entry["action_name"],
|
|
99
|
+
occurred_at: parse_time(entry["first_seen_at"]) || Time.current,
|
|
100
|
+
last_seen_at: last_seen,
|
|
101
|
+
occurrence_count: count,
|
|
102
|
+
error_hash: error_hash,
|
|
103
|
+
resolved: false
|
|
104
|
+
)
|
|
105
|
+
count
|
|
106
|
+
end
|
|
107
|
+
|
|
108
|
+
# Mirrors ErrorHashGenerator.call exactly: same fields, same order,
|
|
109
|
+
# same normalization — so counts land on the same ErrorLog the full
|
|
110
|
+
# capture path would have used.
|
|
111
|
+
def canonical_hash(entry, application)
|
|
112
|
+
return entry["custom_hash"] if entry["custom_hash"].present?
|
|
113
|
+
|
|
114
|
+
digest_input = [
|
|
115
|
+
entry["error_class"],
|
|
116
|
+
Services::ErrorHashGenerator.normalize_message(entry["message"]),
|
|
117
|
+
entry["first_app_frame"],
|
|
118
|
+
entry["controller_name"],
|
|
119
|
+
entry["action_name"],
|
|
120
|
+
application.id.to_s
|
|
121
|
+
].compact.join("|")
|
|
122
|
+
|
|
123
|
+
Digest::SHA256.hexdigest(digest_input)[0..15]
|
|
124
|
+
end
|
|
125
|
+
|
|
126
|
+
def resolve_application
|
|
127
|
+
# Same chain LogError uses — app name is process-global
|
|
128
|
+
app_name = RailsErrorDashboard.configuration.application_name ||
|
|
129
|
+
ENV["APPLICATION_NAME"] ||
|
|
130
|
+
(defined?(Rails) && Rails.application.class.module_parent_name) ||
|
|
131
|
+
"Rails Application"
|
|
132
|
+
Application.find_or_create_by_name(app_name)
|
|
133
|
+
end
|
|
134
|
+
|
|
135
|
+
def upsert_storm_event(counted)
|
|
136
|
+
return unless @episode.is_a?(Hash)
|
|
137
|
+
return unless StormEvent.table_exists?
|
|
138
|
+
|
|
139
|
+
started_at = parse_time(@episode["started_at"])
|
|
140
|
+
return unless started_at
|
|
141
|
+
|
|
142
|
+
event = StormEvent.active.recent_first.first || StormEvent.create!(started_at: started_at)
|
|
143
|
+
|
|
144
|
+
event.events_counted_only = event.events_counted_only.to_i + counted
|
|
145
|
+
event.events_overflow = event.events_overflow.to_i + @overflow
|
|
146
|
+
# events_total is the count-only total: in-map reconciled + overflow.
|
|
147
|
+
# It deliberately excludes :lite/:full admissions (those became real
|
|
148
|
+
# ErrorLog rows on the hot path and are never counted here), so it is
|
|
149
|
+
# always events_counted_only + events_overflow. Derive it rather than
|
|
150
|
+
# accumulate so it can't drift from its two components.
|
|
151
|
+
event.events_total = event.events_counted_only.to_i + event.events_overflow.to_i
|
|
152
|
+
event.fingerprints_affected = [ event.fingerprints_affected.to_i, @entries.size ].max
|
|
153
|
+
event.peak_rate_per_minute = [ event.peak_rate_per_minute.to_i, @episode["peak_rate_per_minute"].to_i ].max
|
|
154
|
+
event.reached_open ||= @episode["reached_open"] == true
|
|
155
|
+
event.top_fingerprints = top_fingerprints_json(event)
|
|
156
|
+
event.ended_at = parse_time(@episode["ended_at"]) if @episode["ended_at"]
|
|
157
|
+
event.save!
|
|
158
|
+
rescue => e
|
|
159
|
+
RailsErrorDashboard::Logger.error(
|
|
160
|
+
"[RailsErrorDashboard] Storm event upsert failed: #{e.class} - #{e.message}"
|
|
161
|
+
)
|
|
162
|
+
end
|
|
163
|
+
|
|
164
|
+
def top_fingerprints_json(event)
|
|
165
|
+
existing = event.top_fingerprints_list
|
|
166
|
+
fresh = @entries.map { |e|
|
|
167
|
+
e = e.with_indifferent_access if e.respond_to?(:with_indifferent_access)
|
|
168
|
+
{ "class" => e["error_class"], "message" => e["message"].to_s[0, 120], "count" => e["count"].to_i }
|
|
169
|
+
}
|
|
170
|
+
|
|
171
|
+
merged = (existing + fresh)
|
|
172
|
+
.group_by { |f| [ f["class"], f["message"] ] }
|
|
173
|
+
.map { |_k, group| group.first.merge("count" => group.sum { |f| f["count"].to_i }) }
|
|
174
|
+
|
|
175
|
+
merged.sort_by { |f| -f["count"].to_i }.first(5).to_json
|
|
176
|
+
end
|
|
177
|
+
|
|
178
|
+
def parse_time(value)
|
|
179
|
+
return value if value.is_a?(Time) || value.is_a?(ActiveSupport::TimeWithZone)
|
|
180
|
+
return nil if value.blank?
|
|
181
|
+
|
|
182
|
+
Time.zone.parse(value.to_s)
|
|
183
|
+
rescue ArgumentError
|
|
184
|
+
nil
|
|
185
|
+
end
|
|
186
|
+
end
|
|
187
|
+
end
|
|
188
|
+
end
|