@checkstack/notification-backend 1.1.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,207 @@
1
1
  # @checkstack/notification-backend
2
2
 
3
+ ## 1.3.0
4
+
5
+ ### Minor Changes
6
+
7
+ - ba07ae2: Quiet down notification spam on flapping systems, auto-open incidents when a check goes critical, and let operators land directly on the broken checks.
8
+
9
+ Notification policy lives **per healthcheck assignment** (one row per `system × configuration`). Different checks on the same system are fully independent — disabling a setting on one check does not affect the others. Defaults preserve existing behaviour for `suppressDeEscalations`; **auto-incident defaults to on** for new and existing assignments.
10
+
11
+ - **`suppressDeEscalations`** (off by default). When on, transitions from a worse state to a better-but-still-failing state (e.g. `unhealthy → degraded`) no longer fire a notification. Escalations and full recoveries to `healthy` are unaffected. Resolved per assignment (the just-ran check is the one driving any aggregate transition).
12
+ - **`autoOpenIncidentOnUnhealthy`** (on by default). Either of two independent triggers can open the auto-incident:
13
+ - **`sustainedUnhealthyTrigger`** (default 30 min) — opens when the check stays continuously unhealthy for the configured duration. Catches real outages.
14
+ - **`flappingTrigger`** (default 3 transitions in 60 min) — opens when the check flips to unhealthy that many times in the window. Catches persistent flapping where each unhealthy phase is too brief for the sustained trigger.
15
+ Each trigger can be individually disabled. One incident per system: triggering checks attach to an existing active auto-incident.
16
+ - **`useNotificationSuppression`** (on by default, only meaningful when auto-open is on). Controls whether the auto-opened incident is created with `suppressNotifications: true` — leaving this off opens the incident but still pings operators on each transition.
17
+ - **`skipDuringMaintenance`** (on by default). No auto-incident is opened while the system has an active maintenance window with suppression. The system is intentionally down and shouldn't trip the on-call.
18
+ - **`autoCloseAfterMinutes`** (default 30). Auto-close cooldown is now per-assignment and snapshotted per-incident at open time — later policy edits don't alter in-flight incidents. Setting `null` ("Never auto-close") leaves the incident for manual resolution.
19
+ - **Require-recovery rule.** After any auto-incident closes (manual or auto), no new auto-incident can open until the check has logged at least one healthy run. Prevents a "operator dismissed but it's still broken" loop.
20
+ - **Auto-close worker** ticks every 60s and resolves auto-opened incidents whose systems have been healthy for their per-row `cooldownMinutes`. Rows with `null` cooldown are skipped entirely. Per-incident: failed close attempts are logged but never abort the sweep.
21
+ - **`incidentResolved` hook subscriber** syncs the auto-incident mapping when an operator manually resolves the incident, so the require-recovery rule sees the close immediately.
22
+ - **Platform-wide defaults.** New admin RPCs `getPlatformNotificationDefaults` / `setPlatformNotificationDefaults` (under the existing `healthcheck.configuration.{read,manage}` access rules) let operators set notification policy once for the whole instance. Per-assignment rows with `notificationPolicy: null` inherit the platform defaults at read time. UI: a "Notification defaults" button in the Assignment IDE opens a modal editor. The per-assignment Notifications panel shows an inheritance banner — "Using platform defaults" (read-only) with an "Override" button, or "Custom override" with a "Use platform defaults" button to revert. The all-or-nothing model keeps the mental model simple: each assignment is either fully inherited or fully overridden.
23
+ - **New service-level RPCs** on the incident plugin (`createAutoIncident`, `resolveAutoIncident`) let other plugins open/close incidents without a user context. Reused by the healthcheck auto-incident flow.
24
+ - **Health-state notification CTA** now deep-links to `?filter=failing` on the system detail page for non-recovery transitions (label changes to "View failing checks"). The system overview gains an `All / Failing / Healthy` segmented filter wired to the same `?filter=…` param.
25
+ - **Notification bell badge** now counts collapse groups instead of raw rows, so the number matches what the user sees in the notifications list. Built on `COUNT(DISTINCT COALESCE(collapse_key, id))` — notifications without a collapse key still each count as one.
26
+ - **`statusFilter` on `getHistory` / `getDetailedHistory`** lets the run-history page and the drawer's Recent Runs panel filter to `All / Healthy / Failing` via shared pills, with the page resetting to the first page on filter change.
27
+ - **Pagination defaults aligned with selector options.** Several pages defaulted to a page size (5 or 20) that wasn't in the dropdown's options (`[10, 25, 50, 100]`), so the page-size `<Select>` rendered empty. The drawer's Recent Runs now defaults to 10; the Run History, History List, and Delivery Logs pages now default to 25.
28
+
29
+ Includes Drizzle migrations adding the `notification_policy` jsonb column to `system_health_checks`, plus two new tables: `health_check_unhealthy_transitions` (for threshold counting) and `health_check_auto_incidents` (for mapping back to incident ids during auto-close).
30
+
31
+ ### Patch Changes
32
+
33
+ - @checkstack/backend-api@0.17.1
34
+ - @checkstack/auth-backend@0.4.30
35
+ - @checkstack/cache-api@0.3.5
36
+ - @checkstack/queue-api@0.3.5
37
+ - @checkstack/cache-utils@0.2.10
38
+
39
+ ## 1.2.0
40
+
41
+ ### Minor Changes
42
+
43
+ - f23f3c9: Add per-channel notification delivery-attempt tracking
44
+ (Phase 8 of the v1 polishing plan). The external dispatch loop now
45
+ persists one row per `strategy.send(...)` call into a new
46
+ `notification_delivery_attempts` table - both successes and
47
+ failures - so silent delivery breakage (misconfigured webhooks, dead
48
+ channels) becomes queryable instead of buried in logs.
49
+
50
+ - `@checkstack/notification-backend` adds the
51
+ `notification_delivery_attempts` table, the matching Drizzle
52
+ migration, and a new `dispatchWithAttempt` helper that wraps every
53
+ external `strategy.send(...)` with duration measurement and
54
+ best-effort row persistence. The insert is intentionally
55
+ fire-and-forget: if writing the attempt row itself errors, the
56
+ dispatch loop logs and continues so visibility tracking can never
57
+ introduce a _new_ silent failure.
58
+ - `@checkstack/notification-common` exports a new
59
+ `DeliveryAttemptSchema` zod schema, the
60
+ `ListDeliveryAttemptsInputSchema =
61
+ PaginationInput.extend({ notificationId })` input, and a new
62
+ `getDeliveryAttempts` procedure on the contract. The procedure is
63
+ gated by the existing `notificationAccess.admin`
64
+ (`notification:manage`) access rule - no new permission was
65
+ introduced.
66
+ - `@checkstack/notification-frontend` adds a minimal admin-only
67
+ `DeliveryAttemptsPage` (route id `notification.deliveryAttempts`,
68
+ path `/notifications/delivery-attempts`) and an "Open inspector"
69
+ link from the Notification Settings page for users with
70
+ `notification:manage`. No client-side `isAdmin` gate - the FORBIDDEN
71
+ case is rendered via the standard error-state branch on the page,
72
+ enforced by the contract.
73
+
74
+ Visibility only: there is no retry mechanism in this phase. A
75
+ `failure` row is a final outcome an admin actions manually
76
+ (re-trigger the source event, fix the misconfigured channel).
77
+ Automated retries are deferred to v1.1.
78
+
79
+ Strategy errors thrown during `send(...)` are persisted via
80
+ `extractErrorMessage(error)` so secrets potentially embedded in raw
81
+ error objects (webhook URLs, OAuth tokens reachable from the strategy
82
+ send context) are not stored verbatim.
83
+
84
+ See the new
85
+ `docs/src/content/docs/backend/notification-delivery.md` page for the
86
+ full surface description.
87
+
88
+ ### Patch Changes
89
+
90
+ - f23f3c9: Add `correlationMiddleware` to `@checkstack/backend-api` and apply it
91
+ to every plugin/core router so each request carries a stable
92
+ `x-correlation-id` (read from the inbound header, or freshly minted
93
+ via `crypto.randomUUID()` when absent) and an auto-injected child
94
+ logger bound with `{ correlationId, pluginId, userId? }`. The ID is
95
+ echoed back on the response header so the caller can correlate their
96
+ client-side trace to the server logs.
97
+
98
+ The `Logger` interface in `@checkstack/backend-api` now formally
99
+ documents the structured-metadata convention (`logger.info("msg",
100
+ { ...meta })`) alongside the long-standing varargs shape. Winston's
101
+ splat handling already routes both shapes through the same vararg
102
+ slot, so existing call sites are unaffected. A new optional
103
+ `Logger.child(meta)` method captures the metadata-binding contract the
104
+ new middleware relies on; production loggers always implement it,
105
+ minimal test mocks may omit it (the middleware falls back gracefully).
106
+
107
+ `RpcContext` grew two optional `Headers` bags, `requestHeaders` and
108
+ `responseHeaders`, populated by the outer Hono `/api/*` and `/rest/*`
109
+ handlers in `@checkstack/backend`. They are write-through observation
110
+ points for middleware; an `RpcContext` constructed without them (S2S
111
+ clients, tests) keeps working — the echo is a silent no-op and the ID
112
+ is still bound onto the child logger for server-side correlation.
113
+
114
+ The scaffolding template in `@checkstack/scripts` was updated so any
115
+ new plugin generated via `bun run create` wires the middleware in the
116
+ expected `.use(correlationMiddleware).use(autoAuthMiddleware)` order
117
+ out of the box.
118
+
119
+ - f23f3c9: Sweep every paginated `*-common` contract onto the canonical
120
+ `PaginationInput` / `PaginatedResult` from `@checkstack/common` and
121
+ remove the now-unused legacy exports.
122
+
123
+ **BREAKING CHANGE** - `@checkstack/common` drops the deprecated
124
+ `PaginationInputSchema`, `paginatedOutput`, and `PaginatedResponse`
125
+ symbols. Callers must consume `PaginationInput` (input) and
126
+ `PaginatedResult(itemSchema)` (output) instead. The canonical input is
127
+ `{ limit (1-100, default 20), offset (>= 0, default 0) }`; the
128
+ canonical output envelope is
129
+ `{ items, total, limit, offset }`.
130
+
131
+ **BREAKING CHANGE** - `@checkstack/notification-common` migrates
132
+ `getNotifications` off the legacy `PaginationInputSchema`
133
+ (`{ limit, offset, unreadOnly }` with output `{ notifications, total }`)
134
+ onto `ListNotificationsInputSchema =
135
+ PaginationInput.extend({ unreadOnly })` and
136
+ `PaginatedResult(NotificationSchema)`. The output key changes from
137
+ `notifications` to `items`, and `limit` / `offset` are now echoed on
138
+ the response. The `PaginationInput` type alias previously exported
139
+ from `notification-common` is removed - use `ListNotificationsInput`
140
+ or the canonical `PaginationInput` from `@checkstack/common`.
141
+
142
+ **BREAKING CHANGE** - `@checkstack/integration-common` migrates
143
+ `listSubscriptions` (inline `{ page, pageSize, ... }` -> output
144
+ `{ subscriptions, total }`) and `getDeliveryLogs` (via
145
+ `DeliveryLogQueryInputSchema` `{ subscriptionId?, eventType?, status?,
146
+ page, pageSize }` -> output `{ logs, total }`) onto the canonical
147
+ `PaginationInput.extend({...})` input and
148
+ `PaginatedResult(itemSchema)` output. External callers must switch
149
+ from `{ page, pageSize }` to `{ limit, offset }` and read response
150
+ items from `data.items` (no more `data.subscriptions` / `data.logs`).
151
+
152
+ The matching `*-backend` handlers were updated to consume the new
153
+ input shape (`offset` arithmetic in lieu of `(page - 1) * pageSize`)
154
+ and to echo `limit` / `offset` on the response. The `*-frontend` call
155
+ sites in `NotificationsPage`, `NotificationBell`, `IntegrationsPage`,
156
+ and `DeliveryLogsPage` were updated to send the new input shape and
157
+ read `data.items`.
158
+
159
+ - f23f3c9: Phase 9 of the v1 polishing plan: tighten the plugin loader's boot-time
160
+ hook policy and backfill notification-router test coverage.
161
+
162
+ `@checkstack/backend` adopts an explicit per-hook policy for the two
163
+ boot-time hooks the plugin loader emits. `pluginInitialized` now
164
+ **halts the boot** if a subscriber throws — a failing subscriber here
165
+ means a downstream never wired itself against the freshly initialised
166
+ plugin, and continuing past that would leave the platform serving
167
+ traffic in a half-wired state. `accessRulesRegistered` keeps its
168
+ log-and-continue behaviour but escalates to `error` level and emits a
169
+ summary count if any subscriber failed; boot-blocking this hook would
170
+ let one misbehaving plugin DOS every other plugin on the same
171
+ instance. The policy is documented inline at each emit site and in a
172
+ new `docs/src/content/docs/backend/plugin-hook-policy.md` page.
173
+ **BREAKING CHANGE**: subscribers to `pluginInitialized` that
174
+ previously threw silently (logged and swallowed) now halt platform
175
+ boot. Audit subscribers and ensure they handle their own internal
176
+ errors before throwing.
177
+
178
+ `@checkstack/notification-backend` ships a real
179
+ `core/notification-backend/src/router.test.ts` covering the dispatch
180
+ fan-out (`notifyForSubscription`: zero subscribers, multi-recipient
181
+ insert, `excludeUserIds`, plus NOT_FOUND/FORBIDDEN guard rails), the
182
+ canonical paginated read on `getNotifications` (envelope shape,
183
+ `unreadOnly` filter propagation, null→undefined column mapping), the
184
+ service-only `createGroup` upsert behaviour (happy path + idempotent
185
+ re-create), and the multi-strategy `sendTransactional` path with a
186
+ focused fallback-style assertion: when one strategy throws, the
187
+ dispatch loop continues to the next and surfaces the failure as a
188
+ per-strategy `success: false` row instead of short-circuiting. No
189
+ runtime changes to the notification router.
190
+
191
+ - Updated dependencies [f23f3c9]
192
+ - Updated dependencies [f23f3c9]
193
+ - Updated dependencies [f23f3c9]
194
+ - Updated dependencies [f23f3c9]
195
+ - @checkstack/common@0.11.0
196
+ - @checkstack/backend-api@0.17.0
197
+ - @checkstack/auth-backend@0.4.29
198
+ - @checkstack/notification-common@1.2.0
199
+ - @checkstack/auth-common@0.7.1
200
+ - @checkstack/signal-common@0.2.4
201
+ - @checkstack/cache-api@0.3.4
202
+ - @checkstack/queue-api@0.3.4
203
+ - @checkstack/cache-utils@0.2.9
204
+
3
205
  ## 1.1.0
4
206
 
5
207
  ### Minor Changes
@@ -0,0 +1,14 @@
1
+ CREATE TYPE "notification_delivery_status" AS ENUM('success', 'failure');--> statement-breakpoint
2
+ CREATE TABLE "notification_delivery_attempts" (
3
+ "id" uuid PRIMARY KEY DEFAULT gen_random_uuid() NOT NULL,
4
+ "notification_id" uuid NOT NULL,
5
+ "strategy_qualified_id" text NOT NULL,
6
+ "attempted_at" timestamp DEFAULT now() NOT NULL,
7
+ "status" "notification_delivery_status" NOT NULL,
8
+ "error_message" text,
9
+ "duration_ms" integer NOT NULL
10
+ );
11
+ --> statement-breakpoint
12
+ ALTER TABLE "notification_delivery_attempts" ADD CONSTRAINT "notification_delivery_attempts_notification_id_notifications_id_fk" FOREIGN KEY ("notification_id") REFERENCES "notifications"("id") ON DELETE cascade ON UPDATE no action;--> statement-breakpoint
13
+ CREATE INDEX "notification_delivery_attempts_notification_idx" ON "notification_delivery_attempts" USING btree ("notification_id");--> statement-breakpoint
14
+ CREATE INDEX "notification_delivery_attempts_attempted_at_idx" ON "notification_delivery_attempts" USING btree ("attempted_at");