@checkstack/incident-backend 1.1.4 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,137 @@
1
1
  # @checkstack/incident-backend
2
2
 
3
+ ## 1.2.0
4
+
5
+ ### Minor Changes
6
+
7
+ - ba07ae2: Quiet down notification spam on flapping systems, auto-open incidents when a check goes critical, and let operators land directly on the broken checks.
8
+
9
+ Notification policy lives **per healthcheck assignment** (one row per `system × configuration`). Different checks on the same system are fully independent — disabling a setting on one check does not affect the others. Defaults preserve existing behaviour for `suppressDeEscalations`; **auto-incident defaults to on** for new and existing assignments.
10
+
11
+ - **`suppressDeEscalations`** (off by default). When on, transitions from a worse state to a better-but-still-failing state (e.g. `unhealthy → degraded`) no longer fire a notification. Escalations and full recoveries to `healthy` are unaffected. Resolved per assignment (the just-ran check is the one driving any aggregate transition).
12
+ - **`autoOpenIncidentOnUnhealthy`** (on by default). Either of two independent triggers can open the auto-incident:
13
+ - **`sustainedUnhealthyTrigger`** (default 30 min) — opens when the check stays continuously unhealthy for the configured duration. Catches real outages.
14
+ - **`flappingTrigger`** (default 3 transitions in 60 min) — opens when the check flips to unhealthy that many times in the window. Catches persistent flapping where each unhealthy phase is too brief for the sustained trigger.
15
+ Each trigger can be individually disabled. One incident per system: triggering checks attach to an existing active auto-incident.
16
+ - **`useNotificationSuppression`** (on by default, only meaningful when auto-open is on). Controls whether the auto-opened incident is created with `suppressNotifications: true` — leaving this off opens the incident but still pings operators on each transition.
17
+ - **`skipDuringMaintenance`** (on by default). No auto-incident is opened while the system has an active maintenance window with suppression. The system is intentionally down and shouldn't trip the on-call.
18
+ - **`autoCloseAfterMinutes`** (default 30). Auto-close cooldown is now per-assignment and snapshotted per-incident at open time — later policy edits don't alter in-flight incidents. Setting `null` ("Never auto-close") leaves the incident for manual resolution.
19
+ - **Require-recovery rule.** After any auto-incident closes (manual or auto), no new auto-incident can open until the check has logged at least one healthy run. Prevents a "operator dismissed but it's still broken" loop.
20
+ - **Auto-close worker** ticks every 60s and resolves auto-opened incidents whose systems have been healthy for their per-row `cooldownMinutes`. Rows with `null` cooldown are skipped entirely. Per-incident: failed close attempts are logged but never abort the sweep.
21
+ - **`incidentResolved` hook subscriber** syncs the auto-incident mapping when an operator manually resolves the incident, so the require-recovery rule sees the close immediately.
22
+ - **Platform-wide defaults.** New admin RPCs `getPlatformNotificationDefaults` / `setPlatformNotificationDefaults` (under the existing `healthcheck.configuration.{read,manage}` access rules) let operators set notification policy once for the whole instance. Per-assignment rows with `notificationPolicy: null` inherit the platform defaults at read time. UI: a "Notification defaults" button in the Assignment IDE opens a modal editor. The per-assignment Notifications panel shows an inheritance banner — "Using platform defaults" (read-only) with an "Override" button, or "Custom override" with a "Use platform defaults" button to revert. The all-or-nothing model keeps the mental model simple: each assignment is either fully inherited or fully overridden.
23
+ - **New service-level RPCs** on the incident plugin (`createAutoIncident`, `resolveAutoIncident`) let other plugins open/close incidents without a user context. Reused by the healthcheck auto-incident flow.
24
+ - **Health-state notification CTA** now deep-links to `?filter=failing` on the system detail page for non-recovery transitions (label changes to "View failing checks"). The system overview gains an `All / Failing / Healthy` segmented filter wired to the same `?filter=…` param.
25
+ - **Notification bell badge** now counts collapse groups instead of raw rows, so the number matches what the user sees in the notifications list. Built on `COUNT(DISTINCT COALESCE(collapse_key, id))` — notifications without a collapse key still each count as one.
26
+ - **`statusFilter` on `getHistory` / `getDetailedHistory`** lets the run-history page and the drawer's Recent Runs panel filter to `All / Healthy / Failing` via shared pills, with the page resetting to the first page on filter change.
27
+ - **Pagination defaults aligned with selector options.** Several pages defaulted to a page size (5 or 20) that wasn't in the dropdown's options (`[10, 25, 50, 100]`), so the page-size `<Select>` rendered empty. The drawer's Recent Runs now defaults to 10; the Run History, History List, and Delivery Logs pages now default to 25.
28
+
29
+ Includes Drizzle migrations adding the `notification_policy` jsonb column to `system_health_checks`, plus two new tables: `health_check_unhealthy_transitions` (for threshold counting) and `health_check_auto_incidents` (for mapping back to incident ids during auto-close).
30
+
31
+ ### Patch Changes
32
+
33
+ - Updated dependencies [ba07ae2]
34
+ - @checkstack/incident-common@1.3.0
35
+ - @checkstack/backend-api@0.17.1
36
+ - @checkstack/cache-api@0.3.5
37
+ - @checkstack/catalog-backend@1.1.6
38
+ - @checkstack/command-backend@0.1.30
39
+ - @checkstack/integration-backend@0.1.30
40
+ - @checkstack/cache-utils@0.2.10
41
+
42
+ ## 1.1.5
43
+
44
+ ### Patch Changes
45
+
46
+ - f23f3c9: Phase 12 of the v1 polishing plan: three coordinated cleanup items that
47
+ close out half-finished features ahead of v1.0.
48
+
49
+ `@checkstack/incident-backend` adds focused unit-test coverage for
50
+ `IncidentService.hasActiveIncidentWithSuppression` in
51
+ `core/incident-backend/src/service.test.ts`. The new tests exercise the
52
+ real query-builder logic against a programmable mock data source and
53
+ pin down the active-only silencing contract: returns `true` only when
54
+ an unresolved incident with `suppressNotifications=true` is associated
55
+ with the queried `systemId`; returns `false` for resolved incidents,
56
+ incidents with `suppressNotifications=false`, systems with no incident
57
+ associations, and other systems' silenced incidents. No runtime
58
+ changes; the service code was already correct end-to-end (write path
59
+ through `IncidentEditor`, read path through the healthcheck queue
60
+ executor and dependency notifications). A companion docs page,
61
+ `docs/src/content/docs/architecture/alert-silencing.md`, documents the
62
+ contract, the two read sites, and the dispatch paths silencing does
63
+ NOT cover so users aren't surprised when an unaware channel keeps
64
+ firing.
65
+
66
+ `@checkstack/auth-frontend` surfaces inline role assignment inside the
67
+ user-creation dialog so admins can pick role(s) atomically with the
68
+ create call. `CreateUserDialog` now renders a checkbox list of
69
+ assignable roles (those with `isAssignable !== false`); on submit,
70
+ `UsersTab` awaits `createCredentialUser`, then immediately calls
71
+ `updateUserRoles` with the selected role IDs. On partial failure
72
+ (user created, role assignment failed) the UI surfaces a warning toast
73
+ naming the recovery path rather than silently misreporting success. No
74
+ new endpoints — reuses the existing `createCredentialUser` +
75
+ `updateUserRoles` contract pair. A companion docs page,
76
+ `docs/src/content/docs/architecture/users-and-teams.md`, documents the
77
+ identity / role / team model, the two S2S endpoints
78
+ (`checkResourceTeamAccess`, `getAccessibleResourceIds`) other plugins
79
+ should call to honour team grants, and explicitly defers audit
80
+ logging, CSV export, team-scoped resource-management UI, and deletion
81
+ side-effect handling to v1.1.
82
+
83
+ The third item — deleting the empty `core/status-frontend/` and
84
+ `core/status-page-backend/` shells — is tooling-only and intentionally
85
+ ships without a changeset; neither shell had a `package.json`, source
86
+ file, or downstream importer.
87
+
88
+ - f23f3c9: Add `correlationMiddleware` to `@checkstack/backend-api` and apply it
89
+ to every plugin/core router so each request carries a stable
90
+ `x-correlation-id` (read from the inbound header, or freshly minted
91
+ via `crypto.randomUUID()` when absent) and an auto-injected child
92
+ logger bound with `{ correlationId, pluginId, userId? }`. The ID is
93
+ echoed back on the response header so the caller can correlate their
94
+ client-side trace to the server logs.
95
+
96
+ The `Logger` interface in `@checkstack/backend-api` now formally
97
+ documents the structured-metadata convention (`logger.info("msg",
98
+ { ...meta })`) alongside the long-standing varargs shape. Winston's
99
+ splat handling already routes both shapes through the same vararg
100
+ slot, so existing call sites are unaffected. A new optional
101
+ `Logger.child(meta)` method captures the metadata-binding contract the
102
+ new middleware relies on; production loggers always implement it,
103
+ minimal test mocks may omit it (the middleware falls back gracefully).
104
+
105
+ `RpcContext` grew two optional `Headers` bags, `requestHeaders` and
106
+ `responseHeaders`, populated by the outer Hono `/api/*` and `/rest/*`
107
+ handlers in `@checkstack/backend`. They are write-through observation
108
+ points for middleware; an `RpcContext` constructed without them (S2S
109
+ clients, tests) keeps working — the echo is a silent no-op and the ID
110
+ is still bound onto the child logger for server-side correlation.
111
+
112
+ The scaffolding template in `@checkstack/scripts` was updated so any
113
+ new plugin generated via `bun run create` wires the middleware in the
114
+ expected `.use(correlationMiddleware).use(autoAuthMiddleware)` order
115
+ out of the box.
116
+
117
+ - Updated dependencies [f23f3c9]
118
+ - Updated dependencies [f23f3c9]
119
+ - Updated dependencies [f23f3c9]
120
+ - Updated dependencies [f23f3c9]
121
+ - @checkstack/common@0.11.0
122
+ - @checkstack/backend-api@0.17.0
123
+ - @checkstack/catalog-backend@1.1.5
124
+ - @checkstack/command-backend@0.1.29
125
+ - @checkstack/integration-backend@0.1.29
126
+ - @checkstack/notification-common@1.2.0
127
+ - @checkstack/integration-common@0.5.0
128
+ - @checkstack/auth-common@0.7.1
129
+ - @checkstack/catalog-common@2.2.2
130
+ - @checkstack/incident-common@1.2.2
131
+ - @checkstack/signal-common@0.2.4
132
+ - @checkstack/cache-api@0.3.4
133
+ - @checkstack/cache-utils@0.2.9
134
+
3
135
  ## 1.1.4
4
136
 
5
137
  ### Patch Changes
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@checkstack/incident-backend",
3
- "version": "1.1.4",
3
+ "version": "1.2.0",
4
4
  "license": "Elastic-2.0",
5
5
  "type": "module",
6
6
  "main": "src/index.ts",
@@ -14,27 +14,27 @@
14
14
  "lint:code": "eslint . --max-warnings 0"
15
15
  },
16
16
  "dependencies": {
17
- "@checkstack/backend-api": "0.15.3",
18
- "@checkstack/cache-api": "0.3.2",
19
- "@checkstack/cache-utils": "0.2.7",
20
- "@checkstack/incident-common": "1.2.0",
21
- "@checkstack/catalog-common": "2.2.0",
22
- "@checkstack/catalog-backend": "1.1.3",
23
- "@checkstack/notification-common": "1.1.0",
24
- "@checkstack/auth-common": "0.7.0",
25
- "@checkstack/command-backend": "0.1.27",
26
- "@checkstack/signal-common": "0.2.3",
27
- "@checkstack/integration-backend": "0.1.27",
28
- "@checkstack/integration-common": "0.4.0",
29
- "@checkstack/common": "0.10.0",
17
+ "@checkstack/backend-api": "0.17.0",
18
+ "@checkstack/cache-api": "0.3.4",
19
+ "@checkstack/cache-utils": "0.2.9",
20
+ "@checkstack/incident-common": "1.2.2",
21
+ "@checkstack/catalog-common": "2.2.2",
22
+ "@checkstack/catalog-backend": "1.1.5",
23
+ "@checkstack/notification-common": "1.2.0",
24
+ "@checkstack/auth-common": "0.7.1",
25
+ "@checkstack/command-backend": "0.1.29",
26
+ "@checkstack/signal-common": "0.2.4",
27
+ "@checkstack/integration-backend": "0.1.29",
28
+ "@checkstack/integration-common": "0.5.0",
29
+ "@checkstack/common": "0.11.0",
30
30
  "drizzle-orm": "^0.45.0",
31
31
  "zod": "^4.2.1",
32
32
  "@orpc/server": "^1.13.2"
33
33
  },
34
34
  "devDependencies": {
35
35
  "@checkstack/drizzle-helper": "0.0.5",
36
- "@checkstack/scripts": "0.3.2",
37
- "@checkstack/test-utils-backend": "0.1.27",
36
+ "@checkstack/scripts": "0.3.3",
37
+ "@checkstack/test-utils-backend": "0.1.29",
38
38
  "@checkstack/tsconfig": "0.0.7",
39
39
  "@types/bun": "^1.0.0",
40
40
  "drizzle-kit": "^0.31.10",
package/src/router.ts CHANGED
@@ -5,6 +5,7 @@ import {
5
5
  } from "@checkstack/incident-common";
6
6
  import {
7
7
  autoAuthMiddleware,
8
+ correlationMiddleware,
8
9
  Logger,
9
10
  type RpcContext,
10
11
  } from "@checkstack/backend-api";
@@ -89,6 +90,7 @@ export function createRouter(
89
90
 
90
91
  const os = implement(incidentContract)
91
92
  .$context<RpcContext>()
93
+ .use(correlationMiddleware)
92
94
  .use(autoAuthMiddleware);
93
95
 
94
96
  return os.router({
@@ -393,6 +395,96 @@ export function createRouter(
393
395
  return { suppressed };
394
396
  }),
395
397
 
398
+ createAutoIncident: os.createAutoIncident.handler(
399
+ async ({ input, context }) => {
400
+ // No user context for service-initiated incidents; createdBy
401
+ // stays null and the timeline shows the originating plugin via
402
+ // the hook payload.
403
+ const result = await service.createIncident(input);
404
+
405
+ await cache.invalidateForMutation({
406
+ incidentId: result.id,
407
+ systemIds: result.systemIds,
408
+ });
409
+
410
+ await signalService.broadcast(INCIDENT_UPDATED, {
411
+ incidentId: result.id,
412
+ systemIds: result.systemIds,
413
+ action: "created",
414
+ });
415
+
416
+ await context.emitHook(incidentHooks.incidentCreated, {
417
+ incidentId: result.id,
418
+ systemIds: result.systemIds,
419
+ title: result.title,
420
+ description: result.description,
421
+ severity: result.severity,
422
+ status: result.status,
423
+ createdAt: result.createdAt.toISOString(),
424
+ });
425
+
426
+ const systemNames = await resolveSystemNames(result.systemIds);
427
+ await notifyAffectedSystems({
428
+ catalogClient,
429
+ notificationClient,
430
+ logger,
431
+ incidentId: result.id,
432
+ incidentTitle: result.title,
433
+ systemIds: result.systemIds,
434
+ systemNames,
435
+ action: "created",
436
+ severity: result.severity,
437
+ });
438
+
439
+ return { id: result.id };
440
+ },
441
+ ),
442
+
443
+ resolveAutoIncident: os.resolveAutoIncident.handler(
444
+ async ({ input, context }) => {
445
+ const result = await service.resolveIncident(input.id, input.message);
446
+ // Idempotent: a missing or already-resolved incident is treated
447
+ // as success so the auto-close worker can be re-run safely.
448
+ if (!result) {
449
+ return { success: true };
450
+ }
451
+
452
+ await cache.invalidateForMutation({
453
+ incidentId: result.id,
454
+ systemIds: result.systemIds,
455
+ });
456
+
457
+ await signalService.broadcast(INCIDENT_UPDATED, {
458
+ incidentId: result.id,
459
+ systemIds: result.systemIds,
460
+ action: "resolved",
461
+ });
462
+
463
+ await context.emitHook(incidentHooks.incidentResolved, {
464
+ incidentId: result.id,
465
+ systemIds: result.systemIds,
466
+ title: result.title,
467
+ severity: result.severity,
468
+ resolvedAt: new Date().toISOString(),
469
+ });
470
+
471
+ const systemNames = await resolveSystemNames(result.systemIds);
472
+ await notifyAffectedSystems({
473
+ catalogClient,
474
+ notificationClient,
475
+ logger,
476
+ incidentId: result.id,
477
+ incidentTitle: result.title,
478
+ systemIds: result.systemIds,
479
+ systemNames,
480
+ action: "resolved",
481
+ severity: result.severity,
482
+ });
483
+
484
+ return { success: true };
485
+ },
486
+ ),
487
+
396
488
  addLink: os.addLink.handler(async ({ input }) => {
397
489
  // Verify incident exists so the FK violation surfaces as NOT_FOUND.
398
490
  const incident = await service.getIncident(input.incidentId);
@@ -0,0 +1,126 @@
1
+ import { describe, it, expect, mock, beforeEach } from "bun:test";
2
+ import { IncidentService } from "./service";
3
+
4
+ /**
5
+ * Programmable mock DB that records each `select(...).from(...).where(...)`
6
+ * (and optional `.limit(...)`) chain and returns a configurable row array
7
+ * per invocation. Tests exercise the real query-builder calls inside
8
+ * `IncidentService`, only swapping out the terminal data source.
9
+ */
10
+ function createProgrammableSelectDb(resultsByCall: unknown[][]) {
11
+ let callIndex = 0;
12
+
13
+ const nextResult = (): unknown[] => {
14
+ const result = resultsByCall[callIndex] ?? [];
15
+ callIndex += 1;
16
+ return result;
17
+ };
18
+
19
+ const select = mock((projection?: Record<string, unknown>) => {
20
+ void projection;
21
+ const rows = nextResult();
22
+
23
+ const limit = mock(() => Promise.resolve(rows));
24
+ const whereResult = Object.assign(Promise.resolve(rows), { limit });
25
+ const where = mock(() => whereResult);
26
+ const fromResult = Object.assign(Promise.resolve(rows), { where });
27
+ const from = mock(() => fromResult);
28
+
29
+ return { from };
30
+ });
31
+
32
+ return {
33
+ db: { select } as unknown,
34
+ select,
35
+ getCallCount: () => callIndex,
36
+ };
37
+ }
38
+
39
+ describe("IncidentService.hasActiveIncidentWithSuppression", () => {
40
+ let dbHelper: ReturnType<typeof createProgrammableSelectDb>;
41
+ let service: IncidentService;
42
+
43
+ const setup = (resultsByCall: unknown[][]) => {
44
+ dbHelper = createProgrammableSelectDb(resultsByCall);
45
+ service = new IncidentService(dbHelper.db as never);
46
+ };
47
+
48
+ beforeEach(() => {
49
+ dbHelper = createProgrammableSelectDb([]);
50
+ });
51
+
52
+ it("returns true when an active incident with suppressNotifications=true exists for the system", async () => {
53
+ setup([
54
+ // 1st query: incidentSystems lookup for systemId="sys-1"
55
+ [{ incidentId: "inc-1" }],
56
+ // 2nd query: incidents lookup with .where(active AND suppression).limit(1)
57
+ [{ id: "inc-1" }],
58
+ ]);
59
+
60
+ const result = await service.hasActiveIncidentWithSuppression("sys-1");
61
+
62
+ expect(result).toBe(true);
63
+ expect(dbHelper.getCallCount()).toBe(2);
64
+ });
65
+
66
+ it("returns false when no incidents are associated with the system", async () => {
67
+ setup([
68
+ // 1st query: empty -> short-circuits before the 2nd query
69
+ [],
70
+ ]);
71
+
72
+ const result = await service.hasActiveIncidentWithSuppression("sys-1");
73
+
74
+ expect(result).toBe(false);
75
+ // Only one query should have run; the incidents lookup is skipped.
76
+ expect(dbHelper.getCallCount()).toBe(1);
77
+ });
78
+
79
+ it("returns false when the matching incident is resolved (silencing is scoped to active incidents)", async () => {
80
+ setup([
81
+ // 1st query: the system has an incident association.
82
+ [{ incidentId: "inc-resolved" }],
83
+ // 2nd query: the WHERE clause filters out resolved incidents, so the
84
+ // limit(1) projection finds nothing. The real query builder enforces
85
+ // this via `ne(incidents.status, "resolved")`.
86
+ [],
87
+ ]);
88
+
89
+ const result = await service.hasActiveIncidentWithSuppression("sys-1");
90
+
91
+ expect(result).toBe(false);
92
+ expect(dbHelper.getCallCount()).toBe(2);
93
+ });
94
+
95
+ it("returns false when the matching incident has suppressNotifications=false", async () => {
96
+ setup([
97
+ // 1st query: the system has an incident association.
98
+ [{ incidentId: "inc-no-suppress" }],
99
+ // 2nd query: the WHERE clause filters by suppressNotifications=true,
100
+ // so a row with suppressNotifications=false is excluded — the result
101
+ // set is empty.
102
+ [],
103
+ ]);
104
+
105
+ const result = await service.hasActiveIncidentWithSuppression("sys-1");
106
+
107
+ expect(result).toBe(false);
108
+ expect(dbHelper.getCallCount()).toBe(2);
109
+ });
110
+
111
+ it("filters by systemId — does not return true for another system's silenced incident", async () => {
112
+ // The systemId filter is enforced by the WHERE clause on the
113
+ // incidentSystems lookup. Querying "sys-other" returns an empty
114
+ // association set even though "sys-1" has a silenced incident, so the
115
+ // method short-circuits to false.
116
+ setup([
117
+ // 1st query for systemId="sys-other": no associations.
118
+ [],
119
+ ]);
120
+
121
+ const result = await service.hasActiveIncidentWithSuppression("sys-other");
122
+
123
+ expect(result).toBe(false);
124
+ expect(dbHelper.getCallCount()).toBe(1);
125
+ });
126
+ });