@checkstack/healthcheck-backend 1.1.4 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,261 @@
1
1
  # @checkstack/healthcheck-backend
2
2
 
3
+ ## 1.3.0
4
+
5
+ ### Minor Changes
6
+
7
+ - 41c77f4: feat(automation): type enum-able trigger/artifact fields as enums for editor value autocompletion
8
+
9
+ The automation editor's staged completion offers concrete values after a
10
+ comparator (`{{ trigger.payload.severity == "high" }}`) only when the
11
+ field's JSON Schema carries an `enum`. Several trigger payload + artifact
12
+ schemas declared closed-set fields as loose `z.string()`, so no values
13
+ were suggested. Tightened them to the canonical enums that already
14
+ existed in each plugin's `-common` package (and matched the hook payload
15
+ types in lockstep so the trigger's `payloadSchema` and `hook` keep the
16
+ same `TPayload`):
17
+
18
+ - **incident** — trigger payloads: `severity` → `IncidentSeverityEnum`,
19
+ `status` / `statusChange` → `IncidentStatusEnum`.
20
+ - **healthcheck** — trigger payloads: `previousStatus` / `newStatus` /
21
+ `status` → `HealthCheckStatusSchema` (across systemDegraded,
22
+ systemHealthy, systemHealthChanged, checkFailed; plus checkCompleted's
23
+ hook type).
24
+ - **dependency** — trigger + artifact: `impactType` → `ImpactTypeSchema`;
25
+ impactPropagated `previousState` / `newState` → `DerivedStateSchema`.
26
+ Also deduped the inline `impactTypeSchema` action-config enum to reuse
27
+ the canonical `ImpactTypeSchema`.
28
+ - **maintenance** — trigger + artifact: `status` →
29
+ `MaintenanceStatusEnum`; deduped the inline `maintenanceStatusEnum`
30
+ (used by `add_update.statusChange`) to the canonical one.
31
+ - **slo** — `achievement.unlocked` trigger + hook: `achievement` →
32
+ `AchievementTypeSchema`.
33
+
34
+ Runtime behaviour is unchanged — these fields always carried valid enum
35
+ values (the underlying records are enum-constrained); only the schema
36
+ types were loose. The hook payload generics are now precise too, which
37
+ caught one stale test fixture asserting an invalid `impactType: "soft"`.
38
+
39
+ Fields that look enum-ish but are genuinely free-form were intentionally
40
+ left as `z.string()`: satellite `region` (user-entered), Jira issue
41
+ `status` (per-instance workflow name), notification `strategyQualifiedId`
42
+ / `errorMessage`, healthcheck collector `result`, and script
43
+ `stdout` / `stderr`.
44
+
45
+ - 41c77f4: feat(healthcheck): Phase 9 — run_now / enable / disable actions + umbrella health-changed trigger
46
+
47
+ - New hook `healthCheckHooks.systemHealthChanged`, an umbrella variant
48
+ of `systemDegraded` + `systemHealthy` that fires on **every**
49
+ aggregated-health transition (with both `previousStatus` and
50
+ `newStatus`). Emitted alongside the directional hooks at both
51
+ emission sites in `queue-executor.ts`, so existing subscribers keep
52
+ working unchanged.
53
+ - New hook `healthCheckHooks.checkFailed` — fires alongside the
54
+ existing `checkCompleted` whenever an individual run's status
55
+ isn't `healthy`. Exists as a narrow alternative so an automation
56
+ doesn't need "trigger on completion → filter by status" — useful
57
+ for incident-style flows.
58
+ - New hook `healthCheckHooks.flappingDetected` — fires from inside
59
+ the auto-incident evaluator whenever the unhealthy-transition count
60
+ crosses `policy.flappingTrigger.transitions` within
61
+ `policy.flappingTrigger.windowMinutes`, regardless of whether
62
+ `autoOpenIncidentOnUnhealthy` is enabled. Carries the observed
63
+ count + window so subscribers can reason about both. Re-fires on
64
+ every additional transition past the threshold while the check
65
+ stays flapping — debounce on `(systemId, configurationId)` if
66
+ "page once and only once" is wanted.
67
+ - Triggers `healthcheck.system_degraded`,
68
+ `healthcheck.system_healthy`, the umbrella
69
+ `healthcheck.system_health_changed`, plus the new
70
+ `healthcheck.check_failed` and `healthcheck.flapping_detected`.
71
+ Inline trigger registrations moved out of `register()` into
72
+ `automations.ts`.
73
+ - Actions `healthcheck.run_now` (enqueues a one-off job on the
74
+ shared `HEALTH_CHECK_QUEUE`), `healthcheck.enable_assignment`, and
75
+ `healthcheck.disable_assignment`. The enable/disable actions use a
76
+ new service method `setAssignmentEnabled(systemId, configurationId,
77
+ enabled)` that flips just the `enabled` flag without touching
78
+ thresholds / satellite assignment / notification policy. Both fire
79
+ the existing `assignmentChanged` hook so the satellite config relay
80
+ picks up the change.
81
+ - Artifact type `healthcheck.assignment` for downstream steps to
82
+ consume.
83
+
84
+ `HEALTH_CHECK_QUEUE` is exported so the `run_now` action can enqueue
85
+ without re-importing the recurring-job factory.
86
+
87
+ - 35bc682: feat(healthcheck): expose check + system run-context to script collectors
88
+
89
+ Script health checks can now read which check and system a run is for.
90
+ Previously shell scripts got only a curated env whitelist and inline
91
+ scripts only `context.config`, so a script had no built-in way to know
92
+ its own check name or the system it was checking.
93
+
94
+ - `@checkstack/backend-api`: new `CollectorRunContext` type
95
+ (`{ check: { id, name, intervalSeconds }, system: { id, name } }`) and
96
+ an optional `runContext` param on `CollectorStrategy.execute`. Optional,
97
+ so existing collector implementations are unaffected.
98
+ - Shell-script collector: injects reserved `CHECKSTACK_CHECK_ID`,
99
+ `CHECKSTACK_CHECK_NAME`, `CHECKSTACK_CHECK_INTERVAL_SECONDS`,
100
+ `CHECKSTACK_SYSTEM_ID`, `CHECKSTACK_SYSTEM_NAME` env vars (user-supplied
101
+ `env` still wins on collision).
102
+ - Inline-script collector: exposes `context.check` and `context.system`
103
+ alongside `context.config`; the inline-script editor now types them for
104
+ autocomplete.
105
+ - Shell editors (health-check collectors and automation shell actions) now
106
+ also suggest the user's own `env` (JSON) keys as `$NAME` completions, via
107
+ the new exported `customShellEnvVars` helper. Keys that aren't valid shell
108
+ identifiers are omitted.
109
+ - Fix: the Typefox `CodeEditor` captured a stale `onChange` at editor start,
110
+ so editing one `DynamicForm` field reverted sibling fields changed since
111
+ mount (e.g. typing in a shell `script` field wiped an unsaved `env` value,
112
+ or deleted a sibling automation action added after mount). The change
113
+ handler now routes through a ref to the current `onChange`.
114
+ - Fix: focusing a JSON editor threw "LanguageStatusService.addStatus is not
115
+ supported" because the standalone service set omitted `ILanguageStatusService`.
116
+ That one service is now registered via `serviceOverrides`.
117
+ - Fix: the automation trigger card nested a `<Badge>` (a `<div>`) inside a
118
+ `<p>`, producing a `validateDOMNesting` warning. Switched the wrapper to a
119
+ `<div>`.
120
+ - Local runs (`queue-executor`) and satellite runs both populate the
121
+ context. `SatelliteAssignment` (and the `getAssignmentsForSatellite`
122
+ RPC output) gained optional `configName` / `systemName` so the metadata
123
+ reaches satellite-side execution; `HealthCheckService` resolves the
124
+ system name via the catalog client.
125
+
126
+ BREAKING CHANGE: `createHealthCheckRouter` now requires a `catalogClient`
127
+ option (used to resolve system names for satellite assignments). Update
128
+ call sites to pass the catalog RPC client.
129
+
130
+ ### Patch Changes
131
+
132
+ - 41c77f4: feat(automation): one-time migration of webhook subscriptions + remove legacy integration backend
133
+
134
+ **BREAKING CHANGES** (platform is in BETA — no major bump):
135
+
136
+ - `IntegrationProvider` no longer carries `config` (subscription
137
+ config) or `deliver`. The interface now models a connection provider
138
+ only: connection schema + `getConnectionOptions` + `testConnection`.
139
+ - The legacy subscription / delivery-log / event endpoints
140
+ (`listSubscriptions`, `createSubscription`, `getDeliveryLogs`,
141
+ `listEventTypes`, …) are removed from `integrationContract`.
142
+ - `delivery-coordinator`, `hook-subscriber`, `event-registry`, and the
143
+ `integrationEventExtensionPoint` are deleted. Plugins that
144
+ previously called `integrationEvents.registerEvent(...)` now
145
+ register their hooks as automation triggers via
146
+ `automationTriggerExtensionPoint.registerTrigger(...)`.
147
+ - Frontend pages `IntegrationsPage` and `DeliveryLogsPage` are gone;
148
+ the integration plugin's only remaining UI is connection
149
+ management. Subscription management lives under `/automation/...`.
150
+ - `webhook_subscriptions` and `delivery_logs` tables stay in the
151
+ database for one release as a safety net (no code reads or writes
152
+ them), and will be dropped in a follow-up migration.
153
+
154
+ **New**:
155
+
156
+ - `jira.create_issue`, `teams.post_message`, `webex.post_message`,
157
+ `webhook.send`, `integration-script.run_shell`, and
158
+ `integration-script.run_script` actions registered against the
159
+ Automation Platform with matching `*.message`, `*.delivery`,
160
+ `shell.result`, and `script.result` artifact types. The script
161
+ plugin exposes **two** actions — `run_shell` runs bash via the
162
+ shared `ShellScriptRunner` (Monaco `shell` editor), `run_script`
163
+ runs an ESM module in a Bun subprocess via `EsmScriptRunner`
164
+ (Monaco `typescript` editor + `defineIntegration` helper) — to
165
+ preserve the legacy provider split. `jira.create_issue` keeps the
166
+ dynamic field-mapping dropdown (driven by
167
+ `JIRA_RESOLVERS.FIELD_OPTIONS`).
168
+ - One-time data migration runs on boot in
169
+ `automation-backend.afterPluginsReady`. It reads
170
+ `webhook_subscriptions` via a new service RPC
171
+ `IntegrationApi.listLegacySubscriptions`, translates each row into
172
+ a single-trigger / single-action automation (marked with
173
+ `managed_by = "migrated-subscription:<id>"`), and is idempotent
174
+ across restarts.
175
+ - Failed translations are recorded in a new
176
+ `automation_migration_failures` table and surfaced via
177
+ `AutomationApi.listMigrationFailures` /
178
+ `acknowledgeMigrationFailure` so admins can review and re-create
179
+ failed entries by hand.
180
+
181
+ - Updated dependencies [e2d6f25]
182
+ - Updated dependencies [41c77f4]
183
+ - Updated dependencies [41c77f4]
184
+ - Updated dependencies [e1a2077]
185
+ - Updated dependencies [41c77f4]
186
+ - Updated dependencies [41c77f4]
187
+ - Updated dependencies [41c77f4]
188
+ - Updated dependencies [41c77f4]
189
+ - Updated dependencies [41c77f4]
190
+ - Updated dependencies [41c77f4]
191
+ - Updated dependencies [41c77f4]
192
+ - Updated dependencies [41c77f4]
193
+ - Updated dependencies [6d52276]
194
+ - Updated dependencies [6d52276]
195
+ - Updated dependencies [35bc682]
196
+ - @checkstack/automation-backend@0.2.0
197
+ - @checkstack/incident-backend@1.3.0
198
+ - @checkstack/catalog-backend@1.2.0
199
+ - @checkstack/satellite-backend@0.4.0
200
+ - @checkstack/common@0.12.0
201
+ - @checkstack/backend-api@0.18.0
202
+ - @checkstack/healthcheck-common@1.3.0
203
+ - @checkstack/catalog-common@2.2.3
204
+ - @checkstack/incident-common@1.3.1
205
+ - @checkstack/maintenance-common@1.2.3
206
+ - @checkstack/command-backend@0.1.31
207
+ - @checkstack/gitops-backend@0.3.7
208
+ - @checkstack/gitops-common@0.4.2
209
+ - @checkstack/notification-common@1.2.1
210
+ - @checkstack/signal-common@0.2.5
211
+ - @checkstack/cache-api@0.3.6
212
+ - @checkstack/queue-api@0.3.6
213
+ - @checkstack/cache-utils@0.2.11
214
+
215
+ ## 1.2.0
216
+
217
+ ### Minor Changes
218
+
219
+ - ba07ae2: Quiet down notification spam on flapping systems, auto-open incidents when a check goes critical, and let operators land directly on the broken checks.
220
+
221
+ Notification policy lives **per healthcheck assignment** (one row per `system × configuration`). Different checks on the same system are fully independent — disabling a setting on one check does not affect the others. Defaults preserve existing behaviour for `suppressDeEscalations`; **auto-incident defaults to on** for new and existing assignments.
222
+
223
+ - **`suppressDeEscalations`** (off by default). When on, transitions from a worse state to a better-but-still-failing state (e.g. `unhealthy → degraded`) no longer fire a notification. Escalations and full recoveries to `healthy` are unaffected. Resolved per assignment (the just-ran check is the one driving any aggregate transition).
224
+ - **`autoOpenIncidentOnUnhealthy`** (on by default). Either of two independent triggers can open the auto-incident:
225
+ - **`sustainedUnhealthyTrigger`** (default 30 min) — opens when the check stays continuously unhealthy for the configured duration. Catches real outages.
226
+ - **`flappingTrigger`** (default 3 transitions in 60 min) — opens when the check flips to unhealthy that many times in the window. Catches persistent flapping where each unhealthy phase is too brief for the sustained trigger.
227
+ Each trigger can be individually disabled. One incident per system: triggering checks attach to an existing active auto-incident.
228
+ - **`useNotificationSuppression`** (on by default, only meaningful when auto-open is on). Controls whether the auto-opened incident is created with `suppressNotifications: true` — leaving this off opens the incident but still pings operators on each transition.
229
+ - **`skipDuringMaintenance`** (on by default). No auto-incident is opened while the system has an active maintenance window with suppression. The system is intentionally down and shouldn't trip the on-call.
230
+ - **`autoCloseAfterMinutes`** (default 30). Auto-close cooldown is now per-assignment and snapshotted per-incident at open time — later policy edits don't alter in-flight incidents. Setting `null` ("Never auto-close") leaves the incident for manual resolution.
231
+ - **Require-recovery rule.** After any auto-incident closes (manual or auto), no new auto-incident can open until the check has logged at least one healthy run. Prevents a "operator dismissed but it's still broken" loop.
232
+ - **Auto-close worker** ticks every 60s and resolves auto-opened incidents whose systems have been healthy for their per-row `cooldownMinutes`. Rows with `null` cooldown are skipped entirely. Per-incident: failed close attempts are logged but never abort the sweep.
233
+ - **`incidentResolved` hook subscriber** syncs the auto-incident mapping when an operator manually resolves the incident, so the require-recovery rule sees the close immediately.
234
+ - **Platform-wide defaults.** New admin RPCs `getPlatformNotificationDefaults` / `setPlatformNotificationDefaults` (under the existing `healthcheck.configuration.{read,manage}` access rules) let operators set notification policy once for the whole instance. Per-assignment rows with `notificationPolicy: null` inherit the platform defaults at read time. UI: a "Notification defaults" button in the Assignment IDE opens a modal editor. The per-assignment Notifications panel shows an inheritance banner — "Using platform defaults" (read-only) with an "Override" button, or "Custom override" with a "Use platform defaults" button to revert. The all-or-nothing model keeps the mental model simple: each assignment is either fully inherited or fully overridden.
235
+ - **New service-level RPCs** on the incident plugin (`createAutoIncident`, `resolveAutoIncident`) let other plugins open/close incidents without a user context. Reused by the healthcheck auto-incident flow.
236
+ - **Health-state notification CTA** now deep-links to `?filter=failing` on the system detail page for non-recovery transitions (label changes to "View failing checks"). The system overview gains an `All / Failing / Healthy` segmented filter wired to the same `?filter=…` param.
237
+ - **Notification bell badge** now counts collapse groups instead of raw rows, so the number matches what the user sees in the notifications list. Built on `COUNT(DISTINCT COALESCE(collapse_key, id))` — notifications without a collapse key still each count as one.
238
+ - **`statusFilter` on `getHistory` / `getDetailedHistory`** lets the run-history page and the drawer's Recent Runs panel filter to `All / Healthy / Failing` via shared pills, with the page resetting to the first page on filter change.
239
+ - **Pagination defaults aligned with selector options.** Several pages defaulted to a page size (5 or 20) that wasn't in the dropdown's options (`[10, 25, 50, 100]`), so the page-size `<Select>` rendered empty. The drawer's Recent Runs now defaults to 10; the Run History, History List, and Delivery Logs pages now default to 25.
240
+
241
+ Includes Drizzle migrations adding the `notification_policy` jsonb column to `system_health_checks`, plus two new tables: `health_check_unhealthy_transitions` (for threshold counting) and `health_check_auto_incidents` (for mapping back to incident ids during auto-close).
242
+
243
+ ### Patch Changes
244
+
245
+ - Updated dependencies [ba07ae2]
246
+ - @checkstack/healthcheck-common@1.2.0
247
+ - @checkstack/incident-common@1.3.0
248
+ - @checkstack/incident-backend@1.2.0
249
+ - @checkstack/backend-api@0.17.1
250
+ - @checkstack/satellite-backend@0.3.6
251
+ - @checkstack/cache-api@0.3.5
252
+ - @checkstack/catalog-backend@1.1.6
253
+ - @checkstack/command-backend@0.1.30
254
+ - @checkstack/gitops-backend@0.3.6
255
+ - @checkstack/integration-backend@0.1.30
256
+ - @checkstack/queue-api@0.3.5
257
+ - @checkstack/cache-utils@0.2.10
258
+
3
259
  ## 1.1.4
4
260
 
5
261
  ### Patch Changes
@@ -0,0 +1 @@
1
+ ALTER TABLE "system_health_checks" ADD COLUMN "notification_policy" jsonb;
@@ -0,0 +1,20 @@
1
+ CREATE TABLE "health_check_auto_incidents" (
2
+ "id" uuid PRIMARY KEY DEFAULT gen_random_uuid() NOT NULL,
3
+ "incident_id" uuid NOT NULL,
4
+ "system_id" text NOT NULL,
5
+ "configuration_id" uuid NOT NULL,
6
+ "opened_at" timestamp DEFAULT now() NOT NULL,
7
+ "closed_at" timestamp
8
+ );
9
+ --> statement-breakpoint
10
+ CREATE TABLE "health_check_unhealthy_transitions" (
11
+ "id" uuid PRIMARY KEY DEFAULT gen_random_uuid() NOT NULL,
12
+ "configuration_id" uuid NOT NULL,
13
+ "system_id" text NOT NULL,
14
+ "transitioned_at" timestamp DEFAULT now() NOT NULL
15
+ );
16
+ --> statement-breakpoint
17
+ ALTER TABLE "health_check_auto_incidents" ADD CONSTRAINT "health_check_auto_incidents_configuration_id_health_check_configurations_id_fk" FOREIGN KEY ("configuration_id") REFERENCES "health_check_configurations"("id") ON DELETE cascade ON UPDATE no action;--> statement-breakpoint
18
+ ALTER TABLE "health_check_unhealthy_transitions" ADD CONSTRAINT "health_check_unhealthy_transitions_configuration_id_health_check_configurations_id_fk" FOREIGN KEY ("configuration_id") REFERENCES "health_check_configurations"("id") ON DELETE cascade ON UPDATE no action;--> statement-breakpoint
19
+ CREATE INDEX "health_check_auto_incidents_active_by_system_idx" ON "health_check_auto_incidents" USING btree ("system_id","closed_at");--> statement-breakpoint
20
+ CREATE INDEX "health_check_unhealthy_transitions_lookup_idx" ON "health_check_unhealthy_transitions" USING btree ("configuration_id","system_id","transitioned_at");
@@ -0,0 +1,2 @@
1
+ ALTER TABLE "health_check_auto_incidents" ADD COLUMN "cooldown_minutes" integer;--> statement-breakpoint
2
+ CREATE INDEX "health_check_auto_incidents_last_close_idx" ON "health_check_auto_incidents" USING btree ("configuration_id","system_id","closed_at");