@checkstack/healthcheck-common 1.2.0 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,171 @@
1
1
  # @checkstack/healthcheck-common
2
2
 
3
+ ## 1.4.0
4
+
5
+ ### Minor Changes
6
+
7
+ - 270ef29: Add the health-state provider data contract (automation sensing layer, Wave 2 Phase 13).
8
+
9
+ - New `health_check_state_transitions` table records every aggregate health-status transition for a system (all statuses, not just unhealthy), giving a reliable "in current status since" timestamp. Written wherever an aggregate transition is detected. Pruned with raw-run retention, but the single most-recent row per system is always kept so an active streak never blanks.
10
+ - New service-typed RPCs on `HealthCheckApi`: `getHealthState({ systemId, configurationId? })` returns `{ status, inStatusSince, inStatusForMs, latencyMs?, avgLatencyMs?, p95LatencyMs?, successRate?, lastRunAt?, inMaintenance, evaluatedAt }`, and `getBulkHealthState({ systemIds })` (POST) resolves many systems against one shared timestamp.
11
+ - New service-typed RPC on `MaintenanceApi`: `hasActiveMaintenance({ systemId })` reports whether a system is in an active maintenance window regardless of notification-suppression (suppression-agnostic), folded into `getHealthState` as `inMaintenance`.
12
+
13
+ All reads are fail-safe: a missing transition row yields `inStatusSince: null`, and a maintenance-plugin error fails open to `inMaintenance: false`.
14
+
15
+ - 270ef29: Add a windowed transition count to the health provider - the building block for custom flapping rules (Wave 2 Phase 18).
16
+
17
+ Flapping is already buildable today via the built-in `healthcheck.flapping_detected` trigger; this phase ships the GENERALIZATION for arbitrary "N status changes in M minutes" rules.
18
+
19
+ - `countStateTransitionsInWindow` counts aggregate status transitions for a system over a trailing window (from the Phase 13 `health_check_state_transitions` table - all statuses, generalizing the unhealthy-only flapping detector). Fail-safe to 0.
20
+ - `getHealthState` / `getBulkHealthState` now return `transitionsInWindow` + `transitionWindowMinutes`, and accept an optional `transitionWindowMinutes` input (default 60).
21
+ - The automation definition gains an optional top-level `state_window_minutes` (default 60), threaded through `enrichScopeWithState` so `health.system.transitions_in_window` / `health.system.transition_window_minutes` are folded into scope per evaluation.
22
+ - Operators author custom flapping as a `numeric_state` condition over `health.system.transitions_in_window` - no new condition variant, no editor change. The variable-scope resolver surfaces the new fields for autocomplete.
23
+
24
+ - 270ef29: Replace the hardcoded auto-incident path with default automations (Wave 2 Phase 20).
25
+
26
+ BREAKING CHANGES: Auto-incident is now automation-driven. The hardcoded background path that opened incidents on sustained-unhealthy / flapping and closed them after a cooldown (`auto-incident.ts`, `auto-incident-close-job.ts`) is removed. On upgrade, an idempotent, threshold-preserving migration seeds equivalent default automations from each assignment's existing `NotificationPolicy`, so alerting behaviour is preserved 1:1:
27
+
28
+ - `sustainedUnhealthyTrigger.durationMinutes` -> the `for:` dwell on a `healthcheck.system_degraded` trigger -> `incident.create`.
29
+ - auto-close `autoCloseAfterMinutes` -> a `wait_until` (healthy continuously for the cooldown) -> `incident.resolve`.
30
+ - `useNotificationSuppression` -> the incident's `suppressNotifications`.
31
+ - `skipDuringMaintenance` -> a `{{ !health.system.in_maintenance }}` pre-run condition.
32
+ - `flappingTrigger.{transitions,windowMinutes}` -> a second automation on the `healthcheck.flapping_detected` trigger -> `incident.create`.
33
+
34
+ Auto-incidents remain ONE OPEN INCIDENT PER SYSTEM, faithful to the old behaviour. `incident.create` gains an opt-in `dedupe_open_for_system` config flag (default false, so existing/custom automations are unaffected): when true, it reuses an existing open incident on the target system instead of opening a duplicate (the old `findActiveAutoIncident(systemId)` semantic), returning the reused incident as the produced `incident` artifact. The seeded default automations set this flag, so a system with several failing checks - sustained and/or flapping - still gets a single open incident; whichever check crosses its threshold first opens it, and the rest dedupe to it. Both sustained and flapping default automations open at `critical` severity (parity with the old path). Per-system run dedup within an automation uses `concurrency_scope: "context_key"` + `mode: "single"`.
35
+
36
+ Operators can read, edit, disable, and extend these automations (see the "Customise auto-incident" guide). Seeded automations are tagged via `managedBy` (`auto-incident:<systemId>:<configurationId>:<kind>`) so the migration is a no-op on re-runs; anything unmappable is recorded as a migration-failure row.
37
+
38
+ Flapping DETECTION (transition recording + the `healthcheck.flapping_detected` emit) is relocated into `flapping-detector.ts` and survives; the emit now fires unconditionally on a threshold cross (no longer gated on `autoOpenIncidentOnUnhealthy`), matching the hook's documented intent and required for the flapping default automation. The legacy `health_check_auto_incidents` mapping table is no longer written or read (it will be dropped in a follow-up migration); `health_check_unhealthy_transitions` is retained for the flapping detector.
39
+
40
+ New service-typed `HealthCheckApi.listAutoIncidentPolicies` RPC exposes each assignment's effective notification policy for the migration. `incident.create` adds the `dedupe_open_for_system` flag (additive, defaults off).
41
+
42
+ - b995afb: Move health-check flapping configuration from the per-assignment notification policy onto the `healthcheck.flapping_detected` automation trigger.
43
+
44
+ Flapping thresholds (`transitions`, `windowMinutes`) are now configured on the trigger itself, next to the automation that reacts to them, instead of on each check assignment. The health-check executor still owns the windowed transition counting (it writes `health_check_unhealthy_transitions` and runs the window query), but it now SOURCES the thresholds from the subscribed automations' trigger config:
45
+
46
+ - On a transition-to-unhealthy it records the transition unconditionally (keeping history warm), then looks up the enabled automations subscribed to `healthcheck.flapping_detected`, collects the distinct set of configured windows, counts transitions once per distinct window, and emits one `healthcheck.flapping_detected` per window. The trigger's exact-window `evaluateConfig` gate then fires each automation only for its own window and transition threshold.
47
+ - A missing or partial flapping trigger config defaults to `{ transitions: 3, windowMinutes: 60 }`, so automations created before the trigger carried config keep working unchanged.
48
+ - `automation-backend` exposes a new backend-only, read-only `automationSubscriptionsRef` service ref (`findEnabledByTriggerEvent`) so a plugin that owns a trigger's underlying event can discover its subscribers' trigger config. It is never browser-exposed.
49
+
50
+ **BREAKING CHANGES**
51
+
52
+ - The per-assignment `notificationPolicy.flappingTrigger` field is removed. `NotificationPolicy` is now `{ suppressDeEscalations }` only. Stored rows that still carry a `flappingTrigger` key parse cleanly - the key is stripped on read - so no data migration is required, but the per-check flapping toggle/threshold in the assignment Notifications tab is gone; configure flapping on the trigger instead.
53
+ - The GitOps `System.healthcheck[].notificationPolicy.flappingTrigger` field is removed. A `flappingTrigger` block in a manifest is ignored. Move the thresholds to the `transitions` / `windowMinutes` config of your `healthcheck.flapping_detected` automation trigger.
54
+ - The standalone `enabled` flag for flapping is gone: flapping is "enabled" precisely when at least one enabled automation subscribes to `healthcheck.flapping_detected`. With no subscriber, the transition is still recorded but nothing is counted or emitted.
55
+
56
+ - b995afb: Remove the legacy per-assignment auto-incident system. Auto-incidents are now built entirely by user-authored automations; nothing is seeded or hardcoded.
57
+
58
+ What was removed:
59
+
60
+ - The one-time migration that auto-seeded "sustained unhealthy" and "flapping" default automations from each assignment's notification policy, plus the `listAutoIncidentPolicies` RPC it consumed.
61
+ - The seeder-only notification-policy settings and their UI: `autoOpenIncidentOnUnhealthy`, `useNotificationSuppression`, `skipDuringMaintenance`, `sustainedUnhealthyTrigger`, and `autoCloseAfterMinutes`. The assignment **Notifications** tab now exposes only the two live settings: **Suppress de-escalation notifications** and the **flapping-detection** thresholds.
62
+ - The dead `health_check_auto_incidents` table (no longer written or read; dropped via migration).
63
+
64
+ What is preserved: flapping detection (`healthcheck.flapping_detected`) and de-escalation suppression are unchanged. The `flappingTrigger` and `suppressDeEscalations` policy fields stay exactly as before.
65
+
66
+ > [!NOTE]
67
+ > One-time cleanup: an automation-backend migration deletes the historically auto-seeded incident automations (`managed_by LIKE 'auto-incident:%'`) from existing databases. This is intentional and destructive - those automations were no longer managed by anything. If you had edited a seeded automation and want to keep it, re-create it as a normal automation before upgrading. See the "Build auto-incident automations" guide for templates.
68
+
69
+ > [!IMPORTANT]
70
+ > NARROWING: `NotificationPolicySchema` is narrowed to `{ suppressDeEscalations, flappingTrigger }`. Stored rows that still carry the removed legacy keys parse cleanly - zod strips the unknown keys on read - so no data migration is required for the `system_health_checks.notification_policy` column. GitOps `notificationPolicy` specs that set the removed fields are no longer accepted for those keys.
71
+
72
+ - 270ef29: Extend in-UI script testing to health-check collectors, and add
73
+ load-from-run replay for automation script tests.
74
+
75
+ - Health-check collectors: a new `testCollectorScript` RPC runs the
76
+ inline-script (TypeScript) collector and the shell `script` collector
77
+ against an editable, auto-seeded sample context using the same
78
+ sandboxed runner the real collector uses. Surfaces beneath the
79
+ collector script fields in the collector editor (both marked
80
+ `x-script-testable`). Gated by `healthcheck.configuration.manage`.
81
+ - Automation replay: a new `getRunScopeForReplay` RPC reconstructs an
82
+ editable test context from a real run (trigger + persisted artifacts,
83
+ plus the durable scope snapshot when the run is still in-flight), and
84
+ the script-test panel gains a "Load from run" picker that seeds the
85
+ sample context from a past run.
86
+
87
+ Note: health-check executions do not persist the script / config /
88
+ check / system that produced a result, so there is no health-check
89
+ replay - auto-seed is the only context source for collector tests. This
90
+ is by design; see the feature plan.
91
+
92
+ - 270ef29: Secrets platform Phase 2: secret -> env-var mapping with central resolve, inject, and mask.
93
+
94
+ - Script consumers declare a least-privilege `secretEnv` allowlist
95
+ (`{ ENV_NAME: "${{ secrets.NAME }}" }`). The automation `run_script` /
96
+ `run_shell` actions resolve ONLY the declared secrets via
97
+ `secretResolverRef.resolveForRun`, inject them into the runner env for
98
+ that run (memory-only; the ESM runner gained a per-run `env` option), and
99
+ mask their values out of stdout/stderr/result/error via the run-scoped
100
+ masking context. A missing required secret fails the run clearly. No
101
+ ambient secret access.
102
+ - Test panel: `testScript` / `testCollectorScript` inject named
103
+ `__SECRET_<NAME>__` placeholders by default, or user-supplied per-secret
104
+ overrides; real production values are never resolved in the test path,
105
+ and overrides are masked out of the result.
106
+ - Healthcheck collectors carry the `secretEnv` field for authoring +
107
+ the test panel; runtime injection on satellites lands in Phase 3.
108
+ - Editor UX: a new `@checkstack/ui` `SecretEnvEditor` renders `x-secret-env`
109
+ record fields with `${{ secrets.* }}` name autocomplete (from
110
+ `listSecretNames`), wired into the automation action editor and the
111
+ healthcheck collector editor. New `withConfigMeta` helper +
112
+ `x-secret-env` config-meta key in `@checkstack/backend-api`.
113
+
114
+ ## 1.3.0
115
+
116
+ ### Minor Changes
117
+
118
+ - 35bc682: feat(healthcheck): expose check + system run-context to script collectors
119
+
120
+ Script health checks can now read which check and system a run is for.
121
+ Previously shell scripts got only a curated env whitelist and inline
122
+ scripts only `context.config`, so a script had no built-in way to know
123
+ its own check name or the system it was checking.
124
+
125
+ - `@checkstack/backend-api`: new `CollectorRunContext` type
126
+ (`{ check: { id, name, intervalSeconds }, system: { id, name } }`) and
127
+ an optional `runContext` param on `CollectorStrategy.execute`. Optional,
128
+ so existing collector implementations are unaffected.
129
+ - Shell-script collector: injects reserved `CHECKSTACK_CHECK_ID`,
130
+ `CHECKSTACK_CHECK_NAME`, `CHECKSTACK_CHECK_INTERVAL_SECONDS`,
131
+ `CHECKSTACK_SYSTEM_ID`, `CHECKSTACK_SYSTEM_NAME` env vars (user-supplied
132
+ `env` still wins on collision).
133
+ - Inline-script collector: exposes `context.check` and `context.system`
134
+ alongside `context.config`; the inline-script editor now types them for
135
+ autocomplete.
136
+ - Shell editors (health-check collectors and automation shell actions) now
137
+ also suggest the user's own `env` (JSON) keys as `$NAME` completions, via
138
+ the new exported `customShellEnvVars` helper. Keys that aren't valid shell
139
+ identifiers are omitted.
140
+ - Fix: the Typefox `CodeEditor` captured a stale `onChange` at editor start,
141
+ so editing one `DynamicForm` field reverted sibling fields changed since
142
+ mount (e.g. typing in a shell `script` field wiped an unsaved `env` value,
143
+ or deleted a sibling automation action added after mount). The change
144
+ handler now routes through a ref to the current `onChange`.
145
+ - Fix: focusing a JSON editor threw "LanguageStatusService.addStatus is not
146
+ supported" because the standalone service set omitted `ILanguageStatusService`.
147
+ That one service is now registered via `serviceOverrides`.
148
+ - Fix: the automation trigger card nested a `<Badge>` (a `<div>`) inside a
149
+ `<p>`, producing a `validateDOMNesting` warning. Switched the wrapper to a
150
+ `<div>`.
151
+ - Local runs (`queue-executor`) and satellite runs both populate the
152
+ context. `SatelliteAssignment` (and the `getAssignmentsForSatellite`
153
+ RPC output) gained optional `configName` / `systemName` so the metadata
154
+ reaches satellite-side execution; `HealthCheckService` resolves the
155
+ system name via the catalog client.
156
+
157
+ BREAKING CHANGE: `createHealthCheckRouter` now requires a `catalogClient`
158
+ option (used to resolve system names for satellite assignments). Update
159
+ call sites to pass the catalog RPC client.
160
+
161
+ ### Patch Changes
162
+
163
+ - Updated dependencies [6d52276]
164
+ - @checkstack/common@0.12.0
165
+ - @checkstack/catalog-common@2.2.3
166
+ - @checkstack/notification-common@1.2.1
167
+ - @checkstack/signal-common@0.2.5
168
+
3
169
  ## 1.2.0
4
170
 
5
171
  ### Minor Changes
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@checkstack/healthcheck-common",
3
- "version": "1.2.0",
3
+ "version": "1.4.0",
4
4
  "license": "Elastic-2.0",
5
5
  "type": "module",
6
6
  "exports": {
@@ -9,16 +9,16 @@
9
9
  }
10
10
  },
11
11
  "dependencies": {
12
- "@checkstack/common": "0.11.0",
13
- "@checkstack/catalog-common": "2.2.2",
14
- "@checkstack/notification-common": "1.2.0",
15
- "@checkstack/signal-common": "0.2.4",
12
+ "@checkstack/common": "0.12.0",
13
+ "@checkstack/catalog-common": "2.2.3",
14
+ "@checkstack/notification-common": "1.2.1",
15
+ "@checkstack/signal-common": "0.2.5",
16
16
  "zod": "^4.2.1"
17
17
  },
18
18
  "devDependencies": {
19
19
  "typescript": "^5.7.2",
20
20
  "@checkstack/tsconfig": "0.0.7",
21
- "@checkstack/scripts": "0.3.3"
21
+ "@checkstack/scripts": "0.3.4"
22
22
  },
23
23
  "scripts": {
24
24
  "typecheck": "tsgo -b",
@@ -0,0 +1,38 @@
1
+ import { describe, it, expect } from "bun:test";
2
+ import {
3
+ NotificationPolicySchema,
4
+ DEFAULT_NOTIFICATION_POLICY,
5
+ } from "./schemas";
6
+
7
+ describe("NotificationPolicySchema", () => {
8
+ it("accepts the slim policy shape", () => {
9
+ const parsed = NotificationPolicySchema.parse({
10
+ suppressDeEscalations: true,
11
+ });
12
+ expect(parsed).toEqual({ suppressDeEscalations: true });
13
+ });
14
+
15
+ it("defaults to the compile-time default when parsing an empty object", () => {
16
+ const parsed = NotificationPolicySchema.parse({});
17
+ expect(parsed).toEqual(DEFAULT_NOTIFICATION_POLICY);
18
+ expect(parsed).toEqual({ suppressDeEscalations: false });
19
+ });
20
+
21
+ it("strips removed legacy keys (auto-incident AND flapping) without throwing", () => {
22
+ // A row persisted before the legacy auto-incident fields and the
23
+ // flapping thresholds were removed still carries the larger object.
24
+ // zod's default object behaviour drops the unknown keys rather than
25
+ // rejecting them — flapping now lives on the automation trigger config.
26
+ const parsed = NotificationPolicySchema.parse({
27
+ suppressDeEscalations: true,
28
+ flappingTrigger: { enabled: false, transitions: 9, windowMinutes: 10 },
29
+ autoOpenIncidentOnUnhealthy: true,
30
+ useNotificationSuppression: true,
31
+ skipDuringMaintenance: false,
32
+ sustainedUnhealthyTrigger: { enabled: true, durationMinutes: 5 },
33
+ autoCloseAfterMinutes: 99,
34
+ });
35
+ expect(parsed).toEqual({ suppressDeEscalations: true });
36
+ expect(Object.keys(parsed)).toEqual(["suppressDeEscalations"]);
37
+ });
38
+ });
@@ -40,6 +40,82 @@ export type SystemHealthStatusResponse = z.infer<
40
40
  typeof SystemHealthStatusResponseSchema
41
41
  >;
42
42
 
43
+ /**
44
+ * Live health-state snapshot used by the automation sensing layer.
45
+ * Service-typed (backend-to-backend). `inStatusSince` is null when no
46
+ * transition has been recorded; `inStatusForMs` is 0 in that case.
47
+ */
48
+ const HealthStateSchema = z.object({
49
+ status: HealthCheckStatusSchema,
50
+ inStatusSince: z.date().nullable(),
51
+ inStatusForMs: z.number(),
52
+ latencyMs: z.number().optional(),
53
+ avgLatencyMs: z.number().optional(),
54
+ p95LatencyMs: z.number().optional(),
55
+ successRate: z.number().optional(),
56
+ lastRunAt: z.date().optional(),
57
+ inMaintenance: z.boolean(),
58
+ /** Count of aggregate status transitions in the trailing window (flapping). */
59
+ transitionsInWindow: z.number(),
60
+ /** The window (minutes) `transitionsInWindow` was counted over. */
61
+ transitionWindowMinutes: z.number(),
62
+ evaluatedAt: z.date(),
63
+ });
64
+
65
+ export type HealthStateResponse = z.infer<typeof HealthStateSchema>;
66
+
67
+ // --- Collector script testing (in-UI) ---
68
+
69
+ /**
70
+ * Curated check/system metadata a collector script reads. Every part is
71
+ * optional so a partial sample still runs.
72
+ */
73
+ const CollectorTestRunContextSchema = z.object({
74
+ check: z
75
+ .object({
76
+ id: z.string(),
77
+ name: z.string(),
78
+ intervalSeconds: z.number().int(),
79
+ })
80
+ .optional(),
81
+ system: z.object({ id: z.string(), name: z.string() }).optional(),
82
+ });
83
+
84
+ export const CollectorScriptTestInputSchema = z.object({
85
+ kind: z.enum(["typescript", "shell"]),
86
+ script: z.string(),
87
+ config: z.record(z.string(), z.unknown()).optional(),
88
+ env: z.record(z.string(), z.string()).optional(),
89
+ /**
90
+ * The collector's declared secret -> env mapping. The test panel NEVER
91
+ * resolves real secret values: each declared env var gets a
92
+ * `__SECRET_<NAME>__` placeholder by default, or the user override below
93
+ * (decision 4).
94
+ */
95
+ secretEnv: z.record(z.string(), z.string()).optional(),
96
+ /** User-supplied per-secret-NAME override values, masked out of the result. */
97
+ secretOverrides: z.record(z.string(), z.string()).optional(),
98
+ workingDirectory: z.string().optional(),
99
+ runContext: CollectorTestRunContextSchema.optional(),
100
+ timeoutMs: z.number().int().min(100).max(300_000).default(30_000),
101
+ });
102
+ export type CollectorScriptTestInputDto = z.infer<
103
+ typeof CollectorScriptTestInputSchema
104
+ >;
105
+
106
+ export const CollectorScriptTestResultSchema = z.object({
107
+ result: z.unknown().optional(),
108
+ stdout: z.string(),
109
+ stderr: z.string(),
110
+ exitCode: z.number().int().optional(),
111
+ durationMs: z.number().int().nonnegative(),
112
+ timedOut: z.boolean(),
113
+ error: z.string().optional(),
114
+ });
115
+ export type CollectorScriptTestResultDto = z.infer<
116
+ typeof CollectorScriptTestResultSchema
117
+ >;
118
+
43
119
  // Health Check RPC Contract using oRPC's contract-first pattern
44
120
  export const healthCheckContract = {
45
121
  // ==========================================================================
@@ -60,6 +136,24 @@ export const healthCheckContract = {
60
136
  .input(z.object({ strategyId: z.string() }))
61
137
  .output(z.array(CollectorDtoSchema)),
62
138
 
139
+ /**
140
+ * Run a collector script (inline-script TS or the shell `script`
141
+ * collector) against an editable sample context, using the same
142
+ * sandboxed runner the real collector uses. Lets operators test a
143
+ * collector script in the editor without scheduling a real execution.
144
+ *
145
+ * Gated by `configuration.manage` because authoring a collector script
146
+ * already executes code on the central backend - same privilege. The
147
+ * run is central-only and time-bounded.
148
+ */
149
+ testCollectorScript: proc({
150
+ operationType: "mutation",
151
+ userType: "authenticated",
152
+ access: [healthCheckAccess.configuration.manage],
153
+ })
154
+ .input(CollectorScriptTestInputSchema)
155
+ .output(CollectorScriptTestResultSchema),
156
+
63
157
  // ==========================================================================
64
158
  // CONFIGURATION MANAGEMENT (userType: "authenticated")
65
159
  // ==========================================================================
@@ -457,6 +551,8 @@ export const healthCheckContract = {
457
551
  )
458
552
  .optional(),
459
553
  intervalSeconds: z.number(),
554
+ configName: z.string().optional(),
555
+ systemName: z.string().optional(),
460
556
  }),
461
557
  ),
462
558
  ),
@@ -480,6 +576,46 @@ export const healthCheckContract = {
480
576
  )
481
577
  .output(z.void()),
482
578
 
579
+ /**
580
+ * Live health-state snapshot for a single system (Wave-2 sensing
581
+ * contract). Returns status, in-status duration, latency, windowed
582
+ * metrics, and suppression-agnostic maintenance state.
583
+ */
584
+ getHealthState: proc({
585
+ operationType: "query",
586
+ userType: "service",
587
+ access: [],
588
+ })
589
+ .input(
590
+ z.object({
591
+ systemId: z.string(),
592
+ configurationId: z.string().optional(),
593
+ /** Trailing window (minutes) for `transitionsInWindow`. Default 60. */
594
+ transitionWindowMinutes: z.number().int().min(1).optional(),
595
+ }),
596
+ )
597
+ .output(HealthStateSchema),
598
+
599
+ /**
600
+ * Bulk variant of {@link getHealthState}. POST to avoid N+1 from
601
+ * dashboards and multi-system automation rules; all systems share one
602
+ * evaluation timestamp.
603
+ */
604
+ getBulkHealthState: proc({
605
+ operationType: "query",
606
+ userType: "service",
607
+ access: [],
608
+ })
609
+ .route({ method: "POST" })
610
+ .input(
611
+ z.object({
612
+ systemIds: z.array(z.string()),
613
+ /** Trailing window (minutes) for `transitionsInWindow`. Default 60. */
614
+ transitionWindowMinutes: z.number().int().min(1).optional(),
615
+ }),
616
+ )
617
+ .output(z.object({ states: z.record(z.string(), HealthStateSchema) })),
618
+
483
619
  getRunsForAnalysis: proc({
484
620
  operationType: "query",
485
621
  userType: "service",
package/src/schemas.ts CHANGED
@@ -209,54 +209,20 @@ export const DEFAULT_STATE_THRESHOLDS: StateThresholds = {
209
209
 
210
210
  // --- Notification Policy ---
211
211
 
212
- /**
213
- * Trigger that opens an auto-incident after a check has been
214
- * continuously `unhealthy` for at least `durationMinutes`. Resets if
215
- * the check recovers to non-unhealthy in between.
216
- */
217
- export const SustainedUnhealthyTriggerSchema = z.object({
218
- /** When false, this trigger is fully disabled. */
219
- enabled: z.boolean().default(true),
220
- /** Minimum continuous-unhealthy time before opening. */
221
- durationMinutes: z.number().int().min(1).default(30),
222
- });
223
-
224
- export type SustainedUnhealthyTrigger = z.infer<
225
- typeof SustainedUnhealthyTriggerSchema
226
- >;
227
-
228
- /**
229
- * Trigger that opens an auto-incident when a check has transitioned
230
- * to `unhealthy` at least `transitions` times within `windowMinutes`.
231
- * Catches flapping that the sustained-duration trigger would miss
232
- * because each unhealthy phase is too short.
233
- */
234
- export const FlappingTriggerSchema = z.object({
235
- /** When false, this trigger is fully disabled. */
236
- enabled: z.boolean().default(true),
237
- /** Minimum number of transitions-to-unhealthy needed in the window. */
238
- transitions: z.number().int().min(1).default(3),
239
- /** Sliding window in minutes the transitions are counted over. */
240
- windowMinutes: z.number().int().min(1).default(60),
241
- });
242
-
243
- export type FlappingTrigger = z.infer<typeof FlappingTriggerSchema>;
244
-
245
- export const DEFAULT_SUSTAINED_TRIGGER: SustainedUnhealthyTrigger = {
246
- enabled: true,
247
- durationMinutes: 30,
248
- };
249
-
250
- export const DEFAULT_FLAPPING_TRIGGER: FlappingTrigger = {
251
- enabled: true,
252
- transitions: 3,
253
- windowMinutes: 60,
254
- };
255
-
256
212
  /**
257
213
  * Per-association notification preferences. All fields are evaluated
258
214
  * per (system, configuration) — different checks on the same system
259
215
  * are fully independent.
216
+ *
217
+ * The schema strips unknown keys (zod's default object behaviour), so
218
+ * rows persisted before the legacy auto-incident fields were removed
219
+ * (`autoOpenIncidentOnUnhealthy`, `useNotificationSuppression`,
220
+ * `skipDuringMaintenance`, `sustainedUnhealthyTrigger`,
221
+ * `autoCloseAfterMinutes`) parse cleanly — the dead keys are dropped.
222
+ * Flapping thresholds also moved OUT of this policy onto the automation
223
+ * engine's windowed-count gate (the `system_health_changed` trigger's
224
+ * `window` block), so a persisted `flappingTrigger` key is likewise stripped
225
+ * on read.
260
226
  */
261
227
  export const NotificationPolicySchema = z.object({
262
228
  /**
@@ -265,61 +231,12 @@ export const NotificationPolicySchema = z.object({
265
231
  * still notify.
266
232
  */
267
233
  suppressDeEscalations: z.boolean().default(false),
268
- /**
269
- * When true, the configured triggers can open auto-managed incidents
270
- * on the system. Setting this to false disables both triggers
271
- * regardless of their individual `enabled` flags.
272
- */
273
- autoOpenIncidentOnUnhealthy: z.boolean().default(true),
274
- /**
275
- * When true, the auto-opened incident is created with
276
- * `suppressNotifications` enabled so further health-state
277
- * notifications for the system are silenced until the incident is
278
- * resolved. Only meaningful when `autoOpenIncidentOnUnhealthy` is on.
279
- */
280
- useNotificationSuppression: z.boolean().default(true),
281
- /**
282
- * When true, no auto-incident is opened while the system has an
283
- * active maintenance window with notification suppression. The
284
- * system is intentionally down and shouldn't trip the on-call.
285
- */
286
- skipDuringMaintenance: z.boolean().default(true),
287
- /**
288
- * Trigger A: "this check has been unhealthy for X minutes
289
- * continuously." Catches real outages.
290
- */
291
- sustainedUnhealthyTrigger: SustainedUnhealthyTriggerSchema.default(
292
- DEFAULT_SUSTAINED_TRIGGER,
293
- ),
294
- /**
295
- * Trigger B: "this check transitioned to unhealthy N times in M
296
- * minutes." Catches persistent flapping where no individual
297
- * unhealthy phase is long enough for the sustained trigger.
298
- */
299
- flappingTrigger: FlappingTriggerSchema.default(DEFAULT_FLAPPING_TRIGGER),
300
- /**
301
- * Minutes of sustained healthy state required before an auto-opened
302
- * incident is auto-closed. `null` disables auto-close — the
303
- * incident stays open until an operator resolves it manually.
304
- */
305
- autoCloseAfterMinutes: z
306
- .number()
307
- .int()
308
- .min(1)
309
- .nullable()
310
- .default(30),
311
234
  });
312
235
 
313
236
  export type NotificationPolicy = z.infer<typeof NotificationPolicySchema>;
314
237
 
315
238
  export const DEFAULT_NOTIFICATION_POLICY: NotificationPolicy = {
316
239
  suppressDeEscalations: false,
317
- autoOpenIncidentOnUnhealthy: true,
318
- useNotificationSuppression: true,
319
- skipDuringMaintenance: true,
320
- sustainedUnhealthyTrigger: DEFAULT_SUSTAINED_TRIGGER,
321
- flappingTrigger: DEFAULT_FLAPPING_TRIGGER,
322
- autoCloseAfterMinutes: 30,
323
240
  };
324
241
 
325
242
  export const AssociateHealthCheckSchema = z.object({