npm - @checkstack/healthcheck-common - Versions diffs - 1.2.0 → 1.4.0 - Mend

@checkstack/healthcheck-common 1.2.0 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/CHANGELOG.md +166 -0
package/package.json +6 -6
package/src/notification-policy-schema.test.ts +38 -0
package/src/rpc-contract.ts +136 -0
package/src/schemas.ts +10 -93

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,171 @@
 # @checkstack/healthcheck-common
+## 1.4.0
+### Minor Changes
+- 270ef29: Add the health-state provider data contract (automation sensing layer, Wave 2 Phase 13).
+  - New `health_check_state_transitions` table records every aggregate health-status transition for a system (all statuses, not just unhealthy), giving a reliable "in current status since" timestamp. Written wherever an aggregate transition is detected. Pruned with raw-run retention, but the single most-recent row per system is always kept so an active streak never blanks.
+  - New service-typed RPCs on `HealthCheckApi`: `getHealthState({ systemId, configurationId? })` returns `{ status, inStatusSince, inStatusForMs, latencyMs?, avgLatencyMs?, p95LatencyMs?, successRate?, lastRunAt?, inMaintenance, evaluatedAt }`, and `getBulkHealthState({ systemIds })` (POST) resolves many systems against one shared timestamp.
+  - New service-typed RPC on `MaintenanceApi`: `hasActiveMaintenance({ systemId })` reports whether a system is in an active maintenance window regardless of notification-suppression (suppression-agnostic), folded into `getHealthState` as `inMaintenance`.
+  All reads are fail-safe: a missing transition row yields `inStatusSince: null`, and a maintenance-plugin error fails open to `inMaintenance: false`.
+- 270ef29: Add a windowed transition count to the health provider - the building block for custom flapping rules (Wave 2 Phase 18).
+  Flapping is already buildable today via the built-in `healthcheck.flapping_detected` trigger; this phase ships the GENERALIZATION for arbitrary "N status changes in M minutes" rules.
+  - `countStateTransitionsInWindow` counts aggregate status transitions for a system over a trailing window (from the Phase 13 `health_check_state_transitions` table - all statuses, generalizing the unhealthy-only flapping detector). Fail-safe to 0.
+  - `getHealthState` / `getBulkHealthState` now return `transitionsInWindow` + `transitionWindowMinutes`, and accept an optional `transitionWindowMinutes` input (default 60).
+  - The automation definition gains an optional top-level `state_window_minutes` (default 60), threaded through `enrichScopeWithState` so `health.system.transitions_in_window` / `health.system.transition_window_minutes` are folded into scope per evaluation.
+  - Operators author custom flapping as a `numeric_state` condition over `health.system.transitions_in_window` - no new condition variant, no editor change. The variable-scope resolver surfaces the new fields for autocomplete.
+- 270ef29: Replace the hardcoded auto-incident path with default automations (Wave 2 Phase 20).
+  BREAKING CHANGES: Auto-incident is now automation-driven. The hardcoded background path that opened incidents on sustained-unhealthy / flapping and closed them after a cooldown (`auto-incident.ts`, `auto-incident-close-job.ts`) is removed. On upgrade, an idempotent, threshold-preserving migration seeds equivalent default automations from each assignment's existing `NotificationPolicy`, so alerting behaviour is preserved 1:1:
+  - `sustainedUnhealthyTrigger.durationMinutes` -> the `for:` dwell on a `healthcheck.system_degraded` trigger -> `incident.create`.
+  - auto-close `autoCloseAfterMinutes` -> a `wait_until` (healthy continuously for the cooldown) -> `incident.resolve`.
+  - `useNotificationSuppression` -> the incident's `suppressNotifications`.
+  - `skipDuringMaintenance` -> a `{{ !health.system.in_maintenance }}` pre-run condition.
+  - `flappingTrigger.{transitions,windowMinutes}` -> a second automation on the `healthcheck.flapping_detected` trigger -> `incident.create`.
+  Auto-incidents remain ONE OPEN INCIDENT PER SYSTEM, faithful to the old behaviour. `incident.create` gains an opt-in `dedupe_open_for_system` config flag (default false, so existing/custom automations are unaffected): when true, it reuses an existing open incident on the target system instead of opening a duplicate (the old `findActiveAutoIncident(systemId)` semantic), returning the reused incident as the produced `incident` artifact. The seeded default automations set this flag, so a system with several failing checks - sustained and/or flapping - still gets a single open incident; whichever check crosses its threshold first opens it, and the rest dedupe to it. Both sustained and flapping default automations open at `critical` severity (parity with the old path). Per-system run dedup within an automation uses `concurrency_scope: "context_key"` + `mode: "single"`.
+  Operators can read, edit, disable, and extend these automations (see the "Customise auto-incident" guide). Seeded automations are tagged via `managedBy` (`auto-incident:<systemId>:<configurationId>:<kind>`) so the migration is a no-op on re-runs; anything unmappable is recorded as a migration-failure row.
+  Flapping DETECTION (transition recording + the `healthcheck.flapping_detected` emit) is relocated into `flapping-detector.ts` and survives; the emit now fires unconditionally on a threshold cross (no longer gated on `autoOpenIncidentOnUnhealthy`), matching the hook's documented intent and required for the flapping default automation. The legacy `health_check_auto_incidents` mapping table is no longer written or read (it will be dropped in a follow-up migration); `health_check_unhealthy_transitions` is retained for the flapping detector.
+  New service-typed `HealthCheckApi.listAutoIncidentPolicies` RPC exposes each assignment's effective notification policy for the migration. `incident.create` adds the `dedupe_open_for_system` flag (additive, defaults off).
+- b995afb: Move health-check flapping configuration from the per-assignment notification policy onto the `healthcheck.flapping_detected` automation trigger.
+  Flapping thresholds (`transitions`, `windowMinutes`) are now configured on the trigger itself, next to the automation that reacts to them, instead of on each check assignment. The health-check executor still owns the windowed transition counting (it writes `health_check_unhealthy_transitions` and runs the window query), but it now SOURCES the thresholds from the subscribed automations' trigger config:
+  - On a transition-to-unhealthy it records the transition unconditionally (keeping history warm), then looks up the enabled automations subscribed to `healthcheck.flapping_detected`, collects the distinct set of configured windows, counts transitions once per distinct window, and emits one `healthcheck.flapping_detected` per window. The trigger's exact-window `evaluateConfig` gate then fires each automation only for its own window and transition threshold.
+  - A missing or partial flapping trigger config defaults to `{ transitions: 3, windowMinutes: 60 }`, so automations created before the trigger carried config keep working unchanged.
+  - `automation-backend` exposes a new backend-only, read-only `automationSubscriptionsRef` service ref (`findEnabledByTriggerEvent`) so a plugin that owns a trigger's underlying event can discover its subscribers' trigger config. It is never browser-exposed.
+  **BREAKING CHANGES**
+  - The per-assignment `notificationPolicy.flappingTrigger` field is removed. `NotificationPolicy` is now `{ suppressDeEscalations }` only. Stored rows that still carry a `flappingTrigger` key parse cleanly - the key is stripped on read - so no data migration is required, but the per-check flapping toggle/threshold in the assignment Notifications tab is gone; configure flapping on the trigger instead.
+  - The GitOps `System.healthcheck[].notificationPolicy.flappingTrigger` field is removed. A `flappingTrigger` block in a manifest is ignored. Move the thresholds to the `transitions` / `windowMinutes` config of your `healthcheck.flapping_detected` automation trigger.
+  - The standalone `enabled` flag for flapping is gone: flapping is "enabled" precisely when at least one enabled automation subscribes to `healthcheck.flapping_detected`. With no subscriber, the transition is still recorded but nothing is counted or emitted.
+- b995afb: Remove the legacy per-assignment auto-incident system. Auto-incidents are now built entirely by user-authored automations; nothing is seeded or hardcoded.
+  What was removed:
+  - The one-time migration that auto-seeded "sustained unhealthy" and "flapping" default automations from each assignment's notification policy, plus the `listAutoIncidentPolicies` RPC it consumed.
+  - The seeder-only notification-policy settings and their UI: `autoOpenIncidentOnUnhealthy`, `useNotificationSuppression`, `skipDuringMaintenance`, `sustainedUnhealthyTrigger`, and `autoCloseAfterMinutes`. The assignment **Notifications** tab now exposes only the two live settings: **Suppress de-escalation notifications** and the **flapping-detection** thresholds.
+  - The dead `health_check_auto_incidents` table (no longer written or read; dropped via migration).
+  What is preserved: flapping detection (`healthcheck.flapping_detected`) and de-escalation suppression are unchanged. The `flappingTrigger` and `suppressDeEscalations` policy fields stay exactly as before.
+  > [!NOTE]
+  > One-time cleanup: an automation-backend migration deletes the historically auto-seeded incident automations (`managed_by LIKE 'auto-incident:%'`) from existing databases. This is intentional and destructive - those automations were no longer managed by anything. If you had edited a seeded automation and want to keep it, re-create it as a normal automation before upgrading. See the "Build auto-incident automations" guide for templates.
+  > [!IMPORTANT]
+  > NARROWING: `NotificationPolicySchema` is narrowed to `{ suppressDeEscalations, flappingTrigger }`. Stored rows that still carry the removed legacy keys parse cleanly - zod strips the unknown keys on read - so no data migration is required for the `system_health_checks.notification_policy` column. GitOps `notificationPolicy` specs that set the removed fields are no longer accepted for those keys.
+- 270ef29: Extend in-UI script testing to health-check collectors, and add
+  load-from-run replay for automation script tests.
+  - Health-check collectors: a new `testCollectorScript` RPC runs the
+    inline-script (TypeScript) collector and the shell `script` collector
+    against an editable, auto-seeded sample context using the same
+    sandboxed runner the real collector uses. Surfaces beneath the
+    collector script fields in the collector editor (both marked
+    `x-script-testable`). Gated by `healthcheck.configuration.manage`.
+  - Automation replay: a new `getRunScopeForReplay` RPC reconstructs an
+    editable test context from a real run (trigger + persisted artifacts,
+    plus the durable scope snapshot when the run is still in-flight), and
+    the script-test panel gains a "Load from run" picker that seeds the
+    sample context from a past run.
+  Note: health-check executions do not persist the script / config /
+  check / system that produced a result, so there is no health-check
+  replay - auto-seed is the only context source for collector tests. This
+  is by design; see the feature plan.
+- 270ef29: Secrets platform Phase 2: secret -> env-var mapping with central resolve, inject, and mask.
+  - Script consumers declare a least-privilege `secretEnv` allowlist
+    (`{ ENV_NAME: "${{ secrets.NAME }}" }`). The automation `run_script` /
+    `run_shell` actions resolve ONLY the declared secrets via
+    `secretResolverRef.resolveForRun`, inject them into the runner env for
+    that run (memory-only; the ESM runner gained a per-run `env` option), and
+    mask their values out of stdout/stderr/result/error via the run-scoped
+    masking context. A missing required secret fails the run clearly. No
+    ambient secret access.
+  - Test panel: `testScript` / `testCollectorScript` inject named
+    `__SECRET_<NAME>__` placeholders by default, or user-supplied per-secret
+    overrides; real production values are never resolved in the test path,
+    and overrides are masked out of the result.
+  - Healthcheck collectors carry the `secretEnv` field for authoring +
+    the test panel; runtime injection on satellites lands in Phase 3.
+  - Editor UX: a new `@checkstack/ui` `SecretEnvEditor` renders `x-secret-env`
+    record fields with `${{ secrets.* }}` name autocomplete (from
+    `listSecretNames`), wired into the automation action editor and the
+    healthcheck collector editor. New `withConfigMeta` helper +
+    `x-secret-env` config-meta key in `@checkstack/backend-api`.
+## 1.3.0
+### Minor Changes
+- 35bc682: feat(healthcheck): expose check + system run-context to script collectors
+  Script health checks can now read which check and system a run is for.
+  Previously shell scripts got only a curated env whitelist and inline
+  scripts only `context.config`, so a script had no built-in way to know
+  its own check name or the system it was checking.
+  - `@checkstack/backend-api`: new `CollectorRunContext` type
+    (`{ check: { id, name, intervalSeconds }, system: { id, name } }`) and
+    an optional `runContext` param on `CollectorStrategy.execute`. Optional,
+    so existing collector implementations are unaffected.
+  - Shell-script collector: injects reserved `CHECKSTACK_CHECK_ID`,
+    `CHECKSTACK_CHECK_NAME`, `CHECKSTACK_CHECK_INTERVAL_SECONDS`,
+    `CHECKSTACK_SYSTEM_ID`, `CHECKSTACK_SYSTEM_NAME` env vars (user-supplied
+    `env` still wins on collision).
+  - Inline-script collector: exposes `context.check` and `context.system`
+    alongside `context.config`; the inline-script editor now types them for
+    autocomplete.
+  - Shell editors (health-check collectors and automation shell actions) now
+    also suggest the user's own `env` (JSON) keys as `$NAME` completions, via
+    the new exported `customShellEnvVars` helper. Keys that aren't valid shell
+    identifiers are omitted.
+  - Fix: the Typefox `CodeEditor` captured a stale `onChange` at editor start,
+    so editing one `DynamicForm` field reverted sibling fields changed since
+    mount (e.g. typing in a shell `script` field wiped an unsaved `env` value,
+    or deleted a sibling automation action added after mount). The change
+    handler now routes through a ref to the current `onChange`.
+  - Fix: focusing a JSON editor threw "LanguageStatusService.addStatus is not
+    supported" because the standalone service set omitted `ILanguageStatusService`.
+    That one service is now registered via `serviceOverrides`.
+  - Fix: the automation trigger card nested a `<Badge>` (a `<div>`) inside a
+    `<p>`, producing a `validateDOMNesting` warning. Switched the wrapper to a
+    `<div>`.
+  - Local runs (`queue-executor`) and satellite runs both populate the
+    context. `SatelliteAssignment` (and the `getAssignmentsForSatellite`
+    RPC output) gained optional `configName` / `systemName` so the metadata
+    reaches satellite-side execution; `HealthCheckService` resolves the
+    system name via the catalog client.
+  BREAKING CHANGE: `createHealthCheckRouter` now requires a `catalogClient`
+  option (used to resolve system names for satellite assignments). Update
+  call sites to pass the catalog RPC client.
+### Patch Changes
+- Updated dependencies [6d52276]
+  - @checkstack/common@0.12.0
+  - @checkstack/catalog-common@2.2.3
+  - @checkstack/notification-common@1.2.1
+  - @checkstack/signal-common@0.2.5
 ## 1.2.0
 ### Minor Changes

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@checkstack/healthcheck-common",
-  "version": "1.2.0",
+  "version": "1.4.0",
   "license": "Elastic-2.0",
   "type": "module",
   "exports": {
@@ -9,16 +9,16 @@
     }
   },
   "dependencies": {
-    "@checkstack/common": "0.11.0",
-    "@checkstack/catalog-common": "2.2.2",
-    "@checkstack/notification-common": "1.2.0",
-    "@checkstack/signal-common": "0.2.4",
+    "@checkstack/common": "0.12.0",
+    "@checkstack/catalog-common": "2.2.3",
+    "@checkstack/notification-common": "1.2.1",
+    "@checkstack/signal-common": "0.2.5",
     "zod": "^4.2.1"
   },
   "devDependencies": {
     "typescript": "^5.7.2",
     "@checkstack/tsconfig": "0.0.7",
-    "@checkstack/scripts": "0.3.3"
+    "@checkstack/scripts": "0.3.4"
   },
   "scripts": {
     "typecheck": "tsgo -b",

package/src/notification-policy-schema.test.ts ADDED Viewed

@@ -0,0 +1,38 @@
+import { describe, it, expect } from "bun:test";
+import {
+  NotificationPolicySchema,
+  DEFAULT_NOTIFICATION_POLICY,
+} from "./schemas";
+describe("NotificationPolicySchema", () => {
+  it("accepts the slim policy shape", () => {
+    const parsed = NotificationPolicySchema.parse({
+      suppressDeEscalations: true,
+    });
+    expect(parsed).toEqual({ suppressDeEscalations: true });
+  });
+  it("defaults to the compile-time default when parsing an empty object", () => {
+    const parsed = NotificationPolicySchema.parse({});
+    expect(parsed).toEqual(DEFAULT_NOTIFICATION_POLICY);
+    expect(parsed).toEqual({ suppressDeEscalations: false });
+  });
+  it("strips removed legacy keys (auto-incident AND flapping) without throwing", () => {
+    // A row persisted before the legacy auto-incident fields and the
+    // flapping thresholds were removed still carries the larger object.
+    // zod's default object behaviour drops the unknown keys rather than
+    // rejecting them — flapping now lives on the automation trigger config.
+    const parsed = NotificationPolicySchema.parse({
+      suppressDeEscalations: true,
+      flappingTrigger: { enabled: false, transitions: 9, windowMinutes: 10 },
+      autoOpenIncidentOnUnhealthy: true,
+      useNotificationSuppression: true,
+      skipDuringMaintenance: false,
+      sustainedUnhealthyTrigger: { enabled: true, durationMinutes: 5 },
+      autoCloseAfterMinutes: 99,
+    });
+    expect(parsed).toEqual({ suppressDeEscalations: true });
+    expect(Object.keys(parsed)).toEqual(["suppressDeEscalations"]);
+  });
+});

package/src/rpc-contract.ts CHANGED Viewed

@@ -40,6 +40,82 @@ export type SystemHealthStatusResponse = z.infer<
   typeof SystemHealthStatusResponseSchema
 >;
+/**
+ * Live health-state snapshot used by the automation sensing layer.
+ * Service-typed (backend-to-backend). `inStatusSince` is null when no
+ * transition has been recorded; `inStatusForMs` is 0 in that case.
+ */
+const HealthStateSchema = z.object({
+  status: HealthCheckStatusSchema,
+  inStatusSince: z.date().nullable(),
+  inStatusForMs: z.number(),
+  latencyMs: z.number().optional(),
+  avgLatencyMs: z.number().optional(),
+  p95LatencyMs: z.number().optional(),
+  successRate: z.number().optional(),
+  lastRunAt: z.date().optional(),
+  inMaintenance: z.boolean(),
+  /** Count of aggregate status transitions in the trailing window (flapping). */
+  transitionsInWindow: z.number(),
+  /** The window (minutes) `transitionsInWindow` was counted over. */
+  transitionWindowMinutes: z.number(),
+  evaluatedAt: z.date(),
+});
+export type HealthStateResponse = z.infer<typeof HealthStateSchema>;
+// --- Collector script testing (in-UI) ---
+/**
+ * Curated check/system metadata a collector script reads. Every part is
+ * optional so a partial sample still runs.
+ */
+const CollectorTestRunContextSchema = z.object({
+  check: z
+    .object({
+      id: z.string(),
+      name: z.string(),
+      intervalSeconds: z.number().int(),
+    })
+    .optional(),
+  system: z.object({ id: z.string(), name: z.string() }).optional(),
+});
+export const CollectorScriptTestInputSchema = z.object({
+  kind: z.enum(["typescript", "shell"]),
+  script: z.string(),
+  config: z.record(z.string(), z.unknown()).optional(),
+  env: z.record(z.string(), z.string()).optional(),
+  /**
+   * The collector's declared secret -> env mapping. The test panel NEVER
+   * resolves real secret values: each declared env var gets a
+   * `__SECRET_<NAME>__` placeholder by default, or the user override below
+   * (decision 4).
+   */
+  secretEnv: z.record(z.string(), z.string()).optional(),
+  /** User-supplied per-secret-NAME override values, masked out of the result. */
+  secretOverrides: z.record(z.string(), z.string()).optional(),
+  workingDirectory: z.string().optional(),
+  runContext: CollectorTestRunContextSchema.optional(),
+  timeoutMs: z.number().int().min(100).max(300_000).default(30_000),
+});
+export type CollectorScriptTestInputDto = z.infer<
+  typeof CollectorScriptTestInputSchema
+>;
+export const CollectorScriptTestResultSchema = z.object({
+  result: z.unknown().optional(),
+  stdout: z.string(),
+  stderr: z.string(),
+  exitCode: z.number().int().optional(),
+  durationMs: z.number().int().nonnegative(),
+  timedOut: z.boolean(),
+  error: z.string().optional(),
+});
+export type CollectorScriptTestResultDto = z.infer<
+  typeof CollectorScriptTestResultSchema
+>;
 // Health Check RPC Contract using oRPC's contract-first pattern
 export const healthCheckContract = {
   // ==========================================================================
@@ -60,6 +136,24 @@ export const healthCheckContract = {
     .input(z.object({ strategyId: z.string() }))
     .output(z.array(CollectorDtoSchema)),
+  /**
+   * Run a collector script (inline-script TS or the shell `script`
+   * collector) against an editable sample context, using the same
+   * sandboxed runner the real collector uses. Lets operators test a
+   * collector script in the editor without scheduling a real execution.
+   *
+   * Gated by `configuration.manage` because authoring a collector script
+   * already executes code on the central backend - same privilege. The
+   * run is central-only and time-bounded.
+   */
+  testCollectorScript: proc({
+    operationType: "mutation",
+    userType: "authenticated",
+    access: [healthCheckAccess.configuration.manage],
+  })
+    .input(CollectorScriptTestInputSchema)
+    .output(CollectorScriptTestResultSchema),
   // ==========================================================================
   // CONFIGURATION MANAGEMENT (userType: "authenticated")
   // ==========================================================================
@@ -457,6 +551,8 @@ export const healthCheckContract = {
             )
             .optional(),
           intervalSeconds: z.number(),
+          configName: z.string().optional(),
+          systemName: z.string().optional(),
         }),
       ),
     ),
@@ -480,6 +576,46 @@ export const healthCheckContract = {
     )
     .output(z.void()),
+  /**
+   * Live health-state snapshot for a single system (Wave-2 sensing
+   * contract). Returns status, in-status duration, latency, windowed
+   * metrics, and suppression-agnostic maintenance state.
+   */
+  getHealthState: proc({
+    operationType: "query",
+    userType: "service",
+    access: [],
+  })
+    .input(
+      z.object({
+        systemId: z.string(),
+        configurationId: z.string().optional(),
+        /** Trailing window (minutes) for `transitionsInWindow`. Default 60. */
+        transitionWindowMinutes: z.number().int().min(1).optional(),
+      }),
+    )
+    .output(HealthStateSchema),
+  /**
+   * Bulk variant of {@link getHealthState}. POST to avoid N+1 from
+   * dashboards and multi-system automation rules; all systems share one
+   * evaluation timestamp.
+   */
+  getBulkHealthState: proc({
+    operationType: "query",
+    userType: "service",
+    access: [],
+  })
+    .route({ method: "POST" })
+    .input(
+      z.object({
+        systemIds: z.array(z.string()),
+        /** Trailing window (minutes) for `transitionsInWindow`. Default 60. */
+        transitionWindowMinutes: z.number().int().min(1).optional(),
+      }),
+    )
+    .output(z.object({ states: z.record(z.string(), HealthStateSchema) })),
   getRunsForAnalysis: proc({
     operationType: "query",
     userType: "service",

package/src/schemas.ts CHANGED Viewed

@@ -209,54 +209,20 @@ export const DEFAULT_STATE_THRESHOLDS: StateThresholds = {
 // --- Notification Policy ---
-/**
- * Trigger that opens an auto-incident after a check has been
- * continuously `unhealthy` for at least `durationMinutes`. Resets if
- * the check recovers to non-unhealthy in between.
- */
-export const SustainedUnhealthyTriggerSchema = z.object({
-  /** When false, this trigger is fully disabled. */
-  enabled: z.boolean().default(true),
-  /** Minimum continuous-unhealthy time before opening. */
-  durationMinutes: z.number().int().min(1).default(30),
-});
-export type SustainedUnhealthyTrigger = z.infer<
-  typeof SustainedUnhealthyTriggerSchema
->;
-/**
- * Trigger that opens an auto-incident when a check has transitioned
- * to `unhealthy` at least `transitions` times within `windowMinutes`.
- * Catches flapping that the sustained-duration trigger would miss
- * because each unhealthy phase is too short.
- */
-export const FlappingTriggerSchema = z.object({
-  /** When false, this trigger is fully disabled. */
-  enabled: z.boolean().default(true),
-  /** Minimum number of transitions-to-unhealthy needed in the window. */
-  transitions: z.number().int().min(1).default(3),
-  /** Sliding window in minutes the transitions are counted over. */
-  windowMinutes: z.number().int().min(1).default(60),
-});
-export type FlappingTrigger = z.infer<typeof FlappingTriggerSchema>;
-export const DEFAULT_SUSTAINED_TRIGGER: SustainedUnhealthyTrigger = {
-  enabled: true,
-  durationMinutes: 30,
-};
-export const DEFAULT_FLAPPING_TRIGGER: FlappingTrigger = {
-  enabled: true,
-  transitions: 3,
-  windowMinutes: 60,
-};
 /**
  * Per-association notification preferences. All fields are evaluated
  * per (system, configuration) — different checks on the same system
  * are fully independent.
+ *
+ * The schema strips unknown keys (zod's default object behaviour), so
+ * rows persisted before the legacy auto-incident fields were removed
+ * (`autoOpenIncidentOnUnhealthy`, `useNotificationSuppression`,
+ * `skipDuringMaintenance`, `sustainedUnhealthyTrigger`,
+ * `autoCloseAfterMinutes`) parse cleanly — the dead keys are dropped.
+ * Flapping thresholds also moved OUT of this policy onto the automation
+ * engine's windowed-count gate (the `system_health_changed` trigger's
+ * `window` block), so a persisted `flappingTrigger` key is likewise stripped
+ * on read.
  */
 export const NotificationPolicySchema = z.object({
   /**
@@ -265,61 +231,12 @@ export const NotificationPolicySchema = z.object({
    * still notify.
    */
   suppressDeEscalations: z.boolean().default(false),
-  /**
-   * When true, the configured triggers can open auto-managed incidents
-   * on the system. Setting this to false disables both triggers
-   * regardless of their individual `enabled` flags.
-   */
-  autoOpenIncidentOnUnhealthy: z.boolean().default(true),
-  /**
-   * When true, the auto-opened incident is created with
-   * `suppressNotifications` enabled so further health-state
-   * notifications for the system are silenced until the incident is
-   * resolved. Only meaningful when `autoOpenIncidentOnUnhealthy` is on.
-   */
-  useNotificationSuppression: z.boolean().default(true),
-  /**
-   * When true, no auto-incident is opened while the system has an
-   * active maintenance window with notification suppression. The
-   * system is intentionally down and shouldn't trip the on-call.
-   */
-  skipDuringMaintenance: z.boolean().default(true),
-  /**
-   * Trigger A: "this check has been unhealthy for X minutes
-   * continuously." Catches real outages.
-   */
-  sustainedUnhealthyTrigger: SustainedUnhealthyTriggerSchema.default(
-    DEFAULT_SUSTAINED_TRIGGER,
-  ),
-  /**
-   * Trigger B: "this check transitioned to unhealthy N times in M
-   * minutes." Catches persistent flapping where no individual
-   * unhealthy phase is long enough for the sustained trigger.
-   */
-  flappingTrigger: FlappingTriggerSchema.default(DEFAULT_FLAPPING_TRIGGER),
-  /**
-   * Minutes of sustained healthy state required before an auto-opened
-   * incident is auto-closed. `null` disables auto-close — the
-   * incident stays open until an operator resolves it manually.
-   */
-  autoCloseAfterMinutes: z
-    .number()
-    .int()
-    .min(1)
-    .nullable()
-    .default(30),
 });
 export type NotificationPolicy = z.infer<typeof NotificationPolicySchema>;
 export const DEFAULT_NOTIFICATION_POLICY: NotificationPolicy = {
   suppressDeEscalations: false,
-  autoOpenIncidentOnUnhealthy: true,
-  useNotificationSuppression: true,
-  skipDuringMaintenance: true,
-  sustainedUnhealthyTrigger: DEFAULT_SUSTAINED_TRIGGER,
-  flappingTrigger: DEFAULT_FLAPPING_TRIGGER,
-  autoCloseAfterMinutes: 30,
 };
 export const AssociateHealthCheckSchema = z.object({