@checkstack/healthcheck-common 1.3.0 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,199 @@
1
1
  # @checkstack/healthcheck-common
2
2
 
3
+ ## 1.5.0
4
+
5
+ ### Minor Changes
6
+
7
+ - 9dcc848: Plugin-owned AI tools: every domain plugin contributes its own AI tools (chat assistant + automation AI action), and `ai-backend` is platform-only.
8
+
9
+ Every plugin-specific AI tool is owned by the plugin whose domain it acts on, registered via that plugin's own `aiToolExtensionPoint` / `aiToolProjectionExtensionPoint` from its init - the same path an external plugin author uses. `ai-backend` no longer imports or depends on any capability plugin's `*-common`; the dependency direction is strictly plugin -> ai-platform. Pure helpers (`computeFieldDiff`, capability-summary, `ScriptContextKind`) live in `@checkstack/ai-common`.
10
+
11
+ Tools shipped:
12
+
13
+ - Health checks and automations: full CRUD - `healthcheck.propose` / `automation.propose` and `*.update` (`mutate`, deep-validated) and `*.delete` (`destructive`, always confirm-gated). `healthcheck.propose`'s dry-run calls the new deep `validateConfiguration` so propose-time validation matches apply-time. Assertions are validated against the collector's result schema and the canonical operator vocabulary. Capability-catalog tools (`ai.listCapabilities`, `ai.getCapabilitySchema`), script context tools (`ai.getScriptContext`, `ai.testScript`), and notify-subscriber tools (`healthcheck.notifySystemSubscribers` / `...GroupSubscribers`).
14
+ - Catalog: `catalog.createSystem` / `updateSystem` / `createGroup` / `updateGroup` (`mutate`), `catalog.deleteSystem` / `deleteGroup` (`destructive`), membership tools (`mutate`), plus `catalog.listSystems` / `listGroups` read projections.
15
+ - Incident: `incident.create` / `update` / `addUpdate` / `resolve` / `addLink` (`mutate`), `incident.delete` / `removeLink` (`destructive`), and `incident.get` / `incident.list` read projections.
16
+ - Maintenance: `maintenance.create` / `update` / `addUpdate` / `close` / `addLink` (`mutate`), `maintenance.delete` / `removeLink` (`destructive`), and `maintenance.list` / `get` read projections.
17
+ - Read projections for SLO (`slo.listObjectives`), dependency (`dependency.list`), incident (`incident.list`), healthcheck (`healthcheck.status`), and anomaly (`anomaly.explain`), each gated by the source procedure's own access rule and routed as the principal.
18
+ - Documentation grounding: `ai.searchDocs` / `ai.getDoc` over a build-time bundled docs index (BM25-ish ranking), so the assistant grounds how-to answers in Checkstack's own docs offline.
19
+ - URL introspection: `ai.probeUrl`, an SSRF-guarded read tool the assistant uses to inspect a real endpoint before drafting a health check. Update tools compute a before -> after field diff rendered on the confirm card (approve mode) or an "Applied" card (auto mode), so a change is never silent.
20
+
21
+ `ai_analyze` automation action (automation-backend, with an editor connection picker + audited tool calls): runs a bounded AI agent on the run context as the automation's `runAs` service account, so it can never exceed that identity's permissions; destructive tools are never offered; mutating tools auto-apply through the service account's client. Produces an `automation.analysis` artifact downstream actions can branch on. The agent loop is exposed as a headless `aiAgentRunnerRef` service so automation-backend can drive it without depending on ai-backend.
22
+
23
+ `notification.notifyForSubscription` is now callable by user / application principals holding `notification.send` (previously service-only). Every tool routes through the user-scoped client, so handler-side authorization is enforced exactly as a direct UI/RPC action; the resolver gate plus the propose/apply re-check at propose AND apply are the additional authority. A systemic authz regression test asserts every registered tool falls into exactly one safe authorization category.
24
+
25
+ A new `ai_transport` enum value `automation` records the AI action's tool calls in the `ai_tool_calls` audit log. No new durable state beyond that; each tool is a thin, deterministic wrapper over an existing RPC, so every pod behaves identically.
26
+
27
+ This is a beta minor.
28
+
29
+ - 9dcc848: Add environments as a first-class catalog primitive, with per-environment health-check fan-out, config templating, per-environment reactive health, and script run-context exposure.
30
+
31
+ - Catalog primitive: an environment is a sibling of groups - a named, instance-global record carrying free-form custom fields (baseUrl, region, tier, ...) that any system can belong to many-to-many. New `environments` + `systems_environments` tables, `EnvironmentSchema` + create/update schemas, `EntityService` environment CRUD and membership joins, RPC endpoints gated by a new `catalogAccess.environment` access rule, a GitOps `Environment` kind + `System.environments` extension, and frontend management (an `EnvironmentEditor`, an Environments management panel, and a per-system environment picker). The Environments card's Add/Edit/Delete affordances are gated on `catalogAccess.environment.manage`.
32
+ - Per-environment fan-out: run identity becomes `(systemId, configurationId, environmentId)`. Runs, aggregates, and state transitions gain a nullable `environmentId`. The health-check assignment gains an `environmentIds` selector with three modes (All / Specific / None; `null` and `[]` are distinct). The queue executor resolves the effective environment set via the catalog `resolveSystemEnvironments` read and executes one isolated run per environment.
33
+ - Config templating: a new `x-templatable` config-field marker renders a string field through the template engine at execute time, against `{ environment, check, system }`. A shared `renderTemplatableConfig` and a `renderTemplatePreview` helper (re-exported from `@checkstack/template-engine`) keep editor previews identical to the run-time render. The HTTP collector's `url`, `headers[].value`, and `body` are templatable, rendered per environment (the strategy client build moves inside the per-env loop); the `url`'s `.url()` validation moves post-render. Secrets resolve before templating; a field marked both secret and `x-templatable` is rejected at plugin load. `DynamicForm` shows a live "Preview" line, and the catalog `EnvironmentPreviewPicker` ("Preview as: <environment>") drives it in the collector editor (only when the schema has a templatable field).
34
+ - Script run-context: `CollectorRunContext` gains an optional `environment` field (`{ id, name, fields }`, metadata only). Shell collectors receive `CHECKSTACK_ENV_ID` / `_NAME` / `CHECKSTACK_ENV_<FIELD>` vars; inline TS collectors read `globalThis.context.environment`; the editor test panel mirrors both. The env-less path is unchanged.
35
+ - Per-environment reactive health (see BREAKING below), env-keyed read/write paths, env-qualified serialization locks, an optional `trigger.payload.environmentId`, per-environment isolation, and an `ENVIRONMENT_RESOLUTION_FAILED` signal when catalog resolution degrades to a single env-less run.
36
+
37
+ BREAKING CHANGES: the reactive `health` entity's id-shape and cardinality change. It now encodes two views: per-environment (id `"<systemId>::<environmentId>"`) and a system rollup (id `"<systemId>"`, the worst status across environments + env-less runs). The rollup PRESERVES the pre-existing system-level contract - dashboards, status badges, and automations referencing health by `systemId` keep working without re-authoring - but the entity's contract surface changed (new id-shape, higher cardinality, new payload field), so it is flagged breaking. `getBulkHealthState` parses env-qualified ids and keys results by the original id.
38
+
39
+ State and scale: membership and custom fields live only in catalog Postgres and are re-read every tick via the cross-plugin RPC; env-keyed health reads from shared `health_check_runs` / aggregates / transitions (compute-on-read). Every pod resolves the same effective set and the same per-environment health. No pod-local environment state.
40
+
41
+ Also: `unwrapSchema` in `zod-config.ts` loops instead of single-pass-stripping so multi-layer wrappers (`.optional().default()`) still resolve `x-templatable` meta. The env-less `{{ environment.* }}` run notice logs at `debug` (a legitimate recurring configuration), while the post-render HTTP `.url()` check still fails a genuinely-broken empty render with a clear "Rendered URL is invalid" error.
42
+
43
+ This is a beta minor.
44
+
45
+ - 9dcc848: Add a deep `validateConfiguration` RPC to the health-check plugin so propose-time validation matches apply-time validation.
46
+
47
+ - `validateConfiguration` (`@checkstack/healthcheck-common`): a new mutation procedure gated by `healthcheck.healthcheck.manage`, taking a proposed configuration (reusing the create skeleton) and returning `{ valid, errors: [{ path, message }] }`, mirroring automation's `validateDefinition`. It persists nothing.
48
+ - Shared deep validation (`@checkstack/healthcheck-backend`): `collectConfigurationIssues` resolves strategy + collectors by fully-qualified id then migrate-then-validate-strict each config via `parseStrictAssumingV1`. The GitOps reconcile path is refactored to call the same `validateVersionedConfigStrict`, so create / gitops-apply / the new RPC share one implementation.
49
+ - `healthcheck.propose`'s dry-run (`@checkstack/ai-backend`) now calls `validateConfiguration` as its validation authority, so a wrong config type or a typo'd key surfaces at propose time, bringing it to the same deep-validate level `automation.propose` already has.
50
+
51
+ State and scale: no durable state; `validateConfiguration` is a pure read against the in-process registries plus zod validation, identical on every pod.
52
+
53
+ This is a beta minor.
54
+
55
+ ### Patch Changes
56
+
57
+ - 9dcc848: Input-validation and error-mapping hardening found by a fuzzing pass against the built container.
58
+
59
+ - backend: a Postgres driver error caused by bad client input no longer surfaces as a `500`. The `/api` and `/rest` dispatchers now map the relevant SQLSTATE classes to the correct status - `22P02`/`22003`/`22001`/`22007` (malformed/out-of-range/over-long/bad-date value), `23502`/`23503`/`23514` (missing/dangling/check-failed) to `400`, and `23505` (unique violation) to `409` - and log them at `warn` (client mistake), not `error`. The client-facing message is generic so column/constraint names are never leaked; genuine unknown faults still log at `error` and 500. Previously a `where id = $1` with a non-uuid `$1` (or an over-long string, or a foreign-key miss in `addSystemToGroup`) reached the driver and 500'd, making routine probing look like a server outage and burying real 500s.
60
+ - slo-common: **fixes a stored cluster-wide DoS.** `windowDays` was accepted up to `2^53`, but the SLO engine derives window boundaries with `Date(now - windowDays * 86_400_000)` - a large value overflows past the max representable `Date` and yields `Invalid Date`. That objective committed fine, then every subsequent read of the system's objectives threw `RangeError: Invalid time value` during serialization (a 500 readable by anyone with SLO read access, on any pod). `windowDays` is now bounded to 1..3650 days at the contract, the GitOps `kind: SLO` spec, and the update path via a single shared `SloWindowDaysSchema`, so the poison row can never be created.
61
+ - slo-common + healthcheck-common: SLO `getDailySnapshots` and the healthcheck history endpoints (`getHistory`, `getDetailedHistory`, `getAggregatedHistory`, `getDetailedAggregatedHistory`, `getRunsForAnalysis`) declared their `startDate`/`endDate` params as `z.date()`, which a `/rest/...` string param can never satisfy - so those endpoints 400'd on the entire REST surface. They now use `z.coerce.date()`, accepting both the REST string shape and the native RPC `Date`.
62
+ - healthcheck-common: `intervalSeconds` was `z.number().min(1)` with no `.int()` and no upper bound, so a fractional or out-of-range value reached the DB and failed at insert (the column is a 32-bit int). It is now `.int().min(1).max(2_592_000)` (1 second .. 30 days), applied to both create and update (the update schema is the create partial).
63
+ - catalog-common: system/group/environment names were bare `z.string()` (environment was `.min(1)` only), so empty, whitespace-only, and 100KB+ names reached the DB - the huge ones surfaced as 500s when parameter binding blew up. Names are now `trim().min(1).max(200)` via a shared schema.
64
+
65
+ **BREAKING:** `getSystemContacts` is now `userType: "authenticated"` (was `"public"`). System contacts carry PII (user id, name, email); the public read leaked them to anonymous status-page visitors. Anonymous callers now receive `401` for this one endpoint; the system detail page already renders "No contacts assigned" for anonymous viewers, so the UI degrades gracefully. All other catalog reads remain public.
66
+
67
+ - catalog-frontend: the system detail page skips the `getSystemContacts` request entirely for anonymous viewers (it would now `401`) and falls back to the empty state.
68
+
69
+ This is a beta release: the breaking contact-visibility change ships as a minor bump per the beta versioning policy, not a major.
70
+
71
+ - Updated dependencies [9dcc848]
72
+ - Updated dependencies [9dcc848]
73
+ - Updated dependencies [9dcc848]
74
+ - Updated dependencies [9dcc848]
75
+ - Updated dependencies [9dcc848]
76
+ - Updated dependencies [9dcc848]
77
+ - Updated dependencies [9dcc848]
78
+ - Updated dependencies [9dcc848]
79
+ - Updated dependencies [9dcc848]
80
+ - Updated dependencies [9dcc848]
81
+ - @checkstack/notification-common@1.3.0
82
+ - @checkstack/catalog-common@2.3.0
83
+ - @checkstack/common@0.13.0
84
+ - @checkstack/signal-common@0.2.6
85
+
86
+ ## 1.4.0
87
+
88
+ ### Minor Changes
89
+
90
+ - 270ef29: Add the health-state provider data contract (automation sensing layer, Wave 2 Phase 13).
91
+
92
+ - New `health_check_state_transitions` table records every aggregate health-status transition for a system (all statuses, not just unhealthy), giving a reliable "in current status since" timestamp. Written wherever an aggregate transition is detected. Pruned with raw-run retention, but the single most-recent row per system is always kept so an active streak never blanks.
93
+ - New service-typed RPCs on `HealthCheckApi`: `getHealthState({ systemId, configurationId? })` returns `{ status, inStatusSince, inStatusForMs, latencyMs?, avgLatencyMs?, p95LatencyMs?, successRate?, lastRunAt?, inMaintenance, evaluatedAt }`, and `getBulkHealthState({ systemIds })` (POST) resolves many systems against one shared timestamp.
94
+ - New service-typed RPC on `MaintenanceApi`: `hasActiveMaintenance({ systemId })` reports whether a system is in an active maintenance window regardless of notification-suppression (suppression-agnostic), folded into `getHealthState` as `inMaintenance`.
95
+
96
+ All reads are fail-safe: a missing transition row yields `inStatusSince: null`, and a maintenance-plugin error fails open to `inMaintenance: false`.
97
+
98
+ - 270ef29: Add a windowed transition count to the health provider - the building block for custom flapping rules (Wave 2 Phase 18).
99
+
100
+ Flapping is already buildable today via the built-in `healthcheck.flapping_detected` trigger; this phase ships the GENERALIZATION for arbitrary "N status changes in M minutes" rules.
101
+
102
+ - `countStateTransitionsInWindow` counts aggregate status transitions for a system over a trailing window (from the Phase 13 `health_check_state_transitions` table - all statuses, generalizing the unhealthy-only flapping detector). Fail-safe to 0.
103
+ - `getHealthState` / `getBulkHealthState` now return `transitionsInWindow` + `transitionWindowMinutes`, and accept an optional `transitionWindowMinutes` input (default 60).
104
+ - The automation definition gains an optional top-level `state_window_minutes` (default 60), threaded through `enrichScopeWithState` so `health.system.transitions_in_window` / `health.system.transition_window_minutes` are folded into scope per evaluation.
105
+ - Operators author custom flapping as a `numeric_state` condition over `health.system.transitions_in_window` - no new condition variant, no editor change. The variable-scope resolver surfaces the new fields for autocomplete.
106
+
107
+ - 270ef29: Replace the hardcoded auto-incident path with default automations (Wave 2 Phase 20).
108
+
109
+ BREAKING CHANGES: Auto-incident is now automation-driven. The hardcoded background path that opened incidents on sustained-unhealthy / flapping and closed them after a cooldown (`auto-incident.ts`, `auto-incident-close-job.ts`) is removed. On upgrade, an idempotent, threshold-preserving migration seeds equivalent default automations from each assignment's existing `NotificationPolicy`, so alerting behaviour is preserved 1:1:
110
+
111
+ - `sustainedUnhealthyTrigger.durationMinutes` -> the `for:` dwell on a `healthcheck.system_degraded` trigger -> `incident.create`.
112
+ - auto-close `autoCloseAfterMinutes` -> a `wait_until` (healthy continuously for the cooldown) -> `incident.resolve`.
113
+ - `useNotificationSuppression` -> the incident's `suppressNotifications`.
114
+ - `skipDuringMaintenance` -> a `{{ !health.system.in_maintenance }}` pre-run condition.
115
+ - `flappingTrigger.{transitions,windowMinutes}` -> a second automation on the `healthcheck.flapping_detected` trigger -> `incident.create`.
116
+
117
+ Auto-incidents remain ONE OPEN INCIDENT PER SYSTEM, faithful to the old behaviour. `incident.create` gains an opt-in `dedupe_open_for_system` config flag (default false, so existing/custom automations are unaffected): when true, it reuses an existing open incident on the target system instead of opening a duplicate (the old `findActiveAutoIncident(systemId)` semantic), returning the reused incident as the produced `incident` artifact. The seeded default automations set this flag, so a system with several failing checks - sustained and/or flapping - still gets a single open incident; whichever check crosses its threshold first opens it, and the rest dedupe to it. Both sustained and flapping default automations open at `critical` severity (parity with the old path). Per-system run dedup within an automation uses `concurrency_scope: "context_key"` + `mode: "single"`.
118
+
119
+ Operators can read, edit, disable, and extend these automations (see the "Customise auto-incident" guide). Seeded automations are tagged via `managedBy` (`auto-incident:<systemId>:<configurationId>:<kind>`) so the migration is a no-op on re-runs; anything unmappable is recorded as a migration-failure row.
120
+
121
+ Flapping DETECTION (transition recording + the `healthcheck.flapping_detected` emit) is relocated into `flapping-detector.ts` and survives; the emit now fires unconditionally on a threshold cross (no longer gated on `autoOpenIncidentOnUnhealthy`), matching the hook's documented intent and required for the flapping default automation. The legacy `health_check_auto_incidents` mapping table is no longer written or read (it will be dropped in a follow-up migration); `health_check_unhealthy_transitions` is retained for the flapping detector.
122
+
123
+ New service-typed `HealthCheckApi.listAutoIncidentPolicies` RPC exposes each assignment's effective notification policy for the migration. `incident.create` adds the `dedupe_open_for_system` flag (additive, defaults off).
124
+
125
+ - b995afb: Move health-check flapping configuration from the per-assignment notification policy onto the `healthcheck.flapping_detected` automation trigger.
126
+
127
+ Flapping thresholds (`transitions`, `windowMinutes`) are now configured on the trigger itself, next to the automation that reacts to them, instead of on each check assignment. The health-check executor still owns the windowed transition counting (it writes `health_check_unhealthy_transitions` and runs the window query), but it now SOURCES the thresholds from the subscribed automations' trigger config:
128
+
129
+ - On a transition-to-unhealthy it records the transition unconditionally (keeping history warm), then looks up the enabled automations subscribed to `healthcheck.flapping_detected`, collects the distinct set of configured windows, counts transitions once per distinct window, and emits one `healthcheck.flapping_detected` per window. The trigger's exact-window `evaluateConfig` gate then fires each automation only for its own window and transition threshold.
130
+ - A missing or partial flapping trigger config defaults to `{ transitions: 3, windowMinutes: 60 }`, so automations created before the trigger carried config keep working unchanged.
131
+ - `automation-backend` exposes a new backend-only, read-only `automationSubscriptionsRef` service ref (`findEnabledByTriggerEvent`) so a plugin that owns a trigger's underlying event can discover its subscribers' trigger config. It is never browser-exposed.
132
+
133
+ **BREAKING CHANGES**
134
+
135
+ - The per-assignment `notificationPolicy.flappingTrigger` field is removed. `NotificationPolicy` is now `{ suppressDeEscalations }` only. Stored rows that still carry a `flappingTrigger` key parse cleanly - the key is stripped on read - so no data migration is required, but the per-check flapping toggle/threshold in the assignment Notifications tab is gone; configure flapping on the trigger instead.
136
+ - The GitOps `System.healthcheck[].notificationPolicy.flappingTrigger` field is removed. A `flappingTrigger` block in a manifest is ignored. Move the thresholds to the `transitions` / `windowMinutes` config of your `healthcheck.flapping_detected` automation trigger.
137
+ - The standalone `enabled` flag for flapping is gone: flapping is "enabled" precisely when at least one enabled automation subscribes to `healthcheck.flapping_detected`. With no subscriber, the transition is still recorded but nothing is counted or emitted.
138
+
139
+ - b995afb: Remove the legacy per-assignment auto-incident system. Auto-incidents are now built entirely by user-authored automations; nothing is seeded or hardcoded.
140
+
141
+ What was removed:
142
+
143
+ - The one-time migration that auto-seeded "sustained unhealthy" and "flapping" default automations from each assignment's notification policy, plus the `listAutoIncidentPolicies` RPC it consumed.
144
+ - The seeder-only notification-policy settings and their UI: `autoOpenIncidentOnUnhealthy`, `useNotificationSuppression`, `skipDuringMaintenance`, `sustainedUnhealthyTrigger`, and `autoCloseAfterMinutes`. The assignment **Notifications** tab now exposes only the two live settings: **Suppress de-escalation notifications** and the **flapping-detection** thresholds.
145
+ - The dead `health_check_auto_incidents` table (no longer written or read; dropped via migration).
146
+
147
+ What is preserved: flapping detection (`healthcheck.flapping_detected`) and de-escalation suppression are unchanged. The `flappingTrigger` and `suppressDeEscalations` policy fields stay exactly as before.
148
+
149
+ > [!NOTE]
150
+ > One-time cleanup: an automation-backend migration deletes the historically auto-seeded incident automations (`managed_by LIKE 'auto-incident:%'`) from existing databases. This is intentional and destructive - those automations were no longer managed by anything. If you had edited a seeded automation and want to keep it, re-create it as a normal automation before upgrading. See the "Build auto-incident automations" guide for templates.
151
+
152
+ > [!IMPORTANT]
153
+ > NARROWING: `NotificationPolicySchema` is narrowed to `{ suppressDeEscalations, flappingTrigger }`. Stored rows that still carry the removed legacy keys parse cleanly - zod strips the unknown keys on read - so no data migration is required for the `system_health_checks.notification_policy` column. GitOps `notificationPolicy` specs that set the removed fields are no longer accepted for those keys.
154
+
155
+ - 270ef29: Extend in-UI script testing to health-check collectors, and add
156
+ load-from-run replay for automation script tests.
157
+
158
+ - Health-check collectors: a new `testCollectorScript` RPC runs the
159
+ inline-script (TypeScript) collector and the shell `script` collector
160
+ against an editable, auto-seeded sample context using the same
161
+ sandboxed runner the real collector uses. Surfaces beneath the
162
+ collector script fields in the collector editor (both marked
163
+ `x-script-testable`). Gated by `healthcheck.configuration.manage`.
164
+ - Automation replay: a new `getRunScopeForReplay` RPC reconstructs an
165
+ editable test context from a real run (trigger + persisted artifacts,
166
+ plus the durable scope snapshot when the run is still in-flight), and
167
+ the script-test panel gains a "Load from run" picker that seeds the
168
+ sample context from a past run.
169
+
170
+ Note: health-check executions do not persist the script / config /
171
+ check / system that produced a result, so there is no health-check
172
+ replay - auto-seed is the only context source for collector tests. This
173
+ is by design; see the feature plan.
174
+
175
+ - 270ef29: Secrets platform Phase 2: secret -> env-var mapping with central resolve, inject, and mask.
176
+
177
+ - Script consumers declare a least-privilege `secretEnv` allowlist
178
+ (`{ ENV_NAME: "${{ secrets.NAME }}" }`). The automation `run_script` /
179
+ `run_shell` actions resolve ONLY the declared secrets via
180
+ `secretResolverRef.resolveForRun`, inject them into the runner env for
181
+ that run (memory-only; the ESM runner gained a per-run `env` option), and
182
+ mask their values out of stdout/stderr/result/error via the run-scoped
183
+ masking context. A missing required secret fails the run clearly. No
184
+ ambient secret access.
185
+ - Test panel: `testScript` / `testCollectorScript` inject named
186
+ `__SECRET_<NAME>__` placeholders by default, or user-supplied per-secret
187
+ overrides; real production values are never resolved in the test path,
188
+ and overrides are masked out of the result.
189
+ - Healthcheck collectors carry the `secretEnv` field for authoring +
190
+ the test panel; runtime injection on satellites lands in Phase 3.
191
+ - Editor UX: a new `@checkstack/ui` `SecretEnvEditor` renders `x-secret-env`
192
+ record fields with `${{ secrets.* }}` name autocomplete (from
193
+ `listSecretNames`), wired into the automation action editor and the
194
+ healthcheck collector editor. New `withConfigMeta` helper +
195
+ `x-secret-env` config-meta key in `@checkstack/backend-api`.
196
+
3
197
  ## 1.3.0
4
198
 
5
199
  ### Minor Changes
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@checkstack/healthcheck-common",
3
- "version": "1.3.0",
3
+ "version": "1.5.0",
4
4
  "license": "Elastic-2.0",
5
5
  "type": "module",
6
6
  "exports": {
@@ -9,16 +9,16 @@
9
9
  }
10
10
  },
11
11
  "dependencies": {
12
- "@checkstack/common": "0.11.0",
13
- "@checkstack/catalog-common": "2.2.2",
14
- "@checkstack/notification-common": "1.2.0",
15
- "@checkstack/signal-common": "0.2.4",
12
+ "@checkstack/common": "0.12.0",
13
+ "@checkstack/catalog-common": "2.2.3",
14
+ "@checkstack/notification-common": "1.2.1",
15
+ "@checkstack/signal-common": "0.2.5",
16
16
  "zod": "^4.2.1"
17
17
  },
18
18
  "devDependencies": {
19
19
  "typescript": "^5.7.2",
20
20
  "@checkstack/tsconfig": "0.0.7",
21
- "@checkstack/scripts": "0.3.3"
21
+ "@checkstack/scripts": "0.3.4"
22
22
  },
23
23
  "scripts": {
24
24
  "typecheck": "tsgo -b",
package/src/index.ts CHANGED
@@ -84,3 +84,24 @@ export const SYSTEM_STATUS_CHANGED = createSignal({
84
84
  newStatus: z.enum(["healthy", "degraded", "unhealthy"]),
85
85
  }),
86
86
  });
87
+
88
+ /**
89
+ * Broadcast when the executor FAILED to resolve a system's environments from
90
+ * the catalog at run time and DEGRADED to a single env-less run (fail-open).
91
+ *
92
+ * This is the durable-misconfig / catalog-outage observability signal: a
93
+ * `logger.warn` alone is easy to miss, so this counter-style signal makes the
94
+ * degradation observable (dashboards / alerts can count it). The check still
95
+ * runs (env-less) — this signals that per-environment fan-out was skipped for
96
+ * this tick, NOT that the check failed.
97
+ */
98
+ export const ENVIRONMENT_RESOLUTION_FAILED = createSignal({
99
+ pluginMetadata,
100
+ event: "environment.resolution_failed",
101
+ payloadSchema: z.object({
102
+ systemId: z.string(),
103
+ configurationId: z.string(),
104
+ /** The error message that caused the fall-back to an env-less run. */
105
+ error: z.string(),
106
+ }),
107
+ });
@@ -0,0 +1,38 @@
1
+ import { describe, it, expect } from "bun:test";
2
+ import {
3
+ NotificationPolicySchema,
4
+ DEFAULT_NOTIFICATION_POLICY,
5
+ } from "./schemas";
6
+
7
+ describe("NotificationPolicySchema", () => {
8
+ it("accepts the slim policy shape", () => {
9
+ const parsed = NotificationPolicySchema.parse({
10
+ suppressDeEscalations: true,
11
+ });
12
+ expect(parsed).toEqual({ suppressDeEscalations: true });
13
+ });
14
+
15
+ it("defaults to the compile-time default when parsing an empty object", () => {
16
+ const parsed = NotificationPolicySchema.parse({});
17
+ expect(parsed).toEqual(DEFAULT_NOTIFICATION_POLICY);
18
+ expect(parsed).toEqual({ suppressDeEscalations: false });
19
+ });
20
+
21
+ it("strips removed legacy keys (auto-incident AND flapping) without throwing", () => {
22
+ // A row persisted before the legacy auto-incident fields and the
23
+ // flapping thresholds were removed still carries the larger object.
24
+ // zod's default object behaviour drops the unknown keys rather than
25
+ // rejecting them — flapping now lives on the automation trigger config.
26
+ const parsed = NotificationPolicySchema.parse({
27
+ suppressDeEscalations: true,
28
+ flappingTrigger: { enabled: false, transitions: 9, windowMinutes: 10 },
29
+ autoOpenIncidentOnUnhealthy: true,
30
+ useNotificationSuppression: true,
31
+ skipDuringMaintenance: false,
32
+ sustainedUnhealthyTrigger: { enabled: true, durationMinutes: 5 },
33
+ autoCloseAfterMinutes: 99,
34
+ });
35
+ expect(parsed).toEqual({ suppressDeEscalations: true });
36
+ expect(Object.keys(parsed)).toEqual(["suppressDeEscalations"]);
37
+ });
38
+ });
@@ -0,0 +1,44 @@
1
+ import { describe, expect, test } from "bun:test";
2
+ import type { ZodType } from "zod";
3
+ import { healthCheckContract } from "./rpc-contract";
4
+
5
+ /**
6
+ * Guards the REST-compatibility fix: history date params were `z.date()`, which
7
+ * a `/rest/...` string param can never satisfy, so every REST history call
8
+ * 400'd. `z.coerce.date()` accepts both the REST string shape and the native RPC
9
+ * Date shape.
10
+ */
11
+ function inputSchemaFor(procName: keyof typeof healthCheckContract): ZodType {
12
+ const proc = healthCheckContract[procName] as unknown as Record<
13
+ string,
14
+ unknown
15
+ >;
16
+ const orpc = proc["~orpc"] as { inputSchema?: ZodType };
17
+ if (!orpc.inputSchema) throw new Error(`${String(procName)} has no input`);
18
+ return orpc.inputSchema;
19
+ }
20
+
21
+ describe("history endpoints coerce string date params (REST compatibility)", () => {
22
+ test("getAggregatedHistory accepts ISO date strings", () => {
23
+ const parsed = inputSchemaFor("getAggregatedHistory").safeParse({
24
+ systemId: "sys-1",
25
+ configurationId: "cfg-1",
26
+ startDate: "2026-01-01T00:00:00.000Z",
27
+ endDate: "2026-02-01T00:00:00.000Z",
28
+ });
29
+ expect(parsed.success).toBe(true);
30
+ if (parsed.success) {
31
+ const data = parsed.data as { startDate: Date };
32
+ expect(data.startDate).toBeInstanceOf(Date);
33
+ }
34
+ });
35
+
36
+ test("getHistory accepts ISO date strings on its optional date params", () => {
37
+ const parsed = inputSchemaFor("getHistory").safeParse({
38
+ systemId: "sys-1",
39
+ startDate: "2026-01-01T00:00:00.000Z",
40
+ sortOrder: "desc",
41
+ });
42
+ expect(parsed.success).toBe(true);
43
+ });
44
+ });
@@ -8,6 +8,8 @@ import {
8
8
  HealthCheckConfigurationSchema,
9
9
  CreateHealthCheckConfigurationSchema,
10
10
  UpdateHealthCheckConfigurationSchema,
11
+ ValidateConfigurationInputSchema,
12
+ ValidateConfigurationResultSchema,
11
13
  AssociateHealthCheckSchema,
12
14
  HealthCheckRunSchema,
13
15
  HealthCheckRunPublicSchema,
@@ -40,6 +42,82 @@ export type SystemHealthStatusResponse = z.infer<
40
42
  typeof SystemHealthStatusResponseSchema
41
43
  >;
42
44
 
45
+ /**
46
+ * Live health-state snapshot used by the automation sensing layer.
47
+ * Service-typed (backend-to-backend). `inStatusSince` is null when no
48
+ * transition has been recorded; `inStatusForMs` is 0 in that case.
49
+ */
50
+ const HealthStateSchema = z.object({
51
+ status: HealthCheckStatusSchema,
52
+ inStatusSince: z.date().nullable(),
53
+ inStatusForMs: z.number(),
54
+ latencyMs: z.number().optional(),
55
+ avgLatencyMs: z.number().optional(),
56
+ p95LatencyMs: z.number().optional(),
57
+ successRate: z.number().optional(),
58
+ lastRunAt: z.date().optional(),
59
+ inMaintenance: z.boolean(),
60
+ /** Count of aggregate status transitions in the trailing window (flapping). */
61
+ transitionsInWindow: z.number(),
62
+ /** The window (minutes) `transitionsInWindow` was counted over. */
63
+ transitionWindowMinutes: z.number(),
64
+ evaluatedAt: z.date(),
65
+ });
66
+
67
+ export type HealthStateResponse = z.infer<typeof HealthStateSchema>;
68
+
69
+ // --- Collector script testing (in-UI) ---
70
+
71
+ /**
72
+ * Curated check/system metadata a collector script reads. Every part is
73
+ * optional so a partial sample still runs.
74
+ */
75
+ const CollectorTestRunContextSchema = z.object({
76
+ check: z
77
+ .object({
78
+ id: z.string(),
79
+ name: z.string(),
80
+ intervalSeconds: z.number().int(),
81
+ })
82
+ .optional(),
83
+ system: z.object({ id: z.string(), name: z.string() }).optional(),
84
+ });
85
+
86
+ export const CollectorScriptTestInputSchema = z.object({
87
+ kind: z.enum(["typescript", "shell"]),
88
+ script: z.string(),
89
+ config: z.record(z.string(), z.unknown()).optional(),
90
+ env: z.record(z.string(), z.string()).optional(),
91
+ /**
92
+ * The collector's declared secret -> env mapping. The test panel NEVER
93
+ * resolves real secret values: each declared env var gets a
94
+ * `__SECRET_<NAME>__` placeholder by default, or the user override below
95
+ * (decision 4).
96
+ */
97
+ secretEnv: z.record(z.string(), z.string()).optional(),
98
+ /** User-supplied per-secret-NAME override values, masked out of the result. */
99
+ secretOverrides: z.record(z.string(), z.string()).optional(),
100
+ workingDirectory: z.string().optional(),
101
+ runContext: CollectorTestRunContextSchema.optional(),
102
+ timeoutMs: z.number().int().min(100).max(300_000).default(30_000),
103
+ });
104
+ export type CollectorScriptTestInputDto = z.infer<
105
+ typeof CollectorScriptTestInputSchema
106
+ >;
107
+
108
+ export const CollectorScriptTestResultSchema = z.object({
109
+ result: z.unknown().optional(),
110
+ stdout: z.string(),
111
+ stderr: z.string(),
112
+ exitCode: z.number().int().optional(),
113
+ durationMs: z.number().int().nonnegative(),
114
+ timedOut: z.boolean(),
115
+ error: z.string().optional(),
116
+ });
117
+ export type CollectorScriptTestResultDto = z.infer<
118
+ typeof CollectorScriptTestResultSchema
119
+ >;
120
+
43
121
  // Health Check RPC Contract using oRPC's contract-first pattern
44
122
  export const healthCheckContract = {
45
123
  // ==========================================================================
@@ -60,6 +138,24 @@ export const healthCheckContract = {
60
138
  .input(z.object({ strategyId: z.string() }))
61
139
  .output(z.array(CollectorDtoSchema)),
62
140
 
141
+ /**
142
+ * Run a collector script (inline-script TS or the shell `script`
143
+ * collector) against an editable sample context, using the same
144
+ * sandboxed runner the real collector uses. Lets operators test a
145
+ * collector script in the editor without scheduling a real execution.
146
+ *
147
+ * Gated by `configuration.manage` because authoring a collector script
148
+ * already executes code on the central backend - same privilege. The
149
+ * run is central-only and time-bounded.
150
+ */
151
+ testCollectorScript: proc({
152
+ operationType: "mutation",
153
+ userType: "authenticated",
154
+ access: [healthCheckAccess.configuration.manage],
155
+ })
156
+ .input(CollectorScriptTestInputSchema)
157
+ .output(CollectorScriptTestResultSchema),
158
+
63
159
  // ==========================================================================
64
160
  // CONFIGURATION MANAGEMENT (userType: "authenticated")
65
161
  // ==========================================================================
@@ -88,6 +184,21 @@ export const healthCheckContract = {
88
184
  .input(CreateHealthCheckConfigurationSchema)
89
185
  .output(HealthCheckConfigurationSchema),
90
186
 
187
+ /**
188
+ * Deep-validate a proposed health-check configuration WITHOUT persisting it.
189
+ * Runs the SAME strategy/collector resolution + migrate-then-validate-strict
190
+ * logic the create / gitops-apply path uses, so propose-time errors match
191
+ * apply-time errors. Gated by `configuration.manage` (the privilege the
192
+ * create form requires); the mirror of automation's `validateDefinition`.
193
+ */
194
+ validateConfiguration: proc({
195
+ operationType: "mutation",
196
+ userType: "authenticated",
197
+ access: [healthCheckAccess.configuration.manage],
198
+ })
199
+ .input(ValidateConfigurationInputSchema)
200
+ .output(ValidateConfigurationResultSchema),
201
+
91
202
  updateConfiguration: proc({
92
203
  operationType: "mutation",
93
204
  userType: "authenticated",
@@ -154,6 +265,11 @@ export const healthCheckContract = {
154
265
  stateThresholds: StateThresholdsSchema.optional(),
155
266
  /** IDs of satellites assigned to execute this health check */
156
267
  satelliteIds: z.array(z.string()).optional(),
268
+ /**
269
+ * Per-assignment environment selector. null = all current
270
+ * environments; [] = opt out (env-less); non-empty = those ids.
271
+ */
272
+ environmentIds: z.array(z.string()).nullable().optional(),
157
273
  /** Whether to also run this check locally on the core (default: true) */
158
274
  includeLocal: z.boolean(),
159
275
  /** Per-association notification policy (omitted = platform defaults) */
@@ -262,8 +378,8 @@ export const healthCheckContract = {
262
378
  z.object({
263
379
  systemId: z.string().optional(),
264
380
  configurationId: z.string().optional(),
265
- startDate: z.date().optional(),
266
- endDate: z.date().optional(),
381
+ startDate: z.coerce.date().optional(),
382
+ endDate: z.coerce.date().optional(),
267
383
  /** Filter by source: "local" = core only, satellite UUID = specific satellite, undefined = all */
268
384
  sourceFilter: z.string().optional(),
269
385
  /** Restrict runs to the listed statuses. Omitted/empty = no filter. */
@@ -289,8 +405,8 @@ export const healthCheckContract = {
289
405
  z.object({
290
406
  systemId: z.string().optional(),
291
407
  configurationId: z.string().optional(),
292
- startDate: z.date().optional(),
293
- endDate: z.date().optional(),
408
+ startDate: z.coerce.date().optional(),
409
+ endDate: z.coerce.date().optional(),
294
410
  /** Filter by source: "local" = core only, satellite UUID = specific satellite, undefined = all */
295
411
  sourceFilter: z.string().optional(),
296
412
  /** Restrict runs to the listed statuses. Omitted/empty = no filter. */
@@ -328,8 +444,8 @@ export const healthCheckContract = {
328
444
  z.object({
329
445
  systemId: z.string(),
330
446
  configurationId: z.string(),
331
- startDate: z.date(),
332
- endDate: z.date(),
447
+ startDate: z.coerce.date(),
448
+ endDate: z.coerce.date(),
333
449
  /** Target number of data points (default: 500). Bucket interval is calculated as (endDate - startDate) / targetPoints */
334
450
  targetPoints: z.number().min(10).max(2000).default(500),
335
451
  }),
@@ -351,8 +467,8 @@ export const healthCheckContract = {
351
467
  z.object({
352
468
  systemId: z.string(),
353
469
  configurationId: z.string(),
354
- startDate: z.date(),
355
- endDate: z.date(),
470
+ startDate: z.coerce.date(),
471
+ endDate: z.coerce.date(),
356
472
  /** Filter by source: "local" = core only, satellite UUID = specific satellite, undefined = all */
357
473
  sourceFilter: z.string().optional(),
358
474
  /** Target number of data points (default: 500). Bucket interval is calculated as (endDate - startDate) / targetPoints */
@@ -482,6 +598,46 @@ export const healthCheckContract = {
482
598
  )
483
599
  .output(z.void()),
484
600
 
601
+ /**
602
+ * Live health-state snapshot for a single system (Wave-2 sensing
603
+ * contract). Returns status, in-status duration, latency, windowed
604
+ * metrics, and suppression-agnostic maintenance state.
605
+ */
606
+ getHealthState: proc({
607
+ operationType: "query",
608
+ userType: "service",
609
+ access: [],
610
+ })
611
+ .input(
612
+ z.object({
613
+ systemId: z.string(),
614
+ configurationId: z.string().optional(),
615
+ /** Trailing window (minutes) for `transitionsInWindow`. Default 60. */
616
+ transitionWindowMinutes: z.number().int().min(1).optional(),
617
+ }),
618
+ )
619
+ .output(HealthStateSchema),
620
+
621
+ /**
622
+ * Bulk variant of {@link getHealthState}. POST to avoid N+1 from
623
+ * dashboards and multi-system automation rules; all systems share one
624
+ * evaluation timestamp.
625
+ */
626
+ getBulkHealthState: proc({
627
+ operationType: "query",
628
+ userType: "service",
629
+ access: [],
630
+ })
631
+ .route({ method: "POST" })
632
+ .input(
633
+ z.object({
634
+ systemIds: z.array(z.string()),
635
+ /** Trailing window (minutes) for `transitionsInWindow`. Default 60. */
636
+ transitionWindowMinutes: z.number().int().min(1).optional(),
637
+ }),
638
+ )
639
+ .output(z.object({ states: z.record(z.string(), HealthStateSchema) })),
640
+
485
641
  getRunsForAnalysis: proc({
486
642
  operationType: "query",
487
643
  userType: "service",
@@ -489,7 +645,7 @@ export const healthCheckContract = {
489
645
  })
490
646
  .input(
491
647
  z.object({
492
- startDate: z.date(),
648
+ startDate: z.coerce.date(),
493
649
  limitPerAssignment: z.number().optional().default(200),
494
650
  }),
495
651
  )
package/src/schemas.ts CHANGED
@@ -97,7 +97,9 @@ export const CreateHealthCheckConfigurationSchema = z.object({
97
97
  name: z.string().min(1),
98
98
  strategyId: z.string().min(1),
99
99
  config: z.record(z.string(), z.unknown()),
100
- intervalSeconds: z.number().min(1),
100
+ // Bounded: a non-integer or out-of-range value previously reached the DB and
101
+ // failed at insert (the column is a 32-bit int). 1 second .. 30 days.
102
+ intervalSeconds: z.number().int().min(1).max(2_592_000),
101
103
  /** Optional collector configurations */
102
104
  collectors: z.array(CollectorConfigEntrySchema).optional(),
103
105
  });
@@ -106,6 +108,39 @@ export type CreateHealthCheckConfiguration = z.infer<
106
108
  typeof CreateHealthCheckConfigurationSchema
107
109
  >;
108
110
 
111
+ /**
112
+ * Input for the `validateConfiguration` RPC: a proposed (not-yet-persisted)
113
+ * health-check configuration. Reuses the create skeleton so the same
114
+ * name/strategyId/config/intervalSeconds/collectors shape is validated at
115
+ * propose time as is at apply time.
116
+ */
117
+ export const ValidateConfigurationInputSchema =
118
+ CreateHealthCheckConfigurationSchema;
119
+
120
+ export type ValidateConfigurationInput = z.infer<
121
+ typeof ValidateConfigurationInputSchema
122
+ >;
123
+
124
+ /**
125
+ * Result of `validateConfiguration`: `valid` plus a flat list of structured
126
+ * issues. `path`s are dot-joinable for display, e.g. `config.url` or
127
+ * `collectors.0.config.path`. Mirrors automation's `validateDefinition`
128
+ * result so consumers (the AI propose tool, the UI) handle both identically.
129
+ */
130
+ export const ValidateConfigurationResultSchema = z.object({
131
+ valid: z.boolean(),
132
+ errors: z.array(
133
+ z.object({
134
+ path: z.array(z.union([z.string(), z.number()])),
135
+ message: z.string(),
136
+ }),
137
+ ),
138
+ });
139
+
140
+ export type ValidateConfigurationResult = z.infer<
141
+ typeof ValidateConfigurationResultSchema
142
+ >;
143
+
109
144
  export const UpdateHealthCheckConfigurationSchema =
110
145
  CreateHealthCheckConfigurationSchema.partial();
111
146
 
@@ -209,54 +244,20 @@ export const DEFAULT_STATE_THRESHOLDS: StateThresholds = {
209
244
 
210
245
  // --- Notification Policy ---
211
246
 
212
- /**
213
- * Trigger that opens an auto-incident after a check has been
214
- * continuously `unhealthy` for at least `durationMinutes`. Resets if
215
- * the check recovers to non-unhealthy in between.
216
- */
217
- export const SustainedUnhealthyTriggerSchema = z.object({
218
- /** When false, this trigger is fully disabled. */
219
- enabled: z.boolean().default(true),
220
- /** Minimum continuous-unhealthy time before opening. */
221
- durationMinutes: z.number().int().min(1).default(30),
222
- });
223
-
224
- export type SustainedUnhealthyTrigger = z.infer<
225
- typeof SustainedUnhealthyTriggerSchema
226
- >;
227
-
228
- /**
229
- * Trigger that opens an auto-incident when a check has transitioned
230
- * to `unhealthy` at least `transitions` times within `windowMinutes`.
231
- * Catches flapping that the sustained-duration trigger would miss
232
- * because each unhealthy phase is too short.
233
- */
234
- export const FlappingTriggerSchema = z.object({
235
- /** When false, this trigger is fully disabled. */
236
- enabled: z.boolean().default(true),
237
- /** Minimum number of transitions-to-unhealthy needed in the window. */
238
- transitions: z.number().int().min(1).default(3),
239
- /** Sliding window in minutes the transitions are counted over. */
240
- windowMinutes: z.number().int().min(1).default(60),
241
- });
242
-
243
- export type FlappingTrigger = z.infer<typeof FlappingTriggerSchema>;
244
-
245
- export const DEFAULT_SUSTAINED_TRIGGER: SustainedUnhealthyTrigger = {
246
- enabled: true,
247
- durationMinutes: 30,
248
- };
249
-
250
- export const DEFAULT_FLAPPING_TRIGGER: FlappingTrigger = {
251
- enabled: true,
252
- transitions: 3,
253
- windowMinutes: 60,
254
- };
255
-
256
247
  /**
257
248
  * Per-association notification preferences. All fields are evaluated
258
249
  * per (system, configuration) — different checks on the same system
259
250
  * are fully independent.
251
+ *
252
+ * The schema strips unknown keys (zod's default object behaviour), so
253
+ * rows persisted before the legacy auto-incident fields were removed
254
+ * (`autoOpenIncidentOnUnhealthy`, `useNotificationSuppression`,
255
+ * `skipDuringMaintenance`, `sustainedUnhealthyTrigger`,
256
+ * `autoCloseAfterMinutes`) parse cleanly — the dead keys are dropped.
257
+ * Flapping thresholds also moved OUT of this policy onto the automation
258
+ * engine's windowed-count gate (the `system_health_changed` trigger's
259
+ * `window` block), so a persisted `flappingTrigger` key is likewise stripped
260
+ * on read.
260
261
  */
261
262
  export const NotificationPolicySchema = z.object({
262
263
  /**
@@ -265,61 +266,12 @@ export const NotificationPolicySchema = z.object({
265
266
  * still notify.
266
267
  */
267
268
  suppressDeEscalations: z.boolean().default(false),
268
- /**
269
- * When true, the configured triggers can open auto-managed incidents
270
- * on the system. Setting this to false disables both triggers
271
- * regardless of their individual `enabled` flags.
272
- */
273
- autoOpenIncidentOnUnhealthy: z.boolean().default(true),
274
- /**
275
- * When true, the auto-opened incident is created with
276
- * `suppressNotifications` enabled so further health-state
277
- * notifications for the system are silenced until the incident is
278
- * resolved. Only meaningful when `autoOpenIncidentOnUnhealthy` is on.
279
- */
280
- useNotificationSuppression: z.boolean().default(true),
281
- /**
282
- * When true, no auto-incident is opened while the system has an
283
- * active maintenance window with notification suppression. The
284
- * system is intentionally down and shouldn't trip the on-call.
285
- */
286
- skipDuringMaintenance: z.boolean().default(true),
287
- /**
288
- * Trigger A: "this check has been unhealthy for X minutes
289
- * continuously." Catches real outages.
290
- */
291
- sustainedUnhealthyTrigger: SustainedUnhealthyTriggerSchema.default(
292
- DEFAULT_SUSTAINED_TRIGGER,
293
- ),
294
- /**
295
- * Trigger B: "this check transitioned to unhealthy N times in M
296
- * minutes." Catches persistent flapping where no individual
297
- * unhealthy phase is long enough for the sustained trigger.
298
- */
299
- flappingTrigger: FlappingTriggerSchema.default(DEFAULT_FLAPPING_TRIGGER),
300
- /**
301
- * Minutes of sustained healthy state required before an auto-opened
302
- * incident is auto-closed. `null` disables auto-close — the
303
- * incident stays open until an operator resolves it manually.
304
- */
305
- autoCloseAfterMinutes: z
306
- .number()
307
- .int()
308
- .min(1)
309
- .nullable()
310
- .default(30),
311
269
  });
312
270
 
313
271
  export type NotificationPolicy = z.infer<typeof NotificationPolicySchema>;
314
272
 
315
273
  export const DEFAULT_NOTIFICATION_POLICY: NotificationPolicy = {
316
274
  suppressDeEscalations: false,
317
- autoOpenIncidentOnUnhealthy: true,
318
- useNotificationSuppression: true,
319
- skipDuringMaintenance: true,
320
- sustainedUnhealthyTrigger: DEFAULT_SUSTAINED_TRIGGER,
321
- flappingTrigger: DEFAULT_FLAPPING_TRIGGER,
322
- autoCloseAfterMinutes: 30,
323
275
  };
324
276
 
325
277
  export const AssociateHealthCheckSchema = z.object({
@@ -328,6 +280,14 @@ export const AssociateHealthCheckSchema = z.object({
328
280
  stateThresholds: StateThresholdsSchema.optional(),
329
281
  /** IDs of satellites assigned to execute this health check */
330
282
  satelliteIds: z.array(z.string()).optional(),
283
+ /**
284
+ * Per-assignment environment selector for per-environment fan-out.
285
+ * `null`/omitted = all environments the system currently belongs to;
286
+ * non-empty array = exactly those (intersected with current membership);
287
+ * empty array `[]` = opt out (run once with no environment). `null` and
288
+ * `[]` are semantically distinct.
289
+ */
290
+ environmentIds: z.array(z.string()).nullable().optional(),
331
291
  /** Whether to also run this check locally on the core instance (default: true) */
332
292
  includeLocal: z.boolean().default(true),
333
293
  /** Per-association notification policy. Defaults applied when omitted. */
@@ -350,6 +310,11 @@ export const HealthCheckRunSchema = z.object({
350
310
  result: z.record(z.string(), z.unknown()),
351
311
  timestamp: z.date(),
352
312
  latencyMs: z.number().optional(),
313
+ /**
314
+ * Environment this run executed for (per-environment fan-out). undefined =
315
+ * env-less run (the opt-out / no-membership case).
316
+ */
317
+ environmentId: z.string().optional(),
353
318
  /** Source ID for result attribution (null = local core, UUID = satellite) */
354
319
  sourceId: z.string().optional(),
355
320
  /** Human-readable source label (e.g. "Local" or "EU West (eu-west-1)") */
@@ -390,6 +355,11 @@ export const HealthCheckRunPublicSchema = z.object({
390
355
  status: HealthCheckStatusSchema,
391
356
  timestamp: z.date(),
392
357
  latencyMs: z.number().optional(),
358
+ /**
359
+ * Environment this run executed for (per-environment fan-out). undefined =
360
+ * env-less run (the opt-out / no-membership case).
361
+ */
362
+ environmentId: z.string().optional(),
393
363
  /** Source ID for result attribution (null = local core, UUID = satellite) */
394
364
  sourceId: z.string().optional(),
395
365
  /** Human-readable source label (e.g. "Local" or "EU West (eu-west-1)") */