npm - @checkstack/healthcheck-common - Versions diffs - 1.3.0 → 1.5.0 - Mend

@checkstack/healthcheck-common 1.3.0 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/CHANGELOG.md +194 -0
package/package.json +6 -6
package/src/index.ts +21 -0
package/src/notification-policy-schema.test.ts +38 -0
package/src/rpc-contract.test.ts +44 -0
package/src/rpc-contract.ts +165 -9
package/src/schemas.ts +64 -94

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,199 @@
 # @checkstack/healthcheck-common
+## 1.5.0
+### Minor Changes
+- 9dcc848: Plugin-owned AI tools: every domain plugin contributes its own AI tools (chat assistant + automation AI action), and `ai-backend` is platform-only.
+  Every plugin-specific AI tool is owned by the plugin whose domain it acts on, registered via that plugin's own `aiToolExtensionPoint` / `aiToolProjectionExtensionPoint` from its init - the same path an external plugin author uses. `ai-backend` no longer imports or depends on any capability plugin's `*-common`; the dependency direction is strictly plugin -> ai-platform. Pure helpers (`computeFieldDiff`, capability-summary, `ScriptContextKind`) live in `@checkstack/ai-common`.
+  Tools shipped:
+  - Health checks and automations: full CRUD - `healthcheck.propose` / `automation.propose` and `*.update` (`mutate`, deep-validated) and `*.delete` (`destructive`, always confirm-gated). `healthcheck.propose`'s dry-run calls the new deep `validateConfiguration` so propose-time validation matches apply-time. Assertions are validated against the collector's result schema and the canonical operator vocabulary. Capability-catalog tools (`ai.listCapabilities`, `ai.getCapabilitySchema`), script context tools (`ai.getScriptContext`, `ai.testScript`), and notify-subscriber tools (`healthcheck.notifySystemSubscribers` / `...GroupSubscribers`).
+  - Catalog: `catalog.createSystem` / `updateSystem` / `createGroup` / `updateGroup` (`mutate`), `catalog.deleteSystem` / `deleteGroup` (`destructive`), membership tools (`mutate`), plus `catalog.listSystems` / `listGroups` read projections.
+  - Incident: `incident.create` / `update` / `addUpdate` / `resolve` / `addLink` (`mutate`), `incident.delete` / `removeLink` (`destructive`), and `incident.get` / `incident.list` read projections.
+  - Maintenance: `maintenance.create` / `update` / `addUpdate` / `close` / `addLink` (`mutate`), `maintenance.delete` / `removeLink` (`destructive`), and `maintenance.list` / `get` read projections.
+  - Read projections for SLO (`slo.listObjectives`), dependency (`dependency.list`), incident (`incident.list`), healthcheck (`healthcheck.status`), and anomaly (`anomaly.explain`), each gated by the source procedure's own access rule and routed as the principal.
+  - Documentation grounding: `ai.searchDocs` / `ai.getDoc` over a build-time bundled docs index (BM25-ish ranking), so the assistant grounds how-to answers in Checkstack's own docs offline.
+  - URL introspection: `ai.probeUrl`, an SSRF-guarded read tool the assistant uses to inspect a real endpoint before drafting a health check. Update tools compute a before -> after field diff rendered on the confirm card (approve mode) or an "Applied" card (auto mode), so a change is never silent.
+  `ai_analyze` automation action (automation-backend, with an editor connection picker + audited tool calls): runs a bounded AI agent on the run context as the automation's `runAs` service account, so it can never exceed that identity's permissions; destructive tools are never offered; mutating tools auto-apply through the service account's client. Produces an `automation.analysis` artifact downstream actions can branch on. The agent loop is exposed as a headless `aiAgentRunnerRef` service so automation-backend can drive it without depending on ai-backend.
+  `notification.notifyForSubscription` is now callable by user / application principals holding `notification.send` (previously service-only). Every tool routes through the user-scoped client, so handler-side authorization is enforced exactly as a direct UI/RPC action; the resolver gate plus the propose/apply re-check at propose AND apply are the additional authority. A systemic authz regression test asserts every registered tool falls into exactly one safe authorization category.
+  A new `ai_transport` enum value `automation` records the AI action's tool calls in the `ai_tool_calls` audit log. No new durable state beyond that; each tool is a thin, deterministic wrapper over an existing RPC, so every pod behaves identically.
+  This is a beta minor.
+- 9dcc848: Add environments as a first-class catalog primitive, with per-environment health-check fan-out, config templating, per-environment reactive health, and script run-context exposure.
+  - Catalog primitive: an environment is a sibling of groups - a named, instance-global record carrying free-form custom fields (baseUrl, region, tier, ...) that any system can belong to many-to-many. New `environments` + `systems_environments` tables, `EnvironmentSchema` + create/update schemas, `EntityService` environment CRUD and membership joins, RPC endpoints gated by a new `catalogAccess.environment` access rule, a GitOps `Environment` kind + `System.environments` extension, and frontend management (an `EnvironmentEditor`, an Environments management panel, and a per-system environment picker). The Environments card's Add/Edit/Delete affordances are gated on `catalogAccess.environment.manage`.
+  - Per-environment fan-out: run identity becomes `(systemId, configurationId, environmentId)`. Runs, aggregates, and state transitions gain a nullable `environmentId`. The health-check assignment gains an `environmentIds` selector with three modes (All / Specific / None; `null` and `[]` are distinct). The queue executor resolves the effective environment set via the catalog `resolveSystemEnvironments` read and executes one isolated run per environment.
+  - Config templating: a new `x-templatable` config-field marker renders a string field through the template engine at execute time, against `{ environment, check, system }`. A shared `renderTemplatableConfig` and a `renderTemplatePreview` helper (re-exported from `@checkstack/template-engine`) keep editor previews identical to the run-time render. The HTTP collector's `url`, `headers[].value`, and `body` are templatable, rendered per environment (the strategy client build moves inside the per-env loop); the `url`'s `.url()` validation moves post-render. Secrets resolve before templating; a field marked both secret and `x-templatable` is rejected at plugin load. `DynamicForm` shows a live "Preview" line, and the catalog `EnvironmentPreviewPicker` ("Preview as: <environment>") drives it in the collector editor (only when the schema has a templatable field).
+  - Script run-context: `CollectorRunContext` gains an optional `environment` field (`{ id, name, fields }`, metadata only). Shell collectors receive `CHECKSTACK_ENV_ID` / `_NAME` / `CHECKSTACK_ENV_<FIELD>` vars; inline TS collectors read `globalThis.context.environment`; the editor test panel mirrors both. The env-less path is unchanged.
+  - Per-environment reactive health (see BREAKING below), env-keyed read/write paths, env-qualified serialization locks, an optional `trigger.payload.environmentId`, per-environment isolation, and an `ENVIRONMENT_RESOLUTION_FAILED` signal when catalog resolution degrades to a single env-less run.
+  BREAKING CHANGES: the reactive `health` entity's id-shape and cardinality change. It now encodes two views: per-environment (id `"<systemId>::<environmentId>"`) and a system rollup (id `"<systemId>"`, the worst status across environments + env-less runs). The rollup PRESERVES the pre-existing system-level contract - dashboards, status badges, and automations referencing health by `systemId` keep working without re-authoring - but the entity's contract surface changed (new id-shape, higher cardinality, new payload field), so it is flagged breaking. `getBulkHealthState` parses env-qualified ids and keys results by the original id.
+  State and scale: membership and custom fields live only in catalog Postgres and are re-read every tick via the cross-plugin RPC; env-keyed health reads from shared `health_check_runs` / aggregates / transitions (compute-on-read). Every pod resolves the same effective set and the same per-environment health. No pod-local environment state.
+  Also: `unwrapSchema` in `zod-config.ts` loops instead of single-pass-stripping so multi-layer wrappers (`.optional().default()`) still resolve `x-templatable` meta. The env-less `{{ environment.* }}` run notice logs at `debug` (a legitimate recurring configuration), while the post-render HTTP `.url()` check still fails a genuinely-broken empty render with a clear "Rendered URL is invalid" error.
+  This is a beta minor.
+- 9dcc848: Add a deep `validateConfiguration` RPC to the health-check plugin so propose-time validation matches apply-time validation.
+  - `validateConfiguration` (`@checkstack/healthcheck-common`): a new mutation procedure gated by `healthcheck.healthcheck.manage`, taking a proposed configuration (reusing the create skeleton) and returning `{ valid, errors: [{ path, message }] }`, mirroring automation's `validateDefinition`. It persists nothing.
+  - Shared deep validation (`@checkstack/healthcheck-backend`): `collectConfigurationIssues` resolves strategy + collectors by fully-qualified id then migrate-then-validate-strict each config via `parseStrictAssumingV1`. The GitOps reconcile path is refactored to call the same `validateVersionedConfigStrict`, so create / gitops-apply / the new RPC share one implementation.
+  - `healthcheck.propose`'s dry-run (`@checkstack/ai-backend`) now calls `validateConfiguration` as its validation authority, so a wrong config type or a typo'd key surfaces at propose time, bringing it to the same deep-validate level `automation.propose` already has.
+  State and scale: no durable state; `validateConfiguration` is a pure read against the in-process registries plus zod validation, identical on every pod.
+  This is a beta minor.
+### Patch Changes
+- 9dcc848: Input-validation and error-mapping hardening found by a fuzzing pass against the built container.
+  - backend: a Postgres driver error caused by bad client input no longer surfaces as a `500`. The `/api` and `/rest` dispatchers now map the relevant SQLSTATE classes to the correct status - `22P02`/`22003`/`22001`/`22007` (malformed/out-of-range/over-long/bad-date value), `23502`/`23503`/`23514` (missing/dangling/check-failed) to `400`, and `23505` (unique violation) to `409` - and log them at `warn` (client mistake), not `error`. The client-facing message is generic so column/constraint names are never leaked; genuine unknown faults still log at `error` and 500. Previously a `where id = $1` with a non-uuid `$1` (or an over-long string, or a foreign-key miss in `addSystemToGroup`) reached the driver and 500'd, making routine probing look like a server outage and burying real 500s.
+  - slo-common: **fixes a stored cluster-wide DoS.** `windowDays` was accepted up to `2^53`, but the SLO engine derives window boundaries with `Date(now - windowDays * 86_400_000)` - a large value overflows past the max representable `Date` and yields `Invalid Date`. That objective committed fine, then every subsequent read of the system's objectives threw `RangeError: Invalid time value` during serialization (a 500 readable by anyone with SLO read access, on any pod). `windowDays` is now bounded to 1..3650 days at the contract, the GitOps `kind: SLO` spec, and the update path via a single shared `SloWindowDaysSchema`, so the poison row can never be created.
+  - slo-common + healthcheck-common: SLO `getDailySnapshots` and the healthcheck history endpoints (`getHistory`, `getDetailedHistory`, `getAggregatedHistory`, `getDetailedAggregatedHistory`, `getRunsForAnalysis`) declared their `startDate`/`endDate` params as `z.date()`, which a `/rest/...` string param can never satisfy - so those endpoints 400'd on the entire REST surface. They now use `z.coerce.date()`, accepting both the REST string shape and the native RPC `Date`.
+  - healthcheck-common: `intervalSeconds` was `z.number().min(1)` with no `.int()` and no upper bound, so a fractional or out-of-range value reached the DB and failed at insert (the column is a 32-bit int). It is now `.int().min(1).max(2_592_000)` (1 second .. 30 days), applied to both create and update (the update schema is the create partial).
+  - catalog-common: system/group/environment names were bare `z.string()` (environment was `.min(1)` only), so empty, whitespace-only, and 100KB+ names reached the DB - the huge ones surfaced as 500s when parameter binding blew up. Names are now `trim().min(1).max(200)` via a shared schema.
+    **BREAKING:** `getSystemContacts` is now `userType: "authenticated"` (was `"public"`). System contacts carry PII (user id, name, email); the public read leaked them to anonymous status-page visitors. Anonymous callers now receive `401` for this one endpoint; the system detail page already renders "No contacts assigned" for anonymous viewers, so the UI degrades gracefully. All other catalog reads remain public.
+  - catalog-frontend: the system detail page skips the `getSystemContacts` request entirely for anonymous viewers (it would now `401`) and falls back to the empty state.
+  This is a beta release: the breaking contact-visibility change ships as a minor bump per the beta versioning policy, not a major.
+- Updated dependencies [9dcc848]
+- Updated dependencies [9dcc848]
+- Updated dependencies [9dcc848]
+- Updated dependencies [9dcc848]
+- Updated dependencies [9dcc848]
+- Updated dependencies [9dcc848]
+- Updated dependencies [9dcc848]
+- Updated dependencies [9dcc848]
+- Updated dependencies [9dcc848]
+- Updated dependencies [9dcc848]
+  - @checkstack/notification-common@1.3.0
+  - @checkstack/catalog-common@2.3.0
+  - @checkstack/common@0.13.0
+  - @checkstack/signal-common@0.2.6
+## 1.4.0
+### Minor Changes
+- 270ef29: Add the health-state provider data contract (automation sensing layer, Wave 2 Phase 13).
+  - New `health_check_state_transitions` table records every aggregate health-status transition for a system (all statuses, not just unhealthy), giving a reliable "in current status since" timestamp. Written wherever an aggregate transition is detected. Pruned with raw-run retention, but the single most-recent row per system is always kept so an active streak never blanks.
+  - New service-typed RPCs on `HealthCheckApi`: `getHealthState({ systemId, configurationId? })` returns `{ status, inStatusSince, inStatusForMs, latencyMs?, avgLatencyMs?, p95LatencyMs?, successRate?, lastRunAt?, inMaintenance, evaluatedAt }`, and `getBulkHealthState({ systemIds })` (POST) resolves many systems against one shared timestamp.
+  - New service-typed RPC on `MaintenanceApi`: `hasActiveMaintenance({ systemId })` reports whether a system is in an active maintenance window regardless of notification-suppression (suppression-agnostic), folded into `getHealthState` as `inMaintenance`.
+  All reads are fail-safe: a missing transition row yields `inStatusSince: null`, and a maintenance-plugin error fails open to `inMaintenance: false`.
+- 270ef29: Add a windowed transition count to the health provider - the building block for custom flapping rules (Wave 2 Phase 18).
+  Flapping is already buildable today via the built-in `healthcheck.flapping_detected` trigger; this phase ships the GENERALIZATION for arbitrary "N status changes in M minutes" rules.
+  - `countStateTransitionsInWindow` counts aggregate status transitions for a system over a trailing window (from the Phase 13 `health_check_state_transitions` table - all statuses, generalizing the unhealthy-only flapping detector). Fail-safe to 0.
+  - `getHealthState` / `getBulkHealthState` now return `transitionsInWindow` + `transitionWindowMinutes`, and accept an optional `transitionWindowMinutes` input (default 60).
+  - The automation definition gains an optional top-level `state_window_minutes` (default 60), threaded through `enrichScopeWithState` so `health.system.transitions_in_window` / `health.system.transition_window_minutes` are folded into scope per evaluation.
+  - Operators author custom flapping as a `numeric_state` condition over `health.system.transitions_in_window` - no new condition variant, no editor change. The variable-scope resolver surfaces the new fields for autocomplete.
+- 270ef29: Replace the hardcoded auto-incident path with default automations (Wave 2 Phase 20).
+  BREAKING CHANGES: Auto-incident is now automation-driven. The hardcoded background path that opened incidents on sustained-unhealthy / flapping and closed them after a cooldown (`auto-incident.ts`, `auto-incident-close-job.ts`) is removed. On upgrade, an idempotent, threshold-preserving migration seeds equivalent default automations from each assignment's existing `NotificationPolicy`, so alerting behaviour is preserved 1:1:
+  - `sustainedUnhealthyTrigger.durationMinutes` -> the `for:` dwell on a `healthcheck.system_degraded` trigger -> `incident.create`.
+  - auto-close `autoCloseAfterMinutes` -> a `wait_until` (healthy continuously for the cooldown) -> `incident.resolve`.
+  - `useNotificationSuppression` -> the incident's `suppressNotifications`.
+  - `skipDuringMaintenance` -> a `{{ !health.system.in_maintenance }}` pre-run condition.
+  - `flappingTrigger.{transitions,windowMinutes}` -> a second automation on the `healthcheck.flapping_detected` trigger -> `incident.create`.
+  Auto-incidents remain ONE OPEN INCIDENT PER SYSTEM, faithful to the old behaviour. `incident.create` gains an opt-in `dedupe_open_for_system` config flag (default false, so existing/custom automations are unaffected): when true, it reuses an existing open incident on the target system instead of opening a duplicate (the old `findActiveAutoIncident(systemId)` semantic), returning the reused incident as the produced `incident` artifact. The seeded default automations set this flag, so a system with several failing checks - sustained and/or flapping - still gets a single open incident; whichever check crosses its threshold first opens it, and the rest dedupe to it. Both sustained and flapping default automations open at `critical` severity (parity with the old path). Per-system run dedup within an automation uses `concurrency_scope: "context_key"` + `mode: "single"`.
+  Operators can read, edit, disable, and extend these automations (see the "Customise auto-incident" guide). Seeded automations are tagged via `managedBy` (`auto-incident:<systemId>:<configurationId>:<kind>`) so the migration is a no-op on re-runs; anything unmappable is recorded as a migration-failure row.
+  Flapping DETECTION (transition recording + the `healthcheck.flapping_detected` emit) is relocated into `flapping-detector.ts` and survives; the emit now fires unconditionally on a threshold cross (no longer gated on `autoOpenIncidentOnUnhealthy`), matching the hook's documented intent and required for the flapping default automation. The legacy `health_check_auto_incidents` mapping table is no longer written or read (it will be dropped in a follow-up migration); `health_check_unhealthy_transitions` is retained for the flapping detector.
+  New service-typed `HealthCheckApi.listAutoIncidentPolicies` RPC exposes each assignment's effective notification policy for the migration. `incident.create` adds the `dedupe_open_for_system` flag (additive, defaults off).
+- b995afb: Move health-check flapping configuration from the per-assignment notification policy onto the `healthcheck.flapping_detected` automation trigger.
+  Flapping thresholds (`transitions`, `windowMinutes`) are now configured on the trigger itself, next to the automation that reacts to them, instead of on each check assignment. The health-check executor still owns the windowed transition counting (it writes `health_check_unhealthy_transitions` and runs the window query), but it now SOURCES the thresholds from the subscribed automations' trigger config:
+  - On a transition-to-unhealthy it records the transition unconditionally (keeping history warm), then looks up the enabled automations subscribed to `healthcheck.flapping_detected`, collects the distinct set of configured windows, counts transitions once per distinct window, and emits one `healthcheck.flapping_detected` per window. The trigger's exact-window `evaluateConfig` gate then fires each automation only for its own window and transition threshold.
+  - A missing or partial flapping trigger config defaults to `{ transitions: 3, windowMinutes: 60 }`, so automations created before the trigger carried config keep working unchanged.
+  - `automation-backend` exposes a new backend-only, read-only `automationSubscriptionsRef` service ref (`findEnabledByTriggerEvent`) so a plugin that owns a trigger's underlying event can discover its subscribers' trigger config. It is never browser-exposed.
+  **BREAKING CHANGES**
+  - The per-assignment `notificationPolicy.flappingTrigger` field is removed. `NotificationPolicy` is now `{ suppressDeEscalations }` only. Stored rows that still carry a `flappingTrigger` key parse cleanly - the key is stripped on read - so no data migration is required, but the per-check flapping toggle/threshold in the assignment Notifications tab is gone; configure flapping on the trigger instead.
+  - The GitOps `System.healthcheck[].notificationPolicy.flappingTrigger` field is removed. A `flappingTrigger` block in a manifest is ignored. Move the thresholds to the `transitions` / `windowMinutes` config of your `healthcheck.flapping_detected` automation trigger.
+  - The standalone `enabled` flag for flapping is gone: flapping is "enabled" precisely when at least one enabled automation subscribes to `healthcheck.flapping_detected`. With no subscriber, the transition is still recorded but nothing is counted or emitted.
+- b995afb: Remove the legacy per-assignment auto-incident system. Auto-incidents are now built entirely by user-authored automations; nothing is seeded or hardcoded.
+  What was removed:
+  - The one-time migration that auto-seeded "sustained unhealthy" and "flapping" default automations from each assignment's notification policy, plus the `listAutoIncidentPolicies` RPC it consumed.
+  - The seeder-only notification-policy settings and their UI: `autoOpenIncidentOnUnhealthy`, `useNotificationSuppression`, `skipDuringMaintenance`, `sustainedUnhealthyTrigger`, and `autoCloseAfterMinutes`. The assignment **Notifications** tab now exposes only the two live settings: **Suppress de-escalation notifications** and the **flapping-detection** thresholds.
+  - The dead `health_check_auto_incidents` table (no longer written or read; dropped via migration).
+  What is preserved: flapping detection (`healthcheck.flapping_detected`) and de-escalation suppression are unchanged. The `flappingTrigger` and `suppressDeEscalations` policy fields stay exactly as before.
+  > [!NOTE]
+  > One-time cleanup: an automation-backend migration deletes the historically auto-seeded incident automations (`managed_by LIKE 'auto-incident:%'`) from existing databases. This is intentional and destructive - those automations were no longer managed by anything. If you had edited a seeded automation and want to keep it, re-create it as a normal automation before upgrading. See the "Build auto-incident automations" guide for templates.
+  > [!IMPORTANT]
+  > NARROWING: `NotificationPolicySchema` is narrowed to `{ suppressDeEscalations, flappingTrigger }`. Stored rows that still carry the removed legacy keys parse cleanly - zod strips the unknown keys on read - so no data migration is required for the `system_health_checks.notification_policy` column. GitOps `notificationPolicy` specs that set the removed fields are no longer accepted for those keys.
+- 270ef29: Extend in-UI script testing to health-check collectors, and add
+  load-from-run replay for automation script tests.
+  - Health-check collectors: a new `testCollectorScript` RPC runs the
+    inline-script (TypeScript) collector and the shell `script` collector
+    against an editable, auto-seeded sample context using the same
+    sandboxed runner the real collector uses. Surfaces beneath the
+    collector script fields in the collector editor (both marked
+    `x-script-testable`). Gated by `healthcheck.configuration.manage`.
+  - Automation replay: a new `getRunScopeForReplay` RPC reconstructs an
+    editable test context from a real run (trigger + persisted artifacts,
+    plus the durable scope snapshot when the run is still in-flight), and
+    the script-test panel gains a "Load from run" picker that seeds the
+    sample context from a past run.
+  Note: health-check executions do not persist the script / config /
+  check / system that produced a result, so there is no health-check
+  replay - auto-seed is the only context source for collector tests. This
+  is by design; see the feature plan.
+- 270ef29: Secrets platform Phase 2: secret -> env-var mapping with central resolve, inject, and mask.
+  - Script consumers declare a least-privilege `secretEnv` allowlist
+    (`{ ENV_NAME: "${{ secrets.NAME }}" }`). The automation `run_script` /
+    `run_shell` actions resolve ONLY the declared secrets via
+    `secretResolverRef.resolveForRun`, inject them into the runner env for
+    that run (memory-only; the ESM runner gained a per-run `env` option), and
+    mask their values out of stdout/stderr/result/error via the run-scoped
+    masking context. A missing required secret fails the run clearly. No
+    ambient secret access.
+  - Test panel: `testScript` / `testCollectorScript` inject named
+    `__SECRET_<NAME>__` placeholders by default, or user-supplied per-secret
+    overrides; real production values are never resolved in the test path,
+    and overrides are masked out of the result.
+  - Healthcheck collectors carry the `secretEnv` field for authoring +
+    the test panel; runtime injection on satellites lands in Phase 3.
+  - Editor UX: a new `@checkstack/ui` `SecretEnvEditor` renders `x-secret-env`
+    record fields with `${{ secrets.* }}` name autocomplete (from
+    `listSecretNames`), wired into the automation action editor and the
+    healthcheck collector editor. New `withConfigMeta` helper +
+    `x-secret-env` config-meta key in `@checkstack/backend-api`.
 ## 1.3.0
 ### Minor Changes

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@checkstack/healthcheck-common",
-  "version": "1.3.0",
+  "version": "1.5.0",
   "license": "Elastic-2.0",
   "type": "module",
   "exports": {
@@ -9,16 +9,16 @@
     }
   },
   "dependencies": {
-    "@checkstack/common": "0.11.0",
-    "@checkstack/catalog-common": "2.2.2",
-    "@checkstack/notification-common": "1.2.0",
-    "@checkstack/signal-common": "0.2.4",
+    "@checkstack/common": "0.12.0",
+    "@checkstack/catalog-common": "2.2.3",
+    "@checkstack/notification-common": "1.2.1",
+    "@checkstack/signal-common": "0.2.5",
     "zod": "^4.2.1"
   },
   "devDependencies": {
     "typescript": "^5.7.2",
     "@checkstack/tsconfig": "0.0.7",
-    "@checkstack/scripts": "0.3.3"
+    "@checkstack/scripts": "0.3.4"
   },
   "scripts": {
     "typecheck": "tsgo -b",

package/src/index.ts CHANGED Viewed

@@ -84,3 +84,24 @@ export const SYSTEM_STATUS_CHANGED = createSignal({
     newStatus: z.enum(["healthy", "degraded", "unhealthy"]),
   }),
 });
+/**
+ * Broadcast when the executor FAILED to resolve a system's environments from
+ * the catalog at run time and DEGRADED to a single env-less run (fail-open).
+ *
+ * This is the durable-misconfig / catalog-outage observability signal: a
+ * `logger.warn` alone is easy to miss, so this counter-style signal makes the
+ * degradation observable (dashboards / alerts can count it). The check still
+ * runs (env-less) — this signals that per-environment fan-out was skipped for
+ * this tick, NOT that the check failed.
+ */
+export const ENVIRONMENT_RESOLUTION_FAILED = createSignal({
+  pluginMetadata,
+  event: "environment.resolution_failed",
+  payloadSchema: z.object({
+    systemId: z.string(),
+    configurationId: z.string(),
+    /** The error message that caused the fall-back to an env-less run. */
+    error: z.string(),
+  }),
+});

package/src/notification-policy-schema.test.ts ADDED Viewed

@@ -0,0 +1,38 @@
+import { describe, it, expect } from "bun:test";
+import {
+  NotificationPolicySchema,
+  DEFAULT_NOTIFICATION_POLICY,
+} from "./schemas";
+describe("NotificationPolicySchema", () => {
+  it("accepts the slim policy shape", () => {
+    const parsed = NotificationPolicySchema.parse({
+      suppressDeEscalations: true,
+    });
+    expect(parsed).toEqual({ suppressDeEscalations: true });
+  });
+  it("defaults to the compile-time default when parsing an empty object", () => {
+    const parsed = NotificationPolicySchema.parse({});
+    expect(parsed).toEqual(DEFAULT_NOTIFICATION_POLICY);
+    expect(parsed).toEqual({ suppressDeEscalations: false });
+  });
+  it("strips removed legacy keys (auto-incident AND flapping) without throwing", () => {
+    // A row persisted before the legacy auto-incident fields and the
+    // flapping thresholds were removed still carries the larger object.
+    // zod's default object behaviour drops the unknown keys rather than
+    // rejecting them — flapping now lives on the automation trigger config.
+    const parsed = NotificationPolicySchema.parse({
+      suppressDeEscalations: true,
+      flappingTrigger: { enabled: false, transitions: 9, windowMinutes: 10 },
+      autoOpenIncidentOnUnhealthy: true,
+      useNotificationSuppression: true,
+      skipDuringMaintenance: false,
+      sustainedUnhealthyTrigger: { enabled: true, durationMinutes: 5 },
+      autoCloseAfterMinutes: 99,
+    });
+    expect(parsed).toEqual({ suppressDeEscalations: true });
+    expect(Object.keys(parsed)).toEqual(["suppressDeEscalations"]);
+  });
+});

package/src/rpc-contract.test.ts ADDED Viewed

@@ -0,0 +1,44 @@
+import { describe, expect, test } from "bun:test";
+import type { ZodType } from "zod";
+import { healthCheckContract } from "./rpc-contract";
+/**
+ * Guards the REST-compatibility fix: history date params were `z.date()`, which
+ * a `/rest/...` string param can never satisfy, so every REST history call
+ * 400'd. `z.coerce.date()` accepts both the REST string shape and the native RPC
+ * Date shape.
+ */
+function inputSchemaFor(procName: keyof typeof healthCheckContract): ZodType {
+  const proc = healthCheckContract[procName] as unknown as Record<
+    string,
+    unknown
+  >;
+  const orpc = proc["~orpc"] as { inputSchema?: ZodType };
+  if (!orpc.inputSchema) throw new Error(`${String(procName)} has no input`);
+  return orpc.inputSchema;
+}
+describe("history endpoints coerce string date params (REST compatibility)", () => {
+  test("getAggregatedHistory accepts ISO date strings", () => {
+    const parsed = inputSchemaFor("getAggregatedHistory").safeParse({
+      systemId: "sys-1",
+      configurationId: "cfg-1",
+      startDate: "2026-01-01T00:00:00.000Z",
+      endDate: "2026-02-01T00:00:00.000Z",
+    });
+    expect(parsed.success).toBe(true);
+    if (parsed.success) {
+      const data = parsed.data as { startDate: Date };
+      expect(data.startDate).toBeInstanceOf(Date);
+    }
+  });
+  test("getHistory accepts ISO date strings on its optional date params", () => {
+    const parsed = inputSchemaFor("getHistory").safeParse({
+      systemId: "sys-1",
+      startDate: "2026-01-01T00:00:00.000Z",
+      sortOrder: "desc",
+    });
+    expect(parsed.success).toBe(true);
+  });
+});

package/src/rpc-contract.ts CHANGED Viewed

@@ -8,6 +8,8 @@ import {
   HealthCheckConfigurationSchema,
   CreateHealthCheckConfigurationSchema,
   UpdateHealthCheckConfigurationSchema,
+  ValidateConfigurationInputSchema,
+  ValidateConfigurationResultSchema,
   AssociateHealthCheckSchema,
   HealthCheckRunSchema,
   HealthCheckRunPublicSchema,
@@ -40,6 +42,82 @@ export type SystemHealthStatusResponse = z.infer<
   typeof SystemHealthStatusResponseSchema
 >;
+/**
+ * Live health-state snapshot used by the automation sensing layer.
+ * Service-typed (backend-to-backend). `inStatusSince` is null when no
+ * transition has been recorded; `inStatusForMs` is 0 in that case.
+ */
+const HealthStateSchema = z.object({
+  status: HealthCheckStatusSchema,
+  inStatusSince: z.date().nullable(),
+  inStatusForMs: z.number(),
+  latencyMs: z.number().optional(),
+  avgLatencyMs: z.number().optional(),
+  p95LatencyMs: z.number().optional(),
+  successRate: z.number().optional(),
+  lastRunAt: z.date().optional(),
+  inMaintenance: z.boolean(),
+  /** Count of aggregate status transitions in the trailing window (flapping). */
+  transitionsInWindow: z.number(),
+  /** The window (minutes) `transitionsInWindow` was counted over. */
+  transitionWindowMinutes: z.number(),
+  evaluatedAt: z.date(),
+});
+export type HealthStateResponse = z.infer<typeof HealthStateSchema>;
+// --- Collector script testing (in-UI) ---
+/**
+ * Curated check/system metadata a collector script reads. Every part is
+ * optional so a partial sample still runs.
+ */
+const CollectorTestRunContextSchema = z.object({
+  check: z
+    .object({
+      id: z.string(),
+      name: z.string(),
+      intervalSeconds: z.number().int(),
+    })
+    .optional(),
+  system: z.object({ id: z.string(), name: z.string() }).optional(),
+});
+export const CollectorScriptTestInputSchema = z.object({
+  kind: z.enum(["typescript", "shell"]),
+  script: z.string(),
+  config: z.record(z.string(), z.unknown()).optional(),
+  env: z.record(z.string(), z.string()).optional(),
+  /**
+   * The collector's declared secret -> env mapping. The test panel NEVER
+   * resolves real secret values: each declared env var gets a
+   * `__SECRET_<NAME>__` placeholder by default, or the user override below
+   * (decision 4).
+   */
+  secretEnv: z.record(z.string(), z.string()).optional(),
+  /** User-supplied per-secret-NAME override values, masked out of the result. */
+  secretOverrides: z.record(z.string(), z.string()).optional(),
+  workingDirectory: z.string().optional(),
+  runContext: CollectorTestRunContextSchema.optional(),
+  timeoutMs: z.number().int().min(100).max(300_000).default(30_000),
+});
+export type CollectorScriptTestInputDto = z.infer<
+  typeof CollectorScriptTestInputSchema
+>;
+export const CollectorScriptTestResultSchema = z.object({
+  result: z.unknown().optional(),
+  stdout: z.string(),
+  stderr: z.string(),
+  exitCode: z.number().int().optional(),
+  durationMs: z.number().int().nonnegative(),
+  timedOut: z.boolean(),
+  error: z.string().optional(),
+});
+export type CollectorScriptTestResultDto = z.infer<
+  typeof CollectorScriptTestResultSchema
+>;
 // Health Check RPC Contract using oRPC's contract-first pattern
 export const healthCheckContract = {
   // ==========================================================================
@@ -60,6 +138,24 @@ export const healthCheckContract = {
     .input(z.object({ strategyId: z.string() }))
     .output(z.array(CollectorDtoSchema)),
+  /**
+   * Run a collector script (inline-script TS or the shell `script`
+   * collector) against an editable sample context, using the same
+   * sandboxed runner the real collector uses. Lets operators test a
+   * collector script in the editor without scheduling a real execution.
+   *
+   * Gated by `configuration.manage` because authoring a collector script
+   * already executes code on the central backend - same privilege. The
+   * run is central-only and time-bounded.
+   */
+  testCollectorScript: proc({
+    operationType: "mutation",
+    userType: "authenticated",
+    access: [healthCheckAccess.configuration.manage],
+  })
+    .input(CollectorScriptTestInputSchema)
+    .output(CollectorScriptTestResultSchema),
   // ==========================================================================
   // CONFIGURATION MANAGEMENT (userType: "authenticated")
   // ==========================================================================
@@ -88,6 +184,21 @@ export const healthCheckContract = {
     .input(CreateHealthCheckConfigurationSchema)
     .output(HealthCheckConfigurationSchema),
+  /**
+   * Deep-validate a proposed health-check configuration WITHOUT persisting it.
+   * Runs the SAME strategy/collector resolution + migrate-then-validate-strict
+   * logic the create / gitops-apply path uses, so propose-time errors match
+   * apply-time errors. Gated by `configuration.manage` (the privilege the
+   * create form requires); the mirror of automation's `validateDefinition`.
+   */
+  validateConfiguration: proc({
+    operationType: "mutation",
+    userType: "authenticated",
+    access: [healthCheckAccess.configuration.manage],
+  })
+    .input(ValidateConfigurationInputSchema)
+    .output(ValidateConfigurationResultSchema),
   updateConfiguration: proc({
     operationType: "mutation",
     userType: "authenticated",
@@ -154,6 +265,11 @@ export const healthCheckContract = {
           stateThresholds: StateThresholdsSchema.optional(),
           /** IDs of satellites assigned to execute this health check */
           satelliteIds: z.array(z.string()).optional(),
+          /**
+           * Per-assignment environment selector. null = all current
+           * environments; [] = opt out (env-less); non-empty = those ids.
+           */
+          environmentIds: z.array(z.string()).nullable().optional(),
           /** Whether to also run this check locally on the core (default: true) */
           includeLocal: z.boolean(),
           /** Per-association notification policy (omitted = platform defaults) */
@@ -262,8 +378,8 @@ export const healthCheckContract = {
       z.object({
         systemId: z.string().optional(),
         configurationId: z.string().optional(),
-        startDate: z.date().optional(),
-        endDate: z.date().optional(),
+        startDate: z.coerce.date().optional(),
+        endDate: z.coerce.date().optional(),
         /** Filter by source: "local" = core only, satellite UUID = specific satellite, undefined = all */
         sourceFilter: z.string().optional(),
         /** Restrict runs to the listed statuses. Omitted/empty = no filter. */
@@ -289,8 +405,8 @@ export const healthCheckContract = {
       z.object({
         systemId: z.string().optional(),
         configurationId: z.string().optional(),
-        startDate: z.date().optional(),
-        endDate: z.date().optional(),
+        startDate: z.coerce.date().optional(),
+        endDate: z.coerce.date().optional(),
         /** Filter by source: "local" = core only, satellite UUID = specific satellite, undefined = all */
         sourceFilter: z.string().optional(),
         /** Restrict runs to the listed statuses. Omitted/empty = no filter. */
@@ -328,8 +444,8 @@ export const healthCheckContract = {
       z.object({
         systemId: z.string(),
         configurationId: z.string(),
-        startDate: z.date(),
-        endDate: z.date(),
+        startDate: z.coerce.date(),
+        endDate: z.coerce.date(),
         /** Target number of data points (default: 500). Bucket interval is calculated as (endDate - startDate) / targetPoints */
         targetPoints: z.number().min(10).max(2000).default(500),
       }),
@@ -351,8 +467,8 @@ export const healthCheckContract = {
       z.object({
         systemId: z.string(),
         configurationId: z.string(),
-        startDate: z.date(),
-        endDate: z.date(),
+        startDate: z.coerce.date(),
+        endDate: z.coerce.date(),
         /** Filter by source: "local" = core only, satellite UUID = specific satellite, undefined = all */
         sourceFilter: z.string().optional(),
         /** Target number of data points (default: 500). Bucket interval is calculated as (endDate - startDate) / targetPoints */
@@ -482,6 +598,46 @@ export const healthCheckContract = {
     )
     .output(z.void()),
+  /**
+   * Live health-state snapshot for a single system (Wave-2 sensing
+   * contract). Returns status, in-status duration, latency, windowed
+   * metrics, and suppression-agnostic maintenance state.
+   */
+  getHealthState: proc({
+    operationType: "query",
+    userType: "service",
+    access: [],
+  })
+    .input(
+      z.object({
+        systemId: z.string(),
+        configurationId: z.string().optional(),
+        /** Trailing window (minutes) for `transitionsInWindow`. Default 60. */
+        transitionWindowMinutes: z.number().int().min(1).optional(),
+      }),
+    )
+    .output(HealthStateSchema),
+  /**
+   * Bulk variant of {@link getHealthState}. POST to avoid N+1 from
+   * dashboards and multi-system automation rules; all systems share one
+   * evaluation timestamp.
+   */
+  getBulkHealthState: proc({
+    operationType: "query",
+    userType: "service",
+    access: [],
+  })
+    .route({ method: "POST" })
+    .input(
+      z.object({
+        systemIds: z.array(z.string()),
+        /** Trailing window (minutes) for `transitionsInWindow`. Default 60. */
+        transitionWindowMinutes: z.number().int().min(1).optional(),
+      }),
+    )
+    .output(z.object({ states: z.record(z.string(), HealthStateSchema) })),
   getRunsForAnalysis: proc({
     operationType: "query",
     userType: "service",
@@ -489,7 +645,7 @@ export const healthCheckContract = {
   })
     .input(
       z.object({
-        startDate: z.date(),
+        startDate: z.coerce.date(),
         limitPerAssignment: z.number().optional().default(200),
       }),
     )

package/src/schemas.ts CHANGED Viewed

@@ -97,7 +97,9 @@ export const CreateHealthCheckConfigurationSchema = z.object({
   name: z.string().min(1),
   strategyId: z.string().min(1),
   config: z.record(z.string(), z.unknown()),
-  intervalSeconds: z.number().min(1),
+  // Bounded: a non-integer or out-of-range value previously reached the DB and
+  // failed at insert (the column is a 32-bit int). 1 second .. 30 days.
+  intervalSeconds: z.number().int().min(1).max(2_592_000),
   /** Optional collector configurations */
   collectors: z.array(CollectorConfigEntrySchema).optional(),
 });
@@ -106,6 +108,39 @@ export type CreateHealthCheckConfiguration = z.infer<
   typeof CreateHealthCheckConfigurationSchema
 >;
+/**
+ * Input for the `validateConfiguration` RPC: a proposed (not-yet-persisted)
+ * health-check configuration. Reuses the create skeleton so the same
+ * name/strategyId/config/intervalSeconds/collectors shape is validated at
+ * propose time as is at apply time.
+ */
+export const ValidateConfigurationInputSchema =
+  CreateHealthCheckConfigurationSchema;
+export type ValidateConfigurationInput = z.infer<
+  typeof ValidateConfigurationInputSchema
+>;
+/**
+ * Result of `validateConfiguration`: `valid` plus a flat list of structured
+ * issues. `path`s are dot-joinable for display, e.g. `config.url` or
+ * `collectors.0.config.path`. Mirrors automation's `validateDefinition`
+ * result so consumers (the AI propose tool, the UI) handle both identically.
+ */
+export const ValidateConfigurationResultSchema = z.object({
+  valid: z.boolean(),
+  errors: z.array(
+    z.object({
+      path: z.array(z.union([z.string(), z.number()])),
+      message: z.string(),
+    }),
+  ),
+});
+export type ValidateConfigurationResult = z.infer<
+  typeof ValidateConfigurationResultSchema
+>;
 export const UpdateHealthCheckConfigurationSchema =
   CreateHealthCheckConfigurationSchema.partial();
@@ -209,54 +244,20 @@ export const DEFAULT_STATE_THRESHOLDS: StateThresholds = {
 // --- Notification Policy ---
-/**
- * Trigger that opens an auto-incident after a check has been
- * continuously `unhealthy` for at least `durationMinutes`. Resets if
- * the check recovers to non-unhealthy in between.
- */
-export const SustainedUnhealthyTriggerSchema = z.object({
-  /** When false, this trigger is fully disabled. */
-  enabled: z.boolean().default(true),
-  /** Minimum continuous-unhealthy time before opening. */
-  durationMinutes: z.number().int().min(1).default(30),
-});
-export type SustainedUnhealthyTrigger = z.infer<
-  typeof SustainedUnhealthyTriggerSchema
->;
-/**
- * Trigger that opens an auto-incident when a check has transitioned
- * to `unhealthy` at least `transitions` times within `windowMinutes`.
- * Catches flapping that the sustained-duration trigger would miss
- * because each unhealthy phase is too short.
- */
-export const FlappingTriggerSchema = z.object({
-  /** When false, this trigger is fully disabled. */
-  enabled: z.boolean().default(true),
-  /** Minimum number of transitions-to-unhealthy needed in the window. */
-  transitions: z.number().int().min(1).default(3),
-  /** Sliding window in minutes the transitions are counted over. */
-  windowMinutes: z.number().int().min(1).default(60),
-});
-export type FlappingTrigger = z.infer<typeof FlappingTriggerSchema>;
-export const DEFAULT_SUSTAINED_TRIGGER: SustainedUnhealthyTrigger = {
-  enabled: true,
-  durationMinutes: 30,
-};
-export const DEFAULT_FLAPPING_TRIGGER: FlappingTrigger = {
-  enabled: true,
-  transitions: 3,
-  windowMinutes: 60,
-};
 /**
  * Per-association notification preferences. All fields are evaluated
  * per (system, configuration) — different checks on the same system
  * are fully independent.
+ *
+ * The schema strips unknown keys (zod's default object behaviour), so
+ * rows persisted before the legacy auto-incident fields were removed
+ * (`autoOpenIncidentOnUnhealthy`, `useNotificationSuppression`,
+ * `skipDuringMaintenance`, `sustainedUnhealthyTrigger`,
+ * `autoCloseAfterMinutes`) parse cleanly — the dead keys are dropped.
+ * Flapping thresholds also moved OUT of this policy onto the automation
+ * engine's windowed-count gate (the `system_health_changed` trigger's
+ * `window` block), so a persisted `flappingTrigger` key is likewise stripped
+ * on read.
  */
 export const NotificationPolicySchema = z.object({
   /**
@@ -265,61 +266,12 @@ export const NotificationPolicySchema = z.object({
    * still notify.
    */
   suppressDeEscalations: z.boolean().default(false),
-  /**
-   * When true, the configured triggers can open auto-managed incidents
-   * on the system. Setting this to false disables both triggers
-   * regardless of their individual `enabled` flags.
-   */
-  autoOpenIncidentOnUnhealthy: z.boolean().default(true),
-  /**
-   * When true, the auto-opened incident is created with
-   * `suppressNotifications` enabled so further health-state
-   * notifications for the system are silenced until the incident is
-   * resolved. Only meaningful when `autoOpenIncidentOnUnhealthy` is on.
-   */
-  useNotificationSuppression: z.boolean().default(true),
-  /**
-   * When true, no auto-incident is opened while the system has an
-   * active maintenance window with notification suppression. The
-   * system is intentionally down and shouldn't trip the on-call.
-   */
-  skipDuringMaintenance: z.boolean().default(true),
-  /**
-   * Trigger A: "this check has been unhealthy for X minutes
-   * continuously." Catches real outages.
-   */
-  sustainedUnhealthyTrigger: SustainedUnhealthyTriggerSchema.default(
-    DEFAULT_SUSTAINED_TRIGGER,
-  ),
-  /**
-   * Trigger B: "this check transitioned to unhealthy N times in M
-   * minutes." Catches persistent flapping where no individual
-   * unhealthy phase is long enough for the sustained trigger.
-   */
-  flappingTrigger: FlappingTriggerSchema.default(DEFAULT_FLAPPING_TRIGGER),
-  /**
-   * Minutes of sustained healthy state required before an auto-opened
-   * incident is auto-closed. `null` disables auto-close — the
-   * incident stays open until an operator resolves it manually.
-   */
-  autoCloseAfterMinutes: z
-    .number()
-    .int()
-    .min(1)
-    .nullable()
-    .default(30),
 });
 export type NotificationPolicy = z.infer<typeof NotificationPolicySchema>;
 export const DEFAULT_NOTIFICATION_POLICY: NotificationPolicy = {
   suppressDeEscalations: false,
-  autoOpenIncidentOnUnhealthy: true,
-  useNotificationSuppression: true,
-  skipDuringMaintenance: true,
-  sustainedUnhealthyTrigger: DEFAULT_SUSTAINED_TRIGGER,
-  flappingTrigger: DEFAULT_FLAPPING_TRIGGER,
-  autoCloseAfterMinutes: 30,
 };
 export const AssociateHealthCheckSchema = z.object({
@@ -328,6 +280,14 @@ export const AssociateHealthCheckSchema = z.object({
   stateThresholds: StateThresholdsSchema.optional(),
   /** IDs of satellites assigned to execute this health check */
   satelliteIds: z.array(z.string()).optional(),
+  /**
+   * Per-assignment environment selector for per-environment fan-out.
+   * `null`/omitted = all environments the system currently belongs to;
+   * non-empty array = exactly those (intersected with current membership);
+   * empty array `[]` = opt out (run once with no environment). `null` and
+   * `[]` are semantically distinct.
+   */
+  environmentIds: z.array(z.string()).nullable().optional(),
   /** Whether to also run this check locally on the core instance (default: true) */
   includeLocal: z.boolean().default(true),
   /** Per-association notification policy. Defaults applied when omitted. */
@@ -350,6 +310,11 @@ export const HealthCheckRunSchema = z.object({
   result: z.record(z.string(), z.unknown()),
   timestamp: z.date(),
   latencyMs: z.number().optional(),
+  /**
+   * Environment this run executed for (per-environment fan-out). undefined =
+   * env-less run (the opt-out / no-membership case).
+   */
+  environmentId: z.string().optional(),
   /** Source ID for result attribution (null = local core, UUID = satellite) */
   sourceId: z.string().optional(),
   /** Human-readable source label (e.g. "Local" or "EU West (eu-west-1)") */
@@ -390,6 +355,11 @@ export const HealthCheckRunPublicSchema = z.object({
   status: HealthCheckStatusSchema,
   timestamp: z.date(),
   latencyMs: z.number().optional(),
+  /**
+   * Environment this run executed for (per-environment fan-out). undefined =
+   * env-less run (the opt-out / no-membership case).
+   */
+  environmentId: z.string().optional(),
   /** Source ID for result attribution (null = local core, UUID = satellite) */
   sourceId: z.string().optional(),
   /** Human-readable source label (e.g. "Local" or "EU West (eu-west-1)") */