@checkstack/healthcheck-backend 1.3.0 → 1.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +329 -0
- package/drizzle/0015_quiet_meggan.sql +12 -0
- package/drizzle/0016_complex_maginty.sql +1 -0
- package/drizzle/0017_pretty_caretaker.sql +1 -0
- package/drizzle/meta/0015_snapshot.json +764 -0
- package/drizzle/meta/0016_snapshot.json +644 -0
- package/drizzle/meta/0017_snapshot.json +563 -0
- package/drizzle/meta/_journal.json +21 -0
- package/package.json +24 -21
- package/src/automations.test.ts +6 -27
- package/src/automations.ts +32 -30
- package/src/collector-script-test.test.ts +236 -0
- package/src/collector-script-test.ts +221 -0
- package/src/health-entity.test.ts +698 -0
- package/src/health-entity.ts +369 -0
- package/src/health-state.test.ts +115 -0
- package/src/health-state.ts +333 -0
- package/src/healthcheck-gitops-kinds.test.ts +6 -32
- package/src/healthcheck-gitops-kinds.ts +4 -19
- package/src/hooks.test.ts +19 -6
- package/src/hooks.ts +13 -68
- package/src/index.ts +115 -48
- package/src/queue-executor.ts +243 -444
- package/src/retention-job.ts +65 -1
- package/src/retention-state-transitions.test.ts +49 -0
- package/src/router.test.ts +13 -0
- package/src/router.ts +44 -0
- package/src/schema.ts +34 -54
- package/src/service-notification-policy.test.ts +28 -71
- package/src/service.ts +89 -0
- package/src/state-transitions.test.ts +126 -0
- package/src/state-transitions.ts +112 -0
- package/tsconfig.json +9 -0
- package/src/auto-incident-close-job.ts +0 -164
- package/src/auto-incident.test.ts +0 -196
- package/src/auto-incident.ts +0 -332
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,334 @@
|
|
|
1
1
|
# @checkstack/healthcheck-backend
|
|
2
2
|
|
|
3
|
+
## 1.4.0
|
|
4
|
+
|
|
5
|
+
### Minor Changes
|
|
6
|
+
|
|
7
|
+
- 270ef29: Add the health-state provider data contract (automation sensing layer, Wave 2 Phase 13).
|
|
8
|
+
|
|
9
|
+
- New `health_check_state_transitions` table records every aggregate health-status transition for a system (all statuses, not just unhealthy), giving a reliable "in current status since" timestamp. Written wherever an aggregate transition is detected. Pruned with raw-run retention, but the single most-recent row per system is always kept so an active streak never blanks.
|
|
10
|
+
- New service-typed RPCs on `HealthCheckApi`: `getHealthState({ systemId, configurationId? })` returns `{ status, inStatusSince, inStatusForMs, latencyMs?, avgLatencyMs?, p95LatencyMs?, successRate?, lastRunAt?, inMaintenance, evaluatedAt }`, and `getBulkHealthState({ systemIds })` (POST) resolves many systems against one shared timestamp.
|
|
11
|
+
- New service-typed RPC on `MaintenanceApi`: `hasActiveMaintenance({ systemId })` reports whether a system is in an active maintenance window regardless of notification-suppression (suppression-agnostic), folded into `getHealthState` as `inMaintenance`.
|
|
12
|
+
|
|
13
|
+
All reads are fail-safe: a missing transition row yields `inStatusSince: null`, and a maintenance-plugin error fails open to `inMaintenance: false`.
|
|
14
|
+
|
|
15
|
+
- 270ef29: Add a windowed transition count to the health provider - the building block for custom flapping rules (Wave 2 Phase 18).
|
|
16
|
+
|
|
17
|
+
Flapping is already buildable today via the built-in `healthcheck.flapping_detected` trigger; this phase ships the GENERALIZATION for arbitrary "N status changes in M minutes" rules.
|
|
18
|
+
|
|
19
|
+
- `countStateTransitionsInWindow` counts aggregate status transitions for a system over a trailing window (from the Phase 13 `health_check_state_transitions` table - all statuses, generalizing the unhealthy-only flapping detector). Fail-safe to 0.
|
|
20
|
+
- `getHealthState` / `getBulkHealthState` now return `transitionsInWindow` + `transitionWindowMinutes`, and accept an optional `transitionWindowMinutes` input (default 60).
|
|
21
|
+
- The automation definition gains an optional top-level `state_window_minutes` (default 60), threaded through `enrichScopeWithState` so `health.system.transitions_in_window` / `health.system.transition_window_minutes` are folded into scope per evaluation.
|
|
22
|
+
- Operators author custom flapping as a `numeric_state` condition over `health.system.transitions_in_window` - no new condition variant, no editor change. The variable-scope resolver surfaces the new fields for autocomplete.
|
|
23
|
+
|
|
24
|
+
- 270ef29: Replace the hardcoded auto-incident path with default automations (Wave 2 Phase 20).
|
|
25
|
+
|
|
26
|
+
BREAKING CHANGES: Auto-incident is now automation-driven. The hardcoded background path that opened incidents on sustained-unhealthy / flapping and closed them after a cooldown (`auto-incident.ts`, `auto-incident-close-job.ts`) is removed. On upgrade, an idempotent, threshold-preserving migration seeds equivalent default automations from each assignment's existing `NotificationPolicy`, so alerting behaviour is preserved 1:1:
|
|
27
|
+
|
|
28
|
+
- `sustainedUnhealthyTrigger.durationMinutes` -> the `for:` dwell on a `healthcheck.system_degraded` trigger -> `incident.create`.
|
|
29
|
+
- auto-close `autoCloseAfterMinutes` -> a `wait_until` (healthy continuously for the cooldown) -> `incident.resolve`.
|
|
30
|
+
- `useNotificationSuppression` -> the incident's `suppressNotifications`.
|
|
31
|
+
- `skipDuringMaintenance` -> a `{{ !health.system.in_maintenance }}` pre-run condition.
|
|
32
|
+
- `flappingTrigger.{transitions,windowMinutes}` -> a second automation on the `healthcheck.flapping_detected` trigger -> `incident.create`.
|
|
33
|
+
|
|
34
|
+
Auto-incidents remain ONE OPEN INCIDENT PER SYSTEM, faithful to the old behaviour. `incident.create` gains an opt-in `dedupe_open_for_system` config flag (default false, so existing/custom automations are unaffected): when true, it reuses an existing open incident on the target system instead of opening a duplicate (the old `findActiveAutoIncident(systemId)` semantic), returning the reused incident as the produced `incident` artifact. The seeded default automations set this flag, so a system with several failing checks - sustained and/or flapping - still gets a single open incident; whichever check crosses its threshold first opens it, and the rest dedupe to it. Both sustained and flapping default automations open at `critical` severity (parity with the old path). Per-system run dedup within an automation uses `concurrency_scope: "context_key"` + `mode: "single"`.
|
|
35
|
+
|
|
36
|
+
Operators can read, edit, disable, and extend these automations (see the "Customise auto-incident" guide). Seeded automations are tagged via `managedBy` (`auto-incident:<systemId>:<configurationId>:<kind>`) so the migration is a no-op on re-runs; anything unmappable is recorded as a migration-failure row.
|
|
37
|
+
|
|
38
|
+
Flapping DETECTION (transition recording + the `healthcheck.flapping_detected` emit) is relocated into `flapping-detector.ts` and survives; the emit now fires unconditionally on a threshold cross (no longer gated on `autoOpenIncidentOnUnhealthy`), matching the hook's documented intent and required for the flapping default automation. The legacy `health_check_auto_incidents` mapping table is no longer written or read (it will be dropped in a follow-up migration); `health_check_unhealthy_transitions` is retained for the flapping detector.
|
|
39
|
+
|
|
40
|
+
New service-typed `HealthCheckApi.listAutoIncidentPolicies` RPC exposes each assignment's effective notification policy for the migration. `incident.create` adds the `dedupe_open_for_system` flag (additive, defaults off).
|
|
41
|
+
|
|
42
|
+
- b995afb: Make the per-system aggregated `health` a PLUGIN-BACKED, COMPUTE-ON-READ reactive entity via the Model-B entity state machine.
|
|
43
|
+
|
|
44
|
+
Healthcheck defines a `health` entity `{ status, healthyChecks, totalChecks }` keyed by `systemId`. There is NO framework storage and NO domain table of its own: the `read` accessor DERIVES the view on demand from the same durable health data the rest of the plugin reads (`health_check_runs` via `getSystemHealthStatus`), gated on the system having at least one enabled check association (see the first-run-degradation fix changeset). Storing a second materialized copy would duplicate the engine's source of truth and risk drift, so the aggregate is computed, not mirrored.
|
|
45
|
+
|
|
46
|
+
Each evaluation-site write drives `handle.mutate({ id: systemId, apply })`, where `apply` performs the REAL durable write (insert run + increment the hourly aggregate) and returns the freshly-computed view. The framework snapshots `prev` via `read` BEFORE the run is persisted, so a real status change still produces exactly one correct `ENTITY_CHANGED` with accurate prev to next. The write is fail-soft (a framework reactivity error after the durable write commits never breaks check execution) and diff-suppressed (an unchanged aggregate is a no-op). Raw `health_check_runs` stay intentionally non-reactive (`declareNonReactiveState`, raw-sample).
|
|
47
|
+
|
|
48
|
+
A behavior-preserving change to trigger-event deriver maps a status transition to the existing qualified TRIGGER events (the underscore trigger ids automations match on, not the dotted hook ids):
|
|
49
|
+
|
|
50
|
+
- recovery (`prev !== healthy` to `next === healthy`) to `healthcheck.system_healthy` + `healthcheck.system_health_changed`
|
|
51
|
+
- degradation (`prev === healthy` to `next !== healthy`) to `healthcheck.system_degraded` + `healthcheck.system_health_changed`
|
|
52
|
+
- any other transition to `healthcheck.system_health_changed`
|
|
53
|
+
|
|
54
|
+
`classifyHealthChange` lets cross-plugin consumers (slo, dependency) reproduce the old directional `systemDegraded` / `systemHealthy` predicates from a `health` change read via `onEntityChanged({ kind: "health" })`. The transition history in `entity_transitions` is recorded for every change.
|
|
55
|
+
|
|
56
|
+
BREAKING CHANGES:
|
|
57
|
+
|
|
58
|
+
- The `health` entity's current state is computed on read from the durable `health_check_*` tables; there is no stored current-state row (no framework `entity_state`, no domain mirror). Any code reading current aggregated health must read through the entity `read` accessor / `handle.get` / `getMany`, scope enrichment, or `onEntityChanged`. Durable history in `entity_transitions` is unaffected. (The cross-plugin `healthcheck.system.degraded` / `.healthy` / `.health_changed` hooks are removed in the healthcheck/catalog hook-removal changeset; the reactive entity drives the matching trigger events so existing automations keep firing.)
|
|
59
|
+
|
|
60
|
+
- b995afb: Fix two correctness defects in the reactive `health` entity: suppressed first-run degradation, and duplicate `ENTITY_CHANGED` under concurrent N-pod evaluation.
|
|
61
|
+
|
|
62
|
+
**First-run degradation was silently dropped (data-loss).** The compute-on-read `health` entity gated on the system having at least one persisted `health_check_runs` row, so a system's very first evaluation snapshotted `prev = null` (a create). The deriver and `classifyHealthChange` both treat a null side as "no transition", so a first-ever run that came up unhealthy fired NO `system_degraded` / `health_changed` trigger and NO `degraded` `onEntityChanged` - meaning SLO / dependency consumers never opened downtime. If the system stayed unhealthy, `prev === next` forever and the event never fired. The executor's own pre-run baseline (`getSystemHealthStatus`, no run gate) DID see the transition, so the entity and the executor disagreed.
|
|
63
|
+
|
|
64
|
+
Fix: the existence gate is now on ENABLED check ASSOCIATIONS, not on persisted runs. A system with at least one enabled check resolves to the SAME default-`healthy` baseline `getSystemHealthStatus` returns for an empty run window (`{ status: "healthy", healthyChecks: N, totalChecks: N }`); a system with no enabled checks still has no entity. So a first-ever unhealthy run is now a real `healthy -> degraded` diff that fires `system_degraded` + `health_changed` and opens SLO / dependency downtime. The entity and the executor now agree on the pre-run baseline.
|
|
65
|
+
|
|
66
|
+
**Concurrent evaluations of one system double-emitted (race / data-loss).** `writeHealthEntity -> handle.mutate` snapshotted `prev`, applied, and diffed with NO advisory lock. Two concurrent evaluations of one system (multiple per-config jobs across pods, or at-least-once redelivery) could both snapshot `prev = healthy`, both insert a failing run, both diff `healthy -> degraded`, and both emit - yielding two `ENTITY_CHANGED` + two `entity_transitions` rows for one logical transition (inflating `transitionCount` / flapping and re-running dependency notify).
|
|
67
|
+
|
|
68
|
+
Fix: each system's snapshot-`prev` + `apply` + diff + emit is now serialized through a transaction-scoped advisory lock keyed `health:<systemId>` (`withXactLock` from `@checkstack/backend-api`), wired into `writeHealthEntity` via an injected `serialize` and applied at all three evaluation-write sites. Two concurrent evals of one system now collapse to exactly one emit and one transition row. The durable run/aggregate write is unchanged; only the snapshot/diff/emit window is protected.
|
|
69
|
+
|
|
70
|
+
BREAKING CHANGES:
|
|
71
|
+
|
|
72
|
+
- A system with an enabled health check now has a resolvable `health` entity BEFORE its first run (default-`healthy` baseline), where previously it had none until the first run persisted. Code that relied on the entity being absent for run-less-but-configured systems (e.g. treating a missing entity as "not yet monitored") should instead treat a `healthy` baseline as "configured, no failing signal yet". Systems with no enabled checks still have no entity.
|
|
73
|
+
|
|
74
|
+
- b995afb: Remove the now-unused healthcheck + catalog entity hooks; rely on the reactive entities + change derivers (reactive automation engine Phase 4, final step of §10.3 / §10.4).
|
|
75
|
+
|
|
76
|
+
Now that every cross-plugin consumer (slo, dependency, incident, and healthcheck's own catalog-cleanup) reads these domains via `onEntityChanged`, the producers stop emitting the entity-change hooks and the trigger registrations become entity-driven (fired by the entity change deriver via Stage-1 routing, with a no-op `setup` so they stay in the editor's trigger catalog).
|
|
77
|
+
|
|
78
|
+
- **healthcheck**: stops emitting `healthcheck.system.degraded` / `.healthy` / `.health_changed` from the queue executor (the `health` entity mirror is the single source of truth). Its own `catalog.system.deleted` consumer switched to `onEntityChanged({ kind: "catalog-system" })` on tombstones (work-queue delivery preserved). The directional/umbrella triggers are now entity-driven.
|
|
79
|
+
- **catalog**: stops emitting `catalog.system.created` / `.updated` / `.deleted` and `catalog.group.created` / `.deleted` from the router + the `system.update_metadata` action (the `catalog-system` / `catalog-group` mirrors are authoritative). The system triggers are now entity-driven.
|
|
80
|
+
|
|
81
|
+
CORRECTNESS FIX (also affects the earlier healthcheck/catalog Phase-4 steps in this branch): the change derivers now emit the TRIGGER qualifiedIds that automations actually store in `trigger.event` and that Stage-1 routing matches on (`findEnabledByTriggerEvent`), NOT the dotted hook ids. Healthcheck triggers use underscore ids, so the deriver emits `healthcheck.system_degraded` / `system_healthy` / `system_health_changed` (not `healthcheck.system.degraded`). Catalog system triggers use ids `created`/`updated`/`deleted`, so the deriver emits `catalog.created` / `catalog.updated` / `catalog.deleted` (not `catalog.system.created`). Without this fix the migrated automations would never fire.
|
|
82
|
+
|
|
83
|
+
BREAKING CHANGES:
|
|
84
|
+
|
|
85
|
+
- `healthcheck.system.degraded` / `healthcheck.system.healthy` / `healthcheck.system.health_changed` cross-plugin hooks are removed. The reactive `health` entity drives the matching trigger events (`healthcheck.system_degraded` / `_healthy` / `_health_changed`), so existing automations keep firing. Kept healthcheck hooks: `assignment.changed`, `check.completed`, `check.failed`, `flapping_detected`.
|
|
86
|
+
- `catalog.system.created` / `.updated` / `.deleted` and `catalog.group.created` / `.deleted` cross-plugin hooks are removed. The reactive `catalog-system` / `catalog-group` entities drive the matching trigger events (`catalog.created` / `.updated` / `.deleted`); cross-plugin cleanup reactors subscribe to the `catalog-system` tombstone via `onEntityChanged`. `catalogHooks` / `healthCheckHooks` remain exported (the removed members are gone) for a stable import surface.
|
|
87
|
+
|
|
88
|
+
- b995afb: Move health-check flapping configuration from the per-assignment notification policy onto the `healthcheck.flapping_detected` automation trigger.
|
|
89
|
+
|
|
90
|
+
Flapping thresholds (`transitions`, `windowMinutes`) are now configured on the trigger itself, next to the automation that reacts to them, instead of on each check assignment. The health-check executor still owns the windowed transition counting (it writes `health_check_unhealthy_transitions` and runs the window query), but it now SOURCES the thresholds from the subscribed automations' trigger config:
|
|
91
|
+
|
|
92
|
+
- On a transition-to-unhealthy it records the transition unconditionally (keeping history warm), then looks up the enabled automations subscribed to `healthcheck.flapping_detected`, collects the distinct set of configured windows, counts transitions once per distinct window, and emits one `healthcheck.flapping_detected` per window. The trigger's exact-window `evaluateConfig` gate then fires each automation only for its own window and transition threshold.
|
|
93
|
+
- A missing or partial flapping trigger config defaults to `{ transitions: 3, windowMinutes: 60 }`, so automations created before the trigger carried config keep working unchanged.
|
|
94
|
+
- `automation-backend` exposes a new backend-only, read-only `automationSubscriptionsRef` service ref (`findEnabledByTriggerEvent`) so a plugin that owns a trigger's underlying event can discover its subscribers' trigger config. It is never browser-exposed.
|
|
95
|
+
|
|
96
|
+
**BREAKING CHANGES**
|
|
97
|
+
|
|
98
|
+
- The per-assignment `notificationPolicy.flappingTrigger` field is removed. `NotificationPolicy` is now `{ suppressDeEscalations }` only. Stored rows that still carry a `flappingTrigger` key parse cleanly - the key is stripped on read - so no data migration is required, but the per-check flapping toggle/threshold in the assignment Notifications tab is gone; configure flapping on the trigger instead.
|
|
99
|
+
- The GitOps `System.healthcheck[].notificationPolicy.flappingTrigger` field is removed. A `flappingTrigger` block in a manifest is ignored. Move the thresholds to the `transitions` / `windowMinutes` config of your `healthcheck.flapping_detected` automation trigger.
|
|
100
|
+
- The standalone `enabled` flag for flapping is gone: flapping is "enabled" precisely when at least one enabled automation subscribes to `healthcheck.flapping_detected`. With no subscriber, the transition is still recorded but nothing is counted or emitted.
|
|
101
|
+
|
|
102
|
+
- b995afb: Restore the documented domain payload fields on entity-driven automation triggers.
|
|
103
|
+
|
|
104
|
+
Migrated triggers declare domain-named `payloadSchema`s (incident `incidentId`; health `systemId` / `previousStatus`; catalog `systemId` / `changedFields`; dependency `dependencyId`), but Stage-2 dispatch built `trigger.payload` from the generic entity-change shape (`{ kind, id, prev, next, delta, ...next }`). Operator filters and templates reading `trigger.payload.incidentId` / `.systemId` / `.previousStatus` silently resolved to `undefined` — a regression vs the legacy hook payloads.
|
|
105
|
+
|
|
106
|
+
Changes:
|
|
107
|
+
|
|
108
|
+
- `@checkstack/automation-backend`: `registerChangeDeriver` now accepts an optional per-kind `toPayload(changed) => Record<string, unknown>` mapper (at most one per kind; a second distinct mapper throws). Stage-2's `changedToPayload` uses the registered mapper to build `trigger.payload` so it matches the kind's declared `payloadSchema`, falling back to the generic change shape for kinds without a mapper. New exported type `EntityChangePayloadMapper`.
|
|
109
|
+
- `@checkstack/incident-backend`, `@checkstack/healthcheck-backend`, `@checkstack/catalog-backend`, `@checkstack/dependency-backend`: implement and register a `toPayload` for each entity-driven kind so `trigger.payload` carries the legacy domain keys again.
|
|
110
|
+
|
|
111
|
+
Descriptive incident payload fields not derivable from the reactive entity state (`title`, `description`, `createdAt`, `resolvedAt`) are now OPTIONAL on the incident trigger `payloadSchema`s — they were always absent from an entity-driven payload.
|
|
112
|
+
|
|
113
|
+
- b995afb: Remove the legacy per-assignment auto-incident system. Auto-incidents are now built entirely by user-authored automations; nothing is seeded or hardcoded.
|
|
114
|
+
|
|
115
|
+
What was removed:
|
|
116
|
+
|
|
117
|
+
- The one-time migration that auto-seeded "sustained unhealthy" and "flapping" default automations from each assignment's notification policy, plus the `listAutoIncidentPolicies` RPC it consumed.
|
|
118
|
+
- The seeder-only notification-policy settings and their UI: `autoOpenIncidentOnUnhealthy`, `useNotificationSuppression`, `skipDuringMaintenance`, `sustainedUnhealthyTrigger`, and `autoCloseAfterMinutes`. The assignment **Notifications** tab now exposes only the two live settings: **Suppress de-escalation notifications** and the **flapping-detection** thresholds.
|
|
119
|
+
- The dead `health_check_auto_incidents` table (no longer written or read; dropped via migration).
|
|
120
|
+
|
|
121
|
+
What is preserved: flapping detection (`healthcheck.flapping_detected`) and de-escalation suppression are unchanged. The `flappingTrigger` and `suppressDeEscalations` policy fields stay exactly as before.
|
|
122
|
+
|
|
123
|
+
> [!NOTE]
|
|
124
|
+
> One-time cleanup: an automation-backend migration deletes the historically auto-seeded incident automations (`managed_by LIKE 'auto-incident:%'`) from existing databases. This is intentional and destructive - those automations were no longer managed by anything. If you had edited a seeded automation and want to keep it, re-create it as a normal automation before upgrading. See the "Build auto-incident automations" guide for templates.
|
|
125
|
+
|
|
126
|
+
> [!IMPORTANT]
|
|
127
|
+
> NARROWING: `NotificationPolicySchema` is narrowed to `{ suppressDeEscalations, flappingTrigger }`. Stored rows that still carry the removed legacy keys parse cleanly - zod strips the unknown keys on read - so no data migration is required for the `system_health_checks.notification_policy` column. GitOps `notificationPolicy` specs that set the removed fields are no longer accepted for those keys.
|
|
128
|
+
|
|
129
|
+
- 270ef29: Extend in-UI script testing to health-check collectors, and add
|
|
130
|
+
load-from-run replay for automation script tests.
|
|
131
|
+
|
|
132
|
+
- Health-check collectors: a new `testCollectorScript` RPC runs the
|
|
133
|
+
inline-script (TypeScript) collector and the shell `script` collector
|
|
134
|
+
against an editable, auto-seeded sample context using the same
|
|
135
|
+
sandboxed runner the real collector uses. Surfaces beneath the
|
|
136
|
+
collector script fields in the collector editor (both marked
|
|
137
|
+
`x-script-testable`). Gated by `healthcheck.configuration.manage`.
|
|
138
|
+
- Automation replay: a new `getRunScopeForReplay` RPC reconstructs an
|
|
139
|
+
editable test context from a real run (trigger + persisted artifacts,
|
|
140
|
+
plus the durable scope snapshot when the run is still in-flight), and
|
|
141
|
+
the script-test panel gains a "Load from run" picker that seeds the
|
|
142
|
+
sample context from a past run.
|
|
143
|
+
|
|
144
|
+
Note: health-check executions do not persist the script / config /
|
|
145
|
+
check / system that produced a result, so there is no health-check
|
|
146
|
+
replay - auto-seed is the only context source for collector tests. This
|
|
147
|
+
is by design; see the feature plan.
|
|
148
|
+
|
|
149
|
+
- 270ef29: Activate npm packages in script execution: thread the managed
|
|
150
|
+
`resolutionRoot` into every user-script call site so an allowlisted package
|
|
151
|
+
can actually be `import`ed.
|
|
152
|
+
|
|
153
|
+
- `@checkstack/backend-api`: the ESM runner now always writes a per-run
|
|
154
|
+
`bunfig.toml` with `[install] auto = "disable"` and runs with that dir as
|
|
155
|
+
CWD. Without this Bun silently auto-installs any imported package from the
|
|
156
|
+
registry (verified), defeating the allowlist; with it, imports resolve
|
|
157
|
+
only against the reconciled `current/node_modules` (when a `resolutionRoot`
|
|
158
|
+
is set) and otherwise fail fast.
|
|
159
|
+
- `@checkstack/script-packages-backend`: `resolveResolutionRoot` /
|
|
160
|
+
`resolveResolutionRootFromStore` / `resolveResolutionRootForHost` decide a
|
|
161
|
+
host's resolution-root status (`none` / `ready` / `notReady`) from the
|
|
162
|
+
local `<store>/current`.
|
|
163
|
+
- `run_script` (integration-script-backend), the inline-script collector
|
|
164
|
+
(healthcheck-script-backend, core + satellite), and the in-UI `testScript`
|
|
165
|
+
/ `testCollectorScript` endpoints all resolve the root per run and pass it
|
|
166
|
+
to the runner; `run_script` surfaces a clear "npm packages not ready"
|
|
167
|
+
error when configured-but-unsynced. Shell paths are unaffected (no module
|
|
168
|
+
resolution).
|
|
169
|
+
|
|
170
|
+
An opt-in end-to-end test (`CHECKSTACK_E2E_NETWORK=1`) proves an allowlisted
|
|
171
|
+
package imports successfully through the real `run_script` action execute
|
|
172
|
+
path, with non-network degradation tests running always.
|
|
173
|
+
|
|
174
|
+
BREAKING CHANGES: `@checkstack/backend-api`'s `defaultEsmScriptRunner` now
|
|
175
|
+
always disables Bun auto-install for the user subprocess. A script that
|
|
176
|
+
previously relied on Bun silently fetching an un-vendored package from the
|
|
177
|
+
registry at import time will now fail to resolve it. This is intentional -
|
|
178
|
+
package availability is governed by the admin allowlist - but any caller
|
|
179
|
+
depending on the old implicit auto-install behavior must add the package to
|
|
180
|
+
the allowlist instead. The new `EsmScriptRunOptions.resolutionRoot` field is
|
|
181
|
+
optional and additive (defaults to today's `os.tmpdir()` behavior when
|
|
182
|
+
unset), so the runner API itself is source-compatible.
|
|
183
|
+
|
|
184
|
+
- 270ef29: Secrets platform Phase 2: secret -> env-var mapping with central resolve, inject, and mask.
|
|
185
|
+
|
|
186
|
+
- Script consumers declare a least-privilege `secretEnv` allowlist
|
|
187
|
+
(`{ ENV_NAME: "${{ secrets.NAME }}" }`). The automation `run_script` /
|
|
188
|
+
`run_shell` actions resolve ONLY the declared secrets via
|
|
189
|
+
`secretResolverRef.resolveForRun`, inject them into the runner env for
|
|
190
|
+
that run (memory-only; the ESM runner gained a per-run `env` option), and
|
|
191
|
+
mask their values out of stdout/stderr/result/error via the run-scoped
|
|
192
|
+
masking context. A missing required secret fails the run clearly. No
|
|
193
|
+
ambient secret access.
|
|
194
|
+
- Test panel: `testScript` / `testCollectorScript` inject named
|
|
195
|
+
`__SECRET_<NAME>__` placeholders by default, or user-supplied per-secret
|
|
196
|
+
overrides; real production values are never resolved in the test path,
|
|
197
|
+
and overrides are masked out of the result.
|
|
198
|
+
- Healthcheck collectors carry the `secretEnv` field for authoring +
|
|
199
|
+
the test panel; runtime injection on satellites lands in Phase 3.
|
|
200
|
+
- Editor UX: a new `@checkstack/ui` `SecretEnvEditor` renders `x-secret-env`
|
|
201
|
+
record fields with `${{ secrets.* }}` name autocomplete (from
|
|
202
|
+
`listSecretNames`), wired into the automation action editor and the
|
|
203
|
+
healthcheck collector editor. New `withConfigMeta` helper +
|
|
204
|
+
`x-secret-env` config-meta key in `@checkstack/backend-api`.
|
|
205
|
+
|
|
206
|
+
- 270ef29: Secrets platform Phase 3: just-in-time secret delivery to satellites + source-side masking, and central-execution injection for healthcheck collectors.
|
|
207
|
+
|
|
208
|
+
- New satellite WS messages `request_run_secrets` / `run_secrets`: just
|
|
209
|
+
before a satellite runs a collector that declares a `secretEnv`, it asks
|
|
210
|
+
core for that collector's resolved env; core resolves ONLY the secrets the
|
|
211
|
+
collector's OWN persisted assignment declares (least-privilege — the
|
|
212
|
+
satellite cannot choose) and replies with the env map (or a clear error).
|
|
213
|
+
The satellite injects it memory-only for the run and drops it on
|
|
214
|
+
completion. Secrets never ride the persisted assignment and never touch
|
|
215
|
+
disk.
|
|
216
|
+
- Source-side masking: the satellite runs `maskSecrets` over the collector's
|
|
217
|
+
stdout/stderr/result/error using the run's delivered values BEFORE the
|
|
218
|
+
result leaves the satellite (defense in depth).
|
|
219
|
+
- `CollectorStrategy.execute` gains an optional `secretEnv`. The
|
|
220
|
+
inline-script and shell collectors inject it into the runner
|
|
221
|
+
(`process.env` / `$VAR`) and mask the values out of their output.
|
|
222
|
+
- Healthcheck collectors running centrally (the queue executor) also resolve
|
|
223
|
+
- inject `secretEnv` via `secretResolverRef`, closing the gap where a
|
|
224
|
+
centrally-run secretEnv collector got no secrets. A missing required
|
|
225
|
+
secret fails the run clearly in all paths.
|
|
226
|
+
|
|
227
|
+
- b995afb: Add an optional `partitionBy` override to the windowed-count trigger gate.
|
|
228
|
+
|
|
229
|
+
A trigger's `window` block now accepts `partitionBy`, a bare expression (same flavour as `filter`, no `{{ }}`) that controls the key the occurrence count is bucketed by. When omitted, the gate keys by the trigger's built-in context key exactly as before (per system for health triggers), so existing automations are unchanged. When set, the expression is evaluated against the same trigger scope `filter` uses and coerced to a string - e.g. `trigger.payload.severity` for a per-severity rate, or `trigger.payload.systemId + ":" + trigger.payload.checkId` for a composite key. If the expression evaluates to null/undefined/empty or fails to evaluate, the gate falls back to the built-in context key (never global counting); eval errors are logged, matching the gate's fail-open posture.
|
|
230
|
+
|
|
231
|
+
Triggers can now declare `contextKeyLabel` (a UI hint, e.g. `"system"`) describing their built-in context dimension. It is surfaced through `TriggerInfo` so the editor's window "Partition by" field shows the default partition ("Leave blank to count per system" / "per automation" when a trigger has no context key). The healthcheck system triggers (`system_health_changed`, `system_degraded`, `system_healthy`, `check_failed`) and the built-in `numeric_state` trigger set it to `"system"`. This is a pure UI hint with no runtime behaviour.
|
|
232
|
+
|
|
233
|
+
The automation editor's window block gains a "Partition by" expression input (reusing the trigger filter's `trigger.payload.*` autocomplete), and the collapsed trigger card summary shows the partition when set.
|
|
234
|
+
|
|
235
|
+
- b995afb: Add a generic windowed-count / rate trigger gate, and express flapping detection on it.
|
|
236
|
+
|
|
237
|
+
Any trigger can now carry a `window: { count, minutes, refire }` block: the automation engine records each qualifying occurrence (after the structured config gate and the operator's `filter`) in a durable append log and counts rows within the trailing sliding window, scoped per context key (e.g. per system). `refire: "every"` (default) fires on every occurrence at/over the threshold; `refire: "once"` fires only on the crossing edge and re-arms as old occurrences age out. The gate runs in `maybeStartRun` after `filter` and before the `for:` dwell, so it composes with both.
|
|
238
|
+
|
|
239
|
+
Flapping is now an instance of this mechanism rather than a bespoke detector. The healthcheck `system_health_changed` raw change event plus a `filter` (`trigger.payload.newStatus != "healthy"`) plus `window: { count: 3, minutes: 60, refire: "once" }` reproduces flapping in the engine.
|
|
240
|
+
|
|
241
|
+
State-and-scale: window state lives in the new `automation_window_events` Postgres table (FK-cascade on the automation, the same delete-lifecycle as `automation_dwell_timers`). The count is read with pure SQL so every pod computes the same answer; the work-queue claim gives exactly one INSERT per emission, so there is no double-count. Rows older than the 24h schema cap are pruned by the existing stalled-sweeper. The `once` policy is best-effort under at-least-once redelivery (a redelivered emission can skip the exact crossing edge; `every` is redelivery-tolerant).
|
|
242
|
+
|
|
243
|
+
**BREAKING CHANGES:**
|
|
244
|
+
|
|
245
|
+
- The `healthcheck.flapping_detected` automation trigger and the `healthcheck.flapping_detected` hook are REMOVED. Flapping is now detected by the windowed-count gate on the `healthcheck.system_health_changed` trigger (`window` block, `refire: "once"`).
|
|
246
|
+
- Flapping is now PER-SYSTEM (the aggregated `health` entity), not per-`(system, configuration)`. Subscribe to `check_failed` with a `window` instead if you need per-check rate detection.
|
|
247
|
+
- The healthcheck `health_check_unhealthy_transitions` table is DROPPED (the per-check flapping audit log is no longer kept; counting moved into the engine).
|
|
248
|
+
- The backend-only `automation.subscriptions` service ref (`automationSubscriptionsRef` / `AutomationSubscriptions`) is REMOVED. The engine enumerates subscribers internally and the window gate runs per-automation inside `maybeStartRun`, so the external read-ref is no longer needed.
|
|
249
|
+
- Existing user-created flapping automations are AUTO-MIGRATED on boot: any trigger on `healthcheck.flapping_detected` is rewritten to `healthcheck.system_health_changed` + the canonical unhealthy-transition filter + `window: { count: transitions ?? 3, minutes: windowMinutes ?? 60, refire: "once" }`, dropping the old `config`. A pre-existing trigger filter is replaced with the canonical one (logged per row). An enabled automation that still references the removed event after migration logs a warning.
|
|
250
|
+
|
|
251
|
+
### Patch Changes
|
|
252
|
+
|
|
253
|
+
- Updated dependencies [270ef29]
|
|
254
|
+
- Updated dependencies [b995afb]
|
|
255
|
+
- Updated dependencies [b995afb]
|
|
256
|
+
- Updated dependencies [b995afb]
|
|
257
|
+
- Updated dependencies [270ef29]
|
|
258
|
+
- Updated dependencies [270ef29]
|
|
259
|
+
- Updated dependencies [270ef29]
|
|
260
|
+
- Updated dependencies [270ef29]
|
|
261
|
+
- Updated dependencies [270ef29]
|
|
262
|
+
- Updated dependencies [270ef29]
|
|
263
|
+
- Updated dependencies [270ef29]
|
|
264
|
+
- Updated dependencies [270ef29]
|
|
265
|
+
- Updated dependencies [270ef29]
|
|
266
|
+
- Updated dependencies [270ef29]
|
|
267
|
+
- Updated dependencies [b995afb]
|
|
268
|
+
- Updated dependencies [b995afb]
|
|
269
|
+
- Updated dependencies [b995afb]
|
|
270
|
+
- Updated dependencies [b995afb]
|
|
271
|
+
- Updated dependencies [270ef29]
|
|
272
|
+
- Updated dependencies [b995afb]
|
|
273
|
+
- Updated dependencies [270ef29]
|
|
274
|
+
- Updated dependencies [b995afb]
|
|
275
|
+
- Updated dependencies [b995afb]
|
|
276
|
+
- Updated dependencies [270ef29]
|
|
277
|
+
- Updated dependencies [b995afb]
|
|
278
|
+
- Updated dependencies [b995afb]
|
|
279
|
+
- Updated dependencies [270ef29]
|
|
280
|
+
- Updated dependencies [b995afb]
|
|
281
|
+
- Updated dependencies [b995afb]
|
|
282
|
+
- Updated dependencies [b995afb]
|
|
283
|
+
- Updated dependencies [b995afb]
|
|
284
|
+
- Updated dependencies [b995afb]
|
|
285
|
+
- Updated dependencies [b995afb]
|
|
286
|
+
- Updated dependencies [b995afb]
|
|
287
|
+
- Updated dependencies [b995afb]
|
|
288
|
+
- Updated dependencies [270ef29]
|
|
289
|
+
- Updated dependencies [270ef29]
|
|
290
|
+
- Updated dependencies [270ef29]
|
|
291
|
+
- Updated dependencies [270ef29]
|
|
292
|
+
- Updated dependencies [270ef29]
|
|
293
|
+
- Updated dependencies [270ef29]
|
|
294
|
+
- Updated dependencies [b995afb]
|
|
295
|
+
- Updated dependencies [b995afb]
|
|
296
|
+
- Updated dependencies [270ef29]
|
|
297
|
+
- Updated dependencies [270ef29]
|
|
298
|
+
- Updated dependencies [270ef29]
|
|
299
|
+
- Updated dependencies [b995afb]
|
|
300
|
+
- Updated dependencies [270ef29]
|
|
301
|
+
- Updated dependencies [b995afb]
|
|
302
|
+
- Updated dependencies [270ef29]
|
|
303
|
+
- Updated dependencies [270ef29]
|
|
304
|
+
- Updated dependencies [b995afb]
|
|
305
|
+
- Updated dependencies [270ef29]
|
|
306
|
+
- Updated dependencies [270ef29]
|
|
307
|
+
- Updated dependencies [270ef29]
|
|
308
|
+
- Updated dependencies [270ef29]
|
|
309
|
+
- Updated dependencies [270ef29]
|
|
310
|
+
- Updated dependencies [270ef29]
|
|
311
|
+
- Updated dependencies [270ef29]
|
|
312
|
+
- Updated dependencies [b995afb]
|
|
313
|
+
- Updated dependencies [b995afb]
|
|
314
|
+
- Updated dependencies [b995afb]
|
|
315
|
+
- @checkstack/backend-api@0.19.0
|
|
316
|
+
- @checkstack/automation-backend@0.3.0
|
|
317
|
+
- @checkstack/gitops-common@0.5.0
|
|
318
|
+
- @checkstack/gitops-backend@0.4.0
|
|
319
|
+
- @checkstack/healthcheck-common@1.4.0
|
|
320
|
+
- @checkstack/maintenance-common@1.3.0
|
|
321
|
+
- @checkstack/incident-backend@1.4.0
|
|
322
|
+
- @checkstack/catalog-backend@1.3.0
|
|
323
|
+
- @checkstack/secrets-backend@0.1.0
|
|
324
|
+
- @checkstack/satellite-backend@0.5.0
|
|
325
|
+
- @checkstack/script-packages-backend@0.2.0
|
|
326
|
+
- @checkstack/secrets-common@0.1.0
|
|
327
|
+
- @checkstack/cache-api@0.3.7
|
|
328
|
+
- @checkstack/command-backend@0.1.32
|
|
329
|
+
- @checkstack/queue-api@0.3.7
|
|
330
|
+
- @checkstack/cache-utils@0.2.12
|
|
331
|
+
|
|
3
332
|
## 1.3.0
|
|
4
333
|
|
|
5
334
|
### Minor Changes
|
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
CREATE TABLE "health_check_state_transitions" (
|
|
2
|
+
"id" uuid PRIMARY KEY DEFAULT gen_random_uuid() NOT NULL,
|
|
3
|
+
"system_id" text NOT NULL,
|
|
4
|
+
"configuration_id" uuid NOT NULL,
|
|
5
|
+
"from_status" "health_check_status",
|
|
6
|
+
"to_status" "health_check_status" NOT NULL,
|
|
7
|
+
"transitioned_at" timestamp DEFAULT now() NOT NULL
|
|
8
|
+
);
|
|
9
|
+
--> statement-breakpoint
|
|
10
|
+
ALTER TABLE "health_check_state_transitions" ADD CONSTRAINT "health_check_state_transitions_configuration_id_health_check_configurations_id_fk" FOREIGN KEY ("configuration_id") REFERENCES "health_check_configurations"("id") ON DELETE cascade ON UPDATE no action;--> statement-breakpoint
|
|
11
|
+
CREATE INDEX "health_check_state_transitions_lookup_idx" ON "health_check_state_transitions" USING btree ("system_id","to_status","transitioned_at");--> statement-breakpoint
|
|
12
|
+
CREATE INDEX "health_check_state_transitions_system_recent_idx" ON "health_check_state_transitions" USING btree ("system_id","transitioned_at");
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
DROP TABLE "health_check_auto_incidents" CASCADE;
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
DROP TABLE "health_check_unhealthy_transitions" CASCADE;
|