@checkstack/automation-common 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,220 @@
1
1
  # @checkstack/automation-common
2
2
 
3
+ ## 0.3.0
4
+
5
+ ### Minor Changes
6
+
7
+ - b995afb: Add grouping to automations so they are easier to find.
8
+
9
+ Each automation now carries an optional single free-text `group` label (HA-style "category"), stored as its own column on the `automations` row alongside `name` / `description` / `status` - it is NOT part of the definition / YAML. The automations list renders one collapsible section per group (sorted alphabetically, with an implicit "Ungrouped" bucket last), and the edit page gains a type-new-or-pick-existing group picker fed by the new `listAutomationGroups` query. `listAutomations` accepts an optional `group` filter.
10
+
11
+ Declaratively managed automations express their group via GitOps `metadata.labels.group`; the reconciler threads it onto the row (blank clears it).
12
+
13
+ A Drizzle migration adds the nullable `"group"` column and an index. Existing automations default to no group (Ungrouped) and behave exactly as before.
14
+
15
+ - 270ef29: Add live state in scope plus duration helpers to the automation sensing layer (Wave 2 Phase 14).
16
+
17
+ - `@checkstack/template-engine` ships four pure, synchronous duration filters: `minutes` and `hours` (number to milliseconds), `duration_since` (ms elapsed since an ISO timestamp), and `older_than(thresholdMs)` (boolean dwell check). They compute against real time at call time, so "now" is fresh per evaluation. Fail-safe on null/unparseable input.
18
+ - The dispatch engine pre-resolves live health state into scope before any condition or template evaluation (the engine is synchronous, so inline state queries are impossible). State is folded under a `health` namespace - `health.system.*` for the trigger's context system and `health.systems[<id>]` for ids listed in the automation's new `uses_state` field. One batched `getBulkHealthState` query per evaluation, wired at the fresh-run, resume, and trigger-gate sites. Fail-open: a missing client or provider error yields an empty namespace and a warning, never wedging unrelated automations.
19
+ - New `automationFilterExtensionPoint` lets plugins contribute pure template filters without forking the engine's default registry. Name collisions with built-ins are skipped with a warning.
20
+ - The editor variable-scope resolver and autocomplete catalogue now surface the `health.*` namespace and the new duration filters.
21
+
22
+ With this phase alone, an operator can build "notify me when a system has been unhealthy for 30 minutes" using an interval trigger plus a single `health.*` condition - no dwell timer required (the precise event-driven path lands in Phase 15).
23
+
24
+ - 270ef29: Add the `for:` dwell on triggers (Wave 2 Phase 15) - precise, event-driven, restart-safe "fire only if the matched state still holds after Y".
25
+
26
+ - New first-class `TriggerSchema.for` (decision D1): a single-unit duration (`{ seconds | minutes | hours }`) or `{ template }` rendering to seconds. A `durationToMs` helper resolves it. Not buried in `config`.
27
+ - New pre-run `automation_dwell_timers` table (decision D5): a dwell arms before any run exists, so it cannot reuse the run-scoped wait locks. Unique on `(automationId, triggerId, contextKey)` so a re-fire re-arms (pushes `fireAt`) rather than stacking timers.
28
+ - Arm / re-arm / fire / cancel wired into the trigger fan-in. When a `for:` trigger fires and its filter passes, the engine snapshots the current status, upserts the dwell row, and enqueues an `automation-dwell` wake job with the matching `startDelay` - no run starts yet.
29
+ - At expiry the dwell re-confirms (via the Phase 13 health-state provider) that the system is still in the armed status, then re-checks the automation's pre-run conditions, then starts the run honouring the concurrency mode. A recovery within the window cancels the pending fire even without an explicit inverse event.
30
+ - Cancellation is DB-side (delete the row; the queue job no-ops when it pops, since queue jobs are not cancellable). A contradicting state-change event eagerly deletes a stale dwell. Deleted automations drop their dwells via FK cascade; disabled automations drop them at fire time.
31
+ - Durability: the dwell row is the source of truth. A new `automation-dwell` queue consumer fires dwells, and the stalled sweeper catches expired rows whose job was lost. Both paths are idempotent via delete-on-fire, so a dwell fires at most once and survives restart.
32
+
33
+ Example:
34
+
35
+ ```yaml
36
+ triggers:
37
+ - event: healthcheck.system.degraded
38
+ for: { minutes: 30 }
39
+ actions:
40
+ - action: incident.create
41
+ config:
42
+ title: "{{ trigger.payload.systemName }} is critical"
43
+ severity: critical
44
+ systemIds: ["{{ trigger.payload.systemId }}"]
45
+ ```
46
+
47
+ - 270ef29: Add the `numeric_state` trigger and three structured condition variants (Wave 2 Phase 16, backend-only).
48
+
49
+ - New built-in `numeric_state` trigger: hook-backed on `healthcheck.check.completed`, fires when a numeric field (`latencyMs` top-level, or a `collectors.<id>.<field>` path) crosses an `above` / `below` threshold. The per-automation threshold is enforced by a new structured config gate (`TriggerDefinition.evaluateConfig`) that runs before the operator's template filter. Pairs with a trigger-level `for:` (Phase 15) for sustained thresholds. v1 is level-triggered; edge de-duplication is deferred. (Per-check `p95LatencyMs` is not in the hook payload; read windowed p95 via a `numeric_state` _condition_ against `health.system.p95_latency_ms` instead.)
50
+ - Corrected the Phase 15 dwell `arm` semantics to be insert-if-absent: a re-fire while a dwell is still armed PRESERVES the original `fireAt` instead of pushing it. Required for the level-triggered `numeric_state` trigger above - otherwise a trigger firing on every check completion (e.g. every 60s) with `for: 10m` would re-arm and push the deadline forward indefinitely, never elapsing. A genuine recover-then-recur still deletes the row (re-confirm / inverse-cancel) so a fresh window starts.
51
+ - Extended the condition grammar (`ConditionInput`) beyond `string | and | or | not` with three typed variants evaluated over the pre-resolved `health.*` scope plus a FRESH `now` per evaluation:
52
+ - `numeric_state`: `{ value, above?, below? }` (value is a literal number or a template/path string).
53
+ - `time`: `{ after?, before?, weekday?[], timezone? }` for on-call / quiet-hours gating, including overnight windows wrapping midnight, weekday filtering, and IANA timezone resolution via `Intl`.
54
+ - `state`: `{ entity, status, for? }` - a condition-side dwell read from `health.systems[entity].in_status_for_ms` (no new timer; it reads, it doesn't time).
55
+ - The raw template string stays the escape hatch. Everything round-trips through zod and YAML.
56
+
57
+ Editor widgets (ConditionEditor branches, duration/time-of-day inputs, operator selects) are intentionally deferred to Phase 19; the YAML editor already round-trips the new schema, so the feature is fully usable and testable via YAML today.
58
+
59
+ - 270ef29: Add the `wait_until` action primitive (Wave 2 Phase 17) - suspend a running automation until a condition becomes true, with an optional timeout (HA's `wait_template`).
60
+
61
+ - New `wait_until: { condition, timeout_seconds?, continue_on_timeout? }` primitive. `continue_on_timeout` defaults to true (HA semantics). Added to the schema, the action union, and `detectActionKind`. (The wait is fully reactive - see the reactive-dispatch-pipeline changeset; there is no `poll_seconds`.)
62
+ - `condition` accepts any condition shape - a template string or the Phase 16 structured `numeric_state` / `time` / `state` variants.
63
+ - Reactive resume: if the condition is already true it continues inline; otherwise it persists a `kind: "until"` wait lock (carrying the condition + timeout policy in a new `wait_config` jsonb column). The reactive-dispatch-pipeline changeset replaces the original poll-based re-check with a wake-index + a single timeout timer, so the wait is woken by a relevant entity change rather than ticked on an interval. Resumes take the per-run advisory lock so a wake and a sweep can't double-resume.
64
+ - Survives restart: the wait lock is the source of truth, and the stalled sweeper applies the timeout policy as a backstop if the wake/timer signal is lost.
65
+ - Works nested inside `choose` / `parallel` / `repeat` via the existing resume-remainder mechanism.
66
+ - Editor: a `wait_until` action card (frontend) mirroring `wait_for_trigger` - a `ConditionEditor` plus timeout and continue-on-timeout inputs. The structured numeric/time/state ConditionEditor branches land with the rest of the sensing-layer editor work; the card uses the expression-based editor for now.
67
+
68
+ - 270ef29: Add a windowed transition count to the health provider - the building block for custom flapping rules (Wave 2 Phase 18).
69
+
70
+ Flapping is already buildable today via the built-in `healthcheck.flapping_detected` trigger; this phase ships the GENERALIZATION for arbitrary "N status changes in M minutes" rules.
71
+
72
+ - `countStateTransitionsInWindow` counts aggregate status transitions for a system over a trailing window (from the Phase 13 `health_check_state_transitions` table - all statuses, generalizing the unhealthy-only flapping detector). Fail-safe to 0.
73
+ - `getHealthState` / `getBulkHealthState` now return `transitionsInWindow` + `transitionWindowMinutes`, and accept an optional `transitionWindowMinutes` input (default 60).
74
+ - The automation definition gains an optional top-level `state_window_minutes` (default 60), threaded through `enrichScopeWithState` so `health.system.transitions_in_window` / `health.system.transition_window_minutes` are folded into scope per evaluation.
75
+ - Operators author custom flapping as a `numeric_state` condition over `health.system.transitions_in_window` - no new condition variant, no editor change. The variable-scope resolver surfaces the new fields for autocomplete.
76
+
77
+ - 270ef29: Add per-context-key concurrency scope to automations (Phase 20 prerequisite).
78
+
79
+ A new optional `concurrency_scope: "automation" | "context_key"` field on the automation definition controls the bucket the concurrency `mode` is evaluated over:
80
+
81
+ - `automation` (default, backward-compatible): one bucket for the whole automation - `single` allows one in-flight run total, `restart` cancels every active run. Existing automations are unchanged.
82
+ - `context_key`: an independent bucket per `contextKey` (typically per system / incident) - `single` allows one in-flight run _per context key_ (system A and system B run concurrently, but a second run for system A is deduped), and `restart` cancels only the active runs sharing the incoming context key.
83
+
84
+ `RunStore.hasActiveRun` / `countActiveRuns` / `cancelActiveRuns` gain an optional `contextKey` filter (the `automation_runs.context_key` column already exists, so no migration). `respectConcurrencyMode` threads the scope through. This is the primitive the default auto-incident automations need for faithful per-system deduplication.
85
+
86
+ - b995afb: Add the entity state machine core (`defineEntity`) - the foundational primitive of the reactive automation engine - as a Model-B plugin-backed reactive WRAPPER with NO framework-owned current-state storage.
87
+
88
+ `defineEntity` owns NO current-state storage of its own. Each kind declares a required plugin `read` accessor pointing at wherever its state lives (its own durable table, or a value computed on read from its own durable tables), and `defineEntity` makes that state reactive. There is no framework current-state store and no "homeless" fallback: every kind is plugin-backed. This makes a non-reactive write structurally impossible and guarantees every transition is durably logged without duplicating the plugin's state.
89
+
90
+ - `@checkstack/automation-backend`:
91
+
92
+ - New `automation.entity` extension point exposing `defineEntity(input)`, `declareNonReactiveState(input)`, `onEntityChanged(...)`, and `registerChangeDeriver(...)`. automation-backend registers the impl in `register`, so other plugins can resolve it and declare entities during their own `register`/`init` (Proxy-buffered until the impl registers).
93
+ - **Driven single mutation entry point.** All reactive-state writes go through `handle.mutate({ id, opts?, apply: () => Promise<TState> })`. The handle snapshots `prev` via `read` BEFORE the write, runs the plugin's `apply` (the actual write, committed in the PLUGIN's own transaction, returning the resulting state), validates `next` (zod), masks run-originated writes through the run-secret registry, diffs prev to next, and on a real diff appends the field-level transition rows to `entity_transitions` and emits `ENTITY_CHANGED` - both AFTER the plugin write commits (never on a rolled-back / throwing write). A structurally-unchanged write is a no-op. `handle.remove({ id, opts?, apply: () => Promise<void> })` is the tombstone counterpart (records the tombstone transition, emits next = null).
94
+ - **Cross-plugin transaction boundary.** `apply` takes NO framework tx: a plugin-backed kind lives behind a DIFFERENT drizzle client than `entity_transitions`, and two clients cannot share one transaction. The plugin write is authoritative; the transition-log append runs in the framework's own transaction AFTER the plugin write commits. A failure between them leaves correct plugin state with a missing history row (a gap, never a corruption).
95
+ - **`get` / `getMany`** route to the kind's `read`; **`inStateSince` / `inStateForMs` / `transitionCount`** read the per-field `entity_transitions` log (generalizing Phase-13 health transitions to any entity).
96
+ - **No framework keyed store.** There is no generic `entity_state` table, no `createKeyedStore`, and no `entityKeyedStoreServiceRef`: kinds whose state has no domain table of their own (the `health` aggregate, the `slo` budget/streak view) compute their `read` on demand from their own durable data instead of materializing a framework copy. `entity_transitions` (the change-history log) is the framework's ONLY persistent table and is written for EVERY kind regardless of where current state lives.
97
+ - **`entityResolverFor(kind)`** routes scope enrichment + the reactive `wait_until` wake re-eval to each kind's `read` accessor. Generalized scope enrichment (`enrichScopeWithEntities`) folds any `state.<kind>.<id>` ref into `scope.state.<kind>.<id>.<field>`. The rich `scope.health.*` condition snapshot (status, latency, success rate, in-maintenance, transitions-in-window, ...) is resolved EXCLUSIVELY through the healthcheck RPC path (the health aggregate is computed on read, not stored as a framework row) and the generic entity pass never writes `scope.health`; `state.health.*` remains the minimal reactive entity view. These are two complementary projections by design, not a migration shim.
98
+ - **Horizontal-scale read-consistency guard.** A reactive entity's current state MUST be globally readable from shared/durable storage, never process-local memory (`.agent/rules/state-and-scale.md`). Enforced by the `checkstack/no-pod-local-entity-state` ESLint tripwire at the `defineEntity({ read })` boundary (wired at `warn`) and the deterministic `cross-pod-read-consistency.it.test.ts` integration test.
99
+ - Load-time validation hard-fails a malformed registration (non-`z.object` state, missing/duplicate `kind`, or a missing / non-function `read`).
100
+ - The `ENTITY_CHANGED` hook is internal (not exported); the change emitter buffers events produced during the init window and flushes them in order once the hook wiring is available in `afterPluginsReady`.
101
+
102
+ - `@checkstack/automation-common`:
103
+
104
+ - New `EntityChangedSchema` (the `ENTITY_CHANGED` payload - `kind`, `id`, `prev`, `next`, `delta`, `changedFields`, `actor`, `occurredAt`) and `DispatchJobSchema` (the Stage-2 `trigger` / `wake` dispatch job).
105
+
106
+ - `@checkstack/automation-frontend`: the `wait_until` editor no longer offers the inert `poll_seconds` field (reactive waits don't poll).
107
+
108
+ This phase adds the primitive only: domains are migrated in their own changesets. No external behavior changes for existing automations.
109
+
110
+ BREAKING CHANGES: There is no framework current-state store. Any out-of-tree plugin must own its entity state in its own durable storage (its own table, or a compute-on-read over it) and pass a `read` accessor to `defineEntity`. `createKeyedStore` / `KeyedStore` / `entityKeyedStoreServiceRef` / `EntityKeyedStoreService` do not exist, and there is no `entity_state` table. `handle.set` / `handle.patch` and the `indexes` option do not exist; all writes go through `handle.mutate` / `handle.remove`.
111
+
112
+ - b995afb: Fix four reactive-automation-engine defects in the `wait_until` / entity-change dispatch path.
113
+
114
+ - **Lost-wakeup re-evaluate-on-registration guard (HIGH, data-loss race).** `executeWaitUntil` evaluated its condition, then committed the wait lock + wake-index rows with NO re-evaluation after arming. An `ENTITY_CHANGED` for a relevant ref landing in that arm window was routed by Stage-1 against a not-yet-visible lock, enqueued no wake job, and — for a no-timeout wait (`timeoutAt` null, skipped by the sweeper) — the run stalled permanently (silent run leak). After arming the lock the engine now re-evaluates ONCE against freshly re-enriched scope; if the condition already holds it deletes the lock (its wake-index rows cascade) and continues the run inline. Idempotent via the lock delete + the per-run advisory lock.
115
+
116
+ - **Wildcard health wake drops the changed system (MEDIUM, correctness).** `reEnrichWaitScope` resolved health only for the trigger `contextKey` + `uses_state` ids and excluded the changed ref from health resolution. A wildcard health wait (`health:*`) woken by `health:sysX` — where `sysX` was neither the contextKey nor in `uses_state` — never had `scope.health.systems[sysX]` populated, so the condition read stale/empty state and failed to resume. The changed system's concrete id is now injected into health resolution during a wildcard wake.
117
+
118
+ - **`changeId` for dispatch dedup (LOW, correctness).** The Stage-2 trigger `jobId` embedded `changed.occurredAt` (millisecond granularity), so two DISTINCT changes to the same entity within one millisecond collapsed onto one job (the second run silently dropped). `EntityChangedSchema` gains an additive, back-compatible `changeId` (generated ONCE at emit time so it travels with redeliveries of the same change); the Stage-2 jobId now uses `changed.changeId` (falling back to `occurredAt` for legacy payloads). Redeliveries of one change still dedup; two real changes stay distinct.
119
+
120
+ - **Run-originated `mutate` returns the unmasked next state (LOW, correctness).** `handle.mutate` returned the `maskForRun`-masked next state, contradicting its "returns the resulting state" contract. Masking is now confined to the emitted `ENTITY_CHANGED` payload and the `entity_transitions` rows only; `mutate` returns the unmasked, zod-validated resulting state.
121
+
122
+ BREAKING CHANGES: none. The `changeId` field is additive and optional; all changes are behavior-preserving except where they fix the defects above.
123
+
124
+ - 270ef29: Add in-UI script testing for automation `run_script` / `run_shell` actions.
125
+
126
+ A new `testScript` RPC runs a TypeScript or shell script against an
127
+ editable, auto-seeded sample context using the same sandboxed runner the
128
+ real action uses, so operators can test scripts directly in the editor
129
+ without dispatching a whole automation. Surfaces beneath any script field
130
+ flagged `x-script-testable` via the new `ScriptTestPanel` /
131
+ `ContextSampleEditor` components in `@checkstack/ui` and the
132
+ `scriptTestRenderer` prop threaded through `DynamicForm`.
133
+
134
+ - `@checkstack/automation-common`: adds the `testScript` contract +
135
+ `ScriptTest*` schemas (gated by `automation.manage`).
136
+ - `@checkstack/automation-backend`: implements `testScript` reusing the
137
+ shared ESM / shell runners; central-only, time-bounded.
138
+ - `@checkstack/backend-api`: new `x-script-testable` config-schema
139
+ metadata propagated to the frontend JSON Schema.
140
+ - `@checkstack/ui`: new `ScriptTestPanel` + `ContextSampleEditor`
141
+ components and a `scriptTestRenderer` prop on `DynamicForm`.
142
+ - `@checkstack/automation-frontend`: wires the test panel into the action
143
+ editor.
144
+ - `@checkstack/integration-script-backend`: marks the `run_script` /
145
+ `run_shell` script fields as testable.
146
+
147
+ - 270ef29: Extend in-UI script testing to health-check collectors, and add
148
+ load-from-run replay for automation script tests.
149
+
150
+ - Health-check collectors: a new `testCollectorScript` RPC runs the
151
+ inline-script (TypeScript) collector and the shell `script` collector
152
+ against an editable, auto-seeded sample context using the same
153
+ sandboxed runner the real collector uses. Surfaces beneath the
154
+ collector script fields in the collector editor (both marked
155
+ `x-script-testable`). Gated by `healthcheck.configuration.manage`.
156
+ - Automation replay: a new `getRunScopeForReplay` RPC reconstructs an
157
+ editable test context from a real run (trigger + persisted artifacts,
158
+ plus the durable scope snapshot when the run is still in-flight), and
159
+ the script-test panel gains a "Load from run" picker that seeds the
160
+ sample context from a past run.
161
+
162
+ Note: health-check executions do not persist the script / config /
163
+ check / system that produced a result, so there is no health-check
164
+ replay - auto-seed is the only context source for collector tests. This
165
+ is by design; see the feature plan.
166
+
167
+ - 270ef29: Secrets platform Phase 2: secret -> env-var mapping with central resolve, inject, and mask.
168
+
169
+ - Script consumers declare a least-privilege `secretEnv` allowlist
170
+ (`{ ENV_NAME: "${{ secrets.NAME }}" }`). The automation `run_script` /
171
+ `run_shell` actions resolve ONLY the declared secrets via
172
+ `secretResolverRef.resolveForRun`, inject them into the runner env for
173
+ that run (memory-only; the ESM runner gained a per-run `env` option), and
174
+ mask their values out of stdout/stderr/result/error via the run-scoped
175
+ masking context. A missing required secret fails the run clearly. No
176
+ ambient secret access.
177
+ - Test panel: `testScript` / `testCollectorScript` inject named
178
+ `__SECRET_<NAME>__` placeholders by default, or user-supplied per-secret
179
+ overrides; real production values are never resolved in the test path,
180
+ and overrides are masked out of the result.
181
+ - Healthcheck collectors carry the `secretEnv` field for authoring +
182
+ the test panel; runtime injection on satellites lands in Phase 3.
183
+ - Editor UX: a new `@checkstack/ui` `SecretEnvEditor` renders `x-secret-env`
184
+ record fields with `${{ secrets.* }}` name autocomplete (from
185
+ `listSecretNames`), wired into the automation action editor and the
186
+ healthcheck collector editor. New `withConfigMeta` helper +
187
+ `x-secret-env` config-meta key in `@checkstack/backend-api`.
188
+
189
+ - b995afb: Add an optional `partitionBy` override to the windowed-count trigger gate.
190
+
191
+ A trigger's `window` block now accepts `partitionBy`, a bare expression (same flavour as `filter`, no `{{ }}`) that controls the key the occurrence count is bucketed by. When omitted, the gate keys by the trigger's built-in context key exactly as before (per system for health triggers), so existing automations are unchanged. When set, the expression is evaluated against the same trigger scope `filter` uses and coerced to a string - e.g. `trigger.payload.severity` for a per-severity rate, or `trigger.payload.systemId + ":" + trigger.payload.checkId` for a composite key. If the expression evaluates to null/undefined/empty or fails to evaluate, the gate falls back to the built-in context key (never global counting); eval errors are logged, matching the gate's fail-open posture.
192
+
193
+ Triggers can now declare `contextKeyLabel` (a UI hint, e.g. `"system"`) describing their built-in context dimension. It is surfaced through `TriggerInfo` so the editor's window "Partition by" field shows the default partition ("Leave blank to count per system" / "per automation" when a trigger has no context key). The healthcheck system triggers (`system_health_changed`, `system_degraded`, `system_healthy`, `check_failed`) and the built-in `numeric_state` trigger set it to `"system"`. This is a pure UI hint with no runtime behaviour.
194
+
195
+ The automation editor's window block gains a "Partition by" expression input (reusing the trigger filter's `trigger.payload.*` autocomplete), and the collapsed trigger card summary shows the partition when set.
196
+
197
+ - b995afb: Add a generic windowed-count / rate trigger gate, and express flapping detection on it.
198
+
199
+ Any trigger can now carry a `window: { count, minutes, refire }` block: the automation engine records each qualifying occurrence (after the structured config gate and the operator's `filter`) in a durable append log and counts rows within the trailing sliding window, scoped per context key (e.g. per system). `refire: "every"` (default) fires on every occurrence at/over the threshold; `refire: "once"` fires only on the crossing edge and re-arms as old occurrences age out. The gate runs in `maybeStartRun` after `filter` and before the `for:` dwell, so it composes with both.
200
+
201
+ Flapping is now an instance of this mechanism rather than a bespoke detector. The healthcheck `system_health_changed` raw change event plus a `filter` (`trigger.payload.newStatus != "healthy"`) plus `window: { count: 3, minutes: 60, refire: "once" }` reproduces flapping in the engine.
202
+
203
+ State-and-scale: window state lives in the new `automation_window_events` Postgres table (FK-cascade on the automation, the same delete-lifecycle as `automation_dwell_timers`). The count is read with pure SQL so every pod computes the same answer; the work-queue claim gives exactly one INSERT per emission, so there is no double-count. Rows older than the 24h schema cap are pruned by the existing stalled-sweeper. The `once` policy is best-effort under at-least-once redelivery (a redelivered emission can skip the exact crossing edge; `every` is redelivery-tolerant).
204
+
205
+ **BREAKING CHANGES:**
206
+
207
+ - The `healthcheck.flapping_detected` automation trigger and the `healthcheck.flapping_detected` hook are REMOVED. Flapping is now detected by the windowed-count gate on the `healthcheck.system_health_changed` trigger (`window` block, `refire: "once"`).
208
+ - Flapping is now PER-SYSTEM (the aggregated `health` entity), not per-`(system, configuration)`. Subscribe to `check_failed` with a `window` instead if you need per-check rate detection.
209
+ - The healthcheck `health_check_unhealthy_transitions` table is DROPPED (the per-check flapping audit log is no longer kept; counting moved into the engine).
210
+ - The backend-only `automation.subscriptions` service ref (`automationSubscriptionsRef` / `AutomationSubscriptions`) is REMOVED. The engine enumerates subscribers internally and the window gate runs per-automation inside `maybeStartRun`, so the external read-ref is no longer needed.
211
+ - Existing user-created flapping automations are AUTO-MIGRATED on boot: any trigger on `healthcheck.flapping_detected` is rewritten to `healthcheck.system_health_changed` + the canonical unhealthy-transition filter + `window: { count: transitions ?? 3, minutes: windowMinutes ?? 60, refire: "once" }`, dropping the old `config`. A pre-existing trigger filter is replaced with the canonical one (logged per row). An enabled automation that still references the removed event after migration logs a warning.
212
+
213
+ ### Patch Changes
214
+
215
+ - Updated dependencies [270ef29]
216
+ - @checkstack/template-engine@0.3.0
217
+
3
218
  ## 0.2.0
4
219
 
5
220
  ### Minor Changes
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@checkstack/automation-common",
3
- "version": "0.2.0",
3
+ "version": "0.3.0",
4
4
  "license": "Elastic-2.0",
5
5
  "type": "module",
6
6
  "exports": {
@@ -10,14 +10,14 @@
10
10
  }
11
11
  },
12
12
  "dependencies": {
13
- "@checkstack/common": "0.11.0",
14
- "@checkstack/signal-common": "0.2.4",
15
- "@checkstack/template-engine": "0.1.0",
13
+ "@checkstack/common": "0.12.0",
14
+ "@checkstack/signal-common": "0.2.5",
15
+ "@checkstack/template-engine": "0.2.0",
16
16
  "@orpc/contract": "^1.13.14",
17
17
  "zod": "^4.0.0"
18
18
  },
19
19
  "devDependencies": {
20
- "@checkstack/scripts": "0.3.3",
20
+ "@checkstack/scripts": "0.3.4",
21
21
  "@checkstack/tsconfig": "0.0.7",
22
22
  "typescript": "^5.7.2"
23
23
  },
@@ -0,0 +1,71 @@
1
+ import { describe, it, expect } from "bun:test";
2
+ import { SYSTEM_ACTOR } from "@checkstack/common";
3
+ import {
4
+ DispatchJobSchema,
5
+ EntityChangedSchema,
6
+ } from "./schemas";
7
+
8
+ const baseChange = {
9
+ kind: "incident",
10
+ id: "inc-1",
11
+ prev: null,
12
+ next: { status: "open" },
13
+ delta: { status: "open" },
14
+ changedFields: ["status"],
15
+ actor: SYSTEM_ACTOR,
16
+ occurredAt: new Date().toISOString(),
17
+ };
18
+
19
+ describe("EntityChangedSchema", () => {
20
+ it("accepts a create (prev null)", () => {
21
+ expect(() => EntityChangedSchema.parse(baseChange)).not.toThrow();
22
+ });
23
+
24
+ it("accepts a tombstone (next null)", () => {
25
+ expect(() =>
26
+ EntityChangedSchema.parse({ ...baseChange, next: null, delta: {} }),
27
+ ).not.toThrow();
28
+ });
29
+
30
+ it("requires an actor", () => {
31
+ const { actor, ...rest } = baseChange;
32
+ void actor;
33
+ expect(() => EntityChangedSchema.parse(rest)).toThrow();
34
+ });
35
+
36
+ it("rejects a non-array changedFields", () => {
37
+ expect(() =>
38
+ EntityChangedSchema.parse({ ...baseChange, changedFields: "status" }),
39
+ ).toThrow();
40
+ });
41
+ });
42
+
43
+ describe("DispatchJobSchema", () => {
44
+ it("accepts a trigger job", () => {
45
+ const job = {
46
+ reason: "trigger" as const,
47
+ automationId: "a-1",
48
+ triggerId: "t-1",
49
+ ref: "incident:inc-1",
50
+ changed: baseChange,
51
+ };
52
+ expect(() => DispatchJobSchema.parse(job)).not.toThrow();
53
+ });
54
+
55
+ it("accepts a wake job", () => {
56
+ const job = {
57
+ reason: "wake" as const,
58
+ runId: "r-1",
59
+ waitLockId: "w-1",
60
+ ref: "incident:inc-1",
61
+ changed: baseChange,
62
+ };
63
+ expect(() => DispatchJobSchema.parse(job)).not.toThrow();
64
+ });
65
+
66
+ it("rejects an unknown reason", () => {
67
+ expect(() =>
68
+ DispatchJobSchema.parse({ reason: "nope", ref: "x", changed: baseChange }),
69
+ ).toThrow();
70
+ });
71
+ });
package/src/index.ts CHANGED
@@ -6,3 +6,4 @@ export * from "./rpc-contract";
6
6
  export * from "./signals";
7
7
  export * from "./variable-scope";
8
8
  export * from "./shell-env";
9
+ export * from "./script-test-schemas";
@@ -7,12 +7,19 @@ import {
7
7
  } from "@checkstack/common";
8
8
  import { automationAccess } from "./access";
9
9
  import { pluginMetadata } from "./plugin-metadata";
10
+ import {
11
+ ReplayScopeInputSchema,
12
+ ReplayScopeResultSchema,
13
+ ScriptTestInputSchema,
14
+ ScriptTestResultSchema,
15
+ } from "./script-test-schemas";
10
16
  import {
11
17
  ActionInfoSchema,
12
18
  ArtifactTypeInfoSchema,
13
19
  AutomationArtifactSchema,
14
20
  AutomationRunSchema,
15
21
  AutomationRunStepSchema,
22
+ AutomationGroupSchema,
16
23
  AutomationSchema,
17
24
  AutomationStatusSchema,
18
25
  CreateAutomationInputSchema,
@@ -45,10 +52,22 @@ export const automationContract = {
45
52
  .input(
46
53
  PaginationInput.extend({
47
54
  status: AutomationStatusSchema.optional(),
55
+ group: AutomationGroupSchema.optional(),
48
56
  }),
49
57
  )
50
58
  .output(PaginatedResult(AutomationSchema)),
51
59
 
60
+ /**
61
+ * Distinct, non-null group values across all automations, sorted
62
+ * alphabetically. Powers the edit-page group picker's "pick existing"
63
+ * suggestions.
64
+ */
65
+ listAutomationGroups: proc({
66
+ operationType: "query",
67
+ userType: "authenticated",
68
+ access: [automationAccess.read],
69
+ }).output(z.object({ groups: z.array(z.string()) })),
70
+
52
71
  getAutomation: proc({
53
72
  operationType: "query",
54
73
  userType: "authenticated",
@@ -200,6 +219,41 @@ export const automationContract = {
200
219
  .input(z.object({ id: z.string() }))
201
220
  .output(z.object({ success: z.boolean() })),
202
221
 
222
+ // ─── Inline script testing ─────────────────────────────────────────────
223
+
224
+ /**
225
+ * Run a `run_script` (TypeScript) or `run_shell` (shell) script against
226
+ * an editable sample context, using the same sandboxed runner the real
227
+ * action uses. Lets operators test a script directly in the editor
228
+ * without dispatching a whole automation.
229
+ *
230
+ * Gated by `manage` because authoring + running a script already
231
+ * executes code on the central backend — this is the same privilege.
232
+ * The run is time-bounded and always central (real satellite runs may
233
+ * differ; the UI notes this).
234
+ */
235
+ testScript: proc({
236
+ operationType: "mutation",
237
+ userType: "authenticated",
238
+ access: [automationAccess.manage],
239
+ })
240
+ .input(ScriptTestInputSchema)
241
+ .output(ScriptTestResultSchema),
242
+
243
+ /**
244
+ * Reconstruct an editable test context from a real automation run, so an
245
+ * operator can replay a script against the data that run saw. Reads run
246
+ * data only (trigger + persisted artifacts, plus the durable scope
247
+ * snapshot when the run is still in-flight), so it's gated by `read`.
248
+ */
249
+ getRunScopeForReplay: proc({
250
+ operationType: "query",
251
+ userType: "authenticated",
252
+ access: [automationAccess.read],
253
+ })
254
+ .input(ReplayScopeInputSchema)
255
+ .output(ReplayScopeResultSchema),
256
+
203
257
  // ─── Template playground ───────────────────────────────────────────────
204
258
 
205
259
  /**
package/src/schemas.ts CHANGED
@@ -1,4 +1,5 @@
1
1
  import { z } from "zod";
2
+ import { ActorSchema } from "@checkstack/common";
2
3
 
3
4
  /**
4
5
  * Schemas for the Automation platform.
@@ -19,6 +20,45 @@ import { z } from "zod";
19
20
  * `automations.definition` column, also round-tripped from YAML in the UI.
20
21
  */
21
22
 
23
+ // ─── Duration ──────────────────────────────────────────────────────────────
24
+
25
+ /**
26
+ * A duration expressed in a single unit, or a template that renders to a
27
+ * number of seconds. Used by a trigger's `for:` dwell (decision D1). The
28
+ * object form is preferred for the editor's duration widget; the template
29
+ * form is the escape hatch for computed durations.
30
+ */
31
+ export const DurationSchema = z.union([
32
+ z.object({ seconds: z.number().int().min(1).max(60 * 60 * 24 * 30) }),
33
+ z.object({ minutes: z.number().int().min(1).max(60 * 24 * 30) }),
34
+ z.object({ hours: z.number().int().min(1).max(24 * 30) }),
35
+ z
36
+ .object({ template: z.string().min(1) })
37
+ .describe("Template rendering to a number of seconds."),
38
+ ]);
39
+
40
+ export type Duration = z.infer<typeof DurationSchema>;
41
+
42
+ /**
43
+ * Resolve a {@link Duration} to milliseconds. The template branch needs a
44
+ * render context, so callers that may receive a template pass a
45
+ * `renderSeconds` resolver. Returns null for an unrenderable / invalid
46
+ * duration.
47
+ */
48
+ export function durationToMs(
49
+ duration: Duration,
50
+ renderSeconds?: (template: string) => number | undefined,
51
+ ): number | null {
52
+ if ("seconds" in duration) return duration.seconds * 1000;
53
+ if ("minutes" in duration) return duration.minutes * 60_000;
54
+ if ("hours" in duration) return duration.hours * 3_600_000;
55
+ const seconds = renderSeconds?.(duration.template);
56
+ if (seconds === undefined || !Number.isFinite(seconds) || seconds < 0) {
57
+ return null;
58
+ }
59
+ return Math.floor(seconds) * 1000;
60
+ }
61
+
22
62
  // ─── Trigger ─────────────────────────────────────────────────────────────
23
63
 
24
64
  /**
@@ -32,7 +72,61 @@ import { z } from "zod";
32
72
  * trigger itself before any action runs.
33
73
  * - `config` is the trigger's own configuration — only used by triggers
34
74
  * that need extra setup (cron pattern for `time.cron`, etc.).
75
+ * - `for` is an optional dwell: fire only if the matched state still holds
76
+ * after the duration (re-confirmed at expiry, restart-safe).
77
+ * - `window` is an optional rate/windowed-count gate: fire only when this
78
+ * trigger has fired (post-filter) at least `count` times within the
79
+ * trailing `minutes`, scoped per `contextKey`. `refire` decides whether to
80
+ * fire on every qualifying occurrence past the threshold (`"every"`) or
81
+ * only on the crossing edge (`"once"`).
82
+ */
83
+
84
+ /**
85
+ * Windowed-count / rate gate (decision: generic `window` block on any
86
+ * trigger). The engine records each qualifying occurrence (after the
87
+ * structured config gate + the operator's `filter`) in a durable append log
88
+ * keyed `(automationId, triggerId, contextKey)` and counts rows within the
89
+ * trailing `minutes`:
90
+ *
91
+ * - `"every"` fires iff `newCount >= count` — re-fires on every occurrence
92
+ * while the window stays over threshold (debounce in the automation if
93
+ * needed).
94
+ * - `"once"` fires iff `newCount === count` — only on the crossing edge;
95
+ * re-arms naturally as old rows age out and the count re-crosses.
96
+ *
97
+ * where `newCount` includes the just-recorded occurrence.
35
98
  */
99
+ export const WindowSchema = z.object({
100
+ count: z
101
+ .number()
102
+ .int()
103
+ .min(1)
104
+ .max(1000)
105
+ .describe("Occurrences within the window that arm the trigger."),
106
+ minutes: z
107
+ .number()
108
+ .int()
109
+ .min(1)
110
+ .max(1440)
111
+ .describe("Trailing sliding window, in minutes, the occurrences count over."),
112
+ refire: z
113
+ .enum(["every", "once"])
114
+ .default("every")
115
+ .describe(
116
+ "`every`: fire on every occurrence at/over the threshold. `once`: fire only on the crossing edge (re-arms as the window drains).",
117
+ ),
118
+ partitionBy: z
119
+ .string()
120
+ .trim()
121
+ .min(1)
122
+ .optional()
123
+ .describe(
124
+ "Optional bare expression (like `filter`, no `{{ }}`) yielding the partition key the count is bucketed by. Omitted ⇒ the trigger's built-in context key (e.g. systemId). An empty/undefined result or an eval error falls back to that built-in key.",
125
+ ),
126
+ });
127
+
128
+ export type Window = z.infer<typeof WindowSchema>;
129
+
36
130
  export const TriggerSchema = z.object({
37
131
  id: z
38
132
  .string()
@@ -59,24 +153,131 @@ export const TriggerSchema = z.object({
59
153
  .describe(
60
154
  "Per-trigger configuration (e.g. cron pattern for time.cron, interval seconds for time.interval).",
61
155
  ),
156
+ for: DurationSchema.optional().describe(
157
+ "Dwell: fire only if the matched state still holds after this duration. Re-confirmed at expiry, restart-safe.",
158
+ ),
159
+ window: WindowSchema.optional().describe(
160
+ "Windowed-count / rate gate: fire only after this trigger has fired N times within M minutes (scoped per context key).",
161
+ ),
62
162
  });
63
163
 
64
164
  export type Trigger = z.infer<typeof TriggerSchema>;
65
165
 
66
166
  // ─── Condition (recursive) ────────────────────────────────────────────────
67
167
 
168
+ /**
169
+ * Structured condition variants (Wave 2 Phase 16). Each evaluates over the
170
+ * pre-resolved scope (Phase 14 `health.*`) plus a FRESH `now` computed per
171
+ * evaluation (constraint 7 — never the frozen scope `now`).
172
+ */
173
+
174
+ /**
175
+ * `numeric_state` — compare a numeric `value` (a template/path string or a
176
+ * literal number) against optional `above` / `below` bounds.
177
+ */
178
+ export const NumericStateConditionSchema = z
179
+ .object({
180
+ numeric_state: z.object({
181
+ value: z
182
+ .union([z.string().min(1), z.number()])
183
+ .describe("Template/path string or literal number to compare."),
184
+ above: z.number().optional(),
185
+ below: z.number().optional(),
186
+ }),
187
+ })
188
+ .refine(
189
+ (c) =>
190
+ c.numeric_state.above !== undefined ||
191
+ c.numeric_state.below !== undefined,
192
+ { message: "numeric_state requires at least one of `above` / `below`" },
193
+ );
194
+
195
+ export type NumericStateCondition = z.infer<
196
+ typeof NumericStateConditionSchema
197
+ >;
198
+
199
+ /**
200
+ * `time` — on-call / quiet-hours gating. `after` / `before` are `HH:mm`
201
+ * (24h). `weekday` is a list of 0-6 (Sunday = 0). `timezone` is an IANA
202
+ * zone (defaults to UTC).
203
+ */
204
+ const HHMM = /^([01]\d|2[0-3]):[0-5]\d$/;
205
+
206
+ export const TimeConditionSchema = z
207
+ .object({
208
+ time: z.object({
209
+ after: z
210
+ .string()
211
+ .regex(HHMM, "Expected HH:mm (24h)")
212
+ .optional()
213
+ .describe("Inclusive lower bound, local to `timezone`."),
214
+ before: z
215
+ .string()
216
+ .regex(HHMM, "Expected HH:mm (24h)")
217
+ .optional()
218
+ .describe("Exclusive upper bound, local to `timezone`."),
219
+ weekday: z
220
+ .array(z.number().int().min(0).max(6))
221
+ .min(1)
222
+ .optional()
223
+ .describe("Allowed weekdays (0 = Sunday … 6 = Saturday)."),
224
+ timezone: z
225
+ .string()
226
+ .min(1)
227
+ .optional()
228
+ .describe("IANA timezone (e.g. `Europe/Berlin`). Defaults to UTC."),
229
+ }),
230
+ })
231
+ .refine(
232
+ (c) =>
233
+ c.time.after !== undefined ||
234
+ c.time.before !== undefined ||
235
+ c.time.weekday !== undefined,
236
+ { message: "time requires at least one of `after` / `before` / `weekday`" },
237
+ );
238
+
239
+ export type TimeCondition = z.infer<typeof TimeConditionSchema>;
240
+
241
+ /**
242
+ * `state` — condition-side dwell. True when `entity` (a system id) has been
243
+ * in `status` for at least `for` (read from the pre-resolved
244
+ * `health.*.in_status_for_ms`; NO new timer — it reads, it doesn't time).
245
+ */
246
+ export const StateConditionSchema = z.object({
247
+ state: z.object({
248
+ entity: z
249
+ .string()
250
+ .min(1)
251
+ .describe("System id whose live health state to read."),
252
+ status: z
253
+ .enum(["healthy", "degraded", "unhealthy"])
254
+ .describe("Status the entity must currently be in."),
255
+ for: DurationSchema.optional().describe(
256
+ "Optional minimum dwell — the entity must have held `status` at least this long.",
257
+ ),
258
+ }),
259
+ });
260
+
261
+ export type StateCondition = z.infer<typeof StateConditionSchema>;
262
+
68
263
  /**
69
264
  * A condition is either:
70
- * - A template string that evaluates truthy/falsy, OR
71
- * - A `{ and | or | not }` combinator wrapping nested conditions.
265
+ * - A template string that evaluates truthy/falsy,
266
+ * - A `{ and | or | not }` combinator wrapping nested conditions, OR
267
+ * - A structured variant (`numeric_state`, `time`, `state`).
72
268
  *
73
- * Recursive — uses `z.lazy` for the combinator branches.
269
+ * Recursive — uses `z.lazy` for the combinator branches. The raw template
270
+ * string stays the escape hatch for anything the structured variants don't
271
+ * cover.
74
272
  */
75
273
  export type ConditionInput =
76
274
  | string
77
275
  | { and: ConditionInput[] }
78
276
  | { or: ConditionInput[] }
79
- | { not: ConditionInput };
277
+ | { not: ConditionInput }
278
+ | NumericStateCondition
279
+ | TimeCondition
280
+ | StateCondition;
80
281
 
81
282
  export const ConditionSchema: z.ZodType<ConditionInput> = z.lazy(() =>
82
283
  z.union([
@@ -93,6 +294,9 @@ export const ConditionSchema: z.ZodType<ConditionInput> = z.lazy(() =>
93
294
  z.object({
94
295
  not: ConditionSchema,
95
296
  }),
297
+ NumericStateConditionSchema,
298
+ TimeConditionSchema,
299
+ StateConditionSchema,
96
300
  ]),
97
301
  );
98
302
 
@@ -152,6 +356,7 @@ export type ActionInput =
152
356
  | ConditionGuardInput
153
357
  | StopInput
154
358
  | WaitForTriggerInput
359
+ | WaitUntilInput
155
360
  | SequenceInput;
156
361
 
157
362
  export interface ChooseInput {
@@ -230,6 +435,19 @@ export interface WaitForTriggerInput {
230
435
  };
231
436
  }
232
437
 
438
+ export interface WaitUntilInput {
439
+ id?: string;
440
+ description?: string;
441
+ enabled?: boolean;
442
+ continue_on_error?: boolean;
443
+ wait_until: {
444
+ condition: ConditionInput;
445
+ timeout_seconds?: number;
446
+ /** Default true (HA semantics): on timeout, continue rather than fail. */
447
+ continue_on_timeout?: boolean;
448
+ };
449
+ }
450
+
233
451
  export interface SequenceInput {
234
452
  id?: string;
235
453
  description?: string;
@@ -392,6 +610,30 @@ export const WaitForTriggerActionSchema = z.object({
392
610
  }),
393
611
  });
394
612
 
613
+ /**
614
+ * 11. Wait until — suspend the run until a CONDITION becomes true, with an
615
+ * optional timeout. Unlike `wait_for_trigger` (wait for an *event*), this
616
+ * polls the condition on an interval, re-resolving live state each tick.
617
+ *
618
+ * `continue_on_timeout` defaults to true (HA's `wait_template` semantics):
619
+ * on timeout the run continues rather than failing.
620
+ */
621
+ export const WaitUntilActionSchema: z.ZodType<WaitUntilInput> = z.lazy(() =>
622
+ z.object({
623
+ ...ActionBase,
624
+ wait_until: z.object({
625
+ condition: ConditionSchema,
626
+ timeout_seconds: z
627
+ .number()
628
+ .int()
629
+ .min(1)
630
+ .max(60 * 60 * 24 * 30) // 30 days
631
+ .optional(),
632
+ continue_on_timeout: z.boolean().default(true),
633
+ }),
634
+ }),
635
+ );
636
+
395
637
  /**
396
638
  * The discriminated union of all 9 action primitives. Discrimination is by
397
639
  * presence of a key (action / choose / parallel / etc.), matching the
@@ -411,6 +653,7 @@ export const ActionSchema: z.ZodType<ActionInput> = z.lazy(() =>
411
653
  ConditionGuardActionSchema,
412
654
  StopActionSchema,
413
655
  WaitForTriggerActionSchema,
656
+ WaitUntilActionSchema,
414
657
  SequenceActionSchema,
415
658
  ]),
416
659
  );
@@ -434,6 +677,27 @@ export const AutomationModeSchema = z
434
677
 
435
678
  export type AutomationMode = z.infer<typeof AutomationModeSchema>;
436
679
 
680
+ /**
681
+ * Scope the concurrency `mode` is evaluated over.
682
+ *
683
+ * - `automation` (default): one concurrency bucket for the whole
684
+ * automation. `single` allows one in-flight run total; `restart`
685
+ * cancels every active run.
686
+ * - `context_key`: an independent bucket per `contextKey` (typically
687
+ * per system / incident). `single` allows one in-flight run *per
688
+ * context key* (system A and system B run concurrently, but a second
689
+ * run for system A is deduped); `restart` cancels only the active
690
+ * runs sharing the incoming context key.
691
+ *
692
+ * Backward-compatible: omitted defaults to `automation`, so existing
693
+ * automations behave exactly as before.
694
+ */
695
+ export const ConcurrencyScopeSchema = z
696
+ .enum(["automation", "context_key"])
697
+ .default("automation");
698
+
699
+ export type ConcurrencyScope = z.infer<typeof ConcurrencyScopeSchema>;
700
+
437
701
  // ─── Automation definition (top-level) ────────────────────────────────────
438
702
 
439
703
  /**
@@ -453,8 +717,40 @@ export const AutomationDefinitionSchema = z.object({
453
717
  actions: z.array(ActionSchema).default([]),
454
718
  /** Concurrency mode. */
455
719
  mode: AutomationModeSchema,
720
+ /** Scope the concurrency mode is evaluated over (per-automation vs per-context-key). */
721
+ concurrency_scope: ConcurrencyScopeSchema,
456
722
  /** Max parallel runs (only meaningful in parallel/queued modes). */
457
723
  max_runs: z.number().int().min(1).max(1000).default(10),
724
+ /**
725
+ * Explicit live-state resolution list (sensing layer). By default the
726
+ * engine resolves the state of the system named by the trigger's
727
+ * `contextKey` (the common single-system case). Listing system ids here
728
+ * resolves their state too, surfaced in templates under
729
+ * `health.systems[<id>]` for cross-system rules. Bounded to keep the
730
+ * pre-evaluation batch query cheap.
731
+ */
732
+ uses_state: z
733
+ .array(z.string().min(1))
734
+ .max(50)
735
+ .optional()
736
+ .describe(
737
+ "Extra system ids whose live health state is resolved into scope under health.systems[id].",
738
+ ),
739
+ /**
740
+ * Trailing window (minutes) for the `health.*.transitions_in_window`
741
+ * count folded into scope. Lets an operator author custom flapping
742
+ * rules ("N status changes in M minutes") via a numeric_state condition
743
+ * over that field. Defaults to 60 when omitted.
744
+ */
745
+ state_window_minutes: z
746
+ .number()
747
+ .int()
748
+ .min(1)
749
+ .max(60 * 24 * 7) // up to a week
750
+ .optional()
751
+ .describe(
752
+ "Window (minutes) for health.*.transitions_in_window. Default 60.",
753
+ ),
458
754
  });
459
755
 
460
756
  export type AutomationDefinition = z.infer<typeof AutomationDefinitionSchema>;
@@ -464,10 +760,19 @@ export type AutomationDefinition = z.infer<typeof AutomationDefinitionSchema>;
464
760
  export const AutomationStatusSchema = z.enum(["enabled", "disabled"]);
465
761
  export type AutomationStatus = z.infer<typeof AutomationStatusSchema>;
466
762
 
763
+ /**
764
+ * Optional grouping label, stored as its own row column (not part of the
765
+ * definition / YAML). A single free-text "category" (HA-style) used purely
766
+ * to organise the list into collapsible sections. Empty / absent means the
767
+ * automation lives in the implicit "Ungrouped" bucket.
768
+ */
769
+ export const AutomationGroupSchema = z.string().trim().min(1).max(120);
770
+
467
771
  export const AutomationSchema = z.object({
468
772
  id: z.string(),
469
773
  name: z.string(),
470
774
  description: z.string().optional(),
775
+ group: AutomationGroupSchema.optional(),
471
776
  status: AutomationStatusSchema,
472
777
  definition: AutomationDefinitionSchema,
473
778
  managedBy: z.string().optional().describe("GitOps provider id when managed declaratively"),
@@ -580,6 +885,7 @@ export type AutomationArtifact = z.infer<typeof AutomationArtifactSchema>;
580
885
  export const CreateAutomationInputSchema = z.object({
581
886
  name: z.string().min(1).max(200),
582
887
  description: z.string().optional(),
888
+ group: AutomationGroupSchema.optional(),
583
889
  status: AutomationStatusSchema.default("enabled"),
584
890
  definition: AutomationDefinitionSchema,
585
891
  });
@@ -590,6 +896,12 @@ export const UpdateAutomationInputSchema = z.object({
590
896
  id: z.string(),
591
897
  name: z.string().min(1).max(200).optional(),
592
898
  description: z.string().optional(),
899
+ /**
900
+ * `null` clears the group (back to Ungrouped); `undefined` leaves it
901
+ * unchanged; a string sets it. Modelled with `.nullish()` so the set-builder
902
+ * can distinguish "clear" from "no change".
903
+ */
904
+ group: AutomationGroupSchema.nullish(),
593
905
  status: AutomationStatusSchema.optional(),
594
906
  definition: AutomationDefinitionSchema.optional(),
595
907
  });
@@ -645,6 +957,12 @@ export const TriggerInfoSchema = z.object({
645
957
  .string()
646
958
  .optional()
647
959
  .describe("Default context key path inside the payload (e.g. 'incidentId')"),
960
+ contextKeyLabel: z
961
+ .string()
962
+ .optional()
963
+ .describe(
964
+ "Human label for the trigger's built-in context dimension (e.g. 'system'). UI hint only - shown as the default partition in the window editor. Absent ⇒ no built-in context (per automation).",
965
+ ),
648
966
  });
649
967
 
650
968
  export type TriggerInfo = z.infer<typeof TriggerInfoSchema>;
@@ -680,3 +998,64 @@ export const ArtifactTypeInfoSchema = z.object({
680
998
  });
681
999
 
682
1000
  export type ArtifactTypeInfo = z.infer<typeof ArtifactTypeInfoSchema>;
1001
+
1002
+ // ─── Reactive entity engine — two-stage queue payloads ─────────────────────
1003
+ //
1004
+ // The entity state machine (`defineEntity`, automation-backend) emits an
1005
+ // internal `ENTITY_CHANGED` hook on every real diff. These schemas are the
1006
+ // wire shapes for the two-stage dispatch pipeline (reactive automation
1007
+ // engine §13.2). Defined here so later phases (Stage-1 routing, Stage-2
1008
+ // dispatch) consume one canonical zod source of truth.
1009
+
1010
+ /**
1011
+ * Stage-1 input — the `ENTITY_CHANGED` hook payload. `prev` is null on
1012
+ * create; `next` is null on remove (a tombstone). `delta` carries only the
1013
+ * changed fields; `actor` is the mutating actor (from `HookEventMeta`).
1014
+ */
1015
+ export const EntityChangedSchema = z.object({
1016
+ kind: z.string(),
1017
+ id: z.string(),
1018
+ prev: z.record(z.string(), z.unknown()).nullable(), // null on create
1019
+ next: z.record(z.string(), z.unknown()).nullable(), // null on remove (tombstone)
1020
+ delta: z.record(z.string(), z.unknown()), // changed fields only
1021
+ changedFields: z.array(z.string()),
1022
+ actor: ActorSchema, // from HookEventMeta
1023
+ occurredAt: z.string(), // ISO
1024
+ /**
1025
+ * Stable, per-change identity generated ONCE at emit time and carried
1026
+ * through every at-least-once redelivery of THIS change. Distinguishes two
1027
+ * DISTINCT changes to the same entity that share an `occurredAt`
1028
+ * (millisecond granularity) so the Stage-2 trigger jobId dedupes
1029
+ * redeliveries of one change without collapsing two real changes (reactive
1030
+ * automation engine §13.2). Optional for back-compat with payloads emitted
1031
+ * before this field existed; the router falls back to `occurredAt`.
1032
+ */
1033
+ changeId: z.string().optional(),
1034
+ });
1035
+
1036
+ export type EntityChanged = z.infer<typeof EntityChangedSchema>;
1037
+
1038
+ /**
1039
+ * Stage-2 per-run dispatch job. Either a fresh run from a trigger match,
1040
+ * or a resume of a suspended `wait_until`.
1041
+ */
1042
+ export const DispatchJobSchema = z.discriminatedUnion("reason", [
1043
+ z.object({
1044
+ // fresh run from a trigger match
1045
+ reason: z.literal("trigger"),
1046
+ automationId: z.string(),
1047
+ triggerId: z.string(),
1048
+ ref: z.string(), // `${kind}:${id}`
1049
+ changed: EntityChangedSchema,
1050
+ }),
1051
+ z.object({
1052
+ // resume a suspended wait_until
1053
+ reason: z.literal("wake"),
1054
+ runId: z.string(),
1055
+ waitLockId: z.string(),
1056
+ ref: z.string(),
1057
+ changed: EntityChangedSchema,
1058
+ }),
1059
+ ]);
1060
+
1061
+ export type DispatchJob = z.infer<typeof DispatchJobSchema>;
@@ -0,0 +1,91 @@
1
+ import { z } from "zod";
2
+
3
+ /**
4
+ * Wire schemas for the in-UI script test endpoint (`testScript`).
5
+ *
6
+ * The editable sample context mirrors what a `run_script` action sees as
7
+ * `globalThis.context` and what `run_shell` flattens into `$CHECKSTACK_*`
8
+ * env vars. Every field is optional so a partial sample still runs.
9
+ */
10
+
11
+ export const ScriptTestKindSchema = z.enum(["typescript", "shell"]);
12
+ export type ScriptTestKind = z.infer<typeof ScriptTestKindSchema>;
13
+
14
+ export const ScriptTestContextSchema = z.object({
15
+ trigger: z
16
+ .object({
17
+ event: z.string().optional(),
18
+ payload: z.record(z.string(), z.unknown()).optional(),
19
+ })
20
+ .optional(),
21
+ artifacts: z.record(z.string(), z.unknown()).optional(),
22
+ var: z.record(z.string(), z.unknown()).optional(),
23
+ repeat: z
24
+ .object({
25
+ index: z.number().int(),
26
+ item: z.unknown().optional(),
27
+ })
28
+ .optional(),
29
+ });
30
+ export type ScriptTestContext = z.infer<typeof ScriptTestContextSchema>;
31
+
32
+ export const ScriptTestInputSchema = z.object({
33
+ kind: ScriptTestKindSchema,
34
+ script: z.string(),
35
+ context: ScriptTestContextSchema.optional(),
36
+ env: z.record(z.string(), z.string()).optional(),
37
+ /**
38
+ * The script's declared secret -> env mapping
39
+ * (`{ ENV_NAME: "${{ secrets.NAME }}" }`). The test panel NEVER resolves
40
+ * real secret values: for each entry it injects a named placeholder
41
+ * (`__SECRET_<NAME>__`) by default, or the user's override value (see
42
+ * `secretOverrides`) for a realistic run. Real production values never
43
+ * reach the test surface (decision 4).
44
+ */
45
+ secretEnv: z.record(z.string(), z.string()).optional(),
46
+ /**
47
+ * User-supplied per-secret-NAME override values for a realistic test
48
+ * (keyed by the `${{ secrets.NAME }}` name, not the env var). These stay
49
+ * client-side until sent here as an explicit test input, and are masked
50
+ * out of the test result so even an override can't round-trip unmasked.
51
+ */
52
+ secretOverrides: z.record(z.string(), z.string()).optional(),
53
+ workingDirectory: z.string().optional(),
54
+ timeoutMs: z.number().int().min(100).max(300_000).default(30_000),
55
+ });
56
+ export type ScriptTestInputDto = z.infer<typeof ScriptTestInputSchema>;
57
+
58
+ export const ScriptTestResultSchema = z.object({
59
+ result: z.unknown().optional(),
60
+ stdout: z.string(),
61
+ stderr: z.string(),
62
+ exitCode: z.number().int().optional(),
63
+ durationMs: z.number().int().nonnegative(),
64
+ timedOut: z.boolean(),
65
+ error: z.string().optional(),
66
+ });
67
+ export type ScriptTestResultDto = z.infer<typeof ScriptTestResultSchema>;
68
+
69
+ /**
70
+ * Input for `getRunScopeForReplay`: reconstruct a test context from a real
71
+ * automation run so an operator can debug an action against the data that
72
+ * run saw. `actionPath` is accepted for forward-compatibility (scoping
73
+ * artifacts to those available before that action); v1 returns the full
74
+ * run scope.
75
+ */
76
+ export const ReplayScopeInputSchema = z.object({
77
+ runId: z.string(),
78
+ actionPath: z.string().optional(),
79
+ });
80
+ export type ReplayScopeInputDto = z.infer<typeof ReplayScopeInputSchema>;
81
+
82
+ export const ReplayScopeResultSchema = z.object({
83
+ context: ScriptTestContextSchema,
84
+ /**
85
+ * False when `var` / `repeat` could not be recovered because the run's
86
+ * durable scope snapshot was already cleared (terminal status). The
87
+ * trigger + artifacts are still reconstructed; the UI can note the gap.
88
+ */
89
+ scopeSnapshotAvailable: z.boolean(),
90
+ });
91
+ export type ReplayScopeResultDto = z.infer<typeof ReplayScopeResultSchema>;
@@ -164,6 +164,7 @@ function basicDefinition(
164
164
  conditions: [],
165
165
  actions: [],
166
166
  mode: "single",
167
+ concurrency_scope: "automation",
167
168
  max_runs: 1,
168
169
  ...overrides,
169
170
  };
@@ -237,6 +238,33 @@ describe("resolveVariableScope", () => {
237
238
  expect(unmatched).toContain("trigger.actor.type");
238
239
  });
239
240
 
241
+ it("exposes the health.* sensing namespace on every automation", () => {
242
+ const flat = flattenScope(
243
+ resolveVariableScope({
244
+ definition: basicDefinition(),
245
+ triggers: [triggerInfo],
246
+ actions: [],
247
+ artifactTypes: [],
248
+ path: [{ slot: "root", index: 0 }],
249
+ }),
250
+ );
251
+ const byPath = new Map(flat.map((e) => [e.path, e]));
252
+ expect(byPath.has("health")).toBe(true);
253
+ expect(byPath.has("health.system")).toBe(true);
254
+ expect(byPath.has("health.systems")).toBe(true);
255
+ expect(byPath.has("health.system.status")).toBe(true);
256
+ expect(byPath.has("health.system.in_status_since")).toBe(true);
257
+ expect(byPath.has("health.system.in_maintenance")).toBe(true);
258
+ // status leaf is typed as the status literal union
259
+ expect(byPath.get("health.system.status")?.type).toBe(
260
+ '"healthy" | "degraded" | "unhealthy"',
261
+ );
262
+ // never gated on a trigger subset — the engine always folds it in.
263
+ expect(
264
+ byPath.get("health.system.status")?.conditionalOnTriggers,
265
+ ).toBeUndefined();
266
+ });
267
+
240
268
  it("exposes trigger.id typed as the literal union of the automation's trigger ids", () => {
241
269
  const definition = basicDefinition({
242
270
  // Two triggers on the SAME event, distinguished by explicit ids.
@@ -519,6 +519,107 @@ function buildActorEntry(): VariableEntry {
519
519
  };
520
520
  }
521
521
 
522
+ /**
523
+ * Static `health.*` namespace (sensing layer). The dispatch engine
524
+ * pre-resolves live health state into scope before evaluating any
525
+ * condition/template, so `health.system` (the trigger's context system)
526
+ * and `health.systems[<id>]` (any `uses_state` ids) are always readable
527
+ * as plain data. Offered unconditionally - the engine always folds the
528
+ * namespace in (empty when nothing resolves), so referencing it never
529
+ * throws.
530
+ */
531
+ const HEALTH_STATE_PROPERTIES: Record<string, Record<string, unknown>> = {
532
+ status: { type: "string", enum: ["healthy", "degraded", "unhealthy"] },
533
+ in_status_since: { type: ["string", "null"] },
534
+ in_status_for_ms: { type: "number" },
535
+ latency_ms: { type: "number" },
536
+ avg_latency_ms: { type: "number" },
537
+ p95_latency_ms: { type: "number" },
538
+ success_rate: { type: "number" },
539
+ last_run_at: { type: "string" },
540
+ in_maintenance: { type: "boolean" },
541
+ transitions_in_window: { type: "number" },
542
+ transition_window_minutes: { type: "number" },
543
+ evaluated_at: { type: "string" },
544
+ };
545
+
546
+ const HEALTH_STATE_DESCRIPTIONS: Record<string, string> = {
547
+ status: "Aggregate health status of the system.",
548
+ in_status_since: "ISO timestamp the system entered its current status (null if unknown).",
549
+ in_status_for_ms: "Milliseconds the system has held its current status.",
550
+ latency_ms: "Latency of the newest run.",
551
+ avg_latency_ms: "Windowed average latency.",
552
+ p95_latency_ms: "Windowed p95 latency.",
553
+ success_rate: "Windowed success rate in [0, 1].",
554
+ last_run_at: "ISO timestamp of the newest run.",
555
+ in_maintenance: "Whether the system is in an active maintenance window.",
556
+ transitions_in_window:
557
+ "Status changes in the trailing window (generalized flapping count).",
558
+ transition_window_minutes:
559
+ "The window (minutes) transitions_in_window was counted over.",
560
+ evaluated_at: "ISO timestamp this snapshot was computed.",
561
+ };
562
+
563
+ function buildHealthStateChildren(prefix: string): VariableEntry[] {
564
+ return Object.entries(HEALTH_STATE_PROPERTIES).map(([key, jsonSchema]) => ({
565
+ path: `${prefix}.${key}`,
566
+ templateRef: `${prefix}.${key}`,
567
+ type:
568
+ key === "status"
569
+ ? '"healthy" | "degraded" | "unhealthy"'
570
+ : key === "in_maintenance"
571
+ ? "boolean"
572
+ : jsonSchema.type === "number"
573
+ ? "number"
574
+ : "string",
575
+ description: HEALTH_STATE_DESCRIPTIONS[key],
576
+ jsonSchema,
577
+ }));
578
+ }
579
+
580
+ function buildHealthEntry(): VariableEntry {
581
+ const stateSchema = {
582
+ type: "object",
583
+ properties: HEALTH_STATE_PROPERTIES,
584
+ };
585
+ return {
586
+ path: "health",
587
+ templateRef: "health",
588
+ type: "object",
589
+ description:
590
+ "Live health state, pre-resolved before evaluation. Use with duration filters, e.g. {{ health.system.in_status_since | older_than(30 | minutes) }}.",
591
+ jsonSchema: {
592
+ type: "object",
593
+ properties: {
594
+ system: stateSchema,
595
+ systems: { type: "object", additionalProperties: stateSchema },
596
+ },
597
+ },
598
+ children: [
599
+ {
600
+ path: "health.system",
601
+ templateRef: "health.system",
602
+ type: "object",
603
+ description:
604
+ "Live state of the system named by the trigger's context key.",
605
+ jsonSchema: stateSchema,
606
+ children: buildHealthStateChildren("health.system"),
607
+ },
608
+ {
609
+ path: "health.systems",
610
+ templateRef: "health.systems",
611
+ type: "Record<string, HealthState>",
612
+ description:
613
+ "Live state of every resolved system keyed by id (context system + uses_state).",
614
+ jsonSchema: {
615
+ type: "object",
616
+ additionalProperties: stateSchema,
617
+ },
618
+ },
619
+ ],
620
+ };
621
+ }
622
+
522
623
  /**
523
624
  * Derive a stable, identifier-safe id for a trigger from its event id, used
524
625
  * when the operator hasn't assigned an explicit `id`. Mirrors the backend
@@ -976,7 +1077,7 @@ export function resolveVariableScope(
976
1077
  registeredTriggers: triggers,
977
1078
  });
978
1079
 
979
- const entries: VariableEntry[] = [...triggerEntries];
1080
+ const entries: VariableEntry[] = [...triggerEntries, buildHealthEntry()];
980
1081
 
981
1082
  if (accumulated.vars.length > 0) {
982
1083
  entries.push({
@@ -0,0 +1,89 @@
1
+ import { describe, expect, it } from "bun:test";
2
+
3
+ import { TriggerSchema, WindowSchema } from "./schemas";
4
+
5
+ describe("WindowSchema", () => {
6
+ it("parses a full window block", () => {
7
+ const parsed = WindowSchema.parse({
8
+ count: 3,
9
+ minutes: 60,
10
+ refire: "once",
11
+ });
12
+ expect(parsed).toEqual({ count: 3, minutes: 60, refire: "once" });
13
+ });
14
+
15
+ it("defaults refire to `every` when omitted", () => {
16
+ const parsed = WindowSchema.parse({ count: 5, minutes: 10 });
17
+ expect(parsed.refire).toBe("every");
18
+ });
19
+
20
+ it("rejects count below 1", () => {
21
+ expect(() => WindowSchema.parse({ count: 0, minutes: 10 })).toThrow();
22
+ });
23
+
24
+ it("rejects count above 1000", () => {
25
+ expect(() => WindowSchema.parse({ count: 1001, minutes: 10 })).toThrow();
26
+ });
27
+
28
+ it("rejects minutes above 1440 (the 24h cap)", () => {
29
+ expect(() => WindowSchema.parse({ count: 3, minutes: 1441 })).toThrow();
30
+ });
31
+
32
+ it("rejects a non-integer count", () => {
33
+ expect(() => WindowSchema.parse({ count: 2.5, minutes: 10 })).toThrow();
34
+ });
35
+
36
+ it("rejects an unknown refire mode", () => {
37
+ expect(() =>
38
+ WindowSchema.parse({ count: 3, minutes: 10, refire: "cooldown" }),
39
+ ).toThrow();
40
+ });
41
+
42
+ it("leaves partitionBy undefined when omitted", () => {
43
+ const parsed = WindowSchema.parse({ count: 3, minutes: 60 });
44
+ expect(parsed.partitionBy).toBeUndefined();
45
+ });
46
+
47
+ it("parses a partitionBy bare expression and trims surrounding whitespace", () => {
48
+ const parsed = WindowSchema.parse({
49
+ count: 3,
50
+ minutes: 60,
51
+ partitionBy: " trigger.payload.severity ",
52
+ });
53
+ expect(parsed.partitionBy).toBe("trigger.payload.severity");
54
+ });
55
+
56
+ it("rejects an empty / whitespace-only partitionBy", () => {
57
+ expect(() =>
58
+ WindowSchema.parse({ count: 3, minutes: 60, partitionBy: " " }),
59
+ ).toThrow();
60
+ });
61
+ });
62
+
63
+ describe("TriggerSchema window field", () => {
64
+ it("accepts a trigger without a window (window optional)", () => {
65
+ const parsed = TriggerSchema.parse({
66
+ event: "healthcheck.system_health_changed",
67
+ });
68
+ expect(parsed.window).toBeUndefined();
69
+ });
70
+
71
+ it("accepts a trigger carrying a window + filter (flapping shape)", () => {
72
+ const parsed = TriggerSchema.parse({
73
+ id: "flapping",
74
+ event: "healthcheck.system_health_changed",
75
+ filter: '{{ trigger.payload.newStatus != "healthy" }}',
76
+ window: { count: 3, minutes: 60, refire: "once" },
77
+ });
78
+ expect(parsed.window).toEqual({ count: 3, minutes: 60, refire: "once" });
79
+ expect(parsed.filter).toBe('{{ trigger.payload.newStatus != "healthy" }}');
80
+ });
81
+
82
+ it("applies the refire default inside a trigger window", () => {
83
+ const parsed = TriggerSchema.parse({
84
+ event: "healthcheck.check_failed",
85
+ window: { count: 5, minutes: 10 },
86
+ });
87
+ expect(parsed.window?.refire).toBe("every");
88
+ });
89
+ });