npm - @checkstack/automation-common - Versions diffs - 0.2.0 → 0.3.0 - Mend

@checkstack/automation-common 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/CHANGELOG.md +215 -0
package/package.json +5 -5
package/src/entity-schemas.test.ts +71 -0
package/src/index.ts +1 -0
package/src/rpc-contract.ts +54 -0
package/src/schemas.ts +383 -4
package/src/script-test-schemas.ts +91 -0
package/src/variable-scope.test.ts +28 -0
package/src/variable-scope.ts +102 -1
package/src/window-schema.test.ts +89 -0

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,220 @@
 # @checkstack/automation-common
+## 0.3.0
+### Minor Changes
+- b995afb: Add grouping to automations so they are easier to find.
+  Each automation now carries an optional single free-text `group` label (HA-style "category"), stored as its own column on the `automations` row alongside `name` / `description` / `status` - it is NOT part of the definition / YAML. The automations list renders one collapsible section per group (sorted alphabetically, with an implicit "Ungrouped" bucket last), and the edit page gains a type-new-or-pick-existing group picker fed by the new `listAutomationGroups` query. `listAutomations` accepts an optional `group` filter.
+  Declaratively managed automations express their group via GitOps `metadata.labels.group`; the reconciler threads it onto the row (blank clears it).
+  A Drizzle migration adds the nullable `"group"` column and an index. Existing automations default to no group (Ungrouped) and behave exactly as before.
+- 270ef29: Add live state in scope plus duration helpers to the automation sensing layer (Wave 2 Phase 14).
+  - `@checkstack/template-engine` ships four pure, synchronous duration filters: `minutes` and `hours` (number to milliseconds), `duration_since` (ms elapsed since an ISO timestamp), and `older_than(thresholdMs)` (boolean dwell check). They compute against real time at call time, so "now" is fresh per evaluation. Fail-safe on null/unparseable input.
+  - The dispatch engine pre-resolves live health state into scope before any condition or template evaluation (the engine is synchronous, so inline state queries are impossible). State is folded under a `health` namespace - `health.system.*` for the trigger's context system and `health.systems[<id>]` for ids listed in the automation's new `uses_state` field. One batched `getBulkHealthState` query per evaluation, wired at the fresh-run, resume, and trigger-gate sites. Fail-open: a missing client or provider error yields an empty namespace and a warning, never wedging unrelated automations.
+  - New `automationFilterExtensionPoint` lets plugins contribute pure template filters without forking the engine's default registry. Name collisions with built-ins are skipped with a warning.
+  - The editor variable-scope resolver and autocomplete catalogue now surface the `health.*` namespace and the new duration filters.
+  With this phase alone, an operator can build "notify me when a system has been unhealthy for 30 minutes" using an interval trigger plus a single `health.*` condition - no dwell timer required (the precise event-driven path lands in Phase 15).
+- 270ef29: Add the `for:` dwell on triggers (Wave 2 Phase 15) - precise, event-driven, restart-safe "fire only if the matched state still holds after Y".
+  - New first-class `TriggerSchema.for` (decision D1): a single-unit duration (`{ seconds | minutes | hours }`) or `{ template }` rendering to seconds. A `durationToMs` helper resolves it. Not buried in `config`.
+  - New pre-run `automation_dwell_timers` table (decision D5): a dwell arms before any run exists, so it cannot reuse the run-scoped wait locks. Unique on `(automationId, triggerId, contextKey)` so a re-fire re-arms (pushes `fireAt`) rather than stacking timers.
+  - Arm / re-arm / fire / cancel wired into the trigger fan-in. When a `for:` trigger fires and its filter passes, the engine snapshots the current status, upserts the dwell row, and enqueues an `automation-dwell` wake job with the matching `startDelay` - no run starts yet.
+  - At expiry the dwell re-confirms (via the Phase 13 health-state provider) that the system is still in the armed status, then re-checks the automation's pre-run conditions, then starts the run honouring the concurrency mode. A recovery within the window cancels the pending fire even without an explicit inverse event.
+  - Cancellation is DB-side (delete the row; the queue job no-ops when it pops, since queue jobs are not cancellable). A contradicting state-change event eagerly deletes a stale dwell. Deleted automations drop their dwells via FK cascade; disabled automations drop them at fire time.
+  - Durability: the dwell row is the source of truth. A new `automation-dwell` queue consumer fires dwells, and the stalled sweeper catches expired rows whose job was lost. Both paths are idempotent via delete-on-fire, so a dwell fires at most once and survives restart.
+  Example:
+  ```yaml
+  triggers:
+    - event: healthcheck.system.degraded
+      for: { minutes: 30 }
+  actions:
+    - action: incident.create
+      config:
+        title: "{{ trigger.payload.systemName }} is critical"
+        severity: critical
+        systemIds: ["{{ trigger.payload.systemId }}"]
+  ```
+- 270ef29: Add the `numeric_state` trigger and three structured condition variants (Wave 2 Phase 16, backend-only).
+  - New built-in `numeric_state` trigger: hook-backed on `healthcheck.check.completed`, fires when a numeric field (`latencyMs` top-level, or a `collectors.<id>.<field>` path) crosses an `above` / `below` threshold. The per-automation threshold is enforced by a new structured config gate (`TriggerDefinition.evaluateConfig`) that runs before the operator's template filter. Pairs with a trigger-level `for:` (Phase 15) for sustained thresholds. v1 is level-triggered; edge de-duplication is deferred. (Per-check `p95LatencyMs` is not in the hook payload; read windowed p95 via a `numeric_state` _condition_ against `health.system.p95_latency_ms` instead.)
+  - Corrected the Phase 15 dwell `arm` semantics to be insert-if-absent: a re-fire while a dwell is still armed PRESERVES the original `fireAt` instead of pushing it. Required for the level-triggered `numeric_state` trigger above - otherwise a trigger firing on every check completion (e.g. every 60s) with `for: 10m` would re-arm and push the deadline forward indefinitely, never elapsing. A genuine recover-then-recur still deletes the row (re-confirm / inverse-cancel) so a fresh window starts.
+  - Extended the condition grammar (`ConditionInput`) beyond `string | and | or | not` with three typed variants evaluated over the pre-resolved `health.*` scope plus a FRESH `now` per evaluation:
+    - `numeric_state`: `{ value, above?, below? }` (value is a literal number or a template/path string).
+    - `time`: `{ after?, before?, weekday?[], timezone? }` for on-call / quiet-hours gating, including overnight windows wrapping midnight, weekday filtering, and IANA timezone resolution via `Intl`.
+    - `state`: `{ entity, status, for? }` - a condition-side dwell read from `health.systems[entity].in_status_for_ms` (no new timer; it reads, it doesn't time).
+  - The raw template string stays the escape hatch. Everything round-trips through zod and YAML.
+  Editor widgets (ConditionEditor branches, duration/time-of-day inputs, operator selects) are intentionally deferred to Phase 19; the YAML editor already round-trips the new schema, so the feature is fully usable and testable via YAML today.
+- 270ef29: Add the `wait_until` action primitive (Wave 2 Phase 17) - suspend a running automation until a condition becomes true, with an optional timeout (HA's `wait_template`).
+  - New `wait_until: { condition, timeout_seconds?, continue_on_timeout? }` primitive. `continue_on_timeout` defaults to true (HA semantics). Added to the schema, the action union, and `detectActionKind`. (The wait is fully reactive - see the reactive-dispatch-pipeline changeset; there is no `poll_seconds`.)
+  - `condition` accepts any condition shape - a template string or the Phase 16 structured `numeric_state` / `time` / `state` variants.
+  - Reactive resume: if the condition is already true it continues inline; otherwise it persists a `kind: "until"` wait lock (carrying the condition + timeout policy in a new `wait_config` jsonb column). The reactive-dispatch-pipeline changeset replaces the original poll-based re-check with a wake-index + a single timeout timer, so the wait is woken by a relevant entity change rather than ticked on an interval. Resumes take the per-run advisory lock so a wake and a sweep can't double-resume.
+  - Survives restart: the wait lock is the source of truth, and the stalled sweeper applies the timeout policy as a backstop if the wake/timer signal is lost.
+  - Works nested inside `choose` / `parallel` / `repeat` via the existing resume-remainder mechanism.
+  - Editor: a `wait_until` action card (frontend) mirroring `wait_for_trigger` - a `ConditionEditor` plus timeout and continue-on-timeout inputs. The structured numeric/time/state ConditionEditor branches land with the rest of the sensing-layer editor work; the card uses the expression-based editor for now.
+- 270ef29: Add a windowed transition count to the health provider - the building block for custom flapping rules (Wave 2 Phase 18).
+  Flapping is already buildable today via the built-in `healthcheck.flapping_detected` trigger; this phase ships the GENERALIZATION for arbitrary "N status changes in M minutes" rules.
+  - `countStateTransitionsInWindow` counts aggregate status transitions for a system over a trailing window (from the Phase 13 `health_check_state_transitions` table - all statuses, generalizing the unhealthy-only flapping detector). Fail-safe to 0.
+  - `getHealthState` / `getBulkHealthState` now return `transitionsInWindow` + `transitionWindowMinutes`, and accept an optional `transitionWindowMinutes` input (default 60).
+  - The automation definition gains an optional top-level `state_window_minutes` (default 60), threaded through `enrichScopeWithState` so `health.system.transitions_in_window` / `health.system.transition_window_minutes` are folded into scope per evaluation.
+  - Operators author custom flapping as a `numeric_state` condition over `health.system.transitions_in_window` - no new condition variant, no editor change. The variable-scope resolver surfaces the new fields for autocomplete.
+- 270ef29: Add per-context-key concurrency scope to automations (Phase 20 prerequisite).
+  A new optional `concurrency_scope: "automation" | "context_key"` field on the automation definition controls the bucket the concurrency `mode` is evaluated over:
+  - `automation` (default, backward-compatible): one bucket for the whole automation - `single` allows one in-flight run total, `restart` cancels every active run. Existing automations are unchanged.
+  - `context_key`: an independent bucket per `contextKey` (typically per system / incident) - `single` allows one in-flight run _per context key_ (system A and system B run concurrently, but a second run for system A is deduped), and `restart` cancels only the active runs sharing the incoming context key.
+  `RunStore.hasActiveRun` / `countActiveRuns` / `cancelActiveRuns` gain an optional `contextKey` filter (the `automation_runs.context_key` column already exists, so no migration). `respectConcurrencyMode` threads the scope through. This is the primitive the default auto-incident automations need for faithful per-system deduplication.
+- b995afb: Add the entity state machine core (`defineEntity`) - the foundational primitive of the reactive automation engine - as a Model-B plugin-backed reactive WRAPPER with NO framework-owned current-state storage.
+  `defineEntity` owns NO current-state storage of its own. Each kind declares a required plugin `read` accessor pointing at wherever its state lives (its own durable table, or a value computed on read from its own durable tables), and `defineEntity` makes that state reactive. There is no framework current-state store and no "homeless" fallback: every kind is plugin-backed. This makes a non-reactive write structurally impossible and guarantees every transition is durably logged without duplicating the plugin's state.
+  - `@checkstack/automation-backend`:
+    - New `automation.entity` extension point exposing `defineEntity(input)`, `declareNonReactiveState(input)`, `onEntityChanged(...)`, and `registerChangeDeriver(...)`. automation-backend registers the impl in `register`, so other plugins can resolve it and declare entities during their own `register`/`init` (Proxy-buffered until the impl registers).
+    - **Driven single mutation entry point.** All reactive-state writes go through `handle.mutate({ id, opts?, apply: () => Promise<TState> })`. The handle snapshots `prev` via `read` BEFORE the write, runs the plugin's `apply` (the actual write, committed in the PLUGIN's own transaction, returning the resulting state), validates `next` (zod), masks run-originated writes through the run-secret registry, diffs prev to next, and on a real diff appends the field-level transition rows to `entity_transitions` and emits `ENTITY_CHANGED` - both AFTER the plugin write commits (never on a rolled-back / throwing write). A structurally-unchanged write is a no-op. `handle.remove({ id, opts?, apply: () => Promise<void> })` is the tombstone counterpart (records the tombstone transition, emits next = null).
+    - **Cross-plugin transaction boundary.** `apply` takes NO framework tx: a plugin-backed kind lives behind a DIFFERENT drizzle client than `entity_transitions`, and two clients cannot share one transaction. The plugin write is authoritative; the transition-log append runs in the framework's own transaction AFTER the plugin write commits. A failure between them leaves correct plugin state with a missing history row (a gap, never a corruption).
+    - **`get` / `getMany`** route to the kind's `read`; **`inStateSince` / `inStateForMs` / `transitionCount`** read the per-field `entity_transitions` log (generalizing Phase-13 health transitions to any entity).
+    - **No framework keyed store.** There is no generic `entity_state` table, no `createKeyedStore`, and no `entityKeyedStoreServiceRef`: kinds whose state has no domain table of their own (the `health` aggregate, the `slo` budget/streak view) compute their `read` on demand from their own durable data instead of materializing a framework copy. `entity_transitions` (the change-history log) is the framework's ONLY persistent table and is written for EVERY kind regardless of where current state lives.
+    - **`entityResolverFor(kind)`** routes scope enrichment + the reactive `wait_until` wake re-eval to each kind's `read` accessor. Generalized scope enrichment (`enrichScopeWithEntities`) folds any `state.<kind>.<id>` ref into `scope.state.<kind>.<id>.<field>`. The rich `scope.health.*` condition snapshot (status, latency, success rate, in-maintenance, transitions-in-window, ...) is resolved EXCLUSIVELY through the healthcheck RPC path (the health aggregate is computed on read, not stored as a framework row) and the generic entity pass never writes `scope.health`; `state.health.*` remains the minimal reactive entity view. These are two complementary projections by design, not a migration shim.
+    - **Horizontal-scale read-consistency guard.** A reactive entity's current state MUST be globally readable from shared/durable storage, never process-local memory (`.agent/rules/state-and-scale.md`). Enforced by the `checkstack/no-pod-local-entity-state` ESLint tripwire at the `defineEntity({ read })` boundary (wired at `warn`) and the deterministic `cross-pod-read-consistency.it.test.ts` integration test.
+    - Load-time validation hard-fails a malformed registration (non-`z.object` state, missing/duplicate `kind`, or a missing / non-function `read`).
+    - The `ENTITY_CHANGED` hook is internal (not exported); the change emitter buffers events produced during the init window and flushes them in order once the hook wiring is available in `afterPluginsReady`.
+  - `@checkstack/automation-common`:
+    - New `EntityChangedSchema` (the `ENTITY_CHANGED` payload - `kind`, `id`, `prev`, `next`, `delta`, `changedFields`, `actor`, `occurredAt`) and `DispatchJobSchema` (the Stage-2 `trigger` / `wake` dispatch job).
+  - `@checkstack/automation-frontend`: the `wait_until` editor no longer offers the inert `poll_seconds` field (reactive waits don't poll).
+  This phase adds the primitive only: domains are migrated in their own changesets. No external behavior changes for existing automations.
+  BREAKING CHANGES: There is no framework current-state store. Any out-of-tree plugin must own its entity state in its own durable storage (its own table, or a compute-on-read over it) and pass a `read` accessor to `defineEntity`. `createKeyedStore` / `KeyedStore` / `entityKeyedStoreServiceRef` / `EntityKeyedStoreService` do not exist, and there is no `entity_state` table. `handle.set` / `handle.patch` and the `indexes` option do not exist; all writes go through `handle.mutate` / `handle.remove`.
+- b995afb: Fix four reactive-automation-engine defects in the `wait_until` / entity-change dispatch path.
+  - **Lost-wakeup re-evaluate-on-registration guard (HIGH, data-loss race).** `executeWaitUntil` evaluated its condition, then committed the wait lock + wake-index rows with NO re-evaluation after arming. An `ENTITY_CHANGED` for a relevant ref landing in that arm window was routed by Stage-1 against a not-yet-visible lock, enqueued no wake job, and — for a no-timeout wait (`timeoutAt` null, skipped by the sweeper) — the run stalled permanently (silent run leak). After arming the lock the engine now re-evaluates ONCE against freshly re-enriched scope; if the condition already holds it deletes the lock (its wake-index rows cascade) and continues the run inline. Idempotent via the lock delete + the per-run advisory lock.
+  - **Wildcard health wake drops the changed system (MEDIUM, correctness).** `reEnrichWaitScope` resolved health only for the trigger `contextKey` + `uses_state` ids and excluded the changed ref from health resolution. A wildcard health wait (`health:*`) woken by `health:sysX` — where `sysX` was neither the contextKey nor in `uses_state` — never had `scope.health.systems[sysX]` populated, so the condition read stale/empty state and failed to resume. The changed system's concrete id is now injected into health resolution during a wildcard wake.
+  - **`changeId` for dispatch dedup (LOW, correctness).** The Stage-2 trigger `jobId` embedded `changed.occurredAt` (millisecond granularity), so two DISTINCT changes to the same entity within one millisecond collapsed onto one job (the second run silently dropped). `EntityChangedSchema` gains an additive, back-compatible `changeId` (generated ONCE at emit time so it travels with redeliveries of the same change); the Stage-2 jobId now uses `changed.changeId` (falling back to `occurredAt` for legacy payloads). Redeliveries of one change still dedup; two real changes stay distinct.
+  - **Run-originated `mutate` returns the unmasked next state (LOW, correctness).** `handle.mutate` returned the `maskForRun`-masked next state, contradicting its "returns the resulting state" contract. Masking is now confined to the emitted `ENTITY_CHANGED` payload and the `entity_transitions` rows only; `mutate` returns the unmasked, zod-validated resulting state.
+  BREAKING CHANGES: none. The `changeId` field is additive and optional; all changes are behavior-preserving except where they fix the defects above.
+- 270ef29: Add in-UI script testing for automation `run_script` / `run_shell` actions.
+  A new `testScript` RPC runs a TypeScript or shell script against an
+  editable, auto-seeded sample context using the same sandboxed runner the
+  real action uses, so operators can test scripts directly in the editor
+  without dispatching a whole automation. Surfaces beneath any script field
+  flagged `x-script-testable` via the new `ScriptTestPanel` /
+  `ContextSampleEditor` components in `@checkstack/ui` and the
+  `scriptTestRenderer` prop threaded through `DynamicForm`.
+  - `@checkstack/automation-common`: adds the `testScript` contract +
+    `ScriptTest*` schemas (gated by `automation.manage`).
+  - `@checkstack/automation-backend`: implements `testScript` reusing the
+    shared ESM / shell runners; central-only, time-bounded.
+  - `@checkstack/backend-api`: new `x-script-testable` config-schema
+    metadata propagated to the frontend JSON Schema.
+  - `@checkstack/ui`: new `ScriptTestPanel` + `ContextSampleEditor`
+    components and a `scriptTestRenderer` prop on `DynamicForm`.
+  - `@checkstack/automation-frontend`: wires the test panel into the action
+    editor.
+  - `@checkstack/integration-script-backend`: marks the `run_script` /
+    `run_shell` script fields as testable.
+- 270ef29: Extend in-UI script testing to health-check collectors, and add
+  load-from-run replay for automation script tests.
+  - Health-check collectors: a new `testCollectorScript` RPC runs the
+    inline-script (TypeScript) collector and the shell `script` collector
+    against an editable, auto-seeded sample context using the same
+    sandboxed runner the real collector uses. Surfaces beneath the
+    collector script fields in the collector editor (both marked
+    `x-script-testable`). Gated by `healthcheck.configuration.manage`.
+  - Automation replay: a new `getRunScopeForReplay` RPC reconstructs an
+    editable test context from a real run (trigger + persisted artifacts,
+    plus the durable scope snapshot when the run is still in-flight), and
+    the script-test panel gains a "Load from run" picker that seeds the
+    sample context from a past run.
+  Note: health-check executions do not persist the script / config /
+  check / system that produced a result, so there is no health-check
+  replay - auto-seed is the only context source for collector tests. This
+  is by design; see the feature plan.
+- 270ef29: Secrets platform Phase 2: secret -> env-var mapping with central resolve, inject, and mask.
+  - Script consumers declare a least-privilege `secretEnv` allowlist
+    (`{ ENV_NAME: "${{ secrets.NAME }}" }`). The automation `run_script` /
+    `run_shell` actions resolve ONLY the declared secrets via
+    `secretResolverRef.resolveForRun`, inject them into the runner env for
+    that run (memory-only; the ESM runner gained a per-run `env` option), and
+    mask their values out of stdout/stderr/result/error via the run-scoped
+    masking context. A missing required secret fails the run clearly. No
+    ambient secret access.
+  - Test panel: `testScript` / `testCollectorScript` inject named
+    `__SECRET_<NAME>__` placeholders by default, or user-supplied per-secret
+    overrides; real production values are never resolved in the test path,
+    and overrides are masked out of the result.
+  - Healthcheck collectors carry the `secretEnv` field for authoring +
+    the test panel; runtime injection on satellites lands in Phase 3.
+  - Editor UX: a new `@checkstack/ui` `SecretEnvEditor` renders `x-secret-env`
+    record fields with `${{ secrets.* }}` name autocomplete (from
+    `listSecretNames`), wired into the automation action editor and the
+    healthcheck collector editor. New `withConfigMeta` helper +
+    `x-secret-env` config-meta key in `@checkstack/backend-api`.
+- b995afb: Add an optional `partitionBy` override to the windowed-count trigger gate.
+  A trigger's `window` block now accepts `partitionBy`, a bare expression (same flavour as `filter`, no `{{ }}`) that controls the key the occurrence count is bucketed by. When omitted, the gate keys by the trigger's built-in context key exactly as before (per system for health triggers), so existing automations are unchanged. When set, the expression is evaluated against the same trigger scope `filter` uses and coerced to a string - e.g. `trigger.payload.severity` for a per-severity rate, or `trigger.payload.systemId + ":" + trigger.payload.checkId` for a composite key. If the expression evaluates to null/undefined/empty or fails to evaluate, the gate falls back to the built-in context key (never global counting); eval errors are logged, matching the gate's fail-open posture.
+  Triggers can now declare `contextKeyLabel` (a UI hint, e.g. `"system"`) describing their built-in context dimension. It is surfaced through `TriggerInfo` so the editor's window "Partition by" field shows the default partition ("Leave blank to count per system" / "per automation" when a trigger has no context key). The healthcheck system triggers (`system_health_changed`, `system_degraded`, `system_healthy`, `check_failed`) and the built-in `numeric_state` trigger set it to `"system"`. This is a pure UI hint with no runtime behaviour.
+  The automation editor's window block gains a "Partition by" expression input (reusing the trigger filter's `trigger.payload.*` autocomplete), and the collapsed trigger card summary shows the partition when set.
+- b995afb: Add a generic windowed-count / rate trigger gate, and express flapping detection on it.
+  Any trigger can now carry a `window: { count, minutes, refire }` block: the automation engine records each qualifying occurrence (after the structured config gate and the operator's `filter`) in a durable append log and counts rows within the trailing sliding window, scoped per context key (e.g. per system). `refire: "every"` (default) fires on every occurrence at/over the threshold; `refire: "once"` fires only on the crossing edge and re-arms as old occurrences age out. The gate runs in `maybeStartRun` after `filter` and before the `for:` dwell, so it composes with both.
+  Flapping is now an instance of this mechanism rather than a bespoke detector. The healthcheck `system_health_changed` raw change event plus a `filter` (`trigger.payload.newStatus != "healthy"`) plus `window: { count: 3, minutes: 60, refire: "once" }` reproduces flapping in the engine.
+  State-and-scale: window state lives in the new `automation_window_events` Postgres table (FK-cascade on the automation, the same delete-lifecycle as `automation_dwell_timers`). The count is read with pure SQL so every pod computes the same answer; the work-queue claim gives exactly one INSERT per emission, so there is no double-count. Rows older than the 24h schema cap are pruned by the existing stalled-sweeper. The `once` policy is best-effort under at-least-once redelivery (a redelivered emission can skip the exact crossing edge; `every` is redelivery-tolerant).
+  **BREAKING CHANGES:**
+  - The `healthcheck.flapping_detected` automation trigger and the `healthcheck.flapping_detected` hook are REMOVED. Flapping is now detected by the windowed-count gate on the `healthcheck.system_health_changed` trigger (`window` block, `refire: "once"`).
+  - Flapping is now PER-SYSTEM (the aggregated `health` entity), not per-`(system, configuration)`. Subscribe to `check_failed` with a `window` instead if you need per-check rate detection.
+  - The healthcheck `health_check_unhealthy_transitions` table is DROPPED (the per-check flapping audit log is no longer kept; counting moved into the engine).
+  - The backend-only `automation.subscriptions` service ref (`automationSubscriptionsRef` / `AutomationSubscriptions`) is REMOVED. The engine enumerates subscribers internally and the window gate runs per-automation inside `maybeStartRun`, so the external read-ref is no longer needed.
+  - Existing user-created flapping automations are AUTO-MIGRATED on boot: any trigger on `healthcheck.flapping_detected` is rewritten to `healthcheck.system_health_changed` + the canonical unhealthy-transition filter + `window: { count: transitions ?? 3, minutes: windowMinutes ?? 60, refire: "once" }`, dropping the old `config`. A pre-existing trigger filter is replaced with the canonical one (logged per row). An enabled automation that still references the removed event after migration logs a warning.
+### Patch Changes
+- Updated dependencies [270ef29]
+  - @checkstack/template-engine@0.3.0
 ## 0.2.0
 ### Minor Changes

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@checkstack/automation-common",
-  "version": "0.2.0",
+  "version": "0.3.0",
   "license": "Elastic-2.0",
   "type": "module",
   "exports": {
@@ -10,14 +10,14 @@
     }
   },
   "dependencies": {
-    "@checkstack/common": "0.11.0",
-    "@checkstack/signal-common": "0.2.4",
-    "@checkstack/template-engine": "0.1.0",
+    "@checkstack/common": "0.12.0",
+    "@checkstack/signal-common": "0.2.5",
+    "@checkstack/template-engine": "0.2.0",
     "@orpc/contract": "^1.13.14",
     "zod": "^4.0.0"
   },
   "devDependencies": {
-    "@checkstack/scripts": "0.3.3",
+    "@checkstack/scripts": "0.3.4",
     "@checkstack/tsconfig": "0.0.7",
     "typescript": "^5.7.2"
   },

package/src/entity-schemas.test.ts ADDED Viewed

@@ -0,0 +1,71 @@
+import { describe, it, expect } from "bun:test";
+import { SYSTEM_ACTOR } from "@checkstack/common";
+import {
+  DispatchJobSchema,
+  EntityChangedSchema,
+} from "./schemas";
+const baseChange = {
+  kind: "incident",
+  id: "inc-1",
+  prev: null,
+  next: { status: "open" },
+  delta: { status: "open" },
+  changedFields: ["status"],
+  actor: SYSTEM_ACTOR,
+  occurredAt: new Date().toISOString(),
+};
+describe("EntityChangedSchema", () => {
+  it("accepts a create (prev null)", () => {
+    expect(() => EntityChangedSchema.parse(baseChange)).not.toThrow();
+  });
+  it("accepts a tombstone (next null)", () => {
+    expect(() =>
+      EntityChangedSchema.parse({ ...baseChange, next: null, delta: {} }),
+    ).not.toThrow();
+  });
+  it("requires an actor", () => {
+    const { actor, ...rest } = baseChange;
+    void actor;
+    expect(() => EntityChangedSchema.parse(rest)).toThrow();
+  });
+  it("rejects a non-array changedFields", () => {
+    expect(() =>
+      EntityChangedSchema.parse({ ...baseChange, changedFields: "status" }),
+    ).toThrow();
+  });
+});
+describe("DispatchJobSchema", () => {
+  it("accepts a trigger job", () => {
+    const job = {
+      reason: "trigger" as const,
+      automationId: "a-1",
+      triggerId: "t-1",
+      ref: "incident:inc-1",
+      changed: baseChange,
+    };
+    expect(() => DispatchJobSchema.parse(job)).not.toThrow();
+  });
+  it("accepts a wake job", () => {
+    const job = {
+      reason: "wake" as const,
+      runId: "r-1",
+      waitLockId: "w-1",
+      ref: "incident:inc-1",
+      changed: baseChange,
+    };
+    expect(() => DispatchJobSchema.parse(job)).not.toThrow();
+  });
+  it("rejects an unknown reason", () => {
+    expect(() =>
+      DispatchJobSchema.parse({ reason: "nope", ref: "x", changed: baseChange }),
+    ).toThrow();
+  });
+});

package/src/index.ts CHANGED Viewed

@@ -6,3 +6,4 @@ export * from "./rpc-contract";
 export * from "./signals";
 export * from "./variable-scope";
 export * from "./shell-env";
+export * from "./script-test-schemas";

package/src/rpc-contract.ts CHANGED Viewed

@@ -7,12 +7,19 @@ import {
 } from "@checkstack/common";
 import { automationAccess } from "./access";
 import { pluginMetadata } from "./plugin-metadata";
+import {
+  ReplayScopeInputSchema,
+  ReplayScopeResultSchema,
+  ScriptTestInputSchema,
+  ScriptTestResultSchema,
+} from "./script-test-schemas";
 import {
   ActionInfoSchema,
   ArtifactTypeInfoSchema,
   AutomationArtifactSchema,
   AutomationRunSchema,
   AutomationRunStepSchema,
+  AutomationGroupSchema,
   AutomationSchema,
   AutomationStatusSchema,
   CreateAutomationInputSchema,
@@ -45,10 +52,22 @@ export const automationContract = {
     .input(
       PaginationInput.extend({
         status: AutomationStatusSchema.optional(),
+        group: AutomationGroupSchema.optional(),
       }),
     )
     .output(PaginatedResult(AutomationSchema)),
+  /**
+   * Distinct, non-null group values across all automations, sorted
+   * alphabetically. Powers the edit-page group picker's "pick existing"
+   * suggestions.
+   */
+  listAutomationGroups: proc({
+    operationType: "query",
+    userType: "authenticated",
+    access: [automationAccess.read],
+  }).output(z.object({ groups: z.array(z.string()) })),
   getAutomation: proc({
     operationType: "query",
     userType: "authenticated",
@@ -200,6 +219,41 @@ export const automationContract = {
     .input(z.object({ id: z.string() }))
     .output(z.object({ success: z.boolean() })),
+  // ─── Inline script testing ─────────────────────────────────────────────
+  /**
+   * Run a `run_script` (TypeScript) or `run_shell` (shell) script against
+   * an editable sample context, using the same sandboxed runner the real
+   * action uses. Lets operators test a script directly in the editor
+   * without dispatching a whole automation.
+   *
+   * Gated by `manage` because authoring + running a script already
+   * executes code on the central backend — this is the same privilege.
+   * The run is time-bounded and always central (real satellite runs may
+   * differ; the UI notes this).
+   */
+  testScript: proc({
+    operationType: "mutation",
+    userType: "authenticated",
+    access: [automationAccess.manage],
+  })
+    .input(ScriptTestInputSchema)
+    .output(ScriptTestResultSchema),
+  /**
+   * Reconstruct an editable test context from a real automation run, so an
+   * operator can replay a script against the data that run saw. Reads run
+   * data only (trigger + persisted artifacts, plus the durable scope
+   * snapshot when the run is still in-flight), so it's gated by `read`.
+   */
+  getRunScopeForReplay: proc({
+    operationType: "query",
+    userType: "authenticated",
+    access: [automationAccess.read],
+  })
+    .input(ReplayScopeInputSchema)
+    .output(ReplayScopeResultSchema),
   // ─── Template playground ───────────────────────────────────────────────
   /**

package/src/schemas.ts CHANGED Viewed

@@ -1,4 +1,5 @@
 import { z } from "zod";
+import { ActorSchema } from "@checkstack/common";
 /**
  * Schemas for the Automation platform.
@@ -19,6 +20,45 @@ import { z } from "zod";
  * `automations.definition` column, also round-tripped from YAML in the UI.
  */
+// ─── Duration ──────────────────────────────────────────────────────────────
+/**
+ * A duration expressed in a single unit, or a template that renders to a
+ * number of seconds. Used by a trigger's `for:` dwell (decision D1). The
+ * object form is preferred for the editor's duration widget; the template
+ * form is the escape hatch for computed durations.
+ */
+export const DurationSchema = z.union([
+  z.object({ seconds: z.number().int().min(1).max(60 * 60 * 24 * 30) }),
+  z.object({ minutes: z.number().int().min(1).max(60 * 24 * 30) }),
+  z.object({ hours: z.number().int().min(1).max(24 * 30) }),
+  z
+    .object({ template: z.string().min(1) })
+    .describe("Template rendering to a number of seconds."),
+]);
+export type Duration = z.infer<typeof DurationSchema>;
+/**
+ * Resolve a {@link Duration} to milliseconds. The template branch needs a
+ * render context, so callers that may receive a template pass a
+ * `renderSeconds` resolver. Returns null for an unrenderable / invalid
+ * duration.
+ */
+export function durationToMs(
+  duration: Duration,
+  renderSeconds?: (template: string) => number | undefined,
+): number | null {
+  if ("seconds" in duration) return duration.seconds * 1000;
+  if ("minutes" in duration) return duration.minutes * 60_000;
+  if ("hours" in duration) return duration.hours * 3_600_000;
+  const seconds = renderSeconds?.(duration.template);
+  if (seconds === undefined || !Number.isFinite(seconds) || seconds < 0) {
+    return null;
+  }
+  return Math.floor(seconds) * 1000;
+}
 // ─── Trigger ─────────────────────────────────────────────────────────────
 /**
@@ -32,7 +72,61 @@ import { z } from "zod";
  *   trigger itself before any action runs.
  * - `config` is the trigger's own configuration — only used by triggers
  *   that need extra setup (cron pattern for `time.cron`, etc.).
+ * - `for` is an optional dwell: fire only if the matched state still holds
+ *   after the duration (re-confirmed at expiry, restart-safe).
+ * - `window` is an optional rate/windowed-count gate: fire only when this
+ *   trigger has fired (post-filter) at least `count` times within the
+ *   trailing `minutes`, scoped per `contextKey`. `refire` decides whether to
+ *   fire on every qualifying occurrence past the threshold (`"every"`) or
+ *   only on the crossing edge (`"once"`).
+ */
+/**
+ * Windowed-count / rate gate (decision: generic `window` block on any
+ * trigger). The engine records each qualifying occurrence (after the
+ * structured config gate + the operator's `filter`) in a durable append log
+ * keyed `(automationId, triggerId, contextKey)` and counts rows within the
+ * trailing `minutes`:
+ *
+ *  - `"every"` fires iff `newCount >= count` — re-fires on every occurrence
+ *    while the window stays over threshold (debounce in the automation if
+ *    needed).
+ *  - `"once"` fires iff `newCount === count` — only on the crossing edge;
+ *    re-arms naturally as old rows age out and the count re-crosses.
+ *
+ * where `newCount` includes the just-recorded occurrence.
  */
+export const WindowSchema = z.object({
+  count: z
+    .number()
+    .int()
+    .min(1)
+    .max(1000)
+    .describe("Occurrences within the window that arm the trigger."),
+  minutes: z
+    .number()
+    .int()
+    .min(1)
+    .max(1440)
+    .describe("Trailing sliding window, in minutes, the occurrences count over."),
+  refire: z
+    .enum(["every", "once"])
+    .default("every")
+    .describe(
+      "`every`: fire on every occurrence at/over the threshold. `once`: fire only on the crossing edge (re-arms as the window drains).",
+    ),
+  partitionBy: z
+    .string()
+    .trim()
+    .min(1)
+    .optional()
+    .describe(
+      "Optional bare expression (like `filter`, no `{{ }}`) yielding the partition key the count is bucketed by. Omitted ⇒ the trigger's built-in context key (e.g. systemId). An empty/undefined result or an eval error falls back to that built-in key.",
+    ),
+});
+export type Window = z.infer<typeof WindowSchema>;
 export const TriggerSchema = z.object({
   id: z
     .string()
@@ -59,24 +153,131 @@ export const TriggerSchema = z.object({
     .describe(
       "Per-trigger configuration (e.g. cron pattern for time.cron, interval seconds for time.interval).",
     ),
+  for: DurationSchema.optional().describe(
+    "Dwell: fire only if the matched state still holds after this duration. Re-confirmed at expiry, restart-safe.",
+  ),
+  window: WindowSchema.optional().describe(
+    "Windowed-count / rate gate: fire only after this trigger has fired N times within M minutes (scoped per context key).",
+  ),
 });
 export type Trigger = z.infer<typeof TriggerSchema>;
 // ─── Condition (recursive) ────────────────────────────────────────────────
+/**
+ * Structured condition variants (Wave 2 Phase 16). Each evaluates over the
+ * pre-resolved scope (Phase 14 `health.*`) plus a FRESH `now` computed per
+ * evaluation (constraint 7 — never the frozen scope `now`).
+ */
+/**
+ * `numeric_state` — compare a numeric `value` (a template/path string or a
+ * literal number) against optional `above` / `below` bounds.
+ */
+export const NumericStateConditionSchema = z
+  .object({
+    numeric_state: z.object({
+      value: z
+        .union([z.string().min(1), z.number()])
+        .describe("Template/path string or literal number to compare."),
+      above: z.number().optional(),
+      below: z.number().optional(),
+    }),
+  })
+  .refine(
+    (c) =>
+      c.numeric_state.above !== undefined ||
+      c.numeric_state.below !== undefined,
+    { message: "numeric_state requires at least one of `above` / `below`" },
+  );
+export type NumericStateCondition = z.infer<
+  typeof NumericStateConditionSchema
+>;
+/**
+ * `time` — on-call / quiet-hours gating. `after` / `before` are `HH:mm`
+ * (24h). `weekday` is a list of 0-6 (Sunday = 0). `timezone` is an IANA
+ * zone (defaults to UTC).
+ */
+const HHMM = /^([01]\d|2[0-3]):[0-5]\d$/;
+export const TimeConditionSchema = z
+  .object({
+    time: z.object({
+      after: z
+        .string()
+        .regex(HHMM, "Expected HH:mm (24h)")
+        .optional()
+        .describe("Inclusive lower bound, local to `timezone`."),
+      before: z
+        .string()
+        .regex(HHMM, "Expected HH:mm (24h)")
+        .optional()
+        .describe("Exclusive upper bound, local to `timezone`."),
+      weekday: z
+        .array(z.number().int().min(0).max(6))
+        .min(1)
+        .optional()
+        .describe("Allowed weekdays (0 = Sunday … 6 = Saturday)."),
+      timezone: z
+        .string()
+        .min(1)
+        .optional()
+        .describe("IANA timezone (e.g. `Europe/Berlin`). Defaults to UTC."),
+    }),
+  })
+  .refine(
+    (c) =>
+      c.time.after !== undefined ||
+      c.time.before !== undefined ||
+      c.time.weekday !== undefined,
+    { message: "time requires at least one of `after` / `before` / `weekday`" },
+  );
+export type TimeCondition = z.infer<typeof TimeConditionSchema>;
+/**
+ * `state` — condition-side dwell. True when `entity` (a system id) has been
+ * in `status` for at least `for` (read from the pre-resolved
+ * `health.*.in_status_for_ms`; NO new timer — it reads, it doesn't time).
+ */
+export const StateConditionSchema = z.object({
+  state: z.object({
+    entity: z
+      .string()
+      .min(1)
+      .describe("System id whose live health state to read."),
+    status: z
+      .enum(["healthy", "degraded", "unhealthy"])
+      .describe("Status the entity must currently be in."),
+    for: DurationSchema.optional().describe(
+      "Optional minimum dwell — the entity must have held `status` at least this long.",
+    ),
+  }),
+});
+export type StateCondition = z.infer<typeof StateConditionSchema>;
 /**
  * A condition is either:
- *   - A template string that evaluates truthy/falsy, OR
- *   - A `{ and | or | not }` combinator wrapping nested conditions.
+ *   - A template string that evaluates truthy/falsy,
+ *   - A `{ and | or | not }` combinator wrapping nested conditions, OR
+ *   - A structured variant (`numeric_state`, `time`, `state`).
  *
- * Recursive — uses `z.lazy` for the combinator branches.
+ * Recursive — uses `z.lazy` for the combinator branches. The raw template
+ * string stays the escape hatch for anything the structured variants don't
+ * cover.
  */
 export type ConditionInput =
   | string
   | { and: ConditionInput[] }
   | { or: ConditionInput[] }
-  | { not: ConditionInput };
+  | { not: ConditionInput }
+  | NumericStateCondition
+  | TimeCondition
+  | StateCondition;
 export const ConditionSchema: z.ZodType<ConditionInput> = z.lazy(() =>
   z.union([
@@ -93,6 +294,9 @@ export const ConditionSchema: z.ZodType<ConditionInput> = z.lazy(() =>
     z.object({
       not: ConditionSchema,
     }),
+    NumericStateConditionSchema,
+    TimeConditionSchema,
+    StateConditionSchema,
   ]),
 );
@@ -152,6 +356,7 @@ export type ActionInput =
   | ConditionGuardInput
   | StopInput
   | WaitForTriggerInput
+  | WaitUntilInput
   | SequenceInput;
 export interface ChooseInput {
@@ -230,6 +435,19 @@ export interface WaitForTriggerInput {
   };
 }
+export interface WaitUntilInput {
+  id?: string;
+  description?: string;
+  enabled?: boolean;
+  continue_on_error?: boolean;
+  wait_until: {
+    condition: ConditionInput;
+    timeout_seconds?: number;
+    /** Default true (HA semantics): on timeout, continue rather than fail. */
+    continue_on_timeout?: boolean;
+  };
+}
 export interface SequenceInput {
   id?: string;
   description?: string;
@@ -392,6 +610,30 @@ export const WaitForTriggerActionSchema = z.object({
   }),
 });
+/**
+ * 11. Wait until — suspend the run until a CONDITION becomes true, with an
+ * optional timeout. Unlike `wait_for_trigger` (wait for an *event*), this
+ * polls the condition on an interval, re-resolving live state each tick.
+ *
+ * `continue_on_timeout` defaults to true (HA's `wait_template` semantics):
+ * on timeout the run continues rather than failing.
+ */
+export const WaitUntilActionSchema: z.ZodType<WaitUntilInput> = z.lazy(() =>
+  z.object({
+    ...ActionBase,
+    wait_until: z.object({
+      condition: ConditionSchema,
+      timeout_seconds: z
+        .number()
+        .int()
+        .min(1)
+        .max(60 * 60 * 24 * 30) // 30 days
+        .optional(),
+      continue_on_timeout: z.boolean().default(true),
+    }),
+  }),
+);
 /**
  * The discriminated union of all 9 action primitives. Discrimination is by
  * presence of a key (action / choose / parallel / etc.), matching the
@@ -411,6 +653,7 @@ export const ActionSchema: z.ZodType<ActionInput> = z.lazy(() =>
     ConditionGuardActionSchema,
     StopActionSchema,
     WaitForTriggerActionSchema,
+    WaitUntilActionSchema,
     SequenceActionSchema,
   ]),
 );
@@ -434,6 +677,27 @@ export const AutomationModeSchema = z
 export type AutomationMode = z.infer<typeof AutomationModeSchema>;
+/**
+ * Scope the concurrency `mode` is evaluated over.
+ *
+ *   - `automation` (default): one concurrency bucket for the whole
+ *     automation. `single` allows one in-flight run total; `restart`
+ *     cancels every active run.
+ *   - `context_key`: an independent bucket per `contextKey` (typically
+ *     per system / incident). `single` allows one in-flight run *per
+ *     context key* (system A and system B run concurrently, but a second
+ *     run for system A is deduped); `restart` cancels only the active
+ *     runs sharing the incoming context key.
+ *
+ * Backward-compatible: omitted defaults to `automation`, so existing
+ * automations behave exactly as before.
+ */
+export const ConcurrencyScopeSchema = z
+  .enum(["automation", "context_key"])
+  .default("automation");
+export type ConcurrencyScope = z.infer<typeof ConcurrencyScopeSchema>;
 // ─── Automation definition (top-level) ────────────────────────────────────
 /**
@@ -453,8 +717,40 @@ export const AutomationDefinitionSchema = z.object({
   actions: z.array(ActionSchema).default([]),
   /** Concurrency mode. */
   mode: AutomationModeSchema,
+  /** Scope the concurrency mode is evaluated over (per-automation vs per-context-key). */
+  concurrency_scope: ConcurrencyScopeSchema,
   /** Max parallel runs (only meaningful in parallel/queued modes). */
   max_runs: z.number().int().min(1).max(1000).default(10),
+  /**
+   * Explicit live-state resolution list (sensing layer). By default the
+   * engine resolves the state of the system named by the trigger's
+   * `contextKey` (the common single-system case). Listing system ids here
+   * resolves their state too, surfaced in templates under
+   * `health.systems[<id>]` for cross-system rules. Bounded to keep the
+   * pre-evaluation batch query cheap.
+   */
+  uses_state: z
+    .array(z.string().min(1))
+    .max(50)
+    .optional()
+    .describe(
+      "Extra system ids whose live health state is resolved into scope under health.systems[id].",
+    ),
+  /**
+   * Trailing window (minutes) for the `health.*.transitions_in_window`
+   * count folded into scope. Lets an operator author custom flapping
+   * rules ("N status changes in M minutes") via a numeric_state condition
+   * over that field. Defaults to 60 when omitted.
+   */
+  state_window_minutes: z
+    .number()
+    .int()
+    .min(1)
+    .max(60 * 24 * 7) // up to a week
+    .optional()
+    .describe(
+      "Window (minutes) for health.*.transitions_in_window. Default 60.",
+    ),
 });
 export type AutomationDefinition = z.infer<typeof AutomationDefinitionSchema>;
@@ -464,10 +760,19 @@ export type AutomationDefinition = z.infer<typeof AutomationDefinitionSchema>;
 export const AutomationStatusSchema = z.enum(["enabled", "disabled"]);
 export type AutomationStatus = z.infer<typeof AutomationStatusSchema>;
+/**
+ * Optional grouping label, stored as its own row column (not part of the
+ * definition / YAML). A single free-text "category" (HA-style) used purely
+ * to organise the list into collapsible sections. Empty / absent means the
+ * automation lives in the implicit "Ungrouped" bucket.
+ */
+export const AutomationGroupSchema = z.string().trim().min(1).max(120);
 export const AutomationSchema = z.object({
   id: z.string(),
   name: z.string(),
   description: z.string().optional(),
+  group: AutomationGroupSchema.optional(),
   status: AutomationStatusSchema,
   definition: AutomationDefinitionSchema,
   managedBy: z.string().optional().describe("GitOps provider id when managed declaratively"),
@@ -580,6 +885,7 @@ export type AutomationArtifact = z.infer<typeof AutomationArtifactSchema>;
 export const CreateAutomationInputSchema = z.object({
   name: z.string().min(1).max(200),
   description: z.string().optional(),
+  group: AutomationGroupSchema.optional(),
   status: AutomationStatusSchema.default("enabled"),
   definition: AutomationDefinitionSchema,
 });
@@ -590,6 +896,12 @@ export const UpdateAutomationInputSchema = z.object({
   id: z.string(),
   name: z.string().min(1).max(200).optional(),
   description: z.string().optional(),
+  /**
+   * `null` clears the group (back to Ungrouped); `undefined` leaves it
+   * unchanged; a string sets it. Modelled with `.nullish()` so the set-builder
+   * can distinguish "clear" from "no change".
+   */
+  group: AutomationGroupSchema.nullish(),
   status: AutomationStatusSchema.optional(),
   definition: AutomationDefinitionSchema.optional(),
 });
@@ -645,6 +957,12 @@ export const TriggerInfoSchema = z.object({
     .string()
     .optional()
     .describe("Default context key path inside the payload (e.g. 'incidentId')"),
+  contextKeyLabel: z
+    .string()
+    .optional()
+    .describe(
+      "Human label for the trigger's built-in context dimension (e.g. 'system'). UI hint only - shown as the default partition in the window editor. Absent ⇒ no built-in context (per automation).",
+    ),
 });
 export type TriggerInfo = z.infer<typeof TriggerInfoSchema>;
@@ -680,3 +998,64 @@ export const ArtifactTypeInfoSchema = z.object({
 });
 export type ArtifactTypeInfo = z.infer<typeof ArtifactTypeInfoSchema>;
+// ─── Reactive entity engine — two-stage queue payloads ─────────────────────
+//
+// The entity state machine (`defineEntity`, automation-backend) emits an
+// internal `ENTITY_CHANGED` hook on every real diff. These schemas are the
+// wire shapes for the two-stage dispatch pipeline (reactive automation
+// engine §13.2). Defined here so later phases (Stage-1 routing, Stage-2
+// dispatch) consume one canonical zod source of truth.
+/**
+ * Stage-1 input — the `ENTITY_CHANGED` hook payload. `prev` is null on
+ * create; `next` is null on remove (a tombstone). `delta` carries only the
+ * changed fields; `actor` is the mutating actor (from `HookEventMeta`).
+ */
+export const EntityChangedSchema = z.object({
+  kind: z.string(),
+  id: z.string(),
+  prev: z.record(z.string(), z.unknown()).nullable(), // null on create
+  next: z.record(z.string(), z.unknown()).nullable(), // null on remove (tombstone)
+  delta: z.record(z.string(), z.unknown()), // changed fields only
+  changedFields: z.array(z.string()),
+  actor: ActorSchema, // from HookEventMeta
+  occurredAt: z.string(), // ISO
+  /**
+   * Stable, per-change identity generated ONCE at emit time and carried
+   * through every at-least-once redelivery of THIS change. Distinguishes two
+   * DISTINCT changes to the same entity that share an `occurredAt`
+   * (millisecond granularity) so the Stage-2 trigger jobId dedupes
+   * redeliveries of one change without collapsing two real changes (reactive
+   * automation engine §13.2). Optional for back-compat with payloads emitted
+   * before this field existed; the router falls back to `occurredAt`.
+   */
+  changeId: z.string().optional(),
+});
+export type EntityChanged = z.infer<typeof EntityChangedSchema>;
+/**
+ * Stage-2 per-run dispatch job. Either a fresh run from a trigger match,
+ * or a resume of a suspended `wait_until`.
+ */
+export const DispatchJobSchema = z.discriminatedUnion("reason", [
+  z.object({
+    // fresh run from a trigger match
+    reason: z.literal("trigger"),
+    automationId: z.string(),
+    triggerId: z.string(),
+    ref: z.string(), // `${kind}:${id}`
+    changed: EntityChangedSchema,
+  }),
+  z.object({
+    // resume a suspended wait_until
+    reason: z.literal("wake"),
+    runId: z.string(),
+    waitLockId: z.string(),
+    ref: z.string(),
+    changed: EntityChangedSchema,
+  }),
+]);
+export type DispatchJob = z.infer<typeof DispatchJobSchema>;

package/src/script-test-schemas.ts ADDED Viewed

@@ -0,0 +1,91 @@
+import { z } from "zod";
+/**
+ * Wire schemas for the in-UI script test endpoint (`testScript`).
+ *
+ * The editable sample context mirrors what a `run_script` action sees as
+ * `globalThis.context` and what `run_shell` flattens into `$CHECKSTACK_*`
+ * env vars. Every field is optional so a partial sample still runs.
+ */
+export const ScriptTestKindSchema = z.enum(["typescript", "shell"]);
+export type ScriptTestKind = z.infer<typeof ScriptTestKindSchema>;
+export const ScriptTestContextSchema = z.object({
+  trigger: z
+    .object({
+      event: z.string().optional(),
+      payload: z.record(z.string(), z.unknown()).optional(),
+    })
+    .optional(),
+  artifacts: z.record(z.string(), z.unknown()).optional(),
+  var: z.record(z.string(), z.unknown()).optional(),
+  repeat: z
+    .object({
+      index: z.number().int(),
+      item: z.unknown().optional(),
+    })
+    .optional(),
+});
+export type ScriptTestContext = z.infer<typeof ScriptTestContextSchema>;
+export const ScriptTestInputSchema = z.object({
+  kind: ScriptTestKindSchema,
+  script: z.string(),
+  context: ScriptTestContextSchema.optional(),
+  env: z.record(z.string(), z.string()).optional(),
+  /**
+   * The script's declared secret -> env mapping
+   * (`{ ENV_NAME: "${{ secrets.NAME }}" }`). The test panel NEVER resolves
+   * real secret values: for each entry it injects a named placeholder
+   * (`__SECRET_<NAME>__`) by default, or the user's override value (see
+   * `secretOverrides`) for a realistic run. Real production values never
+   * reach the test surface (decision 4).
+   */
+  secretEnv: z.record(z.string(), z.string()).optional(),
+  /**
+   * User-supplied per-secret-NAME override values for a realistic test
+   * (keyed by the `${{ secrets.NAME }}` name, not the env var). These stay
+   * client-side until sent here as an explicit test input, and are masked
+   * out of the test result so even an override can't round-trip unmasked.
+   */
+  secretOverrides: z.record(z.string(), z.string()).optional(),
+  workingDirectory: z.string().optional(),
+  timeoutMs: z.number().int().min(100).max(300_000).default(30_000),
+});
+export type ScriptTestInputDto = z.infer<typeof ScriptTestInputSchema>;
+export const ScriptTestResultSchema = z.object({
+  result: z.unknown().optional(),
+  stdout: z.string(),
+  stderr: z.string(),
+  exitCode: z.number().int().optional(),
+  durationMs: z.number().int().nonnegative(),
+  timedOut: z.boolean(),
+  error: z.string().optional(),
+});
+export type ScriptTestResultDto = z.infer<typeof ScriptTestResultSchema>;
+/**
+ * Input for `getRunScopeForReplay`: reconstruct a test context from a real
+ * automation run so an operator can debug an action against the data that
+ * run saw. `actionPath` is accepted for forward-compatibility (scoping
+ * artifacts to those available before that action); v1 returns the full
+ * run scope.
+ */
+export const ReplayScopeInputSchema = z.object({
+  runId: z.string(),
+  actionPath: z.string().optional(),
+});
+export type ReplayScopeInputDto = z.infer<typeof ReplayScopeInputSchema>;
+export const ReplayScopeResultSchema = z.object({
+  context: ScriptTestContextSchema,
+  /**
+   * False when `var` / `repeat` could not be recovered because the run's
+   * durable scope snapshot was already cleared (terminal status). The
+   * trigger + artifacts are still reconstructed; the UI can note the gap.
+   */
+  scopeSnapshotAvailable: z.boolean(),
+});
+export type ReplayScopeResultDto = z.infer<typeof ReplayScopeResultSchema>;

package/src/variable-scope.test.ts CHANGED Viewed

@@ -164,6 +164,7 @@ function basicDefinition(
     conditions: [],
     actions: [],
     mode: "single",
+    concurrency_scope: "automation",
     max_runs: 1,
     ...overrides,
   };
@@ -237,6 +238,33 @@ describe("resolveVariableScope", () => {
     expect(unmatched).toContain("trigger.actor.type");
   });
+  it("exposes the health.* sensing namespace on every automation", () => {
+    const flat = flattenScope(
+      resolveVariableScope({
+        definition: basicDefinition(),
+        triggers: [triggerInfo],
+        actions: [],
+        artifactTypes: [],
+        path: [{ slot: "root", index: 0 }],
+      }),
+    );
+    const byPath = new Map(flat.map((e) => [e.path, e]));
+    expect(byPath.has("health")).toBe(true);
+    expect(byPath.has("health.system")).toBe(true);
+    expect(byPath.has("health.systems")).toBe(true);
+    expect(byPath.has("health.system.status")).toBe(true);
+    expect(byPath.has("health.system.in_status_since")).toBe(true);
+    expect(byPath.has("health.system.in_maintenance")).toBe(true);
+    // status leaf is typed as the status literal union
+    expect(byPath.get("health.system.status")?.type).toBe(
+      '"healthy" | "degraded" | "unhealthy"',
+    );
+    // never gated on a trigger subset — the engine always folds it in.
+    expect(
+      byPath.get("health.system.status")?.conditionalOnTriggers,
+    ).toBeUndefined();
+  });
   it("exposes trigger.id typed as the literal union of the automation's trigger ids", () => {
     const definition = basicDefinition({
       // Two triggers on the SAME event, distinguished by explicit ids.

package/src/variable-scope.ts CHANGED Viewed

@@ -519,6 +519,107 @@ function buildActorEntry(): VariableEntry {
   };
 }
+/**
+ * Static `health.*` namespace (sensing layer). The dispatch engine
+ * pre-resolves live health state into scope before evaluating any
+ * condition/template, so `health.system` (the trigger's context system)
+ * and `health.systems[<id>]` (any `uses_state` ids) are always readable
+ * as plain data. Offered unconditionally - the engine always folds the
+ * namespace in (empty when nothing resolves), so referencing it never
+ * throws.
+ */
+const HEALTH_STATE_PROPERTIES: Record<string, Record<string, unknown>> = {
+  status: { type: "string", enum: ["healthy", "degraded", "unhealthy"] },
+  in_status_since: { type: ["string", "null"] },
+  in_status_for_ms: { type: "number" },
+  latency_ms: { type: "number" },
+  avg_latency_ms: { type: "number" },
+  p95_latency_ms: { type: "number" },
+  success_rate: { type: "number" },
+  last_run_at: { type: "string" },
+  in_maintenance: { type: "boolean" },
+  transitions_in_window: { type: "number" },
+  transition_window_minutes: { type: "number" },
+  evaluated_at: { type: "string" },
+};
+const HEALTH_STATE_DESCRIPTIONS: Record<string, string> = {
+  status: "Aggregate health status of the system.",
+  in_status_since: "ISO timestamp the system entered its current status (null if unknown).",
+  in_status_for_ms: "Milliseconds the system has held its current status.",
+  latency_ms: "Latency of the newest run.",
+  avg_latency_ms: "Windowed average latency.",
+  p95_latency_ms: "Windowed p95 latency.",
+  success_rate: "Windowed success rate in [0, 1].",
+  last_run_at: "ISO timestamp of the newest run.",
+  in_maintenance: "Whether the system is in an active maintenance window.",
+  transitions_in_window:
+    "Status changes in the trailing window (generalized flapping count).",
+  transition_window_minutes:
+    "The window (minutes) transitions_in_window was counted over.",
+  evaluated_at: "ISO timestamp this snapshot was computed.",
+};
+function buildHealthStateChildren(prefix: string): VariableEntry[] {
+  return Object.entries(HEALTH_STATE_PROPERTIES).map(([key, jsonSchema]) => ({
+    path: `${prefix}.${key}`,
+    templateRef: `${prefix}.${key}`,
+    type:
+      key === "status"
+        ? '"healthy" | "degraded" | "unhealthy"'
+        : key === "in_maintenance"
+          ? "boolean"
+          : jsonSchema.type === "number"
+            ? "number"
+            : "string",
+    description: HEALTH_STATE_DESCRIPTIONS[key],
+    jsonSchema,
+  }));
+}
+function buildHealthEntry(): VariableEntry {
+  const stateSchema = {
+    type: "object",
+    properties: HEALTH_STATE_PROPERTIES,
+  };
+  return {
+    path: "health",
+    templateRef: "health",
+    type: "object",
+    description:
+      "Live health state, pre-resolved before evaluation. Use with duration filters, e.g. {{ health.system.in_status_since | older_than(30 | minutes) }}.",
+    jsonSchema: {
+      type: "object",
+      properties: {
+        system: stateSchema,
+        systems: { type: "object", additionalProperties: stateSchema },
+      },
+    },
+    children: [
+      {
+        path: "health.system",
+        templateRef: "health.system",
+        type: "object",
+        description:
+          "Live state of the system named by the trigger's context key.",
+        jsonSchema: stateSchema,
+        children: buildHealthStateChildren("health.system"),
+      },
+      {
+        path: "health.systems",
+        templateRef: "health.systems",
+        type: "Record<string, HealthState>",
+        description:
+          "Live state of every resolved system keyed by id (context system + uses_state).",
+        jsonSchema: {
+          type: "object",
+          additionalProperties: stateSchema,
+        },
+      },
+    ],
+  };
+}
 /**
  * Derive a stable, identifier-safe id for a trigger from its event id, used
  * when the operator hasn't assigned an explicit `id`. Mirrors the backend
@@ -976,7 +1077,7 @@ export function resolveVariableScope(
     registeredTriggers: triggers,
   });
-  const entries: VariableEntry[] = [...triggerEntries];
+  const entries: VariableEntry[] = [...triggerEntries, buildHealthEntry()];
   if (accumulated.vars.length > 0) {
     entries.push({

package/src/window-schema.test.ts ADDED Viewed

@@ -0,0 +1,89 @@
+import { describe, expect, it } from "bun:test";
+import { TriggerSchema, WindowSchema } from "./schemas";
+describe("WindowSchema", () => {
+  it("parses a full window block", () => {
+    const parsed = WindowSchema.parse({
+      count: 3,
+      minutes: 60,
+      refire: "once",
+    });
+    expect(parsed).toEqual({ count: 3, minutes: 60, refire: "once" });
+  });
+  it("defaults refire to `every` when omitted", () => {
+    const parsed = WindowSchema.parse({ count: 5, minutes: 10 });
+    expect(parsed.refire).toBe("every");
+  });
+  it("rejects count below 1", () => {
+    expect(() => WindowSchema.parse({ count: 0, minutes: 10 })).toThrow();
+  });
+  it("rejects count above 1000", () => {
+    expect(() => WindowSchema.parse({ count: 1001, minutes: 10 })).toThrow();
+  });
+  it("rejects minutes above 1440 (the 24h cap)", () => {
+    expect(() => WindowSchema.parse({ count: 3, minutes: 1441 })).toThrow();
+  });
+  it("rejects a non-integer count", () => {
+    expect(() => WindowSchema.parse({ count: 2.5, minutes: 10 })).toThrow();
+  });
+  it("rejects an unknown refire mode", () => {
+    expect(() =>
+      WindowSchema.parse({ count: 3, minutes: 10, refire: "cooldown" }),
+    ).toThrow();
+  });
+  it("leaves partitionBy undefined when omitted", () => {
+    const parsed = WindowSchema.parse({ count: 3, minutes: 60 });
+    expect(parsed.partitionBy).toBeUndefined();
+  });
+  it("parses a partitionBy bare expression and trims surrounding whitespace", () => {
+    const parsed = WindowSchema.parse({
+      count: 3,
+      minutes: 60,
+      partitionBy: "  trigger.payload.severity  ",
+    });
+    expect(parsed.partitionBy).toBe("trigger.payload.severity");
+  });
+  it("rejects an empty / whitespace-only partitionBy", () => {
+    expect(() =>
+      WindowSchema.parse({ count: 3, minutes: 60, partitionBy: "   " }),
+    ).toThrow();
+  });
+});
+describe("TriggerSchema window field", () => {
+  it("accepts a trigger without a window (window optional)", () => {
+    const parsed = TriggerSchema.parse({
+      event: "healthcheck.system_health_changed",
+    });
+    expect(parsed.window).toBeUndefined();
+  });
+  it("accepts a trigger carrying a window + filter (flapping shape)", () => {
+    const parsed = TriggerSchema.parse({
+      id: "flapping",
+      event: "healthcheck.system_health_changed",
+      filter: '{{ trigger.payload.newStatus != "healthy" }}',
+      window: { count: 3, minutes: 60, refire: "once" },
+    });
+    expect(parsed.window).toEqual({ count: 3, minutes: 60, refire: "once" });
+    expect(parsed.filter).toBe('{{ trigger.payload.newStatus != "healthy" }}');
+  });
+  it("applies the refire default inside a trigger window", () => {
+    const parsed = TriggerSchema.parse({
+      event: "healthcheck.check_failed",
+      window: { count: 5, minutes: 10 },
+    });
+    expect(parsed.window?.refire).toBe("every");
+  });
+});