@checkstack/incident-backend 1.3.0 → 1.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +157 -0
- package/package.json +18 -18
- package/src/automations.test.ts +356 -5
- package/src/automations.ts +322 -34
- package/src/hooks.ts +8 -53
- package/src/incident-entity.test.ts +266 -0
- package/src/incident-entity.ts +192 -0
- package/src/index.ts +96 -16
- package/src/router.ts +162 -98
- package/src/service.test.ts +199 -0
- package/src/service.ts +147 -3
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,162 @@
|
|
|
1
1
|
# @checkstack/incident-backend
|
|
2
2
|
|
|
3
|
+
## 1.4.0
|
|
4
|
+
|
|
5
|
+
### Minor Changes
|
|
6
|
+
|
|
7
|
+
- 270ef29: Replace the hardcoded auto-incident path with default automations (Wave 2 Phase 20).
|
|
8
|
+
|
|
9
|
+
BREAKING CHANGES: Auto-incident is now automation-driven. The hardcoded background path that opened incidents on sustained-unhealthy / flapping and closed them after a cooldown (`auto-incident.ts`, `auto-incident-close-job.ts`) is removed. On upgrade, an idempotent, threshold-preserving migration seeds equivalent default automations from each assignment's existing `NotificationPolicy`, so alerting behaviour is preserved 1:1:
|
|
10
|
+
|
|
11
|
+
- `sustainedUnhealthyTrigger.durationMinutes` -> the `for:` dwell on a `healthcheck.system_degraded` trigger -> `incident.create`.
|
|
12
|
+
- auto-close `autoCloseAfterMinutes` -> a `wait_until` (healthy continuously for the cooldown) -> `incident.resolve`.
|
|
13
|
+
- `useNotificationSuppression` -> the incident's `suppressNotifications`.
|
|
14
|
+
- `skipDuringMaintenance` -> a `{{ !health.system.in_maintenance }}` pre-run condition.
|
|
15
|
+
- `flappingTrigger.{transitions,windowMinutes}` -> a second automation on the `healthcheck.flapping_detected` trigger -> `incident.create`.
|
|
16
|
+
|
|
17
|
+
Auto-incidents remain ONE OPEN INCIDENT PER SYSTEM, faithful to the old behaviour. `incident.create` gains an opt-in `dedupe_open_for_system` config flag (default false, so existing/custom automations are unaffected): when true, it reuses an existing open incident on the target system instead of opening a duplicate (the old `findActiveAutoIncident(systemId)` semantic), returning the reused incident as the produced `incident` artifact. The seeded default automations set this flag, so a system with several failing checks - sustained and/or flapping - still gets a single open incident; whichever check crosses its threshold first opens it, and the rest dedupe to it. Both sustained and flapping default automations open at `critical` severity (parity with the old path). Per-system run dedup within an automation uses `concurrency_scope: "context_key"` + `mode: "single"`.
|
|
18
|
+
|
|
19
|
+
Operators can read, edit, disable, and extend these automations (see the "Customise auto-incident" guide). Seeded automations are tagged via `managedBy` (`auto-incident:<systemId>:<configurationId>:<kind>`) so the migration is a no-op on re-runs; anything unmappable is recorded as a migration-failure row.
|
|
20
|
+
|
|
21
|
+
Flapping DETECTION (transition recording + the `healthcheck.flapping_detected` emit) is relocated into `flapping-detector.ts` and survives; the emit now fires unconditionally on a threshold cross (no longer gated on `autoOpenIncidentOnUnhealthy`), matching the hook's documented intent and required for the flapping default automation. The legacy `health_check_auto_incidents` mapping table is no longer written or read (it will be dropped in a follow-up migration); `health_check_unhealthy_transitions` is retained for the flapping detector.
|
|
22
|
+
|
|
23
|
+
New service-typed `HealthCheckApi.listAutoIncidentPolicies` RPC exposes each assignment's effective notification policy for the migration. `incident.create` adds the `dedupe_open_for_system` flag (additive, defaults off).
|
|
24
|
+
|
|
25
|
+
- 270ef29: Add an `incident` artifact type to the incident automation actions (Phase 20 prerequisite).
|
|
26
|
+
|
|
27
|
+
Closes GAP 2 from the Phase 20 analysis - a single automation can now open an incident and reference it downstream (open then wait then resolve) without the operator repeating the id.
|
|
28
|
+
|
|
29
|
+
- New `incident` artifact type registered in incident-backend (`{ incidentId, status, severity, systemIds }`).
|
|
30
|
+
- `incident.create` now declares `produces: "incident"`, so the created incident is queryable in run scope (mirrors the Jira `produces: "jira.issue"` pattern).
|
|
31
|
+
- `incident.resolve` / `incident.add_update` / `incident.update_status` now declare `consumes: ["incident"]` and make their `incidentId` config optional, falling back to the upstream `incident` artifact (config takes priority, else artifact - the `resolveIssueKey` pattern). They fail with a clear error when neither is present.
|
|
32
|
+
|
|
33
|
+
- 270ef29: Fix several correctness defects around distributed coordination and stored-data handling.
|
|
34
|
+
|
|
35
|
+
- Dwell `for:` timers now fire via an atomic `DELETE ... RETURNING` claim, so two pods (or the stalled sweeper vs the queue consumer) can no longer both fire the same dwell.
|
|
36
|
+
- Postgres session-level advisory locks now keep connection affinity. A shared `AdvisoryLockService` (backed by a dedicated pooled client) replaces the previous acquire/release-on-different-connection pattern that leaked locks. Used by the script-packages installer election, the automation run resume + stalled sweeper, and (via a new transaction-scoped `withXactLock`) incident dedup.
|
|
37
|
+
- A storage migration that crashed mid-flight is now resumed on startup under the installer-election lock, instead of permanently wedging installs.
|
|
38
|
+
- Distributed script-package blobs carry a `blobSha256` and are verified before extraction (the SRI `integrity` hashes the npm tarball, not the transported archive). Backward-safe: entries without the field skip verification until a re-install regenerates the manifest.
|
|
39
|
+
- Archive extraction rejects zip-slip paths (absolute or `..` entries) before writing anything.
|
|
40
|
+
- `incident.create` with `dedupe_open_for_system` serializes its check-then-create per system, so concurrent triggers for the same system can't both open a duplicate incident.
|
|
41
|
+
- Seeded auto-incident filter expressions JSON-encode interpolated ids so a quote/backslash can't corrupt the expression.
|
|
42
|
+
- Stored jsonb snapshots (dwell `actorSnapshot`, wait-lock `waitConfig`) are validated with zod on load and degrade safely instead of flowing through as the wrong type.
|
|
43
|
+
|
|
44
|
+
- b995afb: Make incident automation actions fully reactive.
|
|
45
|
+
|
|
46
|
+
Only the `incident.create` action routed through the reactive `incident` entity; the `resolve`, `add_update`, and `update_status` actions called the incident service directly. Action-driven status flips therefore appended NO `entity_transitions` row, emitted NO `ENTITY_CHANGED` (so no `wait_until` woke), and fired NO `incident.resolved` / `.updated` derived trigger events — unlike the RPC router, which routes the same mutations through the entity handle.
|
|
47
|
+
|
|
48
|
+
The three actions now drive their writes through `writeIncidentEntity({ handle, incidentId, opts: { runId }, apply })` (re-reading the post-write state inside `apply` for the status-flipping actions), matching the router. As a result an action-driven resolve/status change now appends a transition, wakes suspended `wait_until` runs, and fires `incident.resolved` / `incident.updated`. The dispatch `runId` is passed so run-resolved secrets in the reactive state are masked.
|
|
49
|
+
|
|
50
|
+
- b995afb: Make `incident` a plugin-backed reactive entity via the Model-B entity state machine.
|
|
51
|
+
|
|
52
|
+
The `incidents` + `incident_systems` tables are BOTH authoritative AND the `incident` entity's current-state storage - there is no framework `entity_state` row for an incident. `defineEntity` is given a plugin `read` accessor (`IncidentService.getManyEntityStates`) that projects the reactive subset `{ status, severity, systemIds }` straight off those tables, and every reactive-state write goes through `handle.mutate` / `handle.remove`: `apply` performs the REAL `incidents` / junction write (the plugin's own db/tx) and returns the new state; the framework snapshots `prev` via `read` BEFORE the write, appends the transition log (its own db), and emits `ENTITY_CHANGED` AFTER the write commits. Covered sites: create, update, add-update, resolve, auto-create, auto-resolve, and delete (tombstone), plus the `incident.create` / `incident.resolve` automation actions.
|
|
53
|
+
|
|
54
|
+
A change -> trigger-event deriver reproduces the existing qualified events so automations keep firing:
|
|
55
|
+
|
|
56
|
+
- create (`prev === null`) -> `incident.created`
|
|
57
|
+
- transition to `resolved` -> `incident.resolved`
|
|
58
|
+
- any other field change -> `incident.updated`
|
|
59
|
+
- delete (tombstone) -> no event (there is no `incident.deleted` trigger)
|
|
60
|
+
|
|
61
|
+
The old `incident.created` / `incident.updated` / `incident.resolved` change hooks are removed in favor of these reactive change events; the catalog `system.deleted` consumer switched from `onHook(catalogHooks.systemDeleted)` to `onEntityChanged({ kind: "catalog-system" })` filtered to tombstones, keeping `work-queue` delivery (association cleanup must run once per cluster).
|
|
62
|
+
|
|
63
|
+
BREAKING CHANGES:
|
|
64
|
+
|
|
65
|
+
- The `incident.created` / `incident.updated` / `incident.resolved` cross-plugin hooks (the `createHook` descriptors) are removed. Incident lifecycle is now the reactive `incident` entity; the matching trigger events still fire (via the entity change deriver), so existing automations on `incident.created/.updated/.resolved` and external event-routing (e.g. the Jira integration's `incident.created` event type) keep working. No in-repo plugin subscribed to the removed hooks via `onHook`.
|
|
66
|
+
- The `addUpdate`-with-status=resolved path previously emitted BOTH `incident.updated` and `incident.resolved`; it now fires only `incident.resolved` (the deriver classifies a transition-to-resolved as a resolution). Automations meant to react to a resolution should use the `incident.resolved` trigger, not `incident.updated`.
|
|
67
|
+
- NARROWING: `incident.updated` now fires only on a change to the REACTIVE state (`status`, `severity`, or affected `systemIds`). A comment-only `addUpdate` (no status change) no longer fires `incident.updated` (the posted message is not reactive entity state). Re-author any automation that needed to react to a comment-only update against a different signal.
|
|
68
|
+
- The `incident.create` automation ACTION path now drives its write through `handle.mutate`, so an action-created incident is now reactive - it emits `incident.created` and other automations can trigger on it. Previously the action path created incidents silently (no lifecycle event). A dedupe REUSE still emits nothing (the open incident is unchanged).
|
|
69
|
+
|
|
70
|
+
- b995afb: Restore the documented domain payload fields on entity-driven automation triggers.
|
|
71
|
+
|
|
72
|
+
Migrated triggers declare domain-named `payloadSchema`s (incident `incidentId`; health `systemId` / `previousStatus`; catalog `systemId` / `changedFields`; dependency `dependencyId`), but Stage-2 dispatch built `trigger.payload` from the generic entity-change shape (`{ kind, id, prev, next, delta, ...next }`). Operator filters and templates reading `trigger.payload.incidentId` / `.systemId` / `.previousStatus` silently resolved to `undefined` — a regression vs the legacy hook payloads.
|
|
73
|
+
|
|
74
|
+
Changes:
|
|
75
|
+
|
|
76
|
+
- `@checkstack/automation-backend`: `registerChangeDeriver` now accepts an optional per-kind `toPayload(changed) => Record<string, unknown>` mapper (at most one per kind; a second distinct mapper throws). Stage-2's `changedToPayload` uses the registered mapper to build `trigger.payload` so it matches the kind's declared `payloadSchema`, falling back to the generic change shape for kinds without a mapper. New exported type `EntityChangePayloadMapper`.
|
|
77
|
+
- `@checkstack/incident-backend`, `@checkstack/healthcheck-backend`, `@checkstack/catalog-backend`, `@checkstack/dependency-backend`: implement and register a `toPayload` for each entity-driven kind so `trigger.payload` carries the legacy domain keys again.
|
|
78
|
+
|
|
79
|
+
Descriptive incident payload fields not derivable from the reactive entity state (`title`, `description`, `createdAt`, `resolvedAt`) are now OPTIONAL on the incident trigger `payloadSchema`s — they were always absent from an entity-driven payload.
|
|
80
|
+
|
|
81
|
+
### Patch Changes
|
|
82
|
+
|
|
83
|
+
- 270ef29: Fix suspend/resume durability + complete the run-wide secret-masking guarantee.
|
|
84
|
+
|
|
85
|
+
A panel review confirmed several defects in the automation dispatch engine's suspend/resume durability and in the run-wide masking choke point. These survived because the unit suite stubbed the seam under test; the fixes ship with tests that exercise the real suspend / sweep / resume paths.
|
|
86
|
+
|
|
87
|
+
Suspend/resume durability:
|
|
88
|
+
|
|
89
|
+
- **Stalled sweeper no longer re-runs intentional waits.** `findStalledRunIds` now joins `automation_runs` and returns only `status = 'running'` runs, and suspend-finalisation no longer clobbers the run's `lastActionPath` checkpoint to `null`. Previously any wait longer than the stale window (>60s) was re-walked from the top every sweep cycle, re-firing pre-wait side effects and leaking wait locks. The wait-aware sweeps now also run before the stalled-run sweep.
|
|
90
|
+
- **Stalled recovery refuses a run holding a live wait lock.** `recoverStalledRun` now only recovers a genuinely-`running` run with no wait lock; a crash-mid-wait recovery is left to the wait/resume paths instead of re-walking from the top and creating a duplicate lock + duplicate delay job.
|
|
91
|
+
- **Cancelled runs can no longer resurrect.** `resumeRun` guards on `status === 'waiting'` (mirroring `checkWaitUntil`) and drops any stale lock for a non-waiting run, so `wakeWaitingRuns` / delay-expiry / a racing queue job can't wake a cancelled or terminal run. `cancelActiveRuns` (restart mode) now deletes the cancelled runs' wait locks + run-state in the same operation.
|
|
92
|
+
- **Concurrency check-then-create is serialized.** The `mode` check + `createRun` now run under a transaction-scoped advisory lock keyed on `(automationId, scope)`, so two concurrent fires can't both pass a `single`-mode "no active run" check and double-run.
|
|
93
|
+
|
|
94
|
+
Masking guarantee (now genuinely covers scope + artifacts):
|
|
95
|
+
|
|
96
|
+
- **The run-wide masking choke point now also masks the durable scope snapshot and produced artifacts.** The `RunSecretRegistry` is threaded into `RunStateStore.upsert` (masks `scopeSnapshot`) and `ArtifactStore.record` (masks `data`) so a resolved connection credential threaded into `scope.variables` or surfaced into an artifact is redacted before persist - and therefore cannot reach a read-only user via `getRunScopeForReplay`. **GUARANTEE CHANGE**: run-wide masking now covers step output, run error, scope snapshot, and artifact data for every action.
|
|
97
|
+
- **`testConnection` / `testProviderConnection` mask provider errors.** These RPCs run outside a dispatch run, so they build a per-call mask set from the resolved/submitted connection config and run any provider error through it before returning, so a provider error echoing a token can't cross back to the browser.
|
|
98
|
+
- **Short secrets surface a warning.** `setSecret` now warns when a value is shorter than `MIN_MASKABLE_LENGTH` (4) that it cannot be auto-redacted (the threshold is intentionally not lowered).
|
|
99
|
+
|
|
100
|
+
Internal:
|
|
101
|
+
|
|
102
|
+
- `@checkstack/backend-api`: `withXactLock`'s `fn` now receives the transaction handle `tx` so a critical section can run on the locked connection; the doc clarifies why running on the pool inside the lock window is still safe. The incident dedup caller's comment is corrected accordingly. `RunStore` gains `findWaitLocksByRun`.
|
|
103
|
+
|
|
104
|
+
- b995afb: Extract a shared `withEntityWrite` / `withEntityRemove` guard for PLUGIN-BACKED (Model B) reactive entities and refactor the per-domain copies onto it.
|
|
105
|
+
|
|
106
|
+
Every plugin-backed domain (incident, catalog, dependency, maintenance, slo, satellite) reimplemented the same "no handle wired → run the plugin write directly; handle wired → route through `handle.mutate` / `handle.remove`" guard, varying only in the id-key name. `@checkstack/automation-backend` now exports `withEntityWrite` / `withEntityRemove` (from the entity barrel) and each domain's thin, well-named wrappers (`writeIncidentEntity`, `writeMaintenanceEntity`, satellite's `mirror`, …) delegate to it, so the branch lives in exactly one place. Behavior is unchanged.
|
|
107
|
+
|
|
108
|
+
`writeHealthEntity` (healthcheck-backend) is intentionally NOT migrated onto the helper — it is genuinely bespoke (closure-captured durable state, distinct rethrow-vs-fail-soft branches, a per-system serializer, and it returns the computed state). SLO keeps its fail-soft `onError` wrapper around the shared guard.
|
|
109
|
+
|
|
110
|
+
- Updated dependencies [270ef29]
|
|
111
|
+
- Updated dependencies [b995afb]
|
|
112
|
+
- Updated dependencies [b995afb]
|
|
113
|
+
- Updated dependencies [b995afb]
|
|
114
|
+
- Updated dependencies [270ef29]
|
|
115
|
+
- Updated dependencies [270ef29]
|
|
116
|
+
- Updated dependencies [270ef29]
|
|
117
|
+
- Updated dependencies [270ef29]
|
|
118
|
+
- Updated dependencies [270ef29]
|
|
119
|
+
- Updated dependencies [270ef29]
|
|
120
|
+
- Updated dependencies [270ef29]
|
|
121
|
+
- Updated dependencies [270ef29]
|
|
122
|
+
- Updated dependencies [b995afb]
|
|
123
|
+
- Updated dependencies [b995afb]
|
|
124
|
+
- Updated dependencies [b995afb]
|
|
125
|
+
- Updated dependencies [b995afb]
|
|
126
|
+
- Updated dependencies [270ef29]
|
|
127
|
+
- Updated dependencies [b995afb]
|
|
128
|
+
- Updated dependencies [270ef29]
|
|
129
|
+
- Updated dependencies [b995afb]
|
|
130
|
+
- Updated dependencies [b995afb]
|
|
131
|
+
- Updated dependencies [270ef29]
|
|
132
|
+
- Updated dependencies [b995afb]
|
|
133
|
+
- Updated dependencies [b995afb]
|
|
134
|
+
- Updated dependencies [b995afb]
|
|
135
|
+
- Updated dependencies [b995afb]
|
|
136
|
+
- Updated dependencies [b995afb]
|
|
137
|
+
- Updated dependencies [b995afb]
|
|
138
|
+
- Updated dependencies [b995afb]
|
|
139
|
+
- Updated dependencies [270ef29]
|
|
140
|
+
- Updated dependencies [270ef29]
|
|
141
|
+
- Updated dependencies [270ef29]
|
|
142
|
+
- Updated dependencies [270ef29]
|
|
143
|
+
- Updated dependencies [270ef29]
|
|
144
|
+
- Updated dependencies [270ef29]
|
|
145
|
+
- Updated dependencies [270ef29]
|
|
146
|
+
- Updated dependencies [270ef29]
|
|
147
|
+
- Updated dependencies [270ef29]
|
|
148
|
+
- Updated dependencies [270ef29]
|
|
149
|
+
- Updated dependencies [b995afb]
|
|
150
|
+
- Updated dependencies [b995afb]
|
|
151
|
+
- @checkstack/backend-api@0.19.0
|
|
152
|
+
- @checkstack/automation-backend@0.3.0
|
|
153
|
+
- @checkstack/automation-common@0.3.0
|
|
154
|
+
- @checkstack/catalog-backend@1.3.0
|
|
155
|
+
- @checkstack/integration-backend@0.3.0
|
|
156
|
+
- @checkstack/cache-api@0.3.7
|
|
157
|
+
- @checkstack/command-backend@0.1.32
|
|
158
|
+
- @checkstack/cache-utils@0.2.12
|
|
159
|
+
|
|
3
160
|
## 1.3.0
|
|
4
161
|
|
|
5
162
|
### Minor Changes
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@checkstack/incident-backend",
|
|
3
|
-
"version": "1.
|
|
3
|
+
"version": "1.4.0",
|
|
4
4
|
"license": "Elastic-2.0",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "src/index.ts",
|
|
@@ -14,29 +14,29 @@
|
|
|
14
14
|
"lint:code": "eslint . --max-warnings 0"
|
|
15
15
|
},
|
|
16
16
|
"dependencies": {
|
|
17
|
-
"@checkstack/backend-api": "0.
|
|
18
|
-
"@checkstack/cache-api": "0.3.
|
|
19
|
-
"@checkstack/cache-utils": "0.2.
|
|
20
|
-
"@checkstack/incident-common": "1.3.
|
|
21
|
-
"@checkstack/catalog-common": "2.2.
|
|
22
|
-
"@checkstack/catalog-backend": "1.
|
|
23
|
-
"@checkstack/notification-common": "1.2.
|
|
24
|
-
"@checkstack/auth-common": "0.7.
|
|
25
|
-
"@checkstack/command-backend": "0.1.
|
|
26
|
-
"@checkstack/signal-common": "0.2.
|
|
27
|
-
"@checkstack/integration-backend": "0.
|
|
28
|
-
"@checkstack/integration-common": "0.
|
|
29
|
-
"@checkstack/automation-backend": "0.
|
|
30
|
-
"@checkstack/automation-common": "0.
|
|
31
|
-
"@checkstack/common": "0.
|
|
17
|
+
"@checkstack/backend-api": "0.18.0",
|
|
18
|
+
"@checkstack/cache-api": "0.3.6",
|
|
19
|
+
"@checkstack/cache-utils": "0.2.11",
|
|
20
|
+
"@checkstack/incident-common": "1.3.1",
|
|
21
|
+
"@checkstack/catalog-common": "2.2.3",
|
|
22
|
+
"@checkstack/catalog-backend": "1.2.0",
|
|
23
|
+
"@checkstack/notification-common": "1.2.1",
|
|
24
|
+
"@checkstack/auth-common": "0.7.2",
|
|
25
|
+
"@checkstack/command-backend": "0.1.31",
|
|
26
|
+
"@checkstack/signal-common": "0.2.5",
|
|
27
|
+
"@checkstack/integration-backend": "0.2.0",
|
|
28
|
+
"@checkstack/integration-common": "0.6.0",
|
|
29
|
+
"@checkstack/automation-backend": "0.2.0",
|
|
30
|
+
"@checkstack/automation-common": "0.2.0",
|
|
31
|
+
"@checkstack/common": "0.12.0",
|
|
32
32
|
"drizzle-orm": "^0.45.0",
|
|
33
33
|
"zod": "^4.2.1",
|
|
34
34
|
"@orpc/server": "^1.13.2"
|
|
35
35
|
},
|
|
36
36
|
"devDependencies": {
|
|
37
37
|
"@checkstack/drizzle-helper": "0.0.5",
|
|
38
|
-
"@checkstack/scripts": "0.3.
|
|
39
|
-
"@checkstack/test-utils-backend": "0.1.
|
|
38
|
+
"@checkstack/scripts": "0.3.4",
|
|
39
|
+
"@checkstack/test-utils-backend": "0.1.31",
|
|
40
40
|
"@checkstack/tsconfig": "0.0.7",
|
|
41
41
|
"@types/bun": "^1.0.0",
|
|
42
42
|
"drizzle-kit": "^0.31.10",
|
package/src/automations.test.ts
CHANGED
|
@@ -5,16 +5,35 @@
|
|
|
5
5
|
* `core/automation-backend` cover registration validity.
|
|
6
6
|
*/
|
|
7
7
|
import { describe, it, expect, mock } from "bun:test";
|
|
8
|
+
import { SYSTEM_ACTOR } from "@checkstack/common";
|
|
8
9
|
import { createMockLogger } from "@checkstack/test-utils-backend";
|
|
9
10
|
|
|
10
11
|
import { createIncidentActions } from "./automations";
|
|
12
|
+
import {
|
|
13
|
+
deriveIncidentTriggerEvents,
|
|
14
|
+
INCIDENT_TRIGGER_EVENTS,
|
|
15
|
+
} from "./incident-entity";
|
|
11
16
|
import type { IncidentService } from "./service";
|
|
12
17
|
|
|
18
|
+
/**
|
|
19
|
+
* A default existing incident the stubbed `getIncident` returns. The
|
|
20
|
+
* status-flipping actions (resolve / add_update / update_status) now route
|
|
21
|
+
* through the reactive entity, which re-reads post-write state via
|
|
22
|
+
* `getIncident`, so the stub must answer it.
|
|
23
|
+
*/
|
|
24
|
+
const DEFAULT_INCIDENT = {
|
|
25
|
+
id: "INC-1",
|
|
26
|
+
status: "investigating" as const,
|
|
27
|
+
severity: "critical" as const,
|
|
28
|
+
systemIds: ["sys-1"],
|
|
29
|
+
};
|
|
30
|
+
|
|
13
31
|
const makeServiceStub = (overrides: Partial<IncidentService> = {}) =>
|
|
14
32
|
({
|
|
15
33
|
createIncident: mock(),
|
|
16
34
|
resolveIncident: mock(),
|
|
17
35
|
addUpdate: mock(),
|
|
36
|
+
getIncident: mock(async () => DEFAULT_INCIDENT),
|
|
18
37
|
...overrides,
|
|
19
38
|
}) as unknown as IncidentService;
|
|
20
39
|
|
|
@@ -59,20 +78,212 @@ describe("incident automation actions", () => {
|
|
|
59
78
|
expect((result.artifact as { incidentId: string }).incidentId).toBe(
|
|
60
79
|
"INC-1",
|
|
61
80
|
);
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
81
|
+
// The action now reserves an id up front and passes it (with no user)
|
|
82
|
+
// so the reactive `incident` entity can key on it and snapshot a null
|
|
83
|
+
// `prev` before the insert (§10.1).
|
|
84
|
+
expect(service.createIncident).toHaveBeenCalledWith(
|
|
85
|
+
{
|
|
86
|
+
title: "DB down",
|
|
87
|
+
description: undefined,
|
|
88
|
+
severity: "critical",
|
|
89
|
+
systemIds: ["sys-1"],
|
|
90
|
+
initialMessage: undefined,
|
|
91
|
+
suppressNotifications: false,
|
|
92
|
+
},
|
|
93
|
+
undefined,
|
|
94
|
+
expect.any(String),
|
|
95
|
+
);
|
|
96
|
+
});
|
|
97
|
+
|
|
98
|
+
it("dedupe_open_for_system reuses an existing open incident on the system", async () => {
|
|
99
|
+
const existing = {
|
|
100
|
+
id: "INC-OPEN",
|
|
101
|
+
status: "investigating",
|
|
65
102
|
severity: "critical",
|
|
66
103
|
systemIds: ["sys-1"],
|
|
67
|
-
|
|
68
|
-
|
|
104
|
+
};
|
|
105
|
+
const service = makeServiceStub({
|
|
106
|
+
// The action now delegates the dedup-serialized find-then-create to
|
|
107
|
+
// the service (which wraps it in an advisory lock).
|
|
108
|
+
createIncidentDedupedForSystem: mock(async () => ({
|
|
109
|
+
incident: existing,
|
|
110
|
+
reused: true,
|
|
111
|
+
})) as unknown as IncidentService["createIncidentDedupedForSystem"],
|
|
112
|
+
createIncident: mock() as unknown as IncidentService["createIncident"],
|
|
113
|
+
});
|
|
114
|
+
const [createAction] = createIncidentActions({ service });
|
|
115
|
+
const result = await createAction.execute({
|
|
116
|
+
...actionContext,
|
|
117
|
+
config: {
|
|
118
|
+
title: "DB down",
|
|
119
|
+
severity: "critical",
|
|
120
|
+
systemIds: ["sys-1"],
|
|
121
|
+
suppressNotifications: false,
|
|
122
|
+
dedupe_open_for_system: true,
|
|
123
|
+
} as never,
|
|
124
|
+
});
|
|
125
|
+
expect(result.success).toBe(true);
|
|
126
|
+
expect((result.artifact as { incidentId: string }).incidentId).toBe(
|
|
127
|
+
"INC-OPEN",
|
|
128
|
+
);
|
|
129
|
+
// Reused via the dedup method — no direct createIncident call.
|
|
130
|
+
expect(service.createIncident).not.toHaveBeenCalled();
|
|
131
|
+
expect(service.createIncidentDedupedForSystem).toHaveBeenCalledTimes(1);
|
|
132
|
+
});
|
|
133
|
+
|
|
134
|
+
it("dedupe_open_for_system creates when no open incident exists", async () => {
|
|
135
|
+
const created = {
|
|
136
|
+
id: "INC-NEW",
|
|
137
|
+
status: "investigating",
|
|
138
|
+
severity: "critical",
|
|
139
|
+
systemIds: ["sys-1"],
|
|
140
|
+
};
|
|
141
|
+
const service = makeServiceStub({
|
|
142
|
+
createIncidentDedupedForSystem: mock(async () => ({
|
|
143
|
+
incident: created,
|
|
144
|
+
reused: false,
|
|
145
|
+
})) as unknown as IncidentService["createIncidentDedupedForSystem"],
|
|
146
|
+
});
|
|
147
|
+
const [createAction] = createIncidentActions({ service });
|
|
148
|
+
const result = await createAction.execute({
|
|
149
|
+
...actionContext,
|
|
150
|
+
config: {
|
|
151
|
+
title: "DB down",
|
|
152
|
+
severity: "critical",
|
|
153
|
+
systemIds: ["sys-1"],
|
|
154
|
+
suppressNotifications: false,
|
|
155
|
+
dedupe_open_for_system: true,
|
|
156
|
+
} as never,
|
|
157
|
+
});
|
|
158
|
+
expect((result.artifact as { incidentId: string }).incidentId).toBe(
|
|
159
|
+
"INC-NEW",
|
|
160
|
+
);
|
|
161
|
+
expect(service.createIncidentDedupedForSystem).toHaveBeenCalledTimes(1);
|
|
162
|
+
});
|
|
163
|
+
|
|
164
|
+
it("without the flag always creates (no dedup lookup)", async () => {
|
|
165
|
+
const service = makeServiceStub({
|
|
166
|
+
findActiveIncidentForSystem: mock(
|
|
167
|
+
async () => ({
|
|
168
|
+
id: "INC-OPEN",
|
|
169
|
+
status: "investigating",
|
|
170
|
+
severity: "critical",
|
|
171
|
+
systemIds: ["sys-1"],
|
|
172
|
+
}),
|
|
173
|
+
) as unknown as IncidentService["findActiveIncidentForSystem"],
|
|
174
|
+
createIncident: mock(
|
|
175
|
+
async () => ({
|
|
176
|
+
id: "INC-NEW",
|
|
177
|
+
status: "investigating",
|
|
178
|
+
severity: "critical",
|
|
179
|
+
systemIds: ["sys-1"],
|
|
180
|
+
}),
|
|
181
|
+
) as unknown as IncidentService["createIncident"],
|
|
182
|
+
});
|
|
183
|
+
const [createAction] = createIncidentActions({ service });
|
|
184
|
+
const result = await createAction.execute({
|
|
185
|
+
...actionContext,
|
|
186
|
+
config: {
|
|
187
|
+
title: "DB down",
|
|
188
|
+
severity: "critical",
|
|
189
|
+
systemIds: ["sys-1"],
|
|
190
|
+
suppressNotifications: false,
|
|
191
|
+
// dedupe_open_for_system omitted (defaults false)
|
|
192
|
+
} as never,
|
|
193
|
+
});
|
|
194
|
+
expect((result.artifact as { incidentId: string }).incidentId).toBe(
|
|
195
|
+
"INC-NEW",
|
|
196
|
+
);
|
|
197
|
+
expect(service.findActiveIncidentForSystem).not.toHaveBeenCalled();
|
|
198
|
+
});
|
|
199
|
+
|
|
200
|
+
// 6(a): an action-created incident is now reactive — the create runs
|
|
201
|
+
// through `handle.mutate`, so the deriver fires `incident.created`.
|
|
202
|
+
it("drives the create through handle.mutate (action-created incident is reactive)", async () => {
|
|
203
|
+
const created = {
|
|
204
|
+
id: "INC-NEW",
|
|
205
|
+
status: "investigating" as const,
|
|
206
|
+
severity: "critical" as const,
|
|
207
|
+
systemIds: ["sys-1"],
|
|
208
|
+
};
|
|
209
|
+
const service = makeServiceStub({
|
|
210
|
+
createIncident: mock(
|
|
211
|
+
async () => created,
|
|
212
|
+
) as unknown as IncidentService["createIncident"],
|
|
213
|
+
});
|
|
214
|
+
const mutate = mock(
|
|
215
|
+
async (input: { id: string; apply: () => Promise<unknown> }) =>
|
|
216
|
+
input.apply(),
|
|
217
|
+
);
|
|
218
|
+
const handle = { kind: "incident", mutate } as never;
|
|
219
|
+
const [createAction] = createIncidentActions({
|
|
220
|
+
service,
|
|
221
|
+
getIncidentEntity: () => handle,
|
|
222
|
+
});
|
|
223
|
+
await createAction.execute({
|
|
224
|
+
...actionContext,
|
|
225
|
+
config: {
|
|
226
|
+
title: "DB down",
|
|
227
|
+
severity: "critical",
|
|
228
|
+
systemIds: ["sys-1"],
|
|
229
|
+
suppressNotifications: false,
|
|
230
|
+
} as never,
|
|
231
|
+
});
|
|
232
|
+
// The create was routed through the entity handle (reactive), keyed on
|
|
233
|
+
// the reserved id passed to the service create.
|
|
234
|
+
expect(mutate).toHaveBeenCalledTimes(1);
|
|
235
|
+
const mutateArg = mutate.mock.calls[0]![0] as { id: string };
|
|
236
|
+
expect(service.createIncident).toHaveBeenCalledWith(
|
|
237
|
+
expect.anything(),
|
|
238
|
+
undefined,
|
|
239
|
+
mutateArg.id,
|
|
240
|
+
);
|
|
241
|
+
});
|
|
242
|
+
|
|
243
|
+
it("dedupe reuse drives NO handle.mutate (no duplicate incident.created)", async () => {
|
|
244
|
+
const existing = {
|
|
245
|
+
id: "INC-OPEN",
|
|
246
|
+
status: "investigating" as const,
|
|
247
|
+
severity: "critical" as const,
|
|
248
|
+
systemIds: ["sys-1"],
|
|
249
|
+
};
|
|
250
|
+
const service = makeServiceStub({
|
|
251
|
+
createIncidentDedupedForSystem: mock(async () => ({
|
|
252
|
+
incident: existing,
|
|
253
|
+
reused: true,
|
|
254
|
+
})) as unknown as IncidentService["createIncidentDedupedForSystem"],
|
|
255
|
+
});
|
|
256
|
+
const mutate = mock(
|
|
257
|
+
async (input: { id: string; apply: () => Promise<unknown> }) =>
|
|
258
|
+
input.apply(),
|
|
259
|
+
);
|
|
260
|
+
const handle = { kind: "incident", mutate } as never;
|
|
261
|
+
const [createAction] = createIncidentActions({
|
|
262
|
+
service,
|
|
263
|
+
getIncidentEntity: () => handle,
|
|
264
|
+
});
|
|
265
|
+
await createAction.execute({
|
|
266
|
+
...actionContext,
|
|
267
|
+
config: {
|
|
268
|
+
title: "DB down",
|
|
269
|
+
severity: "critical",
|
|
270
|
+
systemIds: ["sys-1"],
|
|
271
|
+
suppressNotifications: false,
|
|
272
|
+
dedupe_open_for_system: true,
|
|
273
|
+
} as never,
|
|
69
274
|
});
|
|
275
|
+
// A reused incident is unchanged → no entity write at all.
|
|
276
|
+
expect(mutate).not.toHaveBeenCalled();
|
|
70
277
|
});
|
|
71
278
|
});
|
|
72
279
|
|
|
73
280
|
describe("incident.resolve", () => {
|
|
74
281
|
it("returns failure when the incident doesn't exist", async () => {
|
|
75
282
|
const service = makeServiceStub({
|
|
283
|
+
// The existence guard (re-read before the driven write) sees no row.
|
|
284
|
+
getIncident: mock(
|
|
285
|
+
async () => undefined,
|
|
286
|
+
) as unknown as IncidentService["getIncident"],
|
|
76
287
|
resolveIncident: mock(
|
|
77
288
|
async () => undefined,
|
|
78
289
|
) as unknown as IncidentService["resolveIncident"],
|
|
@@ -85,6 +296,8 @@ describe("incident automation actions", () => {
|
|
|
85
296
|
});
|
|
86
297
|
expect(result.success).toBe(false);
|
|
87
298
|
expect(result.error).toMatch(/not found/i);
|
|
299
|
+
// Guard short-circuits before attempting the resolve.
|
|
300
|
+
expect(service.resolveIncident).not.toHaveBeenCalled();
|
|
88
301
|
});
|
|
89
302
|
|
|
90
303
|
it("calls service.resolveIncident on the happy path", async () => {
|
|
@@ -108,6 +321,73 @@ describe("incident automation actions", () => {
|
|
|
108
321
|
expect(result.success).toBe(true);
|
|
109
322
|
expect(service.resolveIncident).toHaveBeenCalledWith("INC-1", "Fixed");
|
|
110
323
|
});
|
|
324
|
+
|
|
325
|
+
// 6(b) regression: an action-driven resolve must route through the reactive
|
|
326
|
+
// entity (like the RPC router) so it appends an `entity_transitions` row,
|
|
327
|
+
// emits `ENTITY_CHANGED` (waking `wait_until`), and fires the
|
|
328
|
+
// `incident.resolved` deriver — not call the service directly.
|
|
329
|
+
it("routes the resolve through handle.mutate (transition + incident.resolved deriver)", async () => {
|
|
330
|
+
const resolved = {
|
|
331
|
+
id: "INC-1",
|
|
332
|
+
status: "resolved" as const,
|
|
333
|
+
severity: "critical" as const,
|
|
334
|
+
systemIds: ["sys-1"],
|
|
335
|
+
};
|
|
336
|
+
const service = makeServiceStub({
|
|
337
|
+
// `prev` (before resolve) for the deriver assertion below.
|
|
338
|
+
getIncident: mock(async () => ({
|
|
339
|
+
id: "INC-1",
|
|
340
|
+
status: "investigating" as const,
|
|
341
|
+
severity: "critical" as const,
|
|
342
|
+
systemIds: ["sys-1"],
|
|
343
|
+
})) as unknown as IncidentService["getIncident"],
|
|
344
|
+
resolveIncident: mock(
|
|
345
|
+
async () => resolved,
|
|
346
|
+
) as unknown as IncidentService["resolveIncident"],
|
|
347
|
+
});
|
|
348
|
+
const mutate = mock(
|
|
349
|
+
async (input: {
|
|
350
|
+
id: string;
|
|
351
|
+
opts?: { runId?: string };
|
|
352
|
+
apply: () => Promise<unknown>;
|
|
353
|
+
}) => input.apply(),
|
|
354
|
+
);
|
|
355
|
+
const handle = { kind: "incident", mutate } as never;
|
|
356
|
+
const resolveAction = createIncidentActions({
|
|
357
|
+
service,
|
|
358
|
+
getIncidentEntity: () => handle,
|
|
359
|
+
})[1];
|
|
360
|
+
|
|
361
|
+
const result = await resolveAction.execute({
|
|
362
|
+
...actionContext,
|
|
363
|
+
config: { incidentId: "INC-1", message: "Fixed" } as never,
|
|
364
|
+
});
|
|
365
|
+
|
|
366
|
+
expect(result.success).toBe(true);
|
|
367
|
+
// The write was driven through the entity handle, keyed on the incident
|
|
368
|
+
// id, with the dispatch `runId` for secret masking.
|
|
369
|
+
expect(mutate).toHaveBeenCalledTimes(1);
|
|
370
|
+
const mutateArg = mutate.mock.calls[0]![0] as {
|
|
371
|
+
id: string;
|
|
372
|
+
opts?: { runId?: string };
|
|
373
|
+
};
|
|
374
|
+
expect(mutateArg.id).toBe("INC-1");
|
|
375
|
+
expect(mutateArg.opts?.runId).toBe("run-1");
|
|
376
|
+
|
|
377
|
+
// The post-write reactive state is `resolved` — feeding the prev→next
|
|
378
|
+
// change through the deriver fires `incident.resolved` (the wake/route).
|
|
379
|
+
const events = deriveIncidentTriggerEvents({
|
|
380
|
+
kind: "incident",
|
|
381
|
+
id: "INC-1",
|
|
382
|
+
prev: { status: "investigating", severity: "critical", systemIds: ["sys-1"] },
|
|
383
|
+
next: { status: "resolved", severity: "critical", systemIds: ["sys-1"] },
|
|
384
|
+
delta: { status: "resolved" },
|
|
385
|
+
changedFields: ["status"],
|
|
386
|
+
actor: SYSTEM_ACTOR,
|
|
387
|
+
occurredAt: new Date().toISOString(),
|
|
388
|
+
});
|
|
389
|
+
expect(events).toEqual([INCIDENT_TRIGGER_EVENTS.resolved]);
|
|
390
|
+
});
|
|
111
391
|
});
|
|
112
392
|
|
|
113
393
|
describe("incident.add_update", () => {
|
|
@@ -169,4 +449,75 @@ describe("incident automation actions", () => {
|
|
|
169
449
|
});
|
|
170
450
|
});
|
|
171
451
|
});
|
|
452
|
+
|
|
453
|
+
describe("incident artifact (produces / consumes)", () => {
|
|
454
|
+
it("incident.create declares produces: incident", () => {
|
|
455
|
+
const [createAction] = createIncidentActions({
|
|
456
|
+
service: makeServiceStub(),
|
|
457
|
+
});
|
|
458
|
+
expect(createAction.produces).toBe("incident");
|
|
459
|
+
});
|
|
460
|
+
|
|
461
|
+
it("incident.resolve consumes the upstream incident artifact when incidentId is omitted", async () => {
|
|
462
|
+
const resolved = {
|
|
463
|
+
id: "INC-9",
|
|
464
|
+
status: "resolved",
|
|
465
|
+
severity: "critical",
|
|
466
|
+
systemIds: ["sys-1"],
|
|
467
|
+
};
|
|
468
|
+
const service = makeServiceStub({
|
|
469
|
+
resolveIncident: mock(
|
|
470
|
+
async () => resolved,
|
|
471
|
+
) as unknown as IncidentService["resolveIncident"],
|
|
472
|
+
});
|
|
473
|
+
const resolveAction = createIncidentActions({ service })[1];
|
|
474
|
+
expect(resolveAction.consumes).toEqual(["incident"]);
|
|
475
|
+
|
|
476
|
+
const result = await resolveAction.execute({
|
|
477
|
+
...actionContext,
|
|
478
|
+
// No incidentId in config — falls back to the consumed artifact.
|
|
479
|
+
config: { message: "recovered" } as never,
|
|
480
|
+
consumedArtifacts: {
|
|
481
|
+
incident: { incidentId: "INC-9", status: "investigating" },
|
|
482
|
+
},
|
|
483
|
+
});
|
|
484
|
+
expect(result.success).toBe(true);
|
|
485
|
+
expect(service.resolveIncident).toHaveBeenCalledWith("INC-9", "recovered");
|
|
486
|
+
});
|
|
487
|
+
|
|
488
|
+
it("incident.resolve config incidentId takes priority over the artifact", async () => {
|
|
489
|
+
const service = makeServiceStub({
|
|
490
|
+
resolveIncident: mock(
|
|
491
|
+
async () => ({
|
|
492
|
+
id: "INC-CONFIG",
|
|
493
|
+
status: "resolved",
|
|
494
|
+
severity: "high",
|
|
495
|
+
systemIds: [],
|
|
496
|
+
}),
|
|
497
|
+
) as unknown as IncidentService["resolveIncident"],
|
|
498
|
+
});
|
|
499
|
+
const resolveAction = createIncidentActions({ service })[1];
|
|
500
|
+
await resolveAction.execute({
|
|
501
|
+
...actionContext,
|
|
502
|
+
config: { incidentId: "INC-CONFIG" } as never,
|
|
503
|
+
consumedArtifacts: { incident: { incidentId: "INC-ARTIFACT" } },
|
|
504
|
+
});
|
|
505
|
+
expect(service.resolveIncident).toHaveBeenCalledWith(
|
|
506
|
+
"INC-CONFIG",
|
|
507
|
+
undefined,
|
|
508
|
+
);
|
|
509
|
+
});
|
|
510
|
+
|
|
511
|
+
it("incident.resolve fails clearly when neither config nor artifact has an id", async () => {
|
|
512
|
+
const service = makeServiceStub();
|
|
513
|
+
const resolveAction = createIncidentActions({ service })[1];
|
|
514
|
+
const result = await resolveAction.execute({
|
|
515
|
+
...actionContext,
|
|
516
|
+
config: {} as never,
|
|
517
|
+
consumedArtifacts: {},
|
|
518
|
+
});
|
|
519
|
+
expect(result.success).toBe(false);
|
|
520
|
+
expect(service.resolveIncident).not.toHaveBeenCalled();
|
|
521
|
+
});
|
|
522
|
+
});
|
|
172
523
|
});
|