@checkstack/slo-backend 0.6.1 → 0.7.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,151 @@
1
1
  # @checkstack/slo-backend
2
2
 
3
+ ## 0.7.1
4
+
5
+ ### Patch Changes
6
+
7
+ - Updated dependencies [13373ce]
8
+ - @checkstack/common@0.14.0
9
+ - @checkstack/backend-api@0.21.1
10
+ - @checkstack/cache-api@0.3.10
11
+ - @checkstack/queue-api@0.3.10
12
+ - @checkstack/ai-backend@0.1.1
13
+ - @checkstack/automation-backend@0.5.1
14
+ - @checkstack/catalog-backend@1.4.1
15
+ - @checkstack/catalog-common@2.3.1
16
+ - @checkstack/command-backend@0.2.1
17
+ - @checkstack/dependency-common@1.2.1
18
+ - @checkstack/gitops-backend@0.5.1
19
+ - @checkstack/gitops-common@0.6.1
20
+ - @checkstack/healthcheck-backend@1.6.1
21
+ - @checkstack/healthcheck-common@1.5.1
22
+ - @checkstack/signal-common@0.2.7
23
+ - @checkstack/slo-common@0.5.1
24
+ - @checkstack/cache-utils@0.2.15
25
+
26
+ ## 0.7.0
27
+
28
+ ### Minor Changes
29
+
30
+ - 9dcc848: Plugin-owned AI tools: every domain plugin contributes its own AI tools (chat assistant + automation AI action), and `ai-backend` is platform-only.
31
+
32
+ Every plugin-specific AI tool is owned by the plugin whose domain it acts on, registered via that plugin's own `aiToolExtensionPoint` / `aiToolProjectionExtensionPoint` from its init - the same path an external plugin author uses. `ai-backend` no longer imports or depends on any capability plugin's `*-common`; the dependency direction is strictly plugin -> ai-platform. Pure helpers (`computeFieldDiff`, capability-summary, `ScriptContextKind`) live in `@checkstack/ai-common`.
33
+
34
+ Tools shipped:
35
+
36
+ - Health checks and automations: full CRUD - `healthcheck.propose` / `automation.propose` and `*.update` (`mutate`, deep-validated) and `*.delete` (`destructive`, always confirm-gated). `healthcheck.propose`'s dry-run calls the new deep `validateConfiguration` so propose-time validation matches apply-time. Assertions are validated against the collector's result schema and the canonical operator vocabulary. Capability-catalog tools (`ai.listCapabilities`, `ai.getCapabilitySchema`), script context tools (`ai.getScriptContext`, `ai.testScript`), and notify-subscriber tools (`healthcheck.notifySystemSubscribers` / `...GroupSubscribers`).
37
+ - Catalog: `catalog.createSystem` / `updateSystem` / `createGroup` / `updateGroup` (`mutate`), `catalog.deleteSystem` / `deleteGroup` (`destructive`), membership tools (`mutate`), plus `catalog.listSystems` / `listGroups` read projections.
38
+ - Incident: `incident.create` / `update` / `addUpdate` / `resolve` / `addLink` (`mutate`), `incident.delete` / `removeLink` (`destructive`), and `incident.get` / `incident.list` read projections.
39
+ - Maintenance: `maintenance.create` / `update` / `addUpdate` / `close` / `addLink` (`mutate`), `maintenance.delete` / `removeLink` (`destructive`), and `maintenance.list` / `get` read projections.
40
+ - Read projections for SLO (`slo.listObjectives`), dependency (`dependency.list`), incident (`incident.list`), healthcheck (`healthcheck.status`), and anomaly (`anomaly.explain`), each gated by the source procedure's own access rule and routed as the principal.
41
+ - Documentation grounding: `ai.searchDocs` / `ai.getDoc` over a build-time bundled docs index (BM25-ish ranking), so the assistant grounds how-to answers in Checkstack's own docs offline.
42
+ - URL introspection: `ai.probeUrl`, an SSRF-guarded read tool the assistant uses to inspect a real endpoint before drafting a health check. Update tools compute a before -> after field diff rendered on the confirm card (approve mode) or an "Applied" card (auto mode), so a change is never silent.
43
+
44
+ `ai_analyze` automation action (automation-backend, with an editor connection picker + audited tool calls): runs a bounded AI agent on the run context as the automation's `runAs` service account, so it can never exceed that identity's permissions; destructive tools are never offered; mutating tools auto-apply through the service account's client. Produces an `automation.analysis` artifact downstream actions can branch on. The agent loop is exposed as a headless `aiAgentRunnerRef` service so automation-backend can drive it without depending on ai-backend.
45
+
46
+ `notification.notifyForSubscription` is now callable by user / application principals holding `notification.send` (previously service-only). Every tool routes through the user-scoped client, so handler-side authorization is enforced exactly as a direct UI/RPC action; the resolver gate plus the propose/apply re-check at propose AND apply are the additional authority. A systemic authz regression test asserts every registered tool falls into exactly one safe authorization category.
47
+
48
+ A new `ai_transport` enum value `automation` records the AI action's tool calls in the `ai_tool_calls` audit log. No new durable state beyond that; each tool is a thin, deterministic wrapper over an existing RPC, so every pod behaves identically.
49
+
50
+ This is a beta minor.
51
+
52
+ - 9dcc848: Make SLO downtime robust against a drifted event log (fixes "100% available yet degraded" and "ongoing downtime while every check is healthy").
53
+
54
+ SLO downtime was stored as edge-triggered open/close interval rows, so a single missed/out-of-order transition left an event open forever and read as ongoing downtime even when healthy. The fix makes live health authoritative:
55
+
56
+ - `computeStatus` is now live-health-authoritative and side-effect-free: a stored open event counts toward availability/error-budget and sets `hasOpenDowntime` only when the system is actually down right now (verified via the health callback, checked only when open events exist). A healthy system can no longer read breaching/degraded from a stale row, and this stays pure so the reactive `slo` entity can keep reading through it.
57
+ - Window accounting is fixed: `getDowntimeForWindow` counts the in-window portion of every overlapping interval (clamped to the window; open events run to "now" only when included), via a pure `downtime-window` helper, so an outage that began before the window is no longer dropped.
58
+ - Missed-recovery orphans are voided: the daily job deletes open events on currently-healthy systems (their true recovery time was never recorded). The edge-triggered close still records real downtime on normal recoveries.
59
+
60
+ Regression tests cover the window-overlap math, the live-health authority, the no-open-event fast path, and orphan voiding.
61
+
62
+ This is a beta minor.
63
+
64
+ - 9dcc848: Align workspace dependency versions and migrate React Router to v7.
65
+
66
+ BREAKING CHANGES (React Router v7): All frontend packages now depend on `react-router-dom@^7.16.0`. Previously the workspace declared four divergent ranges (`^6.20.0`, `^6.22.0`, `^7.1.1`, `^7.14.2`), which resolved both `react-router@6` and `react-router@7` into a single bundle. Everything is now unified on v7. The public imports the app uses (`BrowserRouter`, `Routes`, `Route`, `Link`, `NavLink`, `MemoryRouter`, `useNavigate`, `useParams`, `useSearchParams`, `useLocation`) are unchanged between v6 and v7, so no source rewrites were required - but any out-of-tree plugin still on react-router v6 should upgrade to v7 (see the React Router v6 -> v7 upgrade guide) to share the host's single router instance via the import map.
67
+
68
+ Other unified ranges (no API change): `react` -> `^18.3.1`, the `@orpc/*` family (`contract`, `server`, `client`, `tanstack-query`, `openapi`, `zod`) -> `^1.14.4`, and `better-auth` -> `^1.6.13`.
69
+
70
+ Removed the pre-rename `@orpc/react-query` leftover from `@checkstack/frontend-api`; its `createRouterUtils` / `RouterUtils` / `ProcedureUtils` now come from `@orpc/tanstack-query` (the package already in use).
71
+
72
+ Stale in-range runtime deps pulled up to current published versions: `hono` `^4.12.23`, `@tanstack/react-query` (+devtools) `^5.100.14`, `date-fns` `^4.4.0`, `jose` `^6.2.3`, `tar` `^7.5.16`, `semver` `^7.8.1`, `@xyflow/react` `^12.11.0`.
73
+
74
+ ### Patch Changes
75
+
76
+ - 9dcc848: Write-path hardening: post-commit side effects can no longer fail a committed write, multi-row mutations are now atomic, and retry-duplication is blocked at the database.
77
+
78
+ **Platform-level (automatic for all current and future plugins):**
79
+
80
+ - signal-backend: `SignalService` (broadcast / sendToUser / sendToUsers / sendToAuthorizedUsers) is now resilient by construction - a transient event-bus/queue failure is caught and logged instead of thrown. Real-time signals are best-effort UI nudges; the authoritative data is already committed by the time a mutation broadcasts, so a signal-transport blip must never turn a successful write into a client-visible error. Every plugin's broadcasts inherit this without per-call-site `try/catch` (which would inevitably be forgotten and regress). This mirrors `createCachedScope`, which already makes cache invalidation non-throwing - so the cache + signal halves of the "post-commit side effect fails the response" class are both closed at the platform seam. Durable side effects (events/hooks that drive automations, queue jobs) intentionally still surface failures. Documented in `developer-guide/backend/signals.md`.
81
+
82
+ **Atomic multi-write mutations (each previously committed row-by-row in autocommit, so a mid-sequence failure left partial/orphaned state):**
83
+
84
+ - slo-backend: `createObjective` now inserts the objective and its 1:1 streak row in one transaction; the post-create reconcile/status/notify steps are best-effort and can no longer fail the (committed) create.
85
+ - incident-backend: `createIncident`, `updateIncident`, `addUpdate`, and `resolveIncident` wrap their row + system-link + timeline writes in a transaction (no more wiped system associations on a failed re-insert, or status flips with no matching timeline entry).
86
+ - maintenance-backend: same for `createMaintenance`, `updateMaintenance`, `addUpdate`, `closeMaintenance`.
87
+ - automation-backend: `cancelRun` marks the run cancelled and tears down its wait locks + durable state in one transaction - previously a failure after the status update could leave a wait lock behind, letting a later trigger event resume an already-cancelled run.
88
+ - healthcheck-backend: `ingestSatelliteResult` commits the run row and its hourly-aggregate increment together (no orphaned run, no aggregate without a backing run). NOTE: this guarantees run/aggregate consistency but does not yet make a _duplicate satellite delivery_ idempotent - that needs a dedupe key on the high-volume runs table and is tracked as a follow-up.
89
+
90
+ **Retry-duplication blocked at the DB (paired with the SQLSTATE 23505 -> 409 mapping shipped separately):**
91
+
92
+ - catalog-backend: new unique indexes on `groups.name`, `environments.name` (consistent with `systems.name`), on `system_links (system_id, url)`, and on `system_contacts (system_id, user_id)` + `(system_id, email)` (NULLs are distinct, so user vs mailbox contacts don't interfere). Name uniqueness is CASE-INSENSITIVE: the three name indexes are functional `lower(name)` indexes (the existing `systems.name` index is rebuilt this way too), so "Api" and "api" collide while the stored value keeps its original casing. The systems pre-write name check (`getSystemByName`) is case-folded to match. Migration `0005` de-dupes any pre-existing rows first - names are preserved by suffixing later case-insensitive duplicates (" (2)", " (3)", ...), redundant contact/link rows are removed keeping the earliest. (Link URLs stay case-sensitive - URL paths are; contact emails are deduped exact-match.)
93
+ - incident-backend / maintenance-backend: unique index on `incident_links (incident_id, url)` / `maintenance_links (maintenance_id, url)`, with a de-dupe step in the migration.
94
+
95
+ **Behavior change:** creating a group/environment with a duplicate name, or attaching a duplicate contact/link, now returns `409 Conflict` instead of silently creating a duplicate. The migrations resolve existing duplicates on upgrade.
96
+
97
+ This is a beta patch.
98
+
99
+ - 9dcc848: Input-validation and error-mapping hardening found by a fuzzing pass against the built container.
100
+
101
+ - backend: a Postgres driver error caused by bad client input no longer surfaces as a `500`. The `/api` and `/rest` dispatchers now map the relevant SQLSTATE classes to the correct status - `22P02`/`22003`/`22001`/`22007` (malformed/out-of-range/over-long/bad-date value), `23502`/`23503`/`23514` (missing/dangling/check-failed) to `400`, and `23505` (unique violation) to `409` - and log them at `warn` (client mistake), not `error`. The client-facing message is generic so column/constraint names are never leaked; genuine unknown faults still log at `error` and 500. Previously a `where id = $1` with a non-uuid `$1` (or an over-long string, or a foreign-key miss in `addSystemToGroup`) reached the driver and 500'd, making routine probing look like a server outage and burying real 500s.
102
+ - slo-common: **fixes a stored cluster-wide DoS.** `windowDays` was accepted up to `2^53`, but the SLO engine derives window boundaries with `Date(now - windowDays * 86_400_000)` - a large value overflows past the max representable `Date` and yields `Invalid Date`. That objective committed fine, then every subsequent read of the system's objectives threw `RangeError: Invalid time value` during serialization (a 500 readable by anyone with SLO read access, on any pod). `windowDays` is now bounded to 1..3650 days at the contract, the GitOps `kind: SLO` spec, and the update path via a single shared `SloWindowDaysSchema`, so the poison row can never be created.
103
+ - slo-common + healthcheck-common: SLO `getDailySnapshots` and the healthcheck history endpoints (`getHistory`, `getDetailedHistory`, `getAggregatedHistory`, `getDetailedAggregatedHistory`, `getRunsForAnalysis`) declared their `startDate`/`endDate` params as `z.date()`, which a `/rest/...` string param can never satisfy - so those endpoints 400'd on the entire REST surface. They now use `z.coerce.date()`, accepting both the REST string shape and the native RPC `Date`.
104
+ - healthcheck-common: `intervalSeconds` was `z.number().min(1)` with no `.int()` and no upper bound, so a fractional or out-of-range value reached the DB and failed at insert (the column is a 32-bit int). It is now `.int().min(1).max(2_592_000)` (1 second .. 30 days), applied to both create and update (the update schema is the create partial).
105
+ - catalog-common: system/group/environment names were bare `z.string()` (environment was `.min(1)` only), so empty, whitespace-only, and 100KB+ names reached the DB - the huge ones surfaced as 500s when parameter binding blew up. Names are now `trim().min(1).max(200)` via a shared schema.
106
+
107
+ **BREAKING:** `getSystemContacts` is now `userType: "authenticated"` (was `"public"`). System contacts carry PII (user id, name, email); the public read leaked them to anonymous status-page visitors. Anonymous callers now receive `401` for this one endpoint; the system detail page already renders "No contacts assigned" for anonymous viewers, so the UI degrades gracefully. All other catalog reads remain public.
108
+
109
+ - catalog-frontend: the system detail page skips the `getSystemContacts` request entirely for anonymous viewers (it would now `401`) and falls back to the empty state.
110
+
111
+ This is a beta release: the breaking contact-visibility change ships as a minor bump per the beta versioning policy, not a major.
112
+
113
+ - Updated dependencies [9dcc848]
114
+ - Updated dependencies [9dcc848]
115
+ - Updated dependencies [9dcc848]
116
+ - Updated dependencies [9dcc848]
117
+ - Updated dependencies [9dcc848]
118
+ - Updated dependencies [9dcc848]
119
+ - Updated dependencies [9dcc848]
120
+ - Updated dependencies [9dcc848]
121
+ - Updated dependencies [9dcc848]
122
+ - Updated dependencies [9dcc848]
123
+ - Updated dependencies [9dcc848]
124
+ - Updated dependencies [9dcc848]
125
+ - Updated dependencies [9dcc848]
126
+ - Updated dependencies [9dcc848]
127
+ - Updated dependencies [9dcc848]
128
+ - Updated dependencies [9dcc848]
129
+ - Updated dependencies [9dcc848]
130
+ - Updated dependencies [9dcc848]
131
+ - @checkstack/ai-backend@0.1.0
132
+ - @checkstack/backend-api@0.21.0
133
+ - @checkstack/healthcheck-backend@1.6.0
134
+ - @checkstack/healthcheck-common@1.5.0
135
+ - @checkstack/automation-backend@0.5.0
136
+ - @checkstack/catalog-backend@1.4.0
137
+ - @checkstack/catalog-common@2.3.0
138
+ - @checkstack/common@0.13.0
139
+ - @checkstack/slo-common@0.5.0
140
+ - @checkstack/command-backend@0.2.0
141
+ - @checkstack/dependency-common@1.2.0
142
+ - @checkstack/gitops-backend@0.5.0
143
+ - @checkstack/gitops-common@0.6.0
144
+ - @checkstack/cache-api@0.3.9
145
+ - @checkstack/queue-api@0.3.9
146
+ - @checkstack/signal-common@0.2.6
147
+ - @checkstack/cache-utils@0.2.14
148
+
3
149
  ## 0.6.1
4
150
 
5
151
  ### Patch Changes
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@checkstack/slo-backend",
3
- "version": "0.6.1",
3
+ "version": "0.7.1",
4
4
  "license": "Elastic-2.0",
5
5
  "type": "module",
6
6
  "main": "src/index.ts",
@@ -11,33 +11,36 @@
11
11
  "typecheck": "tsgo -b",
12
12
  "generate": "drizzle-kit generate",
13
13
  "lint": "bun run lint:code",
14
- "lint:code": "eslint . --max-warnings 0"
14
+ "lint:code": "eslint . --max-warnings 0",
15
+ "test": "bun test"
15
16
  },
16
17
  "dependencies": {
17
- "@checkstack/backend-api": "0.18.0",
18
- "@checkstack/cache-api": "0.3.6",
19
- "@checkstack/cache-utils": "0.2.11",
20
- "@checkstack/slo-common": "0.4.2",
21
- "@checkstack/healthcheck-common": "1.3.0",
22
- "@checkstack/healthcheck-backend": "1.3.0",
23
- "@checkstack/dependency-common": "1.1.3",
24
- "@checkstack/catalog-common": "2.2.3",
25
- "@checkstack/catalog-backend": "1.2.0",
26
- "@checkstack/command-backend": "0.1.31",
27
- "@checkstack/signal-common": "0.2.5",
28
- "@checkstack/automation-backend": "0.2.0",
29
- "@checkstack/gitops-backend": "0.3.7",
30
- "@checkstack/gitops-common": "0.4.2",
31
- "@checkstack/common": "0.12.0",
32
- "@checkstack/queue-api": "0.3.6",
18
+ "@checkstack/backend-api": "0.21.0",
19
+ "@checkstack/ai-backend": "0.1.0",
20
+ "@checkstack/cache-api": "0.3.9",
21
+ "@checkstack/cache-utils": "0.2.14",
22
+ "@checkstack/slo-common": "0.5.0",
23
+ "@checkstack/healthcheck-common": "1.5.0",
24
+ "@checkstack/healthcheck-backend": "1.6.0",
25
+ "@checkstack/dependency-common": "1.2.0",
26
+ "@checkstack/catalog-common": "2.3.0",
27
+ "@checkstack/catalog-backend": "1.4.0",
28
+ "@checkstack/command-backend": "0.2.0",
29
+ "@checkstack/signal-common": "0.2.6",
30
+ "@checkstack/automation-backend": "0.5.0",
31
+ "@checkstack/gitops-backend": "0.5.0",
32
+ "@checkstack/gitops-common": "0.6.0",
33
+ "@checkstack/common": "0.13.0",
34
+ "@checkstack/queue-api": "0.3.9",
33
35
  "drizzle-orm": "^0.45.0",
34
36
  "zod": "^4.2.1",
35
- "@orpc/server": "^1.13.2"
37
+ "@orpc/contract": "^1.14.4",
38
+ "@orpc/server": "^1.14.4"
36
39
  },
37
40
  "devDependencies": {
38
41
  "@checkstack/drizzle-helper": "0.0.5",
39
- "@checkstack/scripts": "0.3.4",
40
- "@checkstack/test-utils-backend": "0.1.31",
42
+ "@checkstack/scripts": "0.4.0",
43
+ "@checkstack/test-utils-backend": "0.1.34",
41
44
  "@checkstack/tsconfig": "0.0.7",
42
45
  "@types/bun": "^1.0.0",
43
46
  "drizzle-kit": "^0.31.10",
@@ -0,0 +1,36 @@
1
+ import { describe, expect, test } from "bun:test";
2
+ import {
3
+ buildProjectedTool,
4
+ deferredProjectionExecute,
5
+ } from "@checkstack/ai-backend";
6
+ import { sloContract, pluginMetadata } from "@checkstack/slo-common";
7
+
8
+ // Build the projected tool with the SAME inputs the plugin exposes via
9
+ // aiToolProjectionExtensionPoint in `index.ts`, and assert the resulting tool
10
+ // carries the source procedure's contract access rules - NOT the chat
11
+ // transport's `ai.chat.read` gate.
12
+ describe("slo.listObjectives projection", () => {
13
+ const tool = buildProjectedTool({
14
+ procedure: sloContract.listObjectives,
15
+ sourcePluginMetadata: pluginMetadata,
16
+ procedureKey: "listObjectives",
17
+ name: "slo.listObjectives",
18
+ description:
19
+ "List service-level objectives with their current status and error budget. Read-only.",
20
+ effect: "read",
21
+ execute: deferredProjectionExecute,
22
+ });
23
+
24
+ test("uses the overridden tool name", () => {
25
+ expect(tool.name).toBe("slo.listObjectives");
26
+ });
27
+
28
+ test("is classified as a read-only effect", () => {
29
+ expect(tool.effect).toBe("read");
30
+ });
31
+
32
+ test("inherits the source procedure's qualified read access rule", () => {
33
+ // qualifyAccessRuleId: `${pluginId}.${rule.id}` where rule.id = `slo.read`.
34
+ expect(tool.requiredAccessRules).toEqual(["slo.slo.read"]);
35
+ });
36
+ });
@@ -0,0 +1,143 @@
1
+ import { describe, it, expect } from "bun:test";
2
+ import {
3
+ eventWindowSeconds,
4
+ aggregateWindowedDowntime,
5
+ type WindowedEventInput,
6
+ } from "./downtime-window";
7
+
8
+ const HOUR = 60 * 60 * 1000;
9
+ const DAY = 24 * HOUR;
10
+
11
+ // A fixed "now" and a 30-day window ending at now.
12
+ const now = new Date("2026-06-04T00:00:00.000Z");
13
+ const windowEnd = now;
14
+ const windowStart = new Date(now.getTime() - 30 * DAY);
15
+
16
+ const selfEvent = (
17
+ overrides: Partial<WindowedEventInput> = {},
18
+ ): WindowedEventInput => ({
19
+ startTime: new Date(now.getTime() - DAY),
20
+ endTime: null,
21
+ attributionType: "self",
22
+ upstreamSystemId: null,
23
+ upstreamSystemName: null,
24
+ ...overrides,
25
+ });
26
+
27
+ describe("eventWindowSeconds", () => {
28
+ it("counts the full window for an OPEN event that started before the window (the bug)", () => {
29
+ // Started 60 days ago, still ongoing. Its in-window portion is the WHOLE
30
+ // 30-day window - it must not be dropped just because it started earlier.
31
+ const seconds = eventWindowSeconds({
32
+ startTime: new Date(now.getTime() - 60 * DAY),
33
+ endTime: null,
34
+ windowStart,
35
+ windowEnd,
36
+ now,
37
+ });
38
+ expect(seconds).toBe((30 * DAY) / 1000);
39
+ });
40
+
41
+ it("counts an open event from its start when it started inside the window", () => {
42
+ const seconds = eventWindowSeconds({
43
+ startTime: new Date(now.getTime() - 2 * HOUR),
44
+ endTime: null,
45
+ windowStart,
46
+ windowEnd,
47
+ now,
48
+ });
49
+ expect(seconds).toBe((2 * HOUR) / 1000);
50
+ });
51
+
52
+ it("returns 0 for a closed event entirely before the window", () => {
53
+ const seconds = eventWindowSeconds({
54
+ startTime: new Date(now.getTime() - 40 * DAY),
55
+ endTime: new Date(now.getTime() - 35 * DAY),
56
+ windowStart,
57
+ windowEnd,
58
+ now,
59
+ });
60
+ expect(seconds).toBe(0);
61
+ });
62
+
63
+ it("clamps an event that started before the window but ended inside it", () => {
64
+ // Started 31 days ago, ended 29 days ago → only 1 day falls in the window.
65
+ const seconds = eventWindowSeconds({
66
+ startTime: new Date(now.getTime() - 31 * DAY),
67
+ endTime: new Date(now.getTime() - 29 * DAY),
68
+ windowStart,
69
+ windowEnd,
70
+ now,
71
+ });
72
+ expect(seconds).toBe(DAY / 1000);
73
+ });
74
+
75
+ it("counts the full duration of an event fully inside the window", () => {
76
+ const seconds = eventWindowSeconds({
77
+ startTime: new Date(now.getTime() - 5 * DAY),
78
+ endTime: new Date(now.getTime() - 5 * DAY + 3 * HOUR),
79
+ windowStart,
80
+ windowEnd,
81
+ now,
82
+ });
83
+ expect(seconds).toBe((3 * HOUR) / 1000);
84
+ });
85
+ });
86
+
87
+ describe("aggregateWindowedDowntime", () => {
88
+ it("regression: an open self outage from before the window consumes the whole window", () => {
89
+ // This is the exact dashboard bug: a 'Self/Ongoing' event from ~2 months
90
+ // ago must NOT yield ~0 consumed minutes (which rendered as 100% available
91
+ // while still flagged degraded).
92
+ const result = aggregateWindowedDowntime({
93
+ events: [
94
+ selfEvent({ startTime: new Date(now.getTime() - 60 * DAY) }),
95
+ ],
96
+ windowStart,
97
+ windowEnd,
98
+ now,
99
+ });
100
+ expect(result.selfMinutes).toBe((30 * DAY) / 1000 / 60);
101
+ expect(result.totalMinutes).toBe((30 * DAY) / 1000 / 60);
102
+ });
103
+
104
+ it("splits self vs upstream and clamps each to the window", () => {
105
+ const result = aggregateWindowedDowntime({
106
+ events: [
107
+ selfEvent({
108
+ startTime: new Date(now.getTime() - 2 * HOUR),
109
+ endTime: new Date(now.getTime() - 1 * HOUR),
110
+ }),
111
+ {
112
+ startTime: new Date(now.getTime() - 3 * HOUR),
113
+ endTime: new Date(now.getTime() - 2 * HOUR),
114
+ attributionType: "upstream",
115
+ upstreamSystemId: "up-1",
116
+ upstreamSystemName: "Upstream",
117
+ },
118
+ ],
119
+ windowStart,
120
+ windowEnd,
121
+ now,
122
+ });
123
+ expect(result.selfMinutes).toBe(60);
124
+ expect(result.upstreamMinutes).toBe(60);
125
+ expect(result.totalMinutes).toBe(120);
126
+ expect(result.entries).toHaveLength(2);
127
+ });
128
+
129
+ it("ignores events fully outside the window", () => {
130
+ const result = aggregateWindowedDowntime({
131
+ events: [
132
+ selfEvent({
133
+ startTime: new Date(now.getTime() - 40 * DAY),
134
+ endTime: new Date(now.getTime() - 35 * DAY),
135
+ }),
136
+ ],
137
+ windowStart,
138
+ windowEnd,
139
+ now,
140
+ });
141
+ expect(result.totalMinutes).toBe(0);
142
+ });
143
+ });
@@ -0,0 +1,130 @@
1
+ /**
2
+ * Pure window-overlap math for SLO downtime accounting.
3
+ *
4
+ * A downtime event contributes to an SLO window only for the portion of its
5
+ * duration that falls inside `[windowStart, windowEnd]`. Open (ongoing) events
6
+ * have no `endTime` and run until `now`. This is deliberately separate from any
7
+ * stored `durationSeconds` cache, which is not window-aware: a long outage that
8
+ * began before the window (the dashboard "100% available + degraded" bug) must
9
+ * still consume its in-window portion, and an event that straddles a window edge
10
+ * must be clamped, not counted in full.
11
+ */
12
+
13
+ export interface WindowedEventInput {
14
+ startTime: Date;
15
+ endTime: Date | null;
16
+ attributionType: string;
17
+ upstreamSystemId: string | null;
18
+ upstreamSystemName: string | null;
19
+ }
20
+
21
+ export interface WindowedDowntime {
22
+ totalMinutes: number;
23
+ selfMinutes: number;
24
+ upstreamMinutes: number;
25
+ entries: Array<{
26
+ attributionType: string;
27
+ upstreamSystemId: string | null;
28
+ upstreamSystemName: string | null;
29
+ totalMinutes: number;
30
+ }>;
31
+ }
32
+
33
+ /**
34
+ * Seconds of a single event that fall inside the window. Open events (no
35
+ * `endTime`) run to `now`. Returns 0 when the event does not overlap the window.
36
+ */
37
+ export function eventWindowSeconds({
38
+ startTime,
39
+ endTime,
40
+ windowStart,
41
+ windowEnd,
42
+ now,
43
+ }: {
44
+ startTime: Date;
45
+ endTime: Date | null;
46
+ windowStart: Date;
47
+ windowEnd: Date;
48
+ now: Date;
49
+ }): number {
50
+ const end = endTime ?? now;
51
+ const effectiveStart = Math.max(startTime.getTime(), windowStart.getTime());
52
+ const effectiveEnd = Math.min(end.getTime(), windowEnd.getTime());
53
+ const seconds = (effectiveEnd - effectiveStart) / 1000;
54
+ return Math.max(0, seconds);
55
+ }
56
+
57
+ /**
58
+ * Aggregate the in-window downtime across many events, split by attribution and
59
+ * grouped per source (self, or one bucket per upstream system).
60
+ */
61
+ export function aggregateWindowedDowntime({
62
+ events,
63
+ windowStart,
64
+ windowEnd,
65
+ now,
66
+ }: {
67
+ events: WindowedEventInput[];
68
+ windowStart: Date;
69
+ windowEnd: Date;
70
+ now: Date;
71
+ }): WindowedDowntime {
72
+ let totalSeconds = 0;
73
+ let selfSeconds = 0;
74
+ let upstreamSeconds = 0;
75
+ const bySource = new Map<
76
+ string,
77
+ {
78
+ attributionType: string;
79
+ upstreamSystemId: string | null;
80
+ upstreamSystemName: string | null;
81
+ totalSeconds: number;
82
+ }
83
+ >();
84
+
85
+ for (const event of events) {
86
+ const duration = eventWindowSeconds({
87
+ startTime: event.startTime,
88
+ endTime: event.endTime,
89
+ windowStart,
90
+ windowEnd,
91
+ now,
92
+ });
93
+ if (duration <= 0) continue;
94
+
95
+ totalSeconds += duration;
96
+ if (event.attributionType === "self") {
97
+ selfSeconds += duration;
98
+ } else {
99
+ upstreamSeconds += duration;
100
+ }
101
+
102
+ const key =
103
+ event.attributionType === "self"
104
+ ? "self"
105
+ : `upstream:${event.upstreamSystemId}`;
106
+ const existing = bySource.get(key);
107
+ if (existing) {
108
+ existing.totalSeconds += duration;
109
+ } else {
110
+ bySource.set(key, {
111
+ attributionType: event.attributionType,
112
+ upstreamSystemId: event.upstreamSystemId,
113
+ upstreamSystemName: event.upstreamSystemName,
114
+ totalSeconds: duration,
115
+ });
116
+ }
117
+ }
118
+
119
+ return {
120
+ totalMinutes: totalSeconds / 60,
121
+ selfMinutes: selfSeconds / 60,
122
+ upstreamMinutes: upstreamSeconds / 60,
123
+ entries: [...bySource.values()].map((e) => ({
124
+ attributionType: e.attributionType,
125
+ upstreamSystemId: e.upstreamSystemId,
126
+ upstreamSystemName: e.upstreamSystemName,
127
+ totalMinutes: e.totalSeconds / 60,
128
+ })),
129
+ };
130
+ }
package/src/index.ts CHANGED
@@ -1,6 +1,10 @@
1
1
  import * as schema from "./schema";
2
2
  import type { SafeDatabase } from "@checkstack/backend-api";
3
3
  import { z } from "zod";
4
+ import {
5
+ aiToolProjectionExtensionPoint,
6
+ deferredProjectionExecute,
7
+ } from "@checkstack/ai-backend";
4
8
  import {
5
9
  sloAccessRules,
6
10
  sloAccess,
@@ -200,6 +204,20 @@ export default createBackendPlugin({
200
204
  },
201
205
  });
202
206
 
207
+ // Expose this plugin's read-only AI projection (`slo.listObjectives`) via
208
+ // the AI projection extension point. ai-backend collects its routing in
209
+ // afterPluginsReady and never imports slo-common.
210
+ env.getExtensionPoint(aiToolProjectionExtensionPoint).expose({
211
+ procedure: sloContract.listObjectives,
212
+ sourcePluginMetadata: pluginMetadata,
213
+ procedureKey: "listObjectives",
214
+ name: "slo.listObjectives",
215
+ description:
216
+ "List service-level objectives with their current status and error budget. Read-only.",
217
+ effect: "read",
218
+ execute: deferredProjectionExecute,
219
+ });
220
+
203
221
  env.registerInit({
204
222
  schema,
205
223
  deps: {
package/src/router.ts CHANGED
@@ -110,25 +110,38 @@ export function createRouter({
110
110
  ),
111
111
 
112
112
  createObjective: os.createObjective.handler(
113
- async ({ input }) => {
113
+ async ({ input, context }) => {
114
+ // The objective (+ its streak row) is committed atomically here. Once it
115
+ // returns, the write is durable and the create has SUCCEEDED - the
116
+ // post-commit steps below are best-effort enrichment/notification and
117
+ // must never turn a committed create into a client-visible error.
114
118
  const objective = await service.createObjective({ input });
115
119
 
116
- // Reconcile initial state: if system is already down,
117
- // open an initial downtime event immediately
118
- await engine.reconcileObjective({ objective });
119
-
120
- const status = await engine.computeStatus({ objective });
121
120
  // Mutation invariant: db.write → cache.invalidate (await) → signals.emit.
122
- await cache.invalidateForMutation({
123
- objectiveId: objective.id,
124
- systemId: objective.systemId,
125
- });
126
- await signalService.broadcast(SLO_STATUS_CHANGED, {
127
- systemId: objective.systemId,
128
- objectiveId: objective.id,
129
- budgetRemainingPercent: status.errorBudgetRemainingPercent,
130
- isBreaching: status.isBreaching,
131
- });
121
+ // cache.invalidate* and signalService.* are already non-throwing by
122
+ // platform contract; reconcile/computeStatus are guarded here so a
123
+ // transient health-read failure can't fail the (already committed) create.
124
+ try {
125
+ // Reconcile initial state: if system is already down, open an initial
126
+ // downtime event immediately.
127
+ await engine.reconcileObjective({ objective });
128
+ const status = await engine.computeStatus({ objective });
129
+ await cache.invalidateForMutation({
130
+ objectiveId: objective.id,
131
+ systemId: objective.systemId,
132
+ });
133
+ await signalService.broadcast(SLO_STATUS_CHANGED, {
134
+ systemId: objective.systemId,
135
+ objectiveId: objective.id,
136
+ budgetRemainingPercent: status.errorBudgetRemainingPercent,
137
+ isBreaching: status.isBreaching,
138
+ });
139
+ } catch (error) {
140
+ context.logger.warn(
141
+ `createObjective: objective ${objective.id} committed, but post-create reconcile/notify failed`,
142
+ { error },
143
+ );
144
+ }
132
145
 
133
146
  return objective;
134
147
  },
package/src/service.ts CHANGED
@@ -1,6 +1,7 @@
1
- import { eq, and, isNull, desc, gte, lte } from "drizzle-orm";
1
+ import { eq, and, isNull, desc, gte, lte, or } from "drizzle-orm";
2
2
  import type { SafeDatabase } from "@checkstack/backend-api";
3
3
  import * as schema from "./schema";
4
+ import { aggregateWindowedDowntime } from "./downtime-window";
4
5
  import {
5
6
  sloObjectives,
6
7
  sloDowntimeEvents,
@@ -69,29 +70,35 @@ export class SloService {
69
70
  const id = generateId();
70
71
  const now = new Date();
71
72
 
72
- await this.db.insert(sloObjectives).values({
73
- id,
74
- systemId: input.systemId,
75
-
76
- healthCheckConfigurationId: input.healthCheckConfigurationId ?? null,
77
- target: input.target,
78
- windowDays: input.windowDays,
79
- dependencyExclusion: input.dependencyExclusion ?? "strict",
80
- excludedDependencyIds: input.excludedDependencyIds ?? [],
81
- burnRateWarningPercent: input.burnRateThresholds?.warningPercent ?? 50,
82
- burnRateCriticalPercent: input.burnRateThresholds?.criticalPercent ?? 80,
83
- burnRateFastBurnMultiplier:
84
- input.burnRateThresholds?.fastBurnMultiplier ?? 5,
85
- createdAt: now,
86
- updatedAt: now,
87
- });
88
-
89
- // Create initial streak record
90
- await this.db.insert(sloStreaks).values({
91
- objectiveId: id,
92
- systemId: input.systemId,
93
- currentStreak: 0,
94
- bestStreak: 0,
73
+ // Atomic: the objective row and its 1:1 streak row must commit together.
74
+ // Without the transaction a failure on the streak insert left a committed
75
+ // objective with no streak (and the client saw an error for a write that
76
+ // partially succeeded).
77
+ await this.db.transaction(async (tx) => {
78
+ await tx.insert(sloObjectives).values({
79
+ id,
80
+ systemId: input.systemId,
81
+
82
+ healthCheckConfigurationId: input.healthCheckConfigurationId ?? null,
83
+ target: input.target,
84
+ windowDays: input.windowDays,
85
+ dependencyExclusion: input.dependencyExclusion ?? "strict",
86
+ excludedDependencyIds: input.excludedDependencyIds ?? [],
87
+ burnRateWarningPercent: input.burnRateThresholds?.warningPercent ?? 50,
88
+ burnRateCriticalPercent: input.burnRateThresholds?.criticalPercent ?? 80,
89
+ burnRateFastBurnMultiplier:
90
+ input.burnRateThresholds?.fastBurnMultiplier ?? 5,
91
+ createdAt: now,
92
+ updatedAt: now,
93
+ });
94
+
95
+ // Create initial streak record
96
+ await tx.insert(sloStreaks).values({
97
+ objectiveId: id,
98
+ systemId: input.systemId,
99
+ currentStreak: 0,
100
+ bestStreak: 0,
101
+ });
95
102
  });
96
103
 
97
104
  return (await this.getObjective({ id }))!;
@@ -306,10 +313,19 @@ export class SloService {
306
313
  objectiveId,
307
314
  windowStart,
308
315
  windowEnd,
316
+ includeOpen,
309
317
  }: {
310
318
  objectiveId: string;
311
319
  windowStart: Date;
312
320
  windowEnd: Date;
321
+ /**
322
+ * Whether to count still-open events as ongoing downtime (clamped to `now`).
323
+ * The caller decides this from the system's LIVE health: an open event is
324
+ * only real ongoing downtime if the system is currently down. When false,
325
+ * only closed intervals are counted, so a stale/orphaned open event (a
326
+ * missed-recovery row) has zero effect on the budget.
327
+ */
328
+ includeOpen: boolean;
313
329
  }): Promise<{
314
330
  totalMinutes: number;
315
331
  selfMinutes: number;
@@ -321,71 +337,53 @@ export class SloService {
321
337
  totalMinutes: number;
322
338
  }>;
323
339
  }> {
324
- // Get closed events within the window
325
- const closedEvents = await this.db
326
- .select()
327
- .from(sloDowntimeEvents)
328
- .where(
329
- and(
330
- eq(sloDowntimeEvents.objectiveId, objectiveId),
331
- gte(sloDowntimeEvents.startTime, windowStart),
332
- lte(sloDowntimeEvents.startTime, windowEnd),
333
- ),
334
- );
335
-
336
- // Also include open events (use current time as endTime for running duration)
340
+ // Closed intervals that OVERLAP the window: started on/before the window end
341
+ // AND ended on/after the window start. (`endTime >= windowStart` excludes
342
+ // NULL end times, i.e. open events, in SQL.) An ongoing outage that began
343
+ // before `windowStart` still consumes its in-window portion - it is not
344
+ // dropped just because it started earlier.
337
345
  const now = new Date();
338
- let totalSeconds = 0;
339
- let selfSeconds = 0;
340
- let upstreamSeconds = 0;
341
- const bySource = new Map<
342
- string,
343
- {
344
- attributionType: string;
345
- upstreamSystemId: string | null;
346
- upstreamSystemName: string | null;
347
- totalSeconds: number;
348
- }
349
- >();
350
-
351
- for (const event of closedEvents) {
352
- const duration =
353
- event.durationSeconds ??
354
- (((event.endTime ?? now).getTime() - event.startTime.getTime()) / 1000);
355
-
356
- totalSeconds += duration;
357
- if (event.attributionType === "self") {
358
- selfSeconds += duration;
359
- } else {
360
- upstreamSeconds += duration;
361
- }
362
-
363
- const key =
364
- event.attributionType === "self"
365
- ? "self"
366
- : `upstream:${event.upstreamSystemId}`;
367
- const existing = bySource.get(key);
368
- if (existing) {
369
- existing.totalSeconds += duration;
370
- } else {
371
- bySource.set(key, {
372
- attributionType: event.attributionType,
373
- upstreamSystemId: event.upstreamSystemId,
374
- upstreamSystemName: event.upstreamSystemName,
375
- totalSeconds: duration,
376
- });
377
- }
378
- }
379
-
380
- return {
381
- totalMinutes: totalSeconds / 60,
382
- selfMinutes: selfSeconds / 60,
383
- upstreamMinutes: upstreamSeconds / 60,
384
- entries: [...bySource.values()].map((e) => ({
385
- ...e,
386
- totalMinutes: e.totalSeconds / 60,
346
+ const startBound = lte(sloDowntimeEvents.startTime, windowEnd);
347
+ const closedOverlap = gte(sloDowntimeEvents.endTime, windowStart);
348
+ const where = includeOpen
349
+ ? and(
350
+ eq(sloDowntimeEvents.objectiveId, objectiveId),
351
+ startBound,
352
+ or(closedOverlap, isNull(sloDowntimeEvents.endTime)),
353
+ )
354
+ : and(
355
+ eq(sloDowntimeEvents.objectiveId, objectiveId),
356
+ startBound,
357
+ closedOverlap,
358
+ );
359
+
360
+ const events = await this.db.select().from(sloDowntimeEvents).where(where);
361
+
362
+ // Clamp each event to the window (open events run to `now`) and aggregate.
363
+ return aggregateWindowedDowntime({
364
+ events: events.map((event) => ({
365
+ startTime: event.startTime,
366
+ endTime: event.endTime,
367
+ attributionType: event.attributionType,
368
+ upstreamSystemId: event.upstreamSystemId,
369
+ upstreamSystemName: event.upstreamSystemName,
387
370
  })),
388
- };
371
+ windowStart,
372
+ windowEnd,
373
+ now,
374
+ });
375
+ }
376
+
377
+ /**
378
+ * Hard-delete a downtime event. Used to VOID a missed-recovery orphan: an
379
+ * open event on a system that is currently healthy, whose true recovery time
380
+ * was never recorded. We do not know the real downtime, the system is healthy,
381
+ * so the unprovable row is removed rather than counted.
382
+ */
383
+ async deleteDowntimeEvent({ id }: { id: string }): Promise<void> {
384
+ await this.db
385
+ .delete(sloDowntimeEvents)
386
+ .where(eq(sloDowntimeEvents.id, id));
389
387
  }
390
388
 
391
389
  async getRecentDowntimeEvents({
@@ -99,6 +99,7 @@ function createMockService(
99
99
  closeDowntimeEvent: mock(() =>
100
100
  Promise.resolve(createDowntimeEvent({ endTime: new Date(), durationSeconds: 60 })),
101
101
  ),
102
+ deleteDowntimeEvent: mock(() => Promise.resolve()),
102
103
  getDowntimeForWindow: mock(() =>
103
104
  Promise.resolve({
104
105
  totalMinutes: 0,
@@ -565,6 +566,230 @@ describe("SloEngine", () => {
565
566
 
566
567
  expect(status.isBreaching).toBe(true);
567
568
  });
569
+
570
+ it("should NOT flag hasOpenDowntime for an open upstream event in self-only mode", async () => {
571
+ // Regression: an open upstream event must not flip a self-only objective
572
+ // to "degraded" when no self downtime is counted — otherwise the SLO
573
+ // reads 100% available + degraded at the same time, which must not happen.
574
+ const objective = createObjective({
575
+ target: 99.9,
576
+ windowDays: 30,
577
+ dependencyExclusion: "self-only",
578
+ });
579
+ const openUpstream = createDowntimeEvent({
580
+ attributionType: "upstream",
581
+ upstreamSystemId: "up-1",
582
+ });
583
+ mockService = createMockService({
584
+ objectives: [objective],
585
+ openEvents: [openUpstream],
586
+ });
587
+ mockSignalService = createMockSignalService();
588
+ mockLogger = createMockLogger();
589
+
590
+ engine = new SloEngine({
591
+ service: mockService,
592
+ signalService: mockSignalService as never,
593
+ logger: mockLogger as never,
594
+ });
595
+
596
+ const status = await engine.computeStatus({ objective });
597
+
598
+ expect(status.currentAvailability).toBe(100);
599
+ expect(status.errorBudgetRemainingPercent).toBe(100);
600
+ // Self-only: an upstream-attributed open event is excluded from budget,
601
+ // so it must not report open (budget-relevant) downtime.
602
+ expect(status.hasOpenDowntime).toBe(false);
603
+ });
604
+
605
+ it("should flag hasOpenDowntime for an open self event in self-only mode", async () => {
606
+ const objective = createObjective({ dependencyExclusion: "self-only" });
607
+ const openSelf = createDowntimeEvent({ attributionType: "self" });
608
+ mockService = createMockService({
609
+ objectives: [objective],
610
+ openEvents: [openSelf],
611
+ });
612
+ mockSignalService = createMockSignalService();
613
+ mockLogger = createMockLogger();
614
+
615
+ engine = new SloEngine({
616
+ service: mockService,
617
+ signalService: mockSignalService as never,
618
+ logger: mockLogger as never,
619
+ });
620
+
621
+ const status = await engine.computeStatus({ objective });
622
+
623
+ expect(status.hasOpenDowntime).toBe(true);
624
+ });
625
+
626
+ it("should flag hasOpenDowntime for any open event in strict mode", async () => {
627
+ const objective = createObjective({ dependencyExclusion: "strict" });
628
+ const openUpstream = createDowntimeEvent({
629
+ attributionType: "upstream",
630
+ upstreamSystemId: "up-1",
631
+ });
632
+ mockService = createMockService({
633
+ objectives: [objective],
634
+ openEvents: [openUpstream],
635
+ });
636
+ mockSignalService = createMockSignalService();
637
+ mockLogger = createMockLogger();
638
+
639
+ engine = new SloEngine({
640
+ service: mockService,
641
+ signalService: mockSignalService as never,
642
+ logger: mockLogger as never,
643
+ });
644
+
645
+ const status = await engine.computeStatus({ objective });
646
+
647
+ expect(status.hasOpenDowntime).toBe(true);
648
+ });
649
+
650
+ it("live health is authoritative: a HEALTHY system with an open event is not degraded and excludes it from the budget", async () => {
651
+ // The dashboard "ongoing while healthy" regression: an orphaned open
652
+ // event (missed recovery) must not flip a currently-healthy SLO to
653
+ // degraded/breaching. computeStatus must ask the open path NOT to count
654
+ // open downtime when the system is healthy.
655
+ const objective = createObjective({ dependencyExclusion: "self-only" });
656
+ const openSelf = createDowntimeEvent({ attributionType: "self" });
657
+ mockService = createMockService({
658
+ objectives: [objective],
659
+ openEvents: [openSelf],
660
+ });
661
+ mockSignalService = createMockSignalService();
662
+ mockLogger = createMockLogger();
663
+
664
+ engine = new SloEngine({
665
+ service: mockService,
666
+ signalService: mockSignalService as never,
667
+ logger: mockLogger as never,
668
+ });
669
+ engine.setHealthStatusCallback(async () => ({ isHealthy: true }));
670
+
671
+ const status = await engine.computeStatus({ objective });
672
+
673
+ expect(status.hasOpenDowntime).toBe(false);
674
+ expect(mockService.getDowntimeForWindow).toHaveBeenCalledWith(
675
+ expect.objectContaining({ includeOpen: false }),
676
+ );
677
+ });
678
+
679
+ it("live health is authoritative: a DOWN system with an open event counts it and is degraded", async () => {
680
+ const objective = createObjective({ dependencyExclusion: "self-only" });
681
+ const openSelf = createDowntimeEvent({ attributionType: "self" });
682
+ mockService = createMockService({
683
+ objectives: [objective],
684
+ openEvents: [openSelf],
685
+ });
686
+ mockSignalService = createMockSignalService();
687
+ mockLogger = createMockLogger();
688
+
689
+ engine = new SloEngine({
690
+ service: mockService,
691
+ signalService: mockSignalService as never,
692
+ logger: mockLogger as never,
693
+ });
694
+ engine.setHealthStatusCallback(async () => ({ isHealthy: false }));
695
+
696
+ const status = await engine.computeStatus({ objective });
697
+
698
+ expect(status.hasOpenDowntime).toBe(true);
699
+ expect(mockService.getDowntimeForWindow).toHaveBeenCalledWith(
700
+ expect.objectContaining({ includeOpen: true }),
701
+ );
702
+ });
703
+
704
+ it("skips the health check entirely when there are no open events", async () => {
705
+ const objective = createObjective();
706
+ mockService = createMockService({ objectives: [objective] });
707
+ mockSignalService = createMockSignalService();
708
+ mockLogger = createMockLogger();
709
+
710
+ const healthCallback = mock(async () => ({ isHealthy: true }));
711
+ engine = new SloEngine({
712
+ service: mockService,
713
+ signalService: mockSignalService as never,
714
+ logger: mockLogger as never,
715
+ });
716
+ engine.setHealthStatusCallback(healthCallback);
717
+
718
+ await engine.computeStatus({ objective });
719
+
720
+ expect(healthCallback).not.toHaveBeenCalled();
721
+ expect(mockService.getDowntimeForWindow).toHaveBeenCalledWith(
722
+ expect.objectContaining({ includeOpen: false }),
723
+ );
724
+ });
725
+ });
726
+
727
+ describe("voidOrphanedDowntime", () => {
728
+ it("voids open events when the system is currently healthy (missed-recovery orphan)", async () => {
729
+ const objective = createObjective();
730
+ const orphan = createDowntimeEvent({ id: "orphan-1" });
731
+ mockService = createMockService({
732
+ objectives: [objective],
733
+ openEvents: [orphan],
734
+ });
735
+ mockSignalService = createMockSignalService();
736
+ mockLogger = createMockLogger();
737
+
738
+ engine = new SloEngine({
739
+ service: mockService,
740
+ signalService: mockSignalService as never,
741
+ logger: mockLogger as never,
742
+ });
743
+ engine.setHealthStatusCallback(async () => ({ isHealthy: true }));
744
+
745
+ await engine.voidOrphanedDowntime({ objective });
746
+
747
+ expect(mockService.deleteDowntimeEvent).toHaveBeenCalledWith({
748
+ id: "orphan-1",
749
+ });
750
+ });
751
+
752
+ it("keeps open events when the system is genuinely down", async () => {
753
+ const objective = createObjective();
754
+ const openEvent = createDowntimeEvent({ id: "evt-1" });
755
+ mockService = createMockService({
756
+ objectives: [objective],
757
+ openEvents: [openEvent],
758
+ });
759
+ mockSignalService = createMockSignalService();
760
+ mockLogger = createMockLogger();
761
+
762
+ engine = new SloEngine({
763
+ service: mockService,
764
+ signalService: mockSignalService as never,
765
+ logger: mockLogger as never,
766
+ });
767
+ engine.setHealthStatusCallback(async () => ({ isHealthy: false }));
768
+
769
+ await engine.voidOrphanedDowntime({ objective });
770
+
771
+ expect(mockService.deleteDowntimeEvent).not.toHaveBeenCalled();
772
+ });
773
+
774
+ it("does nothing when there are no open events", async () => {
775
+ const objective = createObjective();
776
+ mockService = createMockService({ objectives: [objective] });
777
+ mockSignalService = createMockSignalService();
778
+ mockLogger = createMockLogger();
779
+
780
+ const healthCallback = mock(async () => ({ isHealthy: true }));
781
+ engine = new SloEngine({
782
+ service: mockService,
783
+ signalService: mockSignalService as never,
784
+ logger: mockLogger as never,
785
+ });
786
+ engine.setHealthStatusCallback(healthCallback);
787
+
788
+ await engine.voidOrphanedDowntime({ objective });
789
+
790
+ expect(healthCallback).not.toHaveBeenCalled();
791
+ expect(mockService.deleteDowntimeEvent).not.toHaveBeenCalled();
792
+ });
568
793
  });
569
794
 
570
795
  describe("reconcileObjective", () => {
package/src/slo-engine.ts CHANGED
@@ -83,6 +83,42 @@ export class SloEngine {
83
83
  );
84
84
  }
85
85
 
86
+ /**
87
+ * Void missed-recovery orphans: open downtime events on a system that is
88
+ * currently healthy. The edge-triggered close (on the health recovery
89
+ * transition) records real downtime accurately; this is the safety net for
90
+ * when that transition was never delivered (restart, dropped change, recovery
91
+ * before the SLO close path existed), which would otherwise leave an event
92
+ * open forever. `computeStatus` already ignores such rows for the budget
93
+ * (live health is authoritative), so this is row hygiene: it clears the stale
94
+ * "ongoing" event so it stops showing in history. We delete rather than close
95
+ * because the true recovery time is unknown and the system is healthy, so the
96
+ * unprovable downtime must not be counted.
97
+ *
98
+ * Runs in a write context (the daily job), never from a read accessor.
99
+ */
100
+ async voidOrphanedDowntime({
101
+ objective,
102
+ }: {
103
+ objective: { id: string; systemId: string };
104
+ }): Promise<void> {
105
+ const openEvents = await this.service.getOpenDowntimeEventsForObjective({
106
+ objectiveId: objective.id,
107
+ });
108
+ if (openEvents.length === 0) return;
109
+ if (!this._getSystemHealthStatus) return;
110
+
111
+ const health = await this._getSystemHealthStatus(objective.systemId);
112
+ if (!health.isHealthy) return; // genuinely down — the open event is real
113
+
114
+ for (const event of openEvents) {
115
+ await this.service.deleteDowntimeEvent({ id: event.id });
116
+ }
117
+ this.logger.info(
118
+ `SLO ${objective.id}: voided ${openEvents.length} orphaned open downtime event(s) — system is healthy but a recovery transition was missed`,
119
+ );
120
+ }
121
+
86
122
  // ===========================================================================
87
123
  // PERSPECTIVE 1: This system's own SLOs
88
124
  // ===========================================================================
@@ -300,10 +336,34 @@ export class SloEngine {
300
336
  now.getTime() - objective.windowDays * 24 * 60 * 60 * 1000,
301
337
  );
302
338
 
339
+ // LIVE HEALTH IS AUTHORITATIVE for "currently down". A stored open downtime
340
+ // event is only real ongoing downtime if the system is actually down right
341
+ // now - never trusted on its own. This makes the SLO numbers immune to a
342
+ // drifted/orphaned event log: a healthy system can never read breaching or
343
+ // degraded from a stale open row, by construction. The health check is
344
+ // gated on there being open events at all, so the common (no-open-event)
345
+ // path does no extra work. This method stays side-effect-free (the reactive
346
+ // `slo` entity reads through it); orphan rows are voided by the daily job.
347
+ const openEvents = await this.service.getOpenDowntimeEventsForObjective({
348
+ objectiveId: objective.id,
349
+ });
350
+ let currentlyDown: boolean;
351
+ if (openEvents.length === 0) {
352
+ currentlyDown = false;
353
+ } else if (this._getSystemHealthStatus) {
354
+ const health = await this._getSystemHealthStatus(objective.systemId);
355
+ currentlyDown = !health.isHealthy;
356
+ } else {
357
+ // Before afterPluginsReady wires the health callback, fall back to
358
+ // trusting the stored open state (best effort).
359
+ currentlyDown = true;
360
+ }
361
+
303
362
  const downtime = await this.service.getDowntimeForWindow({
304
363
  objectiveId: objective.id,
305
364
  windowStart,
306
365
  windowEnd: now,
366
+ includeOpen: currentlyDown,
307
367
  });
308
368
 
309
369
  const totalWindowMinutes = objective.windowDays * 24 * 60;
@@ -345,10 +405,16 @@ export class SloEngine {
345
405
 
346
406
  expectedConsumption > 0 ? consumedMinutes / expectedConsumption : null;
347
407
 
348
- // Check for open downtime events
349
- const openEvents = await this.service.getOpenDowntimeEventsForObjective({
350
- objectiveId: objective.id,
351
- });
408
+ // "Degraded" (open downtime) requires BOTH that the system is currently
409
+ // down AND that an open event counts toward this objective's budget. In
410
+ // self-only mode an open upstream outage is excluded; and a stale open event
411
+ // on a now-healthy system (currentlyDown === false) never counts - so the
412
+ // SLO can never read available-and-degraded at once.
413
+ const budgetRelevantOpenEvents = currentlyDown
414
+ ? objective.dependencyExclusion === "strict"
415
+ ? openEvents
416
+ : openEvents.filter((event) => event.attributionType === "self")
417
+ : [];
352
418
 
353
419
  // Build attribution breakdown
354
420
  const attribution = downtime.entries.map((entry) => ({
@@ -376,7 +442,7 @@ export class SloEngine {
376
442
  burnRate,
377
443
  dependencyExclusion: objective.dependencyExclusion,
378
444
  isBreaching: effectiveAvailability !== null && effectiveAvailability < objective.target,
379
- hasOpenDowntime: openEvents.length > 0,
445
+ hasOpenDowntime: budgetRelevantOpenEvents.length > 0,
380
446
  attribution,
381
447
  };
382
448
  }
@@ -9,6 +9,7 @@ import {
9
9
  import {
10
10
  DependencyExclusionModeSchema,
11
11
  BurnRateThresholdsSchema,
12
+ SloWindowDaysSchema,
12
13
  } from "@checkstack/slo-common";
13
14
  import type { SloService } from "./service";
14
15
 
@@ -30,7 +31,7 @@ const sloSpecSchema = z.object({
30
31
  systemRef: entityRefSchema,
31
32
  healthcheckRef: entityRefSchema.optional(),
32
33
  target: z.number().min(0).max(100),
33
- windowDays: z.number().int().positive(),
34
+ windowDays: SloWindowDaysSchema,
34
35
  dependencyExclusion: DependencyExclusionModeSchema.optional(),
35
36
  excludedDependencyRefs: z.array(entityRefSchema).optional(),
36
37
  burnRateThresholds: BurnRateThresholdsSchema.optional(),
@@ -75,6 +75,11 @@ export async function runDailySnapshotJob(deps: {
75
75
 
76
76
  for (const objective of objectives) {
77
77
  try {
78
+ // Hygiene: clear missed-recovery orphans (open events on healthy systems)
79
+ // before snapshotting, so the trend reflects reality. computeStatus is
80
+ // already immune to such rows, but this stops them lingering in history.
81
+ await engine.voidOrphanedDowntime({ objective });
82
+
78
83
  const status = await engine.computeStatus({ objective });
79
84
 
80
85
  // 1. Persist daily snapshot
package/tsconfig.json CHANGED
@@ -4,6 +4,9 @@
4
4
  "src"
5
5
  ],
6
6
  "references": [
7
+ {
8
+ "path": "../ai-backend"
9
+ },
7
10
  {
8
11
  "path": "../automation-backend"
9
12
  },