@checkstack/slo-backend 0.6.1 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +123 -0
- package/package.json +18 -15
- package/src/ai/slo.projection.test.ts +36 -0
- package/src/downtime-window.test.ts +143 -0
- package/src/downtime-window.ts +130 -0
- package/src/index.ts +18 -0
- package/src/router.ts +29 -16
- package/src/service.ts +85 -87
- package/src/slo-engine.test.ts +225 -0
- package/src/slo-engine.ts +71 -5
- package/src/slo-gitops-kinds.ts +2 -1
- package/src/streak-calculator.ts +5 -0
- package/tsconfig.json +3 -0
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,128 @@
|
|
|
1
1
|
# @checkstack/slo-backend
|
|
2
2
|
|
|
3
|
+
## 0.7.0
|
|
4
|
+
|
|
5
|
+
### Minor Changes
|
|
6
|
+
|
|
7
|
+
- 9dcc848: Plugin-owned AI tools: every domain plugin contributes its own AI tools (chat assistant + automation AI action), and `ai-backend` is platform-only.
|
|
8
|
+
|
|
9
|
+
Every plugin-specific AI tool is owned by the plugin whose domain it acts on, registered via that plugin's own `aiToolExtensionPoint` / `aiToolProjectionExtensionPoint` from its init - the same path an external plugin author uses. `ai-backend` no longer imports or depends on any capability plugin's `*-common`; the dependency direction is strictly plugin -> ai-platform. Pure helpers (`computeFieldDiff`, capability-summary, `ScriptContextKind`) live in `@checkstack/ai-common`.
|
|
10
|
+
|
|
11
|
+
Tools shipped:
|
|
12
|
+
|
|
13
|
+
- Health checks and automations: full CRUD - `healthcheck.propose` / `automation.propose` and `*.update` (`mutate`, deep-validated) and `*.delete` (`destructive`, always confirm-gated). `healthcheck.propose`'s dry-run calls the new deep `validateConfiguration` so propose-time validation matches apply-time. Assertions are validated against the collector's result schema and the canonical operator vocabulary. Capability-catalog tools (`ai.listCapabilities`, `ai.getCapabilitySchema`), script context tools (`ai.getScriptContext`, `ai.testScript`), and notify-subscriber tools (`healthcheck.notifySystemSubscribers` / `...GroupSubscribers`).
|
|
14
|
+
- Catalog: `catalog.createSystem` / `updateSystem` / `createGroup` / `updateGroup` (`mutate`), `catalog.deleteSystem` / `deleteGroup` (`destructive`), membership tools (`mutate`), plus `catalog.listSystems` / `listGroups` read projections.
|
|
15
|
+
- Incident: `incident.create` / `update` / `addUpdate` / `resolve` / `addLink` (`mutate`), `incident.delete` / `removeLink` (`destructive`), and `incident.get` / `incident.list` read projections.
|
|
16
|
+
- Maintenance: `maintenance.create` / `update` / `addUpdate` / `close` / `addLink` (`mutate`), `maintenance.delete` / `removeLink` (`destructive`), and `maintenance.list` / `get` read projections.
|
|
17
|
+
- Read projections for SLO (`slo.listObjectives`), dependency (`dependency.list`), incident (`incident.list`), healthcheck (`healthcheck.status`), and anomaly (`anomaly.explain`), each gated by the source procedure's own access rule and routed as the principal.
|
|
18
|
+
- Documentation grounding: `ai.searchDocs` / `ai.getDoc` over a build-time bundled docs index (BM25-ish ranking), so the assistant grounds how-to answers in Checkstack's own docs offline.
|
|
19
|
+
- URL introspection: `ai.probeUrl`, an SSRF-guarded read tool the assistant uses to inspect a real endpoint before drafting a health check. Update tools compute a before -> after field diff rendered on the confirm card (approve mode) or an "Applied" card (auto mode), so a change is never silent.
|
|
20
|
+
|
|
21
|
+
`ai_analyze` automation action (automation-backend, with an editor connection picker + audited tool calls): runs a bounded AI agent on the run context as the automation's `runAs` service account, so it can never exceed that identity's permissions; destructive tools are never offered; mutating tools auto-apply through the service account's client. Produces an `automation.analysis` artifact downstream actions can branch on. The agent loop is exposed as a headless `aiAgentRunnerRef` service so automation-backend can drive it without depending on ai-backend.
|
|
22
|
+
|
|
23
|
+
`notification.notifyForSubscription` is now callable by user / application principals holding `notification.send` (previously service-only). Every tool routes through the user-scoped client, so handler-side authorization is enforced exactly as a direct UI/RPC action; the resolver gate plus the propose/apply re-check at propose AND apply are the additional authority. A systemic authz regression test asserts every registered tool falls into exactly one safe authorization category.
|
|
24
|
+
|
|
25
|
+
A new `ai_transport` enum value `automation` records the AI action's tool calls in the `ai_tool_calls` audit log. No new durable state beyond that; each tool is a thin, deterministic wrapper over an existing RPC, so every pod behaves identically.
|
|
26
|
+
|
|
27
|
+
This is a beta minor.
|
|
28
|
+
|
|
29
|
+
- 9dcc848: Make SLO downtime robust against a drifted event log (fixes "100% available yet degraded" and "ongoing downtime while every check is healthy").
|
|
30
|
+
|
|
31
|
+
SLO downtime was stored as edge-triggered open/close interval rows, so a single missed/out-of-order transition left an event open forever and read as ongoing downtime even when healthy. The fix makes live health authoritative:
|
|
32
|
+
|
|
33
|
+
- `computeStatus` is now live-health-authoritative and side-effect-free: a stored open event counts toward availability/error-budget and sets `hasOpenDowntime` only when the system is actually down right now (verified via the health callback, checked only when open events exist). A healthy system can no longer read breaching/degraded from a stale row, and this stays pure so the reactive `slo` entity can keep reading through it.
|
|
34
|
+
- Window accounting is fixed: `getDowntimeForWindow` counts the in-window portion of every overlapping interval (clamped to the window; open events run to "now" only when included), via a pure `downtime-window` helper, so an outage that began before the window is no longer dropped.
|
|
35
|
+
- Missed-recovery orphans are voided: the daily job deletes open events on currently-healthy systems (their true recovery time was never recorded). The edge-triggered close still records real downtime on normal recoveries.
|
|
36
|
+
|
|
37
|
+
Regression tests cover the window-overlap math, the live-health authority, the no-open-event fast path, and orphan voiding.
|
|
38
|
+
|
|
39
|
+
This is a beta minor.
|
|
40
|
+
|
|
41
|
+
- 9dcc848: Align workspace dependency versions and migrate React Router to v7.
|
|
42
|
+
|
|
43
|
+
BREAKING CHANGES (React Router v7): All frontend packages now depend on `react-router-dom@^7.16.0`. Previously the workspace declared four divergent ranges (`^6.20.0`, `^6.22.0`, `^7.1.1`, `^7.14.2`), which resolved both `react-router@6` and `react-router@7` into a single bundle. Everything is now unified on v7. The public imports the app uses (`BrowserRouter`, `Routes`, `Route`, `Link`, `NavLink`, `MemoryRouter`, `useNavigate`, `useParams`, `useSearchParams`, `useLocation`) are unchanged between v6 and v7, so no source rewrites were required - but any out-of-tree plugin still on react-router v6 should upgrade to v7 (see the React Router v6 -> v7 upgrade guide) to share the host's single router instance via the import map.
|
|
44
|
+
|
|
45
|
+
Other unified ranges (no API change): `react` -> `^18.3.1`, the `@orpc/*` family (`contract`, `server`, `client`, `tanstack-query`, `openapi`, `zod`) -> `^1.14.4`, and `better-auth` -> `^1.6.13`.
|
|
46
|
+
|
|
47
|
+
Removed the pre-rename `@orpc/react-query` leftover from `@checkstack/frontend-api`; its `createRouterUtils` / `RouterUtils` / `ProcedureUtils` now come from `@orpc/tanstack-query` (the package already in use).
|
|
48
|
+
|
|
49
|
+
Stale in-range runtime deps pulled up to current published versions: `hono` `^4.12.23`, `@tanstack/react-query` (+devtools) `^5.100.14`, `date-fns` `^4.4.0`, `jose` `^6.2.3`, `tar` `^7.5.16`, `semver` `^7.8.1`, `@xyflow/react` `^12.11.0`.
|
|
50
|
+
|
|
51
|
+
### Patch Changes
|
|
52
|
+
|
|
53
|
+
- 9dcc848: Write-path hardening: post-commit side effects can no longer fail a committed write, multi-row mutations are now atomic, and retry-duplication is blocked at the database.
|
|
54
|
+
|
|
55
|
+
**Platform-level (automatic for all current and future plugins):**
|
|
56
|
+
|
|
57
|
+
- signal-backend: `SignalService` (broadcast / sendToUser / sendToUsers / sendToAuthorizedUsers) is now resilient by construction - a transient event-bus/queue failure is caught and logged instead of thrown. Real-time signals are best-effort UI nudges; the authoritative data is already committed by the time a mutation broadcasts, so a signal-transport blip must never turn a successful write into a client-visible error. Every plugin's broadcasts inherit this without per-call-site `try/catch` (which would inevitably be forgotten and regress). This mirrors `createCachedScope`, which already makes cache invalidation non-throwing - so the cache + signal halves of the "post-commit side effect fails the response" class are both closed at the platform seam. Durable side effects (events/hooks that drive automations, queue jobs) intentionally still surface failures. Documented in `developer-guide/backend/signals.md`.
|
|
58
|
+
|
|
59
|
+
**Atomic multi-write mutations (each previously committed row-by-row in autocommit, so a mid-sequence failure left partial/orphaned state):**
|
|
60
|
+
|
|
61
|
+
- slo-backend: `createObjective` now inserts the objective and its 1:1 streak row in one transaction; the post-create reconcile/status/notify steps are best-effort and can no longer fail the (committed) create.
|
|
62
|
+
- incident-backend: `createIncident`, `updateIncident`, `addUpdate`, and `resolveIncident` wrap their row + system-link + timeline writes in a transaction (no more wiped system associations on a failed re-insert, or status flips with no matching timeline entry).
|
|
63
|
+
- maintenance-backend: same for `createMaintenance`, `updateMaintenance`, `addUpdate`, `closeMaintenance`.
|
|
64
|
+
- automation-backend: `cancelRun` marks the run cancelled and tears down its wait locks + durable state in one transaction - previously a failure after the status update could leave a wait lock behind, letting a later trigger event resume an already-cancelled run.
|
|
65
|
+
- healthcheck-backend: `ingestSatelliteResult` commits the run row and its hourly-aggregate increment together (no orphaned run, no aggregate without a backing run). NOTE: this guarantees run/aggregate consistency but does not yet make a _duplicate satellite delivery_ idempotent - that needs a dedupe key on the high-volume runs table and is tracked as a follow-up.
|
|
66
|
+
|
|
67
|
+
**Retry-duplication blocked at the DB (paired with the SQLSTATE 23505 -> 409 mapping shipped separately):**
|
|
68
|
+
|
|
69
|
+
- catalog-backend: new unique indexes on `groups.name`, `environments.name` (consistent with `systems.name`), on `system_links (system_id, url)`, and on `system_contacts (system_id, user_id)` + `(system_id, email)` (NULLs are distinct, so user vs mailbox contacts don't interfere). Name uniqueness is CASE-INSENSITIVE: the three name indexes are functional `lower(name)` indexes (the existing `systems.name` index is rebuilt this way too), so "Api" and "api" collide while the stored value keeps its original casing. The systems pre-write name check (`getSystemByName`) is case-folded to match. Migration `0005` de-dupes any pre-existing rows first - names are preserved by suffixing later case-insensitive duplicates (" (2)", " (3)", ...), redundant contact/link rows are removed keeping the earliest. (Link URLs stay case-sensitive - URL paths are; contact emails are deduped exact-match.)
|
|
70
|
+
- incident-backend / maintenance-backend: unique index on `incident_links (incident_id, url)` / `maintenance_links (maintenance_id, url)`, with a de-dupe step in the migration.
|
|
71
|
+
|
|
72
|
+
**Behavior change:** creating a group/environment with a duplicate name, or attaching a duplicate contact/link, now returns `409 Conflict` instead of silently creating a duplicate. The migrations resolve existing duplicates on upgrade.
|
|
73
|
+
|
|
74
|
+
This is a beta patch.
|
|
75
|
+
|
|
76
|
+
- 9dcc848: Input-validation and error-mapping hardening found by a fuzzing pass against the built container.
|
|
77
|
+
|
|
78
|
+
- backend: a Postgres driver error caused by bad client input no longer surfaces as a `500`. The `/api` and `/rest` dispatchers now map the relevant SQLSTATE classes to the correct status - `22P02`/`22003`/`22001`/`22007` (malformed/out-of-range/over-long/bad-date value), `23502`/`23503`/`23514` (missing/dangling/check-failed) to `400`, and `23505` (unique violation) to `409` - and log them at `warn` (client mistake), not `error`. The client-facing message is generic so column/constraint names are never leaked; genuine unknown faults still log at `error` and 500. Previously a `where id = $1` with a non-uuid `$1` (or an over-long string, or a foreign-key miss in `addSystemToGroup`) reached the driver and 500'd, making routine probing look like a server outage and burying real 500s.
|
|
79
|
+
- slo-common: **fixes a stored cluster-wide DoS.** `windowDays` was accepted up to `2^53`, but the SLO engine derives window boundaries with `Date(now - windowDays * 86_400_000)` - a large value overflows past the max representable `Date` and yields `Invalid Date`. That objective committed fine, then every subsequent read of the system's objectives threw `RangeError: Invalid time value` during serialization (a 500 readable by anyone with SLO read access, on any pod). `windowDays` is now bounded to 1..3650 days at the contract, the GitOps `kind: SLO` spec, and the update path via a single shared `SloWindowDaysSchema`, so the poison row can never be created.
|
|
80
|
+
- slo-common + healthcheck-common: SLO `getDailySnapshots` and the healthcheck history endpoints (`getHistory`, `getDetailedHistory`, `getAggregatedHistory`, `getDetailedAggregatedHistory`, `getRunsForAnalysis`) declared their `startDate`/`endDate` params as `z.date()`, which a `/rest/...` string param can never satisfy - so those endpoints 400'd on the entire REST surface. They now use `z.coerce.date()`, accepting both the REST string shape and the native RPC `Date`.
|
|
81
|
+
- healthcheck-common: `intervalSeconds` was `z.number().min(1)` with no `.int()` and no upper bound, so a fractional or out-of-range value reached the DB and failed at insert (the column is a 32-bit int). It is now `.int().min(1).max(2_592_000)` (1 second .. 30 days), applied to both create and update (the update schema is the create partial).
|
|
82
|
+
- catalog-common: system/group/environment names were bare `z.string()` (environment was `.min(1)` only), so empty, whitespace-only, and 100KB+ names reached the DB - the huge ones surfaced as 500s when parameter binding blew up. Names are now `trim().min(1).max(200)` via a shared schema.
|
|
83
|
+
|
|
84
|
+
**BREAKING:** `getSystemContacts` is now `userType: "authenticated"` (was `"public"`). System contacts carry PII (user id, name, email); the public read leaked them to anonymous status-page visitors. Anonymous callers now receive `401` for this one endpoint; the system detail page already renders "No contacts assigned" for anonymous viewers, so the UI degrades gracefully. All other catalog reads remain public.
|
|
85
|
+
|
|
86
|
+
- catalog-frontend: the system detail page skips the `getSystemContacts` request entirely for anonymous viewers (it would now `401`) and falls back to the empty state.
|
|
87
|
+
|
|
88
|
+
This is a beta release: the breaking contact-visibility change ships as a minor bump per the beta versioning policy, not a major.
|
|
89
|
+
|
|
90
|
+
- Updated dependencies [9dcc848]
|
|
91
|
+
- Updated dependencies [9dcc848]
|
|
92
|
+
- Updated dependencies [9dcc848]
|
|
93
|
+
- Updated dependencies [9dcc848]
|
|
94
|
+
- Updated dependencies [9dcc848]
|
|
95
|
+
- Updated dependencies [9dcc848]
|
|
96
|
+
- Updated dependencies [9dcc848]
|
|
97
|
+
- Updated dependencies [9dcc848]
|
|
98
|
+
- Updated dependencies [9dcc848]
|
|
99
|
+
- Updated dependencies [9dcc848]
|
|
100
|
+
- Updated dependencies [9dcc848]
|
|
101
|
+
- Updated dependencies [9dcc848]
|
|
102
|
+
- Updated dependencies [9dcc848]
|
|
103
|
+
- Updated dependencies [9dcc848]
|
|
104
|
+
- Updated dependencies [9dcc848]
|
|
105
|
+
- Updated dependencies [9dcc848]
|
|
106
|
+
- Updated dependencies [9dcc848]
|
|
107
|
+
- Updated dependencies [9dcc848]
|
|
108
|
+
- @checkstack/ai-backend@0.1.0
|
|
109
|
+
- @checkstack/backend-api@0.21.0
|
|
110
|
+
- @checkstack/healthcheck-backend@1.6.0
|
|
111
|
+
- @checkstack/healthcheck-common@1.5.0
|
|
112
|
+
- @checkstack/automation-backend@0.5.0
|
|
113
|
+
- @checkstack/catalog-backend@1.4.0
|
|
114
|
+
- @checkstack/catalog-common@2.3.0
|
|
115
|
+
- @checkstack/common@0.13.0
|
|
116
|
+
- @checkstack/slo-common@0.5.0
|
|
117
|
+
- @checkstack/command-backend@0.2.0
|
|
118
|
+
- @checkstack/dependency-common@1.2.0
|
|
119
|
+
- @checkstack/gitops-backend@0.5.0
|
|
120
|
+
- @checkstack/gitops-common@0.6.0
|
|
121
|
+
- @checkstack/cache-api@0.3.9
|
|
122
|
+
- @checkstack/queue-api@0.3.9
|
|
123
|
+
- @checkstack/signal-common@0.2.6
|
|
124
|
+
- @checkstack/cache-utils@0.2.14
|
|
125
|
+
|
|
3
126
|
## 0.6.1
|
|
4
127
|
|
|
5
128
|
### Patch Changes
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@checkstack/slo-backend",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.7.0",
|
|
4
4
|
"license": "Elastic-2.0",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "src/index.ts",
|
|
@@ -11,33 +11,36 @@
|
|
|
11
11
|
"typecheck": "tsgo -b",
|
|
12
12
|
"generate": "drizzle-kit generate",
|
|
13
13
|
"lint": "bun run lint:code",
|
|
14
|
-
"lint:code": "eslint . --max-warnings 0"
|
|
14
|
+
"lint:code": "eslint . --max-warnings 0",
|
|
15
|
+
"test": "bun test"
|
|
15
16
|
},
|
|
16
17
|
"dependencies": {
|
|
17
|
-
"@checkstack/backend-api": "0.
|
|
18
|
-
"@checkstack/
|
|
19
|
-
"@checkstack/cache-
|
|
18
|
+
"@checkstack/backend-api": "0.20.0",
|
|
19
|
+
"@checkstack/ai-backend": "0.0.0",
|
|
20
|
+
"@checkstack/cache-api": "0.3.8",
|
|
21
|
+
"@checkstack/cache-utils": "0.2.13",
|
|
20
22
|
"@checkstack/slo-common": "0.4.2",
|
|
21
|
-
"@checkstack/healthcheck-common": "1.
|
|
22
|
-
"@checkstack/healthcheck-backend": "1.
|
|
23
|
+
"@checkstack/healthcheck-common": "1.4.0",
|
|
24
|
+
"@checkstack/healthcheck-backend": "1.5.0",
|
|
23
25
|
"@checkstack/dependency-common": "1.1.3",
|
|
24
26
|
"@checkstack/catalog-common": "2.2.3",
|
|
25
|
-
"@checkstack/catalog-backend": "1.
|
|
26
|
-
"@checkstack/command-backend": "0.1.
|
|
27
|
+
"@checkstack/catalog-backend": "1.3.1",
|
|
28
|
+
"@checkstack/command-backend": "0.1.33",
|
|
27
29
|
"@checkstack/signal-common": "0.2.5",
|
|
28
|
-
"@checkstack/automation-backend": "0.
|
|
29
|
-
"@checkstack/gitops-backend": "0.
|
|
30
|
-
"@checkstack/gitops-common": "0.
|
|
30
|
+
"@checkstack/automation-backend": "0.4.0",
|
|
31
|
+
"@checkstack/gitops-backend": "0.4.1",
|
|
32
|
+
"@checkstack/gitops-common": "0.5.0",
|
|
31
33
|
"@checkstack/common": "0.12.0",
|
|
32
|
-
"@checkstack/queue-api": "0.3.
|
|
34
|
+
"@checkstack/queue-api": "0.3.8",
|
|
33
35
|
"drizzle-orm": "^0.45.0",
|
|
34
36
|
"zod": "^4.2.1",
|
|
35
|
-
"@orpc/
|
|
37
|
+
"@orpc/contract": "^1.14.4",
|
|
38
|
+
"@orpc/server": "^1.14.4"
|
|
36
39
|
},
|
|
37
40
|
"devDependencies": {
|
|
38
41
|
"@checkstack/drizzle-helper": "0.0.5",
|
|
39
42
|
"@checkstack/scripts": "0.3.4",
|
|
40
|
-
"@checkstack/test-utils-backend": "0.1.
|
|
43
|
+
"@checkstack/test-utils-backend": "0.1.33",
|
|
41
44
|
"@checkstack/tsconfig": "0.0.7",
|
|
42
45
|
"@types/bun": "^1.0.0",
|
|
43
46
|
"drizzle-kit": "^0.31.10",
|
|
@@ -0,0 +1,36 @@
|
|
|
1
|
+
import { describe, expect, test } from "bun:test";
|
|
2
|
+
import {
|
|
3
|
+
buildProjectedTool,
|
|
4
|
+
deferredProjectionExecute,
|
|
5
|
+
} from "@checkstack/ai-backend";
|
|
6
|
+
import { sloContract, pluginMetadata } from "@checkstack/slo-common";
|
|
7
|
+
|
|
8
|
+
// Build the projected tool with the SAME inputs the plugin exposes via
|
|
9
|
+
// aiToolProjectionExtensionPoint in `index.ts`, and assert the resulting tool
|
|
10
|
+
// carries the source procedure's contract access rules - NOT the chat
|
|
11
|
+
// transport's `ai.chat.read` gate.
|
|
12
|
+
describe("slo.listObjectives projection", () => {
|
|
13
|
+
const tool = buildProjectedTool({
|
|
14
|
+
procedure: sloContract.listObjectives,
|
|
15
|
+
sourcePluginMetadata: pluginMetadata,
|
|
16
|
+
procedureKey: "listObjectives",
|
|
17
|
+
name: "slo.listObjectives",
|
|
18
|
+
description:
|
|
19
|
+
"List service-level objectives with their current status and error budget. Read-only.",
|
|
20
|
+
effect: "read",
|
|
21
|
+
execute: deferredProjectionExecute,
|
|
22
|
+
});
|
|
23
|
+
|
|
24
|
+
test("uses the overridden tool name", () => {
|
|
25
|
+
expect(tool.name).toBe("slo.listObjectives");
|
|
26
|
+
});
|
|
27
|
+
|
|
28
|
+
test("is classified as a read-only effect", () => {
|
|
29
|
+
expect(tool.effect).toBe("read");
|
|
30
|
+
});
|
|
31
|
+
|
|
32
|
+
test("inherits the source procedure's qualified read access rule", () => {
|
|
33
|
+
// qualifyAccessRuleId: `${pluginId}.${rule.id}` where rule.id = `slo.read`.
|
|
34
|
+
expect(tool.requiredAccessRules).toEqual(["slo.slo.read"]);
|
|
35
|
+
});
|
|
36
|
+
});
|
|
@@ -0,0 +1,143 @@
|
|
|
1
|
+
import { describe, it, expect } from "bun:test";
|
|
2
|
+
import {
|
|
3
|
+
eventWindowSeconds,
|
|
4
|
+
aggregateWindowedDowntime,
|
|
5
|
+
type WindowedEventInput,
|
|
6
|
+
} from "./downtime-window";
|
|
7
|
+
|
|
8
|
+
const HOUR = 60 * 60 * 1000;
|
|
9
|
+
const DAY = 24 * HOUR;
|
|
10
|
+
|
|
11
|
+
// A fixed "now" and a 30-day window ending at now.
|
|
12
|
+
const now = new Date("2026-06-04T00:00:00.000Z");
|
|
13
|
+
const windowEnd = now;
|
|
14
|
+
const windowStart = new Date(now.getTime() - 30 * DAY);
|
|
15
|
+
|
|
16
|
+
const selfEvent = (
|
|
17
|
+
overrides: Partial<WindowedEventInput> = {},
|
|
18
|
+
): WindowedEventInput => ({
|
|
19
|
+
startTime: new Date(now.getTime() - DAY),
|
|
20
|
+
endTime: null,
|
|
21
|
+
attributionType: "self",
|
|
22
|
+
upstreamSystemId: null,
|
|
23
|
+
upstreamSystemName: null,
|
|
24
|
+
...overrides,
|
|
25
|
+
});
|
|
26
|
+
|
|
27
|
+
describe("eventWindowSeconds", () => {
|
|
28
|
+
it("counts the full window for an OPEN event that started before the window (the bug)", () => {
|
|
29
|
+
// Started 60 days ago, still ongoing. Its in-window portion is the WHOLE
|
|
30
|
+
// 30-day window - it must not be dropped just because it started earlier.
|
|
31
|
+
const seconds = eventWindowSeconds({
|
|
32
|
+
startTime: new Date(now.getTime() - 60 * DAY),
|
|
33
|
+
endTime: null,
|
|
34
|
+
windowStart,
|
|
35
|
+
windowEnd,
|
|
36
|
+
now,
|
|
37
|
+
});
|
|
38
|
+
expect(seconds).toBe((30 * DAY) / 1000);
|
|
39
|
+
});
|
|
40
|
+
|
|
41
|
+
it("counts an open event from its start when it started inside the window", () => {
|
|
42
|
+
const seconds = eventWindowSeconds({
|
|
43
|
+
startTime: new Date(now.getTime() - 2 * HOUR),
|
|
44
|
+
endTime: null,
|
|
45
|
+
windowStart,
|
|
46
|
+
windowEnd,
|
|
47
|
+
now,
|
|
48
|
+
});
|
|
49
|
+
expect(seconds).toBe((2 * HOUR) / 1000);
|
|
50
|
+
});
|
|
51
|
+
|
|
52
|
+
it("returns 0 for a closed event entirely before the window", () => {
|
|
53
|
+
const seconds = eventWindowSeconds({
|
|
54
|
+
startTime: new Date(now.getTime() - 40 * DAY),
|
|
55
|
+
endTime: new Date(now.getTime() - 35 * DAY),
|
|
56
|
+
windowStart,
|
|
57
|
+
windowEnd,
|
|
58
|
+
now,
|
|
59
|
+
});
|
|
60
|
+
expect(seconds).toBe(0);
|
|
61
|
+
});
|
|
62
|
+
|
|
63
|
+
it("clamps an event that started before the window but ended inside it", () => {
|
|
64
|
+
// Started 31 days ago, ended 29 days ago → only 1 day falls in the window.
|
|
65
|
+
const seconds = eventWindowSeconds({
|
|
66
|
+
startTime: new Date(now.getTime() - 31 * DAY),
|
|
67
|
+
endTime: new Date(now.getTime() - 29 * DAY),
|
|
68
|
+
windowStart,
|
|
69
|
+
windowEnd,
|
|
70
|
+
now,
|
|
71
|
+
});
|
|
72
|
+
expect(seconds).toBe(DAY / 1000);
|
|
73
|
+
});
|
|
74
|
+
|
|
75
|
+
it("counts the full duration of an event fully inside the window", () => {
|
|
76
|
+
const seconds = eventWindowSeconds({
|
|
77
|
+
startTime: new Date(now.getTime() - 5 * DAY),
|
|
78
|
+
endTime: new Date(now.getTime() - 5 * DAY + 3 * HOUR),
|
|
79
|
+
windowStart,
|
|
80
|
+
windowEnd,
|
|
81
|
+
now,
|
|
82
|
+
});
|
|
83
|
+
expect(seconds).toBe((3 * HOUR) / 1000);
|
|
84
|
+
});
|
|
85
|
+
});
|
|
86
|
+
|
|
87
|
+
describe("aggregateWindowedDowntime", () => {
|
|
88
|
+
it("regression: an open self outage from before the window consumes the whole window", () => {
|
|
89
|
+
// This is the exact dashboard bug: a 'Self/Ongoing' event from ~2 months
|
|
90
|
+
// ago must NOT yield ~0 consumed minutes (which rendered as 100% available
|
|
91
|
+
// while still flagged degraded).
|
|
92
|
+
const result = aggregateWindowedDowntime({
|
|
93
|
+
events: [
|
|
94
|
+
selfEvent({ startTime: new Date(now.getTime() - 60 * DAY) }),
|
|
95
|
+
],
|
|
96
|
+
windowStart,
|
|
97
|
+
windowEnd,
|
|
98
|
+
now,
|
|
99
|
+
});
|
|
100
|
+
expect(result.selfMinutes).toBe((30 * DAY) / 1000 / 60);
|
|
101
|
+
expect(result.totalMinutes).toBe((30 * DAY) / 1000 / 60);
|
|
102
|
+
});
|
|
103
|
+
|
|
104
|
+
it("splits self vs upstream and clamps each to the window", () => {
|
|
105
|
+
const result = aggregateWindowedDowntime({
|
|
106
|
+
events: [
|
|
107
|
+
selfEvent({
|
|
108
|
+
startTime: new Date(now.getTime() - 2 * HOUR),
|
|
109
|
+
endTime: new Date(now.getTime() - 1 * HOUR),
|
|
110
|
+
}),
|
|
111
|
+
{
|
|
112
|
+
startTime: new Date(now.getTime() - 3 * HOUR),
|
|
113
|
+
endTime: new Date(now.getTime() - 2 * HOUR),
|
|
114
|
+
attributionType: "upstream",
|
|
115
|
+
upstreamSystemId: "up-1",
|
|
116
|
+
upstreamSystemName: "Upstream",
|
|
117
|
+
},
|
|
118
|
+
],
|
|
119
|
+
windowStart,
|
|
120
|
+
windowEnd,
|
|
121
|
+
now,
|
|
122
|
+
});
|
|
123
|
+
expect(result.selfMinutes).toBe(60);
|
|
124
|
+
expect(result.upstreamMinutes).toBe(60);
|
|
125
|
+
expect(result.totalMinutes).toBe(120);
|
|
126
|
+
expect(result.entries).toHaveLength(2);
|
|
127
|
+
});
|
|
128
|
+
|
|
129
|
+
it("ignores events fully outside the window", () => {
|
|
130
|
+
const result = aggregateWindowedDowntime({
|
|
131
|
+
events: [
|
|
132
|
+
selfEvent({
|
|
133
|
+
startTime: new Date(now.getTime() - 40 * DAY),
|
|
134
|
+
endTime: new Date(now.getTime() - 35 * DAY),
|
|
135
|
+
}),
|
|
136
|
+
],
|
|
137
|
+
windowStart,
|
|
138
|
+
windowEnd,
|
|
139
|
+
now,
|
|
140
|
+
});
|
|
141
|
+
expect(result.totalMinutes).toBe(0);
|
|
142
|
+
});
|
|
143
|
+
});
|
|
@@ -0,0 +1,130 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Pure window-overlap math for SLO downtime accounting.
|
|
3
|
+
*
|
|
4
|
+
* A downtime event contributes to an SLO window only for the portion of its
|
|
5
|
+
* duration that falls inside `[windowStart, windowEnd]`. Open (ongoing) events
|
|
6
|
+
* have no `endTime` and run until `now`. This is deliberately separate from any
|
|
7
|
+
* stored `durationSeconds` cache, which is not window-aware: a long outage that
|
|
8
|
+
* began before the window (the dashboard "100% available + degraded" bug) must
|
|
9
|
+
* still consume its in-window portion, and an event that straddles a window edge
|
|
10
|
+
* must be clamped, not counted in full.
|
|
11
|
+
*/
|
|
12
|
+
|
|
13
|
+
export interface WindowedEventInput {
|
|
14
|
+
startTime: Date;
|
|
15
|
+
endTime: Date | null;
|
|
16
|
+
attributionType: string;
|
|
17
|
+
upstreamSystemId: string | null;
|
|
18
|
+
upstreamSystemName: string | null;
|
|
19
|
+
}
|
|
20
|
+
|
|
21
|
+
export interface WindowedDowntime {
|
|
22
|
+
totalMinutes: number;
|
|
23
|
+
selfMinutes: number;
|
|
24
|
+
upstreamMinutes: number;
|
|
25
|
+
entries: Array<{
|
|
26
|
+
attributionType: string;
|
|
27
|
+
upstreamSystemId: string | null;
|
|
28
|
+
upstreamSystemName: string | null;
|
|
29
|
+
totalMinutes: number;
|
|
30
|
+
}>;
|
|
31
|
+
}
|
|
32
|
+
|
|
33
|
+
/**
|
|
34
|
+
* Seconds of a single event that fall inside the window. Open events (no
|
|
35
|
+
* `endTime`) run to `now`. Returns 0 when the event does not overlap the window.
|
|
36
|
+
*/
|
|
37
|
+
export function eventWindowSeconds({
|
|
38
|
+
startTime,
|
|
39
|
+
endTime,
|
|
40
|
+
windowStart,
|
|
41
|
+
windowEnd,
|
|
42
|
+
now,
|
|
43
|
+
}: {
|
|
44
|
+
startTime: Date;
|
|
45
|
+
endTime: Date | null;
|
|
46
|
+
windowStart: Date;
|
|
47
|
+
windowEnd: Date;
|
|
48
|
+
now: Date;
|
|
49
|
+
}): number {
|
|
50
|
+
const end = endTime ?? now;
|
|
51
|
+
const effectiveStart = Math.max(startTime.getTime(), windowStart.getTime());
|
|
52
|
+
const effectiveEnd = Math.min(end.getTime(), windowEnd.getTime());
|
|
53
|
+
const seconds = (effectiveEnd - effectiveStart) / 1000;
|
|
54
|
+
return Math.max(0, seconds);
|
|
55
|
+
}
|
|
56
|
+
|
|
57
|
+
/**
|
|
58
|
+
* Aggregate the in-window downtime across many events, split by attribution and
|
|
59
|
+
* grouped per source (self, or one bucket per upstream system).
|
|
60
|
+
*/
|
|
61
|
+
export function aggregateWindowedDowntime({
|
|
62
|
+
events,
|
|
63
|
+
windowStart,
|
|
64
|
+
windowEnd,
|
|
65
|
+
now,
|
|
66
|
+
}: {
|
|
67
|
+
events: WindowedEventInput[];
|
|
68
|
+
windowStart: Date;
|
|
69
|
+
windowEnd: Date;
|
|
70
|
+
now: Date;
|
|
71
|
+
}): WindowedDowntime {
|
|
72
|
+
let totalSeconds = 0;
|
|
73
|
+
let selfSeconds = 0;
|
|
74
|
+
let upstreamSeconds = 0;
|
|
75
|
+
const bySource = new Map<
|
|
76
|
+
string,
|
|
77
|
+
{
|
|
78
|
+
attributionType: string;
|
|
79
|
+
upstreamSystemId: string | null;
|
|
80
|
+
upstreamSystemName: string | null;
|
|
81
|
+
totalSeconds: number;
|
|
82
|
+
}
|
|
83
|
+
>();
|
|
84
|
+
|
|
85
|
+
for (const event of events) {
|
|
86
|
+
const duration = eventWindowSeconds({
|
|
87
|
+
startTime: event.startTime,
|
|
88
|
+
endTime: event.endTime,
|
|
89
|
+
windowStart,
|
|
90
|
+
windowEnd,
|
|
91
|
+
now,
|
|
92
|
+
});
|
|
93
|
+
if (duration <= 0) continue;
|
|
94
|
+
|
|
95
|
+
totalSeconds += duration;
|
|
96
|
+
if (event.attributionType === "self") {
|
|
97
|
+
selfSeconds += duration;
|
|
98
|
+
} else {
|
|
99
|
+
upstreamSeconds += duration;
|
|
100
|
+
}
|
|
101
|
+
|
|
102
|
+
const key =
|
|
103
|
+
event.attributionType === "self"
|
|
104
|
+
? "self"
|
|
105
|
+
: `upstream:${event.upstreamSystemId}`;
|
|
106
|
+
const existing = bySource.get(key);
|
|
107
|
+
if (existing) {
|
|
108
|
+
existing.totalSeconds += duration;
|
|
109
|
+
} else {
|
|
110
|
+
bySource.set(key, {
|
|
111
|
+
attributionType: event.attributionType,
|
|
112
|
+
upstreamSystemId: event.upstreamSystemId,
|
|
113
|
+
upstreamSystemName: event.upstreamSystemName,
|
|
114
|
+
totalSeconds: duration,
|
|
115
|
+
});
|
|
116
|
+
}
|
|
117
|
+
}
|
|
118
|
+
|
|
119
|
+
return {
|
|
120
|
+
totalMinutes: totalSeconds / 60,
|
|
121
|
+
selfMinutes: selfSeconds / 60,
|
|
122
|
+
upstreamMinutes: upstreamSeconds / 60,
|
|
123
|
+
entries: [...bySource.values()].map((e) => ({
|
|
124
|
+
attributionType: e.attributionType,
|
|
125
|
+
upstreamSystemId: e.upstreamSystemId,
|
|
126
|
+
upstreamSystemName: e.upstreamSystemName,
|
|
127
|
+
totalMinutes: e.totalSeconds / 60,
|
|
128
|
+
})),
|
|
129
|
+
};
|
|
130
|
+
}
|
package/src/index.ts
CHANGED
|
@@ -1,6 +1,10 @@
|
|
|
1
1
|
import * as schema from "./schema";
|
|
2
2
|
import type { SafeDatabase } from "@checkstack/backend-api";
|
|
3
3
|
import { z } from "zod";
|
|
4
|
+
import {
|
|
5
|
+
aiToolProjectionExtensionPoint,
|
|
6
|
+
deferredProjectionExecute,
|
|
7
|
+
} from "@checkstack/ai-backend";
|
|
4
8
|
import {
|
|
5
9
|
sloAccessRules,
|
|
6
10
|
sloAccess,
|
|
@@ -200,6 +204,20 @@ export default createBackendPlugin({
|
|
|
200
204
|
},
|
|
201
205
|
});
|
|
202
206
|
|
|
207
|
+
// Expose this plugin's read-only AI projection (`slo.listObjectives`) via
|
|
208
|
+
// the AI projection extension point. ai-backend collects its routing in
|
|
209
|
+
// afterPluginsReady and never imports slo-common.
|
|
210
|
+
env.getExtensionPoint(aiToolProjectionExtensionPoint).expose({
|
|
211
|
+
procedure: sloContract.listObjectives,
|
|
212
|
+
sourcePluginMetadata: pluginMetadata,
|
|
213
|
+
procedureKey: "listObjectives",
|
|
214
|
+
name: "slo.listObjectives",
|
|
215
|
+
description:
|
|
216
|
+
"List service-level objectives with their current status and error budget. Read-only.",
|
|
217
|
+
effect: "read",
|
|
218
|
+
execute: deferredProjectionExecute,
|
|
219
|
+
});
|
|
220
|
+
|
|
203
221
|
env.registerInit({
|
|
204
222
|
schema,
|
|
205
223
|
deps: {
|
package/src/router.ts
CHANGED
|
@@ -110,25 +110,38 @@ export function createRouter({
|
|
|
110
110
|
),
|
|
111
111
|
|
|
112
112
|
createObjective: os.createObjective.handler(
|
|
113
|
-
async ({ input }) => {
|
|
113
|
+
async ({ input, context }) => {
|
|
114
|
+
// The objective (+ its streak row) is committed atomically here. Once it
|
|
115
|
+
// returns, the write is durable and the create has SUCCEEDED - the
|
|
116
|
+
// post-commit steps below are best-effort enrichment/notification and
|
|
117
|
+
// must never turn a committed create into a client-visible error.
|
|
114
118
|
const objective = await service.createObjective({ input });
|
|
115
119
|
|
|
116
|
-
// Reconcile initial state: if system is already down,
|
|
117
|
-
// open an initial downtime event immediately
|
|
118
|
-
await engine.reconcileObjective({ objective });
|
|
119
|
-
|
|
120
|
-
const status = await engine.computeStatus({ objective });
|
|
121
120
|
// Mutation invariant: db.write → cache.invalidate (await) → signals.emit.
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
121
|
+
// cache.invalidate* and signalService.* are already non-throwing by
|
|
122
|
+
// platform contract; reconcile/computeStatus are guarded here so a
|
|
123
|
+
// transient health-read failure can't fail the (already committed) create.
|
|
124
|
+
try {
|
|
125
|
+
// Reconcile initial state: if system is already down, open an initial
|
|
126
|
+
// downtime event immediately.
|
|
127
|
+
await engine.reconcileObjective({ objective });
|
|
128
|
+
const status = await engine.computeStatus({ objective });
|
|
129
|
+
await cache.invalidateForMutation({
|
|
130
|
+
objectiveId: objective.id,
|
|
131
|
+
systemId: objective.systemId,
|
|
132
|
+
});
|
|
133
|
+
await signalService.broadcast(SLO_STATUS_CHANGED, {
|
|
134
|
+
systemId: objective.systemId,
|
|
135
|
+
objectiveId: objective.id,
|
|
136
|
+
budgetRemainingPercent: status.errorBudgetRemainingPercent,
|
|
137
|
+
isBreaching: status.isBreaching,
|
|
138
|
+
});
|
|
139
|
+
} catch (error) {
|
|
140
|
+
context.logger.warn(
|
|
141
|
+
`createObjective: objective ${objective.id} committed, but post-create reconcile/notify failed`,
|
|
142
|
+
{ error },
|
|
143
|
+
);
|
|
144
|
+
}
|
|
132
145
|
|
|
133
146
|
return objective;
|
|
134
147
|
},
|
package/src/service.ts
CHANGED
|
@@ -1,6 +1,7 @@
|
|
|
1
|
-
import { eq, and, isNull, desc, gte, lte } from "drizzle-orm";
|
|
1
|
+
import { eq, and, isNull, desc, gte, lte, or } from "drizzle-orm";
|
|
2
2
|
import type { SafeDatabase } from "@checkstack/backend-api";
|
|
3
3
|
import * as schema from "./schema";
|
|
4
|
+
import { aggregateWindowedDowntime } from "./downtime-window";
|
|
4
5
|
import {
|
|
5
6
|
sloObjectives,
|
|
6
7
|
sloDowntimeEvents,
|
|
@@ -69,29 +70,35 @@ export class SloService {
|
|
|
69
70
|
const id = generateId();
|
|
70
71
|
const now = new Date();
|
|
71
72
|
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
input.
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
73
|
+
// Atomic: the objective row and its 1:1 streak row must commit together.
|
|
74
|
+
// Without the transaction a failure on the streak insert left a committed
|
|
75
|
+
// objective with no streak (and the client saw an error for a write that
|
|
76
|
+
// partially succeeded).
|
|
77
|
+
await this.db.transaction(async (tx) => {
|
|
78
|
+
await tx.insert(sloObjectives).values({
|
|
79
|
+
id,
|
|
80
|
+
systemId: input.systemId,
|
|
81
|
+
|
|
82
|
+
healthCheckConfigurationId: input.healthCheckConfigurationId ?? null,
|
|
83
|
+
target: input.target,
|
|
84
|
+
windowDays: input.windowDays,
|
|
85
|
+
dependencyExclusion: input.dependencyExclusion ?? "strict",
|
|
86
|
+
excludedDependencyIds: input.excludedDependencyIds ?? [],
|
|
87
|
+
burnRateWarningPercent: input.burnRateThresholds?.warningPercent ?? 50,
|
|
88
|
+
burnRateCriticalPercent: input.burnRateThresholds?.criticalPercent ?? 80,
|
|
89
|
+
burnRateFastBurnMultiplier:
|
|
90
|
+
input.burnRateThresholds?.fastBurnMultiplier ?? 5,
|
|
91
|
+
createdAt: now,
|
|
92
|
+
updatedAt: now,
|
|
93
|
+
});
|
|
94
|
+
|
|
95
|
+
// Create initial streak record
|
|
96
|
+
await tx.insert(sloStreaks).values({
|
|
97
|
+
objectiveId: id,
|
|
98
|
+
systemId: input.systemId,
|
|
99
|
+
currentStreak: 0,
|
|
100
|
+
bestStreak: 0,
|
|
101
|
+
});
|
|
95
102
|
});
|
|
96
103
|
|
|
97
104
|
return (await this.getObjective({ id }))!;
|
|
@@ -306,10 +313,19 @@ export class SloService {
|
|
|
306
313
|
objectiveId,
|
|
307
314
|
windowStart,
|
|
308
315
|
windowEnd,
|
|
316
|
+
includeOpen,
|
|
309
317
|
}: {
|
|
310
318
|
objectiveId: string;
|
|
311
319
|
windowStart: Date;
|
|
312
320
|
windowEnd: Date;
|
|
321
|
+
/**
|
|
322
|
+
* Whether to count still-open events as ongoing downtime (clamped to `now`).
|
|
323
|
+
* The caller decides this from the system's LIVE health: an open event is
|
|
324
|
+
* only real ongoing downtime if the system is currently down. When false,
|
|
325
|
+
* only closed intervals are counted, so a stale/orphaned open event (a
|
|
326
|
+
* missed-recovery row) has zero effect on the budget.
|
|
327
|
+
*/
|
|
328
|
+
includeOpen: boolean;
|
|
313
329
|
}): Promise<{
|
|
314
330
|
totalMinutes: number;
|
|
315
331
|
selfMinutes: number;
|
|
@@ -321,71 +337,53 @@ export class SloService {
|
|
|
321
337
|
totalMinutes: number;
|
|
322
338
|
}>;
|
|
323
339
|
}> {
|
|
324
|
-
//
|
|
325
|
-
|
|
326
|
-
|
|
327
|
-
|
|
328
|
-
|
|
329
|
-
and(
|
|
330
|
-
eq(sloDowntimeEvents.objectiveId, objectiveId),
|
|
331
|
-
gte(sloDowntimeEvents.startTime, windowStart),
|
|
332
|
-
lte(sloDowntimeEvents.startTime, windowEnd),
|
|
333
|
-
),
|
|
334
|
-
);
|
|
335
|
-
|
|
336
|
-
// Also include open events (use current time as endTime for running duration)
|
|
340
|
+
// Closed intervals that OVERLAP the window: started on/before the window end
|
|
341
|
+
// AND ended on/after the window start. (`endTime >= windowStart` excludes
|
|
342
|
+
// NULL end times, i.e. open events, in SQL.) An ongoing outage that began
|
|
343
|
+
// before `windowStart` still consumes its in-window portion - it is not
|
|
344
|
+
// dropped just because it started earlier.
|
|
337
345
|
const now = new Date();
|
|
338
|
-
|
|
339
|
-
|
|
340
|
-
|
|
341
|
-
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
|
|
345
|
-
|
|
346
|
-
|
|
347
|
-
|
|
348
|
-
|
|
349
|
-
|
|
350
|
-
|
|
351
|
-
|
|
352
|
-
|
|
353
|
-
|
|
354
|
-
|
|
355
|
-
|
|
356
|
-
|
|
357
|
-
|
|
358
|
-
|
|
359
|
-
|
|
360
|
-
|
|
361
|
-
|
|
362
|
-
|
|
363
|
-
const key =
|
|
364
|
-
event.attributionType === "self"
|
|
365
|
-
? "self"
|
|
366
|
-
: `upstream:${event.upstreamSystemId}`;
|
|
367
|
-
const existing = bySource.get(key);
|
|
368
|
-
if (existing) {
|
|
369
|
-
existing.totalSeconds += duration;
|
|
370
|
-
} else {
|
|
371
|
-
bySource.set(key, {
|
|
372
|
-
attributionType: event.attributionType,
|
|
373
|
-
upstreamSystemId: event.upstreamSystemId,
|
|
374
|
-
upstreamSystemName: event.upstreamSystemName,
|
|
375
|
-
totalSeconds: duration,
|
|
376
|
-
});
|
|
377
|
-
}
|
|
378
|
-
}
|
|
379
|
-
|
|
380
|
-
return {
|
|
381
|
-
totalMinutes: totalSeconds / 60,
|
|
382
|
-
selfMinutes: selfSeconds / 60,
|
|
383
|
-
upstreamMinutes: upstreamSeconds / 60,
|
|
384
|
-
entries: [...bySource.values()].map((e) => ({
|
|
385
|
-
...e,
|
|
386
|
-
totalMinutes: e.totalSeconds / 60,
|
|
346
|
+
const startBound = lte(sloDowntimeEvents.startTime, windowEnd);
|
|
347
|
+
const closedOverlap = gte(sloDowntimeEvents.endTime, windowStart);
|
|
348
|
+
const where = includeOpen
|
|
349
|
+
? and(
|
|
350
|
+
eq(sloDowntimeEvents.objectiveId, objectiveId),
|
|
351
|
+
startBound,
|
|
352
|
+
or(closedOverlap, isNull(sloDowntimeEvents.endTime)),
|
|
353
|
+
)
|
|
354
|
+
: and(
|
|
355
|
+
eq(sloDowntimeEvents.objectiveId, objectiveId),
|
|
356
|
+
startBound,
|
|
357
|
+
closedOverlap,
|
|
358
|
+
);
|
|
359
|
+
|
|
360
|
+
const events = await this.db.select().from(sloDowntimeEvents).where(where);
|
|
361
|
+
|
|
362
|
+
// Clamp each event to the window (open events run to `now`) and aggregate.
|
|
363
|
+
return aggregateWindowedDowntime({
|
|
364
|
+
events: events.map((event) => ({
|
|
365
|
+
startTime: event.startTime,
|
|
366
|
+
endTime: event.endTime,
|
|
367
|
+
attributionType: event.attributionType,
|
|
368
|
+
upstreamSystemId: event.upstreamSystemId,
|
|
369
|
+
upstreamSystemName: event.upstreamSystemName,
|
|
387
370
|
})),
|
|
388
|
-
|
|
371
|
+
windowStart,
|
|
372
|
+
windowEnd,
|
|
373
|
+
now,
|
|
374
|
+
});
|
|
375
|
+
}
|
|
376
|
+
|
|
377
|
+
/**
|
|
378
|
+
* Hard-delete a downtime event. Used to VOID a missed-recovery orphan: an
|
|
379
|
+
* open event on a system that is currently healthy, whose true recovery time
|
|
380
|
+
* was never recorded. We do not know the real downtime, the system is healthy,
|
|
381
|
+
* so the unprovable row is removed rather than counted.
|
|
382
|
+
*/
|
|
383
|
+
async deleteDowntimeEvent({ id }: { id: string }): Promise<void> {
|
|
384
|
+
await this.db
|
|
385
|
+
.delete(sloDowntimeEvents)
|
|
386
|
+
.where(eq(sloDowntimeEvents.id, id));
|
|
389
387
|
}
|
|
390
388
|
|
|
391
389
|
async getRecentDowntimeEvents({
|
package/src/slo-engine.test.ts
CHANGED
|
@@ -99,6 +99,7 @@ function createMockService(
|
|
|
99
99
|
closeDowntimeEvent: mock(() =>
|
|
100
100
|
Promise.resolve(createDowntimeEvent({ endTime: new Date(), durationSeconds: 60 })),
|
|
101
101
|
),
|
|
102
|
+
deleteDowntimeEvent: mock(() => Promise.resolve()),
|
|
102
103
|
getDowntimeForWindow: mock(() =>
|
|
103
104
|
Promise.resolve({
|
|
104
105
|
totalMinutes: 0,
|
|
@@ -565,6 +566,230 @@ describe("SloEngine", () => {
|
|
|
565
566
|
|
|
566
567
|
expect(status.isBreaching).toBe(true);
|
|
567
568
|
});
|
|
569
|
+
|
|
570
|
+
it("should NOT flag hasOpenDowntime for an open upstream event in self-only mode", async () => {
|
|
571
|
+
// Regression: an open upstream event must not flip a self-only objective
|
|
572
|
+
// to "degraded" when no self downtime is counted — otherwise the SLO
|
|
573
|
+
// reads 100% available + degraded at the same time, which must not happen.
|
|
574
|
+
const objective = createObjective({
|
|
575
|
+
target: 99.9,
|
|
576
|
+
windowDays: 30,
|
|
577
|
+
dependencyExclusion: "self-only",
|
|
578
|
+
});
|
|
579
|
+
const openUpstream = createDowntimeEvent({
|
|
580
|
+
attributionType: "upstream",
|
|
581
|
+
upstreamSystemId: "up-1",
|
|
582
|
+
});
|
|
583
|
+
mockService = createMockService({
|
|
584
|
+
objectives: [objective],
|
|
585
|
+
openEvents: [openUpstream],
|
|
586
|
+
});
|
|
587
|
+
mockSignalService = createMockSignalService();
|
|
588
|
+
mockLogger = createMockLogger();
|
|
589
|
+
|
|
590
|
+
engine = new SloEngine({
|
|
591
|
+
service: mockService,
|
|
592
|
+
signalService: mockSignalService as never,
|
|
593
|
+
logger: mockLogger as never,
|
|
594
|
+
});
|
|
595
|
+
|
|
596
|
+
const status = await engine.computeStatus({ objective });
|
|
597
|
+
|
|
598
|
+
expect(status.currentAvailability).toBe(100);
|
|
599
|
+
expect(status.errorBudgetRemainingPercent).toBe(100);
|
|
600
|
+
// Self-only: an upstream-attributed open event is excluded from budget,
|
|
601
|
+
// so it must not report open (budget-relevant) downtime.
|
|
602
|
+
expect(status.hasOpenDowntime).toBe(false);
|
|
603
|
+
});
|
|
604
|
+
|
|
605
|
+
it("should flag hasOpenDowntime for an open self event in self-only mode", async () => {
|
|
606
|
+
const objective = createObjective({ dependencyExclusion: "self-only" });
|
|
607
|
+
const openSelf = createDowntimeEvent({ attributionType: "self" });
|
|
608
|
+
mockService = createMockService({
|
|
609
|
+
objectives: [objective],
|
|
610
|
+
openEvents: [openSelf],
|
|
611
|
+
});
|
|
612
|
+
mockSignalService = createMockSignalService();
|
|
613
|
+
mockLogger = createMockLogger();
|
|
614
|
+
|
|
615
|
+
engine = new SloEngine({
|
|
616
|
+
service: mockService,
|
|
617
|
+
signalService: mockSignalService as never,
|
|
618
|
+
logger: mockLogger as never,
|
|
619
|
+
});
|
|
620
|
+
|
|
621
|
+
const status = await engine.computeStatus({ objective });
|
|
622
|
+
|
|
623
|
+
expect(status.hasOpenDowntime).toBe(true);
|
|
624
|
+
});
|
|
625
|
+
|
|
626
|
+
it("should flag hasOpenDowntime for any open event in strict mode", async () => {
|
|
627
|
+
const objective = createObjective({ dependencyExclusion: "strict" });
|
|
628
|
+
const openUpstream = createDowntimeEvent({
|
|
629
|
+
attributionType: "upstream",
|
|
630
|
+
upstreamSystemId: "up-1",
|
|
631
|
+
});
|
|
632
|
+
mockService = createMockService({
|
|
633
|
+
objectives: [objective],
|
|
634
|
+
openEvents: [openUpstream],
|
|
635
|
+
});
|
|
636
|
+
mockSignalService = createMockSignalService();
|
|
637
|
+
mockLogger = createMockLogger();
|
|
638
|
+
|
|
639
|
+
engine = new SloEngine({
|
|
640
|
+
service: mockService,
|
|
641
|
+
signalService: mockSignalService as never,
|
|
642
|
+
logger: mockLogger as never,
|
|
643
|
+
});
|
|
644
|
+
|
|
645
|
+
const status = await engine.computeStatus({ objective });
|
|
646
|
+
|
|
647
|
+
expect(status.hasOpenDowntime).toBe(true);
|
|
648
|
+
});
|
|
649
|
+
|
|
650
|
+
it("live health is authoritative: a HEALTHY system with an open event is not degraded and excludes it from the budget", async () => {
|
|
651
|
+
// The dashboard "ongoing while healthy" regression: an orphaned open
|
|
652
|
+
// event (missed recovery) must not flip a currently-healthy SLO to
|
|
653
|
+
// degraded/breaching. computeStatus must ask the open path NOT to count
|
|
654
|
+
// open downtime when the system is healthy.
|
|
655
|
+
const objective = createObjective({ dependencyExclusion: "self-only" });
|
|
656
|
+
const openSelf = createDowntimeEvent({ attributionType: "self" });
|
|
657
|
+
mockService = createMockService({
|
|
658
|
+
objectives: [objective],
|
|
659
|
+
openEvents: [openSelf],
|
|
660
|
+
});
|
|
661
|
+
mockSignalService = createMockSignalService();
|
|
662
|
+
mockLogger = createMockLogger();
|
|
663
|
+
|
|
664
|
+
engine = new SloEngine({
|
|
665
|
+
service: mockService,
|
|
666
|
+
signalService: mockSignalService as never,
|
|
667
|
+
logger: mockLogger as never,
|
|
668
|
+
});
|
|
669
|
+
engine.setHealthStatusCallback(async () => ({ isHealthy: true }));
|
|
670
|
+
|
|
671
|
+
const status = await engine.computeStatus({ objective });
|
|
672
|
+
|
|
673
|
+
expect(status.hasOpenDowntime).toBe(false);
|
|
674
|
+
expect(mockService.getDowntimeForWindow).toHaveBeenCalledWith(
|
|
675
|
+
expect.objectContaining({ includeOpen: false }),
|
|
676
|
+
);
|
|
677
|
+
});
|
|
678
|
+
|
|
679
|
+
it("live health is authoritative: a DOWN system with an open event counts it and is degraded", async () => {
|
|
680
|
+
const objective = createObjective({ dependencyExclusion: "self-only" });
|
|
681
|
+
const openSelf = createDowntimeEvent({ attributionType: "self" });
|
|
682
|
+
mockService = createMockService({
|
|
683
|
+
objectives: [objective],
|
|
684
|
+
openEvents: [openSelf],
|
|
685
|
+
});
|
|
686
|
+
mockSignalService = createMockSignalService();
|
|
687
|
+
mockLogger = createMockLogger();
|
|
688
|
+
|
|
689
|
+
engine = new SloEngine({
|
|
690
|
+
service: mockService,
|
|
691
|
+
signalService: mockSignalService as never,
|
|
692
|
+
logger: mockLogger as never,
|
|
693
|
+
});
|
|
694
|
+
engine.setHealthStatusCallback(async () => ({ isHealthy: false }));
|
|
695
|
+
|
|
696
|
+
const status = await engine.computeStatus({ objective });
|
|
697
|
+
|
|
698
|
+
expect(status.hasOpenDowntime).toBe(true);
|
|
699
|
+
expect(mockService.getDowntimeForWindow).toHaveBeenCalledWith(
|
|
700
|
+
expect.objectContaining({ includeOpen: true }),
|
|
701
|
+
);
|
|
702
|
+
});
|
|
703
|
+
|
|
704
|
+
it("skips the health check entirely when there are no open events", async () => {
|
|
705
|
+
const objective = createObjective();
|
|
706
|
+
mockService = createMockService({ objectives: [objective] });
|
|
707
|
+
mockSignalService = createMockSignalService();
|
|
708
|
+
mockLogger = createMockLogger();
|
|
709
|
+
|
|
710
|
+
const healthCallback = mock(async () => ({ isHealthy: true }));
|
|
711
|
+
engine = new SloEngine({
|
|
712
|
+
service: mockService,
|
|
713
|
+
signalService: mockSignalService as never,
|
|
714
|
+
logger: mockLogger as never,
|
|
715
|
+
});
|
|
716
|
+
engine.setHealthStatusCallback(healthCallback);
|
|
717
|
+
|
|
718
|
+
await engine.computeStatus({ objective });
|
|
719
|
+
|
|
720
|
+
expect(healthCallback).not.toHaveBeenCalled();
|
|
721
|
+
expect(mockService.getDowntimeForWindow).toHaveBeenCalledWith(
|
|
722
|
+
expect.objectContaining({ includeOpen: false }),
|
|
723
|
+
);
|
|
724
|
+
});
|
|
725
|
+
});
|
|
726
|
+
|
|
727
|
+
describe("voidOrphanedDowntime", () => {
|
|
728
|
+
it("voids open events when the system is currently healthy (missed-recovery orphan)", async () => {
|
|
729
|
+
const objective = createObjective();
|
|
730
|
+
const orphan = createDowntimeEvent({ id: "orphan-1" });
|
|
731
|
+
mockService = createMockService({
|
|
732
|
+
objectives: [objective],
|
|
733
|
+
openEvents: [orphan],
|
|
734
|
+
});
|
|
735
|
+
mockSignalService = createMockSignalService();
|
|
736
|
+
mockLogger = createMockLogger();
|
|
737
|
+
|
|
738
|
+
engine = new SloEngine({
|
|
739
|
+
service: mockService,
|
|
740
|
+
signalService: mockSignalService as never,
|
|
741
|
+
logger: mockLogger as never,
|
|
742
|
+
});
|
|
743
|
+
engine.setHealthStatusCallback(async () => ({ isHealthy: true }));
|
|
744
|
+
|
|
745
|
+
await engine.voidOrphanedDowntime({ objective });
|
|
746
|
+
|
|
747
|
+
expect(mockService.deleteDowntimeEvent).toHaveBeenCalledWith({
|
|
748
|
+
id: "orphan-1",
|
|
749
|
+
});
|
|
750
|
+
});
|
|
751
|
+
|
|
752
|
+
it("keeps open events when the system is genuinely down", async () => {
|
|
753
|
+
const objective = createObjective();
|
|
754
|
+
const openEvent = createDowntimeEvent({ id: "evt-1" });
|
|
755
|
+
mockService = createMockService({
|
|
756
|
+
objectives: [objective],
|
|
757
|
+
openEvents: [openEvent],
|
|
758
|
+
});
|
|
759
|
+
mockSignalService = createMockSignalService();
|
|
760
|
+
mockLogger = createMockLogger();
|
|
761
|
+
|
|
762
|
+
engine = new SloEngine({
|
|
763
|
+
service: mockService,
|
|
764
|
+
signalService: mockSignalService as never,
|
|
765
|
+
logger: mockLogger as never,
|
|
766
|
+
});
|
|
767
|
+
engine.setHealthStatusCallback(async () => ({ isHealthy: false }));
|
|
768
|
+
|
|
769
|
+
await engine.voidOrphanedDowntime({ objective });
|
|
770
|
+
|
|
771
|
+
expect(mockService.deleteDowntimeEvent).not.toHaveBeenCalled();
|
|
772
|
+
});
|
|
773
|
+
|
|
774
|
+
it("does nothing when there are no open events", async () => {
|
|
775
|
+
const objective = createObjective();
|
|
776
|
+
mockService = createMockService({ objectives: [objective] });
|
|
777
|
+
mockSignalService = createMockSignalService();
|
|
778
|
+
mockLogger = createMockLogger();
|
|
779
|
+
|
|
780
|
+
const healthCallback = mock(async () => ({ isHealthy: true }));
|
|
781
|
+
engine = new SloEngine({
|
|
782
|
+
service: mockService,
|
|
783
|
+
signalService: mockSignalService as never,
|
|
784
|
+
logger: mockLogger as never,
|
|
785
|
+
});
|
|
786
|
+
engine.setHealthStatusCallback(healthCallback);
|
|
787
|
+
|
|
788
|
+
await engine.voidOrphanedDowntime({ objective });
|
|
789
|
+
|
|
790
|
+
expect(healthCallback).not.toHaveBeenCalled();
|
|
791
|
+
expect(mockService.deleteDowntimeEvent).not.toHaveBeenCalled();
|
|
792
|
+
});
|
|
568
793
|
});
|
|
569
794
|
|
|
570
795
|
describe("reconcileObjective", () => {
|
package/src/slo-engine.ts
CHANGED
|
@@ -83,6 +83,42 @@ export class SloEngine {
|
|
|
83
83
|
);
|
|
84
84
|
}
|
|
85
85
|
|
|
86
|
+
/**
|
|
87
|
+
* Void missed-recovery orphans: open downtime events on a system that is
|
|
88
|
+
* currently healthy. The edge-triggered close (on the health recovery
|
|
89
|
+
* transition) records real downtime accurately; this is the safety net for
|
|
90
|
+
* when that transition was never delivered (restart, dropped change, recovery
|
|
91
|
+
* before the SLO close path existed), which would otherwise leave an event
|
|
92
|
+
* open forever. `computeStatus` already ignores such rows for the budget
|
|
93
|
+
* (live health is authoritative), so this is row hygiene: it clears the stale
|
|
94
|
+
* "ongoing" event so it stops showing in history. We delete rather than close
|
|
95
|
+
* because the true recovery time is unknown and the system is healthy, so the
|
|
96
|
+
* unprovable downtime must not be counted.
|
|
97
|
+
*
|
|
98
|
+
* Runs in a write context (the daily job), never from a read accessor.
|
|
99
|
+
*/
|
|
100
|
+
async voidOrphanedDowntime({
|
|
101
|
+
objective,
|
|
102
|
+
}: {
|
|
103
|
+
objective: { id: string; systemId: string };
|
|
104
|
+
}): Promise<void> {
|
|
105
|
+
const openEvents = await this.service.getOpenDowntimeEventsForObjective({
|
|
106
|
+
objectiveId: objective.id,
|
|
107
|
+
});
|
|
108
|
+
if (openEvents.length === 0) return;
|
|
109
|
+
if (!this._getSystemHealthStatus) return;
|
|
110
|
+
|
|
111
|
+
const health = await this._getSystemHealthStatus(objective.systemId);
|
|
112
|
+
if (!health.isHealthy) return; // genuinely down — the open event is real
|
|
113
|
+
|
|
114
|
+
for (const event of openEvents) {
|
|
115
|
+
await this.service.deleteDowntimeEvent({ id: event.id });
|
|
116
|
+
}
|
|
117
|
+
this.logger.info(
|
|
118
|
+
`SLO ${objective.id}: voided ${openEvents.length} orphaned open downtime event(s) — system is healthy but a recovery transition was missed`,
|
|
119
|
+
);
|
|
120
|
+
}
|
|
121
|
+
|
|
86
122
|
// ===========================================================================
|
|
87
123
|
// PERSPECTIVE 1: This system's own SLOs
|
|
88
124
|
// ===========================================================================
|
|
@@ -300,10 +336,34 @@ export class SloEngine {
|
|
|
300
336
|
now.getTime() - objective.windowDays * 24 * 60 * 60 * 1000,
|
|
301
337
|
);
|
|
302
338
|
|
|
339
|
+
// LIVE HEALTH IS AUTHORITATIVE for "currently down". A stored open downtime
|
|
340
|
+
// event is only real ongoing downtime if the system is actually down right
|
|
341
|
+
// now - never trusted on its own. This makes the SLO numbers immune to a
|
|
342
|
+
// drifted/orphaned event log: a healthy system can never read breaching or
|
|
343
|
+
// degraded from a stale open row, by construction. The health check is
|
|
344
|
+
// gated on there being open events at all, so the common (no-open-event)
|
|
345
|
+
// path does no extra work. This method stays side-effect-free (the reactive
|
|
346
|
+
// `slo` entity reads through it); orphan rows are voided by the daily job.
|
|
347
|
+
const openEvents = await this.service.getOpenDowntimeEventsForObjective({
|
|
348
|
+
objectiveId: objective.id,
|
|
349
|
+
});
|
|
350
|
+
let currentlyDown: boolean;
|
|
351
|
+
if (openEvents.length === 0) {
|
|
352
|
+
currentlyDown = false;
|
|
353
|
+
} else if (this._getSystemHealthStatus) {
|
|
354
|
+
const health = await this._getSystemHealthStatus(objective.systemId);
|
|
355
|
+
currentlyDown = !health.isHealthy;
|
|
356
|
+
} else {
|
|
357
|
+
// Before afterPluginsReady wires the health callback, fall back to
|
|
358
|
+
// trusting the stored open state (best effort).
|
|
359
|
+
currentlyDown = true;
|
|
360
|
+
}
|
|
361
|
+
|
|
303
362
|
const downtime = await this.service.getDowntimeForWindow({
|
|
304
363
|
objectiveId: objective.id,
|
|
305
364
|
windowStart,
|
|
306
365
|
windowEnd: now,
|
|
366
|
+
includeOpen: currentlyDown,
|
|
307
367
|
});
|
|
308
368
|
|
|
309
369
|
const totalWindowMinutes = objective.windowDays * 24 * 60;
|
|
@@ -345,10 +405,16 @@ export class SloEngine {
|
|
|
345
405
|
|
|
346
406
|
expectedConsumption > 0 ? consumedMinutes / expectedConsumption : null;
|
|
347
407
|
|
|
348
|
-
//
|
|
349
|
-
|
|
350
|
-
|
|
351
|
-
|
|
408
|
+
// "Degraded" (open downtime) requires BOTH that the system is currently
|
|
409
|
+
// down AND that an open event counts toward this objective's budget. In
|
|
410
|
+
// self-only mode an open upstream outage is excluded; and a stale open event
|
|
411
|
+
// on a now-healthy system (currentlyDown === false) never counts - so the
|
|
412
|
+
// SLO can never read available-and-degraded at once.
|
|
413
|
+
const budgetRelevantOpenEvents = currentlyDown
|
|
414
|
+
? objective.dependencyExclusion === "strict"
|
|
415
|
+
? openEvents
|
|
416
|
+
: openEvents.filter((event) => event.attributionType === "self")
|
|
417
|
+
: [];
|
|
352
418
|
|
|
353
419
|
// Build attribution breakdown
|
|
354
420
|
const attribution = downtime.entries.map((entry) => ({
|
|
@@ -376,7 +442,7 @@ export class SloEngine {
|
|
|
376
442
|
burnRate,
|
|
377
443
|
dependencyExclusion: objective.dependencyExclusion,
|
|
378
444
|
isBreaching: effectiveAvailability !== null && effectiveAvailability < objective.target,
|
|
379
|
-
hasOpenDowntime:
|
|
445
|
+
hasOpenDowntime: budgetRelevantOpenEvents.length > 0,
|
|
380
446
|
attribution,
|
|
381
447
|
};
|
|
382
448
|
}
|
package/src/slo-gitops-kinds.ts
CHANGED
|
@@ -9,6 +9,7 @@ import {
|
|
|
9
9
|
import {
|
|
10
10
|
DependencyExclusionModeSchema,
|
|
11
11
|
BurnRateThresholdsSchema,
|
|
12
|
+
SloWindowDaysSchema,
|
|
12
13
|
} from "@checkstack/slo-common";
|
|
13
14
|
import type { SloService } from "./service";
|
|
14
15
|
|
|
@@ -30,7 +31,7 @@ const sloSpecSchema = z.object({
|
|
|
30
31
|
systemRef: entityRefSchema,
|
|
31
32
|
healthcheckRef: entityRefSchema.optional(),
|
|
32
33
|
target: z.number().min(0).max(100),
|
|
33
|
-
windowDays:
|
|
34
|
+
windowDays: SloWindowDaysSchema,
|
|
34
35
|
dependencyExclusion: DependencyExclusionModeSchema.optional(),
|
|
35
36
|
excludedDependencyRefs: z.array(entityRefSchema).optional(),
|
|
36
37
|
burnRateThresholds: BurnRateThresholdsSchema.optional(),
|
package/src/streak-calculator.ts
CHANGED
|
@@ -75,6 +75,11 @@ export async function runDailySnapshotJob(deps: {
|
|
|
75
75
|
|
|
76
76
|
for (const objective of objectives) {
|
|
77
77
|
try {
|
|
78
|
+
// Hygiene: clear missed-recovery orphans (open events on healthy systems)
|
|
79
|
+
// before snapshotting, so the trend reflects reality. computeStatus is
|
|
80
|
+
// already immune to such rows, but this stops them lingering in history.
|
|
81
|
+
await engine.voidOrphanedDowntime({ objective });
|
|
82
|
+
|
|
78
83
|
const status = await engine.computeStatus({ objective });
|
|
79
84
|
|
|
80
85
|
// 1. Persist daily snapshot
|