toga-ai 1.0.39 → 1.0.41
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -6,3 +6,4 @@
|
|
|
6
6
|
| [ClickUp Project & Opportunity Multi-List Routing](features/clickup-project-routing.md) | Routes ClickUp tasks into the correct **secondary multi-list memberships** based on their custom-field values, via the `clickup` webhook. | worker2/Worker/Clickup/Project.php, worker2/Worker/Clickup.php |
|
|
7
7
|
| [Creating Worker Actions](features/creating-worker-actions.md) | How to add a new callable Worker action — a PHP class whose `public static` methods are invoked as background jobs (via webhook, cron, or `_Worker::runTask()`). | worker2/Worker/, worker2/Controller/Index.php, _underscore/Worker.php |
|
|
8
8
|
| [Elite Freshservice Sync (worker2)](features/elite-freshservice-sync.md) | `_Worker_Elite` processes Freshservice webhook events and syncs them into TOGA 2. | worker2/Worker/Elite.php, worker2/Config/dev-kmaramreddy-laptop.ini |
|
|
9
|
+
| [Monitoring Framework (Orchestrator + Child Monitors)](features/monitoring-framework.md) | A unified, DB-driven monitoring framework for business-critical data flows (Compass POs, Prudential asset imports, AIG closed claims, …). | worker2/Worker/Monitor.php, worker2/Worker/Monitors/, worker2/Worker/Notification/Email.php, dbchanges2/Core/2026-05-21 - Monitors.sql |
|
|
@@ -0,0 +1,183 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: Monitoring Framework (Orchestrator + Child Monitors)
|
|
3
|
+
framework: "2.0"
|
|
4
|
+
repo: worker2
|
|
5
|
+
project: Worker
|
|
6
|
+
client: shared
|
|
7
|
+
type: feature
|
|
8
|
+
status: active
|
|
9
|
+
updated: 2026-06-10
|
|
10
|
+
owners: [mhammontree]
|
|
11
|
+
files:
|
|
12
|
+
- worker2/Worker/Monitor.php
|
|
13
|
+
- worker2/Worker/Monitors/
|
|
14
|
+
- worker2/Worker/Notification/Email.php
|
|
15
|
+
- dbchanges2/Core/2026-05-21 - Monitors.sql
|
|
16
|
+
related:
|
|
17
|
+
- ../architecture.md
|
|
18
|
+
- ./creating-worker-actions.md
|
|
19
|
+
- ../../dbchanges2/architecture.md
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
## Summary
|
|
23
|
+
|
|
24
|
+
A unified, DB-driven monitoring framework for business-critical data flows (Compass POs,
|
|
25
|
+
Prudential asset imports, AIG closed claims, …). One orchestrator applies consistent
|
|
26
|
+
anti-flap logic and notification rules to any registered "is X healthy?" check. Replaces
|
|
27
|
+
scattered/ad-hoc monitoring where failures surfaced only when a client complained.
|
|
28
|
+
|
|
29
|
+
**Status (2026-06-10):** v1.0 framework + orchestrator landed; **first child monitor
|
|
30
|
+
pending**. Migration applied to local Core_2 only — **not yet on staging/production Core**
|
|
31
|
+
(coordinate before merge). Dashboard/acknowledgment layer under discussion (see Pending
|
|
32
|
+
scope).
|
|
33
|
+
|
|
34
|
+
Design principles: async isolation (one cron per monitor — a slow monitor can't block
|
|
35
|
+
others) · anti-flap on the recovery side only (N consecutive OKs before declaring
|
|
36
|
+
recovery) · immediate alert on first failure, throttled reminders, one recovery email per
|
|
37
|
+
incident · dumb-and-light children (all state/escalation/email logic lives once in the
|
|
38
|
+
orchestrator) · runtime config in the `Core.Monitors` table (no redeploy).
|
|
39
|
+
|
|
40
|
+
## Key files / entry points
|
|
41
|
+
|
|
42
|
+
| File | Purpose |
|
|
43
|
+
|---|---|
|
|
44
|
+
| `worker2/Worker/Monitor.php` | Orchestrator — `abstract class _Worker_Monitor`, action route `Monitor/Run` |
|
|
45
|
+
| `worker2/Worker/Monitors/<Name>.php` | Child classes `_Worker_Monitors_<Name>`, one per monitor |
|
|
46
|
+
| `worker2/Worker/Notification/Email.php` | Reused `_Worker_Notification_Email::Send` for all notification mail |
|
|
47
|
+
| `dbchanges2/Core/2026-05-21 - Monitors.sql` | `Core.Monitors` table definition + INSERT templates |
|
|
48
|
+
|
|
49
|
+
Reused with **no changes**: `Core.CronJobs`, `Core.WorkerJobs`, the `WorkerCronScheduler`
|
|
50
|
+
Lambda, the EB worker tier + SQS delivery (see [worker2 architecture](../architecture.md)).
|
|
51
|
+
|
|
52
|
+
## How it works
|
|
53
|
+
|
|
54
|
+
One `Core.CronJobs` row per monitor (`action = 'Monitor/Run'`, `parameters =
|
|
55
|
+
{"monitorId": N}`, schedule evaluated in **Central** time). Each tick, the orchestrator:
|
|
56
|
+
|
|
57
|
+
1. Fetches the `Core.Monitors` row by `monitorId` (not found → return message, no writes).
|
|
58
|
+
2. Guards `isActive` — if 0, returns early; no child invocation, no mail.
|
|
59
|
+
3. Invokes the child: `class_exists` / `method_exists` checks, then `$phpClass::Run()`
|
|
60
|
+
inside try/catch. **Any `Throwable` becomes `{isOk: false}`** with the exception
|
|
61
|
+
message — a child can never crash the orchestrator. The return must be an object with
|
|
62
|
+
`isOk` and `message`; anything else is treated as failure.
|
|
63
|
+
4. Runs the state machine (below) to compute new state, counter, and notification.
|
|
64
|
+
5. Sends mail if needed via `_Worker_Notification_Email::Send` with
|
|
65
|
+
`clientIdentifier: 'Core'` (infrastructure mail, never client-scoped).
|
|
66
|
+
6. Persists one UPDATE: `state`, `consecutiveOkCount`, `lastRunDt`, `lastResultMessage`,
|
|
67
|
+
plus `lastNotificationDt`/`lastNotificationType` when a notification was sent.
|
|
68
|
+
7. Returns a one-line summary that lands in `WorkerJobs` for diagnostics.
|
|
69
|
+
|
|
70
|
+
Uses `_Query($sql, _underscore::DB_CORE)` and `_Database::escape()` for interpolated values.
|
|
71
|
+
|
|
72
|
+
### State machine (the agreed contract)
|
|
73
|
+
|
|
74
|
+
| Previous | Current | New state | Counter | Notification |
|
|
75
|
+
|---|---|---|---|---|
|
|
76
|
+
| ok | ok | ok | reset to 0 | none |
|
|
77
|
+
| ok | alert | **alert** | reset to 0 | **alert** — always, immediate |
|
|
78
|
+
| alert | alert | alert | unchanged (0) | **reminder** — only if `reminderFrequencyMinutes` elapsed since `lastNotificationDt` |
|
|
79
|
+
| alert | ok | alert until counter hits threshold | +1 | **recovery** — only when `consecutiveOkCount >= requiredConsecutiveOks`; on send, flip to ok and reset counter |
|
|
80
|
+
|
|
81
|
+
Three **intentional product decisions** (deviations from the original meeting summary —
|
|
82
|
+
if you change any, update the plan doc and announce in #engineering):
|
|
83
|
+
1. No anti-flap on the alarm side — first failure alerts immediately.
|
|
84
|
+
2. Recovery fires only on an actual alert → ok transition (the meeting's "ok→ok sends
|
|
85
|
+
recovery" branch was folded out).
|
|
86
|
+
3. Recovery is **not** gated on the reminder window — gating could swallow the
|
|
87
|
+
"all clear" email entirely.
|
|
88
|
+
|
|
89
|
+
Reminder-window math uses PHP server time vs `lastNotificationDt` written via DB `NOW()`
|
|
90
|
+
— consistent on a single DB server, **not explicitly UTC**; revisit if a cross-region DB
|
|
91
|
+
is ever introduced.
|
|
92
|
+
|
|
93
|
+
### Notifications
|
|
94
|
+
|
|
95
|
+
Subjects: `[Monitor ALERT|REMINDER|RECOVERED] <monitor name>`. Plain-text body with
|
|
96
|
+
monitor name, state, notification type, timestamp, and the child's message verbatim.
|
|
97
|
+
From `donotreply@goagilant.com`; recipients parsed from `Monitors.peopleToNotify` (JSON
|
|
98
|
+
array of emails — empty array means no mail sent, no error raised).
|
|
99
|
+
|
|
100
|
+
## Data model
|
|
101
|
+
|
|
102
|
+
### `Core.Monitors` (new)
|
|
103
|
+
|
|
104
|
+
| Column | Type | Purpose |
|
|
105
|
+
|---|---|---|
|
|
106
|
+
| `id` / `uuid` | INT UNSIGNED PK / char(36) UNIQUE | Standard identifiers |
|
|
107
|
+
| `isActive` | TINYINT default 1 | 0 = orchestrator returns early |
|
|
108
|
+
| `dtCreated` | DATETIME | Record create time |
|
|
109
|
+
| `name` | VARCHAR(255) | Human-readable; appears in email subjects |
|
|
110
|
+
| `phpClass` | VARCHAR(255) | Child class, e.g. `_Worker_Monitors_Compass` |
|
|
111
|
+
| `state` | ENUM('ok','alert') default 'ok' | **Confirmed** current state |
|
|
112
|
+
| `consecutiveOkCount` | INT UNSIGNED default 0 | Anti-flap counter (alert→ok only) |
|
|
113
|
+
| `requiredConsecutiveOks` | TINYINT UNSIGNED default 2 | Threshold to flip alert→ok |
|
|
114
|
+
| `reminderFrequencyMinutes` | SMALLINT UNSIGNED default 60 | Reminder cadence while in alert |
|
|
115
|
+
| `peopleToNotify` | JSON | Array of email addresses |
|
|
116
|
+
| `lastRunDt` / `lastResultMessage` | DATETIME / TEXT | Last run + last child message (or exception text) |
|
|
117
|
+
| `lastNotificationDt` / `lastNotificationType` | DATETIME / ENUM('alert','reminder','recovery') | Most recent notification |
|
|
118
|
+
|
|
119
|
+
Index `Monitors_isActive_IDX (isActive)`. `Core.CronJobs` reused as-is — one row per monitor.
|
|
120
|
+
|
|
121
|
+
## Child monitor contract
|
|
122
|
+
|
|
123
|
+
- File `worker2/Worker/Monitors/<Name>.php`, class `abstract class _Worker_Monitors_<Name>`
|
|
124
|
+
with `public static function Run(): object` returning
|
|
125
|
+
`(object)['isOk' => bool, 'message' => string]` — matches the worker2 action convention.
|
|
126
|
+
- `Run()` takes **no parameters**; all per-monitor config lives in the `Monitors` row.
|
|
127
|
+
- **Don't**: send notifications, write to `Monitors`, implement escalation, or wrap
|
|
128
|
+
everything in try/catch (the orchestrator catches all `Throwable`s — bubbling up is the
|
|
129
|
+
simplest way to fail loudly).
|
|
130
|
+
- **Do**: check exactly one signal per monitor; put actionable context in `message`
|
|
131
|
+
(recipients see it verbatim); register any non-Core DB connection inline at the top of
|
|
132
|
+
`Run()` — the framework does **not** call `initialize()` on child monitors.
|
|
133
|
+
|
|
134
|
+
### Adding a new monitor (runbook)
|
|
135
|
+
|
|
136
|
+
1. Write the child class (one check, returns `{isOk, message}`).
|
|
137
|
+
2. INSERT the `Core.Monitors` row (uuid, name, phpClass, thresholds, peopleToNotify).
|
|
138
|
+
3. INSERT the `Core.CronJobs` row (`action = 'Monitor/Run'`,
|
|
139
|
+
`parameters = JSON_OBJECT('monitorId', <id>)`).
|
|
140
|
+
4. Verify end-to-end in dev (8-step checklist: migration sanity, cron registration,
|
|
141
|
+
first-failure alert, reminder cadence, anti-flap recovery, exception path, inactive
|
|
142
|
+
monitor, missing class — all must pass before production).
|
|
143
|
+
5. Deploy class to staging+production worker2; apply the two INSERTs to each Core.
|
|
144
|
+
|
|
145
|
+
No Lambda, SQS, or orchestrator changes needed per monitor.
|
|
146
|
+
|
|
147
|
+
### Operational notes
|
|
148
|
+
|
|
149
|
+
- Force a recheck: set `lastRunDt = NULL`, or POST `{"action":"Monitor/Run","parameters":{"monitorId":N}}` to the worker2 test endpoint (debug path, bypasses WorkerJobs).
|
|
150
|
+
- Pause a monitor: `isActive = 0` (cron row can stay active). Pause email only: `peopleToNotify = JSON_ARRAY()`.
|
|
151
|
+
- Manual state reset: clear `state`/`consecutiveOkCount`/`lastNotificationDt`/`lastNotificationType` — sparingly; the state machine self-corrects.
|
|
152
|
+
|
|
153
|
+
## Client variations
|
|
154
|
+
|
|
155
|
+
None — the framework is shared Core infrastructure. Individual monitors target specific
|
|
156
|
+
clients' data flows (Compass, Prudential, AIG, …) but live as separate child classes.
|
|
157
|
+
|
|
158
|
+
## Gotchas / known issues
|
|
159
|
+
|
|
160
|
+
- **First child monitor not yet built** — `worker2/Worker/Monitors/` does not exist yet
|
|
161
|
+
(as of 2026-06-10).
|
|
162
|
+
- **Migration not applied to staging/production Core** — only local Core_2. Coordinate
|
|
163
|
+
before merge.
|
|
164
|
+
- `worker2/MONITORING_PLAN.md` is referenced by the design doc but **missing on disk** —
|
|
165
|
+
stale reference to resolve.
|
|
166
|
+
- Child return value must be an **object** with both `isOk` and `message`; a bare array
|
|
167
|
+
or missing property is treated as failure by design.
|
|
168
|
+
- Reminder math is server-time based, not UTC — see state-machine section.
|
|
169
|
+
|
|
170
|
+
## Pending scope (NOT implemented)
|
|
171
|
+
|
|
172
|
+
- **Dashboard + acknowledgments** — ack columns on `Monitors` (or a history table),
|
|
173
|
+
reminder gate becomes "elapsed AND not acknowledged", recovery clears the ack. Open
|
|
174
|
+
questions: ack survival across flap-back, auto-expiry, which auth system owns dashboard
|
|
175
|
+
identity. Owner: mhammontree, pending team discussion.
|
|
176
|
+
- Slack / SMS / PagerDuty (email-only today) · `MonitorRuns` history table ·
|
|
177
|
+
HTML email + dashboard deep-links · anti-flap on the alarm side.
|
|
178
|
+
|
|
179
|
+
## Related docs
|
|
180
|
+
|
|
181
|
+
- [worker2 architecture](../architecture.md) — cron/WorkerJobs pipeline the orchestrator rides on
|
|
182
|
+
- [Creating Worker Actions](./creating-worker-actions.md) — the action class/method conventions child monitors follow
|
|
183
|
+
- [dbchanges2 architecture](../../dbchanges2/architecture.md) — migration naming/ordering rules for the Monitors table change
|
package/knowledge/INDEX.md
CHANGED
|
@@ -10,7 +10,7 @@ _Auto-generated by `knowledge.js index`. Do not hand-edit._
|
|
|
10
10
|
## 2.0 framework
|
|
11
11
|
|
|
12
12
|
- **_underscore** (_Underscore) _(framework core)_ — 3 doc(s) → [2.0/apps/_underscore/INDEX.md](2.0/apps/_underscore/INDEX.md)
|
|
13
|
-
- **worker2** (Worker) —
|
|
13
|
+
- **worker2** (Worker) — 5 doc(s) → [2.0/apps/worker2/INDEX.md](2.0/apps/worker2/INDEX.md)
|
|
14
14
|
- **api2** (API) — 1 doc(s) → [2.0/apps/api2/INDEX.md](2.0/apps/api2/INDEX.md)
|
|
15
15
|
- **dbchanges2** (Database Changes) _(framework core)_ — 1 doc(s) → [2.0/apps/dbchanges2/INDEX.md](2.0/apps/dbchanges2/INDEX.md)
|
|
16
16
|
- **toga2-supply** (TOGa Supply) — 2 doc(s) → [2.0/apps/toga2-supply/INDEX.md](2.0/apps/toga2-supply/INDEX.md)
|
package/package.json
CHANGED
package/skills/capture/SKILL.md
CHANGED
|
@@ -86,10 +86,16 @@ every doc in this capture — call the result `AUTHOR_USERNAME`:
|
|
|
86
86
|
3. **Persist it** to Claude memory as a `user` memory named `author-username` (and add its
|
|
87
87
|
`MEMORY.md` pointer) so future captures never ask again.
|
|
88
88
|
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
89
|
+
`owners` is a **list of everyone who has worked on the doc — it may hold several names**,
|
|
90
|
+
not just one. Always include `AUTHOR_USERNAME`, and treat the field as cumulative:
|
|
91
|
+
|
|
92
|
+
- **New doc:** start `owners` with `["<AUTHOR_USERNAME>"]`.
|
|
93
|
+
- **Existing doc:** **union** — keep all existing owners and add `AUTHOR_USERNAME` if not
|
|
94
|
+
already present. Never replace or drop an existing owner.
|
|
95
|
+
- **Co-authors:** if more than one person worked on this feature (e.g. pairing, or a handoff
|
|
96
|
+
this session), ask "Anyone else who should be listed as an owner? (usernames, comma-separated,
|
|
97
|
+
same first-initial+last-name convention)" and add each — deduplicated, lowercase — to the list.
|
|
98
|
+
- Never leave `owners` empty.
|
|
93
99
|
|
|
94
100
|
## Step 2 — Determine framework / repo / project (ask if unknown)
|
|
95
101
|
|