@cronicorn/mcp-server 1.18.3 → 1.19.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -8,384 +8,188 @@ mcp:
8
8
  uri: file:///docs/technical/how-ai-adaptation-works.md
9
9
  mimeType: text/markdown
10
10
  priority: 0.85
11
- lastModified: 2025-11-02T00:00:00Z
11
+ lastModified: 2026-02-03T00:00:00Z
12
12
  ---
13
13
 
14
14
  # How AI Adaptation Works
15
15
 
16
- This document explains how the AI Planner analyzes endpoint execution patterns and suggests schedule adjustments. If you haven't read [System Architecture](./system-architecture.md), start there for context on the dual-worker design.
16
+ **TL;DR:** The AI Planner analyzes endpoint execution patterns and writes time-bounded hints to adjust schedules. It has four action tools (propose_interval, propose_next_time, pause_until, clear_hints) and three query tools for response data. Structure your response bodies with metrics the AI should monitor.
17
+
18
+ This document explains how the AI Planner suggests schedule adjustments. See [System Architecture](./system-architecture.md) for context on the dual-worker design.
17
19
 
18
20
  ## The AI Planner's Job
19
21
 
20
- The AI Planner worker has one responsibility: analyze endpoint execution patterns and suggest scheduling adjustments by writing hints to the database. It doesn't execute jobs, manage locks, or worry about reliability. It observes and recommends.
22
+ The AI Planner worker analyzes endpoint execution patterns and suggests scheduling adjustments by writing hints to the database. It doesn't execute jobs or manage locks—it observes and recommends.
21
23
 
22
- The AI Planner runs independently from the Scheduler. Typically it wakes up every 5 minutes (configurable) and analyzes recently active endpoints. The analysis happens asynchronously—the Scheduler doesn't wait for it.
24
+ The AI Planner runs independently from the Scheduler, typically waking up every 5 minutes to analyze recently active endpoints.
23
25
 
24
26
  ## Discovery: Finding Endpoints to Analyze
25
27
 
26
- The AI Planner doesn't analyze every endpoint on every cycle. That would be expensive (AI API costs) and unnecessary (most endpoints are stable).
27
-
28
- Instead, it uses **smart scheduling** where the AI controls when it needs to analyze again:
28
+ The AI Planner uses **smart scheduling** where it controls when to analyze again:
29
29
 
30
- 1. Query the database for endpoints that ran recently
31
- 2. Check if the endpoint is due for analysis based on:
32
- - **First analysis**: New endpoints that have never been analyzed
30
+ 1. Query the database for recently active endpoints
31
+ 2. Check if the endpoint is due for analysis:
32
+ - **First analysis**: New endpoints never analyzed
33
33
  - **Scheduled time**: AI-requested re-analysis time has passed
34
34
  - **State change**: New failures since last analysis (triggers immediate re-analysis)
35
35
  3. Skip endpoints where none of these conditions are met
36
36
 
37
- This approach lets the AI decide its own analysis frequency:
38
- - Stable endpoints: "Check again in 4 hours"
39
- - Incidents: "Check again in 5 minutes"
40
- - Very stable daily jobs: "Check again in 24 hours"
41
-
42
- The AI communicates this via the `next_analysis_in_ms` parameter in `submit_analysis` (see Tools section).
43
-
44
- ## What the AI Sees: Building Context
45
-
46
- For each endpoint, the AI Planner builds a comprehensive analysis prompt containing:
47
-
48
- ### Current Configuration
37
+ The AI communicates its preferred re-analysis time via `next_analysis_in_ms` in `submit_analysis`.
49
38
 
50
- - **Baseline schedule**: Cron expression or interval
51
- - **Constraints**: Min/max intervals
52
- - **Pause state**: Whether paused and until when
53
- - **Active hints**: Current AI interval/one-shot hints and their expiration times
54
- - **Failure count**: Number of consecutive failures (affects exponential backoff)
39
+ ## What the AI Sees
55
40
 
56
- This tells the AI what scheduling behavior is currently in effect.
41
+ For each endpoint, the AI receives:
57
42
 
58
- ### Recent Performance (Multi-Window Health)
43
+ ### Configuration
44
+ - Baseline schedule (cron or interval)
45
+ - Constraints (min/max intervals)
46
+ - Pause state and active hints
47
+ - Failure count
59
48
 
60
- The AI sees health metrics across **three time windows** to accurately detect recovery patterns:
49
+ ### Multi-Window Health Metrics
61
50
 
62
51
  | Window | Metrics |
63
- |--------|--------|
52
+ |--------|---------|
64
53
  | **Last 1 hour** | Success rate, run count |
65
54
  | **Last 4 hours** | Success rate, run count |
66
55
  | **Last 24 hours** | Success rate, run count |
67
56
 
68
- Plus:
69
- - **Average duration**: Mean execution time
70
- - **Failure streak**: Consecutive failures (signals degradation)
57
+ Plus average duration and failure streak.
71
58
 
72
- **Why multiple windows matter**: A single 24-hour window can be misleading during recovery. If an endpoint failed at high frequency (every 5 seconds for 2 hours = 1,440 failures) and then recovered at normal frequency (every 5 minutes for 6 hours = 72 successes), the 24-hour rate shows 4.8% success even though recent performance is 100%.
73
-
74
- With multi-window health, the AI sees:
75
- - Last 1h: 100% success (12 runs)
76
- - Last 4h: 85% success (40 runs)
77
- - Last 24h: 32% success (500 runs) ← skewed by old failures
78
-
79
- This tells the AI "endpoint has recovered" rather than "endpoint is still failing."
59
+ Multiple windows matter for recovery detection. A 24-hour window can show 5% success even when recent performance is 100% due to old failures.
80
60
 
81
61
  ### Response Body Data
82
62
 
83
- This is the key to intelligent adaptation. Every execution records the **response body**—the JSON returned by your endpoint.
84
-
85
- The AI can query response data in three ways:
86
-
87
- 1. **Latest response**: Check current state (one data point)
88
- 2. **Response history**: Identify trends over time (multiple data points)
63
+ The AI can query response data three ways:
64
+ 1. **Latest response**: Current state snapshot
65
+ 2. **Response history**: Trends over time
89
66
  3. **Sibling responses**: Coordinate across endpoints in the same job
90
67
 
91
- The response body can contain any JSON structure you design. The AI looks for signals indicating:
92
-
93
- - **Load indicators**: queue_depth, pending_count, backlog_size
94
- - **Performance metrics**: latency, p95, processing_time
95
- - **Error rates**: error_count, failure_rate, error_pct
96
- - **Health flags**: healthy, available, status, state
97
- - **Coordination signals**: ready_for_processing, dependency_status
98
- - **Timestamps**: last_success_at, completed_at (detect staleness)
99
-
100
- Structure your response bodies to include the metrics that matter for your use case.
101
-
102
68
  ### Job Context
103
-
104
- If the endpoint belongs to a job, the AI sees:
105
-
106
- - **Job description**: High-level intent (e.g., "Monitors payment queue and triggers processing when depth exceeds threshold")
107
- - **Sibling endpoint names**: Other endpoints in the same job (e.g., "3 endpoints [API Monitor, Data Fetcher, Notifier]")
108
-
109
- Knowing sibling names helps the AI:
110
- - Understand the endpoint is part of a larger workflow
111
- - Decide when to check sibling responses for coordination
112
- - Make informed decisions about the `get_sibling_latest_responses` tool
113
-
114
- The AI uses job context to interpret what "good" vs "bad" looks like for specific metrics. A growing queue_depth might be normal for a collector endpoint but alarming for a processor endpoint.
69
+ - Job description (high-level intent)
70
+ - Sibling endpoint names
115
71
 
116
72
  ## Session Constraints
117
73
 
118
- Each AI analysis session has resource limits to prevent runaway costs:
119
-
120
- - **Maximum 15 tool calls** per session (hard limit)
121
- - **10 history records** is usually sufficient for trend analysis
122
- - Sessions that hit the limit are terminated
74
+ Each analysis session is limited to prevent runaway costs:
75
+ - **Maximum 15 tool calls** per session
76
+ - **10 history records** is usually sufficient
77
+ - Sessions hitting the limit are terminated
123
78
 
124
- These constraints prevent the worst-case scenario: an AI session that paginates through hundreds of identical failure records, consuming 42K+ tokens for a decision reachable in 5 tool calls.
125
-
126
- The AI is informed of these limits and prioritizes the most valuable queries.
127
-
128
- ## The Four Tools: How AI Takes Action
129
-
130
- The AI Planner doesn't write to the database directly. Instead, it has access to **four action tools** that write hints:
79
+ ## The Four Action Tools
131
80
 
132
81
  ### 1. propose_interval
133
82
 
134
- **Purpose**: Adjust how frequently the endpoint runs
83
+ Adjust how frequently the endpoint runs.
135
84
 
136
- **Parameters**:
137
- - `intervalMs`: New interval in milliseconds
138
- - `ttlMinutes`: How long the hint is valid (default: 60 minutes)
139
- - `reason`: Optional explanation
85
+ **Parameters**: `intervalMs`, `ttlMinutes` (default: 60), `reason`
140
86
 
141
87
  **When to use**:
142
88
  - Tighten monitoring during load spikes (5min → 30sec)
143
89
  - Relax during stability (1min → 10min to save resources)
144
- - Override exponential backoff after recovery (restore normal cadence)
145
-
146
- **Example**:
147
- ```
148
- AI sees queue_depth growing: 50 → 100 → 200 over last 10 runs
149
- Action: propose_interval(30000, ttl=15, reason="Growing queue requires tighter monitoring")
150
- Effect: Runs every 30 seconds for 15 minutes, then reverts to baseline
151
- ```
90
+ - Override exponential backoff after recovery
152
91
 
153
- **How it works**:
154
- 1. AI calls the tool with parameters
155
- 2. Tool writes `aiHintIntervalMs` and `aiHintExpiresAt` to database
156
- 3. Tool calls `setNextRunAtIfEarlier(now + intervalMs)` to apply immediately (nudging)
157
- 4. Scheduler's next tick reads the hint, Governor uses it to calculate next run
158
- 5. After TTL expires, hint is ignored and baseline resumes
92
+ **How it works**: Writes `aiHintIntervalMs` and `aiHintExpiresAt` to database, then nudges `nextRunAt` to apply immediately.
159
93
 
160
94
  ### 2. propose_next_time
161
95
 
162
- **Purpose**: Schedule a specific one-time execution
96
+ Schedule a specific one-time execution.
163
97
 
164
- **Parameters**:
165
- - `nextRunAtIso`: ISO 8601 timestamp for next run
166
- - `ttlMinutes`: How long the hint is valid (default: 30 minutes)
167
- - `reason`: Optional explanation
98
+ **Parameters**: `nextRunAtIso`, `ttlMinutes` (default: 30), `reason`
168
99
 
169
100
  **When to use**:
170
- - Run immediately to investigate a failure (now)
171
- - Defer to off-peak hours (specific timestamp)
172
- - Coordinate with external events (batch completion time)
173
-
174
- **Example**:
175
- ```
176
- AI sees failure spike (success rate drops to 45%)
177
- Action: propose_next_time(now, ttl=5, reason="Investigate failure spike")
178
- Effect: Runs immediately once, then resumes baseline schedule
179
- ```
180
-
181
- **How it works**:
182
- 1. AI calls the tool with a timestamp
183
- 2. Tool writes `aiHintNextRunAt` and `aiHintExpiresAt` to database
184
- 3. Tool calls `setNextRunAtIfEarlier(timestamp)` to apply immediately
185
- 4. Scheduler claims endpoint when `nextRunAt` arrives
186
- 5. After execution or TTL expiry, hint is cleared and baseline resumes
101
+ - Run immediately to investigate a failure
102
+ - Defer to off-peak hours
103
+ - Coordinate with external events
187
104
 
188
105
  ### 3. pause_until
189
106
 
190
- **Purpose**: Stop execution temporarily or resume
107
+ Stop execution temporarily or resume.
191
108
 
192
- **Parameters**:
193
- - `untilIso`: ISO 8601 timestamp to pause until, or `null` to resume
194
- - `reason`: Optional explanation
109
+ **Parameters**: `untilIso` (or `null` to resume), `reason`
195
110
 
196
111
  **When to use**:
197
- - Dependency is down (pause until it recovers)
198
- - Rate limit detected (pause for cooldown period)
199
- - Maintenance window (pause until completion)
200
- - Resume after manual pause (pass `null`)
201
-
202
- **Example**:
203
- ```
204
- AI sees responseBody: { dependency_status: "unavailable" }
205
- Action: pause_until("2025-11-02T15:30:00Z", reason="Dependency unavailable")
206
- Effect: No executions until 3:30 PM, then resumes baseline
207
- ```
208
-
209
- **How it works**:
210
- 1. AI calls the tool with a timestamp (or `null`)
211
- 2. Tool writes `pausedUntil` to database
212
- 3. Scheduler's Governor checks pause state—if `pausedUntil > now`, returns that timestamp with source `"paused"`
213
- 4. When pause time passes, Governor resumes normal scheduling
112
+ - Dependency is down
113
+ - Rate limit detected
114
+ - Maintenance window
214
115
 
215
116
  ### 4. clear_hints
216
117
 
217
- **Purpose**: Reset endpoint to baseline schedule by clearing all AI hints
118
+ Reset endpoint to baseline schedule.
218
119
 
219
- **Parameters**:
220
- - `reason`: Explanation for clearing hints
120
+ **Parameters**: `reason`
221
121
 
222
122
  **When to use**:
223
- - AI hints are no longer relevant (situation changed)
123
+ - AI hints no longer relevant
224
124
  - Manual intervention resolved the issue
225
- - False positive detection (AI over-reacted)
226
- - Want to revert to baseline without waiting for TTL expiry
227
-
228
- **Example**:
229
- ```
230
- AI sees endpoint recovered but has aggressive 30s interval hint active
231
- Action: clear_hints(reason="Endpoint recovered, reverting to baseline")
232
- Effect: AI hints cleared immediately, baseline schedule resumes
233
- ```
234
-
235
- **How it works**:
236
- 1. AI calls the tool with a reason
237
- 2. Tool clears `aiHintIntervalMs`, `aiHintNextRunAt`, `aiHintExpiresAt`
238
- 3. Next execution uses baseline schedule
239
-
240
- ## Query Tools: Informing Decisions
241
-
242
- Before taking action, the AI can query response data using **three query tools**:
243
-
244
- ### 1. get_latest_response
245
-
246
- **Returns**: Most recent response body, timestamp, status
247
-
248
- **Use case**: Check current state snapshot
249
-
250
- Example: "What's the current queue depth?"
251
-
252
- ### 2. get_response_history
253
-
254
- **Parameters**:
255
- - `limit`: Number of responses (1-10, default 10)
256
- - `offset`: Skip N newest responses for pagination
257
-
258
- **Returns**: Array of response bodies with timestamps, newest first
259
-
260
- **Use case**: Identify trends over time
261
-
262
- Example: "Is queue_depth increasing monotonically?"
125
+ - Revert to baseline without waiting for TTL
263
126
 
264
- **Note**: 10 records is usually sufficient for trend analysis. The AI is encouraged not to paginate endlessly—if patterns are unclear after 10-20 records, more data rarely helps.
127
+ ## Query Tools
265
128
 
266
- Response bodies are truncated at 1000 characters to prevent token overflow.
129
+ ### get_latest_response
130
+ Returns most recent response body, timestamp, status.
267
131
 
268
- ### 3. get_sibling_latest_responses
132
+ ### get_response_history
133
+ **Parameters**: `limit` (1-10), `offset`
269
134
 
270
- **Returns**: For each sibling endpoint in the same job:
271
- - **Response data**: Latest response body, timestamp, status
272
- - **Schedule info**: Baseline, next run, last run, pause status, failure count
273
- - **AI hints**: Active interval/one-shot hints with expiry and reason
135
+ Returns array of response bodies with timestamps. 10 records is usually sufficient.
274
136
 
275
- **Use case**: Coordinate across endpoints with full context
137
+ ### get_sibling_latest_responses
138
+ Returns for each sibling endpoint:
139
+ - Response data (body, timestamp, status)
140
+ - Schedule info (baseline, next run, pause status)
141
+ - Active AI hints with expiry and reason
276
142
 
277
- Example: "Is the upstream endpoint healthy and running normally, or is it paused/failing?"
143
+ ## Hint Mechanics
278
144
 
279
- The enriched response allows the AI to understand sibling state at a glance:
280
- ```json
281
- {
282
- "endpointId": "ep_456",
283
- "endpointName": "Data Fetcher",
284
- "responseBody": { "batch_ready": true },
285
- "timestamp": "2025-11-02T14:25:00Z",
286
- "status": "success",
287
- "schedule": {
288
- "baseline": "every 5 minutes",
289
- "nextRunAt": "2025-11-02T14:30:00Z",
290
- "lastRunAt": "2025-11-02T14:25:00Z",
291
- "isPaused": false,
292
- "failureCount": 0
293
- },
294
- "aiHints": null
295
- }
296
- ```
297
-
298
- Only useful for workflow endpoints (multiple endpoints in the same job that coordinate).
299
-
300
- ## Hint Mechanics: TTLs and Expiration
301
-
302
- All AI hints have **time-to-live (TTL)**—they expire automatically. This is a safety mechanism.
303
-
304
- **Why TTLs matter**:
305
- - If the AI makes a bad decision (too aggressive, too relaxed), it auto-corrects
306
- - Temporary conditions (spikes, failures) don't permanently alter schedules
307
- - You can experiment with aggressive hints knowing they'll revert
308
-
309
- **TTL strategy**:
310
- - **Short (5-15 minutes)**: Transient spikes, immediate investigations
311
- - **Medium (30-60 minutes)**: Operational shifts, business hours patterns
312
- - **Long (2-4 hours)**: Sustained degradation, maintenance windows
313
-
314
- When a hint expires (`aiHintExpiresAt <= now`):
315
- 1. Scheduler's Governor ignores it during next run calculation
316
- 2. Baseline schedule resumes
317
- 3. Hint fields remain in database (for debugging) until next execution clears them
318
-
319
- ## Override vs. Compete: Hint Semantics
320
-
321
- Understanding how hints interact with baseline is critical:
145
+ ### TTLs and Expiration
322
146
 
323
- **AI interval hints OVERRIDE baseline**:
324
- - If hint is active, Governor ignores baseline completely
325
- - Enables tightening (5min → 30sec) and relaxing (1min → 10min)
326
- - Baseline only applies when hint expires
147
+ All hints expire automatically—this is a safety mechanism:
148
+ - **Short (5-15 min)**: Transient spikes, immediate investigations
149
+ - **Medium (30-60 min)**: Operational shifts
150
+ - **Long (2-4 hours)**: Sustained degradation, maintenance
327
151
 
328
- **AI one-shot hints COMPETE with baseline**:
329
- - Governor chooses earliest between hint timestamp and baseline next run
330
- - Enables immediate runs (one-shot sooner than baseline)
331
- - Baseline still influences scheduling
152
+ When hints expire, the system falls back to baseline.
332
153
 
333
- **When both hints exist**:
334
- - They compete with each other (earliest wins)
335
- - Baseline is ignored entirely
154
+ ### Override vs. Compete
336
155
 
337
- This design allows the AI to both tighten/relax ongoing cadence (interval) and trigger specific executions (one-shot) without the hints fighting each other.
156
+ **AI interval hints OVERRIDE baseline**: If active, Governor ignores baseline completely. Enables both tightening and relaxing.
338
157
 
339
- ## Nudging: Immediate Effect
158
+ **AI one-shot hints COMPETE with baseline**: Governor chooses earliest. Enables immediate runs.
340
159
 
341
- When the AI writes a hint, it doesn't just sit in the database waiting for the next baseline execution. That could take minutes or hours.
160
+ For detailed constraint interaction, see [How Scheduling Works](./how-scheduling-works.md).
342
161
 
343
- Instead, the AI **nudges** the `nextRunAt` field using `setNextRunAtIfEarlier(timestamp)`:
162
+ ### Nudging
344
163
 
345
- 1. Calculate when the hint should take effect (`now + intervalMs` or specific timestamp)
346
- 2. Compare with current `nextRunAt`
347
- 3. If hint time is earlier, update `nextRunAt` immediately
348
- 4. Scheduler claims endpoint on next tick (within 5 seconds)
349
-
350
- **Example timeline**:
164
+ When AI writes a hint, it nudges `nextRunAt` to apply immediately using `setNextRunAtIfEarlier(timestamp)`:
351
165
 
352
166
  ```
353
- T=0: Endpoint scheduled to run at T=300 (5 minutes from now)
354
- T=60: AI analyzes, sees spike, proposes 30-second interval
355
- T=60: AI writes hint and nudges nextRunAt to T=90 (30 seconds from T=60)
356
- T=65: Scheduler wakes up, claims endpoint (nextRunAt=T=90 is due soon)
167
+ T=0: Endpoint scheduled for T=300 (5 min baseline)
168
+ T=60: AI proposes 30-second interval
169
+ T=60: AI nudges nextRunAt to T=90
357
170
  T=90: Scheduler executes endpoint
358
171
  ```
359
172
 
360
- Without nudging, the endpoint would wait until T=300 to apply the new interval. With nudging, it applies at T=90—within seconds of the AI's decision.
361
-
362
- **Important**: Nudging respects constraints. If the nudged time violates `minIntervalMs`, the Governor clamps it during scheduling.
363
-
364
- ## Structuring Response Bodies for AI
173
+ Without nudging, changes wait until T=300. With nudging, they apply at T=90.
365
174
 
366
- The AI can only work with the data you provide. Here's how to structure response bodies:
175
+ ## Structuring Response Bodies
367
176
 
368
- ### Include Relevant Metrics
177
+ The AI works with the data you provide. Structure response bodies to include:
369
178
 
179
+ ### Relevant Metrics
370
180
  ```json
371
181
  {
372
182
  "queue_depth": 250,
373
183
  "processing_rate_per_min": 80,
374
184
  "error_rate_pct": 2.5,
375
185
  "avg_latency_ms": 145,
376
- "healthy": true,
377
- "last_success_at": "2025-11-02T14:30:00Z"
186
+ "healthy": true
378
187
  }
379
188
  ```
380
189
 
381
- The AI looks for field names like `queue`, `latency`, `error`, `rate`, `count`, `healthy`, `status`.
382
-
383
- ### Use Consistent Naming
384
-
385
- If queue depth is `queue_depth` in one response, don't call it `pending_items` in another. Consistency helps the AI spot trends.
386
-
387
- ### Include Thresholds
190
+ The AI looks for: `queue`, `latency`, `error`, `rate`, `count`, `healthy`, `status`.
388
191
 
192
+ ### Thresholds
389
193
  ```json
390
194
  {
391
195
  "queue_depth": 250,
@@ -394,142 +198,75 @@ If queue depth is `queue_depth` in one response, don't call it `pending_items` i
394
198
  }
395
199
  ```
396
200
 
397
- This tells the AI when to intervene (crossing thresholds).
398
-
399
- ### Add Coordination Signals
400
-
401
- For workflow endpoints:
402
-
201
+ ### Coordination Signals
403
202
  ```json
404
203
  {
405
204
  "ready_for_processing": true,
406
- "upstream_completed_at": "2025-11-02T14:45:00Z",
407
- "data_available": true
205
+ "upstream_completed_at": "2025-11-02T14:45:00Z"
408
206
  }
409
207
  ```
410
208
 
411
- The AI can use `get_sibling_latest_responses` to read these flags and coordinate execution.
209
+ ### Tips
210
+ - Use consistent naming across responses
211
+ - Include timestamps for cooldown patterns
212
+ - Keep it simple (1000 char truncation)
412
213
 
413
- ### Include Timestamps
214
+ ## Decision Framework
414
215
 
415
- ```json
416
- {
417
- "last_alert_sent_at": "2025-11-02T12:00:00Z",
418
- "last_cache_warm_at": "2025-11-02T13:30:00Z"
419
- }
420
- ```
421
-
422
- This enables **cooldown patterns**—the AI can check if enough time has passed before triggering duplicate actions.
423
-
424
- ### Keep It Simple
425
-
426
- The AI receives truncated response bodies (1000 chars). Don't include massive arrays or deeply nested objects. Focus on summary metrics.
427
-
428
- ## Decision Framework: When AI Acts
429
-
430
- The AI follows a conservative decision framework:
216
+ The AI follows a conservative approach:
431
217
 
432
218
  **Default to stability**: Most of the time, doing nothing is optimal.
433
219
 
434
220
  **Intervene when**:
435
221
  - Clear trend (5+ data points showing monotonic change)
436
- - Threshold crossing (metric jumps significantly)
222
+ - Threshold crossing
437
223
  - State transition (dependency status changes)
438
- - Coordination signal (sibling endpoint signals readiness)
224
+ - Coordination signal from sibling
439
225
 
440
226
  **Don't intervene when**:
441
227
  - Single anomaly (might be transient)
442
- - Insufficient data (&lt;10 total runs)
228
+ - Insufficient data (fewer than 10 total runs)
443
229
  - Metrics within normal ranges
444
- - No clear cause-effect logic
445
230
 
446
- The AI's reasoning is logged in `ai_analysis_sessions` table. You can review what it considered and why it acted (or didn't).
231
+ ## Analysis Sessions
447
232
 
448
- ## Analysis Sessions: Debugging AI Decisions
233
+ Every analysis creates a session record with:
234
+ - Endpoint analyzed and timestamp
235
+ - Tools called (queries and actions)
236
+ - Reasoning (AI's explanation)
237
+ - Token usage and duration
238
+ - Next analysis time
239
+ - Endpoint failure count snapshot
449
240
 
450
- Every AI analysis creates a session record:
451
-
452
- - **Endpoint analyzed**
453
- - **Timestamp**
454
- - **Tools called** (which queries and actions)
455
- - **Reasoning** (AI's explanation of its decision)
456
- - **Token usage** (cost tracking)
457
- - **Duration**
458
- - **Next analysis at** (when AI scheduled its next analysis)
459
- - **Endpoint failure count** (snapshot for detecting state changes)
460
-
461
- This audit trail helps you:
462
- - Debug unexpected scheduling behavior
463
- - Understand why AI tightened/relaxed intervals
464
- - Review cost (token usage per analysis)
465
- - Tune prompts or constraints based on AI reasoning
466
- - See when AI expects to analyze again
467
-
468
- Check the sessions table when an endpoint's schedule changes unexpectedly. The reasoning field shows the AI's thought process.
241
+ Check sessions when an endpoint's schedule changes unexpectedly.
469
242
 
470
243
  ## Quota and Cost Control
471
244
 
472
- AI analysis costs money (API calls). The system includes quota controls:
473
-
474
- - Per-tenant quota limits (prevent runaway costs)
475
- - Before analyzing an endpoint, check `quota.canProceed(tenantId)`
245
+ AI analysis has quota controls:
246
+ - Per-tenant quota limits
247
+ - Before analyzing, check `quota.canProceed(tenantId)`
476
248
  - If quota exceeded, skip analysis (Scheduler continues on baseline)
477
249
 
478
- This ensures that even if AI becomes unavailable or too expensive, your jobs keep running.
479
-
480
- ## Putting It Together: Example Analysis
481
-
482
- **Scenario**: Payment queue monitoring endpoint
483
-
484
- **Baseline**: 5-minute interval
485
-
486
- **T=0**: Scheduler runs endpoint
487
- - Response: `{ "queue_depth": 50, "processing_rate": 100, "healthy": true }`
488
- - Records to database
489
-
490
- **T=5min**: AI Planner discovers endpoint (ran recently)
491
- - Queries `get_latest_response`: queue_depth=50
492
- - Queries `get_response_history(limit=5)`: [50, 48, 52, 49, 51]
493
- - Analysis: "Stable around 50, no trend, 100% success rate"
494
- - Decision: "No action—stability optimal"
495
- - Session logged with reasoning
496
-
497
- **T=10min**: Scheduler runs again
498
- - Response: `{ "queue_depth": 150, "processing_rate": 100, "healthy": true }`
499
- - Records to database
500
-
501
- **T=12min**: AI Planner analyzes again
502
- - Queries `get_response_history(limit=5)`: [150, 50, 48, 52, 49]
503
- - Analysis: "Queue jumped from 50 to 150. Single spike or trend?"
504
- - Queries `get_response_history(limit=2, offset=5)`: [51, 50]
505
- - Analysis: "Stable before, then spike. Monitor closely."
506
- - Decision: `propose_interval(60000, ttl=15, reason="Queue spike detected")`
507
- - Nudges `nextRunAt` to T=13min (1 minute from now)
508
-
509
- **T=13min**: Scheduler claims endpoint (nudged)
510
- - Executes, gets response: `{ "queue_depth": 200, ... }`
511
- - Governor sees active interval hint (60000ms)
512
- - Schedules next run at T=14min (1 minute interval)
513
-
514
- **T=14min, T=15min, ..., T=27min**: Runs every 1 minute
515
- - AI hint remains active (TTL=15 minutes from T=12min = expires T=27min)
516
-
517
- **T=27min**: Hint expires
518
- - Scheduler's Governor sees `aiHintExpiresAt < now`
519
- - Ignores hint, uses baseline (5-minute interval)
520
- - Schedules next run at T=32min
250
+ This ensures jobs keep running even if AI becomes unavailable or too expensive.
521
251
 
522
252
  ## Key Takeaways
523
253
 
524
- 1. **AI controls its own analysis frequency**: Schedules re-analysis based on endpoint state
525
- 2. **AI sees multi-window health data**: 1h, 4h, 24h windows for accurate recovery detection
526
- 3. **Sessions have constraints**: Max 15 tool calls to prevent runaway costs
254
+ 1. **AI controls its own analysis frequency**
255
+ 2. **Multi-window health data** enables accurate recovery detection
256
+ 3. **Sessions have constraints** (max 15 tool calls)
527
257
  4. **Four action tools**: propose_interval, propose_next_time, pause_until, clear_hints
528
- 5. **Hints have TTLs**: Auto-revert on expiration (safety)
258
+ 5. **Hints have TTLs**: Auto-revert on expiration
529
259
  6. **Interval hints override baseline**: Enables adaptation
530
260
  7. **Nudging provides immediacy**: Changes apply within seconds
531
- 8. **Structure response bodies intentionally**: Include metrics AI should monitor
532
- 9. **Sessions provide audit trail**: Debug AI reasoning
261
+ 8. **Structure response bodies intentionally**
262
+ 9. **Sessions provide audit trail**
533
263
  10. **Quota controls costs**: AI unavailable ≠ jobs stop running
534
264
 
535
- Understanding how AI adaptation works helps you design endpoints that benefit from intelligent scheduling, structure response bodies effectively, and debug unexpected schedule changes.
265
+ ---
266
+
267
+ ## See Also
268
+
269
+ - **[System Architecture](./system-architecture.md)** - High-level dual-worker design
270
+ - **[How Scheduling Works](./how-scheduling-works.md)** - Detailed Governor logic and constraint behavior
271
+ - **[Configuration and Constraints](./configuration-and-constraints.md)** - Setting up endpoints effectively
272
+ - **[Coordinating Multiple Endpoints](./coordinating-multiple-endpoints.md)** - Workflow patterns