@cronicorn/mcp-server 1.18.3 → 1.19.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +87 -436
- package/dist/docs/api-reference.md +558 -0
- package/dist/docs/core-concepts.md +9 -1
- package/dist/docs/introduction.md +15 -5
- package/dist/docs/mcp-server.md +49 -69
- package/dist/docs/quick-start.md +11 -2
- package/dist/docs/self-hosting.md +10 -1
- package/dist/docs/technical/configuration-and-constraints.md +11 -2
- package/dist/docs/technical/coordinating-multiple-endpoints.md +11 -2
- package/dist/docs/technical/how-ai-adaptation-works.md +122 -385
- package/dist/docs/technical/how-scheduling-works.md +76 -2
- package/dist/docs/technical/reference.md +11 -2
- package/dist/docs/technical/system-architecture.md +57 -189
- package/dist/docs/troubleshooting.md +392 -0
- package/dist/docs/use-cases.md +10 -1
- package/dist/index.js +20 -12
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
- package/dist/docs/competitive-analysis.md +0 -324
- package/dist/docs/developers/README.md +0 -29
- package/dist/docs/developers/authentication.md +0 -121
- package/dist/docs/developers/environment-configuration.md +0 -103
- package/dist/docs/developers/quality-checks.md +0 -68
- package/dist/docs/developers/quick-start.md +0 -87
- package/dist/docs/developers/workspace-structure.md +0 -174
|
@@ -8,384 +8,188 @@ mcp:
|
|
|
8
8
|
uri: file:///docs/technical/how-ai-adaptation-works.md
|
|
9
9
|
mimeType: text/markdown
|
|
10
10
|
priority: 0.85
|
|
11
|
-
lastModified:
|
|
11
|
+
lastModified: 2026-02-03T00:00:00Z
|
|
12
12
|
---
|
|
13
13
|
|
|
14
14
|
# How AI Adaptation Works
|
|
15
15
|
|
|
16
|
-
|
|
16
|
+
**TL;DR:** The AI Planner analyzes endpoint execution patterns and writes time-bounded hints to adjust schedules. It has four action tools (propose_interval, propose_next_time, pause_until, clear_hints) and three query tools for response data. Structure your response bodies with metrics the AI should monitor.
|
|
17
|
+
|
|
18
|
+
This document explains how the AI Planner suggests schedule adjustments. See [System Architecture](./system-architecture.md) for context on the dual-worker design.
|
|
17
19
|
|
|
18
20
|
## The AI Planner's Job
|
|
19
21
|
|
|
20
|
-
The AI Planner worker
|
|
22
|
+
The AI Planner worker analyzes endpoint execution patterns and suggests scheduling adjustments by writing hints to the database. It doesn't execute jobs or manage locks—it observes and recommends.
|
|
21
23
|
|
|
22
|
-
The AI Planner runs independently from the Scheduler
|
|
24
|
+
The AI Planner runs independently from the Scheduler, typically waking up every 5 minutes to analyze recently active endpoints.
|
|
23
25
|
|
|
24
26
|
## Discovery: Finding Endpoints to Analyze
|
|
25
27
|
|
|
26
|
-
The AI Planner
|
|
27
|
-
|
|
28
|
-
Instead, it uses **smart scheduling** where the AI controls when it needs to analyze again:
|
|
28
|
+
The AI Planner uses **smart scheduling** where it controls when to analyze again:
|
|
29
29
|
|
|
30
|
-
1. Query the database for
|
|
31
|
-
2. Check if the endpoint is due for analysis
|
|
32
|
-
- **First analysis**: New endpoints
|
|
30
|
+
1. Query the database for recently active endpoints
|
|
31
|
+
2. Check if the endpoint is due for analysis:
|
|
32
|
+
- **First analysis**: New endpoints never analyzed
|
|
33
33
|
- **Scheduled time**: AI-requested re-analysis time has passed
|
|
34
34
|
- **State change**: New failures since last analysis (triggers immediate re-analysis)
|
|
35
35
|
3. Skip endpoints where none of these conditions are met
|
|
36
36
|
|
|
37
|
-
|
|
38
|
-
- Stable endpoints: "Check again in 4 hours"
|
|
39
|
-
- Incidents: "Check again in 5 minutes"
|
|
40
|
-
- Very stable daily jobs: "Check again in 24 hours"
|
|
41
|
-
|
|
42
|
-
The AI communicates this via the `next_analysis_in_ms` parameter in `submit_analysis` (see Tools section).
|
|
43
|
-
|
|
44
|
-
## What the AI Sees: Building Context
|
|
45
|
-
|
|
46
|
-
For each endpoint, the AI Planner builds a comprehensive analysis prompt containing:
|
|
47
|
-
|
|
48
|
-
### Current Configuration
|
|
37
|
+
The AI communicates its preferred re-analysis time via `next_analysis_in_ms` in `submit_analysis`.
|
|
49
38
|
|
|
50
|
-
|
|
51
|
-
- **Constraints**: Min/max intervals
|
|
52
|
-
- **Pause state**: Whether paused and until when
|
|
53
|
-
- **Active hints**: Current AI interval/one-shot hints and their expiration times
|
|
54
|
-
- **Failure count**: Number of consecutive failures (affects exponential backoff)
|
|
39
|
+
## What the AI Sees
|
|
55
40
|
|
|
56
|
-
|
|
41
|
+
For each endpoint, the AI receives:
|
|
57
42
|
|
|
58
|
-
###
|
|
43
|
+
### Configuration
|
|
44
|
+
- Baseline schedule (cron or interval)
|
|
45
|
+
- Constraints (min/max intervals)
|
|
46
|
+
- Pause state and active hints
|
|
47
|
+
- Failure count
|
|
59
48
|
|
|
60
|
-
|
|
49
|
+
### Multi-Window Health Metrics
|
|
61
50
|
|
|
62
51
|
| Window | Metrics |
|
|
63
|
-
|
|
52
|
+
|--------|---------|
|
|
64
53
|
| **Last 1 hour** | Success rate, run count |
|
|
65
54
|
| **Last 4 hours** | Success rate, run count |
|
|
66
55
|
| **Last 24 hours** | Success rate, run count |
|
|
67
56
|
|
|
68
|
-
Plus
|
|
69
|
-
- **Average duration**: Mean execution time
|
|
70
|
-
- **Failure streak**: Consecutive failures (signals degradation)
|
|
57
|
+
Plus average duration and failure streak.
|
|
71
58
|
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
With multi-window health, the AI sees:
|
|
75
|
-
- Last 1h: 100% success (12 runs)
|
|
76
|
-
- Last 4h: 85% success (40 runs)
|
|
77
|
-
- Last 24h: 32% success (500 runs) ← skewed by old failures
|
|
78
|
-
|
|
79
|
-
This tells the AI "endpoint has recovered" rather than "endpoint is still failing."
|
|
59
|
+
Multiple windows matter for recovery detection. A 24-hour window can show 5% success even when recent performance is 100% due to old failures.
|
|
80
60
|
|
|
81
61
|
### Response Body Data
|
|
82
62
|
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
1. **Latest response**: Check current state (one data point)
|
|
88
|
-
2. **Response history**: Identify trends over time (multiple data points)
|
|
63
|
+
The AI can query response data three ways:
|
|
64
|
+
1. **Latest response**: Current state snapshot
|
|
65
|
+
2. **Response history**: Trends over time
|
|
89
66
|
3. **Sibling responses**: Coordinate across endpoints in the same job
|
|
90
67
|
|
|
91
|
-
The response body can contain any JSON structure you design. The AI looks for signals indicating:
|
|
92
|
-
|
|
93
|
-
- **Load indicators**: queue_depth, pending_count, backlog_size
|
|
94
|
-
- **Performance metrics**: latency, p95, processing_time
|
|
95
|
-
- **Error rates**: error_count, failure_rate, error_pct
|
|
96
|
-
- **Health flags**: healthy, available, status, state
|
|
97
|
-
- **Coordination signals**: ready_for_processing, dependency_status
|
|
98
|
-
- **Timestamps**: last_success_at, completed_at (detect staleness)
|
|
99
|
-
|
|
100
|
-
Structure your response bodies to include the metrics that matter for your use case.
|
|
101
|
-
|
|
102
68
|
### Job Context
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
- **Job description**: High-level intent (e.g., "Monitors payment queue and triggers processing when depth exceeds threshold")
|
|
107
|
-
- **Sibling endpoint names**: Other endpoints in the same job (e.g., "3 endpoints [API Monitor, Data Fetcher, Notifier]")
|
|
108
|
-
|
|
109
|
-
Knowing sibling names helps the AI:
|
|
110
|
-
- Understand the endpoint is part of a larger workflow
|
|
111
|
-
- Decide when to check sibling responses for coordination
|
|
112
|
-
- Make informed decisions about the `get_sibling_latest_responses` tool
|
|
113
|
-
|
|
114
|
-
The AI uses job context to interpret what "good" vs "bad" looks like for specific metrics. A growing queue_depth might be normal for a collector endpoint but alarming for a processor endpoint.
|
|
69
|
+
- Job description (high-level intent)
|
|
70
|
+
- Sibling endpoint names
|
|
115
71
|
|
|
116
72
|
## Session Constraints
|
|
117
73
|
|
|
118
|
-
Each
|
|
119
|
-
|
|
120
|
-
- **
|
|
121
|
-
-
|
|
122
|
-
- Sessions that hit the limit are terminated
|
|
74
|
+
Each analysis session is limited to prevent runaway costs:
|
|
75
|
+
- **Maximum 15 tool calls** per session
|
|
76
|
+
- **10 history records** is usually sufficient
|
|
77
|
+
- Sessions hitting the limit are terminated
|
|
123
78
|
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
The AI is informed of these limits and prioritizes the most valuable queries.
|
|
127
|
-
|
|
128
|
-
## The Four Tools: How AI Takes Action
|
|
129
|
-
|
|
130
|
-
The AI Planner doesn't write to the database directly. Instead, it has access to **four action tools** that write hints:
|
|
79
|
+
## The Four Action Tools
|
|
131
80
|
|
|
132
81
|
### 1. propose_interval
|
|
133
82
|
|
|
134
|
-
|
|
83
|
+
Adjust how frequently the endpoint runs.
|
|
135
84
|
|
|
136
|
-
**Parameters**:
|
|
137
|
-
- `intervalMs`: New interval in milliseconds
|
|
138
|
-
- `ttlMinutes`: How long the hint is valid (default: 60 minutes)
|
|
139
|
-
- `reason`: Optional explanation
|
|
85
|
+
**Parameters**: `intervalMs`, `ttlMinutes` (default: 60), `reason`
|
|
140
86
|
|
|
141
87
|
**When to use**:
|
|
142
88
|
- Tighten monitoring during load spikes (5min → 30sec)
|
|
143
89
|
- Relax during stability (1min → 10min to save resources)
|
|
144
|
-
- Override exponential backoff after recovery
|
|
145
|
-
|
|
146
|
-
**Example**:
|
|
147
|
-
```
|
|
148
|
-
AI sees queue_depth growing: 50 → 100 → 200 over last 10 runs
|
|
149
|
-
Action: propose_interval(30000, ttl=15, reason="Growing queue requires tighter monitoring")
|
|
150
|
-
Effect: Runs every 30 seconds for 15 minutes, then reverts to baseline
|
|
151
|
-
```
|
|
90
|
+
- Override exponential backoff after recovery
|
|
152
91
|
|
|
153
|
-
**How it works**:
|
|
154
|
-
1. AI calls the tool with parameters
|
|
155
|
-
2. Tool writes `aiHintIntervalMs` and `aiHintExpiresAt` to database
|
|
156
|
-
3. Tool calls `setNextRunAtIfEarlier(now + intervalMs)` to apply immediately (nudging)
|
|
157
|
-
4. Scheduler's next tick reads the hint, Governor uses it to calculate next run
|
|
158
|
-
5. After TTL expires, hint is ignored and baseline resumes
|
|
92
|
+
**How it works**: Writes `aiHintIntervalMs` and `aiHintExpiresAt` to database, then nudges `nextRunAt` to apply immediately.
|
|
159
93
|
|
|
160
94
|
### 2. propose_next_time
|
|
161
95
|
|
|
162
|
-
|
|
96
|
+
Schedule a specific one-time execution.
|
|
163
97
|
|
|
164
|
-
**Parameters**:
|
|
165
|
-
- `nextRunAtIso`: ISO 8601 timestamp for next run
|
|
166
|
-
- `ttlMinutes`: How long the hint is valid (default: 30 minutes)
|
|
167
|
-
- `reason`: Optional explanation
|
|
98
|
+
**Parameters**: `nextRunAtIso`, `ttlMinutes` (default: 30), `reason`
|
|
168
99
|
|
|
169
100
|
**When to use**:
|
|
170
|
-
- Run immediately to investigate a failure
|
|
171
|
-
- Defer to off-peak hours
|
|
172
|
-
- Coordinate with external events
|
|
173
|
-
|
|
174
|
-
**Example**:
|
|
175
|
-
```
|
|
176
|
-
AI sees failure spike (success rate drops to 45%)
|
|
177
|
-
Action: propose_next_time(now, ttl=5, reason="Investigate failure spike")
|
|
178
|
-
Effect: Runs immediately once, then resumes baseline schedule
|
|
179
|
-
```
|
|
180
|
-
|
|
181
|
-
**How it works**:
|
|
182
|
-
1. AI calls the tool with a timestamp
|
|
183
|
-
2. Tool writes `aiHintNextRunAt` and `aiHintExpiresAt` to database
|
|
184
|
-
3. Tool calls `setNextRunAtIfEarlier(timestamp)` to apply immediately
|
|
185
|
-
4. Scheduler claims endpoint when `nextRunAt` arrives
|
|
186
|
-
5. After execution or TTL expiry, hint is cleared and baseline resumes
|
|
101
|
+
- Run immediately to investigate a failure
|
|
102
|
+
- Defer to off-peak hours
|
|
103
|
+
- Coordinate with external events
|
|
187
104
|
|
|
188
105
|
### 3. pause_until
|
|
189
106
|
|
|
190
|
-
|
|
107
|
+
Stop execution temporarily or resume.
|
|
191
108
|
|
|
192
|
-
**Parameters**:
|
|
193
|
-
- `untilIso`: ISO 8601 timestamp to pause until, or `null` to resume
|
|
194
|
-
- `reason`: Optional explanation
|
|
109
|
+
**Parameters**: `untilIso` (or `null` to resume), `reason`
|
|
195
110
|
|
|
196
111
|
**When to use**:
|
|
197
|
-
- Dependency is down
|
|
198
|
-
- Rate limit detected
|
|
199
|
-
- Maintenance window
|
|
200
|
-
- Resume after manual pause (pass `null`)
|
|
201
|
-
|
|
202
|
-
**Example**:
|
|
203
|
-
```
|
|
204
|
-
AI sees responseBody: { dependency_status: "unavailable" }
|
|
205
|
-
Action: pause_until("2025-11-02T15:30:00Z", reason="Dependency unavailable")
|
|
206
|
-
Effect: No executions until 3:30 PM, then resumes baseline
|
|
207
|
-
```
|
|
208
|
-
|
|
209
|
-
**How it works**:
|
|
210
|
-
1. AI calls the tool with a timestamp (or `null`)
|
|
211
|
-
2. Tool writes `pausedUntil` to database
|
|
212
|
-
3. Scheduler's Governor checks pause state—if `pausedUntil > now`, returns that timestamp with source `"paused"`
|
|
213
|
-
4. When pause time passes, Governor resumes normal scheduling
|
|
112
|
+
- Dependency is down
|
|
113
|
+
- Rate limit detected
|
|
114
|
+
- Maintenance window
|
|
214
115
|
|
|
215
116
|
### 4. clear_hints
|
|
216
117
|
|
|
217
|
-
|
|
118
|
+
Reset endpoint to baseline schedule.
|
|
218
119
|
|
|
219
|
-
**Parameters**:
|
|
220
|
-
- `reason`: Explanation for clearing hints
|
|
120
|
+
**Parameters**: `reason`
|
|
221
121
|
|
|
222
122
|
**When to use**:
|
|
223
|
-
- AI hints
|
|
123
|
+
- AI hints no longer relevant
|
|
224
124
|
- Manual intervention resolved the issue
|
|
225
|
-
-
|
|
226
|
-
- Want to revert to baseline without waiting for TTL expiry
|
|
227
|
-
|
|
228
|
-
**Example**:
|
|
229
|
-
```
|
|
230
|
-
AI sees endpoint recovered but has aggressive 30s interval hint active
|
|
231
|
-
Action: clear_hints(reason="Endpoint recovered, reverting to baseline")
|
|
232
|
-
Effect: AI hints cleared immediately, baseline schedule resumes
|
|
233
|
-
```
|
|
234
|
-
|
|
235
|
-
**How it works**:
|
|
236
|
-
1. AI calls the tool with a reason
|
|
237
|
-
2. Tool clears `aiHintIntervalMs`, `aiHintNextRunAt`, `aiHintExpiresAt`
|
|
238
|
-
3. Next execution uses baseline schedule
|
|
239
|
-
|
|
240
|
-
## Query Tools: Informing Decisions
|
|
241
|
-
|
|
242
|
-
Before taking action, the AI can query response data using **three query tools**:
|
|
243
|
-
|
|
244
|
-
### 1. get_latest_response
|
|
245
|
-
|
|
246
|
-
**Returns**: Most recent response body, timestamp, status
|
|
247
|
-
|
|
248
|
-
**Use case**: Check current state snapshot
|
|
249
|
-
|
|
250
|
-
Example: "What's the current queue depth?"
|
|
251
|
-
|
|
252
|
-
### 2. get_response_history
|
|
253
|
-
|
|
254
|
-
**Parameters**:
|
|
255
|
-
- `limit`: Number of responses (1-10, default 10)
|
|
256
|
-
- `offset`: Skip N newest responses for pagination
|
|
257
|
-
|
|
258
|
-
**Returns**: Array of response bodies with timestamps, newest first
|
|
259
|
-
|
|
260
|
-
**Use case**: Identify trends over time
|
|
261
|
-
|
|
262
|
-
Example: "Is queue_depth increasing monotonically?"
|
|
125
|
+
- Revert to baseline without waiting for TTL
|
|
263
126
|
|
|
264
|
-
|
|
127
|
+
## Query Tools
|
|
265
128
|
|
|
266
|
-
|
|
129
|
+
### get_latest_response
|
|
130
|
+
Returns most recent response body, timestamp, status.
|
|
267
131
|
|
|
268
|
-
###
|
|
132
|
+
### get_response_history
|
|
133
|
+
**Parameters**: `limit` (1-10), `offset`
|
|
269
134
|
|
|
270
|
-
|
|
271
|
-
- **Response data**: Latest response body, timestamp, status
|
|
272
|
-
- **Schedule info**: Baseline, next run, last run, pause status, failure count
|
|
273
|
-
- **AI hints**: Active interval/one-shot hints with expiry and reason
|
|
135
|
+
Returns array of response bodies with timestamps. 10 records is usually sufficient.
|
|
274
136
|
|
|
275
|
-
|
|
137
|
+
### get_sibling_latest_responses
|
|
138
|
+
Returns for each sibling endpoint:
|
|
139
|
+
- Response data (body, timestamp, status)
|
|
140
|
+
- Schedule info (baseline, next run, pause status)
|
|
141
|
+
- Active AI hints with expiry and reason
|
|
276
142
|
|
|
277
|
-
|
|
143
|
+
## Hint Mechanics
|
|
278
144
|
|
|
279
|
-
|
|
280
|
-
```json
|
|
281
|
-
{
|
|
282
|
-
"endpointId": "ep_456",
|
|
283
|
-
"endpointName": "Data Fetcher",
|
|
284
|
-
"responseBody": { "batch_ready": true },
|
|
285
|
-
"timestamp": "2025-11-02T14:25:00Z",
|
|
286
|
-
"status": "success",
|
|
287
|
-
"schedule": {
|
|
288
|
-
"baseline": "every 5 minutes",
|
|
289
|
-
"nextRunAt": "2025-11-02T14:30:00Z",
|
|
290
|
-
"lastRunAt": "2025-11-02T14:25:00Z",
|
|
291
|
-
"isPaused": false,
|
|
292
|
-
"failureCount": 0
|
|
293
|
-
},
|
|
294
|
-
"aiHints": null
|
|
295
|
-
}
|
|
296
|
-
```
|
|
297
|
-
|
|
298
|
-
Only useful for workflow endpoints (multiple endpoints in the same job that coordinate).
|
|
299
|
-
|
|
300
|
-
## Hint Mechanics: TTLs and Expiration
|
|
301
|
-
|
|
302
|
-
All AI hints have **time-to-live (TTL)**—they expire automatically. This is a safety mechanism.
|
|
303
|
-
|
|
304
|
-
**Why TTLs matter**:
|
|
305
|
-
- If the AI makes a bad decision (too aggressive, too relaxed), it auto-corrects
|
|
306
|
-
- Temporary conditions (spikes, failures) don't permanently alter schedules
|
|
307
|
-
- You can experiment with aggressive hints knowing they'll revert
|
|
308
|
-
|
|
309
|
-
**TTL strategy**:
|
|
310
|
-
- **Short (5-15 minutes)**: Transient spikes, immediate investigations
|
|
311
|
-
- **Medium (30-60 minutes)**: Operational shifts, business hours patterns
|
|
312
|
-
- **Long (2-4 hours)**: Sustained degradation, maintenance windows
|
|
313
|
-
|
|
314
|
-
When a hint expires (`aiHintExpiresAt <= now`):
|
|
315
|
-
1. Scheduler's Governor ignores it during next run calculation
|
|
316
|
-
2. Baseline schedule resumes
|
|
317
|
-
3. Hint fields remain in database (for debugging) until next execution clears them
|
|
318
|
-
|
|
319
|
-
## Override vs. Compete: Hint Semantics
|
|
320
|
-
|
|
321
|
-
Understanding how hints interact with baseline is critical:
|
|
145
|
+
### TTLs and Expiration
|
|
322
146
|
|
|
323
|
-
|
|
324
|
-
-
|
|
325
|
-
-
|
|
326
|
-
-
|
|
147
|
+
All hints expire automatically—this is a safety mechanism:
|
|
148
|
+
- **Short (5-15 min)**: Transient spikes, immediate investigations
|
|
149
|
+
- **Medium (30-60 min)**: Operational shifts
|
|
150
|
+
- **Long (2-4 hours)**: Sustained degradation, maintenance
|
|
327
151
|
|
|
328
|
-
|
|
329
|
-
- Governor chooses earliest between hint timestamp and baseline next run
|
|
330
|
-
- Enables immediate runs (one-shot sooner than baseline)
|
|
331
|
-
- Baseline still influences scheduling
|
|
152
|
+
When hints expire, the system falls back to baseline.
|
|
332
153
|
|
|
333
|
-
|
|
334
|
-
- They compete with each other (earliest wins)
|
|
335
|
-
- Baseline is ignored entirely
|
|
154
|
+
### Override vs. Compete
|
|
336
155
|
|
|
337
|
-
|
|
156
|
+
**AI interval hints OVERRIDE baseline**: If active, Governor ignores baseline completely. Enables both tightening and relaxing.
|
|
338
157
|
|
|
339
|
-
|
|
158
|
+
**AI one-shot hints COMPETE with baseline**: Governor chooses earliest. Enables immediate runs.
|
|
340
159
|
|
|
341
|
-
|
|
160
|
+
For detailed constraint interaction, see [How Scheduling Works](./how-scheduling-works.md).
|
|
342
161
|
|
|
343
|
-
|
|
162
|
+
### Nudging
|
|
344
163
|
|
|
345
|
-
|
|
346
|
-
2. Compare with current `nextRunAt`
|
|
347
|
-
3. If hint time is earlier, update `nextRunAt` immediately
|
|
348
|
-
4. Scheduler claims endpoint on next tick (within 5 seconds)
|
|
349
|
-
|
|
350
|
-
**Example timeline**:
|
|
164
|
+
When AI writes a hint, it nudges `nextRunAt` to apply immediately using `setNextRunAtIfEarlier(timestamp)`:
|
|
351
165
|
|
|
352
166
|
```
|
|
353
|
-
T=0: Endpoint scheduled
|
|
354
|
-
T=60: AI
|
|
355
|
-
T=60: AI
|
|
356
|
-
T=65: Scheduler wakes up, claims endpoint (nextRunAt=T=90 is due soon)
|
|
167
|
+
T=0: Endpoint scheduled for T=300 (5 min baseline)
|
|
168
|
+
T=60: AI proposes 30-second interval
|
|
169
|
+
T=60: AI nudges nextRunAt to T=90
|
|
357
170
|
T=90: Scheduler executes endpoint
|
|
358
171
|
```
|
|
359
172
|
|
|
360
|
-
Without nudging,
|
|
361
|
-
|
|
362
|
-
**Important**: Nudging respects constraints. If the nudged time violates `minIntervalMs`, the Governor clamps it during scheduling.
|
|
363
|
-
|
|
364
|
-
## Structuring Response Bodies for AI
|
|
173
|
+
Without nudging, changes wait until T=300. With nudging, they apply at T=90.
|
|
365
174
|
|
|
366
|
-
|
|
175
|
+
## Structuring Response Bodies
|
|
367
176
|
|
|
368
|
-
|
|
177
|
+
The AI works with the data you provide. Structure response bodies to include:
|
|
369
178
|
|
|
179
|
+
### Relevant Metrics
|
|
370
180
|
```json
|
|
371
181
|
{
|
|
372
182
|
"queue_depth": 250,
|
|
373
183
|
"processing_rate_per_min": 80,
|
|
374
184
|
"error_rate_pct": 2.5,
|
|
375
185
|
"avg_latency_ms": 145,
|
|
376
|
-
"healthy": true
|
|
377
|
-
"last_success_at": "2025-11-02T14:30:00Z"
|
|
186
|
+
"healthy": true
|
|
378
187
|
}
|
|
379
188
|
```
|
|
380
189
|
|
|
381
|
-
The AI looks for
|
|
382
|
-
|
|
383
|
-
### Use Consistent Naming
|
|
384
|
-
|
|
385
|
-
If queue depth is `queue_depth` in one response, don't call it `pending_items` in another. Consistency helps the AI spot trends.
|
|
386
|
-
|
|
387
|
-
### Include Thresholds
|
|
190
|
+
The AI looks for: `queue`, `latency`, `error`, `rate`, `count`, `healthy`, `status`.
|
|
388
191
|
|
|
192
|
+
### Thresholds
|
|
389
193
|
```json
|
|
390
194
|
{
|
|
391
195
|
"queue_depth": 250,
|
|
@@ -394,142 +198,75 @@ If queue depth is `queue_depth` in one response, don't call it `pending_items` i
|
|
|
394
198
|
}
|
|
395
199
|
```
|
|
396
200
|
|
|
397
|
-
|
|
398
|
-
|
|
399
|
-
### Add Coordination Signals
|
|
400
|
-
|
|
401
|
-
For workflow endpoints:
|
|
402
|
-
|
|
201
|
+
### Coordination Signals
|
|
403
202
|
```json
|
|
404
203
|
{
|
|
405
204
|
"ready_for_processing": true,
|
|
406
|
-
"upstream_completed_at": "2025-11-02T14:45:00Z"
|
|
407
|
-
"data_available": true
|
|
205
|
+
"upstream_completed_at": "2025-11-02T14:45:00Z"
|
|
408
206
|
}
|
|
409
207
|
```
|
|
410
208
|
|
|
411
|
-
|
|
209
|
+
### Tips
|
|
210
|
+
- Use consistent naming across responses
|
|
211
|
+
- Include timestamps for cooldown patterns
|
|
212
|
+
- Keep it simple (1000 char truncation)
|
|
412
213
|
|
|
413
|
-
|
|
214
|
+
## Decision Framework
|
|
414
215
|
|
|
415
|
-
|
|
416
|
-
{
|
|
417
|
-
"last_alert_sent_at": "2025-11-02T12:00:00Z",
|
|
418
|
-
"last_cache_warm_at": "2025-11-02T13:30:00Z"
|
|
419
|
-
}
|
|
420
|
-
```
|
|
421
|
-
|
|
422
|
-
This enables **cooldown patterns**—the AI can check if enough time has passed before triggering duplicate actions.
|
|
423
|
-
|
|
424
|
-
### Keep It Simple
|
|
425
|
-
|
|
426
|
-
The AI receives truncated response bodies (1000 chars). Don't include massive arrays or deeply nested objects. Focus on summary metrics.
|
|
427
|
-
|
|
428
|
-
## Decision Framework: When AI Acts
|
|
429
|
-
|
|
430
|
-
The AI follows a conservative decision framework:
|
|
216
|
+
The AI follows a conservative approach:
|
|
431
217
|
|
|
432
218
|
**Default to stability**: Most of the time, doing nothing is optimal.
|
|
433
219
|
|
|
434
220
|
**Intervene when**:
|
|
435
221
|
- Clear trend (5+ data points showing monotonic change)
|
|
436
|
-
- Threshold crossing
|
|
222
|
+
- Threshold crossing
|
|
437
223
|
- State transition (dependency status changes)
|
|
438
|
-
- Coordination signal
|
|
224
|
+
- Coordination signal from sibling
|
|
439
225
|
|
|
440
226
|
**Don't intervene when**:
|
|
441
227
|
- Single anomaly (might be transient)
|
|
442
|
-
- Insufficient data (
|
|
228
|
+
- Insufficient data (fewer than 10 total runs)
|
|
443
229
|
- Metrics within normal ranges
|
|
444
|
-
- No clear cause-effect logic
|
|
445
230
|
|
|
446
|
-
|
|
231
|
+
## Analysis Sessions
|
|
447
232
|
|
|
448
|
-
|
|
233
|
+
Every analysis creates a session record with:
|
|
234
|
+
- Endpoint analyzed and timestamp
|
|
235
|
+
- Tools called (queries and actions)
|
|
236
|
+
- Reasoning (AI's explanation)
|
|
237
|
+
- Token usage and duration
|
|
238
|
+
- Next analysis time
|
|
239
|
+
- Endpoint failure count snapshot
|
|
449
240
|
|
|
450
|
-
|
|
451
|
-
|
|
452
|
-
- **Endpoint analyzed**
|
|
453
|
-
- **Timestamp**
|
|
454
|
-
- **Tools called** (which queries and actions)
|
|
455
|
-
- **Reasoning** (AI's explanation of its decision)
|
|
456
|
-
- **Token usage** (cost tracking)
|
|
457
|
-
- **Duration**
|
|
458
|
-
- **Next analysis at** (when AI scheduled its next analysis)
|
|
459
|
-
- **Endpoint failure count** (snapshot for detecting state changes)
|
|
460
|
-
|
|
461
|
-
This audit trail helps you:
|
|
462
|
-
- Debug unexpected scheduling behavior
|
|
463
|
-
- Understand why AI tightened/relaxed intervals
|
|
464
|
-
- Review cost (token usage per analysis)
|
|
465
|
-
- Tune prompts or constraints based on AI reasoning
|
|
466
|
-
- See when AI expects to analyze again
|
|
467
|
-
|
|
468
|
-
Check the sessions table when an endpoint's schedule changes unexpectedly. The reasoning field shows the AI's thought process.
|
|
241
|
+
Check sessions when an endpoint's schedule changes unexpectedly.
|
|
469
242
|
|
|
470
243
|
## Quota and Cost Control
|
|
471
244
|
|
|
472
|
-
AI analysis
|
|
473
|
-
|
|
474
|
-
-
|
|
475
|
-
- Before analyzing an endpoint, check `quota.canProceed(tenantId)`
|
|
245
|
+
AI analysis has quota controls:
|
|
246
|
+
- Per-tenant quota limits
|
|
247
|
+
- Before analyzing, check `quota.canProceed(tenantId)`
|
|
476
248
|
- If quota exceeded, skip analysis (Scheduler continues on baseline)
|
|
477
249
|
|
|
478
|
-
This ensures
|
|
479
|
-
|
|
480
|
-
## Putting It Together: Example Analysis
|
|
481
|
-
|
|
482
|
-
**Scenario**: Payment queue monitoring endpoint
|
|
483
|
-
|
|
484
|
-
**Baseline**: 5-minute interval
|
|
485
|
-
|
|
486
|
-
**T=0**: Scheduler runs endpoint
|
|
487
|
-
- Response: `{ "queue_depth": 50, "processing_rate": 100, "healthy": true }`
|
|
488
|
-
- Records to database
|
|
489
|
-
|
|
490
|
-
**T=5min**: AI Planner discovers endpoint (ran recently)
|
|
491
|
-
- Queries `get_latest_response`: queue_depth=50
|
|
492
|
-
- Queries `get_response_history(limit=5)`: [50, 48, 52, 49, 51]
|
|
493
|
-
- Analysis: "Stable around 50, no trend, 100% success rate"
|
|
494
|
-
- Decision: "No action—stability optimal"
|
|
495
|
-
- Session logged with reasoning
|
|
496
|
-
|
|
497
|
-
**T=10min**: Scheduler runs again
|
|
498
|
-
- Response: `{ "queue_depth": 150, "processing_rate": 100, "healthy": true }`
|
|
499
|
-
- Records to database
|
|
500
|
-
|
|
501
|
-
**T=12min**: AI Planner analyzes again
|
|
502
|
-
- Queries `get_response_history(limit=5)`: [150, 50, 48, 52, 49]
|
|
503
|
-
- Analysis: "Queue jumped from 50 to 150. Single spike or trend?"
|
|
504
|
-
- Queries `get_response_history(limit=2, offset=5)`: [51, 50]
|
|
505
|
-
- Analysis: "Stable before, then spike. Monitor closely."
|
|
506
|
-
- Decision: `propose_interval(60000, ttl=15, reason="Queue spike detected")`
|
|
507
|
-
- Nudges `nextRunAt` to T=13min (1 minute from now)
|
|
508
|
-
|
|
509
|
-
**T=13min**: Scheduler claims endpoint (nudged)
|
|
510
|
-
- Executes, gets response: `{ "queue_depth": 200, ... }`
|
|
511
|
-
- Governor sees active interval hint (60000ms)
|
|
512
|
-
- Schedules next run at T=14min (1 minute interval)
|
|
513
|
-
|
|
514
|
-
**T=14min, T=15min, ..., T=27min**: Runs every 1 minute
|
|
515
|
-
- AI hint remains active (TTL=15 minutes from T=12min = expires T=27min)
|
|
516
|
-
|
|
517
|
-
**T=27min**: Hint expires
|
|
518
|
-
- Scheduler's Governor sees `aiHintExpiresAt < now`
|
|
519
|
-
- Ignores hint, uses baseline (5-minute interval)
|
|
520
|
-
- Schedules next run at T=32min
|
|
250
|
+
This ensures jobs keep running even if AI becomes unavailable or too expensive.
|
|
521
251
|
|
|
522
252
|
## Key Takeaways
|
|
523
253
|
|
|
524
|
-
1. **AI controls its own analysis frequency
|
|
525
|
-
2. **
|
|
526
|
-
3. **Sessions have constraints
|
|
254
|
+
1. **AI controls its own analysis frequency**
|
|
255
|
+
2. **Multi-window health data** enables accurate recovery detection
|
|
256
|
+
3. **Sessions have constraints** (max 15 tool calls)
|
|
527
257
|
4. **Four action tools**: propose_interval, propose_next_time, pause_until, clear_hints
|
|
528
|
-
5. **Hints have TTLs**: Auto-revert on expiration
|
|
258
|
+
5. **Hints have TTLs**: Auto-revert on expiration
|
|
529
259
|
6. **Interval hints override baseline**: Enables adaptation
|
|
530
260
|
7. **Nudging provides immediacy**: Changes apply within seconds
|
|
531
|
-
8. **Structure response bodies intentionally
|
|
532
|
-
9. **Sessions provide audit trail
|
|
261
|
+
8. **Structure response bodies intentionally**
|
|
262
|
+
9. **Sessions provide audit trail**
|
|
533
263
|
10. **Quota controls costs**: AI unavailable ≠ jobs stop running
|
|
534
264
|
|
|
535
|
-
|
|
265
|
+
---
|
|
266
|
+
|
|
267
|
+
## See Also
|
|
268
|
+
|
|
269
|
+
- **[System Architecture](./system-architecture.md)** - High-level dual-worker design
|
|
270
|
+
- **[How Scheduling Works](./how-scheduling-works.md)** - Detailed Governor logic and constraint behavior
|
|
271
|
+
- **[Configuration and Constraints](./configuration-and-constraints.md)** - Setting up endpoints effectively
|
|
272
|
+
- **[Coordinating Multiple Endpoints](./coordinating-multiple-endpoints.md)** - Workflow patterns
|