@cronicorn/mcp-server 1.4.5 → 1.5.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/docs/README.md +133 -0
- package/dist/docs/core-concepts.md +233 -0
- package/dist/docs/introduction.md +152 -0
- package/dist/docs/quick-start.md +232 -0
- package/dist/docs/technical/_category.yml +8 -0
- package/dist/docs/technical/configuration-and-constraints.md +415 -0
- package/dist/docs/technical/coordinating-multiple-endpoints.md +457 -0
- package/dist/docs/technical/how-ai-adaptation-works.md +453 -0
- package/dist/docs/technical/how-scheduling-works.md +268 -0
- package/dist/docs/technical/reference.md +404 -0
- package/dist/docs/technical/system-architecture.md +306 -0
- package/dist/index.js +123 -8
- package/dist/index.js.map +1 -1
- package/package.json +3 -2
|
@@ -0,0 +1,268 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: how-scheduling-works
|
|
3
|
+
title: How Scheduling Works
|
|
4
|
+
description: Scheduler loop, Governor safety, and nextRunAt calculation
|
|
5
|
+
tags: [assistant, technical, scheduler]
|
|
6
|
+
sidebar_position: 2
|
|
7
|
+
mcp:
|
|
8
|
+
uri: file:///docs/technical/how-scheduling-works.md
|
|
9
|
+
mimeType: text/markdown
|
|
10
|
+
priority: 0.85
|
|
11
|
+
lastModified: 2025-11-02T00:00:00Z
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
# How Scheduling Works
|
|
15
|
+
|
|
16
|
+
This document explains how the Scheduler worker executes jobs and calculates next run times. If you haven't read [System Architecture](./system-architecture.md), start there for context on the dual-worker design.
|
|
17
|
+
|
|
18
|
+
## The Scheduler's Job
|
|
19
|
+
|
|
20
|
+
The Scheduler worker has one responsibility: execute endpoints on time, record results, and schedule the next run. It doesn't analyze patterns, make AI decisions, or try to be clever. It's a reliable execution engine.
|
|
21
|
+
|
|
22
|
+
Every 5 seconds (configurable), the Scheduler wakes up and runs a **tick**:
|
|
23
|
+
|
|
24
|
+
1. **Claim** due endpoints from the database
|
|
25
|
+
2. **Execute** each endpoint's HTTP request
|
|
26
|
+
3. **Record** results (status, duration, response body)
|
|
27
|
+
4. **Calculate** when to run next using the Governor
|
|
28
|
+
5. **Update** the database with the new schedule
|
|
29
|
+
|
|
30
|
+
Then it goes back to sleep. Simple, predictable, fast.
|
|
31
|
+
|
|
32
|
+
## The Tick Loop in Detail
|
|
33
|
+
|
|
34
|
+
### Step 1: Claiming Due Endpoints
|
|
35
|
+
|
|
36
|
+
The Scheduler asks the database: "Which endpoints have `nextRunAt <= now`?"
|
|
37
|
+
|
|
38
|
+
But there's a catch—in production, you might run multiple Scheduler instances for redundancy. To prevent two Schedulers from executing the same endpoint simultaneously, the claim operation uses a **lock**.
|
|
39
|
+
|
|
40
|
+
Here's how claiming works:
|
|
41
|
+
|
|
42
|
+
- Query for endpoints where `nextRunAt <= now` and `_lockedUntil <= now` (not currently locked)
|
|
43
|
+
- Atomically update those endpoints to set `_lockedUntil = now + lockTtlMs` (typically 30 seconds)
|
|
44
|
+
- Return the list of endpoint IDs that were successfully locked
|
|
45
|
+
|
|
46
|
+
The lock TTL serves as a safety mechanism. If a Scheduler crashes while executing an endpoint, the lock expires and another Scheduler can claim it. This prevents endpoints from getting stuck if a worker dies mid-execution.
|
|
47
|
+
|
|
48
|
+
The Scheduler claims up to `batchSize` endpoints per tick (default: 10). This provides flow control—if the system is backlogged, it processes endpoints in manageable batches rather than trying to execute hundreds at once.
|
|
49
|
+
|
|
50
|
+
### Step 2: Executing the Endpoint
|
|
51
|
+
|
|
52
|
+
For each claimed endpoint, the Scheduler:
|
|
53
|
+
|
|
54
|
+
1. Reads the full endpoint configuration from the database
|
|
55
|
+
2. Creates a run record with status `"running"` and the current attempt number
|
|
56
|
+
3. Makes the HTTP request (GET, POST, etc.) to the endpoint's URL
|
|
57
|
+
4. Waits for the response (up to the configured timeout)
|
|
58
|
+
5. Records the outcome: success/failure, duration, status code, response body
|
|
59
|
+
|
|
60
|
+
The response body is stored in the database. This is important—the AI Planner will read these response bodies later to make scheduling decisions. Structure your endpoint responses to include metrics the AI should monitor (queue depth, error rate, load indicators).
|
|
61
|
+
|
|
62
|
+
### Step 3: Recording Results
|
|
63
|
+
|
|
64
|
+
After execution completes (or times out), the Scheduler writes a complete run record:
|
|
65
|
+
|
|
66
|
+
- Status: `"success"` or `"failure"`
|
|
67
|
+
- Duration in milliseconds
|
|
68
|
+
- HTTP status code
|
|
69
|
+
- Response body (the JSON returned by your endpoint)
|
|
70
|
+
- Error message (if failed)
|
|
71
|
+
|
|
72
|
+
This creates a historical record. You can query past runs to see execution patterns, investigate failures, or debug scheduling behavior.
|
|
73
|
+
|
|
74
|
+
### Step 4: The Governor Decides Next Run Time
|
|
75
|
+
|
|
76
|
+
After recording results, the Scheduler needs to answer: "When should this endpoint run next?"
|
|
77
|
+
|
|
78
|
+
This is where the **Governor** comes in. The Governor is a pure function that takes three inputs:
|
|
79
|
+
|
|
80
|
+
- **Current time** (`now`)
|
|
81
|
+
- **Endpoint state** (baseline schedule, AI hints, constraints, failure count)
|
|
82
|
+
- **Cron parser** (for evaluating cron expressions)
|
|
83
|
+
|
|
84
|
+
It returns a single output:
|
|
85
|
+
|
|
86
|
+
- **Next run time** and **source** (why this time was chosen)
|
|
87
|
+
|
|
88
|
+
The Governor is deterministic—same inputs always produce the same output. It makes no database calls, has no side effects, and contains no I/O. This makes it easy to test, audit, and understand.
|
|
89
|
+
|
|
90
|
+
Let's walk through how the Governor makes its decision.
|
|
91
|
+
|
|
92
|
+
### Building Scheduling Candidates
|
|
93
|
+
|
|
94
|
+
The Governor starts by building a list of **candidates**—possible next run times based on different scheduling sources:
|
|
95
|
+
|
|
96
|
+
**Baseline Candidate**
|
|
97
|
+
|
|
98
|
+
Every endpoint has a baseline schedule. This is either:
|
|
99
|
+
- A cron expression: `"0 */5 * * *"` → next cron occurrence after `now`
|
|
100
|
+
- An interval: `300000ms` → `now + 300000ms` (with exponential backoff on failures)
|
|
101
|
+
|
|
102
|
+
The baseline represents your intent. It never expires.
|
|
103
|
+
|
|
104
|
+
If the endpoint uses interval-based scheduling and has failures, the Governor applies **exponential backoff**: `interval * 2^failureCount` (capped at 5 failures = 32x multiplier). This prevents hammering a failing endpoint while still retrying.
|
|
105
|
+
|
|
106
|
+
**AI Interval Candidate**
|
|
107
|
+
|
|
108
|
+
If there's an active AI interval hint (not expired), the Governor creates a candidate:
|
|
109
|
+
- `now + aiHintIntervalMs`
|
|
110
|
+
- Source: `"ai-interval"`
|
|
111
|
+
|
|
112
|
+
This candidate only exists if `aiHintExpiresAt > now`. Expired hints are ignored.
|
|
113
|
+
|
|
114
|
+
**AI One-Shot Candidate**
|
|
115
|
+
|
|
116
|
+
If there's an active AI one-shot hint (not expired), the Governor creates a candidate:
|
|
117
|
+
- `aiHintNextRunAt` (specific timestamp)
|
|
118
|
+
- Source: `"ai-oneshot"`
|
|
119
|
+
|
|
120
|
+
Again, only if `aiHintExpiresAt > now`.
|
|
121
|
+
|
|
122
|
+
### Choosing the Winning Candidate
|
|
123
|
+
|
|
124
|
+
Now the Governor has 1-3 candidates. Which one wins?
|
|
125
|
+
|
|
126
|
+
**The rules:**
|
|
127
|
+
|
|
128
|
+
1. **If both AI hints exist**: Choose the earliest between interval and one-shot (baseline ignored)
|
|
129
|
+
2. **If only AI interval exists**: AI interval wins (baseline ignored)
|
|
130
|
+
3. **If only AI one-shot exists**: Choose earliest between one-shot and baseline
|
|
131
|
+
4. **If no AI hints exist**: Baseline wins
|
|
132
|
+
|
|
133
|
+
This logic ensures that:
|
|
134
|
+
- AI interval hints **override** baseline (enables tightening/relaxing schedule)
|
|
135
|
+
- AI one-shot hints **compete** with other candidates (enables immediate runs)
|
|
136
|
+
- Baseline is the fallback when AI has no opinion
|
|
137
|
+
|
|
138
|
+
### Safety: Handling Past Candidates
|
|
139
|
+
|
|
140
|
+
Sometimes a candidate time is in the past. This happens when:
|
|
141
|
+
- Execution took longer than the interval
|
|
142
|
+
- System was backlogged
|
|
143
|
+
- One-shot hint targeted a time that already passed
|
|
144
|
+
|
|
145
|
+
The Governor handles this by rescheduling from `now`:
|
|
146
|
+
|
|
147
|
+
- **Interval-based** (baseline or AI interval): Add the interval to `now` to preserve cadence
|
|
148
|
+
- **Cron-based**: Use cron.next() which handles past times correctly
|
|
149
|
+
- **One-shot**: Floor to `now` (run immediately)
|
|
150
|
+
|
|
151
|
+
This prevents the endpoint from being immediately reclaimed in the next tick, which would cause a tight execution loop.
|
|
152
|
+
|
|
153
|
+
### Applying Constraints (Clamping)
|
|
154
|
+
|
|
155
|
+
After choosing a candidate, the Governor applies **min/max interval constraints** (if configured):
|
|
156
|
+
|
|
157
|
+
- **Min interval**: If `chosen < (now + minIntervalMs)`, move it forward to `now + minIntervalMs` (source becomes `"clamped-min"`)
|
|
158
|
+
- **Max interval**: If `chosen > (now + maxIntervalMs)`, move it back to `now + maxIntervalMs` (source becomes `"clamped-max"`)
|
|
159
|
+
|
|
160
|
+
These constraints are hard limits. They override even AI hints. Use them to enforce invariants like rate limits or maximum staleness.
|
|
161
|
+
|
|
162
|
+
### The Pause Override
|
|
163
|
+
|
|
164
|
+
After all candidate evaluation and clamping, the Governor checks one final thing: **Is the endpoint paused?**
|
|
165
|
+
|
|
166
|
+
If `pausedUntil` is set and `pausedUntil > now`, the Governor returns:
|
|
167
|
+
- `nextRunAt = pausedUntil`
|
|
168
|
+
- Source: `"paused"`
|
|
169
|
+
|
|
170
|
+
Pause **always wins**. It overrides baseline, AI hints, and constraints. This gives you an emergency brake to stop an endpoint immediately.
|
|
171
|
+
|
|
172
|
+
### Step 5: Database Update
|
|
173
|
+
|
|
174
|
+
The Scheduler writes back to the database:
|
|
175
|
+
|
|
176
|
+
- `lastRunAt = now` (when this execution started)
|
|
177
|
+
- `nextRunAt = governor result` (when to run next)
|
|
178
|
+
- `failureCount`: reset to 0 on success, increment on failure
|
|
179
|
+
- Clear expired AI hints (if `aiHintExpiresAt <= now`)
|
|
180
|
+
|
|
181
|
+
The database update is atomic. If two Schedulers somehow claimed the same endpoint (shouldn't happen due to locks, but defensive programming), only one update succeeds.
|
|
182
|
+
|
|
183
|
+
After the update, the endpoint's lock expires naturally (when `_lockedUntil` passes), and it becomes claimable again when `nextRunAt` arrives.
|
|
184
|
+
|
|
185
|
+
## Safety Mechanisms
|
|
186
|
+
|
|
187
|
+
The Scheduler includes several safety mechanisms to handle edge cases:
|
|
188
|
+
|
|
189
|
+
### Zombie Run Cleanup
|
|
190
|
+
|
|
191
|
+
A **zombie run** is a run record stuck in `"running"` status because the Scheduler crashed mid-execution.
|
|
192
|
+
|
|
193
|
+
The Scheduler periodically (every few minutes) scans for runs where:
|
|
194
|
+
- Status is `"running"`
|
|
195
|
+
- Created more than `zombieThresholdMs` ago (default: 5 minutes)
|
|
196
|
+
|
|
197
|
+
These runs are marked as `"timeout"` or `"cancelled"` to clean up the database and prevent confusion in the UI.
|
|
198
|
+
|
|
199
|
+
### Lock Expiration
|
|
200
|
+
|
|
201
|
+
The `_lockedUntil` field prevents double execution. Locks are short-lived (30 seconds by default). If a Scheduler crashes, the lock expires and another Scheduler can pick up the work.
|
|
202
|
+
|
|
203
|
+
This means endpoints can't get permanently stuck. At worst, there's a delay equal to the lock TTL before another Scheduler claims the endpoint.
|
|
204
|
+
|
|
205
|
+
### Past Candidate Protection
|
|
206
|
+
|
|
207
|
+
As mentioned earlier, the Governor reschedules past candidates from `now` rather than allowing them to remain in the past. This prevents immediate reclaiming, which would create a tight loop.
|
|
208
|
+
|
|
209
|
+
The Scheduler also has a second layer of protection: after calling the Governor, it checks if the returned `nextRunAt` is still in the past (due to slow execution). If so, it recalculates from the current time using the intended interval.
|
|
210
|
+
|
|
211
|
+
This double-check ensures that even if execution is very slow, the system doesn't spiral into a reclaim loop.
|
|
212
|
+
|
|
213
|
+
### Failure Count and Backoff
|
|
214
|
+
|
|
215
|
+
When an endpoint fails repeatedly, the Governor applies exponential backoff to interval-based schedules:
|
|
216
|
+
|
|
217
|
+
- 0 failures: Normal interval
|
|
218
|
+
- 1 failure: 2x interval
|
|
219
|
+
- 2 failures: 4x interval
|
|
220
|
+
- 3 failures: 8x interval
|
|
221
|
+
- 4 failures: 16x interval
|
|
222
|
+
- 5+ failures: 32x interval (capped)
|
|
223
|
+
|
|
224
|
+
This prevents hammering a failing service while still retrying. The backoff resets to 0 on the first success.
|
|
225
|
+
|
|
226
|
+
## Sources: Tracing Scheduling Decisions
|
|
227
|
+
|
|
228
|
+
Every scheduling decision records its **source**. This tells you why an endpoint ran at a particular time. Possible sources:
|
|
229
|
+
|
|
230
|
+
- `"baseline-cron"`: Ran on cron schedule
|
|
231
|
+
- `"baseline-interval"`: Ran on fixed interval
|
|
232
|
+
- `"ai-interval"`: Ran using AI interval hint
|
|
233
|
+
- `"ai-oneshot"`: Ran using AI one-shot hint
|
|
234
|
+
- `"clamped-min"`: Chosen time was too soon, moved to min interval
|
|
235
|
+
- `"clamped-max"`: Chosen time was too late, moved to max interval
|
|
236
|
+
- `"paused"`: Endpoint is paused
|
|
237
|
+
|
|
238
|
+
These sources are stored in run records and logs. When debugging "why did this endpoint run at 3:47 AM?", check the source. It shows the decision trail.
|
|
239
|
+
|
|
240
|
+
## What Happens Between Ticks
|
|
241
|
+
|
|
242
|
+
While the Scheduler sleeps (the 5 seconds between ticks), other things can happen:
|
|
243
|
+
|
|
244
|
+
- The AI Planner might write new hints
|
|
245
|
+
- You might update the endpoint configuration via the API
|
|
246
|
+
- You might pause an endpoint
|
|
247
|
+
- External systems might change state
|
|
248
|
+
|
|
249
|
+
None of this disrupts the Scheduler. When it wakes up for the next tick, it reads fresh state from the database and makes decisions based on current reality.
|
|
250
|
+
|
|
251
|
+
This is the power of database-mediated communication: the Scheduler and AI Planner stay in sync without ever talking directly.
|
|
252
|
+
|
|
253
|
+
## Key Takeaways
|
|
254
|
+
|
|
255
|
+
1. **The Scheduler is simple**: Claim, execute, record, schedule, repeat
|
|
256
|
+
2. **The Governor is deterministic**: Pure function, same inputs = same output
|
|
257
|
+
3. **AI hints override baseline**: This enables adaptation
|
|
258
|
+
4. **Constraints are hard limits**: Min/max and pause override everything
|
|
259
|
+
5. **Safety mechanisms prevent failures**: Locks, zombie cleanup, backoff, past protection
|
|
260
|
+
6. **Sources provide auditability**: Every decision is traceable
|
|
261
|
+
|
|
262
|
+
Understanding how scheduling works gives you the foundation to configure endpoints effectively, debug unexpected behavior, and reason about how AI adaptation affects execution timing.
|
|
263
|
+
|
|
264
|
+
## Next Steps
|
|
265
|
+
|
|
266
|
+
- **[How AI Adaptation Works](./how-ai-adaptation-works.md)** - Learn how the AI Planner writes hints
|
|
267
|
+
- **[Configuration and Constraints](./configuration-and-constraints.md)** - Guide to setting up endpoints safely
|
|
268
|
+
- **[Reference](./reference.md)** - Quick lookup for fields, defaults, and sources
|
|
@@ -0,0 +1,404 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: technical-reference
|
|
3
|
+
title: Technical Reference
|
|
4
|
+
description: Quick lookup for schemas, defaults, tools, and state transitions
|
|
5
|
+
tags: [assistant, technical, reference]
|
|
6
|
+
sidebar_position: 6
|
|
7
|
+
mcp:
|
|
8
|
+
uri: file:///docs/technical/reference.md
|
|
9
|
+
mimeType: text/markdown
|
|
10
|
+
priority: 0.75
|
|
11
|
+
lastModified: 2025-11-02T00:00:00Z
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
# Reference
|
|
15
|
+
|
|
16
|
+
Quick lookup for Cronicorn terminology, schema, defaults, and tools.
|
|
17
|
+
|
|
18
|
+
## Glossary
|
|
19
|
+
|
|
20
|
+
**Baseline Schedule**
|
|
21
|
+
: The permanent schedule configured for an endpoint (cron expression or interval). Used when no AI hints are active. Never expires unless manually updated.
|
|
22
|
+
|
|
23
|
+
**Governor**
|
|
24
|
+
: Pure function that calculates the next run time for an endpoint. Takes current time, endpoint state, and cron parser as inputs. Returns timestamp and source.
|
|
25
|
+
|
|
26
|
+
**Hint**
|
|
27
|
+
: Temporary AI suggestion for scheduling (interval or one-shot). Has a TTL and expires automatically. Stored in `aiHint*` fields.
|
|
28
|
+
|
|
29
|
+
**Nudging**
|
|
30
|
+
: Updating `nextRunAt` immediately when AI writes a hint, so the change takes effect within seconds instead of waiting for the next baseline execution. Uses `setNextRunAtIfEarlier()`.
|
|
31
|
+
|
|
32
|
+
**Claiming**
|
|
33
|
+
: The Scheduler's process of acquiring locks on due endpoints to prevent double execution in multi-worker setups. Uses `_lockedUntil` field.
|
|
34
|
+
|
|
35
|
+
**Tick**
|
|
36
|
+
: One iteration of the Scheduler's loop (every 5 seconds). Claims due endpoints, executes them, records results, schedules next run.
|
|
37
|
+
|
|
38
|
+
**Source**
|
|
39
|
+
: The reason a particular next run time was chosen. Values: `baseline-cron`, `baseline-interval`, `ai-interval`, `ai-oneshot`, `clamped-min`, `clamped-max`, `paused`.
|
|
40
|
+
|
|
41
|
+
**TTL (Time To Live)**
|
|
42
|
+
: How long an AI hint remains valid before expiring. After expiration, the system reverts to baseline. Default: 60 minutes for intervals, 30 minutes for one-shots.
|
|
43
|
+
|
|
44
|
+
**Exponential Backoff**
|
|
45
|
+
: Automatic interval increase after failures: `baselineMs × 2^min(failureCount, 5)`. Applies only to baseline interval schedules, not AI hints or cron.
|
|
46
|
+
|
|
47
|
+
**Zombie Run**
|
|
48
|
+
: A run record stuck in `"running"` status because the Scheduler crashed mid-execution. Cleaned up after 5 minutes by default.
|
|
49
|
+
|
|
50
|
+
**Sibling Endpoints**
|
|
51
|
+
: Endpoints in the same job that can coordinate via `get_sibling_latest_responses()`.
|
|
52
|
+
|
|
53
|
+
**Analysis Session**
|
|
54
|
+
: Record of AI analysis for an endpoint, including tools called, reasoning, and token usage. Stored in `ai_analysis_sessions` table.
|
|
55
|
+
|
|
56
|
+
## Schema: job_endpoints Table
|
|
57
|
+
|
|
58
|
+
| Field | Type | Description |
|
|
59
|
+
|-------|------|-------------|
|
|
60
|
+
| `id` | UUID | Unique endpoint identifier |
|
|
61
|
+
| `jobId` | UUID | Parent job (for grouping and sibling queries) |
|
|
62
|
+
| `tenantId` | UUID | Tenant for multi-tenancy |
|
|
63
|
+
| `name` | String | Display name |
|
|
64
|
+
| `description` | String | Optional context for AI |
|
|
65
|
+
| `url` | String | HTTP endpoint to call |
|
|
66
|
+
| `method` | Enum | HTTP method (GET, POST, PUT, PATCH, DELETE) |
|
|
67
|
+
| `headersJson` | JSON | Request headers |
|
|
68
|
+
| `bodyJson` | JSON | Request body (for POST/PUT/PATCH) |
|
|
69
|
+
| `baselineCron` | String | Cron expression (mutually exclusive with interval) |
|
|
70
|
+
| `baselineIntervalMs` | Integer | Fixed interval in milliseconds |
|
|
71
|
+
| `minIntervalMs` | Integer | Minimum interval constraint (hard limit) |
|
|
72
|
+
| `maxIntervalMs` | Integer | Maximum interval constraint (hard limit) |
|
|
73
|
+
| `timeoutMs` | Integer | Request timeout |
|
|
74
|
+
| `maxResponseSizeKb` | Integer | Response body size limit |
|
|
75
|
+
| `maxExecutionTimeMs` | Integer | Max execution time for lock duration |
|
|
76
|
+
| `aiHintIntervalMs` | Integer | AI-suggested interval (temporary) |
|
|
77
|
+
| `aiHintNextRunAt` | Timestamp | AI-suggested one-shot time (temporary) |
|
|
78
|
+
| `aiHintExpiresAt` | Timestamp | When AI hints expire |
|
|
79
|
+
| `aiHintReason` | String | AI's explanation for hint |
|
|
80
|
+
| `pausedUntil` | Timestamp | Pause until this time (null = not paused) |
|
|
81
|
+
| `lastRunAt` | Timestamp | When endpoint last executed |
|
|
82
|
+
| `nextRunAt` | Timestamp | When to run next (updated after each execution) |
|
|
83
|
+
| `failureCount` | Integer | Consecutive failures (reset on success) |
|
|
84
|
+
| `_lockedUntil` | Timestamp | Lock expiration for claiming |
|
|
85
|
+
|
|
86
|
+
## Default Values
|
|
87
|
+
|
|
88
|
+
| Setting | Default | Notes |
|
|
89
|
+
|---------|---------|-------|
|
|
90
|
+
| **Scheduler tick interval** | 5 seconds | How often Scheduler wakes up to claim endpoints |
|
|
91
|
+
| **AI analysis interval** | 5 minutes | How often AI Planner discovers and analyzes endpoints |
|
|
92
|
+
| **Lock TTL** | 30 seconds | How long claimed endpoints stay locked |
|
|
93
|
+
| **Batch size** | 10 endpoints | Max endpoints claimed per tick |
|
|
94
|
+
| **Zombie threshold** | 5 minutes | Runs older than this marked as timeout |
|
|
95
|
+
| **AI hint TTL** | 60 minutes | Default for `propose_interval` |
|
|
96
|
+
| **One-shot hint TTL** | 30 minutes | Default for `propose_next_time` |
|
|
97
|
+
| **Timeout** | 30 seconds | Default request timeout |
|
|
98
|
+
| **Max response size** | 100 KB | Default response body limit |
|
|
99
|
+
| **Failure count cap** | 5 | Backoff capped at 2^5 = 32x |
|
|
100
|
+
| **Min interval** | None | No minimum unless configured |
|
|
101
|
+
| **Max interval** | None | No maximum unless configured |
|
|
102
|
+
|
|
103
|
+
## AI Tool Catalog
|
|
104
|
+
|
|
105
|
+
### Action Tools (Write Hints)
|
|
106
|
+
|
|
107
|
+
#### propose_interval
|
|
108
|
+
|
|
109
|
+
Adjust endpoint execution frequency.
|
|
110
|
+
|
|
111
|
+
**Parameters**:
|
|
112
|
+
- `intervalMs` (number, required): New interval in milliseconds
|
|
113
|
+
- `ttlMinutes` (number, default: 60): How long hint is valid
|
|
114
|
+
- `reason` (string, optional): Explanation for adjustment
|
|
115
|
+
|
|
116
|
+
**Effect**: Writes `aiHintIntervalMs` and `aiHintExpiresAt`, nudges `nextRunAt`
|
|
117
|
+
|
|
118
|
+
**Override behavior**: Overrides baseline (not one-shot)
|
|
119
|
+
|
|
120
|
+
**Example**:
|
|
121
|
+
```json
|
|
122
|
+
{
|
|
123
|
+
"intervalMs": 30000,
|
|
124
|
+
"ttlMinutes": 15,
|
|
125
|
+
"reason": "Queue depth increasing - tightening monitoring"
|
|
126
|
+
}
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
#### propose_next_time
|
|
130
|
+
|
|
131
|
+
Schedule one-shot execution at specific time.
|
|
132
|
+
|
|
133
|
+
**Parameters**:
|
|
134
|
+
- `nextRunAtIso` (ISO 8601 string, required): When to run next
|
|
135
|
+
- `ttlMinutes` (number, default: 30): How long hint is valid
|
|
136
|
+
- `reason` (string, optional): Explanation
|
|
137
|
+
|
|
138
|
+
**Effect**: Writes `aiHintNextRunAt` and `aiHintExpiresAt`, nudges `nextRunAt`
|
|
139
|
+
|
|
140
|
+
**Override behavior**: Competes with baseline (earliest wins)
|
|
141
|
+
|
|
142
|
+
**Example**:
|
|
143
|
+
```json
|
|
144
|
+
{
|
|
145
|
+
"nextRunAtIso": "2025-11-02T14:30:00Z",
|
|
146
|
+
"ttlMinutes": 5,
|
|
147
|
+
"reason": "Immediate investigation of failure spike"
|
|
148
|
+
}
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
#### pause_until
|
|
152
|
+
|
|
153
|
+
Pause execution temporarily or resume.
|
|
154
|
+
|
|
155
|
+
**Parameters**:
|
|
156
|
+
- `untilIso` (ISO 8601 string or null, required): Pause until time, or null to resume
|
|
157
|
+
- `reason` (string, optional): Explanation
|
|
158
|
+
|
|
159
|
+
**Effect**: Writes `pausedUntil`
|
|
160
|
+
|
|
161
|
+
**Override behavior**: Overrides everything (highest priority)
|
|
162
|
+
|
|
163
|
+
**Example**:
|
|
164
|
+
```json
|
|
165
|
+
{
|
|
166
|
+
"untilIso": "2025-11-02T15:00:00Z",
|
|
167
|
+
"reason": "Dependency unavailable until maintenance completes"
|
|
168
|
+
}
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
### Query Tools (Read Data)
|
|
172
|
+
|
|
173
|
+
#### get_latest_response
|
|
174
|
+
|
|
175
|
+
Get most recent response body from this endpoint.
|
|
176
|
+
|
|
177
|
+
**Parameters**: None
|
|
178
|
+
|
|
179
|
+
**Returns**:
|
|
180
|
+
```json
|
|
181
|
+
{
|
|
182
|
+
"found": true,
|
|
183
|
+
"responseBody": { "queue_depth": 45, "status": "healthy" },
|
|
184
|
+
"timestamp": "2025-11-02T14:30:00Z",
|
|
185
|
+
"status": "success"
|
|
186
|
+
}
|
|
187
|
+
```
|
|
188
|
+
|
|
189
|
+
#### get_response_history
|
|
190
|
+
|
|
191
|
+
Get recent response bodies to identify trends.
|
|
192
|
+
|
|
193
|
+
**Parameters**:
|
|
194
|
+
- `limit` (number, default: 2, max: 10): Number of responses
|
|
195
|
+
- `offset` (number, default: 0): Skip N newest for pagination
|
|
196
|
+
|
|
197
|
+
**Returns**:
|
|
198
|
+
```json
|
|
199
|
+
{
|
|
200
|
+
"count": 2,
|
|
201
|
+
"hasMore": true,
|
|
202
|
+
"pagination": { "offset": 0, "limit": 2, "nextOffset": 2 },
|
|
203
|
+
"responses": [
|
|
204
|
+
{
|
|
205
|
+
"responseBody": { "queue_depth": 200 },
|
|
206
|
+
"timestamp": "2025-11-02T14:30:00Z",
|
|
207
|
+
"status": "success",
|
|
208
|
+
"durationMs": 120
|
|
209
|
+
}
|
|
210
|
+
],
|
|
211
|
+
"hint": "More history available - call again with offset: 2"
|
|
212
|
+
}
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
**Note**: Response bodies truncated at 1000 chars
|
|
216
|
+
|
|
217
|
+
#### get_sibling_latest_responses
|
|
218
|
+
|
|
219
|
+
Get latest responses from all endpoints in same job.
|
|
220
|
+
|
|
221
|
+
**Parameters**: None
|
|
222
|
+
|
|
223
|
+
**Returns**:
|
|
224
|
+
```json
|
|
225
|
+
{
|
|
226
|
+
"count": 3,
|
|
227
|
+
"siblings": [
|
|
228
|
+
{
|
|
229
|
+
"endpointId": "ep_456",
|
|
230
|
+
"endpointName": "Data Processor",
|
|
231
|
+
"responseBody": { "batch_id": "2025-11-02", "ready": true },
|
|
232
|
+
"timestamp": "2025-11-02T14:25:00Z",
|
|
233
|
+
"status": "success"
|
|
234
|
+
}
|
|
235
|
+
]
|
|
236
|
+
}
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
**Note**: Only returns endpoints with same `jobId`
|
|
240
|
+
|
|
241
|
+
### Final Answer Tool
|
|
242
|
+
|
|
243
|
+
#### submit_analysis
|
|
244
|
+
|
|
245
|
+
Signal analysis completion and provide reasoning.
|
|
246
|
+
|
|
247
|
+
**Parameters**:
|
|
248
|
+
- `reasoning` (string, required): Analysis explanation
|
|
249
|
+
- `actions_taken` (string[], optional): List of tools called
|
|
250
|
+
- `confidence` (enum, optional): 'high' | 'medium' | 'low'
|
|
251
|
+
|
|
252
|
+
**Returns**:
|
|
253
|
+
```json
|
|
254
|
+
{
|
|
255
|
+
"status": "analysis_complete",
|
|
256
|
+
"reasoning": "...",
|
|
257
|
+
"actions_taken": ["propose_interval"],
|
|
258
|
+
"confidence": "high"
|
|
259
|
+
}
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
**Note**: Must be called last to complete analysis
|
|
263
|
+
|
|
264
|
+
## Scheduling Sources
|
|
265
|
+
|
|
266
|
+
Sources explain why a particular next run time was chosen:
|
|
267
|
+
|
|
268
|
+
| Source | Meaning | Priority |
|
|
269
|
+
|--------|---------|----------|
|
|
270
|
+
| `paused` | Endpoint is paused, runs at `pausedUntil` | Highest (overrides all) |
|
|
271
|
+
| `clamped-min` | Chosen time was too soon, moved to `now + minIntervalMs` | High |
|
|
272
|
+
| `clamped-max` | Chosen time was too late, moved to `now + maxIntervalMs` | High |
|
|
273
|
+
| `ai-interval` | AI interval hint overrode baseline | Medium |
|
|
274
|
+
| `ai-oneshot` | AI one-shot hint won competition | Medium |
|
|
275
|
+
| `baseline-cron` | Cron expression determined time | Low (default) |
|
|
276
|
+
| `baseline-interval` | Fixed interval determined time | Low (default) |
|
|
277
|
+
|
|
278
|
+
**Reading sources**: Check run records or logs to see why an endpoint ran at a particular time.
|
|
279
|
+
|
|
280
|
+
## Constraint Interaction Matrix
|
|
281
|
+
|
|
282
|
+
How different settings interact:
|
|
283
|
+
|
|
284
|
+
| If you set... | AI interval hints... | AI one-shot hints... | Baseline... |
|
|
285
|
+
|---------------|---------------------|---------------------|-------------|
|
|
286
|
+
| `pausedUntil` | Ignored (pause wins) | Ignored (pause wins) | Ignored |
|
|
287
|
+
| `minIntervalMs` | Clamped to minimum | Clamped to minimum | Clamped to minimum |
|
|
288
|
+
| `maxIntervalMs` | Clamped to maximum | Clamped to maximum | Clamped to maximum |
|
|
289
|
+
| Both AI hints active | Compete (earliest wins) | Compete (earliest wins) | Ignored |
|
|
290
|
+
| Cron baseline | Ignored by AI | Competes with cron | Used when no hints |
|
|
291
|
+
|
|
292
|
+
**Key insight**: Pause > Constraints > AI hints > Baseline
|
|
293
|
+
|
|
294
|
+
## Field Constraints
|
|
295
|
+
|
|
296
|
+
Limits enforced by the system:
|
|
297
|
+
|
|
298
|
+
| Field | Min | Max | Units |
|
|
299
|
+
|-------|-----|-----|-------|
|
|
300
|
+
| `baselineIntervalMs` | 1,000 | None | Milliseconds (1 second minimum) |
|
|
301
|
+
| `minIntervalMs` | 0 | None | Milliseconds |
|
|
302
|
+
| `maxIntervalMs` | 0 | None | Milliseconds |
|
|
303
|
+
| `timeoutMs` | 1,000 | 1,800,000 | Milliseconds (30 minutes max) |
|
|
304
|
+
| `maxResponseSizeKb` | 1 | 10,000 | Kilobytes |
|
|
305
|
+
| `maxExecutionTimeMs` | 1,000 | 1,800,000 | Milliseconds (30 minutes max) |
|
|
306
|
+
| `failureCount` | 0 | None | Integer (capped at 5 for backoff) |
|
|
307
|
+
| Hint TTL | 1 | None | Minutes (recommended: 5-240) |
|
|
308
|
+
|
|
309
|
+
## Common Response Body Patterns
|
|
310
|
+
|
|
311
|
+
Reusable structures for endpoint responses:
|
|
312
|
+
|
|
313
|
+
### Health Status
|
|
314
|
+
```json
|
|
315
|
+
{
|
|
316
|
+
"status": "healthy" | "degraded" | "critical",
|
|
317
|
+
"timestamp": "2025-11-02T14:30:00Z"
|
|
318
|
+
}
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
### Metrics
|
|
322
|
+
```json
|
|
323
|
+
{
|
|
324
|
+
"queue_depth": 45,
|
|
325
|
+
"processing_rate_per_min": 100,
|
|
326
|
+
"error_rate_pct": 1.2,
|
|
327
|
+
"avg_latency_ms": 150
|
|
328
|
+
}
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
### Coordination Signals
|
|
332
|
+
```json
|
|
333
|
+
{
|
|
334
|
+
"ready_for_processing": true,
|
|
335
|
+
"batch_id": "2025-11-02",
|
|
336
|
+
"dependency_status": "healthy"
|
|
337
|
+
}
|
|
338
|
+
```
|
|
339
|
+
|
|
340
|
+
### Cooldown Tracking
|
|
341
|
+
```json
|
|
342
|
+
{
|
|
343
|
+
"last_action_at": "2025-11-02T12:00:00Z",
|
|
344
|
+
"action_type": "cache_warm",
|
|
345
|
+
"cooldown_minutes": 60
|
|
346
|
+
}
|
|
347
|
+
```
|
|
348
|
+
|
|
349
|
+
### Thresholds
|
|
350
|
+
```json
|
|
351
|
+
{
|
|
352
|
+
"current_value": 250,
|
|
353
|
+
"warning_threshold": 300,
|
|
354
|
+
"critical_threshold": 500
|
|
355
|
+
}
|
|
356
|
+
```
|
|
357
|
+
|
|
358
|
+
## Quick Troubleshooting
|
|
359
|
+
|
|
360
|
+
**Endpoint not running**:
|
|
361
|
+
- Check `pausedUntil` (might be paused)
|
|
362
|
+
- Check `nextRunAt` (might be scheduled far in future)
|
|
363
|
+
- Check `_lockedUntil` (might be locked by crashed worker)
|
|
364
|
+
|
|
365
|
+
**AI not adapting**:
|
|
366
|
+
- Check `aiHintExpiresAt` (hints might be expired)
|
|
367
|
+
- Check analysis sessions (AI might not see patterns)
|
|
368
|
+
- Verify response bodies include metrics (AI needs data)
|
|
369
|
+
- Check quota limits (might be exceeded)
|
|
370
|
+
|
|
371
|
+
**Running too frequently**:
|
|
372
|
+
- Check active AI interval hint
|
|
373
|
+
- Verify `minIntervalMs` isn't set too low
|
|
374
|
+
- Check for `propose_next_time` loops
|
|
375
|
+
|
|
376
|
+
**Running too slowly**:
|
|
377
|
+
- Check `maxIntervalMs` constraint
|
|
378
|
+
- Check exponential backoff (high `failureCount`)
|
|
379
|
+
- Verify baseline isn't too long
|
|
380
|
+
|
|
381
|
+
## Useful Queries
|
|
382
|
+
|
|
383
|
+
Check if endpoint is paused:
|
|
384
|
+
```
|
|
385
|
+
pausedUntil > now ? "Paused until {pausedUntil}" : "Active"
|
|
386
|
+
```
|
|
387
|
+
|
|
388
|
+
Check if AI hints are active:
|
|
389
|
+
```
|
|
390
|
+
aiHintExpiresAt > now ? "Active hints" : "No active hints"
|
|
391
|
+
```
|
|
392
|
+
|
|
393
|
+
Calculate current backoff multiplier:
|
|
394
|
+
```
|
|
395
|
+
failureCount > 0 ? 2^min(failureCount, 5) : 1
|
|
396
|
+
```
|
|
397
|
+
|
|
398
|
+
## Next Steps
|
|
399
|
+
|
|
400
|
+
For detailed explanations, see:
|
|
401
|
+
- **[System Architecture](./system-architecture.md)** - Understand the big picture
|
|
402
|
+
- **[How Scheduling Works](./how-scheduling-works.md)** - Deep-dive into Governor and Scheduler
|
|
403
|
+
- **[How AI Adaptation Works](./how-ai-adaptation-works.md)** - Learn about hints and tools
|
|
404
|
+
- **[Configuration Guide](./configuration-and-constraints.md)** - Set up endpoints correctly
|