@skill-map/spec 0.53.0 → 0.55.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +32 -0
- package/README.md +12 -10
- package/architecture.md +154 -150
- package/cli-contract.md +139 -143
- package/conformance/README.md +9 -9
- package/conformance/coverage.md +5 -5
- package/db-schema.md +72 -72
- package/index.json +19 -18
- package/interfaces/security-scanner.md +25 -25
- package/job-events.md +43 -43
- package/job-lifecycle.md +32 -36
- package/package.json +2 -1
- package/plugin-author-guide.md +97 -125
- package/plugin-kv-api.md +22 -23
- package/plugin-quickstart.md +96 -0
- package/prompt-preamble.md +6 -6
- package/schemas/extensions/action.schema.json +6 -0
- package/schemas/project-config.schema.json +4 -0
- package/telemetry.md +120 -136
- package/versioning.md +12 -12
package/job-lifecycle.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# Job lifecycle
|
|
2
2
|
|
|
3
|
-
Normative state machine for jobs. A `Job` (see [`schemas/job.schema.json`](./schemas/job.schema.json)) is the runtime instance of an `Action` applied to one or more `Node`s
|
|
3
|
+
Normative state machine for jobs. A `Job` (see [`schemas/job.schema.json`](./schemas/job.schema.json)) is the runtime instance of an `Action` applied to one or more `Node`s, moving through this lifecycle exactly once.
|
|
4
4
|
|
|
5
5
|
---
|
|
6
6
|
|
|
@@ -39,9 +39,9 @@ Terminal states: `completed`, `failed`. Once terminal, a job MUST NOT transition
|
|
|
39
39
|
| `queued` | `running` | Atomic claim by a runner. |
|
|
40
40
|
| `queued` | `failed` | `sm job cancel <id>` (reason `user-cancelled`). |
|
|
41
41
|
| `running` | `completed` | `sm record --status completed` with valid nonce. |
|
|
42
|
-
| `running` | `failed` | `sm record --status failed`, OR TTL expired (reason `abandoned`), OR runner subprocess returned non-zero (reason `runner-error`), OR report failed schema validation (reason `report-invalid`), OR rendered content row missing at runtime (reason `job-file-missing`, historically named for the on-disk artifact; now
|
|
42
|
+
| `running` | `failed` | `sm record --status failed`, OR TTL expired (reason `abandoned`), OR runner subprocess returned non-zero (reason `runner-error`), OR report failed schema validation (reason `report-invalid`), OR rendered content row missing at runtime (reason `job-file-missing`, historically named for the on-disk artifact; now a missing `state_job_contents` row, a DB-corruption-only state since the runtime invariant is that `state_jobs.content_hash` always resolves). |
|
|
43
43
|
|
|
44
|
-
Any other transition attempt MUST be rejected and MUST NOT mutate state. Implementations SHOULD log
|
|
44
|
+
Any other transition attempt MUST be rejected and MUST NOT mutate state. Implementations SHOULD log it.
|
|
45
45
|
|
|
46
46
|
---
|
|
47
47
|
|
|
@@ -54,13 +54,13 @@ Any other transition attempt MUST be rejected and MUST NOT mutate state. Impleme
|
|
|
54
54
|
3. Compute `contentHash = sha256(actionId + actionVersion + bodyHash + frontmatterHash + promptTemplateHash)`.
|
|
55
55
|
4. **Duplicate check**: query `state_jobs` for any row with `(actionId, actionVersion, nodeId, contentHash)` AND `status IN ('queued', 'running')`. If found, refuse with exit 3 and print the existing job id (unless `--force`).
|
|
56
56
|
5. Compute `ttlSeconds` per §TTL resolution below. Frozen on `state_jobs.ttlSeconds` for the life of this job.
|
|
57
|
-
6. Resolve `priority` (integer, default `0`). Precedence (lowest → highest): action manifest `defaultPriority` → user config `jobs.perActionPriority.<actionId>` → flag `--priority <n>`. Higher runs first; ties broken by `createdAt ASC`. Negative values are permitted and run after the default bucket.
|
|
57
|
+
6. Resolve `priority` (integer, default `0`). Precedence (lowest → highest): action manifest `defaultPriority` → user config `jobs.perActionPriority.<actionId>` → flag `--priority <n>`. Higher runs first; ties broken by `createdAt ASC`. Negative values are permitted and run after the default bucket. Frozen on `state_jobs.priority` at submit time, immutable for the life of the job.
|
|
58
58
|
7. Generate `nonce` (implementation-chosen; MUST be cryptographically random, ≥ 128 bits of entropy).
|
|
59
|
-
8. Render the
|
|
60
|
-
9. Insert a row in `state_jobs` with `status = 'queued'`, `createdAt = now`.
|
|
59
|
+
8. Render the job content (canonical preamble + action template + interpolated user content per [`prompt-preamble.md`](./prompt-preamble.md)) and write it to `state_job_contents` via `INSERT OR IGNORE` keyed by `content_hash`. Multiple `state_jobs` rows MAY share one `content_hash` row: stored once, refcounted by reference. Implementations MUST NOT persist the rendered content to a filesystem path, the DB row is the canonical artifact.
|
|
60
|
+
9. Insert a row in `state_jobs` with `status = 'queued'`, `createdAt = now`. Its `content_hash` references the just-stored `state_job_contents.content_hash`. Steps 8 and 9 MUST run inside one transaction.
|
|
61
61
|
10. Return the job id.
|
|
62
62
|
|
|
63
|
-
`--all` fans out one job per node matching the action's `preconditions`. Each fan-out job is independent: some may be
|
|
63
|
+
`--all` fans out one job per node matching the action's `preconditions`. Each fan-out job is independent: some may be refused as duplicates, others succeed. The CLI reports a summary.
|
|
64
64
|
|
|
65
65
|
---
|
|
66
66
|
|
|
@@ -87,21 +87,21 @@ UPDATE state_jobs
|
|
|
87
87
|
|
|
88
88
|
The second `AND status = 'queued'` guards against a race where two runners select the same id at the same instant; only one succeeds.
|
|
89
89
|
|
|
90
|
-
**Non-SQLite implementations**: MUST provide an equivalent single-statement atomic transition. A two-step `SELECT then UPDATE` is NOT acceptable,
|
|
90
|
+
**Non-SQLite implementations**: MUST provide an equivalent single-statement atomic transition. A two-step `SELECT then UPDATE` is NOT acceptable, observable as a double-claim bug.
|
|
91
91
|
|
|
92
|
-
`sm job claim` exposes this primitive to Skill agents (and any driving adapter
|
|
92
|
+
`sm job claim` exposes this primitive to Skill agents (and any driving adapter draining from outside a CLI-runner loop): returns the id on stdout (exit 0) or exits 1 if the queue is empty.
|
|
93
93
|
|
|
94
|
-
In `--json` mode, `sm job claim` instead returns
|
|
94
|
+
In `--json` mode, `sm job claim` instead returns `{ "id": "<id>", "nonce": "<nonce>", "content": "<rendered MD content>" }`. Drivers MUST use the `--json` form when they intend to call `sm record` afterwards: the nonce is the sole credential the callback verb checks, and embedding it in the response is the contracted handover. The plain stdout form (id only) is kept for legacy scripts that just want the claimed id.
|
|
95
95
|
|
|
96
96
|
---
|
|
97
97
|
|
|
98
98
|
## TTL and auto-reap
|
|
99
99
|
|
|
100
|
-
Every `running` job has
|
|
100
|
+
Every `running` job has `expiresAt = claimedAt + ttlSeconds × 1000`. Once real time passes `expiresAt`, the job is abandoned.
|
|
101
101
|
|
|
102
102
|
### Reap procedure
|
|
103
103
|
|
|
104
|
-
Run at the **start of every `sm job run
|
|
104
|
+
Run at the **start of every `sm job run`**, before the first claim:
|
|
105
105
|
|
|
106
106
|
```sql
|
|
107
107
|
UPDATE state_jobs
|
|
@@ -112,22 +112,22 @@ UPDATE state_jobs
|
|
|
112
112
|
AND expiresAt < <now>;
|
|
113
113
|
```
|
|
114
114
|
|
|
115
|
-
|
|
115
|
+
Rows affected is reported as `run.reap.completed.reapedCount` in the event stream.
|
|
116
116
|
|
|
117
|
-
Implementations MAY expose `sm job reap` as an explicit verb
|
|
117
|
+
Implementations MAY expose `sm job reap` as an explicit diagnostics verb, but MUST perform reaping automatically inside `sm job run`.
|
|
118
118
|
|
|
119
119
|
### TTL resolution
|
|
120
120
|
|
|
121
|
-
The kernel resolves the effective TTL for a new job in three
|
|
121
|
+
The kernel resolves the effective TTL for a new job in three steps. The resolved value is written to `state_jobs.ttlSeconds` at submit time, immutable thereafter.
|
|
122
122
|
|
|
123
123
|
#### Step 1, Base duration
|
|
124
124
|
|
|
125
|
-
A seconds integer
|
|
125
|
+
A seconds integer for how long the action is expected to run before the grace multiplier applies:
|
|
126
126
|
|
|
127
127
|
1. Action manifest `expectedDurationSeconds`, if declared.
|
|
128
128
|
2. Otherwise, config `jobs.ttlSeconds` (default: `3600`).
|
|
129
129
|
|
|
130
|
-
The base duration exists even for actions that cannot estimate their own runtime (typically `mode: local`); the global config value
|
|
130
|
+
The base duration exists even for actions that cannot estimate their own runtime (typically `mode: local`); the global config value keeps the formula below well-defined.
|
|
131
131
|
|
|
132
132
|
#### Step 2, Computed TTL
|
|
133
133
|
|
|
@@ -137,14 +137,14 @@ computed = max(base × jobs.graceMultiplier, jobs.minimumTtlSeconds)
|
|
|
137
137
|
|
|
138
138
|
Config defaults: `jobs.graceMultiplier = 3`, `jobs.minimumTtlSeconds = 60`.
|
|
139
139
|
|
|
140
|
-
`minimumTtlSeconds` is a **floor**, not a default
|
|
140
|
+
`minimumTtlSeconds` is a **floor**, not a default: it guarantees no job is claimed with a sub-minute deadline however small the base duration, and never acts as an initial value.
|
|
141
141
|
|
|
142
142
|
#### Step 3, User overrides
|
|
143
143
|
|
|
144
144
|
Two optional overrides, evaluated in order; the later one wins and replaces everything above it:
|
|
145
145
|
|
|
146
146
|
1. Config `jobs.perActionTtl.<actionId>`, integer seconds. Replaces the computed TTL entirely; the formula is skipped for that action id.
|
|
147
|
-
2. Flag `sm job submit --ttl <seconds>`, integer seconds. Highest precedence
|
|
147
|
+
2. Flag `sm job submit --ttl <seconds>`, integer seconds. Highest precedence; replaces anything.
|
|
148
148
|
|
|
149
149
|
Negative or zero values MUST be rejected with exit 2 at submit time.
|
|
150
150
|
|
|
@@ -167,24 +167,20 @@ Negative or zero values MUST be rejected with exit 2 at submit time.
|
|
|
167
167
|
1. Load the job by id. If not found → exit 5.
|
|
168
168
|
2. Compare the supplied nonce against `state_jobs.nonce`. Mismatch → exit 4 without mutation.
|
|
169
169
|
3. If `state_jobs.status != 'running'` → exit 2 with message "job not in running state". This catches late callbacks after a reap.
|
|
170
|
-
4. If `--status completed`: read the report payload from the path passed to `--report` (
|
|
171
|
-
5. Write the execution record (see [`schemas/execution-record.schema.json`](./schemas/execution-record.schema.json)) with
|
|
170
|
+
4. If `--status completed`: read the report payload from the path passed to `--report` (implementation-input only, no canonical on-disk report artifact), validate the parsed JSON against the action's declared report schema. On validation failure → transition to `failed` with reason `report-invalid`; DO NOT stay `running`.
|
|
171
|
+
5. Write the execution record (see [`schemas/execution-record.schema.json`](./schemas/execution-record.schema.json)) with full metrics. The report payload (if any) is stored inline in `state_executions.report_json` as the parsed JSON; the input path is NOT retained.
|
|
172
172
|
6. Transition the job to the terminal state.
|
|
173
173
|
7. Emit `job.callback.received` followed by `job.completed` or `job.failed` (see [`job-events.md`](./job-events.md)).
|
|
174
174
|
|
|
175
|
-
The nonce is the sole authentication factor
|
|
175
|
+
The nonce is the sole authentication factor; a compromised nonce allows forged callbacks for that single job. Nonces MUST be generated per-job, never reused, never logged at info level or above.
|
|
176
176
|
|
|
177
|
-
`--report` accepts
|
|
177
|
+
`--report` accepts a file path or `-` (stdin); the kernel ingests both into `report_json` identically. The on-disk file the runner authored is ephemeral, implementations SHOULD remove it after the kernel acknowledges the callback (courtesy GC, not normative).
|
|
178
178
|
|
|
179
179
|
---
|
|
180
180
|
|
|
181
181
|
## Duplicate prevention rationale
|
|
182
182
|
|
|
183
|
-
The deduplication key `(actionId, actionVersion, nodeId, contentHash)`
|
|
184
|
-
|
|
185
|
-
- Accidental double-submit when a user re-runs a command.
|
|
186
|
-
- Race conditions where two processes both try to submit the same action over the same node at the same content hash.
|
|
187
|
-
- Waste of LLM tokens re-computing an unchanged result.
|
|
183
|
+
The deduplication key `(actionId, actionVersion, nodeId, contentHash)` prevents accidental double-submit on re-run, race conditions where two processes submit the same action over the same node at the same content hash, and wasted LLM tokens re-computing an unchanged result.
|
|
188
184
|
|
|
189
185
|
Post-completion, the check is NOT performed: resubmitting a completed job is always allowed (the previous result is kept in history).
|
|
190
186
|
|
|
@@ -194,9 +190,9 @@ Post-completion, the check is NOT performed: resubmitting a completed job is alw
|
|
|
194
190
|
|
|
195
191
|
## Concurrency
|
|
196
192
|
|
|
197
|
-
Through `v1.0` (spec `v0.x`): **one job at a time**. `sm job run --all` drains sequentially
|
|
193
|
+
Through `v1.0` (spec `v0.x`): **one job at a time**. `sm job run --all` drains sequentially, enforced by the claim semantics above; no pool or scheduler.
|
|
198
194
|
|
|
199
|
-
The event schema carries a `jobId` on every event
|
|
195
|
+
The event schema carries a `jobId` on every event so parallel execution becomes a non-breaking extension. A future implementation MAY spawn multiple claim/run loops concurrently and interleave events; consumers identify an event's job by `jobId`.
|
|
200
196
|
|
|
201
197
|
Parallelism is NOT a v1.0 commitment. Implementations that offer it MUST still emit the canonical event stream correctly.
|
|
202
198
|
|
|
@@ -208,7 +204,7 @@ Implementations MUST handle each of the following:
|
|
|
208
204
|
|
|
209
205
|
| Scenario | Required handling |
|
|
210
206
|
|---|---|
|
|
211
|
-
| `state_jobs` row exists but its `content_hash` is missing from `state_job_contents` (DB corruption, the content row
|
|
207
|
+
| `state_jobs` row exists but its `content_hash` is missing from `state_job_contents` (DB corruption, the content row deleted by external means). | Mark `failed` with `failureReason = job-file-missing`. `sm doctor` MUST report these proactively. The kernel does NOT produce this state under normal operation; submit and prune both keep the two tables consistent. The legacy enum name `job-file-missing` is preserved across the disk-to-DB shift for backward-compatibility; it now refers to a missing content row rather than a missing on-disk file. |
|
|
212
208
|
| `state_job_contents` row references no live `state_jobs` row (GC straggler). | `sm doctor` MUST list them. `sm job prune` MUST collect them in the same transaction that prunes terminal jobs. |
|
|
213
209
|
| Runner crashes between `claim` and reading the content. | Covered by TTL/reap: when `expiresAt` passes, the next reap marks the job `failed` with `abandoned`. |
|
|
214
210
|
| Callback arrives after reap already failed the job. | Reject with exit 2 (see Record step 3). The runner should treat this as an error and log it. |
|
|
@@ -222,7 +218,7 @@ Implementations MUST handle each of the following:
|
|
|
222
218
|
| From | Effect |
|
|
223
219
|
|---|---|
|
|
224
220
|
| `queued` | Transition to `failed` with `failureReason = user-cancelled`. |
|
|
225
|
-
| `running` | Transition to `failed` with `failureReason = user-cancelled`. DOES NOT interrupt a subprocess runner; the runner
|
|
221
|
+
| `running` | Transition to `failed` with `failureReason = user-cancelled`. DOES NOT interrupt a subprocess runner; the runner discovers the failed state on its next callback and exits cleanly. Implementations MAY additionally signal the subprocess, not normative. |
|
|
226
222
|
| Terminal | Reject with exit 2 ("already terminal"). |
|
|
227
223
|
|
|
228
224
|
---
|
|
@@ -232,11 +228,11 @@ Implementations MUST handle each of the following:
|
|
|
232
228
|
Config controls (`jobs.retention.completed`, `jobs.retention.failed`):
|
|
233
229
|
|
|
234
230
|
- `completed` default 30 days (2592000 seconds).
|
|
235
|
-
- `failed` default `null` = never auto-purge (preserves history
|
|
231
|
+
- `failed` default `null` = never auto-purge (preserves failure history for analysis).
|
|
236
232
|
|
|
237
233
|
`sm job prune` applies retention. Implementations MAY run this on a schedule (e.g., on `sm doctor`, or in a cron adapter) but MUST NOT prune implicitly during normal verb execution.
|
|
238
234
|
|
|
239
|
-
`sm job prune` MUST also collect orphaned `state_job_contents` rows (no live `state_jobs` references) in the same transaction that prunes terminal jobs.
|
|
235
|
+
`sm job prune` MUST also collect orphaned `state_job_contents` rows (no live `state_jobs` references) in the same transaction that prunes terminal jobs. Ordering: delete terminal `state_jobs` rows in the retention window, then delete `state_job_contents` rows whose `content_hash` no longer appears in any `state_jobs` row.
|
|
240
236
|
|
|
241
237
|
---
|
|
242
238
|
|
|
@@ -252,10 +248,10 @@ Config controls (`jobs.retention.completed`, `jobs.retention.failed`):
|
|
|
252
248
|
|
|
253
249
|
## Stability
|
|
254
250
|
|
|
255
|
-
The state machine diagram above is **stable** as of spec v1.0.0. Adding a new state is a major bump
|
|
251
|
+
The state machine diagram above is **stable** as of spec v1.0.0. Adding a new state is a major bump; adding a new terminal reason (`failureReason` enum value) a minor bump.
|
|
256
252
|
|
|
257
253
|
The `contentHash` formula is **stable**. Changing what goes into the hash breaks duplicate detection across versions and is a major bump.
|
|
258
254
|
|
|
259
255
|
The atomic-claim semantics are **stable**. A double-claim would be a silent correctness bug observable through event-stream anomalies.
|
|
260
256
|
|
|
261
|
-
The TTL resolution procedure (§TTL resolution) is **stable** as of the next spec release. The three-step structure (base → computed → overrides) and the four config keys (`jobs.ttlSeconds`, `jobs.graceMultiplier`, `jobs.minimumTtlSeconds`, `jobs.perActionTtl`) are locked; adding a new override source is a minor bump, changing the formula shape
|
|
257
|
+
The TTL resolution procedure (§TTL resolution) is **stable** as of the next spec release. The three-step structure (base → computed → overrides) and the four config keys (`jobs.ttlSeconds`, `jobs.graceMultiplier`, `jobs.minimumTtlSeconds`, `jobs.perActionTtl`) are locked; adding a new override source is a minor bump, changing the formula shape a major bump.
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@skill-map/spec",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.55.0",
|
|
4
4
|
"description": "JSON Schemas, prose contracts, and conformance suite for the skill-map specification.",
|
|
5
5
|
"license": "MIT",
|
|
6
6
|
"type": "module",
|
|
@@ -39,6 +39,7 @@
|
|
|
39
39
|
"db-schema.md",
|
|
40
40
|
"plugin-kv-api.md",
|
|
41
41
|
"plugin-author-guide.md",
|
|
42
|
+
"plugin-quickstart.md",
|
|
42
43
|
"telemetry.md",
|
|
43
44
|
"interfaces/",
|
|
44
45
|
"schemas/",
|