@skill-map/spec 0.53.0 → 0.54.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/job-lifecycle.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # Job lifecycle
2
2
 
3
- Normative state machine for jobs. A `Job` (see [`schemas/job.schema.json`](./schemas/job.schema.json)) is the runtime instance of an `Action` applied to one or more `Node`s. Every job moves through this lifecycle exactly once.
3
+ Normative state machine for jobs. A `Job` (see [`schemas/job.schema.json`](./schemas/job.schema.json)) is the runtime instance of an `Action` applied to one or more `Node`s, moving through this lifecycle exactly once.
4
4
 
5
5
  ---
6
6
 
@@ -39,9 +39,9 @@ Terminal states: `completed`, `failed`. Once terminal, a job MUST NOT transition
39
39
  | `queued` | `running` | Atomic claim by a runner. |
40
40
  | `queued` | `failed` | `sm job cancel <id>` (reason `user-cancelled`). |
41
41
  | `running` | `completed` | `sm record --status completed` with valid nonce. |
42
- | `running` | `failed` | `sm record --status failed`, OR TTL expired (reason `abandoned`), OR runner subprocess returned non-zero (reason `runner-error`), OR report failed schema validation (reason `report-invalid`), OR rendered content row missing at runtime (reason `job-file-missing`, historically named for the on-disk artifact; now refers to a missing `state_job_contents` row, a DB-corruption-only state since the runtime invariant is that `state_jobs.content_hash` always resolves). |
42
+ | `running` | `failed` | `sm record --status failed`, OR TTL expired (reason `abandoned`), OR runner subprocess returned non-zero (reason `runner-error`), OR report failed schema validation (reason `report-invalid`), OR rendered content row missing at runtime (reason `job-file-missing`, historically named for the on-disk artifact; now a missing `state_job_contents` row, a DB-corruption-only state since the runtime invariant is that `state_jobs.content_hash` always resolves). |
43
43
 
44
- Any other transition attempt MUST be rejected and MUST NOT mutate state. Implementations SHOULD log the attempt.
44
+ Any other transition attempt MUST be rejected and MUST NOT mutate state. Implementations SHOULD log it.
45
45
 
46
46
  ---
47
47
 
@@ -54,13 +54,13 @@ Any other transition attempt MUST be rejected and MUST NOT mutate state. Impleme
54
54
  3. Compute `contentHash = sha256(actionId + actionVersion + bodyHash + frontmatterHash + promptTemplateHash)`.
55
55
  4. **Duplicate check**: query `state_jobs` for any row with `(actionId, actionVersion, nodeId, contentHash)` AND `status IN ('queued', 'running')`. If found, refuse with exit 3 and print the existing job id (unless `--force`).
56
56
  5. Compute `ttlSeconds` per §TTL resolution below. Frozen on `state_jobs.ttlSeconds` for the life of this job.
57
- 6. Resolve `priority` (integer, default `0`). Precedence (lowest → highest): action manifest `defaultPriority` → user config `jobs.perActionPriority.<actionId>` → flag `--priority <n>`. Higher runs first; ties broken by `createdAt ASC`. Negative values are permitted and run after the default bucket. The resolved value is frozen on `state_jobs.priority` at submit time and is immutable for the life of the job.
57
+ 6. Resolve `priority` (integer, default `0`). Precedence (lowest → highest): action manifest `defaultPriority` → user config `jobs.perActionPriority.<actionId>` → flag `--priority <n>`. Higher runs first; ties broken by `createdAt ASC`. Negative values are permitted and run after the default bucket. Frozen on `state_jobs.priority` at submit time, immutable for the life of the job.
58
58
  7. Generate `nonce` (implementation-chosen; MUST be cryptographically random, ≥ 128 bits of entropy).
59
- 8. Render the rendered job content (canonical preamble + action template + interpolated user content per [`prompt-preamble.md`](./prompt-preamble.md)) and write it to `state_job_contents` via `INSERT OR IGNORE` keyed by `content_hash`. Multiple `state_jobs` rows MAY share the same `content_hash` row: the content is stored exactly once and refcounted by reference. Implementations MUST NOT persist the rendered content to a filesystem path, the DB row is the canonical artifact.
60
- 9. Insert a row in `state_jobs` with `status = 'queued'`, `createdAt = now`. The row's `content_hash` references the just-stored `state_job_contents.content_hash`. Steps 8 and 9 MUST run inside one transaction.
59
+ 8. Render the job content (canonical preamble + action template + interpolated user content per [`prompt-preamble.md`](./prompt-preamble.md)) and write it to `state_job_contents` via `INSERT OR IGNORE` keyed by `content_hash`. Multiple `state_jobs` rows MAY share one `content_hash` row: stored once, refcounted by reference. Implementations MUST NOT persist the rendered content to a filesystem path, the DB row is the canonical artifact.
60
+ 9. Insert a row in `state_jobs` with `status = 'queued'`, `createdAt = now`. Its `content_hash` references the just-stored `state_job_contents.content_hash`. Steps 8 and 9 MUST run inside one transaction.
61
61
  10. Return the job id.
62
62
 
63
- `--all` fans out one job per node matching the action's `preconditions`. Each fan-out job is independent: some may be duplicates and be refused, others succeed. The CLI reports a summary.
63
+ `--all` fans out one job per node matching the action's `preconditions`. Each fan-out job is independent: some may be refused as duplicates, others succeed. The CLI reports a summary.
64
64
 
65
65
  ---
66
66
 
@@ -87,21 +87,21 @@ UPDATE state_jobs
87
87
 
88
88
  The second `AND status = 'queued'` guards against a race where two runners select the same id at the same instant; only one succeeds.
89
89
 
90
- **Non-SQLite implementations**: MUST provide an equivalent single-statement atomic transition. A two-step `SELECT then UPDATE` is NOT acceptable, it is observable as a double-claim bug.
90
+ **Non-SQLite implementations**: MUST provide an equivalent single-statement atomic transition. A two-step `SELECT then UPDATE` is NOT acceptable, observable as a double-claim bug.
91
91
 
92
- `sm job claim` exposes this primitive to Skill agents (and any driving adapter that wants to drain from outside a CLI-runner loop): returns the id on stdout (exit 0) or exits 1 if the queue is empty.
92
+ `sm job claim` exposes this primitive to Skill agents (and any driving adapter draining from outside a CLI-runner loop): returns the id on stdout (exit 0) or exits 1 if the queue is empty.
93
93
 
94
- In `--json` mode, `sm job claim` instead returns the document `{ "id": "<id>", "nonce": "<nonce>", "content": "<rendered MD content>" }`. Drivers MUST use the `--json` form when they intend to call `sm record` afterwards: the nonce is the sole credential the callback verb checks, and embedding it in the claim's structured response is the contracted handover. The plain stdout form (id only) is preserved for legacy scripts that just want to know what id was claimed.
94
+ In `--json` mode, `sm job claim` instead returns `{ "id": "<id>", "nonce": "<nonce>", "content": "<rendered MD content>" }`. Drivers MUST use the `--json` form when they intend to call `sm record` afterwards: the nonce is the sole credential the callback verb checks, and embedding it in the response is the contracted handover. The plain stdout form (id only) is kept for legacy scripts that just want the claimed id.
95
95
 
96
96
  ---
97
97
 
98
98
  ## TTL and auto-reap
99
99
 
100
- Every `running` job has an `expiresAt = claimedAt + ttlSeconds × 1000`. Once real time passes `expiresAt`, the job is considered abandoned.
100
+ Every `running` job has `expiresAt = claimedAt + ttlSeconds × 1000`. Once real time passes `expiresAt`, the job is abandoned.
101
101
 
102
102
  ### Reap procedure
103
103
 
104
- Run at the **start of every `sm job run`** invocation, before the first claim:
104
+ Run at the **start of every `sm job run`**, before the first claim:
105
105
 
106
106
  ```sql
107
107
  UPDATE state_jobs
@@ -112,22 +112,22 @@ UPDATE state_jobs
112
112
  AND expiresAt < <now>;
113
113
  ```
114
114
 
115
- Number of rows affected is reported as `run.reap.completed.reapedCount` in the event stream.
115
+ Rows affected is reported as `run.reap.completed.reapedCount` in the event stream.
116
116
 
117
- Implementations MAY expose `sm job reap` as an explicit verb for diagnostics, but MUST perform reaping automatically inside `sm job run`.
117
+ Implementations MAY expose `sm job reap` as an explicit diagnostics verb, but MUST perform reaping automatically inside `sm job run`.
118
118
 
119
119
  ### TTL resolution
120
120
 
121
- The kernel resolves the effective TTL for a new job in three conceptual steps. The resolved value is written to `state_jobs.ttlSeconds` at submit time and is immutable for the life of the job.
121
+ The kernel resolves the effective TTL for a new job in three steps. The resolved value is written to `state_jobs.ttlSeconds` at submit time, immutable thereafter.
122
122
 
123
123
  #### Step 1, Base duration
124
124
 
125
- A seconds integer that represents how long the action is expected to run before the grace multiplier kicks in:
125
+ A seconds integer for how long the action is expected to run before the grace multiplier applies:
126
126
 
127
127
  1. Action manifest `expectedDurationSeconds`, if declared.
128
128
  2. Otherwise, config `jobs.ttlSeconds` (default: `3600`).
129
129
 
130
- The base duration exists even for actions that cannot estimate their own runtime (typically `mode: local`); the global config value ensures the formula below is always well-defined.
130
+ The base duration exists even for actions that cannot estimate their own runtime (typically `mode: local`); the global config value keeps the formula below well-defined.
131
131
 
132
132
  #### Step 2, Computed TTL
133
133
 
@@ -137,14 +137,14 @@ computed = max(base × jobs.graceMultiplier, jobs.minimumTtlSeconds)
137
137
 
138
138
  Config defaults: `jobs.graceMultiplier = 3`, `jobs.minimumTtlSeconds = 60`.
139
139
 
140
- `minimumTtlSeconds` is a **floor**, not a default. It guarantees no job is claimed with a sub-minute deadline regardless of how small the base duration is. It never participates as an initial value.
140
+ `minimumTtlSeconds` is a **floor**, not a default: it guarantees no job is claimed with a sub-minute deadline however small the base duration, and never acts as an initial value.
141
141
 
142
142
  #### Step 3, User overrides
143
143
 
144
144
  Two optional overrides, evaluated in order; the later one wins and replaces everything above it:
145
145
 
146
146
  1. Config `jobs.perActionTtl.<actionId>`, integer seconds. Replaces the computed TTL entirely; the formula is skipped for that action id.
147
- 2. Flag `sm job submit --ttl <seconds>`, integer seconds. Highest precedence. Replaces anything.
147
+ 2. Flag `sm job submit --ttl <seconds>`, integer seconds. Highest precedence; replaces anything.
148
148
 
149
149
  Negative or zero values MUST be rejected with exit 2 at submit time.
150
150
 
@@ -167,24 +167,20 @@ Negative or zero values MUST be rejected with exit 2 at submit time.
167
167
  1. Load the job by id. If not found → exit 5.
168
168
  2. Compare the supplied nonce against `state_jobs.nonce`. Mismatch → exit 4 without mutation.
169
169
  3. If `state_jobs.status != 'running'` → exit 2 with message "job not in running state". This catches late callbacks after a reap.
170
- 4. If `--status completed`: read the report payload from the path passed to `--report` (the path is implementation-input only; the kernel reads its contents and stores them inline, there is no canonical on-disk report artifact), validate the parsed JSON against the action's declared report schema. On validation failure → transition to `failed` with reason `report-invalid`; DO NOT stay `running`.
171
- 5. Write the execution record (see [`schemas/execution-record.schema.json`](./schemas/execution-record.schema.json)) with the full metrics. The report payload (if any) is stored inline in `state_executions.report_json` as the parsed JSON; the input path is NOT retained.
170
+ 4. If `--status completed`: read the report payload from the path passed to `--report` (implementation-input only, no canonical on-disk report artifact), validate the parsed JSON against the action's declared report schema. On validation failure → transition to `failed` with reason `report-invalid`; DO NOT stay `running`.
171
+ 5. Write the execution record (see [`schemas/execution-record.schema.json`](./schemas/execution-record.schema.json)) with full metrics. The report payload (if any) is stored inline in `state_executions.report_json` as the parsed JSON; the input path is NOT retained.
172
172
  6. Transition the job to the terminal state.
173
173
  7. Emit `job.callback.received` followed by `job.completed` or `job.failed` (see [`job-events.md`](./job-events.md)).
174
174
 
175
- The nonce is the sole authentication factor. A compromised nonce allows forged callbacks for that single job. Nonces MUST be generated per-job; never reused; never logged at info level or above.
175
+ The nonce is the sole authentication factor; a compromised nonce allows forged callbacks for that single job. Nonces MUST be generated per-job, never reused, never logged at info level or above.
176
176
 
177
- `--report` accepts either a file path or `-` (stdin). Drivers MAY choose either form; the kernel ingests both into `report_json` identically. The on-disk file the runner authored is ephemeral, implementations SHOULD remove it after the kernel acknowledges the callback (this is a courtesy GC, not a normative requirement).
177
+ `--report` accepts a file path or `-` (stdin); the kernel ingests both into `report_json` identically. The on-disk file the runner authored is ephemeral, implementations SHOULD remove it after the kernel acknowledges the callback (courtesy GC, not normative).
178
178
 
179
179
  ---
180
180
 
181
181
  ## Duplicate prevention rationale
182
182
 
183
- The deduplication key `(actionId, actionVersion, nodeId, contentHash)` exists to prevent:
184
-
185
- - Accidental double-submit when a user re-runs a command.
186
- - Race conditions where two processes both try to submit the same action over the same node at the same content hash.
187
- - Waste of LLM tokens re-computing an unchanged result.
183
+ The deduplication key `(actionId, actionVersion, nodeId, contentHash)` prevents accidental double-submit on re-run, race conditions where two processes submit the same action over the same node at the same content hash, and wasted LLM tokens re-computing an unchanged result.
188
184
 
189
185
  Post-completion, the check is NOT performed: resubmitting a completed job is always allowed (the previous result is kept in history).
190
186
 
@@ -194,9 +190,9 @@ Post-completion, the check is NOT performed: resubmitting a completed job is alw
194
190
 
195
191
  ## Concurrency
196
192
 
197
- Through `v1.0` (spec `v0.x`): **one job at a time**. `sm job run --all` drains sequentially. Enforced by the claim semantics above, there is no pool or scheduler.
193
+ Through `v1.0` (spec `v0.x`): **one job at a time**. `sm job run --all` drains sequentially, enforced by the claim semantics above; no pool or scheduler.
198
194
 
199
- The event schema carries a `jobId` on every event specifically so that parallel execution becomes a non-breaking extension. A future implementation MAY spawn multiple claim/run loops concurrently and interleave events; consumers identify which job an event belongs to by `jobId`.
195
+ The event schema carries a `jobId` on every event so parallel execution becomes a non-breaking extension. A future implementation MAY spawn multiple claim/run loops concurrently and interleave events; consumers identify an event's job by `jobId`.
200
196
 
201
197
  Parallelism is NOT a v1.0 commitment. Implementations that offer it MUST still emit the canonical event stream correctly.
202
198
 
@@ -208,7 +204,7 @@ Implementations MUST handle each of the following:
208
204
 
209
205
  | Scenario | Required handling |
210
206
  |---|---|
211
- | `state_jobs` row exists but its `content_hash` is missing from `state_job_contents` (DB corruption, the content row was deleted by external means). | Mark `failed` with `failureReason = job-file-missing`. `sm doctor` MUST report these proactively. The kernel does NOT itself produce this state under normal operation; the contract is that submit and prune both keep the two tables consistent. The legacy enum name `job-file-missing` is preserved across the disk-to-DB shift to keep the failure-reason vocabulary backward-compatible, the semantic now refers to a missing content row rather than a missing on-disk file. |
207
+ | `state_jobs` row exists but its `content_hash` is missing from `state_job_contents` (DB corruption, the content row deleted by external means). | Mark `failed` with `failureReason = job-file-missing`. `sm doctor` MUST report these proactively. The kernel does NOT produce this state under normal operation; submit and prune both keep the two tables consistent. The legacy enum name `job-file-missing` is preserved across the disk-to-DB shift for backward-compatibility; it now refers to a missing content row rather than a missing on-disk file. |
212
208
  | `state_job_contents` row references no live `state_jobs` row (GC straggler). | `sm doctor` MUST list them. `sm job prune` MUST collect them in the same transaction that prunes terminal jobs. |
213
209
  | Runner crashes between `claim` and reading the content. | Covered by TTL/reap: when `expiresAt` passes, the next reap marks the job `failed` with `abandoned`. |
214
210
  | Callback arrives after reap already failed the job. | Reject with exit 2 (see Record step 3). The runner should treat this as an error and log it. |
@@ -222,7 +218,7 @@ Implementations MUST handle each of the following:
222
218
  | From | Effect |
223
219
  |---|---|
224
220
  | `queued` | Transition to `failed` with `failureReason = user-cancelled`. |
225
- | `running` | Transition to `failed` with `failureReason = user-cancelled`. DOES NOT interrupt a subprocess runner; the runner will discover the failed state on its next callback and exit cleanly. Implementations MAY additionally send a signal to the subprocess but this is not normative. |
221
+ | `running` | Transition to `failed` with `failureReason = user-cancelled`. DOES NOT interrupt a subprocess runner; the runner discovers the failed state on its next callback and exits cleanly. Implementations MAY additionally signal the subprocess, not normative. |
226
222
  | Terminal | Reject with exit 2 ("already terminal"). |
227
223
 
228
224
  ---
@@ -232,11 +228,11 @@ Implementations MUST handle each of the following:
232
228
  Config controls (`jobs.retention.completed`, `jobs.retention.failed`):
233
229
 
234
230
  - `completed` default 30 days (2592000 seconds).
235
- - `failed` default `null` = never auto-purge (preserves history of failures for analysis).
231
+ - `failed` default `null` = never auto-purge (preserves failure history for analysis).
236
232
 
237
233
  `sm job prune` applies retention. Implementations MAY run this on a schedule (e.g., on `sm doctor`, or in a cron adapter) but MUST NOT prune implicitly during normal verb execution.
238
234
 
239
- `sm job prune` MUST also collect orphaned `state_job_contents` rows (no live `state_jobs` references) in the same transaction that prunes terminal jobs. The natural ordering is: delete terminal `state_jobs` rows in the retention window, then delete `state_job_contents` rows whose `content_hash` no longer appears in any `state_jobs` row. This keeps the two tables consistent without separate verbs.
235
+ `sm job prune` MUST also collect orphaned `state_job_contents` rows (no live `state_jobs` references) in the same transaction that prunes terminal jobs. Ordering: delete terminal `state_jobs` rows in the retention window, then delete `state_job_contents` rows whose `content_hash` no longer appears in any `state_jobs` row.
240
236
 
241
237
  ---
242
238
 
@@ -252,10 +248,10 @@ Config controls (`jobs.retention.completed`, `jobs.retention.failed`):
252
248
 
253
249
  ## Stability
254
250
 
255
- The state machine diagram above is **stable** as of spec v1.0.0. Adding a new state is a major bump. Adding a new terminal reason (`failureReason` enum value) is a minor bump.
251
+ The state machine diagram above is **stable** as of spec v1.0.0. Adding a new state is a major bump; adding a new terminal reason (`failureReason` enum value) a minor bump.
256
252
 
257
253
  The `contentHash` formula is **stable**. Changing what goes into the hash breaks duplicate detection across versions and is a major bump.
258
254
 
259
255
  The atomic-claim semantics are **stable**. A double-claim would be a silent correctness bug observable through event-stream anomalies.
260
256
 
261
- The TTL resolution procedure (§TTL resolution) is **stable** as of the next spec release. The three-step structure (base → computed → overrides) and the four config keys (`jobs.ttlSeconds`, `jobs.graceMultiplier`, `jobs.minimumTtlSeconds`, `jobs.perActionTtl`) are locked; adding a new override source is a minor bump, changing the formula shape is a major bump.
257
+ The TTL resolution procedure (§TTL resolution) is **stable** as of the next spec release. The three-step structure (base → computed → overrides) and the four config keys (`jobs.ttlSeconds`, `jobs.graceMultiplier`, `jobs.minimumTtlSeconds`, `jobs.perActionTtl`) are locked; adding a new override source is a minor bump, changing the formula shape a major bump.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@skill-map/spec",
3
- "version": "0.53.0",
3
+ "version": "0.54.0",
4
4
  "description": "JSON Schemas, prose contracts, and conformance suite for the skill-map specification.",
5
5
  "license": "MIT",
6
6
  "type": "module",
@@ -39,6 +39,7 @@
39
39
  "db-schema.md",
40
40
  "plugin-kv-api.md",
41
41
  "plugin-author-guide.md",
42
+ "plugin-quickstart.md",
42
43
  "telemetry.md",
43
44
  "interfaces/",
44
45
  "schemas/",