@skill-map/spec 0.43.0 → 0.44.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,23 @@
1
1
  # Spec changelog
2
2
 
3
+ ## 0.44.0
4
+
5
+ ### Minor Changes
6
+
7
+ - Wired the `tokenizer` project-config key to actually select the scan encoder. It is now a closed enum (`cl100k_base` default, `o200k_base`); the resolved name is recorded in `scan_meta.tokenizer` / `ScanResult.tokenizer` and an out-of-set value is dropped with a warning and falls back to the default. The orchestrator lazily loads only the chosen `js-tiktoken` rank table, and an incremental scan recomputes per-node token counts when the persisted encoder differs from the resolved one.
8
+
9
+ ## User-facing
10
+
11
+ **Pick your tokenizer.** `tokenizer` in settings.json now selects the encoder for token counts: `cl100k_base` (default, GPT-4) or `o200k_base` (GPT-4o). Any other value is ignored with a warning. Changing it recomputes counts on the next scan.
12
+
13
+ ### Patch Changes
14
+
15
+ - Detect database schema drift by fingerprint. A sha256 of the migration DDL is stored in `scan_meta.schema_fingerprint` per scan and checked at open, so a DB whose columns fell behind an inline schema edit is caught instead of failing later as a cryptic `no such column` error. Write paths (`sm scan`, `sm serve`) prompt to rebuild (or `--yes`); read verbs warn and point at `sm scan` / `sm db reset`.
16
+
17
+ ## User-facing
18
+
19
+ skill-map now notices when your local DB schema is out of date (not just an older version): `sm scan` and `sm serve` offer to rebuild the cache, and read commands warn instead of failing with a confusing database error.
20
+
3
21
  ## 0.43.0
4
22
 
5
23
  ### Minor Changes
package/cli-contract.md CHANGED
@@ -323,7 +323,7 @@ The watcher subscribes to the same roots that `sm scan` walks and respects `.ski
323
323
 
324
324
  **File-size skip** (`scan.maxFileSizeBytes`, default 1 MiB): the walker checks each candidate file's on-disk size before reading it and skips any file larger than the limit. The skip happens at the source (the file is never read, parsed, or indexed as a node), so an accidental binary drop or generated artefact cannot poison the graph. Every skipped file is reported in the `ScanResult` envelope as `oversizedFiles` (each entry the root-relative, forward-slash path plus the byte size) and counted in `stats.filesOversized`. When at least one file is skipped, `sm scan`, `sm watch` (per batch), and `sm serve` (initial scan and every batch) print a **WARN**-level terminal notice listing the skipped files with a human-readable size, plus a hint pointing at `scan.maxFileSizeBytes` and `.skillmapignore`; the UI raises a matching banner. Unlike the node cap, the limit is config-only (no per-invocation flag).
325
325
 
326
- **Schema-drift rebuild (pre-1.0)**: before persisting, `sm scan` and `sm watch` compare `scan_meta.scanned_by_version` against the running CLI. A minor or major difference means the local cache predates a schema change, so the DB is deleted and rebuilt from scratch by this run (`.sm` sidecars are untouched, they are the source of truth). On an interactive terminal the rebuild is confirmed first; `--yes` (and every non-interactive caller: piped stdin, CI, the BFF, the watcher) rebuilds without prompting. Declining aborts the scan (exit `2`) without deleting anything. Patch-level differences are compatible and never trigger a rebuild. Read-only verbs keep the version-skew advisory instead of rebuilding. See [`db-schema.md` §Schema drift (pre-1.0)](./db-schema.md#schema-drift-pre-10).
326
+ **Schema-drift rebuild (pre-1.0)**: before persisting, `sm scan`, `sm watch`, and `sm serve` (before it starts listening) detect schema drift on two axes: the recorded `scan_meta.scanned_by_version` against the running CLI (a minor or major difference is drift; patch-level is compatible), AND the recorded `scan_meta.schema_fingerprint` against the fingerprint recomputed from the bundled migration DDL (any mismatch, or a NULL stored value from a pre-fingerprint DB, is drift). The fingerprint axis catches an inline `001_initial.sql` column add within the same `major.minor` that the version axis cannot see. When either axis trips, the local cache predates a schema change, so the DB is deleted and rebuilt from scratch by this run (`.sm` sidecars are untouched, they are the source of truth). On an interactive terminal the rebuild is confirmed first (`sm scan` rebuilds on the next persist; `sm serve` prompts before booting and aborts with a nonzero exit if declined); `--yes` (and every non-interactive caller: piped stdin, CI, the BFF scan route, the watcher) rebuilds without prompting. Declining aborts (exit `2`) without deleting anything. A DB that was never scanned (no `scan_meta` row) is not drift. Read-only verbs keep the advisory (warn on an older DB or a fingerprint mismatch, refuse on a newer or different-major DB) instead of rebuilding. See [`db-schema.md` §Schema drift (pre-1.0)](./db-schema.md#schema-drift-pre-10).
327
327
 
328
328
  Exit: 0 on clean (or clean watcher shutdown), 1 if error-severity issues exist (one-shot scan only, the watcher does not flip exit code based on per-batch issues), 2 on operational error.
329
329
 
package/db-schema.md CHANGED
@@ -152,6 +152,8 @@ Single-row table holding the metadata of the last persisted scan. Lets `loadScan
152
152
  | `stats_files_walked` | INTEGER | NOT NULL |
153
153
  | `stats_files_skipped` | INTEGER | NOT NULL |
154
154
  | `stats_duration_ms` | INTEGER | NOT NULL |
155
+ | `tokenizer` | TEXT | NULL | Resolved offline encoder that produced this scan's per-node token counts (closed enum `cl100k_base` / `o200k_base`, see `project-config.md` / `project-config.schema.json` §tokenizer). Carried on the `ScanResult.tokenizer` wire field. NULL on a pre-feature DB or a scan run with tokenization disabled. On `sm scan --changed` the orchestrator compares this against the freshly-resolved encoder and, when they differ (or the stored value is NULL), bypasses the cached per-node token reuse so `buildNode` recomputes counts with the current encoder. Changing the tokenizer therefore invalidates prior counts on the next scan. |
156
+ | `schema_fingerprint` | TEXT | NULL | sha256 (hex) of the migration DDL the schema was built from, written at persist time. NULL on a DB created by a pre-fingerprint CLI; a NULL (or mismatching) value is read as schema drift (see §Schema drift). Internal DB metadata, NOT carried on the `ScanResult` wire shape. |
155
157
 
156
158
  The `scope` column was removed pre-1.0 along with the `-g/--global` flag (see `cli-contract.md` §Scope is always project-local); every persisted scan is project-scoped so the column never carried any information worth round-tripping. Older DBs are not migrated, the column drop is a greenfield change and a fresh `sm init` regenerates the schema.
157
159
 
@@ -465,16 +467,28 @@ The kernel ALSO maintains `PRAGMA user_version` (or the engine equivalent) as a
465
467
 
466
468
  ## Schema drift (pre-1.0)
467
469
 
468
- The project DB is a derived cache: every `scan_*` row is regenerable, and the operator's authored data lives in `.sm` sidecars, not in the DB. While the kernel stays in `0.Y.Z` (see [`versioning.md` §Pre-1.0](./versioning.md#pre-10)) it does NOT ship incremental migrations to carry an existing DB across a schema change. Instead, a write-side open (`sm scan`, `sm watch`, and the BFF watcher) compares `scan_meta.scanned_by_version` against the running CLI version:
470
+ The project DB is a derived cache: every `scan_*` row is regenerable, and the operator's authored data lives in `.sm` sidecars, not in the DB. While the kernel stays in `0.Y.Z` (see [`versioning.md` §Pre-1.0](./versioning.md#pre-10)) it does NOT ship incremental migrations to carry an existing DB across a schema change. Drift is detected on two independent axes; either one trips a rebuild.
469
471
 
470
- - **Same `major.minor`** (patch differences ignored): the cache is compatible. The open proceeds untouched.
471
- - **Any minor or major difference**: the on-disk schema is treated as drifted. The entire DB file (plus its `-wal` / `-shm` sidecars) is deleted and recreated from the current `001_initial.sql`; the scan then repopulates it. No backup is written (the cache is derived). `state_*` and `config_*` are wiped along with `scan_*`; pre-1.0 they are accepted as transient. `.sm` sidecars are never touched.
472
+ **Axis 1, version.** A write-side open compares `scan_meta.scanned_by_version` against the running CLI version:
472
473
 
473
- The rebuild is confirmed interactively on a TTY `sm scan` unless `--yes` is passed; non-interactive callers (piped stdin, CI, the BFF scan route, the watcher) rebuild without prompting. Declining the prompt aborts the scan (exit `2`) without deleting anything.
474
+ - **Same `major.minor`** (patch differences ignored): compatible.
475
+ - **Any minor or major difference**: drifted.
474
476
 
475
- Read-side verbs (`sm check`, `sm list`, `sm show`, ...) do NOT rebuild. They keep the version-skew advisory (warn on an older DB, refuse on a newer or different-major DB) so a read never silently discards the cache.
477
+ **Axis 2, schema fingerprint.** Pre-1.0 the greenfield posture adds columns INLINE to `001_initial.sql` WITHOUT bumping a version (see [`versioning.md` §Pre-1.0](./versioning.md#pre-10)). A DB created within the same `major.minor` but with an older inline schema would otherwise pass the version axis and then fail later as a runtime "no such column" query error. To close that gap, the implementation computes a **schema fingerprint** = sha256 over the concatenated migration DDL (the `NNN_*.sql` files, in sorted order) and persists it to `scan_meta.schema_fingerprint` at persist time. A write-side open recomputes the fingerprint from the bundled migrations and compares:
476
478
 
477
- This is a pre-1.0 affordance. The first `1.0.0` replaces it with real up-only migrations (see §Migrations): drift detection by version becomes drift repair by migration, and `state_*` / `config_*` stop being disposable.
479
+ - **Stored fingerprint equals the recomputed one**: compatible.
480
+ - **Stored fingerprint differs from the recomputed one**: drifted. Any inline edit to a migration file changes the fingerprint and trips this axis independently of the version axis.
481
+ - **Stored fingerprint is NULL** (a DB written by a pre-fingerprint CLI, or whose `schema_fingerprint` column does not exist): drifted. This forces a one-time rebuild on upgrade so the very column that detects drift gets provisioned.
482
+
483
+ When **either axis** reports drift, the entire DB file (plus its `-wal` / `-shm` sidecars) is deleted and recreated from the current migrations; the scan then repopulates it. No backup is written (the cache is derived). `state_*` and `config_*` are wiped along with `scan_*`; pre-1.0 they are accepted as transient. `.sm` sidecars are never touched. The drift message names the reason (version skew vs schema fingerprint) so the operator understands why the cache is being rebuilt.
484
+
485
+ A DB that was never scanned (no `scan_meta` row) is **not** drift: there is no recorded version and no recorded fingerprint, so there is no signal. The open proceeds untouched (the next scan writes both fields). Reading the stored fingerprint is defensive: a missing `scan_meta` table and a missing `schema_fingerprint` column are both tolerated (the column-absent case maps to NULL, i.e. drift; the row-absent case maps to no-signal).
486
+
487
+ The rebuild is confirmed interactively on a TTY (`sm scan`, and `sm serve` before it starts listening) unless `--yes` is passed; non-interactive callers (piped stdin, CI, the BFF scan route, the watcher) rebuild without prompting. Declining the prompt aborts (exit `2`) without deleting anything.
488
+
489
+ Read-side verbs (`sm check`, `sm list`, `sm show`, `GET /api/*`) do NOT rebuild. They surface a prominent advisory (warn on an older DB or a fingerprint mismatch, refuse on a newer or different-major DB) so a read never silently discards the cache and never crashes cryptically on a missing column. The advisory points the operator at `sm scan` (rebuild on the next write) or `sm db reset`.
490
+
491
+ This is a pre-1.0 affordance. The first `1.0.0` replaces it with real up-only migrations (see §Migrations): drift detection by version / fingerprint becomes drift repair by migration, and `state_*` / `config_*` stop being disposable.
478
492
 
479
493
  ---
480
494
 
package/index.json CHANGED
@@ -174,14 +174,14 @@
174
174
  }
175
175
  ]
176
176
  },
177
- "specPackageVersion": "0.43.0",
177
+ "specPackageVersion": "0.44.0",
178
178
  "integrity": {
179
179
  "algorithm": "sha256",
180
180
  "files": {
181
- "CHANGELOG.md": "f96ccf5116826cd2860c6149049b208b827cf9b925a4dcfa3fa9442497827c74",
181
+ "CHANGELOG.md": "357f6aa8652f437eb6400edec3f638fd0243a106cfd3fea062d97e05a94d3ddc",
182
182
  "README.md": "a7505a7b0672c39a8b011e3c5e7d41826306476ee63768249bba4bdb3c03d4d1",
183
183
  "architecture.md": "49644c727384f8e12061be834bc97e57d47459dae7bea096f94330b74f568a93",
184
- "cli-contract.md": "76ef201ca99b789914d0530c96744396e96b2938d661fb7ff88583596b193f12",
184
+ "cli-contract.md": "08b03016e89bd3ce48f32c1b31489f769c97fa1893fe1d13c99a70d8238783e1",
185
185
  "conformance/README.md": "0c69bd9becf511ada9175b1e428ba183e31d1c8a49ff09eedf4c950bb831ec4d",
186
186
  "conformance/cases/extractor-emits-signal.json": "0115c7bb62a7a705f72e9d8048b3f0396e5caaeb3d04dea204415e279e58479d",
187
187
  "conformance/cases/kernel-empty-boot.json": "9b51b85ff62479cd0eee37cad260245208d94f6d79644f7ee40945a934960913",
@@ -208,7 +208,7 @@
208
208
  "conformance/fixtures/signal-ir-collision/.claude/agents/architect.md": "acc46b5b2dff73d98a354e4d53b5041164595deae466a4e2ce41d7c5a72f28fb",
209
209
  "conformance/fixtures/signal-ir-single-signal/source.md": "1eda417b4c6eed372b66870e385c8d8cd631372b77cab7e996bb711e22218f89",
210
210
  "conformance/fixtures/signal-ir-single-signal/target.md": "527137f2b4f46c0034b0edc8932cf8613d2bf22ffaaf78f01085c82a3baaebe3",
211
- "db-schema.md": "21d0d789163ee38975b50f4d71f64e7328ffc19a0e8dd1402b86a35323f3b37b",
211
+ "db-schema.md": "f74ce6766bf7f2dcda187a49f82e1768bc1c091d9492846e718903a379610e2e",
212
212
  "interfaces/security-scanner.md": "e8049712b9cf7a07c786bf19f8f775f8ef9638f063f7fba5c7a8b1431b92f38e",
213
213
  "job-events.md": "9d5b35d4c451a7f8eef9915d85316d924ac52f1c026a316cdda5f1099d496854",
214
214
  "job-lifecycle.md": "9c429121f98a07c8795f8979ed1abc5e5334e3f89db51585a8da55c527ef855b",
@@ -238,11 +238,11 @@
238
238
  "schemas/node.schema.json": "14ed2e4c44d01e3f662e240219819895cca06dead374a5cadccfd423c520ed69",
239
239
  "schemas/plugins-doctor.schema.json": "2238266f31402a446b313af16f933e395a02eca70128e39ab99a11de90a4735f",
240
240
  "schemas/plugins-registry.schema.json": "6d850d06cdf70e233f20d0d7968bb0c34306f11f30ce2505cec173cd9fa784e5",
241
- "schemas/project-config.schema.json": "6bbec6252c215bcd7b86f5c7c267c8104d1c78633ede2e7fc1dabfd3fdcaa638",
241
+ "schemas/project-config.schema.json": "0a4a12a3409f900bd19b47c34588c77ac894b944d21a9beebb91ae1e9c0f3d01",
242
242
  "schemas/refresh-report.schema.json": "47184d4f6b15e9b7671dc178b3b3886a64422da198898508ecdb2cb27876db04",
243
243
  "schemas/report-base-deterministic.schema.json": "59785fe6f3ceb34814bbbd03d10fa7336a32835ce598946f2923d469b32aa32a",
244
244
  "schemas/report-base.schema.json": "e4d25f055e24f18ae0f77c24661c1bddc87ff2e43b001b6a827fcb14f9753f44",
245
- "schemas/scan-result.schema.json": "f3caf433a2db79ce44adf09993e0f8032b7230d53eb4472ba2913a39c524d045",
245
+ "schemas/scan-result.schema.json": "9fb81f496d6f8bdcb82131d0b2eb532da1addb801e7d27bd192a0c286a28c2c0",
246
246
  "schemas/sidecar.schema.json": "f23dfe3ba7f71a1af2cc5cd26b57d5e057e56438655f750c1895d35061efe80a",
247
247
  "schemas/signal.schema.json": "57baf52e55fc9a6f122fb9b33395b5a2790e7f5b7d461cf576099b68a8a17159",
248
248
  "schemas/summaries/agent.schema.json": "5b26b95fb082b73d302c8aa6489ab09488a155ccfbb8943dfc47079509d35122",
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@skill-map/spec",
3
- "version": "0.43.0",
3
+ "version": "0.44.0",
4
4
  "description": "JSON Schemas, prose contracts, and conformance suite for the skill-map specification.",
5
5
  "license": "MIT",
6
6
  "type": "module",
@@ -13,7 +13,8 @@
13
13
  },
14
14
  "tokenizer": {
15
15
  "type": "string",
16
- "description": "Name of the offline tokenizer used to compute per-node token counts during scan. Default `cl100k_base`. Stored alongside each token count in `scan_nodes` so consumers know which encoder produced the numbers. Changing this invalidates prior counts on next scan."
16
+ "enum": ["cl100k_base", "o200k_base"],
17
+ "description": "Closed allow-list of the offline tokenizer used to compute per-node token counts during scan. Exactly two encoders are supported, both shipped in `js-tiktoken`: `cl100k_base` (the default, the modern OpenAI tokenizer used by GPT-4 / GPT-3.5) and `o200k_base` (the GPT-4o family tokenizer). When absent the default `cl100k_base` applies. An out-of-set value is dropped with a warning by the config loader's AJV `enum` check and the merged value falls back to the default. The resolved encoder is persisted in `scan_meta.tokenizer` so consumers know which encoder produced the numbers. Changing this invalidates prior per-node counts on the next scan (the incremental cache path force-recomputes token counts when the persisted encoder differs from the resolved one)."
17
18
  },
18
19
  "activeProvider": {
19
20
  "type": "string",
@@ -37,6 +37,10 @@
37
37
  "description": "Provider ids that participated in classification. Empty if no Provider matched.",
38
38
  "items": { "type": "string" }
39
39
  },
40
+ "tokenizer": {
41
+ "type": "string",
42
+ "description": "Resolved offline tokenizer (encoder) that produced the per-node token counts in this scan, one of the closed allow-list in `project-config.schema.json#/properties/tokenizer` (`cl100k_base` default, `o200k_base`). Mirrors `scan_meta.tokenizer`. Reported so consumers know which encoder the counts came from; the incremental scan compares the persisted value against the resolved one and force-recomputes counts when they differ. Absent on synthetic fixtures and when tokenization was disabled."
43
+ },
40
44
  "recommendedNodeLimit": {
41
45
  "type": "integer",
42
46
  "minimum": 1,