@nusoft/nuos-build-catalogue 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,91 @@
1
+ # nuos-build-catalogue
2
+
3
+ Indexes the NuOS build catalogue (`docs/build/`, `docs/contracts/`, `docs/philosophy/`, `docs/guides/`) into NuVector for semantic search. Implements [WU 110](../nuos/docs/build/work-units/110-index-catalogue-into-nuvector.md).
4
+
5
+ This is the first concrete step in NuOS taking over its own build, per [D040](../nuos/docs/build/decisions/D040-nuos-led-build-is-foundation-not-parallel-track.md). Before WU 110, finding things in the catalogue meant `grep`. After WU 110, it means semantic queries with metadata filters.
6
+
7
+ ## Setup
8
+
9
+ ```bash
10
+ npm install
11
+ ```
12
+
13
+ The embedder is selected via `NUOS_CATALOGUE_EMBEDDER`:
14
+
15
+ | Value | Provider | Default model | Dimensions | Notes |
16
+ |---|---|---|---|---|
17
+ | `ollama` (default) | Local Ollama | `qwen3-embedding:8b` | 4096 | **Sovereignty by default.** No network egress. Override the model with `NUOS_CATALOGUE_OLLAMA_MODEL=qwen3-embedding:4b` (2560 dims) or `qwen3-embedding:0.6b` (1024 dims) for smaller boxes. Needs `ollama serve` running and the model pulled (`ollama pull qwen3-embedding:8b`). |
18
+ | `vertex` | Google Vertex | `text-embedding-005` | 768 | Cloud Google. Needs `GOOGLE_CLOUD_PROJECT` plus a Vertex access token (set `GOOGLE_VERTEX_ACCESS_TOKEN`, or have `gcloud` on PATH and run `gcloud auth application-default login`). |
19
+ | `openai` | OpenAI | `text-embedding-3-small` | 1536 | Cloud OpenAI. Needs `OPENAI_API_KEY`. |
20
+ | `stub` | Hash-based, no API | — | 384 | Tests + dev only. Results are noisy. |
21
+
22
+ Switching embedder (or model variant) requires a full reindex (`rm -rf .nuos-catalogue && npm run index`) because dimensions differ.
23
+
24
+ ## Quick start
25
+
26
+ ```bash
27
+ # Pre-flight (one time):
28
+ ollama serve # in another shell
29
+ ollama pull qwen3-embedding:8b # ~4.7 GB download
30
+
31
+ # Index the catalogue (first time — takes ~20 min on 8b)
32
+ npm run index
33
+
34
+ # Search
35
+ npm run search -- "module boundary enforcement"
36
+ npm run search -- "epistemic discipline" --kind=decision --limit=5
37
+ npm run search -- "EHCP lifecycle" --json
38
+
39
+ # Re-run after editing a few files (only changed files re-embed)
40
+ npm run index
41
+ ```
42
+
43
+ ## Storage
44
+
45
+ Index lives at `.nuos-catalogue/index.nv` (file-backed NuVector store, REDB underneath) plus a sibling `hashes.json` mapping each file path to its content SHA-256 + the chunk IDs it produced. Both are committed to git so the index state is reproducible across machines (Topology A per [D041](../nuos/docs/build/decisions/D041-nuos-meta-package-integration-layer.md)).
46
+
47
+ If `.nuos-catalogue/index.nv` ever gets corrupt or out-of-sync, delete the directory and re-run `npm run index`.
48
+
49
+ ## Verification gate
50
+
51
+ Before any other code lands, the verification gate proves that `@nusoft/nuvector@0.1.0` actually persists file-backed storage across process restarts. Re-run it any time you bump the NuVector dep or suspect storage is broken:
52
+
53
+ ```bash
54
+ npm run verify-storage
55
+ ```
56
+
57
+ Pass = file storage works in the published binary; the indexer can use it. Fail = something has regressed and the package needs a Postgres fallback or a NuVector fix.
58
+
59
+ ## Architecture in one paragraph
60
+
61
+ `crawl` walks the catalogue picking up `.md` files (skipping `_index.md`, `done/`, `archive/`, `superseded/`). Each file goes to `chunkMarkdown` which splits on H1/H2/H3 boundaries (preserving code fences) into ~600-token chunks with deterministic IDs. `extractMetadata` produces structured metadata per file (kind, idInKind, status, date, cross-refs). The `Embedder` then turns chunk texts into Float32Array vectors. The orchestrator (`runIndex`) only re-embeds files whose content hash has changed since the last run, then `upsert`s them into NuVector as `nuwiki_article_summary` records. `runSearch` embeds the query and calls `searchKnowledge` to retrieve the top-K hits, which the formatter renders as a human-readable list or JSON.
62
+
63
+ ## Out of scope (Phase 0)
64
+
65
+ - Auto-running on commit — that's WU 128.
66
+ - A GUI — CLI only.
67
+ - Writing to NuVector via NuFlow workflows — that's WU 111.
68
+ - Compiling NuWiki articles from indexed content — WU 113–115.
69
+ - Multi-user / concurrent indexing.
70
+ - Adopting `@nusoft/nuos` — uses NuVector directly until WU 130 ships.
71
+
72
+ ## Known API quirks (NuVector v0.1.0)
73
+
74
+ Discovered during the WU 110 implementation; documented here so future contributors don't burn time rediscovering them:
75
+
76
+ - `embedding` must be a `Float32Array`, not a plain `number[]`. Plain arrays fail with `Get TypedArray info failed on NvMemoryRecord.embedding`.
77
+ - Search results expose the upsert-time `id` as `ref` on each item (asymmetry with the input shape).
78
+ - `tenant` belongs on `MemoryRecord` (upsert) but not on `SearchKnowledgeRequest` — the store-level tenant from `NuVector.open` scopes search automatically.
79
+ - `searchKnowledge` operates on Layer 1 records (`nuwiki_article_summary`); to make a chunk retrievable directly, index it as `nuwiki_article_summary`. `nuwiki_section` is for sections within an enclosing article and requires `searchSectionsInArticles` with the article IDs.
80
+
81
+ ## Tests
82
+
83
+ ```bash
84
+ npx tsx --test tests/
85
+ ```
86
+
87
+ 13 tests across `chunk`, `metadata`, `crawl` cover the indexing primitives. End-to-end is exercised by running `npm run index` then `npm run search` against the real catalogue.
88
+
89
+ ## License
90
+
91
+ Private; not published to npm.
package/package.json ADDED
@@ -0,0 +1,33 @@
1
+ {
2
+ "name": "@nusoft/nuos-build-catalogue",
3
+ "version": "0.1.0",
4
+ "description": "Indexes the NuOS build catalogue into NuVector for semantic search. WU 110.",
5
+ "type": "module",
6
+ "bin": {
7
+ "nuos-catalogue": "./dist/cli.js"
8
+ },
9
+ "files": [
10
+ "dist",
11
+ "scripts",
12
+ "README.md"
13
+ ],
14
+ "publishConfig": {
15
+ "access": "restricted"
16
+ },
17
+ "scripts": {
18
+ "build": "tsc",
19
+ "verify-storage": "tsx scripts/verify-persistence.ts",
20
+ "test": "tsx --test tests/chunk.test.ts tests/metadata.test.ts tests/crawl.test.ts",
21
+ "typecheck": "tsc --noEmit",
22
+ "index": "tsx src/cli.ts index",
23
+ "search": "tsx src/cli.ts search"
24
+ },
25
+ "dependencies": {
26
+ "@nusoft/nuvector": "^0.1.5"
27
+ },
28
+ "devDependencies": {
29
+ "@types/node": "^22.0.0",
30
+ "tsx": "^4.19.0",
31
+ "typescript": "^5.5.0"
32
+ }
33
+ }
@@ -0,0 +1,143 @@
1
+ /**
2
+ * WU 110 verification gate — Pattern J discipline.
3
+ *
4
+ * The @nusoft/nuvector README documents three storage backends:
5
+ * - "memory:" — in-memory, ephemeral
6
+ * - "./project.nv" — local file
7
+ * - { kind: "postgres" } — Postgres
8
+ *
9
+ * Before WU 110 commits to file-backed storage, we must prove empirically
10
+ * that opening a file-backed store, upserting a record, closing it, and
11
+ * reopening in a fresh process actually returns the record. Hedge words
12
+ * ("the README says it works") are not acceptance — only this script
13
+ * passing is.
14
+ *
15
+ * This is run twice: once in --write mode, then again in --read mode.
16
+ * Run via: pnpm verify-storage (does both phases in one process tree)
17
+ */
18
+
19
+ import { NuVector } from '@nusoft/nuvector';
20
+ import { spawn } from 'node:child_process';
21
+ import { existsSync, rmSync } from 'node:fs';
22
+ import { fileURLToPath } from 'node:url';
23
+ import path from 'node:path';
24
+
25
+ const __filename = fileURLToPath(import.meta.url);
26
+ const STORAGE_PATH = path.resolve(path.dirname(__filename), '../.verify-test.nv');
27
+ const DIMENSIONS = 8; // tiny for this smoke test
28
+ const TENANT = 'verify_test';
29
+ const RECORD_ID = 'verify_record_001';
30
+
31
+ // Deterministic non-zero embedding (NuVector requires Float32Array)
32
+ const fixedEmbedding = new Float32Array(
33
+ Array.from({ length: DIMENSIONS }, (_, i) => (i + 1) / 10),
34
+ );
35
+
36
+ async function writePhase(): Promise<void> {
37
+ console.log('[write phase] cleaning previous state at', STORAGE_PATH);
38
+ if (existsSync(STORAGE_PATH)) {
39
+ rmSync(STORAGE_PATH, { recursive: true, force: true });
40
+ }
41
+
42
+ console.log('[write phase] opening NuVector with file storage');
43
+ const memory = await NuVector.open({
44
+ storage: STORAGE_PATH,
45
+ dimensions: DIMENSIONS,
46
+ tenant: TENANT,
47
+ });
48
+
49
+ console.log('[write phase] upserting record', RECORD_ID);
50
+ await memory.upsert({
51
+ id: RECORD_ID,
52
+ kind: 'nuwiki_article_summary',
53
+ embedding: fixedEmbedding,
54
+ text: 'verification record — must survive process restart',
55
+ tenant: TENANT,
56
+ metadata: { test: 'verify-persistence', written_at: Date.now() },
57
+ });
58
+
59
+ console.log('[write phase] record written, closing store');
60
+ // NuVector may or may not expose a close(); rely on process exit to flush.
61
+ process.exit(0);
62
+ }
63
+
64
+ async function readPhase(): Promise<void> {
65
+ console.log('[read phase] reopening NuVector at', STORAGE_PATH);
66
+ if (!existsSync(STORAGE_PATH)) {
67
+ console.error('[read phase] FAIL — storage path does not exist after write phase');
68
+ process.exit(2);
69
+ }
70
+
71
+ const memory = await NuVector.open({
72
+ storage: STORAGE_PATH,
73
+ dimensions: DIMENSIONS,
74
+ tenant: TENANT,
75
+ });
76
+
77
+ console.log('[read phase] searching for the verification record');
78
+ const result = await memory.searchKnowledge({
79
+ query: 'verification record',
80
+ embedding: fixedEmbedding,
81
+ budget: { maxTokens: 1000, maxArticles: 5 },
82
+ });
83
+
84
+ const items = (result?.items ?? []) as Array<{
85
+ ref?: string;
86
+ id?: string;
87
+ metadata?: Record<string, unknown>;
88
+ }>;
89
+ const found = items.some((item) => item.ref === RECORD_ID || item.id === RECORD_ID);
90
+
91
+ if (found) {
92
+ console.log('[read phase] PASS — record retrieved across process restart');
93
+ console.log('[read phase] verdict: file-backed persistence WORKS in this NuVector build');
94
+ process.exit(0);
95
+ } else {
96
+ console.error('[read phase] FAIL — record not found after restart');
97
+ console.error('[read phase] retrieved items:', JSON.stringify(items, null, 2));
98
+ console.error('[read phase] verdict: file-backed persistence does NOT work in this NuVector build');
99
+ console.error('[read phase] WU 110 must fall back to Postgres');
100
+ process.exit(3);
101
+ }
102
+ }
103
+
104
+ async function main(): Promise<void> {
105
+ const mode = process.argv[2];
106
+
107
+ if (mode === '--write') {
108
+ await writePhase();
109
+ return;
110
+ }
111
+ if (mode === '--read') {
112
+ await readPhase();
113
+ return;
114
+ }
115
+
116
+ // Orchestrator: spawn write then read in fresh processes
117
+ console.log('=== WU 110 storage-backend verification gate ===');
118
+ await new Promise<void>((resolve, reject) => {
119
+ const child = spawn('npx', ['tsx', __filename, '--write'], { stdio: 'inherit' });
120
+ child.on('exit', (code) => {
121
+ if (code === 0) resolve();
122
+ else reject(new Error(`write phase exited with code ${code}`));
123
+ });
124
+ });
125
+
126
+ await new Promise<void>((resolve, reject) => {
127
+ const child = spawn('npx', ['tsx', __filename, '--read'], { stdio: 'inherit' });
128
+ child.on('exit', (code) => {
129
+ if (code === 0) {
130
+ console.log('\n=== VERIFICATION GATE: PASS ===');
131
+ resolve();
132
+ } else {
133
+ console.error('\n=== VERIFICATION GATE: FAIL ===');
134
+ process.exit(code ?? 1);
135
+ }
136
+ });
137
+ });
138
+ }
139
+
140
+ main().catch((err) => {
141
+ console.error('verification script error:', err);
142
+ process.exit(1);
143
+ });