neurain 0.1.0-alpha.5 → 0.1.0-alpha.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -4,6 +4,20 @@
4
4
 
5
5
  - No unreleased changes recorded.
6
6
 
7
+ ## 0.1.0-alpha.7
8
+
9
+ - Hardening (recall perf, from an adversarial review): lock the "byte-identical results" claim and tighten the fast-path contracts, with no change to ranking/scores (golden-identical).
10
+
11
+ - Added `test/perf_recall_equivalence.test.mjs`: an oracle test that `countOccurrences` equals `split(term).length - 1` for every non-empty term (overlap/unicode/surrogate cases included), a proof that `scorePreparedSemantic` (prepare-once + per-doc trigram precompute) equals a naive per-doc reference scorer, a determinism check, and a guard that the shared corpus selector keeps private + secret-bearing files out of every branch.
12
+ - Extracted the lexical BM25 term-frequency count into an exported `countOccurrences(haystack, needle)` with an empty-needle guard (returns 0) so the index-loop form can never spin even if the term filter changes.
13
+ - `buildLexicalContext` now throws if a caller passes shared `markdownFiles` together with an `area` (the only safe share is whole-vault; an area context selects a strict subset, so this prevents a future caller from silently widening the corpus).
14
+ - `prepareSemanticQuery` now returns a frozen prepared query, and the provider fast-path contract (prepared query is immutable, no cross-call mutable state) is documented on the default provider.
15
+
16
+ ## 0.1.0-alpha.6
17
+
18
+ - Performance (hybrid recall): `hybrid-search` now walks the markdown corpus ONCE and shares it across its semantic and routed-lexical branches instead of each branch re-walking and re-reading the whole vault. The walk is shared only when no `--area` is set (the two branches then select the same whole-vault corpus); with an area they still walk independently. Results stay byte-identical (golden-verified) because the shared file list is exactly what each branch would have walked. Measured: `recall hybrid-search` ~970ms -> ~763ms (warm median); combined with alpha.5 that is ~1234ms -> ~763ms (-38%). npm test 153/153.
19
+
20
+
7
21
  ## 0.1.0-alpha.5
8
22
 
9
23
  - Performance (recall processing): cut recall/search processing time without changing results. The semantic scorer now prepares the query once and precomputes per-doc trigrams (instead of re-tokenizing the query and rebuilding `charTrigrams` per document), and the lexical BM25 counts term frequency with an index loop instead of `String.split`. Measured: `recall hybrid-search` ~1234ms -> ~970ms, `semantic-search` ~1031ms -> ~750ms (warm median), with byte-identical ranking/scores/matched_terms (golden-verified) and npm test 153/153.
package/README.md CHANGED
@@ -204,7 +204,7 @@ It exposes read/capture/scan/preview tools only. It does not silently compile, p
204
204
 
205
205
  ## Status
206
206
 
207
- This is `0.1.0-alpha.5`. It is not a public SaaS GA release. The alpha exists to prove installability, local-first onboarding, Codex, Claude, Gemini, and Runtime connectivity, plus safety receipts.
207
+ This is `0.1.0-alpha.7`. It is not a public SaaS GA release. The alpha exists to prove installability, local-first onboarding, Codex, Claude, Gemini, and Runtime connectivity, plus safety receipts.
208
208
 
209
209
  Alpha publish command:
210
210
 
@@ -1,9 +1,9 @@
1
1
  # Development Status
2
2
 
3
3
  Version: v0.1
4
- Last updated: 2026-06-19 KST
5
- Package: `neurain@0.1.0-alpha.5`
6
- Latest documented commit: `6305d3d perf(recall): cut recall processing time, results byte-identical`
4
+ Last updated: 2026-06-20 KST
5
+ Package: `neurain@0.1.0-alpha.7`
6
+ Latest documented commit: `18bbb9f perf(recall): lock byte-identical claim in CI + harden fast-path contracts`
7
7
 
8
8
  This document is the canonical product development snapshot for the public package. It tracks what is shipped, what has evidence, and what must not be claimed yet.
9
9
 
@@ -1,9 +1,9 @@
1
1
  # 개발 진행 상태
2
2
 
3
3
  Version: v0.1
4
- Last updated: 2026-06-19 KST
5
- Package: `neurain@0.1.0-alpha.5`
6
- Latest documented commit: `6305d3d perf(recall): cut recall processing time, results byte-identical`
4
+ Last updated: 2026-06-20 KST
5
+ Package: `neurain@0.1.0-alpha.7`
6
+ Latest documented commit: `18bbb9f perf(recall): lock byte-identical claim in CI + harden fast-path contracts`
7
7
 
8
8
  이 문서는 public package 기준의 canonical 개발 상태 스냅샷입니다. 무엇이 shipped인지, 어떤 증거가 있는지, 아직 주장하면 안 되는 것이 무엇인지 함께 기록합니다.
9
9
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "neurain",
3
- "version": "0.1.0-alpha.5",
3
+ "version": "0.1.0-alpha.7",
4
4
  "description": "Local-first Neurain Knowledge OS CLI and MCP connector.",
5
5
  "type": "module",
6
6
  "license": "Apache-2.0",
@@ -183,7 +183,7 @@ export async function searchRecall(root, query, { top = 10, host = '', fallback
183
183
  // corpus. No SQLite required (markdown stays canonical, the default provider
184
184
  // needs no generated index), no model calls, no external calls. Private and
185
185
  // unsafe docs are excluded exactly like the exact-token path.
186
- export async function semanticSearchRecall(root, query, { top = 10, host = '', provider = 'local-lexical', minScore = 0.34, scope = '' } = {}) {
186
+ export async function semanticSearchRecall(root, query, { top = 10, host = '', provider = 'local-lexical', minScore = 0.34, scope = '', markdownFiles } = {}) {
187
187
  const prov = getProvider(provider);
188
188
  const text = String(query || '');
189
189
  if (!text.trim()) throw new Error('Recall semantic search requires a query.');
@@ -191,7 +191,7 @@ export async function semanticSearchRecall(root, query, { top = 10, host = '', p
191
191
  const hostFilter = String(host || '');
192
192
  const scopeFilter = String(scope || '');
193
193
  const floor = Number.isFinite(Number(minScore)) ? Math.max(0, Math.min(Number(minScore), 1)) : 0.34;
194
- const docs = collectRecallDocs(root)
194
+ const docs = collectRecallDocs(root, { markdownFiles })
195
195
  .filter((doc) => doc.sensitivity !== 'private')
196
196
  .filter((doc) => !hostFilter || doc.host === hostFilter)
197
197
  .filter((doc) => !scopeFilter || doc.scope === scopeFilter);
@@ -288,7 +288,14 @@ export async function hybridSearchRecall(root, query, { top = 10, host = '', pro
288
288
  const scope = scopeForArea(areaDir);
289
289
  const routedEnabled = decideRouting(routing, areaDir, root, recallCfg);
290
290
  const exact = await searchRecall(root, text, { top: limit, host, scope });
291
- const semantic = await semanticSearchRecall(root, text, { top: limit, host, provider, minScore, scope });
291
+ // Walk the markdown corpus ONCE and share it across the semantic and (routed)
292
+ // lexical branches, which otherwise each re-walk+read the whole vault. Only when
293
+ // no area is set, because then both branches select the same whole-vault corpus;
294
+ // with an area, semantic stays whole-vault while lexical scopes to the area, so
295
+ // their selections differ and each must walk its own. The shared array is exactly
296
+ // what each branch would have walked, so results stay byte-identical.
297
+ const sharedFiles = areaDir ? null : listRecallMarkdownFiles(root, recallCfg);
298
+ const semantic = await semanticSearchRecall(root, text, { top: limit, host, provider, minScore, scope, markdownFiles: sharedFiles });
292
299
 
293
300
  if (!routedEnabled) {
294
301
  const merged = mergeHybridResults(exact.results, semantic.results);
@@ -316,7 +323,7 @@ export async function hybridSearchRecall(root, query, { top = 10, host = '', pro
316
323
  };
317
324
  }
318
325
 
319
- const lexicalCtx = buildLexicalContext(root, { area: areaDir, recallCfg });
326
+ const lexicalCtx = buildLexicalContext(root, { area: areaDir, recallCfg, markdownFiles: sharedFiles });
320
327
  const lexical = lexicalSearchWithContext(lexicalCtx, text, { top: limit });
321
328
  const merged = mergeRoutedHybridResults(lexical.results, exact.results, semantic.results);
322
329
  return {
@@ -1603,9 +1610,9 @@ function buildSqliteIndex(DatabaseSync, file, docs, manifestHash) {
1603
1610
  }
1604
1611
  }
1605
1612
 
1606
- function collectRecallDocs(root, { recallCfg = recallConfig(root) } = {}) {
1613
+ function collectRecallDocs(root, { recallCfg = recallConfig(root), markdownFiles } = {}) {
1607
1614
  const docs = [
1608
- ...collectMarkdownDocs(root, recallCfg),
1615
+ ...collectMarkdownDocs(root, recallCfg, markdownFiles),
1609
1616
  ...collectEventDocs(root),
1610
1617
  ...collectReceiptDocs(root),
1611
1618
  ];
@@ -1620,8 +1627,12 @@ function collectRecallDocs(root, { recallCfg = recallConfig(root) } = {}) {
1620
1627
  // label resolver (per-file frontmatter + area baseline + boundary path markers),
1621
1628
  // which fixes the old substring gate that dropped `..._tokenomics/` because the
1622
1629
  // path contained `token`. config.recall.include/exclude extend the whitelist.
1623
- function collectMarkdownDocs(root, recallCfg = recallConfig(root)) {
1624
- return listRecallMarkdownFiles(root, recallCfg).map(({ rel, text, sensitivity }) => docFromText({
1630
+ // `markdownFiles`, when given, is a pre-walked listRecallMarkdownFiles() result
1631
+ // for the SAME (root, recallCfg, whole-vault) selection, so a caller that already
1632
+ // walked the corpus (e.g. hybrid sharing one walk across branches) can skip the
1633
+ // redundant walk+read. The mapping is identical, so the docs are byte-identical.
1634
+ function collectMarkdownDocs(root, recallCfg = recallConfig(root), markdownFiles) {
1635
+ return (markdownFiles || listRecallMarkdownFiles(root, recallCfg)).map(({ rel, text, sensitivity }) => docFromText({
1625
1636
  path: rel,
1626
1637
  kind: kindForPath(rel),
1627
1638
  host: 'markdown',
@@ -30,6 +30,20 @@ import {
30
30
  import { factsFor, loadFactIntel } from './recall_facts.mjs';
31
31
 
32
32
  const sourceIdPattern = /\braw-\d{8}-(?:\d{3}|dryrun)\b/i;
33
+
34
+ // Count non-overlapping occurrences of `needle` in `haystack`. For a non-empty
35
+ // needle this is exactly `haystack.split(needle).length - 1` but without
36
+ // allocating the split array on every doc/term pair (the hottest BM25 loop). An
37
+ // empty needle returns 0: every term that reaches scoring is filter(Boolean)'d,
38
+ // so this only future-proofs against an infinite loop if that contract ever
39
+ // changes (the split form would never hit it, but the index form would).
40
+ export function countOccurrences(haystack, needle) {
41
+ if (!needle) return 0;
42
+ let n = 0;
43
+ for (let i = haystack.indexOf(needle); i !== -1; i = haystack.indexOf(needle, i + needle.length)) n += 1;
44
+ return n;
45
+ }
46
+
33
47
  // BM25 content weight relative to the additive structural boosts (vault parity).
34
48
  const BM25_WEIGHT = 4;
35
49
  const BM25_K1 = 1.5;
@@ -111,11 +125,20 @@ function slugish(value) {
111
125
  // intel/facts/alias snapshots + the held-aside queue doc), reused across many
112
126
  // queries. intel/facts/aliasMap can be injected (tests); otherwise loaded from
113
127
  // the registry, degrading to empty when files are absent.
114
- export function buildLexicalContext(root, { area = '', recallCfg, intel, facts, aliasMap } = {}) {
128
+ export function buildLexicalContext(root, { area = '', recallCfg, intel, facts, aliasMap, markdownFiles } = {}) {
115
129
  if (!recallCfg) throw new Error('buildLexicalContext requires recallCfg');
130
+ // `markdownFiles`, when given, must be a pre-walked listRecallMarkdownFiles()
131
+ // result for this exact (root, recallCfg, area) selection; a caller that already
132
+ // walked the corpus (hybrid sharing one walk) passes it to skip the redundant walk.
133
+ // The only safe sharing is whole-vault: an area-scoped context selects a strict
134
+ // subset, so accepting whole-vault files under an area would silently widen the
135
+ // corpus and change results. Reject that misuse loudly instead of ranking wrong.
136
+ if (markdownFiles && area) {
137
+ throw new Error('buildLexicalContext: markdownFiles can only be shared for a whole-vault context (no area)');
138
+ }
116
139
  const dirs = dirsFromConfig(recallCfg);
117
140
  const classify = makeLayerClassifier(dirs);
118
- const files = listRecallMarkdownFiles(root, recallCfg, { area });
141
+ const files = markdownFiles || listRecallMarkdownFiles(root, recallCfg, { area });
119
142
  const baseDocs = files.map(({ rel, text }) => ({
120
143
  text,
121
144
  lower: text.toLowerCase(),
@@ -192,10 +215,7 @@ export function lexicalSearchWithContext(ctx, query, { top = 10, maxPerLayer = 3
192
215
 
193
216
  let bm25 = 0;
194
217
  for (const term of searchTerms) {
195
- // Non-overlapping occurrence count (identical to `lower.split(term).length - 1`)
196
- // without allocating the split array on every doc/term pair.
197
- let tf = 0;
198
- for (let i = lower.indexOf(term); i !== -1; i = lower.indexOf(term, i + term.length)) tf += 1;
218
+ const tf = countOccurrences(lower, term);
199
219
  if (tf === 0) continue;
200
220
  const denom = tf + BM25_K1 * (1 - BM25_B + (BM25_B * length) / avgLength);
201
221
  bm25 += (idf[term] || 0) * ((tf * (BM25_K1 + 1)) / denom);
@@ -135,7 +135,10 @@ function trigramJaccard(ga, gb) {
135
135
  // docs instead of re-tokenizing the query per document (the per-doc hot path).
136
136
  export function prepareSemanticQuery(query) {
137
137
  // Precompute each term's trigrams ONCE so the per-doc fuzzy loop never rebuilds them.
138
- return tokenize(query).map(expandToken).map((q) => ({ ...q, trigrams: charTrigrams(q.stem) }));
138
+ // The prepared query is the SHARED, immutable input to scorePreparedSemantic across an
139
+ // entire corpus scan: freezing it makes the "no per-call mutable state" contract
140
+ // enforced, not just documented, so a long-lived process can reuse it safely.
141
+ return Object.freeze(tokenize(query).map(expandToken).map((q) => Object.freeze({ ...q, trigrams: charTrigrams(q.stem) })));
139
142
  }
140
143
 
141
144
  // Score a pre-prepared query against a document body. Behaviour is identical to
@@ -203,6 +206,11 @@ registerProvider('local-lexical', {
203
206
  expandQuery(query) {
204
207
  return tokenize(query).map(expandToken);
205
208
  },
209
+ // Provider fast-path contract: prepareQuery returns an IMMUTABLE prepared query
210
+ // (frozen) that scorePrepared treats as read-only. A provider must not mutate the
211
+ // prepared object nor keep cross-call mutable state keyed off it, so the same
212
+ // prepared query is safe to reuse across an entire corpus scan and across calls in
213
+ // a long-lived process. The default provider is fully stateless.
206
214
  prepareQuery(query) {
207
215
  return prepareSemanticQuery(query);
208
216
  },