albex 0.3.0 → 0.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (53) hide show
  1. package/CHANGELOG.md +466 -0
  2. package/README.md +32 -19
  3. package/dist/albex-worker.d.ts +65 -2
  4. package/dist/albex-worker.d.ts.map +1 -1
  5. package/dist/albex-worker.js +97 -20
  6. package/dist/albex-worker.js.map +1 -1
  7. package/dist/albex.d.ts +359 -55
  8. package/dist/albex.d.ts.map +1 -1
  9. package/dist/albex.js +766 -312
  10. package/dist/albex.js.map +1 -1
  11. package/dist/errors.d.ts +47 -2
  12. package/dist/errors.d.ts.map +1 -1
  13. package/dist/errors.js +41 -3
  14. package/dist/errors.js.map +1 -1
  15. package/dist/persistence.js +1 -1
  16. package/dist/pool/coordinator.d.ts +14 -6
  17. package/dist/pool/coordinator.d.ts.map +1 -1
  18. package/dist/pool/coordinator.js +65 -28
  19. package/dist/pool/coordinator.js.map +1 -1
  20. package/dist/profile.d.ts +11 -6
  21. package/dist/profile.d.ts.map +1 -1
  22. package/dist/profile.js +6 -13
  23. package/dist/profile.js.map +1 -1
  24. package/dist/resource-manager.js +1 -1
  25. package/dist/tiered-store.js +1 -1
  26. package/dist/wasm-bindings.d.ts +96 -6
  27. package/dist/wasm-bindings.d.ts.map +1 -1
  28. package/dist/wasm-bindings.js +110 -7
  29. package/dist/wasm-bindings.js.map +1 -1
  30. package/dist/worker-protocol.d.ts +23 -2
  31. package/dist/worker-protocol.d.ts.map +1 -1
  32. package/dist/worker-protocol.js +1 -1
  33. package/dist/worker-runtime.js +27 -3
  34. package/dist/worker-runtime.js.map +1 -1
  35. package/package.json +13 -9
  36. package/src/albex-worker.ts +103 -18
  37. package/src/albex.ts +2937 -2292
  38. package/src/errors.ts +63 -2
  39. package/src/pool/coordinator.ts +61 -34
  40. package/src/profile.ts +11 -10
  41. package/src/wasm-bindings.ts +225 -10
  42. package/src/worker-protocol.ts +12 -2
  43. package/src/worker-runtime.ts +28 -3
  44. package/wasm/pkg/albex_pdf.wasm +0 -0
  45. package/wasm/pkg/albex_wasm.wasm +0 -0
  46. package/wasm/pkg/albex_wasm_bg.wasm +0 -0
  47. package/wasm/pkg/albex_wasm_simd.wasm +0 -0
  48. package/wasm/pkg/albex_wasm_mini.wasm +0 -0
  49. package/wasm/pkg/albex_wasm_mini_simd.wasm +0 -0
  50. package/wasm/pkg/albex_wasm_pro.wasm +0 -0
  51. package/wasm/pkg/albex_wasm_pro_simd.wasm +0 -0
  52. package/wasm/pkg/albex_wasm_std.wasm +0 -0
  53. package/wasm/pkg/albex_wasm_std_simd.wasm +0 -0
package/CHANGELOG.md CHANGED
@@ -5,6 +5,472 @@ All notable changes to Albex are documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
6
  and Albex follows [Semantic Versioning](https://semver.org/).
7
7
 
8
+ ## [Unreleased]
9
+
10
+ ### Changed
11
+
12
+ - **IDF-aware relevance scoring — exact matches no longer saturate.** The old
13
+ `rich_score` set `base = 1000 − avg_errors·200`, so every 0-error hit hit the
14
+ 1000 ceiling and `.min(1000)` discarded *all* the relevance bonuses (term
15
+ frequency, proximity, location, word-boundary AND idf). The practical effect:
16
+ on a real corpus every exact match tied at 1000 and the final ordering among
17
+ them was arbitrary — the chunk holding the rare, decisive query term was no
18
+ more likely to rank first than dozens that merely shared a common word.
19
+ Scores are now laid out as **error tiers with relevance ranking inside each
20
+ tier** (0 errors → `[750,1000]`, 1 → `[500,750]`, 2 → `[250,500]`, 3 →
21
+ `[0,250]`): a cleaner match still always beats a noisier one, but within a
22
+ tier the **IDF bonus (now the largest, up to 150)** lifts rare-term matches
23
+ decisively above common-term ones. Measured downstream (the Noesis hybrid-RAG
24
+ eval over a 293-doc / 12.4k-chunk corpus): lexical-leg MRR **0.19 → 0.40**,
25
+ R@5 **23% → 53%**, and end-to-end hybrid MRR **0.42 → 0.53** — with no change
26
+ on small corpora. `SearchResult.score` keeps its `0..=1000` range, so
27
+ `setThreshold` is unaffected for typical values; very weak 3-error fuzzy
28
+ matches (bonus-starved, score < the default threshold 250) are now filtered
29
+ where they previously slipped through.
30
+ - **Fixed a latent result-ordering bug** exposed by the scoring change. The
31
+ heap-sort already left `RESULTS` in descending order; a trailing reverse then
32
+ flipped it to *ascending*. It was invisible while all exact scores tied at
33
+ 1000 and was masked from API consumers by the TypeScript-side re-sort, but the
34
+ WASM ABI now returns correctly descending results on its own.
35
+
36
+ ### Added
37
+
38
+ - **Dynamic capacity (ABI 7, decision A16 executed).** The compile-time
39
+ capacity tiers are gone for good: the engine pools (chunk table, trigram
40
+ signatures, text pool, doc table, content hashes, name pool, tombstone
41
+ bitset, GPU candidate mask) moved from BSS `static mut` arrays to heap
42
+ allocations sized at runtime by the new export
43
+ `initWithCapacity(maxDocs, maxChunks, textPoolBytes, namePoolBytes) → 1|0`.
44
+ - `init()` is now a wrapper over `initWithCapacity` with the historical
45
+ std defaults (128 docs · 100k chunks · 16 MB text · 32 KB names) —
46
+ default behaviour is identical to every previous release.
47
+ - Hard ceilings (validated, return 0 — never trap): 65 536 docs, 4 M
48
+ chunks, 1 GiB text, 16 MiB names; allocation failure (memory.grow
49
+ refused) also returns 0 cleanly with the engine in a safe empty state.
50
+ - Re-init with the same capacities is a plain reset (no realloc);
51
+ different capacities free + re-allocate (no leak; the linear-memory
52
+ high-water mark stays, as WASM memory never shrinks).
53
+ - **Snapshots are admitted by CONTENT, not by writer capacity**: the
54
+ restore gate compares the snapshot's counters against the live runtime
55
+ capacities, so a snapshot saved by a `'large'` engine loads into a
56
+ `'std'` engine whenever its contents fit — and fails CLEANLY (previous
57
+ index intact) when they don't. v4 headers additionally record the
58
+ writer's capacities in the previously-reserved bytes [40..56], for
59
+ diagnostics only.
60
+ - **TS option `capacity`**: `'std'` (default) · `'large'`
61
+ (1 024 docs / 800k chunks / 128 MB text / 256 KB names — the old "pro"
62
+ tier) · custom `{ maxDocs?, maxChunks?, textPoolBytes?, namePoolBytes? }`
63
+ (partials completed from std scaled by std's ratios: chunks = docs×782,
64
+ text = chunks×168 B, names = docs×256 B, with floors; explicit
65
+ maxChunks below maxDocs is clamped up). Forwarded to workers and pool
66
+ shards (structured-clone-safe). `reset()` re-inits with the CONFIGURED
67
+ capacity, not the std defaults.
68
+ - `AlbexCapacityError` gains `max` — the runtime numeric limit of the
69
+ pool named by `limit` (e.g. `4` for `capacity: { maxDocs: 4 }`); it
70
+ survives the worker boundary.
71
+ - New large-scale bench (`bench/large.bench.ts`): 1 000 docs / 200k
72
+ chunks / 15.8 MB text in a `'large'` engine — search 141-769 ms
73
+ (rare-token → fuzzy), snapshot save 24 ms / restore 175 ms for a
74
+ ~20 MB snapshot. Numbers in `bench/README.md`. Binary cost of heap
75
+ pools: baseline 43 396 → 47 446 B (gate 48 KB still green).
76
+
77
+ - **Batch frontier reads (ABI 6).** Three groups of exports collapse the
78
+ JS↔WASM call count on the hot read paths:
79
+ - `getResultsPtr()` + `getResultStride()`: `_collectResults` now reads every
80
+ numeric field of all results with ONE `DataView` pass over the
81
+ `#[repr(C)]` `RESULTS` array (layout documented in the export's
82
+ doc-comment and compile-time asserted with `offset_of!`) instead of
83
+ ~12-15 frontier calls per result. Doc names are resolved once per
84
+ distinct document via its table slot — the per-result
85
+ `getResultDocName` (O(doc_count) inside WASM) is no longer called.
86
+ - `getPatternBloomLo()/Hi()`: the query Bloom the GPU pre-filter needs is
87
+ now computed in `setPattern` through the exact pipeline `searchBegin`
88
+ uses (split → optional Spanish stemming → fold). The third TypeScript
89
+ copy of the fold (`computePatternBloom`) is deleted; with stemming on it
90
+ silently diverged from the CPU pattern.
91
+ - `listChunksBatch(slot, startOrd, maxChunks)`: packs consecutive chunk
92
+ records `[u32 text_len][u32 location][text]` into the scratchpad;
93
+ `listChunks` now makes ~1 frontier call per 64 KB batch instead of 2-3
94
+ per chunk (an embeddings pipeline over 100k chunks drops from ~300k
95
+ calls to ~1.3k).
96
+ `abiVersion()` bumped 5 → 6; host pin updated. Snapshot format unchanged.
97
+ - **UTF-16 spans: `SearchResult.snippetStart/snippetEnd`.** The existing
98
+ `matchStart/matchEnd`/`matches[]` are UTF-8 BYTE offsets (now documented as
99
+ such) and mis-highlight as soon as the snippet contains accents. The new
100
+ fields are the primary span as UTF-16 code-unit indices over the decoded
101
+ snippet — safe for `snippet.slice()`. Byte fields are unchanged for
102
+ backwards compatibility.
103
+ - **Worker parity: `replaceDocument` + `takeDiagnostics`** added to the
104
+ worker protocol and `AlbexEngineWorker`. The class JSDoc now documents that
105
+ `attachOcr` cannot exist in the worker (functions don't cross postMessage)
106
+ and that scanned PDFs index with 0 chunks there — with the diagnostic
107
+ explaining why readable through `takeDiagnostics`, and the recommendation
108
+ to OCR on the main-thread engine and `save()`/`load()` the snapshot.
109
+ - **CI with teeth:** `cargo clippy -D warnings` for every crate (host pass
110
+ for core/ingest, wasm32 pass for the wasm and pdf crates), an advisory
111
+ `cargo fmt --check`, a hard 48 KB size gate on the baseline `.wasm`, a
112
+ build of the SIMD variant, and a manual-only (`workflow_dispatch`) bench
113
+ job with no thresholds.
114
+
115
+ ### Removed
116
+
117
+ - **`tier` option and tier plumbing (breaking, pre-1.0).** The deprecated
118
+ `AlbexOptions.tier` (ignored since 0.5.0), `AlbexEngine.tier`,
119
+ `AlbexPool.tier`, `EngineStats.tier` and the WASM `getTier` export are
120
+ gone — superseded by the runtime `capacity` option (ABI 7, decision
121
+ A16). `EngineStats` now reports the real runtime capacities (`maxDocs`,
122
+ `maxChunks`, `textCapacity`, plus new `namePoolBytes`). The deprecated
123
+ `pickTier`/`Tier` re-exports from `profile.ts` remain for source
124
+ compatibility. The `tier-mini`/`tier-std`/`tier-pro` Cargo features are
125
+ deleted; the only build-time switch left is `simd`.
126
+
127
+ ### Changed
128
+
129
+ - **Trigram windows no longer cross word boundaries.** `build_sig` /
130
+ `token_trigram_bits` only extend the sliding window on code points that
131
+ fold to an ASCII letter or digit; whitespace/punctuation/symbols break it
132
+ on BOTH the index and query sides (previously the fold passed ASCII
133
+ whitespace through, so signatures contained cross-word trigrams that
134
+ diluted selectivity — the doc-comment claimed otherwise). Sound under
135
+ fuzzy matching (same ≤3-windows-per-edit bound); signatures are rebuilt
136
+ on every restore/compact, so existing snapshots are unaffected.
137
+ - **`wee_alloc` replaced (RUSTSEC-2022-0054).** The main no_std module now
138
+ uses `dlmalloc` (`global` feature) — baseline binary 36 732 → 42 654 bytes
139
+ (+5.8 KB, under the 48 KB gate). `pdf-wasm` drops its custom allocator and
140
+ uses std's default (also dlmalloc): 1 193 074 → 1 199 910 bytes (+6.8 KB on
141
+ 1.2 MB). In pdf-wasm the allocator serves all of lopdf, so the known
142
+ wee_alloc leak surface was real there.
143
+
144
+ ### Fixed (this pass)
145
+
146
+ - **`getSnippetWindow` can no longer open or close mid code point.** When the
147
+ word-boundary snap found no space within 20 bytes, the window edge could
148
+ land on a UTF-8 continuation byte → U+FFFD at the snippet border and every
149
+ span visually off. Both edges now back off continuation bytes to the
150
+ code-point lead (same pattern as the chunker's hard cut).
151
+
152
+ ### Added (previous pass)
153
+
154
+ - **Authoritative chunk enumeration (ABI 4).** New `AlbexEngine.listChunks(docId)`
155
+ (and `AlbexEngineWorker.listChunks`) returns the exact chunks Albex indexed for
156
+ a document — `{ docId, location, ord, sub, text, byteLen, id }` — so a host can
157
+ mirror Albex's chunking for a parallel index (e.g. embeddings) keyed on the
158
+ `compact()`-stable id `"<docId>::<ord>"`. Backed by four read-only
159
+ WASM exports (`getDocChunkBase`, `getChunkLocationAt`, `getChunkByteLenAt`,
160
+ `getChunkTextAt`); `abiVersion()` bumped 3 → 4. No change to search, scoring or
161
+ snapshot formats; v3 snapshots still load.
162
+ - **`SearchResult` now carries `docId` and `chunkId`.** `chunkId` (`"<docId>::<ord>"`)
163
+ is identical to the matching `AuthoritativeChunk.id`, so a host can fuse search
164
+ hits with a parallel index on a single stable key — no `(name, location)`
165
+ heuristics, and unambiguous when a long paragraph splits into sub-chunks.
166
+ - **`maxFileBytes` option (default 256 MiB).** `indexFile` now rejects oversized
167
+ inputs with a typed `AlbexCapacityError` (`limit: 'file'`) by checking
168
+ `File.size` BEFORE reading — a 2 GB file no longer gets fully buffered and
169
+ hashed just to fail later. Enforced by the engine, the worker wrapper and the
170
+ pool coordinator.
171
+
172
+ ### Fixed
173
+
174
+ - **GPU pre-filter no longer corrupts the search pattern.** `setCandidateMask`
175
+ pushes the candidate bitset through the scratchpad, overwriting the pattern
176
+ staged by `selectQueryBranch`; `searchBegin` then compiled garbage tokens out
177
+ of the mask bytes, so every GPU-assisted cooperative search silently returned
178
+ wrong results. The active branch is now re-selected after the pre-filter.
179
+ - **GPU bloom upload invalidation is now mutation-driven.** The upload used to
180
+ be skipped when the chunk count matched the last upload — but `compact()` can
181
+ reorder chunks while keeping the count identical, leaving the GPU mask
182
+ filtering the wrong chunks (silent false negatives). A dirty flag set by every
183
+ index mutation (index/remove/compact/reset/load) now forces the re-upload.
184
+ - **`AlbexEngineWorker.init` forwards all engine options.** Previously only
185
+ `wasmUrl` and `pdfWasmUrl` crossed the worker boundary; `wasmBaseUrl`, `simd`,
186
+ `gpu`, `gpuThreshold` and `maxFileBytes` were silently dropped, making the
187
+ SIMD variant and the GPU policy unreachable in workers.
188
+ - **OR-branch dedup keyed on `chunkId`.** The previous
189
+ `(doc, location, matchStart)` key collided when two sub-chunks of the same
190
+ location hit at the same relative offset, dropping a legitimate result.
191
+ - **`AlbexPool.search` now actually caps and dedups after the merge.** The doc
192
+ promised "capped to setMaxResults AFTER merge" but the coordinator returned
193
+ the raw flattened buckets. Results are now deduped, sorted and capped to the
194
+ last `setMaxResults` value (default 50); per-shard search stats are captured
195
+ race-free in the same posting batch as the search itself.
196
+ - Corrected the `listChunks` docstring (and the entry above): the canonical
197
+ chunk id format is `"<docId>::<ord>"`, not `"<docId>::<location>.<sub>"`.
198
+
199
+ ## [0.6.0] — 2026-05-31
200
+
201
+ Release de auditoría algorítmica + robustez. Cierra los hallazgos #1–#8 de
202
+ la revisión externa en profundidad: sube la selectividad del pre-filtro de
203
+ forma drástica, elimina dos clases de corrupción/pérdida silenciosa de
204
+ datos, y de paso corrige una falta de _soundness_ del filtro Bloom bajo
205
+ matching difuso que existía desde el principio. El binario crece ~2 KB
206
+ (33 KB baseline · 37 KB SIMD) por el código del pre-filtro de trigramas.
207
+
208
+ ### Breaking changes
209
+
210
+ - **ABI WASM v2 → v3** (hallazgo #7). El binario principal exporta ahora
211
+ `getLastIndexOverflow` y el host fija el rango ABI aceptado a `[3, 3]`
212
+ (antes `[1, 2]`, con un mínimo inalcanzable que la lista de exports
213
+ requeridos ya hacía imposible). Un `.wasm` cacheado de 0.5.x falla con
214
+ `AlbexAbiMismatchError` claro en `init()` en vez de petar más tarde.
215
+
216
+ ### Added
217
+
218
+ - **Pre-filtro de trigramas q-gram** (hallazgo #1). El Bloom de 64 bits
219
+ (un bit por `c & 0x3F`) está saturado en chunks de prosa de 512 bytes —
220
+ cada chunk contiene casi todo el alfabeto, así que no poda y el Bitap
221
+ acababa ejecutándose sobre todo el corpus. Se añade una firma de 256
222
+ bits de los **trigramas** de cada chunk (`CHUNK_SIG`, BSS, no infla el
223
+ `.wasm`) como segundo pre-filtro mucho más selectivo. Es **sound bajo
224
+ matching difuso**: una ocurrencia con `e` errores conserva ≥ `N − 3e`
225
+ de los trigramas exactos del token (lema q-gram), y la firma no tiene
226
+ falsos negativos, así que nunca descarta un match real — solo poda
227
+ chunks que demostrablemente no pueden contenerlo. El Bitap confirma
228
+ cada superviviente: la corrección no cambia, solo se encoge el conjunto
229
+ candidato.
230
+ - **`AlbexCapacityError` con campo `limit`** (`'chunks' | 'text' | 'docs'
231
+ | 'names'`). El nuevo export `getLastIndexOverflow()` señaliza qué pool
232
+ se llenó durante el último `begin..endDocument`.
233
+
234
+ ### Fixed
235
+
236
+ - **Pérdida silenciosa de datos al agotar capacidad** (hallazgo #3). Los
237
+ pools del WASM se llenaban en silencio y `indexFile` devolvía un
238
+ documento a medio indexar indistinguible de uno completo (`getStats()`
239
+ mentía). Ahora `indexFile` lee el overflow y lanza `AlbexCapacityError`.
240
+ El documento que desborda se **revierte atómicamente** en el WASM
241
+ (`endDocument` restaura `chunk_count`/`text_used`/`name_used` al inicio
242
+ del doc): indexación all-or-nothing por fichero, sin chunks huérfanos
243
+ sin nombre.
244
+ - **Re-entrancy en el path async** (hallazgo #2). Una única instancia WASM
245
+ con estado global y ops async que ceden al scheduler entre slices: dos
246
+ operaciones solapadas se corrompían (un `searchBegin` nuevo reseteaba el
247
+ cursor de una búsqueda cooperativa en vuelo). Las ops async se serializan
248
+ ahora con una cola interna; los mutadores/búsquedas síncronos rechazan
249
+ ejecutarse a media operación (`AlbexError` kind `'busy'`) en vez de
250
+ corromper. El worker-runtime procesa mensajes estrictamente en orden.
251
+ - **Bloom no-sound bajo fuzzy** (corolario del #1). El filtro Bloom de
252
+ caracteres rechazaba un match aproximado cuando el carácter sustituido
253
+ no estaba en el chunk (bug latente pre-0.6.0). Ahora el Bloom solo se
254
+ aplica a tokens exactos (`eff_errors == 0`); los tokens difusos se
255
+ filtran solo con el conteo de trigramas, que sí es sound.
256
+ - **Frase + `windowed`** (hallazgo #7). El post-filtro de frase corría
257
+ contra el snippet recortado, así que `{ windowed: true }` podía descartar
258
+ un match válido cuyo segundo término caía fuera de la ventana. Ahora la
259
+ comprobación de adyacencia corre contra el **texto completo** del chunk;
260
+ el windowing solo afecta a la visualización.
261
+ - **GPU + OR**: el hash del pre-filtro GPU usaba siempre el patrón de la
262
+ rama 0, generando una máscara de candidatos errónea para las ramas i≠0 de
263
+ una query OR y descartando sus hits en silencio (hallazgo #6).
264
+ - **Corte de chunk a mitad de codepoint UTF-8** (hallazgo #7): el corte duro
265
+ a 512 bytes podía partir una secuencia multibyte; ahora retrocede a la
266
+ frontera de codepoint y el snippet del borde no renderiza `�`.
267
+ - **`replaceDocument` sin reclamar espacio** (hallazgo #7): repetidos
268
+ replaces dejaban tombstones en el text pool; ahora se compacta de forma
269
+ oportunista bajo presión.
270
+
271
+ ### Performance
272
+
273
+ - `prepareQuery` ya no hace un `memset` de 64 KB en pila por query (hallazgo
274
+ #4). La query de trabajo se acota a `MAX_QUERY_BYTES` (1 KB).
275
+
276
+ ### Tooling
277
+
278
+ - `npm run relaunch`: limpia artefactos, recompila WASM (baseline + SIMD) +
279
+ PDF + TS + OCR, corre los tests, empaqueta la librería (`npm pack`) y
280
+ levanta el demo en `http://localhost:5173/demo/`. Scripts auxiliares
281
+ `clean` / `clean:all` / `build:ocr`.
282
+ - Cobertura: +16 tests (110 en total) — selectividad del trigram, soundness
283
+ fuzzy, supervivencia de firmas tras compact/restore, error de capacidad,
284
+ guard de concurrencia, frase+windowed.
285
+
286
+ ## [0.5.0] — 2026-05-30
287
+
288
+ Release de endurecimiento — cierra las cinco clases de bugs accionables
289
+ de la auditoría externa de código. Sin nuevas features que pidan datos
290
+ reales para ser correctas. El binario crece ~2 KB respecto a 0.4.0 por
291
+ la lógica de query parsing en Rust + ABI version.
292
+
293
+ ### Breaking changes
294
+
295
+ - **`@albex/ocr` ahora requiere `engine.attachOcr()`** (audit 3.8). El
296
+ patrón anterior de mutar `engine.ocrImage = ...` y
297
+ `engine.ocrConfig = ...` directamente queda eliminado. Para
298
+ integradores que usaban `@albex/ocr`, **el cambio es transparente** —
299
+ `enableOcr(engine)` sigue siendo la misma llamada. Para quien tuviera
300
+ un adaptador manual: usar `engine.attachOcr({ recognize, options })`.
301
+ - **Tiers eliminados** (audit 4.1). Mini/std/pro × baseline/SIMD (6
302
+ binarios) consolidados a **baseline + SIMD** (2 binarios). El
303
+ parámetro `tier` de `AlbexOptions` queda como noop deprecado. El alias
304
+ `albex_wasm_bg.wasm` se mantiene para compatibilidad con `0.4.x`.
305
+ - `pickTier(profile)` siempre devuelve `'std'` ahora. La función queda
306
+ exportada para compatibilidad de código fuente, no de comportamiento.
307
+
308
+ ### Fixed
309
+
310
+ - **Validación runtime de la ABI WASM** (audit 3.2). `asAlbexExports` y
311
+ `asAlbexPdfExports` ya no son `as unknown as` ceremoniales — verifican
312
+ que `memory` sea una `WebAssembly.Memory`, que cada export requerido
313
+ exista, y que `abiVersion()` esté en el rango soportado. Lanzan
314
+ `AlbexAbiMismatchError` con la lista de exports faltantes cuando algo
315
+ no encaja. Antes, un binario incompatible instanciaba en silencio y
316
+ petaba en el primer call site que tocaba la función ausente.
317
+ - **`makePdfWasmImports` falla rápido** ante imports desconocidos
318
+ (compatibilidad con el cambio anterior de 0.3.1; ya estaba pero ahora
319
+ alineado con la ABI version del módulo).
320
+
321
+ ### Added
322
+
323
+ - **Canal de diagnósticos estructurado** (audit 3.6). Los `console.warn`
324
+ diseminados por el path de indexación desaparecen — los reemplaza un
325
+ buffer interno de `AlbexDiagnostic[]` consultable con
326
+ `engine.takeDiagnostics()`. Cada entrada es `{kind, stage, message, file?,
327
+ page?}` con `kind` en `'recovered' | 'skipped' | 'fallback' | 'info'`.
328
+ Cubre: fallback PDF→OCR, OCR fail por imagen, GPU caída a CPU, descarga
329
+ PDF WASM en red restringida. Capped a 256 entradas para no explotar en
330
+ corpus muy corruptos. La API de "best-effort" se mantiene, pero el
331
+ caller ahora puede inspeccionar qué se perdió.
332
+ - **`engine.attachOcr(adapter)`** — extension point formal. Devuelve un
333
+ `OcrHandle` con `dispose()`. El motor valida el contrato y rechaza
334
+ un segundo `attachOcr` mientras haya uno activo. La propiedad pública
335
+ `engine.ocrImage` se mantiene como getter de feature-detect pero no es
336
+ asignable — para evitar el patrón anti-encapsulación del audit.
337
+ - **`abiVersion()` exportada por ambos módulos WASM**. Main = v2 (incluye
338
+ query parser nuevo); PDF = v3 (incluye image extraction). El validador
339
+ TS rechaza binarios fuera de rango.
340
+
341
+ ### Architecture — query parsing moves to WASM (audit "two truths")
342
+
343
+ Pre-0.5.0 el TypeScript dueño de `parseQuery`, `tokenize`,
344
+ `tokensToWasmQuery`, mientras Rust tokenizaba al indexar. Dos verdades
345
+ sobre qué era un "token".
346
+
347
+ - Nuevos exports WASM: `prepareQuery`, `getQueryKind`,
348
+ `getQueryBranchCount`, `getQueryBranchPattern`, `selectQueryBranch`.
349
+ - Hasta **8 branches OR** soportadas, **4 tokens por branch**, **256
350
+ bytes por pattern compilado** — todo en static BSS, sin alocación.
351
+ - `containsPhrase` queda en TS porque opera sobre snippets (output del
352
+ WASM), no sobre la query — no es divergencia de tokenizer.
353
+ - `parseQuery`, `tokenize`, `tokensToWasmQuery` eliminados del TS.
354
+ - Un único algoritmo de "qué es un token" entre indexación y querying.
355
+
356
+ ### Build & maintenance
357
+
358
+ - **`prepublishOnly` rebuildea WASM + corre tests** desde 0.3.1, ya
359
+ garantizado en 0.5.0.
360
+ - Build pipeline simplificada: `scripts/build-wasm.mjs` produce solo
361
+ dos binarios. `npm pack --dry-run` muestra 4 archivos `.wasm` en lugar
362
+ de 8.
363
+ - `wasm/Cargo.toml` añade `wee_alloc` (~1 KB) para el staging Vec del
364
+ restore atómico de 0.4.0.
365
+
366
+ ### Tests
367
+
368
+ - 94 vitest cases verdes (era 88 en 0.4.0). Cinco tests del tier matrix
369
+ eliminados (mini/std/pro ya no existen). Tests nuevos:
370
+ - `tests/abi-validation.test.ts` (5): valida que `AlbexAbiMismatchError`
371
+ se lanza ante exports faltantes, abiVersion fuera de rango, memory
372
+ inválida.
373
+ - `tests/diagnostics.test.ts` (4): valida `takeDiagnostics()` drena,
374
+ cap a 256, reset limpia.
375
+
376
+ ### Postponed con razón
377
+
378
+ Cosas del audit que NO se cierran en 0.5.0 porque cerrarlas sin datos
379
+ reales sería adivinar:
380
+
381
+ - **3.5 OCR paralelización**: optimización sin profiling no es ingeniería.
382
+ - **3.9 Adaptive runtime con métricas reales**: requiere corpus y uso
383
+ reales para validar decisiones.
384
+ - **4.3 GPU equivalence test**: requiere corpus >20k chunks que aún no
385
+ está checked in.
386
+ - **7 parsers lite a WASM**: ~3 semanas serias. Separable. No es bug
387
+ fix, es mejora arquitectural más limpia con tiempo dedicado.
388
+
389
+ ## [0.4.0] — 2026-05-30
390
+
391
+ Cierre de dos clases enteras de bugs identificadas por la auditoría externa
392
+ de código. Sin cambios cosméticos — el binario crece ~4 KB por la lógica
393
+ de atomicidad y el allocator necesario para el staging buffer.
394
+
395
+ ### Fixed — atomic snapshot restore (audit 3.4)
396
+
397
+ - **Snapshot v3 con formato por campos**. Reemplaza la copia byte a byte
398
+ de los structs internos `Chunk`/`DocEntry` (`from_raw_parts`) por un
399
+ encoding explícito little-endian. El formato deja de depender del
400
+ layout en memoria de Rust, del target, del padding o de cambios en
401
+ los tipos. Lo que va al disco es un contrato.
402
+ - **`restoreCommit()` — protocolo de 3 fases atómico**. El antiguo
403
+ `restoreBegin` reseteaba el estado y escribía los counters antes de
404
+ recibir un solo byte del payload. Si `restoreFeed` fallaba a mitad,
405
+ el corpus previo quedaba destruido. v3 acumula todo el payload en un
406
+ staging buffer y solo aplica al estado vivo cuando `restoreCommit`
407
+ valida que el tamaño completo coincide con el header. Un commit
408
+ fallido deja el motor con el corpus previo intacto.
409
+ - **Compatibilidad backwards**. v1 y v2 siguen cargando — para ellos
410
+ `restoreBegin` mantiene la semántica vieja (no-atómica) y
411
+ `restoreCommit` es no-op que devuelve 1. El primer `save()` tras
412
+ cargar un snapshot viejo lo reescribe como v3.
413
+ - Binarios crecen ~4 KB por la lógica nueva y por `wee_alloc` (única
414
+ fuente de alocación en el módulo, usada por el staging Vec).
415
+
416
+ ### Fixed — single source of truth for content hash (audit "two truths")
417
+
418
+ - **FNV-1a 64-bit ahora vive en Rust**. La implementación TypeScript que
419
+ duplicaba el algoritmo desaparece. Tres nuevos exports
420
+ (`hashBegin`/`hashFeed`/`hashFinish`) implementan el hash en streaming
421
+ para archivos mayores que el scratchpad. El método privado del engine
422
+ `_contentHash` produce exactamente el mismo string hex de 16
423
+ caracteres que devolvía la versión TS — ningún caller cambia.
424
+
425
+ ### Added — tests
426
+
427
+ - `tests/load-restores-docs.test.ts`: nuevo test "a v3 restore that
428
+ never commits leaves the previous index intact". Verifica
429
+ explícitamente la atomicidad: trunca el payload de un snapshot al
430
+ 75 %, intenta cargarlo, verifica que `load()` devuelve `false` y que
431
+ el corpus previo sigue indexado y consultable.
432
+ - `tests/hash.test.ts`: reescrito para validar el hash WASM contra el
433
+ engine real (la versión vieja era una re-implementación TS standalone
434
+ comparándose consigo misma). Cubre shape, determinismo, sensibilidad
435
+ a un byte, FNV offset basis, streaming sobre 96 KB (> scratchpad).
436
+ - 88 tests verdes (era 85 en 0.3.1).
437
+
438
+ ### Postponed
439
+
440
+ - Mover el tokenizador y query parser a WASM (audit "wrapper TS hace
441
+ demasiado") se traslada a 0.5.0. Es mejora arquitectural, no cierre
442
+ de bug — y tiene suficientes trade-offs de diseño (semánticas de OR,
443
+ post-filter de phrase) como para no publicar una API a medio cocer.
444
+
445
+ ## [0.3.1] — 2026-05-30
446
+
447
+ Hardening pass after an external code audit. No new features; three
448
+ specific issues addressed.
449
+
450
+ ### Fixed
451
+
452
+ - **Debug logs removed from the indexing hot path.** Three `console.log`
453
+ statements added during the OCR-worker-abort diagnostic session were
454
+ firing on every PDF (hybrid OCR decision) and every embedded image
455
+ (kind / len / magic-byte trace). They are gone; the legitimate
456
+ `console.warn` messages for actual failures stay.
457
+
458
+ - **`makePdfWasmImports` now fails fast on unknown imports.** Previously
459
+ any unrecognised import was satisfied with a `console.warn` stub,
460
+ which let the module instantiate and defer the real failure to an
461
+ arbitrary call inside `extractPdf`. The loader now throws
462
+ `AlbexInitError` at boot with a clear "rebuild your binary" message.
463
+ An unknown import means the wasm-bindgen / lopdf / getrandom graph
464
+ drifted from what this loader was written for; better to surface that
465
+ immediately than to hang or crash mid-extraction.
466
+
467
+ - **`prepublishOnly` now rebuilds every WASM artifact and runs the
468
+ entire test suite.** It was running only `tsc + banner.mjs`, which
469
+ meant the WASM binaries published to npm could be out of sync with
470
+ the current Rust source. The script is now `npm run build:all && npm
471
+ test`. Publishing takes longer, but the package is guaranteed to
472
+ contain binaries reproducible from the source it ships.
473
+
8
474
  ## [0.3.0] — 2026-05-30
9
475
 
10
476
  ### Hybrid PDF OCR (opt-in)
package/README.md CHANGED
@@ -69,7 +69,7 @@ That's the entire onboarding. Read on for what else the engine can do.
69
69
  - **Bundler-friendly default** — `new AlbexEngine()` works without extra
70
70
  configuration in bundlers that recognise the `new URL(..., import.meta.url)`
71
71
  asset pattern (see the "Install" section for the tested matrix).
72
- - **Fuzzy matching** — finds `"contrato"` even if you type `"conttrato"` (Bitap with adaptive edit distance).
72
+ - **Fuzzy matching** — finds `"contrato"` even if you type `"conttrato"` (Bitap with adaptive edit distance). Sound under a two-stage pre-filter (character Bloom for exact tokens, a 256-bit **trigram q-gram signature** for everything) that prunes the candidate set ~10× on prose without ever dropping a real approximate match.
73
73
  - **Accent-insensitive** — `"accion"` matches `"acción"`, `"espana"` matches `"España"`, plus Latin Extended (Polish, Czech, Slovak, Turkish…).
74
74
  - **11 formats with varying depth** — DOCX · XLSX · PDF · HTML · MD · JSON · CSV · EML · RTF · TXT · XML. See the support table below; several formats are deliberately "lite" (CSV is RFC-4180-lite, EML is MIME-lite, RTF is regex-stripped, etc.).
75
75
  - **Phrase + OR queries** — `"contrato marco"` and `contrato | acuerdo` work out of the box.
@@ -81,8 +81,11 @@ That's the entire onboarding. Read on for what else the engine can do.
81
81
  - **WebGPU pre-filter** — experimental, opt-in (`gpu: 'auto'`). Implemented for corpora over 20 k chunks; no reproducible speedup number yet — the bench in this repo runs on a 200-document synthetic corpus only.
82
82
  - **SIMD opportunistic** — picks a SIMD-accelerated variant when the host supports v128.
83
83
  - **Tiered storage** — `TieredStore` keeps recent docs hot, evicts cold ones to OPFS, promotes on demand.
84
+ - **Runtime capacity** — one binary, pools sized at init: `capacity: 'std'` (default, 128 docs / 100k chunks / 16 MB text), `'large'` (1 024 docs / 800k chunks / 128 MB text) or a custom `{ maxDocs, maxChunks, textPoolBytes, namePoolBytes }`.
85
+ - **Capacity-safe** — when a pool fills (`docs`/`chunks`/`text`/`names`), `indexFile` throws `AlbexCapacityError` with `limit` (which pool) and `max` (the runtime limit) instead of silently truncating the corpus.
86
+ - **Re-entrancy-safe** — async operations on one engine serialize; sync `search`/`compact`/`reset` refuse to run mid-operation (`AlbexError` kind `busy`) rather than corrupting the shared WASM state. Use `searchCooperative` for overlapping search-as-you-type.
84
87
  - **Typed errors** — `AlbexParseError`, `AlbexUnsupportedFormatError`, `AlbexCapacityError`, `AlbexInitError`. All extend `AlbexError`.
85
- - **Tiny core** — main WASM 24 KB (27 KB SIMD). PDF module (~1.2 MB) loads on demand. The OCR companion (`@albex/ocr`) is a separate package and pulls Tesseract.js (~3.5 MB) only when you call `enableOcr()`.
88
+ - **Tiny core** — main WASM ~47 KB (~19 KB gzipped); the SIMD build is ~54 KB (~21 KB gzipped). PDF module (~1.2 MB, ~510 KB gzipped) loads on demand. The OCR companion (`@albex/ocr`) is a separate package and pulls Tesseract.js (~3.5 MB) only when you call `enableOcr()`.
86
89
 
87
90
  ---
88
91
 
@@ -197,7 +200,7 @@ const results = await pool.search('contrato'); // map-reduce
197
200
 
198
201
  ## Big corpora — tiered storage
199
202
 
200
- For workloads that exceed the tier's RAM capacity:
203
+ For workloads that exceed the engine's RAM capacity:
201
204
 
202
205
  ```ts
203
206
  import { AlbexEngine, TieredStore } from 'albex';
@@ -221,29 +224,40 @@ Hot tier = engine. Warm tier = original files in OPFS. LRU eviction is automatic
221
224
  `new AlbexEngine()` covers the default case. The options below address
222
225
  specific deployment needs:
223
226
 
224
- ### Tier auto-selection (`mini` / `std` / `pro` based on `deviceMemory`)
227
+ ### Capacity (runtime, single binary)
225
228
 
226
- Albex ships **six** WASM variants of the main engine (3 tiers × baseline/SIMD).
227
- By default it loads the std-baseline binary that comes with the npm package.
228
- If you want runtime tier auto-selection, serve the variants yourself and
229
- pass `wasmBaseUrl`:
229
+ Capacity is a **runtime** parameter there is one engine binary (plus its
230
+ SIMD variant) and the pools are heap-allocated at `init()` to the size you
231
+ ask for. The old compile-time tiers (and the `tier` option) are gone:
230
232
 
231
233
  ```ts
232
234
  const engine = new AlbexEngine({
233
- wasmBaseUrl: '/assets', // directory containing the 6 .wasm files
234
- tier: 'auto', // picks mini/std/pro by deviceMemory
235
+ capacity: 'large', // or 'std' (default), or a custom object
235
236
  simd: 'auto', // picks baseline/simd by WASM probe
236
237
  gpu: 'auto', // engages WebGPU when corpus > 20k chunks
237
238
  });
238
239
  ```
239
240
 
240
- Tier capacities:
241
+ Presets and cost (≈ `maxChunks × 64 B + textPool + namePool`; WASM memory
242
+ never shrinks, so the largest capacity initialised stays committed):
241
243
 
242
- | Tier | Max docs | Max chunks | Max text | Working set |
243
- |-------|---------:|-----------:|---------:|------------:|
244
- | mini | 32 | 25 000 | 4 MB | ~5 MB |
245
- | std | 128 | 100 000 | 16 MB | ~20 MB |
246
- | pro | 1 024 | 800 000 | 128 MB | ~160 MB |
244
+ | Capacity | Max docs | Max chunks | Max text | Working set |
245
+ |-----------|---------:|-----------:|---------:|------------:|
246
+ | `'std'` | 128 | 100 000 | 16 MB | ~22 MB |
247
+ | `'large'` | 1 024 | 800 000 | 128 MB | ~180 MB |
248
+ | custom | 65 536 | 4 M | 1 GiB | as configured |
249
+
250
+ Custom objects may be partial — missing fields are completed from the std
251
+ ratios (`maxChunks = maxDocs × 782`, `textPoolBytes = maxChunks × 168 B`,
252
+ `namePoolBytes = maxDocs × 256 B`, with sane floors):
253
+
254
+ ```ts
255
+ const tiny = new AlbexEngine({ capacity: { maxDocs: 16 } });
256
+ ```
257
+
258
+ Snapshots are admitted by **content**: a snapshot saved with `'large'`
259
+ loads into a `'std'` engine whenever its counters fit, and fails cleanly
260
+ (previous index intact) when they don't.
247
261
 
248
262
  ### Custom CDN
249
263
 
@@ -297,14 +311,13 @@ PDF support requires `albex_pdf.wasm` to be served with MIME type `application/w
297
311
  rustup target add wasm32-unknown-unknown
298
312
 
299
313
  npm install
300
- npm run build:all # 6 main variants + PDF + TypeScript
314
+ npm run build:all # main (baseline + SIMD) + PDF + TypeScript
301
315
  ```
302
316
 
303
317
  Partial builds:
304
318
 
305
319
  ```bash
306
- npm run build:wasm # std baseline only
307
- npm run build:wasm:tiers # all 6 variants
320
+ npm run build:wasm # main module (baseline + SIMD)
308
321
  npm run build:pdf-wasm # PDF module
309
322
  npm run build # TypeScript only
310
323
  ```