albex 0.3.0 → 0.6.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +466 -0
- package/README.md +32 -19
- package/dist/albex-worker.d.ts +65 -2
- package/dist/albex-worker.d.ts.map +1 -1
- package/dist/albex-worker.js +97 -20
- package/dist/albex-worker.js.map +1 -1
- package/dist/albex.d.ts +359 -55
- package/dist/albex.d.ts.map +1 -1
- package/dist/albex.js +766 -312
- package/dist/albex.js.map +1 -1
- package/dist/errors.d.ts +47 -2
- package/dist/errors.d.ts.map +1 -1
- package/dist/errors.js +41 -3
- package/dist/errors.js.map +1 -1
- package/dist/persistence.js +1 -1
- package/dist/pool/coordinator.d.ts +14 -6
- package/dist/pool/coordinator.d.ts.map +1 -1
- package/dist/pool/coordinator.js +65 -28
- package/dist/pool/coordinator.js.map +1 -1
- package/dist/profile.d.ts +11 -6
- package/dist/profile.d.ts.map +1 -1
- package/dist/profile.js +6 -13
- package/dist/profile.js.map +1 -1
- package/dist/resource-manager.js +1 -1
- package/dist/tiered-store.js +1 -1
- package/dist/wasm-bindings.d.ts +96 -6
- package/dist/wasm-bindings.d.ts.map +1 -1
- package/dist/wasm-bindings.js +110 -7
- package/dist/wasm-bindings.js.map +1 -1
- package/dist/worker-protocol.d.ts +23 -2
- package/dist/worker-protocol.d.ts.map +1 -1
- package/dist/worker-protocol.js +1 -1
- package/dist/worker-runtime.js +27 -3
- package/dist/worker-runtime.js.map +1 -1
- package/package.json +13 -9
- package/src/albex-worker.ts +103 -18
- package/src/albex.ts +2937 -2292
- package/src/errors.ts +63 -2
- package/src/pool/coordinator.ts +61 -34
- package/src/profile.ts +11 -10
- package/src/wasm-bindings.ts +225 -10
- package/src/worker-protocol.ts +12 -2
- package/src/worker-runtime.ts +28 -3
- package/wasm/pkg/albex_pdf.wasm +0 -0
- package/wasm/pkg/albex_wasm.wasm +0 -0
- package/wasm/pkg/albex_wasm_bg.wasm +0 -0
- package/wasm/pkg/albex_wasm_simd.wasm +0 -0
- package/wasm/pkg/albex_wasm_mini.wasm +0 -0
- package/wasm/pkg/albex_wasm_mini_simd.wasm +0 -0
- package/wasm/pkg/albex_wasm_pro.wasm +0 -0
- package/wasm/pkg/albex_wasm_pro_simd.wasm +0 -0
- package/wasm/pkg/albex_wasm_std.wasm +0 -0
- package/wasm/pkg/albex_wasm_std_simd.wasm +0 -0
package/CHANGELOG.md
CHANGED
|
@@ -5,6 +5,472 @@ All notable changes to Albex are documented in this file.
|
|
|
5
5
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
|
6
6
|
and Albex follows [Semantic Versioning](https://semver.org/).
|
|
7
7
|
|
|
8
|
+
## [Unreleased]
|
|
9
|
+
|
|
10
|
+
### Changed
|
|
11
|
+
|
|
12
|
+
- **IDF-aware relevance scoring — exact matches no longer saturate.** The old
|
|
13
|
+
`rich_score` set `base = 1000 − avg_errors·200`, so every 0-error hit hit the
|
|
14
|
+
1000 ceiling and `.min(1000)` discarded *all* the relevance bonuses (term
|
|
15
|
+
frequency, proximity, location, word-boundary AND idf). The practical effect:
|
|
16
|
+
on a real corpus every exact match tied at 1000 and the final ordering among
|
|
17
|
+
them was arbitrary — the chunk holding the rare, decisive query term was no
|
|
18
|
+
more likely to rank first than dozens that merely shared a common word.
|
|
19
|
+
Scores are now laid out as **error tiers with relevance ranking inside each
|
|
20
|
+
tier** (0 errors → `[750,1000]`, 1 → `[500,750]`, 2 → `[250,500]`, 3 →
|
|
21
|
+
`[0,250]`): a cleaner match still always beats a noisier one, but within a
|
|
22
|
+
tier the **IDF bonus (now the largest, up to 150)** lifts rare-term matches
|
|
23
|
+
decisively above common-term ones. Measured downstream (the Noesis hybrid-RAG
|
|
24
|
+
eval over a 293-doc / 12.4k-chunk corpus): lexical-leg MRR **0.19 → 0.40**,
|
|
25
|
+
R@5 **23% → 53%**, and end-to-end hybrid MRR **0.42 → 0.53** — with no change
|
|
26
|
+
on small corpora. `SearchResult.score` keeps its `0..=1000` range, so
|
|
27
|
+
`setThreshold` is unaffected for typical values; very weak 3-error fuzzy
|
|
28
|
+
matches (bonus-starved, score < the default threshold 250) are now filtered
|
|
29
|
+
where they previously slipped through.
|
|
30
|
+
- **Fixed a latent result-ordering bug** exposed by the scoring change. The
|
|
31
|
+
heap-sort already left `RESULTS` in descending order; a trailing reverse then
|
|
32
|
+
flipped it to *ascending*. It was invisible while all exact scores tied at
|
|
33
|
+
1000 and was masked from API consumers by the TypeScript-side re-sort, but the
|
|
34
|
+
WASM ABI now returns correctly descending results on its own.
|
|
35
|
+
|
|
36
|
+
### Added
|
|
37
|
+
|
|
38
|
+
- **Dynamic capacity (ABI 7, decision A16 executed).** The compile-time
|
|
39
|
+
capacity tiers are gone for good: the engine pools (chunk table, trigram
|
|
40
|
+
signatures, text pool, doc table, content hashes, name pool, tombstone
|
|
41
|
+
bitset, GPU candidate mask) moved from BSS `static mut` arrays to heap
|
|
42
|
+
allocations sized at runtime by the new export
|
|
43
|
+
`initWithCapacity(maxDocs, maxChunks, textPoolBytes, namePoolBytes) → 1|0`.
|
|
44
|
+
- `init()` is now a wrapper over `initWithCapacity` with the historical
|
|
45
|
+
std defaults (128 docs · 100k chunks · 16 MB text · 32 KB names) —
|
|
46
|
+
default behaviour is identical to every previous release.
|
|
47
|
+
- Hard ceilings (validated, return 0 — never trap): 65 536 docs, 4 M
|
|
48
|
+
chunks, 1 GiB text, 16 MiB names; allocation failure (memory.grow
|
|
49
|
+
refused) also returns 0 cleanly with the engine in a safe empty state.
|
|
50
|
+
- Re-init with the same capacities is a plain reset (no realloc);
|
|
51
|
+
different capacities free + re-allocate (no leak; the linear-memory
|
|
52
|
+
high-water mark stays, as WASM memory never shrinks).
|
|
53
|
+
- **Snapshots are admitted by CONTENT, not by writer capacity**: the
|
|
54
|
+
restore gate compares the snapshot's counters against the live runtime
|
|
55
|
+
capacities, so a snapshot saved by a `'large'` engine loads into a
|
|
56
|
+
`'std'` engine whenever its contents fit — and fails CLEANLY (previous
|
|
57
|
+
index intact) when they don't. v4 headers additionally record the
|
|
58
|
+
writer's capacities in the previously-reserved bytes [40..56], for
|
|
59
|
+
diagnostics only.
|
|
60
|
+
- **TS option `capacity`**: `'std'` (default) · `'large'`
|
|
61
|
+
(1 024 docs / 800k chunks / 128 MB text / 256 KB names — the old "pro"
|
|
62
|
+
tier) · custom `{ maxDocs?, maxChunks?, textPoolBytes?, namePoolBytes? }`
|
|
63
|
+
(partials completed from std scaled by std's ratios: chunks = docs×782,
|
|
64
|
+
text = chunks×168 B, names = docs×256 B, with floors; explicit
|
|
65
|
+
maxChunks below maxDocs is clamped up). Forwarded to workers and pool
|
|
66
|
+
shards (structured-clone-safe). `reset()` re-inits with the CONFIGURED
|
|
67
|
+
capacity, not the std defaults.
|
|
68
|
+
- `AlbexCapacityError` gains `max` — the runtime numeric limit of the
|
|
69
|
+
pool named by `limit` (e.g. `4` for `capacity: { maxDocs: 4 }`); it
|
|
70
|
+
survives the worker boundary.
|
|
71
|
+
- New large-scale bench (`bench/large.bench.ts`): 1 000 docs / 200k
|
|
72
|
+
chunks / 15.8 MB text in a `'large'` engine — search 141-769 ms
|
|
73
|
+
(rare-token → fuzzy), snapshot save 24 ms / restore 175 ms for a
|
|
74
|
+
~20 MB snapshot. Numbers in `bench/README.md`. Binary cost of heap
|
|
75
|
+
pools: baseline 43 396 → 47 446 B (gate 48 KB still green).
|
|
76
|
+
|
|
77
|
+
- **Batch frontier reads (ABI 6).** Three groups of exports collapse the
|
|
78
|
+
JS↔WASM call count on the hot read paths:
|
|
79
|
+
- `getResultsPtr()` + `getResultStride()`: `_collectResults` now reads every
|
|
80
|
+
numeric field of all results with ONE `DataView` pass over the
|
|
81
|
+
`#[repr(C)]` `RESULTS` array (layout documented in the export's
|
|
82
|
+
doc-comment and compile-time asserted with `offset_of!`) instead of
|
|
83
|
+
~12-15 frontier calls per result. Doc names are resolved once per
|
|
84
|
+
distinct document via its table slot — the per-result
|
|
85
|
+
`getResultDocName` (O(doc_count) inside WASM) is no longer called.
|
|
86
|
+
- `getPatternBloomLo()/Hi()`: the query Bloom the GPU pre-filter needs is
|
|
87
|
+
now computed in `setPattern` through the exact pipeline `searchBegin`
|
|
88
|
+
uses (split → optional Spanish stemming → fold). The third TypeScript
|
|
89
|
+
copy of the fold (`computePatternBloom`) is deleted; with stemming on it
|
|
90
|
+
silently diverged from the CPU pattern.
|
|
91
|
+
- `listChunksBatch(slot, startOrd, maxChunks)`: packs consecutive chunk
|
|
92
|
+
records `[u32 text_len][u32 location][text]` into the scratchpad;
|
|
93
|
+
`listChunks` now makes ~1 frontier call per 64 KB batch instead of 2-3
|
|
94
|
+
per chunk (an embeddings pipeline over 100k chunks drops from ~300k
|
|
95
|
+
calls to ~1.3k).
|
|
96
|
+
`abiVersion()` bumped 5 → 6; host pin updated. Snapshot format unchanged.
|
|
97
|
+
- **UTF-16 spans: `SearchResult.snippetStart/snippetEnd`.** The existing
|
|
98
|
+
`matchStart/matchEnd`/`matches[]` are UTF-8 BYTE offsets (now documented as
|
|
99
|
+
such) and mis-highlight as soon as the snippet contains accents. The new
|
|
100
|
+
fields are the primary span as UTF-16 code-unit indices over the decoded
|
|
101
|
+
snippet — safe for `snippet.slice()`. Byte fields are unchanged for
|
|
102
|
+
backwards compatibility.
|
|
103
|
+
- **Worker parity: `replaceDocument` + `takeDiagnostics`** added to the
|
|
104
|
+
worker protocol and `AlbexEngineWorker`. The class JSDoc now documents that
|
|
105
|
+
`attachOcr` cannot exist in the worker (functions don't cross postMessage)
|
|
106
|
+
and that scanned PDFs index with 0 chunks there — with the diagnostic
|
|
107
|
+
explaining why readable through `takeDiagnostics`, and the recommendation
|
|
108
|
+
to OCR on the main-thread engine and `save()`/`load()` the snapshot.
|
|
109
|
+
- **CI with teeth:** `cargo clippy -D warnings` for every crate (host pass
|
|
110
|
+
for core/ingest, wasm32 pass for the wasm and pdf crates), an advisory
|
|
111
|
+
`cargo fmt --check`, a hard 48 KB size gate on the baseline `.wasm`, a
|
|
112
|
+
build of the SIMD variant, and a manual-only (`workflow_dispatch`) bench
|
|
113
|
+
job with no thresholds.
|
|
114
|
+
|
|
115
|
+
### Removed
|
|
116
|
+
|
|
117
|
+
- **`tier` option and tier plumbing (breaking, pre-1.0).** The deprecated
|
|
118
|
+
`AlbexOptions.tier` (ignored since 0.5.0), `AlbexEngine.tier`,
|
|
119
|
+
`AlbexPool.tier`, `EngineStats.tier` and the WASM `getTier` export are
|
|
120
|
+
gone — superseded by the runtime `capacity` option (ABI 7, decision
|
|
121
|
+
A16). `EngineStats` now reports the real runtime capacities (`maxDocs`,
|
|
122
|
+
`maxChunks`, `textCapacity`, plus new `namePoolBytes`). The deprecated
|
|
123
|
+
`pickTier`/`Tier` re-exports from `profile.ts` remain for source
|
|
124
|
+
compatibility. The `tier-mini`/`tier-std`/`tier-pro` Cargo features are
|
|
125
|
+
deleted; the only build-time switch left is `simd`.
|
|
126
|
+
|
|
127
|
+
### Changed
|
|
128
|
+
|
|
129
|
+
- **Trigram windows no longer cross word boundaries.** `build_sig` /
|
|
130
|
+
`token_trigram_bits` only extend the sliding window on code points that
|
|
131
|
+
fold to an ASCII letter or digit; whitespace/punctuation/symbols break it
|
|
132
|
+
on BOTH the index and query sides (previously the fold passed ASCII
|
|
133
|
+
whitespace through, so signatures contained cross-word trigrams that
|
|
134
|
+
diluted selectivity — the doc-comment claimed otherwise). Sound under
|
|
135
|
+
fuzzy matching (same ≤3-windows-per-edit bound); signatures are rebuilt
|
|
136
|
+
on every restore/compact, so existing snapshots are unaffected.
|
|
137
|
+
- **`wee_alloc` replaced (RUSTSEC-2022-0054).** The main no_std module now
|
|
138
|
+
uses `dlmalloc` (`global` feature) — baseline binary 36 732 → 42 654 bytes
|
|
139
|
+
(+5.8 KB, under the 48 KB gate). `pdf-wasm` drops its custom allocator and
|
|
140
|
+
uses std's default (also dlmalloc): 1 193 074 → 1 199 910 bytes (+6.8 KB on
|
|
141
|
+
1.2 MB). In pdf-wasm the allocator serves all of lopdf, so the known
|
|
142
|
+
wee_alloc leak surface was real there.
|
|
143
|
+
|
|
144
|
+
### Fixed (this pass)
|
|
145
|
+
|
|
146
|
+
- **`getSnippetWindow` can no longer open or close mid code point.** When the
|
|
147
|
+
word-boundary snap found no space within 20 bytes, the window edge could
|
|
148
|
+
land on a UTF-8 continuation byte → U+FFFD at the snippet border and every
|
|
149
|
+
span visually off. Both edges now back off continuation bytes to the
|
|
150
|
+
code-point lead (same pattern as the chunker's hard cut).
|
|
151
|
+
|
|
152
|
+
### Added (previous pass)
|
|
153
|
+
|
|
154
|
+
- **Authoritative chunk enumeration (ABI 4).** New `AlbexEngine.listChunks(docId)`
|
|
155
|
+
(and `AlbexEngineWorker.listChunks`) returns the exact chunks Albex indexed for
|
|
156
|
+
a document — `{ docId, location, ord, sub, text, byteLen, id }` — so a host can
|
|
157
|
+
mirror Albex's chunking for a parallel index (e.g. embeddings) keyed on the
|
|
158
|
+
`compact()`-stable id `"<docId>::<ord>"`. Backed by four read-only
|
|
159
|
+
WASM exports (`getDocChunkBase`, `getChunkLocationAt`, `getChunkByteLenAt`,
|
|
160
|
+
`getChunkTextAt`); `abiVersion()` bumped 3 → 4. No change to search, scoring or
|
|
161
|
+
snapshot formats; v3 snapshots still load.
|
|
162
|
+
- **`SearchResult` now carries `docId` and `chunkId`.** `chunkId` (`"<docId>::<ord>"`)
|
|
163
|
+
is identical to the matching `AuthoritativeChunk.id`, so a host can fuse search
|
|
164
|
+
hits with a parallel index on a single stable key — no `(name, location)`
|
|
165
|
+
heuristics, and unambiguous when a long paragraph splits into sub-chunks.
|
|
166
|
+
- **`maxFileBytes` option (default 256 MiB).** `indexFile` now rejects oversized
|
|
167
|
+
inputs with a typed `AlbexCapacityError` (`limit: 'file'`) by checking
|
|
168
|
+
`File.size` BEFORE reading — a 2 GB file no longer gets fully buffered and
|
|
169
|
+
hashed just to fail later. Enforced by the engine, the worker wrapper and the
|
|
170
|
+
pool coordinator.
|
|
171
|
+
|
|
172
|
+
### Fixed
|
|
173
|
+
|
|
174
|
+
- **GPU pre-filter no longer corrupts the search pattern.** `setCandidateMask`
|
|
175
|
+
pushes the candidate bitset through the scratchpad, overwriting the pattern
|
|
176
|
+
staged by `selectQueryBranch`; `searchBegin` then compiled garbage tokens out
|
|
177
|
+
of the mask bytes, so every GPU-assisted cooperative search silently returned
|
|
178
|
+
wrong results. The active branch is now re-selected after the pre-filter.
|
|
179
|
+
- **GPU bloom upload invalidation is now mutation-driven.** The upload used to
|
|
180
|
+
be skipped when the chunk count matched the last upload — but `compact()` can
|
|
181
|
+
reorder chunks while keeping the count identical, leaving the GPU mask
|
|
182
|
+
filtering the wrong chunks (silent false negatives). A dirty flag set by every
|
|
183
|
+
index mutation (index/remove/compact/reset/load) now forces the re-upload.
|
|
184
|
+
- **`AlbexEngineWorker.init` forwards all engine options.** Previously only
|
|
185
|
+
`wasmUrl` and `pdfWasmUrl` crossed the worker boundary; `wasmBaseUrl`, `simd`,
|
|
186
|
+
`gpu`, `gpuThreshold` and `maxFileBytes` were silently dropped, making the
|
|
187
|
+
SIMD variant and the GPU policy unreachable in workers.
|
|
188
|
+
- **OR-branch dedup keyed on `chunkId`.** The previous
|
|
189
|
+
`(doc, location, matchStart)` key collided when two sub-chunks of the same
|
|
190
|
+
location hit at the same relative offset, dropping a legitimate result.
|
|
191
|
+
- **`AlbexPool.search` now actually caps and dedups after the merge.** The doc
|
|
192
|
+
promised "capped to setMaxResults AFTER merge" but the coordinator returned
|
|
193
|
+
the raw flattened buckets. Results are now deduped, sorted and capped to the
|
|
194
|
+
last `setMaxResults` value (default 50); per-shard search stats are captured
|
|
195
|
+
race-free in the same posting batch as the search itself.
|
|
196
|
+
- Corrected the `listChunks` docstring (and the entry above): the canonical
|
|
197
|
+
chunk id format is `"<docId>::<ord>"`, not `"<docId>::<location>.<sub>"`.
|
|
198
|
+
|
|
199
|
+
## [0.6.0] — 2026-05-31
|
|
200
|
+
|
|
201
|
+
Release de auditoría algorítmica + robustez. Cierra los hallazgos #1–#8 de
|
|
202
|
+
la revisión externa en profundidad: sube la selectividad del pre-filtro de
|
|
203
|
+
forma drástica, elimina dos clases de corrupción/pérdida silenciosa de
|
|
204
|
+
datos, y de paso corrige una falta de _soundness_ del filtro Bloom bajo
|
|
205
|
+
matching difuso que existía desde el principio. El binario crece ~2 KB
|
|
206
|
+
(33 KB baseline · 37 KB SIMD) por el código del pre-filtro de trigramas.
|
|
207
|
+
|
|
208
|
+
### Breaking changes
|
|
209
|
+
|
|
210
|
+
- **ABI WASM v2 → v3** (hallazgo #7). El binario principal exporta ahora
|
|
211
|
+
`getLastIndexOverflow` y el host fija el rango ABI aceptado a `[3, 3]`
|
|
212
|
+
(antes `[1, 2]`, con un mínimo inalcanzable que la lista de exports
|
|
213
|
+
requeridos ya hacía imposible). Un `.wasm` cacheado de 0.5.x falla con
|
|
214
|
+
`AlbexAbiMismatchError` claro en `init()` en vez de petar más tarde.
|
|
215
|
+
|
|
216
|
+
### Added
|
|
217
|
+
|
|
218
|
+
- **Pre-filtro de trigramas q-gram** (hallazgo #1). El Bloom de 64 bits
|
|
219
|
+
(un bit por `c & 0x3F`) está saturado en chunks de prosa de 512 bytes —
|
|
220
|
+
cada chunk contiene casi todo el alfabeto, así que no poda y el Bitap
|
|
221
|
+
acababa ejecutándose sobre todo el corpus. Se añade una firma de 256
|
|
222
|
+
bits de los **trigramas** de cada chunk (`CHUNK_SIG`, BSS, no infla el
|
|
223
|
+
`.wasm`) como segundo pre-filtro mucho más selectivo. Es **sound bajo
|
|
224
|
+
matching difuso**: una ocurrencia con `e` errores conserva ≥ `N − 3e`
|
|
225
|
+
de los trigramas exactos del token (lema q-gram), y la firma no tiene
|
|
226
|
+
falsos negativos, así que nunca descarta un match real — solo poda
|
|
227
|
+
chunks que demostrablemente no pueden contenerlo. El Bitap confirma
|
|
228
|
+
cada superviviente: la corrección no cambia, solo se encoge el conjunto
|
|
229
|
+
candidato.
|
|
230
|
+
- **`AlbexCapacityError` con campo `limit`** (`'chunks' | 'text' | 'docs'
|
|
231
|
+
| 'names'`). El nuevo export `getLastIndexOverflow()` señaliza qué pool
|
|
232
|
+
se llenó durante el último `begin..endDocument`.
|
|
233
|
+
|
|
234
|
+
### Fixed
|
|
235
|
+
|
|
236
|
+
- **Pérdida silenciosa de datos al agotar capacidad** (hallazgo #3). Los
|
|
237
|
+
pools del WASM se llenaban en silencio y `indexFile` devolvía un
|
|
238
|
+
documento a medio indexar indistinguible de uno completo (`getStats()`
|
|
239
|
+
mentía). Ahora `indexFile` lee el overflow y lanza `AlbexCapacityError`.
|
|
240
|
+
El documento que desborda se **revierte atómicamente** en el WASM
|
|
241
|
+
(`endDocument` restaura `chunk_count`/`text_used`/`name_used` al inicio
|
|
242
|
+
del doc): indexación all-or-nothing por fichero, sin chunks huérfanos
|
|
243
|
+
sin nombre.
|
|
244
|
+
- **Re-entrancy en el path async** (hallazgo #2). Una única instancia WASM
|
|
245
|
+
con estado global y ops async que ceden al scheduler entre slices: dos
|
|
246
|
+
operaciones solapadas se corrompían (un `searchBegin` nuevo reseteaba el
|
|
247
|
+
cursor de una búsqueda cooperativa en vuelo). Las ops async se serializan
|
|
248
|
+
ahora con una cola interna; los mutadores/búsquedas síncronos rechazan
|
|
249
|
+
ejecutarse a media operación (`AlbexError` kind `'busy'`) en vez de
|
|
250
|
+
corromper. El worker-runtime procesa mensajes estrictamente en orden.
|
|
251
|
+
- **Bloom no-sound bajo fuzzy** (corolario del #1). El filtro Bloom de
|
|
252
|
+
caracteres rechazaba un match aproximado cuando el carácter sustituido
|
|
253
|
+
no estaba en el chunk (bug latente pre-0.6.0). Ahora el Bloom solo se
|
|
254
|
+
aplica a tokens exactos (`eff_errors == 0`); los tokens difusos se
|
|
255
|
+
filtran solo con el conteo de trigramas, que sí es sound.
|
|
256
|
+
- **Frase + `windowed`** (hallazgo #7). El post-filtro de frase corría
|
|
257
|
+
contra el snippet recortado, así que `{ windowed: true }` podía descartar
|
|
258
|
+
un match válido cuyo segundo término caía fuera de la ventana. Ahora la
|
|
259
|
+
comprobación de adyacencia corre contra el **texto completo** del chunk;
|
|
260
|
+
el windowing solo afecta a la visualización.
|
|
261
|
+
- **GPU + OR**: el hash del pre-filtro GPU usaba siempre el patrón de la
|
|
262
|
+
rama 0, generando una máscara de candidatos errónea para las ramas i≠0 de
|
|
263
|
+
una query OR y descartando sus hits en silencio (hallazgo #6).
|
|
264
|
+
- **Corte de chunk a mitad de codepoint UTF-8** (hallazgo #7): el corte duro
|
|
265
|
+
a 512 bytes podía partir una secuencia multibyte; ahora retrocede a la
|
|
266
|
+
frontera de codepoint y el snippet del borde no renderiza `�`.
|
|
267
|
+
- **`replaceDocument` sin reclamar espacio** (hallazgo #7): repetidos
|
|
268
|
+
replaces dejaban tombstones en el text pool; ahora se compacta de forma
|
|
269
|
+
oportunista bajo presión.
|
|
270
|
+
|
|
271
|
+
### Performance
|
|
272
|
+
|
|
273
|
+
- `prepareQuery` ya no hace un `memset` de 64 KB en pila por query (hallazgo
|
|
274
|
+
#4). La query de trabajo se acota a `MAX_QUERY_BYTES` (1 KB).
|
|
275
|
+
|
|
276
|
+
### Tooling
|
|
277
|
+
|
|
278
|
+
- `npm run relaunch`: limpia artefactos, recompila WASM (baseline + SIMD) +
|
|
279
|
+
PDF + TS + OCR, corre los tests, empaqueta la librería (`npm pack`) y
|
|
280
|
+
levanta el demo en `http://localhost:5173/demo/`. Scripts auxiliares
|
|
281
|
+
`clean` / `clean:all` / `build:ocr`.
|
|
282
|
+
- Cobertura: +16 tests (110 en total) — selectividad del trigram, soundness
|
|
283
|
+
fuzzy, supervivencia de firmas tras compact/restore, error de capacidad,
|
|
284
|
+
guard de concurrencia, frase+windowed.
|
|
285
|
+
|
|
286
|
+
## [0.5.0] — 2026-05-30
|
|
287
|
+
|
|
288
|
+
Release de endurecimiento — cierra las cinco clases de bugs accionables
|
|
289
|
+
de la auditoría externa de código. Sin nuevas features que pidan datos
|
|
290
|
+
reales para ser correctas. El binario crece ~2 KB respecto a 0.4.0 por
|
|
291
|
+
la lógica de query parsing en Rust + ABI version.
|
|
292
|
+
|
|
293
|
+
### Breaking changes
|
|
294
|
+
|
|
295
|
+
- **`@albex/ocr` ahora requiere `engine.attachOcr()`** (audit 3.8). El
|
|
296
|
+
patrón anterior de mutar `engine.ocrImage = ...` y
|
|
297
|
+
`engine.ocrConfig = ...` directamente queda eliminado. Para
|
|
298
|
+
integradores que usaban `@albex/ocr`, **el cambio es transparente** —
|
|
299
|
+
`enableOcr(engine)` sigue siendo la misma llamada. Para quien tuviera
|
|
300
|
+
un adaptador manual: usar `engine.attachOcr({ recognize, options })`.
|
|
301
|
+
- **Tiers eliminados** (audit 4.1). Mini/std/pro × baseline/SIMD (6
|
|
302
|
+
binarios) consolidados a **baseline + SIMD** (2 binarios). El
|
|
303
|
+
parámetro `tier` de `AlbexOptions` queda como noop deprecado. El alias
|
|
304
|
+
`albex_wasm_bg.wasm` se mantiene para compatibilidad con `0.4.x`.
|
|
305
|
+
- `pickTier(profile)` siempre devuelve `'std'` ahora. La función queda
|
|
306
|
+
exportada para compatibilidad de código fuente, no de comportamiento.
|
|
307
|
+
|
|
308
|
+
### Fixed
|
|
309
|
+
|
|
310
|
+
- **Validación runtime de la ABI WASM** (audit 3.2). `asAlbexExports` y
|
|
311
|
+
`asAlbexPdfExports` ya no son `as unknown as` ceremoniales — verifican
|
|
312
|
+
que `memory` sea una `WebAssembly.Memory`, que cada export requerido
|
|
313
|
+
exista, y que `abiVersion()` esté en el rango soportado. Lanzan
|
|
314
|
+
`AlbexAbiMismatchError` con la lista de exports faltantes cuando algo
|
|
315
|
+
no encaja. Antes, un binario incompatible instanciaba en silencio y
|
|
316
|
+
petaba en el primer call site que tocaba la función ausente.
|
|
317
|
+
- **`makePdfWasmImports` falla rápido** ante imports desconocidos
|
|
318
|
+
(compatibilidad con el cambio anterior de 0.3.1; ya estaba pero ahora
|
|
319
|
+
alineado con la ABI version del módulo).
|
|
320
|
+
|
|
321
|
+
### Added
|
|
322
|
+
|
|
323
|
+
- **Canal de diagnósticos estructurado** (audit 3.6). Los `console.warn`
|
|
324
|
+
diseminados por el path de indexación desaparecen — los reemplaza un
|
|
325
|
+
buffer interno de `AlbexDiagnostic[]` consultable con
|
|
326
|
+
`engine.takeDiagnostics()`. Cada entrada es `{kind, stage, message, file?,
|
|
327
|
+
page?}` con `kind` en `'recovered' | 'skipped' | 'fallback' | 'info'`.
|
|
328
|
+
Cubre: fallback PDF→OCR, OCR fail por imagen, GPU caída a CPU, descarga
|
|
329
|
+
PDF WASM en red restringida. Capped a 256 entradas para no explotar en
|
|
330
|
+
corpus muy corruptos. La API de "best-effort" se mantiene, pero el
|
|
331
|
+
caller ahora puede inspeccionar qué se perdió.
|
|
332
|
+
- **`engine.attachOcr(adapter)`** — extension point formal. Devuelve un
|
|
333
|
+
`OcrHandle` con `dispose()`. El motor valida el contrato y rechaza
|
|
334
|
+
un segundo `attachOcr` mientras haya uno activo. La propiedad pública
|
|
335
|
+
`engine.ocrImage` se mantiene como getter de feature-detect pero no es
|
|
336
|
+
asignable — para evitar el patrón anti-encapsulación del audit.
|
|
337
|
+
- **`abiVersion()` exportada por ambos módulos WASM**. Main = v2 (incluye
|
|
338
|
+
query parser nuevo); PDF = v3 (incluye image extraction). El validador
|
|
339
|
+
TS rechaza binarios fuera de rango.
|
|
340
|
+
|
|
341
|
+
### Architecture — query parsing moves to WASM (audit "two truths")
|
|
342
|
+
|
|
343
|
+
Pre-0.5.0 el TypeScript dueño de `parseQuery`, `tokenize`,
|
|
344
|
+
`tokensToWasmQuery`, mientras Rust tokenizaba al indexar. Dos verdades
|
|
345
|
+
sobre qué era un "token".
|
|
346
|
+
|
|
347
|
+
- Nuevos exports WASM: `prepareQuery`, `getQueryKind`,
|
|
348
|
+
`getQueryBranchCount`, `getQueryBranchPattern`, `selectQueryBranch`.
|
|
349
|
+
- Hasta **8 branches OR** soportadas, **4 tokens por branch**, **256
|
|
350
|
+
bytes por pattern compilado** — todo en static BSS, sin alocación.
|
|
351
|
+
- `containsPhrase` queda en TS porque opera sobre snippets (output del
|
|
352
|
+
WASM), no sobre la query — no es divergencia de tokenizer.
|
|
353
|
+
- `parseQuery`, `tokenize`, `tokensToWasmQuery` eliminados del TS.
|
|
354
|
+
- Un único algoritmo de "qué es un token" entre indexación y querying.
|
|
355
|
+
|
|
356
|
+
### Build & maintenance
|
|
357
|
+
|
|
358
|
+
- **`prepublishOnly` rebuildea WASM + corre tests** desde 0.3.1, ya
|
|
359
|
+
garantizado en 0.5.0.
|
|
360
|
+
- Build pipeline simplificada: `scripts/build-wasm.mjs` produce solo
|
|
361
|
+
dos binarios. `npm pack --dry-run` muestra 4 archivos `.wasm` en lugar
|
|
362
|
+
de 8.
|
|
363
|
+
- `wasm/Cargo.toml` añade `wee_alloc` (~1 KB) para el staging Vec del
|
|
364
|
+
restore atómico de 0.4.0.
|
|
365
|
+
|
|
366
|
+
### Tests
|
|
367
|
+
|
|
368
|
+
- 94 vitest cases verdes (era 88 en 0.4.0). Cinco tests del tier matrix
|
|
369
|
+
eliminados (mini/std/pro ya no existen). Tests nuevos:
|
|
370
|
+
- `tests/abi-validation.test.ts` (5): valida que `AlbexAbiMismatchError`
|
|
371
|
+
se lanza ante exports faltantes, abiVersion fuera de rango, memory
|
|
372
|
+
inválida.
|
|
373
|
+
- `tests/diagnostics.test.ts` (4): valida `takeDiagnostics()` drena,
|
|
374
|
+
cap a 256, reset limpia.
|
|
375
|
+
|
|
376
|
+
### Postponed con razón
|
|
377
|
+
|
|
378
|
+
Cosas del audit que NO se cierran en 0.5.0 porque cerrarlas sin datos
|
|
379
|
+
reales sería adivinar:
|
|
380
|
+
|
|
381
|
+
- **3.5 OCR paralelización**: optimización sin profiling no es ingeniería.
|
|
382
|
+
- **3.9 Adaptive runtime con métricas reales**: requiere corpus y uso
|
|
383
|
+
reales para validar decisiones.
|
|
384
|
+
- **4.3 GPU equivalence test**: requiere corpus >20k chunks que aún no
|
|
385
|
+
está checked in.
|
|
386
|
+
- **7 parsers lite a WASM**: ~3 semanas serias. Separable. No es bug
|
|
387
|
+
fix, es mejora arquitectural más limpia con tiempo dedicado.
|
|
388
|
+
|
|
389
|
+
## [0.4.0] — 2026-05-30
|
|
390
|
+
|
|
391
|
+
Cierre de dos clases enteras de bugs identificadas por la auditoría externa
|
|
392
|
+
de código. Sin cambios cosméticos — el binario crece ~4 KB por la lógica
|
|
393
|
+
de atomicidad y el allocator necesario para el staging buffer.
|
|
394
|
+
|
|
395
|
+
### Fixed — atomic snapshot restore (audit 3.4)
|
|
396
|
+
|
|
397
|
+
- **Snapshot v3 con formato por campos**. Reemplaza la copia byte a byte
|
|
398
|
+
de los structs internos `Chunk`/`DocEntry` (`from_raw_parts`) por un
|
|
399
|
+
encoding explícito little-endian. El formato deja de depender del
|
|
400
|
+
layout en memoria de Rust, del target, del padding o de cambios en
|
|
401
|
+
los tipos. Lo que va al disco es un contrato.
|
|
402
|
+
- **`restoreCommit()` — protocolo de 3 fases atómico**. El antiguo
|
|
403
|
+
`restoreBegin` reseteaba el estado y escribía los counters antes de
|
|
404
|
+
recibir un solo byte del payload. Si `restoreFeed` fallaba a mitad,
|
|
405
|
+
el corpus previo quedaba destruido. v3 acumula todo el payload en un
|
|
406
|
+
staging buffer y solo aplica al estado vivo cuando `restoreCommit`
|
|
407
|
+
valida que el tamaño completo coincide con el header. Un commit
|
|
408
|
+
fallido deja el motor con el corpus previo intacto.
|
|
409
|
+
- **Compatibilidad backwards**. v1 y v2 siguen cargando — para ellos
|
|
410
|
+
`restoreBegin` mantiene la semántica vieja (no-atómica) y
|
|
411
|
+
`restoreCommit` es no-op que devuelve 1. El primer `save()` tras
|
|
412
|
+
cargar un snapshot viejo lo reescribe como v3.
|
|
413
|
+
- Binarios crecen ~4 KB por la lógica nueva y por `wee_alloc` (única
|
|
414
|
+
fuente de alocación en el módulo, usada por el staging Vec).
|
|
415
|
+
|
|
416
|
+
### Fixed — single source of truth for content hash (audit "two truths")
|
|
417
|
+
|
|
418
|
+
- **FNV-1a 64-bit ahora vive en Rust**. La implementación TypeScript que
|
|
419
|
+
duplicaba el algoritmo desaparece. Tres nuevos exports
|
|
420
|
+
(`hashBegin`/`hashFeed`/`hashFinish`) implementan el hash en streaming
|
|
421
|
+
para archivos mayores que el scratchpad. El método privado del engine
|
|
422
|
+
`_contentHash` produce exactamente el mismo string hex de 16
|
|
423
|
+
caracteres que devolvía la versión TS — ningún caller cambia.
|
|
424
|
+
|
|
425
|
+
### Added — tests
|
|
426
|
+
|
|
427
|
+
- `tests/load-restores-docs.test.ts`: nuevo test "a v3 restore that
|
|
428
|
+
never commits leaves the previous index intact". Verifica
|
|
429
|
+
explícitamente la atomicidad: trunca el payload de un snapshot al
|
|
430
|
+
75 %, intenta cargarlo, verifica que `load()` devuelve `false` y que
|
|
431
|
+
el corpus previo sigue indexado y consultable.
|
|
432
|
+
- `tests/hash.test.ts`: reescrito para validar el hash WASM contra el
|
|
433
|
+
engine real (la versión vieja era una re-implementación TS standalone
|
|
434
|
+
comparándose consigo misma). Cubre shape, determinismo, sensibilidad
|
|
435
|
+
a un byte, FNV offset basis, streaming sobre 96 KB (> scratchpad).
|
|
436
|
+
- 88 tests verdes (era 85 en 0.3.1).
|
|
437
|
+
|
|
438
|
+
### Postponed
|
|
439
|
+
|
|
440
|
+
- Mover el tokenizador y query parser a WASM (audit "wrapper TS hace
|
|
441
|
+
demasiado") se traslada a 0.5.0. Es mejora arquitectural, no cierre
|
|
442
|
+
de bug — y tiene suficientes trade-offs de diseño (semánticas de OR,
|
|
443
|
+
post-filter de phrase) como para no publicar una API a medio cocer.
|
|
444
|
+
|
|
445
|
+
## [0.3.1] — 2026-05-30
|
|
446
|
+
|
|
447
|
+
Hardening pass after an external code audit. No new features; three
|
|
448
|
+
specific issues addressed.
|
|
449
|
+
|
|
450
|
+
### Fixed
|
|
451
|
+
|
|
452
|
+
- **Debug logs removed from the indexing hot path.** Three `console.log`
|
|
453
|
+
statements added during the OCR-worker-abort diagnostic session were
|
|
454
|
+
firing on every PDF (hybrid OCR decision) and every embedded image
|
|
455
|
+
(kind / len / magic-byte trace). They are gone; the legitimate
|
|
456
|
+
`console.warn` messages for actual failures stay.
|
|
457
|
+
|
|
458
|
+
- **`makePdfWasmImports` now fails fast on unknown imports.** Previously
|
|
459
|
+
any unrecognised import was satisfied with a `console.warn` stub,
|
|
460
|
+
which let the module instantiate and defer the real failure to an
|
|
461
|
+
arbitrary call inside `extractPdf`. The loader now throws
|
|
462
|
+
`AlbexInitError` at boot with a clear "rebuild your binary" message.
|
|
463
|
+
An unknown import means the wasm-bindgen / lopdf / getrandom graph
|
|
464
|
+
drifted from what this loader was written for; better to surface that
|
|
465
|
+
immediately than to hang or crash mid-extraction.
|
|
466
|
+
|
|
467
|
+
- **`prepublishOnly` now rebuilds every WASM artifact and runs the
|
|
468
|
+
entire test suite.** It was running only `tsc + banner.mjs`, which
|
|
469
|
+
meant the WASM binaries published to npm could be out of sync with
|
|
470
|
+
the current Rust source. The script is now `npm run build:all && npm
|
|
471
|
+
test`. Publishing takes longer, but the package is guaranteed to
|
|
472
|
+
contain binaries reproducible from the source it ships.
|
|
473
|
+
|
|
8
474
|
## [0.3.0] — 2026-05-30
|
|
9
475
|
|
|
10
476
|
### Hybrid PDF OCR (opt-in)
|
package/README.md
CHANGED
|
@@ -69,7 +69,7 @@ That's the entire onboarding. Read on for what else the engine can do.
|
|
|
69
69
|
- **Bundler-friendly default** — `new AlbexEngine()` works without extra
|
|
70
70
|
configuration in bundlers that recognise the `new URL(..., import.meta.url)`
|
|
71
71
|
asset pattern (see the "Install" section for the tested matrix).
|
|
72
|
-
- **Fuzzy matching** — finds `"contrato"` even if you type `"conttrato"` (Bitap with adaptive edit distance).
|
|
72
|
+
- **Fuzzy matching** — finds `"contrato"` even if you type `"conttrato"` (Bitap with adaptive edit distance). Sound under a two-stage pre-filter (character Bloom for exact tokens, a 256-bit **trigram q-gram signature** for everything) that prunes the candidate set ~10× on prose without ever dropping a real approximate match.
|
|
73
73
|
- **Accent-insensitive** — `"accion"` matches `"acción"`, `"espana"` matches `"España"`, plus Latin Extended (Polish, Czech, Slovak, Turkish…).
|
|
74
74
|
- **11 formats with varying depth** — DOCX · XLSX · PDF · HTML · MD · JSON · CSV · EML · RTF · TXT · XML. See the support table below; several formats are deliberately "lite" (CSV is RFC-4180-lite, EML is MIME-lite, RTF is regex-stripped, etc.).
|
|
75
75
|
- **Phrase + OR queries** — `"contrato marco"` and `contrato | acuerdo` work out of the box.
|
|
@@ -81,8 +81,11 @@ That's the entire onboarding. Read on for what else the engine can do.
|
|
|
81
81
|
- **WebGPU pre-filter** — experimental, opt-in (`gpu: 'auto'`). Implemented for corpora over 20 k chunks; no reproducible speedup number yet — the bench in this repo runs on a 200-document synthetic corpus only.
|
|
82
82
|
- **SIMD opportunistic** — picks a SIMD-accelerated variant when the host supports v128.
|
|
83
83
|
- **Tiered storage** — `TieredStore` keeps recent docs hot, evicts cold ones to OPFS, promotes on demand.
|
|
84
|
+
- **Runtime capacity** — one binary, pools sized at init: `capacity: 'std'` (default, 128 docs / 100k chunks / 16 MB text), `'large'` (1 024 docs / 800k chunks / 128 MB text) or a custom `{ maxDocs, maxChunks, textPoolBytes, namePoolBytes }`.
|
|
85
|
+
- **Capacity-safe** — when a pool fills (`docs`/`chunks`/`text`/`names`), `indexFile` throws `AlbexCapacityError` with `limit` (which pool) and `max` (the runtime limit) instead of silently truncating the corpus.
|
|
86
|
+
- **Re-entrancy-safe** — async operations on one engine serialize; sync `search`/`compact`/`reset` refuse to run mid-operation (`AlbexError` kind `busy`) rather than corrupting the shared WASM state. Use `searchCooperative` for overlapping search-as-you-type.
|
|
84
87
|
- **Typed errors** — `AlbexParseError`, `AlbexUnsupportedFormatError`, `AlbexCapacityError`, `AlbexInitError`. All extend `AlbexError`.
|
|
85
|
-
- **Tiny core** — main WASM
|
|
88
|
+
- **Tiny core** — main WASM ~47 KB (~19 KB gzipped); the SIMD build is ~54 KB (~21 KB gzipped). PDF module (~1.2 MB, ~510 KB gzipped) loads on demand. The OCR companion (`@albex/ocr`) is a separate package and pulls Tesseract.js (~3.5 MB) only when you call `enableOcr()`.
|
|
86
89
|
|
|
87
90
|
---
|
|
88
91
|
|
|
@@ -197,7 +200,7 @@ const results = await pool.search('contrato'); // map-reduce
|
|
|
197
200
|
|
|
198
201
|
## Big corpora — tiered storage
|
|
199
202
|
|
|
200
|
-
For workloads that exceed the
|
|
203
|
+
For workloads that exceed the engine's RAM capacity:
|
|
201
204
|
|
|
202
205
|
```ts
|
|
203
206
|
import { AlbexEngine, TieredStore } from 'albex';
|
|
@@ -221,29 +224,40 @@ Hot tier = engine. Warm tier = original files in OPFS. LRU eviction is automatic
|
|
|
221
224
|
`new AlbexEngine()` covers the default case. The options below address
|
|
222
225
|
specific deployment needs:
|
|
223
226
|
|
|
224
|
-
###
|
|
227
|
+
### Capacity (runtime, single binary)
|
|
225
228
|
|
|
226
|
-
|
|
227
|
-
|
|
228
|
-
|
|
229
|
-
pass `wasmBaseUrl`:
|
|
229
|
+
Capacity is a **runtime** parameter — there is one engine binary (plus its
|
|
230
|
+
SIMD variant) and the pools are heap-allocated at `init()` to the size you
|
|
231
|
+
ask for. The old compile-time tiers (and the `tier` option) are gone:
|
|
230
232
|
|
|
231
233
|
```ts
|
|
232
234
|
const engine = new AlbexEngine({
|
|
233
|
-
|
|
234
|
-
tier: 'auto', // picks mini/std/pro by deviceMemory
|
|
235
|
+
capacity: 'large', // or 'std' (default), or a custom object
|
|
235
236
|
simd: 'auto', // picks baseline/simd by WASM probe
|
|
236
237
|
gpu: 'auto', // engages WebGPU when corpus > 20k chunks
|
|
237
238
|
});
|
|
238
239
|
```
|
|
239
240
|
|
|
240
|
-
|
|
241
|
+
Presets and cost (≈ `maxChunks × 64 B + textPool + namePool`; WASM memory
|
|
242
|
+
never shrinks, so the largest capacity initialised stays committed):
|
|
241
243
|
|
|
242
|
-
|
|
|
243
|
-
|
|
244
|
-
|
|
|
245
|
-
|
|
|
246
|
-
|
|
|
244
|
+
| Capacity | Max docs | Max chunks | Max text | Working set |
|
|
245
|
+
|-----------|---------:|-----------:|---------:|------------:|
|
|
246
|
+
| `'std'` | 128 | 100 000 | 16 MB | ~22 MB |
|
|
247
|
+
| `'large'` | 1 024 | 800 000 | 128 MB | ~180 MB |
|
|
248
|
+
| custom | ≤ 65 536 | ≤ 4 M | ≤ 1 GiB | as configured |
|
|
249
|
+
|
|
250
|
+
Custom objects may be partial — missing fields are completed from the std
|
|
251
|
+
ratios (`maxChunks = maxDocs × 782`, `textPoolBytes = maxChunks × 168 B`,
|
|
252
|
+
`namePoolBytes = maxDocs × 256 B`, with sane floors):
|
|
253
|
+
|
|
254
|
+
```ts
|
|
255
|
+
const tiny = new AlbexEngine({ capacity: { maxDocs: 16 } });
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
Snapshots are admitted by **content**: a snapshot saved with `'large'`
|
|
259
|
+
loads into a `'std'` engine whenever its counters fit, and fails cleanly
|
|
260
|
+
(previous index intact) when they don't.
|
|
247
261
|
|
|
248
262
|
### Custom CDN
|
|
249
263
|
|
|
@@ -297,14 +311,13 @@ PDF support requires `albex_pdf.wasm` to be served with MIME type `application/w
|
|
|
297
311
|
rustup target add wasm32-unknown-unknown
|
|
298
312
|
|
|
299
313
|
npm install
|
|
300
|
-
npm run build:all #
|
|
314
|
+
npm run build:all # main (baseline + SIMD) + PDF + TypeScript
|
|
301
315
|
```
|
|
302
316
|
|
|
303
317
|
Partial builds:
|
|
304
318
|
|
|
305
319
|
```bash
|
|
306
|
-
npm run build:wasm #
|
|
307
|
-
npm run build:wasm:tiers # all 6 variants
|
|
320
|
+
npm run build:wasm # main module (baseline + SIMD)
|
|
308
321
|
npm run build:pdf-wasm # PDF module
|
|
309
322
|
npm run build # TypeScript only
|
|
310
323
|
```
|