lemma-is 0.2.3 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,393 +1,154 @@
1
1
  # lemma-is
2
2
 
3
- Icelandic lemmatization for JavaScript. Maps inflected word forms to base forms (lemmas) for search indexing and text processing.
3
+ Fast Icelandic lemmatization for JavaScript. Built for search indexing.
4
4
 
5
- ## Why?
6
-
7
- Existing Icelandic NLP tools are Python/C++:
8
-
9
- | Tool | Runtime | Standalone? | Notes |
10
- |------|---------|-------------|-------|
11
- | **[GreynirEngine](https://github.com/mideind/GreynirEngine)** | Python + C++ | ✓ | Gold standard. Full parser, POS tagger. |
12
- | **[Nefnir](https://github.com/lexis-project/Nefnir)** | Python | ✗ | Requires POS tags from IceNLP/IceStagger (Java, unmaintained). |
13
- | **lemma-is** | TypeScript | ✓ | Node.js servers. Grammar-based disambiguation, compound splitting. |
14
-
15
- lemma-is trades parsing accuracy for JS ecosystem integration—good enough for search indexing, runs in any Node.js environment.
16
-
17
- ## Quickstart
18
-
19
- ```bash
20
- npm install lemma-is
21
- ```
22
-
23
- **Node.js**:
24
5
  ```typescript
25
- import { readFileSync } from "fs";
26
6
  import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
27
7
 
28
- // Binary data is bundled with the package
29
- const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
30
- const lemmatizer = BinaryLemmatizer.loadFromBuffer(
31
- buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
32
- );
33
-
34
- lemmatizer.lemmatize("börnin"); // → ["barn"]
35
- lemmatizer.lemmatize("fóru"); // → ["fara", "fóra"]
8
+ lemmatizer.lemmatize("börnin"); // ["barn"]
9
+ lemmatizer.lemmatize("keypti"); // → ["kaupa"]
10
+ lemmatizer.lemmatize("hestinum"); // → ["hestur"]
36
11
 
37
- // Full pipeline for search indexing
38
- const lemmas = extractIndexableLemmas("Börnin fóru í bíó", lemmatizer);
39
- // → ["barn", "fara", "fóra", "í", "bíó"]
12
+ // Full pipeline for search
13
+ extractIndexableLemmas("Börnin keypti hestinn", lemmatizer);
14
+ // → ["barn", "kaupa", "hestur"]
40
15
  ```
41
16
 
42
17
  ## The Problem
43
18
 
44
- Icelandic is highly inflected. A single word appears in dozens of forms:
19
+ Icelandic is heavily inflected. A single noun like "hestur" (horse) has 16 forms:
45
20
 
46
- | Search term | Forms in documents |
47
- |-------------|-------------------|
48
- | hestur (horse) | hestinn, hestinum, hestar, hestarnir, hesta... |
49
- | barn (child) | börnin, barnið, barna, börnum... |
50
- | fara (go) | fór, fer, förum, fóru, farið... |
51
- | kona (woman) | konuna, konunni, kvenna, konum... |
21
+ ```
22
+ hestur, hest, hesti, hests, hestar, hesta, hestum, hestanna...
23
+ ```
52
24
 
53
- If you index "Börnin fóru í bíó" by splitting on whitespace, a search for "barn" finds nothing. The word "barn" never appears—only "börnin".
25
+ If a user searches "hestur" but your document contains "hestinum", they won't find it—unless you normalize both to the lemma at index time.
54
26
 
55
- ## Solution
27
+ ## Why lemma-is?
56
28
 
57
- ```typescript
58
- lemmatizer.lemmatize("börnin"); // → ["barn"]
59
- lemmatizer.lemmatize("fóru"); // → ["fara"]
60
- lemmatizer.lemmatize("kvenna"); // → ["kona"]
61
- lemmatizer.lemmatize("hestinum"); // → ["hestur"]
62
- ```
29
+ The gold standard for Icelandic NLP is [GreynirEngine](https://github.com/mideind/GreynirEngine)—a full grammatical parser with excellent accuracy. But it's Python-only, which means you can't run it in Node.js, browsers, or edge runtimes without FFI or a sidecar process.
63
30
 
64
- Now searches for "barn", "fara", or "hestur" match documents containing any of their forms.
31
+ lemma-is trades parsing accuracy for JavaScript portability. It's a lookup table with shallow grammar rules—good enough for search indexing, runs anywhere Node.js runs.
65
32
 
66
- ## Handling Ambiguity
33
+ | | lemma-is | GreynirEngine |
34
+ |---|---|---|
35
+ | **Runtime** | Node, Bun, Deno | Python |
36
+ | **Throughput** | ~250K words/sec | ~25K words/sec |
37
+ | **Cold start** | ~35 ms | ~500 ms |
38
+ | **Memory** | ~185 MB | ~200 MB |
39
+ | **Disambiguation** | Bigrams + grammar rules | Full sentence parsing |
40
+ | **Use case** | Search indexing | NLP analysis |
67
41
 
68
- Many Icelandic words map to multiple lemmas:
42
+ See [BENCHMARKS.md](./BENCHMARKS.md) for methodology and detailed results.
43
+
44
+ ### The Trade-off
45
+
46
+ lemma-is returns **all possible lemmas** for ambiguous words:
69
47
 
70
48
  ```typescript
71
49
  lemmatizer.lemmatize("á");
72
50
  // → ["á", "eiga"]
73
- // "á" = preposition "on" / noun "river"
74
- // "á" = verb "owns" (from "eiga")
75
-
76
- lemmatizer.lemmatize("við");
77
- // → ["við", "ég", "viður"]
78
- // "við" = preposition "by/at"
79
- // "við" = pronoun "we" (from "ég")
80
- // "við" = noun "wood" (from "viður")
51
+ // Could be preposition "on", noun "river", or verb "owns"
81
52
  ```
82
53
 
83
- ### Grammar Rules (Case Government)
54
+ GreynirEngine parses the sentence to return the single correct interpretation. For search, returning all candidates is often better—you'd rather show an extra result than miss a relevant document.
55
+
56
+ ## Installation
84
57
 
85
- The library uses shallow grammar rules based on Icelandic case government to disambiguate prepositions:
58
+ ```bash
59
+ npm install lemma-is
60
+ ```
61
+
62
+ ## Quick Start
86
63
 
87
64
  ```typescript
88
- import { Disambiguator } from "lemma-is";
65
+ import { readFileSync } from "fs";
66
+ import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
89
67
 
90
- // lemmatizer loaded as shown in Quickstart
91
- const disambiguator = new Disambiguator(lemmatizer, lemmatizer, { useGrammarRules: true });
68
+ // Load the core binary (~9-11 MB, low memory, best for browser/edge)
69
+ const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.core.bin");
70
+ const lemmatizer = BinaryLemmatizer.loadFromBuffer(
71
+ buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
72
+ );
92
73
 
93
- // borðinu" - borðinu is dative (þgf), á governs dative → preposition
94
- disambiguator.disambiguate("á", null, "borðinu");
95
- // → { lemma: "á", pos: "fs", resolvedBy: "grammar_rules" }
74
+ // Basic lemmatization
75
+ lemmatizer.lemmatize("börnin"); // ["barn"]
76
+ lemmatizer.lemmatize("fóru"); // → ["fara", "fóra"]
96
77
 
97
- // "ég á" - pronoun before á → likely verb "eiga"
98
- disambiguator.disambiguate("á", "ég", null);
99
- // → { lemma: "eiga", pos: "so", resolvedBy: "preference_rules" }
78
+ // Full pipeline for search indexing
79
+ const lemmas = extractIndexableLemmas("Börnin fóru í bíó", lemmatizer);
80
+ // → ["barn", "fara", "fóra", "í", "bíó"]
100
81
  ```
101
82
 
83
+ ## Features
84
+
102
85
  ### Morphological Features
103
86
 
104
87
  The binary includes case, gender, and number for each word form:
105
88
 
106
89
  ```typescript
107
90
  lemmatizer.lemmatizeWithMorph("hestinum");
108
- // → [{
109
- // lemma: "hestur",
110
- // pos: "no",
111
- // morph: { case: "þgf", gender: "kk", number: "et" }
112
- // }]
113
- // "hestinum" = dative, masculine, singular
114
-
115
- lemmatizer.lemmatizeWithMorph("börnum");
116
- // → [{
117
- // lemma: "barn",
118
- // pos: "no",
119
- // morph: { case: "þgf", gender: "hk", number: "ft" }
120
- // }]
121
- // "börnum" = dative, neuter, plural
91
+ // → [{ lemma: "hestur", pos: "no", morph: { case: "þgf", gender: "kk", number: "et" } }]
92
+ // dative, masculine, singular
122
93
  ```
123
94
 
124
- | Code | Meaning |
125
- |------|---------|
126
- | `nf` | nominative (nefnifall) |
127
- | `þf` | accusative (þolfall) |
128
- | `þgf` | dative (þágufall) |
129
- | `ef` | genitive (eignarfall) |
130
- | `kk` | masculine (karlkyn) |
131
- | `kvk` | feminine (kvenkyn) |
132
- | `hk` | neuter (hvorugkyn) |
133
- | `et` | singular (eintala) |
134
- | `ft` | plural (fleirtala) |
135
-
136
- ### Bigram Disambiguation
95
+ ### Grammar-Based Disambiguation
137
96
 
138
- Use corpus frequencies to pick the most likely lemma based on context:
97
+ Shallow grammar rules use Icelandic case government to disambiguate prepositions:
139
98
 
140
99
  ```typescript
141
- import { processText } from "lemma-is";
100
+ import { Disambiguator } from "lemma-is";
142
101
 
143
- // BinaryLemmatizer has built-in bigram frequencies for disambiguation
144
- // "við erum" = "we are" → bigrams favor pronoun "ég" over preposition
145
- const processed = processText("Við erum hér", lemmatizer, { bigrams: lemmatizer });
146
- // → disambiguated: "ég" for "við" (high confidence)
102
+ const disambiguator = new Disambiguator(lemmatizer, lemmatizer, { useGrammarRules: true });
147
103
 
148
- // "á morgun" = "tomorrow" bigrams favor preposition
149
- const processed2 = processText("Ég fer á morgun", lemmatizer, { bigrams: lemmatizer });
150
- // → disambiguated: "á" for "á" (not "eiga")
104
+ // "á borðinu" - borðinu is dative, á governs dative → preposition
105
+ disambiguator.disambiguate("á", null, "borðinu");
106
+ // → { lemma: "á", pos: "fs", resolvedBy: "grammar_rules" }
151
107
  ```
152
108
 
153
- For search indexing, ambiguity is often acceptable—indexing all candidate lemmas improves recall.
109
+ ### Compound Splitting
154
110
 
155
- ## Compound Word Splitting
156
-
157
- Icelandic forms long compounds. The library splits them for better search coverage:
111
+ Icelandic forms long compounds. Split them for better search coverage:
158
112
 
159
113
  ```typescript
160
114
  import { CompoundSplitter, createKnownLemmaSet } from "lemma-is";
161
115
 
116
+ const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
162
117
  const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
163
118
 
164
- splitter.split("bílstjóri");
165
- // → { isCompound: true, parts: ["bíll", "stjóri"] }
166
- // "car driver" = "car" + "driver"
167
-
168
119
  splitter.split("landbúnaðarráðherra");
169
120
  // → { isCompound: true, parts: ["landbúnaður", "ráðherra"] }
170
- // "agriculture minister" = "agriculture" + "minister"
171
-
172
- splitter.split("húsnæðislán");
173
- // → { isCompound: true, parts: ["húsnæði", "lán"] }
174
- // "housing loan" = "housing" + "loan"
175
- ```
176
-
177
- ### Indexing Compounds
178
-
179
- `getAllLemmas` returns the original word plus its parts—maximizing search recall:
180
-
181
- ```typescript
182
- splitter.getAllLemmas("bílstjóri");
183
- // → ["bílstjóri", "bíll", "stjóri"]
184
- ```
185
-
186
- A document mentioning "bílstjóri" is now findable by searching for "bíll" (car).
187
-
188
- ## Full Pipeline
189
-
190
- For production indexing, combine tokenization, lemmatization, disambiguation, and compound splitting.
191
-
192
- ### What Gets Indexed
193
-
194
- Here's a real example showing exactly what lemmas are extracted:
195
-
196
- ```typescript
197
- const text = "Ríkissjóður stendur í blóma ef 27 milljarða arðgreiðsla Íslandsbanka er talin með.";
198
-
199
- const lemmas = extractIndexableLemmas(text, lemmatizer, {
200
- bigrams: lemmatizer,
201
- compoundSplitter: splitter,
202
- removeStopwords: true,
203
- });
204
-
205
- // Indexed lemmas:
206
- // ✓ ríkissjóður, ríki, sjóður — compound + parts
207
- // ✓ standa — stendur → standa
208
- // ✓ blómi — í blóma → blómi
209
- // ✓ milljarður — milljarða → milljarður
210
- // ✓ arðgreiðsla, arður, greiðsla — compound + parts
211
- // ✓ íslandsbanki — proper noun (lowercased)
212
- // ✓ telja — talin → telja
213
- //
214
- // NOT indexed (stopwords removed):
215
- // ✗ í, ef, er, með
216
- ```
217
-
218
- A search for "sjóður" or "arður" now finds this document about the state treasury and bank dividends.
219
-
220
- ### Another Example: Job Posting
221
-
222
- ```typescript
223
- const posting = "Við leitum að reyndum kennurum til starfa í Reykjavík.";
224
-
225
- const lemmas = extractIndexableLemmas(posting, lemmatizer, {
226
- bigrams: lemmatizer,
227
- removeStopwords: true,
228
- });
229
-
230
- // Indexed:
231
- // ✓ leita, leit — leitum → leita (+ noun variant)
232
- // ✓ reyndur, reynd — reyndum → reyndur
233
- // ✓ kennari — kennurum → kennari
234
- // ✓ starf, starfa — starfa (noun + verb)
235
- // ✓ reykjavík — place name (lowercased)
236
- //
237
- // NOT indexed:
238
- // ✗ við, að, til, í — stopwords
121
+ // "agriculture minister"
239
122
  ```
240
123
 
241
- A search for "kennari" finds this job posting even though the word "kennari" never appears—only "kennurum" (dative plural).
124
+ ### Full Pipeline
242
125
 
243
- ### Complex Sentence
126
+ For production indexing, combine everything:
244
127
 
245
128
  ```typescript
246
- const text = "Löngu áður en Jón borðaði ísinn sem hafði bráðnað hratt " +
247
- "fór ég á veitingastaðinn og keypti mér rauðvín með hamborgaranum.";
248
-
249
- const lemmas = extractIndexableLemmas(text, lemmatizer, {
250
- bigrams: lemmatizer,
251
- compoundSplitter: splitter,
252
- removeStopwords: true,
253
- });
254
-
255
- // Verbs (various tenses/persons):
256
- // ✓ borða — borðaði (past)
257
- // ✓ bráðna — bráðnað (past participle)
258
- // ✓ fara — fór (past, different stem!)
259
- // ✓ kaupa — keypti (past)
260
- //
261
- // Nouns with articles:
262
- // ✓ ís — ísinn (NOT "Ísland"!)
263
- // ✓ veitingastaður, veiting, staður — compound
264
- // ✓ rauðvín
265
- // ✓ hamborgari — hamborgaranum (dative + article)
266
- ```
129
+ import { extractIndexableLemmas, CompoundSplitter, createKnownLemmaSet } from "lemma-is";
267
130
 
268
- ### Setup
269
-
270
- ```typescript
271
- import { readFileSync } from "fs";
272
- import {
273
- BinaryLemmatizer,
274
- extractIndexableLemmas,
275
- CompoundSplitter,
276
- createKnownLemmaSet
277
- } from "lemma-is";
278
-
279
- const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
280
- const lemmatizer = BinaryLemmatizer.loadFromBuffer(
281
- buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
282
- );
283
131
  const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
284
132
  const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
285
- ```
286
-
287
- ### Search-Optimized Defaults
288
-
289
- The defaults favor **recall over precision**—better for search where missing results is worse than extra results:
290
-
291
- ```typescript
292
- const lemmas = extractIndexableLemmas(text, lemmatizer, {
293
- bigrams: lemmatizer,
294
- compoundSplitter: splitter,
295
- // These are the defaults:
296
- // indexAllCandidates: true — indexes ALL lemma candidates
297
- // alwaysTryCompounds: true — splits compounds even if known in BÍN
298
- });
299
- ```
300
-
301
- With these defaults:
302
- - `"á"` → indexes both `"á"` (preposition) AND `"eiga"` (verb)
303
- - `"húsnæðislán"` → indexes `"húsnæðislán"`, `"húsnæði"`, AND `"lán"`
304
-
305
- ### Precision Mode
306
133
 
307
- If you need only the most likely lemma (chatbots, translation), disable the search optimizations:
134
+ const text = "Ríkissjóður stendur í blóma ef milljarða arðgreiðsla er talin með.";
308
135
 
309
- ```typescript
310
136
  const lemmas = extractIndexableLemmas(text, lemmatizer, {
311
137
  bigrams: lemmatizer,
312
138
  compoundSplitter: splitter,
313
- indexAllCandidates: false, // only disambiguated lemma
314
- alwaysTryCompounds: false, // only split unknown words
139
+ removeStopwords: true,
315
140
  });
316
- ```
317
-
318
- ## Word Classes
319
-
320
- Filter by part of speech when context is known:
321
-
322
- ```typescript
323
- lemmatizer.lemmatize("á", { wordClass: "so" }); // → ["eiga"] (verbs only)
324
- lemmatizer.lemmatize("á", { wordClass: "fs" }); // → ["á"] (prepositions only)
325
-
326
- lemmatizer.lemmatizeWithPOS("á");
327
- // → [
328
- // { lemma: "á", pos: "fs" }, // preposition
329
- // { lemma: "á", pos: "no" }, // noun (river)
330
- // { lemma: "eiga", pos: "so" } // verb
331
- // ]
332
- ```
333
-
334
- | Code | Icelandic | English |
335
- |------|-----------|---------|
336
- | `no` | nafnorð | noun |
337
- | `so` | sagnorð | verb |
338
- | `lo` | lýsingarorð | adjective |
339
- | `ao` | atviksorð | adverb |
340
- | `fs` | forsetning | preposition |
341
- | `fn` | fornafn | pronoun |
342
-
343
- ## Data
344
-
345
- Single binary file: `lemma-is.bin` (~91 MB)
346
-
347
- Contains:
348
- - 289K lemmas from BÍN
349
- - 3M word form mappings
350
- - 414K bigram frequencies
351
- - Morphological features (case, gender, number) per word form
352
-
353
- Uses ArrayBuffer with binary search for efficient memory usage. Format version 2 includes packed morphological data.
354
-
355
- ### Building Data
356
-
357
- ```bash
358
- # Download BÍN data from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
359
- # Extract SHsnid.csv to data/
360
141
 
361
- uv run python scripts/build-binary.py # builds lemma-is.bin with morph features
142
+ // Indexed: ríkissjóður, ríki, sjóður, standa, blómi, milljarður,
143
+ // arðgreiðsla, arður, greiðsla, telja
144
+ // Stopwords removed: í, ef, er, með
362
145
  ```
363
146
 
364
- ## Node.js Usage
365
-
366
- ```typescript
367
- import { readFileSync } from "fs";
368
- import { BinaryLemmatizer } from "lemma-is";
369
-
370
- const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
371
- const lemmatizer = BinaryLemmatizer.loadFromBuffer(
372
- buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
373
- );
374
- ```
147
+ A search for "sjóður" or "arður" now finds this document.
375
148
 
376
149
  ## PostgreSQL Full-Text Search
377
150
 
378
- PostgreSQL has no built-in Icelandic stemmer. Use lemma-is to pre-process text, then store lemmas in a `tsvector` column with the `simple` configuration.
379
-
380
- ```sql
381
- CREATE TABLE documents (
382
- id SERIAL PRIMARY KEY,
383
- title TEXT,
384
- body TEXT,
385
- search_vector TSVECTOR
386
- );
387
- CREATE INDEX documents_search_idx ON documents USING GIN (search_vector);
388
- ```
389
-
390
- Lemmatize in your app, store as space-separated string:
151
+ PostgreSQL has no built-in Icelandic stemmer. Use lemma-is to pre-process:
391
152
 
392
153
  ```typescript
393
154
  const lemmas = extractIndexableLemmas(text, lemmatizer, { removeStopwords: true });
@@ -399,160 +160,102 @@ await db.query(
399
160
  );
400
161
  ```
401
162
 
402
- Query by lemmatizing search terms the same way:
403
-
404
- ```typescript
405
- const lemmas = extractIndexableLemmas(query, lemmatizer);
406
-
407
- const results = await db.query(
408
- `SELECT *, ts_rank(search_vector, q) AS rank
409
- FROM documents, plainto_tsquery('simple', $1) q
410
- WHERE search_vector @@ q
411
- ORDER BY rank DESC`,
412
- [Array.from(lemmas).join(" ")]
413
- );
414
-
415
- // User searches "börnum" → lemmatized to "barn" → matches all forms
416
- ```
417
-
418
- **Why `simple`?** It lowercases but doesn't stem—our lemmas are already normalized. Use `setweight()` to boost title matches over body.
163
+ Use the `simple` configuration—it lowercases but doesn't stem, since our lemmas are already normalized.
419
164
 
420
- **Diacritics:** PostgreSQL's `unaccent` extension strips accents, but **don't use it for Icelandic**. Characters like á, ö, þ, ð are distinct letters, not accented variants. "á" (river/on/owns) ≠ "a". Preserve diacritics for correct matching.
165
+ **Important:** Don't use PostgreSQL's `unaccent` extension for Icelandic. Characters like á, ö, þ, ð are distinct letters, not accented variants.
421
166
 
422
167
  ## Limitations
423
168
 
424
- This library makes tradeoffs for portability. Know what you're getting.
169
+ This is an early effort with known limitations.
425
170
 
426
171
  ### File Size
427
172
 
428
- The binary is **~91 MB**. This library targets Node.js server environments where the data is loaded once at startup.
173
+ There are two binaries:
429
174
 
430
- Not recommended for:
431
- - **Serverless/edge** cold start latency loading 91 MB
432
- - **Browser/Web Workers** — download size prohibitive for most users
433
- - **Cloudflare Workers** — fits 128 MB limit but cold starts are slow
175
+ - **Core (~9-11 MB)**: default, optimized for browser/edge/cold start
176
+ - **Full (91 MB)**: maximum coverage and disambiguation
434
177
 
435
- For browser applications, run lemmatization server-side and expose an API endpoint.
178
+ The full binary targets Node.js servers where data loads once at startup. Not recommended for:
436
179
 
437
- ### No Query Expansion
180
+ - **Serverless/edge** cold start loading 91 MB may be slow
181
+ - **Browser** — download size prohibitive
182
+ - **Cloudflare Workers** — fits 128 MB limit but cold starts are slow
438
183
 
439
- You can go **word lemma** but not **lemma → words**:
184
+ For browser apps, use the **core** binary.
440
185
 
441
- ```typescript
442
- lemmatizer.lemmatize("hestinum"); // → ["hestur"] ✓
186
+ To use the full binary, build it locally:
443
187
 
444
- // But you CANNOT do:
445
- lemmatizer.expand("hestur");
446
- // → ["hestur", "hest", "hesti", "hests", "hestinn", "hestinum", ...] ✗
188
+ ```bash
189
+ pnpm build:binary
447
190
  ```
448
191
 
449
- This matters for **search result highlighting**. If a user searches "hestur" and the document contains "hestinum", you can't easily highlight the match without the reverse mapping.
450
-
451
- **Workaround:** Store original text alongside lemmas, use regex patterns for common suffixes.
192
+ Then load it from `data-dist/lemma-is.bin`.
452
193
 
453
- ### Disambiguation Limits
194
+ ### Compact Builds (Browser/Edge)
454
195
 
455
- Bigram disambiguation only works when the word pair exists in the corpus data:
196
+ For cold-start runtimes and the browser, you can build a **compact core** binary that trades accuracy for size by:
197
+ - Keeping only the most frequent word forms
198
+ - Dropping bigram data and morphological features
456
199
 
457
- ```typescript
458
- // Common phrase: bigrams help
459
- processText("við erum", lemmatizer, { bigrams: lemmatizer });
460
- // → "við" disambiguated to "ég" (we) with high confidence
200
+ This reduces memory significantly at the cost of recall/precision on rare words.
461
201
 
462
- // Rare/unusual phrase: no bigram data
463
- processText("við flæktumst", lemmatizer, { bigrams: lemmatizer });
464
- // → "við" picks first candidate, low confidence
202
+ ```bash
203
+ pnpm build:core
465
204
  ```
466
205
 
467
- Without context, ambiguous words fall back to arbitrary ordering:
468
-
469
- ```typescript
470
- // Single word, no context
471
- lemmatizer.lemmatize("á");
472
- // → ["á", "eiga"] — but which is more likely? No way to know.
206
+ The output is written to `data-dist/lemma-is.core.bin`. Use it exactly like the full binary; it just covers fewer word forms.
473
207
 
474
- // The preposition "á" is ~100x more common than verb "eiga" in this form,
475
- // but we don't have unigram frequencies to use as tiebreaker.
476
- ```
477
-
478
- **For search indexing:** Use `indexAllCandidates: true` to index all lemmas and let ranking sort out relevance. For applications needing precision (chatbots, translation), use GreynirEngine instead.
208
+ ### Not a Parser
479
209
 
480
- ### Compound Splitting Heuristics
210
+ This is a lookup table with shallow grammar rules, not a grammatical parser. It doesn't understand sentence structure, named entities, or semantic meaning. The grammar rules help with common patterns but can't handle all disambiguation.
481
211
 
482
- The splitter uses simple rules that miss edge cases:
212
+ For applications needing full grammatical analysis, use [GreynirEngine](https://github.com/mideind/GreynirEngine).
483
213
 
484
- **Three-part compounds only split once:**
485
- ```typescript
486
- splitter.split("þjóðmálaráðherra");
487
- // → ["þjóðmál", "ráðherra"] — missing "þjóð" as separate part
488
- // Ideal: ["þjóð", "mál", "ráðherra"]
489
- ```
214
+ ### Disambiguation Limits
490
215
 
491
- **Inflected first parts may not match:**
492
- ```typescript
493
- splitter.split("húseignir");
494
- // → { isCompound: false } — "hús" appears as "hús" not "húsa"
495
- // The compound IS "hús" + "eignir" but heuristics miss it
496
- ```
216
+ Bigram disambiguation only works when the word pair exists in corpus data. Without context, ambiguous words return all candidates:
497
217
 
498
- **May over-split valid words:**
499
218
  ```typescript
500
- splitter.split("landsins");
501
- // This is NOT a compound it's "land" + genitive suffix "-sins"
502
- // Correctly returns { isCompound: false }, but edge cases exist
219
+ lemmatizer.lemmatize("á");
220
+ // ["á", "eiga"] no way to know which is more likely
503
221
  ```
504
222
 
505
- **Mitigations:**
506
- - Use `alwaysTryCompounds: true` to split even known words
507
- - Use `minPartLength: 2` in CompoundSplitter for more aggressive splitting
508
- - Over-indexing is usually better than under-indexing for search
509
-
510
- ### Not a Parser
511
-
512
- This is a lookup table with shallow grammar rules, not a full grammatical parser. It doesn't understand:
223
+ For search indexing, use `indexAllCandidates: true` (the default) to index all lemmas.
513
224
 
514
- - Full sentence structure or syntax trees
515
- - Complex verb argument frames
516
- - Named entity recognition (people, places, companies)
517
- - Semantic meaning or word sense
225
+ ### No Query Expansion
518
226
 
519
- The grammar rules help with common patterns (preposition + case, pronoun + verb) but can't handle all disambiguation cases. For applications needing full grammatical analysis, use [GreynirEngine](https://github.com/mideind/GreynirEngine). lemma-is is for search indexing where "good enough" recall beats perfect precision.
227
+ You can go word lemma but not lemma words. This affects search result highlighting—if a user searches "hestur" and the document contains "hestinum", you can't easily highlight the match.
520
228
 
521
- ## Development
229
+ ## Data
522
230
 
523
- ### Testing
231
+ Single binary file containing:
232
+ - 289K lemmas from BÍN
233
+ - 3M word form mappings
234
+ - 414K bigram frequencies
235
+ - Morphological features per word form
524
236
 
525
- Tests use [Vitest](https://vitest.dev/):
237
+ ### Building Data
526
238
 
527
239
  ```bash
528
- pnpm test # run all tests
529
- pnpm test:watch # watch mode
530
- npx vitest run --update # update snapshots
531
- ```
240
+ # Download BÍN data from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
241
+ # Extract SHsnid.csv to data/
532
242
 
533
- Test files:
534
- - `binary-lemmatizer.test.ts` — Core lemmatization and bigram lookup
535
- - `compounds.test.ts` — Compound word splitting
536
- - `integration.test.ts` — Full pipeline, search indexing options
537
- - `pipeline-greynir.test.ts` — Full pipeline with Greynir test sentences
538
- - `benchmark.test.ts` — Performance and metrics snapshots
539
- - `icelandic-tricky.test.ts` — Edge cases, morphology examples
540
- - `limitations.test.ts` — Documented limitations and research notes
541
- - `mini-grammar.test.ts` — Grammar rules and case government
243
+ uv run python scripts/build-binary.py
244
+ ```
542
245
 
543
- ### Building
246
+ ## Development
544
247
 
545
248
  ```bash
249
+ pnpm test # run tests
546
250
  pnpm build # build dist/
547
- pnpm typecheck # type check without emitting
548
- pnpm build:data # rebuild binary from BÍN source
251
+ pnpm typecheck # type check
549
252
  ```
550
253
 
551
254
  ## Acknowledgments
552
255
 
553
- - **[BÍN](https://bin.arnastofnun.is/)** Morphological database from the Árni Magnússon Institute
554
- - **[Miðeind](https://mideind.is/)** Greynir and foundational Icelandic NLP work
555
- - **[tokenize-is](https://github.com/axelharri/tokenize-is)** Icelandic tokenizer
256
+ - **[BÍN](https://bin.arnastofnun.is/)** Morphological database from the Árni Magnússon Institute
257
+ - **[Miðeind](https://mideind.is/)** GreynirEngine and foundational Icelandic NLP work
258
+ - **[tokenize-is](https://github.com/axelharri/tokenize-is)** Icelandic tokenizer
556
259
 
557
260
  ## License
558
261
 
@@ -560,10 +263,10 @@ MIT for the code.
560
263
 
561
264
  ### Data License (BÍN)
562
265
 
563
- The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/) (Beygingarlýsing íslensks nútímamáls) © Árni Magnússon Institute for Icelandic Studies.
266
+ The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/) © Árni Magnússon Institute for Icelandic Studies.
564
267
 
565
268
  **By using this package, you agree to BÍN's conditions:**
566
- - Credit the Árni Magnússon Institute in your product's UI
269
+ - Credit the Árni Magnússon Institute in your product
567
270
  - Do not redistribute the raw data separately
568
271
  - Do not publish inflection paradigms without permission
569
272