lemma-is 0.2.2 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,392 +1,154 @@
1
1
  # lemma-is
2
2
 
3
- Icelandic lemmatization for JavaScript. Maps inflected word forms to base forms (lemmas) for search indexing and text processing.
3
+ Fast Icelandic lemmatization for JavaScript. Built for search indexing.
4
4
 
5
- ## Why?
6
-
7
- Existing Icelandic NLP tools are Python/C++:
8
-
9
- | Tool | Runtime | Standalone? | Notes |
10
- |------|---------|-------------|-------|
11
- | **[GreynirEngine](https://github.com/mideind/GreynirEngine)** | Python + C++ | ✓ | Gold standard. Full parser, POS tagger. |
12
- | **[Nefnir](https://github.com/lexis-project/Nefnir)** | Python | ✗ | Requires POS tags from IceNLP/IceStagger (Java, unmaintained). |
13
- | **lemma-is** | TypeScript | ✓ | Node.js servers. Grammar-based disambiguation, compound splitting. |
14
-
15
- lemma-is trades parsing accuracy for JS ecosystem integration—good enough for search indexing, runs in any Node.js environment.
16
-
17
- ## Quickstart
18
-
19
- ```bash
20
- npm install lemma-is
21
- ```
22
-
23
- **Node.js**:
24
5
  ```typescript
25
- import { readFileSync } from "fs";
26
6
  import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
27
7
 
28
- const buffer = readFileSync("./lemma-is.bin");
29
- const lemmatizer = BinaryLemmatizer.loadFromBuffer(buffer.buffer.slice(
30
- buffer.byteOffset, buffer.byteOffset + buffer.byteLength
31
- ));
8
+ lemmatizer.lemmatize("börnin"); // → ["barn"]
9
+ lemmatizer.lemmatize("keypti"); // ["kaupa"]
10
+ lemmatizer.lemmatize("hestinum"); // ["hestur"]
32
11
 
33
- app.post("/lemmatize", (req, res) => {
34
- const lemmas = extractIndexableLemmas(req.body.text, lemmatizer);
35
- res.json({ lemmas: [...lemmas] });
36
- });
12
+ // Full pipeline for search
13
+ extractIndexableLemmas("Börnin keypti hestinn", lemmatizer);
14
+ // ["barn", "kaupa", "hestur"]
37
15
  ```
38
16
 
39
17
  ## The Problem
40
18
 
41
- Icelandic is highly inflected. A single word appears in dozens of forms:
19
+ Icelandic is heavily inflected. A single noun like "hestur" (horse) has 16 forms:
42
20
 
43
- | Search term | Forms in documents |
44
- |-------------|-------------------|
45
- | hestur (horse) | hestinn, hestinum, hestar, hestarnir, hesta... |
46
- | barn (child) | börnin, barnið, barna, börnum... |
47
- | fara (go) | fór, fer, förum, fóru, farið... |
48
- | kona (woman) | konuna, konunni, kvenna, konum... |
21
+ ```
22
+ hestur, hest, hesti, hests, hestar, hesta, hestum, hestanna...
23
+ ```
49
24
 
50
- If you index "Börnin fóru í bíó" by splitting on whitespace, a search for "barn" finds nothing. The word "barn" never appears—only "börnin".
25
+ If a user searches "hestur" but your document contains "hestinum", they won't find it—unless you normalize both to the lemma at index time.
51
26
 
52
- ## Solution
27
+ ## Why lemma-is?
53
28
 
54
- ```typescript
55
- import { BinaryLemmatizer } from "lemma-is";
29
+ The gold standard for Icelandic NLP is [GreynirEngine](https://github.com/mideind/GreynirEngine)—a full grammatical parser with excellent accuracy. But it's Python-only, which means you can't run it in Node.js, browsers, or edge runtimes without FFI or a sidecar process.
56
30
 
57
- const lemmatizer = await BinaryLemmatizer.load("/data/lemma-is.bin");
31
+ lemma-is trades parsing accuracy for JavaScript portability. It's a lookup table with shallow grammar rules—good enough for search indexing, runs anywhere Node.js runs.
58
32
 
59
- lemmatizer.lemmatize("börnin"); // ["barn"]
60
- lemmatizer.lemmatize("fóru"); // → ["fara"]
61
- lemmatizer.lemmatize("kvenna"); // ["kona"]
62
- lemmatizer.lemmatize("hestinum"); // ["hestur"]
63
- ```
33
+ | | lemma-is | GreynirEngine |
34
+ |---|---|---|
35
+ | **Runtime** | Node, Bun, Deno | Python |
36
+ | **Throughput** | ~250K words/sec | ~25K words/sec |
37
+ | **Cold start** | ~35 ms | ~500 ms |
38
+ | **Memory** | ~185 MB | ~200 MB |
39
+ | **Disambiguation** | Bigrams + grammar rules | Full sentence parsing |
40
+ | **Use case** | Search indexing | NLP analysis |
64
41
 
65
- Now searches for "barn", "fara", or "hestur" match documents containing any of their forms.
42
+ See [BENCHMARKS.md](./BENCHMARKS.md) for methodology and detailed results.
66
43
 
67
- ## Handling Ambiguity
44
+ ### The Trade-off
68
45
 
69
- Many Icelandic words map to multiple lemmas:
46
+ lemma-is returns **all possible lemmas** for ambiguous words:
70
47
 
71
48
  ```typescript
72
49
  lemmatizer.lemmatize("á");
73
50
  // → ["á", "eiga"]
74
- // "á" = preposition "on" / noun "river"
75
- // "á" = verb "owns" (from "eiga")
76
-
77
- lemmatizer.lemmatize("við");
78
- // → ["við", "ég", "viður"]
79
- // "við" = preposition "by/at"
80
- // "við" = pronoun "we" (from "ég")
81
- // "við" = noun "wood" (from "viður")
51
+ // Could be preposition "on", noun "river", or verb "owns"
82
52
  ```
83
53
 
84
- ### Grammar Rules (Case Government)
54
+ GreynirEngine parses the sentence to return the single correct interpretation. For search, returning all candidates is often better—you'd rather show an extra result than miss a relevant document.
55
+
56
+ ## Installation
57
+
58
+ ```bash
59
+ npm install lemma-is
60
+ ```
85
61
 
86
- The library uses shallow grammar rules based on Icelandic case government to disambiguate prepositions:
62
+ ## Quick Start
87
63
 
88
64
  ```typescript
89
- import { BinaryLemmatizer, Disambiguator } from "lemma-is";
65
+ import { readFileSync } from "fs";
66
+ import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
90
67
 
91
- const lemmatizer = await BinaryLemmatizer.load("/data/lemma-is.bin");
92
- const disambiguator = new Disambiguator(lemmatizer, lemmatizer, { useGrammarRules: true });
68
+ // Load the binary (91 MB, ~35ms cold start)
69
+ const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
70
+ const lemmatizer = BinaryLemmatizer.loadFromBuffer(
71
+ buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
72
+ );
93
73
 
94
- // borðinu" - borðinu is dative (þgf), á governs dative → preposition
95
- disambiguator.disambiguate("á", null, "borðinu");
96
- // → { lemma: "á", pos: "fs", resolvedBy: "grammar_rules" }
74
+ // Basic lemmatization
75
+ lemmatizer.lemmatize("börnin"); // ["barn"]
76
+ lemmatizer.lemmatize("fóru"); // → ["fara", "fóra"]
97
77
 
98
- // "ég á" - pronoun before á → likely verb "eiga"
99
- disambiguator.disambiguate("á", "ég", null);
100
- // → { lemma: "eiga", pos: "so", resolvedBy: "preference_rules" }
78
+ // Full pipeline for search indexing
79
+ const lemmas = extractIndexableLemmas("Börnin fóru í bíó", lemmatizer);
80
+ // → ["barn", "fara", "fóra", "í", "bíó"]
101
81
  ```
102
82
 
83
+ ## Features
84
+
103
85
  ### Morphological Features
104
86
 
105
87
  The binary includes case, gender, and number for each word form:
106
88
 
107
89
  ```typescript
108
90
  lemmatizer.lemmatizeWithMorph("hestinum");
109
- // → [{
110
- // lemma: "hestur",
111
- // pos: "no",
112
- // morph: { case: "þgf", gender: "kk", number: "et" }
113
- // }]
114
- // "hestinum" = dative, masculine, singular
115
-
116
- lemmatizer.lemmatizeWithMorph("börnum");
117
- // → [{
118
- // lemma: "barn",
119
- // pos: "no",
120
- // morph: { case: "þgf", gender: "hk", number: "ft" }
121
- // }]
122
- // "börnum" = dative, neuter, plural
91
+ // → [{ lemma: "hestur", pos: "no", morph: { case: "þgf", gender: "kk", number: "et" } }]
92
+ // dative, masculine, singular
123
93
  ```
124
94
 
125
- | Code | Meaning |
126
- |------|---------|
127
- | `nf` | nominative (nefnifall) |
128
- | `þf` | accusative (þolfall) |
129
- | `þgf` | dative (þágufall) |
130
- | `ef` | genitive (eignarfall) |
131
- | `kk` | masculine (karlkyn) |
132
- | `kvk` | feminine (kvenkyn) |
133
- | `hk` | neuter (hvorugkyn) |
134
- | `et` | singular (eintala) |
135
- | `ft` | plural (fleirtala) |
95
+ ### Grammar-Based Disambiguation
136
96
 
137
- ### Bigram Disambiguation
138
-
139
- Use corpus frequencies to pick the most likely lemma based on context:
97
+ Shallow grammar rules use Icelandic case government to disambiguate prepositions:
140
98
 
141
99
  ```typescript
142
- import { BinaryLemmatizer, processText } from "lemma-is";
143
-
144
- const lemmatizer = await BinaryLemmatizer.load("/data/lemma-is.bin");
100
+ import { Disambiguator } from "lemma-is";
145
101
 
146
- // BinaryLemmatizer has built-in bigram frequencies for disambiguation
147
- // "við erum" = "we are" → bigrams favor pronoun "ég" over preposition
148
- const processed = processText("Við erum hér", lemmatizer, { bigrams: lemmatizer });
149
- // → disambiguated: "ég" for "við" (high confidence)
102
+ const disambiguator = new Disambiguator(lemmatizer, lemmatizer, { useGrammarRules: true });
150
103
 
151
- // "á morgun" = "tomorrow" bigrams favor preposition
152
- const processed2 = processText("Ég fer á morgun", lemmatizer, { bigrams: lemmatizer });
153
- // → disambiguated: "á" for "á" (not "eiga")
104
+ // "á borðinu" - borðinu is dative, á governs dative → preposition
105
+ disambiguator.disambiguate("á", null, "borðinu");
106
+ // → { lemma: "á", pos: "fs", resolvedBy: "grammar_rules" }
154
107
  ```
155
108
 
156
- For search indexing, ambiguity is often acceptable—indexing all candidate lemmas improves recall.
157
-
158
- ## Compound Word Splitting
109
+ ### Compound Splitting
159
110
 
160
- Icelandic forms long compounds. The library splits them for better search coverage:
111
+ Icelandic forms long compounds. Split them for better search coverage:
161
112
 
162
113
  ```typescript
163
114
  import { CompoundSplitter, createKnownLemmaSet } from "lemma-is";
164
115
 
116
+ const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
165
117
  const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
166
118
 
167
- splitter.split("bílstjóri");
168
- // → { isCompound: true, parts: ["bíll", "stjóri"] }
169
- // "car driver" = "car" + "driver"
170
-
171
119
  splitter.split("landbúnaðarráðherra");
172
120
  // → { isCompound: true, parts: ["landbúnaður", "ráðherra"] }
173
- // "agriculture minister" = "agriculture" + "minister"
174
-
175
- splitter.split("húsnæðislán");
176
- // → { isCompound: true, parts: ["húsnæði", "lán"] }
177
- // "housing loan" = "housing" + "loan"
178
- ```
179
-
180
- ### Indexing Compounds
181
-
182
- `getAllLemmas` returns the original word plus its parts—maximizing search recall:
183
-
184
- ```typescript
185
- splitter.getAllLemmas("bílstjóri");
186
- // → ["bílstjóri", "bíll", "stjóri"]
187
- ```
188
-
189
- A document mentioning "bílstjóri" is now findable by searching for "bíll" (car).
190
-
191
- ## Full Pipeline
192
-
193
- For production indexing, combine tokenization, lemmatization, disambiguation, and compound splitting.
194
-
195
- ### What Gets Indexed
196
-
197
- Here's a real example showing exactly what lemmas are extracted:
198
-
199
- ```typescript
200
- const text = "Ríkissjóður stendur í blóma ef 27 milljarða arðgreiðsla Íslandsbanka er talin með.";
201
-
202
- const lemmas = extractIndexableLemmas(text, lemmatizer, {
203
- bigrams: lemmatizer,
204
- compoundSplitter: splitter,
205
- removeStopwords: true,
206
- });
207
-
208
- // Indexed lemmas:
209
- // ✓ ríkissjóður, ríki, sjóður — compound + parts
210
- // ✓ standa — stendur → standa
211
- // ✓ blómi — í blóma → blómi
212
- // ✓ milljarður — milljarða → milljarður
213
- // ✓ arðgreiðsla, arður, greiðsla — compound + parts
214
- // ✓ íslandsbanki — proper noun (lowercased)
215
- // ✓ telja — talin → telja
216
- //
217
- // NOT indexed (stopwords removed):
218
- // ✗ í, ef, er, með
219
- ```
220
-
221
- A search for "sjóður" or "arður" now finds this document about the state treasury and bank dividends.
222
-
223
- ### Another Example: Job Posting
224
-
225
- ```typescript
226
- const posting = "Við leitum að reyndum kennurum til starfa í Reykjavík.";
227
-
228
- const lemmas = extractIndexableLemmas(posting, lemmatizer, {
229
- bigrams: lemmatizer,
230
- removeStopwords: true,
231
- });
232
-
233
- // Indexed:
234
- // ✓ leita, leit — leitum → leita (+ noun variant)
235
- // ✓ reyndur, reynd — reyndum → reyndur
236
- // ✓ kennari — kennurum → kennari
237
- // ✓ starf, starfa — starfa (noun + verb)
238
- // ✓ reykjavík — place name (lowercased)
239
- //
240
- // NOT indexed:
241
- // ✗ við, að, til, í — stopwords
121
+ // "agriculture minister"
242
122
  ```
243
123
 
244
- A search for "kennari" finds this job posting even though the word "kennari" never appears—only "kennurum" (dative plural).
124
+ ### Full Pipeline
245
125
 
246
- ### Complex Sentence
126
+ For production indexing, combine everything:
247
127
 
248
128
  ```typescript
249
- const text = "Löngu áður en Jón borðaði ísinn sem hafði bráðnað hratt " +
250
- "fór ég á veitingastaðinn og keypti mér rauðvín með hamborgaranum.";
251
-
252
- const lemmas = extractIndexableLemmas(text, lemmatizer, {
253
- bigrams: lemmatizer,
254
- compoundSplitter: splitter,
255
- removeStopwords: true,
256
- });
257
-
258
- // Verbs (various tenses/persons):
259
- // ✓ borða — borðaði (past)
260
- // ✓ bráðna — bráðnað (past participle)
261
- // ✓ fara — fór (past, different stem!)
262
- // ✓ kaupa — keypti (past)
263
- //
264
- // Nouns with articles:
265
- // ✓ ís — ísinn (NOT "Ísland"!)
266
- // ✓ veitingastaður, veiting, staður — compound
267
- // ✓ rauðvín
268
- // ✓ hamborgari — hamborgaranum (dative + article)
269
- ```
129
+ import { extractIndexableLemmas, CompoundSplitter, createKnownLemmaSet } from "lemma-is";
270
130
 
271
- ### Setup
272
-
273
- ```typescript
274
- import {
275
- BinaryLemmatizer,
276
- extractIndexableLemmas,
277
- CompoundSplitter,
278
- createKnownLemmaSet
279
- } from "lemma-is";
280
-
281
- const lemmatizer = await BinaryLemmatizer.load("/data/lemma-is.bin");
282
131
  const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
283
132
  const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
284
- ```
285
133
 
286
- ### Search-Optimized Defaults
134
+ const text = "Ríkissjóður stendur í blóma ef milljarða arðgreiðsla er talin með.";
287
135
 
288
- The defaults favor **recall over precision**—better for search where missing results is worse than extra results:
289
-
290
- ```typescript
291
136
  const lemmas = extractIndexableLemmas(text, lemmatizer, {
292
137
  bigrams: lemmatizer,
293
138
  compoundSplitter: splitter,
294
- // These are the defaults:
295
- // indexAllCandidates: true — indexes ALL lemma candidates
296
- // alwaysTryCompounds: true — splits compounds even if known in BÍN
297
- });
298
- ```
299
-
300
- With these defaults:
301
- - `"á"` → indexes both `"á"` (preposition) AND `"eiga"` (verb)
302
- - `"húsnæðislán"` → indexes `"húsnæðislán"`, `"húsnæði"`, AND `"lán"`
303
-
304
- ### Precision Mode
305
-
306
- If you need only the most likely lemma (chatbots, translation), disable the search optimizations:
307
-
308
- ```typescript
309
- const lemmas = extractIndexableLemmas(text, lemmatizer, {
310
- bigrams: lemmatizer,
311
- compoundSplitter: splitter,
312
- indexAllCandidates: false, // only disambiguated lemma
313
- alwaysTryCompounds: false, // only split unknown words
139
+ removeStopwords: true,
314
140
  });
315
- ```
316
-
317
- ## Word Classes
318
-
319
- Filter by part of speech when context is known:
320
-
321
- ```typescript
322
- lemmatizer.lemmatize("á", { wordClass: "so" }); // → ["eiga"] (verbs only)
323
- lemmatizer.lemmatize("á", { wordClass: "fs" }); // → ["á"] (prepositions only)
324
-
325
- lemmatizer.lemmatizeWithPOS("á");
326
- // → [
327
- // { lemma: "á", pos: "fs" }, // preposition
328
- // { lemma: "á", pos: "no" }, // noun (river)
329
- // { lemma: "eiga", pos: "so" } // verb
330
- // ]
331
- ```
332
-
333
- | Code | Icelandic | English |
334
- |------|-----------|---------|
335
- | `no` | nafnorð | noun |
336
- | `so` | sagnorð | verb |
337
- | `lo` | lýsingarorð | adjective |
338
- | `ao` | atviksorð | adverb |
339
- | `fs` | forsetning | preposition |
340
- | `fn` | fornafn | pronoun |
341
-
342
- ## Data
343
-
344
- Single binary file: `lemma-is.bin` (~91 MB)
345
-
346
- Contains:
347
- - 289K lemmas from BÍN
348
- - 3M word form mappings
349
- - 414K bigram frequencies
350
- - Morphological features (case, gender, number) per word form
351
141
 
352
- Uses ArrayBuffer with binary search for efficient memory usage. Format version 2 includes packed morphological data.
353
-
354
- ### Building Data
355
-
356
- ```bash
357
- # Download BÍN data from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
358
- # Extract SHsnid.csv to data/
359
-
360
- uv run python scripts/build-binary.py # builds lemma-is.bin with morph features
142
+ // Indexed: ríkissjóður, ríki, sjóður, standa, blómi, milljarður,
143
+ // arðgreiðsla, arður, greiðsla, telja
144
+ // Stopwords removed: í, ef, er, með
361
145
  ```
362
146
 
363
- ## Node.js Usage
364
-
365
- ```typescript
366
- import { readFileSync } from "fs";
367
- import { BinaryLemmatizer } from "lemma-is";
368
-
369
- const buffer = readFileSync("data-dist/lemma-is.bin");
370
- const lemmatizer = BinaryLemmatizer.loadFromBuffer(
371
- buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
372
- );
373
- ```
147
+ A search for "sjóður" or "arður" now finds this document.
374
148
 
375
149
  ## PostgreSQL Full-Text Search
376
150
 
377
- PostgreSQL has no built-in Icelandic stemmer. Use lemma-is to pre-process text, then store lemmas in a `tsvector` column with the `simple` configuration.
378
-
379
- ```sql
380
- CREATE TABLE documents (
381
- id SERIAL PRIMARY KEY,
382
- title TEXT,
383
- body TEXT,
384
- search_vector TSVECTOR
385
- );
386
- CREATE INDEX documents_search_idx ON documents USING GIN (search_vector);
387
- ```
388
-
389
- Lemmatize in your app, store as space-separated string:
151
+ PostgreSQL has no built-in Icelandic stemmer. Use lemma-is to pre-process:
390
152
 
391
153
  ```typescript
392
154
  const lemmas = extractIndexableLemmas(text, lemmatizer, { removeStopwords: true });
@@ -398,160 +160,75 @@ await db.query(
398
160
  );
399
161
  ```
400
162
 
401
- Query by lemmatizing search terms the same way:
163
+ Use the `simple` configuration—it lowercases but doesn't stem, since our lemmas are already normalized.
402
164
 
403
- ```typescript
404
- const lemmas = extractIndexableLemmas(query, lemmatizer);
405
-
406
- const results = await db.query(
407
- `SELECT *, ts_rank(search_vector, q) AS rank
408
- FROM documents, plainto_tsquery('simple', $1) q
409
- WHERE search_vector @@ q
410
- ORDER BY rank DESC`,
411
- [Array.from(lemmas).join(" ")]
412
- );
413
-
414
- // User searches "börnum" → lemmatized to "barn" → matches all forms
415
- ```
416
-
417
- **Why `simple`?** It lowercases but doesn't stem—our lemmas are already normalized. Use `setweight()` to boost title matches over body.
418
-
419
- **Diacritics:** PostgreSQL's `unaccent` extension strips accents, but **don't use it for Icelandic**. Characters like á, ö, þ, ð are distinct letters, not accented variants. "á" (river/on/owns) ≠ "a". Preserve diacritics for correct matching.
165
+ **Important:** Don't use PostgreSQL's `unaccent` extension for Icelandic. Characters like á, ö, þ, ð are distinct letters, not accented variants.
420
166
 
421
167
  ## Limitations
422
168
 
423
- This library makes tradeoffs for portability. Know what you're getting.
169
+ This is an early effort with known limitations.
424
170
 
425
171
  ### File Size
426
172
 
427
- The binary is **~91 MB**. This library targets Node.js server environments where the data is loaded once at startup.
173
+ The binary is **91 MB**. This targets Node.js servers where data loads once at startup. Not recommended for:
428
174
 
429
- Not recommended for:
430
- - **Serverless/edge** — cold start latency loading 91 MB
431
- - **Browser/Web Workers** — download size prohibitive for most users
175
+ - **Serverless/edge** — cold start loading 91 MB may be slow
176
+ - **Browser** — download size prohibitive
432
177
  - **Cloudflare Workers** — fits 128 MB limit but cold starts are slow
433
178
 
434
- For browser applications, run lemmatization server-side and expose an API endpoint.
435
-
436
- ### No Query Expansion
437
-
438
- You can go **word → lemma** but not **lemma → words**:
179
+ For browser apps, run lemmatization server-side.
439
180
 
440
- ```typescript
441
- lemmatizer.lemmatize("hestinum"); // → ["hestur"] ✓
442
-
443
- // But you CANNOT do:
444
- lemmatizer.expand("hestur");
445
- // → ["hestur", "hest", "hesti", "hests", "hestinn", "hestinum", ...] ✗
446
- ```
181
+ ### Not a Parser
447
182
 
448
- This matters for **search result highlighting**. If a user searches "hestur" and the document contains "hestinum", you can't easily highlight the match without the reverse mapping.
183
+ This is a lookup table with shallow grammar rules, not a grammatical parser. It doesn't understand sentence structure, named entities, or semantic meaning. The grammar rules help with common patterns but can't handle all disambiguation.
449
184
 
450
- **Workaround:** Store original text alongside lemmas, use regex patterns for common suffixes.
185
+ For applications needing full grammatical analysis, use [GreynirEngine](https://github.com/mideind/GreynirEngine).
451
186
 
452
187
  ### Disambiguation Limits
453
188
 
454
- Bigram disambiguation only works when the word pair exists in the corpus data:
455
-
456
- ```typescript
457
- // Common phrase: bigrams help
458
- processText("við erum", lemmatizer, { bigrams: lemmatizer });
459
- // → "við" disambiguated to "ég" (we) with high confidence
460
-
461
- // Rare/unusual phrase: no bigram data
462
- processText("við flæktumst", lemmatizer, { bigrams: lemmatizer });
463
- // → "við" picks first candidate, low confidence
464
- ```
465
-
466
- Without context, ambiguous words fall back to arbitrary ordering:
189
+ Bigram disambiguation only works when the word pair exists in corpus data. Without context, ambiguous words return all candidates:
467
190
 
468
191
  ```typescript
469
- // Single word, no context
470
192
  lemmatizer.lemmatize("á");
471
- // → ["á", "eiga"] — but which is more likely? No way to know.
472
-
473
- // The preposition "á" is ~100x more common than verb "eiga" in this form,
474
- // but we don't have unigram frequencies to use as tiebreaker.
193
+ // → ["á", "eiga"] — no way to know which is more likely
475
194
  ```
476
195
 
477
- **For search indexing:** Use `indexAllCandidates: true` to index all lemmas and let ranking sort out relevance. For applications needing precision (chatbots, translation), use GreynirEngine instead.
478
-
479
- ### Compound Splitting Heuristics
196
+ For search indexing, use `indexAllCandidates: true` (the default) to index all lemmas.
480
197
 
481
- The splitter uses simple rules that miss edge cases:
482
-
483
- **Three-part compounds only split once:**
484
- ```typescript
485
- splitter.split("þjóðmálaráðherra");
486
- // → ["þjóðmál", "ráðherra"] — missing "þjóð" as separate part
487
- // Ideal: ["þjóð", "mál", "ráðherra"]
488
- ```
489
-
490
- **Inflected first parts may not match:**
491
- ```typescript
492
- splitter.split("húseignir");
493
- // → { isCompound: false } — "hús" appears as "hús" not "húsa"
494
- // The compound IS "hús" + "eignir" but heuristics miss it
495
- ```
496
-
497
- **May over-split valid words:**
498
- ```typescript
499
- splitter.split("landsins");
500
- // This is NOT a compound — it's "land" + genitive suffix "-sins"
501
- // Correctly returns { isCompound: false }, but edge cases exist
502
- ```
503
-
504
- **Mitigations:**
505
- - Use `alwaysTryCompounds: true` to split even known words
506
- - Use `minPartLength: 2` in CompoundSplitter for more aggressive splitting
507
- - Over-indexing is usually better than under-indexing for search
508
-
509
- ### Not a Parser
510
-
511
- This is a lookup table with shallow grammar rules, not a full grammatical parser. It doesn't understand:
512
-
513
- - Full sentence structure or syntax trees
514
- - Complex verb argument frames
515
- - Named entity recognition (people, places, companies)
516
- - Semantic meaning or word sense
198
+ ### No Query Expansion
517
199
 
518
- The grammar rules help with common patterns (preposition + case, pronoun + verb) but can't handle all disambiguation cases. For applications needing full grammatical analysis, use [GreynirEngine](https://github.com/mideind/GreynirEngine). lemma-is is for search indexing where "good enough" recall beats perfect precision.
200
+ You can go word lemma but not lemma words. This affects search result highlighting—if a user searches "hestur" and the document contains "hestinum", you can't easily highlight the match.
519
201
 
520
- ## Development
202
+ ## Data
521
203
 
522
- ### Testing
204
+ Single binary file containing:
205
+ - 289K lemmas from BÍN
206
+ - 3M word form mappings
207
+ - 414K bigram frequencies
208
+ - Morphological features per word form
523
209
 
524
- Tests use [Vitest](https://vitest.dev/):
210
+ ### Building Data
525
211
 
526
212
  ```bash
527
- pnpm test # run all tests
528
- pnpm test:watch # watch mode
529
- npx vitest run --update # update snapshots
530
- ```
213
+ # Download BÍN data from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
214
+ # Extract SHsnid.csv to data/
531
215
 
532
- Test files:
533
- - `binary-lemmatizer.test.ts` — Core lemmatization and bigram lookup
534
- - `compounds.test.ts` — Compound word splitting
535
- - `integration.test.ts` — Full pipeline, search indexing options
536
- - `pipeline-greynir.test.ts` — Full pipeline with Greynir test sentences
537
- - `benchmark.test.ts` — Performance and metrics snapshots
538
- - `icelandic-tricky.test.ts` — Edge cases, morphology examples
539
- - `limitations.test.ts` — Documented limitations and research notes
540
- - `mini-grammar.test.ts` — Grammar rules and case government
216
+ uv run python scripts/build-binary.py
217
+ ```
541
218
 
542
- ### Building
219
+ ## Development
543
220
 
544
221
  ```bash
222
+ pnpm test # run tests
545
223
  pnpm build # build dist/
546
- pnpm typecheck # type check without emitting
547
- pnpm build:data # rebuild binary from BÍN source
224
+ pnpm typecheck # type check
548
225
  ```
549
226
 
550
227
  ## Acknowledgments
551
228
 
552
- - **[BÍN](https://bin.arnastofnun.is/)** Morphological database from the Árni Magnússon Institute
553
- - **[Miðeind](https://mideind.is/)** Greynir and foundational Icelandic NLP work
554
- - **[tokenize-is](https://github.com/axelharri/tokenize-is)** Icelandic tokenizer
229
+ - **[BÍN](https://bin.arnastofnun.is/)** Morphological database from the Árni Magnússon Institute
230
+ - **[Miðeind](https://mideind.is/)** GreynirEngine and foundational Icelandic NLP work
231
+ - **[tokenize-is](https://github.com/axelharri/tokenize-is)** Icelandic tokenizer
555
232
 
556
233
  ## License
557
234
 
@@ -559,10 +236,10 @@ MIT for the code.
559
236
 
560
237
  ### Data License (BÍN)
561
238
 
562
- The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/) (Beygingarlýsing íslensks nútímamáls) © Árni Magnússon Institute for Icelandic Studies.
239
+ The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/) © Árni Magnússon Institute for Icelandic Studies.
563
240
 
564
241
  **By using this package, you agree to BÍN's conditions:**
565
- - Credit the Árni Magnússon Institute in your product's UI
242
+ - Credit the Árni Magnússon Institute in your product
566
243
  - Do not redistribute the raw data separately
567
244
  - Do not publish inflection paradigms without permission
568
245