lemma-is 0.2.3 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +109 -433
- package/dist/index.d.mts +31 -3
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +1 -1
- package/dist/index.mjs.map +1 -1
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,393 +1,154 @@
|
|
|
1
1
|
# lemma-is
|
|
2
2
|
|
|
3
|
-
Icelandic lemmatization for JavaScript.
|
|
3
|
+
Fast Icelandic lemmatization for JavaScript. Built for search indexing.
|
|
4
4
|
|
|
5
|
-
## Why?
|
|
6
|
-
|
|
7
|
-
Existing Icelandic NLP tools are Python/C++:
|
|
8
|
-
|
|
9
|
-
| Tool | Runtime | Standalone? | Notes |
|
|
10
|
-
|------|---------|-------------|-------|
|
|
11
|
-
| **[GreynirEngine](https://github.com/mideind/GreynirEngine)** | Python + C++ | ✓ | Gold standard. Full parser, POS tagger. |
|
|
12
|
-
| **[Nefnir](https://github.com/lexis-project/Nefnir)** | Python | ✗ | Requires POS tags from IceNLP/IceStagger (Java, unmaintained). |
|
|
13
|
-
| **lemma-is** | TypeScript | ✓ | Node.js servers. Grammar-based disambiguation, compound splitting. |
|
|
14
|
-
|
|
15
|
-
lemma-is trades parsing accuracy for JS ecosystem integration—good enough for search indexing, runs in any Node.js environment.
|
|
16
|
-
|
|
17
|
-
## Quickstart
|
|
18
|
-
|
|
19
|
-
```bash
|
|
20
|
-
npm install lemma-is
|
|
21
|
-
```
|
|
22
|
-
|
|
23
|
-
**Node.js**:
|
|
24
5
|
```typescript
|
|
25
|
-
import { readFileSync } from "fs";
|
|
26
6
|
import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
|
|
27
7
|
|
|
28
|
-
//
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
|
|
32
|
-
);
|
|
33
|
-
|
|
34
|
-
lemmatizer.lemmatize("börnin"); // → ["barn"]
|
|
35
|
-
lemmatizer.lemmatize("fóru"); // → ["fara", "fóra"]
|
|
8
|
+
lemmatizer.lemmatize("börnin"); // → ["barn"]
|
|
9
|
+
lemmatizer.lemmatize("keypti"); // → ["kaupa"]
|
|
10
|
+
lemmatizer.lemmatize("hestinum"); // → ["hestur"]
|
|
36
11
|
|
|
37
|
-
// Full pipeline for search
|
|
38
|
-
|
|
39
|
-
// → ["barn", "
|
|
12
|
+
// Full pipeline for search
|
|
13
|
+
extractIndexableLemmas("Börnin keypti hestinn", lemmatizer);
|
|
14
|
+
// → ["barn", "kaupa", "hestur"]
|
|
40
15
|
```
|
|
41
16
|
|
|
42
17
|
## The Problem
|
|
43
18
|
|
|
44
|
-
Icelandic is
|
|
19
|
+
Icelandic is heavily inflected. A single noun like "hestur" (horse) has 16 forms:
|
|
45
20
|
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
| barn (child) | börnin, barnið, barna, börnum... |
|
|
50
|
-
| fara (go) | fór, fer, förum, fóru, farið... |
|
|
51
|
-
| kona (woman) | konuna, konunni, kvenna, konum... |
|
|
21
|
+
```
|
|
22
|
+
hestur, hest, hesti, hests, hestar, hesta, hestum, hestanna...
|
|
23
|
+
```
|
|
52
24
|
|
|
53
|
-
If
|
|
25
|
+
If a user searches "hestur" but your document contains "hestinum", they won't find it—unless you normalize both to the lemma at index time.
|
|
54
26
|
|
|
55
|
-
##
|
|
27
|
+
## Why lemma-is?
|
|
56
28
|
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
lemmatizer.lemmatize("kvenna"); // → ["kona"]
|
|
61
|
-
lemmatizer.lemmatize("hestinum"); // → ["hestur"]
|
|
62
|
-
```
|
|
29
|
+
The gold standard for Icelandic NLP is [GreynirEngine](https://github.com/mideind/GreynirEngine)—a full grammatical parser with excellent accuracy. But it's Python-only, which means you can't run it in Node.js, browsers, or edge runtimes without FFI or a sidecar process.
|
|
30
|
+
|
|
31
|
+
lemma-is trades parsing accuracy for JavaScript portability. It's a lookup table with shallow grammar rules—good enough for search indexing, runs anywhere Node.js runs.
|
|
63
32
|
|
|
64
|
-
|
|
33
|
+
| | lemma-is | GreynirEngine |
|
|
34
|
+
|---|---|---|
|
|
35
|
+
| **Runtime** | Node, Bun, Deno | Python |
|
|
36
|
+
| **Throughput** | ~250K words/sec | ~25K words/sec |
|
|
37
|
+
| **Cold start** | ~35 ms | ~500 ms |
|
|
38
|
+
| **Memory** | ~185 MB | ~200 MB |
|
|
39
|
+
| **Disambiguation** | Bigrams + grammar rules | Full sentence parsing |
|
|
40
|
+
| **Use case** | Search indexing | NLP analysis |
|
|
65
41
|
|
|
66
|
-
|
|
42
|
+
See [BENCHMARKS.md](./BENCHMARKS.md) for methodology and detailed results.
|
|
67
43
|
|
|
68
|
-
|
|
44
|
+
### The Trade-off
|
|
45
|
+
|
|
46
|
+
lemma-is returns **all possible lemmas** for ambiguous words:
|
|
69
47
|
|
|
70
48
|
```typescript
|
|
71
49
|
lemmatizer.lemmatize("á");
|
|
72
50
|
// → ["á", "eiga"]
|
|
73
|
-
//
|
|
74
|
-
// "á" = verb "owns" (from "eiga")
|
|
75
|
-
|
|
76
|
-
lemmatizer.lemmatize("við");
|
|
77
|
-
// → ["við", "ég", "viður"]
|
|
78
|
-
// "við" = preposition "by/at"
|
|
79
|
-
// "við" = pronoun "we" (from "ég")
|
|
80
|
-
// "við" = noun "wood" (from "viður")
|
|
51
|
+
// Could be preposition "on", noun "river", or verb "owns"
|
|
81
52
|
```
|
|
82
53
|
|
|
83
|
-
|
|
54
|
+
GreynirEngine parses the sentence to return the single correct interpretation. For search, returning all candidates is often better—you'd rather show an extra result than miss a relevant document.
|
|
84
55
|
|
|
85
|
-
|
|
56
|
+
## Installation
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
npm install lemma-is
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
## Quick Start
|
|
86
63
|
|
|
87
64
|
```typescript
|
|
88
|
-
import {
|
|
65
|
+
import { readFileSync } from "fs";
|
|
66
|
+
import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
|
|
89
67
|
|
|
90
|
-
//
|
|
91
|
-
const
|
|
68
|
+
// Load the binary (91 MB, ~35ms cold start)
|
|
69
|
+
const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
|
|
70
|
+
const lemmatizer = BinaryLemmatizer.loadFromBuffer(
|
|
71
|
+
buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
|
|
72
|
+
);
|
|
92
73
|
|
|
93
|
-
//
|
|
94
|
-
|
|
95
|
-
// →
|
|
74
|
+
// Basic lemmatization
|
|
75
|
+
lemmatizer.lemmatize("börnin"); // → ["barn"]
|
|
76
|
+
lemmatizer.lemmatize("fóru"); // → ["fara", "fóra"]
|
|
96
77
|
|
|
97
|
-
//
|
|
98
|
-
|
|
99
|
-
// →
|
|
78
|
+
// Full pipeline for search indexing
|
|
79
|
+
const lemmas = extractIndexableLemmas("Börnin fóru í bíó", lemmatizer);
|
|
80
|
+
// → ["barn", "fara", "fóra", "í", "bíó"]
|
|
100
81
|
```
|
|
101
82
|
|
|
83
|
+
## Features
|
|
84
|
+
|
|
102
85
|
### Morphological Features
|
|
103
86
|
|
|
104
87
|
The binary includes case, gender, and number for each word form:
|
|
105
88
|
|
|
106
89
|
```typescript
|
|
107
90
|
lemmatizer.lemmatizeWithMorph("hestinum");
|
|
108
|
-
// → [{
|
|
109
|
-
//
|
|
110
|
-
// pos: "no",
|
|
111
|
-
// morph: { case: "þgf", gender: "kk", number: "et" }
|
|
112
|
-
// }]
|
|
113
|
-
// "hestinum" = dative, masculine, singular
|
|
114
|
-
|
|
115
|
-
lemmatizer.lemmatizeWithMorph("börnum");
|
|
116
|
-
// → [{
|
|
117
|
-
// lemma: "barn",
|
|
118
|
-
// pos: "no",
|
|
119
|
-
// morph: { case: "þgf", gender: "hk", number: "ft" }
|
|
120
|
-
// }]
|
|
121
|
-
// "börnum" = dative, neuter, plural
|
|
91
|
+
// → [{ lemma: "hestur", pos: "no", morph: { case: "þgf", gender: "kk", number: "et" } }]
|
|
92
|
+
// dative, masculine, singular
|
|
122
93
|
```
|
|
123
94
|
|
|
124
|
-
|
|
125
|
-
|------|---------|
|
|
126
|
-
| `nf` | nominative (nefnifall) |
|
|
127
|
-
| `þf` | accusative (þolfall) |
|
|
128
|
-
| `þgf` | dative (þágufall) |
|
|
129
|
-
| `ef` | genitive (eignarfall) |
|
|
130
|
-
| `kk` | masculine (karlkyn) |
|
|
131
|
-
| `kvk` | feminine (kvenkyn) |
|
|
132
|
-
| `hk` | neuter (hvorugkyn) |
|
|
133
|
-
| `et` | singular (eintala) |
|
|
134
|
-
| `ft` | plural (fleirtala) |
|
|
135
|
-
|
|
136
|
-
### Bigram Disambiguation
|
|
95
|
+
### Grammar-Based Disambiguation
|
|
137
96
|
|
|
138
|
-
|
|
97
|
+
Shallow grammar rules use Icelandic case government to disambiguate prepositions:
|
|
139
98
|
|
|
140
99
|
```typescript
|
|
141
|
-
import {
|
|
100
|
+
import { Disambiguator } from "lemma-is";
|
|
142
101
|
|
|
143
|
-
|
|
144
|
-
// "við erum" = "we are" → bigrams favor pronoun "ég" over preposition
|
|
145
|
-
const processed = processText("Við erum hér", lemmatizer, { bigrams: lemmatizer });
|
|
146
|
-
// → disambiguated: "ég" for "við" (high confidence)
|
|
102
|
+
const disambiguator = new Disambiguator(lemmatizer, lemmatizer, { useGrammarRules: true });
|
|
147
103
|
|
|
148
|
-
// "á
|
|
149
|
-
|
|
150
|
-
// →
|
|
104
|
+
// "á borðinu" - borðinu is dative, á governs dative → preposition
|
|
105
|
+
disambiguator.disambiguate("á", null, "borðinu");
|
|
106
|
+
// → { lemma: "á", pos: "fs", resolvedBy: "grammar_rules" }
|
|
151
107
|
```
|
|
152
108
|
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
## Compound Word Splitting
|
|
109
|
+
### Compound Splitting
|
|
156
110
|
|
|
157
|
-
Icelandic forms long compounds.
|
|
111
|
+
Icelandic forms long compounds. Split them for better search coverage:
|
|
158
112
|
|
|
159
113
|
```typescript
|
|
160
114
|
import { CompoundSplitter, createKnownLemmaSet } from "lemma-is";
|
|
161
115
|
|
|
116
|
+
const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
|
|
162
117
|
const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
|
|
163
118
|
|
|
164
|
-
splitter.split("bílstjóri");
|
|
165
|
-
// → { isCompound: true, parts: ["bíll", "stjóri"] }
|
|
166
|
-
// "car driver" = "car" + "driver"
|
|
167
|
-
|
|
168
119
|
splitter.split("landbúnaðarráðherra");
|
|
169
120
|
// → { isCompound: true, parts: ["landbúnaður", "ráðherra"] }
|
|
170
|
-
// "agriculture minister"
|
|
171
|
-
|
|
172
|
-
splitter.split("húsnæðislán");
|
|
173
|
-
// → { isCompound: true, parts: ["húsnæði", "lán"] }
|
|
174
|
-
// "housing loan" = "housing" + "loan"
|
|
175
|
-
```
|
|
176
|
-
|
|
177
|
-
### Indexing Compounds
|
|
178
|
-
|
|
179
|
-
`getAllLemmas` returns the original word plus its parts—maximizing search recall:
|
|
180
|
-
|
|
181
|
-
```typescript
|
|
182
|
-
splitter.getAllLemmas("bílstjóri");
|
|
183
|
-
// → ["bílstjóri", "bíll", "stjóri"]
|
|
184
|
-
```
|
|
185
|
-
|
|
186
|
-
A document mentioning "bílstjóri" is now findable by searching for "bíll" (car).
|
|
187
|
-
|
|
188
|
-
## Full Pipeline
|
|
189
|
-
|
|
190
|
-
For production indexing, combine tokenization, lemmatization, disambiguation, and compound splitting.
|
|
191
|
-
|
|
192
|
-
### What Gets Indexed
|
|
193
|
-
|
|
194
|
-
Here's a real example showing exactly what lemmas are extracted:
|
|
195
|
-
|
|
196
|
-
```typescript
|
|
197
|
-
const text = "Ríkissjóður stendur í blóma ef 27 milljarða arðgreiðsla Íslandsbanka er talin með.";
|
|
198
|
-
|
|
199
|
-
const lemmas = extractIndexableLemmas(text, lemmatizer, {
|
|
200
|
-
bigrams: lemmatizer,
|
|
201
|
-
compoundSplitter: splitter,
|
|
202
|
-
removeStopwords: true,
|
|
203
|
-
});
|
|
204
|
-
|
|
205
|
-
// Indexed lemmas:
|
|
206
|
-
// ✓ ríkissjóður, ríki, sjóður — compound + parts
|
|
207
|
-
// ✓ standa — stendur → standa
|
|
208
|
-
// ✓ blómi — í blóma → blómi
|
|
209
|
-
// ✓ milljarður — milljarða → milljarður
|
|
210
|
-
// ✓ arðgreiðsla, arður, greiðsla — compound + parts
|
|
211
|
-
// ✓ íslandsbanki — proper noun (lowercased)
|
|
212
|
-
// ✓ telja — talin → telja
|
|
213
|
-
//
|
|
214
|
-
// NOT indexed (stopwords removed):
|
|
215
|
-
// ✗ í, ef, er, með
|
|
121
|
+
// "agriculture minister"
|
|
216
122
|
```
|
|
217
123
|
|
|
218
|
-
|
|
219
|
-
|
|
220
|
-
### Another Example: Job Posting
|
|
221
|
-
|
|
222
|
-
```typescript
|
|
223
|
-
const posting = "Við leitum að reyndum kennurum til starfa í Reykjavík.";
|
|
224
|
-
|
|
225
|
-
const lemmas = extractIndexableLemmas(posting, lemmatizer, {
|
|
226
|
-
bigrams: lemmatizer,
|
|
227
|
-
removeStopwords: true,
|
|
228
|
-
});
|
|
229
|
-
|
|
230
|
-
// Indexed:
|
|
231
|
-
// ✓ leita, leit — leitum → leita (+ noun variant)
|
|
232
|
-
// ✓ reyndur, reynd — reyndum → reyndur
|
|
233
|
-
// ✓ kennari — kennurum → kennari
|
|
234
|
-
// ✓ starf, starfa — starfa (noun + verb)
|
|
235
|
-
// ✓ reykjavík — place name (lowercased)
|
|
236
|
-
//
|
|
237
|
-
// NOT indexed:
|
|
238
|
-
// ✗ við, að, til, í — stopwords
|
|
239
|
-
```
|
|
240
|
-
|
|
241
|
-
A search for "kennari" finds this job posting even though the word "kennari" never appears—only "kennurum" (dative plural).
|
|
242
|
-
|
|
243
|
-
### Complex Sentence
|
|
244
|
-
|
|
245
|
-
```typescript
|
|
246
|
-
const text = "Löngu áður en Jón borðaði ísinn sem hafði bráðnað hratt " +
|
|
247
|
-
"fór ég á veitingastaðinn og keypti mér rauðvín með hamborgaranum.";
|
|
248
|
-
|
|
249
|
-
const lemmas = extractIndexableLemmas(text, lemmatizer, {
|
|
250
|
-
bigrams: lemmatizer,
|
|
251
|
-
compoundSplitter: splitter,
|
|
252
|
-
removeStopwords: true,
|
|
253
|
-
});
|
|
254
|
-
|
|
255
|
-
// Verbs (various tenses/persons):
|
|
256
|
-
// ✓ borða — borðaði (past)
|
|
257
|
-
// ✓ bráðna — bráðnað (past participle)
|
|
258
|
-
// ✓ fara — fór (past, different stem!)
|
|
259
|
-
// ✓ kaupa — keypti (past)
|
|
260
|
-
//
|
|
261
|
-
// Nouns with articles:
|
|
262
|
-
// ✓ ís — ísinn (NOT "Ísland"!)
|
|
263
|
-
// ✓ veitingastaður, veiting, staður — compound
|
|
264
|
-
// ✓ rauðvín
|
|
265
|
-
// ✓ hamborgari — hamborgaranum (dative + article)
|
|
266
|
-
```
|
|
124
|
+
### Full Pipeline
|
|
267
125
|
|
|
268
|
-
|
|
126
|
+
For production indexing, combine everything:
|
|
269
127
|
|
|
270
128
|
```typescript
|
|
271
|
-
import {
|
|
272
|
-
import {
|
|
273
|
-
BinaryLemmatizer,
|
|
274
|
-
extractIndexableLemmas,
|
|
275
|
-
CompoundSplitter,
|
|
276
|
-
createKnownLemmaSet
|
|
277
|
-
} from "lemma-is";
|
|
129
|
+
import { extractIndexableLemmas, CompoundSplitter, createKnownLemmaSet } from "lemma-is";
|
|
278
130
|
|
|
279
|
-
const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
|
|
280
|
-
const lemmatizer = BinaryLemmatizer.loadFromBuffer(
|
|
281
|
-
buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
|
|
282
|
-
);
|
|
283
131
|
const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
|
|
284
132
|
const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
|
|
285
|
-
```
|
|
286
133
|
|
|
287
|
-
|
|
134
|
+
const text = "Ríkissjóður stendur í blóma ef milljarða arðgreiðsla er talin með.";
|
|
288
135
|
|
|
289
|
-
The defaults favor **recall over precision**—better for search where missing results is worse than extra results:
|
|
290
|
-
|
|
291
|
-
```typescript
|
|
292
136
|
const lemmas = extractIndexableLemmas(text, lemmatizer, {
|
|
293
137
|
bigrams: lemmatizer,
|
|
294
138
|
compoundSplitter: splitter,
|
|
295
|
-
|
|
296
|
-
// indexAllCandidates: true — indexes ALL lemma candidates
|
|
297
|
-
// alwaysTryCompounds: true — splits compounds even if known in BÍN
|
|
298
|
-
});
|
|
299
|
-
```
|
|
300
|
-
|
|
301
|
-
With these defaults:
|
|
302
|
-
- `"á"` → indexes both `"á"` (preposition) AND `"eiga"` (verb)
|
|
303
|
-
- `"húsnæðislán"` → indexes `"húsnæðislán"`, `"húsnæði"`, AND `"lán"`
|
|
304
|
-
|
|
305
|
-
### Precision Mode
|
|
306
|
-
|
|
307
|
-
If you need only the most likely lemma (chatbots, translation), disable the search optimizations:
|
|
308
|
-
|
|
309
|
-
```typescript
|
|
310
|
-
const lemmas = extractIndexableLemmas(text, lemmatizer, {
|
|
311
|
-
bigrams: lemmatizer,
|
|
312
|
-
compoundSplitter: splitter,
|
|
313
|
-
indexAllCandidates: false, // only disambiguated lemma
|
|
314
|
-
alwaysTryCompounds: false, // only split unknown words
|
|
139
|
+
removeStopwords: true,
|
|
315
140
|
});
|
|
316
|
-
```
|
|
317
|
-
|
|
318
|
-
## Word Classes
|
|
319
|
-
|
|
320
|
-
Filter by part of speech when context is known:
|
|
321
|
-
|
|
322
|
-
```typescript
|
|
323
|
-
lemmatizer.lemmatize("á", { wordClass: "so" }); // → ["eiga"] (verbs only)
|
|
324
|
-
lemmatizer.lemmatize("á", { wordClass: "fs" }); // → ["á"] (prepositions only)
|
|
325
|
-
|
|
326
|
-
lemmatizer.lemmatizeWithPOS("á");
|
|
327
|
-
// → [
|
|
328
|
-
// { lemma: "á", pos: "fs" }, // preposition
|
|
329
|
-
// { lemma: "á", pos: "no" }, // noun (river)
|
|
330
|
-
// { lemma: "eiga", pos: "so" } // verb
|
|
331
|
-
// ]
|
|
332
|
-
```
|
|
333
|
-
|
|
334
|
-
| Code | Icelandic | English |
|
|
335
|
-
|------|-----------|---------|
|
|
336
|
-
| `no` | nafnorð | noun |
|
|
337
|
-
| `so` | sagnorð | verb |
|
|
338
|
-
| `lo` | lýsingarorð | adjective |
|
|
339
|
-
| `ao` | atviksorð | adverb |
|
|
340
|
-
| `fs` | forsetning | preposition |
|
|
341
|
-
| `fn` | fornafn | pronoun |
|
|
342
|
-
|
|
343
|
-
## Data
|
|
344
141
|
|
|
345
|
-
|
|
346
|
-
|
|
347
|
-
|
|
348
|
-
- 289K lemmas from BÍN
|
|
349
|
-
- 3M word form mappings
|
|
350
|
-
- 414K bigram frequencies
|
|
351
|
-
- Morphological features (case, gender, number) per word form
|
|
352
|
-
|
|
353
|
-
Uses ArrayBuffer with binary search for efficient memory usage. Format version 2 includes packed morphological data.
|
|
354
|
-
|
|
355
|
-
### Building Data
|
|
356
|
-
|
|
357
|
-
```bash
|
|
358
|
-
# Download BÍN data from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
|
|
359
|
-
# Extract SHsnid.csv to data/
|
|
360
|
-
|
|
361
|
-
uv run python scripts/build-binary.py # builds lemma-is.bin with morph features
|
|
142
|
+
// Indexed: ríkissjóður, ríki, sjóður, standa, blómi, milljarður,
|
|
143
|
+
// arðgreiðsla, arður, greiðsla, telja
|
|
144
|
+
// Stopwords removed: í, ef, er, með
|
|
362
145
|
```
|
|
363
146
|
|
|
364
|
-
|
|
365
|
-
|
|
366
|
-
```typescript
|
|
367
|
-
import { readFileSync } from "fs";
|
|
368
|
-
import { BinaryLemmatizer } from "lemma-is";
|
|
369
|
-
|
|
370
|
-
const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
|
|
371
|
-
const lemmatizer = BinaryLemmatizer.loadFromBuffer(
|
|
372
|
-
buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
|
|
373
|
-
);
|
|
374
|
-
```
|
|
147
|
+
A search for "sjóður" or "arður" now finds this document.
|
|
375
148
|
|
|
376
149
|
## PostgreSQL Full-Text Search
|
|
377
150
|
|
|
378
|
-
PostgreSQL has no built-in Icelandic stemmer. Use lemma-is to pre-process
|
|
379
|
-
|
|
380
|
-
```sql
|
|
381
|
-
CREATE TABLE documents (
|
|
382
|
-
id SERIAL PRIMARY KEY,
|
|
383
|
-
title TEXT,
|
|
384
|
-
body TEXT,
|
|
385
|
-
search_vector TSVECTOR
|
|
386
|
-
);
|
|
387
|
-
CREATE INDEX documents_search_idx ON documents USING GIN (search_vector);
|
|
388
|
-
```
|
|
389
|
-
|
|
390
|
-
Lemmatize in your app, store as space-separated string:
|
|
151
|
+
PostgreSQL has no built-in Icelandic stemmer. Use lemma-is to pre-process:
|
|
391
152
|
|
|
392
153
|
```typescript
|
|
393
154
|
const lemmas = extractIndexableLemmas(text, lemmatizer, { removeStopwords: true });
|
|
@@ -399,160 +160,75 @@ await db.query(
|
|
|
399
160
|
);
|
|
400
161
|
```
|
|
401
162
|
|
|
402
|
-
|
|
163
|
+
Use the `simple` configuration—it lowercases but doesn't stem, since our lemmas are already normalized.
|
|
403
164
|
|
|
404
|
-
|
|
405
|
-
const lemmas = extractIndexableLemmas(query, lemmatizer);
|
|
406
|
-
|
|
407
|
-
const results = await db.query(
|
|
408
|
-
`SELECT *, ts_rank(search_vector, q) AS rank
|
|
409
|
-
FROM documents, plainto_tsquery('simple', $1) q
|
|
410
|
-
WHERE search_vector @@ q
|
|
411
|
-
ORDER BY rank DESC`,
|
|
412
|
-
[Array.from(lemmas).join(" ")]
|
|
413
|
-
);
|
|
414
|
-
|
|
415
|
-
// User searches "börnum" → lemmatized to "barn" → matches all forms
|
|
416
|
-
```
|
|
417
|
-
|
|
418
|
-
**Why `simple`?** It lowercases but doesn't stem—our lemmas are already normalized. Use `setweight()` to boost title matches over body.
|
|
419
|
-
|
|
420
|
-
**Diacritics:** PostgreSQL's `unaccent` extension strips accents, but **don't use it for Icelandic**. Characters like á, ö, þ, ð are distinct letters, not accented variants. "á" (river/on/owns) ≠ "a". Preserve diacritics for correct matching.
|
|
165
|
+
**Important:** Don't use PostgreSQL's `unaccent` extension for Icelandic. Characters like á, ö, þ, ð are distinct letters, not accented variants.
|
|
421
166
|
|
|
422
167
|
## Limitations
|
|
423
168
|
|
|
424
|
-
This
|
|
169
|
+
This is an early effort with known limitations.
|
|
425
170
|
|
|
426
171
|
### File Size
|
|
427
172
|
|
|
428
|
-
The binary is
|
|
173
|
+
The binary is **91 MB**. This targets Node.js servers where data loads once at startup. Not recommended for:
|
|
429
174
|
|
|
430
|
-
|
|
431
|
-
- **
|
|
432
|
-
- **Browser/Web Workers** — download size prohibitive for most users
|
|
175
|
+
- **Serverless/edge** — cold start loading 91 MB may be slow
|
|
176
|
+
- **Browser** — download size prohibitive
|
|
433
177
|
- **Cloudflare Workers** — fits 128 MB limit but cold starts are slow
|
|
434
178
|
|
|
435
|
-
For browser
|
|
436
|
-
|
|
437
|
-
### No Query Expansion
|
|
438
|
-
|
|
439
|
-
You can go **word → lemma** but not **lemma → words**:
|
|
179
|
+
For browser apps, run lemmatization server-side.
|
|
440
180
|
|
|
441
|
-
|
|
442
|
-
lemmatizer.lemmatize("hestinum"); // → ["hestur"] ✓
|
|
443
|
-
|
|
444
|
-
// But you CANNOT do:
|
|
445
|
-
lemmatizer.expand("hestur");
|
|
446
|
-
// → ["hestur", "hest", "hesti", "hests", "hestinn", "hestinum", ...] ✗
|
|
447
|
-
```
|
|
181
|
+
### Not a Parser
|
|
448
182
|
|
|
449
|
-
This
|
|
183
|
+
This is a lookup table with shallow grammar rules, not a grammatical parser. It doesn't understand sentence structure, named entities, or semantic meaning. The grammar rules help with common patterns but can't handle all disambiguation.
|
|
450
184
|
|
|
451
|
-
|
|
185
|
+
For applications needing full grammatical analysis, use [GreynirEngine](https://github.com/mideind/GreynirEngine).
|
|
452
186
|
|
|
453
187
|
### Disambiguation Limits
|
|
454
188
|
|
|
455
|
-
Bigram disambiguation only works when the word pair exists in
|
|
456
|
-
|
|
457
|
-
```typescript
|
|
458
|
-
// Common phrase: bigrams help
|
|
459
|
-
processText("við erum", lemmatizer, { bigrams: lemmatizer });
|
|
460
|
-
// → "við" disambiguated to "ég" (we) with high confidence
|
|
461
|
-
|
|
462
|
-
// Rare/unusual phrase: no bigram data
|
|
463
|
-
processText("við flæktumst", lemmatizer, { bigrams: lemmatizer });
|
|
464
|
-
// → "við" picks first candidate, low confidence
|
|
465
|
-
```
|
|
466
|
-
|
|
467
|
-
Without context, ambiguous words fall back to arbitrary ordering:
|
|
189
|
+
Bigram disambiguation only works when the word pair exists in corpus data. Without context, ambiguous words return all candidates:
|
|
468
190
|
|
|
469
191
|
```typescript
|
|
470
|
-
// Single word, no context
|
|
471
192
|
lemmatizer.lemmatize("á");
|
|
472
|
-
// → ["á", "eiga"] —
|
|
473
|
-
|
|
474
|
-
// The preposition "á" is ~100x more common than verb "eiga" in this form,
|
|
475
|
-
// but we don't have unigram frequencies to use as tiebreaker.
|
|
476
|
-
```
|
|
477
|
-
|
|
478
|
-
**For search indexing:** Use `indexAllCandidates: true` to index all lemmas and let ranking sort out relevance. For applications needing precision (chatbots, translation), use GreynirEngine instead.
|
|
479
|
-
|
|
480
|
-
### Compound Splitting Heuristics
|
|
481
|
-
|
|
482
|
-
The splitter uses simple rules that miss edge cases:
|
|
483
|
-
|
|
484
|
-
**Three-part compounds only split once:**
|
|
485
|
-
```typescript
|
|
486
|
-
splitter.split("þjóðmálaráðherra");
|
|
487
|
-
// → ["þjóðmál", "ráðherra"] — missing "þjóð" as separate part
|
|
488
|
-
// Ideal: ["þjóð", "mál", "ráðherra"]
|
|
489
|
-
```
|
|
490
|
-
|
|
491
|
-
**Inflected first parts may not match:**
|
|
492
|
-
```typescript
|
|
493
|
-
splitter.split("húseignir");
|
|
494
|
-
// → { isCompound: false } — "hús" appears as "hús" not "húsa"
|
|
495
|
-
// The compound IS "hús" + "eignir" but heuristics miss it
|
|
193
|
+
// → ["á", "eiga"] — no way to know which is more likely
|
|
496
194
|
```
|
|
497
195
|
|
|
498
|
-
|
|
499
|
-
```typescript
|
|
500
|
-
splitter.split("landsins");
|
|
501
|
-
// This is NOT a compound — it's "land" + genitive suffix "-sins"
|
|
502
|
-
// Correctly returns { isCompound: false }, but edge cases exist
|
|
503
|
-
```
|
|
504
|
-
|
|
505
|
-
**Mitigations:**
|
|
506
|
-
- Use `alwaysTryCompounds: true` to split even known words
|
|
507
|
-
- Use `minPartLength: 2` in CompoundSplitter for more aggressive splitting
|
|
508
|
-
- Over-indexing is usually better than under-indexing for search
|
|
509
|
-
|
|
510
|
-
### Not a Parser
|
|
511
|
-
|
|
512
|
-
This is a lookup table with shallow grammar rules, not a full grammatical parser. It doesn't understand:
|
|
196
|
+
For search indexing, use `indexAllCandidates: true` (the default) to index all lemmas.
|
|
513
197
|
|
|
514
|
-
|
|
515
|
-
- Complex verb argument frames
|
|
516
|
-
- Named entity recognition (people, places, companies)
|
|
517
|
-
- Semantic meaning or word sense
|
|
198
|
+
### No Query Expansion
|
|
518
199
|
|
|
519
|
-
|
|
200
|
+
You can go word → lemma but not lemma → words. This affects search result highlighting—if a user searches "hestur" and the document contains "hestinum", you can't easily highlight the match.
|
|
520
201
|
|
|
521
|
-
##
|
|
202
|
+
## Data
|
|
522
203
|
|
|
523
|
-
|
|
204
|
+
Single binary file containing:
|
|
205
|
+
- 289K lemmas from BÍN
|
|
206
|
+
- 3M word form mappings
|
|
207
|
+
- 414K bigram frequencies
|
|
208
|
+
- Morphological features per word form
|
|
524
209
|
|
|
525
|
-
|
|
210
|
+
### Building Data
|
|
526
211
|
|
|
527
212
|
```bash
|
|
528
|
-
|
|
529
|
-
|
|
530
|
-
npx vitest run --update # update snapshots
|
|
531
|
-
```
|
|
213
|
+
# Download BÍN data from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
|
|
214
|
+
# Extract SHsnid.csv to data/
|
|
532
215
|
|
|
533
|
-
|
|
534
|
-
|
|
535
|
-
- `compounds.test.ts` — Compound word splitting
|
|
536
|
-
- `integration.test.ts` — Full pipeline, search indexing options
|
|
537
|
-
- `pipeline-greynir.test.ts` — Full pipeline with Greynir test sentences
|
|
538
|
-
- `benchmark.test.ts` — Performance and metrics snapshots
|
|
539
|
-
- `icelandic-tricky.test.ts` — Edge cases, morphology examples
|
|
540
|
-
- `limitations.test.ts` — Documented limitations and research notes
|
|
541
|
-
- `mini-grammar.test.ts` — Grammar rules and case government
|
|
216
|
+
uv run python scripts/build-binary.py
|
|
217
|
+
```
|
|
542
218
|
|
|
543
|
-
|
|
219
|
+
## Development
|
|
544
220
|
|
|
545
221
|
```bash
|
|
222
|
+
pnpm test # run tests
|
|
546
223
|
pnpm build # build dist/
|
|
547
|
-
pnpm typecheck # type check
|
|
548
|
-
pnpm build:data # rebuild binary from BÍN source
|
|
224
|
+
pnpm typecheck # type check
|
|
549
225
|
```
|
|
550
226
|
|
|
551
227
|
## Acknowledgments
|
|
552
228
|
|
|
553
|
-
- **[BÍN](https://bin.arnastofnun.is/)**
|
|
554
|
-
- **[Miðeind](https://mideind.is/)**
|
|
555
|
-
- **[tokenize-is](https://github.com/axelharri/tokenize-is)**
|
|
229
|
+
- **[BÍN](https://bin.arnastofnun.is/)** — Morphological database from the Árni Magnússon Institute
|
|
230
|
+
- **[Miðeind](https://mideind.is/)** — GreynirEngine and foundational Icelandic NLP work
|
|
231
|
+
- **[tokenize-is](https://github.com/axelharri/tokenize-is)** — Icelandic tokenizer
|
|
556
232
|
|
|
557
233
|
## License
|
|
558
234
|
|
|
@@ -560,10 +236,10 @@ MIT for the code.
|
|
|
560
236
|
|
|
561
237
|
### Data License (BÍN)
|
|
562
238
|
|
|
563
|
-
The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/)
|
|
239
|
+
The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/) © Árni Magnússon Institute for Icelandic Studies.
|
|
564
240
|
|
|
565
241
|
**By using this package, you agree to BÍN's conditions:**
|
|
566
|
-
- Credit the Árni Magnússon Institute in your product
|
|
242
|
+
- Credit the Árni Magnússon Institute in your product
|
|
567
243
|
- Do not redistribute the raw data separately
|
|
568
244
|
- Do not publish inflection paradigms without permission
|
|
569
245
|
|