lemma-is 0.2.3 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +127 -424
- package/data-dist/lemma-is.core.bin +0 -0
- package/dist/index.d.mts +38 -7
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +1 -1
- package/dist/index.mjs.map +1 -1
- package/package.json +5 -2
- package/data-dist/lemma-is.bin +0 -0
package/README.md
CHANGED
|
@@ -1,393 +1,154 @@
|
|
|
1
1
|
# lemma-is
|
|
2
2
|
|
|
3
|
-
Icelandic lemmatization for JavaScript.
|
|
3
|
+
Fast Icelandic lemmatization for JavaScript. Built for search indexing.
|
|
4
4
|
|
|
5
|
-
## Why?
|
|
6
|
-
|
|
7
|
-
Existing Icelandic NLP tools are Python/C++:
|
|
8
|
-
|
|
9
|
-
| Tool | Runtime | Standalone? | Notes |
|
|
10
|
-
|------|---------|-------------|-------|
|
|
11
|
-
| **[GreynirEngine](https://github.com/mideind/GreynirEngine)** | Python + C++ | ✓ | Gold standard. Full parser, POS tagger. |
|
|
12
|
-
| **[Nefnir](https://github.com/lexis-project/Nefnir)** | Python | ✗ | Requires POS tags from IceNLP/IceStagger (Java, unmaintained). |
|
|
13
|
-
| **lemma-is** | TypeScript | ✓ | Node.js servers. Grammar-based disambiguation, compound splitting. |
|
|
14
|
-
|
|
15
|
-
lemma-is trades parsing accuracy for JS ecosystem integration—good enough for search indexing, runs in any Node.js environment.
|
|
16
|
-
|
|
17
|
-
## Quickstart
|
|
18
|
-
|
|
19
|
-
```bash
|
|
20
|
-
npm install lemma-is
|
|
21
|
-
```
|
|
22
|
-
|
|
23
|
-
**Node.js**:
|
|
24
5
|
```typescript
|
|
25
|
-
import { readFileSync } from "fs";
|
|
26
6
|
import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
|
|
27
7
|
|
|
28
|
-
//
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
|
|
32
|
-
);
|
|
33
|
-
|
|
34
|
-
lemmatizer.lemmatize("börnin"); // → ["barn"]
|
|
35
|
-
lemmatizer.lemmatize("fóru"); // → ["fara", "fóra"]
|
|
8
|
+
lemmatizer.lemmatize("börnin"); // → ["barn"]
|
|
9
|
+
lemmatizer.lemmatize("keypti"); // → ["kaupa"]
|
|
10
|
+
lemmatizer.lemmatize("hestinum"); // → ["hestur"]
|
|
36
11
|
|
|
37
|
-
// Full pipeline for search
|
|
38
|
-
|
|
39
|
-
// → ["barn", "
|
|
12
|
+
// Full pipeline for search
|
|
13
|
+
extractIndexableLemmas("Börnin keypti hestinn", lemmatizer);
|
|
14
|
+
// → ["barn", "kaupa", "hestur"]
|
|
40
15
|
```
|
|
41
16
|
|
|
42
17
|
## The Problem
|
|
43
18
|
|
|
44
|
-
Icelandic is
|
|
19
|
+
Icelandic is heavily inflected. A single noun like "hestur" (horse) has 16 forms:
|
|
45
20
|
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
| barn (child) | börnin, barnið, barna, börnum... |
|
|
50
|
-
| fara (go) | fór, fer, förum, fóru, farið... |
|
|
51
|
-
| kona (woman) | konuna, konunni, kvenna, konum... |
|
|
21
|
+
```
|
|
22
|
+
hestur, hest, hesti, hests, hestar, hesta, hestum, hestanna...
|
|
23
|
+
```
|
|
52
24
|
|
|
53
|
-
If
|
|
25
|
+
If a user searches "hestur" but your document contains "hestinum", they won't find it—unless you normalize both to the lemma at index time.
|
|
54
26
|
|
|
55
|
-
##
|
|
27
|
+
## Why lemma-is?
|
|
56
28
|
|
|
57
|
-
|
|
58
|
-
lemmatizer.lemmatize("börnin"); // → ["barn"]
|
|
59
|
-
lemmatizer.lemmatize("fóru"); // → ["fara"]
|
|
60
|
-
lemmatizer.lemmatize("kvenna"); // → ["kona"]
|
|
61
|
-
lemmatizer.lemmatize("hestinum"); // → ["hestur"]
|
|
62
|
-
```
|
|
29
|
+
The gold standard for Icelandic NLP is [GreynirEngine](https://github.com/mideind/GreynirEngine)—a full grammatical parser with excellent accuracy. But it's Python-only, which means you can't run it in Node.js, browsers, or edge runtimes without FFI or a sidecar process.
|
|
63
30
|
|
|
64
|
-
|
|
31
|
+
lemma-is trades parsing accuracy for JavaScript portability. It's a lookup table with shallow grammar rules—good enough for search indexing, runs anywhere Node.js runs.
|
|
65
32
|
|
|
66
|
-
|
|
33
|
+
| | lemma-is | GreynirEngine |
|
|
34
|
+
|---|---|---|
|
|
35
|
+
| **Runtime** | Node, Bun, Deno | Python |
|
|
36
|
+
| **Throughput** | ~250K words/sec | ~25K words/sec |
|
|
37
|
+
| **Cold start** | ~35 ms | ~500 ms |
|
|
38
|
+
| **Memory** | ~185 MB | ~200 MB |
|
|
39
|
+
| **Disambiguation** | Bigrams + grammar rules | Full sentence parsing |
|
|
40
|
+
| **Use case** | Search indexing | NLP analysis |
|
|
67
41
|
|
|
68
|
-
|
|
42
|
+
See [BENCHMARKS.md](./BENCHMARKS.md) for methodology and detailed results.
|
|
43
|
+
|
|
44
|
+
### The Trade-off
|
|
45
|
+
|
|
46
|
+
lemma-is returns **all possible lemmas** for ambiguous words:
|
|
69
47
|
|
|
70
48
|
```typescript
|
|
71
49
|
lemmatizer.lemmatize("á");
|
|
72
50
|
// → ["á", "eiga"]
|
|
73
|
-
//
|
|
74
|
-
// "á" = verb "owns" (from "eiga")
|
|
75
|
-
|
|
76
|
-
lemmatizer.lemmatize("við");
|
|
77
|
-
// → ["við", "ég", "viður"]
|
|
78
|
-
// "við" = preposition "by/at"
|
|
79
|
-
// "við" = pronoun "we" (from "ég")
|
|
80
|
-
// "við" = noun "wood" (from "viður")
|
|
51
|
+
// Could be preposition "on", noun "river", or verb "owns"
|
|
81
52
|
```
|
|
82
53
|
|
|
83
|
-
|
|
54
|
+
GreynirEngine parses the sentence to return the single correct interpretation. For search, returning all candidates is often better—you'd rather show an extra result than miss a relevant document.
|
|
55
|
+
|
|
56
|
+
## Installation
|
|
84
57
|
|
|
85
|
-
|
|
58
|
+
```bash
|
|
59
|
+
npm install lemma-is
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
## Quick Start
|
|
86
63
|
|
|
87
64
|
```typescript
|
|
88
|
-
import {
|
|
65
|
+
import { readFileSync } from "fs";
|
|
66
|
+
import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
|
|
89
67
|
|
|
90
|
-
//
|
|
91
|
-
const
|
|
68
|
+
// Load the core binary (~9-11 MB, low memory, best for browser/edge)
|
|
69
|
+
const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.core.bin");
|
|
70
|
+
const lemmatizer = BinaryLemmatizer.loadFromBuffer(
|
|
71
|
+
buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
|
|
72
|
+
);
|
|
92
73
|
|
|
93
|
-
//
|
|
94
|
-
|
|
95
|
-
// →
|
|
74
|
+
// Basic lemmatization
|
|
75
|
+
lemmatizer.lemmatize("börnin"); // → ["barn"]
|
|
76
|
+
lemmatizer.lemmatize("fóru"); // → ["fara", "fóra"]
|
|
96
77
|
|
|
97
|
-
//
|
|
98
|
-
|
|
99
|
-
// →
|
|
78
|
+
// Full pipeline for search indexing
|
|
79
|
+
const lemmas = extractIndexableLemmas("Börnin fóru í bíó", lemmatizer);
|
|
80
|
+
// → ["barn", "fara", "fóra", "í", "bíó"]
|
|
100
81
|
```
|
|
101
82
|
|
|
83
|
+
## Features
|
|
84
|
+
|
|
102
85
|
### Morphological Features
|
|
103
86
|
|
|
104
87
|
The binary includes case, gender, and number for each word form:
|
|
105
88
|
|
|
106
89
|
```typescript
|
|
107
90
|
lemmatizer.lemmatizeWithMorph("hestinum");
|
|
108
|
-
// → [{
|
|
109
|
-
//
|
|
110
|
-
// pos: "no",
|
|
111
|
-
// morph: { case: "þgf", gender: "kk", number: "et" }
|
|
112
|
-
// }]
|
|
113
|
-
// "hestinum" = dative, masculine, singular
|
|
114
|
-
|
|
115
|
-
lemmatizer.lemmatizeWithMorph("börnum");
|
|
116
|
-
// → [{
|
|
117
|
-
// lemma: "barn",
|
|
118
|
-
// pos: "no",
|
|
119
|
-
// morph: { case: "þgf", gender: "hk", number: "ft" }
|
|
120
|
-
// }]
|
|
121
|
-
// "börnum" = dative, neuter, plural
|
|
91
|
+
// → [{ lemma: "hestur", pos: "no", morph: { case: "þgf", gender: "kk", number: "et" } }]
|
|
92
|
+
// dative, masculine, singular
|
|
122
93
|
```
|
|
123
94
|
|
|
124
|
-
|
|
125
|
-
|------|---------|
|
|
126
|
-
| `nf` | nominative (nefnifall) |
|
|
127
|
-
| `þf` | accusative (þolfall) |
|
|
128
|
-
| `þgf` | dative (þágufall) |
|
|
129
|
-
| `ef` | genitive (eignarfall) |
|
|
130
|
-
| `kk` | masculine (karlkyn) |
|
|
131
|
-
| `kvk` | feminine (kvenkyn) |
|
|
132
|
-
| `hk` | neuter (hvorugkyn) |
|
|
133
|
-
| `et` | singular (eintala) |
|
|
134
|
-
| `ft` | plural (fleirtala) |
|
|
135
|
-
|
|
136
|
-
### Bigram Disambiguation
|
|
95
|
+
### Grammar-Based Disambiguation
|
|
137
96
|
|
|
138
|
-
|
|
97
|
+
Shallow grammar rules use Icelandic case government to disambiguate prepositions:
|
|
139
98
|
|
|
140
99
|
```typescript
|
|
141
|
-
import {
|
|
100
|
+
import { Disambiguator } from "lemma-is";
|
|
142
101
|
|
|
143
|
-
|
|
144
|
-
// "við erum" = "we are" → bigrams favor pronoun "ég" over preposition
|
|
145
|
-
const processed = processText("Við erum hér", lemmatizer, { bigrams: lemmatizer });
|
|
146
|
-
// → disambiguated: "ég" for "við" (high confidence)
|
|
102
|
+
const disambiguator = new Disambiguator(lemmatizer, lemmatizer, { useGrammarRules: true });
|
|
147
103
|
|
|
148
|
-
// "á
|
|
149
|
-
|
|
150
|
-
// →
|
|
104
|
+
// "á borðinu" - borðinu is dative, á governs dative → preposition
|
|
105
|
+
disambiguator.disambiguate("á", null, "borðinu");
|
|
106
|
+
// → { lemma: "á", pos: "fs", resolvedBy: "grammar_rules" }
|
|
151
107
|
```
|
|
152
108
|
|
|
153
|
-
|
|
109
|
+
### Compound Splitting
|
|
154
110
|
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
Icelandic forms long compounds. The library splits them for better search coverage:
|
|
111
|
+
Icelandic forms long compounds. Split them for better search coverage:
|
|
158
112
|
|
|
159
113
|
```typescript
|
|
160
114
|
import { CompoundSplitter, createKnownLemmaSet } from "lemma-is";
|
|
161
115
|
|
|
116
|
+
const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
|
|
162
117
|
const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
|
|
163
118
|
|
|
164
|
-
splitter.split("bílstjóri");
|
|
165
|
-
// → { isCompound: true, parts: ["bíll", "stjóri"] }
|
|
166
|
-
// "car driver" = "car" + "driver"
|
|
167
|
-
|
|
168
119
|
splitter.split("landbúnaðarráðherra");
|
|
169
120
|
// → { isCompound: true, parts: ["landbúnaður", "ráðherra"] }
|
|
170
|
-
// "agriculture minister"
|
|
171
|
-
|
|
172
|
-
splitter.split("húsnæðislán");
|
|
173
|
-
// → { isCompound: true, parts: ["húsnæði", "lán"] }
|
|
174
|
-
// "housing loan" = "housing" + "loan"
|
|
175
|
-
```
|
|
176
|
-
|
|
177
|
-
### Indexing Compounds
|
|
178
|
-
|
|
179
|
-
`getAllLemmas` returns the original word plus its parts—maximizing search recall:
|
|
180
|
-
|
|
181
|
-
```typescript
|
|
182
|
-
splitter.getAllLemmas("bílstjóri");
|
|
183
|
-
// → ["bílstjóri", "bíll", "stjóri"]
|
|
184
|
-
```
|
|
185
|
-
|
|
186
|
-
A document mentioning "bílstjóri" is now findable by searching for "bíll" (car).
|
|
187
|
-
|
|
188
|
-
## Full Pipeline
|
|
189
|
-
|
|
190
|
-
For production indexing, combine tokenization, lemmatization, disambiguation, and compound splitting.
|
|
191
|
-
|
|
192
|
-
### What Gets Indexed
|
|
193
|
-
|
|
194
|
-
Here's a real example showing exactly what lemmas are extracted:
|
|
195
|
-
|
|
196
|
-
```typescript
|
|
197
|
-
const text = "Ríkissjóður stendur í blóma ef 27 milljarða arðgreiðsla Íslandsbanka er talin með.";
|
|
198
|
-
|
|
199
|
-
const lemmas = extractIndexableLemmas(text, lemmatizer, {
|
|
200
|
-
bigrams: lemmatizer,
|
|
201
|
-
compoundSplitter: splitter,
|
|
202
|
-
removeStopwords: true,
|
|
203
|
-
});
|
|
204
|
-
|
|
205
|
-
// Indexed lemmas:
|
|
206
|
-
// ✓ ríkissjóður, ríki, sjóður — compound + parts
|
|
207
|
-
// ✓ standa — stendur → standa
|
|
208
|
-
// ✓ blómi — í blóma → blómi
|
|
209
|
-
// ✓ milljarður — milljarða → milljarður
|
|
210
|
-
// ✓ arðgreiðsla, arður, greiðsla — compound + parts
|
|
211
|
-
// ✓ íslandsbanki — proper noun (lowercased)
|
|
212
|
-
// ✓ telja — talin → telja
|
|
213
|
-
//
|
|
214
|
-
// NOT indexed (stopwords removed):
|
|
215
|
-
// ✗ í, ef, er, með
|
|
216
|
-
```
|
|
217
|
-
|
|
218
|
-
A search for "sjóður" or "arður" now finds this document about the state treasury and bank dividends.
|
|
219
|
-
|
|
220
|
-
### Another Example: Job Posting
|
|
221
|
-
|
|
222
|
-
```typescript
|
|
223
|
-
const posting = "Við leitum að reyndum kennurum til starfa í Reykjavík.";
|
|
224
|
-
|
|
225
|
-
const lemmas = extractIndexableLemmas(posting, lemmatizer, {
|
|
226
|
-
bigrams: lemmatizer,
|
|
227
|
-
removeStopwords: true,
|
|
228
|
-
});
|
|
229
|
-
|
|
230
|
-
// Indexed:
|
|
231
|
-
// ✓ leita, leit — leitum → leita (+ noun variant)
|
|
232
|
-
// ✓ reyndur, reynd — reyndum → reyndur
|
|
233
|
-
// ✓ kennari — kennurum → kennari
|
|
234
|
-
// ✓ starf, starfa — starfa (noun + verb)
|
|
235
|
-
// ✓ reykjavík — place name (lowercased)
|
|
236
|
-
//
|
|
237
|
-
// NOT indexed:
|
|
238
|
-
// ✗ við, að, til, í — stopwords
|
|
121
|
+
// "agriculture minister"
|
|
239
122
|
```
|
|
240
123
|
|
|
241
|
-
|
|
124
|
+
### Full Pipeline
|
|
242
125
|
|
|
243
|
-
|
|
126
|
+
For production indexing, combine everything:
|
|
244
127
|
|
|
245
128
|
```typescript
|
|
246
|
-
|
|
247
|
-
"fór ég á veitingastaðinn og keypti mér rauðvín með hamborgaranum.";
|
|
248
|
-
|
|
249
|
-
const lemmas = extractIndexableLemmas(text, lemmatizer, {
|
|
250
|
-
bigrams: lemmatizer,
|
|
251
|
-
compoundSplitter: splitter,
|
|
252
|
-
removeStopwords: true,
|
|
253
|
-
});
|
|
254
|
-
|
|
255
|
-
// Verbs (various tenses/persons):
|
|
256
|
-
// ✓ borða — borðaði (past)
|
|
257
|
-
// ✓ bráðna — bráðnað (past participle)
|
|
258
|
-
// ✓ fara — fór (past, different stem!)
|
|
259
|
-
// ✓ kaupa — keypti (past)
|
|
260
|
-
//
|
|
261
|
-
// Nouns with articles:
|
|
262
|
-
// ✓ ís — ísinn (NOT "Ísland"!)
|
|
263
|
-
// ✓ veitingastaður, veiting, staður — compound
|
|
264
|
-
// ✓ rauðvín
|
|
265
|
-
// ✓ hamborgari — hamborgaranum (dative + article)
|
|
266
|
-
```
|
|
129
|
+
import { extractIndexableLemmas, CompoundSplitter, createKnownLemmaSet } from "lemma-is";
|
|
267
130
|
|
|
268
|
-
### Setup
|
|
269
|
-
|
|
270
|
-
```typescript
|
|
271
|
-
import { readFileSync } from "fs";
|
|
272
|
-
import {
|
|
273
|
-
BinaryLemmatizer,
|
|
274
|
-
extractIndexableLemmas,
|
|
275
|
-
CompoundSplitter,
|
|
276
|
-
createKnownLemmaSet
|
|
277
|
-
} from "lemma-is";
|
|
278
|
-
|
|
279
|
-
const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
|
|
280
|
-
const lemmatizer = BinaryLemmatizer.loadFromBuffer(
|
|
281
|
-
buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
|
|
282
|
-
);
|
|
283
131
|
const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
|
|
284
132
|
const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
|
|
285
|
-
```
|
|
286
|
-
|
|
287
|
-
### Search-Optimized Defaults
|
|
288
|
-
|
|
289
|
-
The defaults favor **recall over precision**—better for search where missing results is worse than extra results:
|
|
290
|
-
|
|
291
|
-
```typescript
|
|
292
|
-
const lemmas = extractIndexableLemmas(text, lemmatizer, {
|
|
293
|
-
bigrams: lemmatizer,
|
|
294
|
-
compoundSplitter: splitter,
|
|
295
|
-
// These are the defaults:
|
|
296
|
-
// indexAllCandidates: true — indexes ALL lemma candidates
|
|
297
|
-
// alwaysTryCompounds: true — splits compounds even if known in BÍN
|
|
298
|
-
});
|
|
299
|
-
```
|
|
300
|
-
|
|
301
|
-
With these defaults:
|
|
302
|
-
- `"á"` → indexes both `"á"` (preposition) AND `"eiga"` (verb)
|
|
303
|
-
- `"húsnæðislán"` → indexes `"húsnæðislán"`, `"húsnæði"`, AND `"lán"`
|
|
304
|
-
|
|
305
|
-
### Precision Mode
|
|
306
133
|
|
|
307
|
-
|
|
134
|
+
const text = "Ríkissjóður stendur í blóma ef milljarða arðgreiðsla er talin með.";
|
|
308
135
|
|
|
309
|
-
```typescript
|
|
310
136
|
const lemmas = extractIndexableLemmas(text, lemmatizer, {
|
|
311
137
|
bigrams: lemmatizer,
|
|
312
138
|
compoundSplitter: splitter,
|
|
313
|
-
|
|
314
|
-
alwaysTryCompounds: false, // only split unknown words
|
|
139
|
+
removeStopwords: true,
|
|
315
140
|
});
|
|
316
|
-
```
|
|
317
|
-
|
|
318
|
-
## Word Classes
|
|
319
|
-
|
|
320
|
-
Filter by part of speech when context is known:
|
|
321
|
-
|
|
322
|
-
```typescript
|
|
323
|
-
lemmatizer.lemmatize("á", { wordClass: "so" }); // → ["eiga"] (verbs only)
|
|
324
|
-
lemmatizer.lemmatize("á", { wordClass: "fs" }); // → ["á"] (prepositions only)
|
|
325
|
-
|
|
326
|
-
lemmatizer.lemmatizeWithPOS("á");
|
|
327
|
-
// → [
|
|
328
|
-
// { lemma: "á", pos: "fs" }, // preposition
|
|
329
|
-
// { lemma: "á", pos: "no" }, // noun (river)
|
|
330
|
-
// { lemma: "eiga", pos: "so" } // verb
|
|
331
|
-
// ]
|
|
332
|
-
```
|
|
333
|
-
|
|
334
|
-
| Code | Icelandic | English |
|
|
335
|
-
|------|-----------|---------|
|
|
336
|
-
| `no` | nafnorð | noun |
|
|
337
|
-
| `so` | sagnorð | verb |
|
|
338
|
-
| `lo` | lýsingarorð | adjective |
|
|
339
|
-
| `ao` | atviksorð | adverb |
|
|
340
|
-
| `fs` | forsetning | preposition |
|
|
341
|
-
| `fn` | fornafn | pronoun |
|
|
342
|
-
|
|
343
|
-
## Data
|
|
344
|
-
|
|
345
|
-
Single binary file: `lemma-is.bin` (~91 MB)
|
|
346
|
-
|
|
347
|
-
Contains:
|
|
348
|
-
- 289K lemmas from BÍN
|
|
349
|
-
- 3M word form mappings
|
|
350
|
-
- 414K bigram frequencies
|
|
351
|
-
- Morphological features (case, gender, number) per word form
|
|
352
|
-
|
|
353
|
-
Uses ArrayBuffer with binary search for efficient memory usage. Format version 2 includes packed morphological data.
|
|
354
|
-
|
|
355
|
-
### Building Data
|
|
356
|
-
|
|
357
|
-
```bash
|
|
358
|
-
# Download BÍN data from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
|
|
359
|
-
# Extract SHsnid.csv to data/
|
|
360
141
|
|
|
361
|
-
|
|
142
|
+
// Indexed: ríkissjóður, ríki, sjóður, standa, blómi, milljarður,
|
|
143
|
+
// arðgreiðsla, arður, greiðsla, telja
|
|
144
|
+
// Stopwords removed: í, ef, er, með
|
|
362
145
|
```
|
|
363
146
|
|
|
364
|
-
|
|
365
|
-
|
|
366
|
-
```typescript
|
|
367
|
-
import { readFileSync } from "fs";
|
|
368
|
-
import { BinaryLemmatizer } from "lemma-is";
|
|
369
|
-
|
|
370
|
-
const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
|
|
371
|
-
const lemmatizer = BinaryLemmatizer.loadFromBuffer(
|
|
372
|
-
buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
|
|
373
|
-
);
|
|
374
|
-
```
|
|
147
|
+
A search for "sjóður" or "arður" now finds this document.
|
|
375
148
|
|
|
376
149
|
## PostgreSQL Full-Text Search
|
|
377
150
|
|
|
378
|
-
PostgreSQL has no built-in Icelandic stemmer. Use lemma-is to pre-process
|
|
379
|
-
|
|
380
|
-
```sql
|
|
381
|
-
CREATE TABLE documents (
|
|
382
|
-
id SERIAL PRIMARY KEY,
|
|
383
|
-
title TEXT,
|
|
384
|
-
body TEXT,
|
|
385
|
-
search_vector TSVECTOR
|
|
386
|
-
);
|
|
387
|
-
CREATE INDEX documents_search_idx ON documents USING GIN (search_vector);
|
|
388
|
-
```
|
|
389
|
-
|
|
390
|
-
Lemmatize in your app, store as space-separated string:
|
|
151
|
+
PostgreSQL has no built-in Icelandic stemmer. Use lemma-is to pre-process:
|
|
391
152
|
|
|
392
153
|
```typescript
|
|
393
154
|
const lemmas = extractIndexableLemmas(text, lemmatizer, { removeStopwords: true });
|
|
@@ -399,160 +160,102 @@ await db.query(
|
|
|
399
160
|
);
|
|
400
161
|
```
|
|
401
162
|
|
|
402
|
-
|
|
403
|
-
|
|
404
|
-
```typescript
|
|
405
|
-
const lemmas = extractIndexableLemmas(query, lemmatizer);
|
|
406
|
-
|
|
407
|
-
const results = await db.query(
|
|
408
|
-
`SELECT *, ts_rank(search_vector, q) AS rank
|
|
409
|
-
FROM documents, plainto_tsquery('simple', $1) q
|
|
410
|
-
WHERE search_vector @@ q
|
|
411
|
-
ORDER BY rank DESC`,
|
|
412
|
-
[Array.from(lemmas).join(" ")]
|
|
413
|
-
);
|
|
414
|
-
|
|
415
|
-
// User searches "börnum" → lemmatized to "barn" → matches all forms
|
|
416
|
-
```
|
|
417
|
-
|
|
418
|
-
**Why `simple`?** It lowercases but doesn't stem—our lemmas are already normalized. Use `setweight()` to boost title matches over body.
|
|
163
|
+
Use the `simple` configuration—it lowercases but doesn't stem, since our lemmas are already normalized.
|
|
419
164
|
|
|
420
|
-
**
|
|
165
|
+
**Important:** Don't use PostgreSQL's `unaccent` extension for Icelandic. Characters like á, ö, þ, ð are distinct letters, not accented variants.
|
|
421
166
|
|
|
422
167
|
## Limitations
|
|
423
168
|
|
|
424
|
-
This
|
|
169
|
+
This is an early effort with known limitations.
|
|
425
170
|
|
|
426
171
|
### File Size
|
|
427
172
|
|
|
428
|
-
|
|
173
|
+
There are two binaries:
|
|
429
174
|
|
|
430
|
-
|
|
431
|
-
- **
|
|
432
|
-
- **Browser/Web Workers** — download size prohibitive for most users
|
|
433
|
-
- **Cloudflare Workers** — fits 128 MB limit but cold starts are slow
|
|
175
|
+
- **Core (~9-11 MB)**: default, optimized for browser/edge/cold start
|
|
176
|
+
- **Full (91 MB)**: maximum coverage and disambiguation
|
|
434
177
|
|
|
435
|
-
|
|
178
|
+
The full binary targets Node.js servers where data loads once at startup. Not recommended for:
|
|
436
179
|
|
|
437
|
-
|
|
180
|
+
- **Serverless/edge** — cold start loading 91 MB may be slow
|
|
181
|
+
- **Browser** — download size prohibitive
|
|
182
|
+
- **Cloudflare Workers** — fits 128 MB limit but cold starts are slow
|
|
438
183
|
|
|
439
|
-
|
|
184
|
+
For browser apps, use the **core** binary.
|
|
440
185
|
|
|
441
|
-
|
|
442
|
-
lemmatizer.lemmatize("hestinum"); // → ["hestur"] ✓
|
|
186
|
+
To use the full binary, build it locally:
|
|
443
187
|
|
|
444
|
-
|
|
445
|
-
|
|
446
|
-
// → ["hestur", "hest", "hesti", "hests", "hestinn", "hestinum", ...] ✗
|
|
188
|
+
```bash
|
|
189
|
+
pnpm build:binary
|
|
447
190
|
```
|
|
448
191
|
|
|
449
|
-
|
|
450
|
-
|
|
451
|
-
**Workaround:** Store original text alongside lemmas, use regex patterns for common suffixes.
|
|
192
|
+
Then load it from `data-dist/lemma-is.bin`.
|
|
452
193
|
|
|
453
|
-
###
|
|
194
|
+
### Compact Builds (Browser/Edge)
|
|
454
195
|
|
|
455
|
-
|
|
196
|
+
For cold-start runtimes and the browser, you can build a **compact core** binary that trades accuracy for size by:
|
|
197
|
+
- Keeping only the most frequent word forms
|
|
198
|
+
- Dropping bigram data and morphological features
|
|
456
199
|
|
|
457
|
-
|
|
458
|
-
// Common phrase: bigrams help
|
|
459
|
-
processText("við erum", lemmatizer, { bigrams: lemmatizer });
|
|
460
|
-
// → "við" disambiguated to "ég" (we) with high confidence
|
|
200
|
+
This reduces memory significantly at the cost of recall/precision on rare words.
|
|
461
201
|
|
|
462
|
-
|
|
463
|
-
|
|
464
|
-
// → "við" picks first candidate, low confidence
|
|
202
|
+
```bash
|
|
203
|
+
pnpm build:core
|
|
465
204
|
```
|
|
466
205
|
|
|
467
|
-
|
|
468
|
-
|
|
469
|
-
```typescript
|
|
470
|
-
// Single word, no context
|
|
471
|
-
lemmatizer.lemmatize("á");
|
|
472
|
-
// → ["á", "eiga"] — but which is more likely? No way to know.
|
|
206
|
+
The output is written to `data-dist/lemma-is.core.bin`. Use it exactly like the full binary; it just covers fewer word forms.
|
|
473
207
|
|
|
474
|
-
|
|
475
|
-
// but we don't have unigram frequencies to use as tiebreaker.
|
|
476
|
-
```
|
|
477
|
-
|
|
478
|
-
**For search indexing:** Use `indexAllCandidates: true` to index all lemmas and let ranking sort out relevance. For applications needing precision (chatbots, translation), use GreynirEngine instead.
|
|
208
|
+
### Not a Parser
|
|
479
209
|
|
|
480
|
-
|
|
210
|
+
This is a lookup table with shallow grammar rules, not a grammatical parser. It doesn't understand sentence structure, named entities, or semantic meaning. The grammar rules help with common patterns but can't handle all disambiguation.
|
|
481
211
|
|
|
482
|
-
|
|
212
|
+
For applications needing full grammatical analysis, use [GreynirEngine](https://github.com/mideind/GreynirEngine).
|
|
483
213
|
|
|
484
|
-
|
|
485
|
-
```typescript
|
|
486
|
-
splitter.split("þjóðmálaráðherra");
|
|
487
|
-
// → ["þjóðmál", "ráðherra"] — missing "þjóð" as separate part
|
|
488
|
-
// Ideal: ["þjóð", "mál", "ráðherra"]
|
|
489
|
-
```
|
|
214
|
+
### Disambiguation Limits
|
|
490
215
|
|
|
491
|
-
|
|
492
|
-
```typescript
|
|
493
|
-
splitter.split("húseignir");
|
|
494
|
-
// → { isCompound: false } — "hús" appears as "hús" not "húsa"
|
|
495
|
-
// The compound IS "hús" + "eignir" but heuristics miss it
|
|
496
|
-
```
|
|
216
|
+
Bigram disambiguation only works when the word pair exists in corpus data. Without context, ambiguous words return all candidates:
|
|
497
217
|
|
|
498
|
-
**May over-split valid words:**
|
|
499
218
|
```typescript
|
|
500
|
-
|
|
501
|
-
//
|
|
502
|
-
// Correctly returns { isCompound: false }, but edge cases exist
|
|
219
|
+
lemmatizer.lemmatize("á");
|
|
220
|
+
// → ["á", "eiga"] — no way to know which is more likely
|
|
503
221
|
```
|
|
504
222
|
|
|
505
|
-
|
|
506
|
-
- Use `alwaysTryCompounds: true` to split even known words
|
|
507
|
-
- Use `minPartLength: 2` in CompoundSplitter for more aggressive splitting
|
|
508
|
-
- Over-indexing is usually better than under-indexing for search
|
|
509
|
-
|
|
510
|
-
### Not a Parser
|
|
511
|
-
|
|
512
|
-
This is a lookup table with shallow grammar rules, not a full grammatical parser. It doesn't understand:
|
|
223
|
+
For search indexing, use `indexAllCandidates: true` (the default) to index all lemmas.
|
|
513
224
|
|
|
514
|
-
|
|
515
|
-
- Complex verb argument frames
|
|
516
|
-
- Named entity recognition (people, places, companies)
|
|
517
|
-
- Semantic meaning or word sense
|
|
225
|
+
### No Query Expansion
|
|
518
226
|
|
|
519
|
-
|
|
227
|
+
You can go word → lemma but not lemma → words. This affects search result highlighting—if a user searches "hestur" and the document contains "hestinum", you can't easily highlight the match.
|
|
520
228
|
|
|
521
|
-
##
|
|
229
|
+
## Data
|
|
522
230
|
|
|
523
|
-
|
|
231
|
+
Single binary file containing:
|
|
232
|
+
- 289K lemmas from BÍN
|
|
233
|
+
- 3M word form mappings
|
|
234
|
+
- 414K bigram frequencies
|
|
235
|
+
- Morphological features per word form
|
|
524
236
|
|
|
525
|
-
|
|
237
|
+
### Building Data
|
|
526
238
|
|
|
527
239
|
```bash
|
|
528
|
-
|
|
529
|
-
|
|
530
|
-
npx vitest run --update # update snapshots
|
|
531
|
-
```
|
|
240
|
+
# Download BÍN data from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
|
|
241
|
+
# Extract SHsnid.csv to data/
|
|
532
242
|
|
|
533
|
-
|
|
534
|
-
|
|
535
|
-
- `compounds.test.ts` — Compound word splitting
|
|
536
|
-
- `integration.test.ts` — Full pipeline, search indexing options
|
|
537
|
-
- `pipeline-greynir.test.ts` — Full pipeline with Greynir test sentences
|
|
538
|
-
- `benchmark.test.ts` — Performance and metrics snapshots
|
|
539
|
-
- `icelandic-tricky.test.ts` — Edge cases, morphology examples
|
|
540
|
-
- `limitations.test.ts` — Documented limitations and research notes
|
|
541
|
-
- `mini-grammar.test.ts` — Grammar rules and case government
|
|
243
|
+
uv run python scripts/build-binary.py
|
|
244
|
+
```
|
|
542
245
|
|
|
543
|
-
|
|
246
|
+
## Development
|
|
544
247
|
|
|
545
248
|
```bash
|
|
249
|
+
pnpm test # run tests
|
|
546
250
|
pnpm build # build dist/
|
|
547
|
-
pnpm typecheck # type check
|
|
548
|
-
pnpm build:data # rebuild binary from BÍN source
|
|
251
|
+
pnpm typecheck # type check
|
|
549
252
|
```
|
|
550
253
|
|
|
551
254
|
## Acknowledgments
|
|
552
255
|
|
|
553
|
-
- **[BÍN](https://bin.arnastofnun.is/)**
|
|
554
|
-
- **[Miðeind](https://mideind.is/)**
|
|
555
|
-
- **[tokenize-is](https://github.com/axelharri/tokenize-is)**
|
|
256
|
+
- **[BÍN](https://bin.arnastofnun.is/)** — Morphological database from the Árni Magnússon Institute
|
|
257
|
+
- **[Miðeind](https://mideind.is/)** — GreynirEngine and foundational Icelandic NLP work
|
|
258
|
+
- **[tokenize-is](https://github.com/axelharri/tokenize-is)** — Icelandic tokenizer
|
|
556
259
|
|
|
557
260
|
## License
|
|
558
261
|
|
|
@@ -560,10 +263,10 @@ MIT for the code.
|
|
|
560
263
|
|
|
561
264
|
### Data License (BÍN)
|
|
562
265
|
|
|
563
|
-
The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/)
|
|
266
|
+
The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/) © Árni Magnússon Institute for Icelandic Studies.
|
|
564
267
|
|
|
565
268
|
**By using this package, you agree to BÍN's conditions:**
|
|
566
|
-
- Credit the Árni Magnússon Institute in your product
|
|
269
|
+
- Credit the Árni Magnússon Institute in your product
|
|
567
270
|
- Do not redistribute the raw data separately
|
|
568
271
|
- Do not publish inflection paradigms without permission
|
|
569
272
|
|