lemma-is 0.2.2 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +107 -430
- package/dist/index.d.mts +31 -3
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +1 -1
- package/dist/index.mjs.map +1 -1
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,392 +1,154 @@
|
|
|
1
1
|
# lemma-is
|
|
2
2
|
|
|
3
|
-
Icelandic lemmatization for JavaScript.
|
|
3
|
+
Fast Icelandic lemmatization for JavaScript. Built for search indexing.
|
|
4
4
|
|
|
5
|
-
## Why?
|
|
6
|
-
|
|
7
|
-
Existing Icelandic NLP tools are Python/C++:
|
|
8
|
-
|
|
9
|
-
| Tool | Runtime | Standalone? | Notes |
|
|
10
|
-
|------|---------|-------------|-------|
|
|
11
|
-
| **[GreynirEngine](https://github.com/mideind/GreynirEngine)** | Python + C++ | ✓ | Gold standard. Full parser, POS tagger. |
|
|
12
|
-
| **[Nefnir](https://github.com/lexis-project/Nefnir)** | Python | ✗ | Requires POS tags from IceNLP/IceStagger (Java, unmaintained). |
|
|
13
|
-
| **lemma-is** | TypeScript | ✓ | Node.js servers. Grammar-based disambiguation, compound splitting. |
|
|
14
|
-
|
|
15
|
-
lemma-is trades parsing accuracy for JS ecosystem integration—good enough for search indexing, runs in any Node.js environment.
|
|
16
|
-
|
|
17
|
-
## Quickstart
|
|
18
|
-
|
|
19
|
-
```bash
|
|
20
|
-
npm install lemma-is
|
|
21
|
-
```
|
|
22
|
-
|
|
23
|
-
**Node.js**:
|
|
24
5
|
```typescript
|
|
25
|
-
import { readFileSync } from "fs";
|
|
26
6
|
import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
|
|
27
7
|
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
));
|
|
8
|
+
lemmatizer.lemmatize("börnin"); // → ["barn"]
|
|
9
|
+
lemmatizer.lemmatize("keypti"); // → ["kaupa"]
|
|
10
|
+
lemmatizer.lemmatize("hestinum"); // → ["hestur"]
|
|
32
11
|
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
});
|
|
12
|
+
// Full pipeline for search
|
|
13
|
+
extractIndexableLemmas("Börnin keypti hestinn", lemmatizer);
|
|
14
|
+
// → ["barn", "kaupa", "hestur"]
|
|
37
15
|
```
|
|
38
16
|
|
|
39
17
|
## The Problem
|
|
40
18
|
|
|
41
|
-
Icelandic is
|
|
19
|
+
Icelandic is heavily inflected. A single noun like "hestur" (horse) has 16 forms:
|
|
42
20
|
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
| barn (child) | börnin, barnið, barna, börnum... |
|
|
47
|
-
| fara (go) | fór, fer, förum, fóru, farið... |
|
|
48
|
-
| kona (woman) | konuna, konunni, kvenna, konum... |
|
|
21
|
+
```
|
|
22
|
+
hestur, hest, hesti, hests, hestar, hesta, hestum, hestanna...
|
|
23
|
+
```
|
|
49
24
|
|
|
50
|
-
If
|
|
25
|
+
If a user searches "hestur" but your document contains "hestinum", they won't find it—unless you normalize both to the lemma at index time.
|
|
51
26
|
|
|
52
|
-
##
|
|
27
|
+
## Why lemma-is?
|
|
53
28
|
|
|
54
|
-
|
|
55
|
-
import { BinaryLemmatizer } from "lemma-is";
|
|
29
|
+
The gold standard for Icelandic NLP is [GreynirEngine](https://github.com/mideind/GreynirEngine)—a full grammatical parser with excellent accuracy. But it's Python-only, which means you can't run it in Node.js, browsers, or edge runtimes without FFI or a sidecar process.
|
|
56
30
|
|
|
57
|
-
|
|
31
|
+
lemma-is trades parsing accuracy for JavaScript portability. It's a lookup table with shallow grammar rules—good enough for search indexing, runs anywhere Node.js runs.
|
|
58
32
|
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
33
|
+
| | lemma-is | GreynirEngine |
|
|
34
|
+
|---|---|---|
|
|
35
|
+
| **Runtime** | Node, Bun, Deno | Python |
|
|
36
|
+
| **Throughput** | ~250K words/sec | ~25K words/sec |
|
|
37
|
+
| **Cold start** | ~35 ms | ~500 ms |
|
|
38
|
+
| **Memory** | ~185 MB | ~200 MB |
|
|
39
|
+
| **Disambiguation** | Bigrams + grammar rules | Full sentence parsing |
|
|
40
|
+
| **Use case** | Search indexing | NLP analysis |
|
|
64
41
|
|
|
65
|
-
|
|
42
|
+
See [BENCHMARKS.md](./BENCHMARKS.md) for methodology and detailed results.
|
|
66
43
|
|
|
67
|
-
|
|
44
|
+
### The Trade-off
|
|
68
45
|
|
|
69
|
-
|
|
46
|
+
lemma-is returns **all possible lemmas** for ambiguous words:
|
|
70
47
|
|
|
71
48
|
```typescript
|
|
72
49
|
lemmatizer.lemmatize("á");
|
|
73
50
|
// → ["á", "eiga"]
|
|
74
|
-
//
|
|
75
|
-
// "á" = verb "owns" (from "eiga")
|
|
76
|
-
|
|
77
|
-
lemmatizer.lemmatize("við");
|
|
78
|
-
// → ["við", "ég", "viður"]
|
|
79
|
-
// "við" = preposition "by/at"
|
|
80
|
-
// "við" = pronoun "we" (from "ég")
|
|
81
|
-
// "við" = noun "wood" (from "viður")
|
|
51
|
+
// Could be preposition "on", noun "river", or verb "owns"
|
|
82
52
|
```
|
|
83
53
|
|
|
84
|
-
|
|
54
|
+
GreynirEngine parses the sentence to return the single correct interpretation. For search, returning all candidates is often better—you'd rather show an extra result than miss a relevant document.
|
|
55
|
+
|
|
56
|
+
## Installation
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
npm install lemma-is
|
|
60
|
+
```
|
|
85
61
|
|
|
86
|
-
|
|
62
|
+
## Quick Start
|
|
87
63
|
|
|
88
64
|
```typescript
|
|
89
|
-
import {
|
|
65
|
+
import { readFileSync } from "fs";
|
|
66
|
+
import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
|
|
90
67
|
|
|
91
|
-
|
|
92
|
-
const
|
|
68
|
+
// Load the binary (91 MB, ~35ms cold start)
|
|
69
|
+
const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
|
|
70
|
+
const lemmatizer = BinaryLemmatizer.loadFromBuffer(
|
|
71
|
+
buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
|
|
72
|
+
);
|
|
93
73
|
|
|
94
|
-
//
|
|
95
|
-
|
|
96
|
-
// →
|
|
74
|
+
// Basic lemmatization
|
|
75
|
+
lemmatizer.lemmatize("börnin"); // → ["barn"]
|
|
76
|
+
lemmatizer.lemmatize("fóru"); // → ["fara", "fóra"]
|
|
97
77
|
|
|
98
|
-
//
|
|
99
|
-
|
|
100
|
-
// →
|
|
78
|
+
// Full pipeline for search indexing
|
|
79
|
+
const lemmas = extractIndexableLemmas("Börnin fóru í bíó", lemmatizer);
|
|
80
|
+
// → ["barn", "fara", "fóra", "í", "bíó"]
|
|
101
81
|
```
|
|
102
82
|
|
|
83
|
+
## Features
|
|
84
|
+
|
|
103
85
|
### Morphological Features
|
|
104
86
|
|
|
105
87
|
The binary includes case, gender, and number for each word form:
|
|
106
88
|
|
|
107
89
|
```typescript
|
|
108
90
|
lemmatizer.lemmatizeWithMorph("hestinum");
|
|
109
|
-
// → [{
|
|
110
|
-
//
|
|
111
|
-
// pos: "no",
|
|
112
|
-
// morph: { case: "þgf", gender: "kk", number: "et" }
|
|
113
|
-
// }]
|
|
114
|
-
// "hestinum" = dative, masculine, singular
|
|
115
|
-
|
|
116
|
-
lemmatizer.lemmatizeWithMorph("börnum");
|
|
117
|
-
// → [{
|
|
118
|
-
// lemma: "barn",
|
|
119
|
-
// pos: "no",
|
|
120
|
-
// morph: { case: "þgf", gender: "hk", number: "ft" }
|
|
121
|
-
// }]
|
|
122
|
-
// "börnum" = dative, neuter, plural
|
|
91
|
+
// → [{ lemma: "hestur", pos: "no", morph: { case: "þgf", gender: "kk", number: "et" } }]
|
|
92
|
+
// dative, masculine, singular
|
|
123
93
|
```
|
|
124
94
|
|
|
125
|
-
|
|
126
|
-
|------|---------|
|
|
127
|
-
| `nf` | nominative (nefnifall) |
|
|
128
|
-
| `þf` | accusative (þolfall) |
|
|
129
|
-
| `þgf` | dative (þágufall) |
|
|
130
|
-
| `ef` | genitive (eignarfall) |
|
|
131
|
-
| `kk` | masculine (karlkyn) |
|
|
132
|
-
| `kvk` | feminine (kvenkyn) |
|
|
133
|
-
| `hk` | neuter (hvorugkyn) |
|
|
134
|
-
| `et` | singular (eintala) |
|
|
135
|
-
| `ft` | plural (fleirtala) |
|
|
95
|
+
### Grammar-Based Disambiguation
|
|
136
96
|
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
Use corpus frequencies to pick the most likely lemma based on context:
|
|
97
|
+
Shallow grammar rules use Icelandic case government to disambiguate prepositions:
|
|
140
98
|
|
|
141
99
|
```typescript
|
|
142
|
-
import {
|
|
143
|
-
|
|
144
|
-
const lemmatizer = await BinaryLemmatizer.load("/data/lemma-is.bin");
|
|
100
|
+
import { Disambiguator } from "lemma-is";
|
|
145
101
|
|
|
146
|
-
|
|
147
|
-
// "við erum" = "we are" → bigrams favor pronoun "ég" over preposition
|
|
148
|
-
const processed = processText("Við erum hér", lemmatizer, { bigrams: lemmatizer });
|
|
149
|
-
// → disambiguated: "ég" for "við" (high confidence)
|
|
102
|
+
const disambiguator = new Disambiguator(lemmatizer, lemmatizer, { useGrammarRules: true });
|
|
150
103
|
|
|
151
|
-
// "á
|
|
152
|
-
|
|
153
|
-
// →
|
|
104
|
+
// "á borðinu" - borðinu is dative, á governs dative → preposition
|
|
105
|
+
disambiguator.disambiguate("á", null, "borðinu");
|
|
106
|
+
// → { lemma: "á", pos: "fs", resolvedBy: "grammar_rules" }
|
|
154
107
|
```
|
|
155
108
|
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
## Compound Word Splitting
|
|
109
|
+
### Compound Splitting
|
|
159
110
|
|
|
160
|
-
Icelandic forms long compounds.
|
|
111
|
+
Icelandic forms long compounds. Split them for better search coverage:
|
|
161
112
|
|
|
162
113
|
```typescript
|
|
163
114
|
import { CompoundSplitter, createKnownLemmaSet } from "lemma-is";
|
|
164
115
|
|
|
116
|
+
const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
|
|
165
117
|
const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
|
|
166
118
|
|
|
167
|
-
splitter.split("bílstjóri");
|
|
168
|
-
// → { isCompound: true, parts: ["bíll", "stjóri"] }
|
|
169
|
-
// "car driver" = "car" + "driver"
|
|
170
|
-
|
|
171
119
|
splitter.split("landbúnaðarráðherra");
|
|
172
120
|
// → { isCompound: true, parts: ["landbúnaður", "ráðherra"] }
|
|
173
|
-
// "agriculture minister"
|
|
174
|
-
|
|
175
|
-
splitter.split("húsnæðislán");
|
|
176
|
-
// → { isCompound: true, parts: ["húsnæði", "lán"] }
|
|
177
|
-
// "housing loan" = "housing" + "loan"
|
|
178
|
-
```
|
|
179
|
-
|
|
180
|
-
### Indexing Compounds
|
|
181
|
-
|
|
182
|
-
`getAllLemmas` returns the original word plus its parts—maximizing search recall:
|
|
183
|
-
|
|
184
|
-
```typescript
|
|
185
|
-
splitter.getAllLemmas("bílstjóri");
|
|
186
|
-
// → ["bílstjóri", "bíll", "stjóri"]
|
|
187
|
-
```
|
|
188
|
-
|
|
189
|
-
A document mentioning "bílstjóri" is now findable by searching for "bíll" (car).
|
|
190
|
-
|
|
191
|
-
## Full Pipeline
|
|
192
|
-
|
|
193
|
-
For production indexing, combine tokenization, lemmatization, disambiguation, and compound splitting.
|
|
194
|
-
|
|
195
|
-
### What Gets Indexed
|
|
196
|
-
|
|
197
|
-
Here's a real example showing exactly what lemmas are extracted:
|
|
198
|
-
|
|
199
|
-
```typescript
|
|
200
|
-
const text = "Ríkissjóður stendur í blóma ef 27 milljarða arðgreiðsla Íslandsbanka er talin með.";
|
|
201
|
-
|
|
202
|
-
const lemmas = extractIndexableLemmas(text, lemmatizer, {
|
|
203
|
-
bigrams: lemmatizer,
|
|
204
|
-
compoundSplitter: splitter,
|
|
205
|
-
removeStopwords: true,
|
|
206
|
-
});
|
|
207
|
-
|
|
208
|
-
// Indexed lemmas:
|
|
209
|
-
// ✓ ríkissjóður, ríki, sjóður — compound + parts
|
|
210
|
-
// ✓ standa — stendur → standa
|
|
211
|
-
// ✓ blómi — í blóma → blómi
|
|
212
|
-
// ✓ milljarður — milljarða → milljarður
|
|
213
|
-
// ✓ arðgreiðsla, arður, greiðsla — compound + parts
|
|
214
|
-
// ✓ íslandsbanki — proper noun (lowercased)
|
|
215
|
-
// ✓ telja — talin → telja
|
|
216
|
-
//
|
|
217
|
-
// NOT indexed (stopwords removed):
|
|
218
|
-
// ✗ í, ef, er, með
|
|
219
|
-
```
|
|
220
|
-
|
|
221
|
-
A search for "sjóður" or "arður" now finds this document about the state treasury and bank dividends.
|
|
222
|
-
|
|
223
|
-
### Another Example: Job Posting
|
|
224
|
-
|
|
225
|
-
```typescript
|
|
226
|
-
const posting = "Við leitum að reyndum kennurum til starfa í Reykjavík.";
|
|
227
|
-
|
|
228
|
-
const lemmas = extractIndexableLemmas(posting, lemmatizer, {
|
|
229
|
-
bigrams: lemmatizer,
|
|
230
|
-
removeStopwords: true,
|
|
231
|
-
});
|
|
232
|
-
|
|
233
|
-
// Indexed:
|
|
234
|
-
// ✓ leita, leit — leitum → leita (+ noun variant)
|
|
235
|
-
// ✓ reyndur, reynd — reyndum → reyndur
|
|
236
|
-
// ✓ kennari — kennurum → kennari
|
|
237
|
-
// ✓ starf, starfa — starfa (noun + verb)
|
|
238
|
-
// ✓ reykjavík — place name (lowercased)
|
|
239
|
-
//
|
|
240
|
-
// NOT indexed:
|
|
241
|
-
// ✗ við, að, til, í — stopwords
|
|
121
|
+
// "agriculture minister"
|
|
242
122
|
```
|
|
243
123
|
|
|
244
|
-
|
|
124
|
+
### Full Pipeline
|
|
245
125
|
|
|
246
|
-
|
|
126
|
+
For production indexing, combine everything:
|
|
247
127
|
|
|
248
128
|
```typescript
|
|
249
|
-
|
|
250
|
-
"fór ég á veitingastaðinn og keypti mér rauðvín með hamborgaranum.";
|
|
251
|
-
|
|
252
|
-
const lemmas = extractIndexableLemmas(text, lemmatizer, {
|
|
253
|
-
bigrams: lemmatizer,
|
|
254
|
-
compoundSplitter: splitter,
|
|
255
|
-
removeStopwords: true,
|
|
256
|
-
});
|
|
257
|
-
|
|
258
|
-
// Verbs (various tenses/persons):
|
|
259
|
-
// ✓ borða — borðaði (past)
|
|
260
|
-
// ✓ bráðna — bráðnað (past participle)
|
|
261
|
-
// ✓ fara — fór (past, different stem!)
|
|
262
|
-
// ✓ kaupa — keypti (past)
|
|
263
|
-
//
|
|
264
|
-
// Nouns with articles:
|
|
265
|
-
// ✓ ís — ísinn (NOT "Ísland"!)
|
|
266
|
-
// ✓ veitingastaður, veiting, staður — compound
|
|
267
|
-
// ✓ rauðvín
|
|
268
|
-
// ✓ hamborgari — hamborgaranum (dative + article)
|
|
269
|
-
```
|
|
129
|
+
import { extractIndexableLemmas, CompoundSplitter, createKnownLemmaSet } from "lemma-is";
|
|
270
130
|
|
|
271
|
-
### Setup
|
|
272
|
-
|
|
273
|
-
```typescript
|
|
274
|
-
import {
|
|
275
|
-
BinaryLemmatizer,
|
|
276
|
-
extractIndexableLemmas,
|
|
277
|
-
CompoundSplitter,
|
|
278
|
-
createKnownLemmaSet
|
|
279
|
-
} from "lemma-is";
|
|
280
|
-
|
|
281
|
-
const lemmatizer = await BinaryLemmatizer.load("/data/lemma-is.bin");
|
|
282
131
|
const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
|
|
283
132
|
const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
|
|
284
|
-
```
|
|
285
133
|
|
|
286
|
-
|
|
134
|
+
const text = "Ríkissjóður stendur í blóma ef milljarða arðgreiðsla er talin með.";
|
|
287
135
|
|
|
288
|
-
The defaults favor **recall over precision**—better for search where missing results is worse than extra results:
|
|
289
|
-
|
|
290
|
-
```typescript
|
|
291
136
|
const lemmas = extractIndexableLemmas(text, lemmatizer, {
|
|
292
137
|
bigrams: lemmatizer,
|
|
293
138
|
compoundSplitter: splitter,
|
|
294
|
-
|
|
295
|
-
// indexAllCandidates: true — indexes ALL lemma candidates
|
|
296
|
-
// alwaysTryCompounds: true — splits compounds even if known in BÍN
|
|
297
|
-
});
|
|
298
|
-
```
|
|
299
|
-
|
|
300
|
-
With these defaults:
|
|
301
|
-
- `"á"` → indexes both `"á"` (preposition) AND `"eiga"` (verb)
|
|
302
|
-
- `"húsnæðislán"` → indexes `"húsnæðislán"`, `"húsnæði"`, AND `"lán"`
|
|
303
|
-
|
|
304
|
-
### Precision Mode
|
|
305
|
-
|
|
306
|
-
If you need only the most likely lemma (chatbots, translation), disable the search optimizations:
|
|
307
|
-
|
|
308
|
-
```typescript
|
|
309
|
-
const lemmas = extractIndexableLemmas(text, lemmatizer, {
|
|
310
|
-
bigrams: lemmatizer,
|
|
311
|
-
compoundSplitter: splitter,
|
|
312
|
-
indexAllCandidates: false, // only disambiguated lemma
|
|
313
|
-
alwaysTryCompounds: false, // only split unknown words
|
|
139
|
+
removeStopwords: true,
|
|
314
140
|
});
|
|
315
|
-
```
|
|
316
|
-
|
|
317
|
-
## Word Classes
|
|
318
|
-
|
|
319
|
-
Filter by part of speech when context is known:
|
|
320
|
-
|
|
321
|
-
```typescript
|
|
322
|
-
lemmatizer.lemmatize("á", { wordClass: "so" }); // → ["eiga"] (verbs only)
|
|
323
|
-
lemmatizer.lemmatize("á", { wordClass: "fs" }); // → ["á"] (prepositions only)
|
|
324
|
-
|
|
325
|
-
lemmatizer.lemmatizeWithPOS("á");
|
|
326
|
-
// → [
|
|
327
|
-
// { lemma: "á", pos: "fs" }, // preposition
|
|
328
|
-
// { lemma: "á", pos: "no" }, // noun (river)
|
|
329
|
-
// { lemma: "eiga", pos: "so" } // verb
|
|
330
|
-
// ]
|
|
331
|
-
```
|
|
332
|
-
|
|
333
|
-
| Code | Icelandic | English |
|
|
334
|
-
|------|-----------|---------|
|
|
335
|
-
| `no` | nafnorð | noun |
|
|
336
|
-
| `so` | sagnorð | verb |
|
|
337
|
-
| `lo` | lýsingarorð | adjective |
|
|
338
|
-
| `ao` | atviksorð | adverb |
|
|
339
|
-
| `fs` | forsetning | preposition |
|
|
340
|
-
| `fn` | fornafn | pronoun |
|
|
341
|
-
|
|
342
|
-
## Data
|
|
343
|
-
|
|
344
|
-
Single binary file: `lemma-is.bin` (~91 MB)
|
|
345
|
-
|
|
346
|
-
Contains:
|
|
347
|
-
- 289K lemmas from BÍN
|
|
348
|
-
- 3M word form mappings
|
|
349
|
-
- 414K bigram frequencies
|
|
350
|
-
- Morphological features (case, gender, number) per word form
|
|
351
141
|
|
|
352
|
-
|
|
353
|
-
|
|
354
|
-
|
|
355
|
-
|
|
356
|
-
```bash
|
|
357
|
-
# Download BÍN data from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
|
|
358
|
-
# Extract SHsnid.csv to data/
|
|
359
|
-
|
|
360
|
-
uv run python scripts/build-binary.py # builds lemma-is.bin with morph features
|
|
142
|
+
// Indexed: ríkissjóður, ríki, sjóður, standa, blómi, milljarður,
|
|
143
|
+
// arðgreiðsla, arður, greiðsla, telja
|
|
144
|
+
// Stopwords removed: í, ef, er, með
|
|
361
145
|
```
|
|
362
146
|
|
|
363
|
-
|
|
364
|
-
|
|
365
|
-
```typescript
|
|
366
|
-
import { readFileSync } from "fs";
|
|
367
|
-
import { BinaryLemmatizer } from "lemma-is";
|
|
368
|
-
|
|
369
|
-
const buffer = readFileSync("data-dist/lemma-is.bin");
|
|
370
|
-
const lemmatizer = BinaryLemmatizer.loadFromBuffer(
|
|
371
|
-
buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
|
|
372
|
-
);
|
|
373
|
-
```
|
|
147
|
+
A search for "sjóður" or "arður" now finds this document.
|
|
374
148
|
|
|
375
149
|
## PostgreSQL Full-Text Search
|
|
376
150
|
|
|
377
|
-
PostgreSQL has no built-in Icelandic stemmer. Use lemma-is to pre-process
|
|
378
|
-
|
|
379
|
-
```sql
|
|
380
|
-
CREATE TABLE documents (
|
|
381
|
-
id SERIAL PRIMARY KEY,
|
|
382
|
-
title TEXT,
|
|
383
|
-
body TEXT,
|
|
384
|
-
search_vector TSVECTOR
|
|
385
|
-
);
|
|
386
|
-
CREATE INDEX documents_search_idx ON documents USING GIN (search_vector);
|
|
387
|
-
```
|
|
388
|
-
|
|
389
|
-
Lemmatize in your app, store as space-separated string:
|
|
151
|
+
PostgreSQL has no built-in Icelandic stemmer. Use lemma-is to pre-process:
|
|
390
152
|
|
|
391
153
|
```typescript
|
|
392
154
|
const lemmas = extractIndexableLemmas(text, lemmatizer, { removeStopwords: true });
|
|
@@ -398,160 +160,75 @@ await db.query(
|
|
|
398
160
|
);
|
|
399
161
|
```
|
|
400
162
|
|
|
401
|
-
|
|
163
|
+
Use the `simple` configuration—it lowercases but doesn't stem, since our lemmas are already normalized.
|
|
402
164
|
|
|
403
|
-
|
|
404
|
-
const lemmas = extractIndexableLemmas(query, lemmatizer);
|
|
405
|
-
|
|
406
|
-
const results = await db.query(
|
|
407
|
-
`SELECT *, ts_rank(search_vector, q) AS rank
|
|
408
|
-
FROM documents, plainto_tsquery('simple', $1) q
|
|
409
|
-
WHERE search_vector @@ q
|
|
410
|
-
ORDER BY rank DESC`,
|
|
411
|
-
[Array.from(lemmas).join(" ")]
|
|
412
|
-
);
|
|
413
|
-
|
|
414
|
-
// User searches "börnum" → lemmatized to "barn" → matches all forms
|
|
415
|
-
```
|
|
416
|
-
|
|
417
|
-
**Why `simple`?** It lowercases but doesn't stem—our lemmas are already normalized. Use `setweight()` to boost title matches over body.
|
|
418
|
-
|
|
419
|
-
**Diacritics:** PostgreSQL's `unaccent` extension strips accents, but **don't use it for Icelandic**. Characters like á, ö, þ, ð are distinct letters, not accented variants. "á" (river/on/owns) ≠ "a". Preserve diacritics for correct matching.
|
|
165
|
+
**Important:** Don't use PostgreSQL's `unaccent` extension for Icelandic. Characters like á, ö, þ, ð are distinct letters, not accented variants.
|
|
420
166
|
|
|
421
167
|
## Limitations
|
|
422
168
|
|
|
423
|
-
This
|
|
169
|
+
This is an early effort with known limitations.
|
|
424
170
|
|
|
425
171
|
### File Size
|
|
426
172
|
|
|
427
|
-
The binary is
|
|
173
|
+
The binary is **91 MB**. This targets Node.js servers where data loads once at startup. Not recommended for:
|
|
428
174
|
|
|
429
|
-
|
|
430
|
-
- **
|
|
431
|
-
- **Browser/Web Workers** — download size prohibitive for most users
|
|
175
|
+
- **Serverless/edge** — cold start loading 91 MB may be slow
|
|
176
|
+
- **Browser** — download size prohibitive
|
|
432
177
|
- **Cloudflare Workers** — fits 128 MB limit but cold starts are slow
|
|
433
178
|
|
|
434
|
-
For browser
|
|
435
|
-
|
|
436
|
-
### No Query Expansion
|
|
437
|
-
|
|
438
|
-
You can go **word → lemma** but not **lemma → words**:
|
|
179
|
+
For browser apps, run lemmatization server-side.
|
|
439
180
|
|
|
440
|
-
|
|
441
|
-
lemmatizer.lemmatize("hestinum"); // → ["hestur"] ✓
|
|
442
|
-
|
|
443
|
-
// But you CANNOT do:
|
|
444
|
-
lemmatizer.expand("hestur");
|
|
445
|
-
// → ["hestur", "hest", "hesti", "hests", "hestinn", "hestinum", ...] ✗
|
|
446
|
-
```
|
|
181
|
+
### Not a Parser
|
|
447
182
|
|
|
448
|
-
This
|
|
183
|
+
This is a lookup table with shallow grammar rules, not a grammatical parser. It doesn't understand sentence structure, named entities, or semantic meaning. The grammar rules help with common patterns but can't handle all disambiguation.
|
|
449
184
|
|
|
450
|
-
|
|
185
|
+
For applications needing full grammatical analysis, use [GreynirEngine](https://github.com/mideind/GreynirEngine).
|
|
451
186
|
|
|
452
187
|
### Disambiguation Limits
|
|
453
188
|
|
|
454
|
-
Bigram disambiguation only works when the word pair exists in
|
|
455
|
-
|
|
456
|
-
```typescript
|
|
457
|
-
// Common phrase: bigrams help
|
|
458
|
-
processText("við erum", lemmatizer, { bigrams: lemmatizer });
|
|
459
|
-
// → "við" disambiguated to "ég" (we) with high confidence
|
|
460
|
-
|
|
461
|
-
// Rare/unusual phrase: no bigram data
|
|
462
|
-
processText("við flæktumst", lemmatizer, { bigrams: lemmatizer });
|
|
463
|
-
// → "við" picks first candidate, low confidence
|
|
464
|
-
```
|
|
465
|
-
|
|
466
|
-
Without context, ambiguous words fall back to arbitrary ordering:
|
|
189
|
+
Bigram disambiguation only works when the word pair exists in corpus data. Without context, ambiguous words return all candidates:
|
|
467
190
|
|
|
468
191
|
```typescript
|
|
469
|
-
// Single word, no context
|
|
470
192
|
lemmatizer.lemmatize("á");
|
|
471
|
-
// → ["á", "eiga"] —
|
|
472
|
-
|
|
473
|
-
// The preposition "á" is ~100x more common than verb "eiga" in this form,
|
|
474
|
-
// but we don't have unigram frequencies to use as tiebreaker.
|
|
193
|
+
// → ["á", "eiga"] — no way to know which is more likely
|
|
475
194
|
```
|
|
476
195
|
|
|
477
|
-
|
|
478
|
-
|
|
479
|
-
### Compound Splitting Heuristics
|
|
196
|
+
For search indexing, use `indexAllCandidates: true` (the default) to index all lemmas.
|
|
480
197
|
|
|
481
|
-
|
|
482
|
-
|
|
483
|
-
**Three-part compounds only split once:**
|
|
484
|
-
```typescript
|
|
485
|
-
splitter.split("þjóðmálaráðherra");
|
|
486
|
-
// → ["þjóðmál", "ráðherra"] — missing "þjóð" as separate part
|
|
487
|
-
// Ideal: ["þjóð", "mál", "ráðherra"]
|
|
488
|
-
```
|
|
489
|
-
|
|
490
|
-
**Inflected first parts may not match:**
|
|
491
|
-
```typescript
|
|
492
|
-
splitter.split("húseignir");
|
|
493
|
-
// → { isCompound: false } — "hús" appears as "hús" not "húsa"
|
|
494
|
-
// The compound IS "hús" + "eignir" but heuristics miss it
|
|
495
|
-
```
|
|
496
|
-
|
|
497
|
-
**May over-split valid words:**
|
|
498
|
-
```typescript
|
|
499
|
-
splitter.split("landsins");
|
|
500
|
-
// This is NOT a compound — it's "land" + genitive suffix "-sins"
|
|
501
|
-
// Correctly returns { isCompound: false }, but edge cases exist
|
|
502
|
-
```
|
|
503
|
-
|
|
504
|
-
**Mitigations:**
|
|
505
|
-
- Use `alwaysTryCompounds: true` to split even known words
|
|
506
|
-
- Use `minPartLength: 2` in CompoundSplitter for more aggressive splitting
|
|
507
|
-
- Over-indexing is usually better than under-indexing for search
|
|
508
|
-
|
|
509
|
-
### Not a Parser
|
|
510
|
-
|
|
511
|
-
This is a lookup table with shallow grammar rules, not a full grammatical parser. It doesn't understand:
|
|
512
|
-
|
|
513
|
-
- Full sentence structure or syntax trees
|
|
514
|
-
- Complex verb argument frames
|
|
515
|
-
- Named entity recognition (people, places, companies)
|
|
516
|
-
- Semantic meaning or word sense
|
|
198
|
+
### No Query Expansion
|
|
517
199
|
|
|
518
|
-
|
|
200
|
+
You can go word → lemma but not lemma → words. This affects search result highlighting—if a user searches "hestur" and the document contains "hestinum", you can't easily highlight the match.
|
|
519
201
|
|
|
520
|
-
##
|
|
202
|
+
## Data
|
|
521
203
|
|
|
522
|
-
|
|
204
|
+
Single binary file containing:
|
|
205
|
+
- 289K lemmas from BÍN
|
|
206
|
+
- 3M word form mappings
|
|
207
|
+
- 414K bigram frequencies
|
|
208
|
+
- Morphological features per word form
|
|
523
209
|
|
|
524
|
-
|
|
210
|
+
### Building Data
|
|
525
211
|
|
|
526
212
|
```bash
|
|
527
|
-
|
|
528
|
-
|
|
529
|
-
npx vitest run --update # update snapshots
|
|
530
|
-
```
|
|
213
|
+
# Download BÍN data from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
|
|
214
|
+
# Extract SHsnid.csv to data/
|
|
531
215
|
|
|
532
|
-
|
|
533
|
-
|
|
534
|
-
- `compounds.test.ts` — Compound word splitting
|
|
535
|
-
- `integration.test.ts` — Full pipeline, search indexing options
|
|
536
|
-
- `pipeline-greynir.test.ts` — Full pipeline with Greynir test sentences
|
|
537
|
-
- `benchmark.test.ts` — Performance and metrics snapshots
|
|
538
|
-
- `icelandic-tricky.test.ts` — Edge cases, morphology examples
|
|
539
|
-
- `limitations.test.ts` — Documented limitations and research notes
|
|
540
|
-
- `mini-grammar.test.ts` — Grammar rules and case government
|
|
216
|
+
uv run python scripts/build-binary.py
|
|
217
|
+
```
|
|
541
218
|
|
|
542
|
-
|
|
219
|
+
## Development
|
|
543
220
|
|
|
544
221
|
```bash
|
|
222
|
+
pnpm test # run tests
|
|
545
223
|
pnpm build # build dist/
|
|
546
|
-
pnpm typecheck # type check
|
|
547
|
-
pnpm build:data # rebuild binary from BÍN source
|
|
224
|
+
pnpm typecheck # type check
|
|
548
225
|
```
|
|
549
226
|
|
|
550
227
|
## Acknowledgments
|
|
551
228
|
|
|
552
|
-
- **[BÍN](https://bin.arnastofnun.is/)**
|
|
553
|
-
- **[Miðeind](https://mideind.is/)**
|
|
554
|
-
- **[tokenize-is](https://github.com/axelharri/tokenize-is)**
|
|
229
|
+
- **[BÍN](https://bin.arnastofnun.is/)** — Morphological database from the Árni Magnússon Institute
|
|
230
|
+
- **[Miðeind](https://mideind.is/)** — GreynirEngine and foundational Icelandic NLP work
|
|
231
|
+
- **[tokenize-is](https://github.com/axelharri/tokenize-is)** — Icelandic tokenizer
|
|
555
232
|
|
|
556
233
|
## License
|
|
557
234
|
|
|
@@ -559,10 +236,10 @@ MIT for the code.
|
|
|
559
236
|
|
|
560
237
|
### Data License (BÍN)
|
|
561
238
|
|
|
562
|
-
The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/)
|
|
239
|
+
The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/) © Árni Magnússon Institute for Icelandic Studies.
|
|
563
240
|
|
|
564
241
|
**By using this package, you agree to BÍN's conditions:**
|
|
565
|
-
- Credit the Árni Magnússon Institute in your product
|
|
242
|
+
- Credit the Árni Magnússon Institute in your product
|
|
566
243
|
- Do not redistribute the raw data separately
|
|
567
244
|
- Do not publish inflection paradigms without permission
|
|
568
245
|
|