lemma-is 0.2.1 → 0.2.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +23 -30
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -4,15 +4,15 @@ Icelandic lemmatization for JavaScript. Maps inflected word forms to base forms
|
|
|
4
4
|
|
|
5
5
|
## Why?
|
|
6
6
|
|
|
7
|
-
Existing Icelandic NLP tools
|
|
7
|
+
Existing Icelandic NLP tools are Python/C++:
|
|
8
8
|
|
|
9
9
|
| Tool | Runtime | Standalone? | Notes |
|
|
10
10
|
|------|---------|-------------|-------|
|
|
11
|
-
| **[GreynirEngine](https://github.com/mideind/GreynirEngine)** | Python + C++ | ✓ | Gold standard. Full parser, POS tagger
|
|
11
|
+
| **[GreynirEngine](https://github.com/mideind/GreynirEngine)** | Python + C++ | ✓ | Gold standard. Full parser, POS tagger. |
|
|
12
12
|
| **[Nefnir](https://github.com/lexis-project/Nefnir)** | Python | ✗ | Requires POS tags from IceNLP/IceStagger (Java, unmaintained). |
|
|
13
|
-
| **lemma-is** | TypeScript | ✓ |
|
|
13
|
+
| **lemma-is** | TypeScript | ✓ | Node.js servers. Grammar-based disambiguation, compound splitting. |
|
|
14
14
|
|
|
15
|
-
lemma-is trades parsing accuracy for
|
|
15
|
+
lemma-is trades parsing accuracy for JS ecosystem integration—good enough for search indexing, runs in any Node.js environment.
|
|
16
16
|
|
|
17
17
|
## Quickstart
|
|
18
18
|
|
|
@@ -20,24 +20,7 @@ lemma-is trades parsing accuracy for portability—good enough for search, runs
|
|
|
20
20
|
npm install lemma-is
|
|
21
21
|
```
|
|
22
22
|
|
|
23
|
-
**
|
|
24
|
-
```bash
|
|
25
|
-
# Option 1: Download pre-built from npm
|
|
26
|
-
cp node_modules/lemma-is/data-dist/lemma-is.bin ./public/
|
|
27
|
-
|
|
28
|
-
# Option 2: Build from source (requires BÍN data + Python)
|
|
29
|
-
# Download SHsnid.csv from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
|
|
30
|
-
uv run python scripts/build-data.py && uv run python scripts/build-binary.py
|
|
31
|
-
```
|
|
32
|
-
|
|
33
|
-
**Browser (Web Worker)** — see [`test.html`](test.html) for a complete example:
|
|
34
|
-
```typescript
|
|
35
|
-
// Load in worker to avoid blocking main thread
|
|
36
|
-
const lemmatizer = await BinaryLemmatizer.load("/data/lemma-is.bin");
|
|
37
|
-
self.postMessage({ lemmas: lemmatizer.lemmatize("börnin") }); // → ["barn"]
|
|
38
|
-
```
|
|
39
|
-
|
|
40
|
-
**Node.js endpoint**:
|
|
23
|
+
**Node.js**:
|
|
41
24
|
```typescript
|
|
42
25
|
import { readFileSync } from "fs";
|
|
43
26
|
import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
|
|
@@ -441,15 +424,14 @@ This library makes tradeoffs for portability. Know what you're getting.
|
|
|
441
424
|
|
|
442
425
|
### File Size
|
|
443
426
|
|
|
444
|
-
The binary is **~91 MB**.
|
|
427
|
+
The binary is **~91 MB**. This library targets Node.js server environments where the data is loaded once at startup.
|
|
445
428
|
|
|
446
|
-
|
|
447
|
-
|
|
448
|
-
|
|
449
|
-
|
|
450
|
-
```
|
|
429
|
+
Not recommended for:
|
|
430
|
+
- **Serverless/edge** — cold start latency loading 91 MB
|
|
431
|
+
- **Browser/Web Workers** — download size prohibitive for most users
|
|
432
|
+
- **Cloudflare Workers** — fits 128 MB limit but cold starts are slow
|
|
451
433
|
|
|
452
|
-
|
|
434
|
+
For browser applications, run lemmatization server-side and expose an API endpoint.
|
|
453
435
|
|
|
454
436
|
### No Query Expansion
|
|
455
437
|
|
|
@@ -573,4 +555,15 @@ pnpm build:data # rebuild binary from BÍN source
|
|
|
573
555
|
|
|
574
556
|
## License
|
|
575
557
|
|
|
576
|
-
MIT
|
|
558
|
+
MIT for the code.
|
|
559
|
+
|
|
560
|
+
### Data License (BÍN)
|
|
561
|
+
|
|
562
|
+
The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/) (Beygingarlýsing íslensks nútímamáls) © Árni Magnússon Institute for Icelandic Studies.
|
|
563
|
+
|
|
564
|
+
**By using this package, you agree to BÍN's conditions:**
|
|
565
|
+
- Credit the Árni Magnússon Institute in your product's UI
|
|
566
|
+
- Do not redistribute the raw data separately
|
|
567
|
+
- Do not publish inflection paradigms without permission
|
|
568
|
+
|
|
569
|
+
Full terms: [BÍN License Conditions](https://bin.arnastofnun.is/DMII/LTdata/conditions/)
|