lemma-is 0.2.1 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +23 -30
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -4,15 +4,15 @@ Icelandic lemmatization for JavaScript. Maps inflected word forms to base forms
4
4
 
5
5
  ## Why?
6
6
 
7
- Existing Icelandic NLP tools don't run in browsers:
7
+ Existing Icelandic NLP tools are Python/C++:
8
8
 
9
9
  | Tool | Runtime | Standalone? | Notes |
10
10
  |------|---------|-------------|-------|
11
- | **[GreynirEngine](https://github.com/mideind/GreynirEngine)** | Python + C++ | ✓ | Gold standard. Full parser, POS tagger, 100+ MB. |
11
+ | **[GreynirEngine](https://github.com/mideind/GreynirEngine)** | Python + C++ | ✓ | Gold standard. Full parser, POS tagger. |
12
12
  | **[Nefnir](https://github.com/lexis-project/Nefnir)** | Python | ✗ | Requires POS tags from IceNLP/IceStagger (Java, unmaintained). |
13
- | **lemma-is** | TypeScript | ✓ | Browser/Node/edge. Bigram disambiguation, compound splitting. |
13
+ | **lemma-is** | TypeScript | ✓ | Node.js servers. Grammar-based disambiguation, compound splitting. |
14
14
 
15
- lemma-is trades parsing accuracy for portability—good enough for search, runs anywhere JavaScript runs.
15
+ lemma-is trades parsing accuracy for JS ecosystem integration—good enough for search indexing, runs in any Node.js environment.
16
16
 
17
17
  ## Quickstart
18
18
 
@@ -20,24 +20,7 @@ lemma-is trades parsing accuracy for portability—good enough for search, runs
20
20
  npm install lemma-is
21
21
  ```
22
22
 
23
- **Get the data** (~91 MB binary):
24
- ```bash
25
- # Option 1: Download pre-built from npm
26
- cp node_modules/lemma-is/data-dist/lemma-is.bin ./public/
27
-
28
- # Option 2: Build from source (requires BÍN data + Python)
29
- # Download SHsnid.csv from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
30
- uv run python scripts/build-data.py && uv run python scripts/build-binary.py
31
- ```
32
-
33
- **Browser (Web Worker)** — see [`test.html`](test.html) for a complete example:
34
- ```typescript
35
- // Load in worker to avoid blocking main thread
36
- const lemmatizer = await BinaryLemmatizer.load("/data/lemma-is.bin");
37
- self.postMessage({ lemmas: lemmatizer.lemmatize("börnin") }); // → ["barn"]
38
- ```
39
-
40
- **Node.js endpoint**:
23
+ **Node.js**:
41
24
  ```typescript
42
25
  import { readFileSync } from "fs";
43
26
  import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
@@ -441,15 +424,14 @@ This library makes tradeoffs for portability. Know what you're getting.
441
424
 
442
425
  ### File Size
443
426
 
444
- The binary is **~91 MB**. For serverless/edge with cold starts, that's significant. For browser apps, load in a Web Worker and cache aggressively.
427
+ The binary is **~91 MB**. This library targets Node.js server environments where the data is loaded once at startup.
445
428
 
446
- ```typescript
447
- // Cloudflare Workers: fits in 128MB memory limit, but cold starts are slow
448
- // Vercel Edge: works, but consider if you really need client-side lemmatization
449
- // Browser: use Service Worker caching, load once per session
450
- ```
429
+ Not recommended for:
430
+ - **Serverless/edge** cold start latency loading 91 MB
431
+ - **Browser/Web Workers** download size prohibitive for most users
432
+ - **Cloudflare Workers** fits 128 MB limit but cold starts are slow
451
433
 
452
- Consider server-side lemmatization if latency matters more than offline support.
434
+ For browser applications, run lemmatization server-side and expose an API endpoint.
453
435
 
454
436
  ### No Query Expansion
455
437
 
@@ -573,4 +555,15 @@ pnpm build:data # rebuild binary from BÍN source
573
555
 
574
556
  ## License
575
557
 
576
- MIT. Data derived from BÍN under the [BÍN license](https://bin.arnastofnun.is/DMII/LTdata/conditions/).
558
+ MIT for the code.
559
+
560
+ ### Data License (BÍN)
561
+
562
+ The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/) (Beygingarlýsing íslensks nútímamáls) © Árni Magnússon Institute for Icelandic Studies.
563
+
564
+ **By using this package, you agree to BÍN's conditions:**
565
+ - Credit the Árni Magnússon Institute in your product's UI
566
+ - Do not redistribute the raw data separately
567
+ - Do not publish inflection paradigms without permission
568
+
569
+ Full terms: [BÍN License Conditions](https://bin.arnastofnun.is/DMII/LTdata/conditions/)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "lemma-is",
3
- "version": "0.2.1",
3
+ "version": "0.2.2",
4
4
  "description": "Icelandic word form to lemma lookup for browser and Node.js",
5
5
  "keywords": [
6
6
  "icelandic",