npm - lemma-is - Versions diffs - 0.2.1 → 0.2.3 - Mend

lemma-is 0.2.1 → 0.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/README.md +43 -49
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -4,15 +4,15 @@ Icelandic lemmatization for JavaScript. Maps inflected word forms to base forms
 ## Why?
-Existing Icelandic NLP tools don't run in browsers:
+Existing Icelandic NLP tools are Python/C++:
 | Tool | Runtime | Standalone? | Notes |
 |------|---------|-------------|-------|
-| **[GreynirEngine](https://github.com/mideind/GreynirEngine)** | Python + C++ | ✓ | Gold standard. Full parser, POS tagger, 100+ MB. |
+| **[GreynirEngine](https://github.com/mideind/GreynirEngine)** | Python + C++ | ✓ | Gold standard. Full parser, POS tagger. |
 | **[Nefnir](https://github.com/lexis-project/Nefnir)** | Python | ✗ | Requires POS tags from IceNLP/IceStagger (Java, unmaintained). |
-| **lemma-is** | TypeScript | ✓ | Browser/Node/edge. Bigram disambiguation, compound splitting. |
+| **lemma-is** | TypeScript | ✓ | Node.js servers. Grammar-based disambiguation, compound splitting. |
-lemma-is trades parsing accuracy for portability—good enough for search, runs anywhere JavaScript runs.
+lemma-is trades parsing accuracy for JS ecosystem integration—good enough for search indexing, runs in any Node.js environment.
 ## Quickstart
@@ -20,37 +20,23 @@ lemma-is trades parsing accuracy for portability—good enough for search, runs
 npm install lemma-is
 ```
-**Get the data** (~91 MB binary):
-```bash
-# Option 1: Download pre-built from npm
-cp node_modules/lemma-is/data-dist/lemma-is.bin ./public/
-# Option 2: Build from source (requires BÍN data + Python)
-# Download SHsnid.csv from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
-uv run python scripts/build-data.py && uv run python scripts/build-binary.py
-```
-**Browser (Web Worker)** — see [`test.html`](test.html) for a complete example:
-```typescript
-// Load in worker to avoid blocking main thread
-const lemmatizer = await BinaryLemmatizer.load("/data/lemma-is.bin");
-self.postMessage({ lemmas: lemmatizer.lemmatize("börnin") }); // → ["barn"]
-```
-**Node.js endpoint**:
+**Node.js**:
 ```typescript
 import { readFileSync } from "fs";
 import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
-const buffer = readFileSync("./lemma-is.bin");
-const lemmatizer = BinaryLemmatizer.loadFromBuffer(buffer.buffer.slice(
-  buffer.byteOffset, buffer.byteOffset + buffer.byteLength
-));
+// Binary data is bundled with the package
+const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
+const lemmatizer = BinaryLemmatizer.loadFromBuffer(
+  buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
+);
-app.post("/lemmatize", (req, res) => {
-  const lemmas = extractIndexableLemmas(req.body.text, lemmatizer);
-  res.json({ lemmas: [...lemmas] });
-});
+lemmatizer.lemmatize("börnin");  // → ["barn"]
+lemmatizer.lemmatize("fóru");    // → ["fara", "fóra"]
+// Full pipeline for search indexing
+const lemmas = extractIndexableLemmas("Börnin fóru í bíó", lemmatizer);
+// → ["barn", "fara", "fóra", "í", "bíó"]
 ```
 ## The Problem
@@ -69,10 +55,6 @@ If you index "Börnin fóru í bíó" by splitting on whitespace, a search for "
 ## Solution
 ```typescript
-import { BinaryLemmatizer } from "lemma-is";
-const lemmatizer = await BinaryLemmatizer.load("/data/lemma-is.bin");
 lemmatizer.lemmatize("börnin");   // → ["barn"]
 lemmatizer.lemmatize("fóru");     // → ["fara"]
 lemmatizer.lemmatize("kvenna");   // → ["kona"]
@@ -103,9 +85,9 @@ lemmatizer.lemmatize("við");
 The library uses shallow grammar rules based on Icelandic case government to disambiguate prepositions:
 ```typescript
-import { BinaryLemmatizer, Disambiguator } from "lemma-is";
+import { Disambiguator } from "lemma-is";
-const lemmatizer = await BinaryLemmatizer.load("/data/lemma-is.bin");
+// lemmatizer loaded as shown in Quickstart
 const disambiguator = new Disambiguator(lemmatizer, lemmatizer, { useGrammarRules: true });
 // "á borðinu" - borðinu is dative (þgf), á governs dative → preposition
@@ -156,9 +138,7 @@ lemmatizer.lemmatizeWithMorph("börnum");
 Use corpus frequencies to pick the most likely lemma based on context:
 ```typescript
-import { BinaryLemmatizer, processText } from "lemma-is";
-const lemmatizer = await BinaryLemmatizer.load("/data/lemma-is.bin");
+import { processText } from "lemma-is";
 // BinaryLemmatizer has built-in bigram frequencies for disambiguation
 // "við erum" = "we are" → bigrams favor pronoun "ég" over preposition
@@ -288,6 +268,7 @@ const lemmas = extractIndexableLemmas(text, lemmatizer, {
 ### Setup
 ```typescript
+import { readFileSync } from "fs";
 import {
   BinaryLemmatizer,
   extractIndexableLemmas,
@@ -295,7 +276,10 @@ import {
   createKnownLemmaSet
 } from "lemma-is";
-const lemmatizer = await BinaryLemmatizer.load("/data/lemma-is.bin");
+const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
+const lemmatizer = BinaryLemmatizer.loadFromBuffer(
+  buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
+);
 const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
 const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
 ```
@@ -383,7 +367,7 @@ uv run python scripts/build-binary.py    # builds lemma-is.bin with morph featur
 import { readFileSync } from "fs";
 import { BinaryLemmatizer } from "lemma-is";
-const buffer = readFileSync("data-dist/lemma-is.bin");
+const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
 const lemmatizer = BinaryLemmatizer.loadFromBuffer(
   buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
 );
@@ -441,15 +425,14 @@ This library makes tradeoffs for portability. Know what you're getting.
 ### File Size
-The binary is **~91 MB**. For serverless/edge with cold starts, that's significant. For browser apps, load in a Web Worker and cache aggressively.
+The binary is **~91 MB**. This library targets Node.js server environments where the data is loaded once at startup.
-```typescript
-// Cloudflare Workers: fits in 128MB memory limit, but cold starts are slow
-// Vercel Edge: works, but consider if you really need client-side lemmatization
-// Browser: use Service Worker caching, load once per session
-```
+Not recommended for:
+- **Serverless/edge** — cold start latency loading 91 MB
+- **Browser/Web Workers** — download size prohibitive for most users
+- **Cloudflare Workers** — fits 128 MB limit but cold starts are slow
-Consider server-side lemmatization if latency matters more than offline support.
+For browser applications, run lemmatization server-side and expose an API endpoint.
 ### No Query Expansion
@@ -573,4 +556,15 @@ pnpm build:data     # rebuild binary from BÍN source
 ## License
-MIT. Data derived from BÍN under the [BÍN license](https://bin.arnastofnun.is/DMII/LTdata/conditions/).
+MIT for the code.
+### Data License (BÍN)
+The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/) (Beygingarlýsing íslensks nútímamáls) © Árni Magnússon Institute for Icelandic Studies.
+**By using this package, you agree to BÍN's conditions:**
+- Credit the Árni Magnússon Institute in your product's UI
+- Do not redistribute the raw data separately
+- Do not publish inflection paradigms without permission
+Full terms: [BÍN License Conditions](https://bin.arnastofnun.is/DMII/LTdata/conditions/)

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "lemma-is",
-  "version": "0.2.1",
+  "version": "0.2.3",
   "description": "Icelandic word form to lemma lookup for browser and Node.js",
   "keywords": [
     "icelandic",