npm - lemma-is - Versions diffs - 0.2.3 → 0.4.0 - Mend

lemma-is 0.2.3 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/README.md +127 -424
package/data-dist/lemma-is.core.bin +0 -0
package/dist/index.d.mts +38 -7
package/dist/index.d.mts.map +1 -1
package/dist/index.mjs +1 -1
package/dist/index.mjs.map +1 -1
package/package.json +5 -2
package/data-dist/lemma-is.bin +0 -0

package/README.md CHANGED Viewed

@@ -1,393 +1,154 @@
 # lemma-is
-Icelandic lemmatization for JavaScript. Maps inflected word forms to base forms (lemmas) for search indexing and text processing.
+Fast Icelandic lemmatization for JavaScript. Built for search indexing.
-## Why?
-Existing Icelandic NLP tools are Python/C++:
-| Tool | Runtime | Standalone? | Notes |
-|------|---------|-------------|-------|
-| **[GreynirEngine](https://github.com/mideind/GreynirEngine)** | Python + C++ | ✓ | Gold standard. Full parser, POS tagger. |
-| **[Nefnir](https://github.com/lexis-project/Nefnir)** | Python | ✗ | Requires POS tags from IceNLP/IceStagger (Java, unmaintained). |
-| **lemma-is** | TypeScript | ✓ | Node.js servers. Grammar-based disambiguation, compound splitting. |
-lemma-is trades parsing accuracy for JS ecosystem integration—good enough for search indexing, runs in any Node.js environment.
-## Quickstart
-```bash
-npm install lemma-is
-```
-**Node.js**:
 ```typescript
-import { readFileSync } from "fs";
 import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
-// Binary data is bundled with the package
-const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
-const lemmatizer = BinaryLemmatizer.loadFromBuffer(
-  buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
-);
-lemmatizer.lemmatize("börnin");  // → ["barn"]
-lemmatizer.lemmatize("fóru");    // → ["fara", "fóra"]
+lemmatizer.lemmatize("börnin");   // → ["barn"]
+lemmatizer.lemmatize("keypti");   // → ["kaupa"]
+lemmatizer.lemmatize("hestinum"); // → ["hestur"]
-// Full pipeline for search indexing
-const lemmas = extractIndexableLemmas("Börnin fóru í bíó", lemmatizer);
-// → ["barn", "fara", "fóra", "í", "bíó"]
+// Full pipeline for search
+extractIndexableLemmas("Börnin keypti hestinn", lemmatizer);
+// → ["barn", "kaupa", "hestur"]
 ```
 ## The Problem
-Icelandic is highly inflected. A single word appears in dozens of forms:
+Icelandic is heavily inflected. A single noun like "hestur" (horse) has 16 forms:
-| Search term | Forms in documents |
-|-------------|-------------------|
-| hestur (horse) | hestinn, hestinum, hestar, hestarnir, hesta... |
-| barn (child) | börnin, barnið, barna, börnum... |
-| fara (go) | fór, fer, förum, fóru, farið... |
-| kona (woman) | konuna, konunni, kvenna, konum... |
+```
+hestur, hest, hesti, hests, hestar, hesta, hestum, hestanna...
+```
-If you index "Börnin fóru í bíó" by splitting on whitespace, a search for "barn" finds nothing. The word "barn" never appears—only "börnin".
+If a user searches "hestur" but your document contains "hestinum", they won't find it—unless you normalize both to the lemma at index time.
-## Solution
+## Why lemma-is?
-```typescript
-lemmatizer.lemmatize("börnin");   // → ["barn"]
-lemmatizer.lemmatize("fóru");     // → ["fara"]
-lemmatizer.lemmatize("kvenna");   // → ["kona"]
-lemmatizer.lemmatize("hestinum"); // → ["hestur"]
-```
+The gold standard for Icelandic NLP is [GreynirEngine](https://github.com/mideind/GreynirEngine)—a full grammatical parser with excellent accuracy. But it's Python-only, which means you can't run it in Node.js, browsers, or edge runtimes without FFI or a sidecar process.
-Now searches for "barn", "fara", or "hestur" match documents containing any of their forms.
+lemma-is trades parsing accuracy for JavaScript portability. It's a lookup table with shallow grammar rules—good enough for search indexing, runs anywhere Node.js runs.
-## Handling Ambiguity
+| | lemma-is | GreynirEngine |
+|---|---|---|
+| **Runtime** | Node, Bun, Deno | Python |
+| **Throughput** | ~250K words/sec | ~25K words/sec |
+| **Cold start** | ~35 ms | ~500 ms |
+| **Memory** | ~185 MB | ~200 MB |
+| **Disambiguation** | Bigrams + grammar rules | Full sentence parsing |
+| **Use case** | Search indexing | NLP analysis |
-Many Icelandic words map to multiple lemmas:
+See [BENCHMARKS.md](./BENCHMARKS.md) for methodology and detailed results.
+### The Trade-off
+lemma-is returns **all possible lemmas** for ambiguous words:
 ```typescript
 lemmatizer.lemmatize("á");
 // → ["á", "eiga"]
-// "á" = preposition "on" / noun "river"
-// "á" = verb "owns" (from "eiga")
-lemmatizer.lemmatize("við");
-// → ["við", "ég", "viður"]
-// "við" = preposition "by/at"
-// "við" = pronoun "we" (from "ég")
-// "við" = noun "wood" (from "viður")
+// Could be preposition "on", noun "river", or verb "owns"
 ```
-### Grammar Rules (Case Government)
+GreynirEngine parses the sentence to return the single correct interpretation. For search, returning all candidates is often better—you'd rather show an extra result than miss a relevant document.
+## Installation
-The library uses shallow grammar rules based on Icelandic case government to disambiguate prepositions:
+```bash
+npm install lemma-is
+```
+## Quick Start
 ```typescript
-import { Disambiguator } from "lemma-is";
+import { readFileSync } from "fs";
+import { BinaryLemmatizer, extractIndexableLemmas } from "lemma-is";
-// lemmatizer loaded as shown in Quickstart
-const disambiguator = new Disambiguator(lemmatizer, lemmatizer, { useGrammarRules: true });
+// Load the core binary (~9-11 MB, low memory, best for browser/edge)
+const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.core.bin");
+const lemmatizer = BinaryLemmatizer.loadFromBuffer(
+  buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
+);
-// "á borðinu" - borðinu is dative (þgf), á governs dative → preposition
-disambiguator.disambiguate("á", null, "borðinu");
-// → { lemma: "á", pos: "fs", resolvedBy: "grammar_rules" }
+// Basic lemmatization
+lemmatizer.lemmatize("börnin");  // → ["barn"]
+lemmatizer.lemmatize("fóru");    // → ["fara", "fóra"]
-// "ég á" - pronoun before á → likely verb "eiga"
-disambiguator.disambiguate("á", "ég", null);
-// → { lemma: "eiga", pos: "so", resolvedBy: "preference_rules" }
+// Full pipeline for search indexing
+const lemmas = extractIndexableLemmas("Börnin fóru í bíó", lemmatizer);
+// → ["barn", "fara", "fóra", "í", "bíó"]
 ```
+## Features
 ### Morphological Features
 The binary includes case, gender, and number for each word form:
 ```typescript
 lemmatizer.lemmatizeWithMorph("hestinum");
-// → [{
-//   lemma: "hestur",
-//   pos: "no",
-//   morph: { case: "þgf", gender: "kk", number: "et" }
-// }]
-// "hestinum" = dative, masculine, singular
-lemmatizer.lemmatizeWithMorph("börnum");
-// → [{
-//   lemma: "barn",
-//   pos: "no",
-//   morph: { case: "þgf", gender: "hk", number: "ft" }
-// }]
-// "börnum" = dative, neuter, plural
+// → [{ lemma: "hestur", pos: "no", morph: { case: "þgf", gender: "kk", number: "et" } }]
+// dative, masculine, singular
 ```
-| Code | Meaning |
-|------|---------|
-| `nf` | nominative (nefnifall) |
-| `þf` | accusative (þolfall) |
-| `þgf` | dative (þágufall) |
-| `ef` | genitive (eignarfall) |
-| `kk` | masculine (karlkyn) |
-| `kvk` | feminine (kvenkyn) |
-| `hk` | neuter (hvorugkyn) |
-| `et` | singular (eintala) |
-| `ft` | plural (fleirtala) |
-### Bigram Disambiguation
+### Grammar-Based Disambiguation
-Use corpus frequencies to pick the most likely lemma based on context:
+Shallow grammar rules use Icelandic case government to disambiguate prepositions:
 ```typescript
-import { processText } from "lemma-is";
+import { Disambiguator } from "lemma-is";
-// BinaryLemmatizer has built-in bigram frequencies for disambiguation
-// "við erum" = "we are" → bigrams favor pronoun "ég" over preposition
-const processed = processText("Við erum hér", lemmatizer, { bigrams: lemmatizer });
-// → disambiguated: "ég" for "við" (high confidence)
+const disambiguator = new Disambiguator(lemmatizer, lemmatizer, { useGrammarRules: true });
-// "á morgun" = "tomorrow" → bigrams favor preposition
-const processed2 = processText("Ég fer á morgun", lemmatizer, { bigrams: lemmatizer });
-// → disambiguated: "á" for "á" (not "eiga")
+// "á borðinu" - borðinu is dative, á governs dative → preposition
+disambiguator.disambiguate("á", null, "borðinu");
+// → { lemma: "á", pos: "fs", resolvedBy: "grammar_rules" }
 ```
-For search indexing, ambiguity is often acceptable—indexing all candidate lemmas improves recall.
+### Compound Splitting
-## Compound Word Splitting
-Icelandic forms long compounds. The library splits them for better search coverage:
+Icelandic forms long compounds. Split them for better search coverage:
 ```typescript
 import { CompoundSplitter, createKnownLemmaSet } from "lemma-is";
+const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
 const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
-splitter.split("bílstjóri");
-// → { isCompound: true, parts: ["bíll", "stjóri"] }
-// "car driver" = "car" + "driver"
 splitter.split("landbúnaðarráðherra");
 // → { isCompound: true, parts: ["landbúnaður", "ráðherra"] }
-// "agriculture minister" = "agriculture" + "minister"
-splitter.split("húsnæðislán");
-// → { isCompound: true, parts: ["húsnæði", "lán"] }
-// "housing loan" = "housing" + "loan"
-```
-### Indexing Compounds
-`getAllLemmas` returns the original word plus its parts—maximizing search recall:
-```typescript
-splitter.getAllLemmas("bílstjóri");
-// → ["bílstjóri", "bíll", "stjóri"]
-```
-A document mentioning "bílstjóri" is now findable by searching for "bíll" (car).
-## Full Pipeline
-For production indexing, combine tokenization, lemmatization, disambiguation, and compound splitting.
-### What Gets Indexed
-Here's a real example showing exactly what lemmas are extracted:
-```typescript
-const text = "Ríkissjóður stendur í blóma ef 27 milljarða arðgreiðsla Íslandsbanka er talin með.";
-const lemmas = extractIndexableLemmas(text, lemmatizer, {
-  bigrams: lemmatizer,
-  compoundSplitter: splitter,
-  removeStopwords: true,
-});
-// Indexed lemmas:
-// ✓ ríkissjóður, ríki, sjóður     — compound + parts
-// ✓ standa                        — stendur → standa
-// ✓ blómi                         — í blóma → blómi
-// ✓ milljarður                    — milljarða → milljarður
-// ✓ arðgreiðsla, arður, greiðsla  — compound + parts
-// ✓ íslandsbanki                  — proper noun (lowercased)
-// ✓ telja                         — talin → telja
-//
-// NOT indexed (stopwords removed):
-// ✗ í, ef, er, með
-```
-A search for "sjóður" or "arður" now finds this document about the state treasury and bank dividends.
-### Another Example: Job Posting
-```typescript
-const posting = "Við leitum að reyndum kennurum til starfa í Reykjavík.";
-const lemmas = extractIndexableLemmas(posting, lemmatizer, {
-  bigrams: lemmatizer,
-  removeStopwords: true,
-});
-// Indexed:
-// ✓ leita, leit               — leitum → leita (+ noun variant)
-// ✓ reyndur, reynd            — reyndum → reyndur
-// ✓ kennari                   — kennurum → kennari
-// ✓ starf, starfa             — starfa (noun + verb)
-// ✓ reykjavík                 — place name (lowercased)
-//
-// NOT indexed:
-// ✗ við, að, til, í           — stopwords
+// "agriculture minister"
 ```
-A search for "kennari" finds this job posting even though the word "kennari" never appears—only "kennurum" (dative plural).
+### Full Pipeline
-### Complex Sentence
+For production indexing, combine everything:
 ```typescript
-const text = "Löngu áður en Jón borðaði ísinn sem hafði bráðnað hratt " +
-             "fór ég á veitingastaðinn og keypti mér rauðvín með hamborgaranum.";
-const lemmas = extractIndexableLemmas(text, lemmatizer, {
-  bigrams: lemmatizer,
-  compoundSplitter: splitter,
-  removeStopwords: true,
-});
-// Verbs (various tenses/persons):
-// ✓ borða      — borðaði (past)
-// ✓ bráðna     — bráðnað (past participle)
-// ✓ fara       — fór (past, different stem!)
-// ✓ kaupa      — keypti (past)
-//
-// Nouns with articles:
-// ✓ ís         — ísinn (NOT "Ísland"!)
-// ✓ veitingastaður, veiting, staður  — compound
-// ✓ rauðvín
-// ✓ hamborgari — hamborgaranum (dative + article)
-```
+import { extractIndexableLemmas, CompoundSplitter, createKnownLemmaSet } from "lemma-is";
-### Setup
-```typescript
-import { readFileSync } from "fs";
-import {
-  BinaryLemmatizer,
-  extractIndexableLemmas,
-  CompoundSplitter,
-  createKnownLemmaSet
-} from "lemma-is";
-const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
-const lemmatizer = BinaryLemmatizer.loadFromBuffer(
-  buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
-);
 const knownLemmas = createKnownLemmaSet(lemmatizer.getAllLemmas());
 const splitter = new CompoundSplitter(lemmatizer, knownLemmas);
-```
-### Search-Optimized Defaults
-The defaults favor **recall over precision**—better for search where missing results is worse than extra results:
-```typescript
-const lemmas = extractIndexableLemmas(text, lemmatizer, {
-  bigrams: lemmatizer,
-  compoundSplitter: splitter,
-  // These are the defaults:
-  // indexAllCandidates: true  — indexes ALL lemma candidates
-  // alwaysTryCompounds: true  — splits compounds even if known in BÍN
-});
-```
-With these defaults:
-- `"á"` → indexes both `"á"` (preposition) AND `"eiga"` (verb)
-- `"húsnæðislán"` → indexes `"húsnæðislán"`, `"húsnæði"`, AND `"lán"`
-### Precision Mode
-If you need only the most likely lemma (chatbots, translation), disable the search optimizations:
+const text = "Ríkissjóður stendur í blóma ef milljarða arðgreiðsla er talin með.";
-```typescript
 const lemmas = extractIndexableLemmas(text, lemmatizer, {
   bigrams: lemmatizer,
   compoundSplitter: splitter,
-  indexAllCandidates: false,  // only disambiguated lemma
-  alwaysTryCompounds: false,  // only split unknown words
+  removeStopwords: true,
 });
-```
-## Word Classes
-Filter by part of speech when context is known:
-```typescript
-lemmatizer.lemmatize("á", { wordClass: "so" }); // → ["eiga"] (verbs only)
-lemmatizer.lemmatize("á", { wordClass: "fs" }); // → ["á"] (prepositions only)
-lemmatizer.lemmatizeWithPOS("á");
-// → [
-//   { lemma: "á", pos: "fs" },   // preposition
-//   { lemma: "á", pos: "no" },   // noun (river)
-//   { lemma: "eiga", pos: "so" } // verb
-// ]
-```
-| Code | Icelandic | English |
-|------|-----------|---------|
-| `no` | nafnorð | noun |
-| `so` | sagnorð | verb |
-| `lo` | lýsingarorð | adjective |
-| `ao` | atviksorð | adverb |
-| `fs` | forsetning | preposition |
-| `fn` | fornafn | pronoun |
-## Data
-Single binary file: `lemma-is.bin` (~91 MB)
-Contains:
-- 289K lemmas from BÍN
-- 3M word form mappings
-- 414K bigram frequencies
-- Morphological features (case, gender, number) per word form
-Uses ArrayBuffer with binary search for efficient memory usage. Format version 2 includes packed morphological data.
-### Building Data
-```bash
-# Download BÍN data from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
-# Extract SHsnid.csv to data/
-uv run python scripts/build-binary.py    # builds lemma-is.bin with morph features
+// Indexed: ríkissjóður, ríki, sjóður, standa, blómi, milljarður,
+//          arðgreiðsla, arður, greiðsla, telja
+// Stopwords removed: í, ef, er, með
 ```
-## Node.js Usage
-```typescript
-import { readFileSync } from "fs";
-import { BinaryLemmatizer } from "lemma-is";
-const buffer = readFileSync("node_modules/lemma-is/data-dist/lemma-is.bin");
-const lemmatizer = BinaryLemmatizer.loadFromBuffer(
-  buffer.buffer.slice(buffer.byteOffset, buffer.byteOffset + buffer.byteLength)
-);
-```
+A search for "sjóður" or "arður" now finds this document.
 ## PostgreSQL Full-Text Search
-PostgreSQL has no built-in Icelandic stemmer. Use lemma-is to pre-process text, then store lemmas in a `tsvector` column with the `simple` configuration.
-```sql
-CREATE TABLE documents (
-  id SERIAL PRIMARY KEY,
-  title TEXT,
-  body TEXT,
-  search_vector TSVECTOR
-);
-CREATE INDEX documents_search_idx ON documents USING GIN (search_vector);
-```
-Lemmatize in your app, store as space-separated string:
+PostgreSQL has no built-in Icelandic stemmer. Use lemma-is to pre-process:
 ```typescript
 const lemmas = extractIndexableLemmas(text, lemmatizer, { removeStopwords: true });
@@ -399,160 +160,102 @@ await db.query(
 );
 ```
-Query by lemmatizing search terms the same way:
-```typescript
-const lemmas = extractIndexableLemmas(query, lemmatizer);
-const results = await db.query(
-  `SELECT *, ts_rank(search_vector, q) AS rank
-   FROM documents, plainto_tsquery('simple', $1) q
-   WHERE search_vector @@ q
-   ORDER BY rank DESC`,
-  [Array.from(lemmas).join(" ")]
-);
-// User searches "börnum" → lemmatized to "barn" → matches all forms
-```
-**Why `simple`?** It lowercases but doesn't stem—our lemmas are already normalized. Use `setweight()` to boost title matches over body.
+Use the `simple` configuration—it lowercases but doesn't stem, since our lemmas are already normalized.
-**Diacritics:** PostgreSQL's `unaccent` extension strips accents, but **don't use it for Icelandic**. Characters like á, ö, þ, ð are distinct letters, not accented variants. "á" (river/on/owns) ≠ "a". Preserve diacritics for correct matching.
+**Important:** Don't use PostgreSQL's `unaccent` extension for Icelandic. Characters like á, ö, þ, ð are distinct letters, not accented variants.
 ## Limitations
-This library makes tradeoffs for portability. Know what you're getting.
+This is an early effort with known limitations.
 ### File Size
-The binary is **~91 MB**. This library targets Node.js server environments where the data is loaded once at startup.
+There are two binaries:
-Not recommended for:
-- **Serverless/edge** — cold start latency loading 91 MB
-- **Browser/Web Workers** — download size prohibitive for most users
-- **Cloudflare Workers** — fits 128 MB limit but cold starts are slow
+- **Core (~9-11 MB)**: default, optimized for browser/edge/cold start
+- **Full (91 MB)**: maximum coverage and disambiguation
-For browser applications, run lemmatization server-side and expose an API endpoint.
+The full binary targets Node.js servers where data loads once at startup. Not recommended for:
-### No Query Expansion
+- **Serverless/edge** — cold start loading 91 MB may be slow
+- **Browser** — download size prohibitive
+- **Cloudflare Workers** — fits 128 MB limit but cold starts are slow
-You can go **word → lemma** but not **lemma → words**:
+For browser apps, use the **core** binary.
-```typescript
-lemmatizer.lemmatize("hestinum"); // → ["hestur"] ✓
+To use the full binary, build it locally:
-// But you CANNOT do:
-lemmatizer.expand("hestur");
-// → ["hestur", "hest", "hesti", "hests", "hestinn", "hestinum", ...] ✗
+```bash
+pnpm build:binary
 ```
-This matters for **search result highlighting**. If a user searches "hestur" and the document contains "hestinum", you can't easily highlight the match without the reverse mapping.
-**Workaround:** Store original text alongside lemmas, use regex patterns for common suffixes.
+Then load it from `data-dist/lemma-is.bin`.
-### Disambiguation Limits
+### Compact Builds (Browser/Edge)
-Bigram disambiguation only works when the word pair exists in the corpus data:
+For cold-start runtimes and the browser, you can build a **compact core** binary that trades accuracy for size by:
+- Keeping only the most frequent word forms
+- Dropping bigram data and morphological features
-```typescript
-// Common phrase: bigrams help
-processText("við erum", lemmatizer, { bigrams: lemmatizer });
-// → "við" disambiguated to "ég" (we) with high confidence
+This reduces memory significantly at the cost of recall/precision on rare words.
-// Rare/unusual phrase: no bigram data
-processText("við flæktumst", lemmatizer, { bigrams: lemmatizer });
-// → "við" picks first candidate, low confidence
+```bash
+pnpm build:core
 ```
-Without context, ambiguous words fall back to arbitrary ordering:
-```typescript
-// Single word, no context
-lemmatizer.lemmatize("á");
-// → ["á", "eiga"] — but which is more likely? No way to know.
+The output is written to `data-dist/lemma-is.core.bin`. Use it exactly like the full binary; it just covers fewer word forms.
-// The preposition "á" is ~100x more common than verb "eiga" in this form,
-// but we don't have unigram frequencies to use as tiebreaker.
-```
-**For search indexing:** Use `indexAllCandidates: true` to index all lemmas and let ranking sort out relevance. For applications needing precision (chatbots, translation), use GreynirEngine instead.
+### Not a Parser
-### Compound Splitting Heuristics
+This is a lookup table with shallow grammar rules, not a grammatical parser. It doesn't understand sentence structure, named entities, or semantic meaning. The grammar rules help with common patterns but can't handle all disambiguation.
-The splitter uses simple rules that miss edge cases:
+For applications needing full grammatical analysis, use [GreynirEngine](https://github.com/mideind/GreynirEngine).
-**Three-part compounds only split once:**
-```typescript
-splitter.split("þjóðmálaráðherra");
-// → ["þjóðmál", "ráðherra"] — missing "þjóð" as separate part
-// Ideal: ["þjóð", "mál", "ráðherra"]
-```
+### Disambiguation Limits
-**Inflected first parts may not match:**
-```typescript
-splitter.split("húseignir");
-// → { isCompound: false } — "hús" appears as "hús" not "húsa"
-// The compound IS "hús" + "eignir" but heuristics miss it
-```
+Bigram disambiguation only works when the word pair exists in corpus data. Without context, ambiguous words return all candidates:
-**May over-split valid words:**
 ```typescript
-splitter.split("landsins");
-// This is NOT a compound — it's "land" + genitive suffix "-sins"
-// Correctly returns { isCompound: false }, but edge cases exist
+lemmatizer.lemmatize("á");
+// → ["á", "eiga"] — no way to know which is more likely
 ```
-**Mitigations:**
-- Use `alwaysTryCompounds: true` to split even known words
-- Use `minPartLength: 2` in CompoundSplitter for more aggressive splitting
-- Over-indexing is usually better than under-indexing for search
-### Not a Parser
-This is a lookup table with shallow grammar rules, not a full grammatical parser. It doesn't understand:
+For search indexing, use `indexAllCandidates: true` (the default) to index all lemmas.
-- Full sentence structure or syntax trees
-- Complex verb argument frames
-- Named entity recognition (people, places, companies)
-- Semantic meaning or word sense
+### No Query Expansion
-The grammar rules help with common patterns (preposition + case, pronoun + verb) but can't handle all disambiguation cases. For applications needing full grammatical analysis, use [GreynirEngine](https://github.com/mideind/GreynirEngine). lemma-is is for search indexing where "good enough" recall beats perfect precision.
+You can go word → lemma but not lemma → words. This affects search result highlighting—if a user searches "hestur" and the document contains "hestinum", you can't easily highlight the match.
-## Development
+## Data
-### Testing
+Single binary file containing:
+- 289K lemmas from BÍN
+- 3M word form mappings
+- 414K bigram frequencies
+- Morphological features per word form
-Tests use [Vitest](https://vitest.dev/):
+### Building Data
 ```bash
-pnpm test           # run all tests
-pnpm test:watch     # watch mode
-npx vitest run --update  # update snapshots
-```
+# Download BÍN data from https://bin.arnastofnun.is/DMII/LTdata/k-LTdata/
+# Extract SHsnid.csv to data/
-Test files:
-- `binary-lemmatizer.test.ts` — Core lemmatization and bigram lookup
-- `compounds.test.ts` — Compound word splitting
-- `integration.test.ts` — Full pipeline, search indexing options
-- `pipeline-greynir.test.ts` — Full pipeline with Greynir test sentences
-- `benchmark.test.ts` — Performance and metrics snapshots
-- `icelandic-tricky.test.ts` — Edge cases, morphology examples
-- `limitations.test.ts` — Documented limitations and research notes
-- `mini-grammar.test.ts` — Grammar rules and case government
+uv run python scripts/build-binary.py
+```
-### Building
+## Development
 ```bash
+pnpm test           # run tests
 pnpm build          # build dist/
-pnpm typecheck      # type check without emitting
-pnpm build:data     # rebuild binary from BÍN source
+pnpm typecheck      # type check
 ```
 ## Acknowledgments
-- **[BÍN](https://bin.arnastofnun.is/)** – Morphological database from the Árni Magnússon Institute
-- **[Miðeind](https://mideind.is/)** – Greynir and foundational Icelandic NLP work
-- **[tokenize-is](https://github.com/axelharri/tokenize-is)** – Icelandic tokenizer
+- **[BÍN](https://bin.arnastofnun.is/)** — Morphological database from the Árni Magnússon Institute
+- **[Miðeind](https://mideind.is/)** — GreynirEngine and foundational Icelandic NLP work
+- **[tokenize-is](https://github.com/axelharri/tokenize-is)** — Icelandic tokenizer
 ## License
@@ -560,10 +263,10 @@ MIT for the code.
 ### Data License (BÍN)
-The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/) (Beygingarlýsing íslensks nútímamáls) © Árni Magnússon Institute for Icelandic Studies.
+The linguistic data is derived from [BÍN](https://bin.arnastofnun.is/) © Árni Magnússon Institute for Icelandic Studies.
 **By using this package, you agree to BÍN's conditions:**
-- Credit the Árni Magnússon Institute in your product's UI
+- Credit the Árni Magnússon Institute in your product
 - Do not redistribute the raw data separately
 - Do not publish inflection paradigms without permission