npm - tamil-romanizer - Versions diffs - 1.0.0 → 1.0.2 - Mend

tamil-romanizer 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/README.md CHANGED Viewed

@@ -1,16 +1,33 @@
-# Tamil Romanizer (v1.0)
+# tamil-romanizer
-A robust, context-aware rule-based Tamil-to-English romanization library.
+A completely context-aware, highly accurate Tamil-to-English romanization library for Node.js and the browser.
-Vastly outperforming naive character-replacement scripts, this library implements a **6-Layer pipeline** powered by grapheme cluster tokenization (`Intl.Segmenter`) and phonological context analysis to yield true, phonetic English mappings from dense Tamil text.
+Unlike naive character-replacement scripts that turn `"சிங்கம்"` into `cinkam`, `tamil-romanizer` understands Tamil phonology. It natively handles intervocalic softening, post-nasal voicing, and word-boundaries to produce natural, readable Tanglish (e.g., `"singam"`).
-## Features
+It is fast, rigorously tested (100% ISO compliance), and built for real-world text.
-- **Grapheme Accurate:** Handles zero-width joiners, complex modifier stacks, and canonical normalization natively.
-- **Context-Aware Phonology:** Detects word-initial constraints, intervocalic softening, post-nasal transformations, and geminate cluster conditions to output dynamically accurate English syntax (e.g. `ப` maps to `p` or `b` dynamically depending on cross-word sandhi context).
-- **Multiple Mapping Schemes:** Natively supports intelligent `practical` syntax (Tanglish/casual phonetic usage) and strict `iso15919` formalized transliteration.
-- **Exception Trie Routing:** Intercepts proper nouns and anglicized loan words prior to phonological transcription.
-- **Foreign-Script Safe:** Can safely ingest paragraphs containing English, Japanese, or other Unicode blocks, surgically romanizing only the Tamil tokens while passing non-Tamil text safely through.
+---
+## Live Demo
+Try the engine instantly in your browser: [**Tamil Romanizer Live Demo**](https://haroldalan.github.io/tamil-romanizer/)
+![Demo GIF](https://raw.githubusercontent.com/haroldalan/tamil-romanizer/main/assets/demo.gif)
+---
+## Why this library?
+Most Tamil transliteration tools fail because they treat the language as a 1-to-1 character map. Tamil doesn't work that way. `tamil-romanizer` analyzes the context of every letter:
+| Tamil Input | Naive approach | `tamil-romanizer` | Why? |
+|-------------|----------------|-------------------|------|
+| **ப**ம்பரம் | **p**am**p**aram | **p**am**b**aram | Identifies word-initial `p` vs post-nasal `b` |
+| ச**ட்ட**ம் | sa**t**am | sa**tt**am | Detects geminate (double) consonant clusters |
+| **ஞா**னம் | **ny**anam | **gn**anam | Uses practical Tanglish conventions for word-initials |
+| **ஃ**பேன் | **ak**paen | **f**an | Analyzes Aytham lookaheads and cross-references an internal proper-noun dictionary |
+---
 ## Installation
@@ -18,46 +35,89 @@ Vastly outperforming naive character-replacement scripts, this library implement
 npm install tamil-romanizer
 ```
-## Usage
+---
+## Quick Start
 ```javascript
 import { romanize } from 'tamil-romanizer';
-// Basic Transliteration
-console.log(romanize("தமிழ்")); // "thamizh"
+// 1. Basic usage maps to highly accurate practical phonetics
+const text = romanize("தமிழ்நாடு");
+console.log(text); // "tamilnadu" (detected via built-in dictionary)
+const text2 = romanize("பம்பரம்");
+console.log(text2); // "pambaram" (context-aware mapping)
+```
+## Advanced Options
+Provide an `options` object as the second argument to control the output format, scheme, or dictionary usage.
+### 1. Capitalization Formatting
+Romanize targets English letters (which have case), while Tamil does not. You can enforce casing rules natively:
+```javascript
+const sentence = "சென்னை ஒரு அழகான நகரம்";
+console.log(romanize(sentence));
+// "chennai oru azhagana nagaram" (Default: 'none' - strict lowercase)
+console.log(romanize(sentence, { capitalize: 'sentence' }));
+// "Chennai oru azhagana nagaram"
+console.log(romanize(sentence, { capitalize: 'words' }));
+// "Chennai Oru Azhagana Nagaram"
+```
+### 2. Scholarly Translating (ISO 15919)
+If you are building an academic tool or require strict, lossless character-level transliteration, use the `iso15919` scheme.
+```javascript
+// ISO 15919 enforces direct diacritic mapping without contextual softening
+const text = romanize("பம்பரம்", { scheme: 'iso15919', exceptions: false });
+console.log(text); // "pamparam"
+const strict = romanize("தமிழ்", { scheme: 'iso15919' });
+console.log(strict); // "tamiḻ"
+```
+*(Also supports `ala-lc` schema via `{ scheme: 'ala-lc' }`)*
+### 3. Turning off the Exception Dictionary
+The library ships with a fast exception trie that automatically corrects common loan words and proper nouns (e.g. `பஸ்` -> `bus`, `சென்னை` -> `Chennai`).
-// Practical Phonics (Contextual Mapping)
-console.log(romanize("பம்பரம்")); // "pambaram" (Initial P, Post-Nasal B)
-console.log(romanize("சிங்கம்")); // "singam"
+If you want the raw, algorithmic output of the underlying state machine, disable the `exceptions` flag:
-// Capitalization Support
-console.log(romanize("சென்னை பயணம்", { capitalize: 'sentence' })); // "Chennai payanam"
+```javascript
+// With dictionary (Default)
+romanize("பஸ்"); // "bus"
-// Strict ISO 15919 Syntax
-console.log(romanize("தமிழ்நாடு", { scheme: 'iso15919' })); // "tamiḻnāṭu"
+// Algorithmic output
+romanize("பஸ்", { exceptions: false }); // "bas"
 ```
-## API Options
-You can adjust parsing behavior globally via a config block on `romanize(text, options)`.
+## Mixed-language Safe
+Don't worry about sanitizing your inputs. If you pass a string containing English, numbers, emojis, or punctuation, `tamil-romanizer` surgically transliterates *only* the Tamil characters and leaves everything else perfectly intact.
-* **scheme**: Target rules (`'practical'` default, `'iso15919'`, or `'ala-lc'`)
-* **exceptions**: Boolean enabling exception lookups (defaults `true`)
-* **capitalize**: Output casing rule (`'none'`, `'words'`, `'sentence'`). *Note: `none` enforces strict lowercase even for proper noun dictionary inputs.*
+```javascript
+const mixed = "The ticket price is ௫௦௦ rupees (ரூபாய்) 🤯!";
+console.log(romanize(mixed));
+// "The ticket price is 500 rupees (roobaay) 🤯!"
+```
+*(Notice how it also safely converts native Tamil numerals natively!)*
-## Architecture
+## API Reference
-1. **Sanitizer:** NFC normalization & format-character stripping.
-2. **Cluster Tokenizer:** Uses `Intl.Segmenter` to split graphemes accurately.
-3. **Decomposer:** Maps bases and vowel modifiers distinctively.
-4. **Context Analyzer:** Positional tagging (Word Initial, Intervocalic, Geminate, Post-Nasal).
-5. **Scheme Resolver:** Base lookup to targeted transliteration schema (`iso15919`, `practical`, `ala-lc`).
-6. **Special Token Handler:** Cross-cluster constraints (Aytham lookaheads, Grantha sequence transformations).
-7. **Exception Trie:** Fast dictionary overrides.
+`romanize(text: string, options?: Object) => string`
-## Testing & Reliability
-This library was built with mathematical rigour, achieving > 98% test coverage via `vitest`.
-* **ISO 15919 Benchmark:** Achieves 100% Character Error Rate (CER) exact-match compliance against the official specification.
-* **Stress Testing:** A built-in CLI is available to test arbitrary text locally:
-  1. Paste text into `test/stress/input.txt`
-  2. Run `node test/stress/evaluate.js`
+| Option | Type | Default | Description |
+|---|---|---|---|
+| `scheme` | `'practical' \| 'iso15919' \| 'ala-lc'` | `'practical'` | Determines the transliteration ruleset. |
+| `exceptions` | `boolean` | `true` | Enables/disables the internal dictionary for loan words. |
+| `capitalize` | `'none' \| 'sentence' \| 'words'` | `'none'` | Controls the casing of the returned string. |
+---
+*Built for Tamil by Harold Alan.*

package/data/exceptions.json CHANGED Viewed

@@ -1,9 +1,47 @@
 {
     "சென்னை": "Chennai",
-    "தமிழ்நாடு": "Tamil Nadu",
+    "தமிழ்நாடு": "Tamilnadu",
     "கோயம்புத்தூர்": "Coimbatore",
     "மதுரை": "Madurai",
+    "தஞ்சாவூர்": "Thanjavur",
+    "திருச்சி": "Trichy",
+    "திருச்சிராப்பள்ளி": "Tiruchirappalli",
+    "சேலம்": "Salem",
+    "திருப்பூர்": "Tiruppur",
+    "நெல்லை": "Nellai",
+    "திருநெல்வேலி": "Tirunelveli",
+    "கன்னியாகுமரி": "Kanyakumari",
+    "பெங்களூர்": "Bangalore",
+    "கேரளா": "Kerala",
+    "மும்பை": "Mumbai",
+    "டெல்லி": "Delhi",
+    "கல்கத்தா": "Kolkata",
+    "இந்தியா": "India",
+    "அமெரிக்கா": "America",
+    "லண்டன்": "London",
     "பஸ்": "bus",
+    "கார்": "car",
+    "லாரி": "lorry",
+    "ரயில்": "rail",
+    "பிளைட்": "flight",
     "டீ": "tea",
-    "ஃபேன்": "fan"
+    "காபி": "coffee",
+    "டிபன்": "tiffin",
+    "ஹோட்டல்": "hotel",
+    "டாக்டர்": "doctor",
+    "போலீஸ்": "police",
+    "ஸ்டேஷன்": "station",
+    "கம்ப்யூட்டர்": "computer",
+    "இன்டர்நெட்": "internet",
+    "வாட்ஸ்அப்": "whatsapp",
+    "பேஸ்புக்": "facebook",
+    "யூடியூப்": "youtube",
+    "சினிமா": "cinema",
+    "டிக்கெட்": "ticket",
+    "ஹீரோ": "hero",
+    "டிவி": "TV",
+    "பேங்க்": "bank",
+    "போன்": "phone",
+    "ஃபேக்டரி": "factory",
+    "ஆபீஸ்": "office"
 }

package/package.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
     "name": "tamil-romanizer",
-    "version": "1.0.0",
-    "description": "Tamil Romanization Engine v1.0",
+    "version": "1.0.2",
+    "description": "A robust, context-aware rule-based Tamil-to-English romanization library",
     "main": "index.js",
     "type": "module",
     "scripts": {
@@ -9,9 +9,20 @@
     },
     "keywords": [
         "tamil",
+        "tamil_extended",
+        "tamil_superscripted",
+        "phonetic",
         "romanizer",
-        "transliteration"
+        "romanization",
+        "transliteration",
+        "indic-scripts",
+        "nlp",
+        "unicode",
+        "iso15919",
+        "ala-lc",
+        "practical"
     ],
+    "repository": "github:haroldalan/tamil-romanizer",
     "author": "Harold Alan",
     "license": "ISC",
     "files": [

package/src/contextAnalyzer.js CHANGED Viewed

@@ -6,6 +6,7 @@ export const contextTags = {
     GEMINATE: 'GEMINATE',
     POST_NASAL: 'POST_NASAL',
     INTERVOCALIC: 'INTERVOCALIC',
+    FRICATIVE_MUTATED: 'FRICATIVE_MUTATED',
     WORD_FINAL: 'WORD_FINAL',
     DEFAULT: 'DEFAULT'
 };
@@ -48,10 +49,12 @@ export function analyzeContext(tokens) {
             // Determine word boundaries.
             // A token is word initial if it's the very first token in the string,
-            // OR if the previous token was a space/punctuation (OTHER).
-            const isWordInitial = index === 0 || (prevToken && prevToken.type === tokenTypes.OTHER);
+            // OR if the previous token was a space/punctuation
+            const isWordInitial = index === 0 ||
+                (prevToken && (prevToken.type === tokenTypes.WHITESPACE || prevToken.type === tokenTypes.PUNCTUATION || prevToken.type === tokenTypes.OTHER));
-            const isWordFinal = index === tokens.length - 1 || (nextToken && nextToken.type === tokenTypes.OTHER);
+            const isWordFinal = index === tokens.length - 1 ||
+                (nextToken && (nextToken.type === tokenTypes.WHITESPACE || nextToken.type === tokenTypes.PUNCTUATION || nextToken.type === tokenTypes.OTHER));
             if (isWordInitial) {
                 tag = contextTags.WORD_INITIAL;
@@ -69,11 +72,15 @@ export function analyzeContext(tokens) {
                 else if (prevToken && prevToken.modifierType === modifierTypes.VIRAMA && nasals.includes(prevToken.base)) {
                     tag = contextTags.POST_NASAL;
                 }
-                // 3. INTERVOCALIC: Preceding cluster holds a vowel, and current cluster holds a vowel.
-                else if (carriesVowel(prevToken) && carriesVowel(token)) {
+                // 3. FRICATIVE_MUTATED: Immediately preceded by an AYTHAM token (ஃ) AND current base is ப or ஜ
+                else if (prevToken && prevToken.type === tokenTypes.AYTHAM && (token.base === 'ப' || token.base === 'ஜ')) {
+                    tag = contextTags.FRICATIVE_MUTATED;
+                }
+                // 4. INTERVOCALIC: Preceding cluster holds a vowel AND current cluster's modifier is not VIRAMA
+                else if (carriesVowel(prevToken) && token.modifierType !== modifierTypes.VIRAMA) {
                     tag = contextTags.INTERVOCALIC;
                 }
-                // 4. WORD_FINAL: Last cluster in a word
+                // 5. WORD_FINAL: Last cluster in a word
                 else if (isWordFinal) {
                     tag = contextTags.WORD_FINAL;
                 }

package/src/romanizer.js CHANGED Viewed

@@ -46,28 +46,51 @@ export function romanize(text, options = {}) {
     const cleanText = sanitize(text);
     if (!cleanText) return '';
+    // Tokenize the ENTIRE string first. This fixes punctuation and spaces breaking the Trie.
+    const allTokens = tokenize(cleanText);
+    // We group tokens into "words" bounded by whitespace and punctuation.
+    // E.g., "சென்னை," -> word chunk: "சென்னை", punctuation chunk: ","
+    const chunks = [];
+    let currentChunk = [];
+    for (const token of allTokens) {
+        if (token.type === 'whitespace' || token.type === 'punctuation' || token.type === 'other') {
+            if (currentChunk.length > 0) {
+                chunks.push({ type: 'word', tokens: currentChunk });
+                currentChunk = [];
+            }
+            chunks.push({ type: 'separator', tokens: [token] });
+        } else {
+            currentChunk.push(token);
+        }
+    }
+    if (currentChunk.length > 0) {
+        chunks.push({ type: 'word', tokens: currentChunk });
+    }
     let outputWords = [];
-    // Tokenize by spaces to apply whole-word Exception Trie natively
-    const words = cleanText.split(/(\s+)/);
-    for (const word of words) {
-        if (!word.trim()) {
-            outputWords.push({ text: word, isException: false });
+    for (const chunk of chunks) {
+        if (chunk.type === 'separator') {
+            outputWords.push({ text: chunk.tokens[0].text, isException: false });
             continue;
         }
-        // Step 2. Exception Trie Intercept
+        // Reconstruct the raw text of the Tamil word for Trie lookup
+        const wordText = chunk.tokens.map(t => t.text).join('');
+        // Step 2. Exception Trie Intercept (Right after Layer 1 chunking)
         if (exceptions) {
-            const hardMatch = exceptionDictionary.lookup(word);
+            const hardMatch = exceptionDictionary.lookup(wordText);
             if (hardMatch) {
                 outputWords.push({ text: hardMatch, isException: true });
-                continue;
+                continue; // Bypass Layers 2-5 for this chunk completely
             }
         }
-        // Pipeline Execution
-        const tokens = tokenize(word);
-        const decomposed = decompose(tokens);
+        // Pipeline Execution for non-exception clusters
+        const decomposed = decompose(chunk.tokens);
         const analyzed = analyzeContext(decomposed);
         const resolved = resolveScheme(analyzed, scheme, table);
         const finalizedWord = handleSpecialTokens(resolved, scheme);

package/src/sanitizer.js CHANGED Viewed

@@ -10,8 +10,11 @@ export function sanitize(text) {
     if (typeof text !== 'string') return '';
     return text
-        // 1. ZWJ (U+200D) / ZWNJ (U+200C) removal
-        .replace(/[\u200C\u200D]/g, '')
+        // 1. ZWJ (U+200D) / ZWNJ (U+200C) removal ONLY when adjacent to Tamil text
+        // We match any Tamil character (U+0B80-U+0BFF) followed optionally by ZWJ/ZWNJ repeatedly
+        // to strictly scope the removal and prevent corrupting Malayalam text.
+        .replace(/([\u0B80-\u0BFF])[\u200C\u200D]+/g, '$1')
+        .replace(/[\u200C\u200D]+([\u0B80-\u0BFF])/g, '$1')
         // 2. ஸ்ரீ (Sri) canonicalization
         // Normalize variant `ஶ்ரீ` (U+0BB6) to canonical `ஸ்ரீ` (U+0BB8)

package/src/schemeResolver.js CHANGED Viewed

@@ -8,7 +8,8 @@ const schemes = {
     iso15919,
     practical,
     alaLc,
-    'ala-lc': alaLc
+    'ala-lc': alaLc,
+    'alalc': alaLc
 };
 // Maps vowel signs (modifier part of cluster) to their pure vowel equivalent

package/src/schemes/practical.js CHANGED Viewed

@@ -8,8 +8,8 @@ export default {
         'க': { DEFAULT: 'k', INTERVOCALIC: 'g', POST_NASAL: 'g', GEMINATE: 'kk' },
         'ச': { DEFAULT: 's', WORD_INITIAL: 's', INTERVOCALIC: 's', POST_NASAL: 'j', GEMINATE: 'chch' },
         'ட': { DEFAULT: 't', INTERVOCALIC: 'd', POST_NASAL: 'd', GEMINATE: 'tt' },
-        'த': { DEFAULT: 'th', INTERVOCALIC: 'd', POST_NASAL: 'd', GEMINATE: 'tth' },
-        'ப': { DEFAULT: 'p', INTERVOCALIC: 'b', POST_NASAL: 'b', GEMINATE: 'pp' },
+        'த': { DEFAULT: 'th', INTERVOCALIC: 'd', POST_NASAL: 'dh', GEMINATE: 'tth' },
+        'ப': { DEFAULT: 'p', INTERVOCALIC: 'b', POST_NASAL: 'b', GEMINATE: 'pp', FRICATIVE_MUTATED: 'f' },
         'ற': { DEFAULT: 'r', INTERVOCALIC: 'r', POST_NASAL: 'dr', GEMINATE: 'tr' },
         // Nasals and other consonants that change based on context or position
         'ங': { DEFAULT: 'n', WORD_INITIAL: 'ng' },
@@ -26,7 +26,7 @@ export default {
         'ர': { DEFAULT: 'r' },
         'வ': { DEFAULT: 'v' },
         // Grantha mappings standard for practical
-        'ஜ': { DEFAULT: 'j' },
+        'ஜ': { DEFAULT: 'j', FRICATIVE_MUTATED: 'z' },
         'ஷ': { DEFAULT: 'sh' },
         'ஸ': { DEFAULT: 's' },
         'ஹ': { DEFAULT: 'h' }

package/src/specialTokens.js CHANGED Viewed

@@ -22,14 +22,9 @@ export function handleSpecialTokens(resolvedTokens, schemeName = 'practical') {
             const nextToken = resolvedTokens[i + 1];
             if (isPractical) {
-                if (nextToken && nextToken.base === 'ப') {
-                    // Replace 'p' or 'b' with 'f' in the next token's romanization
-                    nextToken.romanized = nextToken.romanized.replace(/^[pb]/i, 'f');
-                } else if (nextToken && nextToken.base === 'ஜ') {
-                    // Replace 'j' with 'z' in the next token's romanization
-                    nextToken.romanized = nextToken.romanized.replace(/^j/i, 'z');
-                }
-                // For other cases or standalone 'ஃ', it's omitted in practical scheme, so nothing is added to outputString here.
+                // In practical scheme, the Āytham token itself is dropped silently.
+                // The subsequent base ('ப' or 'ஜ') has already been mutated by Layer 3 + Layer 4 to 'f' or 'z'.
+                // So we do nothing to outputString here.
             } else {
                 // ISO 15919
                 outputString += 'ḵ';

package/src/tokenizer.js CHANGED Viewed

@@ -5,7 +5,11 @@ export const tokenTypes = {
     CONSONANT_VIRAMA: 'consonant_virama',
     CONSONANT_VOWEL_SIGN: 'consonant_vowel_sign',
     CONSONANT_BARE: 'consonant_bare',
-    OTHER: 'other' // numerals, punctuation, spaces, non-tamil
+    AYTHAM: 'aytham',
+    WHITESPACE: 'whitespace',
+    NUMERAL: 'numeral',
+    PUNCTUATION: 'punctuation',
+    OTHER: 'other' // non-tamil
 };
 // Vowels (அ to ஔ) U+0B85 to U+0B94
@@ -29,6 +33,10 @@ const isVowelSign = (char) => {
     return code >= 0x0BBE && code <= 0x0BCD && code !== 0x0BCD; // Exclude virama explicitly
 };
+const isWhitespace = (str) => /^\s+$/.test(str);
+const isNumeral = (str) => /^\d+$/.test(str) || /^[\u0BE6-\u0BEF]+$/.test(str); // matches 0-9 and tamil numerals
+const isPunctuation = (str) => /^[.,/#!$%^&*;:{}=\-_`~()""'']+$/.test(str);
 /**
  * Tokenizes a sanitized Tamil string into grapheme clusters.
  *
@@ -43,7 +51,15 @@ export function tokenize(text) {
         let type = tokenTypes.OTHER;
         // Check classification based on first character and any modifiers
-        if (segment.length === 1) {
+        if (segment === 'ஃ') {
+            type = tokenTypes.AYTHAM;
+        } else if (isWhitespace(segment)) {
+            type = tokenTypes.WHITESPACE;
+        } else if (isNumeral(segment)) {
+            type = tokenTypes.NUMERAL;
+        } else if (isPunctuation(segment)) {
+            type = tokenTypes.PUNCTUATION;
+        } else if (segment.length === 1) {
             if (isVowel(segment)) {
                 type = tokenTypes.VOWEL;
             } else if (isConsonant(segment)) {