twl-generator 1.2.15 → 1.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (4) hide show
  1. package/README.md +282 -57
  2. package/package.json +5 -2
  3. package/src/cli.js +72 -74
  4. package/src/index.js +807 -27
package/README.md CHANGED
@@ -1,105 +1,330 @@
1
- # twl-generator
1
+ # TWL Generator
2
2
 
3
- Generate term-to-article lists from unfoldingWord en_tw archive for Bible books. Works in both Node.js (CLI) and React.js (browser) environments with intelligent caching.
3
+ A Node.js library and CLI tool for generating Translation Word Links (TWL) TSV files from Door43 USFM data and Translation Words (TW) metadata. This tool intelligently matches biblical terms with their corresponding Translation Words articles using Strong's numbers, morphological analysis, and contextual matching.
4
4
 
5
- ## Features
5
+ ## Installation
6
6
 
7
- - **Universal**: Works in Node.js and browser environments
8
- - ✅ **Smart Caching**: File system (Node.js) or localStorage/sessionStorage (browser)
9
- - **Performance**: Optimized matching with PrefixTrie algorithm
10
- - ✅ **Case Sensitivity**: Proper God/god distinction (God→kt/god, god→kt/falsegod)
11
- - ✅ **Morphological Variants**: Handles plurals, possessives, verb forms
12
- - ✅ **Parentheses Normalization**: "Joseph (OT)" → "Joseph" for better coverage
7
+ ### Global CLI Installation
8
+ ```bash
9
+ npm install -g twl-generator
10
+ ```
13
11
 
14
- ---
12
+ ### Library Installation
13
+ ```bash
14
+ npm install twl-generator
15
+ ```
15
16
 
16
17
  ## Usage
17
18
 
18
- ### CLI
19
+ ### Command Line Interface
19
20
 
20
- Install globally:
21
+ Generate TWL for a specific book:
22
+ ```bash
23
+ twl-generator --book rut
24
+ # Creates: rut.twl.tsv and rut.no-match.twl.tsv
25
+ ```
21
26
 
27
+ Generate TWL for all books:
22
28
  ```bash
23
- npm install -g twl-generator
29
+ twl-generator --all --out-dir ./output
30
+ # Creates TWL files for all 66 biblical books
24
31
  ```
25
32
 
26
- Generate a TWL TSV for a Bible book (downloads USFM from Door43):
33
+ Specify custom output location:
34
+ ```bash
35
+ twl-generator --book mat --out matthew.twl.tsv
36
+ ```
27
37
 
38
+ Enable advanced verb conjugation matching:
28
39
  ```bash
29
- twl-generator --book rut
40
+ twl-generator --book jhn --use-compromise
41
+ # Uses compromise.js for better verb form detection
30
42
  ```
31
43
 
32
- Generate a TWL TSV from a local USFM file:
44
+ #### CLI Options
45
+ - `--book <code>`: Book code (e.g., gen, exo, mat, mrk, jhn, etc.)
46
+ - `--all`: Generate TWL files for all biblical books
47
+ - `--out <file>`: Specify output file path
48
+ - `--out-dir <dir>`: Output directory (for --all option)
49
+ - `--use-compromise`: Enable advanced morphological analysis using compromise.js
33
50
 
34
- ```bash
35
- twl-generator --usfm ./myfile.usfm
51
+ ### Library Usage
52
+
53
+ #### Basic Usage
54
+ ```javascript
55
+ import { generateTwlByBook } from 'twl-generator';
56
+
57
+ // Generate TWL for Ruth
58
+ const result = await generateTwlByBook('rut');
59
+ console.log(result.matchedTsv); // Main TWL output
60
+ console.log(result.noMatchTsv); // Unmatched entries for analysis
36
61
  ```
37
62
 
38
- Specify output file:
63
+ #### With Advanced Options
64
+ ```javascript
65
+ import { generateTwlByBook } from 'twl-generator';
39
66
 
40
- ```bash
41
- twl-generator --usfm ./myfile.usfm --output ./output.tsv
67
+ // Use advanced morphological analysis
68
+ const result = await generateTwlByBook('jhn', {
69
+ useCompromise: true // Enable compromise.js for better verb matching
70
+ });
71
+
72
+ // Save to files
73
+ import fs from 'fs/promises';
74
+ await fs.writeFile('john.twl.tsv', result.matchedTsv);
75
+ await fs.writeFile('john.no-match.tsv', result.noMatchTsv);
76
+ ```
77
+
78
+ #### Integration Example
79
+ ```javascript
80
+ import { generateTwlByBook } from 'twl-generator';
81
+
82
+ async function processBibleBook(bookCode) {
83
+ try {
84
+ const { matchedTsv, noMatchTsv } = await generateTwlByBook(bookCode);
85
+
86
+ // Process the TSV data
87
+ const lines = matchedTsv.split('\n');
88
+ const header = lines[0];
89
+ const rows = lines.slice(1).filter(Boolean);
90
+
91
+ console.log(`Generated ${rows.length} TWL entries for ${bookCode.toUpperCase()}`);
92
+
93
+ // Further processing...
94
+ return { success: true, entries: rows.length };
95
+ } catch (error) {
96
+ console.error(`Failed to process ${bookCode}:`, error);
97
+ return { success: false, error: error.message };
98
+ }
99
+ }
42
100
  ```
43
101
 
44
- You can also combine `--book` and `--usfm` (book is used for output filename and context):
102
+ ## How It Works
103
+
104
+ The TWL Generator uses a sophisticated multi-stage process to create Translation Word Links:
105
+
106
+ ### 1. **Data Sources**
107
+ - **Original Language USFM**: Hebrew (hbo_uhb) and Greek (el-x-koine_ugnt) texts from Door43
108
+ - **English Bible**: unfoldingWord Literal Text (en_ult) for context matching
109
+ - **Translation Words**: Local `tw_strongs_list.json` containing Strong's mappings and term definitions
110
+ - **Strong's Numbers**: Links between original language words and semantic concepts
111
+
112
+ ### 2. **Processing Pipeline**
113
+
114
+ #### Stage 1: Extract Strong's Data
115
+ - Parses USFM `\w` tags to extract Strong's numbers from original language texts
116
+ - Builds initial TSV with Reference, Strong's ID, and surface words
117
+ - Handles multi-word phrases that share Strong's number sequences
118
+
119
+ #### Stage 2: Generate English Context
120
+ - Uses `tsv-quote-converters` to find corresponding English text (GLQuote) in ULT
121
+ - Adds GLQuote and GLOccurrence columns for contextual matching
122
+ - Converts to OrigWords/Occurrence format for processing
123
+
124
+ #### Stage 3: Intelligent Article Selection
125
+ For each Strong's number and its English context, the system:
126
+
127
+ 1. **Prioritizes candidate articles** based on:
128
+ - Articles whose slug appears in the GLQuote text
129
+ - Article type preference: kt/ (key terms) → names/ → other/
130
+ - Alphabetical sorting within each category
131
+
132
+ 2. **Performs 4-stage matching** (best match wins):
133
+ - **Stage 1**: Case-sensitive word boundary matching
134
+ - **Stage 2**: Case-insensitive word boundary matching
135
+ - **Stage 3**: Case-sensitive substring matching
136
+ - **Stage 4**: Case-insensitive morphological variants
137
+
138
+ 3. **Morphological analysis** includes:
139
+ - Pluralization (dog → dogs, man → men)
140
+ - Verb conjugation (-ing, -ed forms)
141
+ - Irregular verb forms (go → went, see → saw)
142
+ - Optional advanced analysis with compromise.js
143
+
144
+ #### Stage 4: Quality Assurance
145
+ - Generates disambiguation info when multiple articles could match
146
+ - Marks entries as "Variant of" when morphological variants are used
147
+ - Creates separate files for matched and unmatched entries
148
+ - Provides detailed statistics and sample unmatched entries
149
+
150
+ ### 3. **Output Format**
151
+
152
+ The generated TSV contains these columns:
153
+
154
+ | Column | Description |
155
+ |--------|-------------|
156
+ | Reference | Chapter:verse (e.g., "1:1") |
157
+ | ID | Random 4-character ID starting with letter |
158
+ | Tags | "keyterm", "name", or empty based on article type |
159
+ | OrigWords | The matched word(s) from the text |
160
+ | Occurrence | Which occurrence of this word in the verse |
161
+ | TWLink | Link to Translation Words article (rc://*/tw/dict/bible/...) |
162
+ | Strongs | Original Strong's number |
163
+ | GLQuote | English text context from ULT |
164
+ | GLOccurrence | Occurrence number in English context |
165
+ | Variant of | Original term if morphological variant was used |
166
+ | Disambiguation | List of other possible articles |
167
+
168
+ ### 4. **Matching Examples**
45
169
 
170
+ ```
171
+ Reference OrigWords GLQuote TWLink Variant of
172
+ 1:17 grace grace and truth rc://*/tw/dict/bible/kt/grace
173
+ 1:17 gracious gracious God rc://*/tw/dict/bible/kt/grace grace
174
+ 2:3 men wise men came rc://*/tw/dict/bible/other/man
175
+ 2:3 wisdom with great wisdom rc://*/tw/dict/bible/kt/wise wise
176
+ ```
177
+
178
+ ## Development
179
+
180
+ ### Prerequisites
181
+ - Node.js 18+ (uses native fetch)
182
+ - Git access to Door43 repositories
183
+
184
+ ### Setup
46
185
  ```bash
47
- twl-generator --usfm ./myfile.usfm --book rut
186
+ git clone https://github.com/unfoldingWord/node-twl-generator.git
187
+ cd node-twl-generator
188
+ npm install
48
189
  ```
49
190
 
50
- ---
191
+ ### Testing
192
+ ```bash
193
+ # Test single book generation
194
+ npm test
51
195
 
52
- ### As a Library (Node.js/ESM/React)
196
+ # Test specific book
197
+ npm run cli -- --book rut
53
198
 
54
- Install as a dependency:
199
+ # Test with advanced morphology
200
+ npm run cli -- --book jhn --use-compromise
201
+ ```
55
202
 
203
+ ### Local Development
56
204
  ```bash
57
- npm install twl-generator
205
+ # Run CLI locally
206
+ node src/cli.js --book gen --out test-output.tsv
207
+
208
+ # Test library integration
209
+ node -e "import('./src/index.js').then(m => m.generateTwlByBook('rut').then(console.log))"
58
210
  ```
59
211
 
60
- #### Example: Generate TWL TSV from USFM string
212
+ ### Project Structure
213
+ ```
214
+ src/
215
+ ├── cli.js # Command line interface
216
+ ├── index.js # Main library exports
217
+ ├── common/
218
+ │ └── books.js # Bible book metadata
219
+ └── utils/
220
+ ├── twl-matcher.js # Term matching algorithms (legacy)
221
+ ├── zipProcessor.js # TW archive processing (legacy)
222
+ └── usfm-alignment-remover.js # USFM parsing (legacy)
223
+ tw_strongs_list.json # Translation Words database
224
+ ```
61
225
 
62
- ```js
63
- import { generateTWLWithUsfm } from 'twl-generator';
226
+ ## Data Files
227
+
228
+ ### `tw_strongs_list.json`
229
+ This file contains the core mapping between Strong's numbers and Translation Words articles:
230
+
231
+ ```json
232
+ {
233
+ "kt/god": {
234
+ "article": {
235
+ "terms": ["God", "god", "deity", "divine"]
236
+ },
237
+ "strongs": [
238
+ ["H430"], // Single Strong's number
239
+ ["H410"],
240
+ ["G2316", "G2318"] // Multiple Strong's for compound concepts
241
+ ]
242
+ }
243
+ }
244
+ ```
64
245
 
65
- // USFM string (can be loaded from file, API, etc.)
66
- const usfmContent = `
67
- \\id MAT
68
- \\c 1
69
- \\v 1 In the beginning...
70
- `;
246
+ ## Contributing
71
247
 
72
- const book = 'mat';
248
+ We welcome contributions! Here's how you can help:
73
249
 
74
- const tsv = await generateTWLWithUsfm(book, usfmContent);
75
- // tsv is a string in TSV format, ready to save or process
76
- console.log(tsv);
77
- ```
250
+ ### Reporting Issues
251
+ - **Missing matches**: If legitimate biblical terms aren't being matched
252
+ - **False positives**: If non-terms are being incorrectly matched
253
+ - **Performance issues**: Slow processing or memory problems
254
+ - **Data quality**: Incorrect Strong's mappings or term definitions
78
255
 
79
- #### Example: Generate TWL TSV by fetching USFM for a book
256
+ ### Enhancement Ideas
257
+ - **Better morphological analysis**: Improve verb conjugation and irregular forms
258
+ - **Multi-language support**: Extend beyond English GLQuotes
259
+ - **Contextual disambiguation**: Use surrounding words for better article selection
260
+ - **Performance optimization**: Faster processing for large corpora
80
261
 
81
- ```js
82
- import { generateTWLWithUsfm } from 'twl-generator';
262
+ ### Development Workflow
263
+ 1. Fork the repository
264
+ 2. Create a feature branch: `git checkout -b feature-name`
265
+ 3. Make your changes with tests
266
+ 4. Run the test suite: `npm test`
267
+ 5. Submit a pull request with detailed description
83
268
 
84
- const book = 'rut'; // Book code
269
+ ### Testing Your Changes
270
+ ```bash
271
+ # Test various scenarios
272
+ npm run cli -- --book psa --use-compromise # Large book with advanced features
273
+ npm run cli -- --book phm # Short book for quick testing
274
+ npm run cli -- --book rev # Symbolic language testing
275
+ ```
85
276
 
86
- const tsv = await generateTWLWithUsfm(book);
87
- // This will fetch the USFM for the book from Door43 and return the TSV string
88
- console.log(tsv);
277
+ ## Browser Compatibility
278
+
279
+ While primarily designed for Node.js, core functionality works in modern browsers:
280
+
281
+ ```javascript
282
+ // React/Browser usage example
283
+ import { generateTwlByBook } from 'twl-generator';
284
+
285
+ const MyComponent = () => {
286
+ const [tsvData, setTsvData] = useState(null);
287
+
288
+ const generateTWL = async () => {
289
+ try {
290
+ const result = await generateTwlByBook('mat');
291
+ setTsvData(result.matchedTsv);
292
+ } catch (error) {
293
+ console.error('TWL generation failed:', error);
294
+ }
295
+ };
296
+
297
+ return (
298
+ <div>
299
+ <button onClick={generateTWL}>Generate TWL for Matthew</button>
300
+ {tsvData && <pre>{tsvData}</pre>}
301
+ </div>
302
+ );
303
+ };
89
304
  ```
90
305
 
91
- ---
306
+ ## Performance
92
307
 
93
- ### API Reference
308
+ Typical processing times:
309
+ - **Short books** (Philemon, 2-3 John): < 5 seconds
310
+ - **Medium books** (Ruth, Ephesians): 5-15 seconds
311
+ - **Large books** (Psalms, Matthew): 30-60 seconds
312
+ - **All books**: 15-30 minutes depending on network speed
94
313
 
95
- #### `generateTWLWithUsfm(book, usfmContent?)`
314
+ Memory usage scales with book size, typically 50-200MB peak.
96
315
 
97
- - `book`: (string) Book code (e.g., 'mat', 'rut'). Required if `usfmContent` is not provided.
98
- - `usfmContent`: (string, optional) USFM file content. If provided, this is used instead of fetching from Door43.
99
- - **Returns:** `Promise<string>` — TSV string of TWL matches.
316
+ ## License
100
317
 
101
- ---
318
+ MIT License - see [LICENSE](LICENSE) file for details.
102
319
 
103
- ## License
320
+ ## Support
321
+
322
+ - **Issues**: https://github.com/unfoldingWord/node-twl-generator/issues
323
+ - **Discussions**: https://github.com/unfoldingWord/node-twl-generator/discussions
324
+ - **Documentation**: https://github.com/unfoldingWord/node-twl-generator/wiki
325
+
326
+ ## Related Projects
104
327
 
105
- MIT
328
+ - [tsv-quote-converters](https://www.npmjs.com/package/tsv-quote-converters) - GLQuote generation
329
+ - [compromise](https://www.npmjs.com/package/compromise) - Advanced morphological analysis
330
+ - [Door43 Content](https://git.door43.org/unfoldingWord) - Source biblical texts and resources
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "twl-generator",
3
- "version": "1.2.15",
3
+ "version": "1.3.1",
4
4
  "description": "Generate term-to-article lists from unfoldingWord en_tw archive for Bible books. Works in both Node.js (CLI) and React.js (browser) environments.",
5
5
  "main": "src/index.js",
6
6
  "bin": {
@@ -46,7 +46,10 @@
46
46
  "node": ">=18.0.0"
47
47
  },
48
48
  "dependencies": {
49
- "jszip": "^3.10.1"
49
+ "csv-parse": "^5.5.6",
50
+ "csv-stringify": "^6.5.0",
51
+ "compromise": "^14.14.2",
52
+ "tsv-quote-converters": "^1.1.13"
50
53
  },
51
54
  "peerDependencies": {
52
55
  "react": ">=16.8.0"
package/src/cli.js CHANGED
@@ -1,86 +1,84 @@
1
1
  #!/usr/bin/env node
2
- import { generateTWLWithUsfm } from './index.js';
3
- import fs from 'fs';
4
- import path from 'path';
2
+ import fs from 'node:fs/promises';
3
+ import path from 'node:path';
4
+ import { generateTwlByBook } from '../src/index.js';
5
+ import { BibleBookData } from '../src/common/books.js';
5
6
 
6
- const args = process.argv.slice(2);
7
+ const THIS_DIR = path.dirname(new URL(import.meta.url).pathname);
7
8
 
8
- function printHelp() {
9
- console.log(`Usage: generate-twls [options]
10
-
11
- Options:
12
- --book <book> Specify the Bible book (e.g., rut)
13
- --usfm <path> Path to USFM file to process
14
- --output <path> Path to output TSV file
15
- --help Show this help message
16
-
17
- Examples:
18
- generate-twls --book rut
19
- generate-twls --usfm ./41-MAT.usfm --output ./mat_twl.tsv
20
- generate-twls --usfm ./file.usfm --book rut`);
21
- }
22
-
23
- let book = null;
24
- let usfmPath = null;
25
- let outputPath = null;
26
-
27
- for (let i = 0; i < args.length; i++) {
28
- if (args[i] === '--book' && args[i + 1]) {
29
- book = args[i + 1].toLowerCase();
30
- i++;
31
- } else if (args[i] === '--usfm' && args[i + 1]) {
32
- usfmPath = args[i + 1];
33
- i++;
34
- } else if (args[i] === '--output' && args[i + 1]) {
35
- outputPath = args[i + 1];
36
- i++;
37
- } else if (args[i] === '--help') {
38
- printHelp();
39
- process.exit(0);
9
+ async function readBooksJs() {
10
+ const map = {};
11
+ for (const [code, meta] of Object.entries(BibleBookData)) {
12
+ map[code.toUpperCase()] = { usfm: meta.usfm, testament: meta.testament };
40
13
  }
14
+ return map;
41
15
  }
42
16
 
43
- // Validate arguments
44
- if (!book && !usfmPath) {
45
- console.error('Error: Either --book or --usfm parameter is required');
46
- printHelp();
47
- process.exit(1);
48
- }
49
-
50
- if (usfmPath && !fs.existsSync(usfmPath)) {
51
- console.error(`Error: USFM file not found: ${usfmPath}`);
52
- process.exit(1);
17
+ function parseArgs(argv) {
18
+ const args = { book: '', out: '', outDir: '', all: false, useCompromise: false };
19
+ for (let i = 2; i < argv.length; i++) {
20
+ const a = argv[i];
21
+ if (a === '--book' || a === '-b') { args.book = argv[++i] || ''; }
22
+ else if (a === '--out' || a === '-o') { args.out = argv[++i] || ''; }
23
+ else if (a === '--out-dir' || a === '-O') { args.outDir = argv[++i] || ''; }
24
+ else if (a === '--all' || a === '-A') { args.all = true; }
25
+ else if (a === '--use-compromise') { args.useCompromise = true; }
26
+ }
27
+ return args;
53
28
  }
54
29
 
55
- (async () => {
56
- try {
57
- let usfmContent = null;
58
- if (usfmPath) {
59
- usfmContent = fs.readFileSync(usfmPath, 'utf8');
60
- console.log(`Reading USFM from: ${usfmPath}`);
61
- }
62
-
63
- const tsv = await generateTWLWithUsfm(book, usfmContent);
64
-
65
- // Determine output filename
66
- let filename;
67
- if (outputPath) {
68
- filename = outputPath;
69
- } else if (book) {
70
- filename = `twl_${book.toUpperCase()}.tsv`;
71
- } else if (usfmPath) {
72
- const baseName = path.basename(usfmPath, path.extname(usfmPath));
73
- filename = `${baseName}.tsv`;
74
- } else {
75
- filename = 'output.tsv';
30
+ async function main() {
31
+ const { book, out, outDir, all, useCompromise } = parseArgs(process.argv);
32
+ if (all || (book && book.toLowerCase() === 'all')) {
33
+ const books = await readBooksJs();
34
+ const codes = Object.keys(books);
35
+ const destDir = outDir ? path.resolve(outDir) : path.resolve(THIS_DIR, '..'); // default to twl-generator dir
36
+ await fs.mkdir(destDir, { recursive: true });
37
+ console.error(`Generating TWL for ${codes.length} books to ${destDir} (useCompromise=${useCompromise})`);
38
+ for (const code of codes) {
39
+ try {
40
+ const { matchedTsv, noMatchTsv } = await generateTwlByBook(code, { useCompromise });
41
+ const fname = `${code.toLowerCase()}.twl.tsv`;
42
+ const outPath = path.join(destDir, fname);
43
+ await fs.writeFile(outPath, matchedTsv, 'utf8');
44
+ const nmPath = path.join(destDir, `${code.toLowerCase()}.no-match.twl.tsv`);
45
+ await fs.writeFile(nmPath, noMatchTsv, 'utf8');
46
+ console.error(` ✓ ${code} -> ${fname}`);
47
+ } catch (err) {
48
+ console.error(` ✗ ${code} failed:`, err.message || err);
49
+ }
76
50
  }
51
+ return;
52
+ }
77
53
 
78
- // Save TSV to file
79
- fs.writeFileSync(filename, tsv, 'utf8');
80
- console.log(`TSV file saved as ${filename}`);
81
- console.log(`Found ${tsv.split('\n').length - 1} matches`);
82
- } catch (error) {
83
- console.error('Error:', error.message);
54
+ if (!book) {
55
+ console.error('Usage: generate-twl --book <code>|all [--out <file.tsv> | --out-dir <dir>] [--use-compromise]');
84
56
  process.exit(1);
85
57
  }
86
- })();
58
+
59
+ const { matchedTsv, noMatchTsv } = await generateTwlByBook(book, { useCompromise });
60
+ if (out) {
61
+ const outPath = path.resolve(out);
62
+ await fs.writeFile(outPath, matchedTsv, 'utf8');
63
+ console.log(`Wrote ${out}`);
64
+ const dir = path.dirname(outPath);
65
+ const base = path.basename(outPath);
66
+ const nmPath = path.join(dir, base.replace(/\.twl\.tsv$/i, '.no-match.twl.tsv'));
67
+ await fs.writeFile(nmPath, noMatchTsv, 'utf8');
68
+ console.log(`Wrote ${nmPath}`);
69
+ } else if (outDir) {
70
+ const destDir = path.resolve(outDir);
71
+ await fs.mkdir(destDir, { recursive: true });
72
+ const outPath = path.join(destDir, `${book.toLowerCase()}.twl.tsv`);
73
+ await fs.writeFile(outPath, matchedTsv, 'utf8');
74
+ const nmPath = path.join(destDir, `${book.toLowerCase()}.no-match.twl.tsv`);
75
+ await fs.writeFile(nmPath, noMatchTsv, 'utf8');
76
+ console.log(`Wrote ${outPath}`);
77
+ console.log(`Wrote ${nmPath}`);
78
+ } else {
79
+ // When writing to stdout, output only the matched TSV to avoid mixing tables
80
+ process.stdout.write(matchedTsv);
81
+ }
82
+ }
83
+
84
+ main().catch(err => { console.error(err); process.exit(1); });