flappa-doormal 2.8.0 → 2.10.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +59 -14
- package/README.md +163 -47
- package/dist/index.d.mts +155 -39
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +338 -23
- package/dist/index.mjs.map +1 -1
- package/package.json +2 -2
package/AGENTS.md
CHANGED
|
@@ -26,8 +26,14 @@ Traditional Arabic text segmentation requires:
|
|
|
26
26
|
```text
|
|
27
27
|
src/
|
|
28
28
|
├── index.ts # Main entry point and exports
|
|
29
|
-
├──
|
|
30
|
-
├──
|
|
29
|
+
├── analysis/ # Analysis helpers module
|
|
30
|
+
│ ├── index.ts # Barrel exports for analysis functions
|
|
31
|
+
│ ├── shared.ts # Shared utilities for analysis
|
|
32
|
+
│ ├── line-starts.ts # analyzeCommonLineStarts (line-based patterns)
|
|
33
|
+
│ ├── repeating-sequences.ts # analyzeRepeatingSequences (continuous text N-grams)
|
|
34
|
+
│ └── *.test.ts # Analysis tests
|
|
35
|
+
├── pattern-detection.ts # Token detection for auto-generating rules
|
|
36
|
+
├── pattern-detection.test.ts # Pattern detection tests
|
|
31
37
|
├── recovery.ts # Marker recovery utility (recover mistaken lineStartsAfter)
|
|
32
38
|
├── recovery.test.ts # Marker recovery tests
|
|
33
39
|
└── segmentation/
|
|
@@ -36,6 +42,7 @@ src/
|
|
|
36
42
|
├── breakpoint-processor.ts # Breakpoint post-processing engine (applyBreakpoints)
|
|
37
43
|
├── breakpoint-utils.ts # Breakpoint processing utilities (windowing, excludes, page joins)
|
|
38
44
|
├── rule-regex.ts # SplitRule -> compiled regex builder (buildRuleRegex, processPattern)
|
|
45
|
+
├── optimize-rules.ts # Rule optimization logic (merge, dedupe, sort)
|
|
39
46
|
├── tokens.ts # Token definitions and expansion logic
|
|
40
47
|
├── fuzzy.ts # Diacritic-insensitive matching utilities
|
|
41
48
|
├── html.ts # HTML utilities (stripHtmlTags)
|
|
@@ -63,32 +70,38 @@ src/
|
|
|
63
70
|
- Deterministic mode reruns segmentation with selected rules converted to `lineStartsWith` and merges recovered `content` back into the provided segments
|
|
64
71
|
- Optional `mode: 'best_effort_then_rerun'` attempts a conservative anchor-based recovery first, then falls back to rerun for unresolved segments
|
|
65
72
|
|
|
66
|
-
|
|
73
|
+
3. **`tokens.ts`** - Template system
|
|
67
74
|
- `TOKEN_PATTERNS` - Map of token names to regex patterns
|
|
68
75
|
- `expandTokensWithCaptures()` - Expands `{{token:name}}` syntax
|
|
69
76
|
- `shouldDefaultToFuzzy()` - Checks if patterns contain fuzzy-default tokens (bab, basmalah, fasl, kitab, naql)
|
|
77
|
+
- `applyTokenMappings()` - Applies named captures (`{{token:name}}`) to raw templates
|
|
78
|
+
- `stripTokenMappings()` - Strips named captures (reverts to `{{token}}`)
|
|
70
79
|
- Supports fuzzy transform for diacritic-insensitive matching
|
|
71
80
|
- **Fuzzy-default tokens**: `bab`, `basmalah`, `fasl`, `kitab`, `naql` - auto-enable fuzzy matching unless `fuzzy: false` is set
|
|
72
81
|
|
|
73
|
-
|
|
82
|
+
4. **`match-utils.ts`** - Extracted utilities (for testability)
|
|
74
83
|
- `extractNamedCaptures()` - Get named groups from regex match
|
|
75
84
|
- `filterByConstraints()` - Apply min/max page filters
|
|
76
85
|
- `anyRuleAllowsId()` - Check if page passes rule constraints
|
|
77
86
|
|
|
78
|
-
|
|
87
|
+
5. **`rule-regex.ts`** - SplitRule → compiled regex builder
|
|
79
88
|
- `buildRuleRegex()` - Compiles rule patterns (`lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, `template`, `regex`)
|
|
80
89
|
- `processPattern()` - Token expansion + auto-escaping + optional fuzzy application
|
|
81
90
|
- `extractNamedCaptureNames()` - Extract `(?<name>...)` groups from raw regex patterns
|
|
82
91
|
|
|
83
|
-
|
|
92
|
+
6. **`optimize-rules.ts`** - Rule management logic
|
|
93
|
+
- `optimizeRules()` - Merges compatible rules, deduplicates patterns, and sorts by specificity (longest patterns first)
|
|
94
|
+
|
|
95
|
+
7. **`pattern-validator.ts`** - Rule validation utilities
|
|
84
96
|
- `validateRules()` - Detects typos in patterns (missing `{{}}`, unknown tokens, duplicates)
|
|
97
|
+
- `formatValidationReport()` - Formats validation issues into human-readable strings
|
|
85
98
|
- Returns parallel array structure for easy error tracking
|
|
86
99
|
|
|
87
|
-
|
|
100
|
+
8. **`breakpoint-processor.ts`** - Breakpoint post-processing engine
|
|
88
101
|
- `applyBreakpoints()` - Splits oversized structural segments using breakpoint patterns + windowing
|
|
89
102
|
- Applies `pageJoiner` normalization to breakpoint-created segments
|
|
90
103
|
|
|
91
|
-
|
|
104
|
+
9. **`breakpoint-utils.ts`** - Breakpoint processing utilities
|
|
92
105
|
- `normalizeBreakpoint()` - Convert string to BreakpointRule object
|
|
93
106
|
- `isPageExcluded()` - Check if page is in exclude list
|
|
94
107
|
- `isInBreakpointRange()` - Validate page against min/max/exclude constraints
|
|
@@ -105,16 +118,17 @@ src/
|
|
|
105
118
|
- `findNextPagePosition()` - Find next page content position
|
|
106
119
|
- `findPatternBreakPosition()` - Find pattern match by preference
|
|
107
120
|
|
|
108
|
-
|
|
121
|
+
10. **`types.ts`** - Type definitions
|
|
109
122
|
- `Logger` interface - Optional logging for debugging
|
|
110
123
|
- `SegmentationOptions` - Options with `logger` property
|
|
111
124
|
- `pageJoiner` - Controls how page boundaries are represented in output (`space` default)
|
|
125
|
+
- `PATTERN_TYPE_KEYS` - Runtime array of all pattern types (for UI building)
|
|
112
126
|
- Verbosity levels: `trace`, `debug`, `info`, `warn`, `error`
|
|
113
127
|
|
|
114
|
-
|
|
128
|
+
11. **`fuzzy.ts`** - Arabic text normalization
|
|
115
129
|
- `makeDiacriticInsensitive()` - Generate regex that ignores diacritics
|
|
116
130
|
|
|
117
|
-
|
|
131
|
+
12. **`pattern-detection.ts`** - Token auto-detection (NEW)
|
|
118
132
|
- `detectTokenPatterns()` - Detect tokens in text with positions
|
|
119
133
|
- `generateTemplateFromText()` - Convert text to template string
|
|
120
134
|
- `suggestPatternConfig()` - Suggest rule configuration
|
|
@@ -202,7 +216,7 @@ Raw `regex` patterns now support named capture groups for metadata extraction:
|
|
|
202
216
|
|
|
203
217
|
### Breakpoints Post-Processing Algorithm
|
|
204
218
|
|
|
205
|
-
The `breakpoints` option provides a post-processing mechanism for limiting segment size.
|
|
219
|
+
The `breakpoints` option provides a post-processing mechanism for limiting segment size. Breakpoints runs AFTER all structural rules.
|
|
206
220
|
|
|
207
221
|
**API Options:**
|
|
208
222
|
```typescript
|
|
@@ -244,7 +258,7 @@ segmentPages(pages, {
|
|
|
244
258
|
- **`prefer: 'shorter'`**: Finds FIRST match (conservative)
|
|
245
259
|
- **Recursive**: If split result still exceeds `maxPages`, breakpoints runs again
|
|
246
260
|
|
|
247
|
-
> **Note**:
|
|
261
|
+
> **Note**: Older per-rule span limiting approaches were removed in favor of post-processing `breakpoints`.
|
|
248
262
|
|
|
249
263
|
## Design Decisions
|
|
250
264
|
|
|
@@ -357,7 +371,7 @@ bunx biome lint .
|
|
|
357
371
|
|
|
358
372
|
5. **Rule order matters for specificity**: When multiple rules can match the same position, put specific patterns BEFORE generic ones. Example: `## {{raqms:num}} {{dash}}` must come before `##` to capture the number.
|
|
359
373
|
|
|
360
|
-
6. **Post-processing beats per-rule limits**:
|
|
374
|
+
6. **Post-processing beats per-rule limits**: Per-rule span limiting caused premature splits. Moving to post-processing `breakpoints` preserves structural integrity while still limiting segment size.
|
|
361
375
|
|
|
362
376
|
7. **Window padding matters**: When calculating approximate content windows, 50% padding is needed (not 20%) to ensure enough content is captured for `prefer: 'longer'` scenarios.
|
|
363
377
|
|
|
@@ -396,6 +410,8 @@ bunx biome lint .
|
|
|
396
410
|
- `src/segmentation/replace.ts` (preprocessing parity)
|
|
397
411
|
- `src/recovery.ts` (recovery implementation)
|
|
398
412
|
|
|
413
|
+
10. **Prefer library utilities for UI tasks**: Instead of re-implementing rule merging, validation, or token mapping in client code, use `optimizeRules`, `validateRules`/`formatValidationReport`, and `applyTokenMappings`. They handle edge cases (like duplicate patterns, regex safety, or diacritic handling) that ad-hoc implementations might miss.
|
|
414
|
+
|
|
399
415
|
### Process Template (Multi-agent design review, TDD-first)
|
|
400
416
|
|
|
401
417
|
If you want to repeat the “write a plan → get multiple AI critiques → synthesize → update plan → implement TDD-first” workflow, use:
|
|
@@ -494,3 +510,32 @@ const quoted = analyzeCommonLineStarts(pages, {
|
|
|
494
510
|
});
|
|
495
511
|
```
|
|
496
512
|
|
|
513
|
+
## Repeating Sequence Analysis (`analyzeRepeatingSequences`)
|
|
514
|
+
|
|
515
|
+
For continuous text **without line breaks** (prose-like content), use `analyzeRepeatingSequences(pages)`. It scans for commonly repeating word/token sequences (N-grams) across pages.
|
|
516
|
+
|
|
517
|
+
Key options:
|
|
518
|
+
- `minElements` / `maxElements`: N-gram size range (default 1-3)
|
|
519
|
+
- `minCount`: Minimum occurrences to include (default 3)
|
|
520
|
+
- `topK`: Maximum patterns to return (default 20)
|
|
521
|
+
- `requireToken`: Only patterns containing `{{tokens}}` (default true)
|
|
522
|
+
- `normalizeArabicDiacritics`: Ignore diacritics when matching (default true)
|
|
523
|
+
|
|
524
|
+
Example:
|
|
525
|
+
```typescript
|
|
526
|
+
import { analyzeRepeatingSequences } from 'flappa-doormal';
|
|
527
|
+
|
|
528
|
+
const patterns = analyzeRepeatingSequences(pages, { minCount: 3, topK: 20 });
|
|
529
|
+
// [{ pattern: '{{naql}}', count: 42, examples: [...] }, ...]
|
|
530
|
+
```
|
|
531
|
+
|
|
532
|
+
## Analysis → Segmentation Workflow
|
|
533
|
+
|
|
534
|
+
Use analysis functions to discover patterns, then pass to `segmentPages()`:
|
|
535
|
+
|
|
536
|
+
1. **Continuous text**: `analyzeRepeatingSequences()` → build rules → `segmentPages()`
|
|
537
|
+
2. **Structured text**: `analyzeCommonLineStarts()` → build rules → `segmentPages()`
|
|
538
|
+
|
|
539
|
+
See README.md for complete examples.
|
|
540
|
+
|
|
541
|
+
|
package/README.md
CHANGED
|
@@ -158,6 +158,23 @@ const rules = [{
|
|
|
158
158
|
| `template` | Depends | Custom pattern with full control |
|
|
159
159
|
| `regex` | Depends | Raw regex for complex cases |
|
|
160
160
|
|
|
161
|
+
#### Building UIs with Pattern Type Keys
|
|
162
|
+
|
|
163
|
+
The library exports `PATTERN_TYPE_KEYS` (a const array) and `PatternTypeKey` (a type) for building UIs that let users select pattern types:
|
|
164
|
+
|
|
165
|
+
```typescript
|
|
166
|
+
import { PATTERN_TYPE_KEYS, type PatternTypeKey } from 'flappa-doormal';
|
|
167
|
+
|
|
168
|
+
// PATTERN_TYPE_KEYS = ['lineStartsWith', 'lineStartsAfter', 'lineEndsWith', 'template', 'regex']
|
|
169
|
+
|
|
170
|
+
// Build a dropdown/select
|
|
171
|
+
PATTERN_TYPE_KEYS.map(key => <option value={key}>{key}</option>)
|
|
172
|
+
|
|
173
|
+
// Type-safe validation
|
|
174
|
+
const isPatternKey = (k: string): k is PatternTypeKey =>
|
|
175
|
+
(PATTERN_TYPE_KEYS as readonly string[]).includes(k);
|
|
176
|
+
```
|
|
177
|
+
|
|
161
178
|
### 4.1 Page-start Guard (avoid page-wrap false positives)
|
|
162
179
|
|
|
163
180
|
When matching at line starts (e.g., `{{naql}}`), a new page can begin with a marker that is actually a **continuation** of the previous page (page wrap), not a true new segment.
|
|
@@ -228,7 +245,6 @@ Control which matches to use:
|
|
|
228
245
|
lineEndsWith: ['\\.'],
|
|
229
246
|
split: 'after',
|
|
230
247
|
occurrence: 'last', // Only split at LAST period on page
|
|
231
|
-
maxSpan: 1, // Apply per-page
|
|
232
248
|
}
|
|
233
249
|
```
|
|
234
250
|
|
|
@@ -406,6 +422,123 @@ Key options:
|
|
|
406
422
|
- If you paste these signatures into `lineStartsWith` / `lineStartsAfter` / `template`, that’s fine: those template pattern types **auto-escape `()[]`** outside `{{tokens}}`.
|
|
407
423
|
- If you paste them into a raw `regex` rule, you may need to escape literal brackets yourself.
|
|
408
424
|
|
|
425
|
+
### Repeating Sequence Analysis (continuous text)
|
|
426
|
+
|
|
427
|
+
For texts without line breaks (continuous prose), use `analyzeRepeatingSequences()`:
|
|
428
|
+
|
|
429
|
+
```typescript
|
|
430
|
+
import { analyzeRepeatingSequences } from 'flappa-doormal';
|
|
431
|
+
|
|
432
|
+
const patterns = analyzeRepeatingSequences(pages, {
|
|
433
|
+
minElements: 2,
|
|
434
|
+
maxElements: 4,
|
|
435
|
+
minCount: 3,
|
|
436
|
+
topK: 20,
|
|
437
|
+
});
|
|
438
|
+
// [{ pattern: "{{naql}}\\s*{{harf}}", count: 42, examples: [...] }, ...]
|
|
439
|
+
```
|
|
440
|
+
|
|
441
|
+
Key options:
|
|
442
|
+
- `minElements` / `maxElements`: N-gram size range (default 1-3)
|
|
443
|
+
- `minCount`: Minimum occurrences to include (default 3)
|
|
444
|
+
- `topK`: Maximum patterns to return (default 20)
|
|
445
|
+
- `requireToken`: Only patterns containing `{{tokens}}` (default true)
|
|
446
|
+
- `normalizeArabicDiacritics`: Ignore diacritics when matching (default true)
|
|
447
|
+
|
|
448
|
+
## Analysis → Segmentation Workflow
|
|
449
|
+
|
|
450
|
+
Use analysis functions to discover patterns, then pass to `segmentPages()`.
|
|
451
|
+
|
|
452
|
+
### Example A: Continuous Text (No Punctuation)
|
|
453
|
+
|
|
454
|
+
For prose-like text without structural line breaks:
|
|
455
|
+
|
|
456
|
+
```typescript
|
|
457
|
+
import { analyzeRepeatingSequences, segmentPages, type Page } from 'flappa-doormal';
|
|
458
|
+
|
|
459
|
+
// Continuous Arabic text with narrator phrases
|
|
460
|
+
const pages: Page[] = [
|
|
461
|
+
{ id: 1, content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي حدثنا خالد بن زيد عن علي' },
|
|
462
|
+
{ id: 2, content: 'حدثنا سعيد بن جبير عن ابن عباس أخبرنا يوسف عن أنس' },
|
|
463
|
+
];
|
|
464
|
+
|
|
465
|
+
// Step 1: Discover repeating patterns
|
|
466
|
+
const patterns = analyzeRepeatingSequences(pages, { minCount: 2, topK: 10 });
|
|
467
|
+
// [{ pattern: '{{naql}}', count: 5, examples: [...] }, ...]
|
|
468
|
+
|
|
469
|
+
// Step 2: Build rules from discovered patterns
|
|
470
|
+
const rules = patterns.filter(p => p.count >= 3).map(p => ({
|
|
471
|
+
lineStartsWith: [p.pattern],
|
|
472
|
+
split: 'at' as const,
|
|
473
|
+
fuzzy: true,
|
|
474
|
+
}));
|
|
475
|
+
|
|
476
|
+
// Step 3: Segment
|
|
477
|
+
const segments = segmentPages(pages, { rules });
|
|
478
|
+
// [{ content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي', from: 1 }, ...]
|
|
479
|
+
```
|
|
480
|
+
|
|
481
|
+
### Example B: Structured Text (With Numbering)
|
|
482
|
+
|
|
483
|
+
For hadith-style numbered entries:
|
|
484
|
+
|
|
485
|
+
```typescript
|
|
486
|
+
import { analyzeCommonLineStarts, segmentPages, type Page } from 'flappa-doormal';
|
|
487
|
+
|
|
488
|
+
// Numbered hadith text
|
|
489
|
+
const pages: Page[] = [
|
|
490
|
+
{ id: 1, content: '٦٦٩٦ - حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ\n٦٦٩٧ - أَخْبَرَنَا عُمَرُ قَالَ' },
|
|
491
|
+
{ id: 2, content: '٦٦٩٨ - حَدَّثَنِي مُحَمَّدٌ عَنْ عَائِشَةَ' },
|
|
492
|
+
];
|
|
493
|
+
|
|
494
|
+
// Step 1: Discover common line-start patterns
|
|
495
|
+
const patterns = analyzeCommonLineStarts(pages, { topK: 10, minCount: 2 });
|
|
496
|
+
// [{ pattern: '{{raqms}}\\s*{{dash}}', count: 3, examples: [...] }, ...]
|
|
497
|
+
|
|
498
|
+
// Step 2: Build rules (add named capture for hadith number)
|
|
499
|
+
const topPattern = patterns[0]?.pattern ?? '{{raqms}} {{dash}} ';
|
|
500
|
+
const rules = [{
|
|
501
|
+
lineStartsAfter: [topPattern.replace('{{raqms}}', '{{raqms:num}}')],
|
|
502
|
+
split: 'at' as const,
|
|
503
|
+
meta: { type: 'hadith' }
|
|
504
|
+
}];
|
|
505
|
+
|
|
506
|
+
// Step 3: Segment
|
|
507
|
+
const segments = segmentPages(pages, { rules });
|
|
508
|
+
// [
|
|
509
|
+
// { content: 'حَدَّثَنَا أَبُو بَكْرٍ...', from: 1, meta: { type: 'hadith', num: '٦٦٩٦' } },
|
|
510
|
+
// { content: 'أَخْبَرَنَا عُمَرُ قَالَ', from: 1, meta: { type: 'hadith', num: '٦٦٩٧' } },
|
|
511
|
+
// { content: 'حَدَّثَنِي مُحَمَّدٌ...', from: 2, meta: { type: 'hadith', num: '٦٦٩٨' } },
|
|
512
|
+
// ]
|
|
513
|
+
```
|
|
514
|
+
|
|
515
|
+
## Rule Optimization
|
|
516
|
+
|
|
517
|
+
Use `optimizeRules()` to automatically merge compatible rules, remove duplicate patterns, and sort rules by specificity (longest patterns first):
|
|
518
|
+
|
|
519
|
+
```typescript
|
|
520
|
+
import { optimizeRules } from 'flappa-doormal';
|
|
521
|
+
|
|
522
|
+
const rules = [
|
|
523
|
+
// These will be merged because meta/fuzzy options match
|
|
524
|
+
{ lineStartsWith: ['{{kitab}}'], fuzzy: true, meta: { type: 'header' } },
|
|
525
|
+
{ lineStartsWith: ['{{bab}}'], fuzzy: true, meta: { type: 'header' } },
|
|
526
|
+
|
|
527
|
+
// This will be kept separate
|
|
528
|
+
{ lineStartsAfter: ['{{numbered}}'], meta: { type: 'entry' } },
|
|
529
|
+
];
|
|
530
|
+
|
|
531
|
+
const { rules: optimized, mergedCount } = optimizeRules(rules);
|
|
532
|
+
|
|
533
|
+
// Result:
|
|
534
|
+
// optimized[0] = {
|
|
535
|
+
// lineStartsWith: ['{{kitab}}', '{{bab}}'],
|
|
536
|
+
// fuzzy: true,
|
|
537
|
+
// meta: { type: 'header' }
|
|
538
|
+
// }
|
|
539
|
+
// optimized[1] = { lineStartsAfter: ['{{numbered}}'], ... }
|
|
540
|
+
```
|
|
541
|
+
|
|
409
542
|
## Rule Validation
|
|
410
543
|
|
|
411
544
|
Use `validateRules()` to detect common mistakes in rule patterns before running segmentation:
|
|
@@ -422,6 +555,16 @@ const issues = validateRules([
|
|
|
422
555
|
// issues[0]?.lineStartsAfter?.[0]?.type === 'missing_braces'
|
|
423
556
|
// issues[1]?.lineStartsWith?.[0]?.type === 'unknown_token'
|
|
424
557
|
// issues[2]?.lineStartsAfter?.[0]?.type === 'missing_braces'
|
|
558
|
+
|
|
559
|
+
// To get a simple list of error strings for UI display:
|
|
560
|
+
import { formatValidationReport } from 'flappa-doormal';
|
|
561
|
+
|
|
562
|
+
const errors = formatValidationReport(issues);
|
|
563
|
+
// [
|
|
564
|
+
// 'Rule 1, lineStartsAfter: Missing {{}} around token "raqms:num"',
|
|
565
|
+
// 'Rule 2, lineStartsWith: Unknown token "{{unknown}}"',
|
|
566
|
+
// ...
|
|
567
|
+
// ]
|
|
425
568
|
```
|
|
426
569
|
|
|
427
570
|
**Checks performed:**
|
|
@@ -429,6 +572,25 @@ const issues = validateRules([
|
|
|
429
572
|
- **Unknown tokens**: Flags tokens inside `{{}}` that don't exist (e.g., `{{nonexistent}}`)
|
|
430
573
|
- **Duplicates**: Finds duplicate patterns within the same rule
|
|
431
574
|
|
|
575
|
+
## Token Mapping Utilities
|
|
576
|
+
|
|
577
|
+
When building UIs for rule editing, it's often useful to separate the *token pattern* (e.g., `{{raqms}}`) from the *capture name* (e.g., `{{raqms:hadithNum}}`).
|
|
578
|
+
|
|
579
|
+
```typescript
|
|
580
|
+
import { applyTokenMappings, stripTokenMappings } from 'flappa-doormal';
|
|
581
|
+
|
|
582
|
+
// 1. Apply user-defined mappings to a raw template
|
|
583
|
+
const template = '{{raqms}} {{dash}}';
|
|
584
|
+
const mappings = [{ token: 'raqms', name: 'num' }];
|
|
585
|
+
|
|
586
|
+
const result = applyTokenMappings(template, mappings);
|
|
587
|
+
// result = '{{raqms:num}} {{dash}}'
|
|
588
|
+
|
|
589
|
+
// 2. Strip captures to get back to the canonical pattern
|
|
590
|
+
const raw = stripTokenMappings(result);
|
|
591
|
+
// raw = '{{raqms}} {{dash}}'
|
|
592
|
+
```
|
|
593
|
+
|
|
432
594
|
## Prompting LLMs / Agents to Generate Rules (Shamela books)
|
|
433
595
|
|
|
434
596
|
### Pre-analysis (no LLM required): generate “hints” from the book
|
|
@@ -619,32 +781,10 @@ const segments = segmentPages(pages, {
|
|
|
619
781
|
lineEndsWith: ['\\.'],
|
|
620
782
|
split: 'after',
|
|
621
783
|
occurrence: 'last',
|
|
622
|
-
maxSpan: 1
|
|
623
|
-
}]
|
|
624
|
-
});
|
|
625
|
-
```
|
|
626
|
-
|
|
627
|
-
### Page Fallback for Unmatched Content
|
|
628
|
-
|
|
629
|
-
When using `maxSpan` to group matches per page, use `fallback: 'page'` to prevent unmatched pages from merging with adjacent segments:
|
|
630
|
-
|
|
631
|
-
```typescript
|
|
632
|
-
const segments = segmentPages(pages, {
|
|
633
|
-
rules: [{
|
|
634
|
-
template: '{{tarqim}}', // Match punctuation marks
|
|
635
|
-
split: 'after',
|
|
636
|
-
occurrence: 'last',
|
|
637
|
-
maxSpan: 1,
|
|
638
|
-
fallback: 'page' // If no punctuation found, segment the page anyway
|
|
639
784
|
}]
|
|
640
785
|
});
|
|
641
786
|
```
|
|
642
787
|
|
|
643
|
-
**Without `fallback`**: Pages without matches merge into the next segment
|
|
644
|
-
**With `fallback: 'page'`**: Each page becomes its own segment even without matches
|
|
645
|
-
|
|
646
|
-
> **Future extensions**: The `fallback` option may support additional values like `'skip'` (omit unmatched content) or `'line'` (split at line breaks) in future versions.
|
|
647
|
-
|
|
648
788
|
### Multiple Rules with Priority
|
|
649
789
|
|
|
650
790
|
```typescript
|
|
@@ -912,9 +1052,7 @@ type SplitRule = {
|
|
|
912
1052
|
// Split behavior
|
|
913
1053
|
split?: 'at' | 'after'; // Default: 'at'
|
|
914
1054
|
occurrence?: 'first' | 'last' | 'all';
|
|
915
|
-
maxSpan?: number;
|
|
916
1055
|
fuzzy?: boolean;
|
|
917
|
-
fallback?: 'page'; // NEW: Page-boundary fallback
|
|
918
1056
|
|
|
919
1057
|
// Constraints
|
|
920
1058
|
min?: number;
|
|
@@ -1042,28 +1180,6 @@ The library concatenates all pages into a single string for pattern matching acr
|
|
|
1042
1180
|
|
|
1043
1181
|
For typical book processing (up to 6,000 pages), memory usage is well within Node.js defaults. For very large books (40,000+ pages), ensure adequate heap size.
|
|
1044
1182
|
|
|
1045
|
-
### `maxSpan` Sliding Window Behavior
|
|
1046
|
-
|
|
1047
|
-
The `maxSpan` option uses a **sliding window algorithm** based on page ID difference:
|
|
1048
|
-
|
|
1049
|
-
```typescript
|
|
1050
|
-
// maxSpan = maximum page ID difference when looking ahead for split points
|
|
1051
|
-
// Algorithm prefers LONGER segments by looking as far ahead as allowed
|
|
1052
|
-
|
|
1053
|
-
// Pages [1, 2, 3, 4] with maxSpan: 1, occurrence: 'last'
|
|
1054
|
-
// Window from page 1: pages 1-2 (diff <= 1), splits at page 2's last match
|
|
1055
|
-
// Window from page 3: pages 3-4 (diff <= 1), splits at page 4's last match
|
|
1056
|
-
// Result: 2 segments spanning pages 1-2 and 3-4
|
|
1057
|
-
|
|
1058
|
-
// Pages [1, 5, 10] with maxSpan: 1, occurrence: 'last'
|
|
1059
|
-
// Window from page 1: only page 1 (5-1=4 > 1), splits at page 1
|
|
1060
|
-
// Window from page 5: only page 5 (10-5=5 > 1), splits at page 5
|
|
1061
|
-
// Window from page 10: only page 10, splits at page 10
|
|
1062
|
-
// Result: 3 segments (pages too far apart to merge)
|
|
1063
|
-
```
|
|
1064
|
-
|
|
1065
|
-
This is intentional for books where page IDs represent actual page numbers. With `occurrence: 'last'`, the algorithm finds the last match within the lookahead window, creating longer segments where possible.
|
|
1066
|
-
|
|
1067
1183
|
## For AI Agents
|
|
1068
1184
|
|
|
1069
1185
|
See [AGENTS.md](./AGENTS.md) for:
|