flappa-doormal 2.8.0 → 2.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md CHANGED
@@ -26,8 +26,14 @@ Traditional Arabic text segmentation requires:
26
26
  ```text
27
27
  src/
28
28
  ├── index.ts # Main entry point and exports
29
- ├── pattern-detection.ts # Token detection for auto-generating rules (NEW)
30
- ├── pattern-detection.test.ts # Pattern detection tests (22 tests)
29
+ ├── analysis/ # Analysis helpers module
30
+ ├── index.ts # Barrel exports for analysis functions
31
+ │ ├── shared.ts # Shared utilities for analysis
32
+ │ ├── line-starts.ts # analyzeCommonLineStarts (line-based patterns)
33
+ │ ├── repeating-sequences.ts # analyzeRepeatingSequences (continuous text N-grams)
34
+ │ └── *.test.ts # Analysis tests
35
+ ├── pattern-detection.ts # Token detection for auto-generating rules
36
+ ├── pattern-detection.test.ts # Pattern detection tests
31
37
  ├── recovery.ts # Marker recovery utility (recover mistaken lineStartsAfter)
32
38
  ├── recovery.test.ts # Marker recovery tests
33
39
  └── segmentation/
@@ -36,6 +42,7 @@ src/
36
42
  ├── breakpoint-processor.ts # Breakpoint post-processing engine (applyBreakpoints)
37
43
  ├── breakpoint-utils.ts # Breakpoint processing utilities (windowing, excludes, page joins)
38
44
  ├── rule-regex.ts # SplitRule -> compiled regex builder (buildRuleRegex, processPattern)
45
+ ├── optimize-rules.ts # Rule optimization logic (merge, dedupe, sort)
39
46
  ├── tokens.ts # Token definitions and expansion logic
40
47
  ├── fuzzy.ts # Diacritic-insensitive matching utilities
41
48
  ├── html.ts # HTML utilities (stripHtmlTags)
@@ -63,32 +70,38 @@ src/
63
70
  - Deterministic mode reruns segmentation with selected rules converted to `lineStartsWith` and merges recovered `content` back into the provided segments
64
71
  - Optional `mode: 'best_effort_then_rerun'` attempts a conservative anchor-based recovery first, then falls back to rerun for unresolved segments
65
72
 
66
- 2. **`tokens.ts`** - Template system
73
+ 3. **`tokens.ts`** - Template system
67
74
  - `TOKEN_PATTERNS` - Map of token names to regex patterns
68
75
  - `expandTokensWithCaptures()` - Expands `{{token:name}}` syntax
69
76
  - `shouldDefaultToFuzzy()` - Checks if patterns contain fuzzy-default tokens (bab, basmalah, fasl, kitab, naql)
77
+ - `applyTokenMappings()` - Applies named captures (`{{token:name}}`) to raw templates
78
+ - `stripTokenMappings()` - Strips named captures (reverts to `{{token}}`)
70
79
  - Supports fuzzy transform for diacritic-insensitive matching
71
80
  - **Fuzzy-default tokens**: `bab`, `basmalah`, `fasl`, `kitab`, `naql` - auto-enable fuzzy matching unless `fuzzy: false` is set
72
81
 
73
- 3. **`match-utils.ts`** - Extracted utilities (for testability)
82
+ 4. **`match-utils.ts`** - Extracted utilities (for testability)
74
83
  - `extractNamedCaptures()` - Get named groups from regex match
75
84
  - `filterByConstraints()` - Apply min/max page filters
76
85
  - `anyRuleAllowsId()` - Check if page passes rule constraints
77
86
 
78
- 4. **`rule-regex.ts`** - SplitRule → compiled regex builder
87
+ 5. **`rule-regex.ts`** - SplitRule → compiled regex builder
79
88
  - `buildRuleRegex()` - Compiles rule patterns (`lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, `template`, `regex`)
80
89
  - `processPattern()` - Token expansion + auto-escaping + optional fuzzy application
81
90
  - `extractNamedCaptureNames()` - Extract `(?<name>...)` groups from raw regex patterns
82
91
 
83
- 5. **`pattern-validator.ts`** - Rule validation utilities
92
+ 6. **`optimize-rules.ts`** - Rule management logic
93
+ - `optimizeRules()` - Merges compatible rules, deduplicates patterns, and sorts by specificity (longest patterns first)
94
+
95
+ 7. **`pattern-validator.ts`** - Rule validation utilities
84
96
  - `validateRules()` - Detects typos in patterns (missing `{{}}`, unknown tokens, duplicates)
97
+ - `formatValidationReport()` - Formats validation issues into human-readable strings
85
98
  - Returns parallel array structure for easy error tracking
86
99
 
87
- 6. **`breakpoint-processor.ts`** - Breakpoint post-processing engine
100
+ 8. **`breakpoint-processor.ts`** - Breakpoint post-processing engine
88
101
  - `applyBreakpoints()` - Splits oversized structural segments using breakpoint patterns + windowing
89
102
  - Applies `pageJoiner` normalization to breakpoint-created segments
90
103
 
91
- 7. **`breakpoint-utils.ts`** - Breakpoint processing utilities
104
+ 9. **`breakpoint-utils.ts`** - Breakpoint processing utilities
92
105
  - `normalizeBreakpoint()` - Convert string to BreakpointRule object
93
106
  - `isPageExcluded()` - Check if page is in exclude list
94
107
  - `isInBreakpointRange()` - Validate page against min/max/exclude constraints
@@ -105,16 +118,17 @@ src/
105
118
  - `findNextPagePosition()` - Find next page content position
106
119
  - `findPatternBreakPosition()` - Find pattern match by preference
107
120
 
108
- 7. **`types.ts`** - Type definitions
121
+ 10. **`types.ts`** - Type definitions
109
122
  - `Logger` interface - Optional logging for debugging
110
123
  - `SegmentationOptions` - Options with `logger` property
111
124
  - `pageJoiner` - Controls how page boundaries are represented in output (`space` default)
125
+ - `PATTERN_TYPE_KEYS` - Runtime array of all pattern types (for UI building)
112
126
  - Verbosity levels: `trace`, `debug`, `info`, `warn`, `error`
113
127
 
114
- 8. **`fuzzy.ts`** - Arabic text normalization
128
+ 11. **`fuzzy.ts`** - Arabic text normalization
115
129
  - `makeDiacriticInsensitive()` - Generate regex that ignores diacritics
116
130
 
117
- 9. **`pattern-detection.ts`** - Token auto-detection (NEW)
131
+ 12. **`pattern-detection.ts`** - Token auto-detection (NEW)
118
132
  - `detectTokenPatterns()` - Detect tokens in text with positions
119
133
  - `generateTemplateFromText()` - Convert text to template string
120
134
  - `suggestPatternConfig()` - Suggest rule configuration
@@ -202,7 +216,7 @@ Raw `regex` patterns now support named capture groups for metadata extraction:
202
216
 
203
217
  ### Breakpoints Post-Processing Algorithm
204
218
 
205
- The `breakpoints` option provides a post-processing mechanism for limiting segment size. Unlike the deprecated `maxSpan` (which was per-rule), breakpoints runs AFTER all structural rules.
219
+ The `breakpoints` option provides a post-processing mechanism for limiting segment size. Breakpoints runs AFTER all structural rules.
206
220
 
207
221
  **API Options:**
208
222
  ```typescript
@@ -244,7 +258,7 @@ segmentPages(pages, {
244
258
  - **`prefer: 'shorter'`**: Finds FIRST match (conservative)
245
259
  - **Recursive**: If split result still exceeds `maxPages`, breakpoints runs again
246
260
 
247
- > **Note**: The old `maxSpan` and `fallback` properties on `SplitRule` are deprecated and removed.
261
+ > **Note**: Older per-rule span limiting approaches were removed in favor of post-processing `breakpoints`.
248
262
 
249
263
  ## Design Decisions
250
264
 
@@ -357,7 +371,7 @@ bunx biome lint .
357
371
 
358
372
  5. **Rule order matters for specificity**: When multiple rules can match the same position, put specific patterns BEFORE generic ones. Example: `## {{raqms:num}} {{dash}}` must come before `##` to capture the number.
359
373
 
360
- 6. **Post-processing beats per-rule limits**: The `maxSpan` approach (per-rule page limits) caused premature splits. Moving to post-processing `breakpoints` preserves structural integrity while still limiting segment size.
374
+ 6. **Post-processing beats per-rule limits**: Per-rule span limiting caused premature splits. Moving to post-processing `breakpoints` preserves structural integrity while still limiting segment size.
361
375
 
362
376
  7. **Window padding matters**: When calculating approximate content windows, 50% padding is needed (not 20%) to ensure enough content is captured for `prefer: 'longer'` scenarios.
363
377
 
@@ -396,6 +410,8 @@ bunx biome lint .
396
410
  - `src/segmentation/replace.ts` (preprocessing parity)
397
411
  - `src/recovery.ts` (recovery implementation)
398
412
 
413
+ 10. **Prefer library utilities for UI tasks**: Instead of re-implementing rule merging, validation, or token mapping in client code, use `optimizeRules`, `validateRules`/`formatValidationReport`, and `applyTokenMappings`. They handle edge cases (like duplicate patterns, regex safety, or diacritic handling) that ad-hoc implementations might miss.
414
+
399
415
  ### Process Template (Multi-agent design review, TDD-first)
400
416
 
401
417
  If you want to repeat the “write a plan → get multiple AI critiques → synthesize → update plan → implement TDD-first” workflow, use:
@@ -494,3 +510,32 @@ const quoted = analyzeCommonLineStarts(pages, {
494
510
  });
495
511
  ```
496
512
 
513
+ ## Repeating Sequence Analysis (`analyzeRepeatingSequences`)
514
+
515
+ For continuous text **without line breaks** (prose-like content), use `analyzeRepeatingSequences(pages)`. It scans for commonly repeating word/token sequences (N-grams) across pages.
516
+
517
+ Key options:
518
+ - `minElements` / `maxElements`: N-gram size range (default 1-3)
519
+ - `minCount`: Minimum occurrences to include (default 3)
520
+ - `topK`: Maximum patterns to return (default 20)
521
+ - `requireToken`: Only patterns containing `{{tokens}}` (default true)
522
+ - `normalizeArabicDiacritics`: Ignore diacritics when matching (default true)
523
+
524
+ Example:
525
+ ```typescript
526
+ import { analyzeRepeatingSequences } from 'flappa-doormal';
527
+
528
+ const patterns = analyzeRepeatingSequences(pages, { minCount: 3, topK: 20 });
529
+ // [{ pattern: '{{naql}}', count: 42, examples: [...] }, ...]
530
+ ```
531
+
532
+ ## Analysis → Segmentation Workflow
533
+
534
+ Use analysis functions to discover patterns, then pass to `segmentPages()`:
535
+
536
+ 1. **Continuous text**: `analyzeRepeatingSequences()` → build rules → `segmentPages()`
537
+ 2. **Structured text**: `analyzeCommonLineStarts()` → build rules → `segmentPages()`
538
+
539
+ See README.md for complete examples.
540
+
541
+
package/README.md CHANGED
@@ -158,6 +158,23 @@ const rules = [{
158
158
  | `template` | Depends | Custom pattern with full control |
159
159
  | `regex` | Depends | Raw regex for complex cases |
160
160
 
161
+ #### Building UIs with Pattern Type Keys
162
+
163
+ The library exports `PATTERN_TYPE_KEYS` (a const array) and `PatternTypeKey` (a type) for building UIs that let users select pattern types:
164
+
165
+ ```typescript
166
+ import { PATTERN_TYPE_KEYS, type PatternTypeKey } from 'flappa-doormal';
167
+
168
+ // PATTERN_TYPE_KEYS = ['lineStartsWith', 'lineStartsAfter', 'lineEndsWith', 'template', 'regex']
169
+
170
+ // Build a dropdown/select
171
+ PATTERN_TYPE_KEYS.map(key => <option value={key}>{key}</option>)
172
+
173
+ // Type-safe validation
174
+ const isPatternKey = (k: string): k is PatternTypeKey =>
175
+ (PATTERN_TYPE_KEYS as readonly string[]).includes(k);
176
+ ```
177
+
161
178
  ### 4.1 Page-start Guard (avoid page-wrap false positives)
162
179
 
163
180
  When matching at line starts (e.g., `{{naql}}`), a new page can begin with a marker that is actually a **continuation** of the previous page (page wrap), not a true new segment.
@@ -228,7 +245,6 @@ Control which matches to use:
228
245
  lineEndsWith: ['\\.'],
229
246
  split: 'after',
230
247
  occurrence: 'last', // Only split at LAST period on page
231
- maxSpan: 1, // Apply per-page
232
248
  }
233
249
  ```
234
250
 
@@ -406,6 +422,123 @@ Key options:
406
422
  - If you paste these signatures into `lineStartsWith` / `lineStartsAfter` / `template`, that’s fine: those template pattern types **auto-escape `()[]`** outside `{{tokens}}`.
407
423
  - If you paste them into a raw `regex` rule, you may need to escape literal brackets yourself.
408
424
 
425
+ ### Repeating Sequence Analysis (continuous text)
426
+
427
+ For texts without line breaks (continuous prose), use `analyzeRepeatingSequences()`:
428
+
429
+ ```typescript
430
+ import { analyzeRepeatingSequences } from 'flappa-doormal';
431
+
432
+ const patterns = analyzeRepeatingSequences(pages, {
433
+ minElements: 2,
434
+ maxElements: 4,
435
+ minCount: 3,
436
+ topK: 20,
437
+ });
438
+ // [{ pattern: "{{naql}}\\s*{{harf}}", count: 42, examples: [...] }, ...]
439
+ ```
440
+
441
+ Key options:
442
+ - `minElements` / `maxElements`: N-gram size range (default 1-3)
443
+ - `minCount`: Minimum occurrences to include (default 3)
444
+ - `topK`: Maximum patterns to return (default 20)
445
+ - `requireToken`: Only patterns containing `{{tokens}}` (default true)
446
+ - `normalizeArabicDiacritics`: Ignore diacritics when matching (default true)
447
+
448
+ ## Analysis → Segmentation Workflow
449
+
450
+ Use analysis functions to discover patterns, then pass to `segmentPages()`.
451
+
452
+ ### Example A: Continuous Text (No Punctuation)
453
+
454
+ For prose-like text without structural line breaks:
455
+
456
+ ```typescript
457
+ import { analyzeRepeatingSequences, segmentPages, type Page } from 'flappa-doormal';
458
+
459
+ // Continuous Arabic text with narrator phrases
460
+ const pages: Page[] = [
461
+ { id: 1, content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي حدثنا خالد بن زيد عن علي' },
462
+ { id: 2, content: 'حدثنا سعيد بن جبير عن ابن عباس أخبرنا يوسف عن أنس' },
463
+ ];
464
+
465
+ // Step 1: Discover repeating patterns
466
+ const patterns = analyzeRepeatingSequences(pages, { minCount: 2, topK: 10 });
467
+ // [{ pattern: '{{naql}}', count: 5, examples: [...] }, ...]
468
+
469
+ // Step 2: Build rules from discovered patterns
470
+ const rules = patterns.filter(p => p.count >= 3).map(p => ({
471
+ lineStartsWith: [p.pattern],
472
+ split: 'at' as const,
473
+ fuzzy: true,
474
+ }));
475
+
476
+ // Step 3: Segment
477
+ const segments = segmentPages(pages, { rules });
478
+ // [{ content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي', from: 1 }, ...]
479
+ ```
480
+
481
+ ### Example B: Structured Text (With Numbering)
482
+
483
+ For hadith-style numbered entries:
484
+
485
+ ```typescript
486
+ import { analyzeCommonLineStarts, segmentPages, type Page } from 'flappa-doormal';
487
+
488
+ // Numbered hadith text
489
+ const pages: Page[] = [
490
+ { id: 1, content: '٦٦٩٦ - حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ\n٦٦٩٧ - أَخْبَرَنَا عُمَرُ قَالَ' },
491
+ { id: 2, content: '٦٦٩٨ - حَدَّثَنِي مُحَمَّدٌ عَنْ عَائِشَةَ' },
492
+ ];
493
+
494
+ // Step 1: Discover common line-start patterns
495
+ const patterns = analyzeCommonLineStarts(pages, { topK: 10, minCount: 2 });
496
+ // [{ pattern: '{{raqms}}\\s*{{dash}}', count: 3, examples: [...] }, ...]
497
+
498
+ // Step 2: Build rules (add named capture for hadith number)
499
+ const topPattern = patterns[0]?.pattern ?? '{{raqms}} {{dash}} ';
500
+ const rules = [{
501
+ lineStartsAfter: [topPattern.replace('{{raqms}}', '{{raqms:num}}')],
502
+ split: 'at' as const,
503
+ meta: { type: 'hadith' }
504
+ }];
505
+
506
+ // Step 3: Segment
507
+ const segments = segmentPages(pages, { rules });
508
+ // [
509
+ // { content: 'حَدَّثَنَا أَبُو بَكْرٍ...', from: 1, meta: { type: 'hadith', num: '٦٦٩٦' } },
510
+ // { content: 'أَخْبَرَنَا عُمَرُ قَالَ', from: 1, meta: { type: 'hadith', num: '٦٦٩٧' } },
511
+ // { content: 'حَدَّثَنِي مُحَمَّدٌ...', from: 2, meta: { type: 'hadith', num: '٦٦٩٨' } },
512
+ // ]
513
+ ```
514
+
515
+ ## Rule Optimization
516
+
517
+ Use `optimizeRules()` to automatically merge compatible rules, remove duplicate patterns, and sort rules by specificity (longest patterns first):
518
+
519
+ ```typescript
520
+ import { optimizeRules } from 'flappa-doormal';
521
+
522
+ const rules = [
523
+ // These will be merged because meta/fuzzy options match
524
+ { lineStartsWith: ['{{kitab}}'], fuzzy: true, meta: { type: 'header' } },
525
+ { lineStartsWith: ['{{bab}}'], fuzzy: true, meta: { type: 'header' } },
526
+
527
+ // This will be kept separate
528
+ { lineStartsAfter: ['{{numbered}}'], meta: { type: 'entry' } },
529
+ ];
530
+
531
+ const { rules: optimized, mergedCount } = optimizeRules(rules);
532
+
533
+ // Result:
534
+ // optimized[0] = {
535
+ // lineStartsWith: ['{{kitab}}', '{{bab}}'],
536
+ // fuzzy: true,
537
+ // meta: { type: 'header' }
538
+ // }
539
+ // optimized[1] = { lineStartsAfter: ['{{numbered}}'], ... }
540
+ ```
541
+
409
542
  ## Rule Validation
410
543
 
411
544
  Use `validateRules()` to detect common mistakes in rule patterns before running segmentation:
@@ -422,6 +555,16 @@ const issues = validateRules([
422
555
  // issues[0]?.lineStartsAfter?.[0]?.type === 'missing_braces'
423
556
  // issues[1]?.lineStartsWith?.[0]?.type === 'unknown_token'
424
557
  // issues[2]?.lineStartsAfter?.[0]?.type === 'missing_braces'
558
+
559
+ // To get a simple list of error strings for UI display:
560
+ import { formatValidationReport } from 'flappa-doormal';
561
+
562
+ const errors = formatValidationReport(issues);
563
+ // [
564
+ // 'Rule 1, lineStartsAfter: Missing {{}} around token "raqms:num"',
565
+ // 'Rule 2, lineStartsWith: Unknown token "{{unknown}}"',
566
+ // ...
567
+ // ]
425
568
  ```
426
569
 
427
570
  **Checks performed:**
@@ -429,6 +572,25 @@ const issues = validateRules([
429
572
  - **Unknown tokens**: Flags tokens inside `{{}}` that don't exist (e.g., `{{nonexistent}}`)
430
573
  - **Duplicates**: Finds duplicate patterns within the same rule
431
574
 
575
+ ## Token Mapping Utilities
576
+
577
+ When building UIs for rule editing, it's often useful to separate the *token pattern* (e.g., `{{raqms}}`) from the *capture name* (e.g., `{{raqms:hadithNum}}`).
578
+
579
+ ```typescript
580
+ import { applyTokenMappings, stripTokenMappings } from 'flappa-doormal';
581
+
582
+ // 1. Apply user-defined mappings to a raw template
583
+ const template = '{{raqms}} {{dash}}';
584
+ const mappings = [{ token: 'raqms', name: 'num' }];
585
+
586
+ const result = applyTokenMappings(template, mappings);
587
+ // result = '{{raqms:num}} {{dash}}'
588
+
589
+ // 2. Strip captures to get back to the canonical pattern
590
+ const raw = stripTokenMappings(result);
591
+ // raw = '{{raqms}} {{dash}}'
592
+ ```
593
+
432
594
  ## Prompting LLMs / Agents to Generate Rules (Shamela books)
433
595
 
434
596
  ### Pre-analysis (no LLM required): generate “hints” from the book
@@ -619,32 +781,10 @@ const segments = segmentPages(pages, {
619
781
  lineEndsWith: ['\\.'],
620
782
  split: 'after',
621
783
  occurrence: 'last',
622
- maxSpan: 1
623
- }]
624
- });
625
- ```
626
-
627
- ### Page Fallback for Unmatched Content
628
-
629
- When using `maxSpan` to group matches per page, use `fallback: 'page'` to prevent unmatched pages from merging with adjacent segments:
630
-
631
- ```typescript
632
- const segments = segmentPages(pages, {
633
- rules: [{
634
- template: '{{tarqim}}', // Match punctuation marks
635
- split: 'after',
636
- occurrence: 'last',
637
- maxSpan: 1,
638
- fallback: 'page' // If no punctuation found, segment the page anyway
639
784
  }]
640
785
  });
641
786
  ```
642
787
 
643
- **Without `fallback`**: Pages without matches merge into the next segment
644
- **With `fallback: 'page'`**: Each page becomes its own segment even without matches
645
-
646
- > **Future extensions**: The `fallback` option may support additional values like `'skip'` (omit unmatched content) or `'line'` (split at line breaks) in future versions.
647
-
648
788
  ### Multiple Rules with Priority
649
789
 
650
790
  ```typescript
@@ -912,9 +1052,7 @@ type SplitRule = {
912
1052
  // Split behavior
913
1053
  split?: 'at' | 'after'; // Default: 'at'
914
1054
  occurrence?: 'first' | 'last' | 'all';
915
- maxSpan?: number;
916
1055
  fuzzy?: boolean;
917
- fallback?: 'page'; // NEW: Page-boundary fallback
918
1056
 
919
1057
  // Constraints
920
1058
  min?: number;
@@ -1042,28 +1180,6 @@ The library concatenates all pages into a single string for pattern matching acr
1042
1180
 
1043
1181
  For typical book processing (up to 6,000 pages), memory usage is well within Node.js defaults. For very large books (40,000+ pages), ensure adequate heap size.
1044
1182
 
1045
- ### `maxSpan` Sliding Window Behavior
1046
-
1047
- The `maxSpan` option uses a **sliding window algorithm** based on page ID difference:
1048
-
1049
- ```typescript
1050
- // maxSpan = maximum page ID difference when looking ahead for split points
1051
- // Algorithm prefers LONGER segments by looking as far ahead as allowed
1052
-
1053
- // Pages [1, 2, 3, 4] with maxSpan: 1, occurrence: 'last'
1054
- // Window from page 1: pages 1-2 (diff <= 1), splits at page 2's last match
1055
- // Window from page 3: pages 3-4 (diff <= 1), splits at page 4's last match
1056
- // Result: 2 segments spanning pages 1-2 and 3-4
1057
-
1058
- // Pages [1, 5, 10] with maxSpan: 1, occurrence: 'last'
1059
- // Window from page 1: only page 1 (5-1=4 > 1), splits at page 1
1060
- // Window from page 5: only page 5 (10-5=5 > 1), splits at page 5
1061
- // Window from page 10: only page 10, splits at page 10
1062
- // Result: 3 segments (pages too far apart to merge)
1063
- ```
1064
-
1065
- This is intentional for books where page IDs represent actual page numbers. With `occurrence: 'last'`, the algorithm finds the last match within the lookahead window, creating longer segments where possible.
1066
-
1067
1183
  ## For AI Agents
1068
1184
 
1069
1185
  See [AGENTS.md](./AGENTS.md) for: