flappa-doormal 2.11.5 → 2.12.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md CHANGED
@@ -118,7 +118,7 @@ src/
118
118
  - `findNextPagePosition()` - Find next page content position
119
119
  - `findPatternBreakPosition()` - Find pattern match by preference
120
120
  - `findSafeBreakPosition()` - Search backward for a safe linguistic split point (whitespace/punctuation)
121
- - `adjustForSurrogate()` - Ensure split position doesn't corrupt Unicode surrogate pairs
121
+ - `adjustForUnicodeBoundary()` - Ensure split position doesn't corrupt surrogate pairs, combining marks, ZWJ/ZWNJ, or variation selectors
122
122
 
123
123
  10. **`types.ts`** - Type definitions
124
124
  - `Logger` interface - Optional logging for debugging
@@ -230,9 +230,19 @@ interface SegmentationOptions {
230
230
  // - pageIds: []: apply to no pages (skip)
231
231
  replace?: Array<{ regex: string; replacement: string; flags?: string; pageIds?: number[] }>;
232
232
  maxPages?: number; // Maximum pages a segment can span
233
- breakpoints?: string[]; // Ordered array of regex patterns (supports token expansion)
233
+ breakpoints?: Breakpoint[]; // Ordered array of patterns (supports token expansion)
234
234
  prefer?: 'longer' | 'shorter'; // Select last or first match within window
235
235
  }
236
+
237
+ // Breakpoint can be a string or object with split control
238
+ type Breakpoint = string | BreakpointRule;
239
+ interface BreakpointRule {
240
+ pattern: string;
241
+ split?: 'at' | 'after'; // Default: 'after'
242
+ min?: number; // Minimum page ID for this breakpoint
243
+ max?: number; // Maximum page ID for this breakpoint
244
+ exclude?: PageRange[]; // Pages to skip this breakpoint
245
+ }
236
246
  ```
237
247
 
238
248
  **How it works:**
@@ -241,16 +251,24 @@ interface SegmentationOptions {
241
251
  3. Patterns are tried in order until one matches
242
252
  4. Empty string `''` means "fall back to page boundary"
243
253
 
254
+ **Split behavior:**
255
+ - **`split: 'after'` (default)**: Previous segment ends WITH the matched text
256
+ - **`split: 'at'`**: Previous segment ends BEFORE the matched text (match starts next segment)
257
+
244
258
  **Example:**
245
259
  ```typescript
246
260
  segmentPages(pages, {
247
261
  rules: [
248
- { lineStartsWith: ['{{basmalah}}'] }, // split defaults to 'at'
262
+ { lineStartsWith: ['{{basmalah}}'] },
249
263
  { lineStartsWith: ['{{bab}}'], meta: { type: 'chapter' } },
250
264
  ],
251
265
  maxPages: 2,
252
- breakpoints: ['{{tarqim}}\\s*', '\\n', ''], // Try: punctuation → newline → page boundary
253
- prefer: 'longer', // Greedy: make segments as large as possible
266
+ breakpoints: [
267
+ { pattern: '{{tarqim}}\\s*', split: 'after' }, // Punctuation ends current segment
268
+ { pattern: 'ولهذا', split: 'at' }, // Word starts next segment
269
+ '', // Fall back to page boundary
270
+ ],
271
+ prefer: 'longer',
254
272
  });
255
273
  ```
256
274
 
@@ -259,6 +277,8 @@ segmentPages(pages, {
259
277
  - **`prefer: 'longer'`**: Finds LAST match in window (greedy)
260
278
  - **`prefer: 'shorter'`**: Finds FIRST match (conservative)
261
279
  - **Recursive**: If split result still exceeds `maxPages`, breakpoints runs again
280
+ - **Lookahead patterns unsupported**: Zero-length matches are skipped; use `split: 'at'` instead
281
+ - **Position 0 protection**: Matches at position 0 are skipped for `split: 'at'` to prevent empty segments
262
282
 
263
283
  > **Note**: Older per-rule span limiting approaches were removed in favor of post-processing `breakpoints`.
264
284
 
@@ -269,7 +289,7 @@ When using `maxContentLength`, the segmenter prevents text corruption through se
269
289
  **Algorithm:**
270
290
  1. **Windowed Pattern Match**: Attempt to find a user-provided `breakpoint` pattern within the character window.
271
291
  2. **Safe Fallback (Linguistic)**: If no pattern matches, use `findSafeBreakPosition()` to search backward (100 chars) for whitespace or punctuation `[\s\n.,;!?؛،۔]`.
272
- 3. **Safe Fallback (Technical)**: If still no safe break found, use `adjustForSurrogate()` to ensure the split doesn't fall between a High and Low Unicode surrogate pair.
292
+ 3. **Safe Fallback (Technical)**: If still no safe break found, use `adjustForUnicodeBoundary()` to ensure the split doesn't corrupt surrogate pairs, combining marks, ZWJ/ZWNJ, or variation selectors.
273
293
  4. **Hard Split**: Only as a final resort is a character-exact split performed.
274
294
 
275
295
  **Progress Guarantee**:
@@ -333,6 +353,7 @@ The original `segmentPages` had complexity 37 (max: 15). Extraction:
333
353
  - **Unit tests**: Each utility function has dedicated tests
334
354
  - **Integration tests**: Full pipeline tests in `segmenter.test.ts`
335
355
  - **Real-world tests**: `segmenter.bukhari.test.ts` uses actual hadith data
356
+ - **Style convention**: Prefer `it('should ...', () => { ... })` (Bun) for consistency across the suite
336
357
  - Run: `bun test`
337
358
 
338
359
  ## Code Quality Standards
@@ -433,7 +454,7 @@ bunx biome lint .
433
454
 
434
455
  11. **Safety Fallback (Search-back)**: When forced to split at a hard character limit, searching backward for whitespace/punctuation (`[\s\n.,;!?؛،۔]`) prevents word-chopping and improves readability significantly.
435
456
 
436
- 12. **Unicode Boundary Safety (Surrogates + Graphemes)**: Multi-byte characters (like emojis) can be corrupted if split in the middle of a surrogate pair. Similarly, Arabic diacritics (combining marks), ZWJ/ZWNJ, and variation selectors can be orphaned if a hard split lands in the middle of a grapheme cluster. Use `adjustForUnicodeBoundary` (built on `adjustForSurrogate`) when forced to hard-split near a limit.
457
+ 12. **Unicode Boundary Safety (Surrogates + Graphemes)**: Multi-byte characters (like emojis) can be corrupted if split in the middle of a surrogate pair. Similarly, Arabic diacritics (combining marks), ZWJ/ZWNJ, and variation selectors can be orphaned if a hard split lands in the middle of a grapheme cluster. Use `adjustForUnicodeBoundary` when forced to hard-split near a limit.
437
458
 
438
459
  13. **Recursion/Iteration Safety**: Using a progress-based guard (comparing `cursorPos` before and after loop iteration) is safer than fixed iteration limits for supporting arbitrary-sized content without truncation risks.
439
460
 
@@ -483,8 +504,34 @@ bunx biome lint .
483
504
  - `advanceCursorAndIndex()` (progress)
484
505
  - `computeNextFromIdx()` (heuristic) **or** position-based override when `maxPages=0` (see #21)
485
506
 
486
- ### Process Template (Multi-agent design review, TDD-first)
507
+ 25. **Page attribution can drift in large-document breakpoint processing**: For ≥`FAST_PATH_THRESHOLD` segments, boundary positions may be derived from cumulative offsets (fast path). If upstream content is modified (e.g. marker stripping or accidental leading-trim), binary-search attribution can classify a piece as starting **before** `currentFromIdx`, inflating `(to - from)` and violating `maxPages`. **Fix**: clamp `actualStartIdx >= currentFromIdx` and re-apply the `maxPages` window using the same ID-span logic as `computeWindowEndIdx(...)` before creating the piece segment.
508
+
509
+ 26. **Offset fast path must respect page-ID span semantics**: `maxPages` in this library is enforced as an **ID span** invariant (`(to ?? from) - from <= maxPages`). For large segments, the offset-based fast path must choose `segEnd` using the same ID-window logic as `computeWindowEndIdx(...)` (not “N pages by index”), otherwise gaps (e.g. `2216 → 2218`) produce illegal spans.
510
+
511
+ 27. **Never `trimStart()` huge fallback content**: `ensureFallbackSegment()` constructs “all pages as one segment” when there are no structural split rules. If this giant content is `trimStart()`’d, cumulative offsets and derived boundary positions become inconsistent, which can lead to incorrect `from/to` attribution and `maxPages` violations that only appear on very large books.
512
+
513
+ 28. **Always test both sides of the fast-path threshold**: Several breakpoint bugs only reproduce at or above `FAST_PATH_THRESHOLD` (1000). Add regressions at `threshold-1` and `threshold` to avoid “works in small unit tests, fails on full books” surprises.
514
+
515
+ 29. **Breakpoint `split` behavior**: The `split: 'at' | 'after'` option for breakpoints controls where the split happens relative to the matched text:
516
+ - `'after'` (default): Match is included in the previous segment
517
+ - `'at'`: Match starts the next segment
518
+ Key implementation details in `findPatternBreakPosition`:
519
+ - Position is calculated as `splitAt ? idx : idx + len`
520
+ - Matches at position 0 are skipped for `split:'at'` to prevent empty first segments
521
+ - Zero-length matches (lookaheads) are always skipped to prevent infinite loops
522
+ - Empty pattern `''` forces `splitAt=false` since page boundaries have no matched text
487
523
 
524
+ 30. **Unicode safety is the user's responsibility for patterns**: Unlike `findSafeBreakPosition` (which adjusts for grapheme boundaries), pattern-based breaks use the exact position where the user's regex matched. If a pattern matches mid-grapheme, that's a pattern authoring error, not a library bug. The library should NOT silently adjust pattern match positions.
525
+
526
+ 31. **Fast path doesn't affect split behavior**: The offset-based fast path only applies to empty pattern `''` breakpoints (page boundary fallback), and empty patterns force `splitAt=false`. Pattern-based breakpoints with `split:'at'` never engage the fast path.
527
+
528
+ 32. **Whitespace trimming affects split:'at' output**: `createSegment()` trims segment content. With `split:'at'`, if the matched text is whitespace-only, it will be trimmed from the start of the next segment. This is usually desirable for delimiter patterns.
529
+
530
+ 33. **`prefer` semantics with `split:'at'`**: With `prefer:'longer'` + `split:'at'`, the algorithm selects the LAST valid match, maximizing content in the previous segment. This is correct but can be counterintuitive since the resulting previous segment might appear "shorter" than with `split:'after'`.
531
+
532
+ 34. **Multi-agent review synthesis**: Getting implementation reviews from multiple AI models (Claude, GPT, Grok, Gemini) and synthesizing their feedback helps catch issues a single reviewer might miss. Key insight: when reviewers disagree on "critical" issues, investigate the codebase to verify claims before implementing fixes. Some "critical" issues are based on incorrect assumptions about how fast paths or downstream functions work.
533
+
534
+ ### Process Template (Multi-agent design review, TDD-first)
488
535
 
489
536
  If you want to repeat the “write a plan → get multiple AI critiques → synthesize → update plan → implement TDD-first” workflow, use:
490
537
 
@@ -503,27 +550,36 @@ If you want to repeat the “write a plan → get multiple AI critiques → synt
503
550
 
504
551
  ## Token Reference
505
552
 
506
- | Token | Pattern Description | Example Match |
507
- |-------|---------------------|---------------|
508
- | `{{tarqim}}` | Arabic punctuation (. , ; ? ! ( ) etc.) | `؛` `،` `.` |
509
- | `{{basmalah}}` | "بِسْمِ اللَّهِ" patterns | بِسْمِ اللَّهِ الرَّحْمَنِ |
510
- | `{{bab}}` | "باب" (chapter) | بَابُ الإيمان |
511
- | `{{fasl}}` | "فصل" (section) | فصل: في الطهارة |
512
- | `{{kitab}}` | "كتاب" (book) | كتاب الصلاة |
513
- | `{{raqm}}` | Single Arabic-Indic numeral | ٥ |
514
- | `{{raqms}}` | Multiple Arabic-Indic numerals | ٧٥٦٣ |
515
- | `{{num}}` | Single ASCII numeral | 5 |
516
- | `{{nums}}` | Multiple ASCII numerals | 123 |
517
- | `{{raqms:num}}` | Numerals with named capture | `meta.num = "٧٥٦٣"` |
518
- | `{{dash}}` | Various dash characters | - – — ـ |
519
- | `{{harfs}}` | Single-letter codes separated by spaces | ت س ي ق` |
520
- | `{{rumuz}}` | rijāl/takhrīj source abbreviations (matches blocks like `خت ٤`, `خ سي`, `دت عس ق`) | `خت ٤` |
521
- | `{{numbered}}` | Composite: `{{raqms}} {{dash}}` | ٧٥٦٣ - |
522
-
523
- **Named captures**: Add `:name` suffix to capture into `meta`:
553
+ | Token | Constant | Pattern | Example |
554
+ |-------|----------|---------|---------|
555
+ | `{{tarqim}}` | `Token.TARQIM` | Arabic punctuation | `؛` `.` |
556
+ | `{{basmalah}}` | `Token.BASMALAH` | "بسم الله" | بسم الله |
557
+ | `{{bab}}` | `Token.BAB` | "باب" (chapter) | باب الإيمان |
558
+ | `{{fasl}}` | `Token.FASL` | "فصل/مسألة" | فصل: |
559
+ | `{{kitab}}` | `Token.KITAB` | "كتاب" (book) | كتاب الصلاة |
560
+ | `{{naql}}` | `Token.NAQL` | Narrator phrases | حدثنا |
561
+ | `{{raqm}}` | `Token.RAQM` | Single Arabic digit | ٥ |
562
+ | `{{raqms}}` | `Token.RAQMS` | Multiple Arabic digits | ٧٥٦٣ |
563
+ | `{{num}}` | `Token.NUM` | Single ASCII digit | 5 |
564
+ | `{{nums}}` | `Token.NUMS` | Multiple ASCII digits | 123 |
565
+ | `{{dash}}` | `Token.DASH` | Dash variants | - – — ـ |
566
+ | `{{harf}}` | `Token.HARF` | Single Arabic letter | أ |
567
+ | `{{harfs}}` | `Token.HARFS` | Spaced letters | د ت س |
568
+ | `{{rumuz}}` | `Token.RUMUZ` | Source abbreviations | خت ٤ |
569
+ | `{{bullet}}` | `Token.BULLET` | Bullet points | • * ° |
570
+ | `{{numbered}}` | `Token.NUMBERED` | `{{raqms}} {{dash}} ` | ٧٥٦٣ - |
571
+
572
+ ### Token Constants (Better DX)
573
+
524
574
  ```typescript
525
- '{{raqms:hadithNum}} {{dash}}'
526
- // → segment.meta.hadithNum = "٧٥٦٣"
575
+ import { Token, withCapture } from 'flappa-doormal';
576
+
577
+ // Use constants instead of strings
578
+ { lineStartsWith: [Token.KITAB, Token.BAB] }
579
+
580
+ // Named captures with withCapture helper
581
+ const pattern = withCapture(Token.RAQMS, 'num') + ' ' + Token.DASH + ' ';
582
+ // → '{{raqms:num}} {{dash}} '
527
583
  ```
528
584
 
529
585
  ## Page-start Guard (`pageStartGuard`)
@@ -614,68 +670,20 @@ See README.md for complete examples.
614
670
 
615
671
  ---
616
672
 
617
- ## Debugging Page Boundary Detection (Added 2026-01-04)
673
+ ## Debugging Tips
618
674
 
619
- ### The Problem: False Positives in Prefix Matching
675
+ ### Page Boundary Detection Issues
620
676
 
621
- When using `maxPages=0` with empty breakpoint `['']` (page boundary breaks), the segmenter can fail when:
622
- 1. **Pages have identical prefixes** - All pages start with the same text
623
- 2. **Duplicated content within pages** - The same phrase appears multiple times in a single page
624
- 3. **Long content** - Pages are thousands of characters, putting false matches closer to expected boundaries
677
+ If `maxPages=0` produces merged segments when pages have identical prefixes or duplicated content:
678
+ - Check `buildCumulativeOffsets()` for correct positions
679
+ - Trace `findPageStartNearExpectedBoundary` candidates
680
+ - Verify matches are within `MAX_DEVIATION` (2000 chars) of expected boundary
625
681
 
626
- **Root cause**: The `findPageStartNearExpectedBoundary` function in `breakpoint-utils.ts` uses prefix matching to find page boundaries. When content is duplicated, it finds matches at incorrect positions within the current page instead of at the actual page boundary.
682
+ Key functions: `applyBreakpoints()` `processOversizedSegment()` `findBreakpointWindowEndPosition()` `handlePageBoundaryBreak()`
627
683
 
628
- ### Key Functions in the Breakpoint Chain
684
+ ### General Debugging
629
685
 
630
- 1. **`applyBreakpoints()`** - Entry point for breakpoint processing
631
- 2. **`processOversizedSegment()`** - Iteratively breaks segments exceeding `maxPages`
632
- 3. **`computeWindowEndIdx()`** - Calculates max page index for current window
633
- 4. **`findBreakpointWindowEndPosition()`** - Finds the byte position where the window ends
634
- 5. **`findPageStartNearExpectedBoundary()`** - Content-based search for page start position
635
- 6. **`handlePageBoundaryBreak()`** - Handles empty pattern `''` (page boundary)
636
- 7. **`buildCumulativeOffsets()`** - Pre-computes exact byte positions for each page
686
+ - Pass `logger` with `debug`/`trace` methods to `segmentPages()` for detailed logs
687
+ - Check `boundaryPositions built` log for page boundary byte offsets
688
+ - Check `iteration=N` logs for `currentFromIdx`, `cursorPos`, `windowEndPosition` per loop
637
689
 
638
- ### Debug Strategy
639
-
640
- 1. **Check cumulative offsets first** - `buildCumulativeOffsets()` returns correct positions from `pages.join('\n')`
641
- 2. **Trace `expectedBoundary`** - This is calculated correctly from cumulative offsets
642
- 3. **Check `findPageStartNearExpectedBoundary` candidates** - The bug is usually here; it finds false matches
643
- 4. **Verify the deviation check** - Matches must be within `MAX_DEVIATION` (2000 chars) of expected boundary
644
-
645
- ### The Fix Applied
646
-
647
- Two changes in `breakpoint-utils.ts`:
648
-
649
- 1. **`findPageStartNearExpectedBoundary`** - Added `MAX_DEVIATION` check to reject matches too far from expected boundary:
650
- ```typescript
651
- const MAX_DEVIATION = 2000;
652
- if (bestDistance <= MAX_DEVIATION) {
653
- return bestCandidate.pos;
654
- }
655
- // Continue trying shorter prefixes or return -1
656
- ```
657
-
658
- 2. **`findBreakpointWindowEndPosition`** - Changed fallback from `remainingContent.length` to `bestExpectedBoundary`:
659
- ```typescript
660
- // Before (bug): return remainingContent.length; // Merges all remaining pages!
661
- // After (fix): return Math.min(bestExpectedBoundary, remainingContent.length);
662
- ```
663
-
664
- ### Test Case Pattern for This Bug
665
-
666
- ```typescript
667
- it('should correctly split pages with identical prefixes and duplicated content', () => {
668
- const sharedPrefix = 'SHARED PREFIX ';
669
- const filler = 'Lorem ipsum. '.repeat(200); // ~6000 chars
670
- const pages: Page[] = [
671
- { content: sharedPrefix + 'start ' + filler + sharedPrefix + 'end', id: 0 },
672
- { content: sharedPrefix + 'page1', id: 1 },
673
- { content: sharedPrefix + 'page2', id: 2 },
674
- ];
675
- const result = segmentPages(pages, { breakpoints: [''], maxPages: 0 });
676
- expect(result).toHaveLength(3); // Without fix: 2 or 1
677
- });
678
- ```
679
-
680
- ---
681
- 15. **Use Synthesized AI Reviews**: For complex safety features, getting reviews from multiple models (Claude, GPT, etc.) and synthesizing them into a single action plan (see `docs/reviews/max-content-length-review-synthesis.md`) revealed critical edge cases like Arabic diacritic corruption and surrogate pair safety that a single model might miss.
package/README.md CHANGED
@@ -117,6 +117,29 @@ Replace regex with readable tokens:
117
117
  | `{{bab}}` | "باب" (chapter) | `باب` |
118
118
  | `{{basmalah}}` | "بسم الله" | `بسم الله` |
119
119
 
120
+ #### Token Constants (TypeScript)
121
+
122
+ For better IDE support, use the `Token` constants instead of raw strings:
123
+
124
+ ```typescript
125
+ import { Token, withCapture } from 'flappa-doormal';
126
+
127
+ // Instead of:
128
+ { lineStartsWith: ['{{kitab}}', '{{bab}}'] }
129
+
130
+ // Use:
131
+ { lineStartsWith: [Token.KITAB, Token.BAB] }
132
+
133
+ // With named captures:
134
+ const pattern = withCapture(Token.RAQMS, 'hadithNum') + ' ' + Token.DASH + ' ';
135
+ // Result: '{{raqms:hadithNum}} {{dash}} '
136
+
137
+ { lineStartsAfter: [pattern], split: 'at' }
138
+ // segment.meta.hadithNum will contain the matched number
139
+ ```
140
+
141
+ Available constants: `Token.BAB`, `Token.BASMALAH`, `Token.BULLET`, `Token.DASH`, `Token.FASL`, `Token.HARF`, `Token.HARFS`, `Token.KITAB`, `Token.NAQL`, `Token.NUM`, `Token.NUMS`, `Token.NUMBERED`, `Token.RAQM`, `Token.RAQMS`, `Token.RUMUZ`, `Token.TARQIM`
142
+
120
143
  ### 2. Named Capture Groups
121
144
 
122
145
  Extract metadata automatically with the `{{token:name}}` syntax:
@@ -324,6 +347,62 @@ When a segment exceeds `maxPages` or `maxContentLength`, breakpoints split it at
324
347
  }
325
348
  ```
326
349
 
350
+ #### Breakpoint Pattern Behavior
351
+
352
+ When a breakpoint pattern matches, the split position is controlled by the `split` option:
353
+
354
+ ```typescript
355
+ {
356
+ breakpoints: [
357
+ // Default: split AFTER the match (match included in previous segment)
358
+ { pattern: '{{tarqim}}' }, // or { pattern: '{{tarqim}}', split: 'after' }
359
+
360
+ // Alternative: split AT the match (match starts next segment)
361
+ { pattern: 'ولهذا', split: 'at' },
362
+ ],
363
+ }
364
+ ```
365
+
366
+ **`split: 'after'` (default)**
367
+ - Previous segment **ENDS WITH** the matched text
368
+ - New segment **STARTS AFTER** the matched text
369
+
370
+ ```typescript
371
+ // Pattern "ولهذا" with split: 'after' on "النص الأول ولهذا النص الثاني"
372
+ // - Segment 1: "النص الأول ولهذا" (ends WITH match)
373
+ // - Segment 2: "النص الثاني" (starts AFTER match)
374
+ ```
375
+
376
+ **`split: 'at'`**
377
+ - Previous segment **ENDS BEFORE** the matched text
378
+ - New segment **STARTS WITH** the matched text
379
+
380
+ ```typescript
381
+ // Pattern "ولهذا" with split: 'at' on "النص الأول ولهذا النص الثاني"
382
+ // - Segment 1: "النص الأول" (ends BEFORE match)
383
+ // - Segment 2: "ولهذا النص الثاني" (starts WITH match)
384
+ ```
385
+
386
+ > **Note**: For empty pattern `''` (page boundary fallback), `split` is ignored since there is no matched text to include/exclude.
387
+
388
+ **Pattern order matters** - the first matching pattern wins:
389
+
390
+ ```typescript
391
+ {
392
+ // Patterns are tried in order
393
+ breakpoints: [
394
+ '\\.\\s*', // Try punctuation first
395
+ 'ولهذا', // Then try specific word
396
+ '', // Finally, fall back to page boundary
397
+ ],
398
+ }
399
+ // If punctuation is found, "ولهذا" is never tried
400
+ ```
401
+
402
+ > **Note on lookahead patterns**: Zero-length patterns like `(?=X)` are not supported for breakpoints because they can cause non-progress scenarios. Use `{ pattern: 'X', split: 'at' }` instead to achieve "split before X" behavior.
403
+
404
+ > **Note on whitespace**: Segments are trimmed by default. With `split:'at'`, if the match consists only of whitespace, it will be trimmed from the start of the next segment. This is usually desirable for delimiter patterns.
405
+
327
406
  **Security note (ReDoS)**: Breakpoints (and raw `regex` rules) compile user-provided regular expressions. **Do not accept untrusted patterns** (e.g. from end users) without validation/sandboxing; some regexes can trigger catastrophic backtracking and hang the process.
328
407
 
329
408
  ### 12. Occurrence Filtering
package/dist/index.d.mts CHANGED
@@ -409,6 +409,17 @@ type BreakpointRule = {
409
409
  * Empty string `''` means fall back to page boundary.
410
410
  */
411
411
  pattern: string;
412
+ /**
413
+ * Where to split relative to the match.
414
+ * - `'at'`: New segment starts AT the match (previous segment does NOT include match)
415
+ * - `'after'`: New segment starts AFTER the match (previous segment ENDS WITH match)
416
+ *
417
+ * **Note**: For empty pattern `''`, `split` is ignored (page boundary).
418
+ * Invalid values are treated as `'after'`.
419
+ *
420
+ * @default 'after'
421
+ */
422
+ split?: 'at' | 'after';
412
423
  /**
413
424
  * Minimum page ID for this breakpoint to apply.
414
425
  * Segments starting before this page skip this pattern.
@@ -945,6 +956,84 @@ declare const segmentPages: (pages: Page[], options: SegmentationOptions) => Seg
945
956
  * // → '{{harf}}' (unchanged - no brackets outside tokens)
946
957
  */
947
958
  declare const escapeTemplateBrackets: (pattern: string) => string;
959
+ /**
960
+ * Pre-defined token constants for use in patterns.
961
+ *
962
+ * Using these constants instead of raw `{{token}}` strings provides:
963
+ * - Autocompletion in IDEs
964
+ * - Compile-time typo detection
965
+ * - Easier refactoring
966
+ *
967
+ * @example
968
+ * // Instead of:
969
+ * { lineStartsWith: ['{{kitab}}', '{{bab}}'] }
970
+ *
971
+ * // Use:
972
+ * import { Token } from 'flappa-doormal';
973
+ * { lineStartsWith: [Token.KITAB, Token.BAB] }
974
+ *
975
+ * @example
976
+ * // With named captures - use the helper function:
977
+ * import { Token, withCapture } from 'flappa-doormal';
978
+ * { lineStartsAfter: [withCapture(Token.RAQMS, 'num') + ' ' + Token.DASH + ' '] }
979
+ */
980
+ declare const Token: {
981
+ /** Chapter marker - باب */
982
+ readonly BAB: "{{bab}}";
983
+ /** Basmala - بسم الله */
984
+ readonly BASMALAH: "{{basmalah}}";
985
+ /** Bullet point variants */
986
+ readonly BULLET: "{{bullet}}";
987
+ /** Dash variants (hyphen, en-dash, em-dash, tatweel) */
988
+ readonly DASH: "{{dash}}";
989
+ /** Section marker - فصل / مسألة */
990
+ readonly FASL: "{{fasl}}";
991
+ /** Single Arabic letter */
992
+ readonly HARF: "{{harf}}";
993
+ /** Multiple Arabic letters separated by spaces */
994
+ readonly HARFS: "{{harfs}}";
995
+ /** Book marker - كتاب */
996
+ readonly KITAB: "{{kitab}}";
997
+ /** Hadith transmission phrases */
998
+ readonly NAQL: "{{naql}}";
999
+ /** Single ASCII digit */
1000
+ readonly NUM: "{{num}}";
1001
+ /** Composite: {{raqms}} {{dash}} (space) */
1002
+ readonly NUMBERED: "{{numbered}}";
1003
+ /** One or more ASCII digits */
1004
+ readonly NUMS: "{{nums}}";
1005
+ /** Single Arabic-Indic digit */
1006
+ readonly RAQM: "{{raqm}}";
1007
+ /** One or more Arabic-Indic digits */
1008
+ readonly RAQMS: "{{raqms}}";
1009
+ /** Source abbreviations (rijāl/takhrīj) */
1010
+ readonly RUMUZ: "{{rumuz}}";
1011
+ /** Punctuation marks */
1012
+ readonly TARQIM: "{{tarqim}}";
1013
+ };
1014
+ /**
1015
+ * Type representing valid token constant keys.
1016
+ */
1017
+ type TokenKey = keyof typeof Token;
1018
+ /**
1019
+ * Wraps a token with a named capture.
1020
+ *
1021
+ * @param token - Token constant (e.g., Token.RAQMS)
1022
+ * @param name - Capture name for metadata extraction
1023
+ * @returns Token string with capture syntax (e.g., '{{raqms:num}}')
1024
+ *
1025
+ * @example
1026
+ * import { Token, withCapture } from 'flappa-doormal';
1027
+ *
1028
+ * // Create a pattern that captures the hadith number
1029
+ * const pattern = withCapture(Token.RAQMS, 'hadithNum') + ' ' + Token.DASH + ' ';
1030
+ * // Result: '{{raqms:hadithNum}} {{dash}} '
1031
+ *
1032
+ * // Use in a rule
1033
+ * { lineStartsAfter: [pattern], split: 'at' }
1034
+ * // segment.meta.hadithNum will contain the matched number
1035
+ */
1036
+ declare const withCapture: (token: string, name: string) => string;
948
1037
  /**
949
1038
  * Expands any *composite* tokens (like `{{numbered}}`) into their underlying template form
950
1039
  * (like `{{raqms}} {{dash}} `).
@@ -1394,5 +1483,5 @@ declare function recoverMistakenMarkersForRuns(runs: MarkerRecoveryRun[], opts?:
1394
1483
  segments: Segment[];
1395
1484
  };
1396
1485
  //#endregion
1397
- export { type Breakpoint, type BreakpointRule, type CommonLineStartPattern, type DetectedPattern, type ExpandResult, type LineStartAnalysisOptions, type LineStartPatternExample, type Logger, type MarkerRecoveryReport, type MarkerRecoveryRun, type MarkerRecoverySelector, type OptimizeResult, PATTERN_TYPE_KEYS, type Page, type PageRange, type PatternTypeKey, type RepeatingSequenceExample, type RepeatingSequenceOptions, type RepeatingSequencePattern, type ReplaceRule, type RuleValidationResult, type Segment, type SegmentationOptions, type SplitRule, TOKEN_PATTERNS, type TokenMapping, type ValidationIssue, type ValidationIssueType, analyzeCommonLineStarts, analyzeRepeatingSequences, analyzeTextForRule, applyReplacements, applyTokenMappings, containsTokens, detectTokenPatterns, escapeRegex, escapeTemplateBrackets, expandCompositeTokensInTemplate, expandTokens, expandTokensWithCaptures, formatValidationReport, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, optimizeRules, recoverMistakenLineStartsAfterMarkers, recoverMistakenMarkersForRuns, segmentPages, shouldDefaultToFuzzy, stripTokenMappings, suggestPatternConfig, templateToRegex, validateRules };
1486
+ export { type Breakpoint, type BreakpointRule, type CommonLineStartPattern, type DetectedPattern, type ExpandResult, type LineStartAnalysisOptions, type LineStartPatternExample, type Logger, type MarkerRecoveryReport, type MarkerRecoveryRun, type MarkerRecoverySelector, type OptimizeResult, PATTERN_TYPE_KEYS, type Page, type PageRange, type PatternTypeKey, type RepeatingSequenceExample, type RepeatingSequenceOptions, type RepeatingSequencePattern, type ReplaceRule, type RuleValidationResult, type Segment, type SegmentationOptions, type SplitRule, TOKEN_PATTERNS, Token, type TokenKey, type TokenMapping, type ValidationIssue, type ValidationIssueType, analyzeCommonLineStarts, analyzeRepeatingSequences, analyzeTextForRule, applyReplacements, applyTokenMappings, containsTokens, detectTokenPatterns, escapeRegex, escapeTemplateBrackets, expandCompositeTokensInTemplate, expandTokens, expandTokensWithCaptures, formatValidationReport, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, optimizeRules, recoverMistakenLineStartsAfterMarkers, recoverMistakenMarkersForRuns, segmentPages, shouldDefaultToFuzzy, stripTokenMappings, suggestPatternConfig, templateToRegex, validateRules, withCapture };
1398
1487
  //# sourceMappingURL=index.d.mts.map
@@ -1 +1 @@
1
- {"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/optimize-rules.ts","../src/segmentation/pattern-validator.ts","../src/segmentation/replace.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/recovery.ts"],"sourcesContent":[],"mappings":";;AAkEA;AAcA;;;;;AClDiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBY,cDnUC,WCmUG,EAAA,CAAA,CAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAoCJ,cDzVC,wBC8XU,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;AD5YvB;AAcA;;;;;AClDiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;KArVK,YAAA,GAqViD;EAAe;EAkBzD,KAAA,EAAI,MAAA;AAoChB,CAAA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsCA;;;;AC5CA;AAKA;AAcA;;KF0BK,eAAA,GExBkB;EACH;EACL,QAAA,EAAA,MAAA;CAAe;AAuF9B;AAsCA;;;;ACrJA;AAiCA;;;;;;;;ACwQA;;;;;;;;ACjPA;AAgRA;AAgDA;AA2CA,KLvVK,qBAAA,GK0VJ;EAQW;EAmJC,cAAA,EAAA,MAAA,EAAA;AA8Cb,CAAA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC5vBA;AAcA;AAEA;AAwQA;;;;;;;;AClRA;AAaA;AAOA;AA2OA;;;;KPjKK,sBAAA,GOoKsB;;;;ACjR3B;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC7MA;AAKA;;;;;;AAOA;AA2BE;AAmnBF;;KT/gBK,mBAAA,GSihBS;EACD;EACC,YAAA,EAAA,MAAA,EAAA;CAGa;;;;AA+C3B;;;;;;;KTtjBK,WAAA,GACC,eACA,kBACA,wBACA,yBACA;;;;;;;;;;;;;;;;cAiBO;;;;;;KAOD,cAAA,WAAyB;;;;;;;KAYhC,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiDO,SAAA;;;;;;;KAYP,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAuDC,SAAA,GAAY,cAAc,gBAAgB;;;;;;;;;;;;;KAkB1C,IAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAoCA,cAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAqCE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCF,UAAA,YAAsB;;;;;;;;;;;;;;;;;;;;;;;;;UA8BjB,MAAA;;;;;;;;;;;;;;;;;;;;;;KAuBL,WAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA+CA,mBAAA;;;;;;YAME;;;;;;;;UASF;;;;;;;;;;;;;;cAiBY;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;gBA4DN;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WAwDL;;;;;;;;;;;;;;;;KAiBD,OAAA;;;;;;;;;;;;;;;;;;;;;;;;;;SA6BD;;;;AAxoBa;;;AAkBlB,KChJM,cAAA,GDgJN;EACA;EACA,KAAA,EChJK,SDgJL,EAAA;EAAmB;EAiBZ,WAAA,EAAA,MAAA;AAOb,CAAA;AAYK,cChJQ,aDgJK,EAAA,CAAA,KAAA,EChJmB,SDgJnB,EAAA,EAAA,GAAA;EAiDN,WAAA,EAAS,MAAA;EAYhB,KAAA,EC7MgC,SD6MhC,EAAe;AA6GpB,CAAA;;;AArViB;AA4BG;AA8BM;AAyDrB,KEpIO,mBAAA,GFoIY,gBAAA,GAAA,eAAA,GAAA,WAAA,GAAA,eAAA;AAAA;;;AAkBlB,KEjJM,eAAA,GFiJN;EACA,IAAA,EEjJI,mBFiJJ;EACA,OAAA,EAAA,MAAA;EAAmB,UAAA,CAAA,EAAA,MAAA;EAiBZ;EAOD,KAAA,CAAA,EAAA,MAAA;EAYP;EAiDO,OAAA,CAAA,EAAA,MAAS;AAA6B,CAAA;AAyHlD;;;;AAAqE,KEnVzD,oBAAA,GFmVyD;EAkBzD,cAAI,CAAA,EAAA,CEpWM,eFoWN,GAAA,SAAA,CAAA,EAAA;EAoCJ,eAAA,CAAA,EAAc,CEvYH,eF4aT,GAAA,SAAS,CAAA,EAAA;EAqCX,YAAA,CAAU,EAAA,CEhdF,eFgdc,GAAA,SAAc,CAAA,EAAA;EA8B/B,QAAA,CAAM,EE7eR,eF6eQ;AAuBvB,CAAA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsCA;;;;AC5CA;AAKA;AAcA;AACsB,cA0FT,aA1FS,EAAA,CAAA,KAAA,EA0Fe,SA1Ff,EAAA,EAAA,GAAA,CA0F0B,oBA1F1B,GAAA,SAAA,CAAA,EAAA;;;;;AA0FtB;AAsCA;;;;ACrJA;AAiCA;;;AAAsE,cDoHzD,sBCpHyD,EAAA,CAAA,OAAA,EAAA,CDoHrB,oBCpHqB,GAAA,SAAA,CAAA,EAAA,EAAA,GAAA,MAAA,EAAA;;;AJqBtE;AAcA;;;;;AClDiB;AA4BG;AA8BM;AAiCC;AAuCtB,KGpJO,WAAA,GAAc,WHoJV,CGpJsB,mBHoJtB,CAAA,SAAA,CAAA,CAAA,CAAA,MAAA,CAAA;AACV,cGpHO,iBHoHP,EAAA,CAAA,KAAA,EGpHmC,IHoHnC,EAAA,EAAA,KAAA,CAAA,EGpHmD,WHoHnD,EAAA,EAAA,GGpHgE,IHoHhE,EAAA;;;;;;;AAoON;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsCA;;;;AC5CA;AAKA;AAcA;;;;;;AA2FA;AAsCA;;;;ACrJA;AAiCA;;;AAAsE,cCwQzD,YDxQyD,EAAA,CAAA,KAAA,ECwQlC,IDxQkC,EAAA,EAAA,OAAA,ECwQjB,mBDxQiB,EAAA,GCwQE,ODxQF,EAAA;;;;AJqBtE;AAcA;;;;;AClDiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsCA;;;;AC5CA;AAKA;AAcA;;;;;;AA2FA;AAsCA;;;;ACrJA;AAiCa,cEuBA,sBFLZ,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;;;;;ACsPD;;;;;;;cC+Ba;AAhRb;AAgRA;AAgDA;AA2CA;AAWA;AAmJA;AA8CA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC5vBA;AAcA;AAEA;AAwQA;;;;;cDuFa,gBAAgB;;;AEzW7B;AAaA;AAOA;AA2OA;;;;;;;;AC9QA;AA+EA;AAgEa,cHoRA,cGrQZ,EAAA,CAAA,KAfgE,EAAA,MAAA,EAAA,GAAA,OAAe;AAuBhF;AAiCA;;;;AC7MA;AAKY,KJ+aA,YAAA,GI/aiB;EAChB;;;;;EAMD,OAAA,EAAA,MAAA;EA6BP;AAinBL;;;;EAIc,YAAA,EAAA,MAAA,EAAA;EAGa;;;;AA+C3B;EACU,WAAA,EAAA,OAAA;CACsE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cJ3InE;;;;;;;;;;;;;;;;;;;;;;;;cA8CA;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAmC;;;;;;;;;;;;;cAqBnC;;;;;;;;;;;;;;;cAgBA;;;;;;;;;;;;;;;cA8BA;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;cAoBC,iDAAkD;;;;;;;;;;;;;;;;cA6BlD;;;AN/sBA,KO7CD,wBAAA,GP6CsE;EAcrE,IAAA,CAAA,EAAA,MAAA;;;;EClDR,WAAA,CAAA,EAAA,MAAY;EA4BZ,wBAAe,CAAA,EAAA,OAAA;EA8Bf,yBAAqB,CAAA,EAAA,OAAA;EAiCrB,MAAA,CAAA,EAAA,aAAA,GAAsB,OAAA;EAwBtB,UAAA,CAAA,EAAA,CAAA,IAAA,EAAA,MAAmB,EAAA,MAAA,EAAA,MAAA,EAAA,GAAA,OAAA;EAenB,cAAW,CAAA,EMjIK,MNiIL,EAAA;EACV,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;CACA;AACA,KMhIM,uBAAA,GNgIN;EACA,IAAA,EAAA,MAAA;EACA,MAAA,EAAA,MAAA;CAAmB;AAiBZ,KMjJD,sBAAA,GNiJwG;EAOxG,OAAA,EAAA,MAAA;EAYP,KAAA,EAAA,MAAA;EAiDO,QAAA,EMlNE,uBNkNO,EAAA;AAA6B,CAAA;AAyHlD;;;AAAsD,cMtEzC,uBNsEyC,EAAA,CAAA,KAAA,EMrE3C,INqE2C,EAAA,EAAA,OAAA,CAAA,EMpEzC,wBNoEyC,EAAA,GMnEnD,sBNmEmD,EAAA;;;AAlOjD,KOtHO,wBAAA,GPsHY;EAenB,WAAA,CAAA,EAAW,MAAA;EACV,WAAA,CAAA,EAAA,MAAA;EACA,QAAA,CAAA,EAAA,MAAA;EACA,IAAA,CAAA,EAAA,MAAA;EACA,yBAAA,CAAA,EAAA,OAAA;EACA,YAAA,CAAA,EAAA,OAAA;EAAmB,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;EAiBZ,WAAA,CAAA,EAAA,MAAuG;EAOxG,YAAA,CAAA,EAAA,MAAc;EAYrB,iBAAa,CAAA,EAAA,MAAA;AAiDlB,CAAA;AAYK,KO9NO,wBAAA,GPuQE;EAoEF,IAAA,EAAA,MAAS;EAAG,OAAA,EAAA,MAAA;EAAc,MAAA,EAAA,MAAA;EAAgB,YAAA,EAAA,MAAA,EAAA;CAAe;AAkBzD,KOtVA,wBAAA,GPsVI;EAoCJ,OAAA,EAAA,MAAA;EA0EA,KAAA,EAAA,MAAU;EA8BL,QAAA,EO/dH,wBP+dS,EAAA;AAuBvB,CAAA;;;;;;AAoNA;cOlea,mCACF,kBACG,6BACX;;;;AR3NH;AAcA;;;;;AClDiB;AA4BG;AA+Df,KQ7GO,eAAA,GR6Ge;EAwBtB;EAeA,KAAA,EAAA,MAAA;EACC;EACA,KAAA,EAAA,MAAA;EACA;EACA,KAAA,EAAA,MAAA;EACA;EAAmB,QAAA,EAAA,MAAA;AAiBzB,CAAA;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;AAMc,cQlgBD,mBRkgBC,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GQlgBkC,eRkgBlC,EAAA;;;;;;AA+Jd;;;;ACzuBA;AAsCA;;;cOkGa,mDAAoD;AN9IjE;AAKA;AAcA;;;;AAIe,cM8IF,oBN9IE,EAAA,CAAA,QAAA,EM+ID,eN/IC,EAAA,EAAA,GAAA;EAAe,WAAA,EAAA,gBAAA,GAAA,iBAAA;EAuFjB,KAAA,EAAA,OAAA;EAsCA,QAAA,CAAA,EAAA,MAAA;;;;ACrJb;AAiCA;;;AAAsE,cKsKzD,kBLtKyD,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA;EAAA,QAAA,EAAA,MAAA;;;;ECwQzD,QAAA,EI3FC,eJqJb,EAAA;CA1DmC,GAAA,IAAA;;;ALnPvB,KU5DD,sBAAA,GV4DsE;EAcrE,IAAA,EAAA,cAAA;;;;EClDR,KAAA,CAAA,EAAA,OAAY,GAAA,YAAA;EA4BZ,QAAA,EAAA,MAAA,EAAe;AAAA,CAAA,GA8Bf;EAiCA,IAAA,EAAA,WAAA;EAwBA,SAAA,EAAA,CAAA,IAAA,ESxIwC,STwIrB,EAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;AAAA,CAAA;AAgBlB,KStJM,iBAAA,GTsJN;EACA,OAAA,EStJO,mBTsJP;EACA,KAAA,EStJK,ITsJL,EAAA;EACA,QAAA,EStJQ,OTsJR,EAAA;EACA,QAAA,EStJQ,sBTsJR;CAAmB;AAiBZ,KSpKD,oBAAA,GToKwG;EAOxG,OAAA,EAAA;IAYP,IAAA,EAAA,YAAa,GAAA,wBAAA;IAiDN,SAAS,EAAA,MAAA;IAYhB,aAAA,EAAe,MAAA;IA6GR,SAAS,EAAA,MAAA;IAAG,UAAA,EAAA,MAAA;EAAc,CAAA;EAAgB,KAAA,CAAA,ESzV1C,KTyV0C,CAAA;IAAe,SAAA,EAAA,MAAA;IAkBzD,QAAI,EAAA,MAAA;IAoCJ,aAAc,EAAA,MAAA;IA0Ed,UAAU,EAAA,MAAA;EA8BL,CAAA,CAAA;EAuBL,OAAA,ESxgBC,KTwgBU,CAAA;IA+CX,IAAA,EAAA,MAAA;IAME,oBAAA,EAAA,MAAA;IASF,sBAAA,CAAA,EAAA,MAAA;IAiBY,qBAAA,CAAA,EAAA,MAAA;IA4DN,YAAA,EAAA,MAAA;IAwDL,MAAA,EAAA,WAAA,GAAA,oBAAA,GAAA,WAAA,GAAA,sBAAA,GAAA,qBAAA;IAAM,QAAA,EAAA,OAAA,GAAA,QAAA,GAAA,MAAA;IAiBP,EAAA,CAAA,EAAA,MAAO;;;;ECzuBP,QAAA,EAAA,MAAA,EAAc;AAsC1B,CAAA;KQVK,oBAAA;iBAinBW,qCAAA,QACL,kBACG,oBACD,+BACC,4BPlpBd;;EALY,gBAAA,CAAA,EO0pBe,oBP1pBI;AAK/B,CAAA,CAAA,EAAY;EAcA,MAAA,EOyoBC,oBPzoBmB;EACV,QAAA,EOwoBuB,OPxoBvB,EAAA;CACC;AACH,iBOmrBJ,6BAAA,CPnrBI,IAAA,EOorBV,iBPprBU,EAAA,EAAA,IAwFpB,CAxFoB,EAAA;EACL,IAAA,CAAA,EAAA,YAAA,GAAA,wBAAA;EAAe,gBAAA,CAAA,EOorBkD,oBPprBlD;AAuF9B,CAAA,CAAA,EAAa;EAsCA,MAAA,EOwjBA,oBP/hBP;YO+hBuC"}
1
+ {"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/optimize-rules.ts","../src/segmentation/pattern-validator.ts","../src/segmentation/replace.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/recovery.ts"],"sourcesContent":[],"mappings":";;AAkEA;AAcA;;;;;AClDiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBY,cDnUC,WCmUG,EAAA,CAAA,CAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAoCJ,cDzVC,wBC0YU,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;ADxZvB;AAcA;;;;;AClDiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;KArVK,YAAA,GAqViD;EAAe;EAkBzD,KAAA,EAAI,MAAA;AAoChB,CAAA;AAsFA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACrvBA;AAsCA;;;;AC5CA;AAKA;AAcA;;KF0BK,eAAA,GExBkB;EACH;EACL,QAAA,EAAA,MAAA;CAAe;AAuF9B;AAsCA;;;;ACrJA;AAiCA;;;;;;;;AC6QA;;;;;;;;ACtPA;AAoPA;AAsCA;AAoBA,KL1RK,qBAAA,GKkSJ;EAuDY;EAgDA,cAAA,EAAA,MAIZ,EAAA;AAuCD,CAAA;AAWA;AAmJA;AA8CA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;ACz1BA;AAcA;AAEA;AAwQA;;;;;;;;AClRA;AAaA;AAOA;AA2OA;KPjKK,sBAAA,GOkKM;EACG;EACX,eAAA,EAAA,MAAA,EAAA;CAAwB;;;;ACjR3B;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC7MA;AAKA;;;;;;AAOA;AA2BE,KToGG,mBAAA,GSlGoB;EAinBT;EACL,YAAA,EAAA,MAAA,EAAA;CACG;;;;;;;AAoDd;;;;KTtjBK,WAAA,GACC,YSwjBuC,GTvjBvC,eSujBuC,GTtjBvC,qBSsjBuC,GTrjBvC,sBSqjBuC,GTpjBvC,mBSojBuC;;;;;;;;;;;;;;;;cTniBhC;;;;;;KAOD,cAAA,WAAyB;;;;;;;KAYhC,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiDO,SAAA;;;;;;;KAYP,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAuDC,SAAA,GAAY,cAAc,gBAAgB;;;;;;;;;;;;;KAkB1C,IAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAoCA,cAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAiDE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCF,UAAA,YAAsB;;;;;;;;;;;;;;;;;;;;;;;;;UA8BjB,MAAA;;;;;;;;;;;;;;;;;;;;;;KAuBL,WAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA+CA,mBAAA;;;;;;YAME;;;;;;;;UASF;;;;;;;;;;;;;;cAiBY;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;gBA4DN;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WAwDL;;;;;;;;;;;;;;;;KAiBD,OAAA;;;;;;;;;;;;;;;;;;;;;;;;;;SA6BD;;;;AAppBa;;;AAkBlB,KChJM,cAAA,GDgJN;EACA;EACA,KAAA,EChJK,SDgJL,EAAA;EAAmB;EAiBZ,WAAA,EAAA,MAAA;AAOb,CAAA;AAYK,cChJQ,aDgJK,EAAA,CAAA,KAAA,EChJmB,SDgJnB,EAAA,EAAA,GAAA;EAiDN,WAAA,EAAS,MAAA;EAYhB,KAAA,EC7MgC,SD6MhC,EAAe;AA6GpB,CAAA;;;AArViB;AA4BG;AA8BM;AAyDrB,KEpIO,mBAAA,GFoIY,gBAAA,GAAA,eAAA,GAAA,WAAA,GAAA,eAAA;AAAA;;;AAkBlB,KEjJM,eAAA,GFiJN;EACA,IAAA,EEjJI,mBFiJJ;EACA,OAAA,EAAA,MAAA;EAAmB,UAAA,CAAA,EAAA,MAAA;EAiBZ;EAOD,KAAA,CAAA,EAAA,MAAA;EAYP;EAiDO,OAAA,CAAA,EAAA,MAAS;AAA6B,CAAA;AAyHlD;;;;AAAqE,KEnVzD,oBAAA,GFmVyD;EAkBzD,cAAI,CAAA,EAAA,CEpWM,eFoWN,GAAA,SAAA,CAAA,EAAA;EAoCJ,eAAA,CAAA,EAAc,CEvYH,eFwbT,GAAA,SAAS,CAAA,EAAA;EAqCX,YAAA,CAAU,EAAA,CE5dF,eF4dc,GAAA,SAAc,CAAA,EAAA;EA8B/B,QAAA,CAAM,EEzfR,eFyfQ;AAuBvB,CAAA;AA+CA;;;;;;;AAqKA;;;;ACrvBA;AAsCA;;;;AC5CA;AAKA;AAcA;AACsB,cA0FT,aA1FS,EAAA,CAAA,KAAA,EA0Fe,SA1Ff,EAAA,EAAA,GAAA,CA0F0B,oBA1F1B,GAAA,SAAA,CAAA,EAAA;;;;;AA0FtB;AAsCA;;;;ACrJA;AAiCA;;;AAAsE,cDoHzD,sBCpHyD,EAAA,CAAA,OAAA,EAAA,CDoHrB,oBCpHqB,GAAA,SAAA,CAAA,EAAA,EAAA,GAAA,MAAA,EAAA;;;AJqBtE;AAcA;;;;;AClDiB;AA4BG;AA8BM;AAiCC;AAuCtB,KGpJO,WAAA,GAAc,WHoJV,CGpJsB,mBHoJtB,CAAA,SAAA,CAAA,CAAA,CAAA,MAAA,CAAA;AACV,cGpHO,iBHoHP,EAAA,CAAA,KAAA,EGpHmC,IHoHnC,EAAA,EAAA,KAAA,CAAA,EGpHmD,WHoHnD,EAAA,EAAA,GGpHgE,IHoHhE,EAAA;;;;;;;AAoON;AAoCA;AAsFA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACrvBA;AAsCA;;;;AC5CA;AAKA;AAcA;;;;;;AA2FA;AAsCA;;;;ACrJA;AAiCA;;;AAAsE,cC6QzD,YD7QyD,EAAA,CAAA,KAAA,EC6QlC,ID7QkC,EAAA,EAAA,OAAA,EC6QjB,mBD7QiB,EAAA,GC6QE,OD7QF,EAAA;;;;AJqBtE;AAcA;;;;;AClDiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AAsFA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACrvBA;AAsCA;;;;AC5CA;AAKA;AAcA;;;;;;AA2FA;AAsCA;;;;ACrJA;AAiCa,cEuBA,sBFLZ,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;;;;;AC2PD;;;;;;;;ACtPA;AAoPA;AAsCA;AAoBA;AA+DA;AAgDA;AA2Ca,cApNA,KAuNZ,EAAA;EAQW;EAmJC,SAAA,GAAA,EAAA,SAAA;EA8CA;EAuBA,SAAA,QAAA,EAOZ,cAP+C;EAqBnC;EAgBA,SAAA,MAAA,EAAA,YAAkE;EA8BlE;EAWD,SAAA,IAAA,EAAY,UAAA;EAoBX;EA6BA,SAAA,IAAA,EAAA,UAGZ;;;;EC51BW,SAAA,KAAA,EAAA,WAAwB;EAcxB;EAEA,SAAA,KAAA,EAAA,WAAsB;EAwQrB;EACF,SAAA,IAAA,EAAA,UAAA;EACE;EACV,SAAA,GAAA,EAAA,SAAA;EAAsB;;;;ECrRb;EAaA,SAAA,IAAA,EAAA,UAAA;EAOA;EA2OC,SAAA,KAAA,EAAA,WAAA;EACF;EACG,SAAA,KAAA,EAAA,WAAA;EACX;EAAwB,SAAA,MAAA,EAAA,YAAA;;;;ACjR3B;AA+Ea,KHmQD,QAAA,GGnQC,MAiDZ,OHkNmC,KGnQY;AAgEhD;AAuBA;AAiCA;;;;AC7MA;AAKA;;;;;;AAOA;AA2BE;AAmnBF;;;AAGa,cJjTA,WIiTA,EAAA,CAAA,KAAA,EAAA,MAAA,EAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;;;AAmDb;;;;;;;;;cJrSa;;;;;;;;;;;;;;;;;;;;;;;;;;cAgDA,gBAAgB;;;;;;;;;;;;;;;;cA2ChB;;;;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cAmJC;;;;;;;;;;;;;;;;;;;;;;;;cA8CA;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAmC;;;;;;;;;;;;;cAqBnC;;;;;;;;;;;;;;;cAgBA;;;;;;;;;;;;;;;cA8BA;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;cAoBC,iDAAkD;;;;;;;;;;;;;;;;cA6BlD;;;AN5yBA,KO7CD,wBAAA,GP6CsE;EAcrE,IAAA,CAAA,EAAA,MAAA;;;;EClDR,WAAA,CAAA,EAAA,MAAY;EA4BZ,wBAAe,CAAA,EAAA,OAAA;EA8Bf,yBAAqB,CAAA,EAAA,OAAA;EAiCrB,MAAA,CAAA,EAAA,aAAA,GAAsB,OAAA;EAwBtB,UAAA,CAAA,EAAA,CAAA,IAAA,EAAA,MAAmB,EAAA,MAAA,EAAA,MAAA,EAAA,GAAA,OAAA;EAenB,cAAW,CAAA,EMjIK,MNiIL,EAAA;EACV,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;CACA;AACA,KMhIM,uBAAA,GNgIN;EACA,IAAA,EAAA,MAAA;EACA,MAAA,EAAA,MAAA;CAAmB;AAiBZ,KMjJD,sBAAA,GNiJwG;EAOxG,OAAA,EAAA,MAAA;EAYP,KAAA,EAAA,MAAA;EAiDO,QAAA,EMlNE,uBNkNO,EAAA;AAA6B,CAAA;AAyHlD;;;AAAsD,cMtEzC,uBNsEyC,EAAA,CAAA,KAAA,EMrE3C,INqE2C,EAAA,EAAA,OAAA,CAAA,EMpEzC,wBNoEyC,EAAA,GMnEnD,sBNmEmD,EAAA;;;AAlOjD,KOtHO,wBAAA,GPsHY;EAenB,WAAA,CAAA,EAAW,MAAA;EACV,WAAA,CAAA,EAAA,MAAA;EACA,QAAA,CAAA,EAAA,MAAA;EACA,IAAA,CAAA,EAAA,MAAA;EACA,yBAAA,CAAA,EAAA,OAAA;EACA,YAAA,CAAA,EAAA,OAAA;EAAmB,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;EAiBZ,WAAA,CAAA,EAAA,MAAuG;EAOxG,YAAA,CAAA,EAAA,MAAc;EAYrB,iBAAa,CAAA,EAAA,MAAA;AAiDlB,CAAA;AAYK,KO9NO,wBAAA,GPuQE;EAoEF,IAAA,EAAA,MAAS;EAAG,OAAA,EAAA,MAAA;EAAc,MAAA,EAAA,MAAA;EAAgB,YAAA,EAAA,MAAA,EAAA;CAAe;AAkBzD,KOtVA,wBAAA,GPsVI;EAoCJ,OAAA,EAAA,MAAA;EAsFA,KAAA,EAAA,MAAU;EA8BL,QAAA,EO3eH,wBP2eS,EAAA;AAuBvB,CAAA;;;;;;AAoNA;cO9ea,mCACF,kBACG,6BACX;;;;AR3NH;AAcA;;;;;AClDiB;AA4BG;AA+Df,KQ7GO,eAAA,GR6Ge;EAwBtB;EAeA,KAAA,EAAA,MAAA;EACC;EACA,KAAA,EAAA,MAAA;EACA;EACA,KAAA,EAAA,MAAA;EACA;EAAmB,QAAA,EAAA,MAAA;AAiBzB,CAAA;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AAsFA;AA8BA;AAuBA;AA+CA;AAMc,cQ9gBD,mBR8gBC,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GQ9gBkC,eR8gBlC,EAAA;;;;;;AA+Jd;;;;ACrvBA;AAsCA;;;cOkGa,mDAAoD;AN9IjE;AAKA;AAcA;;;;AAIe,cM8IF,oBN9IE,EAAA,CAAA,QAAA,EM+ID,eN/IC,EAAA,EAAA,GAAA;EAAe,WAAA,EAAA,gBAAA,GAAA,iBAAA;EAuFjB,KAAA,EAAA,OAAA;EAsCA,QAAA,CAAA,EAAA,MAAA;;;;ACrJb;AAiCA;;;AAAsE,cKsKzD,kBLtKyD,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA;EAAA,QAAA,EAAA,MAAA;;;;EC6QzD,QAAA,EIhGC,eJ0Jb,EAAA;CA1DmC,GAAA,IAAA;;;ALxPvB,KU5DD,sBAAA,GV4DsE;EAcrE,IAAA,EAAA,cAAA;;;;EClDR,KAAA,CAAA,EAAA,OAAY,GAAA,YAAA;EA4BZ,QAAA,EAAA,MAAA,EAAe;AAAA,CAAA,GA8Bf;EAiCA,IAAA,EAAA,WAAA;EAwBA,SAAA,EAAA,CAAA,IAAA,ESxIwC,STwIrB,EAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;AAAA,CAAA;AAgBlB,KStJM,iBAAA,GTsJN;EACA,OAAA,EStJO,mBTsJP;EACA,KAAA,EStJK,ITsJL,EAAA;EACA,QAAA,EStJQ,OTsJR,EAAA;EACA,QAAA,EStJQ,sBTsJR;CAAmB;AAiBZ,KSpKD,oBAAA,GToKwG;EAOxG,OAAA,EAAA;IAYP,IAAA,EAAA,YAAa,GAAA,wBAAA;IAiDN,SAAS,EAAA,MAAA;IAYhB,aAAA,EAAe,MAAA;IA6GR,SAAS,EAAA,MAAA;IAAG,UAAA,EAAA,MAAA;EAAc,CAAA;EAAgB,KAAA,CAAA,ESzV1C,KTyV0C,CAAA;IAAe,SAAA,EAAA,MAAA;IAkBzD,QAAI,EAAA,MAAA;IAoCJ,aAAc,EAAA,MAAA;IAsFd,UAAU,EAAA,MAAA;EA8BL,CAAA,CAAA;EAuBL,OAAA,ESphBC,KTohBU,CAAA;IA+CX,IAAA,EAAA,MAAA;IAME,oBAAA,EAAA,MAAA;IASF,sBAAA,CAAA,EAAA,MAAA;IAiBY,qBAAA,CAAA,EAAA,MAAA;IA4DN,YAAA,EAAA,MAAA;IAwDL,MAAA,EAAA,WAAA,GAAA,oBAAA,GAAA,WAAA,GAAA,sBAAA,GAAA,qBAAA;IAAM,QAAA,EAAA,OAAA,GAAA,QAAA,GAAA,MAAA;IAiBP,EAAA,CAAA,EAAA,MAAO;;;;ECrvBP,QAAA,EAAA,MAAA,EAAc;AAsC1B,CAAA;KQVK,oBAAA;iBAinBW,qCAAA,QACL,kBACG,oBACD,+BACC,4BPlpBd;;EALY,gBAAA,CAAA,EO0pBe,oBP1pBI;AAK/B,CAAA,CAAA,EAAY;EAcA,MAAA,EOyoBC,oBPzoBmB;EACV,QAAA,EOwoBuB,OPxoBvB,EAAA;CACC;AACH,iBOmrBJ,6BAAA,CPnrBI,IAAA,EOorBV,iBPprBU,EAAA,EAAA,IAwFpB,CAxFoB,EAAA;EACL,IAAA,CAAA,EAAA,YAAA,GAAA,wBAAA;EAAe,gBAAA,CAAA,EOorBkD,oBPprBlD;AAuF9B,CAAA,CAAA,EAAa;EAsCA,MAAA,EOwjBA,oBP/hBP;YO+hBuC"}