flappa-doormal 2.13.2 → 2.13.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md CHANGED
@@ -453,46 +453,44 @@ bunx biome lint .
453
453
 
454
454
  1. **`lineStartsAfter` vs `lineStartsWith` is not “cosmetic”**: `lineStartsAfter` changes output by stripping the matched marker via an internal `contentStartOffset` during segment construction. If a client used it by accident, you cannot reconstruct the exact stripped prefix from output alone without referencing the original pages and re-matching the marker.
455
455
 
456
- 2. **Recovery must mirror segmentation’s preprocessing**: If `SegmentationOptions.replace` was used, recovery must apply the same replacements (see `src/preprocessing/replace.ts`) before attempting anchoring or rerun alignment, otherwise substring matching and page joins will drift.
457
-
458
- 3. **Page joining differs between matching and output**:
456
+ 2. **Page joining differs between matching and output**:
459
457
  - Matching always happens on pages concatenated with `\\n` separators.
460
458
  - Output segments may normalize page boundaries (`pageJoiner: 'space' | 'newline'`) and breakpoints post-processing uses its own join normalization utilities.
461
459
  Recovery code must be explicit about which representation it’s searching.
462
460
 
463
- 4. **Breakpoints can produce “pieces” that were never marker-stripped**: When `maxPages` + `breakpoints` are enabled, only the piece that starts at the original structural boundary could have lost a marker due to `lineStartsAfter`. Mid-segment breakpoint pieces should not be “recovered” unless you can anchor them confidently.
461
+ 3. **Breakpoints can produce “pieces” that were never marker-stripped**: When `maxPages` + `breakpoints` are enabled, only the piece that starts at the original structural boundary could have lost a marker due to `lineStartsAfter`. Mid-segment breakpoint pieces should not be “recovered” unless you can anchor them confidently.
464
462
 
465
- 5. **Fuzzy defaults are easy to miss**: Some tokens auto-enable fuzzy matching unless `fuzzy: false` is set (`bab`, `basmalah`, `fasl`, `kitab`, `naql`). If you are validating markers or re-matching prefixes, use the same compilation path as segmentation (`buildRuleRegex` / `processPattern`) so diacritics and token expansion behave identically.
463
+ 4. **Fuzzy defaults are easy to miss**: Some tokens auto-enable fuzzy matching unless `fuzzy: false` is set (`bab`, `basmalah`, `fasl`, `kitab`, `naql`). If you are validating markers or re-matching prefixes, use the same compilation path as segmentation (`buildRuleRegex` / `processPattern`) so diacritics and token expansion behave identically.
466
464
 
467
- 6. **Auto-escaping applies to template-like patterns**: `lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, and `template` auto-escape `()[]` outside `{{tokens}}`. Raw `regex` does not. If you compare patterns by string equality, be careful about escaping and whitespace.
465
+ 5. **Auto-escaping applies to template-like patterns**: `lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, and `template` auto-escape `()[]` outside `{{tokens}}`. Raw `regex` does not. If you compare patterns by string equality, be careful about escaping and whitespace.
468
466
 
469
- 7. **TypeScript union pitfalls with `SplitRule`**: `SplitRule` is a union where only one pattern type should exist. Avoid mutating rules in-place with `delete` on fields (TS often narrows unions and then complains). Prefer rebuilding converted rules via destructuring (e.g. `{ lineStartsAfter, ...rest }` then create `{...rest, lineStartsWith: lineStartsAfter}`).
467
+ 6. **TypeScript union pitfalls with `SplitRule`**: `SplitRule` is a union where only one pattern type should exist. Avoid mutating rules in-place with `delete` on fields (TS often narrows unions and then complains). Prefer rebuilding converted rules via destructuring (e.g. `{ lineStartsAfter, ...rest }` then create `{...rest, lineStartsWith: lineStartsAfter}`).
470
468
 
471
- 8. **Biome lint constraints shape implementation**: The repo enforces low function complexity. Expect to extract helpers (alignment, selector resolution, anchoring) to keep Biome happy. Also, Biome can flag regex character-class usage as misleading; prefer alternation (e.g. `(?:\\u200C|\\u200D|\\uFEFF)`) when removing specific codepoints.
469
+ 7. **Biome lint constraints shape implementation**: The repo enforces low function complexity. Expect to extract helpers (alignment, selector resolution, anchoring) to keep Biome happy. Also, Biome can flag regex character-class usage as misleading; prefer alternation (e.g. `(?:\\u200C|\\u200D|\\uFEFF)`) when removing specific codepoints.
472
470
 
473
- 9. **When debugging recovery, start here**:
471
+ 8. **When debugging recovery, start here**:
474
472
  - `src/segmentation/segmenter.ts` (how content is sliced/trimmed and how `from/to` are computed)
475
473
  - `src/segmentation/rule-regex.ts` + `src/segmentation/tokens.ts` (token expansion + fuzzy behavior)
476
474
  - `src/preprocessing/replace.ts` (preprocessing parity)
477
475
  - `src/recovery.ts` (recovery implementation)
478
476
 
479
- 10. **Prefer library utilities for UI tasks**: Instead of re-implementing rule merging, validation, or token mapping in client code, use `optimizeRules`, `validateRules`/`formatValidationReport`, and `applyTokenMappings`. They handle edge cases (like duplicate patterns, regex safety, or diacritic handling) that ad-hoc implementations might miss.
477
+ 9. **Prefer library utilities for UI tasks**: Instead of re-implementing rule merging, validation, or token mapping in client code, use `optimizeRules`, `validateRules`/`formatValidationReport`, and `applyTokenMappings`. They handle edge cases (like duplicate patterns, regex safety, or diacritic handling) that ad-hoc implementations might miss.
480
478
 
481
- 11. **Safety Fallback (Search-back)**: When forced to split at a hard character limit, searching backward for whitespace/punctuation (`[\s\n.,;!?؛،۔]`) prevents word-chopping and improves readability significantly.
479
+ 10. **Safety Fallback (Search-back)**: When forced to split at a hard character limit, searching backward for whitespace/punctuation (`[\s\n.,;!?؛،۔]`) prevents word-chopping and improves readability significantly.
482
480
 
483
- 12. **Unicode Boundary Safety (Surrogates + Graphemes)**: Multi-byte characters (like emojis) can be corrupted if split in the middle of a surrogate pair. Similarly, Arabic diacritics (combining marks), ZWJ/ZWNJ, and variation selectors can be orphaned if a hard split lands in the middle of a grapheme cluster. Use `adjustForUnicodeBoundary` when forced to hard-split near a limit.
481
+ 11. **Unicode Boundary Safety (Surrogates + Graphemes)**: Multi-byte characters (like emojis) can be corrupted if split in the middle of a surrogate pair. Similarly, Arabic diacritics (combining marks), ZWJ/ZWNJ, and variation selectors can be orphaned if a hard split lands in the middle of a grapheme cluster. Use `adjustForUnicodeBoundary` when forced to hard-split near a limit.
484
482
 
485
- 13. **Recursion/Iteration Safety**: Using a progress-based guard (comparing `cursorPos` before and after loop iteration) is safer than fixed iteration limits for supporting arbitrary-sized content without truncation risks.
483
+ 12. **Recursion/Iteration Safety**: Using a progress-based guard (comparing `cursorPos` before and after loop iteration) is safer than fixed iteration limits for supporting arbitrary-sized content without truncation risks.
486
484
 
487
- 14. **Accidental File Overwrites**: Be extremely careful when using tools like `replace_file_content` with large ranges. Verify file integrity frequently (e.g., `git diff`) to catch accidental deletions of existing code or tests. Merging new tests into existing files is a high-risk operation for AI agents.
485
+ 13. **Accidental File Overwrites**: Be extremely careful when using tools like `replace_file_content` with large ranges. Verify file integrity frequently (e.g., `git diff`) to catch accidental deletions of existing code or tests. Merging new tests into existing files is a high-risk operation for AI agents.
488
486
 
489
- 15. **Invisible Unicode Marks Break Regex Anchors**: Arabic text often contains invisible formatting marks like Left-to-Right Mark (`U+200E`), Right-to-Left Mark (`U+200F`), Arabic Letter Mark (`U+061C`), Zero-Width Space (`U+200B`), Zero-Width Non-Joiner (`U+200C`), Zero-Width Joiner (`U+200D`), or BOM (`U+FEFF`). These can appear at line starts after `\n` but before visible characters, breaking `^` anchored patterns. Solution: include an optional zero-width character class prefix in line-start patterns: `^[\u200E\u200F\u061C\u200B\u200C\u200D\uFEFF]*(?:pattern)`. The library now handles this automatically in `buildLineStartsWithRegexSource` and `buildLineStartsAfterRegexSource`.
487
+ 14. **Invisible Unicode Marks Break Regex Anchors**: Arabic text often contains invisible formatting marks like Left-to-Right Mark (`U+200E`), Right-to-Left Mark (`U+200F`), Arabic Letter Mark (`U+061C`), Zero-Width Space (`U+200B`), Zero-Width Non-Joiner (`U+200C`), Zero-Width Joiner (`U+200D`), or BOM (`U+FEFF`). These can appear at line starts after `\n` but before visible characters, breaking `^` anchored patterns. Solution: include an optional zero-width character class prefix in line-start patterns: `^[\u200E\u200F\u061C\u200B\u200C\u200D\uFEFF]*(?:pattern)`. The library now handles this automatically in `buildLineStartsWithRegexSource` and `buildLineStartsAfterRegexSource`.
490
488
 
491
- 16. **Large Segment Performance & Debugging Strategy**: When processing large books (1000+ pages), avoid O(n²) algorithms. The library uses a fast-path threshold (1000 pages) to switch from accurate string-search boundary detection to cumulative-offset-based slicing. Even on the iterative path (e.g. debug mode), we **slice only the active window (+padding)** per iteration (never `fullContent.slice(cursorPos)`), to avoid quadratic allocation/GC churn. To diagnose performance bottlenecks: (1) Look for logs with "Using iterative path" or "Using accurate string-search path" with large `pageCount` values, (2) Check `iterations` count in completion logs, (3) Strategic logs are placed at operation boundaries (start/end) NOT inside tight loops to avoid log-induced performance regression.
489
+ 15. **Large Segment Performance & Debugging Strategy**: When processing large books (1000+ pages), avoid O(n²) algorithms. The library uses a fast-path threshold (1000 pages) to switch from accurate string-search boundary detection to cumulative-offset-based slicing. Even on the iterative path (e.g. debug mode), we **slice only the active window (+padding)** per iteration (never `fullContent.slice(cursorPos)`), to avoid quadratic allocation/GC churn. To diagnose performance bottlenecks: (1) Look for logs with "Using iterative path" or "Using accurate string-search path" with large `pageCount` values, (2) Check `iterations` count in completion logs, (3) Strategic logs are placed at operation boundaries (start/end) NOT inside tight loops to avoid log-induced performance regression.
492
490
 
493
- 17. **`maxPages=0` is a hard invariant**: When `maxPages=0`, breakpoint windows must never scan beyond the current page boundary. Relying purely on boundary detection (string search) can fail near page ends for long Arabic text + space joiners, letting the window “see” into the next page and creating multi-page segments. The safe fix is to clamp the breakpoint window to the current page’s end using `boundaryPositions` in breakpoint processing.
491
+ 16. **`maxPages=0` is a hard invariant**: When `maxPages=0`, breakpoint windows must never scan beyond the current page boundary. Relying purely on boundary detection (string search) can fail near page ends for long Arabic text + space joiners, letting the window “see” into the next page and creating multi-page segments. The safe fix is to clamp the breakpoint window to the current page’s end using `boundaryPositions` in breakpoint processing.
494
492
 
495
- 18. **`''` breakpoint semantics depend on whether the window is page-bounded vs length-bounded**: `''` means “page boundary fallback”, but it’s intentionally **mode-dependent**:
493
+ 17. **`''` breakpoint semantics depend on whether the window is page-bounded vs length-bounded**: `''` means “page boundary fallback”, but it’s intentionally **mode-dependent**:
496
494
 
497
495
  - **Page-bounded window (maxPages-driven)**: `''` should “swallow the remainder of the current page” (i.e. break at the **next page boundary**, not at an arbitrary character limit). This prevents accidentally consuming part of the next page when no other breakpoint patterns match.
498
496
  - **Length-bounded window (maxContentLength-driven)**: `''` should **not** force an early page-boundary break. In this mode we want the best split *near the length limit* (safe-break fallback → Unicode-safe hard split) even if that means a piece can cross a page boundary.
@@ -508,17 +506,17 @@ bunx biome lint .
508
506
  }
509
507
  ```
510
508
 
511
- 19. **Beware `.only` in test files**: A single `it.only(...)` can mask unrelated failing fixtures for a long time. When debugging, remove `.only` as soon as you have a focused reproduction, and re-run the full suite to catch latent failures.
509
+ 18. **Beware `.only` in test files**: A single `it.only(...)` can mask unrelated failing fixtures for a long time. When debugging, remove `.only` as soon as you have a focused reproduction, and re-run the full suite to catch latent failures.
512
510
 
513
- 20. **Tooling gotcha: IDE diagnostics vs actual parser**: If the editor shows parse errors but `bun test` and `bunx biome check` pass, suspect unsaved local edits or stale diagnostics rather than codebase syntax. Always validate with a direct `bunx biome check <file>` before making sweeping “syntax fix” edits.
511
+ 19. **Tooling gotcha: IDE diagnostics vs actual parser**: If the editor shows parse errors but `bun test` and `bunx biome check` pass, suspect unsaved local edits or stale diagnostics rather than codebase syntax. Always validate with a direct `bunx biome check <file>` before making sweeping “syntax fix” edits.
514
512
 
515
- 21. **Content-based page detection fails with overlapping content**: The `computeNextFromIdx` function uses prefix matching to detect page transitions. When page 0 ends with text identical to page 1's prefix, it incorrectly advances `currentFromIdx`. **Fix**: When `maxPages=0`, override content-based detection with position-based detection via `findPageIndexForPosition(cursorPos, boundaryPositions, fromIdx)`. Always trust cumulative offsets over content heuristics for strict page isolation.
513
+ 20. **Content-based page detection fails with overlapping content**: The `computeNextFromIdx` function uses prefix matching to detect page transitions. When page 0 ends with text identical to page 1's prefix, it incorrectly advances `currentFromIdx`. **Fix**: When `maxPages=0`, override content-based detection with position-based detection via `findPageIndexForPosition(cursorPos, boundaryPositions, fromIdx)`. Always trust cumulative offsets over content heuristics for strict page isolation.
516
514
 
517
- 22. **Test edge cases with data that TRIGGERS the bug path**: Simple test data often bypasses problematic code paths. Ensure tests: (a) use `maxContentLength` to force sub-page splitting, (b) include enough content to exceed window sizes, (c) create overlapping/duplicate text at page boundaries, (d) verify that segments are actually split (not just checking no crashes).
515
+ 21. **Test edge cases with data that TRIGGERS the bug path**: Simple test data often bypasses problematic code paths. Ensure tests: (a) use `maxContentLength` to force sub-page splitting, (b) include enough content to exceed window sizes, (c) create overlapping/duplicate text at page boundaries, (d) verify that segments are actually split (not just checking no crashes).
518
516
 
519
- 23. **Debug breakpoint processing with the logger**: Pass a `logger` object with `debug` and `trace` methods to `segmentPages()`. Key logs: `boundaryPositions built` (page boundary byte offsets), `iteration=N` (shows `currentFromIdx`, `cursorPos`, `windowEndPosition` per loop), `Complete` (final segment count).
517
+ 22. **Debug breakpoint processing with the logger**: Pass a `logger` object with `debug` and `trace` methods to `segmentPages()`. Key logs: `boundaryPositions built` (page boundary byte offsets), `iteration=N` (shows `currentFromIdx`, `cursorPos`, `windowEndPosition` per loop), `Complete` (final segment count).
520
518
 
521
- 24. **Navigating `breakpoint-processor.ts`**: Key functions in (approximate) execution order:
519
+ 23. **Navigating `breakpoint-processor.ts`**: Key functions in (approximate) execution order:
522
520
 
523
521
  - `applyBreakpoints()` (entry point)
524
522
  - `processOversizedSegment()` (main loop)
@@ -530,15 +528,15 @@ bunx biome lint .
530
528
  - `advanceCursorAndIndex()` (progress)
531
529
  - `computeNextFromIdx()` (heuristic) **or** position-based override when `maxPages=0` (see #21)
532
530
 
533
- 25. **Page attribution can drift in large-document breakpoint processing**: For ≥`FAST_PATH_THRESHOLD` segments, boundary positions may be derived from cumulative offsets (fast path). If upstream content is modified (e.g. marker stripping or accidental leading-trim), binary-search attribution can classify a piece as starting **before** `currentFromIdx`, inflating `(to - from)` and violating `maxPages`. **Fix**: clamp `actualStartIdx >= currentFromIdx` and re-apply the `maxPages` window using the same ID-span logic as `computeWindowEndIdx(...)` before creating the piece segment.
531
+ 24. **Page attribution can drift in large-document breakpoint processing**: For ≥`FAST_PATH_THRESHOLD` segments, boundary positions may be derived from cumulative offsets (fast path). If upstream content is modified (e.g. marker stripping or accidental leading-trim), binary-search attribution can classify a piece as starting **before** `currentFromIdx`, inflating `(to - from)` and violating `maxPages`. **Fix**: clamp `actualStartIdx >= currentFromIdx` and re-apply the `maxPages` window using the same ID-span logic as `computeWindowEndIdx(...)` before creating the piece segment.
534
532
 
535
- 26. **Offset fast path must respect page-ID span semantics**: `maxPages` in this library is enforced as an **ID span** invariant (`(to ?? from) - from <= maxPages`). For large segments, the offset-based fast path must choose `segEnd` using the same ID-window logic as `computeWindowEndIdx(...)` (not “N pages by index”), otherwise gaps (e.g. `2216 → 2218`) produce illegal spans.
533
+ 25. **Offset fast path must respect page-ID span semantics**: `maxPages` in this library is enforced as an **ID span** invariant (`(to ?? from) - from <= maxPages`). For large segments, the offset-based fast path must choose `segEnd` using the same ID-window logic as `computeWindowEndIdx(...)` (not “N pages by index”), otherwise gaps (e.g. `2216 → 2218`) produce illegal spans.
536
534
 
537
- 27. **Never `trimStart()` huge fallback content**: `ensureFallbackSegment()` constructs “all pages as one segment” when there are no structural split rules. If this giant content is `trimStart()`’d, cumulative offsets and derived boundary positions become inconsistent, which can lead to incorrect `from/to` attribution and `maxPages` violations that only appear on very large books.
535
+ 26. **Never `trimStart()` huge fallback content**: `ensureFallbackSegment()` constructs “all pages as one segment” when there are no structural split rules. If this giant content is `trimStart()`’d, cumulative offsets and derived boundary positions become inconsistent, which can lead to incorrect `from/to` attribution and `maxPages` violations that only appear on very large books.
538
536
 
539
- 28. **Always test both sides of the fast-path threshold**: Several breakpoint bugs only reproduce at or above `FAST_PATH_THRESHOLD` (1000). Add regressions at `threshold-1` and `threshold` to avoid “works in small unit tests, fails on full books” surprises.
537
+ 27. **Always test both sides of the fast-path threshold**: Several breakpoint bugs only reproduce at or above `FAST_PATH_THRESHOLD` (1000). Add regressions at `threshold-1` and `threshold` to avoid “works in small unit tests, fails on full books” surprises.
540
538
 
541
- 29. **Breakpoint `split` behavior**: The `split: 'at' | 'after'` option for breakpoints controls where the split happens relative to the matched text:
539
+ 28. **Breakpoint `split` behavior**: The `split: 'at' | 'after'` option for breakpoints controls where the split happens relative to the matched text:
542
540
  - `'after'` (default): Match is included in the previous segment
543
541
  - `'at'`: Match starts the next segment
544
542
  Key implementation details in `findPatternBreakPosition`:
@@ -547,15 +545,15 @@ bunx biome lint .
547
545
  - Zero-length matches (lookaheads) are always skipped to prevent infinite loops
548
546
  - Empty pattern `''` forces `splitAt=false` since page boundaries have no matched text
549
547
 
550
- 30. **Unicode safety is the user's responsibility for patterns**: Unlike `findSafeBreakPosition` (which adjusts for grapheme boundaries), pattern-based breaks use the exact position where the user's regex matched. If a pattern matches mid-grapheme, that's a pattern authoring error, not a library bug. The library should NOT silently adjust pattern match positions.
548
+ 29. **Unicode safety is the user's responsibility for patterns**: Unlike `findSafeBreakPosition` (which adjusts for grapheme boundaries), pattern-based breaks use the exact position where the user's regex matched. If a pattern matches mid-grapheme, that's a pattern authoring error, not a library bug. The library should NOT silently adjust pattern match positions.
551
549
 
552
- 31. **Fast path doesn't affect split behavior**: The offset-based fast path only applies to empty pattern `''` breakpoints (page boundary fallback), and empty patterns force `splitAt=false`. Pattern-based breakpoints with `split:'at'` never engage the fast path.
550
+ 30. **Fast path doesn't affect split behavior**: The offset-based fast path only applies to empty pattern `''` breakpoints (page boundary fallback), and empty patterns force `splitAt=false`. Pattern-based breakpoints with `split:'at'` never engage the fast path.
553
551
 
554
- 32. **Whitespace trimming affects split:'at' output**: `createSegment()` trims segment content. With `split:'at'`, if the matched text is whitespace-only, it will be trimmed from the start of the next segment. This is usually desirable for delimiter patterns.
552
+ 31. **Whitespace trimming affects split:'at' output**: `createSegment()` trims segment content. With `split:'at'`, if the matched text is whitespace-only, it will be trimmed from the start of the next segment. This is usually desirable for delimiter patterns.
555
553
 
556
- 33. **`prefer` semantics with `split:'at'`**: With `prefer:'longer'` + `split:'at'`, the algorithm selects the LAST valid match, maximizing content in the previous segment. This is correct but can be counterintuitive since the resulting previous segment might appear "shorter" than with `split:'after'`.
554
+ 32. **`prefer` semantics with `split:'at'`**: With `prefer:'longer'` + `split:'at'`, the algorithm selects the LAST valid match, maximizing content in the previous segment. This is correct but can be counterintuitive since the resulting previous segment might appear "shorter" than with `split:'after'`.
557
555
 
558
- 34. **Multi-agent review synthesis**: Getting implementation reviews from multiple AI models (Claude, GPT, Grok, Gemini) and synthesizing their feedback helps catch issues a single reviewer might miss. Key insight: when reviewers disagree on "critical" issues, investigate the codebase to verify claims before implementing fixes. Some "critical" issues are based on incorrect assumptions about how fast paths or downstream functions work.
556
+ 33. **Multi-agent review synthesis**: Getting implementation reviews from multiple AI models (Claude, GPT, Grok, Gemini) and synthesizing their feedback helps catch issues a single reviewer might miss. Key insight: when reviewers disagree on "critical" issues, investigate the codebase to verify claims before implementing fixes. Some "critical" issues are based on incorrect assumptions about how fast paths or downstream functions work.
559
557
 
560
558
  ### Process Template (Multi-agent design review, TDD-first)
561
559
 
package/README.md CHANGED
@@ -1166,9 +1166,6 @@ const options: SegmentationOptions = {
1166
1166
  // - `flags` defaults to 'gu'. If provided, `g` and `u` are always enforced.
1167
1167
  // - `pageIds: []` means "apply to no pages" (skip that rule).
1168
1168
  // - Remember JSON escaping: to match a literal '.', use regex: "\\\\." in JSON.
1169
- replace: [
1170
- { regex: "([\\u0660-\\u0669]+)\\s*[-–—ـ]\\s*", replacement: "$1 - " }
1171
- ],
1172
1169
  rules: [
1173
1170
  { lineStartsWith: ['## '], split: 'at' }
1174
1171
  ],
package/dist/index.d.mts CHANGED
@@ -698,26 +698,6 @@ interface Logger {
698
698
  /** Log a warning message (potential issues) */
699
699
  warn?: (message: string, ...args: unknown[]) => void;
700
700
  }
701
- /**
702
- * - Default regex flags: `gu` (global + unicode)
703
- * - If `flags` is provided, it is validated and merged with required flags:
704
- * `g` and `u` are always enforced.
705
- *
706
- * `pageIds` controls which pages a rule applies to:
707
- * - `undefined`: apply to all pages
708
- * - `[]`: apply to no pages (rule is skipped)
709
- * - `[id1, id2, ...]`: apply only to those pages
710
- */
711
- type Replacement = {
712
- /** Raw regex source string (no token expansion). Compiled with `u` (and always `g`). */
713
- regex: string;
714
- /** Replacement string (passed to `String.prototype.replace`). */
715
- replacement: string;
716
- /** Optional regex flags; `g` and `u` are always enforced. */
717
- flags?: string;
718
- /** Optional list of page IDs to apply this replacement to. Empty array means skip. */
719
- pageIds?: number[];
720
- };
721
701
  /**
722
702
  * Segmentation options controlling how pages are split.
723
703
  *
@@ -751,12 +731,6 @@ type Replacement = {
751
731
  * };
752
732
  */
753
733
  type SegmentationOptions = {
754
- /**
755
- * Optional pre-processing replacements applied to page content BEFORE segmentation.
756
- *
757
- * Replacements are applied per-page (not on concatenated content), in array order.
758
- */
759
- replace?: Replacement[];
760
734
  /**
761
735
  * Rules applied in order to find split points.
762
736
  *
@@ -891,9 +865,6 @@ type SegmentationOptions = {
891
865
  logger?: Logger;
892
866
  };
893
867
  //#endregion
894
- //#region src/preprocessing/replace.d.ts
895
- declare const applyReplacements: (pages: Page[], rules?: Replacement[]) => Page[];
896
- //#endregion
897
868
  //#region src/recovery.d.ts
898
869
  type MarkerRecoverySelector = {
899
870
  type: 'rule_indices';
@@ -1389,5 +1360,5 @@ declare const escapeTemplateBrackets: (pattern: string) => string;
1389
1360
  declare const escapeRegex: (s: string) => string;
1390
1361
  declare const makeDiacriticInsensitive: (text: string) => string;
1391
1362
  //#endregion
1392
- export { type Breakpoint, type BreakpointRule, type CommonLineStartPattern, type DetectedPattern, type ExpandResult, type LineStartAnalysisOptions, type LineStartPatternExample, type Logger, type MarkerRecoveryReport, type MarkerRecoveryRun, type MarkerRecoverySelector, type OptimizeResult, PATTERN_TYPE_KEYS, type Page, type PageRange, type PatternProcessor, type PatternTypeKey, type RepeatingSequenceExample, type RepeatingSequenceOptions, type RepeatingSequencePattern, type RuleValidationResult, type Segment, type SegmentationOptions, type SplitRule, TOKEN_PATTERNS, Token, type TokenKey, type TokenMapping, type ValidationIssue, type ValidationIssueType, analyzeCommonLineStarts, analyzeRepeatingSequences, analyzeTextForRule, applyReplacements, applyTokenMappings, containsTokens, detectTokenPatterns, escapeRegex, escapeTemplateBrackets, expandCompositeTokensInTemplate, expandTokens, expandTokensWithCaptures, formatValidationReport, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, optimizeRules, recoverMistakenLineStartsAfterMarkers, recoverMistakenMarkersForRuns, segmentPages, shouldDefaultToFuzzy, stripTokenMappings, suggestPatternConfig, templateToRegex, validateRules, withCapture };
1363
+ export { type Breakpoint, type BreakpointRule, type CommonLineStartPattern, type DetectedPattern, type ExpandResult, type LineStartAnalysisOptions, type LineStartPatternExample, type Logger, type MarkerRecoveryReport, type MarkerRecoveryRun, type MarkerRecoverySelector, type OptimizeResult, PATTERN_TYPE_KEYS, type Page, type PageRange, type PatternProcessor, type PatternTypeKey, type RepeatingSequenceExample, type RepeatingSequenceOptions, type RepeatingSequencePattern, type RuleValidationResult, type Segment, type SegmentationOptions, type SplitRule, TOKEN_PATTERNS, Token, type TokenKey, type TokenMapping, type ValidationIssue, type ValidationIssueType, analyzeCommonLineStarts, analyzeRepeatingSequences, analyzeTextForRule, applyTokenMappings, containsTokens, detectTokenPatterns, escapeRegex, escapeTemplateBrackets, expandCompositeTokensInTemplate, expandTokens, expandTokensWithCaptures, formatValidationReport, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, optimizeRules, recoverMistakenLineStartsAfterMarkers, recoverMistakenMarkersForRuns, segmentPages, shouldDefaultToFuzzy, stripTokenMappings, suggestPatternConfig, templateToRegex, validateRules, withCapture };
1393
1364
  //# sourceMappingURL=index.d.mts.map
@@ -1 +1 @@
1
- {"version":3,"file":"index.d.mts","names":[],"sources":["../src/types/index.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/types/rules.ts","../src/optimization/optimize-rules.ts","../src/types/breakpoints.ts","../src/types/options.ts","../src/preprocessing/replace.ts","../src/recovery.ts","../src/segmentation/breakpoint-utils.ts","../src/segmentation/pattern-validator.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/utils/textUtils.ts"],"sourcesContent":[],"mappings":";;AAcA;AA4CA;AA2BA;;;;AClEA;AAcA;AAEA;AA4PA;;;;AAGyB,KDpRb,OAAA,GCoRa;;;;ACzQzB;AAaA;AAOA;EAiOa,OAAA,EAAA,MAAA;EACF;;;EAEgB,IAAA,EAAA,MAAA;;;;AC5Q3B;AA+EA;AAgEA;EAuBa,EAAA,CAAA,EAAA,MAAA;EAiCA;;;;AC5M+B;AA4B3B;AA4BG;EA+Df,IAAA,CAAA,EJ5EM,MI4EN,CAAA,MAAA,EAAA,OAAsB,CAAA;AAAA,CAAA;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAU9C;AAoJlB;AAAwB,KJ/RZ,IAAA,GI+RY;EAAc;;;;;;ECjV1B;AAsCZ;;;;AC7BA;EAyGY,OAAA,EAAA,MAAU;;;;AC/FtB;AAuBA;AA2CA;;;;;;AAoJmB,KP5JP,SAAA,GO4JO,MAAA,GAAA,CAAA,MAAA,EAAA,MAAA,CAAA;;;APnOP,KCKA,wBAAA,GDwBK;EAeL,IAAA,CAAA,EAAI,MAAA;EA2BJ,WAAA,CAAS,EAAA,MAAA;;;;EClET,wBAAA,CAAA,EAAwB,OAAA;EAcxB,yBAAA,CAAuB,EAAA,OAAA;EAEvB,MAAA,CAAA,EAAA,aAAA,GAAsB,OAAA;EA4PrB,UAAA,CAAA,EAAA,CAAA,IAAA,EAAA,MAmBZ,EAAA,MAAA,EAAA,MAAA,EAAA,GAAA,OAAA;EAlBU,cAAA,CAAA,EAnQU,MAmQV,EAAA;EACE,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;CACV;AAAsB,KAjQb,uBAAA,GAiQa;;;;ACzQb,KDUA,sBAAA,GCVwB;EAaxB,OAAA,EAAA,MAAA;EAOA,KAAA,EAAA,MAAA;EAiOC,QAAA,EDxOC,uBC6Pb,EAAA;CApBU;;;;cDgBE,iCACF,kBACE,6BACV;;;AAHU,KCtQD,wBAAA,GDyRX;EAlBU,WAAA,CAAA,EAAA,MAAA;EACE,WAAA,CAAA,EAAA,MAAA;EACV,QAAA,CAAA,EAAA,MAAA;EAAsB,IAAA,CAAA,EAAA,MAAA;;;;ECzQb,WAAA,CAAA,EAAA,MAAA;EAaA,YAAA,CAAA,EAAA,MAAA;EAOA,iBAAA,CAAA,EAAA,MAAA;AAiOZ,CAAA;AACW,KAzOC,wBAAA,GAyOD;EACG,IAAA,EAAA,MAAA;EACX,OAAA,EAAA,MAAA;EAAwB,MAAA,EAAA,MAAA;;;KApOf,wBAAA;ECxCA,OAAA,EAAA,MAAA;EA+EC,KAAA,EAAA,MAAA;EAgEA,QAAA,EDpGC,wBCmHb,EAAA;AAQD,CAAA;;;AC3K4C;AA4B3B;AA4BG;AA8BM;AAyDrB,cF+HQ,yBE/HW,EAAA,CAAA,KAAA,EFgIb,IEhIa,EAAA,EAAA,OAAA,CAAA,EFiIV,wBEjIU,EAAA,GFkIrB,wBElIqB,EAAA;;;;AJjIxB;AA4CA;AA2BY,KGhFA,eAAA,GHgFS;;;;EClET,KAAA,EAAA,MAAA;EAcA;EAEA,KAAA,EAAA,MAAA;EA4PC;EACF,QAAA,EAAA,MAAA;CACE;;;;;;ACxQb;AAaA;AAOA;AAiOA;;;;;;;cC1La,uCAAmC;AA/EhD;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC5M4C;AA4B3B;AA4BG;AA8BM;AAiCC;AAuCtB,cDVQ,wBCUG,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,QAAA,EDViD,eCUjD,EAAA,EAAA,GAAA,MAAA;;;;;;;AAsBH,cDTA,oBCSuG,EAAA,CAAA,QAAA,EDRtG,eCQsG,EAAA,EAAA,GAAA;EAOxG,WAAA,EAAA,gBAAyB,GAAA,iBAAiB;EAUjD,KAAA,EAAA,OAAA;EAyCA,QAAA,CAAA,EAAA,MAAA;AA2GL,CAAA;;;;;;;cD7Ia;EEpMD,QAAA,EAAA,MAAA;EAsCC,WAAA,EAAA,gBAAwB,GAAA,iBAAA;;;YFqKvB;AGlMd,CAAA,GAAY,IAAA;;;ANHZ;AA4CA;AA2BA;;;;AClEA;AAcA;AAEA;AA4PA;;;;;;;;ACtQA;AAaA;AAOA;AAiOA;;;;;;KElPK,YAAA;;EDvBO,KAAA,EAAA,MAAA;AA+EZ,CAAA;AAgEA;AAuBA;AAiCA;;;;AC5M4C;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAU9C;AAoJlB;KAjSK,eAAA,GAiSmB;EAAc;EAAgB,QAAA,EAAA,MAAA;CAAe;;;;ACjVrE;AAsCA;;;;AC7BA;AAyGA;;;;AC/FA;AAuBA;AA2CA;;;;;;;;;;AC3DA,KJoDK,qBAAA,GIlCJ;EAlBwC;EAAgB,cAAA,EAAA,MAAA,EAAA;CAAa;;;;;AC1BtE;AAKA;;;;;;AAOA;AA2BE;AAmnBF;;;;;;;;;AAsDA;;;;;;KLjmBK,sBAAA;;;AM0CL,CAAA;;;;ACpJA;AAKA;AAcA;;;;;;AA2FA;AAsCA;;;;ACiJA;;;KRnKK,mBAAA,GQmKmE;EAAA;;;;AC3MxE;AAsCA;AAGA;AAiBA;AAgDA;AA2CA;AAWA;AAkJA;AA8CA;AAuBA,KThUK,WAAA,GACC,YSsUL,GTrUK,eS8T0C,GT7T1C,qBS6T0C,GT5T1C,sBS4T0C,GT3T1C,mBS2T0C;AAqBhD;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;ACzjBA;AAqDA;AAcA;;;;cV6Ea;;;;;;KAOD,cAAA,WAAyB;;;;;;;KAUhC,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAyCA,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqDC,SAAA,GAAY,cAAc,gBAAgB;;;AJ3UtD;AA4CA;AA2BA;KK7EY,cAAA;;SAED;EJSC;EAcA,WAAA,EAAA,MAAA;AAEZ,CAAA;AA4Pa,cIjPA,aJoQZ,EAAA,CAAA,KAAA,EIpQoC,SJoQpC,EAAA,EAAA,GAAA;EAlBU,WAAA,EAAA,MAAA;EACE,KAAA,EInPwB,SJmPxB,EAAA;CACV;;;ADpRH;AA4CA;AA2BA;;;;AClEA;AAcA;AAEA;AA4PA;;;;;;KK9QY,cAAA;;AJQZ;AAaA;AAOA;AAiOA;;;;;;;;ACzQA;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC5M4C;AA4B3B;EA0DZ,KAAA,CAAA,EAAA,MAAA;EAiCA;AAAsB;AAwBH;;;;;;;AAqCxB;EAOY,KAAA,CAAA,EAAA,IAAA,GAAA,OAAc;EAUrB;AAAa;AAoJlB;;EAAsC,GAAA,CAAA,EAAA,MAAA;EAAgB;;;;;ECjV1C;AAsCZ;;;;AC7BA;AAyGA;;;;AC/FA;AAuBA;AA2CA;;;;;;EAoJmB,OAAA,CAAA,ED5JL,SC4JK,EAAA;;;;AC/MnB;;;;;;;;AC1BA;AAKA;;;;;;EAOY,QAAA,CAAA,EAAA,MAAA;AA2BV,CAAA;AAmnBF;;;;;;;;;AAsDA;;;;;AAGoD,KHjmBxC,UAAA,GGimBwC,MAAA,GHjmBlB,cGimBkB;;;ATjqBpD;AA2BA;;;;AClEA;AAcA;AAEA;AA4PA;;;;;;;;ACtQA;AAaA;AAOA;AAiOA;;;;;UKnPiB,MAAA;;;EJtBL;EA+EC,KAAA,CAAA,EAAA,CAAA,OAAA,EAAA,MAiDZ,EAAA,GAAA,IAjD+C,EAAA,OAAA,EAAA,EAAA,GAAA,IAAA;EAgEnC;EAuBA,IAAA,CAAA,EAAA,CAAA,OAAA,EAAA,MAyBZ,EAAA,GAAA,IAAA,EAxBa,OAAA,EAAA,EAAA,GAAA,IAAe;EAgChB;;;;AC5M+B;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;AAoBC,KGjHb,WAAA,GHiHa;EAiBZ;EAOD,KAAA,EAAA,MAAA;EAUP;EAyCA,WAAA,EAAA,MAAe;EA2GR;EAAY,KAAA,CAAA,EAAA,MAAA;EAAc;EAAgB,OAAA,CAAA,EAAA,MAAA,EAAA;CAAe;;;;ACjVrE;AAsCA;;;;AC7BA;AAyGA;;;;AC/FA;AAuBA;AA2CA;;;;;;;;;;AC3DA;;;;;;;KD2DY,mBAAA;EErFA;AAKZ;;;;EAIc,OAAA,CAAA,EFkFA,WElFA,EAAA;EAAsB;AAGpC;AA2BE;AAmnBF;;;;EAIc,KAAA,CAAA,EF1jBF,SE0jBE,EAAA;EAGa;;;;AA+C3B;;;;;EAGoD,KAAA,CAAA,EAAA,OAAA,GAAA;;;;IC1jBxC,OAAA,CAAA,EHpCY,KGoCI,CAAA,MAAA,GAAA,YAAA,CAAA;;;;ACpJ5B;AAKA;AAcA;;;;;;AA2FA;AAsCA;;;;ECiJa;;;;;;;;AC3Mb;AAsCA;AAGA;EAiBa,gBAAA,CAAA,EAAA,MAAA;EAgDA;AA2Cb;AAWA;AAkJA;AA8CA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;ACzjBA;AAqDA;AAcA;;;;;;;;;;;gBPkFkB;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WAwDL;;;;APvLD,cQxBC,iBRwBG,EAAA,CAAA,KAAA,EQxByB,IRwBzB,EAAA,EAAA,KAAA,CAAA,EQxByC,WRwBzC,EAAA,EAAA,GQxBsD,IRwBtD,EAAA;;;AA2BJ,KS7EA,sBAAA,GT6ES;;;;EClET,IAAA,EAAA,0BAAwB;EAcxB,KAAA,CAAA,EAAA,OAAA,GAAA,YAAuB;EAEvB,QAAA,EAAA,MAAA,EAAA;AA4PZ,CAAA,GAAa;EACF,IAAA,EAAA,WAAA;EACE,SAAA,EAAA,CAAA,IAAA,EQtRgC,SRsRhC,EAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;CACV;AAAsB,KQrRb,iBAAA,GRqRa;WQpRZ;SACF;YACG;EPSF,QAAA,EORE,sBPQsB;AAapC,CAAA;AAOY,KOzBA,oBAAA,GPyBwB;EAiOvB,OAAA,EAAA;IACF,IAAA,EAAA,YAAA,GAAA,wBAAA;IACG,SAAA,EAAA,MAAA;IACX,aAAA,EAAA,MAAA;IAAwB,SAAA,EAAA,MAAA;;;UOrPf;INvBA,SAAA,EAAA,MAAe;IA+Ed,QAAA,EAAA,MAAA;IAgEA,aAAA,EAAA,MAAA;IAuBA,UAAA,EAAA,MAAA;EAiCA,CAAA,CAAA;WM1KA;;;ILNR,sBAAY,CAAA,EAAA,MAAA;IA4BZ,qBAAe,CAAA,EAAA,MAAA;IA8Bf,YAAA,EAAA,MAAqB;IAiCrB,MAAA,EAAA,WAAA,GAAsB,oBAAA,GAAA,WAAA,GAAA,sBAAA,GAAA,qBAAA;IAwBtB,QAAA,EAAA,OAAA,GAAmB,QAAA,GAAA,MAAA;IAenB,EAAA,CAAA,EAAA,MAAW;IACV,KAAA,CAAA,EAAA,MAAA,EAAA;EACA,CAAA,CAAA;EACA,MAAA,EAAA,MAAA,EAAA;EACA,QAAA,EAAA,MAAA,EAAA;CACA;KKlHD,oBAAA,GLkHoB,MAAA,GAAA,YAAA,GAAA,qBAAA;AAiBZ,iBK8eG,qCAAA,CL9eoG,KAAA,EK+ezG,IL/eyG,EAAA,EAAA,QAAA,EKgftG,OLhfsG,EAAA,EAAA,OAAA,EKifvG,mBLjfuG,EAAA,QAAA,EKkftG,sBLlfsG,EAAA,IAiBlG,CAjBkG,EAAA;EAOxG,IAAA,CAAA,EAAA,YAAc,GAAA,wBAAW;EAUhC,gBAAa,CAAA,EKoeS,oBLpeT;AAAA,CAAA,CAAA,EAyCb;EA2GO,MAAA,EKkVC,oBLlVQ;EAAG,QAAA,EKkVqB,OLlVrB,EAAA;CAAc;AAAgB,iBK+XtC,6BAAA,CL/XsC,IAAA,EKgY5C,iBLhY4C,EAAA,EAAA,KAAA,EAAA;EAAe,IAAA,CAAA,EAAA,YAAA,GAAA,wBAAA;qBKiYW;;UACnE;EJntBD,QAAA,EImtBiC,OJntBnB,EAAA;AAsC1B,CAAA;;;;AIwnBc,KCrgBF,gBAAA,GDqgBE,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;ARnpBd;AAcA;AAEA;AA4Pa,KUlRD,mBAAA,GVqSX,gBAAA,GAAA,eAAA,GAAA,WAAA,GAAA,eAAA;;;;AAhBwB,KUhRb,eAAA,GVgRa;QU/Qf;;;ETME;EAaA,KAAA,CAAA,EAAA,MAAA;EAOA;EAiOC,OAAA,CAAA,EAAA,MAAA;CACF;;;;;KS/OC,oBAAA;oBACU;ER5BV,eAAA,CAAA,EAAe,CQ6BJ,eR7BI,GAAA,SAAA,CAAA,EAAA;EA+Ed,YAAA,CAAA,EAAA,CQjDO,eRkGnB,GAjD+C,SAAA,CAAA,EAAA;EAgEnC,QAAA,CAAA,EQhHE,eRgHF;AAuBb,CAAA;AAiCA;;;;AC5M4C;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAmD3D,cOnHQ,aP4JC,EAAA,CAAA,KAAA,EO5JuB,SPyKpB,EAAA,EAAA,GAAA,COzK+B,oBPyK/B,GAAA,SAAA,CAAA,EAAA;AAqDjB;;;;;;;;ACjVA;AAsCA;;;;AC7BY,cKgJC,sBL5EU,EAAA,CAAA,OAAA,EAAA,CK4E0B,oBL5E1B,GAAA,SAAA,CAAA,EAAA,EAAA,GAAA,MAAA,EAAA;;;;;AJ5DvB;AAaA;AAOA;AAiOA;;;;;;;;ACzQA;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC5M4C;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAU9C;AAoJlB;;;;;cQvCa,sBAAuB,iBAAiB,wBAAmB;;;;AZpS5D,cayFC,Kb5DF,EAAA;EAeC;EA2BA,SAAA,GAAS,EAAA,SAAA;;;;EClET,SAAA,MAAA,EAAA,YAAwB;EAcxB;EAEA,SAAA,IAAA,EAAA,UAAsB;EA4PrB;EACF,SAAA,IAAA,EAAA,UAAA;EACE;EACV,SAAA,IAAA,EAAA,UAAA;EAAsB;;;;ECzQb;EAaA,SAAA,IAAA,EAAA,UAAA;EAOA;EAiOC,SAAA,GAAA,EAAA,SAAA;EACF;EACG,SAAA,QAAA,EAAA,cAAA;EACX;EAAwB,SAAA,IAAA,EAAA,UAAA;;;;EC5Qf,SAAA,KAAA,EAAA,WAAe;EA+Ed;EAgEA,SAAA,KAAA,EAAA,WAeZ;EAQY;EAiCA,SAAA,MAAA,EAAA,YAmBZ;;;;AC/N2C;AAwDvC,KSqFO,QAAA,GTrFQ,MAAA,OSqFgB,KTrFhB;AAAA;AA+Df,cSyBQ,WTzBc,EAAA,CAAA,KAAA,EAAA,MAAA,EAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAAA;AAuCtB,cSGQ,+BTHG,EAAA,CAAA,QAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;;;;AAsBhB;AAOA;AAAgE;AAU9C;AAoJlB;;;;;;;;ACjVA;AAsCA;;;;AC7BA;AAyGA;cOuFa,gBAAgB;;;ANtL7B;AAuBA;AA2CA;;;;;;;;;;AC3DA;AAAyC,cK0N5B,cL1N4B,EAAA,CAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;;;;;;;AC1B7B,KI+PA,YAAA,GJ/PA;EAKA;;;;;EAIwB,OAAA,EAAA,MAAA;EAGxB;AA2BV;AAmnBF;;;EAGa,YAAA,EAAA,MAAA,EAAA;EACC;;;;;EAkDE,WAAA,EAAA,OAAA;CACN;;;;;;;;ACxjBV;;;;ACpJA;AAKA;AAcA;;;;;;AA2FA;AAsCA;;;;ACiJA;;;;;;;;AC3MA;AAsCA;AAGA;AAiBA;AAgDa,cAwMA,wBAxMsB,EAAA,CAAA,KAAA,EAAA,MAAA,EAAA,cAAA,CAAA,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA,EAAA,aAAA,CAAA,EAAA,MAAA,EAAA,GAAA;EA2CtB,YAAA,EAAA,MAGZ,EAAA;EAQW,WAAA,EAAA,OAAY;EAkJX,OAAA,EAAA,MAAA;AA8Cb,CAAA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;ACzjBA;AAqDA;AAcA;;;;;;;cDgWa;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAmC;;;;;;;;;;;;;cAqBnC;;;;;;;;;;;;;;;cAgBA;;;;;;;;;;;;;;;cA8BA;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;cAoBC,iDAAkD;;;;;;;;;;;;;;;;cA6BlD;;;AZ9Tb;;;;;;;;ACtQA;AAaA;AAOA;AAiOA;;;;;;;;ACzQA;AA+EA;AAgEA;AAuBA;AAiCa,cWxKA,sBX+KC,EAAA,CAAA,OAAe,EAAA,MAAA,EAAA,GAAA,MAAA;;;;ACnNe;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;AAoBlB,cU1EO,WV0EP,EAAA,CAAA,CAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAAmB,cU5DZ,wBV4DY,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA"}
1
+ {"version":3,"file":"index.d.mts","names":[],"sources":["../src/types/index.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/types/rules.ts","../src/optimization/optimize-rules.ts","../src/types/breakpoints.ts","../src/types/options.ts","../src/recovery.ts","../src/segmentation/breakpoint-utils.ts","../src/segmentation/pattern-validator.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/utils/textUtils.ts"],"sourcesContent":[],"mappings":";;AAcA;AA4CA;AA2BA;;;;AClEA;AAcA;AAEA;AA4PA;;;;AAGyB,KDpRb,OAAA,GCoRa;;;;ACzQzB;AAaA;AAOA;EAiOa,OAAA,EAAA,MAAA;EACF;;;EAEgB,IAAA,EAAA,MAAA;;;;AC5Q3B;AA+EA;AAgEA;EAuBa,EAAA,CAAA,EAAA,MAAA;EAiCA;;;;AC5M+B;AA4B3B;AA4BG;EA+Df,IAAA,CAAA,EJ5EM,MI4EN,CAAA,MAAA,EAAA,OAAsB,CAAA;AAAA,CAAA;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAU9C;AAoJlB;AAAwB,KJ/RZ,IAAA,GI+RY;EAAc;;;;;;ECjV1B;AAsCZ;;;;AC7BA;EAyGY,OAAA,EAAA,MAAU;;;;AC/FtB;AA6CA;;;;;;;KPaY,SAAA;;;AAvEA,KCKA,wBAAA,GDwBK;EAeL,IAAA,CAAA,EAAI,MAAA;EA2BJ,WAAA,CAAS,EAAA,MAAA;;;;EClET,wBAAA,CAAA,EAAwB,OAAA;EAcxB,yBAAA,CAAuB,EAAA,OAAA;EAEvB,MAAA,CAAA,EAAA,aAAA,GAAsB,OAAA;EA4PrB,UAAA,CAAA,EAAA,CAAA,IAAA,EAAA,MAmBZ,EAAA,MAAA,EAAA,MAAA,EAAA,GAAA,OAAA;EAlBU,cAAA,CAAA,EAnQU,MAmQV,EAAA;EACE,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;CACV;AAAsB,KAjQb,uBAAA,GAiQa;;;;ACzQb,KDUA,sBAAA,GCVwB;EAaxB,OAAA,EAAA,MAAA;EAOA,KAAA,EAAA,MAAA;EAiOC,QAAA,EDxOC,uBC6Pb,EAAA;CApBU;;;;cDgBE,iCACF,kBACE,6BACV;;;AAHU,KCtQD,wBAAA,GDyRX;EAlBU,WAAA,CAAA,EAAA,MAAA;EACE,WAAA,CAAA,EAAA,MAAA;EACV,QAAA,CAAA,EAAA,MAAA;EAAsB,IAAA,CAAA,EAAA,MAAA;;;;ECzQb,WAAA,CAAA,EAAA,MAAA;EAaA,YAAA,CAAA,EAAA,MAAA;EAOA,iBAAA,CAAA,EAAA,MAAA;AAiOZ,CAAA;AACW,KAzOC,wBAAA,GAyOD;EACG,IAAA,EAAA,MAAA;EACX,OAAA,EAAA,MAAA;EAAwB,MAAA,EAAA,MAAA;;;KApOf,wBAAA;ECxCA,OAAA,EAAA,MAAA;EA+EC,KAAA,EAAA,MAAA;EAgEA,QAAA,EDpGC,wBCmHb,EAAA;AAQD,CAAA;;;AC3K4C;AA4B3B;AA4BG;AA8BM;AAyDrB,cF+HQ,yBE/HW,EAAA,CAAA,KAAA,EFgIb,IEhIa,EAAA,EAAA,OAAA,CAAA,EFiIV,wBEjIU,EAAA,GFkIrB,wBElIqB,EAAA;;;;AJjIxB;AA4CA;AA2BY,KGhFA,eAAA,GHgFS;;;;EClET,KAAA,EAAA,MAAA;EAcA;EAEA,KAAA,EAAA,MAAA;EA4PC;EACF,QAAA,EAAA,MAAA;CACE;;;;;;ACxQb;AAaA;AAOA;AAiOA;;;;;;;cC1La,uCAAmC;AA/EhD;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC5M4C;AA4B3B;AA4BG;AA8BM;AAiCC;AAuCtB,cDVQ,wBCUG,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,QAAA,EDViD,eCUjD,EAAA,EAAA,GAAA,MAAA;;;;;;;AAsBH,cDTA,oBCSuG,EAAA,CAAA,QAAA,EDRtG,eCQsG,EAAA,EAAA,GAAA;EAOxG,WAAA,EAAA,gBAAyB,GAAA,iBAAiB;EAUjD,KAAA,EAAA,OAAA;EAyCA,QAAA,CAAA,EAAA,MAAA;AA2GL,CAAA;;;;;;;cD7Ia;EEpMD,QAAA,EAAA,MAAA;EAsCC,WAAA,EAAA,gBAAwB,GAAA,iBAAA;;;YFqKvB;AGlMd,CAAA,GAAY,IAAA;;;ANHZ;AA4CA;AA2BA;;;;AClEA;AAcA;AAEA;AA4PA;;;;;;;;ACtQA;AAaA;AAOA;AAiOA;;;;;;KElPK,YAAA;;EDvBO,KAAA,EAAA,MAAA;AA+EZ,CAAA;AAgEA;AAuBA;AAiCA;;;;AC5M4C;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAU9C;AAoJlB;KAjSK,eAAA,GAiSmB;EAAc;EAAgB,QAAA,EAAA,MAAA;CAAe;;;;ACjVrE;AAsCA;;;;AC7BA;AAyGA;;;;AC/FA;AA6CA;;;;;;;;;ACjEA;AAKA;KJ0EK,qBAAA,GIzEQ;EACF;EACG,cAAA,EAAA,MAAA,EAAA;CACA;;AAGd;AA2BE;AAknBF;;;;;;;;;AAsDA;;;;;;;;;ACrjBA;;;;ACpJA;AAKA;AAcA;KNuFK,sBAAA,GMtFiB;EACC;EACH,eAAA,EAAA,MAAA,EAAA;CACL;;AAuFf;AAsCA;;;;ACgJA;;;;;;;;AC1MA;AAsCA;AAGA;AAiBA;AAgDA;AA2CA,KR7GK,mBAAA,GQgHJ;EAQW;EAkJC,YAAA,EAAA,MAAA,EAAA;AA8Cb,CAAA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;ACxjBA,KTyHK,WAAA,GACC,YSxHL,GTyHK,eSzHL,GT0HK,qBS1HL,GT2HK,sBS3HL,GT4HK,mBS5HL;AAoDD;AAeA;;;;;;;;;;;;;;cT0Ea;;;;;;KAOD,cAAA,WAAyB;;;;;;;KAUhC,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAyCA,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqDC,SAAA,GAAY,cAAc,gBAAgB;;;AJ3UtD;AA4CA;AA2BA;KK7EY,cAAA;;SAED;EJSC;EAcA,WAAA,EAAA,MAAA;AAEZ,CAAA;AA4Pa,cIjPA,aJoQZ,EAAA,CAAA,KAAA,EIpQoC,SJoQpC,EAAA,EAAA,GAAA;EAlBU,WAAA,EAAA,MAAA;EACE,KAAA,EInPwB,SJmPxB,EAAA;CACV;;;ADpRH;AA4CA;AA2BA;;;;AClEA;AAcA;AAEA;AA4PA;;;;;;KK9QY,cAAA;;AJQZ;AAaA;AAOA;AAiOA;;;;;;;;ACzQA;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC5M4C;AA4B3B;EA0DZ,KAAA,CAAA,EAAA,MAAA;EAiCA;AAAsB;AAwBH;;;;;;;AAqCxB;EAOY,KAAA,CAAA,EAAA,IAAA,GAAA,OAAc;EAUrB;AAAa;AAoJlB;;EAAsC,GAAA,CAAA,EAAA,MAAA;EAAgB;;;;;ECjV1C;AAsCZ;;;;AC7BA;AAyGA;;;;AC/FA;AA6CA;;;;;;;YDac;;AE9Ed;AAKA;;;;;;AAOA;AA2BE;AAknBF;;;;;;;;EASoD,QAAA,CAAA,EAAA,MAAA;AA6CpD,CAAA;;;;;;;;;ACrjBA;;;;ACpJA;AAKA;AAcY,KJ0FA,UAAA,GI1FA,MAAoB,GJ0FE,cI1FF;;;AV0BhC;AA2BA;;;;AClEA;AAcA;AAEA;AA4PA;;;;;;;;ACtQA;AAaA;AAOA;AAiOA;;;;;UKnPiB,MAAA;;;EJtBL;EA+EC,KAAA,CAAA,EAAA,CAAA,OAAA,EAAA,MAiDZ,EAAA,GAAA,IAjD+C,EAAA,OAAA,EAAA,EAAA,GAAA,IAAA;EAgEnC;EAuBA,IAAA,CAAA,EAAA,CAAA,OAAA,EAAA,MAyBZ,EAAA,GAAA,IAAA,EAxBa,OAAA,EAAA,EAAA,GAAA,IAAe;EAgChB;;;;AC5M+B;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAU9C;AAoJlB;;;;;;;;ACjVA;AAsCA;;;;AC7BA;AAyGA;;;KClDY,mBAAA;EA7CK;AA6CjB;;;;;;UAQY;;;ACzEZ;AAKA;;;;;;EAOY,KAAA,CAAA,EAAA,OAAA,GAAA;IA6BP;IAgnBW,OAAA,CAAA,EAAA,MAAA;IACL;IACG,OAAA,CAAA,EDjkBU,KCikBV,CAAA,MAAA,GAAA,YAAA,CAAA;EACD,CAAA;EACC;;;;;AAkDd;;;;;;;;;ECrjBY,QAAA,CAAA,EAAA,MAAA;;;;ACpJZ;AAKA;AAcA;;;;;;EA2Fa,gBAwBP,CAAA,EAAA,MAAA;EAcO;;;;ACgJb;;;;;;;;AC1MA;AAsCA;AAGA;AAiBA;AAgDA;AA2CA;AAWA;AAkJA;AA8CA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;gBLhckB;;;AMxHlB;AAsDA;AAeA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WN2Ga;;;;APhID,KQ9EA,sBAAA,GR8ES;;;;EClET,IAAA,EAAA,0BAAwB;EAcxB,KAAA,CAAA,EAAA,OAAA,GAAA,YAAuB;EAEvB,QAAA,EAAA,MAAA,EAAA;AA4PZ,CAAA,GAAa;EACF,IAAA,EAAA,WAAA;EACE,SAAA,EAAA,CAAA,IAAA,EOvRgC,SPuRhC,EAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;CACV;AAAsB,KOtRb,iBAAA,GPsRa;WOrRZ;SACF;YACG;ENUF,QAAA,EMTE,sBNSsB;AAapC,CAAA;AAOY,KM1BA,oBAAA,GN0BwB;EAiOvB,OAAA,EAAA;IACF,IAAA,EAAA,YAAA,GAAA,wBAAA;IACG,SAAA,EAAA,MAAA;IACX,aAAA,EAAA,MAAA;IAAwB,SAAA,EAAA,MAAA;;;UMtPf;ILtBA,SAAA,EAAA,MAAe;IA+Ed,QAAA,EAAA,MAAA;IAgEA,aAAA,EAAA,MAAA;IAuBA,UAAA,EAAA,MAAA;EAiCA,CAAA,CAAA;WK3KA;;;IJLR,sBAAY,CAAA,EAAA,MAAA;IA4BZ,qBAAe,CAAA,EAAA,MAAA;IA8Bf,YAAA,EAAA,MAAqB;IAiCrB,MAAA,EAAA,WAAA,GAAsB,oBAAA,GAAA,WAAA,GAAA,sBAAA,GAAA,qBAAA;IAwBtB,QAAA,EAAA,OAAA,GAAmB,QAAA,GAAA,MAAA;IAenB,EAAA,CAAA,EAAA,MAAW;IACV,KAAA,CAAA,EAAA,MAAA,EAAA;EACA,CAAA,CAAA;EACA,MAAA,EAAA,MAAA,EAAA;EACA,QAAA,EAAA,MAAA,EAAA;CACA;KInHD,oBAAA,GJmHoB,MAAA,GAAA,YAAA,GAAA,qBAAA;AAiBZ,iBI4eG,qCAAA,CJ5eoG,KAAA,EI6ezG,IJ7eyG,EAAA,EAAA,QAAA,EI8etG,OJ9esG,EAAA,EAAA,OAAA,EI+evG,mBJ/euG,EAAA,QAAA,EIgftG,sBJhfsG,EAAA,IAiBlG,CAjBkG,EAAA;EAOxG,IAAA,CAAA,EAAA,YAAc,GAAA,wBAAW;EAUhC,gBAAa,CAAA,EIkeS,oBJleT;AAAA,CAAA,CAAA,EAyCb;EA2GO,MAAA,EIgVC,oBJhVQ;EAAG,QAAA,EIgVqB,OJhVrB,EAAA;CAAc;AAAgB,iBI6XtC,6BAAA,CJ7XsC,IAAA,EI8X5C,iBJ9X4C,EAAA,EAAA,KAAA,EAAA;EAAe,IAAA,CAAA,EAAA,YAAA,GAAA,wBAAA;qBI+XW;;UACnE;EHjtBD,QAAA,EGitBiC,OHjtBnB,EAAA;AAsC1B,CAAA;;;;AG2qBoD,KCxjBxC,gBAAA,GDwjBwC,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;APtsBpD;AAcA;AAEA;AA4Pa,KSlRD,mBAAA,GTqSX,gBAAA,GAAA,eAAA,GAAA,WAAA,GAAA,eAAA;;;;AAhBwB,KShRb,eAAA,GTgRa;QS/Qf;;;ERME;EAaA,KAAA,CAAA,EAAA,MAAA;EAOA;EAiOC,OAAA,CAAA,EAAA,MAAA;CACF;;;;;KQ/OC,oBAAA;oBACU;EP5BV,eAAA,CAAA,EAAe,CO6BJ,eP7BI,GAAA,SAAA,CAAA,EAAA;EA+Ed,YAAA,CAAA,EAAA,COjDO,ePkGnB,GAjD+C,SAAA,CAAA,EAAA;EAgEnC,QAAA,CAAA,EOhHE,ePgHF;AAuBb,CAAA;AAiCA;;;;AC5M4C;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAmD3D,cMnHQ,aN4JC,EAAA,CAAA,KAAA,EM5JuB,SNyKpB,EAAA,EAAA,GAAA,CMzK+B,oBNyK/B,GAAA,SAAA,CAAA,EAAA;AAqDjB;;;;;;;;ACjVA;AAsCA;;;;AC7BY,cIgJC,sBJ5EU,EAAA,CAAA,OAAA,EAAA,CI4E0B,oBJ5E1B,GAAA,SAAA,CAAA,EAAA,EAAA,GAAA,MAAA,EAAA;;;;;AJ5DvB;AAaA;AAOA;AAiOA;;;;;;;;ACzQA;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC5M4C;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAU9C;AAoJlB;;;;;cOxCa,sBAAuB,iBAAiB,wBAAmB;;;;AXnS5D,cYyFC,KZ5DF,EAAA;EAeC;EA2BA,SAAA,GAAS,EAAA,SAAA;;;;EClET,SAAA,MAAA,EAAA,YAAwB;EAcxB;EAEA,SAAA,IAAA,EAAA,UAAsB;EA4PrB;EACF,SAAA,IAAA,EAAA,UAAA;EACE;EACV,SAAA,IAAA,EAAA,UAAA;EAAsB;;;;ECzQb;EAaA,SAAA,IAAA,EAAA,UAAA;EAOA;EAiOC,SAAA,GAAA,EAAA,SAAA;EACF;EACG,SAAA,QAAA,EAAA,cAAA;EACX;EAAwB,SAAA,IAAA,EAAA,UAAA;;;;EC5Qf,SAAA,KAAA,EAAA,WAAe;EA+Ed;EAgEA,SAAA,KAAA,EAAA,WAeZ;EAQY;EAiCA,SAAA,MAAA,EAAA,YAmBZ;;;;AC/N2C;AAwDvC,KQqFO,QAAA,GRrFQ,MAAA,OQqFgB,KRrFhB;AAAA;AA+Df,cQyBQ,WRzBc,EAAA,CAAA,KAAA,EAAA,MAAA,EAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAAA;AAuCtB,cQGQ,+BRHG,EAAA,CAAA,QAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;;;;AAsBhB;AAOA;AAAgE;AAU9C;AAoJlB;;;;;;;;ACjVA;AAsCA;;;;AC7BA;AAyGA;cMuFa,gBAAgB;;;ALtL7B;AA6CA;;;;;;;;;ACjEA;AAKA;;AAEW,cI8OE,cJ9OF,EAAA,CAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;;;;AAKX;AA2BE;AAknBF;AACW,KI1ZC,YAAA,GJ0ZD;EACG;;;;;EAO+B,OAAA,EAAA,MAAA;EAAO;AA6CpD;;;;EAG6C,YAAA,EAAA,MAAA,EAAA;EAAO;;;;ACxjBpD;;;;ACpJA;AAKA;AAcA;;;;;;AA2FA;AAsCA;;;;ACgJA;;;;;;;;AC1MA;AAsCA;AAGA;AAiBA;AAgDA;AA2CA;AAWA;AAkJA;AA8CA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBa,cAvKA,wBAuKkD,EAAA,CAAA,KAAY,EAAA,MAAA,EAAA,cAAA,CAAA,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA,EAAA,aAAA,CAAA,EAAA,MAAA,EAAA,GAAA;EA6B9D,YAAA,EAAA,MAAA,EAGZ;;;;AC3jBD;AAsDA;AAeA;;;;;;;;;;;;;;;;;cD6Va;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAmC;;;;;;;;;;;;;cAqBnC;;;;;;;;;;;;;;;cAgBA;;;;;;;;;;;;;;;cA8BA;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;cAoBC,iDAAkD;;;;;;;;;;;;;;;;cA6BlD;;;AX9Tb;;;;;;;;ACtQA;AAaA;AAOA;AAiOA;;;;;;;;ACzQA;AA+EA;AAgEA;AAuBA;AAiCa,cUvKA,sBV8KC,EAAA,CAAA,OAAe,EAAA,MAAA,EAAA,GAAA,MAAA;;;;ACnNe;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;AAoBlB,cSxEO,WTwEP,EAAA,CAAA,CAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAAmB,cSzDZ,wBTyDY,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA"}
package/dist/index.mjs CHANGED
@@ -34,7 +34,9 @@ const normalizeLineEndings = (content) => {
34
34
  * escapeTemplateBrackets('{{harf}}')
35
35
  * // → '{{harf}}' (unchanged - no brackets outside tokens)
36
36
  */
37
- const escapeTemplateBrackets = (pattern) => pattern.replace(/(\{\{[^}]*\}\})|([()[\]])/g, (_match, token, bracket) => token || `\\${bracket}`);
37
+ const escapeTemplateBrackets = (pattern) => {
38
+ return pattern.replace(/(\{\{[^}]*\}\})|([()[\]])/g, (_match, token, bracket) => token || `\\${bracket}`);
39
+ };
38
40
  /**
39
41
  * Character class matching all Arabic diacritics (Tashkeel/Harakat).
40
42
  *
@@ -93,7 +95,9 @@ const getEquivClass = (ch) => {
93
95
  const group = EQUIV_GROUPS.find((g) => g.includes(ch));
94
96
  return group ? `[${group.map(escapeRegex).join("")}]` : escapeRegex(ch);
95
97
  };
96
- const normalizeArabicLight = (str) => str.normalize("NFC").replace(/[\u200C\u200D]/g, "").replace(/\s+/g, " ").trim();
98
+ const normalizeArabicLight = (str) => {
99
+ return str.normalize("NFC").replace(/[\u200C\u200D]/g, "").replace(/\s+/g, " ").trim();
100
+ };
97
101
  const makeDiacriticInsensitive = (text) => {
98
102
  const diacriticsMatcher = `${DIACRITICS_CLASS}*`;
99
103
  return Array.from(normalizeArabicLight(text)).map((ch) => getEquivClass(ch) + diacriticsMatcher).join("");
@@ -1224,53 +1228,6 @@ const optimizeRules = (rules) => {
1224
1228
  };
1225
1229
  };
1226
1230
 
1227
- //#endregion
1228
- //#region src/preprocessing/replace.ts
1229
- const DEFAULT_REPLACE_FLAGS = "gu";
1230
- const normalizeReplaceFlags = (flags) => {
1231
- if (!flags) return DEFAULT_REPLACE_FLAGS;
1232
- const allowed = new Set([
1233
- "g",
1234
- "i",
1235
- "m",
1236
- "s",
1237
- "u",
1238
- "y"
1239
- ]);
1240
- const set = new Set(flags.split("").filter((ch) => {
1241
- if (!allowed.has(ch)) throw new Error(`Invalid replace regex flag: "${ch}" (allowed: gimsyu)`);
1242
- return true;
1243
- }));
1244
- set.add("g");
1245
- set.add("u");
1246
- return [
1247
- "g",
1248
- "i",
1249
- "m",
1250
- "s",
1251
- "y",
1252
- "u"
1253
- ].filter((c) => set.has(c)).join("");
1254
- };
1255
- const compileReplaceRules = (rules) => rules.filter((r) => !(r.pageIds && r.pageIds.length === 0)).map((r) => ({
1256
- pageIdSet: r.pageIds ? new Set(r.pageIds) : void 0,
1257
- re: new RegExp(r.regex, normalizeReplaceFlags(r.flags)),
1258
- replacement: r.replacement
1259
- }));
1260
- const applyReplacements = (pages, rules) => {
1261
- if (!rules?.length || !pages.length) return pages;
1262
- const compiled = compileReplaceRules(rules);
1263
- if (!compiled.length) return pages;
1264
- return pages.map((p) => {
1265
- let content = p.content;
1266
- for (const rule of compiled) if (!rule.pageIdSet || rule.pageIdSet.has(p.id)) content = content.replace(rule.re, rule.replacement);
1267
- return content === p.content ? p : {
1268
- ...p,
1269
- content
1270
- };
1271
- });
1272
- };
1273
-
1274
1231
  //#endregion
1275
1232
  //#region src/segmentation/rule-regex.ts
1276
1233
  /**
@@ -1928,24 +1885,63 @@ const findPatternBreakPosition = (windowContent, regex, prefer, splitAt = false)
1928
1885
  * Handles page boundary breakpoint (empty pattern).
1929
1886
  * Returns break position or -1 if no valid position found.
1930
1887
  */
1888
+ const findStartOfNextPageInWindow = (remainingContent, currentFromIdx, toIdx, pageIds, normalizedPages, targetPos) => {
1889
+ const targetNextPageIdx = currentFromIdx + 1;
1890
+ for (let nextIdx = targetNextPageIdx; nextIdx > currentFromIdx; nextIdx--) if (nextIdx <= toIdx) {
1891
+ const nextPageData = normalizedPages.get(pageIds[nextIdx]);
1892
+ if (nextPageData) {
1893
+ const boundaryPos = findNextPagePosition(remainingContent, nextPageData);
1894
+ if (boundaryPos > 0 && boundaryPos <= targetPos) return boundaryPos;
1895
+ }
1896
+ }
1897
+ return -1;
1898
+ };
1931
1899
  const handlePageBoundaryBreak = (remainingContent, currentFromIdx, windowEndPosition, maxContentLength, toIdx, pageIds, normalizedPages) => {
1932
1900
  const targetPos = Math.min(windowEndPosition, remainingContent.length);
1933
- if (!(maxContentLength !== void 0 && windowEndPosition === maxContentLength)) {
1934
- const targetNextPageIdx = currentFromIdx + 1;
1935
- for (let nextIdx = targetNextPageIdx; nextIdx > currentFromIdx; nextIdx--) if (nextIdx <= toIdx) {
1936
- const nextPageData = normalizedPages.get(pageIds[nextIdx]);
1937
- if (nextPageData) {
1938
- const boundaryPos = findNextPagePosition(remainingContent, nextPageData);
1939
- if (boundaryPos > 0 && boundaryPos <= targetPos) return boundaryPos;
1940
- }
1941
- }
1901
+ const isLengthBounded = maxContentLength !== void 0 && windowEndPosition === maxContentLength;
1902
+ if (!isLengthBounded) {
1903
+ const boundaryPos = findStartOfNextPageInWindow(remainingContent, currentFromIdx, toIdx, pageIds, normalizedPages, targetPos);
1904
+ if (boundaryPos > 0) return { pos: boundaryPos };
1942
1905
  }
1943
1906
  if (targetPos < remainingContent.length) {
1944
1907
  const safePos = findSafeBreakPosition(remainingContent, targetPos);
1945
- if (safePos !== -1) return safePos;
1946
- return adjustForUnicodeBoundary(remainingContent, targetPos);
1908
+ if (safePos !== -1) return {
1909
+ pos: safePos,
1910
+ splitReason: isLengthBounded ? "whitespace" : void 0
1911
+ };
1912
+ return {
1913
+ pos: adjustForUnicodeBoundary(remainingContent, targetPos),
1914
+ splitReason: isLengthBounded ? "unicode_boundary" : void 0
1915
+ };
1947
1916
  }
1948
- return targetPos;
1917
+ return { pos: targetPos };
1918
+ };
1919
+ const checkBreakpointMatch = (i, remainingContent, currentFromIdx, toIdx, windowEndIdx, windowEndPosition, ctx, maxContentLength) => {
1920
+ const { pageIds, normalizedPages, expandedBreakpoints, prefer } = ctx;
1921
+ const bpCtx = expandedBreakpoints[i];
1922
+ const { rule, regex, excludeSet, skipWhenRegex } = bpCtx;
1923
+ if (!isInBreakpointRange(pageIds[currentFromIdx], rule)) return null;
1924
+ if (hasExcludedPageInRange(excludeSet, pageIds, currentFromIdx, windowEndIdx)) return null;
1925
+ if (skipWhenRegex?.test(remainingContent)) return null;
1926
+ if (regex === null) {
1927
+ const result = handlePageBoundaryBreak(remainingContent, currentFromIdx, windowEndPosition, maxContentLength, toIdx, pageIds, normalizedPages);
1928
+ return {
1929
+ breakPos: result.pos,
1930
+ breakpointIndex: i,
1931
+ contentLengthSplit: result.splitReason && maxContentLength ? {
1932
+ maxContentLength,
1933
+ reason: result.splitReason
1934
+ } : void 0,
1935
+ rule
1936
+ };
1937
+ }
1938
+ const breakPos = findPatternBreakPosition(remainingContent.slice(0, Math.min(windowEndPosition, remainingContent.length)), regex, prefer, bpCtx.splitAt);
1939
+ if (breakPos > 0) return {
1940
+ breakPos,
1941
+ breakpointIndex: i,
1942
+ rule
1943
+ };
1944
+ return null;
1949
1945
  };
1950
1946
  /**
1951
1947
  * Tries to find a break position within the current window using breakpoint patterns.
@@ -1959,23 +1955,10 @@ const handlePageBoundaryBreak = (remainingContent, currentFromIdx, windowEndPosi
1959
1955
  * @returns Break position in the content, or -1 if no break found
1960
1956
  */
1961
1957
  const findBreakPosition = (remainingContent, currentFromIdx, toIdx, windowEndIdx, windowEndPosition, ctx, maxContentLength) => {
1962
- const { pageIds, normalizedPages, expandedBreakpoints, prefer } = ctx;
1958
+ const { expandedBreakpoints } = ctx;
1963
1959
  for (let i = 0; i < expandedBreakpoints.length; i++) {
1964
- const { rule, regex, excludeSet, skipWhenRegex } = expandedBreakpoints[i];
1965
- if (!isInBreakpointRange(pageIds[currentFromIdx], rule)) continue;
1966
- if (hasExcludedPageInRange(excludeSet, pageIds, currentFromIdx, windowEndIdx)) continue;
1967
- if (skipWhenRegex?.test(remainingContent)) continue;
1968
- if (regex === null) return {
1969
- breakPos: handlePageBoundaryBreak(remainingContent, currentFromIdx, windowEndPosition, maxContentLength, toIdx, pageIds, normalizedPages),
1970
- breakpointIndex: i,
1971
- rule
1972
- };
1973
- const breakPos = findPatternBreakPosition(remainingContent.slice(0, Math.min(windowEndPosition, remainingContent.length)), regex, prefer, expandedBreakpoints[i].splitAt);
1974
- if (breakPos > 0) return {
1975
- breakPos,
1976
- breakpointIndex: i,
1977
- rule
1978
- };
1960
+ const match = checkBreakpointMatch(i, remainingContent, currentFromIdx, toIdx, windowEndIdx, windowEndPosition, ctx, maxContentLength);
1961
+ if (match) return match;
1979
1962
  }
1980
1963
  return null;
1981
1964
  };
@@ -2124,7 +2107,8 @@ const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx
2124
2107
  if (patternMatch && patternMatch.breakPos > 0) return {
2125
2108
  breakOffset: patternMatch.breakPos,
2126
2109
  breakpointIndex: patternMatch.breakpointIndex,
2127
- breakpointRule: patternMatch.rule
2110
+ breakpointRule: patternMatch.rule,
2111
+ contentLengthSplit: patternMatch.contentLengthSplit
2128
2112
  };
2129
2113
  if (windowEndPosition < remainingContent.length) {
2130
2114
  const safeOffset = findSafeBreakPosition(remainingContent, windowEndPosition);
@@ -3128,8 +3112,7 @@ const segmentPages = (pages, options) => {
3128
3112
  prefer,
3129
3113
  ruleCount: rules.length
3130
3114
  });
3131
- const processedPages = options.replace ? applyReplacements(pages, options.replace) : pages;
3132
- const { content: matchContent, normalizedPages: normalizedContent, pageMap } = buildPageMap(processedPages);
3115
+ const { content: matchContent, normalizedPages: normalizedContent, pageMap } = buildPageMap(pages);
3133
3116
  logger?.debug?.("[segmenter] content built", {
3134
3117
  pageIds: pageMap.pageIds,
3135
3118
  totalContentLength: matchContent.length
@@ -3142,10 +3125,10 @@ const segmentPages = (pages, options) => {
3142
3125
  });
3143
3126
  let segments = buildSegments(unique, matchContent, pageMap, rules, pageJoiner);
3144
3127
  logger?.debug?.("[segmenter] structural segments built", { segmentCount: segments.length });
3145
- segments = ensureFallbackSegment(segments, processedPages, normalizedContent, pageJoiner);
3128
+ segments = ensureFallbackSegment(segments, pages, normalizedContent, pageJoiner);
3146
3129
  if (hasLimits) {
3147
3130
  logger?.debug?.("[segmenter] applying breakpoints to oversized segments");
3148
- const result = applyBreakpoints(segments, processedPages, normalizedContent, maxPages, breakpoints, prefer, (p) => processPattern(p, false).pattern, logger, pageJoiner, debug?.includeBreakpoint ? debug.metaKey : void 0, maxContentLength, processBreakpointPattern);
3131
+ const result = applyBreakpoints(segments, pages, normalizedContent, maxPages, breakpoints, prefer, (p) => processPattern(p, false).pattern, logger, pageJoiner, debug?.includeBreakpoint ? debug.metaKey : void 0, maxContentLength, processBreakpointPattern);
3149
3132
  logger?.info?.("[segmenter] segmentation complete (with breakpoints)", { finalSegmentCount: result.length });
3150
3133
  return result;
3151
3134
  }
@@ -3497,13 +3480,12 @@ const runStage1IfEnabled = (pages, segments, options, selectedRuleIndices, mode)
3497
3480
  recoveredAtIndex,
3498
3481
  recoveredDetailAtIndex
3499
3482
  };
3500
- const processedPages = options.replace ? applyReplacements(pages, options.replace) : pages;
3501
- const pageIdToIndex = buildPageIdToIndex(processedPages);
3483
+ const pageIdToIndex = buildPageIdToIndex(pages);
3502
3484
  const pageJoiner = options.pageJoiner ?? "space";
3503
3485
  const compiledMistaken = compileMistakenRulesAsStartsWith(options, selectedRuleIndices);
3504
3486
  for (let i = 0; i < segments.length; i++) {
3505
3487
  const orig = segments[i];
3506
- const r = tryBestEffortRecoverOneSegment(orig, processedPages, pageIdToIndex, compiledMistaken, pageJoiner);
3488
+ const r = tryBestEffortRecoverOneSegment(orig, pages, pageIdToIndex, compiledMistaken, pageJoiner);
3507
3489
  if (r.kind !== "recovered") continue;
3508
3490
  const seg = {
3509
3491
  ...orig,
@@ -3858,5 +3840,5 @@ const formatValidationReport = (results) => results.flatMap((result, i) => {
3858
3840
  });
3859
3841
 
3860
3842
  //#endregion
3861
- export { PATTERN_TYPE_KEYS, TOKEN_PATTERNS, Token, analyzeCommonLineStarts, analyzeRepeatingSequences, analyzeTextForRule, applyReplacements, applyTokenMappings, containsTokens, detectTokenPatterns, escapeRegex, escapeTemplateBrackets, expandCompositeTokensInTemplate, expandTokens, expandTokensWithCaptures, formatValidationReport, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, optimizeRules, recoverMistakenLineStartsAfterMarkers, recoverMistakenMarkersForRuns, segmentPages, shouldDefaultToFuzzy, stripTokenMappings, suggestPatternConfig, templateToRegex, validateRules, withCapture };
3843
+ export { PATTERN_TYPE_KEYS, TOKEN_PATTERNS, Token, analyzeCommonLineStarts, analyzeRepeatingSequences, analyzeTextForRule, applyTokenMappings, containsTokens, detectTokenPatterns, escapeRegex, escapeTemplateBrackets, expandCompositeTokensInTemplate, expandTokens, expandTokensWithCaptures, formatValidationReport, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, optimizeRules, recoverMistakenLineStartsAfterMarkers, recoverMistakenMarkersForRuns, segmentPages, shouldDefaultToFuzzy, stripTokenMappings, suggestPatternConfig, templateToRegex, validateRules, withCapture };
3862
3844
  //# sourceMappingURL=index.mjs.map