npm - flappa-doormal - Versions diffs - 2.10.1 → 2.11.1 - Mend

flappa-doormal 2.10.1 → 2.11.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/AGENTS.md CHANGED Viewed

@@ -117,6 +117,8 @@ src/
    - `hasExcludedPageInRange()` - Check if range contains excluded pages
    - `findNextPagePosition()` - Find next page content position
    - `findPatternBreakPosition()` - Find pattern match by preference
+   - `findSafeBreakPosition()` - Search backward for a safe linguistic split point (whitespace/punctuation)
+   - `adjustForSurrogate()` - Ensure split position doesn't corrupt Unicode surrogate pairs
 10. **`types.ts`** - Type definitions
    - `Logger` interface - Optional logging for debugging
@@ -260,6 +262,19 @@ segmentPages(pages, {
 > **Note**: Older per-rule span limiting approaches were removed in favor of post-processing `breakpoints`.
+### 5. Safety-Hardened Content Splitting (NEW)
+When using `maxContentLength`, the segmenter prevents text corruption through several layers of fallback logic.
+**Algorithm:**
+1. **Windowed Pattern Match**: Attempt to find a user-provided `breakpoint` pattern within the character window.
+2. **Safe Fallback (Linguistic)**: If no pattern matches, use `findSafeBreakPosition()` to search backward (100 chars) for whitespace or punctuation `[\s\n.,;!?؛،۔]`.
+3. **Safe Fallback (Technical)**: If still no safe break found, use `adjustForSurrogate()` to ensure the split doesn't fall between a High and Low Unicode surrogate pair.
+4. **Hard Split**: Only as a final resort is a character-exact split performed.
+**Progress Guarantee**:
+The loop in `processOversizedSegment` has been refactored to remove fixed iteration limits (e.g., 10k). Instead, it relies on strict `cursorPos` progression and input validation (`maxContentLength >= 50`) to support processing infinitely large content streams without risk of truncation.
 ## Design Decisions
 ### 1. Why `{{double-braces}}`?
@@ -414,6 +429,16 @@ bunx biome lint .
 10. **Prefer library utilities for UI tasks**: Instead of re-implementing rule merging, validation, or token mapping in client code, use `optimizeRules`, `validateRules`/`formatValidationReport`, and `applyTokenMappings`. They handle edge cases (like duplicate patterns, regex safety, or diacritic handling) that ad-hoc implementations might miss.
+11. **Safety Fallback (Search-back)**: When forced to split at a hard character limit, searching backward for whitespace/punctuation (`[\s\n.,;!?؛،۔]`) prevents word-chopping and improves readability significantly.
+12. **Unicode Surrogate Safety**: Multi-byte characters (like Emojis) can be corrupted if split in the middle of a surrogate pair. Always use a helper like `adjustForSurrogate` to ensure the split point falls on a valid character boundary.
+13. **Recursion/Iteration Safety**: Using a progress-based guard (comparing `cursorPos` before and after loop iteration) is safer than fixed iteration limits for supporting arbitrary-sized content without truncation risks.
+14. **Accidental File Overwrites**: Be extremely careful when using tools like `replace_file_content` with large ranges. Verify file integrity frequently (e.g., `git diff`) to catch accidental deletions of existing code or tests. Merging new tests into existing files is a high-risk operation for AI agents.
+15. **Invisible Unicode Marks Break Regex Anchors**: Arabic text often contains invisible bidirectional formatting marks like Left-to-Right Mark (`U+200E`), Right-to-Left Mark (`U+200F`), or Arabic Letter Mark (`U+061C`). These appear at line starts after `\n` but before visible characters, breaking `^` anchored patterns. Solution: include an optional zero-width character class prefix in line-start patterns: `^[\u200E\u200F\u061C\u200B\uFEFF]*(?:pattern)`. The library now handles this automatically in `buildLineStartsWithRegexSource` and `buildLineStartsAfterRegexSource`.
 ### Process Template (Multi-agent design review, TDD-first)
 If you want to repeat the “write a plan → get multiple AI critiques → synthesize → update plan → implement TDD-first” workflow, use:
@@ -606,3 +631,4 @@ it('should correctly split pages with identical prefixes and duplicated content'
 ```
 ---
+15. **Use Synthesized AI Reviews**: For complex safety features, getting reviews from multiple models (Claude, GPT, etc.) and synthesizing them into a single action plan (see `docs/reviews/max-content-length-review-synthesis.md`) revealed critical edge cases like Arabic diacritic corruption and surrogate pair safety that a single model might miss.

package/LICENSE.md CHANGED Viewed

@@ -1,4 +1,4 @@
-Copyright 2025 Ragaeeb Haq>
+Copyright 2025-2026 Ragaeeb Haq>
 Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

package/README.md CHANGED Viewed

@@ -51,6 +51,7 @@ Working with Arabic hadith and Islamic text collections requires splitting conti
 ✅ **Readable templates**: `{{raqms}} {{dash}}` instead of cryptic regex
 ✅ **Named captures**: `{{raqms:hadithNum}}` auto-extracts to `meta.hadithNum`
 ✅ **Fuzzy matching**: Auto-enabled for `{{bab}}`, `{{kitab}}`, `{{basmalah}}`, `{{fasl}}`, `{{naql}}` (override with `fuzzy: false`)
+✅ **Content limits**: `maxPages` and `maxContentLength` (safety-hardened) control segment size
 ✅ **Page tracking**: Know which page each segment came from
 ✅ **Declarative rules**: Describe *what* to match, not *how*
@@ -236,7 +237,92 @@ Limit rules to specific page ranges:
 }
 ```
-### 7. Occurrence Filtering
+### 7. Max Content Length (Safety Hardened)
+Split oversized segments based on character count:
+```typescript
+{
+  maxContentLength: 500, // Split after 500 characters
+  prefer: 'longer',      // Try to fill the character bucket
+  breakpoints: ['\\.'], // Recommended: split on punctuation within window
+}
+```
+The library implements **safety hardening** for character-based splits:
+- **Safe Fallback**: If no breakpoint matches, it searches backward up to 100 characters for a delimiter (whitespace or punctuation) to avoid chopping words.
+- **Unicode Safety**: Automatically prevents splitting inside Unicode surrogate pairs (e.g., emojis), preventing text corruption.
+- **Validation**: `maxContentLength` must be at least **50**.
+### 8. Advanced Structural Filters
+Refine rule matching with page-specific constraints:
+```typescript
+{
+  lineStartsWith: ['### '],
+  split: 'at',
+  // Range constraints
+  min: 10,    // Only match on pages 10 and above
+  max: 500,   // Only match on pages 500 and below
+  exclude: [50, [100, 110]], // Skip page 50 and range 100-110
+  // Negative lookahead: skip rule if content matches this pattern
+  // (e.g. skip chapter marker if it appears inside a table/list)
+  skipWhen: '^\s*- ',
+}
+```
+### 9. Debugging & Logging
+Pass an optional `logger` to trace segmentation decisions or enable `debug` to attach match metadata to segments:
+```typescript
+const segments = segmentPages(pages, {
+  rules: [...],
+  debug: true, // Attaches .meta.debug with regex and match indices
+  logger: {
+    debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
+    info: (msg, data) => console.info(`[INFO] ${msg}`, data),
+    warn: (msg, data) => console.warn(`[WARN] ${msg}`, data),
+    error: (msg, data) => console.error(`[ERROR] ${msg}`, data),
+  }
+});
+```
+### 10. Page Joiners
+Control how text from different pages is stitched together:
+```typescript
+// Default: space ' ' joiner
+// Result: "...end of page 1. Start of page 2..."
+segmentPages(pages, { pageJoiner: 'space' });
+// Result: "...end of page 1.\nStart of page 2..."
+segmentPages(pages, { pageJoiner: 'newline' });
+```
+### 11. Breakpoint Preferences
+When a segment exceeds `maxPages` or `maxContentLength`, breakpoints split it at the "best" available match:
+```typescript
+{
+  maxPages: 1, // Minimum segment size (page span)
+  breakpoints: ['{{tarqim}}'],
+  // 'longer' (default): Greedy. Finds the match furthest in the window.
+  // Result: Segments stay close to the max limit.
+  prefer: 'longer',
+  // 'shorter': Conservative. Finds the first available match.
+  // Result: Segments split as early as possible.
+  prefer: 'shorter',
+}
+```
+### 12. Occurrence Filtering
 Control which matches to use:
@@ -831,7 +917,27 @@ const options: SegmentationOptions = {
   // How to join content across page boundaries in OUTPUT segments:
   // - 'space' (default): page boundaries become spaces
   // - 'newline': preserve page boundaries as newlines
-  pageJoiner: 'space',
+  pageJoiner: 'newline',
+  // Breakpoint preferences for resizing oversized segments:
+  // - 'longer' (default): maximizes segment size within limits
+  // - 'shorter': minimizes segment size (splits at first match)
+  prefer: 'longer',
+  // Post-structural limit: split if segment spans more than 2 pages
+  maxPages: 2,
+  // Post-structural limit: split if segment exceeds 5000 characters
+  maxContentLength: 5000,
+  // Enable match metadata in segments (meta.debug)
+  debug: true,
+  // Custom logger for tracing
+  logger: {
+    info: (m) => console.log(m),
+    warn: (m) => console.warn(m),
+  }
 };
 const segments: Segment[] = segmentPages(pages, options);
@@ -1057,6 +1163,8 @@ type SplitRule = {
   // Constraints
   min?: number;
   max?: number;
+  exclude?: (number | [number, number])[]; // Single page or [start, end] range
+  skipWhen?: string; // Regex pattern (tokens supported)
   meta?: Record<string, unknown>;
 };
 ```

package/dist/index.d.mts CHANGED Viewed

@@ -641,6 +641,18 @@ type SegmentationOptions = {
    * { maxPages: 2, breakpoints: ['{{tarqim}}', ''] }
    */
   maxPages?: number;
+  /**
+   * Maximum length (in characters) per segment.
+   *
+   * When a segment exceeds this length, breakpoints are applied to split it.
+   * This can typically be used in conjunction with `maxPages`, where the
+   * strictest constraint (intersection) determines the split window.
+   *
+   * @example
+   * // Break segments that exceed 2000 chars
+   * { maxContentLength: 2000, breakpoints: ['{{tarqim}}'] }
+   */
+  maxContentLength?: number;
   /**
    * Patterns tried in order to break oversized segments.
    *
@@ -894,20 +906,6 @@ type ReplaceRule = NonNullable<SegmentationOptions['replace']>[number];
 declare const applyReplacements: (pages: Page[], rules?: ReplaceRule[]) => Page[];
 //#endregion
 //#region src/segmentation/segmenter.d.ts
-/**
- * Applies breakpoints to oversized segments.
- *
- * For each segment that spans more than maxPages, tries the breakpoint patterns
- * in order to find a suitable split point. Structural markers (from rules) are
- * always respected - segments are only broken within their boundaries.
- *
- * @param segments - Initial segments from rule processing
- * @param pages - Original pages for page lookup
- * @param maxPages - Maximum pages before breakpoints apply
- * @param breakpoints - Patterns to try in order (tokens supported)
- * @param prefer - 'longer' for last match, 'shorter' for first match
- * @returns Processed segments with oversized ones broken up
- */
 /**
  * Segments pages of content based on pattern-matching rules.
  *

package/dist/index.d.mts.map CHANGED Viewed

	@@ -1 +1 @@
1	- {"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/optimize-rules.ts","../src/segmentation/pattern-validator.ts","../src/segmentation/replace.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/recovery.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBY,cDnUC,WCmUG,EAAA,CAAA,CAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAoChB;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;~~AAwJA~~;;;;~~AC5tBA~~;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2DA;AAAyC,cJ0F5B,wBI1F4B,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;AJLzC;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;KArVK,YAAA,GAqViD;EAAe;EAkBzD,KAAA,EAAI,MAAA;AAoChB,CAAA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;~~AAwJA~~;;;;~~AC5tBA~~;AAsGA;;;;AC5GA;AAKA;AAcA;;KF0BK,eAAA,GExBkB;EACH;EACL,QAAA,EAAA,MAAA;CAAe;AA0G9B;AAwDA;;;;AC1LA;AA2DA;;;;;;;;~~AC+SA~~;;;;;;;;~~AClTA~~;AA0QA;AAsDA;AA2CA,KLvVK,qBAAA,GK0VJ;EAQW;EAuKC,cAAA,EAAA,MAAA,EAAA;AA6Cb,CAAA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC/wBA;AAcA;AAEA;AAwQA;;;;;;;;AClRA;AAaA;AAOA;AA2OA;;;;KPjKK,sBAAA,GOoKsB;;;;ACjR3B;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC7MA;AAKA;;;;;;AAOA;AA2BE;AAmnBF;;KT/gBK,mBAAA,GSihBS;EACD;EACC,YAAA,EAAA,MAAA,EAAA;CAGa;;;;AA+C3B;;;;;;;KTtjBK,WAAA,GACC,eACA,kBACA,wBACA,yBACA;;;;;;;;;;;;;;;;cAiBO;;;;;;KAOD,cAAA,WAAyB;;;;;;;KAYhC,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiDO,SAAA;;;;;;;KAYP,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAuDC,SAAA,GAAY,cAAc,gBAAgB;;;;;;;;;;;;;KAkB1C,IAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAoCA,cAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAqCE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCF,UAAA,YAAsB;;;;;;;;;;;;;;;;;;;;;;;;;UA8BjB,MAAA;;;;;;;;;;;;;;;;;;;;;;KAuBL,WAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA+CA,mBAAA;;;;;;YAME;;;;;;;;UASF;;;;;;;;;;;;;;cAiBY~~;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;gBA+CN~~;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WAwDL;;;;;;;;;;;;;;;;KAiBD,OAAA;;;;;;;;;;;;;;;;;;;;;;;;;;SA6BD;;;;~~AA3nBa~~;;;AAkBlB,KChJM,cAAA,GDgJN;EACA;EACA,KAAA,EChJK,SDgJL,EAAA;EAAmB;EAiBZ,WAAA,EAAA,MAAA;AAOb,CAAA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;~~AAwJA~~;;;;~~AC5tBA~~;AAsGA;;;cAAa,uBAAwB,gBAAc;;;AD3FlC;AA4BG;AA8BM;AAyDrB,KEpIO,mBAAA,GFoIY,gBAAA,GAAA,eAAA,GAAA,WAAA,GAAA,eAAA;AAAA;;;AAkBlB,KEjJM,eAAA,GFiJN;EACA,IAAA,EEjJI,mBFiJJ;EACA,OAAA,EAAA,MAAA;EAAmB,UAAA,CAAA,EAAA,MAAA;EAiBZ;EAOD,KAAA,CAAA,EAAA,MAAA;EAYP;EAiDO,OAAA,CAAA,EAAA,MAAS;AAA6B,CAAA;AAyHlD;;;;AAAqE,KEnVzD,oBAAA,GFmVyD;EAkBzD,cAAI,CAAA,EAAA,CEpWM,eFoWN,GAAA,SAAA,CAAA,EAAA;EAoCJ,eAAA,CAAA,EAAc,CEvYH,eF4aT,GAAA,SAAS,CAAA,EAAA;EAqCX,YAAA,CAAU,EAAA,CEhdF,eFgdc,GAAA,SAAc,CAAA,EAAA;EA8B/B,QAAA,CAAM,EE7eR,eF6eQ;AAuBvB,CAAA;AA+CA;;;;;;;~~AAwJA~~;;;;~~AC5tBA~~;AAsGA;;;;AC5GA;AAKA;AAcA;AACsB,cA6GT,aA7GS,EAAA,CAAA,KAAA,EA6Ge,SA7Gf,EAAA,EAAA,GAAA,CA6G8B,oBA7G9B,GAAA,SAAA,CAAA,EAAA;;;;;AA6GtB;AAwDA;;;;AC1LA;AA2DA;;;AAAyE,cD+H5D,sBC/H4D,EAAA,CAAA,OAAA,EAAA,CD+HxB,oBC/HwB,GAAA,SAAA,CAAA,EAAA,EAAA,GAAA,MAAA,EAAA;;;AJLzE;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAuCtB,KGpJO,WAAA,GAAc,WHoJV,CGpJsB,mBHoJtB,CAAA,SAAA,CAAA,CAAA,CAAA,MAAA,CAAA;;;;;;;AAsBhB;AAOA;AAAgE;AA6DhE;AAYK,cG/LQ,iBHwOC,EAAA,CAAA,KAaH,EGrP8B,IHqPxB,EAAA,EAAA,KAAA,CAAA,EGrPwC,WHqPxC,EAAA,EAAA,GGrPwD,IHqPxD,EAAA;;;;;;;AAyEjB;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;~~AAwJA~~;;;;~~AC5tBA~~;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2DA~~;;;;;;;;AC+SA;;;;;;;;AClTA;AA0Qa~~,~~cDwCA~~,~~YCxCA~~,EAAA,CAAA,~~KAaZ~~,~~ED2BmC~~,~~IC3BnC~~,EAAA,EAAA,OAAA,~~ED2BoD~~,~~mBC3BpD~~,EAAA,~~GD2BuE~~,~~OC3BvE~~,EAAA;;;;~~ANzRD~~;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;~~AAwJA~~;;;;~~AC5tBA~~;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2Da,cEHA,sBFyBZ,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;;;;;~~ACyRD~~;;;;;;;~~cCxCa~~;AA1Qb;AA0QA;AAsDA;AA2CA;AAWA;AAuKA;AA6CA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC/wBA;AAcA;AAEA;AAwQA;;;;;cDuFa,gBAAgB;;;AEzW7B;AAaA;AAOA;AA2OA;;;;;;;;AC9QA;AA+EA;AAgEa,cHoRA,cGrQZ,EAAA,CAAA,KAfgE,EAAA,MAAA,EAAA,GAAA,OAAe;AAuBhF;AAiCA;;;;AC7MA;AAKY,KJ+aA,YAAA,GI/aiB;EAChB;;;;;EAMD,OAAA,EAAA,MAAA;EA6BP;AAinBL;;;;EAIc,YAAA,EAAA,MAAA,EAAA;EAGa;;;;AA+C3B;EACU,WAAA,EAAA,OAAA;CACsE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cJvHnE,mHAIV;;;;;;;;;;;;;;;;;;;;cAyCU;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAmC;;;;;;;;;;;;;cAqBnC;;;;;;;;;;;;;;;cAgBA;;;;;;;;;;;;;;;cA8BA;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;cAoBC,iDAAkD;;;;;;;;;;;;;;;;cA6BlD;;;ANluBA,KO7CD,wBAAA,GP6C8E;EA+F7E,IAAA,CAAA,EAAA,MAAA;;;;ECnIR,WAAA,CAAA,EAAA,MAAY;EA4BZ,wBAAe,CAAA,EAAA,OAAA;EA8Bf,yBAAqB,CAAA,EAAA,OAAA;EAiCrB,MAAA,CAAA,EAAA,aAAA,GAAsB,OAAA;EAwBtB,UAAA,CAAA,EAAA,CAAA,IAAA,EAAA,MAAmB,EAAA,MAAA,EAAA,MAAA,EAAA,GAAA,OAAA;EAenB,cAAW,CAAA,EMjIK,MNiIL,EAAA;EACV,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;CACA;AACA,KMhIM,uBAAA,GNgIN;EACA,IAAA,EAAA,MAAA;EACA,MAAA,EAAA,MAAA;CAAmB;AAiBZ,KMjJD,sBAAA,GNiJwG;EAOxG,OAAA,EAAA,MAAA;EAYP,KAAA,EAAA,MAAA;EAiDO,QAAA,EMlNE,uBNkNO,EAAA;AAA6B,CAAA;AAyHlD;;;AAAsD,cMtEzC,uBNsEyC,EAAA,CAAA,KAAA,EMrE3C,INqE2C,EAAA,EAAA,OAAA,CAAA,EMpEzC,wBNoEyC,EAAA,GMnEnD,sBNmEmD,EAAA;;;AAlOjD,KOtHO,wBAAA,GPsHY;EAenB,WAAA,CAAA,EAAW,MAAA;EACV,WAAA,CAAA,EAAA,MAAA;EACA,QAAA,CAAA,EAAA,MAAA;EACA,IAAA,CAAA,EAAA,MAAA;EACA,yBAAA,CAAA,EAAA,OAAA;EACA,YAAA,CAAA,EAAA,OAAA;EAAmB,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;EAiBZ,WAAA,CAAA,EAAA,MAAuG;EAOxG,YAAA,CAAA,EAAA,MAAc;EAYrB,iBAAa,CAAA,EAAA,MAAA;AAiDlB,CAAA;AAYK,KO9NO,wBAAA,GPuQE;EAoEF,IAAA,EAAA,MAAS;EAAG,OAAA,EAAA,MAAA;EAAc,MAAA,EAAA,MAAA;EAAgB,YAAA,EAAA,MAAA,EAAA;CAAe;AAkBzD,KOtVA,wBAAA,GPsVI;EAoCJ,OAAA,EAAA,MAAA;EA0EA,KAAA,EAAA,MAAU;EA8BL,QAAA,EO/dH,wBP+dS,EAAA;AAuBvB,CAAA;;;;;;~~AAuMA~~;~~cOrda~~,mCACF,kBACG,6BACX;;;;AR3NH;AA+FA;;;;;ACnIiB;AA4BG;AA+Df,KQ7GO,eAAA,GR6Ge;EAwBtB;EAeA,KAAA,EAAA,MAAA;EACC;EACA,KAAA,EAAA,MAAA;EACA;EACA,KAAA,EAAA,MAAA;EACA;EAAmB,QAAA,EAAA,MAAA;AAiBzB,CAAA;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;AAMc,cQlgBD,mBRkgBC,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GQlgBkC,eRkgBlC,EAAA;;;;;;~~AAkJd~~;;;;~~AC5tBA~~;AAsGA;;;cOkCa,mDAAoD;AN9IjE;AAKA;AAcA;;;;AAIe,cM8IF,oBN9IE,EAAA,CAAA,QAAA,EM+ID,eN/IC,EAAA,EAAA,GAAA;EAAe,WAAA,EAAA,gBAAA,GAAA,iBAAA;EA0GjB,KAAA,EAAA,OAAA;EAwDA,QAAA,CAAA,EAAA,MAAA;;;;AC1Lb;AA2DA;;;AAAyE,cK4I5D,kBL5I4D,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA;EAAI,QAAA,EAAA,MAAA;;;;~~EC+ShE~~,QAAA,~~EI5JC~~,~~eJ6Nb~~,EAAA;~~CAjEmC~~,GAAA,IAAA;;;~~ALpTvB~~,KU5DD,sBAAA,GV4D8E;EA+F7E,IAAA,EAAA,cAAA;;;;ECnIR,KAAA,CAAA,EAAA,OAAY,GAAA,YAAA;EA4BZ,QAAA,EAAA,MAAA,EAAe;AAAA,CAAA,GA8Bf;EAiCA,IAAA,EAAA,WAAA;EAwBA,SAAA,EAAA,CAAA,IAAA,ESxIwC,STwIrB,EAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;AAAA,CAAA;AAgBlB,KStJM,iBAAA,GTsJN;EACA,OAAA,EStJO,mBTsJP;EACA,KAAA,EStJK,ITsJL,EAAA;EACA,QAAA,EStJQ,OTsJR,EAAA;EACA,QAAA,EStJQ,sBTsJR;CAAmB;AAiBZ,KSpKD,oBAAA,GToKwG;EAOxG,OAAA,EAAA;IAYP,IAAA,EAAA,YAAa,GAAA,wBAAA;IAiDN,SAAS,EAAA,MAAA;IAYhB,aAAA,EAAe,MAAA;IA6GR,SAAS,EAAA,MAAA;IAAG,UAAA,EAAA,MAAA;EAAc,CAAA;EAAgB,KAAA,CAAA,ESzV1C,KTyV0C,CAAA;IAAe,SAAA,EAAA,MAAA;IAkBzD,QAAI,EAAA,MAAA;IAoCJ,aAAc,EAAA,MAAA;IA0Ed,UAAU,EAAA,MAAA;EA8BL,CAAA,CAAA;EAuBL,OAAA,ESxgBC,KTwgBU,CAAA;IA+CX,IAAA,EAAA,MAAA;IAME,oBAAA,EAAA,MAAA;IASF,sBAAA,CAAA,EAAA,MAAA;IAiBY,qBAAA,CAAA,EAAA,MAAA;~~IA+CN~~,YAAA,EAAA,MAAA;IAwDL,MAAA,EAAA,WAAA,GAAA,oBAAA,GAAA,WAAA,GAAA,sBAAA,GAAA,qBAAA;IAAM,QAAA,EAAA,OAAA,GAAA,QAAA,GAAA,MAAA;IAiBP,EAAA,CAAA,EAAA,MAAO;;;;~~EC5tBP~~,QAAA,EAAA,MAAA,EAAc;AAsG1B,CAAA;KQ1EK,oBAAA;iBAinBW,qCAAA,QACL,kBACG,oBACD,+BACC,4BPlpBd;;EALY,gBAAA,CAAA,EO0pBe,oBP1pBI;AAK/B,CAAA,CAAA,EAAY;EAcA,MAAA,EOyoBC,oBPzoBmB;EACV,QAAA,EOwoBuB,OPxoBvB,EAAA;CACC;AACH,iBOmrBJ,6BAAA,CPnrBI,IAAA,EOorBV,iBPprBU,EAAA,EAAA,IA2GpB,CA3GoB,EAAA;EACL,IAAA,CAAA,EAAA,YAAA,GAAA,wBAAA;EAAe,gBAAA,CAAA,EOorBkD,oBPprBlD;AA0G9B,CAAA,CAAA,EAAa;EAwDA,MAAA,EOmhBA,oBP1eZ;YO0e4C"}
1	+ {"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/optimize-rules.ts","../src/segmentation/pattern-validator.ts","../src/segmentation/replace.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/recovery.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBY,cDnUC,WCmUG,EAAA,CAAA,CAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAoChB;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2DA;AAAyC,cJ0F5B,wBI1F4B,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;AJLzC;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;KArVK,YAAA,GAqViD;EAAe;EAkBzD,KAAA,EAAI,MAAA;AAoChB,CAAA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;KF0BK,eAAA,GExBkB;EACH;EACL,QAAA,EAAA,MAAA;CAAe;AA0G9B;AAwDA;;;;AC1LA;AA2DA;;;;;;;;ACyPA;;;;;;;;AC5PA;AA0QA;AAsDA;AA2CA,KLvVK,qBAAA,GK0VJ;EAQW;EAuKC,cAAA,EAAA,MAAA,EAAA;AA6Cb,CAAA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC/wBA;AAcA;AAEA;AAwQA;;;;;;;;AClRA;AAaA;AAOA;AA2OA;;;;KPjKK,sBAAA,GOoKsB;;;;ACjR3B;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC7MA;AAKA;;;;;;AAOA;AA2BE;AAmnBF;;KT/gBK,mBAAA,GSihBS;EACD;EACC,YAAA,EAAA,MAAA,EAAA;CAGa;;;;AA+C3B;;;;;;;KTtjBK,WAAA,GACC,eACA,kBACA,wBACA,yBACA;;;;;;;;;;;;;;;;cAiBO;;;;;;KAOD,cAAA,WAAyB;;;;;;;KAYhC,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiDO,SAAA;;;;;;;KAYP,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAuDC,SAAA,GAAY,cAAc,gBAAgB;;;;;;;;;;;;;KAkB1C,IAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAoCA,cAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAqCE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCF,UAAA,YAAsB;;;;;;;;;;;;;;;;;;;;;;;;;UA8BjB,MAAA;;;;;;;;;;;;;;;;;;;;;;KAuBL,WAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA+CA,mBAAA;;;;;;YAME;;;;;;;;UASF;;;;;;;;;;;;;;cAiBY;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;gBA4DN;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WAwDL;;;;;;;;;;;;;;;;KAiBD,OAAA;;;;;;;;;;;;;;;;;;;;;;;;;;SA6BD;;;;AAxoBa;;;AAkBlB,KChJM,cAAA,GDgJN;EACA;EACA,KAAA,EChJK,SDgJL,EAAA;EAAmB;EAiBZ,WAAA,EAAA,MAAA;AAOb,CAAA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;cAAa,uBAAwB,gBAAc;;;AD3FlC;AA4BG;AA8BM;AAyDrB,KEpIO,mBAAA,GFoIY,gBAAA,GAAA,eAAA,GAAA,WAAA,GAAA,eAAA;AAAA;;;AAkBlB,KEjJM,eAAA,GFiJN;EACA,IAAA,EEjJI,mBFiJJ;EACA,OAAA,EAAA,MAAA;EAAmB,UAAA,CAAA,EAAA,MAAA;EAiBZ;EAOD,KAAA,CAAA,EAAA,MAAA;EAYP;EAiDO,OAAA,CAAA,EAAA,MAAS;AAA6B,CAAA;AAyHlD;;;;AAAqE,KEnVzD,oBAAA,GFmVyD;EAkBzD,cAAI,CAAA,EAAA,CEpWM,eFoWN,GAAA,SAAA,CAAA,EAAA;EAoCJ,eAAA,CAAA,EAAc,CEvYH,eF4aT,GAAA,SAAS,CAAA,EAAA;EAqCX,YAAA,CAAU,EAAA,CEhdF,eFgdc,GAAA,SAAc,CAAA,EAAA;EA8B/B,QAAA,CAAM,EE7eR,eF6eQ;AAuBvB,CAAA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;AACsB,cA6GT,aA7GS,EAAA,CAAA,KAAA,EA6Ge,SA7Gf,EAAA,EAAA,GAAA,CA6G8B,oBA7G9B,GAAA,SAAA,CAAA,EAAA;;;;;AA6GtB;AAwDA;;;;AC1LA;AA2DA;;;AAAyE,cD+H5D,sBC/H4D,EAAA,CAAA,OAAA,EAAA,CD+HxB,oBC/HwB,GAAA,SAAA,CAAA,EAAA,EAAA,GAAA,MAAA,EAAA;;;AJLzE;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAuCtB,KGpJO,WAAA,GAAc,WHoJV,CGpJsB,mBHoJtB,CAAA,SAAA,CAAA,CAAA,CAAA,MAAA,CAAA;;;;;;;AAsBhB;AAOA;AAAgE;AA6DhE;AAYK,cG/LQ,iBHwOC,EAAA,CAAA,KAaH,EGrP8B,IHqPxB,EAAA,EAAA,KAAA,CAAA,EGrPwC,WHqPxC,EAAA,EAAA,GGrPwD,IHqPxD,EAAA;;;;;;;AAyEjB;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2DA;;;AAAyE,cCyP5D,YDzP4D,EAAA,CAAA,KAAA,ECyPrC,IDzPqC,EAAA,EAAA,OAAA,ECyPpB,mBDzPoB,EAAA,GCyPD,ODzPC,EAAA;;;;AJLzE;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2Da,cEHA,sBFyBZ,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;;;;;ACmOD;;;;;;;cCca;AA1Qb;AA0QA;AAsDA;AA2CA;AAWA;AAuKA;AA6CA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC/wBA;AAcA;AAEA;AAwQA;;;;;cDuFa,gBAAgB;;;AEzW7B;AAaA;AAOA;AA2OA;;;;;;;;AC9QA;AA+EA;AAgEa,cHoRA,cGrQZ,EAAA,CAAA,KAfgE,EAAA,MAAA,EAAA,GAAA,OAAe;AAuBhF;AAiCA;;;;AC7MA;AAKY,KJ+aA,YAAA,GI/aiB;EAChB;;;;;EAMD,OAAA,EAAA,MAAA;EA6BP;AAinBL;;;;EAIc,YAAA,EAAA,MAAA,EAAA;EAGa;;;;AA+C3B;EACU,WAAA,EAAA,OAAA;CACsE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cJvHnE,mHAIV;;;;;;;;;;;;;;;;;;;;cAyCU;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAmC;;;;;;;;;;;;;cAqBnC;;;;;;;;;;;;;;;cAgBA;;;;;;;;;;;;;;;cA8BA;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;cAoBC,iDAAkD;;;;;;;;;;;;;;;;cA6BlD;;;ANluBA,KO7CD,wBAAA,GP6C8E;EA+F7E,IAAA,CAAA,EAAA,MAAA;;;;ECnIR,WAAA,CAAA,EAAA,MAAY;EA4BZ,wBAAe,CAAA,EAAA,OAAA;EA8Bf,yBAAqB,CAAA,EAAA,OAAA;EAiCrB,MAAA,CAAA,EAAA,aAAA,GAAsB,OAAA;EAwBtB,UAAA,CAAA,EAAA,CAAA,IAAA,EAAA,MAAmB,EAAA,MAAA,EAAA,MAAA,EAAA,GAAA,OAAA;EAenB,cAAW,CAAA,EMjIK,MNiIL,EAAA;EACV,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;CACA;AACA,KMhIM,uBAAA,GNgIN;EACA,IAAA,EAAA,MAAA;EACA,MAAA,EAAA,MAAA;CAAmB;AAiBZ,KMjJD,sBAAA,GNiJwG;EAOxG,OAAA,EAAA,MAAA;EAYP,KAAA,EAAA,MAAA;EAiDO,QAAA,EMlNE,uBNkNO,EAAA;AAA6B,CAAA;AAyHlD;;;AAAsD,cMtEzC,uBNsEyC,EAAA,CAAA,KAAA,EMrE3C,INqE2C,EAAA,EAAA,OAAA,CAAA,EMpEzC,wBNoEyC,EAAA,GMnEnD,sBNmEmD,EAAA;;;AAlOjD,KOtHO,wBAAA,GPsHY;EAenB,WAAA,CAAA,EAAW,MAAA;EACV,WAAA,CAAA,EAAA,MAAA;EACA,QAAA,CAAA,EAAA,MAAA;EACA,IAAA,CAAA,EAAA,MAAA;EACA,yBAAA,CAAA,EAAA,OAAA;EACA,YAAA,CAAA,EAAA,OAAA;EAAmB,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;EAiBZ,WAAA,CAAA,EAAA,MAAuG;EAOxG,YAAA,CAAA,EAAA,MAAc;EAYrB,iBAAa,CAAA,EAAA,MAAA;AAiDlB,CAAA;AAYK,KO9NO,wBAAA,GPuQE;EAoEF,IAAA,EAAA,MAAS;EAAG,OAAA,EAAA,MAAA;EAAc,MAAA,EAAA,MAAA;EAAgB,YAAA,EAAA,MAAA,EAAA;CAAe;AAkBzD,KOtVA,wBAAA,GPsVI;EAoCJ,OAAA,EAAA,MAAA;EA0EA,KAAA,EAAA,MAAU;EA8BL,QAAA,EO/dH,wBP+dS,EAAA;AAuBvB,CAAA;;;;;;AAoNA;cOlea,mCACF,kBACG,6BACX;;;;AR3NH;AA+FA;;;;;ACnIiB;AA4BG;AA+Df,KQ7GO,eAAA,GR6Ge;EAwBtB;EAeA,KAAA,EAAA,MAAA;EACC;EACA,KAAA,EAAA,MAAA;EACA;EACA,KAAA,EAAA,MAAA;EACA;EAAmB,QAAA,EAAA,MAAA;AAiBzB,CAAA;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;AAMc,cQlgBD,mBRkgBC,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GQlgBkC,eRkgBlC,EAAA;;;;;;AA+Jd;;;;ACzuBA;AAsGA;;;cOkCa,mDAAoD;AN9IjE;AAKA;AAcA;;;;AAIe,cM8IF,oBN9IE,EAAA,CAAA,QAAA,EM+ID,eN/IC,EAAA,EAAA,GAAA;EAAe,WAAA,EAAA,gBAAA,GAAA,iBAAA;EA0GjB,KAAA,EAAA,OAAA;EAwDA,QAAA,CAAA,EAAA,MAAA;;;;AC1Lb;AA2DA;;;AAAyE,cK4I5D,kBL5I4D,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA;EAAI,QAAA,EAAA,MAAA;;;;ECyPhE,QAAA,EItGC,eJgLb,EAAA;CA1EmC,GAAA,IAAA;;;AL9PvB,KU5DD,sBAAA,GV4D8E;EA+F7E,IAAA,EAAA,cAAA;;;;ECnIR,KAAA,CAAA,EAAA,OAAY,GAAA,YAAA;EA4BZ,QAAA,EAAA,MAAA,EAAe;AAAA,CAAA,GA8Bf;EAiCA,IAAA,EAAA,WAAA;EAwBA,SAAA,EAAA,CAAA,IAAA,ESxIwC,STwIrB,EAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;AAAA,CAAA;AAgBlB,KStJM,iBAAA,GTsJN;EACA,OAAA,EStJO,mBTsJP;EACA,KAAA,EStJK,ITsJL,EAAA;EACA,QAAA,EStJQ,OTsJR,EAAA;EACA,QAAA,EStJQ,sBTsJR;CAAmB;AAiBZ,KSpKD,oBAAA,GToKwG;EAOxG,OAAA,EAAA;IAYP,IAAA,EAAA,YAAa,GAAA,wBAAA;IAiDN,SAAS,EAAA,MAAA;IAYhB,aAAA,EAAe,MAAA;IA6GR,SAAS,EAAA,MAAA;IAAG,UAAA,EAAA,MAAA;EAAc,CAAA;EAAgB,KAAA,CAAA,ESzV1C,KTyV0C,CAAA;IAAe,SAAA,EAAA,MAAA;IAkBzD,QAAI,EAAA,MAAA;IAoCJ,aAAc,EAAA,MAAA;IA0Ed,UAAU,EAAA,MAAA;EA8BL,CAAA,CAAA;EAuBL,OAAA,ESxgBC,KTwgBU,CAAA;IA+CX,IAAA,EAAA,MAAA;IAME,oBAAA,EAAA,MAAA;IASF,sBAAA,CAAA,EAAA,MAAA;IAiBY,qBAAA,CAAA,EAAA,MAAA;IA4DN,YAAA,EAAA,MAAA;IAwDL,MAAA,EAAA,WAAA,GAAA,oBAAA,GAAA,WAAA,GAAA,sBAAA,GAAA,qBAAA;IAAM,QAAA,EAAA,OAAA,GAAA,QAAA,GAAA,MAAA;IAiBP,EAAA,CAAA,EAAA,MAAO;;;;ECzuBP,QAAA,EAAA,MAAA,EAAc;AAsG1B,CAAA;KQ1EK,oBAAA;iBAinBW,qCAAA,QACL,kBACG,oBACD,+BACC,4BPlpBd;;EALY,gBAAA,CAAA,EO0pBe,oBP1pBI;AAK/B,CAAA,CAAA,EAAY;EAcA,MAAA,EOyoBC,oBPzoBmB;EACV,QAAA,EOwoBuB,OPxoBvB,EAAA;CACC;AACH,iBOmrBJ,6BAAA,CPnrBI,IAAA,EOorBV,iBPprBU,EAAA,EAAA,IA2GpB,CA3GoB,EAAA;EACL,IAAA,CAAA,EAAA,YAAA,GAAA,wBAAA;EAAe,gBAAA,CAAA,EOorBkD,oBPprBlD;AA0G9B,CAAA,CAAA,EAAa;EAwDA,MAAA,EOmhBA,oBP1eZ;YO0e4C"}

package/dist/index.mjs CHANGED Viewed

@@ -1319,12 +1319,12 @@ const findPageStartNearExpectedBoundary = (remainingContent, _currentFromIdx, ta
 			const MAX_DEVIATION = 2e3;
 			if (bestDistance <= MAX_DEVIATION) return bestCandidate.pos;
 			logger?.debug?.("[breakpoints] findPageStartNearExpectedBoundary: Rejected match exceeding deviation", {
-				targetPageIdx,
-				expectedBoundary,
 				bestDistance,
-				maxDeviation: MAX_DEVIATION,
+				expectedBoundary,
 				matchPos: bestCandidate.pos,
-				prefixLength: len
+				maxDeviation: MAX_DEVIATION,
+				prefixLength: len,
+				targetPageIdx
 			});
 		}
 	}
@@ -1533,6 +1533,34 @@ const findBreakPosition = (remainingContent, currentFromIdx, toIdx, windowEndIdx
 	}
 	return null;
 };
+/**
+* Searches backward from a target position to find a "safe" split point.
+* A safe split point is after whitespace or punctuation.
+*
+* @param content The text content
+* @param targetPosition The desired split position (hard limit)
+* @param lookbackChars How far back to search for a safe break
+* @returns The new split position (index), or -1 if no safe break found
+*/
+const findSafeBreakPosition = (content, targetPosition, lookbackChars = 100) => {
+	const startSearch = Math.max(0, targetPosition - lookbackChars);
+	for (let i = targetPosition - 1; i >= startSearch; i--) {
+		const char = content[i];
+		if (/[\s\n.,;!?؛،۔]/.test(char)) return i + 1;
+	}
+	return -1;
+};
+/**
+* Ensures the position does not split a surrogate pair.
+* If position is between High and Low surrogate, returns position - 1.
+*/
+const adjustForSurrogate = (content, position) => {
+	if (position <= 0 || position >= content.length) return position;
+	const high = content.charCodeAt(position - 1);
+	const low = content.charCodeAt(position);
+	if (high >= 55296 && high <= 56319 && low >= 56320 && low <= 57343) return position - 1;
+	return position;
+};
 //#endregion
 //#region src/segmentation/debug-meta.ts
@@ -1659,7 +1687,7 @@ const createPieceSegment = (pieceContent, actualStartIdx, actualEndIdx, pageIds,
 *
 * @returns Break offset relative to remainingContent, or windowEndPosition as fallback
 */
-const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer) => {
+const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer, maxContentLength) => {
 	if (hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, windowEndIdx)) {
 		const exclusionBreak = findExclusionBreakPosition(currentFromIdx, windowEndIdx, toIdx, pageIds, expandedBreakpoints, cumulativeOffsets);
 		if (exclusionBreak > 0) return { breakOffset: exclusionBreak };
@@ -1675,6 +1703,11 @@ const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx
 		breakpointIndex: patternMatch.breakpointIndex,
 		breakpointRule: patternMatch.rule
 	};
+	if (maxContentLength && windowEndPosition === maxContentLength) {
+		const safeOffset = findSafeBreakPosition(remainingContent, windowEndPosition);
+		if (safeOffset !== -1) return { breakOffset: safeOffset };
+		return { breakOffset: adjustForSurrogate(remainingContent, windowEndPosition) };
+	}
 	return { breakOffset: windowEndPosition };
 };
 /**
@@ -1691,7 +1724,7 @@ const skipWhitespace$1 = (content, startPos) => {
 *
 * Uses precomputed boundary positions for O(log n) page attribution lookups.
 */
-const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey) => {
+const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey, maxContentLength) => {
 	const result = [];
 	const fullContent = segment.content;
 	let cursorPos = 0;
@@ -1705,57 +1738,104 @@ const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPag
 		fullContentLength: fullContent.length,
 		toIdx
 	});
-	const maxIterations = 1e4;
-	for (let i = 0; i < maxIterations && cursorPos < fullContent.length && currentFromIdx <= toIdx; i++) {
+	let i = 0;
+	const MAX_SAFE_ITERATIONS = 1e5;
+	while (cursorPos < fullContent.length && currentFromIdx <= toIdx && i < MAX_SAFE_ITERATIONS) {
+		i++;
 		const remainingContent = fullContent.slice(cursorPos);
 		if (!remainingContent.trim()) break;
-		const remainingSpan = computeRemainingSpan(currentFromIdx, toIdx, pageIds);
-		const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
-		if (remainingSpan <= maxPages && !remainingHasExclusions) {
-			const includeMeta = isFirstPiece || Boolean(debugMetaKey);
-			const meta = debugMetaKey && lastBreakpoint ? mergeDebugIntoMeta(includeMeta ? segment.meta : void 0, debugMetaKey, buildBreakpointDebugPatch(lastBreakpoint.breakpointIndex, lastBreakpoint.rule)) : includeMeta ? segment.meta : void 0;
-			const finalSeg = createFinalSegment(remainingContent, currentFromIdx, toIdx, pageIds, meta, includeMeta);
-			if (finalSeg) result.push(finalSeg);
-			break;
-		}
+		if (handleOversizedSegmentFit(remainingContent, currentFromIdx, toIdx, pageIds, expandedBreakpoints, maxPages, maxContentLength, isFirstPiece, debugMetaKey, segment.meta, lastBreakpoint, result)) break;
 		const windowEndIdx = computeWindowEndIdx(currentFromIdx, toIdx, pageIds, maxPages);
-		const windowEndPosition = findBreakpointWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, logger);
+		const windowEndPosition = getWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, maxContentLength, logger);
 		logger?.debug?.(`[breakpoints] iteration=${i}`, {
 			currentFromIdx,
 			cursorPos,
-			windowEndIdx
+			windowEndIdx,
+			windowEndPosition
 		});
-		const found = findBreakOffsetForWindow(remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer);
+		const found = findBreakOffsetForWindow(remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer, maxContentLength);
+		let breakOffset = found.breakOffset;
+		if (breakOffset <= 0) {
+			const fallbackPos = maxContentLength ? Math.min(maxContentLength, remainingContent.length) : 1;
+			breakOffset = Math.max(1, fallbackPos);
+			logger?.warn?.("[breakpoints] No progress from findBreakOffsetForWindow; forcing forward movement", {
+				breakOffset,
+				cursorPos
+			});
+		}
 		if (found.breakpointIndex !== void 0 && found.breakpointRule) lastBreakpoint = {
 			breakpointIndex: found.breakpointIndex,
 			rule: found.breakpointRule
 		};
-		const breakPos = cursorPos + found.breakOffset;
+		const breakPos = cursorPos + breakOffset;
 		const pieceContent = fullContent.slice(cursorPos, breakPos).trim();
-		const { actualEndIdx, actualStartIdx } = computePiecePages(cursorPos, breakPos, boundaryPositions, fromIdx, toIdx);
-		logger?.trace?.("[breakpoints] piece", {
-			actualEndIdx,
-			actualStartIdx,
-			pieceLength: pieceContent.length
-		});
 		if (pieceContent) {
-			const includeMeta = isFirstPiece || Boolean(debugMetaKey);
-			const pieceSeg = createPieceSegment(pieceContent, actualStartIdx, actualEndIdx, pageIds, debugMetaKey && lastBreakpoint ? mergeDebugIntoMeta(includeMeta ? segment.meta : void 0, debugMetaKey, buildBreakpointDebugPatch(lastBreakpoint.breakpointIndex, lastBreakpoint.rule)) : includeMeta ? segment.meta : void 0, includeMeta);
+			const { actualEndIdx, actualStartIdx } = computePiecePages(cursorPos, breakPos, boundaryPositions, fromIdx, toIdx);
+			const pieceSeg = createPieceSegment(pieceContent, actualStartIdx, actualEndIdx, pageIds, getSegmentMetaWithDebug(isFirstPiece, debugMetaKey, segment.meta, lastBreakpoint), true);
 			if (pieceSeg) result.push(pieceSeg);
-		}
-		cursorPos = skipWhitespace$1(fullContent, breakPos);
-		currentFromIdx = computeNextFromIdx(fullContent.slice(cursorPos), actualEndIdx, toIdx, pageIds, normalizedPages);
+			const next = advanceCursorAndIndex(fullContent, breakPos, actualEndIdx, toIdx, pageIds, normalizedPages);
+			cursorPos = next.cursorPos;
+			currentFromIdx = next.currentFromIdx;
+		} else cursorPos = breakPos;
 		isFirstPiece = false;
 	}
+	if (i >= MAX_SAFE_ITERATIONS) logger?.error?.("[breakpoints] Stopped processing oversized segment: reached MAX_SAFE_ITERATIONS", {
+		cursorPos,
+		fullContentLength: fullContent.length,
+		iterations: i
+	});
 	logger?.debug?.("[breakpoints] done", { resultCount: result.length });
 	return result;
 };
 /**
+* Checks if the remaining content fits within paged/length limits.
+* If so, pushes the final segment and returns true.
+*/
+const handleOversizedSegmentFit = (remainingContent, currentFromIdx, toIdx, pageIds, expandedBreakpoints, maxPages, maxContentLength, isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint, result) => {
+	const remainingSpan = computeRemainingSpan(currentFromIdx, toIdx, pageIds);
+	const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
+	const fitsInPages = remainingSpan <= maxPages;
+	const fitsInLength = !maxContentLength || remainingContent.length <= maxContentLength;
+	if (fitsInPages && fitsInLength && !remainingHasExclusions) {
+		const includeMeta = isFirstPiece || Boolean(debugMetaKey);
+		const finalSeg = createFinalSegment(remainingContent, currentFromIdx, toIdx, pageIds, getSegmentMetaWithDebug(isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint), includeMeta);
+		if (finalSeg) result.push(finalSeg);
+		return true;
+	}
+	return false;
+};
+/**
+* Builds metadata for a segment piece, optionally including debug info.
+*/
+const getSegmentMetaWithDebug = (isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint) => {
+	if (!(isFirstPiece || Boolean(debugMetaKey))) return;
+	if (debugMetaKey && lastBreakpoint) return mergeDebugIntoMeta(isFirstPiece ? originalMeta : void 0, debugMetaKey, buildBreakpointDebugPatch(lastBreakpoint.breakpointIndex, lastBreakpoint.rule));
+	return isFirstPiece ? originalMeta : void 0;
+};
+/**
+* Calculates window end position, capped by maxContentLength if present.
+*/
+const getWindowEndPosition = (remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, maxContentLength, logger) => {
+	let windowEndPosition = findBreakpointWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, logger);
+	if (maxContentLength && maxContentLength < windowEndPosition) windowEndPosition = maxContentLength;
+	return windowEndPosition;
+};
+/**
+* Advances cursorPos and currentFromIdx for the next iteration.
+*/
+const advanceCursorAndIndex = (fullContent, breakPos, actualEndIdx, toIdx, pageIds, normalizedPages) => {
+	const nextCursorPos = skipWhitespace$1(fullContent, breakPos);
+	return {
+		currentFromIdx: computeNextFromIdx(fullContent.slice(nextCursorPos), actualEndIdx, toIdx, pageIds, normalizedPages),
+		cursorPos: nextCursorPos
+	};
+};
+/**
 * Applies breakpoints to oversized segments.
 *
 * Note: This is an internal engine used by `segmentPages()`.
 */
-const applyBreakpoints = (segments, pages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner = "space", debugMetaKey) => {
+const applyBreakpoints = (segments, pages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner = "space", debugMetaKey, maxContentLength) => {
 	const pageIds = pages.map((p) => p.id);
 	const pageIdToIndex = buildPageIdToIndexMap(pageIds);
 	const normalizedPages = buildNormalizedPagesMap(pages, normalizedContent);
@@ -1779,11 +1859,13 @@ const applyBreakpoints = (segments, pages, normalizedContent, maxPages, breakpoi
 		const toIdx = segment.to !== void 0 ? pageIdToIndex.get(segment.to) ?? fromIdx : fromIdx;
 		const segmentSpan = (segment.to ?? segment.from) - segment.from;
 		const hasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, fromIdx, toIdx);
-		if (segmentSpan <= maxPages && !hasExclusions) {
+		const fitsInPages = segmentSpan <= maxPages;
+		const fitsInLength = !maxContentLength || segment.content.length <= maxContentLength;
+		if (fitsInPages && fitsInLength && !hasExclusions) {
 			result.push(segment);
 			continue;
 		}
-		const broken = processOversizedSegment(segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey);
+		const broken = processOversizedSegment(segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey, maxContentLength);
 		result.push(...broken.map((s) => {
 			const segFromIdx = pageIdToIndex.get(s.from) ?? -1;
 			const segToIdx = s.to !== void 0 ? pageIdToIndex.get(s.to) ?? segFromIdx : segFromIdx;
@@ -1995,7 +2077,7 @@ const buildLineStartsAfterRegexSource = (patterns, fuzzy, capturePrefix) => {
 	const union = processed.map((p) => p.pattern).join("|");
 	return {
 		captureNames: processed.flatMap((p) => p.captureNames),
-		regex: `^(?:${union})${capturePrefix ? `(?<${capturePrefix}__content>.*)` : "(.*)"}`
+		regex: `^[\\u200E\\u200F\\u061C\\u200B\\uFEFF]*(?:${union})${capturePrefix ? `(?<${capturePrefix}__content>.*)` : "(.*)"}`
 	};
 };
 const buildLineStartsWithRegexSource = (patterns, fuzzy, capturePrefix) => {
@@ -2003,7 +2085,7 @@ const buildLineStartsWithRegexSource = (patterns, fuzzy, capturePrefix) => {
 	const union = processed.map((p) => p.pattern).join("|");
 	return {
 		captureNames: processed.flatMap((p) => p.captureNames),
-		regex: `^(?:${union})`
+		regex: `^[\\u200E\\u200F\\u061C\\u200B\\uFEFF]*(?:${union})`
 	};
 };
 const buildLineEndsWithRegexSource = (patterns, fuzzy, capturePrefix) => {
@@ -2246,6 +2328,47 @@ const createPageStartGuardChecker = (matchContent, pageMap) => {
 		return prevReq.test(lastChar);
 	};
 };
+/**
+* Checks if a pageId matches the min/max/exclude constraints of a rule.
+*/
+const passesRuleConstraints$1 = (rule, pageId) => {
+	return (rule.min === void 0 || pageId >= rule.min) && (rule.max === void 0 || pageId <= rule.max) && !isPageExcluded(pageId, rule.exclude);
+};
+/**
+* Records a split point for a specific rule.
+*/
+const recordSplitPointAt = (splitPointsByRule, ruleIndex, sp) => {
+	const arr = splitPointsByRule.get(ruleIndex);
+	if (!arr) {
+		splitPointsByRule.set(ruleIndex, [sp]);
+		return;
+	}
+	arr.push(sp);
+};
+/**
+* Processes matches for all fast-fuzzy rules at a specific line start.
+*/
+const processFastFuzzyMatchesAt = (matchContent, lineStart, pageId, fastFuzzyRules, passesPageStartGuard, isPageStart, splitPointsByRule) => {
+	for (const { compiled, kind, rule, ruleIndex } of fastFuzzyRules) {
+		if (!passesRuleConstraints$1(rule, pageId)) continue;
+		if (isPageStart && !passesPageStartGuard(rule, ruleIndex, lineStart)) continue;
+		const end = matchFastFuzzyTokenAt(matchContent, lineStart, compiled);
+		if (end === null) continue;
+		const splitIndex = (rule.split ?? "at") === "at" ? lineStart : end;
+		if (kind === "startsWith") recordSplitPointAt(splitPointsByRule, ruleIndex, {
+			index: splitIndex,
+			meta: rule.meta
+		});
+		else {
+			const markerLength = end - lineStart;
+			recordSplitPointAt(splitPointsByRule, ruleIndex, {
+				contentStartOffset: (rule.split ?? "at") === "at" ? markerLength : void 0,
+				index: splitIndex,
+				meta: rule.meta
+			});
+		}
+	}
+};
 const collectFastFuzzySplitPoints = (matchContent, pageMap, fastFuzzyRules, passesPageStartGuard) => {
 	const splitPointsByRule = /* @__PURE__ */ new Map();
 	if (fastFuzzyRules.length === 0 || pageMap.boundaries.length === 0) return splitPointsByRule;
@@ -2257,38 +2380,12 @@ const collectFastFuzzySplitPoints = (matchContent, pageMap, fastFuzzyRules, pass
 			currentBoundary = pageMap.boundaries[boundaryIdx];
 		}
 	};
-	const recordSplitPoint = (ruleIndex, sp) => {
-		const arr = splitPointsByRule.get(ruleIndex);
-		if (!arr) {
-			splitPointsByRule.set(ruleIndex, [sp]);
-			return;
-		}
-		arr.push(sp);
-	};
 	const isPageStart = (offset) => offset === currentBoundary?.start;
 	for (let lineStart = 0; lineStart <= matchContent.length;) {
 		advanceBoundaryTo(lineStart);
 		const pageId = currentBoundary?.id ?? 0;
 		if (lineStart >= matchContent.length) break;
-		for (const { compiled, kind, rule, ruleIndex } of fastFuzzyRules) {
-			if (!((rule.min === void 0 || pageId >= rule.min) && (rule.max === void 0 || pageId <= rule.max) && !isPageExcluded(pageId, rule.exclude))) continue;
-			if (isPageStart(lineStart) && !passesPageStartGuard(rule, ruleIndex, lineStart)) continue;
-			const end = matchFastFuzzyTokenAt(matchContent, lineStart, compiled);
-			if (end === null) continue;
-			const splitIndex = (rule.split ?? "at") === "at" ? lineStart : end;
-			if (kind === "startsWith") recordSplitPoint(ruleIndex, {
-				index: splitIndex,
-				meta: rule.meta
-			});
-			else {
-				const markerLength = end - lineStart;
-				recordSplitPoint(ruleIndex, {
-					contentStartOffset: (rule.split ?? "at") === "at" ? markerLength : void 0,
-					index: splitIndex,
-					meta: rule.meta
-				});
-			}
-		}
+		processFastFuzzyMatchesAt(matchContent, lineStart, pageId, fastFuzzyRules, passesPageStartGuard, isPageStart(lineStart), splitPointsByRule);
 		const nextNl = matchContent.indexOf("\n", lineStart);
 		if (nextNl === -1) break;
 		lineStart = nextNl + 1;
@@ -2613,20 +2710,6 @@ const convertPageBreaks = (content, startOffset, pageBreaks) => {
 	return content.replace(/\n/g, (match, offset) => breakSet.has(offset) ? " " : match);
 };
 /**
-* Applies breakpoints to oversized segments.
-*
-* For each segment that spans more than maxPages, tries the breakpoint patterns
-* in order to find a suitable split point. Structural markers (from rules) are
-* always respected - segments are only broken within their boundaries.
-*
-* @param segments - Initial segments from rule processing
-* @param pages - Original pages for page lookup
-* @param maxPages - Maximum pages before breakpoints apply
-* @param breakpoints - Patterns to try in order (tokens supported)
-* @param prefer - 'longer' for last match, 'shorter' for first match
-* @returns Processed segments with oversized ones broken up
-*/
-/**
 * Segments pages of content based on pattern-matching rules.
 *
 * This is the main entry point for the segmentation engine. It takes an array
@@ -2669,11 +2752,14 @@ const convertPageBreaks = (content, startOffset, pageBreaks) => {
 * });
 */
 const segmentPages = (pages, options) => {
-	const { rules = [], maxPages = 0, breakpoints = [], prefer = "longer", pageJoiner = "space", logger } = options;
+	const { rules = [], breakpoints = [], prefer = "longer", pageJoiner = "space", logger, maxContentLength } = options;
+	if (maxContentLength && maxContentLength < 50) throw new Error(`maxContentLength must be at least 50 characters.`);
+	const maxPages = options.maxPages ?? (maxContentLength ? Number.MAX_SAFE_INTEGER : 0);
 	const debug = resolveDebugConfig(options.debug);
 	const debugMetaKey = debug?.includeRule ? debug.metaKey : void 0;
 	logger?.info?.("[segmenter] starting segmentation", {
 		breakpointCount: breakpoints.length,
+		maxContentLength,
 		maxPages,
 		pageCount: pages.length,
 		prefer,
@@ -2701,10 +2787,10 @@ const segmentPages = (pages, options) => {
 		}))
 	});
 	segments = ensureFallbackSegment(segments, processedPages, normalizedContent, pageJoiner);
-	if (maxPages >= 0 && breakpoints.length) {
+	if ((maxPages >= 0 || maxContentLength && maxContentLength > 0) && breakpoints.length) {
 		logger?.debug?.("[segmenter] applying breakpoints to oversized segments");
 		const patternProcessor = (p) => processPattern(p, false).pattern;
-		const result = applyBreakpoints(segments, processedPages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner, debug?.includeBreakpoint ? debug.metaKey : void 0);
+		const result = applyBreakpoints(segments, processedPages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner, debug?.includeBreakpoint ? debug.metaKey : void 0, maxContentLength);
 		logger?.info?.("[segmenter] segmentation complete (with breakpoints)", { finalSegmentCount: result.length });
 		return result;
 	}