flappa-doormal 2.10.1 → 2.11.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md CHANGED
@@ -117,6 +117,8 @@ src/
117
117
  - `hasExcludedPageInRange()` - Check if range contains excluded pages
118
118
  - `findNextPagePosition()` - Find next page content position
119
119
  - `findPatternBreakPosition()` - Find pattern match by preference
120
+ - `findSafeBreakPosition()` - Search backward for a safe linguistic split point (whitespace/punctuation)
121
+ - `adjustForSurrogate()` - Ensure split position doesn't corrupt Unicode surrogate pairs
120
122
 
121
123
  10. **`types.ts`** - Type definitions
122
124
  - `Logger` interface - Optional logging for debugging
@@ -260,6 +262,19 @@ segmentPages(pages, {
260
262
 
261
263
  > **Note**: Older per-rule span limiting approaches were removed in favor of post-processing `breakpoints`.
262
264
 
265
+ ### 5. Safety-Hardened Content Splitting (NEW)
266
+
267
+ When using `maxContentLength`, the segmenter prevents text corruption through several layers of fallback logic.
268
+
269
+ **Algorithm:**
270
+ 1. **Windowed Pattern Match**: Attempt to find a user-provided `breakpoint` pattern within the character window.
271
+ 2. **Safe Fallback (Linguistic)**: If no pattern matches, use `findSafeBreakPosition()` to search backward (100 chars) for whitespace or punctuation `[\s\n.,;!?؛،۔]`.
272
+ 3. **Safe Fallback (Technical)**: If still no safe break found, use `adjustForSurrogate()` to ensure the split doesn't fall between a High and Low Unicode surrogate pair.
273
+ 4. **Hard Split**: Only as a final resort is a character-exact split performed.
274
+
275
+ **Progress Guarantee**:
276
+ The loop in `processOversizedSegment` has been refactored to remove fixed iteration limits (e.g., 10k). Instead, it relies on strict `cursorPos` progression and input validation (`maxContentLength >= 50`) to support processing infinitely large content streams without risk of truncation.
277
+
263
278
  ## Design Decisions
264
279
 
265
280
  ### 1. Why `{{double-braces}}`?
@@ -414,6 +429,16 @@ bunx biome lint .
414
429
 
415
430
  10. **Prefer library utilities for UI tasks**: Instead of re-implementing rule merging, validation, or token mapping in client code, use `optimizeRules`, `validateRules`/`formatValidationReport`, and `applyTokenMappings`. They handle edge cases (like duplicate patterns, regex safety, or diacritic handling) that ad-hoc implementations might miss.
416
431
 
432
+ 11. **Safety Fallback (Search-back)**: When forced to split at a hard character limit, searching backward for whitespace/punctuation (`[\s\n.,;!?؛،۔]`) prevents word-chopping and improves readability significantly.
433
+
434
+ 12. **Unicode Surrogate Safety**: Multi-byte characters (like Emojis) can be corrupted if split in the middle of a surrogate pair. Always use a helper like `adjustForSurrogate` to ensure the split point falls on a valid character boundary.
435
+
436
+ 13. **Recursion/Iteration Safety**: Using a progress-based guard (comparing `cursorPos` before and after loop iteration) is safer than fixed iteration limits for supporting arbitrary-sized content without truncation risks.
437
+
438
+ 14. **Accidental File Overwrites**: Be extremely careful when using tools like `replace_file_content` with large ranges. Verify file integrity frequently (e.g., `git diff`) to catch accidental deletions of existing code or tests. Merging new tests into existing files is a high-risk operation for AI agents.
439
+
440
+ 15. **Invisible Unicode Marks Break Regex Anchors**: Arabic text often contains invisible bidirectional formatting marks like Left-to-Right Mark (`U+200E`), Right-to-Left Mark (`U+200F`), or Arabic Letter Mark (`U+061C`). These appear at line starts after `\n` but before visible characters, breaking `^` anchored patterns. Solution: include an optional zero-width character class prefix in line-start patterns: `^[\u200E\u200F\u061C\u200B\uFEFF]*(?:pattern)`. The library now handles this automatically in `buildLineStartsWithRegexSource` and `buildLineStartsAfterRegexSource`.
441
+
417
442
  ### Process Template (Multi-agent design review, TDD-first)
418
443
 
419
444
  If you want to repeat the “write a plan → get multiple AI critiques → synthesize → update plan → implement TDD-first” workflow, use:
@@ -606,3 +631,4 @@ it('should correctly split pages with identical prefixes and duplicated content'
606
631
  ```
607
632
 
608
633
  ---
634
+ 15. **Use Synthesized AI Reviews**: For complex safety features, getting reviews from multiple models (Claude, GPT, etc.) and synthesizing them into a single action plan (see `docs/reviews/max-content-length-review-synthesis.md`) revealed critical edge cases like Arabic diacritic corruption and surrogate pair safety that a single model might miss.
package/LICENSE.md CHANGED
@@ -1,4 +1,4 @@
1
- Copyright 2025 Ragaeeb Haq>
1
+ Copyright 2025-2026 Ragaeeb Haq>
2
2
 
3
3
  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4
4
 
package/README.md CHANGED
@@ -51,6 +51,7 @@ Working with Arabic hadith and Islamic text collections requires splitting conti
51
51
  ✅ **Readable templates**: `{{raqms}} {{dash}}` instead of cryptic regex
52
52
  ✅ **Named captures**: `{{raqms:hadithNum}}` auto-extracts to `meta.hadithNum`
53
53
  ✅ **Fuzzy matching**: Auto-enabled for `{{bab}}`, `{{kitab}}`, `{{basmalah}}`, `{{fasl}}`, `{{naql}}` (override with `fuzzy: false`)
54
+ ✅ **Content limits**: `maxPages` and `maxContentLength` (safety-hardened) control segment size
54
55
  ✅ **Page tracking**: Know which page each segment came from
55
56
  ✅ **Declarative rules**: Describe *what* to match, not *how*
56
57
 
@@ -236,7 +237,92 @@ Limit rules to specific page ranges:
236
237
  }
237
238
  ```
238
239
 
239
- ### 7. Occurrence Filtering
240
+ ### 7. Max Content Length (Safety Hardened)
241
+
242
+ Split oversized segments based on character count:
243
+
244
+ ```typescript
245
+ {
246
+ maxContentLength: 500, // Split after 500 characters
247
+ prefer: 'longer', // Try to fill the character bucket
248
+ breakpoints: ['\\.'], // Recommended: split on punctuation within window
249
+ }
250
+ ```
251
+
252
+ The library implements **safety hardening** for character-based splits:
253
+ - **Safe Fallback**: If no breakpoint matches, it searches backward up to 100 characters for a delimiter (whitespace or punctuation) to avoid chopping words.
254
+ - **Unicode Safety**: Automatically prevents splitting inside Unicode surrogate pairs (e.g., emojis), preventing text corruption.
255
+ - **Validation**: `maxContentLength` must be at least **50**.
256
+
257
+ ### 8. Advanced Structural Filters
258
+
259
+ Refine rule matching with page-specific constraints:
260
+
261
+ ```typescript
262
+ {
263
+ lineStartsWith: ['### '],
264
+ split: 'at',
265
+ // Range constraints
266
+ min: 10, // Only match on pages 10 and above
267
+ max: 500, // Only match on pages 500 and below
268
+ exclude: [50, [100, 110]], // Skip page 50 and range 100-110
269
+
270
+ // Negative lookahead: skip rule if content matches this pattern
271
+ // (e.g. skip chapter marker if it appears inside a table/list)
272
+ skipWhen: '^\s*- ',
273
+ }
274
+ ```
275
+
276
+ ### 9. Debugging & Logging
277
+
278
+ Pass an optional `logger` to trace segmentation decisions or enable `debug` to attach match metadata to segments:
279
+
280
+ ```typescript
281
+ const segments = segmentPages(pages, {
282
+ rules: [...],
283
+ debug: true, // Attaches .meta.debug with regex and match indices
284
+ logger: {
285
+ debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
286
+ info: (msg, data) => console.info(`[INFO] ${msg}`, data),
287
+ warn: (msg, data) => console.warn(`[WARN] ${msg}`, data),
288
+ error: (msg, data) => console.error(`[ERROR] ${msg}`, data),
289
+ }
290
+ });
291
+ ```
292
+
293
+ ### 10. Page Joiners
294
+
295
+ Control how text from different pages is stitched together:
296
+
297
+ ```typescript
298
+ // Default: space ' ' joiner
299
+ // Result: "...end of page 1. Start of page 2..."
300
+ segmentPages(pages, { pageJoiner: 'space' });
301
+
302
+ // Result: "...end of page 1.\nStart of page 2..."
303
+ segmentPages(pages, { pageJoiner: 'newline' });
304
+ ```
305
+
306
+ ### 11. Breakpoint Preferences
307
+
308
+ When a segment exceeds `maxPages` or `maxContentLength`, breakpoints split it at the "best" available match:
309
+
310
+ ```typescript
311
+ {
312
+ maxPages: 1, // Minimum segment size (page span)
313
+ breakpoints: ['{{tarqim}}'],
314
+
315
+ // 'longer' (default): Greedy. Finds the match furthest in the window.
316
+ // Result: Segments stay close to the max limit.
317
+ prefer: 'longer',
318
+
319
+ // 'shorter': Conservative. Finds the first available match.
320
+ // Result: Segments split as early as possible.
321
+ prefer: 'shorter',
322
+ }
323
+ ```
324
+
325
+ ### 12. Occurrence Filtering
240
326
 
241
327
  Control which matches to use:
242
328
 
@@ -831,7 +917,27 @@ const options: SegmentationOptions = {
831
917
  // How to join content across page boundaries in OUTPUT segments:
832
918
  // - 'space' (default): page boundaries become spaces
833
919
  // - 'newline': preserve page boundaries as newlines
834
- pageJoiner: 'space',
920
+ pageJoiner: 'newline',
921
+
922
+ // Breakpoint preferences for resizing oversized segments:
923
+ // - 'longer' (default): maximizes segment size within limits
924
+ // - 'shorter': minimizes segment size (splits at first match)
925
+ prefer: 'longer',
926
+
927
+ // Post-structural limit: split if segment spans more than 2 pages
928
+ maxPages: 2,
929
+
930
+ // Post-structural limit: split if segment exceeds 5000 characters
931
+ maxContentLength: 5000,
932
+
933
+ // Enable match metadata in segments (meta.debug)
934
+ debug: true,
935
+
936
+ // Custom logger for tracing
937
+ logger: {
938
+ info: (m) => console.log(m),
939
+ warn: (m) => console.warn(m),
940
+ }
835
941
  };
836
942
 
837
943
  const segments: Segment[] = segmentPages(pages, options);
@@ -1057,6 +1163,8 @@ type SplitRule = {
1057
1163
  // Constraints
1058
1164
  min?: number;
1059
1165
  max?: number;
1166
+ exclude?: (number | [number, number])[]; // Single page or [start, end] range
1167
+ skipWhen?: string; // Regex pattern (tokens supported)
1060
1168
  meta?: Record<string, unknown>;
1061
1169
  };
1062
1170
  ```
package/dist/index.d.mts CHANGED
@@ -641,6 +641,18 @@ type SegmentationOptions = {
641
641
  * { maxPages: 2, breakpoints: ['{{tarqim}}', ''] }
642
642
  */
643
643
  maxPages?: number;
644
+ /**
645
+ * Maximum length (in characters) per segment.
646
+ *
647
+ * When a segment exceeds this length, breakpoints are applied to split it.
648
+ * This can typically be used in conjunction with `maxPages`, where the
649
+ * strictest constraint (intersection) determines the split window.
650
+ *
651
+ * @example
652
+ * // Break segments that exceed 2000 chars
653
+ * { maxContentLength: 2000, breakpoints: ['{{tarqim}}'] }
654
+ */
655
+ maxContentLength?: number;
644
656
  /**
645
657
  * Patterns tried in order to break oversized segments.
646
658
  *
@@ -894,20 +906,6 @@ type ReplaceRule = NonNullable<SegmentationOptions['replace']>[number];
894
906
  declare const applyReplacements: (pages: Page[], rules?: ReplaceRule[]) => Page[];
895
907
  //#endregion
896
908
  //#region src/segmentation/segmenter.d.ts
897
- /**
898
- * Applies breakpoints to oversized segments.
899
- *
900
- * For each segment that spans more than maxPages, tries the breakpoint patterns
901
- * in order to find a suitable split point. Structural markers (from rules) are
902
- * always respected - segments are only broken within their boundaries.
903
- *
904
- * @param segments - Initial segments from rule processing
905
- * @param pages - Original pages for page lookup
906
- * @param maxPages - Maximum pages before breakpoints apply
907
- * @param breakpoints - Patterns to try in order (tokens supported)
908
- * @param prefer - 'longer' for last match, 'shorter' for first match
909
- * @returns Processed segments with oversized ones broken up
910
- */
911
909
  /**
912
910
  * Segments pages of content based on pattern-matching rules.
913
911
  *
@@ -1 +1 @@
1
- {"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/optimize-rules.ts","../src/segmentation/pattern-validator.ts","../src/segmentation/replace.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/recovery.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBY,cDnUC,WCmUG,EAAA,CAAA,CAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAoChB;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAwJA;;;;AC5tBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2DA;AAAyC,cJ0F5B,wBI1F4B,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;AJLzC;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;KArVK,YAAA,GAqViD;EAAe;EAkBzD,KAAA,EAAI,MAAA;AAoChB,CAAA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAwJA;;;;AC5tBA;AAsGA;;;;AC5GA;AAKA;AAcA;;KF0BK,eAAA,GExBkB;EACH;EACL,QAAA,EAAA,MAAA;CAAe;AA0G9B;AAwDA;;;;AC1LA;AA2DA;;;;;;;;AC+SA;;;;;;;;AClTA;AA0QA;AAsDA;AA2CA,KLvVK,qBAAA,GK0VJ;EAQW;EAuKC,cAAA,EAAA,MAAA,EAAA;AA6Cb,CAAA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC/wBA;AAcA;AAEA;AAwQA;;;;;;;;AClRA;AAaA;AAOA;AA2OA;;;;KPjKK,sBAAA,GOoKsB;;;;ACjR3B;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC7MA;AAKA;;;;;;AAOA;AA2BE;AAmnBF;;KT/gBK,mBAAA,GSihBS;EACD;EACC,YAAA,EAAA,MAAA,EAAA;CAGa;;;;AA+C3B;;;;;;;KTtjBK,WAAA,GACC,eACA,kBACA,wBACA,yBACA;;;;;;;;;;;;;;;;cAiBO;;;;;;KAOD,cAAA,WAAyB;;;;;;;KAYhC,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiDO,SAAA;;;;;;;KAYP,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAuDC,SAAA,GAAY,cAAc,gBAAgB;;;;;;;;;;;;;KAkB1C,IAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAoCA,cAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAqCE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCF,UAAA,YAAsB;;;;;;;;;;;;;;;;;;;;;;;;;UA8BjB,MAAA;;;;;;;;;;;;;;;;;;;;;;KAuBL,WAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA+CA,mBAAA;;;;;;YAME;;;;;;;;UASF;;;;;;;;;;;;;;cAiBY;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;gBA+CN;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WAwDL;;;;;;;;;;;;;;;;KAiBD,OAAA;;;;;;;;;;;;;;;;;;;;;;;;;;SA6BD;;;;AA3nBa;;;AAkBlB,KChJM,cAAA,GDgJN;EACA;EACA,KAAA,EChJK,SDgJL,EAAA;EAAmB;EAiBZ,WAAA,EAAA,MAAA;AAOb,CAAA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAwJA;;;;AC5tBA;AAsGA;;;cAAa,uBAAwB,gBAAc;;;AD3FlC;AA4BG;AA8BM;AAyDrB,KEpIO,mBAAA,GFoIY,gBAAA,GAAA,eAAA,GAAA,WAAA,GAAA,eAAA;AAAA;;;AAkBlB,KEjJM,eAAA,GFiJN;EACA,IAAA,EEjJI,mBFiJJ;EACA,OAAA,EAAA,MAAA;EAAmB,UAAA,CAAA,EAAA,MAAA;EAiBZ;EAOD,KAAA,CAAA,EAAA,MAAA;EAYP;EAiDO,OAAA,CAAA,EAAA,MAAS;AAA6B,CAAA;AAyHlD;;;;AAAqE,KEnVzD,oBAAA,GFmVyD;EAkBzD,cAAI,CAAA,EAAA,CEpWM,eFoWN,GAAA,SAAA,CAAA,EAAA;EAoCJ,eAAA,CAAA,EAAc,CEvYH,eF4aT,GAAA,SAAS,CAAA,EAAA;EAqCX,YAAA,CAAU,EAAA,CEhdF,eFgdc,GAAA,SAAc,CAAA,EAAA;EA8B/B,QAAA,CAAM,EE7eR,eF6eQ;AAuBvB,CAAA;AA+CA;;;;;;;AAwJA;;;;AC5tBA;AAsGA;;;;AC5GA;AAKA;AAcA;AACsB,cA6GT,aA7GS,EAAA,CAAA,KAAA,EA6Ge,SA7Gf,EAAA,EAAA,GAAA,CA6G8B,oBA7G9B,GAAA,SAAA,CAAA,EAAA;;;;;AA6GtB;AAwDA;;;;AC1LA;AA2DA;;;AAAyE,cD+H5D,sBC/H4D,EAAA,CAAA,OAAA,EAAA,CD+HxB,oBC/HwB,GAAA,SAAA,CAAA,EAAA,EAAA,GAAA,MAAA,EAAA;;;AJLzE;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAuCtB,KGpJO,WAAA,GAAc,WHoJV,CGpJsB,mBHoJtB,CAAA,SAAA,CAAA,CAAA,CAAA,MAAA,CAAA;;;;;;;AAsBhB;AAOA;AAAgE;AA6DhE;AAYK,cG/LQ,iBHwOC,EAAA,CAAA,KAaH,EGrP8B,IHqPxB,EAAA,EAAA,KAAA,CAAA,EGrPwC,WHqPxC,EAAA,EAAA,GGrPwD,IHqPxD,EAAA;;;;;;;AAyEjB;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAwJA;;;;AC5tBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2DA;;;;;;;;AC+SA;;;;;;;;AClTA;AA0Qa,cDwCA,YCxCA,EAAA,CAAA,KAaZ,ED2BmC,IC3BnC,EAAA,EAAA,OAAA,ED2BoD,mBC3BpD,EAAA,GD2BuE,OC3BvE,EAAA;;;;ANzRD;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAwJA;;;;AC5tBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2Da,cEHA,sBFyBZ,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;;;;;ACyRD;;;;;;;cCxCa;AA1Qb;AA0QA;AAsDA;AA2CA;AAWA;AAuKA;AA6CA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC/wBA;AAcA;AAEA;AAwQA;;;;;cDuFa,gBAAgB;;;AEzW7B;AAaA;AAOA;AA2OA;;;;;;;;AC9QA;AA+EA;AAgEa,cHoRA,cGrQZ,EAAA,CAAA,KAfgE,EAAA,MAAA,EAAA,GAAA,OAAe;AAuBhF;AAiCA;;;;AC7MA;AAKY,KJ+aA,YAAA,GI/aiB;EAChB;;;;;EAMD,OAAA,EAAA,MAAA;EA6BP;AAinBL;;;;EAIc,YAAA,EAAA,MAAA,EAAA;EAGa;;;;AA+C3B;EACU,WAAA,EAAA,OAAA;CACsE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cJvHnE,mHAIV;;;;;;;;;;;;;;;;;;;;cAyCU;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAmC;;;;;;;;;;;;;cAqBnC;;;;;;;;;;;;;;;cAgBA;;;;;;;;;;;;;;;cA8BA;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;cAoBC,iDAAkD;;;;;;;;;;;;;;;;cA6BlD;;;ANluBA,KO7CD,wBAAA,GP6C8E;EA+F7E,IAAA,CAAA,EAAA,MAAA;;;;ECnIR,WAAA,CAAA,EAAA,MAAY;EA4BZ,wBAAe,CAAA,EAAA,OAAA;EA8Bf,yBAAqB,CAAA,EAAA,OAAA;EAiCrB,MAAA,CAAA,EAAA,aAAA,GAAsB,OAAA;EAwBtB,UAAA,CAAA,EAAA,CAAA,IAAA,EAAA,MAAmB,EAAA,MAAA,EAAA,MAAA,EAAA,GAAA,OAAA;EAenB,cAAW,CAAA,EMjIK,MNiIL,EAAA;EACV,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;CACA;AACA,KMhIM,uBAAA,GNgIN;EACA,IAAA,EAAA,MAAA;EACA,MAAA,EAAA,MAAA;CAAmB;AAiBZ,KMjJD,sBAAA,GNiJwG;EAOxG,OAAA,EAAA,MAAA;EAYP,KAAA,EAAA,MAAA;EAiDO,QAAA,EMlNE,uBNkNO,EAAA;AAA6B,CAAA;AAyHlD;;;AAAsD,cMtEzC,uBNsEyC,EAAA,CAAA,KAAA,EMrE3C,INqE2C,EAAA,EAAA,OAAA,CAAA,EMpEzC,wBNoEyC,EAAA,GMnEnD,sBNmEmD,EAAA;;;AAlOjD,KOtHO,wBAAA,GPsHY;EAenB,WAAA,CAAA,EAAW,MAAA;EACV,WAAA,CAAA,EAAA,MAAA;EACA,QAAA,CAAA,EAAA,MAAA;EACA,IAAA,CAAA,EAAA,MAAA;EACA,yBAAA,CAAA,EAAA,OAAA;EACA,YAAA,CAAA,EAAA,OAAA;EAAmB,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;EAiBZ,WAAA,CAAA,EAAA,MAAuG;EAOxG,YAAA,CAAA,EAAA,MAAc;EAYrB,iBAAa,CAAA,EAAA,MAAA;AAiDlB,CAAA;AAYK,KO9NO,wBAAA,GPuQE;EAoEF,IAAA,EAAA,MAAS;EAAG,OAAA,EAAA,MAAA;EAAc,MAAA,EAAA,MAAA;EAAgB,YAAA,EAAA,MAAA,EAAA;CAAe;AAkBzD,KOtVA,wBAAA,GPsVI;EAoCJ,OAAA,EAAA,MAAA;EA0EA,KAAA,EAAA,MAAU;EA8BL,QAAA,EO/dH,wBP+dS,EAAA;AAuBvB,CAAA;;;;;;AAuMA;cOrda,mCACF,kBACG,6BACX;;;;AR3NH;AA+FA;;;;;ACnIiB;AA4BG;AA+Df,KQ7GO,eAAA,GR6Ge;EAwBtB;EAeA,KAAA,EAAA,MAAA;EACC;EACA,KAAA,EAAA,MAAA;EACA;EACA,KAAA,EAAA,MAAA;EACA;EAAmB,QAAA,EAAA,MAAA;AAiBzB,CAAA;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;AAMc,cQlgBD,mBRkgBC,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GQlgBkC,eRkgBlC,EAAA;;;;;;AAkJd;;;;AC5tBA;AAsGA;;;cOkCa,mDAAoD;AN9IjE;AAKA;AAcA;;;;AAIe,cM8IF,oBN9IE,EAAA,CAAA,QAAA,EM+ID,eN/IC,EAAA,EAAA,GAAA;EAAe,WAAA,EAAA,gBAAA,GAAA,iBAAA;EA0GjB,KAAA,EAAA,OAAA;EAwDA,QAAA,CAAA,EAAA,MAAA;;;;AC1Lb;AA2DA;;;AAAyE,cK4I5D,kBL5I4D,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA;EAAI,QAAA,EAAA,MAAA;;;;EC+ShE,QAAA,EI5JC,eJ6Nb,EAAA;CAjEmC,GAAA,IAAA;;;ALpTvB,KU5DD,sBAAA,GV4D8E;EA+F7E,IAAA,EAAA,cAAA;;;;ECnIR,KAAA,CAAA,EAAA,OAAY,GAAA,YAAA;EA4BZ,QAAA,EAAA,MAAA,EAAe;AAAA,CAAA,GA8Bf;EAiCA,IAAA,EAAA,WAAA;EAwBA,SAAA,EAAA,CAAA,IAAA,ESxIwC,STwIrB,EAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;AAAA,CAAA;AAgBlB,KStJM,iBAAA,GTsJN;EACA,OAAA,EStJO,mBTsJP;EACA,KAAA,EStJK,ITsJL,EAAA;EACA,QAAA,EStJQ,OTsJR,EAAA;EACA,QAAA,EStJQ,sBTsJR;CAAmB;AAiBZ,KSpKD,oBAAA,GToKwG;EAOxG,OAAA,EAAA;IAYP,IAAA,EAAA,YAAa,GAAA,wBAAA;IAiDN,SAAS,EAAA,MAAA;IAYhB,aAAA,EAAe,MAAA;IA6GR,SAAS,EAAA,MAAA;IAAG,UAAA,EAAA,MAAA;EAAc,CAAA;EAAgB,KAAA,CAAA,ESzV1C,KTyV0C,CAAA;IAAe,SAAA,EAAA,MAAA;IAkBzD,QAAI,EAAA,MAAA;IAoCJ,aAAc,EAAA,MAAA;IA0Ed,UAAU,EAAA,MAAA;EA8BL,CAAA,CAAA;EAuBL,OAAA,ESxgBC,KTwgBU,CAAA;IA+CX,IAAA,EAAA,MAAA;IAME,oBAAA,EAAA,MAAA;IASF,sBAAA,CAAA,EAAA,MAAA;IAiBY,qBAAA,CAAA,EAAA,MAAA;IA+CN,YAAA,EAAA,MAAA;IAwDL,MAAA,EAAA,WAAA,GAAA,oBAAA,GAAA,WAAA,GAAA,sBAAA,GAAA,qBAAA;IAAM,QAAA,EAAA,OAAA,GAAA,QAAA,GAAA,MAAA;IAiBP,EAAA,CAAA,EAAA,MAAO;;;;EC5tBP,QAAA,EAAA,MAAA,EAAc;AAsG1B,CAAA;KQ1EK,oBAAA;iBAinBW,qCAAA,QACL,kBACG,oBACD,+BACC,4BPlpBd;;EALY,gBAAA,CAAA,EO0pBe,oBP1pBI;AAK/B,CAAA,CAAA,EAAY;EAcA,MAAA,EOyoBC,oBPzoBmB;EACV,QAAA,EOwoBuB,OPxoBvB,EAAA;CACC;AACH,iBOmrBJ,6BAAA,CPnrBI,IAAA,EOorBV,iBPprBU,EAAA,EAAA,IA2GpB,CA3GoB,EAAA;EACL,IAAA,CAAA,EAAA,YAAA,GAAA,wBAAA;EAAe,gBAAA,CAAA,EOorBkD,oBPprBlD;AA0G9B,CAAA,CAAA,EAAa;EAwDA,MAAA,EOmhBA,oBP1eZ;YO0e4C"}
1
+ {"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/optimize-rules.ts","../src/segmentation/pattern-validator.ts","../src/segmentation/replace.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/recovery.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBY,cDnUC,WCmUG,EAAA,CAAA,CAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAoChB;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2DA;AAAyC,cJ0F5B,wBI1F4B,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;AJLzC;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;KArVK,YAAA,GAqViD;EAAe;EAkBzD,KAAA,EAAI,MAAA;AAoChB,CAAA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;KF0BK,eAAA,GExBkB;EACH;EACL,QAAA,EAAA,MAAA;CAAe;AA0G9B;AAwDA;;;;AC1LA;AA2DA;;;;;;;;ACyPA;;;;;;;;AC5PA;AA0QA;AAsDA;AA2CA,KLvVK,qBAAA,GK0VJ;EAQW;EAuKC,cAAA,EAAA,MAAA,EAAA;AA6Cb,CAAA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC/wBA;AAcA;AAEA;AAwQA;;;;;;;;AClRA;AAaA;AAOA;AA2OA;;;;KPjKK,sBAAA,GOoKsB;;;;ACjR3B;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC7MA;AAKA;;;;;;AAOA;AA2BE;AAmnBF;;KT/gBK,mBAAA,GSihBS;EACD;EACC,YAAA,EAAA,MAAA,EAAA;CAGa;;;;AA+C3B;;;;;;;KTtjBK,WAAA,GACC,eACA,kBACA,wBACA,yBACA;;;;;;;;;;;;;;;;cAiBO;;;;;;KAOD,cAAA,WAAyB;;;;;;;KAYhC,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiDO,SAAA;;;;;;;KAYP,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAuDC,SAAA,GAAY,cAAc,gBAAgB;;;;;;;;;;;;;KAkB1C,IAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAoCA,cAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAqCE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCF,UAAA,YAAsB;;;;;;;;;;;;;;;;;;;;;;;;;UA8BjB,MAAA;;;;;;;;;;;;;;;;;;;;;;KAuBL,WAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA+CA,mBAAA;;;;;;YAME;;;;;;;;UASF;;;;;;;;;;;;;;cAiBY;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;gBA4DN;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WAwDL;;;;;;;;;;;;;;;;KAiBD,OAAA;;;;;;;;;;;;;;;;;;;;;;;;;;SA6BD;;;;AAxoBa;;;AAkBlB,KChJM,cAAA,GDgJN;EACA;EACA,KAAA,EChJK,SDgJL,EAAA;EAAmB;EAiBZ,WAAA,EAAA,MAAA;AAOb,CAAA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;cAAa,uBAAwB,gBAAc;;;AD3FlC;AA4BG;AA8BM;AAyDrB,KEpIO,mBAAA,GFoIY,gBAAA,GAAA,eAAA,GAAA,WAAA,GAAA,eAAA;AAAA;;;AAkBlB,KEjJM,eAAA,GFiJN;EACA,IAAA,EEjJI,mBFiJJ;EACA,OAAA,EAAA,MAAA;EAAmB,UAAA,CAAA,EAAA,MAAA;EAiBZ;EAOD,KAAA,CAAA,EAAA,MAAA;EAYP;EAiDO,OAAA,CAAA,EAAA,MAAS;AAA6B,CAAA;AAyHlD;;;;AAAqE,KEnVzD,oBAAA,GFmVyD;EAkBzD,cAAI,CAAA,EAAA,CEpWM,eFoWN,GAAA,SAAA,CAAA,EAAA;EAoCJ,eAAA,CAAA,EAAc,CEvYH,eF4aT,GAAA,SAAS,CAAA,EAAA;EAqCX,YAAA,CAAU,EAAA,CEhdF,eFgdc,GAAA,SAAc,CAAA,EAAA;EA8B/B,QAAA,CAAM,EE7eR,eF6eQ;AAuBvB,CAAA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;AACsB,cA6GT,aA7GS,EAAA,CAAA,KAAA,EA6Ge,SA7Gf,EAAA,EAAA,GAAA,CA6G8B,oBA7G9B,GAAA,SAAA,CAAA,EAAA;;;;;AA6GtB;AAwDA;;;;AC1LA;AA2DA;;;AAAyE,cD+H5D,sBC/H4D,EAAA,CAAA,OAAA,EAAA,CD+HxB,oBC/HwB,GAAA,SAAA,CAAA,EAAA,EAAA,GAAA,MAAA,EAAA;;;AJLzE;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAuCtB,KGpJO,WAAA,GAAc,WHoJV,CGpJsB,mBHoJtB,CAAA,SAAA,CAAA,CAAA,CAAA,MAAA,CAAA;;;;;;;AAsBhB;AAOA;AAAgE;AA6DhE;AAYK,cG/LQ,iBHwOC,EAAA,CAAA,KAaH,EGrP8B,IHqPxB,EAAA,EAAA,KAAA,CAAA,EGrPwC,WHqPxC,EAAA,EAAA,GGrPwD,IHqPxD,EAAA;;;;;;;AAyEjB;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2DA;;;AAAyE,cCyP5D,YDzP4D,EAAA,CAAA,KAAA,ECyPrC,IDzPqC,EAAA,EAAA,OAAA,ECyPpB,mBDzPoB,EAAA,GCyPD,ODzPC,EAAA;;;;AJLzE;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2Da,cEHA,sBFyBZ,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;;;;;ACmOD;;;;;;;cCca;AA1Qb;AA0QA;AAsDA;AA2CA;AAWA;AAuKA;AA6CA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC/wBA;AAcA;AAEA;AAwQA;;;;;cDuFa,gBAAgB;;;AEzW7B;AAaA;AAOA;AA2OA;;;;;;;;AC9QA;AA+EA;AAgEa,cHoRA,cGrQZ,EAAA,CAAA,KAfgE,EAAA,MAAA,EAAA,GAAA,OAAe;AAuBhF;AAiCA;;;;AC7MA;AAKY,KJ+aA,YAAA,GI/aiB;EAChB;;;;;EAMD,OAAA,EAAA,MAAA;EA6BP;AAinBL;;;;EAIc,YAAA,EAAA,MAAA,EAAA;EAGa;;;;AA+C3B;EACU,WAAA,EAAA,OAAA;CACsE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cJvHnE,mHAIV;;;;;;;;;;;;;;;;;;;;cAyCU;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAmC;;;;;;;;;;;;;cAqBnC;;;;;;;;;;;;;;;cAgBA;;;;;;;;;;;;;;;cA8BA;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;cAoBC,iDAAkD;;;;;;;;;;;;;;;;cA6BlD;;;ANluBA,KO7CD,wBAAA,GP6C8E;EA+F7E,IAAA,CAAA,EAAA,MAAA;;;;ECnIR,WAAA,CAAA,EAAA,MAAY;EA4BZ,wBAAe,CAAA,EAAA,OAAA;EA8Bf,yBAAqB,CAAA,EAAA,OAAA;EAiCrB,MAAA,CAAA,EAAA,aAAA,GAAsB,OAAA;EAwBtB,UAAA,CAAA,EAAA,CAAA,IAAA,EAAA,MAAmB,EAAA,MAAA,EAAA,MAAA,EAAA,GAAA,OAAA;EAenB,cAAW,CAAA,EMjIK,MNiIL,EAAA;EACV,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;CACA;AACA,KMhIM,uBAAA,GNgIN;EACA,IAAA,EAAA,MAAA;EACA,MAAA,EAAA,MAAA;CAAmB;AAiBZ,KMjJD,sBAAA,GNiJwG;EAOxG,OAAA,EAAA,MAAA;EAYP,KAAA,EAAA,MAAA;EAiDO,QAAA,EMlNE,uBNkNO,EAAA;AAA6B,CAAA;AAyHlD;;;AAAsD,cMtEzC,uBNsEyC,EAAA,CAAA,KAAA,EMrE3C,INqE2C,EAAA,EAAA,OAAA,CAAA,EMpEzC,wBNoEyC,EAAA,GMnEnD,sBNmEmD,EAAA;;;AAlOjD,KOtHO,wBAAA,GPsHY;EAenB,WAAA,CAAA,EAAW,MAAA;EACV,WAAA,CAAA,EAAA,MAAA;EACA,QAAA,CAAA,EAAA,MAAA;EACA,IAAA,CAAA,EAAA,MAAA;EACA,yBAAA,CAAA,EAAA,OAAA;EACA,YAAA,CAAA,EAAA,OAAA;EAAmB,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;EAiBZ,WAAA,CAAA,EAAA,MAAuG;EAOxG,YAAA,CAAA,EAAA,MAAc;EAYrB,iBAAa,CAAA,EAAA,MAAA;AAiDlB,CAAA;AAYK,KO9NO,wBAAA,GPuQE;EAoEF,IAAA,EAAA,MAAS;EAAG,OAAA,EAAA,MAAA;EAAc,MAAA,EAAA,MAAA;EAAgB,YAAA,EAAA,MAAA,EAAA;CAAe;AAkBzD,KOtVA,wBAAA,GPsVI;EAoCJ,OAAA,EAAA,MAAA;EA0EA,KAAA,EAAA,MAAU;EA8BL,QAAA,EO/dH,wBP+dS,EAAA;AAuBvB,CAAA;;;;;;AAoNA;cOlea,mCACF,kBACG,6BACX;;;;AR3NH;AA+FA;;;;;ACnIiB;AA4BG;AA+Df,KQ7GO,eAAA,GR6Ge;EAwBtB;EAeA,KAAA,EAAA,MAAA;EACC;EACA,KAAA,EAAA,MAAA;EACA;EACA,KAAA,EAAA,MAAA;EACA;EAAmB,QAAA,EAAA,MAAA;AAiBzB,CAAA;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;AAMc,cQlgBD,mBRkgBC,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GQlgBkC,eRkgBlC,EAAA;;;;;;AA+Jd;;;;ACzuBA;AAsGA;;;cOkCa,mDAAoD;AN9IjE;AAKA;AAcA;;;;AAIe,cM8IF,oBN9IE,EAAA,CAAA,QAAA,EM+ID,eN/IC,EAAA,EAAA,GAAA;EAAe,WAAA,EAAA,gBAAA,GAAA,iBAAA;EA0GjB,KAAA,EAAA,OAAA;EAwDA,QAAA,CAAA,EAAA,MAAA;;;;AC1Lb;AA2DA;;;AAAyE,cK4I5D,kBL5I4D,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA;EAAI,QAAA,EAAA,MAAA;;;;ECyPhE,QAAA,EItGC,eJgLb,EAAA;CA1EmC,GAAA,IAAA;;;AL9PvB,KU5DD,sBAAA,GV4D8E;EA+F7E,IAAA,EAAA,cAAA;;;;ECnIR,KAAA,CAAA,EAAA,OAAY,GAAA,YAAA;EA4BZ,QAAA,EAAA,MAAA,EAAe;AAAA,CAAA,GA8Bf;EAiCA,IAAA,EAAA,WAAA;EAwBA,SAAA,EAAA,CAAA,IAAA,ESxIwC,STwIrB,EAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;AAAA,CAAA;AAgBlB,KStJM,iBAAA,GTsJN;EACA,OAAA,EStJO,mBTsJP;EACA,KAAA,EStJK,ITsJL,EAAA;EACA,QAAA,EStJQ,OTsJR,EAAA;EACA,QAAA,EStJQ,sBTsJR;CAAmB;AAiBZ,KSpKD,oBAAA,GToKwG;EAOxG,OAAA,EAAA;IAYP,IAAA,EAAA,YAAa,GAAA,wBAAA;IAiDN,SAAS,EAAA,MAAA;IAYhB,aAAA,EAAe,MAAA;IA6GR,SAAS,EAAA,MAAA;IAAG,UAAA,EAAA,MAAA;EAAc,CAAA;EAAgB,KAAA,CAAA,ESzV1C,KTyV0C,CAAA;IAAe,SAAA,EAAA,MAAA;IAkBzD,QAAI,EAAA,MAAA;IAoCJ,aAAc,EAAA,MAAA;IA0Ed,UAAU,EAAA,MAAA;EA8BL,CAAA,CAAA;EAuBL,OAAA,ESxgBC,KTwgBU,CAAA;IA+CX,IAAA,EAAA,MAAA;IAME,oBAAA,EAAA,MAAA;IASF,sBAAA,CAAA,EAAA,MAAA;IAiBY,qBAAA,CAAA,EAAA,MAAA;IA4DN,YAAA,EAAA,MAAA;IAwDL,MAAA,EAAA,WAAA,GAAA,oBAAA,GAAA,WAAA,GAAA,sBAAA,GAAA,qBAAA;IAAM,QAAA,EAAA,OAAA,GAAA,QAAA,GAAA,MAAA;IAiBP,EAAA,CAAA,EAAA,MAAO;;;;ECzuBP,QAAA,EAAA,MAAA,EAAc;AAsG1B,CAAA;KQ1EK,oBAAA;iBAinBW,qCAAA,QACL,kBACG,oBACD,+BACC,4BPlpBd;;EALY,gBAAA,CAAA,EO0pBe,oBP1pBI;AAK/B,CAAA,CAAA,EAAY;EAcA,MAAA,EOyoBC,oBPzoBmB;EACV,QAAA,EOwoBuB,OPxoBvB,EAAA;CACC;AACH,iBOmrBJ,6BAAA,CPnrBI,IAAA,EOorBV,iBPprBU,EAAA,EAAA,IA2GpB,CA3GoB,EAAA;EACL,IAAA,CAAA,EAAA,YAAA,GAAA,wBAAA;EAAe,gBAAA,CAAA,EOorBkD,oBPprBlD;AA0G9B,CAAA,CAAA,EAAa;EAwDA,MAAA,EOmhBA,oBP1eZ;YO0e4C"}
package/dist/index.mjs CHANGED
@@ -1319,12 +1319,12 @@ const findPageStartNearExpectedBoundary = (remainingContent, _currentFromIdx, ta
1319
1319
  const MAX_DEVIATION = 2e3;
1320
1320
  if (bestDistance <= MAX_DEVIATION) return bestCandidate.pos;
1321
1321
  logger?.debug?.("[breakpoints] findPageStartNearExpectedBoundary: Rejected match exceeding deviation", {
1322
- targetPageIdx,
1323
- expectedBoundary,
1324
1322
  bestDistance,
1325
- maxDeviation: MAX_DEVIATION,
1323
+ expectedBoundary,
1326
1324
  matchPos: bestCandidate.pos,
1327
- prefixLength: len
1325
+ maxDeviation: MAX_DEVIATION,
1326
+ prefixLength: len,
1327
+ targetPageIdx
1328
1328
  });
1329
1329
  }
1330
1330
  }
@@ -1533,6 +1533,34 @@ const findBreakPosition = (remainingContent, currentFromIdx, toIdx, windowEndIdx
1533
1533
  }
1534
1534
  return null;
1535
1535
  };
1536
+ /**
1537
+ * Searches backward from a target position to find a "safe" split point.
1538
+ * A safe split point is after whitespace or punctuation.
1539
+ *
1540
+ * @param content The text content
1541
+ * @param targetPosition The desired split position (hard limit)
1542
+ * @param lookbackChars How far back to search for a safe break
1543
+ * @returns The new split position (index), or -1 if no safe break found
1544
+ */
1545
+ const findSafeBreakPosition = (content, targetPosition, lookbackChars = 100) => {
1546
+ const startSearch = Math.max(0, targetPosition - lookbackChars);
1547
+ for (let i = targetPosition - 1; i >= startSearch; i--) {
1548
+ const char = content[i];
1549
+ if (/[\s\n.,;!?؛،۔]/.test(char)) return i + 1;
1550
+ }
1551
+ return -1;
1552
+ };
1553
+ /**
1554
+ * Ensures the position does not split a surrogate pair.
1555
+ * If position is between High and Low surrogate, returns position - 1.
1556
+ */
1557
+ const adjustForSurrogate = (content, position) => {
1558
+ if (position <= 0 || position >= content.length) return position;
1559
+ const high = content.charCodeAt(position - 1);
1560
+ const low = content.charCodeAt(position);
1561
+ if (high >= 55296 && high <= 56319 && low >= 56320 && low <= 57343) return position - 1;
1562
+ return position;
1563
+ };
1536
1564
 
1537
1565
  //#endregion
1538
1566
  //#region src/segmentation/debug-meta.ts
@@ -1659,7 +1687,7 @@ const createPieceSegment = (pieceContent, actualStartIdx, actualEndIdx, pageIds,
1659
1687
  *
1660
1688
  * @returns Break offset relative to remainingContent, or windowEndPosition as fallback
1661
1689
  */
1662
- const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer) => {
1690
+ const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer, maxContentLength) => {
1663
1691
  if (hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, windowEndIdx)) {
1664
1692
  const exclusionBreak = findExclusionBreakPosition(currentFromIdx, windowEndIdx, toIdx, pageIds, expandedBreakpoints, cumulativeOffsets);
1665
1693
  if (exclusionBreak > 0) return { breakOffset: exclusionBreak };
@@ -1675,6 +1703,11 @@ const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx
1675
1703
  breakpointIndex: patternMatch.breakpointIndex,
1676
1704
  breakpointRule: patternMatch.rule
1677
1705
  };
1706
+ if (maxContentLength && windowEndPosition === maxContentLength) {
1707
+ const safeOffset = findSafeBreakPosition(remainingContent, windowEndPosition);
1708
+ if (safeOffset !== -1) return { breakOffset: safeOffset };
1709
+ return { breakOffset: adjustForSurrogate(remainingContent, windowEndPosition) };
1710
+ }
1678
1711
  return { breakOffset: windowEndPosition };
1679
1712
  };
1680
1713
  /**
@@ -1691,7 +1724,7 @@ const skipWhitespace$1 = (content, startPos) => {
1691
1724
  *
1692
1725
  * Uses precomputed boundary positions for O(log n) page attribution lookups.
1693
1726
  */
1694
- const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey) => {
1727
+ const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey, maxContentLength) => {
1695
1728
  const result = [];
1696
1729
  const fullContent = segment.content;
1697
1730
  let cursorPos = 0;
@@ -1705,57 +1738,104 @@ const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPag
1705
1738
  fullContentLength: fullContent.length,
1706
1739
  toIdx
1707
1740
  });
1708
- const maxIterations = 1e4;
1709
- for (let i = 0; i < maxIterations && cursorPos < fullContent.length && currentFromIdx <= toIdx; i++) {
1741
+ let i = 0;
1742
+ const MAX_SAFE_ITERATIONS = 1e5;
1743
+ while (cursorPos < fullContent.length && currentFromIdx <= toIdx && i < MAX_SAFE_ITERATIONS) {
1744
+ i++;
1710
1745
  const remainingContent = fullContent.slice(cursorPos);
1711
1746
  if (!remainingContent.trim()) break;
1712
- const remainingSpan = computeRemainingSpan(currentFromIdx, toIdx, pageIds);
1713
- const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
1714
- if (remainingSpan <= maxPages && !remainingHasExclusions) {
1715
- const includeMeta = isFirstPiece || Boolean(debugMetaKey);
1716
- const meta = debugMetaKey && lastBreakpoint ? mergeDebugIntoMeta(includeMeta ? segment.meta : void 0, debugMetaKey, buildBreakpointDebugPatch(lastBreakpoint.breakpointIndex, lastBreakpoint.rule)) : includeMeta ? segment.meta : void 0;
1717
- const finalSeg = createFinalSegment(remainingContent, currentFromIdx, toIdx, pageIds, meta, includeMeta);
1718
- if (finalSeg) result.push(finalSeg);
1719
- break;
1720
- }
1747
+ if (handleOversizedSegmentFit(remainingContent, currentFromIdx, toIdx, pageIds, expandedBreakpoints, maxPages, maxContentLength, isFirstPiece, debugMetaKey, segment.meta, lastBreakpoint, result)) break;
1721
1748
  const windowEndIdx = computeWindowEndIdx(currentFromIdx, toIdx, pageIds, maxPages);
1722
- const windowEndPosition = findBreakpointWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, logger);
1749
+ const windowEndPosition = getWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, maxContentLength, logger);
1723
1750
  logger?.debug?.(`[breakpoints] iteration=${i}`, {
1724
1751
  currentFromIdx,
1725
1752
  cursorPos,
1726
- windowEndIdx
1753
+ windowEndIdx,
1754
+ windowEndPosition
1727
1755
  });
1728
- const found = findBreakOffsetForWindow(remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer);
1756
+ const found = findBreakOffsetForWindow(remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer, maxContentLength);
1757
+ let breakOffset = found.breakOffset;
1758
+ if (breakOffset <= 0) {
1759
+ const fallbackPos = maxContentLength ? Math.min(maxContentLength, remainingContent.length) : 1;
1760
+ breakOffset = Math.max(1, fallbackPos);
1761
+ logger?.warn?.("[breakpoints] No progress from findBreakOffsetForWindow; forcing forward movement", {
1762
+ breakOffset,
1763
+ cursorPos
1764
+ });
1765
+ }
1729
1766
  if (found.breakpointIndex !== void 0 && found.breakpointRule) lastBreakpoint = {
1730
1767
  breakpointIndex: found.breakpointIndex,
1731
1768
  rule: found.breakpointRule
1732
1769
  };
1733
- const breakPos = cursorPos + found.breakOffset;
1770
+ const breakPos = cursorPos + breakOffset;
1734
1771
  const pieceContent = fullContent.slice(cursorPos, breakPos).trim();
1735
- const { actualEndIdx, actualStartIdx } = computePiecePages(cursorPos, breakPos, boundaryPositions, fromIdx, toIdx);
1736
- logger?.trace?.("[breakpoints] piece", {
1737
- actualEndIdx,
1738
- actualStartIdx,
1739
- pieceLength: pieceContent.length
1740
- });
1741
1772
  if (pieceContent) {
1742
- const includeMeta = isFirstPiece || Boolean(debugMetaKey);
1743
- const pieceSeg = createPieceSegment(pieceContent, actualStartIdx, actualEndIdx, pageIds, debugMetaKey && lastBreakpoint ? mergeDebugIntoMeta(includeMeta ? segment.meta : void 0, debugMetaKey, buildBreakpointDebugPatch(lastBreakpoint.breakpointIndex, lastBreakpoint.rule)) : includeMeta ? segment.meta : void 0, includeMeta);
1773
+ const { actualEndIdx, actualStartIdx } = computePiecePages(cursorPos, breakPos, boundaryPositions, fromIdx, toIdx);
1774
+ const pieceSeg = createPieceSegment(pieceContent, actualStartIdx, actualEndIdx, pageIds, getSegmentMetaWithDebug(isFirstPiece, debugMetaKey, segment.meta, lastBreakpoint), true);
1744
1775
  if (pieceSeg) result.push(pieceSeg);
1745
- }
1746
- cursorPos = skipWhitespace$1(fullContent, breakPos);
1747
- currentFromIdx = computeNextFromIdx(fullContent.slice(cursorPos), actualEndIdx, toIdx, pageIds, normalizedPages);
1776
+ const next = advanceCursorAndIndex(fullContent, breakPos, actualEndIdx, toIdx, pageIds, normalizedPages);
1777
+ cursorPos = next.cursorPos;
1778
+ currentFromIdx = next.currentFromIdx;
1779
+ } else cursorPos = breakPos;
1748
1780
  isFirstPiece = false;
1749
1781
  }
1782
+ if (i >= MAX_SAFE_ITERATIONS) logger?.error?.("[breakpoints] Stopped processing oversized segment: reached MAX_SAFE_ITERATIONS", {
1783
+ cursorPos,
1784
+ fullContentLength: fullContent.length,
1785
+ iterations: i
1786
+ });
1750
1787
  logger?.debug?.("[breakpoints] done", { resultCount: result.length });
1751
1788
  return result;
1752
1789
  };
1753
1790
  /**
1791
+ * Checks if the remaining content fits within paged/length limits.
1792
+ * If so, pushes the final segment and returns true.
1793
+ */
1794
+ const handleOversizedSegmentFit = (remainingContent, currentFromIdx, toIdx, pageIds, expandedBreakpoints, maxPages, maxContentLength, isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint, result) => {
1795
+ const remainingSpan = computeRemainingSpan(currentFromIdx, toIdx, pageIds);
1796
+ const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
1797
+ const fitsInPages = remainingSpan <= maxPages;
1798
+ const fitsInLength = !maxContentLength || remainingContent.length <= maxContentLength;
1799
+ if (fitsInPages && fitsInLength && !remainingHasExclusions) {
1800
+ const includeMeta = isFirstPiece || Boolean(debugMetaKey);
1801
+ const finalSeg = createFinalSegment(remainingContent, currentFromIdx, toIdx, pageIds, getSegmentMetaWithDebug(isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint), includeMeta);
1802
+ if (finalSeg) result.push(finalSeg);
1803
+ return true;
1804
+ }
1805
+ return false;
1806
+ };
1807
+ /**
1808
+ * Builds metadata for a segment piece, optionally including debug info.
1809
+ */
1810
+ const getSegmentMetaWithDebug = (isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint) => {
1811
+ if (!(isFirstPiece || Boolean(debugMetaKey))) return;
1812
+ if (debugMetaKey && lastBreakpoint) return mergeDebugIntoMeta(isFirstPiece ? originalMeta : void 0, debugMetaKey, buildBreakpointDebugPatch(lastBreakpoint.breakpointIndex, lastBreakpoint.rule));
1813
+ return isFirstPiece ? originalMeta : void 0;
1814
+ };
1815
+ /**
1816
+ * Calculates window end position, capped by maxContentLength if present.
1817
+ */
1818
+ const getWindowEndPosition = (remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, maxContentLength, logger) => {
1819
+ let windowEndPosition = findBreakpointWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, logger);
1820
+ if (maxContentLength && maxContentLength < windowEndPosition) windowEndPosition = maxContentLength;
1821
+ return windowEndPosition;
1822
+ };
1823
+ /**
1824
+ * Advances cursorPos and currentFromIdx for the next iteration.
1825
+ */
1826
+ const advanceCursorAndIndex = (fullContent, breakPos, actualEndIdx, toIdx, pageIds, normalizedPages) => {
1827
+ const nextCursorPos = skipWhitespace$1(fullContent, breakPos);
1828
+ return {
1829
+ currentFromIdx: computeNextFromIdx(fullContent.slice(nextCursorPos), actualEndIdx, toIdx, pageIds, normalizedPages),
1830
+ cursorPos: nextCursorPos
1831
+ };
1832
+ };
1833
+ /**
1754
1834
  * Applies breakpoints to oversized segments.
1755
1835
  *
1756
1836
  * Note: This is an internal engine used by `segmentPages()`.
1757
1837
  */
1758
- const applyBreakpoints = (segments, pages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner = "space", debugMetaKey) => {
1838
+ const applyBreakpoints = (segments, pages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner = "space", debugMetaKey, maxContentLength) => {
1759
1839
  const pageIds = pages.map((p) => p.id);
1760
1840
  const pageIdToIndex = buildPageIdToIndexMap(pageIds);
1761
1841
  const normalizedPages = buildNormalizedPagesMap(pages, normalizedContent);
@@ -1779,11 +1859,13 @@ const applyBreakpoints = (segments, pages, normalizedContent, maxPages, breakpoi
1779
1859
  const toIdx = segment.to !== void 0 ? pageIdToIndex.get(segment.to) ?? fromIdx : fromIdx;
1780
1860
  const segmentSpan = (segment.to ?? segment.from) - segment.from;
1781
1861
  const hasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, fromIdx, toIdx);
1782
- if (segmentSpan <= maxPages && !hasExclusions) {
1862
+ const fitsInPages = segmentSpan <= maxPages;
1863
+ const fitsInLength = !maxContentLength || segment.content.length <= maxContentLength;
1864
+ if (fitsInPages && fitsInLength && !hasExclusions) {
1783
1865
  result.push(segment);
1784
1866
  continue;
1785
1867
  }
1786
- const broken = processOversizedSegment(segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey);
1868
+ const broken = processOversizedSegment(segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey, maxContentLength);
1787
1869
  result.push(...broken.map((s) => {
1788
1870
  const segFromIdx = pageIdToIndex.get(s.from) ?? -1;
1789
1871
  const segToIdx = s.to !== void 0 ? pageIdToIndex.get(s.to) ?? segFromIdx : segFromIdx;
@@ -1995,7 +2077,7 @@ const buildLineStartsAfterRegexSource = (patterns, fuzzy, capturePrefix) => {
1995
2077
  const union = processed.map((p) => p.pattern).join("|");
1996
2078
  return {
1997
2079
  captureNames: processed.flatMap((p) => p.captureNames),
1998
- regex: `^(?:${union})${capturePrefix ? `(?<${capturePrefix}__content>.*)` : "(.*)"}`
2080
+ regex: `^[\\u200E\\u200F\\u061C\\u200B\\uFEFF]*(?:${union})${capturePrefix ? `(?<${capturePrefix}__content>.*)` : "(.*)"}`
1999
2081
  };
2000
2082
  };
2001
2083
  const buildLineStartsWithRegexSource = (patterns, fuzzy, capturePrefix) => {
@@ -2003,7 +2085,7 @@ const buildLineStartsWithRegexSource = (patterns, fuzzy, capturePrefix) => {
2003
2085
  const union = processed.map((p) => p.pattern).join("|");
2004
2086
  return {
2005
2087
  captureNames: processed.flatMap((p) => p.captureNames),
2006
- regex: `^(?:${union})`
2088
+ regex: `^[\\u200E\\u200F\\u061C\\u200B\\uFEFF]*(?:${union})`
2007
2089
  };
2008
2090
  };
2009
2091
  const buildLineEndsWithRegexSource = (patterns, fuzzy, capturePrefix) => {
@@ -2246,6 +2328,47 @@ const createPageStartGuardChecker = (matchContent, pageMap) => {
2246
2328
  return prevReq.test(lastChar);
2247
2329
  };
2248
2330
  };
2331
+ /**
2332
+ * Checks if a pageId matches the min/max/exclude constraints of a rule.
2333
+ */
2334
+ const passesRuleConstraints$1 = (rule, pageId) => {
2335
+ return (rule.min === void 0 || pageId >= rule.min) && (rule.max === void 0 || pageId <= rule.max) && !isPageExcluded(pageId, rule.exclude);
2336
+ };
2337
+ /**
2338
+ * Records a split point for a specific rule.
2339
+ */
2340
+ const recordSplitPointAt = (splitPointsByRule, ruleIndex, sp) => {
2341
+ const arr = splitPointsByRule.get(ruleIndex);
2342
+ if (!arr) {
2343
+ splitPointsByRule.set(ruleIndex, [sp]);
2344
+ return;
2345
+ }
2346
+ arr.push(sp);
2347
+ };
2348
+ /**
2349
+ * Processes matches for all fast-fuzzy rules at a specific line start.
2350
+ */
2351
+ const processFastFuzzyMatchesAt = (matchContent, lineStart, pageId, fastFuzzyRules, passesPageStartGuard, isPageStart, splitPointsByRule) => {
2352
+ for (const { compiled, kind, rule, ruleIndex } of fastFuzzyRules) {
2353
+ if (!passesRuleConstraints$1(rule, pageId)) continue;
2354
+ if (isPageStart && !passesPageStartGuard(rule, ruleIndex, lineStart)) continue;
2355
+ const end = matchFastFuzzyTokenAt(matchContent, lineStart, compiled);
2356
+ if (end === null) continue;
2357
+ const splitIndex = (rule.split ?? "at") === "at" ? lineStart : end;
2358
+ if (kind === "startsWith") recordSplitPointAt(splitPointsByRule, ruleIndex, {
2359
+ index: splitIndex,
2360
+ meta: rule.meta
2361
+ });
2362
+ else {
2363
+ const markerLength = end - lineStart;
2364
+ recordSplitPointAt(splitPointsByRule, ruleIndex, {
2365
+ contentStartOffset: (rule.split ?? "at") === "at" ? markerLength : void 0,
2366
+ index: splitIndex,
2367
+ meta: rule.meta
2368
+ });
2369
+ }
2370
+ }
2371
+ };
2249
2372
  const collectFastFuzzySplitPoints = (matchContent, pageMap, fastFuzzyRules, passesPageStartGuard) => {
2250
2373
  const splitPointsByRule = /* @__PURE__ */ new Map();
2251
2374
  if (fastFuzzyRules.length === 0 || pageMap.boundaries.length === 0) return splitPointsByRule;
@@ -2257,38 +2380,12 @@ const collectFastFuzzySplitPoints = (matchContent, pageMap, fastFuzzyRules, pass
2257
2380
  currentBoundary = pageMap.boundaries[boundaryIdx];
2258
2381
  }
2259
2382
  };
2260
- const recordSplitPoint = (ruleIndex, sp) => {
2261
- const arr = splitPointsByRule.get(ruleIndex);
2262
- if (!arr) {
2263
- splitPointsByRule.set(ruleIndex, [sp]);
2264
- return;
2265
- }
2266
- arr.push(sp);
2267
- };
2268
2383
  const isPageStart = (offset) => offset === currentBoundary?.start;
2269
2384
  for (let lineStart = 0; lineStart <= matchContent.length;) {
2270
2385
  advanceBoundaryTo(lineStart);
2271
2386
  const pageId = currentBoundary?.id ?? 0;
2272
2387
  if (lineStart >= matchContent.length) break;
2273
- for (const { compiled, kind, rule, ruleIndex } of fastFuzzyRules) {
2274
- if (!((rule.min === void 0 || pageId >= rule.min) && (rule.max === void 0 || pageId <= rule.max) && !isPageExcluded(pageId, rule.exclude))) continue;
2275
- if (isPageStart(lineStart) && !passesPageStartGuard(rule, ruleIndex, lineStart)) continue;
2276
- const end = matchFastFuzzyTokenAt(matchContent, lineStart, compiled);
2277
- if (end === null) continue;
2278
- const splitIndex = (rule.split ?? "at") === "at" ? lineStart : end;
2279
- if (kind === "startsWith") recordSplitPoint(ruleIndex, {
2280
- index: splitIndex,
2281
- meta: rule.meta
2282
- });
2283
- else {
2284
- const markerLength = end - lineStart;
2285
- recordSplitPoint(ruleIndex, {
2286
- contentStartOffset: (rule.split ?? "at") === "at" ? markerLength : void 0,
2287
- index: splitIndex,
2288
- meta: rule.meta
2289
- });
2290
- }
2291
- }
2388
+ processFastFuzzyMatchesAt(matchContent, lineStart, pageId, fastFuzzyRules, passesPageStartGuard, isPageStart(lineStart), splitPointsByRule);
2292
2389
  const nextNl = matchContent.indexOf("\n", lineStart);
2293
2390
  if (nextNl === -1) break;
2294
2391
  lineStart = nextNl + 1;
@@ -2613,20 +2710,6 @@ const convertPageBreaks = (content, startOffset, pageBreaks) => {
2613
2710
  return content.replace(/\n/g, (match, offset) => breakSet.has(offset) ? " " : match);
2614
2711
  };
2615
2712
  /**
2616
- * Applies breakpoints to oversized segments.
2617
- *
2618
- * For each segment that spans more than maxPages, tries the breakpoint patterns
2619
- * in order to find a suitable split point. Structural markers (from rules) are
2620
- * always respected - segments are only broken within their boundaries.
2621
- *
2622
- * @param segments - Initial segments from rule processing
2623
- * @param pages - Original pages for page lookup
2624
- * @param maxPages - Maximum pages before breakpoints apply
2625
- * @param breakpoints - Patterns to try in order (tokens supported)
2626
- * @param prefer - 'longer' for last match, 'shorter' for first match
2627
- * @returns Processed segments with oversized ones broken up
2628
- */
2629
- /**
2630
2713
  * Segments pages of content based on pattern-matching rules.
2631
2714
  *
2632
2715
  * This is the main entry point for the segmentation engine. It takes an array
@@ -2669,11 +2752,14 @@ const convertPageBreaks = (content, startOffset, pageBreaks) => {
2669
2752
  * });
2670
2753
  */
2671
2754
  const segmentPages = (pages, options) => {
2672
- const { rules = [], maxPages = 0, breakpoints = [], prefer = "longer", pageJoiner = "space", logger } = options;
2755
+ const { rules = [], breakpoints = [], prefer = "longer", pageJoiner = "space", logger, maxContentLength } = options;
2756
+ if (maxContentLength && maxContentLength < 50) throw new Error(`maxContentLength must be at least 50 characters.`);
2757
+ const maxPages = options.maxPages ?? (maxContentLength ? Number.MAX_SAFE_INTEGER : 0);
2673
2758
  const debug = resolveDebugConfig(options.debug);
2674
2759
  const debugMetaKey = debug?.includeRule ? debug.metaKey : void 0;
2675
2760
  logger?.info?.("[segmenter] starting segmentation", {
2676
2761
  breakpointCount: breakpoints.length,
2762
+ maxContentLength,
2677
2763
  maxPages,
2678
2764
  pageCount: pages.length,
2679
2765
  prefer,
@@ -2701,10 +2787,10 @@ const segmentPages = (pages, options) => {
2701
2787
  }))
2702
2788
  });
2703
2789
  segments = ensureFallbackSegment(segments, processedPages, normalizedContent, pageJoiner);
2704
- if (maxPages >= 0 && breakpoints.length) {
2790
+ if ((maxPages >= 0 || maxContentLength && maxContentLength > 0) && breakpoints.length) {
2705
2791
  logger?.debug?.("[segmenter] applying breakpoints to oversized segments");
2706
2792
  const patternProcessor = (p) => processPattern(p, false).pattern;
2707
- const result = applyBreakpoints(segments, processedPages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner, debug?.includeBreakpoint ? debug.metaKey : void 0);
2793
+ const result = applyBreakpoints(segments, processedPages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner, debug?.includeBreakpoint ? debug.metaKey : void 0, maxContentLength);
2708
2794
  logger?.info?.("[segmenter] segmentation complete (with breakpoints)", { finalSegmentCount: result.length });
2709
2795
  return result;
2710
2796
  }