flappa-doormal 2.10.1 → 2.11.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +26 -0
- package/LICENSE.md +1 -1
- package/README.md +110 -2
- package/dist/index.d.mts +12 -14
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +167 -81
- package/dist/index.mjs.map +1 -1
- package/package.json +1 -1
package/AGENTS.md
CHANGED
|
@@ -117,6 +117,8 @@ src/
|
|
|
117
117
|
- `hasExcludedPageInRange()` - Check if range contains excluded pages
|
|
118
118
|
- `findNextPagePosition()` - Find next page content position
|
|
119
119
|
- `findPatternBreakPosition()` - Find pattern match by preference
|
|
120
|
+
- `findSafeBreakPosition()` - Search backward for a safe linguistic split point (whitespace/punctuation)
|
|
121
|
+
- `adjustForSurrogate()` - Ensure split position doesn't corrupt Unicode surrogate pairs
|
|
120
122
|
|
|
121
123
|
10. **`types.ts`** - Type definitions
|
|
122
124
|
- `Logger` interface - Optional logging for debugging
|
|
@@ -260,6 +262,19 @@ segmentPages(pages, {
|
|
|
260
262
|
|
|
261
263
|
> **Note**: Older per-rule span limiting approaches were removed in favor of post-processing `breakpoints`.
|
|
262
264
|
|
|
265
|
+
### 5. Safety-Hardened Content Splitting (NEW)
|
|
266
|
+
|
|
267
|
+
When using `maxContentLength`, the segmenter prevents text corruption through several layers of fallback logic.
|
|
268
|
+
|
|
269
|
+
**Algorithm:**
|
|
270
|
+
1. **Windowed Pattern Match**: Attempt to find a user-provided `breakpoint` pattern within the character window.
|
|
271
|
+
2. **Safe Fallback (Linguistic)**: If no pattern matches, use `findSafeBreakPosition()` to search backward (100 chars) for whitespace or punctuation `[\s\n.,;!?؛،۔]`.
|
|
272
|
+
3. **Safe Fallback (Technical)**: If still no safe break found, use `adjustForSurrogate()` to ensure the split doesn't fall between a High and Low Unicode surrogate pair.
|
|
273
|
+
4. **Hard Split**: Only as a final resort is a character-exact split performed.
|
|
274
|
+
|
|
275
|
+
**Progress Guarantee**:
|
|
276
|
+
The loop in `processOversizedSegment` has been refactored to remove fixed iteration limits (e.g., 10k). Instead, it relies on strict `cursorPos` progression and input validation (`maxContentLength >= 50`) to support processing infinitely large content streams without risk of truncation.
|
|
277
|
+
|
|
263
278
|
## Design Decisions
|
|
264
279
|
|
|
265
280
|
### 1. Why `{{double-braces}}`?
|
|
@@ -414,6 +429,16 @@ bunx biome lint .
|
|
|
414
429
|
|
|
415
430
|
10. **Prefer library utilities for UI tasks**: Instead of re-implementing rule merging, validation, or token mapping in client code, use `optimizeRules`, `validateRules`/`formatValidationReport`, and `applyTokenMappings`. They handle edge cases (like duplicate patterns, regex safety, or diacritic handling) that ad-hoc implementations might miss.
|
|
416
431
|
|
|
432
|
+
11. **Safety Fallback (Search-back)**: When forced to split at a hard character limit, searching backward for whitespace/punctuation (`[\s\n.,;!?؛،۔]`) prevents word-chopping and improves readability significantly.
|
|
433
|
+
|
|
434
|
+
12. **Unicode Surrogate Safety**: Multi-byte characters (like Emojis) can be corrupted if split in the middle of a surrogate pair. Always use a helper like `adjustForSurrogate` to ensure the split point falls on a valid character boundary.
|
|
435
|
+
|
|
436
|
+
13. **Recursion/Iteration Safety**: Using a progress-based guard (comparing `cursorPos` before and after loop iteration) is safer than fixed iteration limits for supporting arbitrary-sized content without truncation risks.
|
|
437
|
+
|
|
438
|
+
14. **Accidental File Overwrites**: Be extremely careful when using tools like `replace_file_content` with large ranges. Verify file integrity frequently (e.g., `git diff`) to catch accidental deletions of existing code or tests. Merging new tests into existing files is a high-risk operation for AI agents.
|
|
439
|
+
|
|
440
|
+
15. **Invisible Unicode Marks Break Regex Anchors**: Arabic text often contains invisible bidirectional formatting marks like Left-to-Right Mark (`U+200E`), Right-to-Left Mark (`U+200F`), or Arabic Letter Mark (`U+061C`). These appear at line starts after `\n` but before visible characters, breaking `^` anchored patterns. Solution: include an optional zero-width character class prefix in line-start patterns: `^[\u200E\u200F\u061C\u200B\uFEFF]*(?:pattern)`. The library now handles this automatically in `buildLineStartsWithRegexSource` and `buildLineStartsAfterRegexSource`.
|
|
441
|
+
|
|
417
442
|
### Process Template (Multi-agent design review, TDD-first)
|
|
418
443
|
|
|
419
444
|
If you want to repeat the “write a plan → get multiple AI critiques → synthesize → update plan → implement TDD-first” workflow, use:
|
|
@@ -606,3 +631,4 @@ it('should correctly split pages with identical prefixes and duplicated content'
|
|
|
606
631
|
```
|
|
607
632
|
|
|
608
633
|
---
|
|
634
|
+
15. **Use Synthesized AI Reviews**: For complex safety features, getting reviews from multiple models (Claude, GPT, etc.) and synthesizing them into a single action plan (see `docs/reviews/max-content-length-review-synthesis.md`) revealed critical edge cases like Arabic diacritic corruption and surrogate pair safety that a single model might miss.
|
package/LICENSE.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
Copyright 2025 Ragaeeb Haq>
|
|
1
|
+
Copyright 2025-2026 Ragaeeb Haq>
|
|
2
2
|
|
|
3
3
|
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
|
4
4
|
|
package/README.md
CHANGED
|
@@ -51,6 +51,7 @@ Working with Arabic hadith and Islamic text collections requires splitting conti
|
|
|
51
51
|
✅ **Readable templates**: `{{raqms}} {{dash}}` instead of cryptic regex
|
|
52
52
|
✅ **Named captures**: `{{raqms:hadithNum}}` auto-extracts to `meta.hadithNum`
|
|
53
53
|
✅ **Fuzzy matching**: Auto-enabled for `{{bab}}`, `{{kitab}}`, `{{basmalah}}`, `{{fasl}}`, `{{naql}}` (override with `fuzzy: false`)
|
|
54
|
+
✅ **Content limits**: `maxPages` and `maxContentLength` (safety-hardened) control segment size
|
|
54
55
|
✅ **Page tracking**: Know which page each segment came from
|
|
55
56
|
✅ **Declarative rules**: Describe *what* to match, not *how*
|
|
56
57
|
|
|
@@ -236,7 +237,92 @@ Limit rules to specific page ranges:
|
|
|
236
237
|
}
|
|
237
238
|
```
|
|
238
239
|
|
|
239
|
-
### 7.
|
|
240
|
+
### 7. Max Content Length (Safety Hardened)
|
|
241
|
+
|
|
242
|
+
Split oversized segments based on character count:
|
|
243
|
+
|
|
244
|
+
```typescript
|
|
245
|
+
{
|
|
246
|
+
maxContentLength: 500, // Split after 500 characters
|
|
247
|
+
prefer: 'longer', // Try to fill the character bucket
|
|
248
|
+
breakpoints: ['\\.'], // Recommended: split on punctuation within window
|
|
249
|
+
}
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
The library implements **safety hardening** for character-based splits:
|
|
253
|
+
- **Safe Fallback**: If no breakpoint matches, it searches backward up to 100 characters for a delimiter (whitespace or punctuation) to avoid chopping words.
|
|
254
|
+
- **Unicode Safety**: Automatically prevents splitting inside Unicode surrogate pairs (e.g., emojis), preventing text corruption.
|
|
255
|
+
- **Validation**: `maxContentLength` must be at least **50**.
|
|
256
|
+
|
|
257
|
+
### 8. Advanced Structural Filters
|
|
258
|
+
|
|
259
|
+
Refine rule matching with page-specific constraints:
|
|
260
|
+
|
|
261
|
+
```typescript
|
|
262
|
+
{
|
|
263
|
+
lineStartsWith: ['### '],
|
|
264
|
+
split: 'at',
|
|
265
|
+
// Range constraints
|
|
266
|
+
min: 10, // Only match on pages 10 and above
|
|
267
|
+
max: 500, // Only match on pages 500 and below
|
|
268
|
+
exclude: [50, [100, 110]], // Skip page 50 and range 100-110
|
|
269
|
+
|
|
270
|
+
// Negative lookahead: skip rule if content matches this pattern
|
|
271
|
+
// (e.g. skip chapter marker if it appears inside a table/list)
|
|
272
|
+
skipWhen: '^\s*- ',
|
|
273
|
+
}
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
### 9. Debugging & Logging
|
|
277
|
+
|
|
278
|
+
Pass an optional `logger` to trace segmentation decisions or enable `debug` to attach match metadata to segments:
|
|
279
|
+
|
|
280
|
+
```typescript
|
|
281
|
+
const segments = segmentPages(pages, {
|
|
282
|
+
rules: [...],
|
|
283
|
+
debug: true, // Attaches .meta.debug with regex and match indices
|
|
284
|
+
logger: {
|
|
285
|
+
debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
|
|
286
|
+
info: (msg, data) => console.info(`[INFO] ${msg}`, data),
|
|
287
|
+
warn: (msg, data) => console.warn(`[WARN] ${msg}`, data),
|
|
288
|
+
error: (msg, data) => console.error(`[ERROR] ${msg}`, data),
|
|
289
|
+
}
|
|
290
|
+
});
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
### 10. Page Joiners
|
|
294
|
+
|
|
295
|
+
Control how text from different pages is stitched together:
|
|
296
|
+
|
|
297
|
+
```typescript
|
|
298
|
+
// Default: space ' ' joiner
|
|
299
|
+
// Result: "...end of page 1. Start of page 2..."
|
|
300
|
+
segmentPages(pages, { pageJoiner: 'space' });
|
|
301
|
+
|
|
302
|
+
// Result: "...end of page 1.\nStart of page 2..."
|
|
303
|
+
segmentPages(pages, { pageJoiner: 'newline' });
|
|
304
|
+
```
|
|
305
|
+
|
|
306
|
+
### 11. Breakpoint Preferences
|
|
307
|
+
|
|
308
|
+
When a segment exceeds `maxPages` or `maxContentLength`, breakpoints split it at the "best" available match:
|
|
309
|
+
|
|
310
|
+
```typescript
|
|
311
|
+
{
|
|
312
|
+
maxPages: 1, // Minimum segment size (page span)
|
|
313
|
+
breakpoints: ['{{tarqim}}'],
|
|
314
|
+
|
|
315
|
+
// 'longer' (default): Greedy. Finds the match furthest in the window.
|
|
316
|
+
// Result: Segments stay close to the max limit.
|
|
317
|
+
prefer: 'longer',
|
|
318
|
+
|
|
319
|
+
// 'shorter': Conservative. Finds the first available match.
|
|
320
|
+
// Result: Segments split as early as possible.
|
|
321
|
+
prefer: 'shorter',
|
|
322
|
+
}
|
|
323
|
+
```
|
|
324
|
+
|
|
325
|
+
### 12. Occurrence Filtering
|
|
240
326
|
|
|
241
327
|
Control which matches to use:
|
|
242
328
|
|
|
@@ -831,7 +917,27 @@ const options: SegmentationOptions = {
|
|
|
831
917
|
// How to join content across page boundaries in OUTPUT segments:
|
|
832
918
|
// - 'space' (default): page boundaries become spaces
|
|
833
919
|
// - 'newline': preserve page boundaries as newlines
|
|
834
|
-
pageJoiner: '
|
|
920
|
+
pageJoiner: 'newline',
|
|
921
|
+
|
|
922
|
+
// Breakpoint preferences for resizing oversized segments:
|
|
923
|
+
// - 'longer' (default): maximizes segment size within limits
|
|
924
|
+
// - 'shorter': minimizes segment size (splits at first match)
|
|
925
|
+
prefer: 'longer',
|
|
926
|
+
|
|
927
|
+
// Post-structural limit: split if segment spans more than 2 pages
|
|
928
|
+
maxPages: 2,
|
|
929
|
+
|
|
930
|
+
// Post-structural limit: split if segment exceeds 5000 characters
|
|
931
|
+
maxContentLength: 5000,
|
|
932
|
+
|
|
933
|
+
// Enable match metadata in segments (meta.debug)
|
|
934
|
+
debug: true,
|
|
935
|
+
|
|
936
|
+
// Custom logger for tracing
|
|
937
|
+
logger: {
|
|
938
|
+
info: (m) => console.log(m),
|
|
939
|
+
warn: (m) => console.warn(m),
|
|
940
|
+
}
|
|
835
941
|
};
|
|
836
942
|
|
|
837
943
|
const segments: Segment[] = segmentPages(pages, options);
|
|
@@ -1057,6 +1163,8 @@ type SplitRule = {
|
|
|
1057
1163
|
// Constraints
|
|
1058
1164
|
min?: number;
|
|
1059
1165
|
max?: number;
|
|
1166
|
+
exclude?: (number | [number, number])[]; // Single page or [start, end] range
|
|
1167
|
+
skipWhen?: string; // Regex pattern (tokens supported)
|
|
1060
1168
|
meta?: Record<string, unknown>;
|
|
1061
1169
|
};
|
|
1062
1170
|
```
|
package/dist/index.d.mts
CHANGED
|
@@ -641,6 +641,18 @@ type SegmentationOptions = {
|
|
|
641
641
|
* { maxPages: 2, breakpoints: ['{{tarqim}}', ''] }
|
|
642
642
|
*/
|
|
643
643
|
maxPages?: number;
|
|
644
|
+
/**
|
|
645
|
+
* Maximum length (in characters) per segment.
|
|
646
|
+
*
|
|
647
|
+
* When a segment exceeds this length, breakpoints are applied to split it.
|
|
648
|
+
* This can typically be used in conjunction with `maxPages`, where the
|
|
649
|
+
* strictest constraint (intersection) determines the split window.
|
|
650
|
+
*
|
|
651
|
+
* @example
|
|
652
|
+
* // Break segments that exceed 2000 chars
|
|
653
|
+
* { maxContentLength: 2000, breakpoints: ['{{tarqim}}'] }
|
|
654
|
+
*/
|
|
655
|
+
maxContentLength?: number;
|
|
644
656
|
/**
|
|
645
657
|
* Patterns tried in order to break oversized segments.
|
|
646
658
|
*
|
|
@@ -894,20 +906,6 @@ type ReplaceRule = NonNullable<SegmentationOptions['replace']>[number];
|
|
|
894
906
|
declare const applyReplacements: (pages: Page[], rules?: ReplaceRule[]) => Page[];
|
|
895
907
|
//#endregion
|
|
896
908
|
//#region src/segmentation/segmenter.d.ts
|
|
897
|
-
/**
|
|
898
|
-
* Applies breakpoints to oversized segments.
|
|
899
|
-
*
|
|
900
|
-
* For each segment that spans more than maxPages, tries the breakpoint patterns
|
|
901
|
-
* in order to find a suitable split point. Structural markers (from rules) are
|
|
902
|
-
* always respected - segments are only broken within their boundaries.
|
|
903
|
-
*
|
|
904
|
-
* @param segments - Initial segments from rule processing
|
|
905
|
-
* @param pages - Original pages for page lookup
|
|
906
|
-
* @param maxPages - Maximum pages before breakpoints apply
|
|
907
|
-
* @param breakpoints - Patterns to try in order (tokens supported)
|
|
908
|
-
* @param prefer - 'longer' for last match, 'shorter' for first match
|
|
909
|
-
* @returns Processed segments with oversized ones broken up
|
|
910
|
-
*/
|
|
911
909
|
/**
|
|
912
910
|
* Segments pages of content based on pattern-matching rules.
|
|
913
911
|
*
|
package/dist/index.d.mts.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/optimize-rules.ts","../src/segmentation/pattern-validator.ts","../src/segmentation/replace.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/recovery.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBY,cDnUC,WCmUG,EAAA,CAAA,CAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAoChB;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;
|
|
1
|
+
{"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/optimize-rules.ts","../src/segmentation/pattern-validator.ts","../src/segmentation/replace.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/recovery.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBY,cDnUC,WCmUG,EAAA,CAAA,CAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAoChB;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2DA;AAAyC,cJ0F5B,wBI1F4B,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;AJLzC;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;KArVK,YAAA,GAqViD;EAAe;EAkBzD,KAAA,EAAI,MAAA;AAoChB,CAAA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;KF0BK,eAAA,GExBkB;EACH;EACL,QAAA,EAAA,MAAA;CAAe;AA0G9B;AAwDA;;;;AC1LA;AA2DA;;;;;;;;ACyPA;;;;;;;;AC5PA;AA0QA;AAsDA;AA2CA,KLvVK,qBAAA,GK0VJ;EAQW;EAuKC,cAAA,EAAA,MAAA,EAAA;AA6Cb,CAAA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC/wBA;AAcA;AAEA;AAwQA;;;;;;;;AClRA;AAaA;AAOA;AA2OA;;;;KPjKK,sBAAA,GOoKsB;;;;ACjR3B;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC7MA;AAKA;;;;;;AAOA;AA2BE;AAmnBF;;KT/gBK,mBAAA,GSihBS;EACD;EACC,YAAA,EAAA,MAAA,EAAA;CAGa;;;;AA+C3B;;;;;;;KTtjBK,WAAA,GACC,eACA,kBACA,wBACA,yBACA;;;;;;;;;;;;;;;;cAiBO;;;;;;KAOD,cAAA,WAAyB;;;;;;;KAYhC,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiDO,SAAA;;;;;;;KAYP,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAuDC,SAAA,GAAY,cAAc,gBAAgB;;;;;;;;;;;;;KAkB1C,IAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAoCA,cAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAqCE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCF,UAAA,YAAsB;;;;;;;;;;;;;;;;;;;;;;;;;UA8BjB,MAAA;;;;;;;;;;;;;;;;;;;;;;KAuBL,WAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA+CA,mBAAA;;;;;;YAME;;;;;;;;UASF;;;;;;;;;;;;;;cAiBY;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;gBA4DN;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WAwDL;;;;;;;;;;;;;;;;KAiBD,OAAA;;;;;;;;;;;;;;;;;;;;;;;;;;SA6BD;;;;AAxoBa;;;AAkBlB,KChJM,cAAA,GDgJN;EACA;EACA,KAAA,EChJK,SDgJL,EAAA;EAAmB;EAiBZ,WAAA,EAAA,MAAA;AAOb,CAAA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;cAAa,uBAAwB,gBAAc;;;AD3FlC;AA4BG;AA8BM;AAyDrB,KEpIO,mBAAA,GFoIY,gBAAA,GAAA,eAAA,GAAA,WAAA,GAAA,eAAA;AAAA;;;AAkBlB,KEjJM,eAAA,GFiJN;EACA,IAAA,EEjJI,mBFiJJ;EACA,OAAA,EAAA,MAAA;EAAmB,UAAA,CAAA,EAAA,MAAA;EAiBZ;EAOD,KAAA,CAAA,EAAA,MAAA;EAYP;EAiDO,OAAA,CAAA,EAAA,MAAS;AAA6B,CAAA;AAyHlD;;;;AAAqE,KEnVzD,oBAAA,GFmVyD;EAkBzD,cAAI,CAAA,EAAA,CEpWM,eFoWN,GAAA,SAAA,CAAA,EAAA;EAoCJ,eAAA,CAAA,EAAc,CEvYH,eF4aT,GAAA,SAAS,CAAA,EAAA;EAqCX,YAAA,CAAU,EAAA,CEhdF,eFgdc,GAAA,SAAc,CAAA,EAAA;EA8B/B,QAAA,CAAM,EE7eR,eF6eQ;AAuBvB,CAAA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;AACsB,cA6GT,aA7GS,EAAA,CAAA,KAAA,EA6Ge,SA7Gf,EAAA,EAAA,GAAA,CA6G8B,oBA7G9B,GAAA,SAAA,CAAA,EAAA;;;;;AA6GtB;AAwDA;;;;AC1LA;AA2DA;;;AAAyE,cD+H5D,sBC/H4D,EAAA,CAAA,OAAA,EAAA,CD+HxB,oBC/HwB,GAAA,SAAA,CAAA,EAAA,EAAA,GAAA,MAAA,EAAA;;;AJLzE;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAuCtB,KGpJO,WAAA,GAAc,WHoJV,CGpJsB,mBHoJtB,CAAA,SAAA,CAAA,CAAA,CAAA,MAAA,CAAA;;;;;;;AAsBhB;AAOA;AAAgE;AA6DhE;AAYK,cG/LQ,iBHwOC,EAAA,CAAA,KAaH,EGrP8B,IHqPxB,EAAA,EAAA,KAAA,CAAA,EGrPwC,WHqPxC,EAAA,EAAA,GGrPwD,IHqPxD,EAAA;;;;;;;AAyEjB;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2DA;;;AAAyE,cCyP5D,YDzP4D,EAAA,CAAA,KAAA,ECyPrC,IDzPqC,EAAA,EAAA,OAAA,ECyPpB,mBDzPoB,EAAA,GCyPD,ODzPC,EAAA;;;;AJLzE;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2Da,cEHA,sBFyBZ,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;;;;;ACmOD;;;;;;;cCca;AA1Qb;AA0QA;AAsDA;AA2CA;AAWA;AAuKA;AA6CA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC/wBA;AAcA;AAEA;AAwQA;;;;;cDuFa,gBAAgB;;;AEzW7B;AAaA;AAOA;AA2OA;;;;;;;;AC9QA;AA+EA;AAgEa,cHoRA,cGrQZ,EAAA,CAAA,KAfgE,EAAA,MAAA,EAAA,GAAA,OAAe;AAuBhF;AAiCA;;;;AC7MA;AAKY,KJ+aA,YAAA,GI/aiB;EAChB;;;;;EAMD,OAAA,EAAA,MAAA;EA6BP;AAinBL;;;;EAIc,YAAA,EAAA,MAAA,EAAA;EAGa;;;;AA+C3B;EACU,WAAA,EAAA,OAAA;CACsE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cJvHnE,mHAIV;;;;;;;;;;;;;;;;;;;;cAyCU;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAmC;;;;;;;;;;;;;cAqBnC;;;;;;;;;;;;;;;cAgBA;;;;;;;;;;;;;;;cA8BA;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;cAoBC,iDAAkD;;;;;;;;;;;;;;;;cA6BlD;;;ANluBA,KO7CD,wBAAA,GP6C8E;EA+F7E,IAAA,CAAA,EAAA,MAAA;;;;ECnIR,WAAA,CAAA,EAAA,MAAY;EA4BZ,wBAAe,CAAA,EAAA,OAAA;EA8Bf,yBAAqB,CAAA,EAAA,OAAA;EAiCrB,MAAA,CAAA,EAAA,aAAA,GAAsB,OAAA;EAwBtB,UAAA,CAAA,EAAA,CAAA,IAAA,EAAA,MAAmB,EAAA,MAAA,EAAA,MAAA,EAAA,GAAA,OAAA;EAenB,cAAW,CAAA,EMjIK,MNiIL,EAAA;EACV,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;CACA;AACA,KMhIM,uBAAA,GNgIN;EACA,IAAA,EAAA,MAAA;EACA,MAAA,EAAA,MAAA;CAAmB;AAiBZ,KMjJD,sBAAA,GNiJwG;EAOxG,OAAA,EAAA,MAAA;EAYP,KAAA,EAAA,MAAA;EAiDO,QAAA,EMlNE,uBNkNO,EAAA;AAA6B,CAAA;AAyHlD;;;AAAsD,cMtEzC,uBNsEyC,EAAA,CAAA,KAAA,EMrE3C,INqE2C,EAAA,EAAA,OAAA,CAAA,EMpEzC,wBNoEyC,EAAA,GMnEnD,sBNmEmD,EAAA;;;AAlOjD,KOtHO,wBAAA,GPsHY;EAenB,WAAA,CAAA,EAAW,MAAA;EACV,WAAA,CAAA,EAAA,MAAA;EACA,QAAA,CAAA,EAAA,MAAA;EACA,IAAA,CAAA,EAAA,MAAA;EACA,yBAAA,CAAA,EAAA,OAAA;EACA,YAAA,CAAA,EAAA,OAAA;EAAmB,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;EAiBZ,WAAA,CAAA,EAAA,MAAuG;EAOxG,YAAA,CAAA,EAAA,MAAc;EAYrB,iBAAa,CAAA,EAAA,MAAA;AAiDlB,CAAA;AAYK,KO9NO,wBAAA,GPuQE;EAoEF,IAAA,EAAA,MAAS;EAAG,OAAA,EAAA,MAAA;EAAc,MAAA,EAAA,MAAA;EAAgB,YAAA,EAAA,MAAA,EAAA;CAAe;AAkBzD,KOtVA,wBAAA,GPsVI;EAoCJ,OAAA,EAAA,MAAA;EA0EA,KAAA,EAAA,MAAU;EA8BL,QAAA,EO/dH,wBP+dS,EAAA;AAuBvB,CAAA;;;;;;AAoNA;cOlea,mCACF,kBACG,6BACX;;;;AR3NH;AA+FA;;;;;ACnIiB;AA4BG;AA+Df,KQ7GO,eAAA,GR6Ge;EAwBtB;EAeA,KAAA,EAAA,MAAA;EACC;EACA,KAAA,EAAA,MAAA;EACA;EACA,KAAA,EAAA,MAAA;EACA;EAAmB,QAAA,EAAA,MAAA;AAiBzB,CAAA;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;AAMc,cQlgBD,mBRkgBC,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GQlgBkC,eRkgBlC,EAAA;;;;;;AA+Jd;;;;ACzuBA;AAsGA;;;cOkCa,mDAAoD;AN9IjE;AAKA;AAcA;;;;AAIe,cM8IF,oBN9IE,EAAA,CAAA,QAAA,EM+ID,eN/IC,EAAA,EAAA,GAAA;EAAe,WAAA,EAAA,gBAAA,GAAA,iBAAA;EA0GjB,KAAA,EAAA,OAAA;EAwDA,QAAA,CAAA,EAAA,MAAA;;;;AC1Lb;AA2DA;;;AAAyE,cK4I5D,kBL5I4D,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA;EAAI,QAAA,EAAA,MAAA;;;;ECyPhE,QAAA,EItGC,eJgLb,EAAA;CA1EmC,GAAA,IAAA;;;AL9PvB,KU5DD,sBAAA,GV4D8E;EA+F7E,IAAA,EAAA,cAAA;;;;ECnIR,KAAA,CAAA,EAAA,OAAY,GAAA,YAAA;EA4BZ,QAAA,EAAA,MAAA,EAAe;AAAA,CAAA,GA8Bf;EAiCA,IAAA,EAAA,WAAA;EAwBA,SAAA,EAAA,CAAA,IAAA,ESxIwC,STwIrB,EAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;AAAA,CAAA;AAgBlB,KStJM,iBAAA,GTsJN;EACA,OAAA,EStJO,mBTsJP;EACA,KAAA,EStJK,ITsJL,EAAA;EACA,QAAA,EStJQ,OTsJR,EAAA;EACA,QAAA,EStJQ,sBTsJR;CAAmB;AAiBZ,KSpKD,oBAAA,GToKwG;EAOxG,OAAA,EAAA;IAYP,IAAA,EAAA,YAAa,GAAA,wBAAA;IAiDN,SAAS,EAAA,MAAA;IAYhB,aAAA,EAAe,MAAA;IA6GR,SAAS,EAAA,MAAA;IAAG,UAAA,EAAA,MAAA;EAAc,CAAA;EAAgB,KAAA,CAAA,ESzV1C,KTyV0C,CAAA;IAAe,SAAA,EAAA,MAAA;IAkBzD,QAAI,EAAA,MAAA;IAoCJ,aAAc,EAAA,MAAA;IA0Ed,UAAU,EAAA,MAAA;EA8BL,CAAA,CAAA;EAuBL,OAAA,ESxgBC,KTwgBU,CAAA;IA+CX,IAAA,EAAA,MAAA;IAME,oBAAA,EAAA,MAAA;IASF,sBAAA,CAAA,EAAA,MAAA;IAiBY,qBAAA,CAAA,EAAA,MAAA;IA4DN,YAAA,EAAA,MAAA;IAwDL,MAAA,EAAA,WAAA,GAAA,oBAAA,GAAA,WAAA,GAAA,sBAAA,GAAA,qBAAA;IAAM,QAAA,EAAA,OAAA,GAAA,QAAA,GAAA,MAAA;IAiBP,EAAA,CAAA,EAAA,MAAO;;;;ECzuBP,QAAA,EAAA,MAAA,EAAc;AAsG1B,CAAA;KQ1EK,oBAAA;iBAinBW,qCAAA,QACL,kBACG,oBACD,+BACC,4BPlpBd;;EALY,gBAAA,CAAA,EO0pBe,oBP1pBI;AAK/B,CAAA,CAAA,EAAY;EAcA,MAAA,EOyoBC,oBPzoBmB;EACV,QAAA,EOwoBuB,OPxoBvB,EAAA;CACC;AACH,iBOmrBJ,6BAAA,CPnrBI,IAAA,EOorBV,iBPprBU,EAAA,EAAA,IA2GpB,CA3GoB,EAAA;EACL,IAAA,CAAA,EAAA,YAAA,GAAA,wBAAA;EAAe,gBAAA,CAAA,EOorBkD,oBPprBlD;AA0G9B,CAAA,CAAA,EAAa;EAwDA,MAAA,EOmhBA,oBP1eZ;YO0e4C"}
|
package/dist/index.mjs
CHANGED
|
@@ -1319,12 +1319,12 @@ const findPageStartNearExpectedBoundary = (remainingContent, _currentFromIdx, ta
|
|
|
1319
1319
|
const MAX_DEVIATION = 2e3;
|
|
1320
1320
|
if (bestDistance <= MAX_DEVIATION) return bestCandidate.pos;
|
|
1321
1321
|
logger?.debug?.("[breakpoints] findPageStartNearExpectedBoundary: Rejected match exceeding deviation", {
|
|
1322
|
-
targetPageIdx,
|
|
1323
|
-
expectedBoundary,
|
|
1324
1322
|
bestDistance,
|
|
1325
|
-
|
|
1323
|
+
expectedBoundary,
|
|
1326
1324
|
matchPos: bestCandidate.pos,
|
|
1327
|
-
|
|
1325
|
+
maxDeviation: MAX_DEVIATION,
|
|
1326
|
+
prefixLength: len,
|
|
1327
|
+
targetPageIdx
|
|
1328
1328
|
});
|
|
1329
1329
|
}
|
|
1330
1330
|
}
|
|
@@ -1533,6 +1533,34 @@ const findBreakPosition = (remainingContent, currentFromIdx, toIdx, windowEndIdx
|
|
|
1533
1533
|
}
|
|
1534
1534
|
return null;
|
|
1535
1535
|
};
|
|
1536
|
+
/**
|
|
1537
|
+
* Searches backward from a target position to find a "safe" split point.
|
|
1538
|
+
* A safe split point is after whitespace or punctuation.
|
|
1539
|
+
*
|
|
1540
|
+
* @param content The text content
|
|
1541
|
+
* @param targetPosition The desired split position (hard limit)
|
|
1542
|
+
* @param lookbackChars How far back to search for a safe break
|
|
1543
|
+
* @returns The new split position (index), or -1 if no safe break found
|
|
1544
|
+
*/
|
|
1545
|
+
const findSafeBreakPosition = (content, targetPosition, lookbackChars = 100) => {
|
|
1546
|
+
const startSearch = Math.max(0, targetPosition - lookbackChars);
|
|
1547
|
+
for (let i = targetPosition - 1; i >= startSearch; i--) {
|
|
1548
|
+
const char = content[i];
|
|
1549
|
+
if (/[\s\n.,;!?؛،۔]/.test(char)) return i + 1;
|
|
1550
|
+
}
|
|
1551
|
+
return -1;
|
|
1552
|
+
};
|
|
1553
|
+
/**
|
|
1554
|
+
* Ensures the position does not split a surrogate pair.
|
|
1555
|
+
* If position is between High and Low surrogate, returns position - 1.
|
|
1556
|
+
*/
|
|
1557
|
+
const adjustForSurrogate = (content, position) => {
|
|
1558
|
+
if (position <= 0 || position >= content.length) return position;
|
|
1559
|
+
const high = content.charCodeAt(position - 1);
|
|
1560
|
+
const low = content.charCodeAt(position);
|
|
1561
|
+
if (high >= 55296 && high <= 56319 && low >= 56320 && low <= 57343) return position - 1;
|
|
1562
|
+
return position;
|
|
1563
|
+
};
|
|
1536
1564
|
|
|
1537
1565
|
//#endregion
|
|
1538
1566
|
//#region src/segmentation/debug-meta.ts
|
|
@@ -1659,7 +1687,7 @@ const createPieceSegment = (pieceContent, actualStartIdx, actualEndIdx, pageIds,
|
|
|
1659
1687
|
*
|
|
1660
1688
|
* @returns Break offset relative to remainingContent, or windowEndPosition as fallback
|
|
1661
1689
|
*/
|
|
1662
|
-
const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer) => {
|
|
1690
|
+
const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer, maxContentLength) => {
|
|
1663
1691
|
if (hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, windowEndIdx)) {
|
|
1664
1692
|
const exclusionBreak = findExclusionBreakPosition(currentFromIdx, windowEndIdx, toIdx, pageIds, expandedBreakpoints, cumulativeOffsets);
|
|
1665
1693
|
if (exclusionBreak > 0) return { breakOffset: exclusionBreak };
|
|
@@ -1675,6 +1703,11 @@ const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx
|
|
|
1675
1703
|
breakpointIndex: patternMatch.breakpointIndex,
|
|
1676
1704
|
breakpointRule: patternMatch.rule
|
|
1677
1705
|
};
|
|
1706
|
+
if (maxContentLength && windowEndPosition === maxContentLength) {
|
|
1707
|
+
const safeOffset = findSafeBreakPosition(remainingContent, windowEndPosition);
|
|
1708
|
+
if (safeOffset !== -1) return { breakOffset: safeOffset };
|
|
1709
|
+
return { breakOffset: adjustForSurrogate(remainingContent, windowEndPosition) };
|
|
1710
|
+
}
|
|
1678
1711
|
return { breakOffset: windowEndPosition };
|
|
1679
1712
|
};
|
|
1680
1713
|
/**
|
|
@@ -1691,7 +1724,7 @@ const skipWhitespace$1 = (content, startPos) => {
|
|
|
1691
1724
|
*
|
|
1692
1725
|
* Uses precomputed boundary positions for O(log n) page attribution lookups.
|
|
1693
1726
|
*/
|
|
1694
|
-
const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey) => {
|
|
1727
|
+
const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey, maxContentLength) => {
|
|
1695
1728
|
const result = [];
|
|
1696
1729
|
const fullContent = segment.content;
|
|
1697
1730
|
let cursorPos = 0;
|
|
@@ -1705,57 +1738,104 @@ const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPag
|
|
|
1705
1738
|
fullContentLength: fullContent.length,
|
|
1706
1739
|
toIdx
|
|
1707
1740
|
});
|
|
1708
|
-
|
|
1709
|
-
|
|
1741
|
+
let i = 0;
|
|
1742
|
+
const MAX_SAFE_ITERATIONS = 1e5;
|
|
1743
|
+
while (cursorPos < fullContent.length && currentFromIdx <= toIdx && i < MAX_SAFE_ITERATIONS) {
|
|
1744
|
+
i++;
|
|
1710
1745
|
const remainingContent = fullContent.slice(cursorPos);
|
|
1711
1746
|
if (!remainingContent.trim()) break;
|
|
1712
|
-
|
|
1713
|
-
const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
|
|
1714
|
-
if (remainingSpan <= maxPages && !remainingHasExclusions) {
|
|
1715
|
-
const includeMeta = isFirstPiece || Boolean(debugMetaKey);
|
|
1716
|
-
const meta = debugMetaKey && lastBreakpoint ? mergeDebugIntoMeta(includeMeta ? segment.meta : void 0, debugMetaKey, buildBreakpointDebugPatch(lastBreakpoint.breakpointIndex, lastBreakpoint.rule)) : includeMeta ? segment.meta : void 0;
|
|
1717
|
-
const finalSeg = createFinalSegment(remainingContent, currentFromIdx, toIdx, pageIds, meta, includeMeta);
|
|
1718
|
-
if (finalSeg) result.push(finalSeg);
|
|
1719
|
-
break;
|
|
1720
|
-
}
|
|
1747
|
+
if (handleOversizedSegmentFit(remainingContent, currentFromIdx, toIdx, pageIds, expandedBreakpoints, maxPages, maxContentLength, isFirstPiece, debugMetaKey, segment.meta, lastBreakpoint, result)) break;
|
|
1721
1748
|
const windowEndIdx = computeWindowEndIdx(currentFromIdx, toIdx, pageIds, maxPages);
|
|
1722
|
-
const windowEndPosition =
|
|
1749
|
+
const windowEndPosition = getWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, maxContentLength, logger);
|
|
1723
1750
|
logger?.debug?.(`[breakpoints] iteration=${i}`, {
|
|
1724
1751
|
currentFromIdx,
|
|
1725
1752
|
cursorPos,
|
|
1726
|
-
windowEndIdx
|
|
1753
|
+
windowEndIdx,
|
|
1754
|
+
windowEndPosition
|
|
1727
1755
|
});
|
|
1728
|
-
const found = findBreakOffsetForWindow(remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer);
|
|
1756
|
+
const found = findBreakOffsetForWindow(remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer, maxContentLength);
|
|
1757
|
+
let breakOffset = found.breakOffset;
|
|
1758
|
+
if (breakOffset <= 0) {
|
|
1759
|
+
const fallbackPos = maxContentLength ? Math.min(maxContentLength, remainingContent.length) : 1;
|
|
1760
|
+
breakOffset = Math.max(1, fallbackPos);
|
|
1761
|
+
logger?.warn?.("[breakpoints] No progress from findBreakOffsetForWindow; forcing forward movement", {
|
|
1762
|
+
breakOffset,
|
|
1763
|
+
cursorPos
|
|
1764
|
+
});
|
|
1765
|
+
}
|
|
1729
1766
|
if (found.breakpointIndex !== void 0 && found.breakpointRule) lastBreakpoint = {
|
|
1730
1767
|
breakpointIndex: found.breakpointIndex,
|
|
1731
1768
|
rule: found.breakpointRule
|
|
1732
1769
|
};
|
|
1733
|
-
const breakPos = cursorPos +
|
|
1770
|
+
const breakPos = cursorPos + breakOffset;
|
|
1734
1771
|
const pieceContent = fullContent.slice(cursorPos, breakPos).trim();
|
|
1735
|
-
const { actualEndIdx, actualStartIdx } = computePiecePages(cursorPos, breakPos, boundaryPositions, fromIdx, toIdx);
|
|
1736
|
-
logger?.trace?.("[breakpoints] piece", {
|
|
1737
|
-
actualEndIdx,
|
|
1738
|
-
actualStartIdx,
|
|
1739
|
-
pieceLength: pieceContent.length
|
|
1740
|
-
});
|
|
1741
1772
|
if (pieceContent) {
|
|
1742
|
-
const
|
|
1743
|
-
const pieceSeg = createPieceSegment(pieceContent, actualStartIdx, actualEndIdx, pageIds,
|
|
1773
|
+
const { actualEndIdx, actualStartIdx } = computePiecePages(cursorPos, breakPos, boundaryPositions, fromIdx, toIdx);
|
|
1774
|
+
const pieceSeg = createPieceSegment(pieceContent, actualStartIdx, actualEndIdx, pageIds, getSegmentMetaWithDebug(isFirstPiece, debugMetaKey, segment.meta, lastBreakpoint), true);
|
|
1744
1775
|
if (pieceSeg) result.push(pieceSeg);
|
|
1745
|
-
|
|
1746
|
-
|
|
1747
|
-
|
|
1776
|
+
const next = advanceCursorAndIndex(fullContent, breakPos, actualEndIdx, toIdx, pageIds, normalizedPages);
|
|
1777
|
+
cursorPos = next.cursorPos;
|
|
1778
|
+
currentFromIdx = next.currentFromIdx;
|
|
1779
|
+
} else cursorPos = breakPos;
|
|
1748
1780
|
isFirstPiece = false;
|
|
1749
1781
|
}
|
|
1782
|
+
if (i >= MAX_SAFE_ITERATIONS) logger?.error?.("[breakpoints] Stopped processing oversized segment: reached MAX_SAFE_ITERATIONS", {
|
|
1783
|
+
cursorPos,
|
|
1784
|
+
fullContentLength: fullContent.length,
|
|
1785
|
+
iterations: i
|
|
1786
|
+
});
|
|
1750
1787
|
logger?.debug?.("[breakpoints] done", { resultCount: result.length });
|
|
1751
1788
|
return result;
|
|
1752
1789
|
};
|
|
1753
1790
|
/**
|
|
1791
|
+
* Checks if the remaining content fits within paged/length limits.
|
|
1792
|
+
* If so, pushes the final segment and returns true.
|
|
1793
|
+
*/
|
|
1794
|
+
const handleOversizedSegmentFit = (remainingContent, currentFromIdx, toIdx, pageIds, expandedBreakpoints, maxPages, maxContentLength, isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint, result) => {
|
|
1795
|
+
const remainingSpan = computeRemainingSpan(currentFromIdx, toIdx, pageIds);
|
|
1796
|
+
const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
|
|
1797
|
+
const fitsInPages = remainingSpan <= maxPages;
|
|
1798
|
+
const fitsInLength = !maxContentLength || remainingContent.length <= maxContentLength;
|
|
1799
|
+
if (fitsInPages && fitsInLength && !remainingHasExclusions) {
|
|
1800
|
+
const includeMeta = isFirstPiece || Boolean(debugMetaKey);
|
|
1801
|
+
const finalSeg = createFinalSegment(remainingContent, currentFromIdx, toIdx, pageIds, getSegmentMetaWithDebug(isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint), includeMeta);
|
|
1802
|
+
if (finalSeg) result.push(finalSeg);
|
|
1803
|
+
return true;
|
|
1804
|
+
}
|
|
1805
|
+
return false;
|
|
1806
|
+
};
|
|
1807
|
+
/**
|
|
1808
|
+
* Builds metadata for a segment piece, optionally including debug info.
|
|
1809
|
+
*/
|
|
1810
|
+
const getSegmentMetaWithDebug = (isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint) => {
|
|
1811
|
+
if (!(isFirstPiece || Boolean(debugMetaKey))) return;
|
|
1812
|
+
if (debugMetaKey && lastBreakpoint) return mergeDebugIntoMeta(isFirstPiece ? originalMeta : void 0, debugMetaKey, buildBreakpointDebugPatch(lastBreakpoint.breakpointIndex, lastBreakpoint.rule));
|
|
1813
|
+
return isFirstPiece ? originalMeta : void 0;
|
|
1814
|
+
};
|
|
1815
|
+
/**
|
|
1816
|
+
* Calculates window end position, capped by maxContentLength if present.
|
|
1817
|
+
*/
|
|
1818
|
+
const getWindowEndPosition = (remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, maxContentLength, logger) => {
|
|
1819
|
+
let windowEndPosition = findBreakpointWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, logger);
|
|
1820
|
+
if (maxContentLength && maxContentLength < windowEndPosition) windowEndPosition = maxContentLength;
|
|
1821
|
+
return windowEndPosition;
|
|
1822
|
+
};
|
|
1823
|
+
/**
|
|
1824
|
+
* Advances cursorPos and currentFromIdx for the next iteration.
|
|
1825
|
+
*/
|
|
1826
|
+
const advanceCursorAndIndex = (fullContent, breakPos, actualEndIdx, toIdx, pageIds, normalizedPages) => {
|
|
1827
|
+
const nextCursorPos = skipWhitespace$1(fullContent, breakPos);
|
|
1828
|
+
return {
|
|
1829
|
+
currentFromIdx: computeNextFromIdx(fullContent.slice(nextCursorPos), actualEndIdx, toIdx, pageIds, normalizedPages),
|
|
1830
|
+
cursorPos: nextCursorPos
|
|
1831
|
+
};
|
|
1832
|
+
};
|
|
1833
|
+
/**
|
|
1754
1834
|
* Applies breakpoints to oversized segments.
|
|
1755
1835
|
*
|
|
1756
1836
|
* Note: This is an internal engine used by `segmentPages()`.
|
|
1757
1837
|
*/
|
|
1758
|
-
const applyBreakpoints = (segments, pages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner = "space", debugMetaKey) => {
|
|
1838
|
+
const applyBreakpoints = (segments, pages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner = "space", debugMetaKey, maxContentLength) => {
|
|
1759
1839
|
const pageIds = pages.map((p) => p.id);
|
|
1760
1840
|
const pageIdToIndex = buildPageIdToIndexMap(pageIds);
|
|
1761
1841
|
const normalizedPages = buildNormalizedPagesMap(pages, normalizedContent);
|
|
@@ -1779,11 +1859,13 @@ const applyBreakpoints = (segments, pages, normalizedContent, maxPages, breakpoi
|
|
|
1779
1859
|
const toIdx = segment.to !== void 0 ? pageIdToIndex.get(segment.to) ?? fromIdx : fromIdx;
|
|
1780
1860
|
const segmentSpan = (segment.to ?? segment.from) - segment.from;
|
|
1781
1861
|
const hasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, fromIdx, toIdx);
|
|
1782
|
-
|
|
1862
|
+
const fitsInPages = segmentSpan <= maxPages;
|
|
1863
|
+
const fitsInLength = !maxContentLength || segment.content.length <= maxContentLength;
|
|
1864
|
+
if (fitsInPages && fitsInLength && !hasExclusions) {
|
|
1783
1865
|
result.push(segment);
|
|
1784
1866
|
continue;
|
|
1785
1867
|
}
|
|
1786
|
-
const broken = processOversizedSegment(segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey);
|
|
1868
|
+
const broken = processOversizedSegment(segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey, maxContentLength);
|
|
1787
1869
|
result.push(...broken.map((s) => {
|
|
1788
1870
|
const segFromIdx = pageIdToIndex.get(s.from) ?? -1;
|
|
1789
1871
|
const segToIdx = s.to !== void 0 ? pageIdToIndex.get(s.to) ?? segFromIdx : segFromIdx;
|
|
@@ -1995,7 +2077,7 @@ const buildLineStartsAfterRegexSource = (patterns, fuzzy, capturePrefix) => {
|
|
|
1995
2077
|
const union = processed.map((p) => p.pattern).join("|");
|
|
1996
2078
|
return {
|
|
1997
2079
|
captureNames: processed.flatMap((p) => p.captureNames),
|
|
1998
|
-
regex: `^(?:${union})${capturePrefix ? `(?<${capturePrefix}__content>.*)` : "(.*)"}`
|
|
2080
|
+
regex: `^[\\u200E\\u200F\\u061C\\u200B\\uFEFF]*(?:${union})${capturePrefix ? `(?<${capturePrefix}__content>.*)` : "(.*)"}`
|
|
1999
2081
|
};
|
|
2000
2082
|
};
|
|
2001
2083
|
const buildLineStartsWithRegexSource = (patterns, fuzzy, capturePrefix) => {
|
|
@@ -2003,7 +2085,7 @@ const buildLineStartsWithRegexSource = (patterns, fuzzy, capturePrefix) => {
|
|
|
2003
2085
|
const union = processed.map((p) => p.pattern).join("|");
|
|
2004
2086
|
return {
|
|
2005
2087
|
captureNames: processed.flatMap((p) => p.captureNames),
|
|
2006
|
-
regex: `^(?:${union})`
|
|
2088
|
+
regex: `^[\\u200E\\u200F\\u061C\\u200B\\uFEFF]*(?:${union})`
|
|
2007
2089
|
};
|
|
2008
2090
|
};
|
|
2009
2091
|
const buildLineEndsWithRegexSource = (patterns, fuzzy, capturePrefix) => {
|
|
@@ -2246,6 +2328,47 @@ const createPageStartGuardChecker = (matchContent, pageMap) => {
|
|
|
2246
2328
|
return prevReq.test(lastChar);
|
|
2247
2329
|
};
|
|
2248
2330
|
};
|
|
2331
|
+
/**
|
|
2332
|
+
* Checks if a pageId matches the min/max/exclude constraints of a rule.
|
|
2333
|
+
*/
|
|
2334
|
+
const passesRuleConstraints$1 = (rule, pageId) => {
|
|
2335
|
+
return (rule.min === void 0 || pageId >= rule.min) && (rule.max === void 0 || pageId <= rule.max) && !isPageExcluded(pageId, rule.exclude);
|
|
2336
|
+
};
|
|
2337
|
+
/**
|
|
2338
|
+
* Records a split point for a specific rule.
|
|
2339
|
+
*/
|
|
2340
|
+
const recordSplitPointAt = (splitPointsByRule, ruleIndex, sp) => {
|
|
2341
|
+
const arr = splitPointsByRule.get(ruleIndex);
|
|
2342
|
+
if (!arr) {
|
|
2343
|
+
splitPointsByRule.set(ruleIndex, [sp]);
|
|
2344
|
+
return;
|
|
2345
|
+
}
|
|
2346
|
+
arr.push(sp);
|
|
2347
|
+
};
|
|
2348
|
+
/**
|
|
2349
|
+
* Processes matches for all fast-fuzzy rules at a specific line start.
|
|
2350
|
+
*/
|
|
2351
|
+
const processFastFuzzyMatchesAt = (matchContent, lineStart, pageId, fastFuzzyRules, passesPageStartGuard, isPageStart, splitPointsByRule) => {
|
|
2352
|
+
for (const { compiled, kind, rule, ruleIndex } of fastFuzzyRules) {
|
|
2353
|
+
if (!passesRuleConstraints$1(rule, pageId)) continue;
|
|
2354
|
+
if (isPageStart && !passesPageStartGuard(rule, ruleIndex, lineStart)) continue;
|
|
2355
|
+
const end = matchFastFuzzyTokenAt(matchContent, lineStart, compiled);
|
|
2356
|
+
if (end === null) continue;
|
|
2357
|
+
const splitIndex = (rule.split ?? "at") === "at" ? lineStart : end;
|
|
2358
|
+
if (kind === "startsWith") recordSplitPointAt(splitPointsByRule, ruleIndex, {
|
|
2359
|
+
index: splitIndex,
|
|
2360
|
+
meta: rule.meta
|
|
2361
|
+
});
|
|
2362
|
+
else {
|
|
2363
|
+
const markerLength = end - lineStart;
|
|
2364
|
+
recordSplitPointAt(splitPointsByRule, ruleIndex, {
|
|
2365
|
+
contentStartOffset: (rule.split ?? "at") === "at" ? markerLength : void 0,
|
|
2366
|
+
index: splitIndex,
|
|
2367
|
+
meta: rule.meta
|
|
2368
|
+
});
|
|
2369
|
+
}
|
|
2370
|
+
}
|
|
2371
|
+
};
|
|
2249
2372
|
const collectFastFuzzySplitPoints = (matchContent, pageMap, fastFuzzyRules, passesPageStartGuard) => {
|
|
2250
2373
|
const splitPointsByRule = /* @__PURE__ */ new Map();
|
|
2251
2374
|
if (fastFuzzyRules.length === 0 || pageMap.boundaries.length === 0) return splitPointsByRule;
|
|
@@ -2257,38 +2380,12 @@ const collectFastFuzzySplitPoints = (matchContent, pageMap, fastFuzzyRules, pass
|
|
|
2257
2380
|
currentBoundary = pageMap.boundaries[boundaryIdx];
|
|
2258
2381
|
}
|
|
2259
2382
|
};
|
|
2260
|
-
const recordSplitPoint = (ruleIndex, sp) => {
|
|
2261
|
-
const arr = splitPointsByRule.get(ruleIndex);
|
|
2262
|
-
if (!arr) {
|
|
2263
|
-
splitPointsByRule.set(ruleIndex, [sp]);
|
|
2264
|
-
return;
|
|
2265
|
-
}
|
|
2266
|
-
arr.push(sp);
|
|
2267
|
-
};
|
|
2268
2383
|
const isPageStart = (offset) => offset === currentBoundary?.start;
|
|
2269
2384
|
for (let lineStart = 0; lineStart <= matchContent.length;) {
|
|
2270
2385
|
advanceBoundaryTo(lineStart);
|
|
2271
2386
|
const pageId = currentBoundary?.id ?? 0;
|
|
2272
2387
|
if (lineStart >= matchContent.length) break;
|
|
2273
|
-
|
|
2274
|
-
if (!((rule.min === void 0 || pageId >= rule.min) && (rule.max === void 0 || pageId <= rule.max) && !isPageExcluded(pageId, rule.exclude))) continue;
|
|
2275
|
-
if (isPageStart(lineStart) && !passesPageStartGuard(rule, ruleIndex, lineStart)) continue;
|
|
2276
|
-
const end = matchFastFuzzyTokenAt(matchContent, lineStart, compiled);
|
|
2277
|
-
if (end === null) continue;
|
|
2278
|
-
const splitIndex = (rule.split ?? "at") === "at" ? lineStart : end;
|
|
2279
|
-
if (kind === "startsWith") recordSplitPoint(ruleIndex, {
|
|
2280
|
-
index: splitIndex,
|
|
2281
|
-
meta: rule.meta
|
|
2282
|
-
});
|
|
2283
|
-
else {
|
|
2284
|
-
const markerLength = end - lineStart;
|
|
2285
|
-
recordSplitPoint(ruleIndex, {
|
|
2286
|
-
contentStartOffset: (rule.split ?? "at") === "at" ? markerLength : void 0,
|
|
2287
|
-
index: splitIndex,
|
|
2288
|
-
meta: rule.meta
|
|
2289
|
-
});
|
|
2290
|
-
}
|
|
2291
|
-
}
|
|
2388
|
+
processFastFuzzyMatchesAt(matchContent, lineStart, pageId, fastFuzzyRules, passesPageStartGuard, isPageStart(lineStart), splitPointsByRule);
|
|
2292
2389
|
const nextNl = matchContent.indexOf("\n", lineStart);
|
|
2293
2390
|
if (nextNl === -1) break;
|
|
2294
2391
|
lineStart = nextNl + 1;
|
|
@@ -2613,20 +2710,6 @@ const convertPageBreaks = (content, startOffset, pageBreaks) => {
|
|
|
2613
2710
|
return content.replace(/\n/g, (match, offset) => breakSet.has(offset) ? " " : match);
|
|
2614
2711
|
};
|
|
2615
2712
|
/**
|
|
2616
|
-
* Applies breakpoints to oversized segments.
|
|
2617
|
-
*
|
|
2618
|
-
* For each segment that spans more than maxPages, tries the breakpoint patterns
|
|
2619
|
-
* in order to find a suitable split point. Structural markers (from rules) are
|
|
2620
|
-
* always respected - segments are only broken within their boundaries.
|
|
2621
|
-
*
|
|
2622
|
-
* @param segments - Initial segments from rule processing
|
|
2623
|
-
* @param pages - Original pages for page lookup
|
|
2624
|
-
* @param maxPages - Maximum pages before breakpoints apply
|
|
2625
|
-
* @param breakpoints - Patterns to try in order (tokens supported)
|
|
2626
|
-
* @param prefer - 'longer' for last match, 'shorter' for first match
|
|
2627
|
-
* @returns Processed segments with oversized ones broken up
|
|
2628
|
-
*/
|
|
2629
|
-
/**
|
|
2630
2713
|
* Segments pages of content based on pattern-matching rules.
|
|
2631
2714
|
*
|
|
2632
2715
|
* This is the main entry point for the segmentation engine. It takes an array
|
|
@@ -2669,11 +2752,14 @@ const convertPageBreaks = (content, startOffset, pageBreaks) => {
|
|
|
2669
2752
|
* });
|
|
2670
2753
|
*/
|
|
2671
2754
|
const segmentPages = (pages, options) => {
|
|
2672
|
-
const { rules = [],
|
|
2755
|
+
const { rules = [], breakpoints = [], prefer = "longer", pageJoiner = "space", logger, maxContentLength } = options;
|
|
2756
|
+
if (maxContentLength && maxContentLength < 50) throw new Error(`maxContentLength must be at least 50 characters.`);
|
|
2757
|
+
const maxPages = options.maxPages ?? (maxContentLength ? Number.MAX_SAFE_INTEGER : 0);
|
|
2673
2758
|
const debug = resolveDebugConfig(options.debug);
|
|
2674
2759
|
const debugMetaKey = debug?.includeRule ? debug.metaKey : void 0;
|
|
2675
2760
|
logger?.info?.("[segmenter] starting segmentation", {
|
|
2676
2761
|
breakpointCount: breakpoints.length,
|
|
2762
|
+
maxContentLength,
|
|
2677
2763
|
maxPages,
|
|
2678
2764
|
pageCount: pages.length,
|
|
2679
2765
|
prefer,
|
|
@@ -2701,10 +2787,10 @@ const segmentPages = (pages, options) => {
|
|
|
2701
2787
|
}))
|
|
2702
2788
|
});
|
|
2703
2789
|
segments = ensureFallbackSegment(segments, processedPages, normalizedContent, pageJoiner);
|
|
2704
|
-
if (maxPages >= 0 && breakpoints.length) {
|
|
2790
|
+
if ((maxPages >= 0 || maxContentLength && maxContentLength > 0) && breakpoints.length) {
|
|
2705
2791
|
logger?.debug?.("[segmenter] applying breakpoints to oversized segments");
|
|
2706
2792
|
const patternProcessor = (p) => processPattern(p, false).pattern;
|
|
2707
|
-
const result = applyBreakpoints(segments, processedPages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner, debug?.includeBreakpoint ? debug.metaKey : void 0);
|
|
2793
|
+
const result = applyBreakpoints(segments, processedPages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner, debug?.includeBreakpoint ? debug.metaKey : void 0, maxContentLength);
|
|
2708
2794
|
logger?.info?.("[segmenter] segmentation complete (with breakpoints)", { finalSegmentCount: result.length });
|
|
2709
2795
|
return result;
|
|
2710
2796
|
}
|