flappa-doormal 2.6.1 → 2.6.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +9 -13
- package/README.md +40 -3
- package/dist/index.mjs +243 -179
- package/dist/index.mjs.map +1 -1
- package/package.json +3 -3
package/AGENTS.md
CHANGED
|
@@ -41,23 +41,13 @@ src/
|
|
|
41
41
|
├── match-utils.ts # Extracted match processing utilities
|
|
42
42
|
├── segmenter.test.ts # Core test suite (150+ tests including breakpoints)
|
|
43
43
|
├── segmenter.bukhari.test.ts # Real-world test cases
|
|
44
|
-
├── breakpoint-utils.test.ts # Breakpoint utility tests (
|
|
44
|
+
├── breakpoint-utils.test.ts # Breakpoint utility tests (55 tests)
|
|
45
45
|
├── rule-regex.test.ts # Rule regex builder tests
|
|
46
46
|
├── segmenter-utils.test.ts # Segmenter helper tests
|
|
47
47
|
├── tokens.test.ts # Token expansion tests
|
|
48
48
|
├── fuzzy.test.ts # Fuzzy matching tests
|
|
49
49
|
├── textUtils.test.ts # Text utility tests
|
|
50
50
|
└── match-utils.test.ts # Utility function tests
|
|
51
|
-
|
|
52
|
-
test/
|
|
53
|
-
├── 2576.json # Test data for book 2576 (Sahih Bukhari)
|
|
54
|
-
└── 2588.json # Test data for book 2588 (Al-Mughni)
|
|
55
|
-
|
|
56
|
-
docs/
|
|
57
|
-
├── checkpoints/ # AI agent handoff documentation
|
|
58
|
-
│ └── 2025-12-09-handoff.md
|
|
59
|
-
└── reviews/ # Performance analysis reports
|
|
60
|
-
└── 2025-12-10/
|
|
61
51
|
```
|
|
62
52
|
|
|
63
53
|
### Core Components
|
|
@@ -92,7 +82,9 @@ docs/
|
|
|
92
82
|
- `buildExcludeSet()` - Create Set from PageRange[] for O(1) lookups
|
|
93
83
|
- `createSegment()` - Create segment with optional to/meta fields
|
|
94
84
|
- `expandBreakpoints()` - Expand patterns with pre-compiled regexes
|
|
95
|
-
- `
|
|
85
|
+
- `buildBoundaryPositions()` - Build position map of page boundaries for O(log n) lookups
|
|
86
|
+
- `findPageIndexForPosition()` - Binary search to find page index for a character position
|
|
87
|
+
- `estimateStartOffsetInCurrentPage()` - Estimate offset when segment starts mid-page
|
|
96
88
|
- `findBreakpointWindowEndPosition()` - Compute window boundary in content-space (robust to marker stripping)
|
|
97
89
|
- `applyPageJoinerBetweenPages()` - Normalize page-boundary join in output segments (`space` vs `newline`)
|
|
98
90
|
- `findBreakPosition()` - Find break position using breakpoint patterns
|
|
@@ -306,7 +298,7 @@ The original `segmentPages` had complexity 37 (max: 15). Extraction:
|
|
|
306
298
|
1. **TypeScript strict mode** - No `any` types
|
|
307
299
|
2. **Biome linting** - Max complexity 15 per function (some exceptions exist)
|
|
308
300
|
3. **JSDoc comments** - All exported functions documented
|
|
309
|
-
4. **Test coverage** -
|
|
301
|
+
4. **Test coverage** - 352 tests across 12 files
|
|
310
302
|
|
|
311
303
|
## Dependencies
|
|
312
304
|
|
|
@@ -360,6 +352,10 @@ bunx biome lint .
|
|
|
360
352
|
|
|
361
353
|
9. **Auto-escaping improves DX significantly**: Users expect `(أ):` to match literal parentheses. Auto-escaping `()[]` in template patterns (but not `regex`) gives intuitive behavior while preserving power-user escape hatch.
|
|
362
354
|
|
|
355
|
+
10. **Page boundary detection needs progressive prefixes**: When breakpoints split content mid-page, checking only the first N characters of a page to detect if the segment ends on that page can fail. Solution: try progressively shorter prefixes (`[80, 60, 40, 30, 20, 15, 12, 10, 8, 6]`) via `JOINER_PREFIX_LENGTHS`. The check uses `indexOf(...) > 0` (not `>= 0`) to avoid false positives when a page prefix appears at position 0 (which indicates the segment *starts* with that page, not *ends* on it).
|
|
356
|
+
|
|
357
|
+
11. **Boundary-position algorithm improves page attribution**: Building a position map of page boundaries once per segment (O(n)) enables binary search for O(log n) lookups per piece. Key insight: when a segment starts mid-page (common after structural rules), expected boundary estimates must account for the offset into the starting page. Without this adjustment, position-based lookups can return the wrong page when pages have identical content prefixes.
|
|
358
|
+
|
|
363
359
|
### Architecture Insights
|
|
364
360
|
|
|
365
361
|
- **Declarative > Imperative**: Users describe patterns, library handles regex
|
package/README.md
CHANGED
|
@@ -1,5 +1,20 @@
|
|
|
1
1
|
# flappa-doormal
|
|
2
2
|
|
|
3
|
+
<p align="center">
|
|
4
|
+
<img src="icon.png" alt="flappa-doormal" width="128" height="128" />
|
|
5
|
+
</p>
|
|
6
|
+
|
|
7
|
+
<p align="center">
|
|
8
|
+
<strong>Declarative Arabic text segmentation library</strong><br/>
|
|
9
|
+
Split pages of content into logical segments using human-readable patterns.
|
|
10
|
+
</p>
|
|
11
|
+
|
|
12
|
+
<p align="center">
|
|
13
|
+
<a href="https://flappa-doormal.surge.sh">🚀 <strong>Live Demo</strong></a> •
|
|
14
|
+
<a href="https://www.npmjs.com/package/flappa-doormal">📦 npm</a> •
|
|
15
|
+
<a href="https://github.com/ragaeeb/flappa-doormal">📚 GitHub</a>
|
|
16
|
+
</p>
|
|
17
|
+
|
|
3
18
|
[](https://wakatime.com/badge/user/a0b906ce-b8e7-4463-8bce-383238df6d4b/project/384fa29d-72e8-4078-980f-45d363f10507)
|
|
4
19
|
[](https://github.com/ragaeeb/flappa-doormal/actions/workflows/build.yml) 
|
|
5
20
|

|
|
@@ -12,8 +27,6 @@
|
|
|
12
27
|
[](https://codecov.io/gh/ragaeeb/flappa-doormal)
|
|
13
28
|
[](https://badge.fury.io/js/flappa-doormal)
|
|
14
29
|
|
|
15
|
-
**Declarative Arabic text segmentation library** - Split pages of content into logical segments using human-readable patterns.
|
|
16
|
-
|
|
17
30
|
## Why This Library?
|
|
18
31
|
|
|
19
32
|
### The Problem
|
|
@@ -939,7 +952,7 @@ Complex logic is intentionally split into small, independently testable modules:
|
|
|
939
952
|
|
|
940
953
|
- `src/segmentation/match-utils.ts`: match filtering + capture extraction
|
|
941
954
|
- `src/segmentation/rule-regex.ts`: SplitRule → compiled regex builder (`buildRuleRegex`, `processPattern`)
|
|
942
|
-
- `src/segmentation/breakpoint-utils.ts`: breakpoint windowing/exclusion helpers
|
|
955
|
+
- `src/segmentation/breakpoint-utils.ts`: breakpoint windowing/exclusion helpers, page boundary join normalization, and progressive prefix page detection for accurate `from`/`to` attribution
|
|
943
956
|
- `src/segmentation/breakpoint-processor.ts`: breakpoint post-processing engine (applies breakpoints after structural segmentation)
|
|
944
957
|
|
|
945
958
|
## Performance Notes
|
|
@@ -986,6 +999,30 @@ See [AGENTS.md](./AGENTS.md) for:
|
|
|
986
999
|
- Algorithm explanations
|
|
987
1000
|
- Lessons learned during development
|
|
988
1001
|
|
|
1002
|
+
## Demo
|
|
1003
|
+
|
|
1004
|
+
An interactive demo is available at [flappa-doormal.surge.sh](https://flappa-doormal.surge.sh).
|
|
1005
|
+
|
|
1006
|
+
The demo source code is located in the `demo/` directory and includes:
|
|
1007
|
+
- **Analysis**: Discover common line-start patterns in your text
|
|
1008
|
+
- **Pattern Detection**: Auto-detect tokens in text and get template suggestions
|
|
1009
|
+
- **Segmentation**: Apply rules and see segmented output with metadata
|
|
1010
|
+
|
|
1011
|
+
To run the demo locally:
|
|
1012
|
+
|
|
1013
|
+
```bash
|
|
1014
|
+
cd demo
|
|
1015
|
+
bun install
|
|
1016
|
+
bun run dev
|
|
1017
|
+
```
|
|
1018
|
+
|
|
1019
|
+
To deploy updates:
|
|
1020
|
+
|
|
1021
|
+
```bash
|
|
1022
|
+
cd demo
|
|
1023
|
+
bun run deploy
|
|
1024
|
+
```
|
|
1025
|
+
|
|
989
1026
|
## License
|
|
990
1027
|
|
|
991
1028
|
MIT
|
package/dist/index.mjs
CHANGED
|
@@ -411,6 +411,72 @@ const findPageStartNearExpectedBoundary = (remainingContent, _currentFromIdx, ta
|
|
|
411
411
|
return -1;
|
|
412
412
|
};
|
|
413
413
|
/**
|
|
414
|
+
* Builds a boundary position map for pages within the given range.
|
|
415
|
+
*
|
|
416
|
+
* This function computes page boundaries once per segment and enables
|
|
417
|
+
* O(log n) page lookups via binary search with `findPageIndexForPosition`.
|
|
418
|
+
*
|
|
419
|
+
* Boundaries are derived from segmentContent (post-structural-rules).
|
|
420
|
+
* When the segment starts mid-page, an offset correction is applied to
|
|
421
|
+
* keep boundary estimates aligned with the segment's actual content space.
|
|
422
|
+
*
|
|
423
|
+
* @param segmentContent - Full segment content (already processed by structural rules)
|
|
424
|
+
* @param fromIdx - Starting page index
|
|
425
|
+
* @param toIdx - Ending page index
|
|
426
|
+
* @param pageIds - Array of all page IDs
|
|
427
|
+
* @param normalizedPages - Map of page ID to normalized content
|
|
428
|
+
* @param cumulativeOffsets - Cumulative character offsets (for estimates)
|
|
429
|
+
* @returns Array where boundaryPositions[i] = start position of page (fromIdx + i),
|
|
430
|
+
* with a sentinel boundary at segmentContent.length as the last element
|
|
431
|
+
*
|
|
432
|
+
* @example
|
|
433
|
+
* // For a 3-page segment:
|
|
434
|
+
* buildBoundaryPositions(content, 0, 2, pageIds, normalizedPages, offsets)
|
|
435
|
+
* // → [0, 23, 45, 67] where 67 is content.length (sentinel)
|
|
436
|
+
*/
|
|
437
|
+
const buildBoundaryPositions = (segmentContent, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets) => {
|
|
438
|
+
const boundaryPositions = [0];
|
|
439
|
+
const startOffsetInFromPage = estimateStartOffsetInCurrentPage(segmentContent, fromIdx, pageIds, normalizedPages);
|
|
440
|
+
for (let i = fromIdx + 1; i <= toIdx; i++) {
|
|
441
|
+
const expectedBoundary = cumulativeOffsets[i] !== void 0 && cumulativeOffsets[fromIdx] !== void 0 ? Math.max(0, cumulativeOffsets[i] - cumulativeOffsets[fromIdx] - startOffsetInFromPage) : segmentContent.length;
|
|
442
|
+
const pos = findPageStartNearExpectedBoundary(segmentContent, fromIdx, i, expectedBoundary, pageIds, normalizedPages);
|
|
443
|
+
const prevBoundary = boundaryPositions[boundaryPositions.length - 1];
|
|
444
|
+
if (pos > 0 && pos > prevBoundary && Math.abs(pos - expectedBoundary) < 2e3) boundaryPositions.push(pos);
|
|
445
|
+
else {
|
|
446
|
+
const estimate = Math.max(prevBoundary + 1, expectedBoundary);
|
|
447
|
+
boundaryPositions.push(Math.min(estimate, segmentContent.length));
|
|
448
|
+
}
|
|
449
|
+
}
|
|
450
|
+
boundaryPositions.push(segmentContent.length);
|
|
451
|
+
return boundaryPositions;
|
|
452
|
+
};
|
|
453
|
+
/**
|
|
454
|
+
* Binary search to find which page a position falls within.
|
|
455
|
+
* Uses "largest i where boundaryPositions[i] <= position" semantics.
|
|
456
|
+
*
|
|
457
|
+
* @param position - Character position in segmentContent
|
|
458
|
+
* @param boundaryPositions - Precomputed boundary positions (from buildBoundaryPositions)
|
|
459
|
+
* @param fromIdx - Base page index (boundaryPositions[0] corresponds to pageIds[fromIdx])
|
|
460
|
+
* @returns Page index in pageIds array
|
|
461
|
+
*
|
|
462
|
+
* @example
|
|
463
|
+
* // With boundaries [0, 20, 40, 60] and fromIdx=0:
|
|
464
|
+
* findPageIndexForPosition(15, boundaries, 0) // → 0 (first page)
|
|
465
|
+
* findPageIndexForPosition(25, boundaries, 0) // → 1 (second page)
|
|
466
|
+
* findPageIndexForPosition(40, boundaries, 0) // → 2 (exactly on boundary = that page)
|
|
467
|
+
*/
|
|
468
|
+
const findPageIndexForPosition = (position, boundaryPositions, fromIdx) => {
|
|
469
|
+
if (boundaryPositions.length <= 1) return fromIdx;
|
|
470
|
+
let left = 0;
|
|
471
|
+
let right = boundaryPositions.length - 2;
|
|
472
|
+
while (left < right) {
|
|
473
|
+
const mid = Math.ceil((left + right) / 2);
|
|
474
|
+
if (boundaryPositions[mid] <= position) left = mid;
|
|
475
|
+
else right = mid - 1;
|
|
476
|
+
}
|
|
477
|
+
return fromIdx + left;
|
|
478
|
+
};
|
|
479
|
+
/**
|
|
414
480
|
* Finds the end position of a breakpoint window inside `remainingContent`.
|
|
415
481
|
*
|
|
416
482
|
* The window end is defined as the start of the page AFTER `windowEndIdx` (i.e. `windowEndIdx + 1`),
|
|
@@ -446,57 +512,6 @@ const findExclusionBreakPosition = (currentFromIdx, windowEndIdx, toIdx, pageIds
|
|
|
446
512
|
return -1;
|
|
447
513
|
};
|
|
448
514
|
/**
|
|
449
|
-
* Finds the actual ending page index by searching backwards for page content prefix.
|
|
450
|
-
* Used to determine which page a segment actually ends on based on content matching.
|
|
451
|
-
*
|
|
452
|
-
* @param pieceContent - Content of the segment piece
|
|
453
|
-
* @param currentFromIdx - Current starting index in pageIds
|
|
454
|
-
* @param toIdx - Maximum ending index to search
|
|
455
|
-
* @param pageIds - Array of page IDs
|
|
456
|
-
* @param normalizedPages - Map of page ID to normalized content
|
|
457
|
-
* @returns The actual ending page index
|
|
458
|
-
*/
|
|
459
|
-
const findActualEndPage = (pieceContent, currentFromIdx, toIdx, pageIds, normalizedPages) => {
|
|
460
|
-
for (let pi = toIdx; pi > currentFromIdx; pi--) {
|
|
461
|
-
const pageData = normalizedPages.get(pageIds[pi]);
|
|
462
|
-
if (pageData) {
|
|
463
|
-
const checkPortion = pageData.content.slice(0, Math.min(30, pageData.length));
|
|
464
|
-
if (checkPortion.length > 0 && pieceContent.indexOf(checkPortion) > 0) return pi;
|
|
465
|
-
}
|
|
466
|
-
}
|
|
467
|
-
return currentFromIdx;
|
|
468
|
-
};
|
|
469
|
-
/**
|
|
470
|
-
* Finds the actual starting page index by searching forwards for page content prefix.
|
|
471
|
-
* Used to determine which page content actually starts from based on content matching.
|
|
472
|
-
*
|
|
473
|
-
* This is the counterpart to findActualEndPage - it searches forward to find which
|
|
474
|
-
* page the content starts on, rather than which page it ends on.
|
|
475
|
-
*
|
|
476
|
-
* @param pieceContent - Content of the segment piece
|
|
477
|
-
* @param currentFromIdx - Current starting index in pageIds
|
|
478
|
-
* @param toIdx - Maximum ending index to search
|
|
479
|
-
* @param pageIds - Array of page IDs
|
|
480
|
-
* @param normalizedPages - Map of page ID to normalized content
|
|
481
|
-
* @returns The actual starting page index
|
|
482
|
-
*/
|
|
483
|
-
const findActualStartPage = (pieceContent, currentFromIdx, toIdx, pageIds, normalizedPages) => {
|
|
484
|
-
const trimmedPiece = pieceContent.trimStart();
|
|
485
|
-
if (!trimmedPiece) return currentFromIdx;
|
|
486
|
-
for (let pi = currentFromIdx; pi <= toIdx; pi++) {
|
|
487
|
-
const pageData = normalizedPages.get(pageIds[pi]);
|
|
488
|
-
if (pageData) {
|
|
489
|
-
const pagePrefix = pageData.content.slice(0, Math.min(30, pageData.length)).trim();
|
|
490
|
-
const piecePrefix = trimmedPiece.slice(0, Math.min(30, trimmedPiece.length));
|
|
491
|
-
if (pagePrefix.length > 0) {
|
|
492
|
-
if (trimmedPiece.startsWith(pagePrefix)) return pi;
|
|
493
|
-
if (pageData.content.trimStart().startsWith(piecePrefix)) return pi;
|
|
494
|
-
}
|
|
495
|
-
}
|
|
496
|
-
}
|
|
497
|
-
return currentFromIdx;
|
|
498
|
-
};
|
|
499
|
-
/**
|
|
500
515
|
* Checks if any page in a range is excluded by the given exclude set.
|
|
501
516
|
*
|
|
502
517
|
* @param excludeSet - Set of excluded page IDs
|
|
@@ -628,10 +643,22 @@ const computeWindowEndIdx = (currentFromIdx, toIdx, pageIds, maxPages) => {
|
|
|
628
643
|
};
|
|
629
644
|
const computeRemainingSpan = (currentFromIdx, toIdx, pageIds) => pageIds[toIdx] - pageIds[currentFromIdx];
|
|
630
645
|
const createFinalSegment = (remainingContent, currentFromIdx, toIdx, pageIds, meta, includeMeta) => createSegment(remainingContent, pageIds[currentFromIdx], currentFromIdx !== toIdx ? pageIds[toIdx] : void 0, includeMeta ? meta : void 0);
|
|
631
|
-
|
|
632
|
-
|
|
646
|
+
/**
|
|
647
|
+
* Computes the actual start and end page indices for a piece using
|
|
648
|
+
* precomputed boundary positions and binary search.
|
|
649
|
+
*
|
|
650
|
+
* @param pieceStartPos - Start position of the piece in the full segment content
|
|
651
|
+
* @param pieceEndPos - End position (exclusive) of the piece
|
|
652
|
+
* @param boundaryPositions - Precomputed boundary positions from buildBoundaryPositions
|
|
653
|
+
* @param baseFromIdx - Base page index (boundaryPositions[0] corresponds to pageIds[baseFromIdx])
|
|
654
|
+
* @param toIdx - Maximum page index
|
|
655
|
+
* @returns Object with actualStartIdx and actualEndIdx
|
|
656
|
+
*/
|
|
657
|
+
const computePiecePages = (pieceStartPos, pieceEndPos, boundaryPositions, baseFromIdx, toIdx) => {
|
|
658
|
+
const actualStartIdx = findPageIndexForPosition(pieceStartPos, boundaryPositions, baseFromIdx);
|
|
659
|
+
const endPos = Math.max(pieceStartPos, pieceEndPos - 1);
|
|
633
660
|
return {
|
|
634
|
-
actualEndIdx:
|
|
661
|
+
actualEndIdx: Math.min(findPageIndexForPosition(endPos, boundaryPositions, baseFromIdx), toIdx),
|
|
635
662
|
actualStartIdx
|
|
636
663
|
};
|
|
637
664
|
};
|
|
@@ -648,79 +675,87 @@ const computeNextFromIdx = (remainingContent, actualEndIdx, toIdx, pageIds, norm
|
|
|
648
675
|
return nextFromIdx;
|
|
649
676
|
};
|
|
650
677
|
const createPieceSegment = (pieceContent, actualStartIdx, actualEndIdx, pageIds, meta, includeMeta) => createSegment(pieceContent, pageIds[actualStartIdx], actualEndIdx > actualStartIdx ? pageIds[actualEndIdx] : void 0, includeMeta ? meta : void 0);
|
|
678
|
+
/**
|
|
679
|
+
* Finds the break offset within a window, trying exclusions first, then patterns.
|
|
680
|
+
*
|
|
681
|
+
* @returns Break offset relative to remainingContent, or windowEndPosition as fallback
|
|
682
|
+
*/
|
|
683
|
+
const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer) => {
|
|
684
|
+
if (hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, windowEndIdx)) {
|
|
685
|
+
const exclusionBreak = findExclusionBreakPosition(currentFromIdx, windowEndIdx, toIdx, pageIds, expandedBreakpoints, cumulativeOffsets);
|
|
686
|
+
if (exclusionBreak > 0) return exclusionBreak;
|
|
687
|
+
}
|
|
688
|
+
const patternBreak = findBreakPosition(remainingContent, currentFromIdx, toIdx, windowEndIdx, windowEndPosition, {
|
|
689
|
+
expandedBreakpoints,
|
|
690
|
+
normalizedPages,
|
|
691
|
+
pageIds,
|
|
692
|
+
prefer
|
|
693
|
+
});
|
|
694
|
+
return patternBreak > 0 ? patternBreak : windowEndPosition;
|
|
695
|
+
};
|
|
696
|
+
/**
|
|
697
|
+
* Advances cursor position past any leading whitespace.
|
|
698
|
+
*/
|
|
699
|
+
const skipWhitespace = (content, startPos) => {
|
|
700
|
+
let pos = startPos;
|
|
701
|
+
while (pos < content.length && /\s/.test(content[pos])) pos++;
|
|
702
|
+
return pos;
|
|
703
|
+
};
|
|
704
|
+
/**
|
|
705
|
+
* Processes an oversized segment by iterating through the content and
|
|
706
|
+
* breaking it into smaller pieces that fit within maxPages constraints.
|
|
707
|
+
*
|
|
708
|
+
* Uses precomputed boundary positions for O(log n) page attribution lookups.
|
|
709
|
+
*/
|
|
651
710
|
const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger) => {
|
|
652
711
|
const result = [];
|
|
653
|
-
|
|
712
|
+
const fullContent = segment.content;
|
|
713
|
+
let cursorPos = 0;
|
|
654
714
|
let currentFromIdx = fromIdx;
|
|
655
715
|
let isFirstPiece = true;
|
|
656
|
-
|
|
716
|
+
const boundaryPositions = buildBoundaryPositions(fullContent, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets);
|
|
717
|
+
logger?.debug?.("[breakpoints] boundaryPositions built", {
|
|
718
|
+
boundaryPositions,
|
|
719
|
+
fromIdx,
|
|
720
|
+
fullContentLength: fullContent.length,
|
|
721
|
+
toIdx
|
|
722
|
+
});
|
|
657
723
|
const maxIterations = 1e4;
|
|
658
|
-
|
|
659
|
-
|
|
660
|
-
if (
|
|
661
|
-
logger?.error?.("INFINITE LOOP DETECTED! Breaking out, you should report this bug", { iterationCount: maxIterations });
|
|
662
|
-
break;
|
|
663
|
-
}
|
|
664
|
-
const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
|
|
724
|
+
for (let i = 0; i < maxIterations && cursorPos < fullContent.length && currentFromIdx <= toIdx; i++) {
|
|
725
|
+
const remainingContent = fullContent.slice(cursorPos);
|
|
726
|
+
if (!remainingContent.trim()) break;
|
|
665
727
|
const remainingSpan = computeRemainingSpan(currentFromIdx, toIdx, pageIds);
|
|
728
|
+
const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
|
|
666
729
|
if (remainingSpan <= maxPages && !remainingHasExclusions) {
|
|
667
730
|
const finalSeg = createFinalSegment(remainingContent, currentFromIdx, toIdx, pageIds, segment.meta, isFirstPiece);
|
|
668
731
|
if (finalSeg) result.push(finalSeg);
|
|
669
732
|
break;
|
|
670
733
|
}
|
|
671
734
|
const windowEndIdx = computeWindowEndIdx(currentFromIdx, toIdx, pageIds, maxPages);
|
|
672
|
-
logger?.debug?.(`[breakpoints] iteration=${iterationCount}`, {
|
|
673
|
-
currentFromIdx,
|
|
674
|
-
currentFromPageId: pageIds[currentFromIdx],
|
|
675
|
-
remainingContentStart: remainingContent.slice(0, 50),
|
|
676
|
-
remainingContentLength: remainingContent.length,
|
|
677
|
-
remainingSpan,
|
|
678
|
-
toIdx,
|
|
679
|
-
toPageId: pageIds[toIdx],
|
|
680
|
-
windowEndIdx,
|
|
681
|
-
windowEndPageId: pageIds[windowEndIdx]
|
|
682
|
-
});
|
|
683
735
|
const windowEndPosition = findBreakpointWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets);
|
|
684
|
-
|
|
685
|
-
|
|
686
|
-
|
|
687
|
-
|
|
688
|
-
expandedBreakpoints,
|
|
689
|
-
normalizedPages,
|
|
690
|
-
pageIds,
|
|
691
|
-
prefer
|
|
736
|
+
logger?.debug?.(`[breakpoints] iteration=${i}`, {
|
|
737
|
+
currentFromIdx,
|
|
738
|
+
cursorPos,
|
|
739
|
+
windowEndIdx
|
|
692
740
|
});
|
|
693
|
-
|
|
694
|
-
const
|
|
695
|
-
|
|
696
|
-
|
|
697
|
-
|
|
698
|
-
|
|
699
|
-
|
|
741
|
+
const breakOffset = findBreakOffsetForWindow(remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer);
|
|
742
|
+
const breakPos = cursorPos + breakOffset;
|
|
743
|
+
const pieceContent = fullContent.slice(cursorPos, breakPos).trim();
|
|
744
|
+
const { actualEndIdx, actualStartIdx } = computePiecePages(cursorPos, breakPos, boundaryPositions, fromIdx, toIdx);
|
|
745
|
+
logger?.trace?.("[breakpoints] piece", {
|
|
746
|
+
actualEndIdx,
|
|
747
|
+
actualStartIdx,
|
|
748
|
+
pieceLength: pieceContent.length
|
|
700
749
|
});
|
|
701
|
-
const { actualEndIdx, actualStartIdx } = computePiecePages(pieceContent, currentFromIdx, toIdx, windowEndIdx, pageIds, normalizedPages);
|
|
702
750
|
if (pieceContent) {
|
|
703
751
|
const pieceSeg = createPieceSegment(pieceContent, actualStartIdx, actualEndIdx, pageIds, segment.meta, isFirstPiece);
|
|
704
752
|
if (pieceSeg) result.push(pieceSeg);
|
|
705
753
|
}
|
|
706
|
-
|
|
707
|
-
|
|
708
|
-
actualEndIdx,
|
|
709
|
-
remainingContentLength: remainingContent.length,
|
|
710
|
-
remainingContentStart: remainingContent.slice(0, 60)
|
|
711
|
-
});
|
|
712
|
-
if (!remainingContent) {
|
|
713
|
-
logger?.debug?.("[breakpoints] done: no remaining content");
|
|
714
|
-
break;
|
|
715
|
-
}
|
|
716
|
-
currentFromIdx = computeNextFromIdx(remainingContent, actualEndIdx, toIdx, pageIds, normalizedPages);
|
|
717
|
-
logger?.debug?.("[breakpoints] nextIteration", {
|
|
718
|
-
currentFromIdx,
|
|
719
|
-
currentFromPageId: pageIds[currentFromIdx]
|
|
720
|
-
});
|
|
754
|
+
cursorPos = skipWhitespace(fullContent, breakPos);
|
|
755
|
+
currentFromIdx = computeNextFromIdx(fullContent.slice(cursorPos), actualEndIdx, toIdx, pageIds, normalizedPages);
|
|
721
756
|
isFirstPiece = false;
|
|
722
757
|
}
|
|
723
|
-
logger?.debug?.("[breakpoints]
|
|
758
|
+
logger?.debug?.("[breakpoints] done", { resultCount: result.length });
|
|
724
759
|
return result;
|
|
725
760
|
};
|
|
726
761
|
/**
|
|
@@ -903,6 +938,77 @@ const anyRuleAllowsId = (rules, pageId) => {
|
|
|
903
938
|
});
|
|
904
939
|
};
|
|
905
940
|
|
|
941
|
+
//#endregion
|
|
942
|
+
//#region src/segmentation/replace.ts
|
|
943
|
+
const DEFAULT_REPLACE_FLAGS = "gu";
|
|
944
|
+
const normalizeReplaceFlags = (flags) => {
|
|
945
|
+
if (!flags) return DEFAULT_REPLACE_FLAGS;
|
|
946
|
+
const allowed = new Set([
|
|
947
|
+
"g",
|
|
948
|
+
"i",
|
|
949
|
+
"m",
|
|
950
|
+
"s",
|
|
951
|
+
"u",
|
|
952
|
+
"y"
|
|
953
|
+
]);
|
|
954
|
+
const set = /* @__PURE__ */ new Set();
|
|
955
|
+
for (const ch of flags) {
|
|
956
|
+
if (!allowed.has(ch)) throw new Error(`Invalid replace regex flag: "${ch}" (allowed: gimsyu)`);
|
|
957
|
+
set.add(ch);
|
|
958
|
+
}
|
|
959
|
+
set.add("g");
|
|
960
|
+
set.add("u");
|
|
961
|
+
return [
|
|
962
|
+
"g",
|
|
963
|
+
"i",
|
|
964
|
+
"m",
|
|
965
|
+
"s",
|
|
966
|
+
"y",
|
|
967
|
+
"u"
|
|
968
|
+
].filter((c) => set.has(c)).join("");
|
|
969
|
+
};
|
|
970
|
+
const compileReplaceRules = (rules) => {
|
|
971
|
+
const compiled = [];
|
|
972
|
+
for (const r of rules) {
|
|
973
|
+
if (r.pageIds && r.pageIds.length === 0) continue;
|
|
974
|
+
const flags = normalizeReplaceFlags(r.flags);
|
|
975
|
+
const re = new RegExp(r.regex, flags);
|
|
976
|
+
compiled.push({
|
|
977
|
+
pageIdSet: r.pageIds ? new Set(r.pageIds) : void 0,
|
|
978
|
+
re,
|
|
979
|
+
replacement: r.replacement
|
|
980
|
+
});
|
|
981
|
+
}
|
|
982
|
+
return compiled;
|
|
983
|
+
};
|
|
984
|
+
/**
|
|
985
|
+
* Applies ordered regex replacements to page content (per page).
|
|
986
|
+
*
|
|
987
|
+
* - Replacement rules are applied in array order.
|
|
988
|
+
* - Each rule is applied globally (flag `g` enforced) with unicode mode (flag `u` enforced).
|
|
989
|
+
* - `pageIds` can scope a rule to specific pages. `pageIds: []` skips the rule entirely.
|
|
990
|
+
*
|
|
991
|
+
* This function is intentionally **pure**:
|
|
992
|
+
* it returns a new pages array only when changes are needed, otherwise it returns the original pages.
|
|
993
|
+
*/
|
|
994
|
+
const applyReplacements = (pages, rules) => {
|
|
995
|
+
if (!rules || rules.length === 0 || pages.length === 0) return pages;
|
|
996
|
+
const compiled = compileReplaceRules(rules);
|
|
997
|
+
if (compiled.length === 0) return pages;
|
|
998
|
+
return pages.map((p) => {
|
|
999
|
+
let content = p.content;
|
|
1000
|
+
for (const rule of compiled) {
|
|
1001
|
+
if (rule.pageIdSet && !rule.pageIdSet.has(p.id)) continue;
|
|
1002
|
+
content = content.replace(rule.re, rule.replacement);
|
|
1003
|
+
}
|
|
1004
|
+
if (content === p.content) return p;
|
|
1005
|
+
return {
|
|
1006
|
+
...p,
|
|
1007
|
+
content
|
|
1008
|
+
};
|
|
1009
|
+
});
|
|
1010
|
+
};
|
|
1011
|
+
|
|
906
1012
|
//#endregion
|
|
907
1013
|
//#region src/segmentation/tokens.ts
|
|
908
1014
|
/**
|
|
@@ -1480,77 +1586,6 @@ const buildRuleRegex = (rule, capturePrefix) => {
|
|
|
1480
1586
|
};
|
|
1481
1587
|
};
|
|
1482
1588
|
|
|
1483
|
-
//#endregion
|
|
1484
|
-
//#region src/segmentation/replace.ts
|
|
1485
|
-
const DEFAULT_REPLACE_FLAGS = "gu";
|
|
1486
|
-
const normalizeReplaceFlags = (flags) => {
|
|
1487
|
-
if (!flags) return DEFAULT_REPLACE_FLAGS;
|
|
1488
|
-
const allowed = new Set([
|
|
1489
|
-
"g",
|
|
1490
|
-
"i",
|
|
1491
|
-
"m",
|
|
1492
|
-
"s",
|
|
1493
|
-
"u",
|
|
1494
|
-
"y"
|
|
1495
|
-
]);
|
|
1496
|
-
const set = /* @__PURE__ */ new Set();
|
|
1497
|
-
for (const ch of flags) {
|
|
1498
|
-
if (!allowed.has(ch)) throw new Error(`Invalid replace regex flag: "${ch}" (allowed: gimsyu)`);
|
|
1499
|
-
set.add(ch);
|
|
1500
|
-
}
|
|
1501
|
-
set.add("g");
|
|
1502
|
-
set.add("u");
|
|
1503
|
-
return [
|
|
1504
|
-
"g",
|
|
1505
|
-
"i",
|
|
1506
|
-
"m",
|
|
1507
|
-
"s",
|
|
1508
|
-
"y",
|
|
1509
|
-
"u"
|
|
1510
|
-
].filter((c) => set.has(c)).join("");
|
|
1511
|
-
};
|
|
1512
|
-
const compileReplaceRules = (rules) => {
|
|
1513
|
-
const compiled = [];
|
|
1514
|
-
for (const r of rules) {
|
|
1515
|
-
if (r.pageIds && r.pageIds.length === 0) continue;
|
|
1516
|
-
const flags = normalizeReplaceFlags(r.flags);
|
|
1517
|
-
const re = new RegExp(r.regex, flags);
|
|
1518
|
-
compiled.push({
|
|
1519
|
-
pageIdSet: r.pageIds ? new Set(r.pageIds) : void 0,
|
|
1520
|
-
re,
|
|
1521
|
-
replacement: r.replacement
|
|
1522
|
-
});
|
|
1523
|
-
}
|
|
1524
|
-
return compiled;
|
|
1525
|
-
};
|
|
1526
|
-
/**
|
|
1527
|
-
* Applies ordered regex replacements to page content (per page).
|
|
1528
|
-
*
|
|
1529
|
-
* - Replacement rules are applied in array order.
|
|
1530
|
-
* - Each rule is applied globally (flag `g` enforced) with unicode mode (flag `u` enforced).
|
|
1531
|
-
* - `pageIds` can scope a rule to specific pages. `pageIds: []` skips the rule entirely.
|
|
1532
|
-
*
|
|
1533
|
-
* This function is intentionally **pure**:
|
|
1534
|
-
* it returns a new pages array only when changes are needed, otherwise it returns the original pages.
|
|
1535
|
-
*/
|
|
1536
|
-
const applyReplacements = (pages, rules) => {
|
|
1537
|
-
if (!rules || rules.length === 0 || pages.length === 0) return pages;
|
|
1538
|
-
const compiled = compileReplaceRules(rules);
|
|
1539
|
-
if (compiled.length === 0) return pages;
|
|
1540
|
-
return pages.map((p) => {
|
|
1541
|
-
let content = p.content;
|
|
1542
|
-
for (const rule of compiled) {
|
|
1543
|
-
if (rule.pageIdSet && !rule.pageIdSet.has(p.id)) continue;
|
|
1544
|
-
content = content.replace(rule.re, rule.replacement);
|
|
1545
|
-
}
|
|
1546
|
-
if (content === p.content) return p;
|
|
1547
|
-
return {
|
|
1548
|
-
...p,
|
|
1549
|
-
content
|
|
1550
|
-
};
|
|
1551
|
-
});
|
|
1552
|
-
};
|
|
1553
|
-
|
|
1554
1589
|
//#endregion
|
|
1555
1590
|
//#region src/segmentation/fast-fuzzy-prefix.ts
|
|
1556
1591
|
/**
|
|
@@ -2120,14 +2155,43 @@ const convertPageBreaks = (content, startOffset, pageBreaks) => {
|
|
|
2120
2155
|
*/
|
|
2121
2156
|
const segmentPages = (pages, options) => {
|
|
2122
2157
|
const { rules = [], maxPages = 0, breakpoints = [], prefer = "longer", pageJoiner = "space", logger } = options;
|
|
2158
|
+
logger?.info?.("[segmenter] starting segmentation", {
|
|
2159
|
+
breakpointCount: breakpoints.length,
|
|
2160
|
+
maxPages,
|
|
2161
|
+
pageCount: pages.length,
|
|
2162
|
+
prefer,
|
|
2163
|
+
ruleCount: rules.length
|
|
2164
|
+
});
|
|
2123
2165
|
const processedPages = options.replace ? applyReplacements(pages, options.replace) : pages;
|
|
2124
2166
|
const { content: matchContent, normalizedPages: normalizedContent, pageMap } = buildPageMap(processedPages);
|
|
2125
|
-
|
|
2167
|
+
logger?.debug?.("[segmenter] content built", {
|
|
2168
|
+
pageIds: pageMap.pageIds,
|
|
2169
|
+
totalContentLength: matchContent.length
|
|
2170
|
+
});
|
|
2171
|
+
const splitPoints = collectSplitPointsFromRules(rules, matchContent, pageMap);
|
|
2172
|
+
const unique = dedupeSplitPoints(splitPoints);
|
|
2173
|
+
logger?.debug?.("[segmenter] split points collected", {
|
|
2174
|
+
rawSplitPoints: splitPoints.length,
|
|
2175
|
+
uniqueSplitPoints: unique.length
|
|
2176
|
+
});
|
|
2177
|
+
let segments = buildSegments(unique, matchContent, pageMap, rules);
|
|
2178
|
+
logger?.debug?.("[segmenter] structural segments built", {
|
|
2179
|
+
segmentCount: segments.length,
|
|
2180
|
+
segments: segments.map((s) => ({
|
|
2181
|
+
contentLength: s.content.length,
|
|
2182
|
+
from: s.from,
|
|
2183
|
+
to: s.to
|
|
2184
|
+
}))
|
|
2185
|
+
});
|
|
2126
2186
|
segments = ensureFallbackSegment(segments, processedPages, normalizedContent, pageJoiner);
|
|
2127
2187
|
if (maxPages >= 0 && breakpoints.length) {
|
|
2188
|
+
logger?.debug?.("[segmenter] applying breakpoints to oversized segments");
|
|
2128
2189
|
const patternProcessor = (p) => processPattern(p, false).pattern;
|
|
2129
|
-
|
|
2190
|
+
const result = applyBreakpoints(segments, processedPages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner);
|
|
2191
|
+
logger?.info?.("[segmenter] segmentation complete (with breakpoints)", { finalSegmentCount: result.length });
|
|
2192
|
+
return result;
|
|
2130
2193
|
}
|
|
2194
|
+
logger?.info?.("[segmenter] segmentation complete (structural only)", { finalSegmentCount: segments.length });
|
|
2131
2195
|
return segments;
|
|
2132
2196
|
};
|
|
2133
2197
|
/**
|