flappa-doormal 2.13.2 → 2.13.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +32 -34
- package/README.md +0 -3
- package/dist/index.d.mts +1 -30
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +68 -86
- package/dist/index.mjs.map +1 -1
- package/package.json +1 -1
package/AGENTS.md
CHANGED
|
@@ -453,46 +453,44 @@ bunx biome lint .
|
|
|
453
453
|
|
|
454
454
|
1. **`lineStartsAfter` vs `lineStartsWith` is not “cosmetic”**: `lineStartsAfter` changes output by stripping the matched marker via an internal `contentStartOffset` during segment construction. If a client used it by accident, you cannot reconstruct the exact stripped prefix from output alone without referencing the original pages and re-matching the marker.
|
|
455
455
|
|
|
456
|
-
2. **
|
|
457
|
-
|
|
458
|
-
3. **Page joining differs between matching and output**:
|
|
456
|
+
2. **Page joining differs between matching and output**:
|
|
459
457
|
- Matching always happens on pages concatenated with `\\n` separators.
|
|
460
458
|
- Output segments may normalize page boundaries (`pageJoiner: 'space' | 'newline'`) and breakpoints post-processing uses its own join normalization utilities.
|
|
461
459
|
Recovery code must be explicit about which representation it’s searching.
|
|
462
460
|
|
|
463
|
-
|
|
461
|
+
3. **Breakpoints can produce “pieces” that were never marker-stripped**: When `maxPages` + `breakpoints` are enabled, only the piece that starts at the original structural boundary could have lost a marker due to `lineStartsAfter`. Mid-segment breakpoint pieces should not be “recovered” unless you can anchor them confidently.
|
|
464
462
|
|
|
465
|
-
|
|
463
|
+
4. **Fuzzy defaults are easy to miss**: Some tokens auto-enable fuzzy matching unless `fuzzy: false` is set (`bab`, `basmalah`, `fasl`, `kitab`, `naql`). If you are validating markers or re-matching prefixes, use the same compilation path as segmentation (`buildRuleRegex` / `processPattern`) so diacritics and token expansion behave identically.
|
|
466
464
|
|
|
467
|
-
|
|
465
|
+
5. **Auto-escaping applies to template-like patterns**: `lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, and `template` auto-escape `()[]` outside `{{tokens}}`. Raw `regex` does not. If you compare patterns by string equality, be careful about escaping and whitespace.
|
|
468
466
|
|
|
469
|
-
|
|
467
|
+
6. **TypeScript union pitfalls with `SplitRule`**: `SplitRule` is a union where only one pattern type should exist. Avoid mutating rules in-place with `delete` on fields (TS often narrows unions and then complains). Prefer rebuilding converted rules via destructuring (e.g. `{ lineStartsAfter, ...rest }` then create `{...rest, lineStartsWith: lineStartsAfter}`).
|
|
470
468
|
|
|
471
|
-
|
|
469
|
+
7. **Biome lint constraints shape implementation**: The repo enforces low function complexity. Expect to extract helpers (alignment, selector resolution, anchoring) to keep Biome happy. Also, Biome can flag regex character-class usage as misleading; prefer alternation (e.g. `(?:\\u200C|\\u200D|\\uFEFF)`) when removing specific codepoints.
|
|
472
470
|
|
|
473
|
-
|
|
471
|
+
8. **When debugging recovery, start here**:
|
|
474
472
|
- `src/segmentation/segmenter.ts` (how content is sliced/trimmed and how `from/to` are computed)
|
|
475
473
|
- `src/segmentation/rule-regex.ts` + `src/segmentation/tokens.ts` (token expansion + fuzzy behavior)
|
|
476
474
|
- `src/preprocessing/replace.ts` (preprocessing parity)
|
|
477
475
|
- `src/recovery.ts` (recovery implementation)
|
|
478
476
|
|
|
479
|
-
|
|
477
|
+
9. **Prefer library utilities for UI tasks**: Instead of re-implementing rule merging, validation, or token mapping in client code, use `optimizeRules`, `validateRules`/`formatValidationReport`, and `applyTokenMappings`. They handle edge cases (like duplicate patterns, regex safety, or diacritic handling) that ad-hoc implementations might miss.
|
|
480
478
|
|
|
481
|
-
|
|
479
|
+
10. **Safety Fallback (Search-back)**: When forced to split at a hard character limit, searching backward for whitespace/punctuation (`[\s\n.,;!?؛،۔]`) prevents word-chopping and improves readability significantly.
|
|
482
480
|
|
|
483
|
-
|
|
481
|
+
11. **Unicode Boundary Safety (Surrogates + Graphemes)**: Multi-byte characters (like emojis) can be corrupted if split in the middle of a surrogate pair. Similarly, Arabic diacritics (combining marks), ZWJ/ZWNJ, and variation selectors can be orphaned if a hard split lands in the middle of a grapheme cluster. Use `adjustForUnicodeBoundary` when forced to hard-split near a limit.
|
|
484
482
|
|
|
485
|
-
|
|
483
|
+
12. **Recursion/Iteration Safety**: Using a progress-based guard (comparing `cursorPos` before and after loop iteration) is safer than fixed iteration limits for supporting arbitrary-sized content without truncation risks.
|
|
486
484
|
|
|
487
|
-
|
|
485
|
+
13. **Accidental File Overwrites**: Be extremely careful when using tools like `replace_file_content` with large ranges. Verify file integrity frequently (e.g., `git diff`) to catch accidental deletions of existing code or tests. Merging new tests into existing files is a high-risk operation for AI agents.
|
|
488
486
|
|
|
489
|
-
|
|
487
|
+
14. **Invisible Unicode Marks Break Regex Anchors**: Arabic text often contains invisible formatting marks like Left-to-Right Mark (`U+200E`), Right-to-Left Mark (`U+200F`), Arabic Letter Mark (`U+061C`), Zero-Width Space (`U+200B`), Zero-Width Non-Joiner (`U+200C`), Zero-Width Joiner (`U+200D`), or BOM (`U+FEFF`). These can appear at line starts after `\n` but before visible characters, breaking `^` anchored patterns. Solution: include an optional zero-width character class prefix in line-start patterns: `^[\u200E\u200F\u061C\u200B\u200C\u200D\uFEFF]*(?:pattern)`. The library now handles this automatically in `buildLineStartsWithRegexSource` and `buildLineStartsAfterRegexSource`.
|
|
490
488
|
|
|
491
|
-
|
|
489
|
+
15. **Large Segment Performance & Debugging Strategy**: When processing large books (1000+ pages), avoid O(n²) algorithms. The library uses a fast-path threshold (1000 pages) to switch from accurate string-search boundary detection to cumulative-offset-based slicing. Even on the iterative path (e.g. debug mode), we **slice only the active window (+padding)** per iteration (never `fullContent.slice(cursorPos)`), to avoid quadratic allocation/GC churn. To diagnose performance bottlenecks: (1) Look for logs with "Using iterative path" or "Using accurate string-search path" with large `pageCount` values, (2) Check `iterations` count in completion logs, (3) Strategic logs are placed at operation boundaries (start/end) NOT inside tight loops to avoid log-induced performance regression.
|
|
492
490
|
|
|
493
|
-
|
|
491
|
+
16. **`maxPages=0` is a hard invariant**: When `maxPages=0`, breakpoint windows must never scan beyond the current page boundary. Relying purely on boundary detection (string search) can fail near page ends for long Arabic text + space joiners, letting the window “see” into the next page and creating multi-page segments. The safe fix is to clamp the breakpoint window to the current page’s end using `boundaryPositions` in breakpoint processing.
|
|
494
492
|
|
|
495
|
-
|
|
493
|
+
17. **`''` breakpoint semantics depend on whether the window is page-bounded vs length-bounded**: `''` means “page boundary fallback”, but it’s intentionally **mode-dependent**:
|
|
496
494
|
|
|
497
495
|
- **Page-bounded window (maxPages-driven)**: `''` should “swallow the remainder of the current page” (i.e. break at the **next page boundary**, not at an arbitrary character limit). This prevents accidentally consuming part of the next page when no other breakpoint patterns match.
|
|
498
496
|
- **Length-bounded window (maxContentLength-driven)**: `''` should **not** force an early page-boundary break. In this mode we want the best split *near the length limit* (safe-break fallback → Unicode-safe hard split) even if that means a piece can cross a page boundary.
|
|
@@ -508,17 +506,17 @@ bunx biome lint .
|
|
|
508
506
|
}
|
|
509
507
|
```
|
|
510
508
|
|
|
511
|
-
|
|
509
|
+
18. **Beware `.only` in test files**: A single `it.only(...)` can mask unrelated failing fixtures for a long time. When debugging, remove `.only` as soon as you have a focused reproduction, and re-run the full suite to catch latent failures.
|
|
512
510
|
|
|
513
|
-
|
|
511
|
+
19. **Tooling gotcha: IDE diagnostics vs actual parser**: If the editor shows parse errors but `bun test` and `bunx biome check` pass, suspect unsaved local edits or stale diagnostics rather than codebase syntax. Always validate with a direct `bunx biome check <file>` before making sweeping “syntax fix” edits.
|
|
514
512
|
|
|
515
|
-
|
|
513
|
+
20. **Content-based page detection fails with overlapping content**: The `computeNextFromIdx` function uses prefix matching to detect page transitions. When page 0 ends with text identical to page 1's prefix, it incorrectly advances `currentFromIdx`. **Fix**: When `maxPages=0`, override content-based detection with position-based detection via `findPageIndexForPosition(cursorPos, boundaryPositions, fromIdx)`. Always trust cumulative offsets over content heuristics for strict page isolation.
|
|
516
514
|
|
|
517
|
-
|
|
515
|
+
21. **Test edge cases with data that TRIGGERS the bug path**: Simple test data often bypasses problematic code paths. Ensure tests: (a) use `maxContentLength` to force sub-page splitting, (b) include enough content to exceed window sizes, (c) create overlapping/duplicate text at page boundaries, (d) verify that segments are actually split (not just checking no crashes).
|
|
518
516
|
|
|
519
|
-
|
|
517
|
+
22. **Debug breakpoint processing with the logger**: Pass a `logger` object with `debug` and `trace` methods to `segmentPages()`. Key logs: `boundaryPositions built` (page boundary byte offsets), `iteration=N` (shows `currentFromIdx`, `cursorPos`, `windowEndPosition` per loop), `Complete` (final segment count).
|
|
520
518
|
|
|
521
|
-
|
|
519
|
+
23. **Navigating `breakpoint-processor.ts`**: Key functions in (approximate) execution order:
|
|
522
520
|
|
|
523
521
|
- `applyBreakpoints()` (entry point)
|
|
524
522
|
- `processOversizedSegment()` (main loop)
|
|
@@ -530,15 +528,15 @@ bunx biome lint .
|
|
|
530
528
|
- `advanceCursorAndIndex()` (progress)
|
|
531
529
|
- `computeNextFromIdx()` (heuristic) **or** position-based override when `maxPages=0` (see #21)
|
|
532
530
|
|
|
533
|
-
|
|
531
|
+
24. **Page attribution can drift in large-document breakpoint processing**: For ≥`FAST_PATH_THRESHOLD` segments, boundary positions may be derived from cumulative offsets (fast path). If upstream content is modified (e.g. marker stripping or accidental leading-trim), binary-search attribution can classify a piece as starting **before** `currentFromIdx`, inflating `(to - from)` and violating `maxPages`. **Fix**: clamp `actualStartIdx >= currentFromIdx` and re-apply the `maxPages` window using the same ID-span logic as `computeWindowEndIdx(...)` before creating the piece segment.
|
|
534
532
|
|
|
535
|
-
|
|
533
|
+
25. **Offset fast path must respect page-ID span semantics**: `maxPages` in this library is enforced as an **ID span** invariant (`(to ?? from) - from <= maxPages`). For large segments, the offset-based fast path must choose `segEnd` using the same ID-window logic as `computeWindowEndIdx(...)` (not “N pages by index”), otherwise gaps (e.g. `2216 → 2218`) produce illegal spans.
|
|
536
534
|
|
|
537
|
-
|
|
535
|
+
26. **Never `trimStart()` huge fallback content**: `ensureFallbackSegment()` constructs “all pages as one segment” when there are no structural split rules. If this giant content is `trimStart()`’d, cumulative offsets and derived boundary positions become inconsistent, which can lead to incorrect `from/to` attribution and `maxPages` violations that only appear on very large books.
|
|
538
536
|
|
|
539
|
-
|
|
537
|
+
27. **Always test both sides of the fast-path threshold**: Several breakpoint bugs only reproduce at or above `FAST_PATH_THRESHOLD` (1000). Add regressions at `threshold-1` and `threshold` to avoid “works in small unit tests, fails on full books” surprises.
|
|
540
538
|
|
|
541
|
-
|
|
539
|
+
28. **Breakpoint `split` behavior**: The `split: 'at' | 'after'` option for breakpoints controls where the split happens relative to the matched text:
|
|
542
540
|
- `'after'` (default): Match is included in the previous segment
|
|
543
541
|
- `'at'`: Match starts the next segment
|
|
544
542
|
Key implementation details in `findPatternBreakPosition`:
|
|
@@ -547,15 +545,15 @@ bunx biome lint .
|
|
|
547
545
|
- Zero-length matches (lookaheads) are always skipped to prevent infinite loops
|
|
548
546
|
- Empty pattern `''` forces `splitAt=false` since page boundaries have no matched text
|
|
549
547
|
|
|
550
|
-
|
|
548
|
+
29. **Unicode safety is the user's responsibility for patterns**: Unlike `findSafeBreakPosition` (which adjusts for grapheme boundaries), pattern-based breaks use the exact position where the user's regex matched. If a pattern matches mid-grapheme, that's a pattern authoring error, not a library bug. The library should NOT silently adjust pattern match positions.
|
|
551
549
|
|
|
552
|
-
|
|
550
|
+
30. **Fast path doesn't affect split behavior**: The offset-based fast path only applies to empty pattern `''` breakpoints (page boundary fallback), and empty patterns force `splitAt=false`. Pattern-based breakpoints with `split:'at'` never engage the fast path.
|
|
553
551
|
|
|
554
|
-
|
|
552
|
+
31. **Whitespace trimming affects split:'at' output**: `createSegment()` trims segment content. With `split:'at'`, if the matched text is whitespace-only, it will be trimmed from the start of the next segment. This is usually desirable for delimiter patterns.
|
|
555
553
|
|
|
556
|
-
|
|
554
|
+
32. **`prefer` semantics with `split:'at'`**: With `prefer:'longer'` + `split:'at'`, the algorithm selects the LAST valid match, maximizing content in the previous segment. This is correct but can be counterintuitive since the resulting previous segment might appear "shorter" than with `split:'after'`.
|
|
557
555
|
|
|
558
|
-
|
|
556
|
+
33. **Multi-agent review synthesis**: Getting implementation reviews from multiple AI models (Claude, GPT, Grok, Gemini) and synthesizing their feedback helps catch issues a single reviewer might miss. Key insight: when reviewers disagree on "critical" issues, investigate the codebase to verify claims before implementing fixes. Some "critical" issues are based on incorrect assumptions about how fast paths or downstream functions work.
|
|
559
557
|
|
|
560
558
|
### Process Template (Multi-agent design review, TDD-first)
|
|
561
559
|
|
package/README.md
CHANGED
|
@@ -1166,9 +1166,6 @@ const options: SegmentationOptions = {
|
|
|
1166
1166
|
// - `flags` defaults to 'gu'. If provided, `g` and `u` are always enforced.
|
|
1167
1167
|
// - `pageIds: []` means "apply to no pages" (skip that rule).
|
|
1168
1168
|
// - Remember JSON escaping: to match a literal '.', use regex: "\\\\." in JSON.
|
|
1169
|
-
replace: [
|
|
1170
|
-
{ regex: "([\\u0660-\\u0669]+)\\s*[-–—ـ]\\s*", replacement: "$1 - " }
|
|
1171
|
-
],
|
|
1172
1169
|
rules: [
|
|
1173
1170
|
{ lineStartsWith: ['## '], split: 'at' }
|
|
1174
1171
|
],
|
package/dist/index.d.mts
CHANGED
|
@@ -698,26 +698,6 @@ interface Logger {
|
|
|
698
698
|
/** Log a warning message (potential issues) */
|
|
699
699
|
warn?: (message: string, ...args: unknown[]) => void;
|
|
700
700
|
}
|
|
701
|
-
/**
|
|
702
|
-
* - Default regex flags: `gu` (global + unicode)
|
|
703
|
-
* - If `flags` is provided, it is validated and merged with required flags:
|
|
704
|
-
* `g` and `u` are always enforced.
|
|
705
|
-
*
|
|
706
|
-
* `pageIds` controls which pages a rule applies to:
|
|
707
|
-
* - `undefined`: apply to all pages
|
|
708
|
-
* - `[]`: apply to no pages (rule is skipped)
|
|
709
|
-
* - `[id1, id2, ...]`: apply only to those pages
|
|
710
|
-
*/
|
|
711
|
-
type Replacement = {
|
|
712
|
-
/** Raw regex source string (no token expansion). Compiled with `u` (and always `g`). */
|
|
713
|
-
regex: string;
|
|
714
|
-
/** Replacement string (passed to `String.prototype.replace`). */
|
|
715
|
-
replacement: string;
|
|
716
|
-
/** Optional regex flags; `g` and `u` are always enforced. */
|
|
717
|
-
flags?: string;
|
|
718
|
-
/** Optional list of page IDs to apply this replacement to. Empty array means skip. */
|
|
719
|
-
pageIds?: number[];
|
|
720
|
-
};
|
|
721
701
|
/**
|
|
722
702
|
* Segmentation options controlling how pages are split.
|
|
723
703
|
*
|
|
@@ -751,12 +731,6 @@ type Replacement = {
|
|
|
751
731
|
* };
|
|
752
732
|
*/
|
|
753
733
|
type SegmentationOptions = {
|
|
754
|
-
/**
|
|
755
|
-
* Optional pre-processing replacements applied to page content BEFORE segmentation.
|
|
756
|
-
*
|
|
757
|
-
* Replacements are applied per-page (not on concatenated content), in array order.
|
|
758
|
-
*/
|
|
759
|
-
replace?: Replacement[];
|
|
760
734
|
/**
|
|
761
735
|
* Rules applied in order to find split points.
|
|
762
736
|
*
|
|
@@ -891,9 +865,6 @@ type SegmentationOptions = {
|
|
|
891
865
|
logger?: Logger;
|
|
892
866
|
};
|
|
893
867
|
//#endregion
|
|
894
|
-
//#region src/preprocessing/replace.d.ts
|
|
895
|
-
declare const applyReplacements: (pages: Page[], rules?: Replacement[]) => Page[];
|
|
896
|
-
//#endregion
|
|
897
868
|
//#region src/recovery.d.ts
|
|
898
869
|
type MarkerRecoverySelector = {
|
|
899
870
|
type: 'rule_indices';
|
|
@@ -1389,5 +1360,5 @@ declare const escapeTemplateBrackets: (pattern: string) => string;
|
|
|
1389
1360
|
declare const escapeRegex: (s: string) => string;
|
|
1390
1361
|
declare const makeDiacriticInsensitive: (text: string) => string;
|
|
1391
1362
|
//#endregion
|
|
1392
|
-
export { type Breakpoint, type BreakpointRule, type CommonLineStartPattern, type DetectedPattern, type ExpandResult, type LineStartAnalysisOptions, type LineStartPatternExample, type Logger, type MarkerRecoveryReport, type MarkerRecoveryRun, type MarkerRecoverySelector, type OptimizeResult, PATTERN_TYPE_KEYS, type Page, type PageRange, type PatternProcessor, type PatternTypeKey, type RepeatingSequenceExample, type RepeatingSequenceOptions, type RepeatingSequencePattern, type RuleValidationResult, type Segment, type SegmentationOptions, type SplitRule, TOKEN_PATTERNS, Token, type TokenKey, type TokenMapping, type ValidationIssue, type ValidationIssueType, analyzeCommonLineStarts, analyzeRepeatingSequences, analyzeTextForRule,
|
|
1363
|
+
export { type Breakpoint, type BreakpointRule, type CommonLineStartPattern, type DetectedPattern, type ExpandResult, type LineStartAnalysisOptions, type LineStartPatternExample, type Logger, type MarkerRecoveryReport, type MarkerRecoveryRun, type MarkerRecoverySelector, type OptimizeResult, PATTERN_TYPE_KEYS, type Page, type PageRange, type PatternProcessor, type PatternTypeKey, type RepeatingSequenceExample, type RepeatingSequenceOptions, type RepeatingSequencePattern, type RuleValidationResult, type Segment, type SegmentationOptions, type SplitRule, TOKEN_PATTERNS, Token, type TokenKey, type TokenMapping, type ValidationIssue, type ValidationIssueType, analyzeCommonLineStarts, analyzeRepeatingSequences, analyzeTextForRule, applyTokenMappings, containsTokens, detectTokenPatterns, escapeRegex, escapeTemplateBrackets, expandCompositeTokensInTemplate, expandTokens, expandTokensWithCaptures, formatValidationReport, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, optimizeRules, recoverMistakenLineStartsAfterMarkers, recoverMistakenMarkersForRuns, segmentPages, shouldDefaultToFuzzy, stripTokenMappings, suggestPatternConfig, templateToRegex, validateRules, withCapture };
|
|
1393
1364
|
//# sourceMappingURL=index.d.mts.map
|
package/dist/index.d.mts.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"index.d.mts","names":[],"sources":["../src/types/index.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/types/rules.ts","../src/optimization/optimize-rules.ts","../src/types/breakpoints.ts","../src/types/options.ts","../src/
|
|
1
|
+
{"version":3,"file":"index.d.mts","names":[],"sources":["../src/types/index.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/types/rules.ts","../src/optimization/optimize-rules.ts","../src/types/breakpoints.ts","../src/types/options.ts","../src/recovery.ts","../src/segmentation/breakpoint-utils.ts","../src/segmentation/pattern-validator.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/utils/textUtils.ts"],"sourcesContent":[],"mappings":";;AAcA;AA4CA;AA2BA;;;;AClEA;AAcA;AAEA;AA4PA;;;;AAGyB,KDpRb,OAAA,GCoRa;;;;ACzQzB;AAaA;AAOA;EAiOa,OAAA,EAAA,MAAA;EACF;;;EAEgB,IAAA,EAAA,MAAA;;;;AC5Q3B;AA+EA;AAgEA;EAuBa,EAAA,CAAA,EAAA,MAAA;EAiCA;;;;AC5M+B;AA4B3B;AA4BG;EA+Df,IAAA,CAAA,EJ5EM,MI4EN,CAAA,MAAA,EAAA,OAAsB,CAAA;AAAA,CAAA;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAU9C;AAoJlB;AAAwB,KJ/RZ,IAAA,GI+RY;EAAc;;;;;;ECjV1B;AAsCZ;;;;AC7BA;EAyGY,OAAA,EAAA,MAAU;;;;AC/FtB;AA6CA;;;;;;;KPaY,SAAA;;;AAvEA,KCKA,wBAAA,GDwBK;EAeL,IAAA,CAAA,EAAI,MAAA;EA2BJ,WAAA,CAAS,EAAA,MAAA;;;;EClET,wBAAA,CAAA,EAAwB,OAAA;EAcxB,yBAAA,CAAuB,EAAA,OAAA;EAEvB,MAAA,CAAA,EAAA,aAAA,GAAsB,OAAA;EA4PrB,UAAA,CAAA,EAAA,CAAA,IAAA,EAAA,MAmBZ,EAAA,MAAA,EAAA,MAAA,EAAA,GAAA,OAAA;EAlBU,cAAA,CAAA,EAnQU,MAmQV,EAAA;EACE,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;CACV;AAAsB,KAjQb,uBAAA,GAiQa;;;;ACzQb,KDUA,sBAAA,GCVwB;EAaxB,OAAA,EAAA,MAAA;EAOA,KAAA,EAAA,MAAA;EAiOC,QAAA,EDxOC,uBC6Pb,EAAA;CApBU;;;;cDgBE,iCACF,kBACE,6BACV;;;AAHU,KCtQD,wBAAA,GDyRX;EAlBU,WAAA,CAAA,EAAA,MAAA;EACE,WAAA,CAAA,EAAA,MAAA;EACV,QAAA,CAAA,EAAA,MAAA;EAAsB,IAAA,CAAA,EAAA,MAAA;;;;ECzQb,WAAA,CAAA,EAAA,MAAA;EAaA,YAAA,CAAA,EAAA,MAAA;EAOA,iBAAA,CAAA,EAAA,MAAA;AAiOZ,CAAA;AACW,KAzOC,wBAAA,GAyOD;EACG,IAAA,EAAA,MAAA;EACX,OAAA,EAAA,MAAA;EAAwB,MAAA,EAAA,MAAA;;;KApOf,wBAAA;ECxCA,OAAA,EAAA,MAAA;EA+EC,KAAA,EAAA,MAAA;EAgEA,QAAA,EDpGC,wBCmHb,EAAA;AAQD,CAAA;;;AC3K4C;AA4B3B;AA4BG;AA8BM;AAyDrB,cF+HQ,yBE/HW,EAAA,CAAA,KAAA,EFgIb,IEhIa,EAAA,EAAA,OAAA,CAAA,EFiIV,wBEjIU,EAAA,GFkIrB,wBElIqB,EAAA;;;;AJjIxB;AA4CA;AA2BY,KGhFA,eAAA,GHgFS;;;;EClET,KAAA,EAAA,MAAA;EAcA;EAEA,KAAA,EAAA,MAAA;EA4PC;EACF,QAAA,EAAA,MAAA;CACE;;;;;;ACxQb;AAaA;AAOA;AAiOA;;;;;;;cC1La,uCAAmC;AA/EhD;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC5M4C;AA4B3B;AA4BG;AA8BM;AAiCC;AAuCtB,cDVQ,wBCUG,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,QAAA,EDViD,eCUjD,EAAA,EAAA,GAAA,MAAA;;;;;;;AAsBH,cDTA,oBCSuG,EAAA,CAAA,QAAA,EDRtG,eCQsG,EAAA,EAAA,GAAA;EAOxG,WAAA,EAAA,gBAAyB,GAAA,iBAAiB;EAUjD,KAAA,EAAA,OAAA;EAyCA,QAAA,CAAA,EAAA,MAAA;AA2GL,CAAA;;;;;;;cD7Ia;EEpMD,QAAA,EAAA,MAAA;EAsCC,WAAA,EAAA,gBAAwB,GAAA,iBAAA;;;YFqKvB;AGlMd,CAAA,GAAY,IAAA;;;ANHZ;AA4CA;AA2BA;;;;AClEA;AAcA;AAEA;AA4PA;;;;;;;;ACtQA;AAaA;AAOA;AAiOA;;;;;;KElPK,YAAA;;EDvBO,KAAA,EAAA,MAAA;AA+EZ,CAAA;AAgEA;AAuBA;AAiCA;;;;AC5M4C;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAU9C;AAoJlB;KAjSK,eAAA,GAiSmB;EAAc;EAAgB,QAAA,EAAA,MAAA;CAAe;;;;ACjVrE;AAsCA;;;;AC7BA;AAyGA;;;;AC/FA;AA6CA;;;;;;;;;ACjEA;AAKA;KJ0EK,qBAAA,GIzEQ;EACF;EACG,cAAA,EAAA,MAAA,EAAA;CACA;;AAGd;AA2BE;AAknBF;;;;;;;;;AAsDA;;;;;;;;;ACrjBA;;;;ACpJA;AAKA;AAcA;KNuFK,sBAAA,GMtFiB;EACC;EACH,eAAA,EAAA,MAAA,EAAA;CACL;;AAuFf;AAsCA;;;;ACgJA;;;;;;;;AC1MA;AAsCA;AAGA;AAiBA;AAgDA;AA2CA,KR7GK,mBAAA,GQgHJ;EAQW;EAkJC,YAAA,EAAA,MAAA,EAAA;AA8Cb,CAAA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;ACxjBA,KTyHK,WAAA,GACC,YSxHL,GTyHK,eSzHL,GT0HK,qBS1HL,GT2HK,sBS3HL,GT4HK,mBS5HL;AAoDD;AAeA;;;;;;;;;;;;;;cT0Ea;;;;;;KAOD,cAAA,WAAyB;;;;;;;KAUhC,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAyCA,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqDC,SAAA,GAAY,cAAc,gBAAgB;;;AJ3UtD;AA4CA;AA2BA;KK7EY,cAAA;;SAED;EJSC;EAcA,WAAA,EAAA,MAAA;AAEZ,CAAA;AA4Pa,cIjPA,aJoQZ,EAAA,CAAA,KAAA,EIpQoC,SJoQpC,EAAA,EAAA,GAAA;EAlBU,WAAA,EAAA,MAAA;EACE,KAAA,EInPwB,SJmPxB,EAAA;CACV;;;ADpRH;AA4CA;AA2BA;;;;AClEA;AAcA;AAEA;AA4PA;;;;;;KK9QY,cAAA;;AJQZ;AAaA;AAOA;AAiOA;;;;;;;;ACzQA;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC5M4C;AA4B3B;EA0DZ,KAAA,CAAA,EAAA,MAAA;EAiCA;AAAsB;AAwBH;;;;;;;AAqCxB;EAOY,KAAA,CAAA,EAAA,IAAA,GAAA,OAAc;EAUrB;AAAa;AAoJlB;;EAAsC,GAAA,CAAA,EAAA,MAAA;EAAgB;;;;;ECjV1C;AAsCZ;;;;AC7BA;AAyGA;;;;AC/FA;AA6CA;;;;;;;YDac;;AE9Ed;AAKA;;;;;;AAOA;AA2BE;AAknBF;;;;;;;;EASoD,QAAA,CAAA,EAAA,MAAA;AA6CpD,CAAA;;;;;;;;;ACrjBA;;;;ACpJA;AAKA;AAcY,KJ0FA,UAAA,GI1FA,MAAoB,GJ0FE,cI1FF;;;AV0BhC;AA2BA;;;;AClEA;AAcA;AAEA;AA4PA;;;;;;;;ACtQA;AAaA;AAOA;AAiOA;;;;;UKnPiB,MAAA;;;EJtBL;EA+EC,KAAA,CAAA,EAAA,CAAA,OAAA,EAAA,MAiDZ,EAAA,GAAA,IAjD+C,EAAA,OAAA,EAAA,EAAA,GAAA,IAAA;EAgEnC;EAuBA,IAAA,CAAA,EAAA,CAAA,OAAA,EAAA,MAyBZ,EAAA,GAAA,IAAA,EAxBa,OAAA,EAAA,EAAA,GAAA,IAAe;EAgChB;;;;AC5M+B;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAU9C;AAoJlB;;;;;;;;ACjVA;AAsCA;;;;AC7BA;AAyGA;;;KClDY,mBAAA;EA7CK;AA6CjB;;;;;;UAQY;;;ACzEZ;AAKA;;;;;;EAOY,KAAA,CAAA,EAAA,OAAA,GAAA;IA6BP;IAgnBW,OAAA,CAAA,EAAA,MAAA;IACL;IACG,OAAA,CAAA,EDjkBU,KCikBV,CAAA,MAAA,GAAA,YAAA,CAAA;EACD,CAAA;EACC;;;;;AAkDd;;;;;;;;;ECrjBY,QAAA,CAAA,EAAA,MAAA;;;;ACpJZ;AAKA;AAcA;;;;;;EA2Fa,gBAwBP,CAAA,EAAA,MAAA;EAcO;;;;ACgJb;;;;;;;;AC1MA;AAsCA;AAGA;AAiBA;AAgDA;AA2CA;AAWA;AAkJA;AA8CA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;gBLhckB;;;AMxHlB;AAsDA;AAeA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WN2Ga;;;;APhID,KQ9EA,sBAAA,GR8ES;;;;EClET,IAAA,EAAA,0BAAwB;EAcxB,KAAA,CAAA,EAAA,OAAA,GAAA,YAAuB;EAEvB,QAAA,EAAA,MAAA,EAAA;AA4PZ,CAAA,GAAa;EACF,IAAA,EAAA,WAAA;EACE,SAAA,EAAA,CAAA,IAAA,EOvRgC,SPuRhC,EAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;CACV;AAAsB,KOtRb,iBAAA,GPsRa;WOrRZ;SACF;YACG;ENUF,QAAA,EMTE,sBNSsB;AAapC,CAAA;AAOY,KM1BA,oBAAA,GN0BwB;EAiOvB,OAAA,EAAA;IACF,IAAA,EAAA,YAAA,GAAA,wBAAA;IACG,SAAA,EAAA,MAAA;IACX,aAAA,EAAA,MAAA;IAAwB,SAAA,EAAA,MAAA;;;UMtPf;ILtBA,SAAA,EAAA,MAAe;IA+Ed,QAAA,EAAA,MAAA;IAgEA,aAAA,EAAA,MAAA;IAuBA,UAAA,EAAA,MAAA;EAiCA,CAAA,CAAA;WK3KA;;;IJLR,sBAAY,CAAA,EAAA,MAAA;IA4BZ,qBAAe,CAAA,EAAA,MAAA;IA8Bf,YAAA,EAAA,MAAqB;IAiCrB,MAAA,EAAA,WAAA,GAAsB,oBAAA,GAAA,WAAA,GAAA,sBAAA,GAAA,qBAAA;IAwBtB,QAAA,EAAA,OAAA,GAAmB,QAAA,GAAA,MAAA;IAenB,EAAA,CAAA,EAAA,MAAW;IACV,KAAA,CAAA,EAAA,MAAA,EAAA;EACA,CAAA,CAAA;EACA,MAAA,EAAA,MAAA,EAAA;EACA,QAAA,EAAA,MAAA,EAAA;CACA;KInHD,oBAAA,GJmHoB,MAAA,GAAA,YAAA,GAAA,qBAAA;AAiBZ,iBI4eG,qCAAA,CJ5eoG,KAAA,EI6ezG,IJ7eyG,EAAA,EAAA,QAAA,EI8etG,OJ9esG,EAAA,EAAA,OAAA,EI+evG,mBJ/euG,EAAA,QAAA,EIgftG,sBJhfsG,EAAA,IAiBlG,CAjBkG,EAAA;EAOxG,IAAA,CAAA,EAAA,YAAc,GAAA,wBAAW;EAUhC,gBAAa,CAAA,EIkeS,oBJleT;AAAA,CAAA,CAAA,EAyCb;EA2GO,MAAA,EIgVC,oBJhVQ;EAAG,QAAA,EIgVqB,OJhVrB,EAAA;CAAc;AAAgB,iBI6XtC,6BAAA,CJ7XsC,IAAA,EI8X5C,iBJ9X4C,EAAA,EAAA,KAAA,EAAA;EAAe,IAAA,CAAA,EAAA,YAAA,GAAA,wBAAA;qBI+XW;;UACnE;EHjtBD,QAAA,EGitBiC,OHjtBnB,EAAA;AAsC1B,CAAA;;;;AG2qBoD,KCxjBxC,gBAAA,GDwjBwC,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;APtsBpD;AAcA;AAEA;AA4Pa,KSlRD,mBAAA,GTqSX,gBAAA,GAAA,eAAA,GAAA,WAAA,GAAA,eAAA;;;;AAhBwB,KShRb,eAAA,GTgRa;QS/Qf;;;ERME;EAaA,KAAA,CAAA,EAAA,MAAA;EAOA;EAiOC,OAAA,CAAA,EAAA,MAAA;CACF;;;;;KQ/OC,oBAAA;oBACU;EP5BV,eAAA,CAAA,EAAe,CO6BJ,eP7BI,GAAA,SAAA,CAAA,EAAA;EA+Ed,YAAA,CAAA,EAAA,COjDO,ePkGnB,GAjD+C,SAAA,CAAA,EAAA;EAgEnC,QAAA,CAAA,EOhHE,ePgHF;AAuBb,CAAA;AAiCA;;;;AC5M4C;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAmD3D,cMnHQ,aN4JC,EAAA,CAAA,KAAA,EM5JuB,SNyKpB,EAAA,EAAA,GAAA,CMzK+B,oBNyK/B,GAAA,SAAA,CAAA,EAAA;AAqDjB;;;;;;;;ACjVA;AAsCA;;;;AC7BY,cIgJC,sBJ5EU,EAAA,CAAA,OAAA,EAAA,CI4E0B,oBJ5E1B,GAAA,SAAA,CAAA,EAAA,EAAA,GAAA,MAAA,EAAA;;;;;AJ5DvB;AAaA;AAOA;AAiOA;;;;;;;;ACzQA;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC5M4C;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AAU9C;AAoJlB;;;;;cOxCa,sBAAuB,iBAAiB,wBAAmB;;;;AXnS5D,cYyFC,KZ5DF,EAAA;EAeC;EA2BA,SAAA,GAAS,EAAA,SAAA;;;;EClET,SAAA,MAAA,EAAA,YAAwB;EAcxB;EAEA,SAAA,IAAA,EAAA,UAAsB;EA4PrB;EACF,SAAA,IAAA,EAAA,UAAA;EACE;EACV,SAAA,IAAA,EAAA,UAAA;EAAsB;;;;ECzQb;EAaA,SAAA,IAAA,EAAA,UAAA;EAOA;EAiOC,SAAA,GAAA,EAAA,SAAA;EACF;EACG,SAAA,QAAA,EAAA,cAAA;EACX;EAAwB,SAAA,IAAA,EAAA,UAAA;;;;EC5Qf,SAAA,KAAA,EAAA,WAAe;EA+Ed;EAgEA,SAAA,KAAA,EAAA,WAeZ;EAQY;EAiCA,SAAA,MAAA,EAAA,YAmBZ;;;;AC/N2C;AAwDvC,KQqFO,QAAA,GRrFQ,MAAA,OQqFgB,KRrFhB;AAAA;AA+Df,cQyBQ,WRzBc,EAAA,CAAA,KAAA,EAAA,MAAA,EAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAAA;AAuCtB,cQGQ,+BRHG,EAAA,CAAA,QAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;;;;AAsBhB;AAOA;AAAgE;AAU9C;AAoJlB;;;;;;;;ACjVA;AAsCA;;;;AC7BA;AAyGA;cMuFa,gBAAgB;;;ALtL7B;AA6CA;;;;;;;;;ACjEA;AAKA;;AAEW,cI8OE,cJ9OF,EAAA,CAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;;;;AAKX;AA2BE;AAknBF;AACW,KI1ZC,YAAA,GJ0ZD;EACG;;;;;EAO+B,OAAA,EAAA,MAAA;EAAO;AA6CpD;;;;EAG6C,YAAA,EAAA,MAAA,EAAA;EAAO;;;;ACxjBpD;;;;ACpJA;AAKA;AAcA;;;;;;AA2FA;AAsCA;;;;ACgJA;;;;;;;;AC1MA;AAsCA;AAGA;AAiBA;AAgDA;AA2CA;AAWA;AAkJA;AA8CA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBa,cAvKA,wBAuKkD,EAAA,CAAA,KAAY,EAAA,MAAA,EAAA,cAAA,CAAA,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA,EAAA,aAAA,CAAA,EAAA,MAAA,EAAA,GAAA;EA6B9D,YAAA,EAAA,MAAA,EAGZ;;;;AC3jBD;AAsDA;AAeA;;;;;;;;;;;;;;;;;cD6Va;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAmC;;;;;;;;;;;;;cAqBnC;;;;;;;;;;;;;;;cAgBA;;;;;;;;;;;;;;;cA8BA;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;cAoBC,iDAAkD;;;;;;;;;;;;;;;;cA6BlD;;;AX9Tb;;;;;;;;ACtQA;AAaA;AAOA;AAiOA;;;;;;;;ACzQA;AA+EA;AAgEA;AAuBA;AAiCa,cUvKA,sBV8KC,EAAA,CAAA,OAAe,EAAA,MAAA,EAAA,GAAA,MAAA;;;;ACnNe;AA4B3B;AA4BG;AA8BM;AAiCC;AAwBH;;;;;AAoBlB,cSxEO,WTwEP,EAAA,CAAA,CAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAAmB,cSzDZ,wBTyDY,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA"}
|
package/dist/index.mjs
CHANGED
|
@@ -34,7 +34,9 @@ const normalizeLineEndings = (content) => {
|
|
|
34
34
|
* escapeTemplateBrackets('{{harf}}')
|
|
35
35
|
* // → '{{harf}}' (unchanged - no brackets outside tokens)
|
|
36
36
|
*/
|
|
37
|
-
const escapeTemplateBrackets = (pattern) =>
|
|
37
|
+
const escapeTemplateBrackets = (pattern) => {
|
|
38
|
+
return pattern.replace(/(\{\{[^}]*\}\})|([()[\]])/g, (_match, token, bracket) => token || `\\${bracket}`);
|
|
39
|
+
};
|
|
38
40
|
/**
|
|
39
41
|
* Character class matching all Arabic diacritics (Tashkeel/Harakat).
|
|
40
42
|
*
|
|
@@ -93,7 +95,9 @@ const getEquivClass = (ch) => {
|
|
|
93
95
|
const group = EQUIV_GROUPS.find((g) => g.includes(ch));
|
|
94
96
|
return group ? `[${group.map(escapeRegex).join("")}]` : escapeRegex(ch);
|
|
95
97
|
};
|
|
96
|
-
const normalizeArabicLight = (str) =>
|
|
98
|
+
const normalizeArabicLight = (str) => {
|
|
99
|
+
return str.normalize("NFC").replace(/[\u200C\u200D]/g, "").replace(/\s+/g, " ").trim();
|
|
100
|
+
};
|
|
97
101
|
const makeDiacriticInsensitive = (text) => {
|
|
98
102
|
const diacriticsMatcher = `${DIACRITICS_CLASS}*`;
|
|
99
103
|
return Array.from(normalizeArabicLight(text)).map((ch) => getEquivClass(ch) + diacriticsMatcher).join("");
|
|
@@ -1224,53 +1228,6 @@ const optimizeRules = (rules) => {
|
|
|
1224
1228
|
};
|
|
1225
1229
|
};
|
|
1226
1230
|
|
|
1227
|
-
//#endregion
|
|
1228
|
-
//#region src/preprocessing/replace.ts
|
|
1229
|
-
const DEFAULT_REPLACE_FLAGS = "gu";
|
|
1230
|
-
const normalizeReplaceFlags = (flags) => {
|
|
1231
|
-
if (!flags) return DEFAULT_REPLACE_FLAGS;
|
|
1232
|
-
const allowed = new Set([
|
|
1233
|
-
"g",
|
|
1234
|
-
"i",
|
|
1235
|
-
"m",
|
|
1236
|
-
"s",
|
|
1237
|
-
"u",
|
|
1238
|
-
"y"
|
|
1239
|
-
]);
|
|
1240
|
-
const set = new Set(flags.split("").filter((ch) => {
|
|
1241
|
-
if (!allowed.has(ch)) throw new Error(`Invalid replace regex flag: "${ch}" (allowed: gimsyu)`);
|
|
1242
|
-
return true;
|
|
1243
|
-
}));
|
|
1244
|
-
set.add("g");
|
|
1245
|
-
set.add("u");
|
|
1246
|
-
return [
|
|
1247
|
-
"g",
|
|
1248
|
-
"i",
|
|
1249
|
-
"m",
|
|
1250
|
-
"s",
|
|
1251
|
-
"y",
|
|
1252
|
-
"u"
|
|
1253
|
-
].filter((c) => set.has(c)).join("");
|
|
1254
|
-
};
|
|
1255
|
-
const compileReplaceRules = (rules) => rules.filter((r) => !(r.pageIds && r.pageIds.length === 0)).map((r) => ({
|
|
1256
|
-
pageIdSet: r.pageIds ? new Set(r.pageIds) : void 0,
|
|
1257
|
-
re: new RegExp(r.regex, normalizeReplaceFlags(r.flags)),
|
|
1258
|
-
replacement: r.replacement
|
|
1259
|
-
}));
|
|
1260
|
-
const applyReplacements = (pages, rules) => {
|
|
1261
|
-
if (!rules?.length || !pages.length) return pages;
|
|
1262
|
-
const compiled = compileReplaceRules(rules);
|
|
1263
|
-
if (!compiled.length) return pages;
|
|
1264
|
-
return pages.map((p) => {
|
|
1265
|
-
let content = p.content;
|
|
1266
|
-
for (const rule of compiled) if (!rule.pageIdSet || rule.pageIdSet.has(p.id)) content = content.replace(rule.re, rule.replacement);
|
|
1267
|
-
return content === p.content ? p : {
|
|
1268
|
-
...p,
|
|
1269
|
-
content
|
|
1270
|
-
};
|
|
1271
|
-
});
|
|
1272
|
-
};
|
|
1273
|
-
|
|
1274
1231
|
//#endregion
|
|
1275
1232
|
//#region src/segmentation/rule-regex.ts
|
|
1276
1233
|
/**
|
|
@@ -1928,24 +1885,63 @@ const findPatternBreakPosition = (windowContent, regex, prefer, splitAt = false)
|
|
|
1928
1885
|
* Handles page boundary breakpoint (empty pattern).
|
|
1929
1886
|
* Returns break position or -1 if no valid position found.
|
|
1930
1887
|
*/
|
|
1888
|
+
const findStartOfNextPageInWindow = (remainingContent, currentFromIdx, toIdx, pageIds, normalizedPages, targetPos) => {
|
|
1889
|
+
const targetNextPageIdx = currentFromIdx + 1;
|
|
1890
|
+
for (let nextIdx = targetNextPageIdx; nextIdx > currentFromIdx; nextIdx--) if (nextIdx <= toIdx) {
|
|
1891
|
+
const nextPageData = normalizedPages.get(pageIds[nextIdx]);
|
|
1892
|
+
if (nextPageData) {
|
|
1893
|
+
const boundaryPos = findNextPagePosition(remainingContent, nextPageData);
|
|
1894
|
+
if (boundaryPos > 0 && boundaryPos <= targetPos) return boundaryPos;
|
|
1895
|
+
}
|
|
1896
|
+
}
|
|
1897
|
+
return -1;
|
|
1898
|
+
};
|
|
1931
1899
|
const handlePageBoundaryBreak = (remainingContent, currentFromIdx, windowEndPosition, maxContentLength, toIdx, pageIds, normalizedPages) => {
|
|
1932
1900
|
const targetPos = Math.min(windowEndPosition, remainingContent.length);
|
|
1933
|
-
|
|
1934
|
-
|
|
1935
|
-
|
|
1936
|
-
|
|
1937
|
-
if (nextPageData) {
|
|
1938
|
-
const boundaryPos = findNextPagePosition(remainingContent, nextPageData);
|
|
1939
|
-
if (boundaryPos > 0 && boundaryPos <= targetPos) return boundaryPos;
|
|
1940
|
-
}
|
|
1941
|
-
}
|
|
1901
|
+
const isLengthBounded = maxContentLength !== void 0 && windowEndPosition === maxContentLength;
|
|
1902
|
+
if (!isLengthBounded) {
|
|
1903
|
+
const boundaryPos = findStartOfNextPageInWindow(remainingContent, currentFromIdx, toIdx, pageIds, normalizedPages, targetPos);
|
|
1904
|
+
if (boundaryPos > 0) return { pos: boundaryPos };
|
|
1942
1905
|
}
|
|
1943
1906
|
if (targetPos < remainingContent.length) {
|
|
1944
1907
|
const safePos = findSafeBreakPosition(remainingContent, targetPos);
|
|
1945
|
-
if (safePos !== -1) return
|
|
1946
|
-
|
|
1908
|
+
if (safePos !== -1) return {
|
|
1909
|
+
pos: safePos,
|
|
1910
|
+
splitReason: isLengthBounded ? "whitespace" : void 0
|
|
1911
|
+
};
|
|
1912
|
+
return {
|
|
1913
|
+
pos: adjustForUnicodeBoundary(remainingContent, targetPos),
|
|
1914
|
+
splitReason: isLengthBounded ? "unicode_boundary" : void 0
|
|
1915
|
+
};
|
|
1947
1916
|
}
|
|
1948
|
-
return targetPos;
|
|
1917
|
+
return { pos: targetPos };
|
|
1918
|
+
};
|
|
1919
|
+
const checkBreakpointMatch = (i, remainingContent, currentFromIdx, toIdx, windowEndIdx, windowEndPosition, ctx, maxContentLength) => {
|
|
1920
|
+
const { pageIds, normalizedPages, expandedBreakpoints, prefer } = ctx;
|
|
1921
|
+
const bpCtx = expandedBreakpoints[i];
|
|
1922
|
+
const { rule, regex, excludeSet, skipWhenRegex } = bpCtx;
|
|
1923
|
+
if (!isInBreakpointRange(pageIds[currentFromIdx], rule)) return null;
|
|
1924
|
+
if (hasExcludedPageInRange(excludeSet, pageIds, currentFromIdx, windowEndIdx)) return null;
|
|
1925
|
+
if (skipWhenRegex?.test(remainingContent)) return null;
|
|
1926
|
+
if (regex === null) {
|
|
1927
|
+
const result = handlePageBoundaryBreak(remainingContent, currentFromIdx, windowEndPosition, maxContentLength, toIdx, pageIds, normalizedPages);
|
|
1928
|
+
return {
|
|
1929
|
+
breakPos: result.pos,
|
|
1930
|
+
breakpointIndex: i,
|
|
1931
|
+
contentLengthSplit: result.splitReason && maxContentLength ? {
|
|
1932
|
+
maxContentLength,
|
|
1933
|
+
reason: result.splitReason
|
|
1934
|
+
} : void 0,
|
|
1935
|
+
rule
|
|
1936
|
+
};
|
|
1937
|
+
}
|
|
1938
|
+
const breakPos = findPatternBreakPosition(remainingContent.slice(0, Math.min(windowEndPosition, remainingContent.length)), regex, prefer, bpCtx.splitAt);
|
|
1939
|
+
if (breakPos > 0) return {
|
|
1940
|
+
breakPos,
|
|
1941
|
+
breakpointIndex: i,
|
|
1942
|
+
rule
|
|
1943
|
+
};
|
|
1944
|
+
return null;
|
|
1949
1945
|
};
|
|
1950
1946
|
/**
|
|
1951
1947
|
* Tries to find a break position within the current window using breakpoint patterns.
|
|
@@ -1959,23 +1955,10 @@ const handlePageBoundaryBreak = (remainingContent, currentFromIdx, windowEndPosi
|
|
|
1959
1955
|
* @returns Break position in the content, or -1 if no break found
|
|
1960
1956
|
*/
|
|
1961
1957
|
const findBreakPosition = (remainingContent, currentFromIdx, toIdx, windowEndIdx, windowEndPosition, ctx, maxContentLength) => {
|
|
1962
|
-
const {
|
|
1958
|
+
const { expandedBreakpoints } = ctx;
|
|
1963
1959
|
for (let i = 0; i < expandedBreakpoints.length; i++) {
|
|
1964
|
-
const
|
|
1965
|
-
if (
|
|
1966
|
-
if (hasExcludedPageInRange(excludeSet, pageIds, currentFromIdx, windowEndIdx)) continue;
|
|
1967
|
-
if (skipWhenRegex?.test(remainingContent)) continue;
|
|
1968
|
-
if (regex === null) return {
|
|
1969
|
-
breakPos: handlePageBoundaryBreak(remainingContent, currentFromIdx, windowEndPosition, maxContentLength, toIdx, pageIds, normalizedPages),
|
|
1970
|
-
breakpointIndex: i,
|
|
1971
|
-
rule
|
|
1972
|
-
};
|
|
1973
|
-
const breakPos = findPatternBreakPosition(remainingContent.slice(0, Math.min(windowEndPosition, remainingContent.length)), regex, prefer, expandedBreakpoints[i].splitAt);
|
|
1974
|
-
if (breakPos > 0) return {
|
|
1975
|
-
breakPos,
|
|
1976
|
-
breakpointIndex: i,
|
|
1977
|
-
rule
|
|
1978
|
-
};
|
|
1960
|
+
const match = checkBreakpointMatch(i, remainingContent, currentFromIdx, toIdx, windowEndIdx, windowEndPosition, ctx, maxContentLength);
|
|
1961
|
+
if (match) return match;
|
|
1979
1962
|
}
|
|
1980
1963
|
return null;
|
|
1981
1964
|
};
|
|
@@ -2124,7 +2107,8 @@ const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx
|
|
|
2124
2107
|
if (patternMatch && patternMatch.breakPos > 0) return {
|
|
2125
2108
|
breakOffset: patternMatch.breakPos,
|
|
2126
2109
|
breakpointIndex: patternMatch.breakpointIndex,
|
|
2127
|
-
breakpointRule: patternMatch.rule
|
|
2110
|
+
breakpointRule: patternMatch.rule,
|
|
2111
|
+
contentLengthSplit: patternMatch.contentLengthSplit
|
|
2128
2112
|
};
|
|
2129
2113
|
if (windowEndPosition < remainingContent.length) {
|
|
2130
2114
|
const safeOffset = findSafeBreakPosition(remainingContent, windowEndPosition);
|
|
@@ -3128,8 +3112,7 @@ const segmentPages = (pages, options) => {
|
|
|
3128
3112
|
prefer,
|
|
3129
3113
|
ruleCount: rules.length
|
|
3130
3114
|
});
|
|
3131
|
-
const
|
|
3132
|
-
const { content: matchContent, normalizedPages: normalizedContent, pageMap } = buildPageMap(processedPages);
|
|
3115
|
+
const { content: matchContent, normalizedPages: normalizedContent, pageMap } = buildPageMap(pages);
|
|
3133
3116
|
logger?.debug?.("[segmenter] content built", {
|
|
3134
3117
|
pageIds: pageMap.pageIds,
|
|
3135
3118
|
totalContentLength: matchContent.length
|
|
@@ -3142,10 +3125,10 @@ const segmentPages = (pages, options) => {
|
|
|
3142
3125
|
});
|
|
3143
3126
|
let segments = buildSegments(unique, matchContent, pageMap, rules, pageJoiner);
|
|
3144
3127
|
logger?.debug?.("[segmenter] structural segments built", { segmentCount: segments.length });
|
|
3145
|
-
segments = ensureFallbackSegment(segments,
|
|
3128
|
+
segments = ensureFallbackSegment(segments, pages, normalizedContent, pageJoiner);
|
|
3146
3129
|
if (hasLimits) {
|
|
3147
3130
|
logger?.debug?.("[segmenter] applying breakpoints to oversized segments");
|
|
3148
|
-
const result = applyBreakpoints(segments,
|
|
3131
|
+
const result = applyBreakpoints(segments, pages, normalizedContent, maxPages, breakpoints, prefer, (p) => processPattern(p, false).pattern, logger, pageJoiner, debug?.includeBreakpoint ? debug.metaKey : void 0, maxContentLength, processBreakpointPattern);
|
|
3149
3132
|
logger?.info?.("[segmenter] segmentation complete (with breakpoints)", { finalSegmentCount: result.length });
|
|
3150
3133
|
return result;
|
|
3151
3134
|
}
|
|
@@ -3497,13 +3480,12 @@ const runStage1IfEnabled = (pages, segments, options, selectedRuleIndices, mode)
|
|
|
3497
3480
|
recoveredAtIndex,
|
|
3498
3481
|
recoveredDetailAtIndex
|
|
3499
3482
|
};
|
|
3500
|
-
const
|
|
3501
|
-
const pageIdToIndex = buildPageIdToIndex(processedPages);
|
|
3483
|
+
const pageIdToIndex = buildPageIdToIndex(pages);
|
|
3502
3484
|
const pageJoiner = options.pageJoiner ?? "space";
|
|
3503
3485
|
const compiledMistaken = compileMistakenRulesAsStartsWith(options, selectedRuleIndices);
|
|
3504
3486
|
for (let i = 0; i < segments.length; i++) {
|
|
3505
3487
|
const orig = segments[i];
|
|
3506
|
-
const r = tryBestEffortRecoverOneSegment(orig,
|
|
3488
|
+
const r = tryBestEffortRecoverOneSegment(orig, pages, pageIdToIndex, compiledMistaken, pageJoiner);
|
|
3507
3489
|
if (r.kind !== "recovered") continue;
|
|
3508
3490
|
const seg = {
|
|
3509
3491
|
...orig,
|
|
@@ -3858,5 +3840,5 @@ const formatValidationReport = (results) => results.flatMap((result, i) => {
|
|
|
3858
3840
|
});
|
|
3859
3841
|
|
|
3860
3842
|
//#endregion
|
|
3861
|
-
export { PATTERN_TYPE_KEYS, TOKEN_PATTERNS, Token, analyzeCommonLineStarts, analyzeRepeatingSequences, analyzeTextForRule,
|
|
3843
|
+
export { PATTERN_TYPE_KEYS, TOKEN_PATTERNS, Token, analyzeCommonLineStarts, analyzeRepeatingSequences, analyzeTextForRule, applyTokenMappings, containsTokens, detectTokenPatterns, escapeRegex, escapeTemplateBrackets, expandCompositeTokensInTemplate, expandTokens, expandTokensWithCaptures, formatValidationReport, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, optimizeRules, recoverMistakenLineStartsAfterMarkers, recoverMistakenMarkersForRuns, segmentPages, shouldDefaultToFuzzy, stripTokenMappings, suggestPatternConfig, templateToRegex, validateRules, withCapture };
|
|
3862
3844
|
//# sourceMappingURL=index.mjs.map
|