flappa-doormal 2.17.0 → 2.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md CHANGED
@@ -53,7 +53,6 @@ src/
53
53
  ├── utils/ # Low-level helpers
54
54
  │ └── textUtils.ts # Diacritics, Unicode, and bracket escaping
55
55
  ├── index.ts # Public barrel exports
56
- ├── recovery.ts # Mistaken stripping recovery logic
57
56
  ├── detection.ts # Pattern auto-detection (standalone)
58
57
  └── *.test.ts # Unit and integration tests (co-located)
59
58
 
@@ -64,19 +63,14 @@ src/
64
63
  - Returns array of `{content, from, to?, meta?}` segments
65
64
  - Orchestrates rule matching, optimization, and breakpoint processing
66
65
 
67
- 2. **`recoverMistakenLineStartsAfterMarkers(...)`** - Recovery helper (`src/recovery.ts`)
68
- - Use when a client mistakenly used `lineStartsAfter` where `lineStartsWith` was intended
69
- - Deterministic mode reruns segmentation with selected rules converted to `lineStartsWith` and merges recovered `content` back into the provided segments
70
- - Optional `mode: 'best_effort_then_rerun'` attempts a conservative anchor-based recovery first, then falls back to rerun for unresolved segments
71
-
72
- 3. **`tokens.ts`** - Template system (`src/segmentation/tokens.ts`)
66
+ 2. **`tokens.ts`** - Template system (`src/segmentation/tokens.ts`)
73
67
  - `TOKEN_PATTERNS` - Map of token names to regex patterns
74
68
  - `expandTokensWithCaptures()` - Expands `{{token:name}}` syntax
75
69
  - `shouldDefaultToFuzzy()` - Auto-enables fuzzy matching for `bab`, `basmalah`, `fasl`, `kitab`, `naql`
76
70
  - `applyTokenMappings()` - Applies named captures (`{{token:name}}`) to raw templates
77
71
  - `stripTokenMappings()` - Strips named captures (reverts to `{{token}}`)
78
72
 
79
- 4. **`rule-regex.ts`** - Rule compiler (`src/segmentation/rule-regex.ts`)
73
+ 3. **`rule-regex.ts`** - Rule compiler (`src/segmentation/rule-regex.ts`)
80
74
  - `buildRuleRegex()` - Compiles various rule types to executable RegExp
81
75
  - `processPattern()` - Token expansion, auto-escaping, and fuzzy application
82
76
  - `extractNamedCaptureNames()` - Extract metadata field names from raw regex
@@ -463,48 +457,33 @@ bunx biome lint .
463
457
 
464
458
  13. **ASCII vs Arabic-Indic Numerals**: While most classical Arabic texts use Arabic-Indic digits (`٠-٩`), modern digitizers often mix them with ASCII digits (`0-9`). Providing separate tokens (`{{raqms}}` for Arabic and `{{nums}}` for ASCII) allows better precision in rule definitions while keeping patterns readable. Always check which digit set is used in the source text before authoring rules.
465
459
 
466
- ### For Future AI Agents (Recovery + Repo gotchas)
467
-
468
- 1. **`lineStartsAfter` vs `lineStartsWith` is not “cosmetic”**: `lineStartsAfter` changes output by stripping the matched marker via an internal `contentStartOffset` during segment construction. If a client used it by accident, you cannot reconstruct the exact stripped prefix from output alone without referencing the original pages and re-matching the marker.
469
-
470
- 2. **Page joining differs between matching and output**:
471
- - Matching always happens on pages concatenated with `\\n` separators.
472
- - Output segments may normalize page boundaries (`pageJoiner: 'space' | 'newline'`) and breakpoints post-processing uses its own join normalization utilities.
473
- Recovery code must be explicit about which representation it’s searching.
474
-
475
- 3. **Breakpoints can produce “pieces” that were never marker-stripped**: When `maxPages` + `breakpoints` are enabled, only the piece that starts at the original structural boundary could have lost a marker due to `lineStartsAfter`. Mid-segment breakpoint pieces should not be “recovered” unless you can anchor them confidently.
460
+ ### For Future AI Agents (Repo gotchas)
476
461
 
477
- 4. **Fuzzy defaults are easy to miss**: Some tokens auto-enable fuzzy matching unless `fuzzy: false` is set (`bab`, `basmalah`, `fasl`, `kitab`, `naql`). If you are validating markers or re-matching prefixes, use the same compilation path as segmentation (`buildRuleRegex` / `processPattern`) so diacritics and token expansion behave identically.
462
+ 1. **Fuzzy defaults are easy to miss**: Some tokens auto-enable fuzzy matching unless `fuzzy: false` is set (`bab`, `basmalah`, `fasl`, `kitab`, `naql`). If you are validating markers or re-matching prefixes, use the same compilation path as segmentation (`buildRuleRegex` / `processPattern`) so diacritics and token expansion behave identically.
478
463
 
479
- 5. **Auto-escaping applies to template-like patterns**: `lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, and `template` auto-escape `()[]` outside `{{tokens}}`. Raw `regex` does not. If you compare patterns by string equality, be careful about escaping and whitespace.
464
+ 2. **Auto-escaping applies to template-like patterns**: `lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, and `template` auto-escape `()[]` outside `{{tokens}}`. Raw `regex` does not. If you compare patterns by string equality, be careful about escaping and whitespace.
480
465
 
481
- 6. **TypeScript union pitfalls with `SplitRule`**: `SplitRule` is a union where only one pattern type should exist. Avoid mutating rules in-place with `delete` on fields (TS often narrows unions and then complains). Prefer rebuilding converted rules via destructuring (e.g. `{ lineStartsAfter, ...rest }` then create `{...rest, lineStartsWith: lineStartsAfter}`).
466
+ 3. **TypeScript union pitfalls with `SplitRule`**: `SplitRule` is a union where only one pattern type should exist. Avoid mutating rules in-place with `delete` on fields (TS often narrows unions and then complains). Prefer rebuilding converted rules via destructuring (e.g. `{ lineStartsAfter, ...rest }` then create `{...rest, lineStartsWith: lineStartsAfter}`).
482
467
 
483
- 7. **Biome lint constraints shape implementation**: The repo enforces low function complexity. Expect to extract helpers (alignment, selector resolution, anchoring) to keep Biome happy. Also, Biome can flag regex character-class usage as misleading; prefer alternation (e.g. `(?:\\u200C|\\u200D|\\uFEFF)`) when removing specific codepoints.
468
+ 4. **Biome lint constraints shape implementation**: The repo enforces low function complexity. Expect to extract helpers to keep Biome happy. Also, Biome can flag regex character-class usage as misleading; prefer alternation (e.g. `(?:\\u200C|\\u200D|\\uFEFF)`) when removing specific codepoints.
484
469
 
485
- 8. **When debugging recovery, start here**:
486
- - `src/segmentation/segmenter.ts` (how content is sliced/trimmed and how `from/to` are computed)
487
- - `src/segmentation/rule-regex.ts` + `src/segmentation/tokens.ts` (token expansion + fuzzy behavior)
488
- - `src/preprocessing/transforms.ts` (preprocessing transforms: removeZeroWidth, condenseEllipsis, fixTrailingWaw)
489
- - `src/recovery.ts` (recovery implementation)
470
+ 5. **Prefer library utilities for UI tasks**: Instead of re-implementing rule merging, validation, or token mapping in client code, use `optimizeRules`, `validateRules`/`formatValidationReport`, and `applyTokenMappings`. They handle edge cases (like duplicate patterns, regex safety, or diacritic handling) that ad-hoc implementations might miss.
490
471
 
491
- 9. **Prefer library utilities for UI tasks**: Instead of re-implementing rule merging, validation, or token mapping in client code, use `optimizeRules`, `validateRules`/`formatValidationReport`, and `applyTokenMappings`. They handle edge cases (like duplicate patterns, regex safety, or diacritic handling) that ad-hoc implementations might miss.
472
+ 6. **Safety Fallback (Search-back)**: When forced to split at a hard character limit, searching backward for whitespace/punctuation (`[\s\n.,;!?؛،۔]`) prevents word-chopping and improves readability significantly.
492
473
 
493
- 10. **Safety Fallback (Search-back)**: When forced to split at a hard character limit, searching backward for whitespace/punctuation (`[\s\n.,;!?؛،۔]`) prevents word-chopping and improves readability significantly.
474
+ 7. **Unicode Boundary Safety (Surrogates + Graphemes)**: Multi-byte characters (like emojis) can be corrupted if split in the middle of a surrogate pair. Similarly, Arabic diacritics (combining marks), ZWJ/ZWNJ, and variation selectors can be orphaned if a hard split lands in the middle of a grapheme cluster. Use `adjustForUnicodeBoundary` when forced to hard-split near a limit.
494
475
 
495
- 11. **Unicode Boundary Safety (Surrogates + Graphemes)**: Multi-byte characters (like emojis) can be corrupted if split in the middle of a surrogate pair. Similarly, Arabic diacritics (combining marks), ZWJ/ZWNJ, and variation selectors can be orphaned if a hard split lands in the middle of a grapheme cluster. Use `adjustForUnicodeBoundary` when forced to hard-split near a limit.
476
+ 8. **Recursion/Iteration Safety**: Using a progress-based guard (comparing `cursorPos` before and after loop iteration) is safer than fixed iteration limits for supporting arbitrary-sized content without truncation risks.
496
477
 
497
- 12. **Recursion/Iteration Safety**: Using a progress-based guard (comparing `cursorPos` before and after loop iteration) is safer than fixed iteration limits for supporting arbitrary-sized content without truncation risks.
478
+ 9. **Accidental File Overwrites**: Be extremely careful when using tools like `replace_file_content` with large ranges. Verify file integrity frequently (e.g., `git diff`) to catch accidental deletions of existing code or tests. Merging new tests into existing files is a high-risk operation for AI agents.
498
479
 
499
- 13. **Accidental File Overwrites**: Be extremely careful when using tools like `replace_file_content` with large ranges. Verify file integrity frequently (e.g., `git diff`) to catch accidental deletions of existing code or tests. Merging new tests into existing files is a high-risk operation for AI agents.
480
+ 10. **Invisible Unicode Marks Break Regex Anchors**: Arabic text often contains invisible formatting marks like Left-to-Right Mark (`U+200E`), Right-to-Left Mark (`U+200F`), Arabic Letter Mark (`U+061C`), Zero-Width Space (`U+200B`), Zero-Width Non-Joiner (`U+200C`), Zero-Width Joiner (`U+200D`), or BOM (`U+FEFF`). These can appear at line starts after `\n` but before visible characters, breaking `^` anchored patterns. Solution: include an optional zero-width character class prefix in line-start patterns: `^[\u200E\u200F\u061C\u200B\u200C\u200D\uFEFF]*(?:pattern)`. The library now handles this automatically in `buildLineStartsWithRegexSource` and `buildLineStartsAfterRegexSource`.
500
481
 
501
- 14. **Invisible Unicode Marks Break Regex Anchors**: Arabic text often contains invisible formatting marks like Left-to-Right Mark (`U+200E`), Right-to-Left Mark (`U+200F`), Arabic Letter Mark (`U+061C`), Zero-Width Space (`U+200B`), Zero-Width Non-Joiner (`U+200C`), Zero-Width Joiner (`U+200D`), or BOM (`U+FEFF`). These can appear at line starts after `\n` but before visible characters, breaking `^` anchored patterns. Solution: include an optional zero-width character class prefix in line-start patterns: `^[\u200E\u200F\u061C\u200B\u200C\u200D\uFEFF]*(?:pattern)`. The library now handles this automatically in `buildLineStartsWithRegexSource` and `buildLineStartsAfterRegexSource`.
482
+ 11. **Large Segment Performance & Debugging Strategy**: When processing large books (1000+ pages), avoid O(n²) algorithms. The library uses a fast-path threshold (1000 pages) to switch from accurate string-search boundary detection to cumulative-offset-based slicing. Even on the iterative path (e.g. debug mode), we **slice only the active window (+padding)** per iteration (never `fullContent.slice(cursorPos)`), to avoid quadratic allocation/GC churn. To diagnose performance bottlenecks: (1) Look for logs with "Using iterative path" or "Using accurate string-search path" with large `pageCount` values, (2) Check `iterations` count in completion logs, (3) Strategic logs are placed at operation boundaries (start/end) NOT inside tight loops to avoid log-induced performance regression.
502
483
 
503
- 15. **Large Segment Performance & Debugging Strategy**: When processing large books (1000+ pages), avoid O(n²) algorithms. The library uses a fast-path threshold (1000 pages) to switch from accurate string-search boundary detection to cumulative-offset-based slicing. Even on the iterative path (e.g. debug mode), we **slice only the active window (+padding)** per iteration (never `fullContent.slice(cursorPos)`), to avoid quadratic allocation/GC churn. To diagnose performance bottlenecks: (1) Look for logs with "Using iterative path" or "Using accurate string-search path" with large `pageCount` values, (2) Check `iterations` count in completion logs, (3) Strategic logs are placed at operation boundaries (start/end) NOT inside tight loops to avoid log-induced performance regression.
484
+ 12. **`maxPages=0` is a hard invariant**: When `maxPages=0`, breakpoint windows must never scan beyond the current page boundary. Relying purely on boundary detection (string search) can fail near page ends for long Arabic text + space joiners, letting the window “see” into the next page and creating multi-page segments. The safe fix is to clamp the breakpoint window to the current page’s end using `boundaryPositions` in breakpoint processing.
504
485
 
505
- 16. **`maxPages=0` is a hard invariant**: When `maxPages=0`, breakpoint windows must never scan beyond the current page boundary. Relying purely on boundary detection (string search) can fail near page ends for long Arabic text + space joiners, letting the window “see” into the next page and creating multi-page segments. The safe fix is to clamp the breakpoint window to the current page’s end using `boundaryPositions` in breakpoint processing.
506
-
507
- 17. **`''` breakpoint semantics depend on whether the window is page-bounded vs length-bounded**: `''` means “page boundary fallback”, but it’s intentionally **mode-dependent**:
486
+ 13. **`''` breakpoint semantics depend on whether the window is page-bounded vs length-bounded**: `''` means “page boundary fallback”, but it’s intentionally **mode-dependent**:
508
487
 
509
488
  - **Page-bounded window (maxPages-driven)**: `''` should “swallow the remainder of the current page” (i.e. break at the **next page boundary**, not at an arbitrary character limit). This prevents accidentally consuming part of the next page when no other breakpoint patterns match.
510
489
  - **Length-bounded window (maxContentLength-driven)**: `''` should **not** force an early page-boundary break. In this mode we want the best split *near the length limit* (safe-break fallback → Unicode-safe hard split) even if that means a piece can cross a page boundary.
@@ -805,5 +784,3 @@ Key functions: `applyBreakpoints()` → `processOversizedSegment()` → `findBre
805
784
  ## Known Issues
806
785
 
807
786
  - **Binary Search Gap (Theoretical)**: `findBoundaryIdForOffset` returns `undefined` if the search offset falls exactly on a joiner character (e.g., a space or newline) between two pages. This is mathematically correct (the gap belongs to neither page) but may cause validation errors if a segment consists _only_ of such a gap or matches content starting/ending strictly within the gap. We have marked this as "accept" behavior for now, with a documented skipped test case.
808
-
809
-
package/README.md CHANGED
@@ -135,7 +135,7 @@ Replace regex with readable tokens:
135
135
  | `{{num}}` | Single ASCII digit | `\\d` |
136
136
  | `{{dash}}` | Dash variants | `[-–—ـ]` |
137
137
  | `{{harf}}` | Arabic letter | `[أ-ي]` |
138
- | `{{harfs}}` | Single-letter codes separated by spaces | `[أ-ي](?:\s+[أ-ي])*` |
138
+ | `{{harfs}}` | Single-letter codes separated by spaces, with optional marks/tatweel on each isolated letter | e.g. `د ت س ي ق`, `هـ ث` |
139
139
  | `{{rumuz}}` | Source abbreviations (rijāl/takhrīj rumuz), incl. multi-code blocks | e.g. `خت ٤`, `خ سي`, `خ فق`, `د ت سي ق`, `دت عس ق` |
140
140
  | `{{numbered}}` | Hadith numbering `٢٢ - ` | `{{raqms}} {{dash}} ` |
141
141
  | `{{fasl}}` | Section markers | `فصل\|مسألة` |
@@ -323,6 +323,93 @@ const segments = segmentPages(pages, {
323
323
 
324
324
  This guard applies **only at page starts**. Mid-page line starts are unaffected.
325
325
 
326
+ #### Previous-Word Page-Start Stoplist
327
+
328
+ For dictionary-like content, page wraps can split a phrase across pages and create
329
+ false positives at the top of the next page. Example:
330
+
331
+ - Page N ends with `قال`
332
+ - Page N+1 starts with `العجاج:`
333
+
334
+ Use `pageStartPrevWordStoplist` to suppress page-start matches when the previous
335
+ page's last Arabic word is in a stoplist. Matching is Arabic-normalized and
336
+ diacritic-insensitive.
337
+
338
+ ```typescript
339
+ const segments = segmentPages(pages, {
340
+ rules: [{
341
+ regex: '^(?<lemma>[ء-غف-ي]+):',
342
+ split: 'at',
343
+ pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال']
344
+ }]
345
+ });
346
+ ```
347
+
348
+ If the previous page ends with strong sentence punctuation (`.`, `!`, `?`, `؟`, `؛`),
349
+ the stoplist guard is skipped and the page-start match is allowed.
350
+
351
+ #### Arabic Dictionary Helper
352
+
353
+ Use `createArabicDictionaryEntryRule()` to build a conservative rule for Arabic
354
+ dictionaries with lemma capture, stopword filtering, and page-wrap protection:
355
+
356
+ ```typescript
357
+ import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
358
+
359
+ const rule = createArabicDictionaryEntryRule({
360
+ stopWords: ['وقيل', 'ويقال', 'قال', 'العجاج', 'أخاك'],
361
+ pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال'],
362
+ samePagePrevWordStoplist: ['جل'],
363
+ // Optional dictionary-specific shapes:
364
+ allowParenthesized: true, // e.g. (عنبر) :
365
+ allowWhitespaceBeforeColon: true, // e.g. عنبر :
366
+ allowCommaSeparated: true, // e.g. سبد، دبس:
367
+ });
368
+
369
+ const segments = segmentPages(pages, { rules: [rule] });
370
+ ```
371
+
372
+ Behavior:
373
+ - Keeps the lemma marker in `segment.content`
374
+ - Stores the matched lemma in `segment.meta.lemma`
375
+ - Matches root entries at true line/page starts like `عز:` and `لع:`
376
+ - Matches mid-line subentries conservatively when they begin with `و`
377
+ - Can match parenthesized headwords like `(عنبر) :` when enabled
378
+ - Can match comma-separated headword lists like `سبد، دبس:` when enabled
379
+ - Can suppress same-page false positives like `جلّ وعزّ:` with `samePagePrevWordStoplist`
380
+
381
+ #### Dictionary Letter-Code Lines
382
+
383
+ For dictionary-specific letter-code lines like `ك ش ن` or `(هـ ث)`, use
384
+ `{{harfs}}` and decide the metadata shape in client code:
385
+
386
+ ```typescript
387
+ import { getTokenPattern, segmentPages } from 'flappa-doormal';
388
+
389
+ const harfCodes = getTokenPattern('harfs').replaceAll('\\s+', '[ \\t]+');
390
+
391
+ const segments = segmentPages(pages, {
392
+ rules: [{
393
+ regex: `^(?:\\((?<huruf>${harfCodes})\\)|(?<huruf>${harfCodes}))$`,
394
+ split: 'at',
395
+ meta: { type: 'C' },
396
+ }],
397
+ });
398
+ ```
399
+
400
+ Here `huruf` is just a named capture group chosen by the client, not a built-in
401
+ regex primitive.
402
+
403
+ This client-side rule can be used for:
404
+ - chapter-adjacent code lines like `(هـ ث)`
405
+ - consecutive bare code lines like `س ط ب` then `س د ر`
406
+
407
+ The `replaceAll('\\s+', '[ \\t]+')` step is intentional:
408
+ - `{{harfs}}` itself uses `\s+`
409
+ - but when embedding it in a raw full-line regex, horizontal whitespace is usually
410
+ safer than unrestricted `\s+`, because it prevents accidental matching across
411
+ newlines
412
+
326
413
  ### 5. Auto-Escaping Brackets
327
414
 
328
415
  In `lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, and `template` patterns, parentheses `()` and square brackets `[]` are **automatically escaped**. This means you can write intuitive patterns without manual escaping:
@@ -452,6 +539,11 @@ Pass an optional `logger` to trace segmentation decisions or enable `debug` to a
452
539
  const segments = segmentPages(pages, {
453
540
  rules: [...],
454
541
  debug: true, // Enables detailed match metadata
542
+ logger: {
543
+ debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
544
+ info: (msg, data) => console.info(`[INFO] ${msg}`, data),
545
+ warn: (msg, data) => console.warn(`[WARN] ${msg}`, data),
546
+ error: (msg, data) => console.error(`[ERROR] ${msg}`, data),
455
547
  logger: {
456
548
  debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
457
549
  info: (msg, data) => console.info(`[INFO] ${msg}`, data),
@@ -459,6 +551,10 @@ const segments = segmentPages(pages, {
459
551
  error: (msg, data) => console.error(`[ERROR] ${msg}`, data),
460
552
  }
461
553
  });
554
+
555
+ // Helper to format debug reason
556
+ // import { getSegmentDebugReason } from 'flappa-doormal';
557
+ // console.log(getSegmentDebugReason(segments[0])); // "Rule #0 (lineStartsWith) [idx:2] (Matched: '{{naql}}')"
462
558
  ```
463
559
 
464
560
  #### Debug Metadata (`_flappa`)
@@ -1373,66 +1469,6 @@ const options: SegmentationOptions = {
1373
1469
  const segments: Segment[] = segmentPages(pages, options);
1374
1470
  ```
1375
1471
 
1376
- ### Marker recovery (when `lineStartsAfter` was used by accident)
1377
-
1378
- If you accidentally used `lineStartsAfter` for markers that should have been preserved (e.g. Arabic connective phrases like `وروى` / `وذكر`), you can recover those missing prefixes from existing segments.
1379
-
1380
- #### `recoverMistakenLineStartsAfterMarkers(pages, segments, options, selector, opts?)`
1381
-
1382
- This function returns new segments with recovered `content` plus a `report` describing what happened.
1383
-
1384
- **Recommended (deterministic) mode**: rerun segmentation with selected rules converted to `lineStartsWith`, then merge recovered content back.
1385
-
1386
- ```ts
1387
- import { recoverMistakenLineStartsAfterMarkers, segmentPages } from 'flappa-doormal';
1388
-
1389
- const pages = [{ id: 1, content: 'وروى أحمد\nوذكر خالد' }];
1390
- const options = { rules: [{ lineStartsAfter: ['وروى '] }, { lineStartsAfter: ['وذكر '] }] };
1391
-
1392
- const segments = segmentPages(pages, options);
1393
- // segments[0].content === 'أحمد' (marker stripped)
1394
-
1395
- const { segments: recovered, report } = recoverMistakenLineStartsAfterMarkers(
1396
- pages,
1397
- segments,
1398
- options,
1399
- { type: 'rule_indices', indices: [0] }, // recover only the first rule
1400
- );
1401
-
1402
- // recovered[0].content === 'وروى أحمد'
1403
- // recovered[1].content === 'خالد' (unchanged)
1404
- console.log(report.summary);
1405
- ```
1406
-
1407
- **Optional**: best-effort anchoring mode attempts to recover without rerunning first, then falls back to rerun for unresolved segments:
1408
-
1409
- ```ts
1410
- const { segments: recovered } = recoverMistakenLineStartsAfterMarkers(
1411
- pages,
1412
- segments,
1413
- options,
1414
- { type: 'rule_indices', indices: [0] },
1415
- { mode: 'best_effort_then_rerun' }
1416
- );
1417
- ```
1418
-
1419
- Notes:
1420
- - Recovery is **explicitly scoped** by the `selector`; it will not “guess” which rules are mistaken.
1421
- - If your segments were heavily post-processed (trimmed/normalized/reordered), recovery may return unresolved items; see the report for details.
1422
-
1423
- #### `recoverMistakenMarkersForRuns(runs, opts?)`
1424
-
1425
- Batch version of `recoverMistakenLineStartsAfterMarkers`. Processes multiple independent segmentation runs (e.g. from different books) and returns a consolidated report.
1426
-
1427
- ```typescript
1428
- import { recoverMistakenMarkersForRuns } from 'flappa-doormal';
1429
-
1430
- const results = recoverMistakenMarkersForRuns([
1431
- { pages: pages1, segments: segments1, options: options1, selector: selector1 },
1432
- { pages: pages2, segments: segments2, options: options2, selector: selector2 },
1433
- ]);
1434
- ```
1435
-
1436
1472
  ### `validateSegments(pages, options, segments, validationOptions?)`
1437
1473
 
1438
1474
  Validates that segments correctly map back to the source pages and adhere to constraints.
@@ -1626,7 +1662,9 @@ type SplitRule = {
1626
1662
  min?: number;
1627
1663
  max?: number;
1628
1664
  exclude?: (number | [number, number])[]; // Single page or [start, end] range
1629
- skipWhen?: string; // Regex pattern (tokens supported)
1665
+ pageStartGuard?: string;
1666
+ pageStartPrevWordStoplist?: string[];
1667
+ samePagePrevWordStoplist?: string[];
1630
1668
  meta?: Record<string, unknown>;
1631
1669
  };
1632
1670
  ```