npm - flappa-doormal - Versions diffs - 2.7.0 → 2.9.0 - Mend

flappa-doormal 2.7.0 → 2.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/AGENTS.md CHANGED Viewed

@@ -26,8 +26,16 @@ Traditional Arabic text segmentation requires:
 ```text
 src/
 ├── index.ts                    # Main entry point and exports
-├── pattern-detection.ts        # Token detection for auto-generating rules (NEW)
-├── pattern-detection.test.ts   # Pattern detection tests (22 tests)
+├── analysis/                   # Analysis helpers module
+│   ├── index.ts                # Barrel exports for analysis functions
+│   ├── shared.ts               # Shared utilities for analysis
+│   ├── line-starts.ts          # analyzeCommonLineStarts (line-based patterns)
+│   ├── repeating-sequences.ts  # analyzeRepeatingSequences (continuous text N-grams)
+│   └── *.test.ts               # Analysis tests
+├── pattern-detection.ts        # Token detection for auto-generating rules
+├── pattern-detection.test.ts   # Pattern detection tests
+├── recovery.ts                 # Marker recovery utility (recover mistaken lineStartsAfter)
+├── recovery.test.ts            # Marker recovery tests
 └── segmentation/
     ├── types.ts                # TypeScript type definitions for rules/segments
     ├── segmenter.ts            # Core segmentation engine (segmentPages)
@@ -56,6 +64,11 @@ src/
    - Takes array of `{id, content}` pages and split rules
    - Returns array of `{content, from, to?, meta?}` segments
+1. **`recoverMistakenLineStartsAfterMarkers(pages, segments, options, selector)`** - Recovery helper
+   - Use when a client mistakenly used `lineStartsAfter` where `lineStartsWith` was intended
+   - Deterministic mode reruns segmentation with selected rules converted to `lineStartsWith` and merges recovered `content` back into the provided segments
+   - Optional `mode: 'best_effort_then_rerun'` attempts a conservative anchor-based recovery first, then falls back to rerun for unresolved segments
 2. **`tokens.ts`** - Template system
    - `TOKEN_PATTERNS` - Map of token names to regex patterns
    - `expandTokensWithCaptures()` - Expands `{{token:name}}` syntax
@@ -195,7 +208,7 @@ Raw `regex` patterns now support named capture groups for metadata extraction:
 ### Breakpoints Post-Processing Algorithm
-The `breakpoints` option provides a post-processing mechanism for limiting segment size. Unlike the deprecated `maxSpan` (which was per-rule), breakpoints runs AFTER all structural rules.
+The `breakpoints` option provides a post-processing mechanism for limiting segment size. Breakpoints runs AFTER all structural rules.
 **API Options:**
 ```typescript
@@ -237,7 +250,7 @@ segmentPages(pages, {
 - **`prefer: 'shorter'`**: Finds FIRST match (conservative)
 - **Recursive**: If split result still exceeds `maxPages`, breakpoints runs again
-> **Note**: The old `maxSpan` and `fallback` properties on `SplitRule` are deprecated and removed.
+> **Note**: Older per-rule span limiting approaches were removed in favor of post-processing `breakpoints`.
 ## Design Decisions
@@ -350,7 +363,7 @@ bunx biome lint .
 5. **Rule order matters for specificity**: When multiple rules can match the same position, put specific patterns BEFORE generic ones. Example: `## {{raqms:num}} {{dash}}` must come before `##` to capture the number.
-6. **Post-processing beats per-rule limits**: The `maxSpan` approach (per-rule page limits) caused premature splits. Moving to post-processing `breakpoints` preserves structural integrity while still limiting segment size.
+6. **Post-processing beats per-rule limits**: Per-rule span limiting caused premature splits. Moving to post-processing `breakpoints` preserves structural integrity while still limiting segment size.
 7. **Window padding matters**: When calculating approximate content windows, 50% padding is needed (not 20%) to ensure enough content is captured for `prefer: 'longer'` scenarios.
@@ -362,6 +375,39 @@ bunx biome lint .
 11. **Boundary-position algorithm improves page attribution**: Building a position map of page boundaries once per segment (O(n)) enables binary search for O(log n) lookups per piece. Key insight: when a segment starts mid-page (common after structural rules), expected boundary estimates must account for the offset into the starting page. Without this adjustment, position-based lookups can return the wrong page when pages have identical content prefixes.
+### For Future AI Agents (Recovery + Repo gotchas)
+1. **`lineStartsAfter` vs `lineStartsWith` is not “cosmetic”**: `lineStartsAfter` changes output by stripping the matched marker via an internal `contentStartOffset` during segment construction. If a client used it by accident, you cannot reconstruct the exact stripped prefix from output alone without referencing the original pages and re-matching the marker.
+2. **Recovery must mirror segmentation’s preprocessing**: If `SegmentationOptions.replace` was used, recovery must apply the same replacements (see `src/segmentation/replace.ts`) before attempting anchoring or rerun alignment, otherwise substring matching and page joins will drift.
+3. **Page joining differs between matching and output**:
+   - Matching always happens on pages concatenated with `\\n` separators.
+   - Output segments may normalize page boundaries (`pageJoiner: 'space' | 'newline'`) and breakpoints post-processing uses its own join normalization utilities.
+   Recovery code must be explicit about which representation it’s searching.
+4. **Breakpoints can produce “pieces” that were never marker-stripped**: When `maxPages` + `breakpoints` are enabled, only the piece that starts at the original structural boundary could have lost a marker due to `lineStartsAfter`. Mid-segment breakpoint pieces should not be “recovered” unless you can anchor them confidently.
+5. **Fuzzy defaults are easy to miss**: Some tokens auto-enable fuzzy matching unless `fuzzy: false` is set (`bab`, `basmalah`, `fasl`, `kitab`, `naql`). If you are validating markers or re-matching prefixes, use the same compilation path as segmentation (`buildRuleRegex` / `processPattern`) so diacritics and token expansion behave identically.
+6. **Auto-escaping applies to template-like patterns**: `lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, and `template` auto-escape `()[]` outside `{{tokens}}`. Raw `regex` does not. If you compare patterns by string equality, be careful about escaping and whitespace.
+7. **TypeScript union pitfalls with `SplitRule`**: `SplitRule` is a union where only one pattern type should exist. Avoid mutating rules in-place with `delete` on fields (TS often narrows unions and then complains). Prefer rebuilding converted rules via destructuring (e.g. `{ lineStartsAfter, ...rest }` then create `{...rest, lineStartsWith: lineStartsAfter}`).
+8. **Biome lint constraints shape implementation**: The repo enforces low function complexity. Expect to extract helpers (alignment, selector resolution, anchoring) to keep Biome happy. Also, Biome can flag regex character-class usage as misleading; prefer alternation (e.g. `(?:\\u200C|\\u200D|\\uFEFF)`) when removing specific codepoints.
+9. **When debugging recovery, start here**:
+   - `src/segmentation/segmenter.ts` (how content is sliced/trimmed and how `from/to` are computed)
+   - `src/segmentation/rule-regex.ts` + `src/segmentation/tokens.ts` (token expansion + fuzzy behavior)
+   - `src/segmentation/replace.ts` (preprocessing parity)
+   - `src/recovery.ts` (recovery implementation)
+### Process Template (Multi-agent design review, TDD-first)
+If you want to repeat the “write a plan → get multiple AI critiques → synthesize → update plan → implement TDD-first” workflow, use:
+- `docs/ai-multi-agent-tdd-template.md`
 ### Architecture Insights
 - **Declarative > Imperative**: Users describe patterns, library handles regex
@@ -454,3 +500,32 @@ const quoted = analyzeCommonLineStarts(pages, {
 });
 ```
+## Repeating Sequence Analysis (`analyzeRepeatingSequences`)
+For continuous text **without line breaks** (prose-like content), use `analyzeRepeatingSequences(pages)`. It scans for commonly repeating word/token sequences (N-grams) across pages.
+Key options:
+- `minElements` / `maxElements`: N-gram size range (default 1-3)
+- `minCount`: Minimum occurrences to include (default 3)
+- `topK`: Maximum patterns to return (default 20)
+- `requireToken`: Only patterns containing `{{tokens}}` (default true)
+- `normalizeArabicDiacritics`: Ignore diacritics when matching (default true)
+Example:
+```typescript
+import { analyzeRepeatingSequences } from 'flappa-doormal';
+const patterns = analyzeRepeatingSequences(pages, { minCount: 3, topK: 20 });
+// [{ pattern: '{{naql}}', count: 42, examples: [...] }, ...]
+```
+## Analysis → Segmentation Workflow
+Use analysis functions to discover patterns, then pass to `segmentPages()`:
+1. **Continuous text**: `analyzeRepeatingSequences()` → build rules → `segmentPages()`
+2. **Structured text**: `analyzeCommonLineStarts()` → build rules → `segmentPages()`
+See README.md for complete examples.

package/README.md CHANGED Viewed

@@ -228,7 +228,6 @@ Control which matches to use:
   lineEndsWith: ['\\.'],
   split: 'after',
   occurrence: 'last',  // Only split at LAST period on page
-  maxSpan: 1,          // Apply per-page
 }
 ```
@@ -406,8 +405,99 @@ Key options:
 - If you paste these signatures into `lineStartsWith` / `lineStartsAfter` / `template`, that’s fine: those template pattern types **auto-escape `()[]`** outside `{{tokens}}`.
 - If you paste them into a raw `regex` rule, you may need to escape literal brackets yourself.
+### Repeating Sequence Analysis (continuous text)
+For texts without line breaks (continuous prose), use `analyzeRepeatingSequences()`:
+```typescript
+import { analyzeRepeatingSequences } from 'flappa-doormal';
+const patterns = analyzeRepeatingSequences(pages, {
+  minElements: 2,
+  maxElements: 4,
+  minCount: 3,
+  topK: 20,
+});
+// [{ pattern: "{{naql}}\\s*{{harf}}", count: 42, examples: [...] }, ...]
+```
+Key options:
+- `minElements` / `maxElements`: N-gram size range (default 1-3)
+- `minCount`: Minimum occurrences to include (default 3)
+- `topK`: Maximum patterns to return (default 20)
+- `requireToken`: Only patterns containing `{{tokens}}` (default true)
+- `normalizeArabicDiacritics`: Ignore diacritics when matching (default true)
+## Analysis → Segmentation Workflow
+Use analysis functions to discover patterns, then pass to `segmentPages()`.
+### Example A: Continuous Text (No Punctuation)
+For prose-like text without structural line breaks:
+```typescript
+import { analyzeRepeatingSequences, segmentPages, type Page } from 'flappa-doormal';
+// Continuous Arabic text with narrator phrases
+const pages: Page[] = [
+  { id: 1, content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي حدثنا خالد بن زيد عن علي' },
+  { id: 2, content: 'حدثنا سعيد بن جبير عن ابن عباس أخبرنا يوسف عن أنس' },
+];
+// Step 1: Discover repeating patterns
+const patterns = analyzeRepeatingSequences(pages, { minCount: 2, topK: 10 });
+// [{ pattern: '{{naql}}', count: 5, examples: [...] }, ...]
+// Step 2: Build rules from discovered patterns
+const rules = patterns.filter(p => p.count >= 3).map(p => ({
+  lineStartsWith: [p.pattern],
+  split: 'at' as const,
+  fuzzy: true,
+}));
+// Step 3: Segment
+const segments = segmentPages(pages, { rules });
+// [{ content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي', from: 1 }, ...]
+```
+### Example B: Structured Text (With Numbering)
+For hadith-style numbered entries:
+```typescript
+import { analyzeCommonLineStarts, segmentPages, type Page } from 'flappa-doormal';
+// Numbered hadith text
+const pages: Page[] = [
+  { id: 1, content: '٦٦٩٦ - حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ\n٦٦٩٧ - أَخْبَرَنَا عُمَرُ قَالَ' },
+  { id: 2, content: '٦٦٩٨ - حَدَّثَنِي مُحَمَّدٌ عَنْ عَائِشَةَ' },
+];
+// Step 1: Discover common line-start patterns
+const patterns = analyzeCommonLineStarts(pages, { topK: 10, minCount: 2 });
+// [{ pattern: '{{raqms}}\\s*{{dash}}', count: 3, examples: [...] }, ...]
+// Step 2: Build rules (add named capture for hadith number)
+const topPattern = patterns[0]?.pattern ?? '{{raqms}} {{dash}} ';
+const rules = [{
+  lineStartsAfter: [topPattern.replace('{{raqms}}', '{{raqms:num}}')],
+  split: 'at' as const,
+  meta: { type: 'hadith' }
+}];
+// Step 3: Segment
+const segments = segmentPages(pages, { rules });
+// [
+//   { content: 'حَدَّثَنَا أَبُو بَكْرٍ...', from: 1, meta: { type: 'hadith', num: '٦٦٩٦' } },
+//   { content: 'أَخْبَرَنَا عُمَرُ قَالَ', from: 1, meta: { type: 'hadith', num: '٦٦٩٧' } },
+//   { content: 'حَدَّثَنِي مُحَمَّدٌ...', from: 2, meta: { type: 'hadith', num: '٦٦٩٨' } },
+// ]
+```
 ## Rule Validation
 Use `validateRules()` to detect common mistakes in rule patterns before running segmentation:
 ```typescript
@@ -619,32 +709,10 @@ const segments = segmentPages(pages, {
     lineEndsWith: ['\\.'],
     split: 'after',
     occurrence: 'last',
-    maxSpan: 1
   }]
 });
 ```
-### Page Fallback for Unmatched Content
-When using `maxSpan` to group matches per page, use `fallback: 'page'` to prevent unmatched pages from merging with adjacent segments:
-```typescript
-const segments = segmentPages(pages, {
-  rules: [{
-    template: '{{tarqim}}',  // Match punctuation marks
-    split: 'after',
-    occurrence: 'last',
-    maxSpan: 1,
-    fallback: 'page'  // If no punctuation found, segment the page anyway
-  }]
-});
-```
-**Without `fallback`**: Pages without matches merge into the next segment
-**With `fallback: 'page'`**: Each page becomes its own segment even without matches
-> **Future extensions**: The `fallback` option may support additional values like `'skip'` (omit unmatched content) or `'line'` (split at line breaks) in future versions.
 ### Multiple Rules with Priority
 ```typescript
@@ -697,6 +765,53 @@ const options: SegmentationOptions = {
 const segments: Segment[] = segmentPages(pages, options);
 ```
+### Marker recovery (when `lineStartsAfter` was used by accident)
+If you accidentally used `lineStartsAfter` for markers that should have been preserved (e.g. Arabic connective phrases like `وروى` / `وذكر`), you can recover those missing prefixes from existing segments.
+#### `recoverMistakenLineStartsAfterMarkers(pages, segments, options, selector, opts?)`
+This function returns new segments with recovered `content` plus a `report` describing what happened.
+**Recommended (deterministic) mode**: rerun segmentation with selected rules converted to `lineStartsWith`, then merge recovered content back.
+```ts
+import { recoverMistakenLineStartsAfterMarkers, segmentPages } from 'flappa-doormal';
+const pages = [{ id: 1, content: 'وروى أحمد\nوذكر خالد' }];
+const options = { rules: [{ lineStartsAfter: ['وروى '] }, { lineStartsAfter: ['وذكر '] }] };
+const segments = segmentPages(pages, options);
+// segments[0].content === 'أحمد' (marker stripped)
+const { segments: recovered, report } = recoverMistakenLineStartsAfterMarkers(
+  pages,
+  segments,
+  options,
+  { type: 'rule_indices', indices: [0] }, // recover only the first rule
+);
+// recovered[0].content === 'وروى أحمد'
+// recovered[1].content === 'خالد'  (unchanged)
+console.log(report.summary);
+```
+**Optional**: best-effort anchoring mode attempts to recover without rerunning first, then falls back to rerun for unresolved segments:
+```ts
+const { segments: recovered } = recoverMistakenLineStartsAfterMarkers(
+  pages,
+  segments,
+  options,
+  { type: 'rule_indices', indices: [0] },
+  { mode: 'best_effort_then_rerun' }
+);
+```
+Notes:
+- Recovery is **explicitly scoped** by the `selector`; it will not “guess” which rules are mistaken.
+- If your segments were heavily post-processed (trimmed/normalized/reordered), recovery may return unresolved items; see the report for details.
 ### `stripHtmlTags(html)`
 Remove all HTML tags from content, keeping only text.
@@ -865,9 +980,7 @@ type SplitRule = {
   // Split behavior
   split?: 'at' | 'after';  // Default: 'at'
   occurrence?: 'first' | 'last' | 'all';
-  maxSpan?: number;
   fuzzy?: boolean;
-  fallback?: 'page';  // NEW: Page-boundary fallback
   // Constraints
   min?: number;
@@ -995,28 +1108,6 @@ The library concatenates all pages into a single string for pattern matching acr
 For typical book processing (up to 6,000 pages), memory usage is well within Node.js defaults. For very large books (40,000+ pages), ensure adequate heap size.
-### `maxSpan` Sliding Window Behavior
-The `maxSpan` option uses a **sliding window algorithm** based on page ID difference:
-```typescript
-// maxSpan = maximum page ID difference when looking ahead for split points
-// Algorithm prefers LONGER segments by looking as far ahead as allowed
-// Pages [1, 2, 3, 4] with maxSpan: 1, occurrence: 'last'
-// Window from page 1: pages 1-2 (diff <= 1), splits at page 2's last match
-// Window from page 3: pages 3-4 (diff <= 1), splits at page 4's last match
-// Result: 2 segments spanning pages 1-2 and 3-4
-// Pages [1, 5, 10] with maxSpan: 1, occurrence: 'last'
-// Window from page 1: only page 1 (5-1=4 > 1), splits at page 1
-// Window from page 5: only page 5 (10-5=5 > 1), splits at page 5
-// Window from page 10: only page 10, splits at page 10
-// Result: 3 segments (pages too far apart to merge)
-```
-This is intentional for books where page IDs represent actual page numbers. With `occurrence: 'last'`, the algorithm finds the last match within the lookahead window, creating longer segments where possible.
 ## For AI Agents
 See [AGENTS.md](./AGENTS.md) for:

package/dist/index.d.mts CHANGED Viewed

@@ -238,38 +238,9 @@ type SplitBehavior = {
    * - `'first'`: Only split at the first match
    * - `'last'`: Only split at the last match
    *
-   * When `maxSpan` is set, occurrence filtering is applied per sliding
-   * window rather than globally. With `'last'`, the algorithm prefers
-   * longer segments by looking as far ahead as allowed before selecting
-   * the last match in the window.
-   *
    * @default 'all'
    */
   occurrence?: 'first' | 'last' | 'all';
-  /**
-   * Maximum page ID difference allowed when looking ahead for split points.
-   *
-   * Uses a sliding window algorithm that prefers longer segments:
-   * 1. Start from the first page of the current segment
-   * 2. Look for matches within pages where `pageId - startPageId <= maxSpan`
-   * 3. Apply occurrence filter (e.g., 'last') to select a match
-   * 4. Next window starts from the page after the match
-   *
-   * Examples:
-   * - `maxSpan: 1` = look 1 page ahead (segments span at most 2 pages)
-   * - `maxSpan: 2` = look 2 pages ahead (segments span at most 3 pages)
-   * - `undefined` = no limit (entire content treated as one group)
-   *
-   * Note: With non-consecutive page IDs, the algorithm uses actual ID
-   * difference, not array index. Pages 1 and 5 have a difference of 4.
-   *
-   * @example
-   * // Split at last period, looking up to 1 page ahead
-   * // Pages 1,2: split at page 2's last period
-   * // Page 3: split at page 3's last period
-   * { lineEndsWith: ['.'], split: 'after', occurrence: 'last', maxSpan: 1 }
-   */
-  maxSpan?: number;
   /**
    * Enable diacritic-insensitive matching for Arabic text.
    *
@@ -354,12 +325,6 @@ type RuleConstraints = {
    * { lineStartsWith: ['{{bab}}'], split: 'before', meta: { type: 'chapter' } }
    */
   meta?: Record<string, unknown>;
-  /**
-   * Fallback behavior when no matches are found within a maxSpan boundary.
-   * - 'page': Create split points at page boundaries
-   * - undefined: No fallback (current behavior)
-   */
-  fallback?: 'page';
   /**
    * Page-start guard: only allow this rule to match at the START of a page if the
    * previous page's last non-whitespace character matches this pattern.
@@ -388,7 +353,7 @@ type RuleConstraints = {
  * Each rule must specify:
  * - **Pattern** (exactly one): `regex`, `template`, `lineStartsWith`,
  *   `lineStartsAfter`, or `lineEndsWith`
- * - **Split behavior**: `split` (optional, defaults to `'at'`), `occurrence`, `maxSpan`, `fuzzy`
+ * - **Split behavior**: `split` (optional, defaults to `'at'`), `occurrence`, `fuzzy`
  * - **Constraints** (optional): `min`, `max`, `meta`
  *
  * @example
@@ -424,7 +389,6 @@ type SplitRule = PatternType & SplitBehavior & RuleConstraints;
 type Page = {
   /**
    * Unique page/entry ID used for:
-   * - `maxSpan` grouping (segments spanning multiple pages)
    * - `min`/`max` constraint filtering
    * - `from`/`to` tracking in output segments
    */
@@ -625,6 +589,21 @@ type SegmentationOptions = {
    * rule's metadata is used for each segment.
    */
   rules?: SplitRule[];
+  /**
+   * Attach debugging provenance into `segment.meta` indicating which rule and/or breakpoint
+   * created the segment boundary.
+   *
+   * This is opt-in because it increases output size.
+   *
+   * When enabled (default metaKey: `_flappa`), segments may include:
+   * `meta._flappa.rule` and/or `meta._flappa.breakpoint`.
+   */
+  debug?: boolean | {
+    /** Where to store provenance in meta. @default '_flappa' */
+    metaKey?: string;
+    /** Which kinds of provenance to include. @default ['rule','breakpoint'] */
+    include?: Array<'rule' | 'breakpoint'>;
+  };
   /**
    * Maximum pages per segment before breakpoints are applied.
    *
@@ -770,7 +749,7 @@ type Segment = {
 /**
  * Types of validation issues that can be detected.
  */
-type ValidationIssueType = 'missing_braces' | 'unknown_token' | 'duplicate';
+type ValidationIssueType = 'missing_braces' | 'unknown_token' | 'duplicate' | 'empty_pattern';
 /**
  * A validation issue found in a pattern.
  */
@@ -1149,70 +1128,18 @@ declare const getAvailableTokens: () => string[];
  */
 declare const getTokenPattern: (tokenName: string) => string | undefined;
 //#endregion
-//#region src/analysis.d.ts
+//#region src/analysis/line-starts.d.ts
 type LineStartAnalysisOptions = {
-  /** Return top K patterns (after filtering). Default: 20 */
   topK?: number;
-  /** Only consider the first N characters of each trimmed line. Default: 60 */
   prefixChars?: number;
-  /** Ignore lines shorter than this (after trimming). Default: 6 */
   minLineLength?: number;
-  /** Only include patterns that appear at least this many times. Default: 3 */
   minCount?: number;
-  /** Keep up to this many example lines per pattern. Default: 5 */
   maxExamples?: number;
-  /**
-   * If true, include a literal first word when no token match is found at the start.
-   * Default: true
-   */
   includeFirstWordFallback?: boolean;
-  /**
-   * If true, strip Arabic diacritics (harakat/tashkeel) for the purposes of matching tokens.
-   * This helps patterns like `وأَخْبَرَنَا` match the `{{naql}}` token (`وأخبرنا`).
-   *
-   * Note: examples are still stored in their original (unstripped) form.
-   *
-   * Default: true
-   */
   normalizeArabicDiacritics?: boolean;
-  /**
-   * How to sort patterns before applying `topK`.
-   *
-   * - `specificity` (default): prioritize more structured prefixes first (tokenCount, then literalLen), then count.
-   * - `count`: prioritize highest-frequency patterns first, then specificity.
-   */
   sortBy?: 'specificity' | 'count';
-  /**
-   * Optional filter to restrict which lines are analyzed.
-   *
-   * The `line` argument is the trimmed + whitespace-collapsed version of the line.
-   * Return `true` to include it, `false` to skip it.
-   *
-   * @example
-   * // Only analyze markdown H2 headings
-   * { lineFilter: (line) => line.startsWith('## ') }
-   */
   lineFilter?: (line: string, pageId: number) => boolean;
-  /**
-   * Optional list of prefix matchers to consume before tokenization.
-   *
-   * This is for "syntactic" prefixes that are common at line start but are not
-   * meaningful as tokens by themselves (e.g. markdown headings like `##`).
-   *
-   * Each matcher is applied at the current position. If it matches, the matched
-   * text is appended (escaped) to the signature and the scanner advances.
-   *
-   * @example
-   * // Support markdown blockquotes and headings
-   * { prefixMatchers: [/^>+/u, /^#+/u] }
-   */
   prefixMatchers?: RegExp[];
-  /**
-   * How to represent whitespace in returned `pattern` signatures.
-   *
-   * - `regex` (default): use `\\s*` placeholders between tokens (useful if you paste patterns into regex-ish templates).
-   * - `space`: use literal single spaces (`' '`) between tokens (safer if you don't want `\\s` to match newlines when reused as regex).
-   */
   whitespace?: 'regex' | 'space';
 };
 type LineStartPatternExample = {
@@ -1226,12 +1153,41 @@ type CommonLineStartPattern = {
 };
 /**
  * Analyze pages and return the most common line-start patterns (top K).
- *
- * This is a pure algorithmic heuristic: it tokenizes common prefixes into a stable
- * template-ish string using the library tokens (e.g., `{{bab}}`, `{{raqms}}`, `{{rumuz}}`).
  */
 declare const analyzeCommonLineStarts: (pages: Page[], options?: LineStartAnalysisOptions) => CommonLineStartPattern[];
 //#endregion
+//#region src/analysis/repeating-sequences.d.ts
+type RepeatingSequenceOptions = {
+  minElements?: number;
+  maxElements?: number;
+  minCount?: number;
+  topK?: number;
+  normalizeArabicDiacritics?: boolean;
+  requireToken?: boolean;
+  whitespace?: 'regex' | 'space';
+  maxExamples?: number;
+  contextChars?: number;
+  maxUniquePatterns?: number;
+};
+type RepeatingSequenceExample = {
+  text: string;
+  context: string;
+  pageId: number;
+  startIndices: number[];
+};
+type RepeatingSequencePattern = {
+  pattern: string;
+  count: number;
+  examples: RepeatingSequenceExample[];
+};
+/**
+ * Analyze pages for commonly repeating word sequences.
+ *
+ * Use for continuous text without line breaks. For line-based analysis,
+ * use `analyzeCommonLineStarts()` instead.
+ */
+declare const analyzeRepeatingSequences: (pages: Page[], options?: RepeatingSequenceOptions) => RepeatingSequencePattern[];
+//#endregion
 //#region src/detection.d.ts
 /**
  * Pattern detection utilities for recognizing template tokens in Arabic text.
@@ -1307,5 +1263,67 @@ declare const analyzeTextForRule: (text: string) => {
   detected: DetectedPattern[];
 } | null;
 //#endregion
-export { type Breakpoint, type BreakpointRule, type CommonLineStartPattern, type DetectedPattern, type ExpandResult, type LineStartAnalysisOptions, type LineStartPatternExample, type Logger, type Page, type PageRange, type ReplaceRule, type RuleValidationResult, type Segment, type SegmentationOptions, type SplitRule, TOKEN_PATTERNS, type ValidationIssue, type ValidationIssueType, analyzeCommonLineStarts, analyzeTextForRule, applyReplacements, containsTokens, detectTokenPatterns, escapeRegex, escapeTemplateBrackets, expandCompositeTokensInTemplate, expandTokens, expandTokensWithCaptures, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, segmentPages, suggestPatternConfig, templateToRegex, validateRules };
+//#region src/recovery.d.ts
+type MarkerRecoverySelector = {
+  type: 'rule_indices';
+  indices: number[];
+} | {
+  type: 'lineStartsAfter_patterns';
+  match?: 'exact' | 'normalized';
+  patterns: string[];
+} | {
+  type: 'predicate';
+  predicate: (rule: SplitRule, index: number) => boolean;
+};
+type MarkerRecoveryRun = {
+  options: SegmentationOptions;
+  pages: Page[];
+  segments: Segment[];
+  selector: MarkerRecoverySelector;
+};
+type MarkerRecoveryReport = {
+  summary: {
+    mode: 'rerun_only' | 'best_effort_then_rerun';
+    recovered: number;
+    totalSegments: number;
+    unchanged: number;
+    unresolved: number;
+  };
+  byRun?: Array<{
+    recovered: number;
+    runIndex: number;
+    totalSegments: number;
+    unresolved: number;
+  }>;
+  details: Array<{
+    from: number;
+    originalStartPreview: string;
+    recoveredPrefixPreview?: string;
+    recoveredStartPreview?: string;
+    segmentIndex: number;
+    status: 'recovered' | 'skipped_idempotent' | 'unchanged' | 'unresolved_alignment' | 'unresolved_selector';
+    strategy: 'rerun' | 'stage1' | 'none';
+    to?: number;
+    notes?: string[];
+  }>;
+  errors: string[];
+  warnings: string[];
+};
+type NormalizeCompareMode = 'none' | 'whitespace' | 'whitespace_and_nfkc';
+declare function recoverMistakenLineStartsAfterMarkers(pages: Page[], segments: Segment[], options: SegmentationOptions, selector: MarkerRecoverySelector, opts?: {
+  mode?: 'rerun_only' | 'best_effort_then_rerun';
+  normalizeCompare?: NormalizeCompareMode;
+}): {
+  report: MarkerRecoveryReport;
+  segments: Segment[];
+};
+declare function recoverMistakenMarkersForRuns(runs: MarkerRecoveryRun[], opts?: {
+  mode?: 'rerun_only' | 'best_effort_then_rerun';
+  normalizeCompare?: NormalizeCompareMode;
+}): {
+  report: MarkerRecoveryReport;
+  segments: Segment[];
+};
+//#endregion
+export { type Breakpoint, type BreakpointRule, type CommonLineStartPattern, type DetectedPattern, type ExpandResult, type LineStartAnalysisOptions, type LineStartPatternExample, type Logger, type MarkerRecoveryReport, type MarkerRecoveryRun, type MarkerRecoverySelector, type Page, type PageRange, type RepeatingSequenceExample, type RepeatingSequenceOptions, type RepeatingSequencePattern, type ReplaceRule, type RuleValidationResult, type Segment, type SegmentationOptions, type SplitRule, TOKEN_PATTERNS, type ValidationIssue, type ValidationIssueType, analyzeCommonLineStarts, analyzeRepeatingSequences, analyzeTextForRule, applyReplacements, containsTokens, detectTokenPatterns, escapeRegex, escapeTemplateBrackets, expandCompositeTokensInTemplate, expandTokens, expandTokensWithCaptures, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, recoverMistakenLineStartsAfterMarkers, recoverMistakenMarkersForRuns, segmentPages, suggestPatternConfig, templateToRegex, validateRules };
 //# sourceMappingURL=index.d.mts.map