flappa-doormal 2.3.1 → 2.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md CHANGED
@@ -379,6 +379,8 @@ bunx biome lint .
379
379
  | `{{raqms}}` | Multiple Arabic-Indic numerals | ٧٥٦٣ |
380
380
  | `{{raqms:num}}` | Numerals with named capture | `meta.num = "٧٥٦٣"` |
381
381
  | `{{dash}}` | Various dash characters | - – — ـ |
382
+ | `{{harfs}}` | Single-letter codes separated by spaces | `د ت س ي ق` |
383
+ | `{{rumuz}}` | rijāl/takhrīj source abbreviations (matches blocks like `خت ٤`, `خ سي`) | `خت ٤` |
382
384
  | `{{numbered}}` | Composite: `{{raqms}} {{dash}}` | ٧٥٦٣ - |
383
385
 
384
386
  **Named captures**: Add `:name` suffix to capture into `meta`:
@@ -387,3 +389,55 @@ bunx biome lint .
387
389
  // → segment.meta.hadithNum = "٧٥٦٣"
388
390
  ```
389
391
 
392
+ ## Page-start Guard (`pageStartGuard`)
393
+
394
+ Some books contain page-wrap continuations where a new page starts with a common line-start marker (e.g. `{{naql}}`) but it is not a true new segment.
395
+
396
+ Use `pageStartGuard` on a rule to allow matches at the start of a page **only if** the previous page’s last non-whitespace character matches a pattern (tokens supported):
397
+
398
+ ```typescript
399
+ {
400
+ fuzzy: true,
401
+ lineStartsWith: ['{{naql}}'],
402
+ split: 'at',
403
+ pageStartGuard: '{{tarqim}}'
404
+ }
405
+ ```
406
+
407
+ Notes:
408
+ - Applies only at page starts; mid-page line starts are unaffected.
409
+ - Implemented in `src/segmentation/segmenter.ts` match filtering.
410
+
411
+ ## Analysis Helper (`analyzeCommonLineStarts`)
412
+
413
+ `analyzeCommonLineStarts(pages)` scans lines across pages and returns common template-like line-start signatures (tokenized with `TOKEN_PATTERNS`). It’s intended to help you quickly discover rule candidates without using an LLM.
414
+
415
+ Useful options (recent additions):
416
+ - **`sortBy`**: `'specificity'` (default) or `'count'` (highest-frequency first). `topK` is applied **after** sorting.
417
+ - **`lineFilter`**: restrict which lines are analyzed (e.g. only Markdown headings).
418
+ - **`prefixMatchers`**: consume syntactic prefixes before tokenization (default includes headings via `/^#+/u`).
419
+ - This is how you see variations *after* prefixes like `##` instead of collapsing to just `"##"`.
420
+ - **`normalizeArabicDiacritics`**: `true` by default so tokens match diacritized forms (e.g. `وأَخْبَرَنَا` → `{{naql}}`).
421
+
422
+ Examples:
423
+
424
+ ```typescript
425
+ import { analyzeCommonLineStarts } from 'flappa-doormal';
426
+
427
+ // Top 20 by frequency
428
+ const top20 = analyzeCommonLineStarts(pages, { sortBy: 'count', topK: 20 });
429
+
430
+ // Only headings (## / ### / ...)
431
+ const headings = analyzeCommonLineStarts(pages, {
432
+ lineFilter: (line) => line.startsWith('#'),
433
+ sortBy: 'count',
434
+ });
435
+
436
+ // Custom prefixes (e.g. blockquotes + headings)
437
+ const quoted = analyzeCommonLineStarts(pages, {
438
+ lineFilter: (line) => line.startsWith('>') || line.startsWith('#'),
439
+ prefixMatchers: [/^>+/u, /^#+/u],
440
+ sortBy: 'count',
441
+ });
442
+ ```
443
+
package/README.md CHANGED
@@ -90,7 +90,8 @@ Replace regex with readable tokens:
90
90
  | `{{raqm}}` | Single Arabic digit | `[\\u0660-\\u0669]` |
91
91
  | `{{dash}}` | Dash variants | `[-–—ـ]` |
92
92
  | `{{harf}}` | Arabic letter | `[أ-ي]` |
93
- | `{{harfs}}` | Arabic letters with spaces | `[أ-ي](?:[أ-ي\s]*[أ-ي])?` |
93
+ | `{{harfs}}` | Single-letter codes separated by spaces | `[أ-ي](?:\s+[أ-ي])*` |
94
+ | `{{rumuz}}` | Source abbreviations (rijāl/takhrīj rumuz), incl. multi-code blocks | e.g. `خت ٤`, `خ سي`, `خ فق`, `د ت سي ق` |
94
95
  | `{{numbered}}` | Hadith numbering `٢٢ - ` | `{{raqms}} {{dash}} ` |
95
96
  | `{{fasl}}` | Section markers | `فصل\|مسألة` |
96
97
  | `{{tarqim}}` | Punctuation marks | `[.!?؟؛]` |
@@ -144,6 +145,26 @@ const rules = [{
144
145
  | `template` | Depends | Custom pattern with full control |
145
146
  | `regex` | Depends | Raw regex for complex cases |
146
147
 
148
+ ### 4.1 Page-start Guard (avoid page-wrap false positives)
149
+
150
+ When matching at line starts (e.g., `{{naql}}`), a new page can begin with a marker that is actually a **continuation** of the previous page (page wrap), not a true new segment.
151
+
152
+ Use `pageStartGuard` to allow a rule to match at the start of a page **only if** the previous page’s last non-whitespace character matches a pattern (tokens supported):
153
+
154
+ ```typescript
155
+ const segments = segmentPages(pages, {
156
+ rules: [{
157
+ fuzzy: true,
158
+ lineStartsWith: ['{{naql}}'],
159
+ split: 'at',
160
+ // Only allow a split at the start of a new page if the previous page ended with sentence punctuation:
161
+ pageStartGuard: '{{tarqim}}'
162
+ }]
163
+ });
164
+ ```
165
+
166
+ This guard applies **only at page starts**. Mid-page line starts are unaffected.
167
+
147
168
  ### 5. Auto-Escaping Brackets
148
169
 
149
170
  In `lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, and `template` patterns, parentheses `()` and square brackets `[]` are **automatically escaped**. This means you can write intuitive patterns without manual escaping:
@@ -296,19 +317,250 @@ const segments = segmentPages(pages, {
296
317
 
297
318
  ### Narrator Abbreviation Codes
298
319
 
299
- Use `{{harfs}}` for matching Arabic letter abbreviations with spaces (common in narrator biography books):
320
+ Use `{{rumuz}}` for matching rijāl/takhrīj source abbreviations (common in narrator biography books and takhrīj notes):
300
321
 
301
322
  ```typescript
302
323
  const segments = segmentPages(pages, {
303
324
  rules: [{
304
- lineStartsAfter: ['{{raqms:num}} {{harfs}}:'],
325
+ lineStartsAfter: ['{{raqms:num}} {{rumuz}}:'],
305
326
  split: 'at'
306
327
  }]
307
328
  });
308
329
 
309
- // Matches: ١١١٨ د ت سي ق: حجاج بن دينار
330
+ // Matches: ١١١٨ ع: ... / ١١١٨ خ سي: ... / ١١١٨ خ فق: ...
310
331
  // meta: { num: '١١١٨' }
311
- // content: 'حجاج بن دينار' (abbreviations stripped)
332
+ // content: '...' (rumuz stripped)
333
+ ```
334
+
335
+ If your data uses *only single-letter codes separated by spaces* (e.g., `د ت س ي ق`), you can also use `{{harfs}}`.
336
+
337
+ ## Analysis Helpers (no LLM required)
338
+
339
+ Use `analyzeCommonLineStarts(pages)` to discover common line-start signatures across a book, useful for rule authoring:
340
+
341
+ ```typescript
342
+ import { analyzeCommonLineStarts } from 'flappa-doormal';
343
+
344
+ const patterns = analyzeCommonLineStarts(pages);
345
+ // [{ pattern: "{{numbered}}", count: 1234, examples: [...] }, ...]
346
+ ```
347
+
348
+ You can control **what gets analyzed** and **how results are ranked**:
349
+
350
+ ```typescript
351
+ import { analyzeCommonLineStarts } from 'flappa-doormal';
352
+
353
+ // Top 20 most common line-start signatures (by frequency)
354
+ const topByCount = analyzeCommonLineStarts(pages, {
355
+ sortBy: 'count',
356
+ topK: 20,
357
+ });
358
+
359
+ // Only analyze markdown H2 headings (lines beginning with "##")
360
+ // This shows what comes AFTER the heading marker (e.g. "## {{bab}}", "## {{numbered}}\\[", etc.)
361
+ const headingVariants = analyzeCommonLineStarts(pages, {
362
+ lineFilter: (line) => line.startsWith('##'),
363
+ sortBy: 'count',
364
+ topK: 40,
365
+ });
366
+
367
+ // Support additional prefix styles without changing library code
368
+ // (e.g. markdown blockquotes ">> ..." + headings)
369
+ const quotedHeadings = analyzeCommonLineStarts(pages, {
370
+ lineFilter: (line) => line.startsWith('>') || line.startsWith('#'),
371
+ prefixMatchers: [/^>+/u, /^#+/u],
372
+ sortBy: 'count',
373
+ topK: 40,
374
+ });
375
+ ```
376
+
377
+ Key options:
378
+ - `sortBy`: `'specificity'` (default) or `'count'` (highest frequency first)
379
+ - `lineFilter`: restrict which lines are counted (e.g. only headings)
380
+ - `prefixMatchers`: consume syntactic prefixes (default includes headings via `/^#+/u`) so you can see variations *after* the prefix
381
+ - `normalizeArabicDiacritics`: `true` by default (helps token matching like `وأَخْبَرَنَا` → `{{naql}}`)
382
+
383
+
384
+ ## Prompting LLMs / Agents to Generate Rules (Shamela books)
385
+
386
+ ### Pre-analysis (no LLM required): generate “hints” from the book
387
+
388
+ Before prompting an LLM, you can quickly extract **high-signal pattern hints** from the book using:
389
+ - `analyzeCommonLineStarts(pages, options)` (from `src/line-start-analysis.ts`): common **line-start signatures** (tokenized)
390
+ - `analyzeTextForRule(text)` / `detectTokenPatterns(text)` (from `src/pattern-detection.ts`): turn a **single representative line** into a token template suggestion
391
+
392
+ These help the LLM avoid guessing and focus on the patterns actually present.
393
+
394
+ #### Step 1: top line-start signatures (frequency-first)
395
+
396
+ ```typescript
397
+ import { analyzeCommonLineStarts } from 'flappa-doormal';
398
+
399
+ const top = analyzeCommonLineStarts(pages, {
400
+ sortBy: 'count',
401
+ topK: 40,
402
+ minCount: 10,
403
+ });
404
+
405
+ console.log(top.map((p) => ({ pattern: p.pattern, count: p.count, example: p.examples[0] })));
406
+ ```
407
+
408
+ Typical output (example):
409
+
410
+ ```text
411
+ [
412
+ { pattern: "{{numbered}}", count: 1200, example: { pageId: 50, line: "١ - حَدَّثَنَا ..." } },
413
+ { pattern: "{{bab}}", count: 180, example: { pageId: 66, line: "باب ..." } },
414
+ { pattern: "##\\s*{{bab}}",count: 140, example: { pageId: 69, line: "## باب ..." } }
415
+ ]
416
+ ```
417
+
418
+ If you only want to analyze headings (to see what comes *after* `##`):
419
+
420
+ ```typescript
421
+ const headingVariants = analyzeCommonLineStarts(pages, {
422
+ lineFilter: (line) => line.startsWith('##'),
423
+ sortBy: 'count',
424
+ topK: 40,
425
+ });
426
+ ```
427
+
428
+ #### Step 2: convert a few representative lines into token templates
429
+
430
+ Pick 3–10 representative line prefixes from the book (often from the examples returned above) and run:
431
+
432
+ ```typescript
433
+ import { analyzeTextForRule } from 'flappa-doormal';
434
+
435
+ console.log(analyzeTextForRule("٢٩- خ سي: أحمد بن حميد ..."));
436
+ // -> { template: "{{raqms}}- {{rumuz}}: أحمد...", patternType: "lineStartsAfter", fuzzy: false, ... }
437
+ ```
438
+
439
+ #### Step 3: paste the “hints” into your LLM prompt
440
+
441
+ When you prompt the LLM, include a short “Hints” section:
442
+ - Top 20–50 `analyzeCommonLineStarts` patterns (with counts + 1–2 examples)
443
+ - 3–10 `analyzeTextForRule(...)` results
444
+ - A small sample of pages (not the full book)
445
+
446
+ Then instruct the LLM to **prioritize rules that align with those hints**.
447
+
448
+ You can use an LLM to generate `SegmentationOptions` by pasting it a random subset of pages and asking it to infer robust segmentation rules. Here’s a ready-to-copy plain-text prompt:
449
+
450
+ ```text
451
+ You are helping me generate JSON configuration for a text-segmentation function called segmentPages(pages, options).
452
+ It segments Arabic book pages (e.g., Shamela) into logical segments (books/chapters/sections/entries/hadiths).
453
+
454
+ I will give you a random subset of pages so you can infer patterns. You must respond with ONLY JSON (no prose).
455
+
456
+ I will paste a random subset of pages. Each page has:
457
+ - id: page number (not necessarily consecutive)
458
+ - content: plain text; line breaks are \n
459
+
460
+ Output ONLY a JSON object compatible with SegmentationOptions (no prose, no code fences).
461
+
462
+ SegmentationOptions shape:
463
+ - rules: SplitRule[]
464
+ - optional: maxPages, breakpoints, prefer
465
+
466
+ SplitRule constraints:
467
+ - Each rule must use exactly ONE of: lineStartsWith, lineStartsAfter, lineEndsWith, template, regex
468
+ - Optional fields: split ("at" | "after"), meta, min, max, exclude, occurrence ("first" | "last"), fuzzy
469
+
470
+ Important behaviors:
471
+ - lineStartsAfter matches at line start but strips the marker from segment.content.
472
+ - Template patterns (lineStartsWith/After/EndsWith/template) auto-escape ()[] outside tokens.
473
+ - Raw regex patterns do NOT auto-escape and can include groups, named captures, etc.
474
+
475
+ Available tokens you may use in templates:
476
+ - {{basmalah}} (بسم الله / ﷽)
477
+ - {{kitab}} (كتاب)
478
+ - {{bab}} (باب)
479
+ - {{fasl}} (فصل | مسألة)
480
+ - {{naql}} (حدثنا/أخبرنا/... narration phrases)
481
+ - {{raqm}} (single Arabic-Indic digit)
482
+ - {{raqms}} (Arabic-Indic digits)
483
+ - {{dash}} (dash variants)
484
+ - {{tarqim}} (punctuation [. ! ? ؟ ؛])
485
+ - {{harf}} (Arabic letter)
486
+ - {{harfs}} (single-letter codes separated by spaces; e.g. "د ت س ي ق")
487
+ - {{rumuz}} (rijāl/takhrīj source abbreviations; matches blocks like "خت ٤", "خ سي", "خ فق")
488
+
489
+ Named captures:
490
+ - {{raqms:num}} captures to meta.num
491
+ - {{:name}} captures arbitrary text to meta.name
492
+
493
+ Your tasks:
494
+ 1) Identify document structure from the sample:
495
+ - book headers (كتاب), chapter headers (باب), sections (فصل/مسألة), hadith numbering, biography entries, etc.
496
+ 2) Propose a minimal but robust ordered ruleset:
497
+ - Put most-specific rules first.
498
+ - Use fuzzy:true for Arabic headings where diacritics vary.
499
+ - Use lineStartsAfter when you want to remove the marker (e.g., hadith numbers, rumuz prefixes).
500
+ 3) Use constraints:
501
+ - Use min/max/exclude when front matter differs or specific pages are noisy.
502
+ 4) If segments can span many pages:
503
+ - Set maxPages and breakpoints.
504
+ - Suggested breakpoints (in order): "{{tarqim}}\\s*", "\\n", "" (page boundary)
505
+ - Prefer "longer" unless there’s a reason to prefer shorter segments.
506
+ 5) Capture useful metadata:
507
+ - For numbering patterns, capture the number into meta.num (e.g., {{raqms:num}}).
508
+
509
+ Examples (what good answers look like):
510
+
511
+ Example A: hadith-style numbered segments
512
+ Input pages:
513
+ PAGE 10:
514
+ ٣٤ - حَدَّثَنَا ...\n... (rest of hadith)
515
+ PAGE 11:
516
+ ٣٥ - حَدَّثَنَا ...\n... (rest of hadith)
517
+
518
+ Good JSON answer:
519
+ {
520
+ "rules": [
521
+ {
522
+ "lineStartsAfter": ["{{raqms:num}} {{dash}}\\s*"],
523
+ "split": "at",
524
+ "meta": { "type": "hadith" }
525
+ }
526
+ ]
527
+ }
528
+
529
+ Example B: chapter markers + hadith numbers
530
+ Input pages:
531
+ PAGE 50:
532
+ كتاب الصلاة\nباب فضل الصلاة\n١ - حَدَّثَنَا ...\n...
533
+ PAGE 51:
534
+ ٢ - حَدَّثَنَا ...\n...
535
+
536
+ Good JSON answer:
537
+ {
538
+ "rules": [
539
+ { "fuzzy": true, "lineStartsWith": ["{{kitab}}"], "split": "at", "meta": { "type": "book" } },
540
+ { "fuzzy": true, "lineStartsWith": ["{{bab}}"], "split": "at", "meta": { "type": "chapter" } },
541
+ { "lineStartsAfter": ["{{raqms:num}}\\s*{{dash}}\\s*"], "split": "at", "meta": { "type": "hadith" } }
542
+ ]
543
+ }
544
+
545
+ Example C: narrator/rijāl entries with rumuz (codes) + colon
546
+ Input pages:
547
+ PAGE 257:
548
+ ٢٩- خ سي: أحمد بن حميد...\nوكان من حفاظ الكوفة.
549
+ PAGE 258:
550
+ ١٠٢- ق: تمييز ولهم شيخ آخر...\n...
551
+
552
+ Good JSON answer:
553
+ {
554
+ "rules": [
555
+ {
556
+ "lineStartsAfter": ["{{raqms:num}}\\s*{{dash}}\\s*{{rumuz}}:\\s*"],
557
+ "split": "at",
558
+ "meta": { "type": "entry" }
559
+ }
560
+ ]
561
+ }
562
+
563
+ Now wait for the pages.
312
564
  ```
313
565
 
314
566
  ### Sentence-Based Splitting (Last Period Per Page)
package/dist/index.d.mts CHANGED
@@ -360,6 +360,27 @@ type RuleConstraints = {
360
360
  * - undefined: No fallback (current behavior)
361
361
  */
362
362
  fallback?: 'page';
363
+ /**
364
+ * Page-start guard: only allow this rule to match at the START of a page if the
365
+ * previous page's last non-whitespace character matches this pattern.
366
+ *
367
+ * This is useful for avoiding false positives caused purely by page wrap.
368
+ *
369
+ * Example use-case:
370
+ * - Split on `{{naql}}` at line starts (e.g. "أخبرنا ...")
371
+ * - BUT if a new page starts with "أخبرنا ..." and the previous page did NOT
372
+ * end with sentence-ending punctuation, treat it as a continuation and do not split.
373
+ *
374
+ * Notes:
375
+ * - This guard applies ONLY at page starts, not mid-page line starts.
376
+ * - This is a template pattern (tokens allowed). It is checked against the LAST
377
+ * non-whitespace character of the previous page's content.
378
+ *
379
+ * @example
380
+ * // Allow split at page start only if previous page ends with sentence punctuation
381
+ * { lineStartsWith: ['{{naql}}'], fuzzy: true, pageStartGuard: '{{tarqim}}' }
382
+ */
383
+ pageStartGuard?: string;
363
384
  };
364
385
  /**
365
386
  * A complete split rule combining pattern, behavior, and constraints.
@@ -720,7 +741,6 @@ type Segment = {
720
741
  };
721
742
  //#endregion
722
743
  //#region src/segmentation/segmenter.d.ts
723
-
724
744
  /**
725
745
  * Applies breakpoints to oversized segments.
726
746
  *
@@ -779,25 +799,6 @@ type Segment = {
779
799
  */
780
800
  declare const segmentPages: (pages: Page[], options: SegmentationOptions) => Segment[];
781
801
  //#endregion
782
- //#region src/segmentation/textUtils.d.ts
783
- /**
784
- * Strip all HTML tags from content, keeping only text.
785
- *
786
- * @param html - HTML content
787
- * @returns Plain text content
788
- */
789
- declare const stripHtmlTags: (html: string) => string;
790
- /**
791
- * Normalizes line endings to Unix-style (`\n`).
792
- *
793
- * Converts Windows (`\r\n`) and old Mac (`\r`) line endings to Unix style
794
- * for consistent pattern matching across platforms.
795
- *
796
- * @param content - Raw content with potentially mixed line endings
797
- * @returns Content with all line endings normalized to `\n`
798
- */
799
- declare const normalizeLineEndings: (content: string) => string;
800
- //#endregion
801
802
  //#region src/segmentation/tokens.d.ts
802
803
  /**
803
804
  * Token-based template system for Arabic text pattern matching.
@@ -1039,7 +1040,83 @@ declare const getAvailableTokens: () => string[];
1039
1040
  */
1040
1041
  declare const getTokenPattern: (tokenName: string) => string | undefined;
1041
1042
  //#endregion
1042
- //#region src/pattern-detection.d.ts
1043
+ //#region src/analysis.d.ts
1044
+ type LineStartAnalysisOptions = {
1045
+ /** Return top K patterns (after filtering). Default: 20 */
1046
+ topK?: number;
1047
+ /** Only consider the first N characters of each trimmed line. Default: 60 */
1048
+ prefixChars?: number;
1049
+ /** Ignore lines shorter than this (after trimming). Default: 6 */
1050
+ minLineLength?: number;
1051
+ /** Only include patterns that appear at least this many times. Default: 3 */
1052
+ minCount?: number;
1053
+ /** Keep up to this many example lines per pattern. Default: 5 */
1054
+ maxExamples?: number;
1055
+ /**
1056
+ * If true, include a literal first word when no token match is found at the start.
1057
+ * Default: true
1058
+ */
1059
+ includeFirstWordFallback?: boolean;
1060
+ /**
1061
+ * If true, strip Arabic diacritics (harakat/tashkeel) for the purposes of matching tokens.
1062
+ * This helps patterns like `وأَخْبَرَنَا` match the `{{naql}}` token (`وأخبرنا`).
1063
+ *
1064
+ * Note: examples are still stored in their original (unstripped) form.
1065
+ *
1066
+ * Default: true
1067
+ */
1068
+ normalizeArabicDiacritics?: boolean;
1069
+ /**
1070
+ * How to sort patterns before applying `topK`.
1071
+ *
1072
+ * - `specificity` (default): prioritize more structured prefixes first (tokenCount, then literalLen), then count.
1073
+ * - `count`: prioritize highest-frequency patterns first, then specificity.
1074
+ */
1075
+ sortBy?: 'specificity' | 'count';
1076
+ /**
1077
+ * Optional filter to restrict which lines are analyzed.
1078
+ *
1079
+ * The `line` argument is the trimmed + whitespace-collapsed version of the line.
1080
+ * Return `true` to include it, `false` to skip it.
1081
+ *
1082
+ * @example
1083
+ * // Only analyze markdown H2 headings
1084
+ * { lineFilter: (line) => line.startsWith('## ') }
1085
+ */
1086
+ lineFilter?: (line: string, pageId: number) => boolean;
1087
+ /**
1088
+ * Optional list of prefix matchers to consume before tokenization.
1089
+ *
1090
+ * This is for "syntactic" prefixes that are common at line start but are not
1091
+ * meaningful as tokens by themselves (e.g. markdown headings like `##`).
1092
+ *
1093
+ * Each matcher is applied at the current position. If it matches, the matched
1094
+ * text is appended (escaped) to the signature and the scanner advances.
1095
+ *
1096
+ * @example
1097
+ * // Support markdown blockquotes and headings
1098
+ * { prefixMatchers: [/^>+/u, /^#+/u] }
1099
+ */
1100
+ prefixMatchers?: RegExp[];
1101
+ };
1102
+ type LineStartPatternExample = {
1103
+ line: string;
1104
+ pageId: number;
1105
+ };
1106
+ type CommonLineStartPattern = {
1107
+ pattern: string;
1108
+ count: number;
1109
+ examples: LineStartPatternExample[];
1110
+ };
1111
+ /**
1112
+ * Analyze pages and return the most common line-start patterns (top K).
1113
+ *
1114
+ * This is a pure algorithmic heuristic: it tokenizes common prefixes into a stable
1115
+ * template-ish string using the library tokens (e.g., `{{bab}}`, `{{raqms}}`, `{{rumuz}}`).
1116
+ */
1117
+ declare const analyzeCommonLineStarts: (pages: Page[], options?: LineStartAnalysisOptions) => CommonLineStartPattern[];
1118
+ //#endregion
1119
+ //#region src/detection.d.ts
1043
1120
  /**
1044
1121
  * Pattern detection utilities for recognizing template tokens in Arabic text.
1045
1122
  * Used to auto-detect patterns from user-highlighted text in the segmentation dialog.
@@ -1114,5 +1191,5 @@ declare const analyzeTextForRule: (text: string) => {
1114
1191
  detected: DetectedPattern[];
1115
1192
  } | null;
1116
1193
  //#endregion
1117
- export { type Breakpoint, type BreakpointRule, type DetectedPattern, type ExpandResult, type Logger, type Page, type PageRange, type Segment, type SegmentationOptions, type SplitRule, TOKEN_PATTERNS, analyzeTextForRule, containsTokens, detectTokenPatterns, escapeRegex, escapeTemplateBrackets, expandTokens, expandTokensWithCaptures, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, normalizeLineEndings, segmentPages, stripHtmlTags, suggestPatternConfig, templateToRegex };
1194
+ export { type Breakpoint, type BreakpointRule, type CommonLineStartPattern, type DetectedPattern, type ExpandResult, type LineStartAnalysisOptions, type LineStartPatternExample, type Logger, type Page, type PageRange, type Segment, type SegmentationOptions, type SplitRule, TOKEN_PATTERNS, analyzeCommonLineStarts, analyzeTextForRule, containsTokens, detectTokenPatterns, escapeRegex, escapeTemplateBrackets, expandTokens, expandTokensWithCaptures, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, segmentPages, suggestPatternConfig, templateToRegex };
1118
1195
  //# sourceMappingURL=index.d.mts.map
@@ -1 +1 @@
1
- {"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/segmenter.ts","../src/segmentation/textUtils.ts","../src/segmentation/tokens.ts","../src/pattern-detection.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAoBC;AA2FzB;AAAkD;AA0GlD;;;;;AAkBA;AAqCA;AA0EY,cDzaC,WCyaqB,EAAA,CAAA,CAAA,EAAA,MAAc,EAAA,GAAA,MAAA;AA8BhD;AAiDA;;;;;AA+HA;;;;AC/FA;;;;;;;;ACplBA;AAcA;;;;ACgDA;AA6NA;AA2CA;AAWA;AA2DA;AAyHA;AAuBA;AAqBA;AAgBA;;;;AC3jBY,cLqJC,wBKrJc,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;ALsD3B;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAoBC;AA2FzB;AAAkD;AA0GlD;;;;;AAkBA,KA9VK,YAAA,GA8VW;EAqCJ;EA0EA,KAAA,EAAA,MAAU;AA8BtB,CAAA;AAiDA;;;;;AA+HA;;;;AC/FA;;;;;;;;ACplBA;AAcA;;;;ACgDA;AA6NA,KHvOK,eAAA,GG2OJ;EAuCY;EAWD,QAAA,EAAA,MAAY;AA2DxB,CAAA;AAyHA;AAuBA;AAqBA;AAgBA;;;;AC3jBA;AA0DA;AA4DA;AAuBA;AAiCA;;;;;;;;;;;;;;KJlGK,qBAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiCA,sBAAA;;;;;;;;;;;;;;;;;;;;;;;KAwBA,mBAAA;;;;;;;;;;;;;;KAeA,WAAA,GACC,eACA,kBACA,wBACA,yBACA;;;;;;;KAYD,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA+EO,SAAA;;;;;;;KAYP,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAwCC,SAAA,GAAY,cAAc,gBAAgB;;;;;;;;;;;;;KAkB1C,IAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCA,cAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAqCE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCF,UAAA,YAAsB;;;;;;;;;;;;;;;;;;;;;;;;;UA8BjB,MAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiDL,mBAAA;;;;;;;;UAQA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;gBA8CM;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WAwDL;;;;;;;;;;;;;;;;KAiBD,OAAA;;;;;;;;;;;;;;;;;;;;;;;;;;SA6BD;;;;;;;;;AEhtBX;AAcA;;;;ACgDA;AA6NA;AA2CA;AAWA;AA2DA;AAyHA;AAuBA;AAqBA;AAgBA;;;;AC3jBA;AA0DA;AA4DA;AAuBA;AAiCA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cHgaa,sBAAuB,iBAAiB,wBAAsB;;;;AFxhB3E;AA+FA;;;;ACnIK,cExBQ,aFwBI,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAAA;AA4BG;AA8BM;AAiCC;AAwBH;;;;;AAoBlB,cEjJO,oBFiJP,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;ADnGN;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAoBC;AA2FzB;AAAkD;AA0GlD;;;;;AAkBA;AAqCA;AA0EA;AA8BA;AAiDA;;;;;AA+HA;;;;AC/FA;;;;;;;;ACplBA;AAcA;;;;ACgDA;AA6NA;AA2CA;AAWA;AA2DA;AAyHA;AAuBA;AAqBA;AAgBA;;;cAngBa;ACxDb;AA0DA;AA4DA;AAuBA;AAiCA;;;;;;;;;;;;;;;;;;;;;cDuGa,gBAAgB;;;;;;;;;;;;;;;;cA2ChB;;;;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cA2DC,mHAIV;;;;;;;;;;;;;;;;;;;;cAqHU;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAmC;;;;;;;;;;;;;cAqBnC;;;;;;;;;;;;;;;cAgBA;;;;AJrgBb;AA+FA;;;;;ACnIiB;AA4BG;AA+Df,KI7GO,eAAA,GJ6Ge;EAwBtB;EAeA,KAAA,EAAA,MAAA;EACC;EACA,KAAA,EAAA,MAAA;EACA;EACA,KAAA,EAAA,MAAA;EACA;EAAmB,QAAA,EAAA,MAAA;AAAA,CAAA;AA2FzB;AAAkD;AA0GlD;;;;;AAkBA;AAqCA;AA0EA;AA8BA;AAiDA;;;;AA8GmB,cIlmBN,mBJkmBM,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GIlmBgC,eJkmBhC,EAAA;AAiBnB;;;;AC/FA;;;;;;;;ACplBA;AAca,cE8GA,wBF7GyD,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,QAAA,EE6GL,eF7GK,EAAA,EAAA,GAAA,MAAA;;;;AC+CtE;AA6NA;AA2CA;AAWY,cC9LC,oBD8LW,EAAA,CAAA,QAAA,EC7LV,eD6LU,EAAA,EAAA,GAAA;EA2DX,WAAA,EAAA,gBAoGZ,GAAA,iBAhGE;EAqHU,KAAA,EAAA,OAAA;EAuBA,QAAA,CAAA,EAAA,MAAA;AAqBb,CAAA;AAgBA;;;;AC3jBA;AA0DA;AA4Da,cAwDA,kBAzCZ,EAAA,CAfgE,IAAA,EAAA,MAAA,EAAA,GAAe;EAuBnE,QAAA,EAAA,MAAA;EAiCA,WAAA,EAAA,gBAmBZ,GAZa,iBAAe;;;YAAf"}
1
+ {"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/analysis.ts","../src/detection.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAoBC;AA2FzB;AAAkD;AAgIlD;;;;;AAkBA;AAqCA;AA0EY,cD/bC,WC+bqB,EAAA,CAAA,CAAA,EAAA,MAAc,EAAA,GAAA,MAAA;AA8BhD;AAiDA;;;;;AA+HA;;;;ACjPA;;;;;;;;AC1ZA;AAgQA;AA2CA;AAWA;AA2DA;AAyHA;AAuBA;AAqBA;AAgBA;;;;ACtmBA;AA2DA;AAEA;AAoQA;;AAEa,cJtKA,wBIsKA,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;AJrQb;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAoBC;AA2FzB;AAAkD;AAgIlD;;;;;AAkBA,KApXK,YAAA,GAoXW;EAqCJ;EA0EA,KAAA,EAAA,MAAU;AA8BtB,CAAA;AAiDA;;;;;AA+HA;;;;ACjPA;;;;;;;;AC1ZA;AAgQA;AA2CA;AAWA;AA2DA;AAyHA;AAuBA,KF3gBK,eAAA,GEkhBJ;EAcY;EAgBA,QAAA,EAAA,MAAA;;;;ACtmBb;AA2DA;AAEA;AAoQA;;;;;;;;ACzTA;AA2DA;AA4DA;AAuBA;AAiCA;;;;;;;;KJnGK,qBAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiCA,sBAAA;;;;;;;;;;;;;;;;;;;;;;;KAwBA,mBAAA;;;;;;;;;;;;;;KAeA,WAAA,GACC,eACA,kBACA,wBACA,yBACA;;;;;;;KAYD,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA+EO,SAAA;;;;;;;KAYP,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA8DC,SAAA,GAAY,cAAc,gBAAgB;;;;;;;;;;;;;KAkB1C,IAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCA,cAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAqCE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCF,UAAA,YAAsB;;;;;;;;;;;;;;;;;;;;;;;;;UA8BjB,MAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiDL,mBAAA;;;;;;;;UAQA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;gBA8CM;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WAwDL;;;;;;;;;;;;;;;;KAiBD,OAAA;;;;;;;;;;;;;;;;;;;;;;;;;;SA6BD;;;;;;AA1VX;AAqCA;AA0EA;AA8BA;AAiDA;;;;;AA+HA;;;;ACjPA;;;;;;;;AC1ZA;AAgQA;AA2CA;AAWA;AA2DA;AAyHA;AAuBA;AAqBA;AAgBA;;;;ACtmBA;AA2DA;AAEA;AAoQA;;;;;;;;ACzTA;AA2DA;AA4DA;AAuBA;AAiCA;;;;;;cHmSa,sBAAuB,iBAAiB,wBAAsB;;;;AF5Z3E;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAoBC;AA2FzB;AAAkD;AAgIlD;;;;;AAkBA;AAqCA;AA0EA;AA8BA;AAiDA;;;;;AA+HA;;;;ACjPA;;;;;;;;AC1ZA;AAgQA;AA2CA;AAWA;AA2DA;AAyHA;AAuBA;AAqBA;AAgBA;;;;ACtmBA;AA2DA;AAEA;AAoQA;AACW,cDlQE,sBCkQF,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;;;;AC1TX;AA2DA;AA4DA;AAuBA;AAiCA;;;;;;;;;;;;;;;cFyIa,gBAAgB;;;;;;;;;;;;;;;;cA2ChB;;;;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cA2DC,mHAIV;;;;;;;;;;;;;;;;;;;;cAqHU;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAmC;;;;;;;;;;;;;cAqBnC;;;;;;;;;;;;;;;cAgBA;;;AHxiBA,KI9DD,wBAAA,GJ8D8E;EA+F7E;;;;ECnIR;EA4BA,aAAA,CAAA,EAAA,MAAe;EA8Bf;EAiCA,QAAA,CAAA,EAAA,MAAA;EAwBA;EAeA,WAAA,CAAA,EAAW,MAAA;EACV;;;;EAIA,wBAAA,CAAA,EAAA,OAAA;EAAmB;AAAA;AA2FzB;AAAkD;AAgIlD;;;;EAAqE,yBAAA,CAAA,EAAA,OAAA;EAkBzD;AAqCZ;AA0EA;AA8BA;AAiDA;;EAsDkB,MAAA,CAAA,EAAA,aAAA,GAAA,OAAA;EAwDL;;AAiBb;;;;ACjPA;;;;EAAkF,UAAA,CAAA,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,MAAA,EAAA,MAAA,EAAA,GAAA,OAAA;;;;AC1ZlF;AAgQA;AA2CA;AAWA;AA2DA;AAyHA;AAuBA;AAqBA;AAgBA;;mBC9iBqB;;AAxDT,KA2DA,uBAAA,GA3DwB;EA2DxB,IAAA,EAAA,MAAA;EAEA,MAAA,EAAA,MAAA;AAoQZ,CAAA;AACW,KArQC,sBAAA,GAqQD;EACE,OAAA,EAAA,MAAA;EACV,KAAA,EAAA,MAAA;EAAsB,QAAA,EApQX,uBAoQW,EAAA;;;;AC5TzB;AA2DA;AA4DA;AAuBA;AAiCa,cD0IA,uBCnIC,EAAA,CAAA,KAAA,EDoIH,ICpIkB,EAAA,EAAA,OAAA,CAAA,EDqIhB,wBCrIgB,EAAA,GDsI1B,sBCtI0B,EAAA;;;;ALhI7B;AA+FA;;;;;ACnIiB;AA4BG;AA+Df,KI7GO,eAAA,GJ6Ge;EAwBtB;EAeA,KAAA,EAAA,MAAA;EACC;EACA,KAAA,EAAA,MAAA;EACA;EACA,KAAA,EAAA,MAAA;EACA;EAAmB,QAAA,EAAA,MAAA;AAAA,CAAA;AA2FzB;AAAkD;AAgIlD;;;;;AAkBA;AAqCA;AA0EA;AA8BA;AAiDA;;;;AA8GmB,cIvnBN,mBJunBM,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GIvnBgC,eJunBhC,EAAA;AAiBnB;;;;ACjPA;;;;;;;;AC1ZA;AAgQa,cEjMA,wBFiMsB,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,QAAA,EEjM8B,eFiM9B,EAAA,EAAA,GAAA,MAAA;AA2CnC;AAWA;AA2DA;AAyHA;AAuBA;AAqBA;AAgBa,cEhdA,oBFgdsF,EAAA,CAAA,QAAA,EE/crF,eF+cqF,EAAA,EAAA,GAAA;;;;ACtmBnG,CAAA;AA2DA;AAEA;AAoQA;;;;AAGyB,cC7IZ,kBD6IY,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA;;;;EC5Tb,QAAA,CAAA,EAAA,MAAA;EA2DC,QAAA,EA2HC,eA9Eb,EAAA;AAeD,CAAA,GAAa,IAAA"}