flappa-doormal 2.0.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md ADDED
@@ -0,0 +1,325 @@
1
+ # AGENTS.md
2
+
3
+ ## Project Overview
4
+
5
+ **flappa-doormal** is a declarative text segmentation library for Arabic texts. It splits pages of content into logical segments (hadiths, chapters, verses) using pattern-based rules with a human-readable template syntax.
6
+
7
+ ### Why This Library Exists
8
+
9
+ Traditional Arabic text segmentation requires:
10
+ - Hand-crafted regular expressions with complex Unicode patterns
11
+ - Deep knowledge of Arabic diacritics and character equivalences
12
+ - Handling page boundaries and multi-page content spans
13
+ - Manual capture group management for metadata extraction
14
+
15
+ **flappa-doormal** solves these problems with:
16
+ - **Declarative rules**: Describe *what* to match, not *how* to match it
17
+ - **Template tokens**: `{{raqms}} {{dash}}` instead of `[\u0660-\u0669]+\s*[-–—ـ]`
18
+ - **Named captures**: `{{raqms:hadithNum}}` → automatic `meta.hadithNum` extraction
19
+ - **Fuzzy matching**: Diacritic-insensitive matching for harakat variations
20
+ - **Page-aware**: Tracks which page each segment comes from
21
+
22
+ ## Architecture
23
+
24
+ ### Repository Structure
25
+
26
+ ```text
27
+ src/
28
+ ├── index.ts # Main entry point and exports
29
+ ├── pattern-detection.ts # Token detection for auto-generating rules (NEW)
30
+ ├── pattern-detection.test.ts # Pattern detection tests (22 tests)
31
+ └── segmentation/
32
+ ├── types.ts # TypeScript type definitions for rules/segments
33
+ ├── segmenter.ts # Core segmentation engine (segmentPages, applyBreakpoints)
34
+ ├── breakpoint-utils.ts # Extracted breakpoint processing utilities
35
+ ├── tokens.ts # Token definitions and expansion logic
36
+ ├── fuzzy.ts # Diacritic-insensitive matching utilities
37
+ ├── html.ts # HTML utilities (stripHtmlTags)
38
+ ├── textUtils.ts # Text processing utilities
39
+ ├── match-utils.ts # Extracted match processing utilities
40
+ ├── segmenter.test.ts # Core test suite (150+ tests including breakpoints)
41
+ ├── segmenter.bukhari.test.ts # Real-world test cases
42
+ ├── breakpoint-utils.test.ts # Breakpoint utility tests (42 tests)
43
+ ├── tokens.test.ts # Token expansion tests
44
+ ├── fuzzy.test.ts # Fuzzy matching tests
45
+ ├── textUtils.test.ts # Text utility tests
46
+ └── match-utils.test.ts # Utility function tests
47
+
48
+ test/
49
+ ├── 2576.json # Test data for book 2576 (Sahih Bukhari)
50
+ └── 2588.json # Test data for book 2588 (Al-Mughni)
51
+
52
+ docs/
53
+ ├── checkpoints/ # AI agent handoff documentation
54
+ │ └── 2025-12-09-handoff.md
55
+ └── reviews/ # Performance analysis reports
56
+ └── 2025-12-10/
57
+ ```
58
+
59
+ ### Core Components
60
+
61
+ 1. **`segmentPages(pages, options)`** - Main entry point
62
+ - Takes array of `{id, content}` pages and split rules
63
+ - Returns array of `{content, from, to?, meta?}` segments
64
+
65
+ 2. **`tokens.ts`** - Template system
66
+ - `TOKEN_PATTERNS` - Map of token names to regex patterns
67
+ - `expandTokensWithCaptures()` - Expands `{{token:name}}` syntax
68
+ - Supports fuzzy transform for diacritic-insensitive matching
69
+
70
+ 3. **`match-utils.ts`** - Extracted utilities (for testability)
71
+ - `extractNamedCaptures()` - Get named groups from regex match
72
+ - `filterByConstraints()` - Apply min/max page filters
73
+ - `anyRuleAllowsId()` - Check if page passes rule constraints
74
+
75
+ 4. **`breakpoint-utils.ts`** - Breakpoint processing utilities (NEW)
76
+ - `normalizeBreakpoint()` - Convert string to BreakpointRule object
77
+ - `isPageExcluded()` - Check if page is in exclude list
78
+ - `isInBreakpointRange()` - Validate page against min/max/exclude constraints
79
+ - `buildExcludeSet()` - Create Set from PageRange[] for O(1) lookups
80
+ - `createSegment()` - Create segment with optional to/meta fields
81
+ - `expandBreakpoints()` - Expand patterns with pre-compiled regexes
82
+ - `findActualEndPage()` - Search backwards for ending page by content
83
+ - `findBreakPosition()` - Find break position using breakpoint patterns
84
+ - `hasExcludedPageInRange()` - Check if range contains excluded pages
85
+ - `findNextPagePosition()` - Find next page content position
86
+ - `findPatternBreakPosition()` - Find pattern match by preference
87
+
88
+ 5. **`types.ts`** - Type definitions
89
+ - `Logger` interface - Optional logging for debugging
90
+ - `SegmentationOptions` - Options with `logger` property
91
+ - Verbosity levels: `trace`, `debug`, `info`, `warn`, `error`
92
+
93
+ 6. **`fuzzy.ts`** - Arabic text normalization
94
+ - `makeDiacriticInsensitive()` - Generate regex that ignores diacritics
95
+
96
+ 7. **`pattern-detection.ts`** - Token auto-detection (NEW)
97
+ - `detectTokenPatterns()` - Detect tokens in text with positions
98
+ - `generateTemplateFromText()` - Convert text to template string
99
+ - `suggestPatternConfig()` - Suggest rule configuration
100
+ - `analyzeTextForRule()` - Complete analysis returning template + config
101
+
102
+ ## Key Algorithms
103
+
104
+ ### Token Expansion
105
+
106
+ ```
107
+ Input: "{{raqms:num}} {{dash}} {{:text}}"
108
+ Output: "(?<num>[\u0660-\u0669]+) [-–—ـ] (?<text>.+)"
109
+ ```
110
+
111
+ The expansion algorithm:
112
+ 1. Splits query into token and text segments
113
+ 2. Looks up token patterns from `TOKEN_PATTERNS`
114
+ 3. Wraps in named capture group if `:name` suffix present
115
+ 4. Applies fuzzy transform if enabled (before wrapping in groups)
116
+
117
+ ### Fuzzy Application Order
118
+
119
+ **Critical design decision**: Fuzzy transforms are applied to raw token patterns and plain text *before* they're wrapped in regex groups.
120
+
121
+ ```
122
+ WRONG: makeDiacriticInsensitive("(?<name>حدثنا)") // Breaks ( ? < > )
123
+ RIGHT: "(?<name>" + makeDiacriticInsensitive("حدثنا") + ")"
124
+ ```
125
+
126
+ ### lineStartsAfter Content Capture
127
+
128
+ For patterns like `^٦٦٩٦ - (content)`, the content capture is the *last* positional group:
129
+
130
+ ```typescript
131
+ // Pattern: ^(?:(?<num>[\u0660-\u0669]+) [-–—ـ] )(.*)
132
+ // Match: m[1] = named group value, m[2] = content
133
+ // Solution: Iterate backward from m.length-1 to find last defined capture
134
+ ```
135
+
136
+ ### Breakpoints Post-Processing Algorithm
137
+
138
+ The `breakpoints` option provides a post-processing mechanism for limiting segment size. Unlike the deprecated `maxSpan` (which was per-rule), breakpoints runs AFTER all structural rules.
139
+
140
+ **API Options:**
141
+ ```typescript
142
+ interface SegmentationOptions {
143
+ rules: SplitRule[];
144
+ maxPages?: number; // Maximum pages a segment can span
145
+ breakpoints?: string[]; // Ordered array of regex patterns (supports token expansion)
146
+ prefer?: 'longer' | 'shorter'; // Select last or first match within window
147
+ }
148
+ ```
149
+
150
+ **How it works:**
151
+ 1. Structural rules run first, creating initial segments
152
+ 2. Breakpoints then processes any segment exceeding `maxPages`
153
+ 3. Patterns are tried in order until one matches
154
+ 4. Empty string `''` means "fall back to page boundary"
155
+
156
+ **Example:**
157
+ ```typescript
158
+ segmentPages(pages, {
159
+ rules: [
160
+ { lineStartsWith: ['{{basmalah}}'], split: 'at' },
161
+ { lineStartsWith: ['{{bab}}'], split: 'at', meta: { type: 'chapter' } },
162
+ ],
163
+ maxPages: 2,
164
+ breakpoints: ['{{tarqim}}\\s*', '\\n', ''], // Try: punctuation → newline → page boundary
165
+ prefer: 'longer', // Greedy: make segments as large as possible
166
+ });
167
+ ```
168
+
169
+ **Key behaviors:**
170
+ - **Pattern order matters**: First matching pattern wins
171
+ - **`prefer: 'longer'`**: Finds LAST match in window (greedy)
172
+ - **`prefer: 'shorter'`**: Finds FIRST match (conservative)
173
+ - **Recursive**: If split result still exceeds `maxPages`, breakpoints runs again
174
+
175
+ > **Note**: The old `maxSpan` and `fallback` properties on `SplitRule` are deprecated and removed.
176
+
177
+ ## Design Decisions
178
+
179
+ ### 1. Why `{{double-braces}}`?
180
+
181
+ - Single braces `{}` conflict with regex quantifiers `{n,m}`
182
+ - Double braces are visually distinct and rarely appear in content
183
+ - Consistent with template systems (Handlebars, Mustache)
184
+
185
+ ### 2. Why `lineStartsAfter` vs `lineStartsWith`?
186
+
187
+ | Pattern | Marker in content? | Use case |
188
+ |---------|-------------------|----------|
189
+ | `lineStartsWith` | ✅ Yes | Keep marker, segment at boundary |
190
+ | `lineStartsAfter` | ❌ No | Strip marker, capture only content |
191
+
192
+ ### 3. Why fuzzy transform at token level?
193
+
194
+ Applying fuzzy globally would corrupt regex metacharacters. Instead:
195
+ - Fuzzy is passed to `expandTokensWithCaptures()`
196
+ - Applied only to Arabic text portions
197
+ - Preserves `(`, `)`, `|`, `?`, etc.
198
+
199
+ ### 4. Why extract match utilities?
200
+
201
+ The original `segmentPages` had complexity 37 (max: 15). Extraction:
202
+ - Creates independently testable units
203
+ - Reduces main function complexity
204
+ - Improves code readability
205
+
206
+ ## Working with the Codebase
207
+
208
+ ### Adding a New Token
209
+
210
+ 1. Add to `TOKEN_PATTERNS` in `tokens.ts`:
211
+ ```typescript
212
+ export const TOKEN_PATTERNS = {
213
+ // ...existing
214
+ verse: '﴿[^﴾]+﴾', // Quranic verse markers
215
+ };
216
+ ```
217
+ 2. Add test cases in `segmenter.test.ts`
218
+ 3. Document in README.md
219
+
220
+ ### Adding a New Pattern Type
221
+
222
+ 1. Add type to union in `types.ts`:
223
+ ```typescript
224
+ type NewPattern = { newPatternField: string[] };
225
+ type PatternType = ... | NewPattern;
226
+ ```
227
+ 2. Handle in `buildRuleRegex()` in `segmenter.ts`
228
+ 3. Add comprehensive tests
229
+
230
+ ### Testing Strategy
231
+
232
+ - **Unit tests**: Each utility function has dedicated tests
233
+ - **Integration tests**: Full pipeline tests in `segmenter.test.ts`
234
+ - **Real-world tests**: `segmenter.bukhari.test.ts` uses actual hadith data
235
+ - Run: `bun test`
236
+
237
+ ## Code Quality Standards
238
+
239
+ 1. **TypeScript strict mode** - No `any` types
240
+ 2. **Biome linting** - Max complexity 15 per function (some exceptions exist)
241
+ 3. **JSDoc comments** - All exported functions documented
242
+ 4. **Test coverage** - 251 tests across 8 files
243
+
244
+ ## Dependencies
245
+
246
+ ### Development
247
+ - **@biomejs/biome** - Linting and formatting
248
+ - **tsdown** - Build tool (generates `.mjs` and `.d.mts`)
249
+ - **Bun** - Runtime and test runner
250
+
251
+ ## Build & Release
252
+
253
+ ```bash
254
+ # Install dependencies
255
+ bun install
256
+
257
+ # Run all tests
258
+ bun test
259
+
260
+ # Build distribution
261
+ bun run build
262
+ # Output: dist/index.mjs (~17 KB gzip ~5.7 KB)
263
+
264
+ # Run performance test (generates 50K pages, measures segmentation speed/memory)
265
+ bun run perf
266
+
267
+ # Format code
268
+ bunx biome format --write .
269
+
270
+ # Lint code
271
+ bunx biome lint .
272
+ ```
273
+
274
+ ## Lessons Learned
275
+
276
+ ### From Development
277
+
278
+ 1. **Named captures shift positional indices**: When `(?<name>…)` appears before `(.*)`, the content is at `m[2]` not `m[1]`. Solution: iterate backward to find last defined group.
279
+
280
+ 2. **Fuzzy + metacharacters don't mix**: `makeDiacriticInsensitive` expands Arabic to character classes. If applied to `(?<name>text)`, it corrupts the `(`, `?`, `<`, `>` characters.
281
+
282
+ 3. **Alternations need per-alternative fuzzy**: Token `narrated: 'حدثنا|أخبرنا'` requires splitting at `|`, applying fuzzy to each, then rejoining.
283
+
284
+ 4. **Complexity extraction works**: Pulling logic into `match-utils.ts` reduced main function complexity from 37 to 10 and made the code testable.
285
+
286
+ 5. **Rule order matters for specificity**: When multiple rules can match the same position, put specific patterns BEFORE generic ones. Example: `## {{raqms:num}} {{dash}}` must come before `##` to capture the number.
287
+
288
+ 6. **Post-processing beats per-rule limits**: The `maxSpan` approach (per-rule page limits) caused premature splits. Moving to post-processing `breakpoints` preserves structural integrity while still limiting segment size.
289
+
290
+ 7. **Window padding matters**: When calculating approximate content windows, 50% padding is needed (not 20%) to ensure enough content is captured for `prefer: 'longer'` scenarios.
291
+
292
+ 8. **Escaping in tests requires care**: TypeScript string `'\\.'` creates regex `\.`, but regex literal `/\./` is already escaped. Double-backslash in strings, single in literals.
293
+
294
+ ### Architecture Insights
295
+
296
+ - **Declarative > Imperative**: Users describe patterns, library handles regex
297
+ - **Composability**: Tokens can be combined freely with `:name` captures
298
+ - **Fail gracefully**: Unknown tokens are left as-is, allowing partial templates
299
+ - **Post-process > Inline**: Breakpoints runs after rules, avoiding conflicts
300
+ - **Dependency injection for testability**: `breakpoint-utils.ts` accepts a `PatternProcessor` function instead of importing `processPattern` directly, enabling independent testing without mocking
301
+ - **Optional logging**: Use optional chaining (`logger?.debug?.()`) for zero-overhead when no logger is provided. All log methods are optional, allowing clients to subscribe to only the verbosity levels they need.
302
+
303
+ ---
304
+
305
+ ## Token Reference
306
+
307
+ | Token | Pattern Description | Example Match |
308
+ |-------|---------------------|---------------|
309
+ | `{{tarqim}}` | Arabic punctuation (. , ; ? ! ( ) etc.) | `؛` `،` `.` |
310
+ | `{{basmalah}}` | "بِسْمِ اللَّهِ" patterns | بِسْمِ اللَّهِ الرَّحْمَنِ |
311
+ | `{{bab}}` | "باب" (chapter) | بَابُ الإيمان |
312
+ | `{{fasl}}` | "فصل" (section) | فصل: في الطهارة |
313
+ | `{{kitab}}` | "كتاب" (book) | كتاب الصلاة |
314
+ | `{{raqm}}` | Single Arabic-Indic numeral | ٥ |
315
+ | `{{raqms}}` | Multiple Arabic-Indic numerals | ٧٥٦٣ |
316
+ | `{{raqms:num}}` | Numerals with named capture | `meta.num = "٧٥٦٣"` |
317
+ | `{{dash}}` | Various dash characters | - – — ـ |
318
+ | `{{numbered}}` | Composite: `{{raqms}} {{dash}}` | ٧٥٦٣ - |
319
+
320
+ **Named captures**: Add `:name` suffix to capture into `meta`:
321
+ ```typescript
322
+ '{{raqms:hadithNum}} {{dash}}'
323
+ // → segment.meta.hadithNum = "٧٥٦٣"
324
+ ```
325
+
package/README.md CHANGED
@@ -395,6 +395,32 @@ console.log(TOKEN_PATTERNS.narrated);
395
395
  // 'حدثنا|أخبرنا|حدثني|وحدثنا|أنبأنا|سمعت'
396
396
  ```
397
397
 
398
+ ### Pattern Detection
399
+
400
+ Auto-detect tokens in Arabic text for building rules:
401
+
402
+ ```typescript
403
+ import { detectTokenPatterns, analyzeTextForRule } from 'flappa-doormal';
404
+
405
+ // Detect individual tokens
406
+ const tokens = detectTokenPatterns('٣٤ - حدثنا');
407
+ // [
408
+ // { token: 'raqms', match: '٣٤', index: 0, endIndex: 2 },
409
+ // { token: 'dash', match: '-', index: 3, endIndex: 4 },
410
+ // { token: 'naql', match: 'حدثنا', index: 5, endIndex: 10 }
411
+ // ]
412
+
413
+ // Get complete rule suggestion
414
+ const rule = analyzeTextForRule('٣٤ - ');
415
+ // {
416
+ // template: '{{raqms}} {{dash}} ',
417
+ // patternType: 'lineStartsAfter',
418
+ // fuzzy: false,
419
+ // metaType: 'hadith',
420
+ // detected: [...]
421
+ // }
422
+ ```
423
+
398
424
  ## Types
399
425
 
400
426
  ### `SplitRule`
@@ -433,6 +459,56 @@ type Segment = {
433
459
  };
434
460
  ```
435
461
 
462
+ ### `Logger`
463
+
464
+ Optional logging interface for debugging segmentation:
465
+
466
+ ```typescript
467
+ interface Logger {
468
+ trace?: (message: string, ...args: unknown[]) => void; // Per-iteration details
469
+ debug?: (message: string, ...args: unknown[]) => void; // Detailed operations
470
+ info?: (message: string, ...args: unknown[]) => void; // Key progress points
471
+ warn?: (message: string, ...args: unknown[]) => void; // Potential issues
472
+ error?: (message: string, ...args: unknown[]) => void; // Critical failures
473
+ }
474
+ ```
475
+
476
+ ## Debugging
477
+
478
+ ### Using the Logger
479
+
480
+ Pass a `logger` option to receive detailed information about the segmentation process:
481
+
482
+ ```typescript
483
+ // Console logger for development
484
+ const segments = segmentPages(pages, {
485
+ rules: [...],
486
+ logger: {
487
+ debug: console.debug,
488
+ info: console.info,
489
+ warn: console.warn,
490
+ }
491
+ });
492
+
493
+ // Production logger (only errors)
494
+ const segments = segmentPages(pages, {
495
+ rules: [...],
496
+ logger: {
497
+ error: (msg, ...args) => myLoggingService.error(msg, args),
498
+ }
499
+ });
500
+ ```
501
+
502
+ **Verbosity levels:**
503
+ - `trace` - Per-iteration loop details (very verbose)
504
+ - `debug` - Segment processing, pattern matching
505
+ - `info` - Start/completion of breakpoint processing
506
+ - `warn` - Safety checks triggered
507
+ - `error` - Infinite loop detection
508
+
509
+ When no logger is provided, no logging overhead is incurred.
510
+
511
+
436
512
  ## Usage with Next.js / Node.js
437
513
 
438
514
  ```typescript
@@ -474,7 +550,7 @@ console.log(`Found ${segments.length} segments`);
474
550
  # Install dependencies
475
551
  bun install
476
552
 
477
- # Run tests (222 tests)
553
+ # Run tests (251 tests)
478
554
  bun test
479
555
 
480
556
  # Build
package/dist/index.d.mts CHANGED
@@ -460,6 +460,42 @@ type BreakpointRule = {
460
460
  * { pattern: '{{tarqim}}\\s*', min: 10 }
461
461
  */
462
462
  type Breakpoint = string | BreakpointRule;
463
+ /**
464
+ * Logger interface for custom logging implementations.
465
+ *
466
+ * All methods are optional - only implement the verbosity levels you need.
467
+ * When no logger is provided, no logging overhead is incurred.
468
+ *
469
+ * Compatible with the Logger interface from ffmpeg-simplified and similar libraries.
470
+ *
471
+ * @example
472
+ * // Simple console logger
473
+ * const logger: Logger = {
474
+ * debug: console.debug,
475
+ * info: console.info,
476
+ * warn: console.warn,
477
+ * error: console.error,
478
+ * };
479
+ *
480
+ * @example
481
+ * // Production logger (only warnings and errors)
482
+ * const prodLogger: Logger = {
483
+ * warn: (msg, ...args) => myLoggingService.warn(msg, args),
484
+ * error: (msg, ...args) => myLoggingService.error(msg, args),
485
+ * };
486
+ */
487
+ interface Logger {
488
+ /** Log a debug message (verbose debugging output) */
489
+ debug?: (message: string, ...args: unknown[]) => void;
490
+ /** Log an error message (critical failures) */
491
+ error?: (message: string, ...args: unknown[]) => void;
492
+ /** Log an informational message (key progress points) */
493
+ info?: (message: string, ...args: unknown[]) => void;
494
+ /** Log a trace message (extremely verbose, per-iteration details) */
495
+ trace?: (message: string, ...args: unknown[]) => void;
496
+ /** Log a warning message (potential issues) */
497
+ warn?: (message: string, ...args: unknown[]) => void;
498
+ }
463
499
  /**
464
500
  * Segmentation options controlling how pages are split.
465
501
  *
@@ -480,6 +516,17 @@ type Breakpoint = string | BreakpointRule;
480
516
  * breakpoints: ['{{tarqim}}\\s*', '\\n', ''],
481
517
  * prefer: 'longer'
482
518
  * };
519
+ *
520
+ * @example
521
+ * // With custom logger for debugging
522
+ * const options: SegmentationOptions = {
523
+ * rules: [...],
524
+ * logger: {
525
+ * debug: console.debug,
526
+ * info: console.info,
527
+ * warn: console.warn,
528
+ * }
529
+ * };
483
530
  */
484
531
  type SegmentationOptions = {
485
532
  /**
@@ -542,6 +589,38 @@ type SegmentationOptions = {
542
589
  * @default 'longer'
543
590
  */
544
591
  prefer?: 'longer' | 'shorter';
592
+ /**
593
+ * Optional logger for debugging segmentation.
594
+ *
595
+ * Provide a logger to receive detailed information about the segmentation
596
+ * process. Useful for debugging pattern matching, page tracking, and
597
+ * breakpoint processing issues.
598
+ *
599
+ * When not provided, no logging overhead is incurred (methods are not called).
600
+ *
601
+ * Verbosity levels:
602
+ * - `trace`: Per-iteration details (very verbose)
603
+ * - `debug`: Detailed operation information
604
+ * - `info`: Key progress points
605
+ * - `warn`: Potential issues
606
+ * - `error`: Critical failures
607
+ *
608
+ * @example
609
+ * // Console logger for development
610
+ * logger: {
611
+ * debug: console.debug,
612
+ * info: console.info,
613
+ * warn: console.warn,
614
+ * }
615
+ *
616
+ * @example
617
+ * // Custom logger integration
618
+ * logger: {
619
+ * debug: (msg, ...args) => winston.debug(msg, { meta: args }),
620
+ * error: (msg, ...args) => winston.error(msg, { meta: args }),
621
+ * }
622
+ */
623
+ logger?: Logger;
545
624
  };
546
625
  /**
547
626
  * Output segment produced by `segmentPages()`.
@@ -846,5 +925,80 @@ declare const getAvailableTokens: () => string[];
846
925
  */
847
926
  declare const getTokenPattern: (tokenName: string) => string | undefined;
848
927
  //#endregion
849
- export { type Breakpoint, type BreakpointRule, type ExpandResult, type Page, type PageRange, type Segment, type SegmentationOptions, type SplitRule, TOKEN_PATTERNS, containsTokens, escapeRegex, expandTokens, expandTokensWithCaptures, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, normalizeLineEndings, segmentPages, stripHtmlTags, templateToRegex };
928
+ //#region src/pattern-detection.d.ts
929
+ /**
930
+ * Pattern detection utilities for recognizing template tokens in Arabic text.
931
+ * Used to auto-detect patterns from user-highlighted text in the segmentation dialog.
932
+ *
933
+ * @module pattern-detection
934
+ */
935
+ /**
936
+ * Result of detecting a token pattern in text
937
+ */
938
+ type DetectedPattern = {
939
+ /** Token name from TOKEN_PATTERNS (e.g., 'raqms', 'dash') */
940
+ token: string;
941
+ /** The matched text */
942
+ match: string;
943
+ /** Start index in the original text */
944
+ index: number;
945
+ /** End index (exclusive) */
946
+ endIndex: number;
947
+ };
948
+ /**
949
+ * Analyzes text and returns all detected token patterns with their positions.
950
+ * Patterns are detected in priority order to avoid partial matches.
951
+ *
952
+ * @param text - The text to analyze for token patterns
953
+ * @returns Array of detected patterns sorted by position
954
+ *
955
+ * @example
956
+ * detectTokenPatterns("٣٤ - حدثنا")
957
+ * // Returns: [
958
+ * // { token: 'raqms', match: '٣٤', index: 0, endIndex: 2 },
959
+ * // { token: 'dash', match: '-', index: 3, endIndex: 4 },
960
+ * // { token: 'naql', match: 'حدثنا', index: 5, endIndex: 10 }
961
+ * // ]
962
+ */
963
+ declare const detectTokenPatterns: (text: string) => DetectedPattern[];
964
+ /**
965
+ * Generates a template pattern from text using detected tokens.
966
+ * Replaces matched portions with {{token}} syntax.
967
+ *
968
+ * @param text - Original text
969
+ * @param detected - Array of detected patterns from detectTokenPatterns
970
+ * @returns Template string with tokens, e.g., "{{raqms}} {{dash}} "
971
+ *
972
+ * @example
973
+ * const detected = detectTokenPatterns("٣٤ - ");
974
+ * generateTemplateFromText("٣٤ - ", detected);
975
+ * // Returns: "{{raqms}} {{dash}} "
976
+ */
977
+ declare const generateTemplateFromText: (text: string, detected: DetectedPattern[]) => string;
978
+ /**
979
+ * Determines the best pattern type for auto-generated rules based on detected patterns.
980
+ *
981
+ * @param detected - Array of detected patterns
982
+ * @returns Suggested pattern type and whether to use fuzzy matching
983
+ */
984
+ declare const suggestPatternConfig: (detected: DetectedPattern[]) => {
985
+ patternType: "lineStartsWith" | "lineStartsAfter";
986
+ fuzzy: boolean;
987
+ metaType?: string;
988
+ };
989
+ /**
990
+ * Analyzes text and generates a complete suggested rule configuration.
991
+ *
992
+ * @param text - Highlighted text from the page
993
+ * @returns Suggested rule configuration or null if no patterns detected
994
+ */
995
+ declare const analyzeTextForRule: (text: string) => {
996
+ template: string;
997
+ patternType: "lineStartsWith" | "lineStartsAfter";
998
+ fuzzy: boolean;
999
+ metaType?: string;
1000
+ detected: DetectedPattern[];
1001
+ } | null;
1002
+ //#endregion
1003
+ export { type Breakpoint, type BreakpointRule, type DetectedPattern, type ExpandResult, type Logger, type Page, type PageRange, type Segment, type SegmentationOptions, type SplitRule, TOKEN_PATTERNS, analyzeTextForRule, containsTokens, detectTokenPatterns, escapeRegex, expandTokens, expandTokensWithCaptures, generateTemplateFromText, getAvailableTokens, getTokenPattern, makeDiacriticInsensitive, normalizeLineEndings, segmentPages, stripHtmlTags, suggestPatternConfig, templateToRegex };
850
1004
  //# sourceMappingURL=index.d.mts.map
@@ -1 +1 @@
1
- {"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/segmenter.ts","../src/segmentation/textUtils.ts","../src/segmentation/tokens.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;AC9IiB;AAoBG;AAsBM;AAyBC;AAiBH;;;;;;;AAoBC;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EY,cDhYC,WCgYqB,EAAA,CAAA,CAAA,EAAA,MAAc,EAAA,GAAA,MAAA;AAuBhD;AAgFA;;;;ACoKA;;;;;;;;ACvsBA;AAaA;;;;AC6NA;AA2CA;AAWA;AA2DA;AAqGA;AAuBA;AAqBA;AAgBA;;;;;;;;;;cJjWa;;;;AA/Fb;AA+FA;;;;;AC9IiB;AAoBG;AAsBM;AAyBC;AAiBH;;;;KApFnB,YAAA,GAuGC;EACA;EAAmB,KAAA,EAAA,MAAA;AAAA,CAAA;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EA;AAuBA;AAgFA;;;;ACoKA,KDtqBK,eAAA,GC6vBJ;EAvFmC;EAAiB,QAAA,EAAA,MAAA;CAAsB;;;;;ACvsB3E;AAaA;;;;AC6NA;AA2CA;AAWA;AA2DA;AAqGA;AAuBA;AAqBA;AAgBA;KHrcK,qBAAA;;;;;;;;;;;;;;;;;;;;;;;;KAyBA,sBAAA;;;;;;;;;;;;;;;;KAiBA,mBAAA;;;;;;;;;;;;;;KAeA,WAAA,GACC,eACA,kBACA,wBACA,yBACA;;;;;;;KAYD,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA8EO,SAAA;;;;;;;KAYP,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA0CC,SAAA,GAAY,cAAc,gBAAgB;;;;;;;;;;;;;KAkB1C,IAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCA,cAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAqCE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCF,UAAA,YAAsB;;;;;;;;;;;;;;;;;;;;;;KAuBtB,mBAAA;;;;;;;;UAQA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;gBA8CM;;;;;;;;;;;;;;;;;;;;;;;;KA0BN,OAAA;;;;;;;;;;;;;;;;;;;;;;;;;;SA6BD;;;;AAhfgB;AAiBH;;;;;;;AAoBC;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EA;AAuBA;AAgFA;;;;ACoKA;;;;;;;;ACvsBA;AAaA;;;;AC6NA;AA2CA;AAWA;AA2DA;AAqGA;AAuBa,cFgPA,YEzOZ,EAPkD,CAAA,KAAM,EFgPrB,IEhPqB,EAAA,EAAA,OAAA,EFgPJ,mBEhPI,EAAA,GFgPkB,OEhPlB,EAAA;;;;AJ3ZzD;AA+FA;;;;AC9IK,cEbQ,aFaI,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAAA;AAoBG;AAsBM;AAyBC;AAiBH;;;;;AAoBlB,cExGO,oBFwGP,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;ADzDN;AA+FA;;;;;AC9IiB;AAoBG;AAsBM;AAyBC;AAiBH;;;;;;;AAoBC;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EA;AAuBA;AAgFA;;;;ACoKA;;;;;;;;ACvsBA;AAaA;cC6Na,gBAAgB;;;AAA7B;AA2CA;AAWA;AA2DA;AAqGA;AAuBA;AAqBA;AAgBA;;;;;;cAvOa;;;;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cA2DC,2FAA0F;;;;;;;;;;;;;;;;;;;;cAqG1F;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAsC;;;;;;;;;;;;;cAqBtC;;;;;;;;;;;;;;;cAgBA"}
1
+ {"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/segmenter.ts","../src/segmentation/textUtils.ts","../src/segmentation/tokens.ts","../src/pattern-detection.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;AC9IiB;AAoBG;AAsBM;AAyBC;AAiBH;;;;;;;AAoBC;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EY,cDhYC,WCgYqB,EAAA,CAAA,CAAA,EAAA,MAAc,EAAA,GAAA,MAAA;AA8BhD;AAiDA;;;;;AAiHA;;;;AC2KA;;;;;;;;ACvyBA;AAaA;;;;AC6NA;AA2CA;AAWA;AA2DA;AAqGA;AAuBA;AAqBA;AAgBA;;;;ACtfA;AA0Da,cL2FA,wBK3FsC,EAAA,CAAA,IAAA,EAAe,MAAA,EAAA,GAAA,MAAA;;;;ALJlE;AA+FA;;;;;AC9IiB;AAoBG;AAsBM;AAyBC;AAiBH;;;;KApFnB,YAAA,GAuGC;EACA;EAAmB,KAAA,EAAA,MAAA;AAAA,CAAA;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EA;AA8BA;AAiDA;;;;KA1eK,eAAA,GA0kBc;EAiBP;;;;AC2KZ;;;;;;;;ACvyBA;AAaA;;;;AC6NA;AA2CA;AAWA;AA2DA,KHpSK,qBAAA,GGoSQ;EAqGA;EAuBA,cAAA,EAAA,MAOZ,EAAA;AAcD,CAAA;AAgBA;;;;ACtfA;AA0DA;AA4DA;AAuBA;AAiCA;;;;;;;;;;;;KJpGK,sBAAA;;;;;;;;;;;;;;;;KAiBA,mBAAA;;;;;;;;;;;;;;KAeA,WAAA,GACC,eACA,kBACA,wBACA,yBACA;;;;;;;KAYD,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA8EO,SAAA;;;;;;;KAYP,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA0CC,SAAA,GAAY,cAAc,gBAAgB;;;;;;;;;;;;;KAkB1C,IAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCA,cAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAqCE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCF,UAAA,YAAsB;;;;;;;;;;;;;;;;;;;;;;;;;UA8BjB,MAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiDL,mBAAA;;;;;;;;UAQA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;gBA8CM;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WA0CL;;;;;;;;;;;;;;;;KAiBD,OAAA;;;;;;;;;;;;;;;;;;;;;;;;;;SA6BD;;;;AAzkBgB;AAiBH;;;;;;;AAoBC;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EA;AA8BA;AAiDA;;;;;AAiHA;;;;AC2KA;;;;;;;;ACvyBA;AAaA;;;;AC6Na,cF6jBA,YE7jBgB,EAAA,CAAA,KAAM,EF6jBC,IE7jBD,EAAA,EAAA,OAAA,EF6jBkB,mBE7jBlB,EAAA,GF6jBwC,OE7jBxC,EAAA;;;;AJ9KnC;AA+FA;;;;AC9IK,cEbQ,aFaI,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAAA;AAoBG;AAsBM;AAyBC;AAiBH;;;;;AAoBlB,cExGO,oBFwGP,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;ADzDN;AA+FA;;;;;AC9IiB;AAoBG;AAsBM;AAyBC;AAiBH;;;;;;;AAoBC;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EA;AA8BA;AAiDA;;;;;AAiHA;;;;AC2KA;;;;;cE7jBa,gBAAgB;;;AD1O7B;AAaA;;;;AC6NA;AA2CA;AAWA;AA2DA;AAqGA;AAuBA;AAqBA;AAgBA;cAvOa;;;AC/Qb;AA0DA;AA4DA;AAuBA;AAiCa,KD4GD,YAAA,GCzFX;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cDoJY,2FAA0F;;;;;;;;;;;;;;;;;;;;cAqG1F;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAsC;;;;;;;;;;;;;cAqBtC;;;;;;;;;;;;;;;cAgBA;;;;AJhcb;AA+FA;;;;;AC9IiB;AAoBG;AA+Cf,KI1EO,eAAA,GJ0Ee;EAiBtB;EAeA,KAAA,EAAA,MAAA;EACC;EACA,KAAA,EAAA,MAAA;EACA;EACA,KAAA,EAAA,MAAA;EACA;EAAmB,QAAA,EAAA,MAAA;AAAA,CAAA;AA0FzB;AAAkD;AA4GlD;;;;;AAkBA;AAqCA;AA0EA;AA8BA;AAiDA;;;;AAgGmB,cI3iBN,mBJ2iBM,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GI3iBgC,eJ2iBhC,EAAA;AAiBnB;;;;AC2KA;;;;;;;;ACvyBA;AAaa,cE+GA,wBF/G2E,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,QAAA,EE+GvB,eF/GuB,EAAA,EAAA,GAAA,MAAA;;;;AC6NxF;AA2CA;AAWA;AA2Da,cCxMA,oBDwRZ,EAhFsG,CAAA,QAAA,ECvMzF,eDuRb,EAAA,EAAA,GAAA;EAqBY,WAAA,EAAA,gBAAiF,GAAA,iBAAA;EAuBjF,KAAA,EAAA,OAAA;EAqBA,QAAA,CAAA,EAAA,MAAA;AAgBb,CAAA;;;;ACtfA;AA0DA;AA4DA;AAuBa,cAiCA,kBAhCC,EAAA,CAAA,IAAA,EAAA,MAAe,EAAA,GAAA;EAgChB,QAAA,EAAA,MAAA;;;;YAOC"}