flappa-doormal 2.0.0 → 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md ADDED
@@ -0,0 +1,357 @@
1
+ # AGENTS.md
2
+
3
+ ## Project Overview
4
+
5
+ **flappa-doormal** is a declarative text segmentation library for Arabic texts. It splits pages of content into logical segments (hadiths, chapters, verses) using pattern-based rules with a human-readable template syntax.
6
+
7
+ ### Why This Library Exists
8
+
9
+ Traditional Arabic text segmentation requires:
10
+ - Hand-crafted regular expressions with complex Unicode patterns
11
+ - Deep knowledge of Arabic diacritics and character equivalences
12
+ - Handling page boundaries and multi-page content spans
13
+ - Manual capture group management for metadata extraction
14
+
15
+ **flappa-doormal** solves these problems with:
16
+ - **Declarative rules**: Describe *what* to match, not *how* to match it
17
+ - **Template tokens**: `{{raqms}} {{dash}}` instead of `[\u0660-\u0669]+\s*[-–—ـ]`
18
+ - **Named captures**: `{{raqms:hadithNum}}` → automatic `meta.hadithNum` extraction
19
+ - **Fuzzy matching**: Diacritic-insensitive matching for harakat variations
20
+ - **Page-aware**: Tracks which page each segment comes from
21
+
22
+ ## Architecture
23
+
24
+ ### Repository Structure
25
+
26
+ ```text
27
+ src/
28
+ ├── index.ts # Main entry point and exports
29
+ ├── pattern-detection.ts # Token detection for auto-generating rules (NEW)
30
+ ├── pattern-detection.test.ts # Pattern detection tests (22 tests)
31
+ └── segmentation/
32
+ ├── types.ts # TypeScript type definitions for rules/segments
33
+ ├── segmenter.ts # Core segmentation engine (segmentPages, applyBreakpoints)
34
+ ├── breakpoint-utils.ts # Extracted breakpoint processing utilities
35
+ ├── tokens.ts # Token definitions and expansion logic
36
+ ├── fuzzy.ts # Diacritic-insensitive matching utilities
37
+ ├── html.ts # HTML utilities (stripHtmlTags)
38
+ ├── textUtils.ts # Text processing utilities
39
+ ├── match-utils.ts # Extracted match processing utilities
40
+ ├── segmenter.test.ts # Core test suite (150+ tests including breakpoints)
41
+ ├── segmenter.bukhari.test.ts # Real-world test cases
42
+ ├── breakpoint-utils.test.ts # Breakpoint utility tests (42 tests)
43
+ ├── tokens.test.ts # Token expansion tests
44
+ ├── fuzzy.test.ts # Fuzzy matching tests
45
+ ├── textUtils.test.ts # Text utility tests
46
+ └── match-utils.test.ts # Utility function tests
47
+
48
+ test/
49
+ ├── 2576.json # Test data for book 2576 (Sahih Bukhari)
50
+ └── 2588.json # Test data for book 2588 (Al-Mughni)
51
+
52
+ docs/
53
+ ├── checkpoints/ # AI agent handoff documentation
54
+ │ └── 2025-12-09-handoff.md
55
+ └── reviews/ # Performance analysis reports
56
+ └── 2025-12-10/
57
+ ```
58
+
59
+ ### Core Components
60
+
61
+ 1. **`segmentPages(pages, options)`** - Main entry point
62
+ - Takes array of `{id, content}` pages and split rules
63
+ - Returns array of `{content, from, to?, meta?}` segments
64
+
65
+ 2. **`tokens.ts`** - Template system
66
+ - `TOKEN_PATTERNS` - Map of token names to regex patterns
67
+ - `expandTokensWithCaptures()` - Expands `{{token:name}}` syntax
68
+ - Supports fuzzy transform for diacritic-insensitive matching
69
+
70
+ 3. **`match-utils.ts`** - Extracted utilities (for testability)
71
+ - `extractNamedCaptures()` - Get named groups from regex match
72
+ - `filterByConstraints()` - Apply min/max page filters
73
+ - `anyRuleAllowsId()` - Check if page passes rule constraints
74
+
75
+ 4. **`breakpoint-utils.ts`** - Breakpoint processing utilities (NEW)
76
+ - `normalizeBreakpoint()` - Convert string to BreakpointRule object
77
+ - `isPageExcluded()` - Check if page is in exclude list
78
+ - `isInBreakpointRange()` - Validate page against min/max/exclude constraints
79
+ - `buildExcludeSet()` - Create Set from PageRange[] for O(1) lookups
80
+ - `createSegment()` - Create segment with optional to/meta fields
81
+ - `expandBreakpoints()` - Expand patterns with pre-compiled regexes
82
+ - `findActualEndPage()` - Search backwards for ending page by content
83
+ - `findBreakPosition()` - Find break position using breakpoint patterns
84
+ - `hasExcludedPageInRange()` - Check if range contains excluded pages
85
+ - `findNextPagePosition()` - Find next page content position
86
+ - `findPatternBreakPosition()` - Find pattern match by preference
87
+
88
+ 5. **`types.ts`** - Type definitions
89
+ - `Logger` interface - Optional logging for debugging
90
+ - `SegmentationOptions` - Options with `logger` property
91
+ - Verbosity levels: `trace`, `debug`, `info`, `warn`, `error`
92
+
93
+ 6. **`fuzzy.ts`** - Arabic text normalization
94
+ - `makeDiacriticInsensitive()` - Generate regex that ignores diacritics
95
+
96
+ 7. **`pattern-detection.ts`** - Token auto-detection (NEW)
97
+ - `detectTokenPatterns()` - Detect tokens in text with positions
98
+ - `generateTemplateFromText()` - Convert text to template string
99
+ - `suggestPatternConfig()` - Suggest rule configuration
100
+ - `analyzeTextForRule()` - Complete analysis returning template + config
101
+
102
+ ## Key Algorithms
103
+
104
+ ### Token Expansion
105
+
106
+ ```
107
+ Input: "{{raqms:num}} {{dash}} {{:text}}"
108
+ Output: "(?<num>[\u0660-\u0669]+) [-–—ـ] (?<text>.+)"
109
+ ```
110
+
111
+ The expansion algorithm:
112
+ 1. Splits query into token and text segments
113
+ 2. Looks up token patterns from `TOKEN_PATTERNS`
114
+ 3. Wraps in named capture group if `:name` suffix present
115
+ 4. Applies fuzzy transform if enabled (before wrapping in groups)
116
+
117
+ ### Fuzzy Application Order
118
+
119
+ **Critical design decision**: Fuzzy transforms are applied to raw token patterns and plain text *before* they're wrapped in regex groups.
120
+
121
+ ```
122
+ WRONG: makeDiacriticInsensitive("(?<name>حدثنا)") // Breaks ( ? < > )
123
+ RIGHT: "(?<name>" + makeDiacriticInsensitive("حدثنا") + ")"
124
+ ```
125
+
126
+ ### lineStartsAfter Content Capture
127
+
128
+ For patterns like `^٦٦٩٦ - (content)`, the content capture is the *last* positional group:
129
+
130
+ ```typescript
131
+ // Pattern: ^(?:(?<num>[\u0660-\u0669]+) [-–—ـ] )(.*)
132
+ // Match: m[1] = named group value, m[2] = content
133
+ // Solution: Iterate backward from m.length-1 to find last defined capture
134
+ ```
135
+
136
+ ### Auto-Escaping Brackets in Templates
137
+
138
+ Template patterns (`lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, `template`) automatically escape `()[]` characters that appear **outside** of `{{token}}` delimiters. This allows intuitive patterns without manual escaping.
139
+
140
+ **Processing order:**
141
+ 1. `escapeTemplateBrackets()` escapes `()[]` outside `{{...}}`
142
+ 2. `expandTokensWithCaptures()` expands tokens to regex patterns
143
+ 3. Fuzzy transform applied (if enabled)
144
+
145
+ ```
146
+ Input: "({{harf}}): "
147
+ Step 1: "\({{harf}}\): " (brackets escaped)
148
+ Step 2: "\([أ-ي]\): " (token expanded - its [] preserved)
149
+ ```
150
+
151
+ **Implementation in `tokens.ts`:**
152
+ ```typescript
153
+ export const escapeTemplateBrackets = (pattern: string): string => {
154
+ return pattern.replace(/(\{\{[^}]*\}\})|([()[\]])/g, (match, token, bracket) => {
155
+ if (token) return token; // Preserve {{tokens}}
156
+ return `\\${bracket}`; // Escape brackets
157
+ });
158
+ };
159
+ ```
160
+
161
+ **Where escaping is applied:**
162
+ - `processPattern()` - handles `lineStartsWith`, `lineStartsAfter`, `lineEndsWith`
163
+ - Direct `template` processing in `buildRuleRegex()`
164
+ - **NOT** applied to `regex` patterns (user has full control)
165
+
166
+ ### Breakpoints Post-Processing Algorithm
167
+
168
+ The `breakpoints` option provides a post-processing mechanism for limiting segment size. Unlike the deprecated `maxSpan` (which was per-rule), breakpoints runs AFTER all structural rules.
169
+
170
+ **API Options:**
171
+ ```typescript
172
+ interface SegmentationOptions {
173
+ rules: SplitRule[];
174
+ maxPages?: number; // Maximum pages a segment can span
175
+ breakpoints?: string[]; // Ordered array of regex patterns (supports token expansion)
176
+ prefer?: 'longer' | 'shorter'; // Select last or first match within window
177
+ }
178
+ ```
179
+
180
+ **How it works:**
181
+ 1. Structural rules run first, creating initial segments
182
+ 2. Breakpoints then processes any segment exceeding `maxPages`
183
+ 3. Patterns are tried in order until one matches
184
+ 4. Empty string `''` means "fall back to page boundary"
185
+
186
+ **Example:**
187
+ ```typescript
188
+ segmentPages(pages, {
189
+ rules: [
190
+ { lineStartsWith: ['{{basmalah}}'], split: 'at' },
191
+ { lineStartsWith: ['{{bab}}'], split: 'at', meta: { type: 'chapter' } },
192
+ ],
193
+ maxPages: 2,
194
+ breakpoints: ['{{tarqim}}\\s*', '\\n', ''], // Try: punctuation → newline → page boundary
195
+ prefer: 'longer', // Greedy: make segments as large as possible
196
+ });
197
+ ```
198
+
199
+ **Key behaviors:**
200
+ - **Pattern order matters**: First matching pattern wins
201
+ - **`prefer: 'longer'`**: Finds LAST match in window (greedy)
202
+ - **`prefer: 'shorter'`**: Finds FIRST match (conservative)
203
+ - **Recursive**: If split result still exceeds `maxPages`, breakpoints runs again
204
+
205
+ > **Note**: The old `maxSpan` and `fallback` properties on `SplitRule` are deprecated and removed.
206
+
207
+ ## Design Decisions
208
+
209
+ ### 1. Why `{{double-braces}}`?
210
+
211
+ - Single braces `{}` conflict with regex quantifiers `{n,m}`
212
+ - Double braces are visually distinct and rarely appear in content
213
+ - Consistent with template systems (Handlebars, Mustache)
214
+
215
+ ### 2. Why `lineStartsAfter` vs `lineStartsWith`?
216
+
217
+ | Pattern | Marker in content? | Use case |
218
+ |---------|-------------------|----------|
219
+ | `lineStartsWith` | ✅ Yes | Keep marker, segment at boundary |
220
+ | `lineStartsAfter` | ❌ No | Strip marker, capture only content |
221
+
222
+ ### 3. Why fuzzy transform at token level?
223
+
224
+ Applying fuzzy globally would corrupt regex metacharacters. Instead:
225
+ - Fuzzy is passed to `expandTokensWithCaptures()`
226
+ - Applied only to Arabic text portions
227
+ - Preserves `(`, `)`, `|`, `?`, etc.
228
+
229
+ ### 4. Why extract match utilities?
230
+
231
+ The original `segmentPages` had complexity 37 (max: 15). Extraction:
232
+ - Creates independently testable units
233
+ - Reduces main function complexity
234
+ - Improves code readability
235
+
236
+ ## Working with the Codebase
237
+
238
+ ### Adding a New Token
239
+
240
+ 1. Add to `TOKEN_PATTERNS` in `tokens.ts`:
241
+ ```typescript
242
+ export const TOKEN_PATTERNS = {
243
+ // ...existing
244
+ verse: '﴿[^﴾]+﴾', // Quranic verse markers
245
+ };
246
+ ```
247
+ 2. Add test cases in `segmenter.test.ts`
248
+ 3. Document in README.md
249
+
250
+ ### Adding a New Pattern Type
251
+
252
+ 1. Add type to union in `types.ts`:
253
+ ```typescript
254
+ type NewPattern = { newPatternField: string[] };
255
+ type PatternType = ... | NewPattern;
256
+ ```
257
+ 2. Handle in `buildRuleRegex()` in `segmenter.ts`
258
+ 3. Add comprehensive tests
259
+
260
+ ### Testing Strategy
261
+
262
+ - **Unit tests**: Each utility function has dedicated tests
263
+ - **Integration tests**: Full pipeline tests in `segmenter.test.ts`
264
+ - **Real-world tests**: `segmenter.bukhari.test.ts` uses actual hadith data
265
+ - Run: `bun test`
266
+
267
+ ## Code Quality Standards
268
+
269
+ 1. **TypeScript strict mode** - No `any` types
270
+ 2. **Biome linting** - Max complexity 15 per function (some exceptions exist)
271
+ 3. **JSDoc comments** - All exported functions documented
272
+ 4. **Test coverage** - 251 tests across 8 files
273
+
274
+ ## Dependencies
275
+
276
+ ### Development
277
+ - **@biomejs/biome** - Linting and formatting
278
+ - **tsdown** - Build tool (generates `.mjs` and `.d.mts`)
279
+ - **Bun** - Runtime and test runner
280
+
281
+ ## Build & Release
282
+
283
+ ```bash
284
+ # Install dependencies
285
+ bun install
286
+
287
+ # Run all tests
288
+ bun test
289
+
290
+ # Build distribution
291
+ bun run build
292
+ # Output: dist/index.mjs (~17 KB gzip ~5.7 KB)
293
+
294
+ # Run performance test (generates 50K pages, measures segmentation speed/memory)
295
+ bun run perf
296
+
297
+ # Format code
298
+ bunx biome format --write .
299
+
300
+ # Lint code
301
+ bunx biome lint .
302
+ ```
303
+
304
+ ## Lessons Learned
305
+
306
+ ### From Development
307
+
308
+ 1. **Named captures shift positional indices**: When `(?<name>…)` appears before `(.*)`, the content is at `m[2]` not `m[1]`. Solution: iterate backward to find last defined group.
309
+
310
+ 2. **Fuzzy + metacharacters don't mix**: `makeDiacriticInsensitive` expands Arabic to character classes. If applied to `(?<name>text)`, it corrupts the `(`, `?`, `<`, `>` characters.
311
+
312
+ 3. **Alternations need per-alternative fuzzy**: Token `narrated: 'حدثنا|أخبرنا'` requires splitting at `|`, applying fuzzy to each, then rejoining.
313
+
314
+ 4. **Complexity extraction works**: Pulling logic into `match-utils.ts` reduced main function complexity from 37 to 10 and made the code testable.
315
+
316
+ 5. **Rule order matters for specificity**: When multiple rules can match the same position, put specific patterns BEFORE generic ones. Example: `## {{raqms:num}} {{dash}}` must come before `##` to capture the number.
317
+
318
+ 6. **Post-processing beats per-rule limits**: The `maxSpan` approach (per-rule page limits) caused premature splits. Moving to post-processing `breakpoints` preserves structural integrity while still limiting segment size.
319
+
320
+ 7. **Window padding matters**: When calculating approximate content windows, 50% padding is needed (not 20%) to ensure enough content is captured for `prefer: 'longer'` scenarios.
321
+
322
+ 8. **Escaping in tests requires care**: TypeScript string `'\\.'` creates regex `\.`, but regex literal `/\./` is already escaped. Double-backslash in strings, single in literals.
323
+
324
+ 9. **Auto-escaping improves DX significantly**: Users expect `(أ):` to match literal parentheses. Auto-escaping `()[]` in template patterns (but not `regex`) gives intuitive behavior while preserving power-user escape hatch.
325
+
326
+ ### Architecture Insights
327
+
328
+ - **Declarative > Imperative**: Users describe patterns, library handles regex
329
+ - **Composability**: Tokens can be combined freely with `:name` captures
330
+ - **Fail gracefully**: Unknown tokens are left as-is, allowing partial templates
331
+ - **Post-process > Inline**: Breakpoints runs after rules, avoiding conflicts
332
+ - **Dependency injection for testability**: `breakpoint-utils.ts` accepts a `PatternProcessor` function instead of importing `processPattern` directly, enabling independent testing without mocking
333
+ - **Optional logging**: Use optional chaining (`logger?.debug?.()`) for zero-overhead when no logger is provided. All log methods are optional, allowing clients to subscribe to only the verbosity levels they need.
334
+
335
+ ---
336
+
337
+ ## Token Reference
338
+
339
+ | Token | Pattern Description | Example Match |
340
+ |-------|---------------------|---------------|
341
+ | `{{tarqim}}` | Arabic punctuation (. , ; ? ! ( ) etc.) | `؛` `،` `.` |
342
+ | `{{basmalah}}` | "بِسْمِ اللَّهِ" patterns | بِسْمِ اللَّهِ الرَّحْمَنِ |
343
+ | `{{bab}}` | "باب" (chapter) | بَابُ الإيمان |
344
+ | `{{fasl}}` | "فصل" (section) | فصل: في الطهارة |
345
+ | `{{kitab}}` | "كتاب" (book) | كتاب الصلاة |
346
+ | `{{raqm}}` | Single Arabic-Indic numeral | ٥ |
347
+ | `{{raqms}}` | Multiple Arabic-Indic numerals | ٧٥٦٣ |
348
+ | `{{raqms:num}}` | Numerals with named capture | `meta.num = "٧٥٦٣"` |
349
+ | `{{dash}}` | Various dash characters | - – — ـ |
350
+ | `{{numbered}}` | Composite: `{{raqms}} {{dash}}` | ٧٥٦٣ - |
351
+
352
+ **Named captures**: Add `:name` suffix to capture into `meta`:
353
+ ```typescript
354
+ '{{raqms:hadithNum}} {{dash}}'
355
+ // → segment.meta.hadithNum = "٧٥٦٣"
356
+ ```
357
+
package/README.md CHANGED
@@ -143,7 +143,31 @@ const rules = [{
143
143
  | `template` | Depends | Custom pattern with full control |
144
144
  | `regex` | Depends | Raw regex for complex cases |
145
145
 
146
- ### 5. Page Constraints
146
+ ### 5. Auto-Escaping Brackets
147
+
148
+ In `lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, and `template` patterns, parentheses `()` and square brackets `[]` are **automatically escaped**. This means you can write intuitive patterns without manual escaping:
149
+
150
+ ```typescript
151
+ // Write this (clean and readable):
152
+ { lineStartsAfter: ['({{harf}}): '], split: 'at' }
153
+
154
+ // Instead of this (verbose escaping):
155
+ { lineStartsAfter: ['\\({{harf}}\\): '], split: 'at' }
156
+ ```
157
+
158
+ **Important**: Brackets inside `{{tokens}}` are NOT escaped - token patterns like `{{harf}}` which expand to `[أ-ي]` work correctly.
159
+
160
+ For full regex control (character classes, capturing groups), use the `regex` pattern type which does NOT auto-escape:
161
+
162
+ ```typescript
163
+ // Character class [أب] matches أ or ب
164
+ { regex: '^[أب] ', split: 'at' }
165
+
166
+ // Capturing group (test|text) matches either
167
+ { regex: '^(test|text) ', split: 'at' }
168
+ ```
169
+
170
+ ### 6. Page Constraints
147
171
 
148
172
  Limit rules to specific page ranges:
149
173
 
@@ -156,7 +180,7 @@ Limit rules to specific page ranges:
156
180
  }
157
181
  ```
158
182
 
159
- ### 6. Occurrence Filtering
183
+ ### 7. Occurrence Filtering
160
184
 
161
185
  Control which matches to use:
162
186
 
@@ -395,6 +419,83 @@ console.log(TOKEN_PATTERNS.narrated);
395
419
  // 'حدثنا|أخبرنا|حدثني|وحدثنا|أنبأنا|سمعت'
396
420
  ```
397
421
 
422
+ ### Pattern Detection Utilities
423
+
424
+ These functions help auto-detect tokens in text, useful for building UI tools that suggest rule configurations from user-highlighted text.
425
+
426
+ #### `detectTokenPatterns(text)`
427
+
428
+ Analyzes text and returns all detected token patterns with their positions.
429
+
430
+ ```typescript
431
+ import { detectTokenPatterns } from 'flappa-doormal';
432
+
433
+ const detected = detectTokenPatterns("٣٤ - حدثنا");
434
+ // Returns:
435
+ // [
436
+ // { token: 'raqms', match: '٣٤', index: 0, endIndex: 2 },
437
+ // { token: 'dash', match: '-', index: 3, endIndex: 4 },
438
+ // { token: 'naql', match: 'حدثنا', index: 5, endIndex: 10 }
439
+ // ]
440
+ ```
441
+
442
+ #### `generateTemplateFromText(text, detected)`
443
+
444
+ Converts text to a template string using detected patterns.
445
+
446
+ ```typescript
447
+ import { detectTokenPatterns, generateTemplateFromText } from 'flappa-doormal';
448
+
449
+ const text = "٣٤ - ";
450
+ const detected = detectTokenPatterns(text);
451
+ const template = generateTemplateFromText(text, detected);
452
+ // Returns: "{{raqms}} {{dash}} "
453
+ ```
454
+
455
+ #### `suggestPatternConfig(detected)`
456
+
457
+ Suggests the best pattern type and options based on detected patterns.
458
+
459
+ ```typescript
460
+ import { detectTokenPatterns, suggestPatternConfig } from 'flappa-doormal';
461
+
462
+ // For numbered patterns (hadith-style)
463
+ const hadithDetected = detectTokenPatterns("٣٤ - ");
464
+ suggestPatternConfig(hadithDetected);
465
+ // Returns: { patternType: 'lineStartsAfter', fuzzy: false, metaType: 'hadith' }
466
+
467
+ // For structural patterns (chapter markers)
468
+ const chapterDetected = detectTokenPatterns("باب الصلاة");
469
+ suggestPatternConfig(chapterDetected);
470
+ // Returns: { patternType: 'lineStartsWith', fuzzy: true, metaType: 'bab' }
471
+ ```
472
+
473
+ #### `analyzeTextForRule(text)`
474
+
475
+ Complete analysis that combines detection, template generation, and config suggestion.
476
+
477
+ ```typescript
478
+ import { analyzeTextForRule } from 'flappa-doormal';
479
+
480
+ const result = analyzeTextForRule("٣٤ - حدثنا");
481
+ // Returns:
482
+ // {
483
+ // template: "{{raqms}} {{dash}} {{naql}}",
484
+ // patternType: 'lineStartsAfter',
485
+ // fuzzy: false,
486
+ // metaType: 'hadith',
487
+ // detected: [...]
488
+ // }
489
+
490
+ // Use the result to build a rule:
491
+ const rule = {
492
+ [result.patternType]: [result.template],
493
+ split: 'at',
494
+ fuzzy: result.fuzzy,
495
+ meta: { type: result.metaType }
496
+ };
497
+ ```
498
+
398
499
  ## Types
399
500
 
400
501
  ### `SplitRule`
@@ -433,6 +534,19 @@ type Segment = {
433
534
  };
434
535
  ```
435
536
 
537
+ ### `DetectedPattern`
538
+
539
+ Result from pattern detection utilities.
540
+
541
+ ```typescript
542
+ type DetectedPattern = {
543
+ token: string; // Token name (e.g., 'raqms', 'dash')
544
+ match: string; // The matched text
545
+ index: number; // Start index in original text
546
+ endIndex: number; // End index (exclusive)
547
+ };
548
+ ```
549
+
436
550
  ## Usage with Next.js / Node.js
437
551
 
438
552
  ```typescript