flappa-doormal 1.0.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md ADDED
@@ -0,0 +1,325 @@
1
+ # AGENTS.md
2
+
3
+ ## Project Overview
4
+
5
+ **flappa-doormal** is a declarative text segmentation library for Arabic texts. It splits pages of content into logical segments (hadiths, chapters, verses) using pattern-based rules with a human-readable template syntax.
6
+
7
+ ### Why This Library Exists
8
+
9
+ Traditional Arabic text segmentation requires:
10
+ - Hand-crafted regular expressions with complex Unicode patterns
11
+ - Deep knowledge of Arabic diacritics and character equivalences
12
+ - Handling page boundaries and multi-page content spans
13
+ - Manual capture group management for metadata extraction
14
+
15
+ **flappa-doormal** solves these problems with:
16
+ - **Declarative rules**: Describe *what* to match, not *how* to match it
17
+ - **Template tokens**: `{{raqms}} {{dash}}` instead of `[\u0660-\u0669]+\s*[-–—ـ]`
18
+ - **Named captures**: `{{raqms:hadithNum}}` → automatic `meta.hadithNum` extraction
19
+ - **Fuzzy matching**: Diacritic-insensitive matching for harakat variations
20
+ - **Page-aware**: Tracks which page each segment comes from
21
+
22
+ ## Architecture
23
+
24
+ ### Repository Structure
25
+
26
+ ```text
27
+ src/
28
+ ├── index.ts # Main entry point and exports
29
+ ├── pattern-detection.ts # Token detection for auto-generating rules (NEW)
30
+ ├── pattern-detection.test.ts # Pattern detection tests (22 tests)
31
+ └── segmentation/
32
+ ├── types.ts # TypeScript type definitions for rules/segments
33
+ ├── segmenter.ts # Core segmentation engine (segmentPages, applyBreakpoints)
34
+ ├── breakpoint-utils.ts # Extracted breakpoint processing utilities
35
+ ├── tokens.ts # Token definitions and expansion logic
36
+ ├── fuzzy.ts # Diacritic-insensitive matching utilities
37
+ ├── html.ts # HTML utilities (stripHtmlTags)
38
+ ├── textUtils.ts # Text processing utilities
39
+ ├── match-utils.ts # Extracted match processing utilities
40
+ ├── segmenter.test.ts # Core test suite (150+ tests including breakpoints)
41
+ ├── segmenter.bukhari.test.ts # Real-world test cases
42
+ ├── breakpoint-utils.test.ts # Breakpoint utility tests (42 tests)
43
+ ├── tokens.test.ts # Token expansion tests
44
+ ├── fuzzy.test.ts # Fuzzy matching tests
45
+ ├── textUtils.test.ts # Text utility tests
46
+ └── match-utils.test.ts # Utility function tests
47
+
48
+ test/
49
+ ├── 2576.json # Test data for book 2576 (Sahih Bukhari)
50
+ └── 2588.json # Test data for book 2588 (Al-Mughni)
51
+
52
+ docs/
53
+ ├── checkpoints/ # AI agent handoff documentation
54
+ │ └── 2025-12-09-handoff.md
55
+ └── reviews/ # Performance analysis reports
56
+ └── 2025-12-10/
57
+ ```
58
+
59
+ ### Core Components
60
+
61
+ 1. **`segmentPages(pages, options)`** - Main entry point
62
+ - Takes array of `{id, content}` pages and split rules
63
+ - Returns array of `{content, from, to?, meta?}` segments
64
+
65
+ 2. **`tokens.ts`** - Template system
66
+ - `TOKEN_PATTERNS` - Map of token names to regex patterns
67
+ - `expandTokensWithCaptures()` - Expands `{{token:name}}` syntax
68
+ - Supports fuzzy transform for diacritic-insensitive matching
69
+
70
+ 3. **`match-utils.ts`** - Extracted utilities (for testability)
71
+ - `extractNamedCaptures()` - Get named groups from regex match
72
+ - `filterByConstraints()` - Apply min/max page filters
73
+ - `anyRuleAllowsId()` - Check if page passes rule constraints
74
+
75
+ 4. **`breakpoint-utils.ts`** - Breakpoint processing utilities (NEW)
76
+ - `normalizeBreakpoint()` - Convert string to BreakpointRule object
77
+ - `isPageExcluded()` - Check if page is in exclude list
78
+ - `isInBreakpointRange()` - Validate page against min/max/exclude constraints
79
+ - `buildExcludeSet()` - Create Set from PageRange[] for O(1) lookups
80
+ - `createSegment()` - Create segment with optional to/meta fields
81
+ - `expandBreakpoints()` - Expand patterns with pre-compiled regexes
82
+ - `findActualEndPage()` - Search backwards for ending page by content
83
+ - `findBreakPosition()` - Find break position using breakpoint patterns
84
+ - `hasExcludedPageInRange()` - Check if range contains excluded pages
85
+ - `findNextPagePosition()` - Find next page content position
86
+ - `findPatternBreakPosition()` - Find pattern match by preference
87
+
88
+ 5. **`types.ts`** - Type definitions
89
+ - `Logger` interface - Optional logging for debugging
90
+ - `SegmentationOptions` - Options with `logger` property
91
+ - Verbosity levels: `trace`, `debug`, `info`, `warn`, `error`
92
+
93
+ 6. **`fuzzy.ts`** - Arabic text normalization
94
+ - `makeDiacriticInsensitive()` - Generate regex that ignores diacritics
95
+
96
+ 7. **`pattern-detection.ts`** - Token auto-detection (NEW)
97
+ - `detectTokenPatterns()` - Detect tokens in text with positions
98
+ - `generateTemplateFromText()` - Convert text to template string
99
+ - `suggestPatternConfig()` - Suggest rule configuration
100
+ - `analyzeTextForRule()` - Complete analysis returning template + config
101
+
102
+ ## Key Algorithms
103
+
104
+ ### Token Expansion
105
+
106
+ ```
107
+ Input: "{{raqms:num}} {{dash}} {{:text}}"
108
+ Output: "(?<num>[\u0660-\u0669]+) [-–—ـ] (?<text>.+)"
109
+ ```
110
+
111
+ The expansion algorithm:
112
+ 1. Splits query into token and text segments
113
+ 2. Looks up token patterns from `TOKEN_PATTERNS`
114
+ 3. Wraps in named capture group if `:name` suffix present
115
+ 4. Applies fuzzy transform if enabled (before wrapping in groups)
116
+
117
+ ### Fuzzy Application Order
118
+
119
+ **Critical design decision**: Fuzzy transforms are applied to raw token patterns and plain text *before* they're wrapped in regex groups.
120
+
121
+ ```
122
+ WRONG: makeDiacriticInsensitive("(?<name>حدثنا)") // Breaks ( ? < > )
123
+ RIGHT: "(?<name>" + makeDiacriticInsensitive("حدثنا") + ")"
124
+ ```
125
+
126
+ ### lineStartsAfter Content Capture
127
+
128
+ For patterns like `^٦٦٩٦ - (content)`, the content capture is the *last* positional group:
129
+
130
+ ```typescript
131
+ // Pattern: ^(?:(?<num>[\u0660-\u0669]+) [-–—ـ] )(.*)
132
+ // Match: m[1] = named group value, m[2] = content
133
+ // Solution: Iterate backward from m.length-1 to find last defined capture
134
+ ```
135
+
136
+ ### Breakpoints Post-Processing Algorithm
137
+
138
+ The `breakpoints` option provides a post-processing mechanism for limiting segment size. Unlike the deprecated `maxSpan` (which was per-rule), breakpoints runs AFTER all structural rules.
139
+
140
+ **API Options:**
141
+ ```typescript
142
+ interface SegmentationOptions {
143
+ rules: SplitRule[];
144
+ maxPages?: number; // Maximum pages a segment can span
145
+ breakpoints?: string[]; // Ordered array of regex patterns (supports token expansion)
146
+ prefer?: 'longer' | 'shorter'; // Select last or first match within window
147
+ }
148
+ ```
149
+
150
+ **How it works:**
151
+ 1. Structural rules run first, creating initial segments
152
+ 2. Breakpoints then processes any segment exceeding `maxPages`
153
+ 3. Patterns are tried in order until one matches
154
+ 4. Empty string `''` means "fall back to page boundary"
155
+
156
+ **Example:**
157
+ ```typescript
158
+ segmentPages(pages, {
159
+ rules: [
160
+ { lineStartsWith: ['{{basmalah}}'], split: 'at' },
161
+ { lineStartsWith: ['{{bab}}'], split: 'at', meta: { type: 'chapter' } },
162
+ ],
163
+ maxPages: 2,
164
+ breakpoints: ['{{tarqim}}\\s*', '\\n', ''], // Try: punctuation → newline → page boundary
165
+ prefer: 'longer', // Greedy: make segments as large as possible
166
+ });
167
+ ```
168
+
169
+ **Key behaviors:**
170
+ - **Pattern order matters**: First matching pattern wins
171
+ - **`prefer: 'longer'`**: Finds LAST match in window (greedy)
172
+ - **`prefer: 'shorter'`**: Finds FIRST match (conservative)
173
+ - **Recursive**: If split result still exceeds `maxPages`, breakpoints runs again
174
+
175
+ > **Note**: The old `maxSpan` and `fallback` properties on `SplitRule` are deprecated and removed.
176
+
177
+ ## Design Decisions
178
+
179
+ ### 1. Why `{{double-braces}}`?
180
+
181
+ - Single braces `{}` conflict with regex quantifiers `{n,m}`
182
+ - Double braces are visually distinct and rarely appear in content
183
+ - Consistent with template systems (Handlebars, Mustache)
184
+
185
+ ### 2. Why `lineStartsAfter` vs `lineStartsWith`?
186
+
187
+ | Pattern | Marker in content? | Use case |
188
+ |---------|-------------------|----------|
189
+ | `lineStartsWith` | ✅ Yes | Keep marker, segment at boundary |
190
+ | `lineStartsAfter` | ❌ No | Strip marker, capture only content |
191
+
192
+ ### 3. Why fuzzy transform at token level?
193
+
194
+ Applying fuzzy globally would corrupt regex metacharacters. Instead:
195
+ - Fuzzy is passed to `expandTokensWithCaptures()`
196
+ - Applied only to Arabic text portions
197
+ - Preserves `(`, `)`, `|`, `?`, etc.
198
+
199
+ ### 4. Why extract match utilities?
200
+
201
+ The original `segmentPages` had complexity 37 (max: 15). Extraction:
202
+ - Creates independently testable units
203
+ - Reduces main function complexity
204
+ - Improves code readability
205
+
206
+ ## Working with the Codebase
207
+
208
+ ### Adding a New Token
209
+
210
+ 1. Add to `TOKEN_PATTERNS` in `tokens.ts`:
211
+ ```typescript
212
+ export const TOKEN_PATTERNS = {
213
+ // ...existing
214
+ verse: '﴿[^﴾]+﴾', // Quranic verse markers
215
+ };
216
+ ```
217
+ 2. Add test cases in `segmenter.test.ts`
218
+ 3. Document in README.md
219
+
220
+ ### Adding a New Pattern Type
221
+
222
+ 1. Add type to union in `types.ts`:
223
+ ```typescript
224
+ type NewPattern = { newPatternField: string[] };
225
+ type PatternType = ... | NewPattern;
226
+ ```
227
+ 2. Handle in `buildRuleRegex()` in `segmenter.ts`
228
+ 3. Add comprehensive tests
229
+
230
+ ### Testing Strategy
231
+
232
+ - **Unit tests**: Each utility function has dedicated tests
233
+ - **Integration tests**: Full pipeline tests in `segmenter.test.ts`
234
+ - **Real-world tests**: `segmenter.bukhari.test.ts` uses actual hadith data
235
+ - Run: `bun test`
236
+
237
+ ## Code Quality Standards
238
+
239
+ 1. **TypeScript strict mode** - No `any` types
240
+ 2. **Biome linting** - Max complexity 15 per function (some exceptions exist)
241
+ 3. **JSDoc comments** - All exported functions documented
242
+ 4. **Test coverage** - 251 tests across 8 files
243
+
244
+ ## Dependencies
245
+
246
+ ### Development
247
+ - **@biomejs/biome** - Linting and formatting
248
+ - **tsdown** - Build tool (generates `.mjs` and `.d.mts`)
249
+ - **Bun** - Runtime and test runner
250
+
251
+ ## Build & Release
252
+
253
+ ```bash
254
+ # Install dependencies
255
+ bun install
256
+
257
+ # Run all tests
258
+ bun test
259
+
260
+ # Build distribution
261
+ bun run build
262
+ # Output: dist/index.mjs (~17 KB gzip ~5.7 KB)
263
+
264
+ # Run performance test (generates 50K pages, measures segmentation speed/memory)
265
+ bun run perf
266
+
267
+ # Format code
268
+ bunx biome format --write .
269
+
270
+ # Lint code
271
+ bunx biome lint .
272
+ ```
273
+
274
+ ## Lessons Learned
275
+
276
+ ### From Development
277
+
278
+ 1. **Named captures shift positional indices**: When `(?<name>…)` appears before `(.*)`, the content is at `m[2]` not `m[1]`. Solution: iterate backward to find last defined group.
279
+
280
+ 2. **Fuzzy + metacharacters don't mix**: `makeDiacriticInsensitive` expands Arabic to character classes. If applied to `(?<name>text)`, it corrupts the `(`, `?`, `<`, `>` characters.
281
+
282
+ 3. **Alternations need per-alternative fuzzy**: Token `narrated: 'حدثنا|أخبرنا'` requires splitting at `|`, applying fuzzy to each, then rejoining.
283
+
284
+ 4. **Complexity extraction works**: Pulling logic into `match-utils.ts` reduced main function complexity from 37 to 10 and made the code testable.
285
+
286
+ 5. **Rule order matters for specificity**: When multiple rules can match the same position, put specific patterns BEFORE generic ones. Example: `## {{raqms:num}} {{dash}}` must come before `##` to capture the number.
287
+
288
+ 6. **Post-processing beats per-rule limits**: The `maxSpan` approach (per-rule page limits) caused premature splits. Moving to post-processing `breakpoints` preserves structural integrity while still limiting segment size.
289
+
290
+ 7. **Window padding matters**: When calculating approximate content windows, 50% padding is needed (not 20%) to ensure enough content is captured for `prefer: 'longer'` scenarios.
291
+
292
+ 8. **Escaping in tests requires care**: TypeScript string `'\\.'` creates regex `\.`, but regex literal `/\./` is already escaped. Double-backslash in strings, single in literals.
293
+
294
+ ### Architecture Insights
295
+
296
+ - **Declarative > Imperative**: Users describe patterns, library handles regex
297
+ - **Composability**: Tokens can be combined freely with `:name` captures
298
+ - **Fail gracefully**: Unknown tokens are left as-is, allowing partial templates
299
+ - **Post-process > Inline**: Breakpoints runs after rules, avoiding conflicts
300
+ - **Dependency injection for testability**: `breakpoint-utils.ts` accepts a `PatternProcessor` function instead of importing `processPattern` directly, enabling independent testing without mocking
301
+ - **Optional logging**: Use optional chaining (`logger?.debug?.()`) for zero-overhead when no logger is provided. All log methods are optional, allowing clients to subscribe to only the verbosity levels they need.
302
+
303
+ ---
304
+
305
+ ## Token Reference
306
+
307
+ | Token | Pattern Description | Example Match |
308
+ |-------|---------------------|---------------|
309
+ | `{{tarqim}}` | Arabic punctuation (. , ; ? ! ( ) etc.) | `؛` `،` `.` |
310
+ | `{{basmalah}}` | "بِسْمِ اللَّهِ" patterns | بِسْمِ اللَّهِ الرَّحْمَنِ |
311
+ | `{{bab}}` | "باب" (chapter) | بَابُ الإيمان |
312
+ | `{{fasl}}` | "فصل" (section) | فصل: في الطهارة |
313
+ | `{{kitab}}` | "كتاب" (book) | كتاب الصلاة |
314
+ | `{{raqm}}` | Single Arabic-Indic numeral | ٥ |
315
+ | `{{raqms}}` | Multiple Arabic-Indic numerals | ٧٥٦٣ |
316
+ | `{{raqms:num}}` | Numerals with named capture | `meta.num = "٧٥٦٣"` |
317
+ | `{{dash}}` | Various dash characters | - – — ـ |
318
+ | `{{numbered}}` | Composite: `{{raqms}} {{dash}}` | ٧٥٦٣ - |
319
+
320
+ **Named captures**: Add `:name` suffix to capture into `meta`:
321
+ ```typescript
322
+ '{{raqms:hadithNum}} {{dash}}'
323
+ // → segment.meta.hadithNum = "٧٥٦٣"
324
+ ```
325
+