npm - flappa-doormal - Versions diffs - 1.0.0 → 2.1.0 - Mend

flappa-doormal 1.0.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/README.md CHANGED Viewed

@@ -12,346 +12,624 @@
 [![codecov](https://codecov.io/gh/ragaeeb/flappa-doormal/graph/badge.svg?token=RQ2BV4M9IS)](https://codecov.io/gh/ragaeeb/flappa-doormal)
 [![npm version](https://badge.fury.io/js/flappa-doormal.svg)](https://badge.fury.io/js/flappa-doormal)
-**Arabic text marker pattern library** - Generate regex patterns from declarative marker configurations.
+**Declarative Arabic text segmentation library** - Split pages of content into logical segments using human-readable patterns.
-🎯 **Purpose:** Simplify Arabic text segmentation by replacing complex regex patterns with readable, composable templates.
+## Why This Library?
+### The Problem
+Working with Arabic hadith and Islamic text collections requires splitting continuous text into segments (individual hadiths, chapters, verses). This traditionally means:
+- Writing complex Unicode regex patterns: `^[\u0660-\u0669]+\s*[-–—ـ]\s*`
+- Handling diacritic variations: `حَدَّثَنَا` vs `حدثنا`
+- Managing multi-page spans and page boundary tracking
+- Manually extracting hadith numbers, volume/page references
+### What Exists
+- **General regex libraries**: Don't understand Arabic text nuances
+- **NLP tokenizers**: Overkill for pattern-based segmentation
+- **Manual regex**: Error-prone, hard to maintain, no metadata extraction
+### The Solution
+**flappa-doormal** provides:
+✅ **Readable templates**: `{{raqms}} {{dash}}` instead of cryptic regex
+✅ **Named captures**: `{{raqms:hadithNum}}` auto-extracts to `meta.hadithNum`
+✅ **Fuzzy matching**: Ignore diacritics with `fuzzy: true`
+✅ **Page tracking**: Know which page each segment came from
+✅ **Declarative rules**: Describe *what* to match, not *how*
 ## Installation
 ```bash
+npm install flappa-doormal
+# or
 bun add flappa-doormal
-# Peer dependencies
-bun add bitaboom baburchi shamela
+# or
+yarn add flappa-doormal
 ```
 ## Quick Start
 ```typescript
-import { generateRegexFromMarker } from 'flappa-doormal';
-// Simple numbered marker
-const regex = generateRegexFromMarker({
-  type: 'numbered'  // Defaults: Arabic-Indic numerals, dash separator
+import { segmentPages } from 'flappa-doormal';
+// Your pages from a hadith book
+const pages = [
+  { id: 1, content: '٦٦٩٦ - حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ...' },
+  { id: 1, content: '٦٦٩٧ - أَخْبَرَنَا عُمَرُ قَالَ...' },
+  { id: 2, content: '٦٦٩٨ - حَدَّثَنِي مُحَمَّدٌ...' },
+];
+const segments = segmentPages(pages, {
+  rules: [{
+    lineStartsAfter: ['{{raqms:num}} {{dash}} '],
+    split: 'at',
+  }]
 });
-regex.exec('٥ - نص الحديث');
-// Returns: ['٥ - نص الحديث', 'نص الحديث']
+// Result:
+// [
+//   { content: 'حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ...', from: 1, meta: { num: '٦٦٩٦' } },
+//   { content: 'أَخْبَرَنَا عُمَرُ قَالَ...', from: 1, meta: { num: '٦٦٩٧' } },
+//   { content: 'حَدَّثَنِي مُحَمَّدٌ...', from: 2, meta: { num: '٦٦٩٨' } }
+// ]
 ```
 ## Features
-✅ **13 Preset Types** - Common patterns like `bab`, `hadith-chain`, `basmala`
-✅ **Template System** - Use `{num}`, `{dash}`, `{bullet}` instead of regex
-✅ **Type-Safe** - Full TypeScript support
-✅ **Composable** - Mix and match tokens with quantifiers
-✅ **Diacritic-Insensitive** - Handles Arabic text variations
+### 1. Template Tokens
-## Marker Types
+Replace regex with readable tokens:
-### Basic Types
-```typescript
-{ type: 'numbered' }      // ٥ - text
-{ type: 'bullet' }        // • text
-{ type: 'bab' }           // باب chapter
-{ type: 'hadith-chain' }  // حَدَّثَنَا narrator
-{ type: 'basmala' }       // بسم الله
-{ type: 'square-bracket' } // [٦٥] reference
-```
+| Token | Matches | Regex Equivalent |
+|-------|---------|------------------|
+| `{{raqms}}` | Arabic-Indic digits | `[\\u0660-\\u0669]+` |
+| `{{raqm}}` | Single Arabic digit | `[\\u0660-\\u0669]` |
+| `{{dash}}` | Dash variants | `[-–—ـ]` |
+| `{{harf}}` | Arabic letter | `[أ-ي]` |
+| `{{numbered}}` | Hadith numbering `٢٢ - ` | `{{raqms}} {{dash}} ` |
+| `{{fasl}}` | Section markers | `فصل\|مسألة` |
+| `{{tarqim}}` | Punctuation marks | `[.!?؟؛]` |
+| `{{bullet}}` | Bullet points | `[•*°]` |
+| `{{naql}}` | Narrator phrases | `حدثنا\|أخبرنا\|...` |
+| `{{kitab}}` | "كتاب" (book) | `كتاب` |
+| `{{bab}}` | "باب" (chapter) | `باب` |
+| `{{basmalah}}` | "بسم الله" | `بسم الله` |
+### 2. Named Capture Groups
+Extract metadata automatically with the `{{token:name}}` syntax:
-### Numbered Variants
 ```typescript
-{ type: 'num-letter' }    // ٥ أ - (number + letter)
-{ type: 'num-paren' }     // ٥ (أ) - (number + paren)
-{ type: 'num-slash' }     // ٥/٦ - (number/number)
+// Capture hadith number
+{ template: '^{{raqms:hadithNum}} {{dash}} ' }
+// Result: meta.hadithNum = '٦٦٩٦'
+// Capture volume and page
+{ template: '^{{raqms:vol}}/{{raqms:page}} {{dash}} ' }
+// Result: meta.vol = '٣', meta.page = '٤٥٦'
+// Capture rest of content
+{ template: '^{{raqms:num}} {{dash}} {{:text}}' }
+// Result: meta.num = '٦٦٩٦', meta.text = 'حَدَّثَنَا أَبُو بَكْرٍ'
 ```
-### Custom Patterns
+### 3. Fuzzy Matching (Diacritic-Insensitive)
+Match Arabic text regardless of harakat:
-**Using templates (recommended):**
 ```typescript
-{
-  type: 'pattern',
-  template: '{bullet}? {num}+ {dash}'
-}
+const rules = [{
+  fuzzy: true,
+  lineStartsAfter: ['{{kitab:book}} '],
+  split: 'at',
+}];
+// Matches both:
+// - 'كِتَابُ الصلاة' (with diacritics)
+// - 'كتاب الصيام' (without diacritics)
 ```
-**Using raw regex (for complex patterns):**
+### 4. Pattern Types
+| Type | Marker in content? | Use case |
+|------|-------------------|----------|
+| `lineStartsWith` | ✅ Included | Keep marker, segment at boundary |
+| `lineStartsAfter` | ❌ Excluded | Strip marker, capture only content |
+| `lineEndsWith` | ✅ Included | Match patterns at end of line |
+| `template` | Depends | Custom pattern with full control |
+| `regex` | Depends | Raw regex for complex cases |
+### 5. Page Constraints
+Limit rules to specific page ranges:
 ```typescript
 {
-  type: 'pattern',
-  pattern: '^CUSTOM: (.*)'  // When templates aren't sufficient
+  lineStartsWith: ['## '],
+  split: 'at',
+  min: 10,    // Only pages 10+
+  max: 100,   // Only pages up to 100
 }
 ```
-**Using format for numbered:**
+### 6. Occurrence Filtering
+Control which matches to use:
 ```typescript
 {
-  type: 'numbered',
-  format: '{bullet}+ {num} {letter} {dash}'
+  lineEndsWith: ['\\.'],
+  split: 'after',
+  occurrence: 'last',  // Only split at LAST period on page
+  maxSpan: 1,          // Apply per-page
 }
 ```
-## Complex Pattern Examples
+## Use Cases
-### Comma-Separated Numerals
-Match patterns like: `٩٩٣٦، ٩٩٣٧ - حَدَّثَنَا`
+### Simple Hadith Segmentation
+Use `{{numbered}}` for the common "number - content" format:
 ```typescript
-{
-  type: 'pattern',
-  template: '{num}(?:،{s}{num})*{s}{dash}'
-}
+const segments = segmentPages(pages, {
+  rules: [{
+    lineStartsAfter: ['{{numbered}}'],
+    split: 'at',
+    meta: { type: 'hadith' }
+  }]
+});
+// Matches: ٢٢ - حدثنا, ٦٦٩٦ – أخبرنا, etc.
+// Content starts AFTER the number and dash
 ```
-### Number / Letter
-Match patterns like: `١١٠٧٣/ أ - حَدَّثَنَا`
+### Hadith Segmentation with Number Extraction
+For capturing the hadith number, use explicit capture syntax:
 ```typescript
-{
-  type: 'pattern',
-  template: '{num}{s}/{s}{letter}{s}{dash}'
-}
+const segments = segmentPages(pages, {
+  rules: [{
+    lineStartsAfter: ['{{raqms:hadithNum}} {{dash}} '],
+    split: 'at',
+    meta: { type: 'hadith' }
+  }]
+});
+// Each segment has:
+// - content: The hadith text (without number prefix)
+// - from/to: Page range
+// - meta: { type: 'hadith', hadithNum: '٦٦٩٦' }
 ```
-### Number / Number (Built-in)
-Match patterns like: `١٠٢٦٦ / ١ - "وَإِذَا`
+### Volume/Page Reference Extraction
 ```typescript
-{
-  type: 'num-slash'  // Built-in preset
-}
+const segments = segmentPages(pages, {
+  rules: [{
+    lineStartsAfter: ['{{raqms:vol}}/{{raqms:page}} {{dash}} '],
+    split: 'at'
+  }]
+});
+// meta: { vol: '٣', page: '٤٥٦' }
 ```
-### Repeating Dots
-Match patterns like: `. . . . . . . . . .`
+### Chapter Detection with Fuzzy Matching
 ```typescript
-{
-  type: 'pattern',
-  template: '\\.(?:{s}\\.)+'
- }
+const segments = segmentPages(pages, {
+  rules: [{
+    fuzzy: true,
+    lineStartsAfter: ['{{kitab:book}} '],
+    split: 'at',
+    meta: { type: 'chapter' }
+  }]
+});
+// Matches "كِتَابُ" or "كتاب" regardless of diacritics
 ```
-### Asterisk + Dots + Number
-Match patterns like: `*. . . / ٨٦ - حَدَّثَنَا`
+### Naql (Transmission) Phrase Detection
-**Option 1: Capture from asterisk**
 ```typescript
-{
-  type: 'pattern',
-  template: '\\*\\.(?:{s}\\.)*{s}/{s}{num}{s}{dash}',
-  removeMarker: false  // Keep everything
-}
+const segments = segmentPages(pages, {
+  rules: [{
+    fuzzy: true,
+    lineStartsWith: ['{{naql:phrase}}'],
+    split: 'at'
+  }]
+});
+// meta.phrase captures which narrator phrase was matched:
+// 'حدثنا', 'أخبرنا', 'حدثني', etc.
 ```
-**Option 2: Detect from asterisk, capture from number**
+### Mixed Captured and Non-Captured Tokens
 ```typescript
-{
-  type: 'pattern',
-  pattern: '^\\*\\.(?:\\s?\\.)*\\s?/\\s?([\\u0660-\\u0669]+\\s?[-–—ـ].*)'
-}
+// Only capture the number, not the letter
+const segments = segmentPages(pages, {
+  rules: [{
+    lineStartsWith: ['{{raqms:num}} {{harf}} {{dash}} '],
+    split: 'at'
+  }]
+});
+// Input: '٥ أ - البند الأول'
+// meta: { num: '٥' }  // harf not captured (no :name suffix)
 ```
-## Template Tokens
+### Sentence-Based Splitting (Last Period Per Page)
-| Token | Matches | Example |
-|-------|---------|---------|
-| `{num}` | Arabic-Indic numerals | `[\\u0660-\\u0669]+` |
-| `{latin}` | Latin numerals | `\\d+` |
-| `{roman}` | Roman numerals | `[IVXLCDM]+` |
-| `{dash}` | Various dashes | `[-–—ـ]` |
-| `{dot}` | Period | `\\.` |
-| `{bullet}` | Bullet variants | `[•*°]` |
-| `{letter}` | Arabic letters | `[أ-ي]` |
-| `{s}` | Optional space | `\\s?` |
-| `{space}` | Required space | `\\s+` |
+```typescript
+const segments = segmentPages(pages, {
+  rules: [{
+    lineEndsWith: ['\\.'],
+    split: 'after',
+    occurrence: 'last',
+    maxSpan: 1
+  }]
+});
+```
-**Quantifiers:** Add `+`, `*`, `?` after tokens: `{num}+`, `{bullet}?`
+### Page Fallback for Unmatched Content
-## Examples
+When using `maxSpan` to group matches per page, use `fallback: 'page'` to prevent unmatched pages from merging with adjacent segments:
-### Before (Regex)
 ```typescript
-const pattern = '^[•*°]+ ([\\u0660-\\u0669]+\\s?[-–—ـ].*)';
+const segments = segmentPages(pages, {
+  rules: [{
+    template: '{{tarqim}}',  // Match punctuation marks
+    split: 'after',
+    occurrence: 'last',
+    maxSpan: 1,
+    fallback: 'page'  // If no punctuation found, segment the page anyway
+  }]
+});
 ```
-### After (Template)
+**Without `fallback`**: Pages without matches merge into the next segment
+**With `fallback: 'page'`**: Each page becomes its own segment even without matches
+> **Future extensions**: The `fallback` option may support additional values like `'skip'` (omit unmatched content) or `'line'` (split at line breaks) in future versions.
+### Multiple Rules with Priority
 ```typescript
-{
-  type: 'numbered',
-  format: '{bullet}+ {num} {dash}'
-}
+const segments = segmentPages(pages, {
+  rules: [
+    // First: Chapter headers (highest priority)
+    { fuzzy: true, lineStartsAfter: ['{{kitab:book}} '], split: 'at', meta: { type: 'chapter' } },
+    // Second: Sub-chapters
+    { fuzzy: true, lineStartsAfter: ['{{bab:section}} '], split: 'at', meta: { type: 'section' } },
+    // Third: Individual hadiths
+    { lineStartsAfter: ['{{raqms:num}} {{dash}} '], split: 'at', meta: { type: 'hadith' } },
+  ]
+});
 ```
-**80% reduction in complexity!**
+## API Reference
-## API
+### `segmentPages(pages, options)`
-### `generateRegexFromMarker(config)`
+Main segmentation function.
 ```typescript
-import { generateRegexFromMarker, type MarkerConfig } from 'flappa-doormal';
+import { segmentPages, type Page, type SegmentationOptions, type Segment } from 'flappa-doormal';
+const pages: Page[] = [
+  { id: 1, content: 'First page content...' },
+  { id: 2, content: 'Second page content...' },
+];
-const config: MarkerConfig = {
-  type: 'numbered',
-  numbering: 'arabic-indic',  // or 'latin', 'roman'
-  separator: 'dash',           // or 'dot', 'colon', 'paren'
-  removeMarker: true,          // Remove marker from capture (default: true)
+const options: SegmentationOptions = {
+  rules: [
+    { lineStartsWith: ['## '], split: 'at' }
+  ]
 };
-const regex = generateRegexFromMarker(config);
+const segments: Segment[] = segmentPages(pages, options);
 ```
-### `expandTemplate(template, options)`
+### `stripHtmlTags(html)`
+Remove all HTML tags from content, keeping only text.
 ```typescript
-import { expandTemplate } from 'flappa-doormal';
+import { stripHtmlTags } from 'flappa-doormal';
+const text = stripHtmlTags('<p>Hello <b>World</b></p>');
+// Returns: 'Hello World'
+```
-const pattern = expandTemplate('{num} {dash}');
-// Returns: '^[\\u0660-\\u0669]+ [-–—ـ](.*)'
+For more sophisticated HTML to Markdown conversion (like converting `<span data-type="title">` to `## ` headers), you can implement your own function. Here's an example:
-const pattern2 = expandTemplate('{num} {dash}', { removeMarker: false });
-// Returns: '^([\\u0660-\\u0669]+ [-–—ـ].*)'
+```typescript
+const htmlToMarkdown = (html: string): string => {
+    return html
+        // Convert title spans to markdown headers
+        .replace(/<span[^>]*data-type=["']title["'][^>]*>(.*?)<\/span>/gi, '## $1')
+        // Strip narrator links but keep text
+        .replace(/<a[^>]*href=["']inr:\/\/[^"']*["'][^>]*>(.*?)<\/a>/gi, '$1')
+        // Strip all remaining HTML tags
+        .replace(/<[^>]*>/g, '');
+};
 ```
-### `validateTemplate(template)`
+### `expandTokens(template)`
+Expand template tokens to regex pattern.
 ```typescript
-import { validateTemplate } from 'flappa-doormal';
+import { expandTokens } from 'flappa-doormal';
-const result = validateTemplate('{num} {invalid}');
-// Returns: { valid: false, errors: ['Unknown tokens: {invalid}'] }
+const pattern = expandTokens('{{raqms}} {{dash}}');
+// Returns: '[\u0660-\u0669]+ [-–—ـ]'
 ```
-## Configuration Options
+### `makeDiacriticInsensitive(text)`
+Make Arabic text diacritic-insensitive for fuzzy matching.
 ```typescript
-type MarkerConfig = {
-  type: MarkerType;
-  numbering?: 'arabic-indic' | 'latin' | 'roman';
-  separator?: 'dash' | 'dot' | 'paren' | 'colon' | 'none' | string;
-  format?: string;           // Template for numbered markers
-  template?: string;         // Template for pattern markers
-  pattern?: string;          // Raw regex (when templates aren't enough)
-  tokens?: Record<string, string>;  // Custom token definitions
-  phrases?: string[];        // For 'phrase' and 'hadith-chain' types
-  removeMarker?: boolean;    // Default: true for numbered/bullet
-};
+import { makeDiacriticInsensitive } from 'flappa-doormal';
+const pattern = makeDiacriticInsensitive('حدثنا');
+// Returns regex pattern matching 'حَدَّثَنَا', 'حدثنا', etc.
 ```
-## Extensibility
+### `TOKEN_PATTERNS`
-### Extending Default Phrase Lists
+Access available token definitions.
 ```typescript
-import { DEFAULT_HADITH_PHRASES, generateRegexFromMarker } from 'flappa-doormal';
+import { TOKEN_PATTERNS } from 'flappa-doormal';
+console.log(TOKEN_PATTERNS.narrated);
+// 'حدثنا|أخبرنا|حدثني|وحدثنا|أنبأنا|سمعت'
+```
-// Add to existing hadith phrases
-const myPhrases = [...DEFAULT_HADITH_PHRASES, 'أَخْبَرَنِي', 'سَمِعْتُ'];
+### Pattern Detection
-const regex = generateRegexFromMarker({
-  type: 'hadith-chain',
-  phrases: myPhrases,
-});
+Auto-detect tokens in Arabic text for building rules:
+```typescript
+import { detectTokenPatterns, analyzeTextForRule } from 'flappa-doormal';
+// Detect individual tokens
+const tokens = detectTokenPatterns('٣٤ - حدثنا');
+// [
+//   { token: 'raqms', match: '٣٤', index: 0, endIndex: 2 },
+//   { token: 'dash', match: '-', index: 3, endIndex: 4 },
+//   { token: 'naql', match: 'حدثنا', index: 5, endIndex: 10 }
+// ]
+// Get complete rule suggestion
+const rule = analyzeTextForRule('٣٤ - ');
+// {
+//   template: '{{raqms}} {{dash}} ',
+//   patternType: 'lineStartsAfter',
+//   fuzzy: false,
+//   metaType: 'hadith',
+//   detected: [...]
+// }
 ```
-### Using Type-Specific Generators
+## Types
+### `SplitRule`
 ```typescript
-import { generateHadithChainRegex, DEFAULT_HADITH_PHRASES } from 'flappa-doormal';
+type SplitRule = {
+  // Pattern (choose one)
+  lineStartsWith?: string[];
+  lineStartsAfter?: string[];
+  lineEndsWith?: string[];
+  template?: string;
+  regex?: string;
+  // Split behavior
+  split: 'at' | 'after';
+  occurrence?: 'first' | 'last' | 'all';
+  maxSpan?: number;
+  fuzzy?: boolean;
+  fallback?: 'page';  // NEW: Page-boundary fallback
+  // Constraints
+  min?: number;
+  max?: number;
+  meta?: Record<string, unknown>;
+};
+```
-// Direct access to type-specific generator
-const regex = generateHadithChainRegex(
-  { type: 'hadith-chain', phrases: [...DEFAULT_HADITH_PHRASES, 'extra'] },
-  true // removeMarker
-);
+### `Segment`
+```typescript
+type Segment = {
+  content: string;
+  from: number;
+  to?: number;
+  meta?: Record<string, unknown>;
+};
 ```
-### Custom Tokens
+### `Logger`
+Optional logging interface for debugging segmentation:
 ```typescript
-import { createTokenMap, expandTemplate } from 'flappa-doormal';
+interface Logger {
+  trace?: (message: string, ...args: unknown[]) => void;  // Per-iteration details
+  debug?: (message: string, ...args: unknown[]) => void;  // Detailed operations
+  info?: (message: string, ...args: unknown[]) => void;   // Key progress points
+  warn?: (message: string, ...args: unknown[]) => void;   // Potential issues
+  error?: (message: string, ...args: unknown[]) => void;  // Critical failures
+}
+```
-const customTokens = createTokenMap({
-  verse: '\\[[\\u0660-\\u0669]+\\]',
-  tafsir: 'تفسير',
+## Debugging
+### Using the Logger
+Pass a `logger` option to receive detailed information about the segmentation process:
+```typescript
+// Console logger for development
+const segments = segmentPages(pages, {
+  rules: [...],
+  logger: {
+    debug: console.debug,
+    info: console.info,
+    warn: console.warn,
+  }
 });
-const pattern = expandTemplate('{verse} {tafsir}', {
-  tokens: customTokens,
-  removeMarker: true
+// Production logger (only errors)
+const segments = segmentPages(pages, {
+  rules: [...],
+  logger: {
+    error: (msg, ...args) => myLoggingService.error(msg, args),
+  }
 });
 ```
-## Available Exports
+**Verbosity levels:**
+- `trace` - Per-iteration loop details (very verbose)
+- `debug` - Segment processing, pattern matching
+- `info` - Start/completion of breakpoint processing
+- `warn` - Safety checks triggered
+- `error` - Infinite loop detection
-**Constants:**
-- `DEFAULT_HADITH_PHRASES` - Default narrator phrases
-- `DEFAULT_BASMALA_PATTERNS` - Default basmala patterns
-- `TOKENS` - Token definitions
+When no logger is provided, no logging overhead is incurred.
-**Functions:**
-- `generateRegexFromMarker()` - Main function
-- `generate{Type}Regex()` - 12 type-specific generators
-- `expandTemplate()` - Template expansion
-- `validateTemplate()` - Template validation
-- `createTokenMap()` - Custom token maps
-## Testing
+## Usage with Next.js / Node.js
-This project has comprehensive unit test coverage for all marker type generators.
-```bash
-# Run all tests
-bun test
+```typescript
+// app/api/segment/route.ts (Next.js App Router)
+import { segmentPages } from 'flappa-doormal';
+import { NextResponse } from 'next/server';
+export async function POST(request: Request) {
+  const { pages, rules } = await request.json();
+  const segments = segmentPages(pages, { rules });
+  return NextResponse.json({ segments });
+}
+```
-# Run specific test file
-bun test src/markers/type-generators.test.ts
+```typescript
+// Node.js script
+import { segmentPages, stripHtmlTags } from 'flappa-doormal';
+const pages = rawPages.map((p, i) => ({
+  id: i + 1,
+  content: stripHtmlTags(p.html)
+}));
+const segments = segmentPages(pages, {
+  rules: [{
+    lineStartsAfter: ['{{raqms:num}} {{dash}} '],
+    split: 'at'
+  }]
+});
-# Run tests with coverage
-bun test --coverage
+console.log(`Found ${segments.length} segments`);
 ```
-**Test Coverage**: 100% coverage for `type-generators.ts` with 54+ test cases covering:
-- All 12 marker type generators
-- Edge cases (empty phrases, diacritic variations, custom separators)
-- Error handling (missing required fields)
-- Various numbering styles and separators
 ## Development
 ```bash
 # Install dependencies
 bun install
-# Run tests
+# Run tests (251 tests)
 bun test
-# Build (if needed)
+# Build
 bun run build
-# Format code
-bunx biome format --write .
+# Run performance test (generates 50K pages, measures segmentation speed/memory)
+bun run perf
-# Lint code
+# Lint
 bunx biome lint .
+# Format
+bunx biome format --write .
+```
+## Design Decisions
+### Double-Brace Syntax `{{token}}`
+Single braces conflict with regex quantifiers `{n,m}`. Double braces are visually distinct and match common template syntax (Handlebars, Mustache).
+### `lineStartsAfter` vs `lineStartsWith`
+- `lineStartsWith`: Keep marker in content (for detection only)
+- `lineStartsAfter`: Strip marker, capture only content (for clean extraction)
+### Fuzzy Applied at Token Level
+Fuzzy transforms are applied to raw Arabic text *before* wrapping in regex groups. This prevents corruption of regex metacharacters like `(`, `)`, `|`.
+### Extracted Utilities
+Complex logic was extracted into `match-utils.ts` for independent testing and reduced complexity (main function: 37 → 10).
+## Performance Notes
+### Memory Requirements
+The library concatenates all pages into a single string for pattern matching across page boundaries. Memory usage scales linearly with total content size:
+| Pages | Avg Page Size | Approximate Memory |
+|-------|---------------|-------------------|
+| 1,000 | 5 KB | ~5 MB |
+| 6,000 | 5 KB | ~30 MB |
+| 40,000 | 5 KB | ~200 MB |
+For typical book processing (up to 6,000 pages), memory usage is well within Node.js defaults. For very large books (40,000+ pages), ensure adequate heap size.
+### `maxSpan` Sliding Window Behavior
+The `maxSpan` option uses a **sliding window algorithm** based on page ID difference:
+```typescript
+// maxSpan = maximum page ID difference when looking ahead for split points
+// Algorithm prefers LONGER segments by looking as far ahead as allowed
+// Pages [1, 2, 3, 4] with maxSpan: 1, occurrence: 'last'
+// Window from page 1: pages 1-2 (diff <= 1), splits at page 2's last match
+// Window from page 3: pages 3-4 (diff <= 1), splits at page 4's last match
+// Result: 2 segments spanning pages 1-2 and 3-4
+// Pages [1, 5, 10] with maxSpan: 1, occurrence: 'last'
+// Window from page 1: only page 1 (5-1=4 > 1), splits at page 1
+// Window from page 5: only page 5 (10-5=5 > 1), splits at page 5
+// Window from page 10: only page 10, splits at page 10
+// Result: 3 segments (pages too far apart to merge)
 ```
+This is intentional for books where page IDs represent actual page numbers. With `occurrence: 'last'`, the algorithm finds the last match within the lookahead window, creating longer segments where possible.
 ## For AI Agents
-See [AGENTS.md](./AGENTS.md) for comprehensive guidance on:
-- Project architecture and design patterns
-- Adding new marker types
-- Testing strategies
-- Code quality standards
-- Extension points
+See [AGENTS.md](./AGENTS.md) for:
+- Architecture details and design patterns
+- Adding new tokens and pattern types
+- Algorithm explanations
+- Lessons learned during development
 ## License
 MIT
-## Related
-- [bitaboom](https://github.com/ragaeeb/bitaboom) - Arabic text utilities
-- [baburchi](https://github.com/ragaeeb/baburchi) - Text sanitization
-- [shamela](https://github.com/ragaeeb/shamela) - Shamela library utilities