npm - flappa-doormal - Versions diffs - 2.8.0 → 2.10.0 - Mend

flappa-doormal 2.8.0 → 2.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/AGENTS.md CHANGED Viewed

@@ -26,8 +26,14 @@ Traditional Arabic text segmentation requires:
 ```text
 src/
 ├── index.ts                    # Main entry point and exports
-├── pattern-detection.ts        # Token detection for auto-generating rules (NEW)
-├── pattern-detection.test.ts   # Pattern detection tests (22 tests)
+├── analysis/                   # Analysis helpers module
+│   ├── index.ts                # Barrel exports for analysis functions
+│   ├── shared.ts               # Shared utilities for analysis
+│   ├── line-starts.ts          # analyzeCommonLineStarts (line-based patterns)
+│   ├── repeating-sequences.ts  # analyzeRepeatingSequences (continuous text N-grams)
+│   └── *.test.ts               # Analysis tests
+├── pattern-detection.ts        # Token detection for auto-generating rules
+├── pattern-detection.test.ts   # Pattern detection tests
 ├── recovery.ts                 # Marker recovery utility (recover mistaken lineStartsAfter)
 ├── recovery.test.ts            # Marker recovery tests
 └── segmentation/
@@ -36,6 +42,7 @@ src/
     ├── breakpoint-processor.ts # Breakpoint post-processing engine (applyBreakpoints)
     ├── breakpoint-utils.ts     # Breakpoint processing utilities (windowing, excludes, page joins)
     ├── rule-regex.ts           # SplitRule -> compiled regex builder (buildRuleRegex, processPattern)
+    ├── optimize-rules.ts       # Rule optimization logic (merge, dedupe, sort)
     ├── tokens.ts               # Token definitions and expansion logic
     ├── fuzzy.ts                # Diacritic-insensitive matching utilities
     ├── html.ts                 # HTML utilities (stripHtmlTags)
@@ -63,32 +70,38 @@ src/
    - Deterministic mode reruns segmentation with selected rules converted to `lineStartsWith` and merges recovered `content` back into the provided segments
    - Optional `mode: 'best_effort_then_rerun'` attempts a conservative anchor-based recovery first, then falls back to rerun for unresolved segments
-2. **`tokens.ts`** - Template system
+3. **`tokens.ts`** - Template system
    - `TOKEN_PATTERNS` - Map of token names to regex patterns
    - `expandTokensWithCaptures()` - Expands `{{token:name}}` syntax
    - `shouldDefaultToFuzzy()` - Checks if patterns contain fuzzy-default tokens (bab, basmalah, fasl, kitab, naql)
+   - `applyTokenMappings()` - Applies named captures (`{{token:name}}`) to raw templates
+   - `stripTokenMappings()` - Strips named captures (reverts to `{{token}}`)
    - Supports fuzzy transform for diacritic-insensitive matching
    - **Fuzzy-default tokens**: `bab`, `basmalah`, `fasl`, `kitab`, `naql` - auto-enable fuzzy matching unless `fuzzy: false` is set
-3. **`match-utils.ts`** - Extracted utilities (for testability)
+4. **`match-utils.ts`** - Extracted utilities (for testability)
    - `extractNamedCaptures()` - Get named groups from regex match
    - `filterByConstraints()` - Apply min/max page filters
    - `anyRuleAllowsId()` - Check if page passes rule constraints
-4. **`rule-regex.ts`** - SplitRule → compiled regex builder
+5. **`rule-regex.ts`** - SplitRule → compiled regex builder
    - `buildRuleRegex()` - Compiles rule patterns (`lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, `template`, `regex`)
    - `processPattern()` - Token expansion + auto-escaping + optional fuzzy application
    - `extractNamedCaptureNames()` - Extract `(?<name>...)` groups from raw regex patterns
-5. **`pattern-validator.ts`** - Rule validation utilities
+6. **`optimize-rules.ts`** - Rule management logic
+   - `optimizeRules()` - Merges compatible rules, deduplicates patterns, and sorts by specificity (longest patterns first)
+7. **`pattern-validator.ts`** - Rule validation utilities
    - `validateRules()` - Detects typos in patterns (missing `{{}}`, unknown tokens, duplicates)
+   - `formatValidationReport()` - Formats validation issues into human-readable strings
    - Returns parallel array structure for easy error tracking
-6. **`breakpoint-processor.ts`** - Breakpoint post-processing engine
+8. **`breakpoint-processor.ts`** - Breakpoint post-processing engine
    - `applyBreakpoints()` - Splits oversized structural segments using breakpoint patterns + windowing
    - Applies `pageJoiner` normalization to breakpoint-created segments
-7. **`breakpoint-utils.ts`** - Breakpoint processing utilities
+9. **`breakpoint-utils.ts`** - Breakpoint processing utilities
    - `normalizeBreakpoint()` - Convert string to BreakpointRule object
    - `isPageExcluded()` - Check if page is in exclude list
    - `isInBreakpointRange()` - Validate page against min/max/exclude constraints
@@ -105,16 +118,17 @@ src/
    - `findNextPagePosition()` - Find next page content position
    - `findPatternBreakPosition()` - Find pattern match by preference
-7. **`types.ts`** - Type definitions
+10. **`types.ts`** - Type definitions
    - `Logger` interface - Optional logging for debugging
    - `SegmentationOptions` - Options with `logger` property
    - `pageJoiner` - Controls how page boundaries are represented in output (`space` default)
+   - `PATTERN_TYPE_KEYS` - Runtime array of all pattern types (for UI building)
    - Verbosity levels: `trace`, `debug`, `info`, `warn`, `error`
-8. **`fuzzy.ts`** - Arabic text normalization
+11. **`fuzzy.ts`** - Arabic text normalization
    - `makeDiacriticInsensitive()` - Generate regex that ignores diacritics
-9. **`pattern-detection.ts`** - Token auto-detection (NEW)
+12. **`pattern-detection.ts`** - Token auto-detection (NEW)
    - `detectTokenPatterns()` - Detect tokens in text with positions
    - `generateTemplateFromText()` - Convert text to template string
    - `suggestPatternConfig()` - Suggest rule configuration
@@ -202,7 +216,7 @@ Raw `regex` patterns now support named capture groups for metadata extraction:
 ### Breakpoints Post-Processing Algorithm
-The `breakpoints` option provides a post-processing mechanism for limiting segment size. Unlike the deprecated `maxSpan` (which was per-rule), breakpoints runs AFTER all structural rules.
+The `breakpoints` option provides a post-processing mechanism for limiting segment size. Breakpoints runs AFTER all structural rules.
 **API Options:**
 ```typescript
@@ -244,7 +258,7 @@ segmentPages(pages, {
 - **`prefer: 'shorter'`**: Finds FIRST match (conservative)
 - **Recursive**: If split result still exceeds `maxPages`, breakpoints runs again
-> **Note**: The old `maxSpan` and `fallback` properties on `SplitRule` are deprecated and removed.
+> **Note**: Older per-rule span limiting approaches were removed in favor of post-processing `breakpoints`.
 ## Design Decisions
@@ -357,7 +371,7 @@ bunx biome lint .
 5. **Rule order matters for specificity**: When multiple rules can match the same position, put specific patterns BEFORE generic ones. Example: `## {{raqms:num}} {{dash}}` must come before `##` to capture the number.
-6. **Post-processing beats per-rule limits**: The `maxSpan` approach (per-rule page limits) caused premature splits. Moving to post-processing `breakpoints` preserves structural integrity while still limiting segment size.
+6. **Post-processing beats per-rule limits**: Per-rule span limiting caused premature splits. Moving to post-processing `breakpoints` preserves structural integrity while still limiting segment size.
 7. **Window padding matters**: When calculating approximate content windows, 50% padding is needed (not 20%) to ensure enough content is captured for `prefer: 'longer'` scenarios.
@@ -396,6 +410,8 @@ bunx biome lint .
    - `src/segmentation/replace.ts` (preprocessing parity)
    - `src/recovery.ts` (recovery implementation)
+10. **Prefer library utilities for UI tasks**: Instead of re-implementing rule merging, validation, or token mapping in client code, use `optimizeRules`, `validateRules`/`formatValidationReport`, and `applyTokenMappings`. They handle edge cases (like duplicate patterns, regex safety, or diacritic handling) that ad-hoc implementations might miss.
 ### Process Template (Multi-agent design review, TDD-first)
 If you want to repeat the “write a plan → get multiple AI critiques → synthesize → update plan → implement TDD-first” workflow, use:
@@ -494,3 +510,32 @@ const quoted = analyzeCommonLineStarts(pages, {
 });
 ```
+## Repeating Sequence Analysis (`analyzeRepeatingSequences`)
+For continuous text **without line breaks** (prose-like content), use `analyzeRepeatingSequences(pages)`. It scans for commonly repeating word/token sequences (N-grams) across pages.
+Key options:
+- `minElements` / `maxElements`: N-gram size range (default 1-3)
+- `minCount`: Minimum occurrences to include (default 3)
+- `topK`: Maximum patterns to return (default 20)
+- `requireToken`: Only patterns containing `{{tokens}}` (default true)
+- `normalizeArabicDiacritics`: Ignore diacritics when matching (default true)
+Example:
+```typescript
+import { analyzeRepeatingSequences } from 'flappa-doormal';
+const patterns = analyzeRepeatingSequences(pages, { minCount: 3, topK: 20 });
+// [{ pattern: '{{naql}}', count: 42, examples: [...] }, ...]
+```
+## Analysis → Segmentation Workflow
+Use analysis functions to discover patterns, then pass to `segmentPages()`:
+1. **Continuous text**: `analyzeRepeatingSequences()` → build rules → `segmentPages()`
+2. **Structured text**: `analyzeCommonLineStarts()` → build rules → `segmentPages()`
+See README.md for complete examples.

package/README.md CHANGED Viewed

@@ -158,6 +158,23 @@ const rules = [{
 | `template` | Depends | Custom pattern with full control |
 | `regex` | Depends | Raw regex for complex cases |
+#### Building UIs with Pattern Type Keys
+The library exports `PATTERN_TYPE_KEYS` (a const array) and `PatternTypeKey` (a type) for building UIs that let users select pattern types:
+```typescript
+import { PATTERN_TYPE_KEYS, type PatternTypeKey } from 'flappa-doormal';
+// PATTERN_TYPE_KEYS = ['lineStartsWith', 'lineStartsAfter', 'lineEndsWith', 'template', 'regex']
+// Build a dropdown/select
+PATTERN_TYPE_KEYS.map(key => <option value={key}>{key}</option>)
+// Type-safe validation
+const isPatternKey = (k: string): k is PatternTypeKey =>
+  (PATTERN_TYPE_KEYS as readonly string[]).includes(k);
+```
 ### 4.1 Page-start Guard (avoid page-wrap false positives)
 When matching at line starts (e.g., `{{naql}}`), a new page can begin with a marker that is actually a **continuation** of the previous page (page wrap), not a true new segment.
@@ -228,7 +245,6 @@ Control which matches to use:
   lineEndsWith: ['\\.'],
   split: 'after',
   occurrence: 'last',  // Only split at LAST period on page
-  maxSpan: 1,          // Apply per-page
 }
 ```
@@ -406,6 +422,123 @@ Key options:
 - If you paste these signatures into `lineStartsWith` / `lineStartsAfter` / `template`, that’s fine: those template pattern types **auto-escape `()[]`** outside `{{tokens}}`.
 - If you paste them into a raw `regex` rule, you may need to escape literal brackets yourself.
+### Repeating Sequence Analysis (continuous text)
+For texts without line breaks (continuous prose), use `analyzeRepeatingSequences()`:
+```typescript
+import { analyzeRepeatingSequences } from 'flappa-doormal';
+const patterns = analyzeRepeatingSequences(pages, {
+  minElements: 2,
+  maxElements: 4,
+  minCount: 3,
+  topK: 20,
+});
+// [{ pattern: "{{naql}}\\s*{{harf}}", count: 42, examples: [...] }, ...]
+```
+Key options:
+- `minElements` / `maxElements`: N-gram size range (default 1-3)
+- `minCount`: Minimum occurrences to include (default 3)
+- `topK`: Maximum patterns to return (default 20)
+- `requireToken`: Only patterns containing `{{tokens}}` (default true)
+- `normalizeArabicDiacritics`: Ignore diacritics when matching (default true)
+## Analysis → Segmentation Workflow
+Use analysis functions to discover patterns, then pass to `segmentPages()`.
+### Example A: Continuous Text (No Punctuation)
+For prose-like text without structural line breaks:
+```typescript
+import { analyzeRepeatingSequences, segmentPages, type Page } from 'flappa-doormal';
+// Continuous Arabic text with narrator phrases
+const pages: Page[] = [
+  { id: 1, content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي حدثنا خالد بن زيد عن علي' },
+  { id: 2, content: 'حدثنا سعيد بن جبير عن ابن عباس أخبرنا يوسف عن أنس' },
+];
+// Step 1: Discover repeating patterns
+const patterns = analyzeRepeatingSequences(pages, { minCount: 2, topK: 10 });
+// [{ pattern: '{{naql}}', count: 5, examples: [...] }, ...]
+// Step 2: Build rules from discovered patterns
+const rules = patterns.filter(p => p.count >= 3).map(p => ({
+  lineStartsWith: [p.pattern],
+  split: 'at' as const,
+  fuzzy: true,
+}));
+// Step 3: Segment
+const segments = segmentPages(pages, { rules });
+// [{ content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي', from: 1 }, ...]
+```
+### Example B: Structured Text (With Numbering)
+For hadith-style numbered entries:
+```typescript
+import { analyzeCommonLineStarts, segmentPages, type Page } from 'flappa-doormal';
+// Numbered hadith text
+const pages: Page[] = [
+  { id: 1, content: '٦٦٩٦ - حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ\n٦٦٩٧ - أَخْبَرَنَا عُمَرُ قَالَ' },
+  { id: 2, content: '٦٦٩٨ - حَدَّثَنِي مُحَمَّدٌ عَنْ عَائِشَةَ' },
+];
+// Step 1: Discover common line-start patterns
+const patterns = analyzeCommonLineStarts(pages, { topK: 10, minCount: 2 });
+// [{ pattern: '{{raqms}}\\s*{{dash}}', count: 3, examples: [...] }, ...]
+// Step 2: Build rules (add named capture for hadith number)
+const topPattern = patterns[0]?.pattern ?? '{{raqms}} {{dash}} ';
+const rules = [{
+  lineStartsAfter: [topPattern.replace('{{raqms}}', '{{raqms:num}}')],
+  split: 'at' as const,
+  meta: { type: 'hadith' }
+}];
+// Step 3: Segment
+const segments = segmentPages(pages, { rules });
+// [
+//   { content: 'حَدَّثَنَا أَبُو بَكْرٍ...', from: 1, meta: { type: 'hadith', num: '٦٦٩٦' } },
+//   { content: 'أَخْبَرَنَا عُمَرُ قَالَ', from: 1, meta: { type: 'hadith', num: '٦٦٩٧' } },
+//   { content: 'حَدَّثَنِي مُحَمَّدٌ...', from: 2, meta: { type: 'hadith', num: '٦٦٩٨' } },
+// ]
+```
+## Rule Optimization
+Use `optimizeRules()` to automatically merge compatible rules, remove duplicate patterns, and sort rules by specificity (longest patterns first):
+```typescript
+import { optimizeRules } from 'flappa-doormal';
+const rules = [
+  // These will be merged because meta/fuzzy options match
+  { lineStartsWith: ['{{kitab}}'], fuzzy: true, meta: { type: 'header' } },
+  { lineStartsWith: ['{{bab}}'], fuzzy: true, meta: { type: 'header' } },
+  // This will be kept separate
+  { lineStartsAfter: ['{{numbered}}'], meta: { type: 'entry' } },
+];
+const { rules: optimized, mergedCount } = optimizeRules(rules);
+// Result:
+// optimized[0] = {
+//   lineStartsWith: ['{{kitab}}', '{{bab}}'],
+//   fuzzy: true,
+//   meta: { type: 'header' }
+// }
+// optimized[1] = { lineStartsAfter: ['{{numbered}}'], ... }
+```
 ## Rule Validation
 Use `validateRules()` to detect common mistakes in rule patterns before running segmentation:
@@ -422,6 +555,16 @@ const issues = validateRules([
 // issues[0]?.lineStartsAfter?.[0]?.type === 'missing_braces'
 // issues[1]?.lineStartsWith?.[0]?.type === 'unknown_token'
 // issues[2]?.lineStartsAfter?.[0]?.type === 'missing_braces'
+// To get a simple list of error strings for UI display:
+import { formatValidationReport } from 'flappa-doormal';
+const errors = formatValidationReport(issues);
+// [
+//   'Rule 1, lineStartsAfter: Missing {{}} around token "raqms:num"',
+//   'Rule 2, lineStartsWith: Unknown token "{{unknown}}"',
+//   ...
+// ]
 ```
 **Checks performed:**
@@ -429,6 +572,25 @@ const issues = validateRules([
 - **Unknown tokens**: Flags tokens inside `{{}}` that don't exist (e.g., `{{nonexistent}}`)
 - **Duplicates**: Finds duplicate patterns within the same rule
+## Token Mapping Utilities
+When building UIs for rule editing, it's often useful to separate the *token pattern* (e.g., `{{raqms}}`) from the *capture name* (e.g., `{{raqms:hadithNum}}`).
+```typescript
+import { applyTokenMappings, stripTokenMappings } from 'flappa-doormal';
+// 1. Apply user-defined mappings to a raw template
+const template = '{{raqms}} {{dash}}';
+const mappings = [{ token: 'raqms', name: 'num' }];
+const result = applyTokenMappings(template, mappings);
+// result = '{{raqms:num}} {{dash}}'
+// 2. Strip captures to get back to the canonical pattern
+const raw = stripTokenMappings(result);
+// raw = '{{raqms}} {{dash}}'
+```
 ## Prompting LLMs / Agents to Generate Rules (Shamela books)
 ### Pre-analysis (no LLM required): generate “hints” from the book
@@ -619,32 +781,10 @@ const segments = segmentPages(pages, {
     lineEndsWith: ['\\.'],
     split: 'after',
     occurrence: 'last',
-    maxSpan: 1
-  }]
-});
-```
-### Page Fallback for Unmatched Content
-When using `maxSpan` to group matches per page, use `fallback: 'page'` to prevent unmatched pages from merging with adjacent segments:
-```typescript
-const segments = segmentPages(pages, {
-  rules: [{
-    template: '{{tarqim}}',  // Match punctuation marks
-    split: 'after',
-    occurrence: 'last',
-    maxSpan: 1,
-    fallback: 'page'  // If no punctuation found, segment the page anyway
   }]
 });
 ```
-**Without `fallback`**: Pages without matches merge into the next segment
-**With `fallback: 'page'`**: Each page becomes its own segment even without matches
-> **Future extensions**: The `fallback` option may support additional values like `'skip'` (omit unmatched content) or `'line'` (split at line breaks) in future versions.
 ### Multiple Rules with Priority
 ```typescript
@@ -912,9 +1052,7 @@ type SplitRule = {
   // Split behavior
   split?: 'at' | 'after';  // Default: 'at'
   occurrence?: 'first' | 'last' | 'all';
-  maxSpan?: number;
   fuzzy?: boolean;
-  fallback?: 'page';  // NEW: Page-boundary fallback
   // Constraints
   min?: number;
@@ -1042,28 +1180,6 @@ The library concatenates all pages into a single string for pattern matching acr
 For typical book processing (up to 6,000 pages), memory usage is well within Node.js defaults. For very large books (40,000+ pages), ensure adequate heap size.
-### `maxSpan` Sliding Window Behavior
-The `maxSpan` option uses a **sliding window algorithm** based on page ID difference:
-```typescript
-// maxSpan = maximum page ID difference when looking ahead for split points
-// Algorithm prefers LONGER segments by looking as far ahead as allowed
-// Pages [1, 2, 3, 4] with maxSpan: 1, occurrence: 'last'
-// Window from page 1: pages 1-2 (diff <= 1), splits at page 2's last match
-// Window from page 3: pages 3-4 (diff <= 1), splits at page 4's last match
-// Result: 2 segments spanning pages 1-2 and 3-4
-// Pages [1, 5, 10] with maxSpan: 1, occurrence: 'last'
-// Window from page 1: only page 1 (5-1=4 > 1), splits at page 1
-// Window from page 5: only page 5 (10-5=5 > 1), splits at page 5
-// Window from page 10: only page 10, splits at page 10
-// Result: 3 segments (pages too far apart to merge)
-```
-This is intentional for books where page IDs represent actual page numbers. With `occurrence: 'last'`, the algorithm finds the last match within the lookahead window, creating longer segments where possible.
 ## For AI Agents
 See [AGENTS.md](./AGENTS.md) for: