flappa-doormal 2.17.1 → 2.18.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +16 -39
- package/README.md +91 -62
- package/dist/index.d.mts +196 -73
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +976 -1148
- package/dist/index.mjs.map +1 -1
- package/package.json +9 -9
package/AGENTS.md
CHANGED
|
@@ -53,7 +53,6 @@ src/
|
|
|
53
53
|
├── utils/ # Low-level helpers
|
|
54
54
|
│ └── textUtils.ts # Diacritics, Unicode, and bracket escaping
|
|
55
55
|
├── index.ts # Public barrel exports
|
|
56
|
-
├── recovery.ts # Mistaken stripping recovery logic
|
|
57
56
|
├── detection.ts # Pattern auto-detection (standalone)
|
|
58
57
|
└── *.test.ts # Unit and integration tests (co-located)
|
|
59
58
|
|
|
@@ -64,19 +63,14 @@ src/
|
|
|
64
63
|
- Returns array of `{content, from, to?, meta?}` segments
|
|
65
64
|
- Orchestrates rule matching, optimization, and breakpoint processing
|
|
66
65
|
|
|
67
|
-
2. **`
|
|
68
|
-
- Use when a client mistakenly used `lineStartsAfter` where `lineStartsWith` was intended
|
|
69
|
-
- Deterministic mode reruns segmentation with selected rules converted to `lineStartsWith` and merges recovered `content` back into the provided segments
|
|
70
|
-
- Optional `mode: 'best_effort_then_rerun'` attempts a conservative anchor-based recovery first, then falls back to rerun for unresolved segments
|
|
71
|
-
|
|
72
|
-
3. **`tokens.ts`** - Template system (`src/segmentation/tokens.ts`)
|
|
66
|
+
2. **`tokens.ts`** - Template system (`src/segmentation/tokens.ts`)
|
|
73
67
|
- `TOKEN_PATTERNS` - Map of token names to regex patterns
|
|
74
68
|
- `expandTokensWithCaptures()` - Expands `{{token:name}}` syntax
|
|
75
69
|
- `shouldDefaultToFuzzy()` - Auto-enables fuzzy matching for `bab`, `basmalah`, `fasl`, `kitab`, `naql`
|
|
76
70
|
- `applyTokenMappings()` - Applies named captures (`{{token:name}}`) to raw templates
|
|
77
71
|
- `stripTokenMappings()` - Strips named captures (reverts to `{{token}}`)
|
|
78
72
|
|
|
79
|
-
|
|
73
|
+
3. **`rule-regex.ts`** - Rule compiler (`src/segmentation/rule-regex.ts`)
|
|
80
74
|
- `buildRuleRegex()` - Compiles various rule types to executable RegExp
|
|
81
75
|
- `processPattern()` - Token expansion, auto-escaping, and fuzzy application
|
|
82
76
|
- `extractNamedCaptureNames()` - Extract metadata field names from raw regex
|
|
@@ -463,48 +457,33 @@ bunx biome lint .
|
|
|
463
457
|
|
|
464
458
|
13. **ASCII vs Arabic-Indic Numerals**: While most classical Arabic texts use Arabic-Indic digits (`٠-٩`), modern digitizers often mix them with ASCII digits (`0-9`). Providing separate tokens (`{{raqms}}` for Arabic and `{{nums}}` for ASCII) allows better precision in rule definitions while keeping patterns readable. Always check which digit set is used in the source text before authoring rules.
|
|
465
459
|
|
|
466
|
-
### For Future AI Agents (
|
|
467
|
-
|
|
468
|
-
1. **`lineStartsAfter` vs `lineStartsWith` is not “cosmetic”**: `lineStartsAfter` changes output by stripping the matched marker via an internal `contentStartOffset` during segment construction. If a client used it by accident, you cannot reconstruct the exact stripped prefix from output alone without referencing the original pages and re-matching the marker.
|
|
469
|
-
|
|
470
|
-
2. **Page joining differs between matching and output**:
|
|
471
|
-
- Matching always happens on pages concatenated with `\\n` separators.
|
|
472
|
-
- Output segments may normalize page boundaries (`pageJoiner: 'space' | 'newline'`) and breakpoints post-processing uses its own join normalization utilities.
|
|
473
|
-
Recovery code must be explicit about which representation it’s searching.
|
|
474
|
-
|
|
475
|
-
3. **Breakpoints can produce “pieces” that were never marker-stripped**: When `maxPages` + `breakpoints` are enabled, only the piece that starts at the original structural boundary could have lost a marker due to `lineStartsAfter`. Mid-segment breakpoint pieces should not be “recovered” unless you can anchor them confidently.
|
|
460
|
+
### For Future AI Agents (Repo gotchas)
|
|
476
461
|
|
|
477
|
-
|
|
462
|
+
1. **Fuzzy defaults are easy to miss**: Some tokens auto-enable fuzzy matching unless `fuzzy: false` is set (`bab`, `basmalah`, `fasl`, `kitab`, `naql`). If you are validating markers or re-matching prefixes, use the same compilation path as segmentation (`buildRuleRegex` / `processPattern`) so diacritics and token expansion behave identically.
|
|
478
463
|
|
|
479
|
-
|
|
464
|
+
2. **Auto-escaping applies to template-like patterns**: `lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, and `template` auto-escape `()[]` outside `{{tokens}}`. Raw `regex` does not. If you compare patterns by string equality, be careful about escaping and whitespace.
|
|
480
465
|
|
|
481
|
-
|
|
466
|
+
3. **TypeScript union pitfalls with `SplitRule`**: `SplitRule` is a union where only one pattern type should exist. Avoid mutating rules in-place with `delete` on fields (TS often narrows unions and then complains). Prefer rebuilding converted rules via destructuring (e.g. `{ lineStartsAfter, ...rest }` then create `{...rest, lineStartsWith: lineStartsAfter}`).
|
|
482
467
|
|
|
483
|
-
|
|
468
|
+
4. **Biome lint constraints shape implementation**: The repo enforces low function complexity. Expect to extract helpers to keep Biome happy. Also, Biome can flag regex character-class usage as misleading; prefer alternation (e.g. `(?:\\u200C|\\u200D|\\uFEFF)`) when removing specific codepoints.
|
|
484
469
|
|
|
485
|
-
|
|
486
|
-
- `src/segmentation/segmenter.ts` (how content is sliced/trimmed and how `from/to` are computed)
|
|
487
|
-
- `src/segmentation/rule-regex.ts` + `src/segmentation/tokens.ts` (token expansion + fuzzy behavior)
|
|
488
|
-
- `src/preprocessing/transforms.ts` (preprocessing transforms: removeZeroWidth, condenseEllipsis, fixTrailingWaw)
|
|
489
|
-
- `src/recovery.ts` (recovery implementation)
|
|
470
|
+
5. **Prefer library utilities for UI tasks**: Instead of re-implementing rule merging, validation, or token mapping in client code, use `optimizeRules`, `validateRules`/`formatValidationReport`, and `applyTokenMappings`. They handle edge cases (like duplicate patterns, regex safety, or diacritic handling) that ad-hoc implementations might miss.
|
|
490
471
|
|
|
491
|
-
|
|
472
|
+
6. **Safety Fallback (Search-back)**: When forced to split at a hard character limit, searching backward for whitespace/punctuation (`[\s\n.,;!?؛،۔]`) prevents word-chopping and improves readability significantly.
|
|
492
473
|
|
|
493
|
-
|
|
474
|
+
7. **Unicode Boundary Safety (Surrogates + Graphemes)**: Multi-byte characters (like emojis) can be corrupted if split in the middle of a surrogate pair. Similarly, Arabic diacritics (combining marks), ZWJ/ZWNJ, and variation selectors can be orphaned if a hard split lands in the middle of a grapheme cluster. Use `adjustForUnicodeBoundary` when forced to hard-split near a limit.
|
|
494
475
|
|
|
495
|
-
|
|
476
|
+
8. **Recursion/Iteration Safety**: Using a progress-based guard (comparing `cursorPos` before and after loop iteration) is safer than fixed iteration limits for supporting arbitrary-sized content without truncation risks.
|
|
496
477
|
|
|
497
|
-
|
|
478
|
+
9. **Accidental File Overwrites**: Be extremely careful when using tools like `replace_file_content` with large ranges. Verify file integrity frequently (e.g., `git diff`) to catch accidental deletions of existing code or tests. Merging new tests into existing files is a high-risk operation for AI agents.
|
|
498
479
|
|
|
499
|
-
|
|
480
|
+
10. **Invisible Unicode Marks Break Regex Anchors**: Arabic text often contains invisible formatting marks like Left-to-Right Mark (`U+200E`), Right-to-Left Mark (`U+200F`), Arabic Letter Mark (`U+061C`), Zero-Width Space (`U+200B`), Zero-Width Non-Joiner (`U+200C`), Zero-Width Joiner (`U+200D`), or BOM (`U+FEFF`). These can appear at line starts after `\n` but before visible characters, breaking `^` anchored patterns. Solution: include an optional zero-width character class prefix in line-start patterns: `^[\u200E\u200F\u061C\u200B\u200C\u200D\uFEFF]*(?:pattern)`. The library now handles this automatically in `buildLineStartsWithRegexSource` and `buildLineStartsAfterRegexSource`.
|
|
500
481
|
|
|
501
|
-
|
|
482
|
+
11. **Large Segment Performance & Debugging Strategy**: When processing large books (1000+ pages), avoid O(n²) algorithms. The library uses a fast-path threshold (1000 pages) to switch from accurate string-search boundary detection to cumulative-offset-based slicing. Even on the iterative path (e.g. debug mode), we **slice only the active window (+padding)** per iteration (never `fullContent.slice(cursorPos)`), to avoid quadratic allocation/GC churn. To diagnose performance bottlenecks: (1) Look for logs with "Using iterative path" or "Using accurate string-search path" with large `pageCount` values, (2) Check `iterations` count in completion logs, (3) Strategic logs are placed at operation boundaries (start/end) NOT inside tight loops to avoid log-induced performance regression.
|
|
502
483
|
|
|
503
|
-
|
|
484
|
+
12. **`maxPages=0` is a hard invariant**: When `maxPages=0`, breakpoint windows must never scan beyond the current page boundary. Relying purely on boundary detection (string search) can fail near page ends for long Arabic text + space joiners, letting the window “see” into the next page and creating multi-page segments. The safe fix is to clamp the breakpoint window to the current page’s end using `boundaryPositions` in breakpoint processing.
|
|
504
485
|
|
|
505
|
-
|
|
506
|
-
|
|
507
|
-
17. **`''` breakpoint semantics depend on whether the window is page-bounded vs length-bounded**: `''` means “page boundary fallback”, but it’s intentionally **mode-dependent**:
|
|
486
|
+
13. **`''` breakpoint semantics depend on whether the window is page-bounded vs length-bounded**: `''` means “page boundary fallback”, but it’s intentionally **mode-dependent**:
|
|
508
487
|
|
|
509
488
|
- **Page-bounded window (maxPages-driven)**: `''` should “swallow the remainder of the current page” (i.e. break at the **next page boundary**, not at an arbitrary character limit). This prevents accidentally consuming part of the next page when no other breakpoint patterns match.
|
|
510
489
|
- **Length-bounded window (maxContentLength-driven)**: `''` should **not** force an early page-boundary break. In this mode we want the best split *near the length limit* (safe-break fallback → Unicode-safe hard split) even if that means a piece can cross a page boundary.
|
|
@@ -805,5 +784,3 @@ Key functions: `applyBreakpoints()` → `processOversizedSegment()` → `findBre
|
|
|
805
784
|
## Known Issues
|
|
806
785
|
|
|
807
786
|
- **Binary Search Gap (Theoretical)**: `findBoundaryIdForOffset` returns `undefined` if the search offset falls exactly on a joiner character (e.g., a space or newline) between two pages. This is mathematically correct (the gap belongs to neither page) but may cause validation errors if a segment consists _only_ of such a gap or matches content starting/ending strictly within the gap. We have marked this as "accept" behavior for now, with a documented skipped test case.
|
|
808
|
-
|
|
809
|
-
|
package/README.md
CHANGED
|
@@ -135,7 +135,7 @@ Replace regex with readable tokens:
|
|
|
135
135
|
| `{{num}}` | Single ASCII digit | `\\d` |
|
|
136
136
|
| `{{dash}}` | Dash variants | `[-–—ـ]` |
|
|
137
137
|
| `{{harf}}` | Arabic letter | `[أ-ي]` |
|
|
138
|
-
| `{{harfs}}` | Single-letter codes separated by spaces |
|
|
138
|
+
| `{{harfs}}` | Single-letter codes separated by spaces, with optional marks/tatweel on each isolated letter | e.g. `د ت س ي ق`, `هـ ث` |
|
|
139
139
|
| `{{rumuz}}` | Source abbreviations (rijāl/takhrīj rumuz), incl. multi-code blocks | e.g. `خت ٤`, `خ سي`, `خ فق`, `د ت سي ق`, `دت عس ق` |
|
|
140
140
|
| `{{numbered}}` | Hadith numbering `٢٢ - ` | `{{raqms}} {{dash}} ` |
|
|
141
141
|
| `{{fasl}}` | Section markers | `فصل\|مسألة` |
|
|
@@ -323,6 +323,93 @@ const segments = segmentPages(pages, {
|
|
|
323
323
|
|
|
324
324
|
This guard applies **only at page starts**. Mid-page line starts are unaffected.
|
|
325
325
|
|
|
326
|
+
#### Previous-Word Page-Start Stoplist
|
|
327
|
+
|
|
328
|
+
For dictionary-like content, page wraps can split a phrase across pages and create
|
|
329
|
+
false positives at the top of the next page. Example:
|
|
330
|
+
|
|
331
|
+
- Page N ends with `قال`
|
|
332
|
+
- Page N+1 starts with `العجاج:`
|
|
333
|
+
|
|
334
|
+
Use `pageStartPrevWordStoplist` to suppress page-start matches when the previous
|
|
335
|
+
page's last Arabic word is in a stoplist. Matching is Arabic-normalized and
|
|
336
|
+
diacritic-insensitive.
|
|
337
|
+
|
|
338
|
+
```typescript
|
|
339
|
+
const segments = segmentPages(pages, {
|
|
340
|
+
rules: [{
|
|
341
|
+
regex: '^(?<lemma>[ء-غف-ي]+):',
|
|
342
|
+
split: 'at',
|
|
343
|
+
pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال']
|
|
344
|
+
}]
|
|
345
|
+
});
|
|
346
|
+
```
|
|
347
|
+
|
|
348
|
+
If the previous page ends with strong sentence punctuation (`.`, `!`, `?`, `؟`, `؛`),
|
|
349
|
+
the stoplist guard is skipped and the page-start match is allowed.
|
|
350
|
+
|
|
351
|
+
#### Arabic Dictionary Helper
|
|
352
|
+
|
|
353
|
+
Use `createArabicDictionaryEntryRule()` to build a conservative rule for Arabic
|
|
354
|
+
dictionaries with lemma capture, stopword filtering, and page-wrap protection:
|
|
355
|
+
|
|
356
|
+
```typescript
|
|
357
|
+
import { createArabicDictionaryEntryRule, segmentPages } from 'flappa-doormal';
|
|
358
|
+
|
|
359
|
+
const rule = createArabicDictionaryEntryRule({
|
|
360
|
+
stopWords: ['وقيل', 'ويقال', 'قال', 'العجاج', 'أخاك'],
|
|
361
|
+
pageStartPrevWordStoplist: ['قال', 'وقيل', 'ويقال'],
|
|
362
|
+
samePagePrevWordStoplist: ['جل'],
|
|
363
|
+
// Optional dictionary-specific shapes:
|
|
364
|
+
allowParenthesized: true, // e.g. (عنبر) :
|
|
365
|
+
allowWhitespaceBeforeColon: true, // e.g. عنبر :
|
|
366
|
+
allowCommaSeparated: true, // e.g. سبد، دبس:
|
|
367
|
+
});
|
|
368
|
+
|
|
369
|
+
const segments = segmentPages(pages, { rules: [rule] });
|
|
370
|
+
```
|
|
371
|
+
|
|
372
|
+
Behavior:
|
|
373
|
+
- Keeps the lemma marker in `segment.content`
|
|
374
|
+
- Stores the matched lemma in `segment.meta.lemma`
|
|
375
|
+
- Matches root entries at true line/page starts like `عز:` and `لع:`
|
|
376
|
+
- Matches mid-line subentries conservatively when they begin with `و`
|
|
377
|
+
- Can match parenthesized headwords like `(عنبر) :` when enabled
|
|
378
|
+
- Can match comma-separated headword lists like `سبد، دبس:` when enabled
|
|
379
|
+
- Can suppress same-page false positives like `جلّ وعزّ:` with `samePagePrevWordStoplist`
|
|
380
|
+
|
|
381
|
+
#### Dictionary Letter-Code Lines
|
|
382
|
+
|
|
383
|
+
For dictionary-specific letter-code lines like `ك ش ن` or `(هـ ث)`, use
|
|
384
|
+
`{{harfs}}` and decide the metadata shape in client code:
|
|
385
|
+
|
|
386
|
+
```typescript
|
|
387
|
+
import { getTokenPattern, segmentPages } from 'flappa-doormal';
|
|
388
|
+
|
|
389
|
+
const harfCodes = getTokenPattern('harfs').replaceAll('\\s+', '[ \\t]+');
|
|
390
|
+
|
|
391
|
+
const segments = segmentPages(pages, {
|
|
392
|
+
rules: [{
|
|
393
|
+
regex: `^(?:\\((?<huruf>${harfCodes})\\)|(?<huruf>${harfCodes}))$`,
|
|
394
|
+
split: 'at',
|
|
395
|
+
meta: { type: 'C' },
|
|
396
|
+
}],
|
|
397
|
+
});
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
Here `huruf` is just a named capture group chosen by the client, not a built-in
|
|
401
|
+
regex primitive.
|
|
402
|
+
|
|
403
|
+
This client-side rule can be used for:
|
|
404
|
+
- chapter-adjacent code lines like `(هـ ث)`
|
|
405
|
+
- consecutive bare code lines like `س ط ب` then `س د ر`
|
|
406
|
+
|
|
407
|
+
The `replaceAll('\\s+', '[ \\t]+')` step is intentional:
|
|
408
|
+
- `{{harfs}}` itself uses `\s+`
|
|
409
|
+
- but when embedding it in a raw full-line regex, horizontal whitespace is usually
|
|
410
|
+
safer than unrestricted `\s+`, because it prevents accidental matching across
|
|
411
|
+
newlines
|
|
412
|
+
|
|
326
413
|
### 5. Auto-Escaping Brackets
|
|
327
414
|
|
|
328
415
|
In `lineStartsWith`, `lineStartsAfter`, `lineEndsWith`, and `template` patterns, parentheses `()` and square brackets `[]` are **automatically escaped**. This means you can write intuitive patterns without manual escaping:
|
|
@@ -1382,66 +1469,6 @@ const options: SegmentationOptions = {
|
|
|
1382
1469
|
const segments: Segment[] = segmentPages(pages, options);
|
|
1383
1470
|
```
|
|
1384
1471
|
|
|
1385
|
-
### Marker recovery (when `lineStartsAfter` was used by accident)
|
|
1386
|
-
|
|
1387
|
-
If you accidentally used `lineStartsAfter` for markers that should have been preserved (e.g. Arabic connective phrases like `وروى` / `وذكر`), you can recover those missing prefixes from existing segments.
|
|
1388
|
-
|
|
1389
|
-
#### `recoverMistakenLineStartsAfterMarkers(pages, segments, options, selector, opts?)`
|
|
1390
|
-
|
|
1391
|
-
This function returns new segments with recovered `content` plus a `report` describing what happened.
|
|
1392
|
-
|
|
1393
|
-
**Recommended (deterministic) mode**: rerun segmentation with selected rules converted to `lineStartsWith`, then merge recovered content back.
|
|
1394
|
-
|
|
1395
|
-
```ts
|
|
1396
|
-
import { recoverMistakenLineStartsAfterMarkers, segmentPages } from 'flappa-doormal';
|
|
1397
|
-
|
|
1398
|
-
const pages = [{ id: 1, content: 'وروى أحمد\nوذكر خالد' }];
|
|
1399
|
-
const options = { rules: [{ lineStartsAfter: ['وروى '] }, { lineStartsAfter: ['وذكر '] }] };
|
|
1400
|
-
|
|
1401
|
-
const segments = segmentPages(pages, options);
|
|
1402
|
-
// segments[0].content === 'أحمد' (marker stripped)
|
|
1403
|
-
|
|
1404
|
-
const { segments: recovered, report } = recoverMistakenLineStartsAfterMarkers(
|
|
1405
|
-
pages,
|
|
1406
|
-
segments,
|
|
1407
|
-
options,
|
|
1408
|
-
{ type: 'rule_indices', indices: [0] }, // recover only the first rule
|
|
1409
|
-
);
|
|
1410
|
-
|
|
1411
|
-
// recovered[0].content === 'وروى أحمد'
|
|
1412
|
-
// recovered[1].content === 'خالد' (unchanged)
|
|
1413
|
-
console.log(report.summary);
|
|
1414
|
-
```
|
|
1415
|
-
|
|
1416
|
-
**Optional**: best-effort anchoring mode attempts to recover without rerunning first, then falls back to rerun for unresolved segments:
|
|
1417
|
-
|
|
1418
|
-
```ts
|
|
1419
|
-
const { segments: recovered } = recoverMistakenLineStartsAfterMarkers(
|
|
1420
|
-
pages,
|
|
1421
|
-
segments,
|
|
1422
|
-
options,
|
|
1423
|
-
{ type: 'rule_indices', indices: [0] },
|
|
1424
|
-
{ mode: 'best_effort_then_rerun' }
|
|
1425
|
-
);
|
|
1426
|
-
```
|
|
1427
|
-
|
|
1428
|
-
Notes:
|
|
1429
|
-
- Recovery is **explicitly scoped** by the `selector`; it will not “guess” which rules are mistaken.
|
|
1430
|
-
- If your segments were heavily post-processed (trimmed/normalized/reordered), recovery may return unresolved items; see the report for details.
|
|
1431
|
-
|
|
1432
|
-
#### `recoverMistakenMarkersForRuns(runs, opts?)`
|
|
1433
|
-
|
|
1434
|
-
Batch version of `recoverMistakenLineStartsAfterMarkers`. Processes multiple independent segmentation runs (e.g. from different books) and returns a consolidated report.
|
|
1435
|
-
|
|
1436
|
-
```typescript
|
|
1437
|
-
import { recoverMistakenMarkersForRuns } from 'flappa-doormal';
|
|
1438
|
-
|
|
1439
|
-
const results = recoverMistakenMarkersForRuns([
|
|
1440
|
-
{ pages: pages1, segments: segments1, options: options1, selector: selector1 },
|
|
1441
|
-
{ pages: pages2, segments: segments2, options: options2, selector: selector2 },
|
|
1442
|
-
]);
|
|
1443
|
-
```
|
|
1444
|
-
|
|
1445
1472
|
### `validateSegments(pages, options, segments, validationOptions?)`
|
|
1446
1473
|
|
|
1447
1474
|
Validates that segments correctly map back to the source pages and adhere to constraints.
|
|
@@ -1635,7 +1662,9 @@ type SplitRule = {
|
|
|
1635
1662
|
min?: number;
|
|
1636
1663
|
max?: number;
|
|
1637
1664
|
exclude?: (number | [number, number])[]; // Single page or [start, end] range
|
|
1638
|
-
|
|
1665
|
+
pageStartGuard?: string;
|
|
1666
|
+
pageStartPrevWordStoplist?: string[];
|
|
1667
|
+
samePagePrevWordStoplist?: string[];
|
|
1639
1668
|
meta?: Record<string, unknown>;
|
|
1640
1669
|
};
|
|
1641
1670
|
```
|