flappa-doormal 2.11.1 → 2.11.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +6 -0
- package/README.md +7 -0
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +214 -51
- package/dist/index.mjs.map +1 -1
- package/package.json +3 -2
package/AGENTS.md
CHANGED
|
@@ -400,6 +400,8 @@ bunx biome lint .
|
|
|
400
400
|
|
|
401
401
|
12. **Prefix matching fails with duplicated content**: When using `indexOf()` to find page boundaries by matching prefixes, false positives occur when pages have identical prefixes AND content is duplicated within pages. Solution: use cumulative byte offsets as the source of truth for expected boundaries, and only accept prefix matches within a strict deviation threshold (2000 chars). When content-based detection fails, fall back directly to the calculated offset rather than returning `remainingContent.length` (which merges all remaining pages).
|
|
402
402
|
|
|
403
|
+
13. **ASCII vs Arabic-Indic Numerals**: While most classical Arabic texts use Arabic-Indic digits (`٠-٩`), modern digitizers often mix them with ASCII digits (`0-9`). Providing separate tokens (`{{raqms}}` for Arabic and `{{nums}}` for ASCII) allows better precision in rule definitions while keeping patterns readable. Always check which digit set is used in the source text before authoring rules.
|
|
404
|
+
|
|
403
405
|
### For Future AI Agents (Recovery + Repo gotchas)
|
|
404
406
|
|
|
405
407
|
1. **`lineStartsAfter` vs `lineStartsWith` is not “cosmetic”**: `lineStartsAfter` changes output by stripping the matched marker via an internal `contentStartOffset` during segment construction. If a client used it by accident, you cannot reconstruct the exact stripped prefix from output alone without referencing the original pages and re-matching the marker.
|
|
@@ -439,6 +441,8 @@ bunx biome lint .
|
|
|
439
441
|
|
|
440
442
|
15. **Invisible Unicode Marks Break Regex Anchors**: Arabic text often contains invisible bidirectional formatting marks like Left-to-Right Mark (`U+200E`), Right-to-Left Mark (`U+200F`), or Arabic Letter Mark (`U+061C`). These appear at line starts after `\n` but before visible characters, breaking `^` anchored patterns. Solution: include an optional zero-width character class prefix in line-start patterns: `^[\u200E\u200F\u061C\u200B\uFEFF]*(?:pattern)`. The library now handles this automatically in `buildLineStartsWithRegexSource` and `buildLineStartsAfterRegexSource`.
|
|
441
443
|
|
|
444
|
+
16. **Large Segment Performance & Debugging Strategy**: When processing large books (1000+ pages), avoid O(n²) algorithms. The library uses a fast-path threshold (1000 pages) to switch from accurate string-search boundary detection to cumulative-offset-based slicing. To diagnose performance bottlenecks: (1) Look for logs with "Using iterative path" or "Using accurate string-search path" with large `pageCount` values, (2) Check `iterations` count in completion logs, (3) Strategic logs are placed at operation boundaries (start/end) NOT inside tight loops to avoid log-induced performance regression.
|
|
445
|
+
|
|
442
446
|
### Process Template (Multi-agent design review, TDD-first)
|
|
443
447
|
|
|
444
448
|
If you want to repeat the “write a plan → get multiple AI critiques → synthesize → update plan → implement TDD-first” workflow, use:
|
|
@@ -467,6 +471,8 @@ If you want to repeat the “write a plan → get multiple AI critiques → synt
|
|
|
467
471
|
| `{{kitab}}` | "كتاب" (book) | كتاب الصلاة |
|
|
468
472
|
| `{{raqm}}` | Single Arabic-Indic numeral | ٥ |
|
|
469
473
|
| `{{raqms}}` | Multiple Arabic-Indic numerals | ٧٥٦٣ |
|
|
474
|
+
| `{{num}}` | Single ASCII numeral | 5 |
|
|
475
|
+
| `{{nums}}` | Multiple ASCII numerals | 123 |
|
|
470
476
|
| `{{raqms:num}}` | Numerals with named capture | `meta.num = "٧٥٦٣"` |
|
|
471
477
|
| `{{dash}}` | Various dash characters | - – — ـ |
|
|
472
478
|
| `{{harfs}}` | Single-letter codes separated by spaces | `د ت س ي ق` |
|
package/README.md
CHANGED
|
@@ -102,6 +102,8 @@ Replace regex with readable tokens:
|
|
|
102
102
|
|-------|---------|------------------|
|
|
103
103
|
| `{{raqms}}` | Arabic-Indic digits | `[\\u0660-\\u0669]+` |
|
|
104
104
|
| `{{raqm}}` | Single Arabic digit | `[\\u0660-\\u0669]` |
|
|
105
|
+
| `{{nums}}` | ASCII digits | `\\d+` |
|
|
106
|
+
| `{{num}}` | Single ASCII digit | `\\d` |
|
|
105
107
|
| `{{dash}}` | Dash variants | `[-–—ـ]` |
|
|
106
108
|
| `{{harf}}` | Arabic letter | `[أ-ي]` |
|
|
107
109
|
| `{{harfs}}` | Single-letter codes separated by spaces | `[أ-ي](?:\s+[أ-ي])*` |
|
|
@@ -776,6 +778,8 @@ Available tokens you may use in templates:
|
|
|
776
778
|
- {{naql}} (حدثنا/أخبرنا/... narration phrases)
|
|
777
779
|
- {{raqm}} (single Arabic-Indic digit)
|
|
778
780
|
- {{raqms}} (Arabic-Indic digits)
|
|
781
|
+
- {{num}} (single ASCII digit)
|
|
782
|
+
- {{nums}} (ASCII digits)
|
|
779
783
|
- {{dash}} (dash variants)
|
|
780
784
|
- {{tarqim}} (punctuation [. ! ? ؟ ؛])
|
|
781
785
|
- {{harf}} (Arabic letter)
|
|
@@ -1324,3 +1328,6 @@ bun run deploy
|
|
|
1324
1328
|
|
|
1325
1329
|
MIT
|
|
1326
1330
|
|
|
1331
|
+
## Inspiration
|
|
1332
|
+
|
|
1333
|
+
The name of the project is from Asmāʾ, it seems to be some sort of gymnastic move.
|
package/dist/index.d.mts.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/optimize-rules.ts","../src/segmentation/pattern-validator.ts","../src/segmentation/replace.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/recovery.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBY,cDnUC,WCmUG,EAAA,CAAA,CAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAoChB;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2DA;AAAyC,cJ0F5B,wBI1F4B,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;AJLzC;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;KArVK,YAAA,GAqViD;EAAe;EAkBzD,KAAA,EAAI,MAAA;AAoChB,CAAA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;KF0BK,eAAA,GExBkB;EACH;EACL,QAAA,EAAA,MAAA;CAAe;AA0G9B;AAwDA;;;;AC1LA;AA2DA;;;;;;;;ACyPA;;;;;;;;AC5PA;
|
|
1
|
+
{"version":3,"file":"index.d.mts","names":[],"sources":["../src/segmentation/fuzzy.ts","../src/segmentation/types.ts","../src/segmentation/optimize-rules.ts","../src/segmentation/pattern-validator.ts","../src/segmentation/replace.ts","../src/segmentation/segmenter.ts","../src/segmentation/tokens.ts","../src/analysis/line-starts.ts","../src/analysis/repeating-sequences.ts","../src/detection.ts","../src/recovery.ts"],"sourcesContent":[],"mappings":";;AAkEA;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBY,cDnUC,WCmUG,EAAA,CAAA,CAAA,EAAA,MAAA,EAAA,GAAA,MAAA;AAoChB;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2DA;AAAyC,cJ0F5B,wBI1F4B,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;AJLzC;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;KArVK,YAAA,GAqViD;EAAe;EAkBzD,KAAA,EAAI,MAAA;AAoChB,CAAA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;KF0BK,eAAA,GExBkB;EACH;EACL,QAAA,EAAA,MAAA;CAAe;AA0G9B;AAwDA;;;;AC1LA;AA2DA;;;;;;;;ACyPA;;;;;;;;AC5PA;AAwRA;AAsDA;AA2CA,KLrWK,qBAAA,GKwWJ;EAQW;EAuKC,cAAA,EAAA,MAAA,EAAA;AA6Cb,CAAA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC7xBA;AAcA;AAEA;AAwQA;;;;;;;;AClRA;AAaA;AAOA;AA2OA;;;;KPjKK,sBAAA,GOoKsB;;;;ACjR3B;AA+EA;AAgEA;AAuBA;AAiCA;;;;AC7MA;AAKA;;;;;;AAOA;AA2BE;AAmnBF;;KT/gBK,mBAAA,GSihBS;EACD;EACC,YAAA,EAAA,MAAA,EAAA;CAGa;;;;AA+C3B;;;;;;;KTtjBK,WAAA,GACC,eACA,kBACA,wBACA,yBACA;;;;;;;;;;;;;;;;cAiBO;;;;;;KAOD,cAAA,WAAyB;;;;;;;KAYhC,aAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAiDO,SAAA;;;;;;;KAYP,eAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAyCS;;;;;;;;;;;;SAaH;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAuDC,SAAA,GAAY,cAAc,gBAAgB;;;;;;;;;;;;;KAkB1C,IAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAoCA,cAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;YAqCE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KAqCF,UAAA,YAAsB;;;;;;;;;;;;;;;;;;;;;;;;;UA8BjB,MAAA;;;;;;;;;;;;;;;;;;;;;;KAuBL,WAAA;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;KA+CA,mBAAA;;;;;;YAME;;;;;;;;UASF;;;;;;;;;;;;;;cAiBY;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;gBA4DN;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;WAwDL;;;;;;;;;;;;;;;;KAiBD,OAAA;;;;;;;;;;;;;;;;;;;;;;;;;;SA6BD;;;;AAxoBa;;;AAkBlB,KChJM,cAAA,GDgJN;EACA;EACA,KAAA,EChJK,SDgJL,EAAA;EAAmB;EAiBZ,WAAA,EAAA,MAAA;AAOb,CAAA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;cAAa,uBAAwB,gBAAc;;;AD3FlC;AA4BG;AA8BM;AAyDrB,KEpIO,mBAAA,GFoIY,gBAAA,GAAA,eAAA,GAAA,WAAA,GAAA,eAAA;AAAA;;;AAkBlB,KEjJM,eAAA,GFiJN;EACA,IAAA,EEjJI,mBFiJJ;EACA,OAAA,EAAA,MAAA;EAAmB,UAAA,CAAA,EAAA,MAAA;EAiBZ;EAOD,KAAA,CAAA,EAAA,MAAA;EAYP;EAiDO,OAAA,CAAA,EAAA,MAAS;AAA6B,CAAA;AAyHlD;;;;AAAqE,KEnVzD,oBAAA,GFmVyD;EAkBzD,cAAI,CAAA,EAAA,CEpWM,eFoWN,GAAA,SAAA,CAAA,EAAA;EAoCJ,eAAA,CAAA,EAAc,CEvYH,eF4aT,GAAA,SAAS,CAAA,EAAA;EAqCX,YAAA,CAAU,EAAA,CEhdF,eFgdc,GAAA,SAAc,CAAA,EAAA;EA8B/B,QAAA,CAAM,EE7eR,eF6eQ;AAuBvB,CAAA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;AACsB,cA6GT,aA7GS,EAAA,CAAA,KAAA,EA6Ge,SA7Gf,EAAA,EAAA,GAAA,CA6G8B,oBA7G9B,GAAA,SAAA,CAAA,EAAA;;;;;AA6GtB;AAwDA;;;;AC1LA;AA2DA;;;AAAyE,cD+H5D,sBC/H4D,EAAA,CAAA,OAAA,EAAA,CD+HxB,oBC/HwB,GAAA,SAAA,CAAA,EAAA,EAAA,GAAA,MAAA,EAAA;;;AJLzE;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAuCtB,KGpJO,WAAA,GAAc,WHoJV,CGpJsB,mBHoJtB,CAAA,SAAA,CAAA,CAAA,CAAA,MAAA,CAAA;;;;;;;AAsBhB;AAOA;AAAgE;AA6DhE;AAYK,cG/LQ,iBHwOC,EAAA,CAAA,KAaH,EGrP8B,IHqPxB,EAAA,EAAA,KAAA,CAAA,EGrPwC,WHqPxC,EAAA,EAAA,GGrPwD,IHqPxD,EAAA;;;;;;;AAyEjB;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2DA;;;AAAyE,cCyP5D,YDzP4D,EAAA,CAAA,KAAA,ECyPrC,IDzPqC,EAAA,EAAA,OAAA,ECyPpB,mBDzPoB,EAAA,GCyPD,ODzPC,EAAA;;;;AJLzE;AA+FA;;;;;ACnIiB;AA4BG;AA8BM;AAiCC;AAwBH;;;;;;;AAqCxB;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;;;;;;;AAqKA;;;;ACzuBA;AAsGA;;;;AC5GA;AAKA;AAcA;;;;;;AA8GA;AAwDA;;;;AC1LA;AA2Da,cEHA,sBFyBZ,EAAA,CAAA,OAAA,EAAA,MAAA,EAAA,GAAA,MAAA;;;;;;;;ACmOD;;;;;;;cC4Ba;AAxRb;AAwRA;AAsDA;AA2CA;AAWA;AAuKA;AA6CA;AAuBA;AAqBA;AAgBA;AA8BA;AAWA;AAoBA;AA6BA;;;;AC7xBA;AAcA;AAEA;AAwQA;;;;;cDqGa,gBAAgB;;;AEvX7B;AAaA;AAOA;AA2OA;;;;;;;;AC9QA;AA+EA;AAgEa,cHkSA,cGnRZ,EAAA,CAAA,KAfgE,EAAA,MAAA,EAAA,GAAA,OAAe;AAuBhF;AAiCA;;;;AC7MA;AAKY,KJ6bA,YAAA,GI7biB;EAChB;;;;;EAMD,OAAA,EAAA,MAAA;EA6BP;AAinBL;;;;EAIc,YAAA,EAAA,MAAA,EAAA;EAGa;;;;AA+C3B;EACU,WAAA,EAAA,OAAA;CACsE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;cJzGnE,mHAIV;;;;;;;;;;;;;;;;;;;;cAyCU;;;;;;;;;;;;;;;;;;;;;;cAuBA,uCAAmC;;;;;;;;;;;;;cAqBnC;;;;;;;;;;;;;;;cAgBA;;;;;;;;;;;;;;;cA8BA;;;;KAWD,YAAA;;;;;;;;;;;;;;;;;;;;;;cAoBC,iDAAkD;;;;;;;;;;;;;;;;cA6BlD;;;ANhvBA,KO7CD,wBAAA,GP6C8E;EA+F7E,IAAA,CAAA,EAAA,MAAA;;;;ECnIR,WAAA,CAAA,EAAA,MAAY;EA4BZ,wBAAe,CAAA,EAAA,OAAA;EA8Bf,yBAAqB,CAAA,EAAA,OAAA;EAiCrB,MAAA,CAAA,EAAA,aAAA,GAAsB,OAAA;EAwBtB,UAAA,CAAA,EAAA,CAAA,IAAA,EAAA,MAAmB,EAAA,MAAA,EAAA,MAAA,EAAA,GAAA,OAAA;EAenB,cAAW,CAAA,EMjIK,MNiIL,EAAA;EACV,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;CACA;AACA,KMhIM,uBAAA,GNgIN;EACA,IAAA,EAAA,MAAA;EACA,MAAA,EAAA,MAAA;CAAmB;AAiBZ,KMjJD,sBAAA,GNiJwG;EAOxG,OAAA,EAAA,MAAA;EAYP,KAAA,EAAA,MAAA;EAiDO,QAAA,EMlNE,uBNkNO,EAAA;AAA6B,CAAA;AAyHlD;;;AAAsD,cMtEzC,uBNsEyC,EAAA,CAAA,KAAA,EMrE3C,INqE2C,EAAA,EAAA,OAAA,CAAA,EMpEzC,wBNoEyC,EAAA,GMnEnD,sBNmEmD,EAAA;;;AAlOjD,KOtHO,wBAAA,GPsHY;EAenB,WAAA,CAAA,EAAW,MAAA;EACV,WAAA,CAAA,EAAA,MAAA;EACA,QAAA,CAAA,EAAA,MAAA;EACA,IAAA,CAAA,EAAA,MAAA;EACA,yBAAA,CAAA,EAAA,OAAA;EACA,YAAA,CAAA,EAAA,OAAA;EAAmB,UAAA,CAAA,EAAA,OAAA,GAAA,OAAA;EAiBZ,WAAA,CAAA,EAAA,MAAuG;EAOxG,YAAA,CAAA,EAAA,MAAc;EAYrB,iBAAa,CAAA,EAAA,MAAA;AAiDlB,CAAA;AAYK,KO9NO,wBAAA,GPuQE;EAoEF,IAAA,EAAA,MAAS;EAAG,OAAA,EAAA,MAAA;EAAc,MAAA,EAAA,MAAA;EAAgB,YAAA,EAAA,MAAA,EAAA;CAAe;AAkBzD,KOtVA,wBAAA,GPsVI;EAoCJ,OAAA,EAAA,MAAA;EA0EA,KAAA,EAAA,MAAU;EA8BL,QAAA,EO/dH,wBP+dS,EAAA;AAuBvB,CAAA;;;;;;AAoNA;cOlea,mCACF,kBACG,6BACX;;;;AR3NH;AA+FA;;;;;ACnIiB;AA4BG;AA+Df,KQ7GO,eAAA,GR6Ge;EAwBtB;EAeA,KAAA,EAAA,MAAA;EACC;EACA,KAAA,EAAA,MAAA;EACA;EACA,KAAA,EAAA,MAAA;EACA;EAAmB,QAAA,EAAA,MAAA;AAiBzB,CAAA;AAOA;AAAgE;AA6DhE;AAAkD;AAyHlD;;;;;AAkBA;AAoCA;AA0EA;AA8BA;AAuBA;AA+CA;AAMc,cQlgBD,mBRkgBC,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GQlgBkC,eRkgBlC,EAAA;;;;;;AA+Jd;;;;ACzuBA;AAsGA;;;cOkCa,mDAAoD;AN9IjE;AAKA;AAcA;;;;AAIe,cM8IF,oBN9IE,EAAA,CAAA,QAAA,EM+ID,eN/IC,EAAA,EAAA,GAAA;EAAe,WAAA,EAAA,gBAAA,GAAA,iBAAA;EA0GjB,KAAA,EAAA,OAAA;EAwDA,QAAA,CAAA,EAAA,MAAA;;;;AC1Lb;AA2DA;;;AAAyE,cK4I5D,kBL5I4D,EAAA,CAAA,IAAA,EAAA,MAAA,EAAA,GAAA;EAAI,QAAA,EAAA,MAAA;;;;ECyPhE,QAAA,EItGC,eJgLb,EAAA;CA1EmC,GAAA,IAAA;;;AL9PvB,KU5DD,sBAAA,GV4D8E;EA+F7E,IAAA,EAAA,cAAA;;;;ECnIR,KAAA,CAAA,EAAA,OAAY,GAAA,YAAA;EA4BZ,QAAA,EAAA,MAAA,EAAe;AAAA,CAAA,GA8Bf;EAiCA,IAAA,EAAA,WAAA;EAwBA,SAAA,EAAA,CAAA,IAAA,ESxIwC,STwIrB,EAAA,KAAA,EAAA,MAAA,EAAA,GAAA,OAAA;AAAA,CAAA;AAgBlB,KStJM,iBAAA,GTsJN;EACA,OAAA,EStJO,mBTsJP;EACA,KAAA,EStJK,ITsJL,EAAA;EACA,QAAA,EStJQ,OTsJR,EAAA;EACA,QAAA,EStJQ,sBTsJR;CAAmB;AAiBZ,KSpKD,oBAAA,GToKwG;EAOxG,OAAA,EAAA;IAYP,IAAA,EAAA,YAAa,GAAA,wBAAA;IAiDN,SAAS,EAAA,MAAA;IAYhB,aAAA,EAAe,MAAA;IA6GR,SAAS,EAAA,MAAA;IAAG,UAAA,EAAA,MAAA;EAAc,CAAA;EAAgB,KAAA,CAAA,ESzV1C,KTyV0C,CAAA;IAAe,SAAA,EAAA,MAAA;IAkBzD,QAAI,EAAA,MAAA;IAoCJ,aAAc,EAAA,MAAA;IA0Ed,UAAU,EAAA,MAAA;EA8BL,CAAA,CAAA;EAuBL,OAAA,ESxgBC,KTwgBU,CAAA;IA+CX,IAAA,EAAA,MAAA;IAME,oBAAA,EAAA,MAAA;IASF,sBAAA,CAAA,EAAA,MAAA;IAiBY,qBAAA,CAAA,EAAA,MAAA;IA4DN,YAAA,EAAA,MAAA;IAwDL,MAAA,EAAA,WAAA,GAAA,oBAAA,GAAA,WAAA,GAAA,sBAAA,GAAA,qBAAA;IAAM,QAAA,EAAA,OAAA,GAAA,QAAA,GAAA,MAAA;IAiBP,EAAA,CAAA,EAAA,MAAO;;;;ECzuBP,QAAA,EAAA,MAAA,EAAc;AAsG1B,CAAA;KQ1EK,oBAAA;iBAinBW,qCAAA,QACL,kBACG,oBACD,+BACC,4BPlpBd;;EALY,gBAAA,CAAA,EO0pBe,oBP1pBI;AAK/B,CAAA,CAAA,EAAY;EAcA,MAAA,EOyoBC,oBPzoBmB;EACV,QAAA,EOwoBuB,OPxoBvB,EAAA;CACC;AACH,iBOmrBJ,6BAAA,CPnrBI,IAAA,EOorBV,iBPprBU,EAAA,EAAA,IA2GpB,CA3GoB,EAAA;EACL,IAAA,CAAA,EAAA,YAAA,GAAA,wBAAA;EAAe,gBAAA,CAAA,EOorBkD,oBPprBlD;AA0G9B,CAAA,CAAA,EAAa;EAwDA,MAAA,EOmhBA,oBP1eZ;YO0e4C"}
|
package/dist/index.mjs
CHANGED
|
@@ -423,6 +423,8 @@ const BASE_TOKENS = {
|
|
|
423
423
|
"وحدثني",
|
|
424
424
|
"وحدثنيه"
|
|
425
425
|
].join("|"),
|
|
426
|
+
num: "\\d",
|
|
427
|
+
nums: "\\d+",
|
|
426
428
|
raqm: "[\\u0660-\\u0669]",
|
|
427
429
|
raqms: "[\\u0660-\\u0669]+",
|
|
428
430
|
rumuz: RUMUZ_BLOCK,
|
|
@@ -1043,8 +1045,43 @@ const applyReplacements = (pages, rules) => {
|
|
|
1043
1045
|
});
|
|
1044
1046
|
};
|
|
1045
1047
|
|
|
1048
|
+
//#endregion
|
|
1049
|
+
//#region src/segmentation/breakpoint-constants.ts
|
|
1050
|
+
/**
|
|
1051
|
+
* Shared constants for segmentation breakpoint processing.
|
|
1052
|
+
*/
|
|
1053
|
+
/**
|
|
1054
|
+
* Threshold for using offset-based fast path in boundary processing.
|
|
1055
|
+
*
|
|
1056
|
+
* Below this: accurate string-search (handles offset drift from structural rules).
|
|
1057
|
+
* At or above this: O(n) arithmetic (performance critical for large books).
|
|
1058
|
+
*
|
|
1059
|
+
* The value of 1000 is chosen based on typical Arabic book sizes:
|
|
1060
|
+
* - Sahih al-Bukhari: ~1000-3000 pages
|
|
1061
|
+
* - Standard hadith collections: 1000-7000 pages
|
|
1062
|
+
* - Large aggregated corpora: 10k-50k pages
|
|
1063
|
+
*
|
|
1064
|
+
* For segments ≥1000 pages, the performance gain from offset-based slicing
|
|
1065
|
+
* outweighs the minor accuracy loss from potential offset drift.
|
|
1066
|
+
*
|
|
1067
|
+
* @remarks
|
|
1068
|
+
* Fast path is skipped when:
|
|
1069
|
+
* - `maxContentLength` is set (requires character-accurate splitting)
|
|
1070
|
+
* - `debugMetaKey` is set (requires proper provenance tracking)
|
|
1071
|
+
* - Content was structurally modified by marker stripping (offsets may drift)
|
|
1072
|
+
*/
|
|
1073
|
+
const FAST_PATH_THRESHOLD = 1e3;
|
|
1074
|
+
|
|
1046
1075
|
//#endregion
|
|
1047
1076
|
//#region src/segmentation/breakpoint-utils.ts
|
|
1077
|
+
/**
|
|
1078
|
+
* Utility functions for breakpoint processing in the segmentation engine.
|
|
1079
|
+
*
|
|
1080
|
+
* These functions handle breakpoint normalization, page exclusion checking,
|
|
1081
|
+
* and segment creation. Extracted for independent testing and reuse.
|
|
1082
|
+
*
|
|
1083
|
+
* @module breakpoint-utils
|
|
1084
|
+
*/
|
|
1048
1085
|
const WINDOW_PREFIX_LENGTHS = [
|
|
1049
1086
|
80,
|
|
1050
1087
|
60,
|
|
@@ -1357,6 +1394,31 @@ const findPageStartNearExpectedBoundary = (remainingContent, _currentFromIdx, ta
|
|
|
1357
1394
|
*/
|
|
1358
1395
|
const buildBoundaryPositions = (segmentContent, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, logger) => {
|
|
1359
1396
|
const boundaryPositions = [0];
|
|
1397
|
+
const pageCount = toIdx - fromIdx + 1;
|
|
1398
|
+
if (pageCount >= FAST_PATH_THRESHOLD) {
|
|
1399
|
+
logger?.debug?.("[breakpoints] Using fast-path for large segment in buildBoundaryPositions", {
|
|
1400
|
+
fromIdx,
|
|
1401
|
+
pageCount,
|
|
1402
|
+
toIdx
|
|
1403
|
+
});
|
|
1404
|
+
const baseOffset = cumulativeOffsets[fromIdx] ?? 0;
|
|
1405
|
+
for (let i = fromIdx + 1; i <= toIdx; i++) {
|
|
1406
|
+
const offset = cumulativeOffsets[i];
|
|
1407
|
+
if (offset !== void 0) {
|
|
1408
|
+
const boundary = Math.max(0, offset - baseOffset);
|
|
1409
|
+
const prevBoundary = boundaryPositions[boundaryPositions.length - 1];
|
|
1410
|
+
boundaryPositions.push(Math.max(prevBoundary + 1, Math.min(boundary, segmentContent.length)));
|
|
1411
|
+
}
|
|
1412
|
+
}
|
|
1413
|
+
boundaryPositions.push(segmentContent.length);
|
|
1414
|
+
return boundaryPositions;
|
|
1415
|
+
}
|
|
1416
|
+
logger?.debug?.("[breakpoints] buildBoundaryPositions: Using accurate string-search path", {
|
|
1417
|
+
contentLength: segmentContent.length,
|
|
1418
|
+
fromIdx,
|
|
1419
|
+
pageCount,
|
|
1420
|
+
toIdx
|
|
1421
|
+
});
|
|
1360
1422
|
const startOffsetInFromPage = estimateStartOffsetInCurrentPage(segmentContent, fromIdx, pageIds, normalizedPages);
|
|
1361
1423
|
for (let i = fromIdx + 1; i <= toIdx; i++) {
|
|
1362
1424
|
const expectedBoundary = cumulativeOffsets[i] !== void 0 && cumulativeOffsets[fromIdx] !== void 0 ? Math.max(0, cumulativeOffsets[i] - cumulativeOffsets[fromIdx] - startOffsetInFromPage) : segmentContent.length;
|
|
@@ -1369,6 +1431,7 @@ const buildBoundaryPositions = (segmentContent, fromIdx, toIdx, pageIds, normali
|
|
|
1369
1431
|
}
|
|
1370
1432
|
}
|
|
1371
1433
|
boundaryPositions.push(segmentContent.length);
|
|
1434
|
+
logger?.debug?.("[breakpoints] buildBoundaryPositions: Complete", { boundaryCount: boundaryPositions.length });
|
|
1372
1435
|
return boundaryPositions;
|
|
1373
1436
|
};
|
|
1374
1437
|
/**
|
|
@@ -1719,6 +1782,127 @@ const skipWhitespace$1 = (content, startPos) => {
|
|
|
1719
1782
|
return pos;
|
|
1720
1783
|
};
|
|
1721
1784
|
/**
|
|
1785
|
+
* Validates that cumulative offsets match actual content length within a tolerance.
|
|
1786
|
+
* Required to detect if structural rules (like `lineStartsAfter`) have stripped content
|
|
1787
|
+
* which would make offset-based calculations inaccurate.
|
|
1788
|
+
*/
|
|
1789
|
+
const checkFastPathAlignment = (cumulativeOffsets, fullContent, fromIdx, toIdx, pageCount, logger) => {
|
|
1790
|
+
const expectedLength = (cumulativeOffsets[toIdx + 1] ?? fullContent.length) - (cumulativeOffsets[fromIdx] ?? 0);
|
|
1791
|
+
const actualLength = fullContent.length;
|
|
1792
|
+
const driftTolerance = Math.max(100, actualLength * .01);
|
|
1793
|
+
const isAligned = Math.abs(expectedLength - actualLength) <= driftTolerance;
|
|
1794
|
+
if (!isAligned && pageCount >= FAST_PATH_THRESHOLD) logger?.warn?.("[breakpoints] Offset drift detected in fast-path candidate, falling back to slow path", {
|
|
1795
|
+
actualLength,
|
|
1796
|
+
drift: Math.abs(expectedLength - actualLength),
|
|
1797
|
+
expectedLength,
|
|
1798
|
+
pageCount
|
|
1799
|
+
});
|
|
1800
|
+
return isAligned;
|
|
1801
|
+
};
|
|
1802
|
+
/**
|
|
1803
|
+
* Handles the special optimized case for maxPages=0 (1 page per segment).
|
|
1804
|
+
* This is O(n) and safer than offset arithmetic as it uses source pages directly.
|
|
1805
|
+
*/
|
|
1806
|
+
const processTrivialFastPath = (fromIdx, toIdx, pageIds, normalizedPages, pageCount, originalMeta, debugMetaKey, logger) => {
|
|
1807
|
+
logger?.debug?.("[breakpoints] Using trivial per-page fast-path (maxPages=0)", {
|
|
1808
|
+
fromIdx,
|
|
1809
|
+
pageCount,
|
|
1810
|
+
toIdx
|
|
1811
|
+
});
|
|
1812
|
+
const result = [];
|
|
1813
|
+
for (let i = fromIdx; i <= toIdx; i++) {
|
|
1814
|
+
const pageData = normalizedPages.get(pageIds[i]);
|
|
1815
|
+
if (pageData?.content.trim()) {
|
|
1816
|
+
const meta = getSegmentMetaWithDebug(i === fromIdx, debugMetaKey, originalMeta, null);
|
|
1817
|
+
const seg = createSegment(pageData.content.trim(), pageIds[i], void 0, meta);
|
|
1818
|
+
if (seg) result.push(seg);
|
|
1819
|
+
}
|
|
1820
|
+
}
|
|
1821
|
+
return result;
|
|
1822
|
+
};
|
|
1823
|
+
/**
|
|
1824
|
+
* Handles fast-path segmentation for maxPages > 0 using cumulative offsets.
|
|
1825
|
+
* Avoids O(n²) string searching but requires accurate offsets.
|
|
1826
|
+
*/
|
|
1827
|
+
const processOffsetFastPath = (fullContent, fromIdx, toIdx, pageIds, cumulativeOffsets, maxPages, originalMeta, debugMetaKey, logger) => {
|
|
1828
|
+
const result = [];
|
|
1829
|
+
const effectiveMaxPages = maxPages + 1;
|
|
1830
|
+
const pageCount = toIdx - fromIdx + 1;
|
|
1831
|
+
logger?.debug?.("[breakpoints] Using offset-based fast-path for large segment", {
|
|
1832
|
+
effectiveMaxPages,
|
|
1833
|
+
fromIdx,
|
|
1834
|
+
maxPages,
|
|
1835
|
+
pageCount,
|
|
1836
|
+
toIdx
|
|
1837
|
+
});
|
|
1838
|
+
const baseOffset = cumulativeOffsets[fromIdx] ?? 0;
|
|
1839
|
+
for (let segStart = fromIdx; segStart <= toIdx; segStart += effectiveMaxPages) {
|
|
1840
|
+
const segEnd = Math.min(segStart + effectiveMaxPages - 1, toIdx);
|
|
1841
|
+
const startOffset = Math.max(0, (cumulativeOffsets[segStart] ?? 0) - baseOffset);
|
|
1842
|
+
const endOffset = segEnd < toIdx ? Math.max(0, (cumulativeOffsets[segEnd + 1] ?? fullContent.length) - baseOffset) : fullContent.length;
|
|
1843
|
+
const rawContent = fullContent.slice(startOffset, endOffset).trim();
|
|
1844
|
+
if (rawContent) {
|
|
1845
|
+
const meta = getSegmentMetaWithDebug(segStart === fromIdx, debugMetaKey, originalMeta, null);
|
|
1846
|
+
const seg = {
|
|
1847
|
+
content: rawContent,
|
|
1848
|
+
from: pageIds[segStart]
|
|
1849
|
+
};
|
|
1850
|
+
if (segEnd > segStart) seg.to = pageIds[segEnd];
|
|
1851
|
+
if (meta) seg.meta = meta;
|
|
1852
|
+
result.push(seg);
|
|
1853
|
+
}
|
|
1854
|
+
}
|
|
1855
|
+
return result;
|
|
1856
|
+
};
|
|
1857
|
+
/**
|
|
1858
|
+
* Checks if the remaining content fits within paged/length limits.
|
|
1859
|
+
* If so, pushes the final segment and returns true.
|
|
1860
|
+
*/
|
|
1861
|
+
const handleOversizedSegmentFit = (remainingContent, currentFromIdx, toIdx, pageIds, expandedBreakpoints, maxPages, maxContentLength, isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint, result) => {
|
|
1862
|
+
const remainingSpan = computeRemainingSpan(currentFromIdx, toIdx, pageIds);
|
|
1863
|
+
const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
|
|
1864
|
+
const fitsInPages = remainingSpan <= maxPages;
|
|
1865
|
+
const fitsInLength = !maxContentLength || remainingContent.length <= maxContentLength;
|
|
1866
|
+
if (fitsInPages && fitsInLength && !remainingHasExclusions) {
|
|
1867
|
+
const includeMeta = isFirstPiece || Boolean(debugMetaKey);
|
|
1868
|
+
const finalSeg = createFinalSegment(remainingContent, currentFromIdx, toIdx, pageIds, getSegmentMetaWithDebug(isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint), includeMeta);
|
|
1869
|
+
if (finalSeg) result.push(finalSeg);
|
|
1870
|
+
return true;
|
|
1871
|
+
}
|
|
1872
|
+
return false;
|
|
1873
|
+
};
|
|
1874
|
+
/**
|
|
1875
|
+
* Builds metadata for a segment piece, optionally including debug info.
|
|
1876
|
+
*/
|
|
1877
|
+
const getSegmentMetaWithDebug = (isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint) => {
|
|
1878
|
+
if (!(isFirstPiece || Boolean(debugMetaKey))) return;
|
|
1879
|
+
if (debugMetaKey && lastBreakpoint) return mergeDebugIntoMeta(isFirstPiece ? originalMeta : void 0, debugMetaKey, buildBreakpointDebugPatch(lastBreakpoint.breakpointIndex, lastBreakpoint.rule));
|
|
1880
|
+
return isFirstPiece ? originalMeta : void 0;
|
|
1881
|
+
};
|
|
1882
|
+
/**
|
|
1883
|
+
* Calculates window end position, capped by maxContentLength if present.
|
|
1884
|
+
*/
|
|
1885
|
+
const getWindowEndPosition = (remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, maxContentLength, logger) => {
|
|
1886
|
+
let windowEndPosition = findBreakpointWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, logger);
|
|
1887
|
+
if (maxContentLength && maxContentLength < windowEndPosition) windowEndPosition = maxContentLength;
|
|
1888
|
+
return windowEndPosition;
|
|
1889
|
+
};
|
|
1890
|
+
/**
|
|
1891
|
+
* Advances cursorPos and currentFromIdx for the next iteration.
|
|
1892
|
+
*/
|
|
1893
|
+
const advanceCursorAndIndex = (fullContent, breakPos, actualEndIdx, toIdx, pageIds, normalizedPages) => {
|
|
1894
|
+
const nextCursorPos = skipWhitespace$1(fullContent, breakPos);
|
|
1895
|
+
return {
|
|
1896
|
+
currentFromIdx: computeNextFromIdx(fullContent.slice(nextCursorPos, nextCursorPos + 500), actualEndIdx, toIdx, pageIds, normalizedPages),
|
|
1897
|
+
cursorPos: nextCursorPos
|
|
1898
|
+
};
|
|
1899
|
+
};
|
|
1900
|
+
/**
|
|
1901
|
+
* Applies breakpoints to oversized segments.
|
|
1902
|
+
*
|
|
1903
|
+
* Note: This is an internal engine used by `segmentPages()`.
|
|
1904
|
+
*/
|
|
1905
|
+
/**
|
|
1722
1906
|
* Processes an oversized segment by iterating through the content and
|
|
1723
1907
|
* breaking it into smaller pieces that fit within maxPages constraints.
|
|
1724
1908
|
*
|
|
@@ -1727,6 +1911,20 @@ const skipWhitespace$1 = (content, startPos) => {
|
|
|
1727
1911
|
const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey, maxContentLength) => {
|
|
1728
1912
|
const result = [];
|
|
1729
1913
|
const fullContent = segment.content;
|
|
1914
|
+
const pageCount = toIdx - fromIdx + 1;
|
|
1915
|
+
const isAligned = checkFastPathAlignment(cumulativeOffsets, fullContent, fromIdx, toIdx, pageCount, logger);
|
|
1916
|
+
if (pageCount >= FAST_PATH_THRESHOLD && isAligned && !maxContentLength && !debugMetaKey) {
|
|
1917
|
+
if (maxPages === 0) return processTrivialFastPath(fromIdx, toIdx, pageIds, normalizedPages, pageCount, segment.meta, debugMetaKey, logger);
|
|
1918
|
+
return processOffsetFastPath(fullContent, fromIdx, toIdx, pageIds, cumulativeOffsets, maxPages, segment.meta, debugMetaKey, logger);
|
|
1919
|
+
}
|
|
1920
|
+
logger?.debug?.("[breakpoints] processOversizedSegment: Using iterative path", {
|
|
1921
|
+
contentLength: fullContent.length,
|
|
1922
|
+
fromIdx,
|
|
1923
|
+
maxContentLength,
|
|
1924
|
+
maxPages,
|
|
1925
|
+
pageCount,
|
|
1926
|
+
toIdx
|
|
1927
|
+
});
|
|
1730
1928
|
let cursorPos = 0;
|
|
1731
1929
|
let currentFromIdx = fromIdx;
|
|
1732
1930
|
let isFirstPiece = true;
|
|
@@ -1742,12 +1940,13 @@ const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPag
|
|
|
1742
1940
|
const MAX_SAFE_ITERATIONS = 1e5;
|
|
1743
1941
|
while (cursorPos < fullContent.length && currentFromIdx <= toIdx && i < MAX_SAFE_ITERATIONS) {
|
|
1744
1942
|
i++;
|
|
1745
|
-
const
|
|
1943
|
+
const safeSliceLen = maxContentLength ? maxContentLength + 4e3 : void 0;
|
|
1944
|
+
const remainingContent = safeSliceLen ? fullContent.slice(cursorPos, cursorPos + safeSliceLen) : fullContent.slice(cursorPos);
|
|
1746
1945
|
if (!remainingContent.trim()) break;
|
|
1747
1946
|
if (handleOversizedSegmentFit(remainingContent, currentFromIdx, toIdx, pageIds, expandedBreakpoints, maxPages, maxContentLength, isFirstPiece, debugMetaKey, segment.meta, lastBreakpoint, result)) break;
|
|
1748
1947
|
const windowEndIdx = computeWindowEndIdx(currentFromIdx, toIdx, pageIds, maxPages);
|
|
1749
1948
|
const windowEndPosition = getWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, maxContentLength, logger);
|
|
1750
|
-
logger?.
|
|
1949
|
+
logger?.trace?.(`[breakpoints] iteration=${i}`, {
|
|
1751
1950
|
currentFromIdx,
|
|
1752
1951
|
cursorPos,
|
|
1753
1952
|
windowEndIdx,
|
|
@@ -1784,57 +1983,12 @@ const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPag
|
|
|
1784
1983
|
fullContentLength: fullContent.length,
|
|
1785
1984
|
iterations: i
|
|
1786
1985
|
});
|
|
1787
|
-
logger?.debug?.("[breakpoints]
|
|
1986
|
+
logger?.debug?.("[breakpoints] processOversizedSegment: Complete", {
|
|
1987
|
+
iterations: i,
|
|
1988
|
+
resultCount: result.length
|
|
1989
|
+
});
|
|
1788
1990
|
return result;
|
|
1789
1991
|
};
|
|
1790
|
-
/**
|
|
1791
|
-
* Checks if the remaining content fits within paged/length limits.
|
|
1792
|
-
* If so, pushes the final segment and returns true.
|
|
1793
|
-
*/
|
|
1794
|
-
const handleOversizedSegmentFit = (remainingContent, currentFromIdx, toIdx, pageIds, expandedBreakpoints, maxPages, maxContentLength, isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint, result) => {
|
|
1795
|
-
const remainingSpan = computeRemainingSpan(currentFromIdx, toIdx, pageIds);
|
|
1796
|
-
const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
|
|
1797
|
-
const fitsInPages = remainingSpan <= maxPages;
|
|
1798
|
-
const fitsInLength = !maxContentLength || remainingContent.length <= maxContentLength;
|
|
1799
|
-
if (fitsInPages && fitsInLength && !remainingHasExclusions) {
|
|
1800
|
-
const includeMeta = isFirstPiece || Boolean(debugMetaKey);
|
|
1801
|
-
const finalSeg = createFinalSegment(remainingContent, currentFromIdx, toIdx, pageIds, getSegmentMetaWithDebug(isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint), includeMeta);
|
|
1802
|
-
if (finalSeg) result.push(finalSeg);
|
|
1803
|
-
return true;
|
|
1804
|
-
}
|
|
1805
|
-
return false;
|
|
1806
|
-
};
|
|
1807
|
-
/**
|
|
1808
|
-
* Builds metadata for a segment piece, optionally including debug info.
|
|
1809
|
-
*/
|
|
1810
|
-
const getSegmentMetaWithDebug = (isFirstPiece, debugMetaKey, originalMeta, lastBreakpoint) => {
|
|
1811
|
-
if (!(isFirstPiece || Boolean(debugMetaKey))) return;
|
|
1812
|
-
if (debugMetaKey && lastBreakpoint) return mergeDebugIntoMeta(isFirstPiece ? originalMeta : void 0, debugMetaKey, buildBreakpointDebugPatch(lastBreakpoint.breakpointIndex, lastBreakpoint.rule));
|
|
1813
|
-
return isFirstPiece ? originalMeta : void 0;
|
|
1814
|
-
};
|
|
1815
|
-
/**
|
|
1816
|
-
* Calculates window end position, capped by maxContentLength if present.
|
|
1817
|
-
*/
|
|
1818
|
-
const getWindowEndPosition = (remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, maxContentLength, logger) => {
|
|
1819
|
-
let windowEndPosition = findBreakpointWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, logger);
|
|
1820
|
-
if (maxContentLength && maxContentLength < windowEndPosition) windowEndPosition = maxContentLength;
|
|
1821
|
-
return windowEndPosition;
|
|
1822
|
-
};
|
|
1823
|
-
/**
|
|
1824
|
-
* Advances cursorPos and currentFromIdx for the next iteration.
|
|
1825
|
-
*/
|
|
1826
|
-
const advanceCursorAndIndex = (fullContent, breakPos, actualEndIdx, toIdx, pageIds, normalizedPages) => {
|
|
1827
|
-
const nextCursorPos = skipWhitespace$1(fullContent, breakPos);
|
|
1828
|
-
return {
|
|
1829
|
-
currentFromIdx: computeNextFromIdx(fullContent.slice(nextCursorPos), actualEndIdx, toIdx, pageIds, normalizedPages),
|
|
1830
|
-
cursorPos: nextCursorPos
|
|
1831
|
-
};
|
|
1832
|
-
};
|
|
1833
|
-
/**
|
|
1834
|
-
* Applies breakpoints to oversized segments.
|
|
1835
|
-
*
|
|
1836
|
-
* Note: This is an internal engine used by `segmentPages()`.
|
|
1837
|
-
*/
|
|
1838
1992
|
const applyBreakpoints = (segments, pages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner = "space", debugMetaKey, maxContentLength) => {
|
|
1839
1993
|
const pageIds = pages.map((p) => p.id);
|
|
1840
1994
|
const pageIdToIndex = buildPageIdToIndexMap(pageIds);
|
|
@@ -1865,6 +2019,15 @@ const applyBreakpoints = (segments, pages, normalizedContent, maxPages, breakpoi
|
|
|
1865
2019
|
result.push(segment);
|
|
1866
2020
|
continue;
|
|
1867
2021
|
}
|
|
2022
|
+
logger?.debug?.("[breakpoints] Processing oversized segment", {
|
|
2023
|
+
contentLength: segment.content.length,
|
|
2024
|
+
from: segment.from,
|
|
2025
|
+
hasExclusions,
|
|
2026
|
+
pageSpan: toIdx - fromIdx + 1,
|
|
2027
|
+
reasonFitsInLength: fitsInLength,
|
|
2028
|
+
reasonFitsInPages: fitsInPages,
|
|
2029
|
+
to: segment.to
|
|
2030
|
+
});
|
|
1868
2031
|
const broken = processOversizedSegment(segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger, debugMetaKey, maxContentLength);
|
|
1869
2032
|
result.push(...broken.map((s) => {
|
|
1870
2033
|
const segFromIdx = pageIdToIndex.get(s.from) ?? -1;
|