npm - flappa-doormal - Versions diffs - 2.6.1 → 2.6.3 - Mend

flappa-doormal 2.6.1 → 2.6.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/AGENTS.md CHANGED Viewed

@@ -41,23 +41,13 @@ src/
     ├── match-utils.ts          # Extracted match processing utilities
     ├── segmenter.test.ts       # Core test suite (150+ tests including breakpoints)
     ├── segmenter.bukhari.test.ts # Real-world test cases
-    ├── breakpoint-utils.test.ts # Breakpoint utility tests (42 tests)
+    ├── breakpoint-utils.test.ts # Breakpoint utility tests (55 tests)
     ├── rule-regex.test.ts      # Rule regex builder tests
     ├── segmenter-utils.test.ts # Segmenter helper tests
     ├── tokens.test.ts          # Token expansion tests
     ├── fuzzy.test.ts           # Fuzzy matching tests
     ├── textUtils.test.ts       # Text utility tests
     └── match-utils.test.ts     # Utility function tests
-test/
-├── 2576.json                   # Test data for book 2576 (Sahih Bukhari)
-└── 2588.json                   # Test data for book 2588 (Al-Mughni)
-docs/
-├── checkpoints/                # AI agent handoff documentation
-│   └── 2025-12-09-handoff.md
-└── reviews/                    # Performance analysis reports
-    └── 2025-12-10/
 ```
 ### Core Components
@@ -92,7 +82,9 @@ docs/
    - `buildExcludeSet()` - Create Set from PageRange[] for O(1) lookups
    - `createSegment()` - Create segment with optional to/meta fields
    - `expandBreakpoints()` - Expand patterns with pre-compiled regexes
-   - `findActualEndPage()` - Search backwards for ending page by content
+   - `buildBoundaryPositions()` - Build position map of page boundaries for O(log n) lookups
+   - `findPageIndexForPosition()` - Binary search to find page index for a character position
+   - `estimateStartOffsetInCurrentPage()` - Estimate offset when segment starts mid-page
    - `findBreakpointWindowEndPosition()` - Compute window boundary in content-space (robust to marker stripping)
    - `applyPageJoinerBetweenPages()` - Normalize page-boundary join in output segments (`space` vs `newline`)
    - `findBreakPosition()` - Find break position using breakpoint patterns
@@ -306,7 +298,7 @@ The original `segmentPages` had complexity 37 (max: 15). Extraction:
 1. **TypeScript strict mode** - No `any` types
 2. **Biome linting** - Max complexity 15 per function (some exceptions exist)
 3. **JSDoc comments** - All exported functions documented
-4. **Test coverage** - 251 tests across 8 files
+4. **Test coverage** - 352 tests across 12 files
 ## Dependencies
@@ -360,6 +352,10 @@ bunx biome lint .
 9. **Auto-escaping improves DX significantly**: Users expect `(أ):` to match literal parentheses. Auto-escaping `()[]` in template patterns (but not `regex`) gives intuitive behavior while preserving power-user escape hatch.
+10. **Page boundary detection needs progressive prefixes**: When breakpoints split content mid-page, checking only the first N characters of a page to detect if the segment ends on that page can fail. Solution: try progressively shorter prefixes (`[80, 60, 40, 30, 20, 15, 12, 10, 8, 6]`) via `JOINER_PREFIX_LENGTHS`. The check uses `indexOf(...) > 0` (not `>= 0`) to avoid false positives when a page prefix appears at position 0 (which indicates the segment *starts* with that page, not *ends* on it).
+11. **Boundary-position algorithm improves page attribution**: Building a position map of page boundaries once per segment (O(n)) enables binary search for O(log n) lookups per piece. Key insight: when a segment starts mid-page (common after structural rules), expected boundary estimates must account for the offset into the starting page. Without this adjustment, position-based lookups can return the wrong page when pages have identical content prefixes.
 ### Architecture Insights
 - **Declarative > Imperative**: Users describe patterns, library handles regex

package/README.md CHANGED Viewed

@@ -1,5 +1,20 @@
 # flappa-doormal
+<p align="center">
+  <img src="icon.png" alt="flappa-doormal" width="128" height="128" />
+</p>
+<p align="center">
+  <strong>Declarative Arabic text segmentation library</strong><br/>
+  Split pages of content into logical segments using human-readable patterns.
+</p>
+<p align="center">
+  <a href="https://flappa-doormal.surge.sh">🚀 <strong>Live Demo</strong></a> •
+  <a href="https://www.npmjs.com/package/flappa-doormal">📦 npm</a> •
+  <a href="https://github.com/ragaeeb/flappa-doormal">📚 GitHub</a>
+</p>
 [![wakatime](https://wakatime.com/badge/user/a0b906ce-b8e7-4463-8bce-383238df6d4b/project/384fa29d-72e8-4078-980f-45d363f10507.svg)](https://wakatime.com/badge/user/a0b906ce-b8e7-4463-8bce-383238df6d4b/project/384fa29d-72e8-4078-980f-45d363f10507)
 [![Node.js CI](https://github.com/ragaeeb/flappa-doormal/actions/workflows/build.yml/badge.svg)](https://github.com/ragaeeb/flappa-doormal/actions/workflows/build.yml) ![GitHub License](https://img.shields.io/github/license/ragaeeb/flappa-doormal)
 ![GitHub Release](https://img.shields.io/github/v/release/ragaeeb/flappa-doormal)
@@ -12,8 +27,6 @@
 [![codecov](https://codecov.io/gh/ragaeeb/flappa-doormal/graph/badge.svg?token=RQ2BV4M9IS)](https://codecov.io/gh/ragaeeb/flappa-doormal)
 [![npm version](https://badge.fury.io/js/flappa-doormal.svg)](https://badge.fury.io/js/flappa-doormal)
-**Declarative Arabic text segmentation library** - Split pages of content into logical segments using human-readable patterns.
 ## Why This Library?
 ### The Problem
@@ -939,7 +952,7 @@ Complex logic is intentionally split into small, independently testable modules:
 - `src/segmentation/match-utils.ts`: match filtering + capture extraction
 - `src/segmentation/rule-regex.ts`: SplitRule → compiled regex builder (`buildRuleRegex`, `processPattern`)
-- `src/segmentation/breakpoint-utils.ts`: breakpoint windowing/exclusion helpers + page boundary join normalization
+- `src/segmentation/breakpoint-utils.ts`: breakpoint windowing/exclusion helpers, page boundary join normalization, and progressive prefix page detection for accurate `from`/`to` attribution
 - `src/segmentation/breakpoint-processor.ts`: breakpoint post-processing engine (applies breakpoints after structural segmentation)
 ## Performance Notes
@@ -986,6 +999,30 @@ See [AGENTS.md](./AGENTS.md) for:
 - Algorithm explanations
 - Lessons learned during development
+## Demo
+An interactive demo is available at [flappa-doormal.surge.sh](https://flappa-doormal.surge.sh).
+The demo source code is located in the `demo/` directory and includes:
+- **Analysis**: Discover common line-start patterns in your text
+- **Pattern Detection**: Auto-detect tokens in text and get template suggestions
+- **Segmentation**: Apply rules and see segmented output with metadata
+To run the demo locally:
+```bash
+cd demo
+bun install
+bun run dev
+```
+To deploy updates:
+```bash
+cd demo
+bun run deploy
+```
 ## License
 MIT

package/dist/index.mjs CHANGED Viewed

@@ -411,6 +411,72 @@ const findPageStartNearExpectedBoundary = (remainingContent, _currentFromIdx, ta
 	return -1;
 };
 /**
+* Builds a boundary position map for pages within the given range.
+*
+* This function computes page boundaries once per segment and enables
+* O(log n) page lookups via binary search with `findPageIndexForPosition`.
+*
+* Boundaries are derived from segmentContent (post-structural-rules).
+* When the segment starts mid-page, an offset correction is applied to
+* keep boundary estimates aligned with the segment's actual content space.
+*
+* @param segmentContent - Full segment content (already processed by structural rules)
+* @param fromIdx - Starting page index
+* @param toIdx - Ending page index
+* @param pageIds - Array of all page IDs
+* @param normalizedPages - Map of page ID to normalized content
+* @param cumulativeOffsets - Cumulative character offsets (for estimates)
+* @returns Array where boundaryPositions[i] = start position of page (fromIdx + i),
+*          with a sentinel boundary at segmentContent.length as the last element
+*
+* @example
+* // For a 3-page segment:
+* buildBoundaryPositions(content, 0, 2, pageIds, normalizedPages, offsets)
+* // → [0, 23, 45, 67] where 67 is content.length (sentinel)
+*/
+const buildBoundaryPositions = (segmentContent, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets) => {
+	const boundaryPositions = [0];
+	const startOffsetInFromPage = estimateStartOffsetInCurrentPage(segmentContent, fromIdx, pageIds, normalizedPages);
+	for (let i = fromIdx + 1; i <= toIdx; i++) {
+		const expectedBoundary = cumulativeOffsets[i] !== void 0 && cumulativeOffsets[fromIdx] !== void 0 ? Math.max(0, cumulativeOffsets[i] - cumulativeOffsets[fromIdx] - startOffsetInFromPage) : segmentContent.length;
+		const pos = findPageStartNearExpectedBoundary(segmentContent, fromIdx, i, expectedBoundary, pageIds, normalizedPages);
+		const prevBoundary = boundaryPositions[boundaryPositions.length - 1];
+		if (pos > 0 && pos > prevBoundary && Math.abs(pos - expectedBoundary) < 2e3) boundaryPositions.push(pos);
+		else {
+			const estimate = Math.max(prevBoundary + 1, expectedBoundary);
+			boundaryPositions.push(Math.min(estimate, segmentContent.length));
+		}
+	}
+	boundaryPositions.push(segmentContent.length);
+	return boundaryPositions;
+};
+/**
+* Binary search to find which page a position falls within.
+* Uses "largest i where boundaryPositions[i] <= position" semantics.
+*
+* @param position - Character position in segmentContent
+* @param boundaryPositions - Precomputed boundary positions (from buildBoundaryPositions)
+* @param fromIdx - Base page index (boundaryPositions[0] corresponds to pageIds[fromIdx])
+* @returns Page index in pageIds array
+*
+* @example
+* // With boundaries [0, 20, 40, 60] and fromIdx=0:
+* findPageIndexForPosition(15, boundaries, 0) // → 0 (first page)
+* findPageIndexForPosition(25, boundaries, 0) // → 1 (second page)
+* findPageIndexForPosition(40, boundaries, 0) // → 2 (exactly on boundary = that page)
+*/
+const findPageIndexForPosition = (position, boundaryPositions, fromIdx) => {
+	if (boundaryPositions.length <= 1) return fromIdx;
+	let left = 0;
+	let right = boundaryPositions.length - 2;
+	while (left < right) {
+		const mid = Math.ceil((left + right) / 2);
+		if (boundaryPositions[mid] <= position) left = mid;
+		else right = mid - 1;
+	}
+	return fromIdx + left;
+};
+/**
 * Finds the end position of a breakpoint window inside `remainingContent`.
 *
 * The window end is defined as the start of the page AFTER `windowEndIdx` (i.e. `windowEndIdx + 1`),
@@ -446,57 +512,6 @@ const findExclusionBreakPosition = (currentFromIdx, windowEndIdx, toIdx, pageIds
 	return -1;
 };
 /**
-* Finds the actual ending page index by searching backwards for page content prefix.
-* Used to determine which page a segment actually ends on based on content matching.
-*
-* @param pieceContent - Content of the segment piece
-* @param currentFromIdx - Current starting index in pageIds
-* @param toIdx - Maximum ending index to search
-* @param pageIds - Array of page IDs
-* @param normalizedPages - Map of page ID to normalized content
-* @returns The actual ending page index
-*/
-const findActualEndPage = (pieceContent, currentFromIdx, toIdx, pageIds, normalizedPages) => {
-	for (let pi = toIdx; pi > currentFromIdx; pi--) {
-		const pageData = normalizedPages.get(pageIds[pi]);
-		if (pageData) {
-			const checkPortion = pageData.content.slice(0, Math.min(30, pageData.length));
-			if (checkPortion.length > 0 && pieceContent.indexOf(checkPortion) > 0) return pi;
-		}
-	}
-	return currentFromIdx;
-};
-/**
-* Finds the actual starting page index by searching forwards for page content prefix.
-* Used to determine which page content actually starts from based on content matching.
-*
-* This is the counterpart to findActualEndPage - it searches forward to find which
-* page the content starts on, rather than which page it ends on.
-*
-* @param pieceContent - Content of the segment piece
-* @param currentFromIdx - Current starting index in pageIds
-* @param toIdx - Maximum ending index to search
-* @param pageIds - Array of page IDs
-* @param normalizedPages - Map of page ID to normalized content
-* @returns The actual starting page index
-*/
-const findActualStartPage = (pieceContent, currentFromIdx, toIdx, pageIds, normalizedPages) => {
-	const trimmedPiece = pieceContent.trimStart();
-	if (!trimmedPiece) return currentFromIdx;
-	for (let pi = currentFromIdx; pi <= toIdx; pi++) {
-		const pageData = normalizedPages.get(pageIds[pi]);
-		if (pageData) {
-			const pagePrefix = pageData.content.slice(0, Math.min(30, pageData.length)).trim();
-			const piecePrefix = trimmedPiece.slice(0, Math.min(30, trimmedPiece.length));
-			if (pagePrefix.length > 0) {
-				if (trimmedPiece.startsWith(pagePrefix)) return pi;
-				if (pageData.content.trimStart().startsWith(piecePrefix)) return pi;
-			}
-		}
-	}
-	return currentFromIdx;
-};
-/**
 * Checks if any page in a range is excluded by the given exclude set.
 *
 * @param excludeSet - Set of excluded page IDs
@@ -628,10 +643,22 @@ const computeWindowEndIdx = (currentFromIdx, toIdx, pageIds, maxPages) => {
 };
 const computeRemainingSpan = (currentFromIdx, toIdx, pageIds) => pageIds[toIdx] - pageIds[currentFromIdx];
 const createFinalSegment = (remainingContent, currentFromIdx, toIdx, pageIds, meta, includeMeta) => createSegment(remainingContent, pageIds[currentFromIdx], currentFromIdx !== toIdx ? pageIds[toIdx] : void 0, includeMeta ? meta : void 0);
-const computePiecePages = (pieceContent, currentFromIdx, toIdx, windowEndIdx, pageIds, normalizedPages) => {
-	const actualStartIdx = pieceContent ? findActualStartPage(pieceContent, currentFromIdx, toIdx, pageIds, normalizedPages) : currentFromIdx;
+/**
+* Computes the actual start and end page indices for a piece using
+* precomputed boundary positions and binary search.
+*
+* @param pieceStartPos - Start position of the piece in the full segment content
+* @param pieceEndPos - End position (exclusive) of the piece
+* @param boundaryPositions - Precomputed boundary positions from buildBoundaryPositions
+* @param baseFromIdx - Base page index (boundaryPositions[0] corresponds to pageIds[baseFromIdx])
+* @param toIdx - Maximum page index
+* @returns Object with actualStartIdx and actualEndIdx
+*/
+const computePiecePages = (pieceStartPos, pieceEndPos, boundaryPositions, baseFromIdx, toIdx) => {
+	const actualStartIdx = findPageIndexForPosition(pieceStartPos, boundaryPositions, baseFromIdx);
+	const endPos = Math.max(pieceStartPos, pieceEndPos - 1);
 	return {
-		actualEndIdx: pieceContent ? findActualEndPage(pieceContent, actualStartIdx, windowEndIdx, pageIds, normalizedPages) : currentFromIdx,
+		actualEndIdx: Math.min(findPageIndexForPosition(endPos, boundaryPositions, baseFromIdx), toIdx),
 		actualStartIdx
 	};
 };
@@ -648,79 +675,87 @@ const computeNextFromIdx = (remainingContent, actualEndIdx, toIdx, pageIds, norm
 	return nextFromIdx;
 };
 const createPieceSegment = (pieceContent, actualStartIdx, actualEndIdx, pageIds, meta, includeMeta) => createSegment(pieceContent, pageIds[actualStartIdx], actualEndIdx > actualStartIdx ? pageIds[actualEndIdx] : void 0, includeMeta ? meta : void 0);
+/**
+* Finds the break offset within a window, trying exclusions first, then patterns.
+*
+* @returns Break offset relative to remainingContent, or windowEndPosition as fallback
+*/
+const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer) => {
+	if (hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, windowEndIdx)) {
+		const exclusionBreak = findExclusionBreakPosition(currentFromIdx, windowEndIdx, toIdx, pageIds, expandedBreakpoints, cumulativeOffsets);
+		if (exclusionBreak > 0) return exclusionBreak;
+	}
+	const patternBreak = findBreakPosition(remainingContent, currentFromIdx, toIdx, windowEndIdx, windowEndPosition, {
+		expandedBreakpoints,
+		normalizedPages,
+		pageIds,
+		prefer
+	});
+	return patternBreak > 0 ? patternBreak : windowEndPosition;
+};
+/**
+* Advances cursor position past any leading whitespace.
+*/
+const skipWhitespace = (content, startPos) => {
+	let pos = startPos;
+	while (pos < content.length && /\s/.test(content[pos])) pos++;
+	return pos;
+};
+/**
+* Processes an oversized segment by iterating through the content and
+* breaking it into smaller pieces that fit within maxPages constraints.
+*
+* Uses precomputed boundary positions for O(log n) page attribution lookups.
+*/
 const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger) => {
 	const result = [];
-	let remainingContent = segment.content;
+	const fullContent = segment.content;
+	let cursorPos = 0;
 	let currentFromIdx = fromIdx;
 	let isFirstPiece = true;
-	let iterationCount = 0;
+	const boundaryPositions = buildBoundaryPositions(fullContent, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets);
+	logger?.debug?.("[breakpoints] boundaryPositions built", {
+		boundaryPositions,
+		fromIdx,
+		fullContentLength: fullContent.length,
+		toIdx
+	});
 	const maxIterations = 1e4;
-	while (currentFromIdx <= toIdx) {
-		iterationCount++;
-		if (iterationCount > maxIterations) {
-			logger?.error?.("INFINITE LOOP DETECTED! Breaking out, you should report this bug", { iterationCount: maxIterations });
-			break;
-		}
-		const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
+	for (let i = 0; i < maxIterations && cursorPos < fullContent.length && currentFromIdx <= toIdx; i++) {
+		const remainingContent = fullContent.slice(cursorPos);
+		if (!remainingContent.trim()) break;
 		const remainingSpan = computeRemainingSpan(currentFromIdx, toIdx, pageIds);
+		const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
 		if (remainingSpan <= maxPages && !remainingHasExclusions) {
 			const finalSeg = createFinalSegment(remainingContent, currentFromIdx, toIdx, pageIds, segment.meta, isFirstPiece);
 			if (finalSeg) result.push(finalSeg);
 			break;
 		}
 		const windowEndIdx = computeWindowEndIdx(currentFromIdx, toIdx, pageIds, maxPages);
-		logger?.debug?.(`[breakpoints] iteration=${iterationCount}`, {
-			currentFromIdx,
-			currentFromPageId: pageIds[currentFromIdx],
-			remainingContentStart: remainingContent.slice(0, 50),
-			remainingContentLength: remainingContent.length,
-			remainingSpan,
-			toIdx,
-			toPageId: pageIds[toIdx],
-			windowEndIdx,
-			windowEndPageId: pageIds[windowEndIdx]
-		});
 		const windowEndPosition = findBreakpointWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets);
-		const windowHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, windowEndIdx);
-		let breakPosition = -1;
-		if (windowHasExclusions) breakPosition = findExclusionBreakPosition(currentFromIdx, windowEndIdx, toIdx, pageIds, expandedBreakpoints, cumulativeOffsets);
-		if (breakPosition <= 0) breakPosition = findBreakPosition(remainingContent, currentFromIdx, toIdx, windowEndIdx, windowEndPosition, {
-			expandedBreakpoints,
-			normalizedPages,
-			pageIds,
-			prefer
+		logger?.debug?.(`[breakpoints] iteration=${i}`, {
+			currentFromIdx,
+			cursorPos,
+			windowEndIdx
 		});
-		if (breakPosition <= 0) breakPosition = windowEndPosition;
-		const pieceContent = remainingContent.slice(0, breakPosition).trim();
-		logger?.debug?.("[breakpoints] selectedBreak", {
-			breakPosition,
-			pieceContentEnd: pieceContent.slice(-50),
-			pieceContentLength: pieceContent.length,
-			windowEndPosition
+		const breakOffset = findBreakOffsetForWindow(remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer);
+		const breakPos = cursorPos + breakOffset;
+		const pieceContent = fullContent.slice(cursorPos, breakPos).trim();
+		const { actualEndIdx, actualStartIdx } = computePiecePages(cursorPos, breakPos, boundaryPositions, fromIdx, toIdx);
+		logger?.trace?.("[breakpoints] piece", {
+			actualEndIdx,
+			actualStartIdx,
+			pieceLength: pieceContent.length
 		});
-		const { actualEndIdx, actualStartIdx } = computePiecePages(pieceContent, currentFromIdx, toIdx, windowEndIdx, pageIds, normalizedPages);
 		if (pieceContent) {
 			const pieceSeg = createPieceSegment(pieceContent, actualStartIdx, actualEndIdx, pageIds, segment.meta, isFirstPiece);
 			if (pieceSeg) result.push(pieceSeg);
 		}
-		remainingContent = remainingContent.slice(breakPosition).trim();
-		logger?.debug?.("[breakpoints] afterSlice", {
-			actualEndIdx,
-			remainingContentLength: remainingContent.length,
-			remainingContentStart: remainingContent.slice(0, 60)
-		});
-		if (!remainingContent) {
-			logger?.debug?.("[breakpoints] done: no remaining content");
-			break;
-		}
-		currentFromIdx = computeNextFromIdx(remainingContent, actualEndIdx, toIdx, pageIds, normalizedPages);
-		logger?.debug?.("[breakpoints] nextIteration", {
-			currentFromIdx,
-			currentFromPageId: pageIds[currentFromIdx]
-		});
+		cursorPos = skipWhitespace(fullContent, breakPos);
+		currentFromIdx = computeNextFromIdx(fullContent.slice(cursorPos), actualEndIdx, toIdx, pageIds, normalizedPages);
 		isFirstPiece = false;
 	}
-	logger?.debug?.("[breakpoints] processOversizedSegmentDone", { resultCount: result.length });
+	logger?.debug?.("[breakpoints] done", { resultCount: result.length });
 	return result;
 };
 /**
@@ -903,6 +938,77 @@ const anyRuleAllowsId = (rules, pageId) => {
 	});
 };
+//#endregion
+//#region src/segmentation/replace.ts
+const DEFAULT_REPLACE_FLAGS = "gu";
+const normalizeReplaceFlags = (flags) => {
+	if (!flags) return DEFAULT_REPLACE_FLAGS;
+	const allowed = new Set([
+		"g",
+		"i",
+		"m",
+		"s",
+		"u",
+		"y"
+	]);
+	const set = /* @__PURE__ */ new Set();
+	for (const ch of flags) {
+		if (!allowed.has(ch)) throw new Error(`Invalid replace regex flag: "${ch}" (allowed: gimsyu)`);
+		set.add(ch);
+	}
+	set.add("g");
+	set.add("u");
+	return [
+		"g",
+		"i",
+		"m",
+		"s",
+		"y",
+		"u"
+	].filter((c) => set.has(c)).join("");
+};
+const compileReplaceRules = (rules) => {
+	const compiled = [];
+	for (const r of rules) {
+		if (r.pageIds && r.pageIds.length === 0) continue;
+		const flags = normalizeReplaceFlags(r.flags);
+		const re = new RegExp(r.regex, flags);
+		compiled.push({
+			pageIdSet: r.pageIds ? new Set(r.pageIds) : void 0,
+			re,
+			replacement: r.replacement
+		});
+	}
+	return compiled;
+};
+/**
+* Applies ordered regex replacements to page content (per page).
+*
+* - Replacement rules are applied in array order.
+* - Each rule is applied globally (flag `g` enforced) with unicode mode (flag `u` enforced).
+* - `pageIds` can scope a rule to specific pages. `pageIds: []` skips the rule entirely.
+*
+* This function is intentionally **pure**:
+* it returns a new pages array only when changes are needed, otherwise it returns the original pages.
+*/
+const applyReplacements = (pages, rules) => {
+	if (!rules || rules.length === 0 || pages.length === 0) return pages;
+	const compiled = compileReplaceRules(rules);
+	if (compiled.length === 0) return pages;
+	return pages.map((p) => {
+		let content = p.content;
+		for (const rule of compiled) {
+			if (rule.pageIdSet && !rule.pageIdSet.has(p.id)) continue;
+			content = content.replace(rule.re, rule.replacement);
+		}
+		if (content === p.content) return p;
+		return {
+			...p,
+			content
+		};
+	});
+};
 //#endregion
 //#region src/segmentation/tokens.ts
 /**
@@ -1480,77 +1586,6 @@ const buildRuleRegex = (rule, capturePrefix) => {
 	};
 };
-//#endregion
-//#region src/segmentation/replace.ts
-const DEFAULT_REPLACE_FLAGS = "gu";
-const normalizeReplaceFlags = (flags) => {
-	if (!flags) return DEFAULT_REPLACE_FLAGS;
-	const allowed = new Set([
-		"g",
-		"i",
-		"m",
-		"s",
-		"u",
-		"y"
-	]);
-	const set = /* @__PURE__ */ new Set();
-	for (const ch of flags) {
-		if (!allowed.has(ch)) throw new Error(`Invalid replace regex flag: "${ch}" (allowed: gimsyu)`);
-		set.add(ch);
-	}
-	set.add("g");
-	set.add("u");
-	return [
-		"g",
-		"i",
-		"m",
-		"s",
-		"y",
-		"u"
-	].filter((c) => set.has(c)).join("");
-};
-const compileReplaceRules = (rules) => {
-	const compiled = [];
-	for (const r of rules) {
-		if (r.pageIds && r.pageIds.length === 0) continue;
-		const flags = normalizeReplaceFlags(r.flags);
-		const re = new RegExp(r.regex, flags);
-		compiled.push({
-			pageIdSet: r.pageIds ? new Set(r.pageIds) : void 0,
-			re,
-			replacement: r.replacement
-		});
-	}
-	return compiled;
-};
-/**
-* Applies ordered regex replacements to page content (per page).
-*
-* - Replacement rules are applied in array order.
-* - Each rule is applied globally (flag `g` enforced) with unicode mode (flag `u` enforced).
-* - `pageIds` can scope a rule to specific pages. `pageIds: []` skips the rule entirely.
-*
-* This function is intentionally **pure**:
-* it returns a new pages array only when changes are needed, otherwise it returns the original pages.
-*/
-const applyReplacements = (pages, rules) => {
-	if (!rules || rules.length === 0 || pages.length === 0) return pages;
-	const compiled = compileReplaceRules(rules);
-	if (compiled.length === 0) return pages;
-	return pages.map((p) => {
-		let content = p.content;
-		for (const rule of compiled) {
-			if (rule.pageIdSet && !rule.pageIdSet.has(p.id)) continue;
-			content = content.replace(rule.re, rule.replacement);
-		}
-		if (content === p.content) return p;
-		return {
-			...p,
-			content
-		};
-	});
-};
 //#endregion
 //#region src/segmentation/fast-fuzzy-prefix.ts
 /**
@@ -2120,14 +2155,43 @@ const convertPageBreaks = (content, startOffset, pageBreaks) => {
 */
 const segmentPages = (pages, options) => {
 	const { rules = [], maxPages = 0, breakpoints = [], prefer = "longer", pageJoiner = "space", logger } = options;
+	logger?.info?.("[segmenter] starting segmentation", {
+		breakpointCount: breakpoints.length,
+		maxPages,
+		pageCount: pages.length,
+		prefer,
+		ruleCount: rules.length
+	});
 	const processedPages = options.replace ? applyReplacements(pages, options.replace) : pages;
 	const { content: matchContent, normalizedPages: normalizedContent, pageMap } = buildPageMap(processedPages);
-	let segments = buildSegments(dedupeSplitPoints(collectSplitPointsFromRules(rules, matchContent, pageMap)), matchContent, pageMap, rules);
+	logger?.debug?.("[segmenter] content built", {
+		pageIds: pageMap.pageIds,
+		totalContentLength: matchContent.length
+	});
+	const splitPoints = collectSplitPointsFromRules(rules, matchContent, pageMap);
+	const unique = dedupeSplitPoints(splitPoints);
+	logger?.debug?.("[segmenter] split points collected", {
+		rawSplitPoints: splitPoints.length,
+		uniqueSplitPoints: unique.length
+	});
+	let segments = buildSegments(unique, matchContent, pageMap, rules);
+	logger?.debug?.("[segmenter] structural segments built", {
+		segmentCount: segments.length,
+		segments: segments.map((s) => ({
+			contentLength: s.content.length,
+			from: s.from,
+			to: s.to
+		}))
+	});
 	segments = ensureFallbackSegment(segments, processedPages, normalizedContent, pageJoiner);
 	if (maxPages >= 0 && breakpoints.length) {
+		logger?.debug?.("[segmenter] applying breakpoints to oversized segments");
 		const patternProcessor = (p) => processPattern(p, false).pattern;
-		return applyBreakpoints(segments, processedPages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner);
+		const result = applyBreakpoints(segments, processedPages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner);
+		logger?.info?.("[segmenter] segmentation complete (with breakpoints)", { finalSegmentCount: result.length });
+		return result;
 	}
+	logger?.info?.("[segmenter] segmentation complete (structural only)", { finalSegmentCount: segments.length });
 	return segments;
 };
 /**