npm - flappa-doormal - Versions diffs - 2.6.2 → 2.6.3 - Mend

flappa-doormal 2.6.2 → 2.6.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/AGENTS.md CHANGED Viewed

@@ -48,16 +48,6 @@ src/
     ├── fuzzy.test.ts           # Fuzzy matching tests
     ├── textUtils.test.ts       # Text utility tests
     └── match-utils.test.ts     # Utility function tests
-test/
-├── 2576.json                   # Test data for book 2576 (Sahih Bukhari)
-└── 2588.json                   # Test data for book 2588 (Al-Mughni)
-docs/
-├── checkpoints/                # AI agent handoff documentation
-│   └── 2025-12-09-handoff.md
-└── reviews/                    # Performance analysis reports
-    └── 2025-12-10/
 ```
 ### Core Components
@@ -92,7 +82,9 @@ docs/
    - `buildExcludeSet()` - Create Set from PageRange[] for O(1) lookups
    - `createSegment()` - Create segment with optional to/meta fields
    - `expandBreakpoints()` - Expand patterns with pre-compiled regexes
-   - `findActualEndPage()` - Search backwards for ending page using progressive prefix matching (handles mid-page splits)
+   - `buildBoundaryPositions()` - Build position map of page boundaries for O(log n) lookups
+   - `findPageIndexForPosition()` - Binary search to find page index for a character position
+   - `estimateStartOffsetInCurrentPage()` - Estimate offset when segment starts mid-page
    - `findBreakpointWindowEndPosition()` - Compute window boundary in content-space (robust to marker stripping)
    - `applyPageJoinerBetweenPages()` - Normalize page-boundary join in output segments (`space` vs `newline`)
    - `findBreakPosition()` - Find break position using breakpoint patterns
@@ -362,6 +354,8 @@ bunx biome lint .
 10. **Page boundary detection needs progressive prefixes**: When breakpoints split content mid-page, checking only the first N characters of a page to detect if the segment ends on that page can fail. Solution: try progressively shorter prefixes (`[80, 60, 40, 30, 20, 15, 12, 10, 8, 6]`) via `JOINER_PREFIX_LENGTHS`. The check uses `indexOf(...) > 0` (not `>= 0`) to avoid false positives when a page prefix appears at position 0 (which indicates the segment *starts* with that page, not *ends* on it).
+11. **Boundary-position algorithm improves page attribution**: Building a position map of page boundaries once per segment (O(n)) enables binary search for O(log n) lookups per piece. Key insight: when a segment starts mid-page (common after structural rules), expected boundary estimates must account for the offset into the starting page. Without this adjustment, position-based lookups can return the wrong page when pages have identical content prefixes.
 ### Architecture Insights
 - **Declarative > Imperative**: Users describe patterns, library handles regex

package/dist/index.mjs CHANGED Viewed

@@ -411,6 +411,72 @@ const findPageStartNearExpectedBoundary = (remainingContent, _currentFromIdx, ta
 	return -1;
 };
 /**
+* Builds a boundary position map for pages within the given range.
+*
+* This function computes page boundaries once per segment and enables
+* O(log n) page lookups via binary search with `findPageIndexForPosition`.
+*
+* Boundaries are derived from segmentContent (post-structural-rules).
+* When the segment starts mid-page, an offset correction is applied to
+* keep boundary estimates aligned with the segment's actual content space.
+*
+* @param segmentContent - Full segment content (already processed by structural rules)
+* @param fromIdx - Starting page index
+* @param toIdx - Ending page index
+* @param pageIds - Array of all page IDs
+* @param normalizedPages - Map of page ID to normalized content
+* @param cumulativeOffsets - Cumulative character offsets (for estimates)
+* @returns Array where boundaryPositions[i] = start position of page (fromIdx + i),
+*          with a sentinel boundary at segmentContent.length as the last element
+*
+* @example
+* // For a 3-page segment:
+* buildBoundaryPositions(content, 0, 2, pageIds, normalizedPages, offsets)
+* // → [0, 23, 45, 67] where 67 is content.length (sentinel)
+*/
+const buildBoundaryPositions = (segmentContent, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets) => {
+	const boundaryPositions = [0];
+	const startOffsetInFromPage = estimateStartOffsetInCurrentPage(segmentContent, fromIdx, pageIds, normalizedPages);
+	for (let i = fromIdx + 1; i <= toIdx; i++) {
+		const expectedBoundary = cumulativeOffsets[i] !== void 0 && cumulativeOffsets[fromIdx] !== void 0 ? Math.max(0, cumulativeOffsets[i] - cumulativeOffsets[fromIdx] - startOffsetInFromPage) : segmentContent.length;
+		const pos = findPageStartNearExpectedBoundary(segmentContent, fromIdx, i, expectedBoundary, pageIds, normalizedPages);
+		const prevBoundary = boundaryPositions[boundaryPositions.length - 1];
+		if (pos > 0 && pos > prevBoundary && Math.abs(pos - expectedBoundary) < 2e3) boundaryPositions.push(pos);
+		else {
+			const estimate = Math.max(prevBoundary + 1, expectedBoundary);
+			boundaryPositions.push(Math.min(estimate, segmentContent.length));
+		}
+	}
+	boundaryPositions.push(segmentContent.length);
+	return boundaryPositions;
+};
+/**
+* Binary search to find which page a position falls within.
+* Uses "largest i where boundaryPositions[i] <= position" semantics.
+*
+* @param position - Character position in segmentContent
+* @param boundaryPositions - Precomputed boundary positions (from buildBoundaryPositions)
+* @param fromIdx - Base page index (boundaryPositions[0] corresponds to pageIds[fromIdx])
+* @returns Page index in pageIds array
+*
+* @example
+* // With boundaries [0, 20, 40, 60] and fromIdx=0:
+* findPageIndexForPosition(15, boundaries, 0) // → 0 (first page)
+* findPageIndexForPosition(25, boundaries, 0) // → 1 (second page)
+* findPageIndexForPosition(40, boundaries, 0) // → 2 (exactly on boundary = that page)
+*/
+const findPageIndexForPosition = (position, boundaryPositions, fromIdx) => {
+	if (boundaryPositions.length <= 1) return fromIdx;
+	let left = 0;
+	let right = boundaryPositions.length - 2;
+	while (left < right) {
+		const mid = Math.ceil((left + right) / 2);
+		if (boundaryPositions[mid] <= position) left = mid;
+		else right = mid - 1;
+	}
+	return fromIdx + left;
+};
+/**
 * Finds the end position of a breakpoint window inside `remainingContent`.
 *
 * The window end is defined as the start of the page AFTER `windowEndIdx` (i.e. `windowEndIdx + 1`),
@@ -446,59 +512,6 @@ const findExclusionBreakPosition = (currentFromIdx, windowEndIdx, toIdx, pageIds
 	return -1;
 };
 /**
-* Finds the actual ending page index by searching backwards for page content prefix.
-* Used to determine which page a segment actually ends on based on content matching.
-*
-* @param pieceContent - Content of the segment piece
-* @param currentFromIdx - Current starting index in pageIds
-* @param toIdx - Maximum ending index to search
-* @param pageIds - Array of page IDs
-* @param normalizedPages - Map of page ID to normalized content
-* @returns The actual ending page index
-*/
-const findActualEndPage = (pieceContent, currentFromIdx, toIdx, pageIds, normalizedPages) => {
-	for (let pi = toIdx; pi > currentFromIdx; pi--) {
-		const pageData = normalizedPages.get(pageIds[pi]);
-		if (!pageData) continue;
-		const trimmedContent = pageData.content.trimStart();
-		for (const len of JOINER_PREFIX_LENGTHS) {
-			const checkPortion = trimmedContent.slice(0, Math.min(len, trimmedContent.length)).trim();
-			if (checkPortion.length > 0 && pieceContent.indexOf(checkPortion) > 0) return pi;
-		}
-	}
-	return currentFromIdx;
-};
-/**
-* Finds the actual starting page index by searching forwards for page content prefix.
-* Used to determine which page content actually starts from based on content matching.
-*
-* This is the counterpart to findActualEndPage - it searches forward to find which
-* page the content starts on, rather than which page it ends on.
-*
-* @param pieceContent - Content of the segment piece
-* @param currentFromIdx - Current starting index in pageIds
-* @param toIdx - Maximum ending index to search
-* @param pageIds - Array of page IDs
-* @param normalizedPages - Map of page ID to normalized content
-* @returns The actual starting page index
-*/
-const findActualStartPage = (pieceContent, currentFromIdx, toIdx, pageIds, normalizedPages) => {
-	const trimmedPiece = pieceContent.trimStart();
-	if (!trimmedPiece) return currentFromIdx;
-	for (let pi = currentFromIdx; pi <= toIdx; pi++) {
-		const pageData = normalizedPages.get(pageIds[pi]);
-		if (pageData) {
-			const pagePrefix = pageData.content.slice(0, Math.min(30, pageData.length)).trim();
-			const piecePrefix = trimmedPiece.slice(0, Math.min(30, trimmedPiece.length));
-			if (pagePrefix.length > 0) {
-				if (trimmedPiece.startsWith(pagePrefix)) return pi;
-				if (pageData.content.trimStart().startsWith(piecePrefix)) return pi;
-			}
-		}
-	}
-	return currentFromIdx;
-};
-/**
 * Checks if any page in a range is excluded by the given exclude set.
 *
 * @param excludeSet - Set of excluded page IDs
@@ -630,10 +643,22 @@ const computeWindowEndIdx = (currentFromIdx, toIdx, pageIds, maxPages) => {
 };
 const computeRemainingSpan = (currentFromIdx, toIdx, pageIds) => pageIds[toIdx] - pageIds[currentFromIdx];
 const createFinalSegment = (remainingContent, currentFromIdx, toIdx, pageIds, meta, includeMeta) => createSegment(remainingContent, pageIds[currentFromIdx], currentFromIdx !== toIdx ? pageIds[toIdx] : void 0, includeMeta ? meta : void 0);
-const computePiecePages = (pieceContent, currentFromIdx, toIdx, windowEndIdx, pageIds, normalizedPages) => {
-	const actualStartIdx = pieceContent ? findActualStartPage(pieceContent, currentFromIdx, toIdx, pageIds, normalizedPages) : currentFromIdx;
+/**
+* Computes the actual start and end page indices for a piece using
+* precomputed boundary positions and binary search.
+*
+* @param pieceStartPos - Start position of the piece in the full segment content
+* @param pieceEndPos - End position (exclusive) of the piece
+* @param boundaryPositions - Precomputed boundary positions from buildBoundaryPositions
+* @param baseFromIdx - Base page index (boundaryPositions[0] corresponds to pageIds[baseFromIdx])
+* @param toIdx - Maximum page index
+* @returns Object with actualStartIdx and actualEndIdx
+*/
+const computePiecePages = (pieceStartPos, pieceEndPos, boundaryPositions, baseFromIdx, toIdx) => {
+	const actualStartIdx = findPageIndexForPosition(pieceStartPos, boundaryPositions, baseFromIdx);
+	const endPos = Math.max(pieceStartPos, pieceEndPos - 1);
 	return {
-		actualEndIdx: pieceContent ? findActualEndPage(pieceContent, actualStartIdx, windowEndIdx, pageIds, normalizedPages) : currentFromIdx,
+		actualEndIdx: Math.min(findPageIndexForPosition(endPos, boundaryPositions, baseFromIdx), toIdx),
 		actualStartIdx
 	};
 };
@@ -650,79 +675,87 @@ const computeNextFromIdx = (remainingContent, actualEndIdx, toIdx, pageIds, norm
 	return nextFromIdx;
 };
 const createPieceSegment = (pieceContent, actualStartIdx, actualEndIdx, pageIds, meta, includeMeta) => createSegment(pieceContent, pageIds[actualStartIdx], actualEndIdx > actualStartIdx ? pageIds[actualEndIdx] : void 0, includeMeta ? meta : void 0);
+/**
+* Finds the break offset within a window, trying exclusions first, then patterns.
+*
+* @returns Break offset relative to remainingContent, or windowEndPosition as fallback
+*/
+const findBreakOffsetForWindow = (remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer) => {
+	if (hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, windowEndIdx)) {
+		const exclusionBreak = findExclusionBreakPosition(currentFromIdx, windowEndIdx, toIdx, pageIds, expandedBreakpoints, cumulativeOffsets);
+		if (exclusionBreak > 0) return exclusionBreak;
+	}
+	const patternBreak = findBreakPosition(remainingContent, currentFromIdx, toIdx, windowEndIdx, windowEndPosition, {
+		expandedBreakpoints,
+		normalizedPages,
+		pageIds,
+		prefer
+	});
+	return patternBreak > 0 ? patternBreak : windowEndPosition;
+};
+/**
+* Advances cursor position past any leading whitespace.
+*/
+const skipWhitespace = (content, startPos) => {
+	let pos = startPos;
+	while (pos < content.length && /\s/.test(content[pos])) pos++;
+	return pos;
+};
+/**
+* Processes an oversized segment by iterating through the content and
+* breaking it into smaller pieces that fit within maxPages constraints.
+*
+* Uses precomputed boundary positions for O(log n) page attribution lookups.
+*/
 const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, expandedBreakpoints, maxPages, prefer, logger) => {
 	const result = [];
-	let remainingContent = segment.content;
+	const fullContent = segment.content;
+	let cursorPos = 0;
 	let currentFromIdx = fromIdx;
 	let isFirstPiece = true;
-	let iterationCount = 0;
+	const boundaryPositions = buildBoundaryPositions(fullContent, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets);
+	logger?.debug?.("[breakpoints] boundaryPositions built", {
+		boundaryPositions,
+		fromIdx,
+		fullContentLength: fullContent.length,
+		toIdx
+	});
 	const maxIterations = 1e4;
-	while (currentFromIdx <= toIdx) {
-		iterationCount++;
-		if (iterationCount > maxIterations) {
-			logger?.error?.("INFINITE LOOP DETECTED! Breaking out, you should report this bug", { iterationCount: maxIterations });
-			break;
-		}
-		const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
+	for (let i = 0; i < maxIterations && cursorPos < fullContent.length && currentFromIdx <= toIdx; i++) {
+		const remainingContent = fullContent.slice(cursorPos);
+		if (!remainingContent.trim()) break;
 		const remainingSpan = computeRemainingSpan(currentFromIdx, toIdx, pageIds);
+		const remainingHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, toIdx);
 		if (remainingSpan <= maxPages && !remainingHasExclusions) {
 			const finalSeg = createFinalSegment(remainingContent, currentFromIdx, toIdx, pageIds, segment.meta, isFirstPiece);
 			if (finalSeg) result.push(finalSeg);
 			break;
 		}
 		const windowEndIdx = computeWindowEndIdx(currentFromIdx, toIdx, pageIds, maxPages);
-		logger?.debug?.(`[breakpoints] iteration=${iterationCount}`, {
-			currentFromIdx,
-			currentFromPageId: pageIds[currentFromIdx],
-			remainingContentStart: remainingContent.slice(0, 50),
-			remainingContentLength: remainingContent.length,
-			remainingSpan,
-			toIdx,
-			toPageId: pageIds[toIdx],
-			windowEndIdx,
-			windowEndPageId: pageIds[windowEndIdx]
-		});
 		const windowEndPosition = findBreakpointWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets);
-		const windowHasExclusions = hasAnyExclusionsInRange(expandedBreakpoints, pageIds, currentFromIdx, windowEndIdx);
-		let breakPosition = -1;
-		if (windowHasExclusions) breakPosition = findExclusionBreakPosition(currentFromIdx, windowEndIdx, toIdx, pageIds, expandedBreakpoints, cumulativeOffsets);
-		if (breakPosition <= 0) breakPosition = findBreakPosition(remainingContent, currentFromIdx, toIdx, windowEndIdx, windowEndPosition, {
-			expandedBreakpoints,
-			normalizedPages,
-			pageIds,
-			prefer
+		logger?.debug?.(`[breakpoints] iteration=${i}`, {
+			currentFromIdx,
+			cursorPos,
+			windowEndIdx
 		});
-		if (breakPosition <= 0) breakPosition = windowEndPosition;
-		const pieceContent = remainingContent.slice(0, breakPosition).trim();
-		logger?.debug?.("[breakpoints] selectedBreak", {
-			breakPosition,
-			pieceContentEnd: pieceContent.slice(-50),
-			pieceContentLength: pieceContent.length,
-			windowEndPosition
+		const breakOffset = findBreakOffsetForWindow(remainingContent, currentFromIdx, windowEndIdx, toIdx, windowEndPosition, pageIds, expandedBreakpoints, cumulativeOffsets, normalizedPages, prefer);
+		const breakPos = cursorPos + breakOffset;
+		const pieceContent = fullContent.slice(cursorPos, breakPos).trim();
+		const { actualEndIdx, actualStartIdx } = computePiecePages(cursorPos, breakPos, boundaryPositions, fromIdx, toIdx);
+		logger?.trace?.("[breakpoints] piece", {
+			actualEndIdx,
+			actualStartIdx,
+			pieceLength: pieceContent.length
 		});
-		const { actualEndIdx, actualStartIdx } = computePiecePages(pieceContent, currentFromIdx, toIdx, windowEndIdx, pageIds, normalizedPages);
 		if (pieceContent) {
 			const pieceSeg = createPieceSegment(pieceContent, actualStartIdx, actualEndIdx, pageIds, segment.meta, isFirstPiece);
 			if (pieceSeg) result.push(pieceSeg);
 		}
-		remainingContent = remainingContent.slice(breakPosition).trim();
-		logger?.debug?.("[breakpoints] afterSlice", {
-			actualEndIdx,
-			remainingContentLength: remainingContent.length,
-			remainingContentStart: remainingContent.slice(0, 60)
-		});
-		if (!remainingContent) {
-			logger?.debug?.("[breakpoints] done: no remaining content");
-			break;
-		}
-		currentFromIdx = computeNextFromIdx(remainingContent, actualEndIdx, toIdx, pageIds, normalizedPages);
-		logger?.debug?.("[breakpoints] nextIteration", {
-			currentFromIdx,
-			currentFromPageId: pageIds[currentFromIdx]
-		});
+		cursorPos = skipWhitespace(fullContent, breakPos);
+		currentFromIdx = computeNextFromIdx(fullContent.slice(cursorPos), actualEndIdx, toIdx, pageIds, normalizedPages);
 		isFirstPiece = false;
 	}
-	logger?.debug?.("[breakpoints] processOversizedSegmentDone", { resultCount: result.length });
+	logger?.debug?.("[breakpoints] done", { resultCount: result.length });
 	return result;
 };
 /**
@@ -905,6 +938,77 @@ const anyRuleAllowsId = (rules, pageId) => {
 	});
 };
+//#endregion
+//#region src/segmentation/replace.ts
+const DEFAULT_REPLACE_FLAGS = "gu";
+const normalizeReplaceFlags = (flags) => {
+	if (!flags) return DEFAULT_REPLACE_FLAGS;
+	const allowed = new Set([
+		"g",
+		"i",
+		"m",
+		"s",
+		"u",
+		"y"
+	]);
+	const set = /* @__PURE__ */ new Set();
+	for (const ch of flags) {
+		if (!allowed.has(ch)) throw new Error(`Invalid replace regex flag: "${ch}" (allowed: gimsyu)`);
+		set.add(ch);
+	}
+	set.add("g");
+	set.add("u");
+	return [
+		"g",
+		"i",
+		"m",
+		"s",
+		"y",
+		"u"
+	].filter((c) => set.has(c)).join("");
+};
+const compileReplaceRules = (rules) => {
+	const compiled = [];
+	for (const r of rules) {
+		if (r.pageIds && r.pageIds.length === 0) continue;
+		const flags = normalizeReplaceFlags(r.flags);
+		const re = new RegExp(r.regex, flags);
+		compiled.push({
+			pageIdSet: r.pageIds ? new Set(r.pageIds) : void 0,
+			re,
+			replacement: r.replacement
+		});
+	}
+	return compiled;
+};
+/**
+* Applies ordered regex replacements to page content (per page).
+*
+* - Replacement rules are applied in array order.
+* - Each rule is applied globally (flag `g` enforced) with unicode mode (flag `u` enforced).
+* - `pageIds` can scope a rule to specific pages. `pageIds: []` skips the rule entirely.
+*
+* This function is intentionally **pure**:
+* it returns a new pages array only when changes are needed, otherwise it returns the original pages.
+*/
+const applyReplacements = (pages, rules) => {
+	if (!rules || rules.length === 0 || pages.length === 0) return pages;
+	const compiled = compileReplaceRules(rules);
+	if (compiled.length === 0) return pages;
+	return pages.map((p) => {
+		let content = p.content;
+		for (const rule of compiled) {
+			if (rule.pageIdSet && !rule.pageIdSet.has(p.id)) continue;
+			content = content.replace(rule.re, rule.replacement);
+		}
+		if (content === p.content) return p;
+		return {
+			...p,
+			content
+		};
+	});
+};
 //#endregion
 //#region src/segmentation/tokens.ts
 /**
@@ -1482,77 +1586,6 @@ const buildRuleRegex = (rule, capturePrefix) => {
 	};
 };
-//#endregion
-//#region src/segmentation/replace.ts
-const DEFAULT_REPLACE_FLAGS = "gu";
-const normalizeReplaceFlags = (flags) => {
-	if (!flags) return DEFAULT_REPLACE_FLAGS;
-	const allowed = new Set([
-		"g",
-		"i",
-		"m",
-		"s",
-		"u",
-		"y"
-	]);
-	const set = /* @__PURE__ */ new Set();
-	for (const ch of flags) {
-		if (!allowed.has(ch)) throw new Error(`Invalid replace regex flag: "${ch}" (allowed: gimsyu)`);
-		set.add(ch);
-	}
-	set.add("g");
-	set.add("u");
-	return [
-		"g",
-		"i",
-		"m",
-		"s",
-		"y",
-		"u"
-	].filter((c) => set.has(c)).join("");
-};
-const compileReplaceRules = (rules) => {
-	const compiled = [];
-	for (const r of rules) {
-		if (r.pageIds && r.pageIds.length === 0) continue;
-		const flags = normalizeReplaceFlags(r.flags);
-		const re = new RegExp(r.regex, flags);
-		compiled.push({
-			pageIdSet: r.pageIds ? new Set(r.pageIds) : void 0,
-			re,
-			replacement: r.replacement
-		});
-	}
-	return compiled;
-};
-/**
-* Applies ordered regex replacements to page content (per page).
-*
-* - Replacement rules are applied in array order.
-* - Each rule is applied globally (flag `g` enforced) with unicode mode (flag `u` enforced).
-* - `pageIds` can scope a rule to specific pages. `pageIds: []` skips the rule entirely.
-*
-* This function is intentionally **pure**:
-* it returns a new pages array only when changes are needed, otherwise it returns the original pages.
-*/
-const applyReplacements = (pages, rules) => {
-	if (!rules || rules.length === 0 || pages.length === 0) return pages;
-	const compiled = compileReplaceRules(rules);
-	if (compiled.length === 0) return pages;
-	return pages.map((p) => {
-		let content = p.content;
-		for (const rule of compiled) {
-			if (rule.pageIdSet && !rule.pageIdSet.has(p.id)) continue;
-			content = content.replace(rule.re, rule.replacement);
-		}
-		if (content === p.content) return p;
-		return {
-			...p,
-			content
-		};
-	});
-};
 //#endregion
 //#region src/segmentation/fast-fuzzy-prefix.ts
 /**
@@ -2122,14 +2155,43 @@ const convertPageBreaks = (content, startOffset, pageBreaks) => {
 */
 const segmentPages = (pages, options) => {
 	const { rules = [], maxPages = 0, breakpoints = [], prefer = "longer", pageJoiner = "space", logger } = options;
+	logger?.info?.("[segmenter] starting segmentation", {
+		breakpointCount: breakpoints.length,
+		maxPages,
+		pageCount: pages.length,
+		prefer,
+		ruleCount: rules.length
+	});
 	const processedPages = options.replace ? applyReplacements(pages, options.replace) : pages;
 	const { content: matchContent, normalizedPages: normalizedContent, pageMap } = buildPageMap(processedPages);
-	let segments = buildSegments(dedupeSplitPoints(collectSplitPointsFromRules(rules, matchContent, pageMap)), matchContent, pageMap, rules);
+	logger?.debug?.("[segmenter] content built", {
+		pageIds: pageMap.pageIds,
+		totalContentLength: matchContent.length
+	});
+	const splitPoints = collectSplitPointsFromRules(rules, matchContent, pageMap);
+	const unique = dedupeSplitPoints(splitPoints);
+	logger?.debug?.("[segmenter] split points collected", {
+		rawSplitPoints: splitPoints.length,
+		uniqueSplitPoints: unique.length
+	});
+	let segments = buildSegments(unique, matchContent, pageMap, rules);
+	logger?.debug?.("[segmenter] structural segments built", {
+		segmentCount: segments.length,
+		segments: segments.map((s) => ({
+			contentLength: s.content.length,
+			from: s.from,
+			to: s.to
+		}))
+	});
 	segments = ensureFallbackSegment(segments, processedPages, normalizedContent, pageJoiner);
 	if (maxPages >= 0 && breakpoints.length) {
+		logger?.debug?.("[segmenter] applying breakpoints to oversized segments");
 		const patternProcessor = (p) => processPattern(p, false).pattern;
-		return applyBreakpoints(segments, processedPages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner);
+		const result = applyBreakpoints(segments, processedPages, normalizedContent, maxPages, breakpoints, prefer, patternProcessor, logger, pageJoiner);
+		logger?.info?.("[segmenter] segmentation complete (with breakpoints)", { finalSegmentCount: result.length });
+		return result;
 	}
+	logger?.info?.("[segmenter] segmentation complete (structural only)", { finalSegmentCount: segments.length });
 	return segments;
 };
 /**