npm - flappa-doormal - Versions diffs - 2.10.0 → 2.10.1 - Mend

flappa-doormal 2.10.0 → 2.10.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/AGENTS.md CHANGED Viewed

@@ -383,6 +383,8 @@ bunx biome lint .
 11. **Boundary-position algorithm improves page attribution**: Building a position map of page boundaries once per segment (O(n)) enables binary search for O(log n) lookups per piece. Key insight: when a segment starts mid-page (common after structural rules), expected boundary estimates must account for the offset into the starting page. Without this adjustment, position-based lookups can return the wrong page when pages have identical content prefixes.
+12. **Prefix matching fails with duplicated content**: When using `indexOf()` to find page boundaries by matching prefixes, false positives occur when pages have identical prefixes AND content is duplicated within pages. Solution: use cumulative byte offsets as the source of truth for expected boundaries, and only accept prefix matches within a strict deviation threshold (2000 chars). When content-based detection fails, fall back directly to the calculated offset rather than returning `remainingContent.length` (which merges all remaining pages).
 ### For Future AI Agents (Recovery + Repo gotchas)
 1. **`lineStartsAfter` vs `lineStartsWith` is not “cosmetic”**: `lineStartsAfter` changes output by stripping the matched marker via an internal `contentStartOffset` during segment construction. If a client used it by accident, you cannot reconstruct the exact stripped prefix from output alone without referencing the original pages and re-matching the marker.
@@ -538,4 +540,69 @@ Use analysis functions to discover patterns, then pass to `segmentPages()`:
 See README.md for complete examples.
+---
+## Debugging Page Boundary Detection (Added 2026-01-04)
+### The Problem: False Positives in Prefix Matching
+When using `maxPages=0` with empty breakpoint `['']` (page boundary breaks), the segmenter can fail when:
+1. **Pages have identical prefixes** - All pages start with the same text
+2. **Duplicated content within pages** - The same phrase appears multiple times in a single page
+3. **Long content** - Pages are thousands of characters, putting false matches closer to expected boundaries
+**Root cause**: The `findPageStartNearExpectedBoundary` function in `breakpoint-utils.ts` uses prefix matching to find page boundaries. When content is duplicated, it finds matches at incorrect positions within the current page instead of at the actual page boundary.
+### Key Functions in the Breakpoint Chain
+1. **`applyBreakpoints()`** - Entry point for breakpoint processing
+2. **`processOversizedSegment()`** - Iteratively breaks segments exceeding `maxPages`
+3. **`computeWindowEndIdx()`** - Calculates max page index for current window
+4. **`findBreakpointWindowEndPosition()`** - Finds the byte position where the window ends
+5. **`findPageStartNearExpectedBoundary()`** - Content-based search for page start position
+6. **`handlePageBoundaryBreak()`** - Handles empty pattern `''` (page boundary)
+7. **`buildCumulativeOffsets()`** - Pre-computes exact byte positions for each page
+### Debug Strategy
+1. **Check cumulative offsets first** - `buildCumulativeOffsets()` returns correct positions from `pages.join('\n')`
+2. **Trace `expectedBoundary`** - This is calculated correctly from cumulative offsets
+3. **Check `findPageStartNearExpectedBoundary` candidates** - The bug is usually here; it finds false matches
+4. **Verify the deviation check** - Matches must be within `MAX_DEVIATION` (2000 chars) of expected boundary
+### The Fix Applied
+Two changes in `breakpoint-utils.ts`:
+1. **`findPageStartNearExpectedBoundary`** - Added `MAX_DEVIATION` check to reject matches too far from expected boundary:
+   ```typescript
+   const MAX_DEVIATION = 2000;
+   if (bestDistance <= MAX_DEVIATION) {
+       return bestCandidate.pos;
+   }
+   // Continue trying shorter prefixes or return -1
+   ```
+2. **`findBreakpointWindowEndPosition`** - Changed fallback from `remainingContent.length` to `bestExpectedBoundary`:
+   ```typescript
+   // Before (bug): return remainingContent.length; // Merges all remaining pages!
+   // After (fix): return Math.min(bestExpectedBoundary, remainingContent.length);
+   ```
+### Test Case Pattern for This Bug
+```typescript
+it('should correctly split pages with identical prefixes and duplicated content', () => {
+    const sharedPrefix = 'SHARED PREFIX ';
+    const filler = 'Lorem ipsum. '.repeat(200); // ~6000 chars
+    const pages: Page[] = [
+        { content: sharedPrefix + 'start ' + filler + sharedPrefix + 'end', id: 0 },
+        { content: sharedPrefix + 'page1', id: 1 },
+        { content: sharedPrefix + 'page2', id: 2 },
+    ];
+    const result = segmentPages(pages, { breakpoints: [''], maxPages: 0 });
+    expect(result).toHaveLength(3); // Without fix: 2 or 1
+});
+```
+---

package/dist/index.mjs CHANGED Viewed

@@ -1278,7 +1278,7 @@ const estimateStartOffsetInCurrentPage = (remainingContent, currentFromIdx, page
 * This is used to define breakpoint windows in terms of actual content being split, rather than
 * raw per-page offsets which can desync when structural rules strip markers.
 */
-const findPageStartNearExpectedBoundary = (remainingContent, _currentFromIdx, targetPageIdx, expectedBoundary, pageIds, normalizedPages) => {
+const findPageStartNearExpectedBoundary = (remainingContent, _currentFromIdx, targetPageIdx, expectedBoundary, pageIds, normalizedPages, logger) => {
 	const targetPageData = normalizedPages.get(pageIds[targetPageIdx]);
 	if (!targetPageData) return -1;
 	const approx = Math.min(Math.max(0, expectedBoundary), remainingContent.length);
@@ -1288,13 +1288,45 @@ const findPageStartNearExpectedBoundary = (remainingContent, _currentFromIdx, ta
 	for (const len of WINDOW_PREFIX_LENGTHS) {
 		const prefix = targetTrimmed.slice(0, Math.min(len, targetTrimmed.length)).trim();
 		if (!prefix) continue;
+		const candidates = [];
 		let pos = remainingContent.indexOf(prefix, searchStart);
 		while (pos !== -1 && pos <= searchEnd) {
-			if (pos > 0 && /\s/.test(remainingContent[pos - 1] ?? "")) return pos;
+			if (pos > 0) {
+				const charBefore = remainingContent[pos - 1];
+				if (charBefore === "\n") candidates.push({
+					isNewline: true,
+					pos
+				});
+				else if (/\s/.test(charBefore)) candidates.push({
+					isNewline: false,
+					pos
+				});
+			}
 			pos = remainingContent.indexOf(prefix, pos + 1);
 		}
-		const last = remainingContent.lastIndexOf(prefix, approx);
-		if (last > 0) return last;
+		if (candidates.length > 0) {
+			const newlineCandidates = candidates.filter((c) => c.isNewline);
+			const pool = newlineCandidates.length > 0 ? newlineCandidates : candidates;
+			let bestCandidate = pool[0];
+			let bestDistance = Math.abs(pool[0].pos - expectedBoundary);
+			for (let i = 1; i < pool.length; i++) {
+				const dist = Math.abs(pool[i].pos - expectedBoundary);
+				if (dist < bestDistance) {
+					bestDistance = dist;
+					bestCandidate = pool[i];
+				}
+			}
+			const MAX_DEVIATION = 2e3;
+			if (bestDistance <= MAX_DEVIATION) return bestCandidate.pos;
+			logger?.debug?.("[breakpoints] findPageStartNearExpectedBoundary: Rejected match exceeding deviation", {
+				targetPageIdx,
+				expectedBoundary,
+				bestDistance,
+				maxDeviation: MAX_DEVIATION,
+				matchPos: bestCandidate.pos,
+				prefixLength: len
+			});
+		}
 	}
 	return -1;
 };
@@ -1314,6 +1346,7 @@ const findPageStartNearExpectedBoundary = (remainingContent, _currentFromIdx, ta
 * @param pageIds - Array of all page IDs
 * @param normalizedPages - Map of page ID to normalized content
 * @param cumulativeOffsets - Cumulative character offsets (for estimates)
+* @param logger - Optional logger for debugging
 * @returns Array where boundaryPositions[i] = start position of page (fromIdx + i),
 *          with a sentinel boundary at segmentContent.length as the last element
 *
@@ -1322,12 +1355,12 @@ const findPageStartNearExpectedBoundary = (remainingContent, _currentFromIdx, ta
 * buildBoundaryPositions(content, 0, 2, pageIds, normalizedPages, offsets)
 * // → [0, 23, 45, 67] where 67 is content.length (sentinel)
 */
-const buildBoundaryPositions = (segmentContent, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets) => {
+const buildBoundaryPositions = (segmentContent, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, logger) => {
 	const boundaryPositions = [0];
 	const startOffsetInFromPage = estimateStartOffsetInCurrentPage(segmentContent, fromIdx, pageIds, normalizedPages);
 	for (let i = fromIdx + 1; i <= toIdx; i++) {
 		const expectedBoundary = cumulativeOffsets[i] !== void 0 && cumulativeOffsets[fromIdx] !== void 0 ? Math.max(0, cumulativeOffsets[i] - cumulativeOffsets[fromIdx] - startOffsetInFromPage) : segmentContent.length;
-		const pos = findPageStartNearExpectedBoundary(segmentContent, fromIdx, i, expectedBoundary, pageIds, normalizedPages);
+		const pos = findPageStartNearExpectedBoundary(segmentContent, fromIdx, i, expectedBoundary, pageIds, normalizedPages, logger);
 		const prevBoundary = boundaryPositions[boundaryPositions.length - 1];
 		if (pos > 0 && pos > prevBoundary && Math.abs(pos - expectedBoundary) < 2e3) boundaryPositions.push(pos);
 		else {
@@ -1371,18 +1404,20 @@ const findPageIndexForPosition = (position, boundaryPositions, fromIdx) => {
 * found within the actual `remainingContent` string being split. This avoids relying on raw page offsets
 * that can diverge when structural rules strip markers (e.g. `lineStartsAfter`).
 */
-const findBreakpointWindowEndPosition = (remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets) => {
+const findBreakpointWindowEndPosition = (remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, logger) => {
 	if (windowEndIdx >= toIdx) return remainingContent.length;
 	const desiredNextIdx = windowEndIdx + 1;
 	const minNextIdx = currentFromIdx + 1;
 	const maxNextIdx = Math.min(desiredNextIdx, toIdx);
 	const startOffsetInCurrentPage = estimateStartOffsetInCurrentPage(remainingContent, currentFromIdx, pageIds, normalizedPages);
+	let bestExpectedBoundary = remainingContent.length;
 	for (let nextIdx = maxNextIdx; nextIdx >= minNextIdx; nextIdx--) {
 		const expectedBoundary = cumulativeOffsets[nextIdx] !== void 0 && cumulativeOffsets[currentFromIdx] !== void 0 ? Math.max(0, cumulativeOffsets[nextIdx] - cumulativeOffsets[currentFromIdx] - startOffsetInCurrentPage) : remainingContent.length;
-		const pos = findPageStartNearExpectedBoundary(remainingContent, currentFromIdx, nextIdx, expectedBoundary, pageIds, normalizedPages);
+		if (nextIdx === maxNextIdx) bestExpectedBoundary = expectedBoundary;
+		const pos = findPageStartNearExpectedBoundary(remainingContent, currentFromIdx, nextIdx, expectedBoundary, pageIds, normalizedPages, logger);
 		if (pos > 0) return pos;
 	}
-	return remainingContent.length;
+	return Math.min(bestExpectedBoundary, remainingContent.length);
 };
 /**
 * Finds exclusion-based break position using raw cumulative offsets.
@@ -1460,7 +1495,8 @@ const handlePageBoundaryBreak = (remainingContent, windowEndIdx, windowEndPositi
 		const nextPageData = normalizedPages.get(pageIds[nextPageIdx]);
 		if (nextPageData) {
 			const pos = findNextPagePosition(remainingContent, nextPageData);
-			if (pos > 0) return Math.min(pos, windowEndPosition, remainingContent.length);
+			const tolerance = Math.max(2e3, windowEndPosition * .5);
+			if (pos > 0 && Math.abs(pos - windowEndPosition) <= tolerance) return Math.min(pos, windowEndPosition, remainingContent.length);
 		}
 	}
 	return Math.min(windowEndPosition, remainingContent.length);
@@ -1484,14 +1520,14 @@ const findBreakPosition = (remainingContent, currentFromIdx, toIdx, windowEndIdx
 		if (hasExcludedPageInRange(excludeSet, pageIds, currentFromIdx, windowEndIdx)) continue;
 		if (skipWhenRegex?.test(remainingContent)) continue;
 		if (regex === null) return {
-			breakpointIndex: i,
 			breakPos: handlePageBoundaryBreak(remainingContent, windowEndIdx, windowEndPosition, toIdx, pageIds, normalizedPages),
+			breakpointIndex: i,
 			rule
 		};
 		const breakPos = findPatternBreakPosition(remainingContent.slice(0, Math.min(windowEndPosition, remainingContent.length)), regex, prefer);
 		if (breakPos > 0) return {
-			breakpointIndex: i,
 			breakPos,
+			breakpointIndex: i,
 			rule
 		};
 	}
@@ -1662,7 +1698,7 @@ const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPag
 	let currentFromIdx = fromIdx;
 	let isFirstPiece = true;
 	let lastBreakpoint = null;
-	const boundaryPositions = buildBoundaryPositions(fullContent, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets);
+	const boundaryPositions = buildBoundaryPositions(fullContent, fromIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, logger);
 	logger?.debug?.("[breakpoints] boundaryPositions built", {
 		boundaryPositions,
 		fromIdx,
@@ -1683,7 +1719,7 @@ const processOversizedSegment = (segment, fromIdx, toIdx, pageIds, normalizedPag
 			break;
 		}
 		const windowEndIdx = computeWindowEndIdx(currentFromIdx, toIdx, pageIds, maxPages);
-		const windowEndPosition = findBreakpointWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets);
+		const windowEndPosition = findBreakpointWindowEndPosition(remainingContent, currentFromIdx, windowEndIdx, toIdx, pageIds, normalizedPages, cumulativeOffsets, logger);
 		logger?.debug?.(`[breakpoints] iteration=${i}`, {
 			currentFromIdx,
 			cursorPos,