pdfdelta 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
pdfdelta-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 pdfcompare contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,97 @@
1
+ Metadata-Version: 2.4
2
+ Name: pdfdelta
3
+ Version: 0.1.0
4
+ Summary: Visual diff for born-digital PDFs — highlights changes directly on the original pages
5
+ License: MIT
6
+ Keywords: pdf,diff,compare,highlight,academic,paper
7
+ Classifier: Development Status :: 4 - Beta
8
+ Classifier: Intended Audience :: Science/Research
9
+ Classifier: Topic :: Text Processing :: General
10
+ Classifier: Programming Language :: Python :: 3
11
+ Requires-Python: >=3.10
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: PyMuPDF>=1.24
15
+ Dynamic: license-file
16
+
17
+ # pdfdelta
18
+
19
+ **pdfdelta** is a lightweight visual diff tool for born-digital PDFs.
20
+
21
+ Given an old and a new version of a PDF, it writes highlights directly onto the original pages so revisions are easy to review: deletions on the old file, additions on the new file.
22
+
23
+ It is mainly designed for academic papers and technical documents, where small wording changes matter and layout is part of the review process.
24
+
25
+ <p align="center">
26
+ <img src="examples/old_marked.png" alt="Old PDF with deletions highlighted" width="48%" />
27
+ <img src="examples/new_marked.png" alt="New PDF with additions highlighted" width="48%" />
28
+ </p>
29
+
30
+ ## Features
31
+
32
+ - Highlights changes directly on the original PDF pages
33
+ - Works well for born-digital PDFs such as papers, reports, and drafts
34
+ - Handles multi-column layouts better than plain text diff tools
35
+ - Tries to reduce noisy highlights from simple reflow
36
+ - Keeps the review workflow visual and page-based
37
+
38
+ ## Installation
39
+
40
+ If you are using the repository directly:
41
+
42
+ ```sh
43
+ pip install git+https://github.com/mli55/pdfdelta.git
44
+ ```
45
+
46
+ ## Usage
47
+
48
+ ```sh
49
+ pdfdelta old.pdf new.pdf
50
+ ```
51
+
52
+ This writes two annotated files:
53
+
54
+ - `old_marked.pdf` — original pages with deletions highlighted
55
+ - `new_marked.pdf` — revised pages with additions highlighted
56
+
57
+ ### Options
58
+
59
+ | Flag | Default | Description |
60
+ | ---- | ------- | ----------- |
61
+ | `--old-out` | `old_marked.pdf` | Output path for the annotated old PDF |
62
+ | `--new-out` | `new_marked.pdf` | Output path for the annotated new PDF |
63
+ | `--opacity` | `0.35` | Highlight opacity (0.0–1.0) |
64
+
65
+ ## How It Works
66
+
67
+ ```
68
+ old.pdf new.pdf
69
+ │ │
70
+ ▼ ▼
71
+ ┌──────────────────┐
72
+ │ Extract words │ PyMuPDF: word text + bounding boxes
73
+ └────────┬─────────┘
74
+
75
+ ┌──────────────────┐
76
+ │ Global diff │ Flatten all pages → SequenceMatcher
77
+ └────────┬─────────┘
78
+
79
+ ┌──────────────────┐
80
+ │ Word-level diff │ Per-word & sub-word precision
81
+ └────────┬─────────┘
82
+
83
+ ┌──────────────────┐
84
+ │ Reflow filter │ Suppress cross-page / cross-column noise
85
+ └────────┬─────────┘
86
+
87
+ ┌──────────────────┐
88
+ │ Annotate PDFs │ Highlights on original pages
89
+ └────────┬─────────┘
90
+
91
+ old_marked.pdf
92
+ new_marked.pdf
93
+ ```
94
+
95
+ ## License
96
+
97
+ MIT
@@ -0,0 +1,81 @@
1
+ # pdfdelta
2
+
3
+ **pdfdelta** is a lightweight visual diff tool for born-digital PDFs.
4
+
5
+ Given an old and a new version of a PDF, it writes highlights directly onto the original pages so revisions are easy to review: deletions on the old file, additions on the new file.
6
+
7
+ It is mainly designed for academic papers and technical documents, where small wording changes matter and layout is part of the review process.
8
+
9
+ <p align="center">
10
+ <img src="examples/old_marked.png" alt="Old PDF with deletions highlighted" width="48%" />
11
+ <img src="examples/new_marked.png" alt="New PDF with additions highlighted" width="48%" />
12
+ </p>
13
+
14
+ ## Features
15
+
16
+ - Highlights changes directly on the original PDF pages
17
+ - Works well for born-digital PDFs such as papers, reports, and drafts
18
+ - Handles multi-column layouts better than plain text diff tools
19
+ - Tries to reduce noisy highlights from simple reflow
20
+ - Keeps the review workflow visual and page-based
21
+
22
+ ## Installation
23
+
24
+ If you are using the repository directly:
25
+
26
+ ```sh
27
+ pip install git+https://github.com/mli55/pdfdelta.git
28
+ ```
29
+
30
+ ## Usage
31
+
32
+ ```sh
33
+ pdfdelta old.pdf new.pdf
34
+ ```
35
+
36
+ This writes two annotated files:
37
+
38
+ - `old_marked.pdf` — original pages with deletions highlighted
39
+ - `new_marked.pdf` — revised pages with additions highlighted
40
+
41
+ ### Options
42
+
43
+ | Flag | Default | Description |
44
+ | ---- | ------- | ----------- |
45
+ | `--old-out` | `old_marked.pdf` | Output path for the annotated old PDF |
46
+ | `--new-out` | `new_marked.pdf` | Output path for the annotated new PDF |
47
+ | `--opacity` | `0.35` | Highlight opacity (0.0–1.0) |
48
+
49
+ ## How It Works
50
+
51
+ ```
52
+ old.pdf new.pdf
53
+ │ │
54
+ ▼ ▼
55
+ ┌──────────────────┐
56
+ │ Extract words │ PyMuPDF: word text + bounding boxes
57
+ └────────┬─────────┘
58
+
59
+ ┌──────────────────┐
60
+ │ Global diff │ Flatten all pages → SequenceMatcher
61
+ └────────┬─────────┘
62
+
63
+ ┌──────────────────┐
64
+ │ Word-level diff │ Per-word & sub-word precision
65
+ └────────┬─────────┘
66
+
67
+ ┌──────────────────┐
68
+ │ Reflow filter │ Suppress cross-page / cross-column noise
69
+ └────────┬─────────┘
70
+
71
+ ┌──────────────────┐
72
+ │ Annotate PDFs │ Highlights on original pages
73
+ └────────┬─────────┘
74
+
75
+ old_marked.pdf
76
+ new_marked.pdf
77
+ ```
78
+
79
+ ## License
80
+
81
+ MIT
@@ -0,0 +1,30 @@
1
+ [build-system]
2
+ requires = ["setuptools>=68", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "pdfdelta"
7
+ version = "0.1.0"
8
+ description = "Visual diff for born-digital PDFs — highlights changes directly on the original pages"
9
+ readme = "README.md"
10
+ license = {text = "MIT"}
11
+ requires-python = ">=3.10"
12
+ keywords = ["pdf", "diff", "compare", "highlight", "academic", "paper"]
13
+ classifiers = [
14
+ "Development Status :: 4 - Beta",
15
+ "Intended Audience :: Science/Research",
16
+ "Topic :: Text Processing :: General",
17
+ "Programming Language :: Python :: 3",
18
+ ]
19
+ dependencies = [
20
+ "PyMuPDF>=1.24",
21
+ ]
22
+
23
+ [project.scripts]
24
+ pdfdelta = "pdfdelta.cli:main"
25
+
26
+ [tool.setuptools]
27
+ package-dir = {"" = "src"}
28
+
29
+ [tool.setuptools.packages.find]
30
+ where = ["src"]
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,8 @@
1
+ """pdfdelta — visual diff for born-digital PDFs."""
2
+
3
+ from .annotate import apply_annotations
4
+ from .compare import compare_documents
5
+ from .extract import extract_document
6
+
7
+ __all__ = ["extract_document", "compare_documents", "apply_annotations"]
8
+ __version__ = "0.1.0"
@@ -0,0 +1,39 @@
1
+ from __future__ import annotations
2
+
3
+ import fitz
4
+
5
+ from .models import RectTuple
6
+
7
+
8
+ def add_highlight(
9
+ page: fitz.Page,
10
+ rect_tuple: RectTuple,
11
+ color: tuple[float, float, float],
12
+ opacity: float = 0.35,
13
+ ) -> None:
14
+ rect = fitz.Rect(rect_tuple)
15
+ annot = page.add_highlight_annot(quads=[rect])
16
+ annot.set_colors(stroke=color)
17
+ annot.set_opacity(opacity)
18
+ annot.update()
19
+
20
+
21
+ def apply_annotations(
22
+ input_pdf: str,
23
+ output_pdf: str,
24
+ page_to_rects: dict[int, list[RectTuple]],
25
+ color: tuple[float, float, float],
26
+ opacity: float = 0.35,
27
+ ) -> None:
28
+ doc = fitz.open(input_pdf)
29
+ try:
30
+ for page_index, rects in page_to_rects.items():
31
+ if page_index >= len(doc):
32
+ continue
33
+ page = doc[page_index]
34
+ for rect in rects:
35
+ add_highlight(page, rect, color=color, opacity=opacity)
36
+
37
+ doc.save(output_pdf, garbage=4, deflate=True)
38
+ finally:
39
+ doc.close()
@@ -0,0 +1,63 @@
1
+ from __future__ import annotations
2
+
3
+ import argparse
4
+ from pathlib import Path
5
+
6
+ from .annotate import apply_annotations
7
+ from .compare import compare_documents
8
+ from .extract import extract_document
9
+
10
+
11
+ def build_parser() -> argparse.ArgumentParser:
12
+ parser = argparse.ArgumentParser(
13
+ prog="pdfdelta",
14
+ description="Compare two born-digital PDFs and write highlights back to original PDFs.",
15
+ )
16
+ parser.add_argument("old_pdf", help="old/original PDF")
17
+ parser.add_argument("new_pdf", help="new/revised PDF")
18
+ parser.add_argument("--old-out", default="old_marked.pdf", help="output annotated old PDF")
19
+ parser.add_argument("--new-out", default="new_marked.pdf", help="output annotated new PDF")
20
+ parser.add_argument("--opacity", type=float, default=0.35, help="annotation opacity")
21
+ return parser
22
+
23
+
24
+ def main() -> None:
25
+ parser = build_parser()
26
+ args = parser.parse_args()
27
+
28
+ old_pdf = str(Path(args.old_pdf))
29
+ new_pdf = str(Path(args.new_pdf))
30
+
31
+ old_pages = extract_document(old_pdf)
32
+ new_pages = extract_document(new_pdf)
33
+
34
+ old_rects, new_rects = compare_documents(old_pages, new_pages)
35
+
36
+ apply_annotations(
37
+ input_pdf=old_pdf,
38
+ output_pdf=args.old_out,
39
+ page_to_rects=old_rects,
40
+ color=(1.0, 0.0, 0.0),
41
+ opacity=args.opacity,
42
+ )
43
+
44
+ apply_annotations(
45
+ input_pdf=new_pdf,
46
+ output_pdf=args.new_out,
47
+ page_to_rects=new_rects,
48
+ color=(0.0, 1.0, 0.0),
49
+ opacity=args.opacity,
50
+ )
51
+
52
+ old_count = sum(len(v) for v in old_rects.values())
53
+ new_count = sum(len(v) for v in new_rects.values())
54
+
55
+ print(f"Done.")
56
+ print(f"Old annotations: {old_count}")
57
+ print(f"New annotations: {new_count}")
58
+ print(f"Wrote: {args.old_out}")
59
+ print(f"Wrote: {args.new_out}")
60
+
61
+
62
+ if __name__ == "__main__":
63
+ main()
@@ -0,0 +1,443 @@
1
+ from __future__ import annotations
2
+
3
+ from collections import Counter, defaultdict
4
+ from difflib import SequenceMatcher
5
+ from typing import DefaultDict
6
+
7
+ from .models import LineBox, PageBox, RectTuple, WordBox
8
+
9
+
10
+ def _sub_word_rect(
11
+ old_word: WordBox, new_word: WordBox,
12
+ ) -> tuple[RectTuple, RectTuple]:
13
+ """Narrow a 1:1 word replacement to only the changed characters.
14
+
15
+ Estimates left/right boundaries proportionally based on the shared
16
+ prefix and suffix lengths. Falls back to the full word rects when
17
+ the words are completely different.
18
+ """
19
+ a, b = old_word.norm, new_word.norm
20
+ # Find common prefix length
21
+ prefix = 0
22
+ while prefix < len(a) and prefix < len(b) and a[prefix] == b[prefix]:
23
+ prefix += 1
24
+ # Find common suffix length (don't overlap with prefix)
25
+ suffix = 0
26
+ while (
27
+ suffix < len(a) - prefix
28
+ and suffix < len(b) - prefix
29
+ and a[-(suffix + 1)] == b[-(suffix + 1)]
30
+ ):
31
+ suffix += 1
32
+
33
+ if prefix == 0 and suffix == 0:
34
+ return old_word.rect, new_word.rect
35
+
36
+ def _trim(rect: RectTuple, text_len: int) -> RectTuple:
37
+ if text_len == 0:
38
+ return rect
39
+ x0, y0, x1, y1 = rect
40
+ w = x1 - x0
41
+ new_x0 = x0 + w * (prefix / text_len)
42
+ new_x1 = x1 - w * (suffix / text_len)
43
+ if new_x0 >= new_x1:
44
+ return rect
45
+ return (new_x0, y0, new_x1, y1)
46
+
47
+ return _trim(old_word.rect, len(a)), _trim(new_word.rect, len(b))
48
+
49
+
50
+ def _chunk_word_diff(
51
+ old_chunk: list[LineBox],
52
+ new_chunk: list[LineBox],
53
+ ) -> tuple[list[RectTuple], list[RectTuple]]:
54
+ """Flatten words across a multi-line chunk and do word-level diff.
55
+
56
+ This handles text reflow: when line breaks shift but the actual words
57
+ are mostly the same, only the truly changed words get highlighted.
58
+ """
59
+ old_words = [w for line in old_chunk for w in line.words]
60
+ new_words = [w for line in new_chunk for w in line.words]
61
+
62
+ old_tokens = [w.norm for w in old_words]
63
+ new_tokens = [w.norm for w in new_words]
64
+
65
+ sm = SequenceMatcher(a=old_tokens, b=new_tokens, autojunk=False)
66
+ old_rects: list[RectTuple] = []
67
+ new_rects: list[RectTuple] = []
68
+
69
+ for tag, i1, i2, j1, j2 in sm.get_opcodes():
70
+ if tag == "equal":
71
+ continue
72
+ if tag == "replace" and (i2 - i1) == 1 and (j2 - j1) == 1:
73
+ # Single word replaced by single word — use sub-word precision
74
+ o_r, n_r = _sub_word_rect(old_words[i1], new_words[j1])
75
+ old_rects.append(o_r)
76
+ new_rects.append(n_r)
77
+ else:
78
+ if tag in ("delete", "replace"):
79
+ old_rects.extend(w.rect for w in old_words[i1:i2])
80
+ if tag in ("insert", "replace"):
81
+ new_rects.extend(w.rect for w in new_words[j1:j2])
82
+
83
+ return old_rects, new_rects
84
+
85
+
86
+ def _merge_opcodes(opcodes: list[tuple]) -> list[tuple]:
87
+ """Merge adjacent delete/insert pairs into replace blocks.
88
+
89
+ SequenceMatcher often emits a delete immediately followed by an insert
90
+ (or vice-versa) for text that simply reflowed across lines. Merging
91
+ them into a single 'replace' lets _chunk_word_diff handle the reflow
92
+ and only highlight truly changed words.
93
+ """
94
+ merged: list[tuple] = []
95
+ for op in opcodes:
96
+ if not merged:
97
+ merged.append(op)
98
+ continue
99
+ prev_tag, pi1, pi2, pj1, pj2 = merged[-1]
100
+ tag, i1, i2, j1, j2 = op
101
+ # Merge delete+insert or insert+delete into replace
102
+ if (
103
+ {prev_tag, tag} == {"delete", "insert"}
104
+ or (prev_tag == "replace" and tag in ("delete", "insert"))
105
+ or (tag == "replace" and prev_tag in ("delete", "insert"))
106
+ ) and pi2 == i1 and pj2 == j1:
107
+ merged[-1] = ("replace", pi1, i2, pj1, j2)
108
+ else:
109
+ merged.append(op)
110
+ return merged
111
+
112
+
113
+ def _same_line(a: RectTuple | list[float], b: RectTuple) -> bool:
114
+ """Check if two rects are on the same text line (>50% y overlap)."""
115
+ y_overlap = min(a[3], b[3]) - max(a[1], b[1])
116
+ y_height = max(a[3] - a[1], b[3] - b[1])
117
+ return y_height > 0 and y_overlap / y_height > 0.5
118
+
119
+
120
+ def merge_nearby_rects(rects: list[RectTuple], x_gap: float = 10.0) -> list[RectTuple]:
121
+ """Merge horizontally adjacent rects that share the same line.
122
+
123
+ Consecutive highlighted word rects on the same line are combined
124
+ into a single wide rect. x_gap should be at least as large as
125
+ the normal word spacing in the PDF (~3-9 pt typically).
126
+ """
127
+ if not rects:
128
+ return []
129
+
130
+ # Sort by vertical midpoint then left edge.
131
+ sorted_rects = sorted(rects, key=lambda r: ((r[1] + r[3]) / 2, r[0]))
132
+
133
+ merged: list[list[float]] = [list(sorted_rects[0])]
134
+ for r in sorted_rects[1:]:
135
+ prev = merged[-1]
136
+ # gap > 0 means r starts after prev ends; gap < 0 means overlap/wrap.
137
+ # Only merge when gap is within [-x_gap, x_gap] to prevent merging
138
+ # across columns (where r[0] << prev[2] due to column switch).
139
+ gap = r[0] - prev[2]
140
+ if _same_line(prev, r) and -x_gap <= gap <= x_gap:
141
+ prev[0] = min(prev[0], r[0])
142
+ prev[2] = max(prev[2], r[2])
143
+ prev[1] = min(prev[1], r[1])
144
+ prev[3] = max(prev[3], r[3])
145
+ else:
146
+ merged.append(list(r))
147
+
148
+ return [(m[0], m[1], m[2], m[3]) for m in merged]
149
+
150
+
151
+ def dedupe_rects(rects: list[RectTuple], ndigits: int = 1) -> list[RectTuple]:
152
+ seen = set()
153
+ out: list[RectTuple] = []
154
+ for r in rects:
155
+ key = tuple(round(v, ndigits) for v in r)
156
+ if key in seen:
157
+ continue
158
+ seen.add(key)
159
+ out.append(r)
160
+ return out
161
+
162
+
163
+ def _dehyphenate_norms(norms: list[str]) -> list[tuple[str, list[int]]]:
164
+ """Join hyphenated word pairs into single tokens.
165
+
166
+ PDF line-break hyphenation produces e.g. ``"aver-", "age"`` which
167
+ should be treated as ``"average"`` for matching purposes.
168
+
169
+ Returns ``[(dehyphenated_norm, [original_indices]), ...]``.
170
+ """
171
+ result: list[tuple[str, list[int]]] = []
172
+ i = 0
173
+ while i < len(norms):
174
+ if norms[i].endswith('-') and len(norms[i]) > 1 and i + 1 < len(norms):
175
+ joined = norms[i][:-1] + norms[i + 1]
176
+ result.append((joined, [i, i + 1]))
177
+ i += 2
178
+ else:
179
+ result.append((norms[i], [i]))
180
+ i += 1
181
+ return result
182
+
183
+
184
+ def _is_hyph_match(a: str, b: str) -> bool:
185
+ """Check if *a* and *b* are plausibly the same word split by hyphenation.
186
+
187
+ Detected patterns:
188
+ - ``"aver-"`` vs ``"average"`` (line-break hyphen on one side)
189
+ - ``"particularly"`` vs ``"ticularly"`` (continuation of ``"par-"``
190
+ on the previous page)
191
+ """
192
+ # Pattern 1: one ends with '-', the other starts with that prefix.
193
+ if a.endswith('-') and len(a) > 2:
194
+ prefix = a[:-1]
195
+ if b.startswith(prefix) and len(b) > len(prefix):
196
+ return True
197
+ if b.endswith('-') and len(b) > 2:
198
+ prefix = b[:-1]
199
+ if a.startswith(prefix) and len(a) > len(prefix):
200
+ return True
201
+ # Pattern 2: one token is a suffix of the other (the "continuation"
202
+ # half of a hyphenated word whose first half sits on the prev page).
203
+ shorter, longer = (a, b) if len(a) <= len(b) else (b, a)
204
+ if len(shorter) >= 4 and longer.endswith(shorter) and len(shorter) >= len(longer) * 0.5:
205
+ return True
206
+ return False
207
+
208
+
209
+ def compare_documents(
210
+ old_pages: list[PageBox],
211
+ new_pages: list[PageBox],
212
+ ) -> tuple[dict[int, list[RectTuple]], dict[int, list[RectTuple]]]:
213
+ """Compare two documents using a global (cross-page) diff.
214
+
215
+ Flattens all lines from all pages, performs a single global diff,
216
+ then maps highlighted rects back to their source pages. This
217
+ correctly handles text that reflowed across page boundaries.
218
+
219
+ Lines whose normalized text appears exactly once in each document
220
+ are recognised as *moved* (page reflow) and excluded from the diff
221
+ so that only genuinely changed text gets highlighted.
222
+ """
223
+ # Flatten: list of (page_index, LineBox)
224
+ old_flat: list[tuple[int, LineBox]] = [
225
+ (p.page_index, line) for p in old_pages for line in p.lines
226
+ ]
227
+ new_flat: list[tuple[int, LineBox]] = [
228
+ (p.page_index, line) for p in new_pages for line in p.lines
229
+ ]
230
+
231
+ old_texts = [line.norm_text for _, line in old_flat]
232
+ new_texts = [line.norm_text for _, line in new_flat]
233
+
234
+ # Frequency counts for move detection: a line appearing exactly once
235
+ # in each document is the same line, just at a different position.
236
+ old_counts = Counter(old_texts)
237
+ new_counts = Counter(new_texts)
238
+ new_text_set = set(new_texts)
239
+ old_text_set = set(old_texts)
240
+
241
+ def _is_moved(text: str) -> bool:
242
+ """True when *text* appears exactly once in each document."""
243
+ return old_counts[text] == 1 and new_counts[text] == 1
244
+
245
+ sm = SequenceMatcher(a=old_texts, b=new_texts, autojunk=False)
246
+ opcodes = _merge_opcodes(list(sm.get_opcodes()))
247
+
248
+ # ── First pass: collect candidate highlight words ────────────────
249
+ # Each candidate is (WordBox, RectTuple, is_subword).
250
+ # is_subword=True means the rect was trimmed by _sub_word_rect and
251
+ # represents a genuine character-level change (never suppress these).
252
+ old_cands: list[tuple[WordBox, RectTuple, bool]] = []
253
+ new_cands: list[tuple[WordBox, RectTuple, bool]] = []
254
+
255
+ for tag, i1, i2, j1, j2 in opcodes:
256
+ if tag == "equal":
257
+ continue
258
+
259
+ if tag == "delete":
260
+ for idx in range(i1, i2):
261
+ _, line = old_flat[idx]
262
+ if line.norm_text in new_text_set and _is_moved(line.norm_text):
263
+ continue
264
+ for w in line.words:
265
+ old_cands.append((w, w.rect, False))
266
+ continue
267
+
268
+ if tag == "insert":
269
+ for idx in range(j1, j2):
270
+ _, line = new_flat[idx]
271
+ if line.norm_text in old_text_set and _is_moved(line.norm_text):
272
+ continue
273
+ for w in line.words:
274
+ new_cands.append((w, w.rect, False))
275
+ continue
276
+
277
+ # tag == "replace" — pre-filter moved lines, then word-level diff
278
+ old_chunk_texts = set(old_texts[i1:i2])
279
+ new_chunk_texts = set(new_texts[j1:j2])
280
+
281
+ moved_old = set()
282
+ for idx in range(i1, i2):
283
+ t = old_texts[idx]
284
+ if t not in new_chunk_texts and t in new_text_set and _is_moved(t):
285
+ moved_old.add(idx)
286
+
287
+ moved_new = set()
288
+ for idx in range(j1, j2):
289
+ t = new_texts[idx]
290
+ if t not in old_chunk_texts and t in old_text_set and _is_moved(t):
291
+ moved_new.add(idx)
292
+
293
+ old_words = [w for i, (_, line) in enumerate(old_flat[i1:i2], i1)
294
+ if i not in moved_old for w in line.words]
295
+ new_words = [w for i, (_, line) in enumerate(new_flat[j1:j2], j1)
296
+ if i not in moved_new for w in line.words]
297
+
298
+ if not old_words and not new_words:
299
+ continue
300
+
301
+ old_norms = [w.norm for w in old_words]
302
+ new_norms = [w.norm for w in new_words]
303
+
304
+ sm2 = SequenceMatcher(a=old_norms, b=new_norms, autojunk=False)
305
+ for op_tag, oi1, oi2, oj1, oj2 in sm2.get_opcodes():
306
+ if op_tag == "equal":
307
+ continue
308
+ if op_tag == "replace" and (oi2 - oi1) == 1 and (oj2 - oj1) == 1:
309
+ o_r, n_r = _sub_word_rect(old_words[oi1], new_words[oj1])
310
+ old_cands.append((old_words[oi1], o_r, o_r != old_words[oi1].rect))
311
+ new_cands.append((new_words[oj1], n_r, n_r != new_words[oj1].rect))
312
+ else:
313
+ if op_tag in ("delete", "replace"):
314
+ for w in old_words[oi1:oi2]:
315
+ old_cands.append((w, w.rect, False))
316
+ if op_tag in ("insert", "replace"):
317
+ for w in new_words[oj1:oj2]:
318
+ new_cands.append((w, w.rect, False))
319
+
320
+ # ── Second pass: page-boundary reflow suppression ──────────────
321
+ # For each pair of adjacent pages (old_pg P, new_pg P±1), compare
322
+ # ALL words from both pages using dehyphenation-aware normalization.
323
+ # Candidate words within contiguous matching runs of ≥ MIN_MATCH
324
+ # tokens are recognised as reflowed text and suppressed.
325
+ #
326
+ # Within-page dehyphenation joins e.g. "aver-"+"age" → "average".
327
+ # Cross-page hyphenation (where a figure sits between the two halves)
328
+ # is handled via _is_hyph_match tolerance on 1:1 replace blocks.
329
+ MIN_MATCH = 2
330
+
331
+ suppress_old: set[int] = set()
332
+ suppress_new: set[int] = set()
333
+
334
+ # Index: (page_index, word_rect) → candidate index
335
+ old_cand_lookup: dict[tuple[int, RectTuple], int] = {}
336
+ new_cand_lookup: dict[tuple[int, RectTuple], int] = {}
337
+ old_cand_pages: set[int] = set()
338
+ new_cand_pages: set[int] = set()
339
+
340
+ for i, (w, _, sub) in enumerate(old_cands):
341
+ if not sub:
342
+ old_cand_lookup[(w.page_index, w.rect)] = i
343
+ old_cand_pages.add(w.page_index)
344
+ for i, (w, _, sub) in enumerate(new_cands):
345
+ if not sub:
346
+ new_cand_lookup[(w.page_index, w.rect)] = i
347
+ new_cand_pages.add(w.page_index)
348
+
349
+ def _all_words(pages: list[PageBox], pg: int) -> list[WordBox]:
350
+ if 0 <= pg < len(pages):
351
+ return [w for line in pages[pg].lines for w in line.words]
352
+ return []
353
+
354
+ def _suppress(dehyph, start, end, words, lookup, target_set):
355
+ """Mark candidate words in dehyph[start:end] for suppression."""
356
+ for k in range(start, end):
357
+ for wi in dehyph[k][1]:
358
+ w = words[wi]
359
+ ci = lookup.get((w.page_index, w.rect))
360
+ if ci is not None:
361
+ target_set.add(ci)
362
+
363
+ checked: set[tuple[int, int]] = set()
364
+ for old_pg in sorted(old_cand_pages):
365
+ for delta in (1, -1):
366
+ new_pg = old_pg + delta
367
+ if new_pg not in new_cand_pages:
368
+ continue
369
+ pair = (old_pg, new_pg)
370
+ if pair in checked:
371
+ continue
372
+ checked.add(pair)
373
+
374
+ old_words_pg = _all_words(old_pages, old_pg)
375
+ new_words_pg = _all_words(new_pages, new_pg)
376
+ if not old_words_pg or not new_words_pg:
377
+ continue
378
+
379
+ old_norms = [w.norm for w in old_words_pg]
380
+ new_norms = [w.norm for w in new_words_pg]
381
+
382
+ # Within-page dehyphenation only (no cross-page peek —
383
+ # adjacent pages may start with figures, not text continuations)
384
+ old_dehyph = _dehyphenate_norms(old_norms)
385
+ new_dehyph = _dehyphenate_norms(new_norms)
386
+
387
+ old_tokens = [t for t, _ in old_dehyph]
388
+ new_tokens = [t for t, _ in new_dehyph]
389
+
390
+ sm_b = SequenceMatcher(a=old_tokens, b=new_tokens, autojunk=False)
391
+ ops = list(sm_b.get_opcodes())
392
+
393
+ for oi, (op, bi1, bi2, bj1, bj2) in enumerate(ops):
394
+ if op == "equal" and (bi2 - bi1) >= MIN_MATCH:
395
+ _suppress(old_dehyph, bi1, bi2, old_words_pg,
396
+ old_cand_lookup, suppress_old)
397
+ _suppress(new_dehyph, bj1, bj2, new_words_pg,
398
+ new_cand_lookup, suppress_new)
399
+ elif op == "replace":
400
+ # Boundary tokens of a replace block next to a long
401
+ # equal run may be hyphenation artifacts. E.g.
402
+ # "aver-" vs "average" or "particularly" vs "ticularly".
403
+ # Leading edge (preceded by equal ≥ MIN_MATCH)
404
+ if oi > 0:
405
+ p = ops[oi - 1]
406
+ if (p[0] == "equal" and (p[2] - p[1]) >= MIN_MATCH
407
+ and _is_hyph_match(old_tokens[bi1],
408
+ new_tokens[bj1])):
409
+ _suppress(old_dehyph, bi1, bi1 + 1,
410
+ old_words_pg, old_cand_lookup,
411
+ suppress_old)
412
+ _suppress(new_dehyph, bj1, bj1 + 1,
413
+ new_words_pg, new_cand_lookup,
414
+ suppress_new)
415
+ # Trailing edge (followed by equal ≥ MIN_MATCH)
416
+ if oi < len(ops) - 1:
417
+ n = ops[oi + 1]
418
+ if (n[0] == "equal" and (n[2] - n[1]) >= MIN_MATCH
419
+ and _is_hyph_match(old_tokens[bi2 - 1],
420
+ new_tokens[bj2 - 1])):
421
+ _suppress(old_dehyph, bi2 - 1, bi2,
422
+ old_words_pg, old_cand_lookup,
423
+ suppress_old)
424
+ _suppress(new_dehyph, bj2 - 1, bj2,
425
+ new_words_pg, new_cand_lookup,
426
+ suppress_new)
427
+
428
+ # ── Map surviving candidates to pages ────────────────────────────
429
+ old_map: DefaultDict[int, list[RectTuple]] = defaultdict(list)
430
+ new_map: DefaultDict[int, list[RectTuple]] = defaultdict(list)
431
+
432
+ for i, (w, rect, _) in enumerate(old_cands):
433
+ if i not in suppress_old:
434
+ old_map[w.page_index].append(rect)
435
+
436
+ for i, (w, rect, _) in enumerate(new_cands):
437
+ if i not in suppress_new:
438
+ new_map[w.page_index].append(rect)
439
+
440
+ return (
441
+ {k: merge_nearby_rects(dedupe_rects(v)) for k, v in old_map.items()},
442
+ {k: merge_nearby_rects(dedupe_rects(v)) for k, v in new_map.items()},
443
+ )
@@ -0,0 +1,85 @@
1
+ from __future__ import annotations
2
+
3
+ import re
4
+ from collections import defaultdict
5
+
6
+ import fitz
7
+
8
+ from .models import LineBox, PageBox, WordBox
9
+
10
+
11
+ def _normalize(text: str) -> str:
12
+ return re.sub(r"\s+", " ", text.strip()).lower()
13
+
14
+
15
+ def group_words_into_lines(
16
+ raw_words: list[tuple],
17
+ page_index: int,
18
+ ) -> list[LineBox]:
19
+ """Group PyMuPDF word tuples into :class:`LineBox` objects.
20
+
21
+ Groups by ``(block_no, line_no)`` so that multi-column layouts are
22
+ handled correctly — words in different columns get separate lines
23
+ even when they share the same y-coordinate.
24
+ """
25
+ groups: defaultdict[tuple[int, int], list[tuple]] = defaultdict(list)
26
+ for w in raw_words:
27
+ x0, y0, x1, y1, text, block_no, line_no, word_no = w[:8]
28
+ if not str(text).strip():
29
+ continue
30
+ groups[(int(block_no), int(line_no))].append(
31
+ (x0, y0, x1, y1, str(text))
32
+ )
33
+
34
+ # Sort by (block_no, line_no) for stable ordering.
35
+ # Using PyMuPDF's own block ordering keeps line order consistent
36
+ # between two versions of the same document, even when figures
37
+ # shift vertically by a few pixels.
38
+ sorted_keys = sorted(groups.keys())
39
+
40
+ lines: list[LineBox] = []
41
+ for line_idx, key in enumerate(sorted_keys):
42
+ word_items = sorted(groups[key], key=lambda t: t[0]) # sort by x0
43
+ line_words: list[WordBox] = []
44
+ plain_words: list[str] = []
45
+
46
+ for word_idx, (x0, y0, x1, y1, text) in enumerate(word_items):
47
+ word = WordBox(
48
+ page_index=page_index,
49
+ line_index=line_idx,
50
+ word_index=word_idx,
51
+ text=text,
52
+ norm=_normalize(text),
53
+ rect=(x0, y0, x1, y1),
54
+ )
55
+ line_words.append(word)
56
+ plain_words.append(text)
57
+
58
+ line_text = " ".join(plain_words)
59
+ lines.append(
60
+ LineBox(
61
+ page_index=page_index,
62
+ line_index=line_idx,
63
+ text=line_text,
64
+ norm_text=_normalize(line_text),
65
+ words=line_words,
66
+ )
67
+ )
68
+
69
+ return lines
70
+
71
+
72
+ def extract_document(path: str) -> list[PageBox]:
73
+ doc = fitz.open(path)
74
+ pages: list[PageBox] = []
75
+
76
+ try:
77
+ for page_index in range(len(doc)):
78
+ page = doc[page_index]
79
+ raw_words = page.get_text("words", sort=True)
80
+ lines = group_words_into_lines(raw_words, page_index=page_index)
81
+ pages.append(PageBox(page_index=page_index, lines=lines))
82
+ finally:
83
+ doc.close()
84
+
85
+ return pages
@@ -0,0 +1,38 @@
1
+ from __future__ import annotations
2
+
3
+ from dataclasses import dataclass
4
+
5
+ RectTuple = tuple[float, float, float, float]
6
+
7
+
8
+ @dataclass(frozen=True)
9
+ class WordBox:
10
+ page_index: int
11
+ line_index: int
12
+ word_index: int
13
+ text: str
14
+ norm: str
15
+ rect: RectTuple
16
+
17
+
18
+ @dataclass(frozen=True)
19
+ class LineBox:
20
+ page_index: int
21
+ line_index: int
22
+ text: str
23
+ norm_text: str
24
+ words: list[WordBox]
25
+
26
+
27
+ @dataclass(frozen=True)
28
+ class PageBox:
29
+ page_index: int
30
+ lines: list[LineBox]
31
+
32
+ @property
33
+ def text(self) -> str:
34
+ return "\n".join(line.text for line in self.lines)
35
+
36
+ @property
37
+ def norm_text(self) -> str:
38
+ return "\n".join(line.norm_text for line in self.lines)
@@ -0,0 +1,97 @@
1
+ Metadata-Version: 2.4
2
+ Name: pdfdelta
3
+ Version: 0.1.0
4
+ Summary: Visual diff for born-digital PDFs — highlights changes directly on the original pages
5
+ License: MIT
6
+ Keywords: pdf,diff,compare,highlight,academic,paper
7
+ Classifier: Development Status :: 4 - Beta
8
+ Classifier: Intended Audience :: Science/Research
9
+ Classifier: Topic :: Text Processing :: General
10
+ Classifier: Programming Language :: Python :: 3
11
+ Requires-Python: >=3.10
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: PyMuPDF>=1.24
15
+ Dynamic: license-file
16
+
17
+ # pdfdelta
18
+
19
+ **pdfdelta** is a lightweight visual diff tool for born-digital PDFs.
20
+
21
+ Given an old and a new version of a PDF, it writes highlights directly onto the original pages so revisions are easy to review: deletions on the old file, additions on the new file.
22
+
23
+ It is mainly designed for academic papers and technical documents, where small wording changes matter and layout is part of the review process.
24
+
25
+ <p align="center">
26
+ <img src="examples/old_marked.png" alt="Old PDF with deletions highlighted" width="48%" />
27
+ <img src="examples/new_marked.png" alt="New PDF with additions highlighted" width="48%" />
28
+ </p>
29
+
30
+ ## Features
31
+
32
+ - Highlights changes directly on the original PDF pages
33
+ - Works well for born-digital PDFs such as papers, reports, and drafts
34
+ - Handles multi-column layouts better than plain text diff tools
35
+ - Tries to reduce noisy highlights from simple reflow
36
+ - Keeps the review workflow visual and page-based
37
+
38
+ ## Installation
39
+
40
+ If you are using the repository directly:
41
+
42
+ ```sh
43
+ pip install git+https://github.com/mli55/pdfdelta.git
44
+ ```
45
+
46
+ ## Usage
47
+
48
+ ```sh
49
+ pdfdelta old.pdf new.pdf
50
+ ```
51
+
52
+ This writes two annotated files:
53
+
54
+ - `old_marked.pdf` — original pages with deletions highlighted
55
+ - `new_marked.pdf` — revised pages with additions highlighted
56
+
57
+ ### Options
58
+
59
+ | Flag | Default | Description |
60
+ | ---- | ------- | ----------- |
61
+ | `--old-out` | `old_marked.pdf` | Output path for the annotated old PDF |
62
+ | `--new-out` | `new_marked.pdf` | Output path for the annotated new PDF |
63
+ | `--opacity` | `0.35` | Highlight opacity (0.0–1.0) |
64
+
65
+ ## How It Works
66
+
67
+ ```
68
+ old.pdf new.pdf
69
+ │ │
70
+ ▼ ▼
71
+ ┌──────────────────┐
72
+ │ Extract words │ PyMuPDF: word text + bounding boxes
73
+ └────────┬─────────┘
74
+
75
+ ┌──────────────────┐
76
+ │ Global diff │ Flatten all pages → SequenceMatcher
77
+ └────────┬─────────┘
78
+
79
+ ┌──────────────────┐
80
+ │ Word-level diff │ Per-word & sub-word precision
81
+ └────────┬─────────┘
82
+
83
+ ┌──────────────────┐
84
+ │ Reflow filter │ Suppress cross-page / cross-column noise
85
+ └────────┬─────────┘
86
+
87
+ ┌──────────────────┐
88
+ │ Annotate PDFs │ Highlights on original pages
89
+ └────────┬─────────┘
90
+
91
+ old_marked.pdf
92
+ new_marked.pdf
93
+ ```
94
+
95
+ ## License
96
+
97
+ MIT
@@ -0,0 +1,15 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ src/pdfdelta/__init__.py
5
+ src/pdfdelta/annotate.py
6
+ src/pdfdelta/cli.py
7
+ src/pdfdelta/compare.py
8
+ src/pdfdelta/extract.py
9
+ src/pdfdelta/models.py
10
+ src/pdfdelta.egg-info/PKG-INFO
11
+ src/pdfdelta.egg-info/SOURCES.txt
12
+ src/pdfdelta.egg-info/dependency_links.txt
13
+ src/pdfdelta.egg-info/entry_points.txt
14
+ src/pdfdelta.egg-info/requires.txt
15
+ src/pdfdelta.egg-info/top_level.txt
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ pdfdelta = pdfdelta.cli:main
@@ -0,0 +1 @@
1
+ PyMuPDF>=1.24
@@ -0,0 +1 @@
1
+ pdfdelta