pdfdelta 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pdfdelta-0.1.0/LICENSE +21 -0
- pdfdelta-0.1.0/PKG-INFO +97 -0
- pdfdelta-0.1.0/README.md +81 -0
- pdfdelta-0.1.0/pyproject.toml +30 -0
- pdfdelta-0.1.0/setup.cfg +4 -0
- pdfdelta-0.1.0/src/pdfdelta/__init__.py +8 -0
- pdfdelta-0.1.0/src/pdfdelta/annotate.py +39 -0
- pdfdelta-0.1.0/src/pdfdelta/cli.py +63 -0
- pdfdelta-0.1.0/src/pdfdelta/compare.py +443 -0
- pdfdelta-0.1.0/src/pdfdelta/extract.py +85 -0
- pdfdelta-0.1.0/src/pdfdelta/models.py +38 -0
- pdfdelta-0.1.0/src/pdfdelta.egg-info/PKG-INFO +97 -0
- pdfdelta-0.1.0/src/pdfdelta.egg-info/SOURCES.txt +15 -0
- pdfdelta-0.1.0/src/pdfdelta.egg-info/dependency_links.txt +1 -0
- pdfdelta-0.1.0/src/pdfdelta.egg-info/entry_points.txt +2 -0
- pdfdelta-0.1.0/src/pdfdelta.egg-info/requires.txt +1 -0
- pdfdelta-0.1.0/src/pdfdelta.egg-info/top_level.txt +1 -0
pdfdelta-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 pdfcompare contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
pdfdelta-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pdfdelta
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Visual diff for born-digital PDFs — highlights changes directly on the original pages
|
|
5
|
+
License: MIT
|
|
6
|
+
Keywords: pdf,diff,compare,highlight,academic,paper
|
|
7
|
+
Classifier: Development Status :: 4 - Beta
|
|
8
|
+
Classifier: Intended Audience :: Science/Research
|
|
9
|
+
Classifier: Topic :: Text Processing :: General
|
|
10
|
+
Classifier: Programming Language :: Python :: 3
|
|
11
|
+
Requires-Python: >=3.10
|
|
12
|
+
Description-Content-Type: text/markdown
|
|
13
|
+
License-File: LICENSE
|
|
14
|
+
Requires-Dist: PyMuPDF>=1.24
|
|
15
|
+
Dynamic: license-file
|
|
16
|
+
|
|
17
|
+
# pdfdelta
|
|
18
|
+
|
|
19
|
+
**pdfdelta** is a lightweight visual diff tool for born-digital PDFs.
|
|
20
|
+
|
|
21
|
+
Given an old and a new version of a PDF, it writes highlights directly onto the original pages so revisions are easy to review: deletions on the old file, additions on the new file.
|
|
22
|
+
|
|
23
|
+
It is mainly designed for academic papers and technical documents, where small wording changes matter and layout is part of the review process.
|
|
24
|
+
|
|
25
|
+
<p align="center">
|
|
26
|
+
<img src="examples/old_marked.png" alt="Old PDF with deletions highlighted" width="48%" />
|
|
27
|
+
<img src="examples/new_marked.png" alt="New PDF with additions highlighted" width="48%" />
|
|
28
|
+
</p>
|
|
29
|
+
|
|
30
|
+
## Features
|
|
31
|
+
|
|
32
|
+
- Highlights changes directly on the original PDF pages
|
|
33
|
+
- Works well for born-digital PDFs such as papers, reports, and drafts
|
|
34
|
+
- Handles multi-column layouts better than plain text diff tools
|
|
35
|
+
- Tries to reduce noisy highlights from simple reflow
|
|
36
|
+
- Keeps the review workflow visual and page-based
|
|
37
|
+
|
|
38
|
+
## Installation
|
|
39
|
+
|
|
40
|
+
If you are using the repository directly:
|
|
41
|
+
|
|
42
|
+
```sh
|
|
43
|
+
pip install git+https://github.com/mli55/pdfdelta.git
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
## Usage
|
|
47
|
+
|
|
48
|
+
```sh
|
|
49
|
+
pdfdelta old.pdf new.pdf
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
This writes two annotated files:
|
|
53
|
+
|
|
54
|
+
- `old_marked.pdf` — original pages with deletions highlighted
|
|
55
|
+
- `new_marked.pdf` — revised pages with additions highlighted
|
|
56
|
+
|
|
57
|
+
### Options
|
|
58
|
+
|
|
59
|
+
| Flag | Default | Description |
|
|
60
|
+
| ---- | ------- | ----------- |
|
|
61
|
+
| `--old-out` | `old_marked.pdf` | Output path for the annotated old PDF |
|
|
62
|
+
| `--new-out` | `new_marked.pdf` | Output path for the annotated new PDF |
|
|
63
|
+
| `--opacity` | `0.35` | Highlight opacity (0.0–1.0) |
|
|
64
|
+
|
|
65
|
+
## How It Works
|
|
66
|
+
|
|
67
|
+
```
|
|
68
|
+
old.pdf new.pdf
|
|
69
|
+
│ │
|
|
70
|
+
▼ ▼
|
|
71
|
+
┌──────────────────┐
|
|
72
|
+
│ Extract words │ PyMuPDF: word text + bounding boxes
|
|
73
|
+
└────────┬─────────┘
|
|
74
|
+
▼
|
|
75
|
+
┌──────────────────┐
|
|
76
|
+
│ Global diff │ Flatten all pages → SequenceMatcher
|
|
77
|
+
└────────┬─────────┘
|
|
78
|
+
▼
|
|
79
|
+
┌──────────────────┐
|
|
80
|
+
│ Word-level diff │ Per-word & sub-word precision
|
|
81
|
+
└────────┬─────────┘
|
|
82
|
+
▼
|
|
83
|
+
┌──────────────────┐
|
|
84
|
+
│ Reflow filter │ Suppress cross-page / cross-column noise
|
|
85
|
+
└────────┬─────────┘
|
|
86
|
+
▼
|
|
87
|
+
┌──────────────────┐
|
|
88
|
+
│ Annotate PDFs │ Highlights on original pages
|
|
89
|
+
└────────┬─────────┘
|
|
90
|
+
▼
|
|
91
|
+
old_marked.pdf
|
|
92
|
+
new_marked.pdf
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
## License
|
|
96
|
+
|
|
97
|
+
MIT
|
pdfdelta-0.1.0/README.md
ADDED
|
@@ -0,0 +1,81 @@
|
|
|
1
|
+
# pdfdelta
|
|
2
|
+
|
|
3
|
+
**pdfdelta** is a lightweight visual diff tool for born-digital PDFs.
|
|
4
|
+
|
|
5
|
+
Given an old and a new version of a PDF, it writes highlights directly onto the original pages so revisions are easy to review: deletions on the old file, additions on the new file.
|
|
6
|
+
|
|
7
|
+
It is mainly designed for academic papers and technical documents, where small wording changes matter and layout is part of the review process.
|
|
8
|
+
|
|
9
|
+
<p align="center">
|
|
10
|
+
<img src="examples/old_marked.png" alt="Old PDF with deletions highlighted" width="48%" />
|
|
11
|
+
<img src="examples/new_marked.png" alt="New PDF with additions highlighted" width="48%" />
|
|
12
|
+
</p>
|
|
13
|
+
|
|
14
|
+
## Features
|
|
15
|
+
|
|
16
|
+
- Highlights changes directly on the original PDF pages
|
|
17
|
+
- Works well for born-digital PDFs such as papers, reports, and drafts
|
|
18
|
+
- Handles multi-column layouts better than plain text diff tools
|
|
19
|
+
- Tries to reduce noisy highlights from simple reflow
|
|
20
|
+
- Keeps the review workflow visual and page-based
|
|
21
|
+
|
|
22
|
+
## Installation
|
|
23
|
+
|
|
24
|
+
If you are using the repository directly:
|
|
25
|
+
|
|
26
|
+
```sh
|
|
27
|
+
pip install git+https://github.com/mli55/pdfdelta.git
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
## Usage
|
|
31
|
+
|
|
32
|
+
```sh
|
|
33
|
+
pdfdelta old.pdf new.pdf
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
This writes two annotated files:
|
|
37
|
+
|
|
38
|
+
- `old_marked.pdf` — original pages with deletions highlighted
|
|
39
|
+
- `new_marked.pdf` — revised pages with additions highlighted
|
|
40
|
+
|
|
41
|
+
### Options
|
|
42
|
+
|
|
43
|
+
| Flag | Default | Description |
|
|
44
|
+
| ---- | ------- | ----------- |
|
|
45
|
+
| `--old-out` | `old_marked.pdf` | Output path for the annotated old PDF |
|
|
46
|
+
| `--new-out` | `new_marked.pdf` | Output path for the annotated new PDF |
|
|
47
|
+
| `--opacity` | `0.35` | Highlight opacity (0.0–1.0) |
|
|
48
|
+
|
|
49
|
+
## How It Works
|
|
50
|
+
|
|
51
|
+
```
|
|
52
|
+
old.pdf new.pdf
|
|
53
|
+
│ │
|
|
54
|
+
▼ ▼
|
|
55
|
+
┌──────────────────┐
|
|
56
|
+
│ Extract words │ PyMuPDF: word text + bounding boxes
|
|
57
|
+
└────────┬─────────┘
|
|
58
|
+
▼
|
|
59
|
+
┌──────────────────┐
|
|
60
|
+
│ Global diff │ Flatten all pages → SequenceMatcher
|
|
61
|
+
└────────┬─────────┘
|
|
62
|
+
▼
|
|
63
|
+
┌──────────────────┐
|
|
64
|
+
│ Word-level diff │ Per-word & sub-word precision
|
|
65
|
+
└────────┬─────────┘
|
|
66
|
+
▼
|
|
67
|
+
┌──────────────────┐
|
|
68
|
+
│ Reflow filter │ Suppress cross-page / cross-column noise
|
|
69
|
+
└────────┬─────────┘
|
|
70
|
+
▼
|
|
71
|
+
┌──────────────────┐
|
|
72
|
+
│ Annotate PDFs │ Highlights on original pages
|
|
73
|
+
└────────┬─────────┘
|
|
74
|
+
▼
|
|
75
|
+
old_marked.pdf
|
|
76
|
+
new_marked.pdf
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
## License
|
|
80
|
+
|
|
81
|
+
MIT
|
|
@@ -0,0 +1,30 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=68", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "pdfdelta"
|
|
7
|
+
version = "0.1.0"
|
|
8
|
+
description = "Visual diff for born-digital PDFs — highlights changes directly on the original pages"
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
license = {text = "MIT"}
|
|
11
|
+
requires-python = ">=3.10"
|
|
12
|
+
keywords = ["pdf", "diff", "compare", "highlight", "academic", "paper"]
|
|
13
|
+
classifiers = [
|
|
14
|
+
"Development Status :: 4 - Beta",
|
|
15
|
+
"Intended Audience :: Science/Research",
|
|
16
|
+
"Topic :: Text Processing :: General",
|
|
17
|
+
"Programming Language :: Python :: 3",
|
|
18
|
+
]
|
|
19
|
+
dependencies = [
|
|
20
|
+
"PyMuPDF>=1.24",
|
|
21
|
+
]
|
|
22
|
+
|
|
23
|
+
[project.scripts]
|
|
24
|
+
pdfdelta = "pdfdelta.cli:main"
|
|
25
|
+
|
|
26
|
+
[tool.setuptools]
|
|
27
|
+
package-dir = {"" = "src"}
|
|
28
|
+
|
|
29
|
+
[tool.setuptools.packages.find]
|
|
30
|
+
where = ["src"]
|
pdfdelta-0.1.0/setup.cfg
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
1
|
+
"""pdfdelta — visual diff for born-digital PDFs."""
|
|
2
|
+
|
|
3
|
+
from .annotate import apply_annotations
|
|
4
|
+
from .compare import compare_documents
|
|
5
|
+
from .extract import extract_document
|
|
6
|
+
|
|
7
|
+
__all__ = ["extract_document", "compare_documents", "apply_annotations"]
|
|
8
|
+
__version__ = "0.1.0"
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
|
|
3
|
+
import fitz
|
|
4
|
+
|
|
5
|
+
from .models import RectTuple
|
|
6
|
+
|
|
7
|
+
|
|
8
|
+
def add_highlight(
|
|
9
|
+
page: fitz.Page,
|
|
10
|
+
rect_tuple: RectTuple,
|
|
11
|
+
color: tuple[float, float, float],
|
|
12
|
+
opacity: float = 0.35,
|
|
13
|
+
) -> None:
|
|
14
|
+
rect = fitz.Rect(rect_tuple)
|
|
15
|
+
annot = page.add_highlight_annot(quads=[rect])
|
|
16
|
+
annot.set_colors(stroke=color)
|
|
17
|
+
annot.set_opacity(opacity)
|
|
18
|
+
annot.update()
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
def apply_annotations(
|
|
22
|
+
input_pdf: str,
|
|
23
|
+
output_pdf: str,
|
|
24
|
+
page_to_rects: dict[int, list[RectTuple]],
|
|
25
|
+
color: tuple[float, float, float],
|
|
26
|
+
opacity: float = 0.35,
|
|
27
|
+
) -> None:
|
|
28
|
+
doc = fitz.open(input_pdf)
|
|
29
|
+
try:
|
|
30
|
+
for page_index, rects in page_to_rects.items():
|
|
31
|
+
if page_index >= len(doc):
|
|
32
|
+
continue
|
|
33
|
+
page = doc[page_index]
|
|
34
|
+
for rect in rects:
|
|
35
|
+
add_highlight(page, rect, color=color, opacity=opacity)
|
|
36
|
+
|
|
37
|
+
doc.save(output_pdf, garbage=4, deflate=True)
|
|
38
|
+
finally:
|
|
39
|
+
doc.close()
|
|
@@ -0,0 +1,63 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
|
|
3
|
+
import argparse
|
|
4
|
+
from pathlib import Path
|
|
5
|
+
|
|
6
|
+
from .annotate import apply_annotations
|
|
7
|
+
from .compare import compare_documents
|
|
8
|
+
from .extract import extract_document
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
def build_parser() -> argparse.ArgumentParser:
|
|
12
|
+
parser = argparse.ArgumentParser(
|
|
13
|
+
prog="pdfdelta",
|
|
14
|
+
description="Compare two born-digital PDFs and write highlights back to original PDFs.",
|
|
15
|
+
)
|
|
16
|
+
parser.add_argument("old_pdf", help="old/original PDF")
|
|
17
|
+
parser.add_argument("new_pdf", help="new/revised PDF")
|
|
18
|
+
parser.add_argument("--old-out", default="old_marked.pdf", help="output annotated old PDF")
|
|
19
|
+
parser.add_argument("--new-out", default="new_marked.pdf", help="output annotated new PDF")
|
|
20
|
+
parser.add_argument("--opacity", type=float, default=0.35, help="annotation opacity")
|
|
21
|
+
return parser
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
def main() -> None:
|
|
25
|
+
parser = build_parser()
|
|
26
|
+
args = parser.parse_args()
|
|
27
|
+
|
|
28
|
+
old_pdf = str(Path(args.old_pdf))
|
|
29
|
+
new_pdf = str(Path(args.new_pdf))
|
|
30
|
+
|
|
31
|
+
old_pages = extract_document(old_pdf)
|
|
32
|
+
new_pages = extract_document(new_pdf)
|
|
33
|
+
|
|
34
|
+
old_rects, new_rects = compare_documents(old_pages, new_pages)
|
|
35
|
+
|
|
36
|
+
apply_annotations(
|
|
37
|
+
input_pdf=old_pdf,
|
|
38
|
+
output_pdf=args.old_out,
|
|
39
|
+
page_to_rects=old_rects,
|
|
40
|
+
color=(1.0, 0.0, 0.0),
|
|
41
|
+
opacity=args.opacity,
|
|
42
|
+
)
|
|
43
|
+
|
|
44
|
+
apply_annotations(
|
|
45
|
+
input_pdf=new_pdf,
|
|
46
|
+
output_pdf=args.new_out,
|
|
47
|
+
page_to_rects=new_rects,
|
|
48
|
+
color=(0.0, 1.0, 0.0),
|
|
49
|
+
opacity=args.opacity,
|
|
50
|
+
)
|
|
51
|
+
|
|
52
|
+
old_count = sum(len(v) for v in old_rects.values())
|
|
53
|
+
new_count = sum(len(v) for v in new_rects.values())
|
|
54
|
+
|
|
55
|
+
print(f"Done.")
|
|
56
|
+
print(f"Old annotations: {old_count}")
|
|
57
|
+
print(f"New annotations: {new_count}")
|
|
58
|
+
print(f"Wrote: {args.old_out}")
|
|
59
|
+
print(f"Wrote: {args.new_out}")
|
|
60
|
+
|
|
61
|
+
|
|
62
|
+
if __name__ == "__main__":
|
|
63
|
+
main()
|
|
@@ -0,0 +1,443 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
|
|
3
|
+
from collections import Counter, defaultdict
|
|
4
|
+
from difflib import SequenceMatcher
|
|
5
|
+
from typing import DefaultDict
|
|
6
|
+
|
|
7
|
+
from .models import LineBox, PageBox, RectTuple, WordBox
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
def _sub_word_rect(
|
|
11
|
+
old_word: WordBox, new_word: WordBox,
|
|
12
|
+
) -> tuple[RectTuple, RectTuple]:
|
|
13
|
+
"""Narrow a 1:1 word replacement to only the changed characters.
|
|
14
|
+
|
|
15
|
+
Estimates left/right boundaries proportionally based on the shared
|
|
16
|
+
prefix and suffix lengths. Falls back to the full word rects when
|
|
17
|
+
the words are completely different.
|
|
18
|
+
"""
|
|
19
|
+
a, b = old_word.norm, new_word.norm
|
|
20
|
+
# Find common prefix length
|
|
21
|
+
prefix = 0
|
|
22
|
+
while prefix < len(a) and prefix < len(b) and a[prefix] == b[prefix]:
|
|
23
|
+
prefix += 1
|
|
24
|
+
# Find common suffix length (don't overlap with prefix)
|
|
25
|
+
suffix = 0
|
|
26
|
+
while (
|
|
27
|
+
suffix < len(a) - prefix
|
|
28
|
+
and suffix < len(b) - prefix
|
|
29
|
+
and a[-(suffix + 1)] == b[-(suffix + 1)]
|
|
30
|
+
):
|
|
31
|
+
suffix += 1
|
|
32
|
+
|
|
33
|
+
if prefix == 0 and suffix == 0:
|
|
34
|
+
return old_word.rect, new_word.rect
|
|
35
|
+
|
|
36
|
+
def _trim(rect: RectTuple, text_len: int) -> RectTuple:
|
|
37
|
+
if text_len == 0:
|
|
38
|
+
return rect
|
|
39
|
+
x0, y0, x1, y1 = rect
|
|
40
|
+
w = x1 - x0
|
|
41
|
+
new_x0 = x0 + w * (prefix / text_len)
|
|
42
|
+
new_x1 = x1 - w * (suffix / text_len)
|
|
43
|
+
if new_x0 >= new_x1:
|
|
44
|
+
return rect
|
|
45
|
+
return (new_x0, y0, new_x1, y1)
|
|
46
|
+
|
|
47
|
+
return _trim(old_word.rect, len(a)), _trim(new_word.rect, len(b))
|
|
48
|
+
|
|
49
|
+
|
|
50
|
+
def _chunk_word_diff(
|
|
51
|
+
old_chunk: list[LineBox],
|
|
52
|
+
new_chunk: list[LineBox],
|
|
53
|
+
) -> tuple[list[RectTuple], list[RectTuple]]:
|
|
54
|
+
"""Flatten words across a multi-line chunk and do word-level diff.
|
|
55
|
+
|
|
56
|
+
This handles text reflow: when line breaks shift but the actual words
|
|
57
|
+
are mostly the same, only the truly changed words get highlighted.
|
|
58
|
+
"""
|
|
59
|
+
old_words = [w for line in old_chunk for w in line.words]
|
|
60
|
+
new_words = [w for line in new_chunk for w in line.words]
|
|
61
|
+
|
|
62
|
+
old_tokens = [w.norm for w in old_words]
|
|
63
|
+
new_tokens = [w.norm for w in new_words]
|
|
64
|
+
|
|
65
|
+
sm = SequenceMatcher(a=old_tokens, b=new_tokens, autojunk=False)
|
|
66
|
+
old_rects: list[RectTuple] = []
|
|
67
|
+
new_rects: list[RectTuple] = []
|
|
68
|
+
|
|
69
|
+
for tag, i1, i2, j1, j2 in sm.get_opcodes():
|
|
70
|
+
if tag == "equal":
|
|
71
|
+
continue
|
|
72
|
+
if tag == "replace" and (i2 - i1) == 1 and (j2 - j1) == 1:
|
|
73
|
+
# Single word replaced by single word — use sub-word precision
|
|
74
|
+
o_r, n_r = _sub_word_rect(old_words[i1], new_words[j1])
|
|
75
|
+
old_rects.append(o_r)
|
|
76
|
+
new_rects.append(n_r)
|
|
77
|
+
else:
|
|
78
|
+
if tag in ("delete", "replace"):
|
|
79
|
+
old_rects.extend(w.rect for w in old_words[i1:i2])
|
|
80
|
+
if tag in ("insert", "replace"):
|
|
81
|
+
new_rects.extend(w.rect for w in new_words[j1:j2])
|
|
82
|
+
|
|
83
|
+
return old_rects, new_rects
|
|
84
|
+
|
|
85
|
+
|
|
86
|
+
def _merge_opcodes(opcodes: list[tuple]) -> list[tuple]:
|
|
87
|
+
"""Merge adjacent delete/insert pairs into replace blocks.
|
|
88
|
+
|
|
89
|
+
SequenceMatcher often emits a delete immediately followed by an insert
|
|
90
|
+
(or vice-versa) for text that simply reflowed across lines. Merging
|
|
91
|
+
them into a single 'replace' lets _chunk_word_diff handle the reflow
|
|
92
|
+
and only highlight truly changed words.
|
|
93
|
+
"""
|
|
94
|
+
merged: list[tuple] = []
|
|
95
|
+
for op in opcodes:
|
|
96
|
+
if not merged:
|
|
97
|
+
merged.append(op)
|
|
98
|
+
continue
|
|
99
|
+
prev_tag, pi1, pi2, pj1, pj2 = merged[-1]
|
|
100
|
+
tag, i1, i2, j1, j2 = op
|
|
101
|
+
# Merge delete+insert or insert+delete into replace
|
|
102
|
+
if (
|
|
103
|
+
{prev_tag, tag} == {"delete", "insert"}
|
|
104
|
+
or (prev_tag == "replace" and tag in ("delete", "insert"))
|
|
105
|
+
or (tag == "replace" and prev_tag in ("delete", "insert"))
|
|
106
|
+
) and pi2 == i1 and pj2 == j1:
|
|
107
|
+
merged[-1] = ("replace", pi1, i2, pj1, j2)
|
|
108
|
+
else:
|
|
109
|
+
merged.append(op)
|
|
110
|
+
return merged
|
|
111
|
+
|
|
112
|
+
|
|
113
|
+
def _same_line(a: RectTuple | list[float], b: RectTuple) -> bool:
|
|
114
|
+
"""Check if two rects are on the same text line (>50% y overlap)."""
|
|
115
|
+
y_overlap = min(a[3], b[3]) - max(a[1], b[1])
|
|
116
|
+
y_height = max(a[3] - a[1], b[3] - b[1])
|
|
117
|
+
return y_height > 0 and y_overlap / y_height > 0.5
|
|
118
|
+
|
|
119
|
+
|
|
120
|
+
def merge_nearby_rects(rects: list[RectTuple], x_gap: float = 10.0) -> list[RectTuple]:
|
|
121
|
+
"""Merge horizontally adjacent rects that share the same line.
|
|
122
|
+
|
|
123
|
+
Consecutive highlighted word rects on the same line are combined
|
|
124
|
+
into a single wide rect. x_gap should be at least as large as
|
|
125
|
+
the normal word spacing in the PDF (~3-9 pt typically).
|
|
126
|
+
"""
|
|
127
|
+
if not rects:
|
|
128
|
+
return []
|
|
129
|
+
|
|
130
|
+
# Sort by vertical midpoint then left edge.
|
|
131
|
+
sorted_rects = sorted(rects, key=lambda r: ((r[1] + r[3]) / 2, r[0]))
|
|
132
|
+
|
|
133
|
+
merged: list[list[float]] = [list(sorted_rects[0])]
|
|
134
|
+
for r in sorted_rects[1:]:
|
|
135
|
+
prev = merged[-1]
|
|
136
|
+
# gap > 0 means r starts after prev ends; gap < 0 means overlap/wrap.
|
|
137
|
+
# Only merge when gap is within [-x_gap, x_gap] to prevent merging
|
|
138
|
+
# across columns (where r[0] << prev[2] due to column switch).
|
|
139
|
+
gap = r[0] - prev[2]
|
|
140
|
+
if _same_line(prev, r) and -x_gap <= gap <= x_gap:
|
|
141
|
+
prev[0] = min(prev[0], r[0])
|
|
142
|
+
prev[2] = max(prev[2], r[2])
|
|
143
|
+
prev[1] = min(prev[1], r[1])
|
|
144
|
+
prev[3] = max(prev[3], r[3])
|
|
145
|
+
else:
|
|
146
|
+
merged.append(list(r))
|
|
147
|
+
|
|
148
|
+
return [(m[0], m[1], m[2], m[3]) for m in merged]
|
|
149
|
+
|
|
150
|
+
|
|
151
|
+
def dedupe_rects(rects: list[RectTuple], ndigits: int = 1) -> list[RectTuple]:
|
|
152
|
+
seen = set()
|
|
153
|
+
out: list[RectTuple] = []
|
|
154
|
+
for r in rects:
|
|
155
|
+
key = tuple(round(v, ndigits) for v in r)
|
|
156
|
+
if key in seen:
|
|
157
|
+
continue
|
|
158
|
+
seen.add(key)
|
|
159
|
+
out.append(r)
|
|
160
|
+
return out
|
|
161
|
+
|
|
162
|
+
|
|
163
|
+
def _dehyphenate_norms(norms: list[str]) -> list[tuple[str, list[int]]]:
|
|
164
|
+
"""Join hyphenated word pairs into single tokens.
|
|
165
|
+
|
|
166
|
+
PDF line-break hyphenation produces e.g. ``"aver-", "age"`` which
|
|
167
|
+
should be treated as ``"average"`` for matching purposes.
|
|
168
|
+
|
|
169
|
+
Returns ``[(dehyphenated_norm, [original_indices]), ...]``.
|
|
170
|
+
"""
|
|
171
|
+
result: list[tuple[str, list[int]]] = []
|
|
172
|
+
i = 0
|
|
173
|
+
while i < len(norms):
|
|
174
|
+
if norms[i].endswith('-') and len(norms[i]) > 1 and i + 1 < len(norms):
|
|
175
|
+
joined = norms[i][:-1] + norms[i + 1]
|
|
176
|
+
result.append((joined, [i, i + 1]))
|
|
177
|
+
i += 2
|
|
178
|
+
else:
|
|
179
|
+
result.append((norms[i], [i]))
|
|
180
|
+
i += 1
|
|
181
|
+
return result
|
|
182
|
+
|
|
183
|
+
|
|
184
|
+
def _is_hyph_match(a: str, b: str) -> bool:
|
|
185
|
+
"""Check if *a* and *b* are plausibly the same word split by hyphenation.
|
|
186
|
+
|
|
187
|
+
Detected patterns:
|
|
188
|
+
- ``"aver-"`` vs ``"average"`` (line-break hyphen on one side)
|
|
189
|
+
- ``"particularly"`` vs ``"ticularly"`` (continuation of ``"par-"``
|
|
190
|
+
on the previous page)
|
|
191
|
+
"""
|
|
192
|
+
# Pattern 1: one ends with '-', the other starts with that prefix.
|
|
193
|
+
if a.endswith('-') and len(a) > 2:
|
|
194
|
+
prefix = a[:-1]
|
|
195
|
+
if b.startswith(prefix) and len(b) > len(prefix):
|
|
196
|
+
return True
|
|
197
|
+
if b.endswith('-') and len(b) > 2:
|
|
198
|
+
prefix = b[:-1]
|
|
199
|
+
if a.startswith(prefix) and len(a) > len(prefix):
|
|
200
|
+
return True
|
|
201
|
+
# Pattern 2: one token is a suffix of the other (the "continuation"
|
|
202
|
+
# half of a hyphenated word whose first half sits on the prev page).
|
|
203
|
+
shorter, longer = (a, b) if len(a) <= len(b) else (b, a)
|
|
204
|
+
if len(shorter) >= 4 and longer.endswith(shorter) and len(shorter) >= len(longer) * 0.5:
|
|
205
|
+
return True
|
|
206
|
+
return False
|
|
207
|
+
|
|
208
|
+
|
|
209
|
+
def compare_documents(
|
|
210
|
+
old_pages: list[PageBox],
|
|
211
|
+
new_pages: list[PageBox],
|
|
212
|
+
) -> tuple[dict[int, list[RectTuple]], dict[int, list[RectTuple]]]:
|
|
213
|
+
"""Compare two documents using a global (cross-page) diff.
|
|
214
|
+
|
|
215
|
+
Flattens all lines from all pages, performs a single global diff,
|
|
216
|
+
then maps highlighted rects back to their source pages. This
|
|
217
|
+
correctly handles text that reflowed across page boundaries.
|
|
218
|
+
|
|
219
|
+
Lines whose normalized text appears exactly once in each document
|
|
220
|
+
are recognised as *moved* (page reflow) and excluded from the diff
|
|
221
|
+
so that only genuinely changed text gets highlighted.
|
|
222
|
+
"""
|
|
223
|
+
# Flatten: list of (page_index, LineBox)
|
|
224
|
+
old_flat: list[tuple[int, LineBox]] = [
|
|
225
|
+
(p.page_index, line) for p in old_pages for line in p.lines
|
|
226
|
+
]
|
|
227
|
+
new_flat: list[tuple[int, LineBox]] = [
|
|
228
|
+
(p.page_index, line) for p in new_pages for line in p.lines
|
|
229
|
+
]
|
|
230
|
+
|
|
231
|
+
old_texts = [line.norm_text for _, line in old_flat]
|
|
232
|
+
new_texts = [line.norm_text for _, line in new_flat]
|
|
233
|
+
|
|
234
|
+
# Frequency counts for move detection: a line appearing exactly once
|
|
235
|
+
# in each document is the same line, just at a different position.
|
|
236
|
+
old_counts = Counter(old_texts)
|
|
237
|
+
new_counts = Counter(new_texts)
|
|
238
|
+
new_text_set = set(new_texts)
|
|
239
|
+
old_text_set = set(old_texts)
|
|
240
|
+
|
|
241
|
+
def _is_moved(text: str) -> bool:
|
|
242
|
+
"""True when *text* appears exactly once in each document."""
|
|
243
|
+
return old_counts[text] == 1 and new_counts[text] == 1
|
|
244
|
+
|
|
245
|
+
sm = SequenceMatcher(a=old_texts, b=new_texts, autojunk=False)
|
|
246
|
+
opcodes = _merge_opcodes(list(sm.get_opcodes()))
|
|
247
|
+
|
|
248
|
+
# ── First pass: collect candidate highlight words ────────────────
|
|
249
|
+
# Each candidate is (WordBox, RectTuple, is_subword).
|
|
250
|
+
# is_subword=True means the rect was trimmed by _sub_word_rect and
|
|
251
|
+
# represents a genuine character-level change (never suppress these).
|
|
252
|
+
old_cands: list[tuple[WordBox, RectTuple, bool]] = []
|
|
253
|
+
new_cands: list[tuple[WordBox, RectTuple, bool]] = []
|
|
254
|
+
|
|
255
|
+
for tag, i1, i2, j1, j2 in opcodes:
|
|
256
|
+
if tag == "equal":
|
|
257
|
+
continue
|
|
258
|
+
|
|
259
|
+
if tag == "delete":
|
|
260
|
+
for idx in range(i1, i2):
|
|
261
|
+
_, line = old_flat[idx]
|
|
262
|
+
if line.norm_text in new_text_set and _is_moved(line.norm_text):
|
|
263
|
+
continue
|
|
264
|
+
for w in line.words:
|
|
265
|
+
old_cands.append((w, w.rect, False))
|
|
266
|
+
continue
|
|
267
|
+
|
|
268
|
+
if tag == "insert":
|
|
269
|
+
for idx in range(j1, j2):
|
|
270
|
+
_, line = new_flat[idx]
|
|
271
|
+
if line.norm_text in old_text_set and _is_moved(line.norm_text):
|
|
272
|
+
continue
|
|
273
|
+
for w in line.words:
|
|
274
|
+
new_cands.append((w, w.rect, False))
|
|
275
|
+
continue
|
|
276
|
+
|
|
277
|
+
# tag == "replace" — pre-filter moved lines, then word-level diff
|
|
278
|
+
old_chunk_texts = set(old_texts[i1:i2])
|
|
279
|
+
new_chunk_texts = set(new_texts[j1:j2])
|
|
280
|
+
|
|
281
|
+
moved_old = set()
|
|
282
|
+
for idx in range(i1, i2):
|
|
283
|
+
t = old_texts[idx]
|
|
284
|
+
if t not in new_chunk_texts and t in new_text_set and _is_moved(t):
|
|
285
|
+
moved_old.add(idx)
|
|
286
|
+
|
|
287
|
+
moved_new = set()
|
|
288
|
+
for idx in range(j1, j2):
|
|
289
|
+
t = new_texts[idx]
|
|
290
|
+
if t not in old_chunk_texts and t in old_text_set and _is_moved(t):
|
|
291
|
+
moved_new.add(idx)
|
|
292
|
+
|
|
293
|
+
old_words = [w for i, (_, line) in enumerate(old_flat[i1:i2], i1)
|
|
294
|
+
if i not in moved_old for w in line.words]
|
|
295
|
+
new_words = [w for i, (_, line) in enumerate(new_flat[j1:j2], j1)
|
|
296
|
+
if i not in moved_new for w in line.words]
|
|
297
|
+
|
|
298
|
+
if not old_words and not new_words:
|
|
299
|
+
continue
|
|
300
|
+
|
|
301
|
+
old_norms = [w.norm for w in old_words]
|
|
302
|
+
new_norms = [w.norm for w in new_words]
|
|
303
|
+
|
|
304
|
+
sm2 = SequenceMatcher(a=old_norms, b=new_norms, autojunk=False)
|
|
305
|
+
for op_tag, oi1, oi2, oj1, oj2 in sm2.get_opcodes():
|
|
306
|
+
if op_tag == "equal":
|
|
307
|
+
continue
|
|
308
|
+
if op_tag == "replace" and (oi2 - oi1) == 1 and (oj2 - oj1) == 1:
|
|
309
|
+
o_r, n_r = _sub_word_rect(old_words[oi1], new_words[oj1])
|
|
310
|
+
old_cands.append((old_words[oi1], o_r, o_r != old_words[oi1].rect))
|
|
311
|
+
new_cands.append((new_words[oj1], n_r, n_r != new_words[oj1].rect))
|
|
312
|
+
else:
|
|
313
|
+
if op_tag in ("delete", "replace"):
|
|
314
|
+
for w in old_words[oi1:oi2]:
|
|
315
|
+
old_cands.append((w, w.rect, False))
|
|
316
|
+
if op_tag in ("insert", "replace"):
|
|
317
|
+
for w in new_words[oj1:oj2]:
|
|
318
|
+
new_cands.append((w, w.rect, False))
|
|
319
|
+
|
|
320
|
+
# ── Second pass: page-boundary reflow suppression ──────────────
|
|
321
|
+
# For each pair of adjacent pages (old_pg P, new_pg P±1), compare
|
|
322
|
+
# ALL words from both pages using dehyphenation-aware normalization.
|
|
323
|
+
# Candidate words within contiguous matching runs of ≥ MIN_MATCH
|
|
324
|
+
# tokens are recognised as reflowed text and suppressed.
|
|
325
|
+
#
|
|
326
|
+
# Within-page dehyphenation joins e.g. "aver-"+"age" → "average".
|
|
327
|
+
# Cross-page hyphenation (where a figure sits between the two halves)
|
|
328
|
+
# is handled via _is_hyph_match tolerance on 1:1 replace blocks.
|
|
329
|
+
MIN_MATCH = 2
|
|
330
|
+
|
|
331
|
+
suppress_old: set[int] = set()
|
|
332
|
+
suppress_new: set[int] = set()
|
|
333
|
+
|
|
334
|
+
# Index: (page_index, word_rect) → candidate index
|
|
335
|
+
old_cand_lookup: dict[tuple[int, RectTuple], int] = {}
|
|
336
|
+
new_cand_lookup: dict[tuple[int, RectTuple], int] = {}
|
|
337
|
+
old_cand_pages: set[int] = set()
|
|
338
|
+
new_cand_pages: set[int] = set()
|
|
339
|
+
|
|
340
|
+
for i, (w, _, sub) in enumerate(old_cands):
|
|
341
|
+
if not sub:
|
|
342
|
+
old_cand_lookup[(w.page_index, w.rect)] = i
|
|
343
|
+
old_cand_pages.add(w.page_index)
|
|
344
|
+
for i, (w, _, sub) in enumerate(new_cands):
|
|
345
|
+
if not sub:
|
|
346
|
+
new_cand_lookup[(w.page_index, w.rect)] = i
|
|
347
|
+
new_cand_pages.add(w.page_index)
|
|
348
|
+
|
|
349
|
+
def _all_words(pages: list[PageBox], pg: int) -> list[WordBox]:
|
|
350
|
+
if 0 <= pg < len(pages):
|
|
351
|
+
return [w for line in pages[pg].lines for w in line.words]
|
|
352
|
+
return []
|
|
353
|
+
|
|
354
|
+
def _suppress(dehyph, start, end, words, lookup, target_set):
|
|
355
|
+
"""Mark candidate words in dehyph[start:end] for suppression."""
|
|
356
|
+
for k in range(start, end):
|
|
357
|
+
for wi in dehyph[k][1]:
|
|
358
|
+
w = words[wi]
|
|
359
|
+
ci = lookup.get((w.page_index, w.rect))
|
|
360
|
+
if ci is not None:
|
|
361
|
+
target_set.add(ci)
|
|
362
|
+
|
|
363
|
+
checked: set[tuple[int, int]] = set()
|
|
364
|
+
for old_pg in sorted(old_cand_pages):
|
|
365
|
+
for delta in (1, -1):
|
|
366
|
+
new_pg = old_pg + delta
|
|
367
|
+
if new_pg not in new_cand_pages:
|
|
368
|
+
continue
|
|
369
|
+
pair = (old_pg, new_pg)
|
|
370
|
+
if pair in checked:
|
|
371
|
+
continue
|
|
372
|
+
checked.add(pair)
|
|
373
|
+
|
|
374
|
+
old_words_pg = _all_words(old_pages, old_pg)
|
|
375
|
+
new_words_pg = _all_words(new_pages, new_pg)
|
|
376
|
+
if not old_words_pg or not new_words_pg:
|
|
377
|
+
continue
|
|
378
|
+
|
|
379
|
+
old_norms = [w.norm for w in old_words_pg]
|
|
380
|
+
new_norms = [w.norm for w in new_words_pg]
|
|
381
|
+
|
|
382
|
+
# Within-page dehyphenation only (no cross-page peek —
|
|
383
|
+
# adjacent pages may start with figures, not text continuations)
|
|
384
|
+
old_dehyph = _dehyphenate_norms(old_norms)
|
|
385
|
+
new_dehyph = _dehyphenate_norms(new_norms)
|
|
386
|
+
|
|
387
|
+
old_tokens = [t for t, _ in old_dehyph]
|
|
388
|
+
new_tokens = [t for t, _ in new_dehyph]
|
|
389
|
+
|
|
390
|
+
sm_b = SequenceMatcher(a=old_tokens, b=new_tokens, autojunk=False)
|
|
391
|
+
ops = list(sm_b.get_opcodes())
|
|
392
|
+
|
|
393
|
+
for oi, (op, bi1, bi2, bj1, bj2) in enumerate(ops):
|
|
394
|
+
if op == "equal" and (bi2 - bi1) >= MIN_MATCH:
|
|
395
|
+
_suppress(old_dehyph, bi1, bi2, old_words_pg,
|
|
396
|
+
old_cand_lookup, suppress_old)
|
|
397
|
+
_suppress(new_dehyph, bj1, bj2, new_words_pg,
|
|
398
|
+
new_cand_lookup, suppress_new)
|
|
399
|
+
elif op == "replace":
|
|
400
|
+
# Boundary tokens of a replace block next to a long
|
|
401
|
+
# equal run may be hyphenation artifacts. E.g.
|
|
402
|
+
# "aver-" vs "average" or "particularly" vs "ticularly".
|
|
403
|
+
# Leading edge (preceded by equal ≥ MIN_MATCH)
|
|
404
|
+
if oi > 0:
|
|
405
|
+
p = ops[oi - 1]
|
|
406
|
+
if (p[0] == "equal" and (p[2] - p[1]) >= MIN_MATCH
|
|
407
|
+
and _is_hyph_match(old_tokens[bi1],
|
|
408
|
+
new_tokens[bj1])):
|
|
409
|
+
_suppress(old_dehyph, bi1, bi1 + 1,
|
|
410
|
+
old_words_pg, old_cand_lookup,
|
|
411
|
+
suppress_old)
|
|
412
|
+
_suppress(new_dehyph, bj1, bj1 + 1,
|
|
413
|
+
new_words_pg, new_cand_lookup,
|
|
414
|
+
suppress_new)
|
|
415
|
+
# Trailing edge (followed by equal ≥ MIN_MATCH)
|
|
416
|
+
if oi < len(ops) - 1:
|
|
417
|
+
n = ops[oi + 1]
|
|
418
|
+
if (n[0] == "equal" and (n[2] - n[1]) >= MIN_MATCH
|
|
419
|
+
and _is_hyph_match(old_tokens[bi2 - 1],
|
|
420
|
+
new_tokens[bj2 - 1])):
|
|
421
|
+
_suppress(old_dehyph, bi2 - 1, bi2,
|
|
422
|
+
old_words_pg, old_cand_lookup,
|
|
423
|
+
suppress_old)
|
|
424
|
+
_suppress(new_dehyph, bj2 - 1, bj2,
|
|
425
|
+
new_words_pg, new_cand_lookup,
|
|
426
|
+
suppress_new)
|
|
427
|
+
|
|
428
|
+
# ── Map surviving candidates to pages ────────────────────────────
|
|
429
|
+
old_map: DefaultDict[int, list[RectTuple]] = defaultdict(list)
|
|
430
|
+
new_map: DefaultDict[int, list[RectTuple]] = defaultdict(list)
|
|
431
|
+
|
|
432
|
+
for i, (w, rect, _) in enumerate(old_cands):
|
|
433
|
+
if i not in suppress_old:
|
|
434
|
+
old_map[w.page_index].append(rect)
|
|
435
|
+
|
|
436
|
+
for i, (w, rect, _) in enumerate(new_cands):
|
|
437
|
+
if i not in suppress_new:
|
|
438
|
+
new_map[w.page_index].append(rect)
|
|
439
|
+
|
|
440
|
+
return (
|
|
441
|
+
{k: merge_nearby_rects(dedupe_rects(v)) for k, v in old_map.items()},
|
|
442
|
+
{k: merge_nearby_rects(dedupe_rects(v)) for k, v in new_map.items()},
|
|
443
|
+
)
|
|
@@ -0,0 +1,85 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
|
|
3
|
+
import re
|
|
4
|
+
from collections import defaultdict
|
|
5
|
+
|
|
6
|
+
import fitz
|
|
7
|
+
|
|
8
|
+
from .models import LineBox, PageBox, WordBox
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
def _normalize(text: str) -> str:
|
|
12
|
+
return re.sub(r"\s+", " ", text.strip()).lower()
|
|
13
|
+
|
|
14
|
+
|
|
15
|
+
def group_words_into_lines(
|
|
16
|
+
raw_words: list[tuple],
|
|
17
|
+
page_index: int,
|
|
18
|
+
) -> list[LineBox]:
|
|
19
|
+
"""Group PyMuPDF word tuples into :class:`LineBox` objects.
|
|
20
|
+
|
|
21
|
+
Groups by ``(block_no, line_no)`` so that multi-column layouts are
|
|
22
|
+
handled correctly — words in different columns get separate lines
|
|
23
|
+
even when they share the same y-coordinate.
|
|
24
|
+
"""
|
|
25
|
+
groups: defaultdict[tuple[int, int], list[tuple]] = defaultdict(list)
|
|
26
|
+
for w in raw_words:
|
|
27
|
+
x0, y0, x1, y1, text, block_no, line_no, word_no = w[:8]
|
|
28
|
+
if not str(text).strip():
|
|
29
|
+
continue
|
|
30
|
+
groups[(int(block_no), int(line_no))].append(
|
|
31
|
+
(x0, y0, x1, y1, str(text))
|
|
32
|
+
)
|
|
33
|
+
|
|
34
|
+
# Sort by (block_no, line_no) for stable ordering.
|
|
35
|
+
# Using PyMuPDF's own block ordering keeps line order consistent
|
|
36
|
+
# between two versions of the same document, even when figures
|
|
37
|
+
# shift vertically by a few pixels.
|
|
38
|
+
sorted_keys = sorted(groups.keys())
|
|
39
|
+
|
|
40
|
+
lines: list[LineBox] = []
|
|
41
|
+
for line_idx, key in enumerate(sorted_keys):
|
|
42
|
+
word_items = sorted(groups[key], key=lambda t: t[0]) # sort by x0
|
|
43
|
+
line_words: list[WordBox] = []
|
|
44
|
+
plain_words: list[str] = []
|
|
45
|
+
|
|
46
|
+
for word_idx, (x0, y0, x1, y1, text) in enumerate(word_items):
|
|
47
|
+
word = WordBox(
|
|
48
|
+
page_index=page_index,
|
|
49
|
+
line_index=line_idx,
|
|
50
|
+
word_index=word_idx,
|
|
51
|
+
text=text,
|
|
52
|
+
norm=_normalize(text),
|
|
53
|
+
rect=(x0, y0, x1, y1),
|
|
54
|
+
)
|
|
55
|
+
line_words.append(word)
|
|
56
|
+
plain_words.append(text)
|
|
57
|
+
|
|
58
|
+
line_text = " ".join(plain_words)
|
|
59
|
+
lines.append(
|
|
60
|
+
LineBox(
|
|
61
|
+
page_index=page_index,
|
|
62
|
+
line_index=line_idx,
|
|
63
|
+
text=line_text,
|
|
64
|
+
norm_text=_normalize(line_text),
|
|
65
|
+
words=line_words,
|
|
66
|
+
)
|
|
67
|
+
)
|
|
68
|
+
|
|
69
|
+
return lines
|
|
70
|
+
|
|
71
|
+
|
|
72
|
+
def extract_document(path: str) -> list[PageBox]:
|
|
73
|
+
doc = fitz.open(path)
|
|
74
|
+
pages: list[PageBox] = []
|
|
75
|
+
|
|
76
|
+
try:
|
|
77
|
+
for page_index in range(len(doc)):
|
|
78
|
+
page = doc[page_index]
|
|
79
|
+
raw_words = page.get_text("words", sort=True)
|
|
80
|
+
lines = group_words_into_lines(raw_words, page_index=page_index)
|
|
81
|
+
pages.append(PageBox(page_index=page_index, lines=lines))
|
|
82
|
+
finally:
|
|
83
|
+
doc.close()
|
|
84
|
+
|
|
85
|
+
return pages
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
|
|
3
|
+
from dataclasses import dataclass
|
|
4
|
+
|
|
5
|
+
RectTuple = tuple[float, float, float, float]
|
|
6
|
+
|
|
7
|
+
|
|
8
|
+
@dataclass(frozen=True)
|
|
9
|
+
class WordBox:
|
|
10
|
+
page_index: int
|
|
11
|
+
line_index: int
|
|
12
|
+
word_index: int
|
|
13
|
+
text: str
|
|
14
|
+
norm: str
|
|
15
|
+
rect: RectTuple
|
|
16
|
+
|
|
17
|
+
|
|
18
|
+
@dataclass(frozen=True)
|
|
19
|
+
class LineBox:
|
|
20
|
+
page_index: int
|
|
21
|
+
line_index: int
|
|
22
|
+
text: str
|
|
23
|
+
norm_text: str
|
|
24
|
+
words: list[WordBox]
|
|
25
|
+
|
|
26
|
+
|
|
27
|
+
@dataclass(frozen=True)
|
|
28
|
+
class PageBox:
|
|
29
|
+
page_index: int
|
|
30
|
+
lines: list[LineBox]
|
|
31
|
+
|
|
32
|
+
@property
|
|
33
|
+
def text(self) -> str:
|
|
34
|
+
return "\n".join(line.text for line in self.lines)
|
|
35
|
+
|
|
36
|
+
@property
|
|
37
|
+
def norm_text(self) -> str:
|
|
38
|
+
return "\n".join(line.norm_text for line in self.lines)
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pdfdelta
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Visual diff for born-digital PDFs — highlights changes directly on the original pages
|
|
5
|
+
License: MIT
|
|
6
|
+
Keywords: pdf,diff,compare,highlight,academic,paper
|
|
7
|
+
Classifier: Development Status :: 4 - Beta
|
|
8
|
+
Classifier: Intended Audience :: Science/Research
|
|
9
|
+
Classifier: Topic :: Text Processing :: General
|
|
10
|
+
Classifier: Programming Language :: Python :: 3
|
|
11
|
+
Requires-Python: >=3.10
|
|
12
|
+
Description-Content-Type: text/markdown
|
|
13
|
+
License-File: LICENSE
|
|
14
|
+
Requires-Dist: PyMuPDF>=1.24
|
|
15
|
+
Dynamic: license-file
|
|
16
|
+
|
|
17
|
+
# pdfdelta
|
|
18
|
+
|
|
19
|
+
**pdfdelta** is a lightweight visual diff tool for born-digital PDFs.
|
|
20
|
+
|
|
21
|
+
Given an old and a new version of a PDF, it writes highlights directly onto the original pages so revisions are easy to review: deletions on the old file, additions on the new file.
|
|
22
|
+
|
|
23
|
+
It is mainly designed for academic papers and technical documents, where small wording changes matter and layout is part of the review process.
|
|
24
|
+
|
|
25
|
+
<p align="center">
|
|
26
|
+
<img src="examples/old_marked.png" alt="Old PDF with deletions highlighted" width="48%" />
|
|
27
|
+
<img src="examples/new_marked.png" alt="New PDF with additions highlighted" width="48%" />
|
|
28
|
+
</p>
|
|
29
|
+
|
|
30
|
+
## Features
|
|
31
|
+
|
|
32
|
+
- Highlights changes directly on the original PDF pages
|
|
33
|
+
- Works well for born-digital PDFs such as papers, reports, and drafts
|
|
34
|
+
- Handles multi-column layouts better than plain text diff tools
|
|
35
|
+
- Tries to reduce noisy highlights from simple reflow
|
|
36
|
+
- Keeps the review workflow visual and page-based
|
|
37
|
+
|
|
38
|
+
## Installation
|
|
39
|
+
|
|
40
|
+
If you are using the repository directly:
|
|
41
|
+
|
|
42
|
+
```sh
|
|
43
|
+
pip install git+https://github.com/mli55/pdfdelta.git
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
## Usage
|
|
47
|
+
|
|
48
|
+
```sh
|
|
49
|
+
pdfdelta old.pdf new.pdf
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
This writes two annotated files:
|
|
53
|
+
|
|
54
|
+
- `old_marked.pdf` — original pages with deletions highlighted
|
|
55
|
+
- `new_marked.pdf` — revised pages with additions highlighted
|
|
56
|
+
|
|
57
|
+
### Options
|
|
58
|
+
|
|
59
|
+
| Flag | Default | Description |
|
|
60
|
+
| ---- | ------- | ----------- |
|
|
61
|
+
| `--old-out` | `old_marked.pdf` | Output path for the annotated old PDF |
|
|
62
|
+
| `--new-out` | `new_marked.pdf` | Output path for the annotated new PDF |
|
|
63
|
+
| `--opacity` | `0.35` | Highlight opacity (0.0–1.0) |
|
|
64
|
+
|
|
65
|
+
## How It Works
|
|
66
|
+
|
|
67
|
+
```
|
|
68
|
+
old.pdf new.pdf
|
|
69
|
+
│ │
|
|
70
|
+
▼ ▼
|
|
71
|
+
┌──────────────────┐
|
|
72
|
+
│ Extract words │ PyMuPDF: word text + bounding boxes
|
|
73
|
+
└────────┬─────────┘
|
|
74
|
+
▼
|
|
75
|
+
┌──────────────────┐
|
|
76
|
+
│ Global diff │ Flatten all pages → SequenceMatcher
|
|
77
|
+
└────────┬─────────┘
|
|
78
|
+
▼
|
|
79
|
+
┌──────────────────┐
|
|
80
|
+
│ Word-level diff │ Per-word & sub-word precision
|
|
81
|
+
└────────┬─────────┘
|
|
82
|
+
▼
|
|
83
|
+
┌──────────────────┐
|
|
84
|
+
│ Reflow filter │ Suppress cross-page / cross-column noise
|
|
85
|
+
└────────┬─────────┘
|
|
86
|
+
▼
|
|
87
|
+
┌──────────────────┐
|
|
88
|
+
│ Annotate PDFs │ Highlights on original pages
|
|
89
|
+
└────────┬─────────┘
|
|
90
|
+
▼
|
|
91
|
+
old_marked.pdf
|
|
92
|
+
new_marked.pdf
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
## License
|
|
96
|
+
|
|
97
|
+
MIT
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
src/pdfdelta/__init__.py
|
|
5
|
+
src/pdfdelta/annotate.py
|
|
6
|
+
src/pdfdelta/cli.py
|
|
7
|
+
src/pdfdelta/compare.py
|
|
8
|
+
src/pdfdelta/extract.py
|
|
9
|
+
src/pdfdelta/models.py
|
|
10
|
+
src/pdfdelta.egg-info/PKG-INFO
|
|
11
|
+
src/pdfdelta.egg-info/SOURCES.txt
|
|
12
|
+
src/pdfdelta.egg-info/dependency_links.txt
|
|
13
|
+
src/pdfdelta.egg-info/entry_points.txt
|
|
14
|
+
src/pdfdelta.egg-info/requires.txt
|
|
15
|
+
src/pdfdelta.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
PyMuPDF>=1.24
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
pdfdelta
|