pdf2text-arabic 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,150 @@
1
+ Metadata-Version: 2.4
2
+ Name: pdf2text-arabic
3
+ Version: 0.1.0
4
+ Summary: Arabic PDF text extraction with ligature and RTL fixes
5
+ Requires-Python: >=3.13
6
+ Description-Content-Type: text/markdown
7
+ Requires-Dist: pymupdf>=1.27.2.2
8
+ Requires-Dist: tqdm>=4.67.3
9
+
10
+ # PDF2Text-Arabic
11
+
12
+ Arabic PDF text extraction built on PyMuPDF. Fixes ligature decomposition, RTL ordering, table extraction, and other issues that make raw PyMuPDF output unusable for Arabic.
13
+
14
+ ## What it fixes
15
+
16
+ | # | Problem | Fix |
17
+ |---|---------|-----|
18
+ | 1 | **Ligature decomposition** — PyMuPDF breaks Arabic ligatures (الله, لأ, لإ) into LTR-ordered zero-width chars | Detects zero-width clusters, reverses to RTL order |
19
+ | 1b | **Lam-Alef swap** — لا ligature decomposed as ال (alef before lam) | Detects width ratio, swaps to correct order |
20
+ | 2 | **Presentation Forms** — Returns U+FB50–FDFF / U+FE70–FEFF instead of standard Arabic | NFKC normalization |
21
+ | 3 | **Line splitting** — One visual line split into multiple rawdict lines at same y | Y-coordinate merging with tolerance |
22
+ | 4 | **Number reversal** — RTL sorting reverses digit sequences (2019 → 9102) | Detects LTR digit runs, reverses back |
23
+ | 5 | **Arabic↔digit spacing** — No space between Arabic text and numbers | Regex-inserts spaces at boundaries |
24
+ | 6 | **Artifact spaces** — Space chars with overlapping bboxes cause false word breaks | Only honors spaces with physical gaps > 0.5px |
25
+ | 7 | **Invisible chars** — Zero-width joiners, BOM, LTR/RTL marks, kashida | Stripped in post-processing |
26
+
27
+ ## Install
28
+
29
+ ```bash
30
+ pip install pdf2text-arabic
31
+ ```
32
+
33
+ From source:
34
+ ```bash
35
+ pip install .
36
+ # or with uv
37
+ uv pip install .
38
+ ```
39
+
40
+ ## Quick start
41
+
42
+ ### Python API
43
+
44
+ ```python
45
+ from pdf2text_arabic import extract_pdf, extract_page
46
+
47
+ # Extract entire PDF
48
+ text = extract_pdf("document.pdf")
49
+
50
+ # With cropping (remove headers/page numbers)
51
+ text = extract_pdf("document.pdf", crop_top=50, crop_bottom=30)
52
+
53
+ # Crop by percentage
54
+ text = extract_pdf("document.pdf", crop_top=5, crop_bottom=3, crop_unit="pct")
55
+
56
+ # Disable footnote separator detection
57
+ text = extract_pdf("document.pdf", detect_footer=False)
58
+ ```
59
+
60
+ ### Single page
61
+
62
+ ```python
63
+ import fitz
64
+ from pdf2text_arabic import extract_page
65
+
66
+ doc = fitz.open("document.pdf")
67
+ text = extract_page(doc[0], crop_top=50, crop_bottom=30)
68
+ doc.close()
69
+ ```
70
+
71
+ ### CLI
72
+
73
+ ```bash
74
+ # Process all PDFs in a directory
75
+ pdf2text-arabic -i ./download -o ./output/plain_text
76
+
77
+ # Single file
78
+ pdf2text-arabic -f document.pdf -o ./output
79
+
80
+ # With cropping
81
+ pdf2text-arabic -i ./download --crop-top 50 --crop-bottom 30
82
+
83
+ # Crop by percentage, no footer detection
84
+ pdf2text-arabic -i ./download --crop-top 5 --crop-bottom 3 --crop-unit pct --no-footer
85
+ ```
86
+
87
+ ## API reference
88
+
89
+ ### `extract_pdf(pdf_path, **kwargs) → str`
90
+
91
+ Extract text from all pages of a PDF.
92
+
93
+ | Parameter | Type | Default | Description |
94
+ |-----------|------|---------|-------------|
95
+ | `pdf_path` | `str` | — | Path to the PDF file |
96
+ | `crop_top` | `float` | `0` | Crop from top of each page |
97
+ | `crop_bottom` | `float` | `0` | Crop from bottom of each page |
98
+ | `crop_unit` | `"px" \| "pct"` | `"px"` | Unit: points or percentage of page height |
99
+ | `detect_footer` | `bool` | `True` | Auto-detect footnote separator lines and exclude content below |
100
+
101
+ ### `extract_page(page, **kwargs) → str`
102
+
103
+ Extract text from a single `fitz.Page`. Same parameters as `extract_pdf` (except `pdf_path`).
104
+
105
+ ## Features
106
+
107
+ ### Table extraction
108
+
109
+ Tables are automatically detected via PyMuPDF's `find_tables()`, extracted with proper Arabic cell ordering, and formatted as pipe-delimited text. Merged cells are filled down so every row is self-contained:
110
+
111
+ ```
112
+ الجهات | عدد المقاعد | مقر الدائرة الانتخابية
113
+ طنجة – تطوان – الحسيمة | 2 | ولاية جهة فاس - مكناس
114
+ الشرق | 2 | ولاية جهة فاس - مكناس
115
+ فاس - مكناس | 2 | ولاية جهة فاس - مكناس
116
+ ```
117
+
118
+ ### Footer detection
119
+
120
+ Automatically detects horizontal separator lines (both vector drawings and text-based dashes) in the bottom 40% of each page and excludes footnote text below them. Handles non-selectable drawn lines and selectable `------` text.
121
+
122
+ ### Page cropping
123
+
124
+ Crop headers and page numbers by fixed pixel amount or percentage of page height.
125
+
126
+ ## Project structure
127
+
128
+ ```
129
+ pdf2text_arabic/
130
+ ├── __init__.py # Public API: extract_pdf, extract_page
131
+ ├── _chars.py # Character-level ligature/overlap fixes
132
+ ├── _text.py # RTL text building, cleaning, line merging
133
+ ├── _tables.py # Table detection and formatting
134
+ ├── _footer.py # Footer separator detection
135
+ ├── _extract.py # Page/PDF extraction orchestration
136
+ └── cli.py # CLI entry point
137
+ ```
138
+
139
+ ## Integration with other projects
140
+
141
+ ```bash
142
+ pip install pdf2text-arabic
143
+ ```
144
+
145
+ ```python
146
+ from pdf2text_arabic import extract_pdf
147
+
148
+ def extract_law_text(path: str) -> str:
149
+ return extract_pdf(path, crop_top=50, crop_bottom=30, detect_footer=True)
150
+ ```
@@ -0,0 +1,141 @@
1
+ # PDF2Text-Arabic
2
+
3
+ Arabic PDF text extraction built on PyMuPDF. Fixes ligature decomposition, RTL ordering, table extraction, and other issues that make raw PyMuPDF output unusable for Arabic.
4
+
5
+ ## What it fixes
6
+
7
+ | # | Problem | Fix |
8
+ |---|---------|-----|
9
+ | 1 | **Ligature decomposition** — PyMuPDF breaks Arabic ligatures (الله, لأ, لإ) into LTR-ordered zero-width chars | Detects zero-width clusters, reverses to RTL order |
10
+ | 1b | **Lam-Alef swap** — لا ligature decomposed as ال (alef before lam) | Detects width ratio, swaps to correct order |
11
+ | 2 | **Presentation Forms** — Returns U+FB50–FDFF / U+FE70–FEFF instead of standard Arabic | NFKC normalization |
12
+ | 3 | **Line splitting** — One visual line split into multiple rawdict lines at same y | Y-coordinate merging with tolerance |
13
+ | 4 | **Number reversal** — RTL sorting reverses digit sequences (2019 → 9102) | Detects LTR digit runs, reverses back |
14
+ | 5 | **Arabic↔digit spacing** — No space between Arabic text and numbers | Regex-inserts spaces at boundaries |
15
+ | 6 | **Artifact spaces** — Space chars with overlapping bboxes cause false word breaks | Only honors spaces with physical gaps > 0.5px |
16
+ | 7 | **Invisible chars** — Zero-width joiners, BOM, LTR/RTL marks, kashida | Stripped in post-processing |
17
+
18
+ ## Install
19
+
20
+ ```bash
21
+ pip install pdf2text-arabic
22
+ ```
23
+
24
+ From source:
25
+ ```bash
26
+ pip install .
27
+ # or with uv
28
+ uv pip install .
29
+ ```
30
+
31
+ ## Quick start
32
+
33
+ ### Python API
34
+
35
+ ```python
36
+ from pdf2text_arabic import extract_pdf, extract_page
37
+
38
+ # Extract entire PDF
39
+ text = extract_pdf("document.pdf")
40
+
41
+ # With cropping (remove headers/page numbers)
42
+ text = extract_pdf("document.pdf", crop_top=50, crop_bottom=30)
43
+
44
+ # Crop by percentage
45
+ text = extract_pdf("document.pdf", crop_top=5, crop_bottom=3, crop_unit="pct")
46
+
47
+ # Disable footnote separator detection
48
+ text = extract_pdf("document.pdf", detect_footer=False)
49
+ ```
50
+
51
+ ### Single page
52
+
53
+ ```python
54
+ import fitz
55
+ from pdf2text_arabic import extract_page
56
+
57
+ doc = fitz.open("document.pdf")
58
+ text = extract_page(doc[0], crop_top=50, crop_bottom=30)
59
+ doc.close()
60
+ ```
61
+
62
+ ### CLI
63
+
64
+ ```bash
65
+ # Process all PDFs in a directory
66
+ pdf2text-arabic -i ./download -o ./output/plain_text
67
+
68
+ # Single file
69
+ pdf2text-arabic -f document.pdf -o ./output
70
+
71
+ # With cropping
72
+ pdf2text-arabic -i ./download --crop-top 50 --crop-bottom 30
73
+
74
+ # Crop by percentage, no footer detection
75
+ pdf2text-arabic -i ./download --crop-top 5 --crop-bottom 3 --crop-unit pct --no-footer
76
+ ```
77
+
78
+ ## API reference
79
+
80
+ ### `extract_pdf(pdf_path, **kwargs) → str`
81
+
82
+ Extract text from all pages of a PDF.
83
+
84
+ | Parameter | Type | Default | Description |
85
+ |-----------|------|---------|-------------|
86
+ | `pdf_path` | `str` | — | Path to the PDF file |
87
+ | `crop_top` | `float` | `0` | Crop from top of each page |
88
+ | `crop_bottom` | `float` | `0` | Crop from bottom of each page |
89
+ | `crop_unit` | `"px" \| "pct"` | `"px"` | Unit: points or percentage of page height |
90
+ | `detect_footer` | `bool` | `True` | Auto-detect footnote separator lines and exclude content below |
91
+
92
+ ### `extract_page(page, **kwargs) → str`
93
+
94
+ Extract text from a single `fitz.Page`. Same parameters as `extract_pdf` (except `pdf_path`).
95
+
96
+ ## Features
97
+
98
+ ### Table extraction
99
+
100
+ Tables are automatically detected via PyMuPDF's `find_tables()`, extracted with proper Arabic cell ordering, and formatted as pipe-delimited text. Merged cells are filled down so every row is self-contained:
101
+
102
+ ```
103
+ الجهات | عدد المقاعد | مقر الدائرة الانتخابية
104
+ طنجة – تطوان – الحسيمة | 2 | ولاية جهة فاس - مكناس
105
+ الشرق | 2 | ولاية جهة فاس - مكناس
106
+ فاس - مكناس | 2 | ولاية جهة فاس - مكناس
107
+ ```
108
+
109
+ ### Footer detection
110
+
111
+ Automatically detects horizontal separator lines (both vector drawings and text-based dashes) in the bottom 40% of each page and excludes footnote text below them. Handles non-selectable drawn lines and selectable `------` text.
112
+
113
+ ### Page cropping
114
+
115
+ Crop headers and page numbers by fixed pixel amount or percentage of page height.
116
+
117
+ ## Project structure
118
+
119
+ ```
120
+ pdf2text_arabic/
121
+ ├── __init__.py # Public API: extract_pdf, extract_page
122
+ ├── _chars.py # Character-level ligature/overlap fixes
123
+ ├── _text.py # RTL text building, cleaning, line merging
124
+ ├── _tables.py # Table detection and formatting
125
+ ├── _footer.py # Footer separator detection
126
+ ├── _extract.py # Page/PDF extraction orchestration
127
+ └── cli.py # CLI entry point
128
+ ```
129
+
130
+ ## Integration with other projects
131
+
132
+ ```bash
133
+ pip install pdf2text-arabic
134
+ ```
135
+
136
+ ```python
137
+ from pdf2text_arabic import extract_pdf
138
+
139
+ def extract_law_text(path: str) -> str:
140
+ return extract_pdf(path, crop_top=50, crop_bottom=30, detect_footer=True)
141
+ ```
@@ -0,0 +1,12 @@
1
+ """Arabic PDF text extraction with PyMuPDF ligature and RTL fixes.
2
+
3
+ Usage:
4
+ from pdf2text_arabic import extract_pdf, extract_page
5
+
6
+ text = extract_pdf("document.pdf")
7
+ """
8
+
9
+ from ._extract import extract_page, extract_pdf
10
+ from .cli import main
11
+
12
+ __all__ = ["extract_pdf", "extract_page", "main"]
@@ -0,0 +1,228 @@
1
+ """Character-level Arabic fixes for PyMuPDF ligature decomposition.
2
+
3
+ Fixes zero-width clusters, lam-alef ligature swaps, exact-overlap pairs,
4
+ and near-overlap repositioning — all caused by PyMuPDF decomposing Arabic
5
+ ligature glyphs into visual LTR byte order.
6
+ """
7
+
8
+ import re
9
+ import unicodedata
10
+
11
+ # ---------------------------------------------------------------------------
12
+ # Constants & helpers
13
+ # ---------------------------------------------------------------------------
14
+
15
+ ARABIC_RE = re.compile(
16
+ r"[\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF\uFB50-\uFDFF\uFE70-\uFEFF]"
17
+ )
18
+
19
+
20
+ def is_arabic(c: str) -> bool:
21
+ """Check if a character is Arabic (after NFKC normalization)."""
22
+ return bool(ARABIC_RE.match(unicodedata.normalize("NFKC", c)))
23
+
24
+
25
+ def reposition(char: dict, new_x: float) -> dict:
26
+ """Return a copy of *char* with its bbox x-coordinates set to *new_x*."""
27
+ return {
28
+ "c": char["c"],
29
+ "bbox": (new_x, char["bbox"][1], new_x, char["bbox"][3]),
30
+ "origin": char.get("origin", (0, 0)),
31
+ }
32
+
33
+
34
+ def _reverse_cluster(cluster: list[dict], reposition_all: bool = False) -> None:
35
+ """Reverse a ligature cluster in-place and assign decreasing x-positions.
36
+
37
+ If *reposition_all* is False (default), the first char after reversal
38
+ (the anchor) keeps its original bbox. If True, every char including
39
+ the anchor is repositioned from ``max(x0)`` downward.
40
+ """
41
+ cluster.reverse()
42
+ if len(cluster) > 1:
43
+ if reposition_all:
44
+ anchor_x0 = max(c["bbox"][0] for c in cluster)
45
+ for k in range(len(cluster)):
46
+ cluster[k] = reposition(cluster[k], anchor_x0 - k * 0.01)
47
+ else:
48
+ ax0 = cluster[0]["bbox"][0]
49
+ for k in range(1, len(cluster)):
50
+ cluster[k] = reposition(cluster[k], ax0 - k * 0.01)
51
+
52
+
53
+ # ---------------------------------------------------------------------------
54
+ # Main fix pipeline
55
+ # ---------------------------------------------------------------------------
56
+
57
+
58
+ def fix_zero_width_clusters(chars: list[dict]) -> list[dict]:
59
+ """Reverse zero-width and overlapping Arabic clusters from ligature decomposition.
60
+
61
+ Four overlap patterns are detected and fixed:
62
+
63
+ - Zero-width: consecutive zero-width Arabic chars (w < 0.5) followed by
64
+ one real-width char. Reversed as a cluster.
65
+ - Lam-Alef ligature: an alef variant (ا/أ/إ/آ) followed by a lam (ل)
66
+ where the lam inherits the full ligature width (ratio > 1.8×).
67
+ Swapped to restore logical RTL order (لا instead of ال).
68
+ - Exact-overlap: real-width Arabic chars at the same x-position
69
+ (diff < 0.02). Handled as a pair or reversed as a multi-char cluster.
70
+ - Near-overlap: a char overlapping with the previous char (diff 0.02–1.5)
71
+ AND adjacent to the next char. Repositioned past the next char.
72
+ """
73
+ if not chars:
74
+ return chars
75
+
76
+ result: list[dict] = []
77
+ i = 0
78
+ while i < len(chars):
79
+ w = chars[i]["bbox"][2] - chars[i]["bbox"][0]
80
+
81
+ # --- Zero-width cluster ---
82
+ if w < 0.5 and is_arabic(chars[i]["c"]):
83
+ cluster = [chars[i]]
84
+ j = i + 1
85
+ while j < len(chars):
86
+ jw = chars[j]["bbox"][2] - chars[j]["bbox"][0]
87
+ if jw < 0.5 and is_arabic(chars[j]["c"]):
88
+ cluster.append(chars[j])
89
+ j += 1
90
+ else:
91
+ break
92
+ if j < len(chars) and is_arabic(chars[j]["c"]):
93
+ cluster.append(chars[j])
94
+ j += 1
95
+ _reverse_cluster(cluster)
96
+ result.extend(cluster)
97
+ i = j
98
+ continue
99
+
100
+ # --- Lam-Alef ligature ---
101
+ if (
102
+ w >= 0.5
103
+ and chars[i]["c"] in "اأإآ"
104
+ and i + 1 < len(chars)
105
+ and chars[i + 1]["c"] == "\u0644"
106
+ ):
107
+ lam = chars[i + 1]
108
+ lam_w = lam["bbox"][2] - lam["bbox"][0]
109
+ if lam_w > w * 1.8:
110
+ alef_bbox = chars[i]["bbox"]
111
+ lam_bbox = lam["bbox"]
112
+
113
+ if result and abs(result[-1]["bbox"][0] - alef_bbox[0]) < 1.0:
114
+ # Overlap with preceding char — place lam just after it
115
+ new_x = result[-1]["bbox"][0] - 0.01
116
+ result.append(
117
+ {
118
+ "c": "\u0644",
119
+ "bbox": (new_x, alef_bbox[1], new_x, alef_bbox[3]),
120
+ "origin": chars[i].get("origin", (0, 0)),
121
+ }
122
+ )
123
+ result.append(
124
+ {
125
+ "c": chars[i]["c"],
126
+ "bbox": lam_bbox,
127
+ "origin": lam.get("origin", (0, 0)),
128
+ }
129
+ )
130
+ elif result and is_arabic(result[-1]["c"]):
131
+ # Word-internal/final — no swap needed (e.g. حال not حلا)
132
+ result.append(chars[i])
133
+ result.append(chars[i + 1])
134
+ else:
135
+ # Word-initial/standalone — swap lam↔alef bboxes
136
+ result.append(
137
+ {
138
+ "c": "\u0644",
139
+ "bbox": alef_bbox,
140
+ "origin": chars[i].get("origin", (0, 0)),
141
+ }
142
+ )
143
+ result.append(
144
+ {
145
+ "c": chars[i]["c"],
146
+ "bbox": lam_bbox,
147
+ "origin": lam.get("origin", (0, 0)),
148
+ }
149
+ )
150
+ i += 2
151
+ continue
152
+
153
+ # --- Exact-overlap ---
154
+ if (
155
+ w >= 0.5
156
+ and i + 1 < len(chars)
157
+ and is_arabic(chars[i]["c"])
158
+ and is_arabic(chars[i + 1]["c"])
159
+ and (chars[i + 1]["bbox"][2] - chars[i + 1]["bbox"][0]) >= 0.5
160
+ and abs(chars[i]["bbox"][0] - chars[i + 1]["bbox"][0]) < 0.02
161
+ ):
162
+ cur_x1 = chars[i]["bbox"][2]
163
+ nxt_x1 = chars[i + 1]["bbox"][2]
164
+
165
+ if abs(cur_x1 - nxt_x1) < 0.5 or cur_x1 > nxt_x1:
166
+ result.append(chars[i])
167
+ if i + 2 < len(chars) and is_arabic(chars[i + 2]["c"]):
168
+ result.append(
169
+ reposition(chars[i + 1], chars[i + 2]["bbox"][0] - 0.01)
170
+ )
171
+ else:
172
+ result.append(chars[i + 1])
173
+ i += 2
174
+ else:
175
+ ref_x0 = chars[i]["bbox"][0]
176
+ cluster = [chars[i]]
177
+ j = i + 1
178
+ while j < len(chars):
179
+ jw = chars[j]["bbox"][2] - chars[j]["bbox"][0]
180
+ if (
181
+ jw >= 0.5
182
+ and is_arabic(chars[j]["c"])
183
+ and abs(chars[j]["bbox"][0] - ref_x0) < 0.02
184
+ ):
185
+ cluster.append(chars[j])
186
+ j += 1
187
+ else:
188
+ break
189
+ if j < len(chars) and is_arabic(chars[j]["c"]):
190
+ cluster_min_x0 = min(c["bbox"][0] for c in cluster)
191
+ if abs(chars[j]["bbox"][2] - cluster_min_x0) < 1.0:
192
+ cluster.append(chars[j])
193
+ j += 1
194
+ _reverse_cluster(cluster, reposition_all=True)
195
+ result.extend(cluster)
196
+ i = j
197
+ continue
198
+
199
+ # --- Near-overlap ---
200
+ if (
201
+ result
202
+ and i + 1 < len(chars)
203
+ and is_arabic(chars[i]["c"])
204
+ and is_arabic(chars[i + 1]["c"])
205
+ ):
206
+ cur_x0 = chars[i]["bbox"][0]
207
+ candidates = []
208
+ for k in range(len(result) - 1, max(len(result) - 4, -1), -1):
209
+ if is_arabic(result[k]["c"]):
210
+ candidates.append(result[k])
211
+ break
212
+
213
+ triggered = False
214
+ for prev in candidates:
215
+ diff = abs(prev["bbox"][0] - cur_x0)
216
+ if 0.02 <= diff < 1.5 and abs(cur_x0 - chars[i + 1]["bbox"][2]) < 1.0:
217
+ result.append(reposition(chars[i], chars[i + 1]["bbox"][0] - 0.01))
218
+ i += 1
219
+ triggered = True
220
+ break
221
+ if triggered:
222
+ continue
223
+
224
+ # Default: keep char unchanged
225
+ result.append(chars[i])
226
+ i += 1
227
+
228
+ return result