PyPI - pdf2text-arabic - Versions diffs - 0.1.0__tar.gz - Mend

pdf2text-arabic 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

pdf2text_arabic-0.1.0/PKG-INFO +150 -0
pdf2text_arabic-0.1.0/README.md +141 -0
pdf2text_arabic-0.1.0/pdf2text_arabic/__init__.py +12 -0
pdf2text_arabic-0.1.0/pdf2text_arabic/_chars.py +228 -0
pdf2text_arabic-0.1.0/pdf2text_arabic/_extract.py +241 -0
pdf2text_arabic-0.1.0/pdf2text_arabic/_footer.py +158 -0
pdf2text_arabic-0.1.0/pdf2text_arabic/_tables.py +217 -0
pdf2text_arabic-0.1.0/pdf2text_arabic/_text.py +146 -0
pdf2text_arabic-0.1.0/pdf2text_arabic/cli.py +116 -0
pdf2text_arabic-0.1.0/pdf2text_arabic.egg-info/PKG-INFO +150 -0
pdf2text_arabic-0.1.0/pdf2text_arabic.egg-info/SOURCES.txt +15 -0
pdf2text_arabic-0.1.0/pdf2text_arabic.egg-info/dependency_links.txt +1 -0
pdf2text_arabic-0.1.0/pdf2text_arabic.egg-info/entry_points.txt +2 -0
pdf2text_arabic-0.1.0/pdf2text_arabic.egg-info/requires.txt +2 -0
pdf2text_arabic-0.1.0/pdf2text_arabic.egg-info/top_level.txt +1 -0
pdf2text_arabic-0.1.0/pyproject.toml +10 -0
pdf2text_arabic-0.1.0/setup.cfg +4 -0

pdf2text_arabic-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,150 @@
+Metadata-Version: 2.4
+Name: pdf2text-arabic
+Version: 0.1.0
+Summary: Arabic PDF text extraction with ligature and RTL fixes
+Requires-Python: >=3.13
+Description-Content-Type: text/markdown
+Requires-Dist: pymupdf>=1.27.2.2
+Requires-Dist: tqdm>=4.67.3
+# PDF2Text-Arabic
+Arabic PDF text extraction built on PyMuPDF. Fixes ligature decomposition, RTL ordering, table extraction, and other issues that make raw PyMuPDF output unusable for Arabic.
+## What it fixes
+| # | Problem | Fix |
+|---|---------|-----|
+| 1 | **Ligature decomposition** — PyMuPDF breaks Arabic ligatures (الله, لأ, لإ) into LTR-ordered zero-width chars | Detects zero-width clusters, reverses to RTL order |
+| 1b | **Lam-Alef swap** — لا ligature decomposed as ال (alef before lam) | Detects width ratio, swaps to correct order |
+| 2 | **Presentation Forms** — Returns U+FB50–FDFF / U+FE70–FEFF instead of standard Arabic | NFKC normalization |
+| 3 | **Line splitting** — One visual line split into multiple rawdict lines at same y | Y-coordinate merging with tolerance |
+| 4 | **Number reversal** — RTL sorting reverses digit sequences (2019 → 9102) | Detects LTR digit runs, reverses back |
+| 5 | **Arabic↔digit spacing** — No space between Arabic text and numbers | Regex-inserts spaces at boundaries |
+| 6 | **Artifact spaces** — Space chars with overlapping bboxes cause false word breaks | Only honors spaces with physical gaps > 0.5px |
+| 7 | **Invisible chars** — Zero-width joiners, BOM, LTR/RTL marks, kashida | Stripped in post-processing |
+## Install
+```bash
+pip install pdf2text-arabic
+```
+From source:
+```bash
+pip install .
+# or with uv
+uv pip install .
+```
+## Quick start
+### Python API
+```python
+from pdf2text_arabic import extract_pdf, extract_page
+# Extract entire PDF
+text = extract_pdf("document.pdf")
+# With cropping (remove headers/page numbers)
+text = extract_pdf("document.pdf", crop_top=50, crop_bottom=30)
+# Crop by percentage
+text = extract_pdf("document.pdf", crop_top=5, crop_bottom=3, crop_unit="pct")
+# Disable footnote separator detection
+text = extract_pdf("document.pdf", detect_footer=False)
+```
+### Single page
+```python
+import fitz
+from pdf2text_arabic import extract_page
+doc = fitz.open("document.pdf")
+text = extract_page(doc[0], crop_top=50, crop_bottom=30)
+doc.close()
+```
+### CLI
+```bash
+# Process all PDFs in a directory
+pdf2text-arabic -i ./download -o ./output/plain_text
+# Single file
+pdf2text-arabic -f document.pdf -o ./output
+# With cropping
+pdf2text-arabic -i ./download --crop-top 50 --crop-bottom 30
+# Crop by percentage, no footer detection
+pdf2text-arabic -i ./download --crop-top 5 --crop-bottom 3 --crop-unit pct --no-footer
+```
+## API reference
+### `extract_pdf(pdf_path, **kwargs) → str`
+Extract text from all pages of a PDF.
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `pdf_path` | `str` | — | Path to the PDF file |
+| `crop_top` | `float` | `0` | Crop from top of each page |
+| `crop_bottom` | `float` | `0` | Crop from bottom of each page |
+| `crop_unit` | `"px" \| "pct"` | `"px"` | Unit: points or percentage of page height |
+| `detect_footer` | `bool` | `True` | Auto-detect footnote separator lines and exclude content below |
+### `extract_page(page, **kwargs) → str`
+Extract text from a single `fitz.Page`. Same parameters as `extract_pdf` (except `pdf_path`).
+## Features
+### Table extraction
+Tables are automatically detected via PyMuPDF's `find_tables()`, extracted with proper Arabic cell ordering, and formatted as pipe-delimited text. Merged cells are filled down so every row is self-contained:
+```
+الجهات | عدد المقاعد | مقر الدائرة الانتخابية
+طنجة – تطوان – الحسيمة | 2 | ولاية جهة فاس - مكناس
+الشرق | 2 | ولاية جهة فاس - مكناس
+فاس - مكناس | 2 | ولاية جهة فاس - مكناس
+```
+### Footer detection
+Automatically detects horizontal separator lines (both vector drawings and text-based dashes) in the bottom 40% of each page and excludes footnote text below them. Handles non-selectable drawn lines and selectable `------` text.
+### Page cropping
+Crop headers and page numbers by fixed pixel amount or percentage of page height.
+## Project structure
+```
+pdf2text_arabic/
+├── __init__.py    # Public API: extract_pdf, extract_page
+├── _chars.py      # Character-level ligature/overlap fixes
+├── _text.py       # RTL text building, cleaning, line merging
+├── _tables.py     # Table detection and formatting
+├── _footer.py     # Footer separator detection
+├── _extract.py    # Page/PDF extraction orchestration
+└── cli.py         # CLI entry point
+```
+## Integration with other projects
+```bash
+pip install pdf2text-arabic
+```
+```python
+from pdf2text_arabic import extract_pdf
+def extract_law_text(path: str) -> str:
+    return extract_pdf(path, crop_top=50, crop_bottom=30, detect_footer=True)
+```

pdf2text_arabic-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,141 @@
+# PDF2Text-Arabic
+Arabic PDF text extraction built on PyMuPDF. Fixes ligature decomposition, RTL ordering, table extraction, and other issues that make raw PyMuPDF output unusable for Arabic.
+## What it fixes
+| # | Problem | Fix |
+|---|---------|-----|
+| 1 | **Ligature decomposition** — PyMuPDF breaks Arabic ligatures (الله, لأ, لإ) into LTR-ordered zero-width chars | Detects zero-width clusters, reverses to RTL order |
+| 1b | **Lam-Alef swap** — لا ligature decomposed as ال (alef before lam) | Detects width ratio, swaps to correct order |
+| 2 | **Presentation Forms** — Returns U+FB50–FDFF / U+FE70–FEFF instead of standard Arabic | NFKC normalization |
+| 3 | **Line splitting** — One visual line split into multiple rawdict lines at same y | Y-coordinate merging with tolerance |
+| 4 | **Number reversal** — RTL sorting reverses digit sequences (2019 → 9102) | Detects LTR digit runs, reverses back |
+| 5 | **Arabic↔digit spacing** — No space between Arabic text and numbers | Regex-inserts spaces at boundaries |
+| 6 | **Artifact spaces** — Space chars with overlapping bboxes cause false word breaks | Only honors spaces with physical gaps > 0.5px |
+| 7 | **Invisible chars** — Zero-width joiners, BOM, LTR/RTL marks, kashida | Stripped in post-processing |
+## Install
+```bash
+pip install pdf2text-arabic
+```
+From source:
+```bash
+pip install .
+# or with uv
+uv pip install .
+```
+## Quick start
+### Python API
+```python
+from pdf2text_arabic import extract_pdf, extract_page
+# Extract entire PDF
+text = extract_pdf("document.pdf")
+# With cropping (remove headers/page numbers)
+text = extract_pdf("document.pdf", crop_top=50, crop_bottom=30)
+# Crop by percentage
+text = extract_pdf("document.pdf", crop_top=5, crop_bottom=3, crop_unit="pct")
+# Disable footnote separator detection
+text = extract_pdf("document.pdf", detect_footer=False)
+```
+### Single page
+```python
+import fitz
+from pdf2text_arabic import extract_page
+doc = fitz.open("document.pdf")
+text = extract_page(doc[0], crop_top=50, crop_bottom=30)
+doc.close()
+```
+### CLI
+```bash
+# Process all PDFs in a directory
+pdf2text-arabic -i ./download -o ./output/plain_text
+# Single file
+pdf2text-arabic -f document.pdf -o ./output
+# With cropping
+pdf2text-arabic -i ./download --crop-top 50 --crop-bottom 30
+# Crop by percentage, no footer detection
+pdf2text-arabic -i ./download --crop-top 5 --crop-bottom 3 --crop-unit pct --no-footer
+```
+## API reference
+### `extract_pdf(pdf_path, **kwargs) → str`
+Extract text from all pages of a PDF.
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `pdf_path` | `str` | — | Path to the PDF file |
+| `crop_top` | `float` | `0` | Crop from top of each page |
+| `crop_bottom` | `float` | `0` | Crop from bottom of each page |
+| `crop_unit` | `"px" \| "pct"` | `"px"` | Unit: points or percentage of page height |
+| `detect_footer` | `bool` | `True` | Auto-detect footnote separator lines and exclude content below |
+### `extract_page(page, **kwargs) → str`
+Extract text from a single `fitz.Page`. Same parameters as `extract_pdf` (except `pdf_path`).
+## Features
+### Table extraction
+Tables are automatically detected via PyMuPDF's `find_tables()`, extracted with proper Arabic cell ordering, and formatted as pipe-delimited text. Merged cells are filled down so every row is self-contained:
+```
+الجهات | عدد المقاعد | مقر الدائرة الانتخابية
+طنجة – تطوان – الحسيمة | 2 | ولاية جهة فاس - مكناس
+الشرق | 2 | ولاية جهة فاس - مكناس
+فاس - مكناس | 2 | ولاية جهة فاس - مكناس
+```
+### Footer detection
+Automatically detects horizontal separator lines (both vector drawings and text-based dashes) in the bottom 40% of each page and excludes footnote text below them. Handles non-selectable drawn lines and selectable `------` text.
+### Page cropping
+Crop headers and page numbers by fixed pixel amount or percentage of page height.
+## Project structure
+```
+pdf2text_arabic/
+├── __init__.py    # Public API: extract_pdf, extract_page
+├── _chars.py      # Character-level ligature/overlap fixes
+├── _text.py       # RTL text building, cleaning, line merging
+├── _tables.py     # Table detection and formatting
+├── _footer.py     # Footer separator detection
+├── _extract.py    # Page/PDF extraction orchestration
+└── cli.py         # CLI entry point
+```
+## Integration with other projects
+```bash
+pip install pdf2text-arabic
+```
+```python
+from pdf2text_arabic import extract_pdf
+def extract_law_text(path: str) -> str:
+    return extract_pdf(path, crop_top=50, crop_bottom=30, detect_footer=True)
+```

pdf2text_arabic-0.1.0/pdf2text_arabic/__init__.py ADDED Viewed

@@ -0,0 +1,12 @@
+"""Arabic PDF text extraction with PyMuPDF ligature and RTL fixes.
+Usage:
+    from pdf2text_arabic import extract_pdf, extract_page
+    text = extract_pdf("document.pdf")
+"""
+from ._extract import extract_page, extract_pdf
+from .cli import main
+__all__ = ["extract_pdf", "extract_page", "main"]

pdf2text_arabic-0.1.0/pdf2text_arabic/_chars.py ADDED Viewed

@@ -0,0 +1,228 @@
+"""Character-level Arabic fixes for PyMuPDF ligature decomposition.
+Fixes zero-width clusters, lam-alef ligature swaps, exact-overlap pairs,
+and near-overlap repositioning — all caused by PyMuPDF decomposing Arabic
+ligature glyphs into visual LTR byte order.
+"""
+import re
+import unicodedata
+# ---------------------------------------------------------------------------
+# Constants & helpers
+# ---------------------------------------------------------------------------
+ARABIC_RE = re.compile(
+    r"[\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF\uFB50-\uFDFF\uFE70-\uFEFF]"
+)
+def is_arabic(c: str) -> bool:
+    """Check if a character is Arabic (after NFKC normalization)."""
+    return bool(ARABIC_RE.match(unicodedata.normalize("NFKC", c)))
+def reposition(char: dict, new_x: float) -> dict:
+    """Return a copy of *char* with its bbox x-coordinates set to *new_x*."""
+    return {
+        "c": char["c"],
+        "bbox": (new_x, char["bbox"][1], new_x, char["bbox"][3]),
+        "origin": char.get("origin", (0, 0)),
+    }
+def _reverse_cluster(cluster: list[dict], reposition_all: bool = False) -> None:
+    """Reverse a ligature cluster in-place and assign decreasing x-positions.
+    If *reposition_all* is False (default), the first char after reversal
+    (the anchor) keeps its original bbox.  If True, every char including
+    the anchor is repositioned from ``max(x0)`` downward.
+    """
+    cluster.reverse()
+    if len(cluster) > 1:
+        if reposition_all:
+            anchor_x0 = max(c["bbox"][0] for c in cluster)
+            for k in range(len(cluster)):
+                cluster[k] = reposition(cluster[k], anchor_x0 - k * 0.01)
+        else:
+            ax0 = cluster[0]["bbox"][0]
+            for k in range(1, len(cluster)):
+                cluster[k] = reposition(cluster[k], ax0 - k * 0.01)
+# ---------------------------------------------------------------------------
+# Main fix pipeline
+# ---------------------------------------------------------------------------
+def fix_zero_width_clusters(chars: list[dict]) -> list[dict]:
+    """Reverse zero-width and overlapping Arabic clusters from ligature decomposition.
+    Four overlap patterns are detected and fixed:
+    - Zero-width: consecutive zero-width Arabic chars (w < 0.5) followed by
+      one real-width char.  Reversed as a cluster.
+    - Lam-Alef ligature: an alef variant (ا/أ/إ/آ) followed by a lam (ل)
+      where the lam inherits the full ligature width (ratio > 1.8×).
+      Swapped to restore logical RTL order (لا instead of ال).
+    - Exact-overlap: real-width Arabic chars at the same x-position
+      (diff < 0.02).  Handled as a pair or reversed as a multi-char cluster.
+    - Near-overlap: a char overlapping with the previous char (diff 0.02–1.5)
+      AND adjacent to the next char.  Repositioned past the next char.
+    """
+    if not chars:
+        return chars
+    result: list[dict] = []
+    i = 0
+    while i < len(chars):
+        w = chars[i]["bbox"][2] - chars[i]["bbox"][0]
+        # --- Zero-width cluster ---
+        if w < 0.5 and is_arabic(chars[i]["c"]):
+            cluster = [chars[i]]
+            j = i + 1
+            while j < len(chars):
+                jw = chars[j]["bbox"][2] - chars[j]["bbox"][0]
+                if jw < 0.5 and is_arabic(chars[j]["c"]):
+                    cluster.append(chars[j])
+                    j += 1
+                else:
+                    break
+            if j < len(chars) and is_arabic(chars[j]["c"]):
+                cluster.append(chars[j])
+                j += 1
+            _reverse_cluster(cluster)
+            result.extend(cluster)
+            i = j
+            continue
+        # --- Lam-Alef ligature ---
+        if (
+            w >= 0.5
+            and chars[i]["c"] in "اأإآ"
+            and i + 1 < len(chars)
+            and chars[i + 1]["c"] == "\u0644"
+        ):
+            lam = chars[i + 1]
+            lam_w = lam["bbox"][2] - lam["bbox"][0]
+            if lam_w > w * 1.8:
+                alef_bbox = chars[i]["bbox"]
+                lam_bbox = lam["bbox"]
+                if result and abs(result[-1]["bbox"][0] - alef_bbox[0]) < 1.0:
+                    # Overlap with preceding char — place lam just after it
+                    new_x = result[-1]["bbox"][0] - 0.01
+                    result.append(
+                        {
+                            "c": "\u0644",
+                            "bbox": (new_x, alef_bbox[1], new_x, alef_bbox[3]),
+                            "origin": chars[i].get("origin", (0, 0)),
+                        }
+                    )
+                    result.append(
+                        {
+                            "c": chars[i]["c"],
+                            "bbox": lam_bbox,
+                            "origin": lam.get("origin", (0, 0)),
+                        }
+                    )
+                elif result and is_arabic(result[-1]["c"]):
+                    # Word-internal/final — no swap needed (e.g. حال not حلا)
+                    result.append(chars[i])
+                    result.append(chars[i + 1])
+                else:
+                    # Word-initial/standalone — swap lam↔alef bboxes
+                    result.append(
+                        {
+                            "c": "\u0644",
+                            "bbox": alef_bbox,
+                            "origin": chars[i].get("origin", (0, 0)),
+                        }
+                    )
+                    result.append(
+                        {
+                            "c": chars[i]["c"],
+                            "bbox": lam_bbox,
+                            "origin": lam.get("origin", (0, 0)),
+                        }
+                    )
+                i += 2
+                continue
+        # --- Exact-overlap ---
+        if (
+            w >= 0.5
+            and i + 1 < len(chars)
+            and is_arabic(chars[i]["c"])
+            and is_arabic(chars[i + 1]["c"])
+            and (chars[i + 1]["bbox"][2] - chars[i + 1]["bbox"][0]) >= 0.5
+            and abs(chars[i]["bbox"][0] - chars[i + 1]["bbox"][0]) < 0.02
+        ):
+            cur_x1 = chars[i]["bbox"][2]
+            nxt_x1 = chars[i + 1]["bbox"][2]
+            if abs(cur_x1 - nxt_x1) < 0.5 or cur_x1 > nxt_x1:
+                result.append(chars[i])
+                if i + 2 < len(chars) and is_arabic(chars[i + 2]["c"]):
+                    result.append(
+                        reposition(chars[i + 1], chars[i + 2]["bbox"][0] - 0.01)
+                    )
+                else:
+                    result.append(chars[i + 1])
+                i += 2
+            else:
+                ref_x0 = chars[i]["bbox"][0]
+                cluster = [chars[i]]
+                j = i + 1
+                while j < len(chars):
+                    jw = chars[j]["bbox"][2] - chars[j]["bbox"][0]
+                    if (
+                        jw >= 0.5
+                        and is_arabic(chars[j]["c"])
+                        and abs(chars[j]["bbox"][0] - ref_x0) < 0.02
+                    ):
+                        cluster.append(chars[j])
+                        j += 1
+                    else:
+                        break
+                if j < len(chars) and is_arabic(chars[j]["c"]):
+                    cluster_min_x0 = min(c["bbox"][0] for c in cluster)
+                    if abs(chars[j]["bbox"][2] - cluster_min_x0) < 1.0:
+                        cluster.append(chars[j])
+                        j += 1
+                _reverse_cluster(cluster, reposition_all=True)
+                result.extend(cluster)
+                i = j
+            continue
+        # --- Near-overlap ---
+        if (
+            result
+            and i + 1 < len(chars)
+            and is_arabic(chars[i]["c"])
+            and is_arabic(chars[i + 1]["c"])
+        ):
+            cur_x0 = chars[i]["bbox"][0]
+            candidates = []
+            for k in range(len(result) - 1, max(len(result) - 4, -1), -1):
+                if is_arabic(result[k]["c"]):
+                    candidates.append(result[k])
+                    break
+            triggered = False
+            for prev in candidates:
+                diff = abs(prev["bbox"][0] - cur_x0)
+                if 0.02 <= diff < 1.5 and abs(cur_x0 - chars[i + 1]["bbox"][2]) < 1.0:
+                    result.append(reposition(chars[i], chars[i + 1]["bbox"][0] - 0.01))
+                    i += 1
+                    triggered = True
+                    break
+            if triggered:
+                continue
+        # Default: keep char unchanged
+        result.append(chars[i])
+        i += 1
+    return result