PyPI - pdf2docx-plus - Versions diffs - 0.6.1__tar.gz - Mend

pdf2docx-plus 0.6.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (107) hide show

pdf2docx_plus-0.6.1/.gitignore ADDED Viewed

@@ -0,0 +1,32 @@
+__pycache__/
+*.py[cod]
+*.egg-info/
+.eggs/
+dist/
+build/
+.venv/
+venv/
+env/
+.mypy_cache/
+.ruff_cache/
+.pytest_cache/
+.coverage
+htmlcov/
+# bench outputs
+bench/reports/outputs/
+bench/reports/*.json
+!bench/reports/.gitkeep
+# legacy upstream patterns
+*.jp*g
+layout.json
+.vscode/
+test/issues/
+test/features/
+test/outputs/
+diff.png
+pdf2docx*.rst
+.env
+.DS_Store

pdf2docx_plus-0.6.1/LICENSE ADDED Viewed

@@ -0,0 +1,7 @@
+Copyright (c) 2026 Artifex Software, Inc.
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

pdf2docx_plus-0.6.1/LICENSING.md ADDED Viewed

@@ -0,0 +1,70 @@
+# Licensing
+`pdf2docx-plus` is MIT-licensed (see `LICENSE`). **However, it depends on PyMuPDF,
+which is AGPL-3.0.** This section documents the practical consequences.
+## Dependency license matrix
+| Package | License | Shipped with | Note |
+|---|---|---|---|
+| pdf2docx-plus (this project) | MIT | core | |
+| pdf2docx (vendored patched upstream) | MIT | core | Artifex / dothinking |
+| PyMuPDF (fitz) | **AGPL-3.0** | core | **See AGPL section below** |
+| python-docx | MIT | core | |
+| fonttools | MIT | core | |
+| numpy | BSD-3-Clause | core | |
+| opencv-python-headless | Apache-2.0 | core | |
+| fire | Apache-2.0 | core | |
+| fastapi / uvicorn | MIT / BSD-3 | `rest` extra | |
+| apted | MIT | `bench` extra | |
+| scikit-image | BSD-3-Clause | `bench` extra | |
+| Table Transformer weights | MIT | `ml-tables` extra | |
+| pix2tex / LaTeX-OCR | MIT | `ml-formula` extra | |
+| PaddleOCR | Apache-2.0 | `ml-ocr` extra | |
+| UniMERNet | Apache-2.0 | (optional, manual) | |
+## AGPL implications (PyMuPDF)
+PyMuPDF is distributed under **AGPL-3.0**. When `pdf2docx-plus` is redistributed
+or offered as a network service, the AGPL copyleft reaches through to the
+consumer of that service:
+- If you **ship pdf2docx-plus inside a closed-source product**, you need a
+  commercial PyMuPDF license from Artifex.
+- If you **offer pdf2docx-plus as a SaaS/network service** to third parties,
+  the AGPL requires you to make the corresponding source (including your app)
+  available to those users.
+- **Internal use** inside a single organisation is typically fine under AGPL.
+## Migrating away from PyMuPDF (future work)
+The parse layer is isolated behind the `pdf2docx_plus.backends` abstraction so
+the fitz dependency can be swapped for an Apache-2.0 / MIT alternative:
+- **`pypdfium2`** (Apache-2.0): Google PDFium bindings. Exposes text with
+  positioning and page rendering but does *not* provide the rich
+  block/line/span extraction or path extraction that the current pipeline
+  relies on. A swap requires re-implementing ~3-4 weeks of extraction logic
+  using `pypdfium2` + `pdfplumber` (MIT) for ruling-line tables.
+- **`pdfminer.six`** (MIT): slower but full text/layout extraction. Could be
+  a drop-in for many text paths.
+The `pdf2docx_plus.backends.Backend` Protocol is the seam. When a permissive
+backend is implemented, the same high-level API keeps working and AGPL falls
+away from the default distribution.
+## OCR / ML model weights
+Some ML integrations downloaded by the optional extras carry **non-commercial
+or research-only** weights:
+- **LayoutLMv3 weights**: CC-BY-NC-SA-4.0 — **not safe for commercial use**.
+  `pdf2docx-plus` does NOT ship or auto-download these.
+- **Nougat (Meta) weights**: CC-BY-NC-4.0 — **not safe for commercial use**.
+- **Surya / Marker weights**: OpenRAIL-M with a revenue cap. Safe up to the
+  cap; verify before relying on them in production.
+The default `ml-*` extras pin only permissively-licensed models
+(Table Transformer, pix2tex, PaddleOCR, UniMERNet). Users who wire in their
+own detectors via the plugin API are responsible for their own weight
+licensing.

pdf2docx_plus-0.6.1/PKG-INFO ADDED Viewed

@@ -0,0 +1,236 @@
+Metadata-Version: 2.4
+Name: pdf2docx-plus
+Version: 0.6.1
+Summary: Hardened PDF->DOCX converter. Fork of pdf2docx with stability fixes, typed API, plugin architecture, and optional ML layout/OCR/table backends.
+Project-URL: Homepage, https://github.com/mithunvoe/pdf2docx-plus
+Project-URL: Issues, https://github.com/mithunvoe/pdf2docx-plus/issues
+Project-URL: Upstream, https://github.com/ArtifexSoftware/pdf2docx
+Author: pdf2docx-plus maintainers
+License: MIT
+License-File: LICENSE
+Keywords: convert,docx,ocr,pdf,table,word
+Classifier: Development Status :: 3 - Alpha
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Office/Business
+Classifier: Topic :: Text Processing :: Markup
+Requires-Python: >=3.11
+Requires-Dist: fire>=0.5.0
+Requires-Dist: fonttools>=4.24.0
+Requires-Dist: numpy>=1.24.0
+Requires-Dist: opencv-python-headless>=4.8
+Requires-Dist: pymupdf>=1.24.0
+Requires-Dist: python-docx>=1.1.0
+Requires-Dist: typing-extensions>=4.10
+Provides-Extra: all
+Requires-Dist: apted>=1.0.3; extra == 'all'
+Requires-Dist: fastapi>=0.110; extra == 'all'
+Requires-Dist: pillow>=10.0; extra == 'all'
+Requires-Dist: python-multipart>=0.0.9; extra == 'all'
+Requires-Dist: scikit-image>=0.22; extra == 'all'
+Requires-Dist: scipy>=1.11; extra == 'all'
+Requires-Dist: uvicorn[standard]>=0.27; extra == 'all'
+Provides-Extra: bench
+Requires-Dist: apted>=1.0.3; extra == 'bench'
+Requires-Dist: pillow>=10.0; extra == 'bench'
+Requires-Dist: scikit-image>=0.22; extra == 'bench'
+Requires-Dist: scipy>=1.11; extra == 'bench'
+Provides-Extra: dev
+Requires-Dist: mypy>=1.10; extra == 'dev'
+Requires-Dist: pre-commit>=3.6; extra == 'dev'
+Requires-Dist: pytest-cov>=4.1; extra == 'dev'
+Requires-Dist: pytest-timeout>=2.2; extra == 'dev'
+Requires-Dist: pytest>=8.0; extra == 'dev'
+Requires-Dist: ruff>=0.6; extra == 'dev'
+Requires-Dist: types-setuptools; extra == 'dev'
+Provides-Extra: ml-formula
+Requires-Dist: pix2tex>=0.1.4; extra == 'ml-formula'
+Requires-Dist: torch>=2.2; extra == 'ml-formula'
+Provides-Extra: ml-layout
+Requires-Dist: torch>=2.2; extra == 'ml-layout'
+Requires-Dist: transformers>=4.40; extra == 'ml-layout'
+Provides-Extra: ml-ocr
+Requires-Dist: paddleocr>=2.7; extra == 'ml-ocr'
+Requires-Dist: paddlepaddle>=2.6; extra == 'ml-ocr'
+Provides-Extra: ml-tables
+Requires-Dist: timm>=0.9; extra == 'ml-tables'
+Requires-Dist: torch>=2.2; extra == 'ml-tables'
+Requires-Dist: transformers>=4.40; extra == 'ml-tables'
+Provides-Extra: rest
+Requires-Dist: fastapi>=0.110; extra == 'rest'
+Requires-Dist: python-multipart>=0.0.9; extra == 'rest'
+Requires-Dist: uvicorn[standard]>=0.27; extra == 'rest'
+Description-Content-Type: text/markdown
+# pdf2docx-plus
+Hardened fork of [pdf2docx](https://github.com/ArtifexSoftware/pdf2docx) — a
+Python PDF → DOCX converter that actually writes editable Word documents
+(not Markdown, not HTML).
+**What's different from upstream**
+| | upstream `pdf2docx` | `pdf2docx-plus` |
+|---|---|---|
+| Python support | 3.10+ | **3.11 / 3.12 / 3.13** |
+| Hyperlink OOXML | nested inside `<w:r>` (invalid) | paragraph-level `<w:hyperlink>` (valid) |
+| NULL-byte / control chars | sometimes leaks into `<w:t>`, corrupts DOCX | stripped at run insertion |
+| Errors | single `ConversionException` | `InputError` / `ParseError` / `MakeDocxError` / `PasswordRequired` / `TimeoutExceeded` |
+| Typed API | no | `py.typed`, dataclasses, `Protocol`-based plugins |
+| Return value | `None` | `ConversionResult` with per-page accounting |
+| Timeout | none (can hang forever) | `timeout_s=` watchdog |
+| Plugin architecture | no | swap table / layout / OCR / formula backends |
+| REST server | no | `pdf2docx-plus serve` (FastAPI, optional) |
+| ML hooks (opt-in) | no | Table Transformer, Granite-Docling, PaddleOCR, pix2tex |
+| Tables → CSV | no | `--tables-csv DIR` |
+| Structured logging | hijacks root logger | scoped `pdf2docx_plus` logger |
+## Install
+```bash
+pip install pdf2docx-plus            # core
+pip install 'pdf2docx-plus[rest]'    # + FastAPI server
+pip install 'pdf2docx-plus[bench]'   # + evaluation harness
+pip install 'pdf2docx-plus[ml-tables]' # + Table Transformer (torch)
+pip install 'pdf2docx-plus[ml-ocr]'  # + PaddleOCR
+```
+## Quick start
+```python
+from pdf2docx_plus import convert
+result = convert("in.pdf", "out.docx", timeout_s=120)
+print(result.pages_ok, "/", result.pages_total, "pages in", result.elapsed_s, "s")
+```
+Or with more control:
+```python
+from pdf2docx_plus import Converter, PluginRegistry
+from pdf2docx_plus.hooks import TableTransformerDetector
+plugins = PluginRegistry()
+plugins.add_table_detector(TableTransformerDetector(device="cuda"))
+with Converter("in.pdf", password="s3cret") as cv:
+    result = cv.convert(
+        "out.docx",
+        pages=[0, 1, 2],
+        profile="fidelity",     # "fast" | "fidelity" | "semantic"
+        timeout_s=60,
+        continue_on_error=True,
+    )
+    for p in result.page_results:
+        if not p.ok:
+            print(f"page {p.page_index}: {p.error}")
+```
+## CLI
+```
+pdf2docx-plus convert in.pdf out.docx --timeout 120 --profile fidelity
+pdf2docx-plus convert in.pdf --pages 0,2,5 --tables-csv tables/
+pdf2docx-plus extract-tables in.pdf --out tables.json
+pdf2docx-plus serve --host 0.0.0.0 --port 8000
+pdf2docx-plus version
+```
+## REST server
+```bash
+pip install 'pdf2docx-plus[rest]'
+pdf2docx-plus serve --port 8000
+# in another shell:
+curl -F file=@in.pdf -F profile=fidelity http://localhost:8000/convert -o out.docx
+```
+Endpoints:
+| Method | Path | Body | Returns |
+|---|---|---|---|
+| POST | `/convert` | multipart `file`, optional `password`, `profile`, `timeout_s` | DOCX bytes + `X-Pages-Ok` / `X-Pages-Failed` / `X-Elapsed-Seconds` headers |
+| POST | `/extract-tables` | multipart `file`, optional `password` | JSON `{"tables": [...]}` |
+| GET  | `/healthz` | — | `{"status": "ok"}` |
+| GET  | `/version` | — | `{"version": "..."}` |
+## Plugin architecture
+Four extension points, all `Protocol`-based:
+```python
+from pdf2docx_plus.plugins import (
+    TableDetector, LayoutDetector, OcrEngine, FormulaRecognizer
+)
+```
+Register any implementation on `PluginRegistry` and pass it to `Converter`.
+Plugins never kill a conversion — exceptions raised inside a plugin are
+logged and skipped.
+Built-in ML hooks (opt-in extras):
+| Hook | Backend | Extra | Weights license |
+|---|---|---|---|
+| `TableTransformerDetector` | HuggingFace `microsoft/table-transformer-*` | `ml-tables` | MIT |
+| `GraniteDoclingLayoutDetector` | `ibm-granite/granite-docling-258M` | `ml-layout` | Apache-2.0 |
+| `PaddleOcrEngine` | PaddleOCR | `ml-ocr` | Apache-2.0 |
+| `Pix2TexFormulaRecognizer` | pix2tex | `ml-formula` | MIT |
+| `UniMERNetFormulaRecognizer` | UniMERNet (bring weights) | manual | Apache-2.0 |
+## Benchmark
+```bash
+pip install 'pdf2docx-plus[bench]'
+python -m bench.run --corpus bench/corpus --out bench/reports/latest.json
+```
+Metrics implemented: text F1, TEDS (`apted`), reading-order Kendall-tau,
+rendered SSIM (via LibreOffice + scikit-image), and editability ratio.
+Seed corpus in this repo: 3 financial fund PDFs (born-digital). Drop more
+under `bench/corpus/<name>/input.pdf` and, optionally, `expected_text.txt`,
+`expected_tables.json`, `expected_order.json` for scoring.
+Current baseline on the seed corpus (76 pages, CPU):
+```
+awhkef                  9 pages   0 failed    7.1 s   74 KB
+first_sentier          58 pages   0 failed   15.8 s  155 KB
+kfs_bosera              9 pages   0 failed    4.3 s   87 KB
+TOTAL                  76 pages   0 failed   27.7 s  2.75 pg/s
+```
+## Licensing
+`pdf2docx-plus` is MIT, but **depends on PyMuPDF (AGPL-3.0)** — this
+propagates to you if you redistribute or expose as a network service. See
+[LICENSING.md](LICENSING.md) for the full dependency matrix, AGPL
+implications, and the future pypdfium2 migration path.
+## What's NOT done yet (roadmap)
+This fork covers **Phase 0** (foundation) and most of **Phase 1** (stability
++ typed API) from the original 21-week
+[`PDF2DOCX_FORK_PLAN.md`](../PDF2DOCX_FORK_PLAN.md). Phases 2–5 are scaffolded
+via the plugin architecture but the ML-backed hooks need real integration
+work to reach the v1.0 success criteria in the plan (TEDS ≥ 0.90, text F1 ≥
+0.98, reading-order Kendall-tau ≥ 0.90).
+Specifically, still open:
+- Train / evaluate Table Transformer + Granite-Docling against an annotated
+  corpus (plan §K).
+- Cross-page table stitching heuristic (§B.7).
+- Header/footer → `w:hdr` / `w:ftr` emission (§C.13).
+- Math recognition pipeline wiring (§F.24).
+- Scanned-PDF OCR routing + auto-detect (§G.25).
+- `styles.xml` rewrite (§H.27) — currently we still use python-docx defaults.
+- pypdfium2 backend for permissive licensing (§6).
+## Credits
+Forked from [ArtifexSoftware/pdf2docx](https://github.com/ArtifexSoftware/pdf2docx)
+(originally by [@dothinking](https://github.com/dothinking)). MIT.

pdf2docx_plus-0.6.1/README.md ADDED Viewed

@@ -0,0 +1,170 @@
+# pdf2docx-plus
+Hardened fork of [pdf2docx](https://github.com/ArtifexSoftware/pdf2docx) — a
+Python PDF → DOCX converter that actually writes editable Word documents
+(not Markdown, not HTML).
+**What's different from upstream**
+| | upstream `pdf2docx` | `pdf2docx-plus` |
+|---|---|---|
+| Python support | 3.10+ | **3.11 / 3.12 / 3.13** |
+| Hyperlink OOXML | nested inside `<w:r>` (invalid) | paragraph-level `<w:hyperlink>` (valid) |
+| NULL-byte / control chars | sometimes leaks into `<w:t>`, corrupts DOCX | stripped at run insertion |
+| Errors | single `ConversionException` | `InputError` / `ParseError` / `MakeDocxError` / `PasswordRequired` / `TimeoutExceeded` |
+| Typed API | no | `py.typed`, dataclasses, `Protocol`-based plugins |
+| Return value | `None` | `ConversionResult` with per-page accounting |
+| Timeout | none (can hang forever) | `timeout_s=` watchdog |
+| Plugin architecture | no | swap table / layout / OCR / formula backends |
+| REST server | no | `pdf2docx-plus serve` (FastAPI, optional) |
+| ML hooks (opt-in) | no | Table Transformer, Granite-Docling, PaddleOCR, pix2tex |
+| Tables → CSV | no | `--tables-csv DIR` |
+| Structured logging | hijacks root logger | scoped `pdf2docx_plus` logger |
+## Install
+```bash
+pip install pdf2docx-plus            # core
+pip install 'pdf2docx-plus[rest]'    # + FastAPI server
+pip install 'pdf2docx-plus[bench]'   # + evaluation harness
+pip install 'pdf2docx-plus[ml-tables]' # + Table Transformer (torch)
+pip install 'pdf2docx-plus[ml-ocr]'  # + PaddleOCR
+```
+## Quick start
+```python
+from pdf2docx_plus import convert
+result = convert("in.pdf", "out.docx", timeout_s=120)
+print(result.pages_ok, "/", result.pages_total, "pages in", result.elapsed_s, "s")
+```
+Or with more control:
+```python
+from pdf2docx_plus import Converter, PluginRegistry
+from pdf2docx_plus.hooks import TableTransformerDetector
+plugins = PluginRegistry()
+plugins.add_table_detector(TableTransformerDetector(device="cuda"))
+with Converter("in.pdf", password="s3cret") as cv:
+    result = cv.convert(
+        "out.docx",
+        pages=[0, 1, 2],
+        profile="fidelity",     # "fast" | "fidelity" | "semantic"
+        timeout_s=60,
+        continue_on_error=True,
+    )
+    for p in result.page_results:
+        if not p.ok:
+            print(f"page {p.page_index}: {p.error}")
+```
+## CLI
+```
+pdf2docx-plus convert in.pdf out.docx --timeout 120 --profile fidelity
+pdf2docx-plus convert in.pdf --pages 0,2,5 --tables-csv tables/
+pdf2docx-plus extract-tables in.pdf --out tables.json
+pdf2docx-plus serve --host 0.0.0.0 --port 8000
+pdf2docx-plus version
+```
+## REST server
+```bash
+pip install 'pdf2docx-plus[rest]'
+pdf2docx-plus serve --port 8000
+# in another shell:
+curl -F file=@in.pdf -F profile=fidelity http://localhost:8000/convert -o out.docx
+```
+Endpoints:
+| Method | Path | Body | Returns |
+|---|---|---|---|
+| POST | `/convert` | multipart `file`, optional `password`, `profile`, `timeout_s` | DOCX bytes + `X-Pages-Ok` / `X-Pages-Failed` / `X-Elapsed-Seconds` headers |
+| POST | `/extract-tables` | multipart `file`, optional `password` | JSON `{"tables": [...]}` |
+| GET  | `/healthz` | — | `{"status": "ok"}` |
+| GET  | `/version` | — | `{"version": "..."}` |
+## Plugin architecture
+Four extension points, all `Protocol`-based:
+```python
+from pdf2docx_plus.plugins import (
+    TableDetector, LayoutDetector, OcrEngine, FormulaRecognizer
+)
+```
+Register any implementation on `PluginRegistry` and pass it to `Converter`.
+Plugins never kill a conversion — exceptions raised inside a plugin are
+logged and skipped.
+Built-in ML hooks (opt-in extras):
+| Hook | Backend | Extra | Weights license |
+|---|---|---|---|
+| `TableTransformerDetector` | HuggingFace `microsoft/table-transformer-*` | `ml-tables` | MIT |
+| `GraniteDoclingLayoutDetector` | `ibm-granite/granite-docling-258M` | `ml-layout` | Apache-2.0 |
+| `PaddleOcrEngine` | PaddleOCR | `ml-ocr` | Apache-2.0 |
+| `Pix2TexFormulaRecognizer` | pix2tex | `ml-formula` | MIT |
+| `UniMERNetFormulaRecognizer` | UniMERNet (bring weights) | manual | Apache-2.0 |
+## Benchmark
+```bash
+pip install 'pdf2docx-plus[bench]'
+python -m bench.run --corpus bench/corpus --out bench/reports/latest.json
+```
+Metrics implemented: text F1, TEDS (`apted`), reading-order Kendall-tau,
+rendered SSIM (via LibreOffice + scikit-image), and editability ratio.
+Seed corpus in this repo: 3 financial fund PDFs (born-digital). Drop more
+under `bench/corpus/<name>/input.pdf` and, optionally, `expected_text.txt`,
+`expected_tables.json`, `expected_order.json` for scoring.
+Current baseline on the seed corpus (76 pages, CPU):
+```
+awhkef                  9 pages   0 failed    7.1 s   74 KB
+first_sentier          58 pages   0 failed   15.8 s  155 KB
+kfs_bosera              9 pages   0 failed    4.3 s   87 KB
+TOTAL                  76 pages   0 failed   27.7 s  2.75 pg/s
+```
+## Licensing
+`pdf2docx-plus` is MIT, but **depends on PyMuPDF (AGPL-3.0)** — this
+propagates to you if you redistribute or expose as a network service. See
+[LICENSING.md](LICENSING.md) for the full dependency matrix, AGPL
+implications, and the future pypdfium2 migration path.
+## What's NOT done yet (roadmap)
+This fork covers **Phase 0** (foundation) and most of **Phase 1** (stability
++ typed API) from the original 21-week
+[`PDF2DOCX_FORK_PLAN.md`](../PDF2DOCX_FORK_PLAN.md). Phases 2–5 are scaffolded
+via the plugin architecture but the ML-backed hooks need real integration
+work to reach the v1.0 success criteria in the plan (TEDS ≥ 0.90, text F1 ≥
+0.98, reading-order Kendall-tau ≥ 0.90).
+Specifically, still open:
+- Train / evaluate Table Transformer + Granite-Docling against an annotated
+  corpus (plan §K).
+- Cross-page table stitching heuristic (§B.7).
+- Header/footer → `w:hdr` / `w:ftr` emission (§C.13).
+- Math recognition pipeline wiring (§F.24).
+- Scanned-PDF OCR routing + auto-detect (§G.25).
+- `styles.xml` rewrite (§H.27) — currently we still use python-docx defaults.
+- pypdfium2 backend for permissive licensing (§6).
+## Credits
+Forked from [ArtifexSoftware/pdf2docx](https://github.com/ArtifexSoftware/pdf2docx)
+(originally by [@dothinking](https://github.com/dothinking)). MIT.

pdf2docx_plus-0.6.1/docs/README.md ADDED Viewed

@@ -0,0 +1,34 @@
+# pdf2docx documentation
+Welcome to the **pdf2docx** documentation. This documentation relies on [Sphinx](https://www.sphinx-doc.org/en/master/) to publish HTML docs from markdown files written with [restructured text](https://en.wikipedia.org/wiki/ReStructuredText) (RST).
+## Sphinx version
+This README assumes you have [Sphinx v5.0.2 installed](https://www.sphinx-doc.org/en/master/usage/installation.html) on your system.
+## Updating the documentation
+Within `docs` update the associated restructured text (`.rst`) files. These files represent the corresponding document pages.
+## Building HTML documentation
+- Ensure you have the `furo` theme installed:
+`pip install furo`
+Furo theme, Copyright (c) 2020 Pradyun Gedam <mail@pradyunsg.me>, thank you to:
+https://github.com/pradyunsg/furo/blob/main/LICENSE
+- From the "docs" location run:
+`sphinx-build -b html . build/html`
+This then creates the HTML documentation within `build/html`.
+> Use: `sphinx-build -a -b html . build/html` to build all, including the assets in `_static` (important if you have updated CSS).
+For full details see: [Using Sphinx](https://www.sphinx-doc.org/en/master/usage/index.html)

pdf2docx_plus-0.6.1/pdf2docx_plus/__init__.py ADDED Viewed

@@ -0,0 +1,41 @@
+"""pdf2docx-plus: hardened PDF -> DOCX converter.
+Public API:
+    from pdf2docx_plus import Converter, convert, ConversionResult
+    result = convert("in.pdf", "out.docx", timeout_s=60)
+    print(result.pages_ok, result.pages_failed, result.elapsed_s)
+Lower-level facade:
+    with Converter("in.pdf") as cv:
+        cv.convert("out.docx", pages=[0, 1, 2])
+"""
+from __future__ import annotations
+from .api import ConversionResult, Converter, convert, extract_tables
+from .errors import (
+    ConversionError,
+    InputError,
+    MakeDocxError,
+    ParseError,
+    PasswordRequired,
+    TimeoutExceeded,
+)
+from .version import __version__
+__all__ = [
+    "ConversionError",
+    "ConversionResult",
+    "Converter",
+    "InputError",
+    "MakeDocxError",
+    "ParseError",
+    "PasswordRequired",
+    "TimeoutExceeded",
+    "__version__",
+    "convert",
+    "extract_tables",
+]

pdf2docx_plus-0.6.1/pdf2docx_plus/_vendored/__init__.py ADDED Viewed

@@ -0,0 +1,6 @@
+"""Vendored third-party packages.
+These packages are shipped inside pdf2docx_plus to isolate them from
+whatever else the user has installed. Do not import from here directly
+from application code; use the public pdf2docx_plus API instead.
+"""

pdf2docx_plus-0.6.1/pdf2docx_plus/_vendored/pdf2docx/__init__.py ADDED Viewed

@@ -0,0 +1,3 @@
+from .converter import Converter
+from .page.Page import Page
+from .main import parse