PyPI - eticket-document-sdk - Versions diffs - 1.0.0__tar.gz - Mend

eticket-document-sdk 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (66) hide show

eticket_document_sdk-1.0.0/.gitignore ADDED Viewed

@@ -0,0 +1,30 @@
+# Python
+__pycache__/
+*.py[cod]
+*.egg-info/
+*.egg
+.eggs/
+build/
+dist/
+*.so
+# Virtual environments
+.venv/
+venv/
+env/
+# Testing / coverage
+.pytest_cache/
+.coverage
+.coverage.*
+htmlcov/
+coverage.xml
+# Tooling caches
+.ruff_cache/
+.mypy_cache/
+# OS / editor
+.DS_Store
+.idea/
+.vscode/

eticket_document_sdk-1.0.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 eticket-document-sdk contributors
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

eticket_document_sdk-1.0.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,399 @@
+Metadata-Version: 2.4
+Name: eticket-document-sdk
+Version: 1.0.0
+Summary: Parse airline e-ticket PDFs into strongly-typed JSON with high accuracy.
+Project-URL: Homepage, https://github.com/your-org/eticket-document-sdk
+Project-URL: Repository, https://github.com/your-org/eticket-document-sdk
+Project-URL: Issues, https://github.com/your-org/eticket-document-sdk/issues
+Author: eticket-document-sdk contributors
+License: MIT License
+        Copyright (c) 2026 eticket-document-sdk contributors
+        Permission is hereby granted, free of charge, to any person obtaining a copy
+        of this software and associated documentation files (the "Software"), to deal
+        in the Software without restriction, including without limitation the rights
+        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+        copies of the Software, and to permit persons to whom the Software is
+        furnished to do so, subject to the following conditions:
+        The above copyright notice and this permission notice shall be included in all
+        copies or substantial portions of the Software.
+        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+        SOFTWARE.
+License-File: LICENSE
+Keywords: airline,e-ticket,itinerary,ocr,parser,pdf,pnr
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Classifier: Topic :: Text Processing :: General
+Requires-Python: >=3.10
+Requires-Dist: pdfplumber>=0.10
+Requires-Dist: pydantic<3.0,>=2.5
+Requires-Dist: pymupdf>=1.23
+Requires-Dist: python-dateutil>=2.8
+Provides-Extra: dev
+Requires-Dist: build>=1.0; extra == 'dev'
+Requires-Dist: pytest-cov>=4.1; extra == 'dev'
+Requires-Dist: pytest>=7.4; extra == 'dev'
+Requires-Dist: ruff>=0.5; extra == 'dev'
+Provides-Extra: ocr
+Requires-Dist: numpy>=1.24; extra == 'ocr'
+Requires-Dist: opencv-python>=4.8; extra == 'ocr'
+Requires-Dist: paddleocr>=2.7; extra == 'ocr'
+Description-Content-Type: text/markdown
+# eticket-document-sdk
+Parse airline e-ticket PDFs and convert them into **strongly-typed JSON** with
+very high accuracy.
+The SDK primarily targets PDFs that already contain a **text layer** (the common
+case for e-ticket receipts, booking confirmations and itineraries). **OCR is
+used only as a fallback** when text extraction is insufficient.
+```python
+from eticket_document_sdk import ETicketParser
+parser = ETicketParser()
+booking = parser.parse("ticket.pdf")
+print(booking.model_dump())
+```
+```json
+{
+  "booking_code": "F65A7Y",
+  "ticket_number": "7382421531079",
+  "passenger": { "first_name": "YUMIKO", "last_name": "KOHNO" },
+  "currency": "JPY",
+  "total_price": 467510,
+  "segments": [{ "flight_number": "VN311", "origin": "NRT", "destination": "HAN" }]
+}
+```
+---
+## Table of Contents
+- [Installation](#installation)
+- [Quick Start](#quick-start)
+- [Advanced Usage](#advanced-usage)
+- [Data Model](#data-model)
+- [Parser Architecture](#parser-architecture)
+- [Custom Parser Plugin](#custom-parser-plugin)
+- [Error Handling](#error-handling)
+- [Development Guide](#development-guide)
+- [Release Guide](#release-guide)
+- [Roadmap](#roadmap)
+---
+## Installation
+```bash
+pip install eticket-document-sdk
+```
+Base install includes PDF text extraction (PyMuPDF + pdfplumber). The OCR
+fallback backend is a heavy optional extra:
+```bash
+# Only needed for scanned / image-only PDFs
+pip install "eticket-document-sdk[ocr]"
+```
+| Extra | Pulls in | When you need it |
+|-------|----------|------------------|
+| *(base)* | `pydantic`, `pymupdf`, `pdfplumber`, `python-dateutil` | Text-layer PDFs (the common case) |
+| `ocr` | `paddleocr`, `opencv-python`, `numpy` | Scanned/image PDFs requiring OCR |
+| `dev` | `pytest`, `pytest-cov`, `ruff`, `build` | Contributing / running tests |
+Requires Python 3.10+.
+---
+## Quick Start
+```python
+from eticket_document_sdk import ETicketParser
+parser = ETicketParser()
+# 1) From a file path (PDF or image)
+booking = parser.parse("ticket.pdf")
+# 2) From raw bytes
+booking = parser.parse_bytes(pdf_bytes)
+# 3) From already-extracted text (no PDF/OCR needed)
+booking = parser.parse_text(raw_text)
+print(booking.model_dump())          # python types
+print(booking.model_dump(mode="json"))  # JSON-serializable (datetimes -> str)
+# Convenience top-level accessors:
+booking.booking_code     # "F65A7Y"
+booking.ticket_number    # "7382421531079"
+booking.currency         # "JPY"
+booking.total_price      # 467510
+booking.passenger.full_name
+booking.segments[0].origin
+```
+---
+## Advanced Usage
+### Configuration
+```python
+parser = ETicketParser(
+    enable_ocr=True,             # allow OCR fallback (default True)
+    ocr_langs=["vi", "en", "ja"],# OCR languages (Vietnamese/English/Japanese)
+    debug=False,                 # enable DEBUG logging for the SDK
+    strict_validation=False,     # raise ValidationError on invalid output
+    text_quality_threshold=0.35, # below this, OCR fallback kicks in
+)
+```
+### Detailed result envelope
+`parse()` returns a `Booking`. For metadata about *how* the document was parsed,
+use the `*_detailed` variants, which return a `ParseResult`:
+```python
+result = parser.parse_detailed("ticket.pdf")
+result.success            # True
+result.parser             # "vietnam_airlines"
+result.extraction_method  # ExtractionMethod.TEXT_LAYER | OCR | RAW_TEXT
+result.confidence         # 0.0 .. 1.0 completeness heuristic
+result.warnings           # list of non-fatal validation issues
+result.booking            # the Booking object
+```
+Also available: `parse_bytes_detailed(...)` and `parse_text_detailed(...)`.
+### Thread-safety & performance
+A single `ETicketParser` instance is **reusable and thread-safe**. The
+PaddleOCR model, compiled regexes and the parser registry are created once and
+shared, so models are never reloaded between calls. Create one instance and
+reuse it (e.g. for batch processing — see `examples/batch_parse.py`).
+---
+## Data Model
+All models are **Pydantic v2** (`eticket_document_sdk.models`, re-exported from
+`eticket_document_sdk.schemas.pydantic_models`).
+```
+Booking
+├── status: TICKETED | CONFIRMED | PENDING | CANCELLED | UNKNOWN
+├── passenger: Passenger
+│   ├── title, first_name, last_name, full_name
+│   ├── membership_number          # frequent-flyer number
+│   └── passenger_type             # ADT | CHD | INF
+├── segments: list[FlightSegment]
+│   ├── segment_id, flight_number, airline
+│   ├── origin, destination        # IATA codes
+│   ├── departure_datetime, arrival_datetime
+│   ├── departure_terminal, arrival_terminal
+│   ├── cabin_class, fare_basis, seat
+│   ├── free_baggage, status
+├── ticket: Ticket
+│   ├── booking_code, ticket_number, currency
+│   ├── base_fare, taxes, total_price
+│   ├── tax_breakdown: list[TaxItem]   # {code, amount}
+│   ├── fare_basis, fare_calculation, endorsement
+│   ├── issue_date, payment
+└── booking_code / ticket_number / currency / total_price   # convenience mirrors
+```
+---
+## Parser Architecture
+The SDK follows a clean, layered pipeline:
+```
+            ┌─────────────────────────────────────────────────────────┐
+ source ──▶ │ 1. detect type (PDF / IMAGE / TEXT)        core.classifier│
+            │ 2. extract text layer (PyMuPDF → pdfplumber) pdf.*         │
+            │ 3. measure quality                          pdf.text_extr. │
+            │ 4. OCR fallback if needed (PaddleOCR)        ocr.*          │
+            │ 5. classify airline → pick parser           core.classifier│
+            │ 6. parse fields                             parsers.*       │
+            │ 7. validate (business rules)                core.validator  │
+            │ 8. return Pydantic Booking                  models.*        │
+            └─────────────────────────────────────────────────────────┘
+```
+Key design points:
+- **Text-first.** OCR is only attempted when the extracted text layer falls
+  below the quality threshold. PyMuPDF is the primary extractor; pdfplumber is
+  the fallback engine.
+- **Lazy heavy deps.** PyMuPDF, pdfplumber, PaddleOCR, OpenCV and numpy are all
+  imported lazily, so importing the SDK is cheap and the OCR stack is optional.
+- **Centralized regex** (`utils/regex.py`) compiled once at import time.
+- **Pluggable parsers** via a thread-safe registry (see below).
+---
+## Custom Parser Plugin
+Add a new airline **without modifying the SDK core**. Implement `BaseParser`
+and register it:
+```python
+from eticket_document_sdk import (
+    BaseParser, Booking, Ticket, FlightSegment, Passenger, register_parser,
+)
+from eticket_document_sdk.parsers.generic.parser import GenericExtractors
+class JapanAirlinesParser(BaseParser):
+    code = "JL"
+    name = "japan_airlines"
+    def can_parse(self, text: str) -> float:
+        # Return 0..1 confidence that this layout is yours.
+        return 0.9 if "JAPAN AIRLINES" in text.upper() else 0.0
+    def parse(self, text: str) -> Booking:
+        return Booking(
+            ticket=Ticket(
+                booking_code=GenericExtractors.find_pnr(text),
+                ticket_number=GenericExtractors.find_ticket_number(text),
+                currency=GenericExtractors.find_currency(text),
+            ),
+            passenger=Passenger(),
+            segments=[],
+        )
+register_parser("JL", JapanAirlinesParser())
+# From now on ETicketParser auto-selects it for matching tickets.
+```
+The classifier calls `can_parse()` on every registered parser and selects the
+highest scorer; airline-specific parsers always win ties against the generic
+fallback. See `examples/custom_parser.py` for a runnable version.
+This is the same mechanism by which future airlines (ANA, Air France, Emirates,
+Singapore Airlines, …) are added.
+---
+## Error Handling
+All exceptions derive from `ETicketSDKError`:
+| Exception | Raised when |
+|-----------|-------------|
+| `DocumentReadError` | File can't be opened/read, or is empty/corrupt |
+| `UnsupportedDocumentError` | Input type/layout not supported |
+| `OCRFailedError` | OCR needed but disabled/unavailable, or produced nothing |
+| `ParserError` | No parser matched, or a parser failed |
+| `ValidationError` | Output failed business-rule validation (strict mode) |
+```python
+from eticket_document_sdk import ETicketParser, ETicketSDKError
+try:
+    booking = ETicketParser().parse("ticket.pdf")
+except ETicketSDKError as exc:
+    print(f"Failed: {exc}")
+```
+Validation (`core/validator.py`) checks ticket-number format (13 digits), PNR
+format (6 alphanumerics), currency code, flight-number format and segment
+presence. In non-strict mode, issues are returned as `result.warnings`; in
+`strict_validation=True` mode a `ValidationError` is raised.
+---
+## Development Guide
+```bash
+git clone https://github.com/your-org/eticket-document-sdk
+cd eticket-document-sdk
+python -m venv .venv && source .venv/bin/activate
+pip install -e ".[dev]"
+# Run the test suite with coverage (target: 90%+)
+pytest --cov=eticket_document_sdk --cov-report=term-missing
+# Lint / format
+ruff check .
+ruff format .
+```
+Tests use a bundled real Vietnam Airlines fixture (`tests/fixtures/`). OCR-path
+tests use a fake engine, so **PaddleOCR is not required to run the suite**.
+Project layout mirrors the pipeline: `core/` (orchestration), `pdf/`
+(extraction), `ocr/` (fallback), `parsers/` (plugins), `models/` + `schemas/`
+(Pydantic), `utils/`, `exceptions/`.
+---
+## Release Guide
+1. Update the version in `pyproject.toml` and `eticket_document_sdk/__init__.py`
+   (`__version__`).
+2. Update the changelog / release notes.
+3. Run the full test suite and linters; ensure coverage ≥ 90%.
+4. Build the distributions:
+   ```bash
+   python -m build
+   ```
+   This produces `dist/eticket_document_sdk-<version>-py3-none-any.whl` and the
+   sdist `.tar.gz`.
+5. Smoke-test the wheel in a clean virtualenv:
+   ```bash
+   pip install dist/eticket_document_sdk-*.whl
+   ```
+6. Publish:
+   ```bash
+   python -m twine upload dist/*
+   ```
+7. Tag the release: `git tag v<version> && git push --tags`.
+---
+## Roadmap
+The extractor/parser boundary is deliberately abstract so additional backends
+can be added **without breaking the public API**:
+- **Cloud OCR / Document AI**: Azure Document Intelligence, AWS Textract,
+  Google Document AI — pluggable as alternative `OCREngine` implementations or
+  text providers.
+- **LLM structured outputs**: OpenAI and Claude structured outputs as
+  alternative parser plugins behind the same `BaseParser` contract.
+Because selection happens via the registry + classifier, these can be layered in
+as new strategies while `ETicketParser.parse(...)` stays unchanged.
+---
+## License
+MIT — see [LICENSE](LICENSE).