PyPI - longparser - Versions diffs - 0.1.3__tar.gz → 0.1.5__tar.gz - Mend

longparser 0.1.3tar.gz → 0.1.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (54) hide show

longparser-0.1.5/LICENSE-THIRD-PARTY.md ADDED Viewed

@@ -0,0 +1,50 @@
+# Third-Party Licenses
+LongParser core is licensed under the **MIT License**.
+Some **optional** backends and integrations use different licenses.
+These packages are **never loaded by default** — they are only imported
+when you explicitly install them and select them in your configuration.
+## Optional Backend Licenses
+| Package | License | Install Command | When Loaded |
+|---------|---------|-----------------|-------------|
+| `pymupdf4llm` | AGPL-3.0 or Artifex Commercial | `pip install "longparser[pymupdf]"` | Only when you set `backend="pymupdf"` |
+| `marker-pdf` | GPL-3.0-or-later | `pip install "longparser[marker]"` | Only when you set `backend="marker"` *(future)* |
+| `surya-ocr` | GPL-3.0-or-later | `pip install "longparser[surya]"` | Only when explicitly imported *(future)* |
+## Core Dependency Licenses (always installed)
+| Package | License | Purpose |
+|---------|---------|---------|
+| `pydantic` | MIT | Schema validation |
+| `docling` | MIT | Default PDF extraction engine |
+| `docling-core` | MIT | Docling data models |
+| `fast-langdetect` | Apache-2.0 | Document language detection |
+## What This Means for You
+- **If you only use `pip install longparser`** — everything is MIT or Apache-2.0.
+  You can use LongParser in any project (commercial, proprietary, open source).
+- **If you install `longparser[pymupdf]`** — the `pymupdf4llm` library is
+  AGPL-3.0 licensed. You must comply with AGPL terms for the PyMuPDF component,
+  OR purchase a commercial license from [Artifex](https://artifex.com).
+  LongParser core code remains MIT.
+- **If you install `longparser[marker]`** *(future)* — the `marker-pdf` library
+  is GPL-3.0 licensed. You must comply with GPL terms for the Marker component.
+  LongParser core code remains MIT.
+## License Isolation Guarantee
+LongParser uses **lazy imports** to ensure GPL/AGPL packages are never loaded
+unless explicitly requested. The following guarantees hold:
+1. `import longparser` does NOT import any GPL/AGPL package
+2. `from longparser import DocumentPipeline` does NOT import any GPL/AGPL package
+3. `DocumentPipeline().process_file("doc.pdf")` does NOT import any GPL/AGPL
+   package (uses Docling, which is MIT)
+4. GPL/AGPL code is only loaded when you explicitly set `backend="pymupdf"` or
+   `backend="marker"` in `ProcessingConfig`

{longparser-0.1.3 → longparser-0.1.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: longparser
-Version: 0.1.3
+Version: 0.1.5
 Summary: Privacy-first document intelligence engine — converts PDFs, DOCX, PPTX, XLSX, and CSV into AI-ready Markdown + structured JSON for RAG pipelines.
 Author-email: ENDEVSOLS Team <technology@endevsols.com>
 License-Expression: MIT
@@ -24,16 +24,24 @@ Classifier: Topic :: Text Processing :: General
 Classifier: Typing :: Typed
 Requires-Python: >=3.10
 Description-Content-Type: text/markdown
+License-File: LICENSE-THIRD-PARTY.md
 Requires-Dist: pydantic<3,>=2.0
 Requires-Dist: docling>=2.14
 Requires-Dist: docling-core>=2.13
 Requires-Dist: langgraph-checkpoint-mongodb>=0.3.1
+Requires-Dist: fast-langdetect<1.0,>=0.3
 Provides-Extra: pptx
 Requires-Dist: python-pptx>=1.0; extra == "pptx"
 Provides-Extra: langchain
 Requires-Dist: langchain-core>=0.2; extra == "langchain"
 Provides-Extra: llamaindex
 Requires-Dist: llama-index-core>=0.10; extra == "llamaindex"
+Provides-Extra: pymupdf
+Requires-Dist: pymupdf4llm>=1.27; extra == "pymupdf"
+Provides-Extra: ner
+Requires-Dist: spacy>=3.7.0; extra == "ner"
+Provides-Extra: marker
+Requires-Dist: marker-pdf>=0.3.0; extra == "marker"
 Provides-Extra: server
 Requires-Dist: fastapi>=0.115; extra == "server"
 Requires-Dist: uvicorn[standard]>=0.34; extra == "server"
@@ -108,6 +116,7 @@ Requires-Dist: build>=1.0; extra == "dev"
 Requires-Dist: twine>=5.0; extra == "dev"
 Requires-Dist: httpx>=0.27; extra == "dev"
 Requires-Dist: anyio>=4.0; extra == "dev"
+Dynamic: license-file
 <p align="center">
   <img src="https://raw.githubusercontent.com/ENDEVSOLS/LongParser/main/docs/assets/logo.png" alt="LongParser" width="320">
@@ -147,8 +156,13 @@ Requires-Dist: anyio>=4.0; extra == "dev"
 | Feature | Detail |
 |---------|--------|
-| **Multi-format extraction** | PDF, DOCX, PPTX, XLSX, CSV via Docling |
+| **Multi-format extraction** | PDF, DOCX, PPTX, XLSX, CSV via Docling & Marker |
 | **Hybrid chunking** | Token-aware, heading-hierarchy-aware, table-aware |
+| **Semantic chunking** | Embedding-based boundaries using `all-MiniLM-L6-v2` |
+| **Cross-referencing** | Deterministic linking of explicit and implicit charts/figures |
+| **Quality scoring** | Zero-ML heuristic scoring with dictionary & fastText validation |
+| **PII redaction** | Hybrid Regex + NER (spaCy) redaction with secure HITL preservation |
+| **Summary chunks** | Async ARQ worker generating hierarchical LLM section summaries |
 | **HITL review** | Human-in-the-Loop block & chunk editing before embedding |
 | **LangGraph HITL** | `approve / edit / reject` workflow with LangGraph `interrupt()` and MongoDB checkpointer |
 | **3-layer memory** | Short-term turns + rolling summary + long-term facts |

{longparser-0.1.3 → longparser-0.1.5}/README.md RENAMED Viewed

@@ -36,8 +36,13 @@
 | Feature | Detail |
 |---------|--------|
-| **Multi-format extraction** | PDF, DOCX, PPTX, XLSX, CSV via Docling |
+| **Multi-format extraction** | PDF, DOCX, PPTX, XLSX, CSV via Docling & Marker |
 | **Hybrid chunking** | Token-aware, heading-hierarchy-aware, table-aware |
+| **Semantic chunking** | Embedding-based boundaries using `all-MiniLM-L6-v2` |
+| **Cross-referencing** | Deterministic linking of explicit and implicit charts/figures |
+| **Quality scoring** | Zero-ML heuristic scoring with dictionary & fastText validation |
+| **PII redaction** | Hybrid Regex + NER (spaCy) redaction with secure HITL preservation |
+| **Summary chunks** | Async ARQ worker generating hierarchical LLM section summaries |
 | **HITL review** | Human-in-the-Loop block & chunk editing before embedding |
 | **LangGraph HITL** | `approve / edit / reject` workflow with LangGraph `interrupt()` and MongoDB checkpointer |
 | **3-layer memory** | Short-term turns + rolling summary + long-term facts |

{longparser-0.1.3 → longparser-0.1.5}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "longparser"
-version = "0.1.3"
+version = "0.1.5"
 description = "Privacy-first document intelligence engine — converts PDFs, DOCX, PPTX, XLSX, and CSV into AI-ready Markdown + structured JSON for RAG pipelines."
 readme = {file = "README.md", content-type = "text/markdown"}
 requires-python = ">=3.10"
@@ -36,6 +36,7 @@ dependencies = [
     "docling>=2.14",
     "docling-core>=2.13",
     "langgraph-checkpoint-mongodb>=0.3.1",
+    "fast-langdetect>=0.3,<1.0",  # Apache-2.0 — document language detection
 ]
 [project.optional-dependencies]
@@ -51,6 +52,24 @@ langchain = [
 llamaindex = [
     "llama-index-core>=0.10",
 ]
+# ----------- v0.1.4: Optional extraction backends -----------
+# ⚠️ pymupdf4llm is AGPL-3.0 licensed. See LICENSE-THIRD-PARTY.md.
+# Only loaded when user sets backend="pymupdf".
+pymupdf = [
+    "pymupdf4llm>=1.27",
+]
+# NER redaction (spaCy) for advanced PII detection
+ner = [
+    "spacy>=3.7.0",
+]
+# ⚠️ marker-pdf is GPL-3.0. GPU recommended. Future release.
+marker = [
+    "marker-pdf>=0.3.0",
+]
+# ⚠️ surya-ocr is GPL-3.0. GPU recommended. Future release.
+# surya = [
+#     "surya-ocr>=0.17",
+# ]
 # FastAPI REST server + MongoDB + job queue + LangChain chat engine
 server = [
     "fastapi>=0.115",

{longparser-0.1.3 → longparser-0.1.5}/src/longparser/__init__.py RENAMED Viewed

@@ -25,7 +25,7 @@ point and :mod:`longparser.server` for the REST API layer.
 from __future__ import annotations
-__version__ = "0.1.3"
+__version__ = "0.1.5"
 __author__ = "ENDEVSOLS Team"
 __license__ = "MIT"
@@ -59,6 +59,10 @@ def __getattr__(name: str):
     if name == "DoclingExtractor":
         from .extractors import DoclingExtractor
         return DoclingExtractor
+    if name == "PyMuPDFExtractor":
+        # AGPL-isolated — only loaded when explicitly requested
+        from .extractors.pymupdf_extractor import PyMuPDFExtractor
+        return PyMuPDFExtractor
     if name == "PipelineOrchestrator":
         from .pipeline import PipelineOrchestrator
         return PipelineOrchestrator
@@ -101,6 +105,7 @@ __all__ = [
     "JobResult",
     # Lazily imported (require extras)
     "DoclingExtractor",
+    "PyMuPDFExtractor",
     "PipelineOrchestrator",
     "DocumentPipeline",
     "PipelineResult",

{longparser-0.1.3 → longparser-0.1.5}/src/longparser/chunkers/hybrid_chunker.py RENAMED Viewed

@@ -620,6 +620,10 @@ class HybridChunker:
         # --- Apply overlap ---
         all_chunks = self._apply_overlap(all_chunks)
+        # --- Quality score ---
+        from .quality_scorer import score_chunks
+        all_chunks = score_chunks(all_chunks, blocks)
         logger.info(f"[HybridChunker] Done — {len(all_chunks)} chunks produced")
         return all_chunks
@@ -749,11 +753,25 @@ class HybridChunker:
         Equations are kept with their surrounding context.
         """
         chunks: list[Chunk] = []
+        # Pre-compute semantic boundaries if enabled
+        semantic_boundaries = set()
+        if self.config.use_semantic_chunking:
+            from .semantic_boundary import find_semantic_boundaries
+            semantic_boundaries = set(find_semantic_boundaries(
+                [b.text.strip() for b in blocks if b.text.strip()],
+                threshold=self.config.semantic_threshold,
+                model_name=self.config.semantic_model,
+            ))
         current_texts: list[str] = []
         current_ids: list[str] = []
         current_pages: set[int] = set()
         current_tokens = 0
         has_equation = False
+        # We need an index over valid blocks to match semantic_boundaries
+        block_idx = 0
         for block in blocks:
             text = block.text.strip()
@@ -761,10 +779,12 @@ class HybridChunker:
                 continue
             block_tokens = _count_tokens(text)
-            # If adding this block would exceed the limit, flush
-            if (current_tokens + block_tokens > self.config.max_tokens
-                    and current_texts):
+            # Flush condition: Token limit reached OR semantic boundary hit
+            hit_limit = current_tokens + block_tokens > self.config.max_tokens
+            hit_semantic = block_idx in semantic_boundaries
+            if (hit_limit or hit_semantic) and current_texts:
                 carry_text = None
                 carry_id = None
@@ -811,6 +831,8 @@ class HybridChunker:
             if block.type == BlockType.EQUATION:
                 has_equation = True
+            block_idx += 1
         # Flush remaining
         if current_texts:

longparser-0.1.5/src/longparser/chunkers/quality_scorer.py ADDED Viewed

@@ -0,0 +1,110 @@
+"""Chunk quality scorer based on token-weighted confidence and noise penalties."""
+from __future__ import annotations
+import logging
+import re
+from typing import Dict, Set
+from ..schemas import Block, Chunk
+logger = logging.getLogger(__name__)
+# --- Lazy-loaded resources ---
+_english_words: Set[str] | None = None
+def _get_english_words() -> Set[str]:
+    """Load standard OS dictionary for word coverage checks."""
+    global _english_words
+    if _english_words is None:
+        _english_words = set()
+        # Try common unix dictionary path
+        try:
+            with open("/usr/share/dict/words", "r", encoding="utf-8") as f:
+                _english_words = {line.strip().lower() for line in f}
+            logger.info(f"Loaded {len(_english_words)} words for quality scoring")
+        except Exception:
+            logger.debug("System dictionary not found. Word coverage metric will be skipped.")
+    return _english_words
+def _get_lang_confidence(text: str) -> float:
+    """Get fastText language detection confidence (0.0 to 1.0)."""
+    text = text.strip().replace("\n", " ")
+    if len(text) < 10:
+        return 1.0  # Too short to reliably detect, assume okay
+    try:
+        from fast_langdetect import detect
+        res = detect(text)
+        return res.get("score", 1.0)
+    except Exception:
+        return 1.0
+def score_chunks(chunks: list[Chunk], blocks: list[Block]) -> list[Chunk]:
+    """Score chunks based on block confidence and text noise.
+    Assigns a quality_score (0.0 to 1.0) to each chunk.
+    """
+    if not chunks or not blocks:
+        return chunks
+    # Build block lookup for fast access
+    block_lookup: Dict[str, Block] = {b.block_id: b for b in blocks}
+    for chunk in chunks:
+        chunk_blocks = [
+            block_lookup[bid] for bid in chunk.block_ids if bid in block_lookup
+        ]
+        if not chunk_blocks:
+            chunk.quality_score = 0.5  # Fallback
+            continue
+        # 1. Base score: token-weighted average of block confidence
+        weighted_sum = sum(
+            (b.confidence.overall if b.confidence else 1.0) * len(b.text)
+            for b in chunk_blocks
+        )
+        total_weight = sum(len(b.text) for b in chunk_blocks)
+        base_score = weighted_sum / total_weight if total_weight > 0 else 0.5
+        # 2. Noise penalty: density of garbled characters
+        text = chunk.text
+        noise_chars = sum(
+            1 for c in text if not (c.isalnum() or c in ' .,;:!?()-"\'\n\t')
+        )
+        noise_ratio = noise_chars / max(len(text), 1)
+        # Cap penalty at 50%
+        penalty = min(noise_ratio * 2.0, 0.5)
+        # 3. Dictionary Word Coverage penalty
+        words = _get_english_words()
+        if words:
+            # Extract alphabetic tokens
+            tokens = [t.lower() for t in re.findall(r'\b[a-zA-Z]{2,}\b', text)]
+            if tokens:
+                coverage = sum(1 for t in tokens if t in words) / len(tokens)
+                # If less than 60% of tokens are real words, apply up to 30% penalty
+                if coverage < 0.6:
+                    penalty += min((0.6 - coverage), 0.3)
+        # 4. FastText Language Confidence penalty
+        # Garbled text often confuses the language ID model, resulting in low confidence
+        lang_score = _get_lang_confidence(text)
+        if lang_score < 0.8:
+            # Scale penalty: 0.8 confidence = 0 penalty, 0.0 confidence = 0.4 penalty
+            penalty += (0.8 - lang_score) * 0.5
+        # 5. Completeness bonus: full sentences score higher
+        ends_properly = text.rstrip().endswith(('.', '!', '?', ':', '"'))
+        bonus = 0.05 if ends_properly else 0.0
+        # Calculate final score (cap penalty before applying it)
+        total_penalty = min(penalty, 0.8) # Max penalty is 80% to avoid dropping to 0 for weird formatting
+        final_score = max(0.0, min(1.0, base_score - total_penalty + bonus))
+        chunk.quality_score = final_score
+    return chunks

longparser-0.1.5/src/longparser/chunkers/semantic_boundary.py ADDED Viewed

@@ -0,0 +1,67 @@
+"""Semantic boundary detection using SentenceTransformers."""
+from __future__ import annotations
+import logging
+from typing import List
+logger = logging.getLogger(__name__)
+_models: dict = {}
+def _get_model(model_name: str = "all-MiniLM-L6-v2"):
+    """Lazily load the SentenceTransformer model (cached by name)."""
+    if model_name not in _models:
+        try:
+            from sentence_transformers import SentenceTransformer
+            _models[model_name] = SentenceTransformer(model_name)
+            logger.info("Loaded semantic chunking model: %s", model_name)
+        except ImportError:
+            logger.warning("sentence-transformers not installed. Semantic chunking disabled.")
+            return None
+    return _models[model_name]
+def find_semantic_boundaries(
+    texts: List[str],
+    threshold: float = 0.3,
+    model_name: str = "all-MiniLM-L6-v2",
+) -> List[int]:
+    """Find semantic boundaries in a list of texts.
+    Args:
+        texts: List of block texts in reading order.
+        threshold: Cosine similarity threshold. Drops below this indicate a shift.
+    Returns:
+        List of block indices where a semantic shift occurs (the boundary is *before* the index).
+    """
+    if not texts or len(texts) < 2:
+        return []
+    model = _get_model(model_name)
+    if not model:
+        return []
+    # Batch encode all texts (fast on CPU)
+    embeddings = model.encode(texts, batch_size=64, show_progress_bar=False)
+    import numpy as np
+    def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
+        norm_a = np.linalg.norm(a)
+        norm_b = np.linalg.norm(b)
+        if norm_a == 0 or norm_b == 0:
+            return 0.0
+        return float(np.dot(a, b) / (norm_a * norm_b))
+    boundaries = []
+    for i in range(len(embeddings) - 1):
+        sim = cosine_sim(embeddings[i], embeddings[i+1])
+        if sim < threshold:
+            # Shift occurs before block i+1
+            boundaries.append(i + 1)
+    return boundaries

longparser-0.1.5/src/longparser/extractors/marker_extractor.py ADDED Viewed

@@ -0,0 +1,219 @@
+"""Marker-based extractor for high-fidelity extraction on complex PDFs.
+⚠️  LICENSE NOTICE — GPL-3.0
+    marker-pdf is licensed under GPL-3.0.
+    By using this backend, you agree to the terms of the GPL-3.0 license.
+    This module is NOT imported by default — users must explicitly opt in
+    via ``pip install longparser[marker]`` and ``backend='marker'``.
+⚠️  ISOLATION RULES (do NOT violate)
+    1. This file must NEVER be imported by ``extractors/__init__.py``
+    2. This file must NEVER be imported at module level by ``orchestrator.py``
+    3. This file must ONLY be imported behind ``if backend == "marker":``
+    4. ``import longparser`` must NEVER trigger loading this file
+"""
+from __future__ import annotations
+import hashlib
+import logging
+from pathlib import Path
+from typing import Optional, List, Tuple
+from ..schemas import (
+    Document, Page, Block, BlockType, ExtractorType, ProcessingConfig,
+    BoundingBox, Provenance, Confidence, DocumentMetadata, PageProfile, ExtractionMetadata
+)
+from .base import BaseExtractor
+logger = logging.getLogger(__name__)
+def _require_marker():
+    """Check that marker-pdf is installed; raise clear error if not."""
+    try:
+        import marker
+        return marker
+    except ImportError:
+        raise ImportError(
+            "\n"
+            "╔══════════════════════════════════════════════════════════╗\n"
+            "║  marker-pdf is not installed.                          ║\n"
+            "║                                                        ║\n"
+            "║  Install:  pip install 'longparser[marker]'            ║\n"
+            "║                                                        ║\n"
+            "║  ⚠️  marker-pdf is licensed under GPL-3.0.             ║\n"
+            "║  By installing it, you agree to GPL terms for that     ║\n"
+            "║  component. LongParser core remains MIT-licensed.      ║\n"
+            "╚══════════════════════════════════════════════════════════╝\n"
+        )
+class MarkerExtractor(BaseExtractor):
+    """Extractor using Marker for high-fidelity output.
+    Includes soft-cap logic for running on CPU to prevent infinite hangs.
+    """
+    extractor_type = ExtractorType.MARKER
+    version = "1.0.0"
+    def __init__(self):
+        """Initialize and verify marker-pdf is available."""
+        _require_marker()
+        # Check for GPU
+        try:
+            import torch
+            if not torch.cuda.is_available() and not torch.backends.mps.is_available():
+                logger.warning(
+                    "⚠️  Marker is running on CPU — expect 5-10× slower extraction. "
+                    "A soft cap of 10 pages is enforced by default. "
+                    "Set `force_marker_cpu=True` to bypass this."
+                )
+        except ImportError:
+            pass
+        logger.info("Marker backend initialized")
+    def extract(
+        self,
+        file_path: Path,
+        config: ProcessingConfig,
+        page_numbers: Optional[List[int]] = None,
+    ) -> Tuple[Document, ExtractionMetadata]:
+        """Extract a PDF using Marker."""
+        from marker.convert import convert_single_pdf
+        from marker.models import load_all_models
+        from marker.settings import settings
+        import fitz  # PyMuPDF is a marker dependency anyway
+        file_path = Path(file_path)
+        logger.info("Extracting with Marker: %s", file_path.name)
+        if file_path.suffix.lower() != ".pdf":
+            raise ValueError(f"Marker backend only supports PDF files, got: {file_path.suffix}")
+        pdf_doc = fitz.open(str(file_path))
+        total_pages = len(pdf_doc)
+        pdf_doc.close()
+        # Soft cap logic for CPU
+        try:
+            import torch
+            is_cpu = not torch.cuda.is_available() and not torch.backends.mps.is_available()
+        except ImportError:
+            is_cpu = True
+        if is_cpu and not config.force_marker_cpu and total_pages > 10:
+            if page_numbers is None or len(page_numbers) > 10:
+                raise RuntimeError(
+                    f"Marker CPU Soft Cap exceeded. Document has {total_pages} pages "
+                    f"(limit: 10). Extraction will take too long on CPU. "
+                    f"Set config.force_marker_cpu=True to override."
+                )
+        file_hash = hashlib.sha256(file_path.read_bytes()).hexdigest()[:16]
+        # Load models (cached internally by Marker)
+        model_lst = load_all_models()
+        # Convert
+        full_text, images, out_meta = convert_single_pdf(
+            str(file_path),
+            model_lst,
+            max_pages=settings.MAX_PAGES if not page_numbers else len(page_numbers),
+            langs=config.languages if config.languages else None,
+            batch_multiplier=settings.BATCH_MULTIPLIER,
+            start_page=page_numbers[0] if page_numbers else None
+        )
+        # Map to LongParser Document
+        # Note: Marker's output is flat markdown, so we do a fast mapping
+        # similar to PyMuPDFExtractor.
+        document = self._markdown_to_document(
+            md_text=full_text,
+            file_path=file_path,
+            file_hash=file_hash,
+            total_pages=total_pages,
+        )
+        meta = ExtractionMetadata(
+            strategy_used="marker",
+            ocr_backend_used="surya (marker)",
+        )
+        return document, meta
+    def _markdown_to_document(
+        self,
+        md_text: str,
+        file_path: Path,
+        file_hash: str,
+        total_pages: int,
+    ) -> Document:
+        """Convert Marker's markdown into a LongParser Document."""
+        metadata = DocumentMetadata(
+            source_file=str(file_path),
+            file_hash=file_hash,
+            total_pages=total_pages,
+        )
+        pages: list[Page] = []
+        blocks: list[Block] = []
+        lines = md_text.strip().split("\n")
+        order_idx = 0
+        # Fast parse
+        for i, line in enumerate(lines):
+            stripped = line.strip()
+            if not stripped:
+                continue
+            block_type = BlockType.PARAGRAPH
+            heading_level = None
+            if stripped.startswith("#"):
+                block_type = BlockType.HEADING
+                heading_level = min(len(stripped) - len(stripped.lstrip("#")), 6)
+                stripped = stripped.lstrip("#").strip()
+            elif stripped.startswith(("- ", "* ")):
+                block_type = BlockType.LIST_ITEM
+                stripped = stripped.lstrip("-* ").strip()
+            blocks.append(Block(
+                type=block_type,
+                text=stripped,
+                order_index=order_idx,
+                heading_level=heading_level,
+                provenance=Provenance(
+                    source_file=str(file_path),
+                    page_number=1, # Marker loses page boundaries in its markdown string
+                    bbox=BoundingBox(x0=0, y0=0, x1=0, y1=0),
+                    extractor=self.extractor_type,
+                    extractor_version=self.version,
+                ),
+                confidence=Confidence(overall=0.9),
+            ))
+            order_idx += 1
+        pages.append(Page(
+            page_number=1,
+            width=612.0,
+            height=792.0,
+            blocks=blocks,
+            profile=PageProfile(page_number=1, layout_confidence=0.9)
+        ))
+        return Document(metadata=metadata, pages=pages)
+    def extract_page(
+        self,
+        file_path: Path,
+        page_number: int,
+        config: ProcessingConfig,
+    ) -> Page:
+        doc, _ = self.extract(file_path, config, page_numbers=[page_number])
+        return doc.pages[0] if doc.pages else Page(page_number=page_number, width=0, height=0)

longparser 0.1.3__tar.gz → 0.1.5__tar.gz

longparser 0.1.3tar.gz → 0.1.5tar.gz