PyPI - longparser - Versions diffs - 0.1.1__tar.gz → 0.1.3__tar.gz - Mend

longparser 0.1.1tar.gz → 0.1.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

{longparser-0.1.1 → longparser-0.1.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: longparser
-Version: 0.1.1
+Version: 0.1.3
 Summary: Privacy-first document intelligence engine — converts PDFs, DOCX, PPTX, XLSX, and CSV into AI-ready Markdown + structured JSON for RAG pipelines.
 Author-email: ENDEVSOLS Team <technology@endevsols.com>
 License-Expression: MIT
@@ -27,6 +27,7 @@ Description-Content-Type: text/markdown
 Requires-Dist: pydantic<3,>=2.0
 Requires-Dist: docling>=2.14
 Requires-Dist: docling-core>=2.13
+Requires-Dist: langgraph-checkpoint-mongodb>=0.3.1
 Provides-Extra: pptx
 Requires-Dist: python-pptx>=1.0; extra == "pptx"
 Provides-Extra: langchain
@@ -109,8 +110,7 @@ Requires-Dist: httpx>=0.27; extra == "dev"
 Requires-Dist: anyio>=4.0; extra == "dev"
 <p align="center">
-  <!-- Logo goes here once ready -->
-  <h1 align="center">LongParser</h1>
+  <img src="https://raw.githubusercontent.com/ENDEVSOLS/LongParser/main/docs/assets/logo.png" alt="LongParser" width="320">
   <p align="center"><strong>Privacy-first document intelligence engine for production RAG pipelines.</strong></p>
   <p align="center">
     Parse PDFs, DOCX, PPTX, XLSX &amp; CSV → validated, AI-ready chunks with HITL review.
@@ -129,7 +129,7 @@ Requires-Dist: anyio>=4.0; extra == "dev"
       <img src="https://static.pepy.tech/badge/longparser/month" alt="Monthly Downloads">
     </a>
     <a href="https://www.python.org/">
-      <img src="https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue.svg" alt="Python">
+      <img src="https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13-blue.svg" alt="Python">
     </a>
     <a href="LICENSE">
       <img src="https://img.shields.io/badge/License-MIT-brightgreen.svg" alt="MIT License">
@@ -150,11 +150,12 @@ Requires-Dist: anyio>=4.0; extra == "dev"
 | **Multi-format extraction** | PDF, DOCX, PPTX, XLSX, CSV via Docling |
 | **Hybrid chunking** | Token-aware, heading-hierarchy-aware, table-aware |
 | **HITL review** | Human-in-the-Loop block & chunk editing before embedding |
-| **LangGraph HITL** | `approve / edit / reject` workflow with LangGraph `interrupt()` |
+| **LangGraph HITL** | `approve / edit / reject` workflow with LangGraph `interrupt()` and MongoDB checkpointer |
 | **3-layer memory** | Short-term turns + rolling summary + long-term facts |
 | **Multi-provider LLM** | OpenAI, Gemini, Groq, OpenRouter |
 | **Multi-backend vectors** | Chroma, FAISS, Qdrant |
-| **Async-first API** | FastAPI + Motor (MongoDB) + ARQ (Redis) |
+| **Production-ready API** | FastAPI + Motor (MongoDB) + ARQ + Redis (Queue & Rate Limiting) |
+| **Enterprise Security** | Tenant isolation, Role-Based Access Control (RBAC), and CORS |
 | **LangChain adapters** | Drop-in `BaseRetriever` and LlamaIndex `QueryEngine` |
 | **Privacy-first** | All processing runs locally; no data leaves your infra |
@@ -215,9 +216,9 @@ pip install "longparser[cpu]"
 ### Python SDK
 ```python
-from longparser import PipelineOrchestrator, ProcessingConfig
+from longparser import DocumentPipeline, ProcessingConfig
-pipeline = PipelineOrchestrator()
+pipeline = DocumentPipeline(ProcessingConfig())
 result = pipeline.process_file("document.pdf")
 print(f"Pages: {result.document.metadata.total_pages}")
@@ -296,7 +297,7 @@ src/longparser/
 ├── schemas.py           ← core Pydantic models (Document, Block, Chunk, …)
 ├── extractors/          ← Docling, LaTeX OCR backends
 ├── chunkers/            ← HybridChunker
-├── pipeline/            ← PipelineOrchestrator
+├── pipeline/            ← DocumentPipeline
 ├── integrations/        ← LangChain loader & LlamaIndex reader
 ├── utils/               ← shared helpers (RTL detection, …)
 └── server/              ← REST API layer
@@ -344,11 +345,14 @@ Copy `.env.example` to `.env` and set:
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `LONGPARSER_MONGO_URL` | `mongodb://localhost:27017` | MongoDB connection |
-| `LONGPARSER_REDIS_URL` | `redis://localhost:6379` | Redis for job queue |
+| `LONGPARSER_REDIS_URL` | `redis://localhost:6379` | Redis for job queue & rate limits |
 | `LONGPARSER_LLM_PROVIDER` | `openai` | LLM provider |
-| `LONGPARSER_LLM_MODEL` | `gpt-4o` | Model name |
+| `LONGPARSER_LLM_MODEL` | `gpt-5.3` | Model name |
 | `LONGPARSER_EMBED_PROVIDER` | `huggingface` | Embedding provider |
 | `LONGPARSER_VECTOR_DB` | `chroma` | Vector store backend |
+| `LONGPARSER_CORS_ORIGINS` | `*` | Allowed CORS origins |
+| `LONGPARSER_RATE_LIMIT` | `60` | Max RPM per tenant |
+| `LONGPARSER_ADMIN_KEYS` | (empty) | Comma-separated admin API keys |
 ---

{longparser-0.1.1 → longparser-0.1.3}/README.md RENAMED Viewed

@@ -1,6 +1,5 @@
 <p align="center">
-  <!-- Logo goes here once ready -->
-  <h1 align="center">LongParser</h1>
+  <img src="https://raw.githubusercontent.com/ENDEVSOLS/LongParser/main/docs/assets/logo.png" alt="LongParser" width="320">
   <p align="center"><strong>Privacy-first document intelligence engine for production RAG pipelines.</strong></p>
   <p align="center">
     Parse PDFs, DOCX, PPTX, XLSX &amp; CSV → validated, AI-ready chunks with HITL review.
@@ -19,7 +18,7 @@
       <img src="https://static.pepy.tech/badge/longparser/month" alt="Monthly Downloads">
     </a>
     <a href="https://www.python.org/">
-      <img src="https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue.svg" alt="Python">
+      <img src="https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13-blue.svg" alt="Python">
     </a>
     <a href="LICENSE">
       <img src="https://img.shields.io/badge/License-MIT-brightgreen.svg" alt="MIT License">
@@ -40,11 +39,12 @@
 | **Multi-format extraction** | PDF, DOCX, PPTX, XLSX, CSV via Docling |
 | **Hybrid chunking** | Token-aware, heading-hierarchy-aware, table-aware |
 | **HITL review** | Human-in-the-Loop block & chunk editing before embedding |
-| **LangGraph HITL** | `approve / edit / reject` workflow with LangGraph `interrupt()` |
+| **LangGraph HITL** | `approve / edit / reject` workflow with LangGraph `interrupt()` and MongoDB checkpointer |
 | **3-layer memory** | Short-term turns + rolling summary + long-term facts |
 | **Multi-provider LLM** | OpenAI, Gemini, Groq, OpenRouter |
 | **Multi-backend vectors** | Chroma, FAISS, Qdrant |
-| **Async-first API** | FastAPI + Motor (MongoDB) + ARQ (Redis) |
+| **Production-ready API** | FastAPI + Motor (MongoDB) + ARQ + Redis (Queue & Rate Limiting) |
+| **Enterprise Security** | Tenant isolation, Role-Based Access Control (RBAC), and CORS |
 | **LangChain adapters** | Drop-in `BaseRetriever` and LlamaIndex `QueryEngine` |
 | **Privacy-first** | All processing runs locally; no data leaves your infra |
@@ -105,9 +105,9 @@ pip install "longparser[cpu]"
 ### Python SDK
 ```python
-from longparser import PipelineOrchestrator, ProcessingConfig
+from longparser import DocumentPipeline, ProcessingConfig
-pipeline = PipelineOrchestrator()
+pipeline = DocumentPipeline(ProcessingConfig())
 result = pipeline.process_file("document.pdf")
 print(f"Pages: {result.document.metadata.total_pages}")
@@ -186,7 +186,7 @@ src/longparser/
 ├── schemas.py           ← core Pydantic models (Document, Block, Chunk, …)
 ├── extractors/          ← Docling, LaTeX OCR backends
 ├── chunkers/            ← HybridChunker
-├── pipeline/            ← PipelineOrchestrator
+├── pipeline/            ← DocumentPipeline
 ├── integrations/        ← LangChain loader & LlamaIndex reader
 ├── utils/               ← shared helpers (RTL detection, …)
 └── server/              ← REST API layer
@@ -234,11 +234,14 @@ Copy `.env.example` to `.env` and set:
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `LONGPARSER_MONGO_URL` | `mongodb://localhost:27017` | MongoDB connection |
-| `LONGPARSER_REDIS_URL` | `redis://localhost:6379` | Redis for job queue |
+| `LONGPARSER_REDIS_URL` | `redis://localhost:6379` | Redis for job queue & rate limits |
 | `LONGPARSER_LLM_PROVIDER` | `openai` | LLM provider |
-| `LONGPARSER_LLM_MODEL` | `gpt-4o` | Model name |
+| `LONGPARSER_LLM_MODEL` | `gpt-5.3` | Model name |
 | `LONGPARSER_EMBED_PROVIDER` | `huggingface` | Embedding provider |
 | `LONGPARSER_VECTOR_DB` | `chroma` | Vector store backend |
+| `LONGPARSER_CORS_ORIGINS` | `*` | Allowed CORS origins |
+| `LONGPARSER_RATE_LIMIT` | `60` | Max RPM per tenant |
+| `LONGPARSER_ADMIN_KEYS` | (empty) | Comma-separated admin API keys |
 ---

{longparser-0.1.1 → longparser-0.1.3}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "longparser"
-version = "0.1.1"
+version = "0.1.3"
 description = "Privacy-first document intelligence engine — converts PDFs, DOCX, PPTX, XLSX, and CSV into AI-ready Markdown + structured JSON for RAG pipelines."
 readme = {file = "README.md", content-type = "text/markdown"}
 requires-python = ">=3.10"
@@ -35,6 +35,7 @@ dependencies = [
     "pydantic>=2.0,<3",
     "docling>=2.14",
     "docling-core>=2.13",
+    "langgraph-checkpoint-mongodb>=0.3.1",
 ]
 [project.optional-dependencies]

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/__init__.py RENAMED Viewed

@@ -9,9 +9,9 @@ Built by ENDEVSOLS for production RAG pipelines.
 Quick start::
-    from longparser import PipelineOrchestrator, ProcessingConfig
+    from longparser import DocumentPipeline, ProcessingConfig
-    pipeline = PipelineOrchestrator()
+    pipeline = DocumentPipeline(ProcessingConfig())
     result = pipeline.process_file("document.pdf")
     print(result.chunks[0].text)
@@ -19,13 +19,13 @@ For the full REST API server::
     uv run uvicorn longparser.server.app:app --reload --port 8000
-See :class:`~longparser.pipeline.PipelineOrchestrator` for the main SDK entry
+See :class:`~longparser.pipeline.DocumentPipeline` for the main SDK entry
 point and :mod:`longparser.server` for the REST API layer.
 """
 from __future__ import annotations
-__version__ = "0.1.1"
+__version__ = "0.1.3"
 __author__ = "ENDEVSOLS Team"
 __license__ = "MIT"
@@ -62,6 +62,9 @@ def __getattr__(name: str):
     if name == "PipelineOrchestrator":
         from .pipeline import PipelineOrchestrator
         return PipelineOrchestrator
+    if name == "DocumentPipeline":
+        from .pipeline import DocumentPipeline
+        return DocumentPipeline
     if name == "PipelineResult":
         from .pipeline import PipelineResult
         return PipelineResult
@@ -99,6 +102,7 @@ __all__ = [
     # Lazily imported (require extras)
     "DoclingExtractor",
     "PipelineOrchestrator",
+    "DocumentPipeline",
     "PipelineResult",
     "HybridChunker",
 ]

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/chunkers/hybrid_chunker.py RENAMED Viewed

@@ -345,10 +345,10 @@ def _generate_schema_chunk(
         sample_rows.append(f"  Row {r_idx}: " + "; ".join(parts))
     lines = [
-        f"[TABLE SCHEMA]",
+        "[TABLE SCHEMA]",
         f"Table ID: {block.block_id}",
         f"Rows: {n_data} (data rows), Columns: {n_cols}",
-        f"Columns:",
+        "Columns:",
     ]
     lines.extend(col_profiles)
     lines.append(f"Sample Rows ({sample_count}):")

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/extractors/docling_extractor.py RENAMED Viewed

@@ -254,7 +254,7 @@ class DoclingExtractor(BaseExtractor):
                         # Order-based substitution with alignment gate
                         injected = 0
                         _non_omml = 0
-                        for block, latex in zip(formula_blocks, latex_eqs):
+                        for block, latex in zip(formula_blocks, latex_eqs, strict=False):
                             orig_len = len(block.text.strip()) if block.text else 0
                             latex_len = len(latex.strip())
@@ -431,7 +431,8 @@ class DoclingExtractor(BaseExtractor):
                     page_img = None
                     try:
                         page_img = page_obj.image.pil_image
-                    except Exception:
+                    except Exception as e:
+                        logger.warning("Failed to extract image for formula scanning: %s", e)
                         continue
                     if page_img is None:
                         continue
@@ -527,8 +528,8 @@ class DoclingExtractor(BaseExtractor):
                                     # Update label to formula so downstream sees it correctly
                                     try:
                                         item.label = type(item.label)("formula")
-                                    except Exception:
-                                        pass
+                                    except Exception as e:
+                                        logger.debug(f"Failed to update formula label: {e}")
                                     replaced = True
                                     logger.debug(f"MFD: replaced garbled block on page {page_no}")
                                     break
@@ -1023,15 +1024,15 @@ class DoclingExtractor(BaseExtractor):
         if isinstance(item, TableItem) and hasattr(item, 'export_to_markdown'):
             try:
                 return item.export_to_markdown(doc=docling_doc)
-            except Exception:
-                pass
+            except Exception as e:
+                logger.debug(f"Failed to export table item to markdown: {e}")
         if hasattr(item, 'text') and item.text:
             return item.text
         if hasattr(item, 'export_to_markdown'):
             try:
                 return item.export_to_markdown()
-            except Exception:
-                pass
+            except Exception as e:
+                logger.debug(f"Failed to export item to markdown: {e}")
         return ""
     def _get_item_confidence(self, item) -> float:
@@ -1080,10 +1081,10 @@ class DoclingExtractor(BaseExtractor):
                                 if s.placeholder_format.type == PP_PH.SUBTITLE:
                                     has_subtitle_placeholder = True
                                     break
-                            except Exception:
-                                pass
-                except ImportError:
-                    pass
+                            except Exception as e:
+                                logger.debug(f"Failed to check PPTX subtitle placeholder format: {e}")
+                except ImportError as e:
+                    logger.debug(f"Failed to import python-pptx: {e}")
             for shape in slide.shapes:
                 found_title = self._extract_pptx_shape_info(
@@ -1160,8 +1161,8 @@ class DoclingExtractor(BaseExtractor):
                     is_subtitle_shape = True
                 elif ph_type in (PP_PLACEHOLDER.DATE, PP_PLACEHOLDER.FOOTER, PP_PLACEHOLDER.SLIDE_NUMBER):
                     is_footer_shape = True
-            except Exception:
-                pass
+            except Exception as e:
+                logger.debug(f"Failed to check PPTX placeholder format type: {e}")
         # Skip footer/date/slide-number shapes entirely
         if is_footer_shape:
@@ -1267,7 +1268,7 @@ class DoclingExtractor(BaseExtractor):
         # Calculate file hash
         with open(file_path, "rb") as f:
-            file_hash = hashlib.md5(f.read()).hexdigest()
+            file_hash = hashlib.sha256(f.read()).hexdigest()
         # Get conversion result (cached or new)
         result = self._run_docling(file_path, config)

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/pipeline/__init__.py RENAMED Viewed

@@ -2,7 +2,11 @@
 from .orchestrator import PipelineOrchestrator, PipelineResult
+# Public alias — docs and quickstart use this name
+DocumentPipeline = PipelineOrchestrator
 __all__ = [
     "PipelineOrchestrator",
+    "DocumentPipeline",
     "PipelineResult",
 ]

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/server/app.py RENAMED Viewed

@@ -13,6 +13,7 @@ try:
 except ImportError:
     pass
+from collections import defaultdict
 import hashlib
 import io
 import logging
@@ -25,6 +26,7 @@ from datetime import datetime, timezone
 from pathlib import Path
 from typing import Optional
 import time as _time
+import redis.asyncio as redis
 from fastapi import (
     FastAPI,
@@ -35,6 +37,7 @@ from fastapi import (
     Request,
     UploadFile,
 )
+from fastapi.middleware.cors import CORSMiddleware
 from fastapi.responses import JSONResponse, StreamingResponse
 from .db import Database
@@ -57,6 +60,15 @@ from .schemas import (
     SearchResponse,
     SearchResult,
 )
+from .chat.schemas import (
+    ChatConfig,
+    ChatRequest,
+    ChatResponse,
+    CreateSessionRequest,
+    HITLResumeRequest,
+    LLMAnswer,
+    SourceRef,
+)
 logger = logging.getLogger(__name__)
@@ -92,8 +104,18 @@ queue = ARQBackend(
 async def lifespan(app: FastAPI):
     """Startup/shutdown hooks."""
     await db.create_indexes()
+    from .chat.checkpointer import init_checkpointer, close_checkpointer
+    await init_checkpointer(
+        mongo_uri=os.getenv("LONGPARSER_MONGO_URL", "mongodb://localhost:27017"),
+        db_name=os.getenv("LONGPARSER_DB_NAME", "longparser"),
+    )
     logger.info("LongParser API started")
     yield
+    await close_checkpointer()
     await queue.close()
     await db.close()
     if hasattr(app.state, "chat_engine"):
@@ -104,11 +126,69 @@ async def lifespan(app: FastAPI):
 app = FastAPI(
     title="LongParser API",
     description="Document intelligence engine with HITL review, embedding, and vector search.",
-    version="0.3.0",
+    version=__import__("longparser").__version__,
     lifespan=lifespan,
 )
+# ---------------------------------------------------------------------------
+# CORS middleware
+# ---------------------------------------------------------------------------
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=os.getenv("LONGPARSER_CORS_ORIGINS", "*").split(","),
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# ---------------------------------------------------------------------------
+# Global exception handler
+# ---------------------------------------------------------------------------
+@app.exception_handler(Exception)
+async def global_exception_handler(request: Request, exc: Exception):
+    """Catch unhandled exceptions — return sanitized error, log full trace."""
+    logger.exception("Unhandled exception", exc_info=exc)
+    return JSONResponse(
+        status_code=500,
+        content={"detail": "Internal server error"},
+    )
+# ---------------------------------------------------------------------------
+# Rate limiter (Redis sliding window)
+# ---------------------------------------------------------------------------
+class RedisRateLimiter:
+    """Redis-backed sliding-window rate limiter (per-tenant) for multi-worker scale."""
+    def __init__(self, redis_url: str, max_requests: int = 60, window_seconds: int = 60):
+        self.max_requests = max_requests
+        self.window = window_seconds
+        self.redis = redis.from_url(redis_url)
+    async def check(self, key: str) -> bool:
+        now = _time.time()
+        redis_key = f"rate_limit:{key}"
+        pipeline = self.redis.pipeline()
+        pipeline.zremrangebyscore(redis_key, 0, now - self.window)
+        pipeline.zadd(redis_key, {str(now): now})
+        pipeline.zcard(redis_key)
+        pipeline.expire(redis_key, self.window)
+        results = await pipeline.execute()
+        return results[2] <= self.max_requests
+_rate_limiter = RedisRateLimiter(
+    redis_url=os.getenv("LONGPARSER_REDIS_URL", "redis://localhost:6379/0"),
+    max_requests=int(os.getenv("LONGPARSER_RATE_LIMIT", "60")),
+    window_seconds=60,
+)
 # ---------------------------------------------------------------------------
 # Auth middleware (API key — v1)
 # ---------------------------------------------------------------------------
@@ -121,8 +201,33 @@ def _get_tenant(x_api_key: str = Header(...)) -> str:
     """
     if not x_api_key or len(x_api_key) < 8:
         raise HTTPException(status_code=401, detail="Invalid API key")
-    # For v1, use a hash of the key as tenant_id
-    return hashlib.sha256(x_api_key.encode()).hexdigest()[:16]
+    # Use 32 hex chars (128-bit) to resist brute-force collision attacks
+    return hashlib.sha256(x_api_key.encode()).hexdigest()[:32]
+# ---------------------------------------------------------------------------
+# RBAC (role-based access control)
+# ---------------------------------------------------------------------------
+_ADMIN_KEYS: set[str] = set(
+    k.strip() for k in os.getenv("LONGPARSER_ADMIN_KEYS", "").split(",") if k.strip()
+)
+def _get_role(x_api_key: str) -> str:
+    """Resolve user role from API key.
+    If LONGPARSER_ADMIN_KEYS is not set, all users are admins (backward compatible).
+    """
+    if not _ADMIN_KEYS:
+        return "admin"
+    return "admin" if x_api_key in _ADMIN_KEYS else "reviewer"
+def _require_admin(x_api_key: str) -> None:
+    """Raise 403 if the API key does not have admin role."""
+    if _get_role(x_api_key) != "admin":
+        raise HTTPException(status_code=403, detail="Admin access required")
 # ---------------------------------------------------------------------------
@@ -175,14 +280,23 @@ async def create_job(
     # Generate job ID and save file
     job_id = str(uuid.uuid4())
-    dest = UPLOAD_DIR / tenant_id / job_id / (file.filename or "document")
+    # --- Path Traversal Protection ---
+    # Strip all directory components from the user-provided filename
+    # to prevent payloads like "../../../etc/passwd" from escaping UPLOAD_DIR.
+    raw_name = file.filename or "document"
+    safe_name = Path(raw_name).name  # keeps only the final component
+    if not safe_name or safe_name in (".", ".."):
+        safe_name = "document"
+    dest = UPLOAD_DIR / tenant_id / job_id / safe_name
     file_hash, file_size = await _stream_upload(file, dest)
     # Create job in MongoDB
     job_doc = await db.create_job(
         tenant_id=tenant_id,
         job_id=job_id,
-        source_file=file.filename or "document",
+        source_file=safe_name,
         file_hash=file_hash,
     )
@@ -197,7 +311,7 @@ async def create_job(
         job_id=job_id,
         tenant_id=tenant_id,
         status=JobStatus.QUEUED,
-        source_file=file.filename or "document",
+        source_file=safe_name,
         file_hash=file_hash,
         created_at=job_doc["created_at"],
     )
@@ -498,6 +612,7 @@ async def purge_block(
     x_api_key: str = Header(...),
 ):
     """Admin-only: permanently delete a block. Writes a tombstone revision."""
+    _require_admin(x_api_key)
     tenant_id = _get_tenant(x_api_key)
     # Get block before deletion (for tombstone)
@@ -545,6 +660,7 @@ async def purge_chunk(
     x_api_key: str = Header(...),
 ):
     """Admin-only: permanently delete a chunk. Writes a tombstone revision."""
+    _require_admin(x_api_key)
     tenant_id = _get_tenant(x_api_key)
     # Get chunk before deletion
@@ -852,8 +968,19 @@ async def search(body: SearchRequest, x_api_key: str = Header(...)):
 @app.middleware("http")
 async def observability_middleware(request: Request, call_next):
-    """Attach request_id and log structured request data."""
+    """Attach request_id, enforce rate limits, and log structured request data."""
     request_id = str(uuid.uuid4())[:8]
+    # ── Rate limiting (skip unauthenticated endpoints) ──
+    api_key = request.headers.get("x-api-key")
+    if api_key and len(api_key) >= 8:
+        tenant_key = hashlib.sha256(api_key.encode()).hexdigest()[:32]
+        if not await _rate_limiter.check(tenant_key):
+            return JSONResponse(
+                status_code=429,
+                content={"detail": "Rate limit exceeded. Try again later."},
+            )
     start = _time.monotonic()
     response = await call_next(request)
     latency_ms = (_time.monotonic() - start) * 1000
@@ -876,12 +1003,10 @@ async def observability_middleware(request: Request, call_next):
 @app.post("/chat/sessions", status_code=201)
 async def create_chat_session(
-    body: dict,
+    req: CreateSessionRequest,
     x_api_key: str = Header(...),
 ):
     """Create a new chat session (server-generated session_id)."""
-    from .chat.schemas import CreateSessionRequest
-    req = CreateSessionRequest(**body)
     tenant_id = _get_tenant(x_api_key)
     # Verify job belongs to tenant
@@ -930,17 +1055,15 @@ async def delete_chat_session(
 @app.post("/chat")
 async def chat(
-    body: dict,
+    req: ChatRequest,
     x_api_key: str = Header(...),
 ):
     """Ask a question — RAG chatbot with 3-layer memory.
     Set require_approval=true for Human-in-the-Loop review.
     """
-    from .chat.schemas import ChatRequest, ChatResponse, ChatConfig
     from .chat.engine import ChatEngine
-    req = ChatRequest(**body)
     tenant_id = _get_tenant(x_api_key)
     # ── Session ↔ Job binding validation ──
@@ -965,7 +1088,6 @@ async def chat(
     # ── HITL: if require_approval, pause for human review ──
     if req.require_approval and response.status == "complete":
-        from .chat.schemas import LLMAnswer, SourceRef
         from .chat.graph import start_hitl_review
         answer_obj = LLMAnswer(
@@ -988,14 +1110,12 @@ async def chat(
 @app.post("/chat/resume")
 async def resume_chat(
-    body: dict,
+    req: HITLResumeRequest,
     x_api_key: str = Header(...),
 ):
     """Resume a paused HITL chat with human decision (approve/edit/reject)."""
-    from .chat.schemas import HITLResumeRequest, ChatResponse, SourceRef, Turn
     from .chat.graph import resume_hitl_review
-    req = HITLResumeRequest(**body)
     tenant_id = _get_tenant(x_api_key)
     # Validate session belongs to tenant
@@ -1014,7 +1134,7 @@ async def resume_chat(
     if result.get("status") == "complete":
         # Update the last turn's answer if edited
         if req.action == "edit" and req.edited_answer:
-            await db.chat_turns.update_one(
+            await db.chat_turns.find_one_and_update(
                 {
                     "tenant_id": tenant_id,
                     "session_id": req.session_id,
@@ -1041,5 +1161,5 @@ async def resume_chat(
 @app.get("/health")
 async def health():
     """Health check endpoint."""
-    return {"status": "ok", "service": "cleanrag-api"}
+    return {"status": "ok", "service": "longparser-api"}

longparser-0.1.3/src/longparser/server/chat/checkpointer.py ADDED Viewed

@@ -0,0 +1,45 @@
+"""LangGraph MongoDB Checkpointer singleton.
+Holds the global per-worker instance of the MongoDBSaver.
+"""
+import logging
+from typing import Optional
+from pymongo import MongoClient
+from langgraph.checkpoint.mongodb import MongoDBSaver
+logger = logging.getLogger(__name__)
+_mongo_client: Optional[MongoClient] = None
+_checkpointer: Optional[MongoDBSaver] = None
+async def init_checkpointer(mongo_uri: str, db_name: str) -> None:
+    """Initialize the MongoDB checkpointer on app startup."""
+    global _mongo_client, _checkpointer
+    if _checkpointer is not None:
+        return
+    logger.info("Initializing LangGraph MongoDB checkpointer...")
+    # Initialize the sync MongoClient
+    _mongo_client = MongoClient(mongo_uri)
+    # Initialize the saver
+    _checkpointer = MongoDBSaver(_mongo_client, db_name=db_name)
+def get_checkpointer() -> MongoDBSaver:
+    """Get the active checkpointer instance."""
+    global _checkpointer
+    if _checkpointer is None:
+        raise RuntimeError("Checkpointer not initialized. Call init_checkpointer first.")
+    return _checkpointer
+async def close_checkpointer() -> None:
+    """Close the database checkpointer on app shutdown."""
+    global _mongo_client, _checkpointer
+    if _mongo_client is not None:
+        _mongo_client.close()
+        _mongo_client = None
+    _checkpointer = None
+    logger.info("LangGraph MongoDB checkpointer closed.")

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/server/chat/engine.py RENAMED Viewed

@@ -76,7 +76,7 @@ RAG_PROMPT = ChatPromptTemplate.from_messages([
 # Token Counting (model-aware) — kept as custom logic
 # ---------------------------------------------------------------------------
-def count_tokens(text: str, model: str = "gpt-4o") -> int:
+def count_tokens(text: str, model: str = "gpt-5.3") -> int:
     """Count tokens — exact for OpenAI models, conservative approx for others."""
     try:
         import tiktoken
@@ -96,7 +96,7 @@ def budget_trim(
     recent_turns: list[dict],
     rolling_summary: str,
     long_term_facts: list[dict],
-    model: str = "gpt-4o",
+    model: str = "gpt-5.3",
     max_prompt_tokens: int = 6000,
 ) -> dict:
     """Priority-ordered truncation of prompt variables to fit token budget.

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/server/chat/graph.py RENAMED Viewed

@@ -17,16 +17,14 @@ import logging
 import uuid
 from typing import TypedDict, Optional, Any
-from langgraph.checkpoint.memory import InMemorySaver
 from langgraph.graph import StateGraph, END
 from langgraph.types import interrupt, Command
 from .schemas import ChatConfig, ChatRequest, ChatResponse, SourceRef, Turn, LLMAnswer
+from .checkpointer import get_checkpointer
 logger = logging.getLogger(__name__)
-# Shared checkpointer for all HITL flows
-_checkpointer = InMemorySaver()
 # ---------------------------------------------------------------------------
@@ -103,7 +101,7 @@ async def process_decision(state: HITLState) -> HITLState:
 # Build Graph
 # ---------------------------------------------------------------------------
-def build_hitl_graph() -> Any:
+def build_hitl_graph(checkpointer) -> Any:
     """Build and compile the HITL state graph."""
     graph = StateGraph(HITLState)
@@ -116,11 +114,7 @@ def build_hitl_graph() -> Any:
     graph.add_edge("review", "decide")
     graph.add_edge("decide", END)
-    return graph.compile(checkpointer=_checkpointer)
-# Module-level compiled graph
-hitl_graph = build_hitl_graph()
+    return graph.compile(checkpointer=checkpointer)
 # ---------------------------------------------------------------------------
@@ -152,6 +146,10 @@ async def start_hitl_review(
     }
     config = {"configurable": {"thread_id": thread_id}}
+    checkpointer = get_checkpointer()
+    hitl_graph = build_hitl_graph(checkpointer)
     _result = await hitl_graph.ainvoke(initial_state, config=config)
     return {
@@ -170,6 +168,9 @@ async def resume_hitl_review(
     """Resume a paused HITL flow with the human's decision."""
     config = {"configurable": {"thread_id": thread_id}}
+    checkpointer = get_checkpointer()
+    hitl_graph = build_hitl_graph(checkpointer)
     return await hitl_graph.ainvoke(
         Command(resume={"action": action, "edited_answer": edited_answer}),
         config=config,

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/server/chat/llm_chain.py RENAMED Viewed

@@ -16,14 +16,16 @@ from .schemas import ChatConfig
 logger = logging.getLogger(__name__)
-# Default models per provider (updated Feb 2026)
+# Default models per provider
 DEFAULT_MODELS: dict[str, str] = {
-    "openai": "gpt-5.3-codex",
+    "openai": "gpt-5.3",
     "gemini": "gemini-2.5-flash",
     "groq": "openai/gpt-oss-120b",
-    "openrouter": "openai/gpt-5.3-codex",
+    "openrouter": "openai/gpt-5.3",
 }
+SUPPORTED_PROVIDERS = list(DEFAULT_MODELS.keys())
 def _create_openai(model: str, temperature: float, max_tokens: int,
                    max_retries: int, callbacks: Optional[list] = None):
@@ -113,7 +115,7 @@ def get_chat_model(
     """
     config = config or ChatConfig()
     provider = provider or config.llm_provider
-    model = model or config.llm_model or DEFAULT_MODELS.get(provider, "gpt-4o")
+    model = model or config.llm_model or DEFAULT_MODELS.get(provider, "gpt-5.3")
     max_tokens = max_tokens or config.max_output_tokens
     creator = _CREATORS.get(provider)

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/server/chat/schemas.py RENAMED Viewed

@@ -33,7 +33,7 @@ class ChatConfig(BaseModel):
         default_factory=lambda: os.getenv("LONGPARSER_LLM_PROVIDER", "openai")
     )
     llm_model: str = Field(
-        default_factory=lambda: os.getenv("LONGPARSER_LLM_MODEL", "gpt-4o")
+        default_factory=lambda: os.getenv("LONGPARSER_LLM_MODEL", "gpt-5.3")
     )
     max_input_tokens: int = Field(
         default_factory=lambda: int(os.getenv("LONGPARSER_CHAT_MAX_INPUT_TOKENS", "1000"))

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/server/db.py RENAMED Viewed

@@ -411,7 +411,7 @@ class Database:
                 ]},
             },
             {"_id": 0},
-        ).to_list(length=None)
+        ).to_list(length=10000)  # Cap: embedding batches
     # -----------------------------------------------------------------------
     # Index versions
@@ -450,7 +450,7 @@ class Database:
         """List all index versions for a job (for cleanup on delete)."""
         return await self.index_versions.find(
             {"tenant_id": tenant_id, "job_id": job_id}, {"_id": 0}
-        ).to_list(length=None)
+        ).to_list(length=100)  # Cap: index versions per job
     # -----------------------------------------------------------------------
     # Chat Sessions
@@ -597,7 +597,7 @@ class Database:
             {"tenant_id": tenant_id, "session_id": session_id},
             {"_id": 0},
         ).sort("created_at", 1)
-        return await cursor.to_list(length=None)
+        return await cursor.to_list(length=5000)  # Cap: session history
     async def get_unarchived_turns(
         self, tenant_id: str, session_id: str
@@ -611,7 +611,7 @@ class Database:
             },
             {"_id": 0},
         ).sort("created_at", 1)
-        return await cursor.to_list(length=None)
+        return await cursor.to_list(length=5000)  # Cap: summarization batch
     async def archive_turns(
         self, tenant_id: str, session_id: str, turn_ids: list[str]
@@ -645,7 +645,7 @@ class Database:
             {"deleted_at": {"$lte": cutoff}},
             {"session_id": 1, "tenant_id": 1, "_id": 0},
         )
-        return await cursor.to_list(length=None)
+        return await cursor.to_list(length=1000)  # Cap: purge batch
     # -----------------------------------------------------------------------
     # Lifecycle

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/server/embeddings.py RENAMED Viewed

@@ -93,7 +93,7 @@ class EmbeddingEngine:
         # Stable json dump
         cfg_str = json.dumps(config, sort_keys=True)
-        return hashlib.sha1(cfg_str.encode("utf-8")).hexdigest()[:10]
+        return hashlib.sha256(cfg_str.encode("utf-8")).hexdigest()[:10]
     @property
     def dim(self) -> int:
@@ -108,7 +108,7 @@ class EmbeddingEngine:
                 return self._dim
             fp = self.get_fingerprint()
-            cache_key = f"cleanrag:embed_dim:{fp}"
+            cache_key = f"longparser:embed_dim:{fp}"
             # 1) Try Redis cross-process cache if available
             try:
@@ -145,8 +145,8 @@ class EmbeddingEngine:
             try:
                 if 'r' in locals():
                     r.set(cache_key, self._dim)
-            except Exception:
-                pass
+            except Exception as e:
+                logger.debug(f"Failed to set Redis cache: {e}")
             return self._dim

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/server/queue.py RENAMED Viewed

@@ -45,12 +45,7 @@ class ARQBackend(QueueBackend):
             from arq import create_pool
             from arq.connections import RedisSettings
-            url = self.redis_url.replace("redis://", "")
-            # Strip database number (e.g., /0) if present
-            url = url.split("/")[0]
-            host, _, port_str = url.partition(":")
-            port = int(port_str) if port_str else 6379
-            self._pool = await create_pool(RedisSettings(host=host, port=port))
+            self._pool = await create_pool(RedisSettings.from_dsn(self.redis_url))
         return self._pool
     async def enqueue(self, task_name: str, payload: dict) -> str:

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/server/vectorstores.py RENAMED Viewed

@@ -64,7 +64,7 @@ class ChromaStore(BaseVectorStore):
             import chromadb
         except ImportError:
             raise ImportError(
-                "chromadb is required. Install: pip install clean_rag[chroma]"
+                "chromadb is required. Install: pip install longparser[chroma]"
             )
         # Securely isolate vector spaces based on model config
@@ -125,8 +125,8 @@ class ChromaStore(BaseVectorStore):
                     if isinstance(v, str) and v.startswith("["):
                         try:
                             meta[k] = json.loads(v)
-                        except (json.JSONDecodeError, ValueError):
-                            pass
+                        except (json.JSONDecodeError, ValueError) as e:
+                            logger.debug(f"Failed to decode JSON list from Chroma metadata: {e}")
                 output.append({
                     "id": vid,
                     "score": 1.0 - (results["distances"][0][i] if results["distances"] else 0),
@@ -165,7 +165,7 @@ class FAISSStore(BaseVectorStore):
             import faiss  # noqa: F401
         except ImportError:
             raise ImportError(
-                "faiss-cpu is required. Install: pip install clean_rag[faiss]"
+                "faiss-cpu is required. Install: pip install longparser[faiss-cpu]"
             )
         self.base_dir = Path(base_dir)
@@ -297,7 +297,7 @@ class QdrantStore(BaseVectorStore):
             from qdrant_client.models import Distance, VectorParams
         except ImportError:
             raise ImportError(
-                "qdrant-client is required. Install: pip install clean_rag[qdrant]"
+                "qdrant-client is required. Install: pip install longparser[qdrant]"
             )
         self.client = QdrantClient(url=url)
@@ -319,7 +319,7 @@ class QdrantStore(BaseVectorStore):
             if existing_dim != dim:
                 # Mismatch — create new collection with hash suffix
                 import hashlib
-                suffix = hashlib.md5(f"{dim}".encode()).hexdigest()[:8]
+                suffix = hashlib.sha256(f"{dim}".encode()).hexdigest()[:8]
                 self.collection_name = f"{self.collection_name}_{suffix}"
                 logger.warning(
                     f"QdrantStore: dim mismatch, using collection: {self.collection_name}"
@@ -382,8 +382,8 @@ class QdrantStore(BaseVectorStore):
                 if isinstance(v, str) and v.startswith("["):
                     try:
                         payload[k] = json.loads(v)
-                    except (json.JSONDecodeError, ValueError):
-                        pass
+                    except (json.JSONDecodeError, ValueError) as e:
+                        logger.debug(f"Failed to decode JSON list from Qdrant metadata: {e}")
             output.append({
                 "id": payload.get("vector_id", ""),
                 "score": hit.score,

{longparser-0.1.1 → longparser-0.1.3}/src/longparser/server/worker.py RENAMED Viewed

@@ -258,8 +258,8 @@ async def summarize_session(ctx: dict, tenant_id: str, session_id: str) -> dict:
       4. Archive summarized turns
     """
     from .db import Database
-    from .schemas import ChatConfig
-    from .llm_chain import get_plain_chat_model
+    from .chat.schemas import ChatConfig
+    from .chat.llm_chain import get_plain_chat_model
     from langchain_core.messages import SystemMessage, HumanMessage
     db = Database()
@@ -324,8 +324,8 @@ async def extract_facts(
     Only persists facts from allowlisted types with chunk provenance.
     """
     from .db import Database
-    from .schemas import ChatConfig, FactSourceType
-    from .llm_chain import get_chat_model
+    from .chat.schemas import ChatConfig, FactSourceType
+    from .chat.llm_chain import get_chat_model
     from langchain_core.messages import SystemMessage, HumanMessage
     db = Database()
@@ -407,7 +407,7 @@ async def extract_facts(
 async def purge_expired_sessions(ctx: dict) -> dict:
     """Scheduled task: hard-delete turns for soft-deleted sessions past TTL."""
     from .db import Database
-    from .schemas import ChatConfig
+    from .chat.schemas import ChatConfig
     db = Database()
     config = ChatConfig()

{longparser-0.1.1 → longparser-0.1.3}/src/longparser.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: longparser
-Version: 0.1.1
+Version: 0.1.3
 Summary: Privacy-first document intelligence engine — converts PDFs, DOCX, PPTX, XLSX, and CSV into AI-ready Markdown + structured JSON for RAG pipelines.
 Author-email: ENDEVSOLS Team <technology@endevsols.com>
 License-Expression: MIT
@@ -27,6 +27,7 @@ Description-Content-Type: text/markdown
 Requires-Dist: pydantic<3,>=2.0
 Requires-Dist: docling>=2.14
 Requires-Dist: docling-core>=2.13
+Requires-Dist: langgraph-checkpoint-mongodb>=0.3.1
 Provides-Extra: pptx
 Requires-Dist: python-pptx>=1.0; extra == "pptx"
 Provides-Extra: langchain
@@ -109,8 +110,7 @@ Requires-Dist: httpx>=0.27; extra == "dev"
 Requires-Dist: anyio>=4.0; extra == "dev"
 <p align="center">
-  <!-- Logo goes here once ready -->
-  <h1 align="center">LongParser</h1>
+  <img src="https://raw.githubusercontent.com/ENDEVSOLS/LongParser/main/docs/assets/logo.png" alt="LongParser" width="320">
   <p align="center"><strong>Privacy-first document intelligence engine for production RAG pipelines.</strong></p>
   <p align="center">
     Parse PDFs, DOCX, PPTX, XLSX &amp; CSV → validated, AI-ready chunks with HITL review.
@@ -129,7 +129,7 @@ Requires-Dist: anyio>=4.0; extra == "dev"
       <img src="https://static.pepy.tech/badge/longparser/month" alt="Monthly Downloads">
     </a>
     <a href="https://www.python.org/">
-      <img src="https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue.svg" alt="Python">
+      <img src="https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13-blue.svg" alt="Python">
     </a>
     <a href="LICENSE">
       <img src="https://img.shields.io/badge/License-MIT-brightgreen.svg" alt="MIT License">
@@ -150,11 +150,12 @@ Requires-Dist: anyio>=4.0; extra == "dev"
 | **Multi-format extraction** | PDF, DOCX, PPTX, XLSX, CSV via Docling |
 | **Hybrid chunking** | Token-aware, heading-hierarchy-aware, table-aware |
 | **HITL review** | Human-in-the-Loop block & chunk editing before embedding |
-| **LangGraph HITL** | `approve / edit / reject` workflow with LangGraph `interrupt()` |
+| **LangGraph HITL** | `approve / edit / reject` workflow with LangGraph `interrupt()` and MongoDB checkpointer |
 | **3-layer memory** | Short-term turns + rolling summary + long-term facts |
 | **Multi-provider LLM** | OpenAI, Gemini, Groq, OpenRouter |
 | **Multi-backend vectors** | Chroma, FAISS, Qdrant |
-| **Async-first API** | FastAPI + Motor (MongoDB) + ARQ (Redis) |
+| **Production-ready API** | FastAPI + Motor (MongoDB) + ARQ + Redis (Queue & Rate Limiting) |
+| **Enterprise Security** | Tenant isolation, Role-Based Access Control (RBAC), and CORS |
 | **LangChain adapters** | Drop-in `BaseRetriever` and LlamaIndex `QueryEngine` |
 | **Privacy-first** | All processing runs locally; no data leaves your infra |
@@ -215,9 +216,9 @@ pip install "longparser[cpu]"
 ### Python SDK
 ```python
-from longparser import PipelineOrchestrator, ProcessingConfig
+from longparser import DocumentPipeline, ProcessingConfig
-pipeline = PipelineOrchestrator()
+pipeline = DocumentPipeline(ProcessingConfig())
 result = pipeline.process_file("document.pdf")
 print(f"Pages: {result.document.metadata.total_pages}")
@@ -296,7 +297,7 @@ src/longparser/
 ├── schemas.py           ← core Pydantic models (Document, Block, Chunk, …)
 ├── extractors/          ← Docling, LaTeX OCR backends
 ├── chunkers/            ← HybridChunker
-├── pipeline/            ← PipelineOrchestrator
+├── pipeline/            ← DocumentPipeline
 ├── integrations/        ← LangChain loader & LlamaIndex reader
 ├── utils/               ← shared helpers (RTL detection, …)
 └── server/              ← REST API layer
@@ -344,11 +345,14 @@ Copy `.env.example` to `.env` and set:
 | Variable | Default | Description |
 |----------|---------|-------------|
 | `LONGPARSER_MONGO_URL` | `mongodb://localhost:27017` | MongoDB connection |
-| `LONGPARSER_REDIS_URL` | `redis://localhost:6379` | Redis for job queue |
+| `LONGPARSER_REDIS_URL` | `redis://localhost:6379` | Redis for job queue & rate limits |
 | `LONGPARSER_LLM_PROVIDER` | `openai` | LLM provider |
-| `LONGPARSER_LLM_MODEL` | `gpt-4o` | Model name |
+| `LONGPARSER_LLM_MODEL` | `gpt-5.3` | Model name |
 | `LONGPARSER_EMBED_PROVIDER` | `huggingface` | Embedding provider |
 | `LONGPARSER_VECTOR_DB` | `chroma` | Vector store backend |
+| `LONGPARSER_CORS_ORIGINS` | `*` | Allowed CORS origins |
+| `LONGPARSER_RATE_LIMIT` | `60` | Max RPM per tenant |
+| `LONGPARSER_ADMIN_KEYS` | (empty) | Comma-separated admin API keys |
 ---

{longparser-0.1.1 → longparser-0.1.3}/src/longparser.egg-info/SOURCES.txt RENAMED Viewed

@@ -29,6 +29,7 @@ src/longparser/server/vectorstores.py
 src/longparser/server/worker.py
 src/longparser/server/chat/__init__.py
 src/longparser/server/chat/callbacks.py
+src/longparser/server/chat/checkpointer.py
 src/longparser/server/chat/engine.py
 src/longparser/server/chat/graph.py
 src/longparser/server/chat/llm_chain.py