PyPI - docintel-platform - Versions diffs - 1.2.0__tar.gz → 1.3.0__tar.gz - Mend

docintel-platform 1.2.0tar.gz → 1.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (129) hide show

{docintel_platform-1.2.0/src/docintel_platform.egg-info → docintel_platform-1.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: docintel-platform
-Version: 1.2.0
+Version: 1.3.0
 Summary: Enterprise document intelligence API for PDF compliance, multi-format extraction, structuring, and summarization.
 Author: Babandeep Singh
 License-Expression: MIT
@@ -40,9 +40,11 @@ Requires-Dist: fakeredis>=2.26.2; extra == "dev"
 Requires-Dist: prometheus-client>=0.21.0; extra == "dev"
 Requires-Dist: python-docx>=1.1.2; extra == "dev"
 Requires-Dist: openpyxl>=3.1.5; extra == "dev"
+Requires-Dist: python-pptx>=1.0.2; extra == "dev"
 Provides-Extra: documents
 Requires-Dist: python-docx>=1.1.2; extra == "documents"
 Requires-Dist: openpyxl>=3.1.5; extra == "documents"
+Requires-Dist: python-pptx>=1.0.2; extra == "documents"
 Provides-Extra: ocr
 Requires-Dist: easyocr>=1.7.2; extra == "ocr"
 Requires-Dist: presidio-analyzer>=2.2.354; extra == "ocr"
@@ -80,6 +82,7 @@ Requires-Dist: cryptography>=43.0.0; extra == "all"
 Requires-Dist: gradio>=4.44.0; extra == "all"
 Requires-Dist: python-docx>=1.1.2; extra == "all"
 Requires-Dist: openpyxl>=3.1.5; extra == "all"
+Requires-Dist: python-pptx>=1.0.2; extra == "all"
 # Document Intelligence Platform
@@ -90,7 +93,7 @@ Requires-Dist: openpyxl>=3.1.5; extra == "all"
 Enterprise document intelligence API: PDF compliance (OCR, PII, redaction), LLM structuring, and multi-format text workflows (Word, Excel, CSV, plain text).
-**Version:** 1.2.0 | **PyPI:** [docintel-platform](https://pypi.org/project/docintel-platform/)
+**Version:** 1.3.0 | **PyPI:** [docintel-platform](https://pypi.org/project/docintel-platform/)
 ---
@@ -119,7 +122,7 @@ Gradio includes a **Document process** tab (unified pipeline). It needs the API
 ```bash
 pip install docintel-platform
 pip install "docintel-platform[all]"        # OCR, LLM, jobs, auth, UI, office formats
-pip install "docintel-platform[documents]"  # Word and Excel only
+pip install "docintel-platform[documents]"  # Word, Excel, and PowerPoint
 ```
 **Python client:**
@@ -141,13 +144,13 @@ report = client.process_document("policy.docx", include_pii=True)
 | PDF annotate | `POST /v1/pdf/annotate` | Regex highlight, redact, markup |
 | PDF PII scan | `POST /v1/pdf/detect-sensitive` | Presidio + OCR for scanned PDFs |
 | PDF structure | `POST /v1/pdf/structure` | OCR + LLM curated PDF (needs LLM key) |
-| Documents | `POST /v1/documents/*` | Identify, extract, classify, summarize, PII, compare, **process** |
+| Documents | `POST /v1/documents/*` | Identify, extract, classify, summarize, PII, compare, **process**, **ingest** (S3) |
 | Text | `POST /v1/text/summarize` | TextRank extractive summary |
 | Batch | `POST /v1/batch` | Async summarize, classify, detect_pii, process |
 | Jobs | `GET /v1/jobs/{id}` | Poll async work (`?async=true`; default in Docker when Redis is up) |
 | Ops | `GET /health`, `GET /metrics` | Health and Prometheus-friendly metrics |
-**Supported uploads (text workflows):** PDF, DOCX, XLSX, CSV, JSON, TXT, MD.
+**Supported uploads (text workflows):** PDF, DOCX, XLSX, PPTX, CSV, JSON, TXT, MD.
 **PDF-only routes** (annotate, sensitive, structure) return HTTP 415 for other types. Use `/v1/documents/extract-text` or `/v1/documents/process` for office files.
@@ -199,6 +202,7 @@ Copy `.env.example` to `.env` for `DOCINTEL_LLM_API_KEY`, auth keys, Redis, and
 | [docs/PLATFORM.md](docs/PLATFORM.md) | Jobs, auth, storage, ops layout |
 | [docs/PRODUCTION.md](docs/PRODUCTION.md) | Checklist, latency, failure modes |
 | [docs/ROADMAP.md](docs/ROADMAP.md) | Milestones and history |
+| [docs/WEBHOOKS.md](docs/WEBHOOKS.md) | Async callbacks and S3 ingest |
 | [docs/adr/](docs/adr/) | Architecture decision records |
 ---

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/README.md RENAMED Viewed

@@ -7,7 +7,7 @@
 Enterprise document intelligence API: PDF compliance (OCR, PII, redaction), LLM structuring, and multi-format text workflows (Word, Excel, CSV, plain text).
-**Version:** 1.2.0 | **PyPI:** [docintel-platform](https://pypi.org/project/docintel-platform/)
+**Version:** 1.3.0 | **PyPI:** [docintel-platform](https://pypi.org/project/docintel-platform/)
 ---
@@ -36,7 +36,7 @@ Gradio includes a **Document process** tab (unified pipeline). It needs the API
 ```bash
 pip install docintel-platform
 pip install "docintel-platform[all]"        # OCR, LLM, jobs, auth, UI, office formats
-pip install "docintel-platform[documents]"  # Word and Excel only
+pip install "docintel-platform[documents]"  # Word, Excel, and PowerPoint
 ```
 **Python client:**
@@ -58,13 +58,13 @@ report = client.process_document("policy.docx", include_pii=True)
 | PDF annotate | `POST /v1/pdf/annotate` | Regex highlight, redact, markup |
 | PDF PII scan | `POST /v1/pdf/detect-sensitive` | Presidio + OCR for scanned PDFs |
 | PDF structure | `POST /v1/pdf/structure` | OCR + LLM curated PDF (needs LLM key) |
-| Documents | `POST /v1/documents/*` | Identify, extract, classify, summarize, PII, compare, **process** |
+| Documents | `POST /v1/documents/*` | Identify, extract, classify, summarize, PII, compare, **process**, **ingest** (S3) |
 | Text | `POST /v1/text/summarize` | TextRank extractive summary |
 | Batch | `POST /v1/batch` | Async summarize, classify, detect_pii, process |
 | Jobs | `GET /v1/jobs/{id}` | Poll async work (`?async=true`; default in Docker when Redis is up) |
 | Ops | `GET /health`, `GET /metrics` | Health and Prometheus-friendly metrics |
-**Supported uploads (text workflows):** PDF, DOCX, XLSX, CSV, JSON, TXT, MD.
+**Supported uploads (text workflows):** PDF, DOCX, XLSX, PPTX, CSV, JSON, TXT, MD.
 **PDF-only routes** (annotate, sensitive, structure) return HTTP 415 for other types. Use `/v1/documents/extract-text` or `/v1/documents/process` for office files.
@@ -116,6 +116,7 @@ Copy `.env.example` to `.env` for `DOCINTEL_LLM_API_KEY`, auth keys, Redis, and
 | [docs/PLATFORM.md](docs/PLATFORM.md) | Jobs, auth, storage, ops layout |
 | [docs/PRODUCTION.md](docs/PRODUCTION.md) | Checklist, latency, failure modes |
 | [docs/ROADMAP.md](docs/ROADMAP.md) | Milestones and history |
+| [docs/WEBHOOKS.md](docs/WEBHOOKS.md) | Async callbacks and S3 ingest |
 | [docs/adr/](docs/adr/) | Architecture decision records |
 ---

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "docintel-platform"
-version = "1.2.0"
+version = "1.3.0"
 description = "Enterprise document intelligence API for PDF compliance, multi-format extraction, structuring, and summarization."
 readme = "README.md"
 license = "MIT"
@@ -50,10 +50,12 @@ dev = [
     "prometheus-client>=0.21.0",
     "python-docx>=1.1.2",
     "openpyxl>=3.1.5",
+    "python-pptx>=1.0.2",
 ]
 documents = [
     "python-docx>=1.1.2",
     "openpyxl>=3.1.5",
+    "python-pptx>=1.0.2",
 ]
 ocr = [
     "easyocr>=1.7.2",
@@ -98,6 +100,7 @@ all = [
     "gradio>=4.44.0",
     "python-docx>=1.1.2",
     "openpyxl>=3.1.5",
+    "python-pptx>=1.0.2",
 ]
 [project.scripts]

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/__init__.py RENAMED Viewed

@@ -2,5 +2,5 @@
 from docintel.client import DocintelClient, DocintelError
-__version__ = "1.2.0"
+__version__ = "1.3.0"
 __all__ = ["DocintelClient", "DocintelError", "__version__"]

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/capabilities/extraction/formats/extract.py RENAMED Viewed

@@ -39,6 +39,8 @@ def extract_document_text(
         return _extract_docx(file_path, resolved)
     if resolved.kind is DocumentKind.XLSX:
         return _extract_xlsx(file_path, resolved)
+    if resolved.kind is DocumentKind.PPTX:
+        return _extract_pptx(file_path, resolved)
     if resolved.kind is DocumentKind.CSV:
         return _extract_csv(file_path, resolved)
     if resolved.kind is DocumentKind.JSON:
@@ -129,6 +131,37 @@ def _extract_xlsx(path: Path, identification: IdentificationResult) -> Extractio
     )
+def _extract_pptx(path: Path, identification: IdentificationResult) -> ExtractionResult:
+    try:
+        from pptx import Presentation
+    except ImportError as exc:
+        raise RuntimeError(
+            "PowerPoint support requires optional dependencies. Install: pip install -e '.[documents]'"
+        ) from exc
+    presentation = Presentation(path)
+    segments: list[dict] = []
+    parts: list[str] = []
+    for slide_index, slide in enumerate(presentation.slides, start=1):
+        slide_parts: list[str] = []
+        for shape in slide.shapes:
+            text = getattr(shape, "text", "").strip()
+            if text:
+                slide_parts.append(text)
+        slide_text = "\n".join(slide_parts)
+        segments.append({"slide": slide_index, "text": slide_text})
+        if slide_text:
+            parts.append(f"# Slide {slide_index}\n{slide_text}")
+    return ExtractionResult(
+        kind=identification.kind,
+        mime_type=identification.mime_type,
+        text="\n\n".join(parts),
+        segments=segments,
+        metadata={"slide_count": len(presentation.slides)},
+    )
 def _extract_csv(path: Path, identification: IdentificationResult) -> ExtractionResult:
     raw = path.read_text(encoding="utf-8", errors="replace")
     sample = raw[:2048]

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/capabilities/extraction/formats/models.py RENAMED Viewed

@@ -11,6 +11,7 @@ class DocumentKind(str, Enum):
     PDF = "pdf"
     DOCX = "docx"
     XLSX = "xlsx"
+    PPTX = "pptx"
     CSV = "csv"
     PLAIN_TEXT = "plain_text"
     JSON = "json"

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/capabilities/extraction/formats/registry.py RENAMED Viewed

@@ -31,6 +31,14 @@ _PROFILES: tuple[DocumentProfile, ...] = (
         supports_pdf_pipeline=False,
         supports_text_extraction=True,
     ),
+    DocumentProfile(
+        kind=DocumentKind.PPTX,
+        mime_type="application/vnd.openxmlformats-officedocument.presentationml.presentation",
+        extensions=(".pptx",),
+        label="PowerPoint presentation",
+        supports_pdf_pipeline=False,
+        supports_text_extraction=True,
+    ),
     DocumentProfile(
         kind=DocumentKind.CSV,
         mime_type="text/csv",

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/capabilities/extraction/formats/sniff.py RENAMED Viewed

@@ -31,6 +31,8 @@ def _sniff_zip_kind(path: Path) -> DocumentKind | None:
     if any(name.startswith("word/") for name in names):
         return DocumentKind.DOCX
+    if any(name.startswith("ppt/") for name in names):
+        return DocumentKind.PPTX
     if any(name.startswith("xl/") for name in names):
         return DocumentKind.XLSX
     return None
@@ -80,7 +82,7 @@ def _looks_like_csv(sample: str) -> bool:
 def _requires_content_confirmation(kind: DocumentKind) -> bool:
-    return kind in {DocumentKind.PDF, DocumentKind.DOCX, DocumentKind.XLSX}
+    return kind in {DocumentKind.PDF, DocumentKind.DOCX, DocumentKind.XLSX, DocumentKind.PPTX}
 def _build_result(

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/client.py RENAMED Viewed

@@ -416,3 +416,45 @@ class DocintelClient:
                 data=data,
                 poll=poll,
             )
+    def ingest_document_from_s3(
+        self,
+        *,
+        s3_uri: str | None = None,
+        bucket: str | None = None,
+        key: str | None = None,
+        sentences: int = 3,
+        include_summarize: bool = True,
+        include_pii: bool = True,
+        include_text: bool = False,
+        entities: str | None = None,
+        vertical: str | None = None,
+        min_score: float = 0.35,
+        callback_url: str | None = None,
+        poll: bool = True,
+    ) -> dict[str, Any]:
+        body: dict[str, Any] = {
+            "operation": "process",
+            "sentences": sentences,
+            "include_summarize": include_summarize,
+            "include_pii": include_pii,
+            "include_text": include_text,
+            "min_score": min_score,
+        }
+        if s3_uri:
+            body["s3_uri"] = s3_uri
+        if bucket:
+            body["bucket"] = bucket
+        if key:
+            body["key"] = key
+        if entities:
+            body["entities"] = entities
+        if vertical:
+            body["vertical"] = vertical
+        if callback_url:
+            body["callback_url"] = callback_url
+        return self._post_async_json(
+            "/v1/documents/ingest",
+            json_body=body,
+            poll=poll,
+        )

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/config.py RENAMED Viewed

@@ -11,6 +11,7 @@ class Config:
     LOG_LEVEL = os.getenv("DOCINTEL_LOG_LEVEL", "INFO")
     REDIS_URL = os.getenv("DOCINTEL_REDIS_URL", "redis://localhost:6379/0")
     JOBS_ENABLED = os.getenv("DOCINTEL_JOBS_ENABLED", "true").lower() == "true"
+    JOB_TTL_SECONDS = int(os.getenv("DOCINTEL_JOB_TTL_SECONDS", str(60 * 60 * 24 * 7)))
     QUEUE_NAME = os.getenv("DOCINTEL_QUEUE_NAME", "docintel")
     API_KEYS = os.getenv("DOCINTEL_API_KEYS", "")
     AUTH_REQUIRED = os.getenv("DOCINTEL_AUTH_REQUIRED", "false").lower() == "true"

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/jobs/models.py RENAMED Viewed

@@ -35,6 +35,7 @@ class JobType(str, Enum):
     DOCUMENT_DETECT_PII = "document_detect_pii"
     DOCUMENT_EXTRACT_TEXT = "document_extract_text"
     DOCUMENT_COMPARE = "document_compare"
+    DOCUMENT_S3_PROCESS = "document_s3_process"
     BATCH = "batch"

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/jobs/queue.py RENAMED Viewed

@@ -270,6 +270,25 @@ def enqueue_extract_text_job(
     )
+def enqueue_s3_document_process_job(
+    job_id: str,
+    bucket: str,
+    key: str,
+    options: dict,
+) -> None:
+    queue = get_queue()
+    queue.enqueue(
+        "docintel.jobs.tasks.run_s3_document_process_job",
+        job_id=job_id,
+        bucket=bucket,
+        key=key,
+        options=options,
+        job_timeout=900,
+        result_ttl=DEFAULT_RESULT_TTL,
+        failure_ttl=DEFAULT_FAILURE_TTL,
+    )
 def enqueue_compare_job(
     job_id: str,
     *,

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/jobs/store.py RENAMED Viewed

@@ -12,6 +12,15 @@ JOB_KEY_PREFIX = "docintel:job:"
 DEFAULT_JOB_TTL_SECONDS = 60 * 60 * 24 * 7
+def job_ttl_seconds() -> int:
+    raw = os.getenv("DOCINTEL_JOB_TTL_SECONDS", str(DEFAULT_JOB_TTL_SECONDS)).strip()
+    try:
+        ttl = int(raw)
+    except ValueError:
+        return DEFAULT_JOB_TTL_SECONDS
+    return max(ttl, 60)
 def redis_url() -> str:
     return os.getenv("DOCINTEL_REDIS_URL", "redis://localhost:6379/0").strip()
@@ -37,9 +46,10 @@ def _job_key(job_id: str) -> str:
     return f"{JOB_KEY_PREFIX}{job_id}"
-def save_job(record: JobRecord, ttl_seconds: int = DEFAULT_JOB_TTL_SECONDS) -> None:
+def save_job(record: JobRecord, ttl_seconds: int | None = None) -> None:
     client = _redis_client()
-    client.set(_job_key(record.job_id), json.dumps(record.to_dict()), ex=ttl_seconds)
+    resolved_ttl = ttl_seconds if ttl_seconds is not None else job_ttl_seconds()
+    client.set(_job_key(record.job_id), json.dumps(record.to_dict()), ex=resolved_ttl)
 def get_job(job_id: str) -> JobRecord | None:

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/jobs/tasks.py RENAMED Viewed

@@ -645,6 +645,45 @@ def run_compare_job(
     )
+def run_s3_document_process_job(
+    *,
+    job_id: str,
+    bucket: str,
+    key: str,
+    options: dict,
+) -> dict:
+    from docintel.storage.s3_ingest import download_s3_object_to_job_dir
+    record = get_job(job_id)
+    callback_url = record.callback_url if record else None
+    update_job(
+        job_id,
+        job_status=JobStatus.RUNNING.value,
+        progress=5,
+        progress_message="Downloading from S3",
+    )
+    try:
+        input_path, filename = download_s3_object_to_job_dir(job_id, bucket, key)
+    except Exception as exc:
+        failed = update_job(
+            job_id,
+            job_status=JobStatus.FAILED.value,
+            progress=100,
+            progress_message="Job failed",
+            error=str(exc),
+        )
+        _notify_webhook(callback_url, failed)
+        raise
+    return run_document_process_job(
+        job_id=job_id,
+        input_path=str(input_path),
+        filename=filename,
+        content_type=None,
+        options=options,
+    )
 def create_queued_job(
     job_id: str,
     *,

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/openapi/openapi.yaml RENAMED Viewed

@@ -541,6 +541,66 @@ paths:
         "503":
           description: Presidio stack unavailable
+  /v1/documents/ingest:
+    post:
+      tags: [documents]
+      summary: Queue process pipeline for an S3 object
+      description: |
+        Downloads the object in the worker, then runs the same pipeline as
+        POST /v1/documents/process. Always returns 202 when Redis is available.
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              type: object
+              required: [operation]
+              properties:
+                s3_uri:
+                  type: string
+                  example: s3://my-bucket/inbox/policy.docx
+                bucket:
+                  type: string
+                key:
+                  type: string
+                operation:
+                  type: string
+                  enum: [process]
+                  default: process
+                sentences:
+                  type: integer
+                  minimum: 1
+                  maximum: 20
+                include_summarize:
+                  type: boolean
+                  default: true
+                include_pii:
+                  type: boolean
+                  default: true
+                include_text:
+                  type: boolean
+                  default: false
+                vertical:
+                  type: string
+                entities:
+                  type: string
+                min_score:
+                  type: number
+                callback_url:
+                  type: string
+                  format: uri
+      responses:
+        "202":
+          description: S3 ingest job queued
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/AsyncAccepted"
+        "400":
+          description: Invalid S3 location or options
+        "503":
+          description: Async jobs unavailable
   /v1/documents/process:
     post:
       tags: [documents]

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/routes/documents.py RENAMED Viewed

@@ -182,6 +182,60 @@ def _parse_process_options() -> tuple[ProcessOptions | None, dict | None, int |
     )
+def _parse_process_options_from_dict(
+    payload: dict,
+) -> tuple[ProcessOptions | None, dict | None, int | None]:
+    raw_sentences = payload.get("sentences", DEFAULT_SENTENCE_COUNT)
+    try:
+        sentences = int(raw_sentences)
+    except (TypeError, ValueError):
+        return None, {"error": "Field 'sentences' must be an integer."}, 400
+    if sentences < 1 or sentences > MAX_SENTENCE_COUNT:
+        return None, {
+            "error": f"Field 'sentences' must be between 1 and {MAX_SENTENCE_COUNT}."
+        }, 400
+    vertical = payload.get("vertical", "")
+    vertical = vertical.strip() if isinstance(vertical, str) else ""
+    entities_raw = payload.get("entities")
+    try:
+        entities = _resolve_entities(
+            entities_raw if isinstance(entities_raw, str) else None,
+            vertical or None,
+        )
+    except ValueError as exc:
+        return None, {"error": str(exc)}, 400
+    raw_min_score = payload.get("min_score", 0.35)
+    try:
+        min_score = float(raw_min_score)
+    except (TypeError, ValueError):
+        return None, {"error": "Field 'min_score' must be a number."}, 400
+    def _bool_value(name: str, default: bool) -> bool:
+        if name not in payload:
+            return default
+        value = payload[name]
+        if isinstance(value, bool):
+            return value
+        if isinstance(value, str):
+            return value.strip().lower() in {"1", "true", "yes", "on"}
+        return default
+    return (
+        ProcessOptions(
+            sentences=sentences,
+            include_summarize=_bool_value("include_summarize", True),
+            include_pii=_bool_value("include_pii", True),
+            include_text=_bool_value("include_text", False),
+            entities=entities,
+            min_score=min_score,
+        ),
+        None,
+        None,
+    )
 @documents_bp.get("/types")
 @limiter.limit("120 per hour")
 def supported_document_types():
@@ -189,6 +243,47 @@ def supported_document_types():
     return jsonify({"status": "ok", "types": list_supported_types()})
+@documents_bp.post("/ingest")
+@limiter.limit("20 per hour")
+def ingest_document():
+    """Queue unified document processing for an object already stored in S3."""
+    payload = request.get_json(silent=True)
+    if not isinstance(payload, dict):
+        return jsonify({"error": "Request body must be JSON."}), 400
+    operation = str(payload.get("operation", "process")).strip().lower()
+    if operation != "process":
+        return jsonify({"error": "Only operation 'process' is supported."}), 400
+    from docintel.storage.s3_ingest import resolve_s3_location
+    try:
+        bucket, key = resolve_s3_location(payload)
+    except ValueError as exc:
+        return jsonify({"error": str(exc)}), 400
+    options, option_error, option_status = _parse_process_options_from_dict(payload)
+    if option_error is not None:
+        return jsonify(option_error), option_status
+    callback_raw = payload.get("callback_url", "")
+    callback_url = callback_raw.strip() if isinstance(callback_raw, str) and callback_raw.strip() else None
+    from docintel.jobs.models import JobType
+    from docintel.jobs.queue import enqueue_s3_document_process_job
+    job_id = uuid.uuid4().hex[:12]
+    return enqueue_background_job(
+        job_type=JobType.DOCUMENT_S3_PROCESS,
+        callback_url=callback_url,
+        enqueue_fn=enqueue_s3_document_process_job,
+        job_id=job_id,
+        bucket=bucket,
+        key=key,
+        options=options.to_dict() if options else {},
+    )
 @documents_bp.post("/identify")
 @limiter.limit("120 per hour")
 def identify_upload():

docintel_platform-1.3.0/src/docintel/storage/s3_ingest.py ADDED Viewed

@@ -0,0 +1,61 @@
+"""Download objects from S3 for async document ingest."""
+from __future__ import annotations
+import os
+import re
+from pathlib import Path
+from urllib.parse import unquote
+from werkzeug.utils import secure_filename
+_S3_URI_PATTERN = re.compile(r"^s3://([^/]+)/(.+)$")
+def parse_s3_uri(uri: str) -> tuple[str, str]:
+    """Parse s3://bucket/key into bucket and key."""
+    normalized = uri.strip()
+    match = _S3_URI_PATTERN.match(normalized)
+    if not match:
+        raise ValueError("s3_uri must look like s3://bucket/path/to/object")
+    bucket = match.group(1).strip()
+    key = unquote(match.group(2).strip())
+    if not bucket or not key:
+        raise ValueError("s3_uri must include a bucket name and object key")
+    return bucket, key
+def resolve_s3_location(payload: dict) -> tuple[str, str]:
+    """Resolve bucket and key from JSON body fields."""
+    s3_uri = payload.get("s3_uri")
+    if isinstance(s3_uri, str) and s3_uri.strip():
+        return parse_s3_uri(s3_uri)
+    bucket = payload.get("bucket")
+    key = payload.get("key")
+    if isinstance(bucket, str) and bucket.strip() and isinstance(key, str) and key.strip():
+        return bucket.strip(), key.strip()
+    raise ValueError("Provide s3_uri or both bucket and key.")
+def s3_client():
+    import boto3
+    return boto3.client(
+        "s3",
+        region_name=os.getenv("DOCINTEL_S3_REGION", "us-east-1"),
+        endpoint_url=os.getenv("DOCINTEL_S3_ENDPOINT_URL", "") or None,
+    )
+def download_s3_object_to_job_dir(job_id: str, bucket: str, key: str) -> tuple[Path, str]:
+    """Download an S3 object into the job work directory."""
+    from docintel.storage import get_storage
+    filename = secure_filename(Path(key).name) or "document.bin"
+    work_dir = get_storage().job_dir(job_id)
+    work_dir.mkdir(parents=True, exist_ok=True)
+    destination = work_dir / filename
+    s3_client().download_file(bucket, key, str(destination))
+    return destination, filename

{docintel_platform-1.2.0 → docintel_platform-1.3.0}/src/docintel/ui.py RENAMED Viewed

@@ -516,12 +516,12 @@ def build_ui():
                 outputs=summary_output,
             )
-        office_types = [".pdf", ".docx", ".xlsx", ".csv", ".txt", ".md", ".json"]
+        office_types = [".pdf", ".docx", ".xlsx", ".pptx", ".csv", ".txt", ".md", ".json"]
         with gr.Tab("Document process"):
             gr.Markdown(
                 "Run extract, classify, summarize, and PII detection in one async job. "
                 "Requires Redis and a worker (`make run-worker` or docker-compose worker). "
-                "Word and Excel need `pip install -e '.[documents]'` on the API server."
+                "Office formats need `pip install -e '.[documents]'` on the API server (Word, Excel, PowerPoint)."
             )
             from docintel.capabilities.compliance.presets import list_vertical_presets
@@ -563,7 +563,7 @@ def build_ui():
         with gr.Tab("Document tools"):
             gr.Markdown(
                 "Identify, extract, classify, summarize, scan for PII, and compare office documents. "
-                "Requires `pip install -e '.[documents]'` for Word and Excel."
+                "Requires `pip install -e '.[documents]'` for Word, Excel, and PowerPoint."
             )
             with gr.Row():
                 doc_file = gr.File(label="Document upload", file_types=office_types)

docintel-platform 1.2.0__tar.gz → 1.3.0__tar.gz

docintel-platform 1.2.0tar.gz → 1.3.0tar.gz