PyPI - poma - Versions diffs - 0.1.2__tar.gz → 0.2.1__tar.gz - Mend

poma 0.1.2tar.gz → 0.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

poma-0.2.1/.github/workflows/python-publish.yml +71 -0
poma-0.2.1/.gitignore +56 -0
{poma-0.1.2 → poma-0.2.1}/PKG-INFO +4 -4
{poma-0.1.2 → poma-0.2.1}/README.md +3 -3
{poma-0.1.2 → poma-0.2.1}/poma/client.py +73 -59
{poma-0.1.2 → poma-0.2.1}/poma/exceptions.py +1 -0
{poma-0.1.2 → poma-0.2.1}/poma/integrations/langchain_poma.py +7 -17
{poma-0.1.2 → poma-0.2.1}/poma/integrations/llamaindex_poma.py +7 -16
poma-0.2.1/poma/retrieval.py +339 -0
{poma-0.1.2 → poma-0.2.1}/poma.egg-info/PKG-INFO +4 -4
{poma-0.1.2 → poma-0.2.1}/poma.egg-info/SOURCES.txt +2 -0
{poma-0.1.2 → poma-0.2.1}/pyproject.toml +2 -2
poma-0.1.2/poma/retrieval.py +0 -176
{poma-0.1.2 → poma-0.2.1}/LICENSE +0 -0
{poma-0.1.2 → poma-0.2.1}/poma/__init__.py +0 -0
{poma-0.1.2 → poma-0.2.1}/poma/integrations/__init__.py +0 -0
{poma-0.1.2 → poma-0.2.1}/poma.egg-info/dependency_links.txt +0 -0
{poma-0.1.2 → poma-0.2.1}/poma.egg-info/requires.txt +0 -0
{poma-0.1.2 → poma-0.2.1}/poma.egg-info/top_level.txt +0 -0
{poma-0.1.2 → poma-0.2.1}/setup.cfg +0 -0

poma-0.2.1/.github/workflows/python-publish.yml ADDED Viewed

@@ -0,0 +1,71 @@
+# This workflow will upload a Python Package to PyPI when a release is created
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries
+# This workflow uses actions that are not certified by GitHub.
+# They are provided by a third-party and are governed by
+# separate terms of service, privacy policy, and support
+# documentation.
+name: Upload Python Package
+on:
+  push:
+    tags: [ '*.*.*' ]
+permissions:
+  contents: read
+jobs:
+  release-build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.x"
+      - name: Build release distributions
+        run: |
+          # NOTE: put your own distribution build steps here.
+          python -m pip install --upgrade pip
+          python -m pip install --upgrade build
+          python -m build
+      - name: Upload distributions
+        uses: actions/upload-artifact@v4
+        with:
+          name: release-dists
+          path: dist/
+  pypi-publish:
+    runs-on: ubuntu-latest
+    needs:
+      - release-build
+    permissions:
+      # IMPORTANT: this permission is mandatory for trusted publishing
+      id-token: write
+    # Dedicated environments with protections for publishing are strongly recommended.
+    # For more information, see: https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment#deployment-protection-rules
+    environment:
+      name: pypi
+      # OPTIONAL: uncomment and update to include your PyPI project URL in the deployment status:
+      # url: https://pypi.org/p/YOURPROJECT
+      #
+      # ALTERNATIVE: if your GitHub Release name is the PyPI project version string
+      # ALTERNATIVE: exactly, uncomment the following line instead:
+      # url: https://pypi.org/project/YOURPROJECT/${{ github.event.release.name }}
+    steps:
+      - name: Retrieve release distributions
+        uses: actions/download-artifact@v4
+        with:
+          name: release-dists
+          path: dist/
+      - name: Publish release distributions to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1
+        with:
+          packages-dir: dist/

poma-0.2.1/.gitignore ADDED Viewed

@@ -0,0 +1,56 @@
+# -------------------------------------------------------
+# Python bytecode
+# -------------------------------------------------------
+__pycache__/
+*.py[cod]
+*$py.class
+# -------------------------------------------------------
+# Virtualenv (Windows + Unix)
+# -------------------------------------------------------
+.venv/
+venv/
+env/
+.poma-sdk-venv/
+ENV/
+env.bak/
+.idea/
+# -------------------------------------------------------
+# VS Code
+# -------------------------------------------------------
+.vscode/
+# -------------------------------------------------------
+# Pytest
+# -------------------------------------------------------
+.cache/
+.pytest_cache/
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+# -------------------------------------------------------
+# Build / Dist
+# -------------------------------------------------------
+build/
+dist/
+*.egg-info/
+*.egg
+.eggs/
+# -------------------------------------------------------
+# Environment files
+# -------------------------------------------------------
+.env
+.env.*
+*.env
+# -------------------------------------------------------
+# Mac / Windows
+# -------------------------------------------------------
+.DS_Store
+Thumbs.db
+desktop.ini

{poma-0.1.2 → poma-0.2.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: poma
-Version: 0.1.2
+Version: 0.2.1
 Summary: Official Python SDK for the Poma document-processing API
 Author-email: "POMA AI GmbH, Berlin" <sdk@poma-ai.com>
 License-Expression: MPL-2.0
@@ -40,15 +40,15 @@ pip install poma
 For integrations into LangChain and LlamaIndex:
 ```bash
-pip install poma[integrations]
+pip install 'poma[integrations]'
 # Or LangChain/LlamaIndex including example extras:
-pip install poma[integration-examples]
+pip install 'poma[integration-examples]'
 ```
 - You may also want: `pip install python-dotenv` to load API keys from a .env file.
 - API keys required (POMA_API_KEY) for the POMA AI client via environment variables.
-- **To request a POMA_API_KEY, please contact us at api@poma-ai.com**
+- **To request a POMA_API_KEY, please contact us at sdk@poma-ai.com**
 ### Example Implementations — all examples, integrations, and additional information can be found in our GitHub repository: [poma-ai/poma](https://github.com/poma-ai/)

{poma-0.1.2 → poma-0.2.1}/README.md RENAMED Viewed

@@ -12,15 +12,15 @@ pip install poma
 For integrations into LangChain and LlamaIndex:
 ```bash
-pip install poma[integrations]
+pip install 'poma[integrations]'
 # Or LangChain/LlamaIndex including example extras:
-pip install poma[integration-examples]
+pip install 'poma[integration-examples]'
 ```
 - You may also want: `pip install python-dotenv` to load API keys from a .env file.
 - API keys required (POMA_API_KEY) for the POMA AI client via environment variables.
-- **To request a POMA_API_KEY, please contact us at api@poma-ai.com**
+- **To request a POMA_API_KEY, please contact us at sdk@poma-ai.com**
 ### Example Implementations — all examples, integrations, and additional information can be found in our GitHub repository: [poma-ai/poma](https://github.com/poma-ai/)

{poma-0.1.2 → poma-0.2.1}/poma/client.py RENAMED Viewed

@@ -8,18 +8,15 @@ import time
 from pathlib import Path
 from typing import Any
-from poma.exceptions import AuthenticationError, InvalidInputError, RemoteServerError, InvalidResponseError
-from poma.retrieval import generate_cheatsheets, generate_single_cheatsheet
+from poma.exceptions import (
+    AuthenticationError,
+    RemoteServerError,
+    InvalidResponseError,
+)
+from poma.retrieval import generate_cheatsheets
-USER_AGENT = "poma-ai-sdk/0.1.0"
-ALLOWED_FILE_EXTENSIONS: set[str] = {
-    ".txt",
-    ".md",
-    ".html",
-    ".htm",
-    ".pdf",
-}
+USER_AGENT = "poma-ai-sdk/0.1.0"
 API_BASE_URL = "https://api.poma-ai.com/api/v1"
@@ -51,10 +48,13 @@ class Poma:
         # Override API base URL if environment variable is set
         if os.environ.get("API_BASE_URL"):
             api_base_url = os.environ.get("API_BASE_URL")
+        if not api_base_url:
+            raise ValueError("API base URL cannot be empty.")
         self.base_api_url = api_base_url.rstrip("/")
-        self._client = client or httpx.Client(timeout=timeout,
-                                                headers={"user-agent": USER_AGENT})
+        self._client = client or httpx.Client(
+            timeout=timeout, headers={"user-agent": USER_AGENT}
+        )
         if not (api_key := api_key or os.environ.get("POMA_API_KEY", "")):
             raise Exception("POMA_API_KEY environment variable not set.")
         self._client.headers.update({"Authorization": f"Bearer {api_key}"})
@@ -77,17 +77,12 @@ class Poma:
         """
         if not file_path or not isinstance(file_path, os.PathLike):
             raise ValueError("file_path must be a non-empty os.PathLike.")
-        file_extension = Path(file_path).suffix.lower()
-        if file_extension not in ALLOWED_FILE_EXTENSIONS:
-            raise InvalidInputError(
-                f"File extension of {file_path} is not allowed; use one of the following types: {', '.join(sorted(ALLOWED_FILE_EXTENSIONS))}."
-            )
         payload = {}
         if base_url:
             payload["base_url"] = base_url
         try:
             response = self._client.post(
-                f"{self.base_api_url}/process",
+                f"{self.base_api_url}/ingest",
                 data=payload,
                 files={
                     "file": (Path(file_path).name, Path(file_path).read_bytes()),
@@ -97,12 +92,18 @@ class Poma:
         except httpx.HTTPStatusError as error:
             status = error.response.status_code
             if status in (401, 403):
-                raise AuthenticationError(response.text) from error
-            raise RemoteServerError(f"{status}: {response.text}") from error
+                raise AuthenticationError(
+                    f"Failed to submit file '{file_path}': authentication error"
+                ) from error
+            raise RemoteServerError(
+                f"Failed to submit file '{file_path}': {status}"
+            ) from error
         try:
             data = response.json()
         except ValueError as error:
-            raise InvalidResponseError("Server returned non-JSON or empty body") from error
+            raise InvalidResponseError(
+                "Server returned non-JSON or empty body"
+            ) from error
         return data
     def get_chunk_result(
@@ -149,19 +150,25 @@ class Poma:
                     download = data.get("download", {})
                     download_url = download.get("download_url", "")
                     if not download_url:
-                        raise RuntimeError("Failed to receive download URL from server.")
+                        raise RuntimeError(
+                            "Failed to receive download URL from server."
+                        )
                     if download_dir is None:
                         # Return bytes content instead of saving to file
                         file_bytes = self.download_bytes(download_url)
-                        return self.extract_chunks_and_chunksets_from_poma_archive(poma_archive_data=file_bytes)
+                        return self.extract_chunks_and_chunksets_from_poma_archive(
+                            poma_archive_data=file_bytes
+                        )
                     else:
                         # Save downloaded file to directory
                         filename = download.get("filename", "downloaded_file.poma")
-                        downloaded_file_path = self.download_file(download_url,
-                                                                    filename,
-                                                                    save_directory=download_dir)
-                        return self.extract_chunks_and_chunksets_from_poma_archive(poma_archive_path=downloaded_file_path)
+                        downloaded_file_path = self.download_file(
+                            download_url, filename, save_directory=download_dir
+                        )
+                        return self.extract_chunks_and_chunksets_from_poma_archive(
+                            poma_archive_path=downloaded_file_path
+                        )
                 elif status == "failed":
                     error_code = data.get("code", "unknown")
                     error_details = data.get("error", "No details provided.")
@@ -197,33 +204,29 @@ class Poma:
         Returns:
             dict: A dictionary containing the chunks and chunksets.
         """
-        # Sanity check for parameters
-        if not poma_archive_data and not poma_archive_path:
-            raise ValueError("Either poma_archive_data or poma_archive_path must be provided.")
         # Load the chunks and chunksets from POMA archive
         chunks = None
         chunksets = None
         if poma_archive_path:
             with zipfile.ZipFile(poma_archive_path, "r") as zip_ref:
-                chunks = zip_ref.read('chunks.json')
-                chunksets = zip_ref.read('chunksets.json')
-        else:
+                chunks = zip_ref.read("chunks.json")
+                chunksets = zip_ref.read("chunksets.json")
+        elif poma_archive_data:
             with zipfile.ZipFile(io.BytesIO(poma_archive_data), "r") as zip_ref:
-                chunks = zip_ref.read('chunks.json')
-                chunksets = zip_ref.read('chunksets.json')
+                chunks = zip_ref.read("chunks.json")
+                chunksets = zip_ref.read("chunksets.json")
+        else:
+            raise ValueError(
+                "Either poma_archive_data or poma_archive_path must be provided."
+            )
         # Sanity check
         if not chunks or not chunksets:
-            raise KeyError(
-                "Result must contain 'chunks' and 'chunksets' keys."
-            )
+            raise KeyError("Result must contain 'chunks' and 'chunksets' keys.")
         # Load the chunks and chunksets
-        json_result = {
-            "chunks": json.loads(chunks),
-            "chunksets": json.loads(chunksets)
-        }
+        json_result = {"chunks": json.loads(chunks), "chunksets": json.loads(chunksets)}
         return json_result
     def create_cheatsheet(
@@ -234,14 +237,24 @@ class Poma:
         """
         Generates a single cheatsheet for one single document
         from relevant chunksets (relevant for a certain query)
-        and from all available chunks (which must contain the textual content).
+        and from all chunks of that document (providing the textual content).
         Args:
             relevant_chunksets (list[dict]): A list of chunksets, each containing a "chunks" key with a list of chunk IDs.
-            all_chunks (list[dict]): A list of all available chunk dictionaries, each representing a chunk of content.
+            all_chunks (list[dict]): A list of all chunk dictionaries of the same document, each representing a chunk of content.
         Returns:
             str: The textual content of the generated cheatsheet.
         """
-        return generate_single_cheatsheet(relevant_chunksets, all_chunks)
+        cheatsheets = generate_cheatsheets(relevant_chunksets, all_chunks)
+        if (
+            not cheatsheets
+            or not isinstance(cheatsheets, list)
+            or len(cheatsheets) == 0
+            or "content" not in cheatsheets[0]
+        ):
+            raise Exception(
+                "Unknown error; cheatsheet could not be created from input chunks."
+            )
+        return cheatsheets[0]["content"]
     def create_cheatsheets(
         self,
@@ -250,14 +263,14 @@ class Poma:
     ) -> list[dict[str, Any]]:
         """
         Generates cheatsheets from relevant chunksets (relevant for a certain query)
-        and from all available chunks (which must contain the textual content).
-        One cheatsheet is created for each document tag found in the chunks.
+        and from all the chunks of all affected documents (providing the textual content).
+        One cheatsheet is created for each document found in the chunks (tagged with file_id).
         Args:
             relevant_chunksets (list[dict]): A list of chunksets, each containing a "chunks" key with a list of chunk IDs.
-            all_chunks (list[dict]): A list of all available chunk dictionaries, each representing a chunk of content.
+            all_chunks (list[dict]): A list of all available chunk dictionaries of affected documents, each representing a chunk of content.
         Returns:
             list[dict]: A list of dictionaries representing the generated cheatsheets, each containing:
-                - 'tag': The tag associated with the respective document.
+                - 'file_id': The tag associated with the respective document.
                 - 'content': The textual content of the generated cheatsheet.
         """
         return generate_cheatsheets(relevant_chunksets, all_chunks)
@@ -283,7 +296,7 @@ class Poma:
         """
         if not download_url:
             raise ValueError("download_url cannot be empty")
         # Determine filename
         if not filename:
             filename = Path(download_url).name or "downloaded_file"
@@ -303,7 +316,8 @@ class Poma:
         # Save the file
         with open(save_path, "wb") as f:
             f.write(content)
-        return save_path
+        return str(save_path)
     def download_bytes(
         self,
@@ -319,27 +333,27 @@ class Poma:
         """
         if not download_url:
             raise ValueError("download_url cannot be empty")
         # Construct the full URL if it's a relative path
         if download_url.startswith("/"):
             full_url = f"{self.base_api_url}{download_url}"
         else:
             full_url = download_url
         print("Downloading file from:", full_url)
         try:
-            # Download the file
             response = self._client.get(full_url)
             response.raise_for_status()
-            # Return the bytes content
             return response.content
         except httpx.HTTPStatusError as error:
             status = error.response.status_code
             if status in (401, 403):
-                raise AuthenticationError(f"Authentication failed when downloading file: {response.text}") from error
-            raise RemoteServerError(f"Failed to download file: {status} {response.text}") from error
+                raise AuthenticationError(
+                    f"Failed to download '{download_url}': authentication error"
+                ) from error
+            raise RemoteServerError(
+                f"Failed to download '{download_url}': {status}"
+            ) from error
         except Exception as error:
             raise RuntimeError(f"File download failed: {error}") from error

{poma-0.1.2 → poma-0.2.1}/poma/exceptions.py RENAMED Viewed

@@ -16,5 +16,6 @@ class RemoteServerError(PomaSDKError):
 class InvalidInputError(PomaSDKError):
     """Raised when an unsupported *Content‑Type* is given to ``chunk_text``."""
 class InvalidResponseError(PomaSDKError):
     """Raised when the server returns non-JSON or empty body."""

{poma-0.1.2 → poma-0.2.1}/poma/integrations/langchain_poma.py RENAMED Viewed

@@ -18,9 +18,8 @@ from langchain_text_splitters import TextSplitter
 from pydantic import Field, PrivateAttr
 from poma import Poma
-from poma.client import ALLOWED_FILE_EXTENSIONS
 from poma.exceptions import InvalidInputError
-from poma.retrieval import _cheatsheets_from_chunks
+from poma.retrieval import chunks_from_dicts, _cheatsheets_from_chunks
 __all__ = ["PomaFileLoader", "PomaChunksetSplitter", "PomaCheatsheetRetrieverLC"]
@@ -52,10 +51,6 @@ class PomaFileLoader(BaseLoader):
             nonlocal skipped, documents
             if not file_path.is_file():
                 return
-            file_extension = file_path.suffix.lower()
-            if not file_extension or file_extension not in ALLOWED_FILE_EXTENSIONS:
-                skipped += 1
-                return
             file_bytes = file_path.read_bytes()
             file_hash = hashlib.md5(file_bytes).hexdigest()
             if file_path.suffix.lower() == ".pdf":
@@ -84,13 +79,10 @@ class PomaFileLoader(BaseLoader):
         else:
             raise FileNotFoundError(f"Unsupported path type (not file/dir): {path}")
-        allowed = ", ".join(sorted(ALLOWED_FILE_EXTENSIONS))
         if not documents:
-            raise InvalidInputError(f"No supported files found. Allowed: {allowed}")
+            raise InvalidInputError(f"No supported files found.")
         if skipped > 0:
-            print(
-                f"Skipped {skipped} file(s) due to unsupported or unreadable type. Allowed: {allowed}"
-            )
+            print(f"Skipped {skipped} file(s) due to unsupported or unreadable type.")
         return documents
@@ -330,7 +322,7 @@ class PomaCheatsheetRetrieverLC(BaseRetriever):
     def _create_cheatsheet_langchain(self, chunked_docs: list[Document]) -> str:
         """Generate a single deduplicated cheatsheet from chunked documents."""
-        all_chunks = []
+        all_chunk_dicts = []
         seen = set()
         for doc in chunked_docs:
             doc_id = doc.metadata.get("doc_id", "unknown_doc")
@@ -341,11 +333,9 @@ class PomaCheatsheetRetrieverLC(BaseRetriever):
                 if chunk_index not in seen:
                     seen.add(chunk_index)
                     chunk["tag"] = doc_id
-                    all_chunks.append(chunk)
-        sorted_chunks = sorted(
-            all_chunks, key=lambda chunk: (chunk["tag"], chunk["chunk_index"])
-        )
-        cheatsheets = _cheatsheets_from_chunks(sorted_chunks)
+                    all_chunk_dicts.append(chunk)
+        all_chunks = chunks_from_dicts(all_chunk_dicts)
+        cheatsheets = _cheatsheets_from_chunks(all_chunks)
         if (
             not cheatsheets
             or not isinstance(cheatsheets, list)

{poma-0.1.2 → poma-0.2.1}/poma/integrations/llamaindex_poma.py RENAMED Viewed

@@ -23,9 +23,8 @@ from llama_index.core.schema import (
 from pydantic import PrivateAttr
 from poma import Poma
-from poma.client import ALLOWED_FILE_EXTENSIONS
-from poma.retrieval import _cheatsheets_from_chunks
 from poma.exceptions import InvalidInputError
+from poma.retrieval import chunks_from_dicts, _cheatsheets_from_chunks
 __all__ = ["PomaFileReader", "PomaChunksetNodeParser", "PomaCheatsheetRetrieverLI"]
@@ -54,9 +53,6 @@ class PomaFileReader(BaseReader):
             if not file_path.is_file():
                 return
             file_extension = file_path.suffix.lower()
-            if not file_extension or file_extension not in ALLOWED_FILE_EXTENSIONS:
-                skipped += 1
-                return
             file_bytes = file_path.read_bytes()
             file_hash = hashlib.md5(file_bytes).hexdigest()
             if file_extension == ".pdf":
@@ -87,13 +83,10 @@ class PomaFileReader(BaseReader):
         else:
             raise FileNotFoundError(f"Unsupported path type (not file/dir): {path}")
-        allowed = ", ".join(sorted(ALLOWED_FILE_EXTENSIONS))
         if not documents:
-            raise InvalidInputError(f"No supported files found. Allowed: {allowed}")
+            raise InvalidInputError(f"No supported files found.")
         if skipped > 0:
-            print(
-                f"Skipped {skipped} file(s) due to unsupported or unreadable type. Allowed: {allowed}"
-            )
+            print(f"Skipped {skipped} file(s) due to unsupported or unreadable type.")
         return documents
@@ -329,7 +322,7 @@ class PomaCheatsheetRetrieverLI(BaseRetriever):
     def _create_cheatsheet_llamaindex(self, chunked_nodes: list[NodeWithScore]) -> str:
         """Generate a single deduplicated cheatsheet from chunked nodes."""
-        all_chunks = []
+        all_chunk_dicts = []
         seen = set()
         for node in chunked_nodes:
             doc_id = node.metadata.get("doc_id", "unknown_doc")
@@ -344,11 +337,9 @@ class PomaCheatsheetRetrieverLI(BaseRetriever):
                     continue
                 seen.add(chunk_index)
                 chunk["tag"] = doc_id
-                all_chunks.append(chunk)
-        sorted_chunks = sorted(
-            all_chunks, key=lambda chunk: (chunk["tag"], chunk["chunk_index"])
-        )
-        cheatsheets = _cheatsheets_from_chunks(sorted_chunks)
+                all_chunk_dicts.append(chunk)
+        all_chunks = chunks_from_dicts(all_chunk_dicts)
+        cheatsheets = _cheatsheets_from_chunks(all_chunks)
         if (
             not cheatsheets
             or not isinstance(cheatsheets, list)

poma-0.2.1/poma/retrieval.py ADDED Viewed

@@ -0,0 +1,339 @@
+# retrieval.py
+from collections import defaultdict
+from itertools import chain
+from typing import Any
+import warnings
+def deprecated(replacement: str):
+    def decorator(func):
+        msg = (
+            f"{func.__name__}() is deprecated and will be removed in a future version. "
+            f"Use {replacement} instead."
+        )
+        def wrapper(*args, **kwargs):
+            warnings.warn(msg, DeprecationWarning, stacklevel=2)
+            return func(*args, **kwargs)
+        wrapper.__name__ = func.__name__
+        wrapper.__doc__ = (func.__doc__ or "") + f"\n\nDEPRECATED: {msg}\n"
+        return wrapper
+    return decorator
+def generate_cheatsheets(
+    relevant_chunksets: list[dict[str, Any]],
+    all_chunks: list[dict[str, Any]],
+) -> list[dict[str, Any]]:
+    # get chunks grouped by document file_id
+    doc_chunks = defaultdict(list)
+    for chunk in all_chunks:
+        file_id = chunk.get("file_id") or chunk.get("tag") or "single_doc"
+        chunk["file_id"] = file_id  # update
+        doc_chunks[file_id].append(chunk)
+    # Check for duplicate chunk_index values
+    # (necessary when file_id was not set in chunks)
+    for file_id, chunks in doc_chunks.items():
+        chunk_indices = [c["chunk_index"] for c in chunks]
+        if len(chunk_indices) != len(set(chunk_indices)):
+            raise ValueError(f"Duplicate chunk_index found for file_id: {file_id}")
+    # get relevant chunksets grouped by document file_id
+    relevant_chunksets_per_doc = defaultdict(list)
+    for chunkset in relevant_chunksets:
+        file_id = chunkset.get("file_id") or chunkset.get("tag") or "single_doc"
+        chunkset["file_id"] = file_id  # update
+        if "chunks" not in chunkset:
+            raise ValueError(
+                "Chunkset not valid; must contain a 'chunks' key with a list of chunk IDs."
+            )
+        relevant_chunksets_per_doc[file_id].append(chunkset)
+    # Ensure that chunksets and chunks correspond to the same file_ids
+    for file_id in relevant_chunksets_per_doc.keys():
+        if file_id not in doc_chunks:
+            raise ValueError(
+                f"Chunksets contain file_id '{file_id}' which is not present in the chunks."
+            )
+    # retrieve relevant chunks with content per document
+    relevant_content_chunks: list[RetrievalChunk] = []
+    for file_id, chunksets_per_doc in relevant_chunksets_per_doc.items():
+        chunk_ids = list(  # flattened list
+            chain.from_iterable(chunkset["chunks"] for chunkset in chunksets_per_doc)
+        )
+        relevant_chunks_dict = _get_relevant_chunks_for_ids(
+            chunk_ids, doc_chunks[file_id]
+        )
+        relevant_chunks: list[RetrievalChunk] = chunks_from_dicts(relevant_chunks_dict)
+        relevant_content_chunks.extend(relevant_chunks)
+    return _cheatsheets_from_chunks(relevant_content_chunks)
+@deprecated("generate_cheatsheets(relevant_chunksets, all_chunks)")
+def generate_single_cheatsheet(
+    relevant_chunksets: list[dict[str, Any]],
+    all_chunks: list[dict[str, Any]],
+) -> str:
+    cheatsheets = generate_cheatsheets(
+        relevant_chunksets=relevant_chunksets,
+        all_chunks=all_chunks,
+    )
+    return cheatsheets[0].get("content", "") if cheatsheets else ""
+########################
+# RetrievalChunk Class #
+########################
+class RetrievalChunk:
+    """
+    Represents a chunk of text with associated metadata.
+    Attributes:
+        index (int): The index of the chunk within a sequence.
+        file_id (str): The id associating the chunk with a document.
+        content (str): The textual content of the chunk.
+        depth_rebased (int, optional): The hierarchical depth of the chunk content.
+            In cheatsheets, this affects indentation for certain text parts.
+            Currently only used for code blocks.
+    """
+    def __init__(
+        self,
+        index: int,
+        file_id: str,
+        content: str,
+        depth_rebased: int | None,
+    ):
+        self.index = index
+        self.file_id = file_id
+        self.content = content
+        self.depth_rebased = depth_rebased
+    @classmethod
+    def from_chunk_dict(
+        cls,
+        chunk_dict: dict,
+        block_min_depth: int | None,
+    ):
+        if block_min_depth is not None:
+            depth = int(chunk_dict["depth"])
+            depth_rebased = cls._rebase_depth(depth, block_min_depth)
+        else:
+            depth_rebased = None
+        return cls(
+            index=int(chunk_dict["chunk_index"]),
+            file_id=str(
+                chunk_dict.get("file_id") or chunk_dict.get("tag") or "single_doc"
+            ),
+            content=str(chunk_dict["content"]),
+            depth_rebased=depth_rebased,
+        )
+    @staticmethod
+    def _rebase_depth(depth: int, min_depth: int, base_unit: int = 0) -> int | None:
+        rebased = depth - min_depth + base_unit
+        return max(0, rebased)
+    def __repr__(self):
+        return f"RetrievalChunk(index={self.index}, file_id={self.file_id}, content={self.content}), depth_rebased={self.depth_rebased}"
+def chunks_from_dicts(chunk_dicts: list[dict]) -> list[RetrievalChunk]:
+    """
+    Converts a list of chunk dictionaries into a list of Chunk objects.
+    File_ids are needed to identify chunks from different documents;
+    if is_single_doc is True, all chunks are assumed to come from a single document
+    and file_id is optional.
+    Args:
+        chunk_dicts (list[dict]): A list of dictionaries, each representing a chunk with required keys:
+            - "chunk_index": The index of the chunk within the document.
+            - "file_id": The identifier of the document.
+            - "content": The textual content of the chunk.
+            - "depth": The depth or level of the chunk.
+    Returns:
+        list[Chunk]: A list of Chunk objects with the textual content needed for the cheatsheets.
+    """
+    # Determine the minimum depth per code block
+    min_depth_per_code_block: dict[str, int] = {}
+    for chunk_dict in chunk_dicts:
+        block_id = chunk_dict.get("code")
+        if block_id is None:
+            continue
+        depth = int(chunk_dict["depth"])
+        current = min_depth_per_code_block.get(block_id)
+        min_depth_per_code_block[block_id] = (
+            depth if current is None else min(current, depth)
+        )
+    # Create Chunk objects
+    all_chunks: list[RetrievalChunk] = []
+    for chunk_dict in chunk_dicts:
+        code_id = chunk_dict.get("code")
+        if bool(code_id):
+            block_min_depth = min_depth_per_code_block.get(str(code_id))
+        else:
+            block_min_depth = None
+        chunk = RetrievalChunk.from_chunk_dict(chunk_dict, block_min_depth)
+        all_chunks.append(chunk)
+    # Sanity check: Make sure there are no duplicate chunk_index values
+    check_dict = defaultdict(set)
+    has_duplicates = any(
+        chunk.index in check_dict[chunk.file_id]
+        or check_dict[chunk.file_id].add(chunk.index)
+        for chunk in all_chunks
+    )
+    if has_duplicates:
+        raise ValueError(
+            "Duplicate chunk indices found in single document mode. "
+            "Each chunk must have a unique index."
+        )
+    return all_chunks
+###################
+# Private Methods #
+###################
+def _get_relevant_chunks_for_ids(
+    chunk_ids: list[int],
+    chunks: list[dict[str, Any]],
+) -> list[dict[str, Any]]:
+    chunk_indices_of_retrieved_chunksets = chunk_ids
+    all_chunks_of_doc = chunks
+    # Build helpers
+    sorted_chunks = sorted(all_chunks_of_doc, key=lambda c: c["chunk_index"])
+    index_to_chunk = {c["chunk_index"]: c for c in sorted_chunks}
+    index_to_depth = {c["chunk_index"]: c["depth"] for c in sorted_chunks}
+    # Find relatively deepest indices in the retrieval
+    candidate_indices = set(chunk_indices_of_retrieved_chunksets)
+    def is_ancestor(idx1, idx2):
+        """True if idx1 is an ancestor of idx2."""
+        # idx1 must be before idx2 and have smaller depth
+        if idx1 >= idx2:
+            return False
+        depth1 = index_to_depth[idx1]
+        depth2 = index_to_depth[idx2]
+        if depth1 >= depth2:
+            return False
+        # scan from idx1+1 up to idx2, making sure all are deeper than depth1 until idx2
+        for i in range(idx1 + 1, idx2 + 1):
+            depth = index_to_depth[sorted_chunks[i]["chunk_index"]]
+            if depth <= depth1 and sorted_chunks[i]["chunk_index"] != idx2:
+                return False
+        return True
+    # Exclude any index that is an ancestor of another in the set
+    relatively_deepest = set(candidate_indices)
+    for idx1 in candidate_indices:
+        for idx2 in candidate_indices:
+            if idx1 != idx2 and is_ancestor(idx1, idx2):
+                relatively_deepest.discard(idx1)
+                break
+    # Standard subtree/parent finding routines
+    def get_child_indices(chunk_index: int) -> list[int]:
+        base_depth = index_to_depth[chunk_index]
+        children = []
+        for i in range(chunk_index + 1, len(sorted_chunks)):
+            idx = sorted_chunks[i]["chunk_index"]
+            depth = sorted_chunks[i]["depth"]
+            if depth <= base_depth:
+                break
+            children.append(idx)
+        return children
+    def get_parent_indices(chunk_index: int) -> list[int]:
+        parents = []
+        current_depth = index_to_depth[chunk_index]
+        for i in range(chunk_index - 1, -1, -1):
+            idx = sorted_chunks[i]["chunk_index"]
+            depth = sorted_chunks[i]["depth"]
+            if depth < current_depth:
+                parents.append(idx)
+                current_depth = depth
+        return parents[::-1]  # root -> leaf order
+    # Collect all relevant indices
+    all_indices = set(
+        chunk_indices_of_retrieved_chunksets
+    )  # always include all search hits
+    for idx in relatively_deepest:
+        all_indices.update(get_child_indices(idx))
+    # Parents for all found nodes
+    for idx in list(all_indices):
+        all_indices.update(get_parent_indices(idx))
+    # Return in doc order
+    return [index_to_chunk[i] for i in sorted(all_indices)]
+def _cheatsheets_from_chunks(
+    content_chunks: list[RetrievalChunk],
+) -> list[dict[str, Any]]:
+    if not content_chunks:
+        return []
+    if isinstance(content_chunks[0], dict):
+        raise ValueError(
+            "Input to _cheatsheets_from_chunks must be a list of RetrievalChunk objects, not dicts."
+            "Use chunks_from_dicts() to convert dicts to RetrievalChunk objects first."
+        )
+    def _format_chunk_content(chunk: "RetrievalChunk") -> str:
+        if not getattr(chunk, "depth_rebased", False):
+            return chunk.content
+        else:
+            indent = " " * 4 * (chunk.depth_rebased or 0)
+            return f"{indent}{chunk.content}"
+    cheatsheets: list[dict] = []
+    compressed_data = {}
+    content_chunks = sorted(content_chunks, key=lambda c: (c.file_id, c.index))
+    for chunk in content_chunks:
+        if chunk.file_id not in compressed_data:
+            # If there is data stored for a previous file_id, save it to the cheatsheets list
+            if compressed_data:
+                for key, value in compressed_data.items():
+                    cheatsheets.append(
+                        {"file_id": key, "tag": key, "content": value["content"]}
+                    )
+            # Clear the compressed_data for the current file_id
+            compressed_data.clear()
+            # Start a new entry for the current file_id
+            compressed_data[chunk.file_id] = {
+                "content": _format_chunk_content(chunk),
+                "last_chunk": chunk.index,
+            }
+        else:
+            chunk_content = _format_chunk_content(chunk)
+            # Check if chunks are consecutive
+            if chunk.index == int(compressed_data[chunk.file_id]["last_chunk"]) + 1:
+                compressed_data[chunk.file_id]["content"] += "\n" + chunk_content
+            else:
+                compressed_data[chunk.file_id]["content"] += "\n[…]\n" + chunk_content
+            # Update the last chunk index
+            compressed_data[chunk.file_id]["last_chunk"] = chunk.index
+    # Save the last processed entry to the cheatsheets list
+    if compressed_data:
+        for key, value in compressed_data.items():
+            cheatsheets.append(
+                {"file_id": key, "tag": key, "content": value["content"]}
+            )
+    return cheatsheets

{poma-0.1.2 → poma-0.2.1}/poma.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: poma
-Version: 0.1.2
+Version: 0.2.1
 Summary: Official Python SDK for the Poma document-processing API
 Author-email: "POMA AI GmbH, Berlin" <sdk@poma-ai.com>
 License-Expression: MPL-2.0
@@ -40,15 +40,15 @@ pip install poma
 For integrations into LangChain and LlamaIndex:
 ```bash
-pip install poma[integrations]
+pip install 'poma[integrations]'
 # Or LangChain/LlamaIndex including example extras:
-pip install poma[integration-examples]
+pip install 'poma[integration-examples]'
 ```
 - You may also want: `pip install python-dotenv` to load API keys from a .env file.
 - API keys required (POMA_API_KEY) for the POMA AI client via environment variables.
-- **To request a POMA_API_KEY, please contact us at api@poma-ai.com**
+- **To request a POMA_API_KEY, please contact us at sdk@poma-ai.com**
 ### Example Implementations — all examples, integrations, and additional information can be found in our GitHub repository: [poma-ai/poma](https://github.com/poma-ai/)

{poma-0.1.2 → poma-0.2.1}/poma.egg-info/SOURCES.txt RENAMED Viewed

@@ -1,6 +1,8 @@
+.gitignore
 LICENSE
 README.md
 pyproject.toml
+.github/workflows/python-publish.yml
 poma/__init__.py
 poma/client.py
 poma/exceptions.py

{poma-0.1.2 → poma-0.2.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "poma"
-version = "0.1.2"
+version = "0.2.1"
 description = "Official Python SDK for the Poma document-processing API"
 authors = [{ name = "POMA AI GmbH, Berlin", email = "sdk@poma-ai.com" }]
 readme = "README.md"
@@ -34,7 +34,7 @@ integration-examples = [
 ]
 [build-system]
-requires = ["setuptools>=67", "wheel"]
+requires = ["setuptools>=67", "wheel", "setuptools-scm>=8"]
 build-backend = "setuptools.build_meta"
 [tool.setuptools.packages.find]

poma-0.1.2/poma/retrieval.py DELETED Viewed

@@ -1,176 +0,0 @@
-# retrieval.py
-from collections import defaultdict
-from itertools import chain
-from typing import Any
-def generate_cheatsheets(
-    relevant_chunksets: list[dict[str, Any]], all_chunks: list[dict[str, Any]]
-) -> list[dict[str, Any]]:
-    chunk_ids = [cs["chunks"] for cs in relevant_chunksets if "chunks" in cs]
-    chunk_ids = list(chain.from_iterable(chunk_ids))  # flatten the list
-    relevant_chunks = _get_relevant_chunks_for_ids(chunk_ids, all_chunks)
-    sorted_chunks = sorted(
-        relevant_chunks, key=lambda chunk: (chunk["tag"], chunk["chunk_index"])
-    )
-    return _cheatsheets_from_chunks(sorted_chunks)
-def generate_single_cheatsheet(
-    relevant_chunksets: list[dict[str, Any]], all_chunks: list[dict[str, Any]]
-) -> str:
-    def prepare_single_doc_chunks(
-        chunk_dicts: list[dict[str, Any]],
-    ) -> list[dict[str, Any]]:
-        # Make sure there are no duplicate chunk_index values
-        check_dict = defaultdict(set)
-        has_duplicates = any(
-            chunk["chunk_index"] in check_dict[chunk["tag"]]
-            or check_dict[chunk["tag"]].add(chunk["chunk_index"])
-            for chunk in chunk_dicts
-        )
-        if has_duplicates:
-            raise ValueError(
-                "Duplicate chunk indices found in single document mode. "
-                "Each chunk must have a unique index."
-            )
-        # Use a fixed tag for chunks from single documents
-        for chunk_dict in chunk_dicts:
-            chunk_dict["tag"] = "single_doc"
-        return chunk_dicts
-    chunk_ids = [cs["chunks"] for cs in relevant_chunksets if "chunks" in cs]
-    chunk_ids = list(chain.from_iterable(chunk_ids))  # flatten the list
-    relevant_chunks = _get_relevant_chunks_for_ids(chunk_ids, all_chunks)
-    relevant_chunks = prepare_single_doc_chunks(relevant_chunks)
-    sorted_chunks = sorted(
-        relevant_chunks, key=lambda chunk: (chunk["tag"], chunk["chunk_index"])
-    )
-    cheatsheets = _cheatsheets_from_chunks(sorted_chunks)
-    if (
-        not cheatsheets
-        or not isinstance(cheatsheets, list)
-        or len(cheatsheets) == 0
-        or "content" not in cheatsheets[0]
-    ):
-        raise Exception(
-            "Unknown error; cheatsheet could not be created from input chunks."
-        )
-    return cheatsheets[0]["content"]
-def _get_relevant_chunks_for_ids(
-    chunk_ids: list[int],
-    chunks: list[dict[str, Any]],
-) -> list[dict[str, Any]]:
-    chunk_indices_of_retrieved_chunksets = chunk_ids
-    all_chunks_of_doc = chunks
-    # Build helpers
-    sorted_chunks = sorted(all_chunks_of_doc, key=lambda c: c["chunk_index"])
-    index_to_chunk = {c["chunk_index"]: c for c in sorted_chunks}
-    index_to_depth = {c["chunk_index"]: c["depth"] for c in sorted_chunks}
-    # Find relatively deepest indices in the retrieval
-    candidate_indices = set(chunk_indices_of_retrieved_chunksets)
-    def is_ancestor(idx1, idx2):
-        """True if idx1 is an ancestor of idx2."""
-        # idx1 must be before idx2 and have smaller depth
-        if idx1 >= idx2:
-            return False
-        depth1 = index_to_depth[idx1]
-        depth2 = index_to_depth[idx2]
-        if depth1 >= depth2:
-            return False
-        # scan from idx1+1 up to idx2, making sure all are deeper than depth1 until idx2
-        for i in range(idx1 + 1, idx2 + 1):
-            depth = index_to_depth[sorted_chunks[i]["chunk_index"]]
-            if depth <= depth1 and sorted_chunks[i]["chunk_index"] != idx2:
-                return False
-        return True
-    # Exclude any index that is an ancestor of another in the set
-    relatively_deepest = set(candidate_indices)
-    for idx1 in candidate_indices:
-        for idx2 in candidate_indices:
-            if idx1 != idx2 and is_ancestor(idx1, idx2):
-                relatively_deepest.discard(idx1)
-                break
-    # Standard subtree/parent finding routines
-    def get_child_indices(chunk_index: int) -> list[int]:
-        base_depth = index_to_depth[chunk_index]
-        children = []
-        for i in range(chunk_index + 1, len(sorted_chunks)):
-            idx = sorted_chunks[i]["chunk_index"]
-            depth = sorted_chunks[i]["depth"]
-            if depth <= base_depth:
-                break
-            children.append(idx)
-        return children
-    def get_parent_indices(chunk_index: int) -> list[int]:
-        parents = []
-        current_depth = index_to_depth[chunk_index]
-        for i in range(chunk_index - 1, -1, -1):
-            idx = sorted_chunks[i]["chunk_index"]
-            depth = sorted_chunks[i]["depth"]
-            if depth < current_depth:
-                parents.append(idx)
-                current_depth = depth
-        return parents[::-1]  # root -> leaf order
-    # Collect all relevant indices
-    all_indices = set(
-        chunk_indices_of_retrieved_chunksets
-    )  # always include all search hits
-    for idx in relatively_deepest:
-        all_indices.update(get_child_indices(idx))
-    # Parents for all found nodes
-    for idx in list(all_indices):
-        all_indices.update(get_parent_indices(idx))
-    # Return in doc order
-    return [index_to_chunk[i] for i in sorted(all_indices)]
-def _cheatsheets_from_chunks(
-    content_chunks: list[dict[str, Any]],
-) -> list[dict[str, Any]]:
-    cheatsheets: list[dict] = []
-    compressed_data = {}
-    for chunk in content_chunks:
-        if chunk["tag"] not in compressed_data:
-            # If there is data stored for a previous tag, save it to the cheatsheets list
-            if compressed_data:
-                for key, value in compressed_data.items():
-                    cheatsheets.append({"tag": key, "content": value["content"]})
-            # Clear the compressed_data for the current tag
-            compressed_data.clear()
-            # Start a new entry for the current tag
-            compressed_data[chunk["tag"]] = {
-                "content": chunk["content"],
-                "last_chunk": chunk["chunk_index"],
-            }
-        else:
-            # Check if chunks are consecutive
-            if (
-                chunk["chunk_index"]
-                == int(compressed_data[chunk["tag"]]["last_chunk"]) + 1
-            ):
-                compressed_data[chunk["tag"]]["content"] += "\n" + chunk["content"]
-            else:
-                compressed_data[chunk["tag"]]["content"] += "\n[…]\n" + chunk["content"]
-            # Update the last chunk index
-            compressed_data[chunk["tag"]]["last_chunk"] = chunk["chunk_index"]
-    # Save the last processed entry to the cheatsheets list
-    if compressed_data:
-        for key, value in compressed_data.items():
-            cheatsheets.append({"tag": key, "content": value["content"]})
-    return cheatsheets

{poma-0.1.2 → poma-0.2.1}/LICENSE RENAMED Viewed

File without changes

{poma-0.1.2 → poma-0.2.1}/poma/__init__.py RENAMED Viewed

File without changes

{poma-0.1.2 → poma-0.2.1}/poma/integrations/__init__.py RENAMED Viewed

File without changes

{poma-0.1.2 → poma-0.2.1}/poma.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{poma-0.1.2 → poma-0.2.1}/poma.egg-info/requires.txt RENAMED Viewed

File without changes

{poma-0.1.2 → poma-0.2.1}/poma.egg-info/top_level.txt RENAMED Viewed

File without changes

{poma-0.1.2 → poma-0.2.1}/setup.cfg RENAMED Viewed

File without changes

poma 0.1.2__tar.gz → 0.2.1__tar.gz

poma 0.1.2tar.gz → 0.2.1tar.gz