PyPI - chunksmith-agent - Versions diffs - 0.4.0__tar.gz - Mend

chunksmith-agent 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

chunksmith_agent-0.4.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,82 @@
+Metadata-Version: 2.4
+Name: chunksmith-agent
+Version: 0.4.0
+Summary: ChunkSmith document Q&A agent over saved multi-indexing outlines.
+Author-email: AnshulParate2004 <anshulnparate@gmail.com>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/AnshulParate2004/ChunkSmith
+Project-URL: Repository, https://github.com/AnshulParate2004/ChunkSmith
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+Requires-Dist: python-dotenv>=1.0.0
+Requires-Dist: pydantic>=2.10.0
+Provides-Extra: langchain
+Requires-Dist: langchain-core>=0.3.28; extra == "langchain"
+Requires-Dist: langchain-openai>=0.3.0; extra == "langchain"
+Requires-Dist: langchain-litellm>=0.2.0; extra == "langchain"
+# chunksmith-agent
+Standalone document Q&A over **saved ChunkSmith index JSON** (no dependency on `chunksmith-core`, `chunksmith-multimodal`, or `chunksmith-pageindex`).
+## Install
+```bash
+pip install chunksmith-agent
+pip install "chunksmith-agent[langchain]"   # LangChain tool-calling Q&A
+```
+## Usage
+```python
+from pathlib import Path
+from chunksmith_agent import ChunkSmithAgent
+from chunksmith_agent.index_builder import build_document_index_from_saved
+index = build_document_index_from_saved(
+    pageindex_path=Path("runs/my-doc/json/my-doc_pageindex.json"),
+)
+agent = ChunkSmithAgent(index)
+answer = agent.ask("What is this document about?")
+print(answer.answer)
+```
+## JSON artifact contract
+The agent reads files produced by `chunksmith-cli` (or compatible tools) under a run folder:
+| File | Fields used |
+|------|-------------|
+| `json/*_pageindex.json` | `doc_name`, `structure`, optional embedded `canonical_bundle` |
+| `json/*_canonical_bundle.json` | `elements[]`, `coded_formate`, `path_image` |
+| Outline nodes | `node_id`, `title`, `summary`, `start_index`/`end_index`, anchor fields |
+## Environment variables
+Same LLM env vars as ChunkSmith CLI / MVL:
+- `OPENAI_API_KEY` (or `CHATGPT_API_KEY`)
+- `PAGEINDEX_MODEL`, `CHUNKSMITH_LLM_MODEL`, `LLM_MODEL`
+- Azure: `AZURE_API_KEY`, `AZURE_API_BASE`, `AZURE_API_VERSION`
+## Integration patterns (loose coupling)
+**Do not** depend on `chunksmith-adapters` inside this package. Pass data in yourself:
+| Caller | How to load index | Agent config |
+|--------|-------------------|--------------|
+| **CLI** | `build_document_index_from_saved(pageindex_path=...)` | `load_settings()` from `.env` |
+| **MVL app** | `chunksmith_agent_bridge.load_document_index_from_mvl(repo, ...)` | `load_settings()` or explicit `AgentSettings` |
+| **Custom app** | Fetch JSON from your DB/S3 → `build_document_index(dict)` | Your env / settings |
+Install separately from CLI or MVL:
+```bash
+pip install "chunksmith-agent[langchain]"
+```
+## Extensibility
+- **Outline nodes:** come from saved JSON (`structure`); re-index or edit JSON to add sections.
+- **Tools:** extend `_make_tools()` in `tool_agent.py` (e.g. add `get_page_images`).
+- **Another agent in your app:** compose `ChunkSmithAgent` alongside your own planners/retrievers — this package is one document Q&A brain, not your whole system.

chunksmith_agent-0.4.0/README.md ADDED Viewed

@@ -0,0 +1,65 @@
+# chunksmith-agent
+Standalone document Q&A over **saved ChunkSmith index JSON** (no dependency on `chunksmith-core`, `chunksmith-multimodal`, or `chunksmith-pageindex`).
+## Install
+```bash
+pip install chunksmith-agent
+pip install "chunksmith-agent[langchain]"   # LangChain tool-calling Q&A
+```
+## Usage
+```python
+from pathlib import Path
+from chunksmith_agent import ChunkSmithAgent
+from chunksmith_agent.index_builder import build_document_index_from_saved
+index = build_document_index_from_saved(
+    pageindex_path=Path("runs/my-doc/json/my-doc_pageindex.json"),
+)
+agent = ChunkSmithAgent(index)
+answer = agent.ask("What is this document about?")
+print(answer.answer)
+```
+## JSON artifact contract
+The agent reads files produced by `chunksmith-cli` (or compatible tools) under a run folder:
+| File | Fields used |
+|------|-------------|
+| `json/*_pageindex.json` | `doc_name`, `structure`, optional embedded `canonical_bundle` |
+| `json/*_canonical_bundle.json` | `elements[]`, `coded_formate`, `path_image` |
+| Outline nodes | `node_id`, `title`, `summary`, `start_index`/`end_index`, anchor fields |
+## Environment variables
+Same LLM env vars as ChunkSmith CLI / MVL:
+- `OPENAI_API_KEY` (or `CHATGPT_API_KEY`)
+- `PAGEINDEX_MODEL`, `CHUNKSMITH_LLM_MODEL`, `LLM_MODEL`
+- Azure: `AZURE_API_KEY`, `AZURE_API_BASE`, `AZURE_API_VERSION`
+## Integration patterns (loose coupling)
+**Do not** depend on `chunksmith-adapters` inside this package. Pass data in yourself:
+| Caller | How to load index | Agent config |
+|--------|-------------------|--------------|
+| **CLI** | `build_document_index_from_saved(pageindex_path=...)` | `load_settings()` from `.env` |
+| **MVL app** | `chunksmith_agent_bridge.load_document_index_from_mvl(repo, ...)` | `load_settings()` or explicit `AgentSettings` |
+| **Custom app** | Fetch JSON from your DB/S3 → `build_document_index(dict)` | Your env / settings |
+Install separately from CLI or MVL:
+```bash
+pip install "chunksmith-agent[langchain]"
+```
+## Extensibility
+- **Outline nodes:** come from saved JSON (`structure`); re-index or edit JSON to add sections.
+- **Tools:** extend `_make_tools()` in `tool_agent.py` (e.g. add `get_page_images`).
+- **Another agent in your app:** compose `ChunkSmithAgent` alongside your own planners/retrievers — this package is one document Q&A brain, not your whole system.

chunksmith_agent-0.4.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,35 @@
+[build-system]
+requires = ["setuptools>=61", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "chunksmith-agent"
+version = "0.4.0"
+description = "ChunkSmith document Q&A agent over saved multi-indexing outlines."
+requires-python = ">=3.10"
+license = "MIT"
+authors = [{ name = "AnshulParate2004", email = "anshulnparate@gmail.com" }]
+readme = "README.md"
+dependencies = [
+    "python-dotenv>=1.0.0",
+    "pydantic>=2.10.0",
+]
+[project.optional-dependencies]
+langchain = [
+    "langchain-core>=0.3.28",
+    "langchain-openai>=0.3.0",
+    "langchain-litellm>=0.2.0",
+]
+[project.urls]
+Homepage = "https://github.com/AnshulParate2004/ChunkSmith"
+Repository = "https://github.com/AnshulParate2004/ChunkSmith"
+[tool.setuptools.packages.find]
+where = ["src"]
+include = ["chunksmith_agent", "chunksmith_agent.*"]
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+pythonpath = ["src"]

chunksmith_agent-0.4.0/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

chunksmith_agent-0.4.0/src/chunksmith_agent/__init__.py ADDED Viewed

@@ -0,0 +1,13 @@
+"""Reasoning-based Q&A over ChunkSmith multimodal indexes (outline tree + elements)."""
+from chunksmith_agent.agent import ChunkSmithAgent
+from chunksmith_agent.index_builder import build_document_index, build_document_index_from_saved
+from chunksmith_agent.models import AgentAnswer, DocumentIndex
+__all__ = [
+    "AgentAnswer",
+    "ChunkSmithAgent",
+    "DocumentIndex",
+    "build_document_index",
+    "build_document_index_from_saved",
+]

chunksmith_agent-0.4.0/src/chunksmith_agent/agent.py ADDED Viewed

@@ -0,0 +1,59 @@
+"""ChunkSmith agent session: Q&A over a built document index."""
+from __future__ import annotations
+from typing import Any, Iterator
+from chunksmith_agent.index_builder import build_document_index
+from chunksmith_agent.models import AgentAnswer, DocumentIndex
+from chunksmith_agent.retrieval import answer_question, iter_answer_events
+from chunksmith_agent.session import AgentConversation
+from chunksmith_agent.settings import AgentSettings, load_settings
+class ChunkSmithAgent:
+    """Holds a document index and answers with session memory (reuse sections on follow-ups)."""
+    def __init__(
+        self,
+        index: DocumentIndex,
+        settings: AgentSettings | None = None,
+    ) -> None:
+        self.index = index
+        self.settings = settings or load_settings()
+        self.conversation = AgentConversation()
+    @classmethod
+    def from_multimodal_output(cls, out: dict[str, Any], settings: AgentSettings | None = None) -> ChunkSmithAgent:
+        return cls(build_document_index(out), settings=settings)
+    def reset_conversation(self) -> None:
+        """Clear chat history and cached section selection."""
+        self.conversation = AgentConversation()
+    def ask(self, query: str, *, stream_tokens: bool = False) -> AgentAnswer:
+        return answer_question(
+            self.index,
+            query,
+            self.settings,
+            conversation=self.conversation,
+        )
+    def ask_events(
+        self,
+        query: str,
+        *,
+        event_sink: Any | None = None,
+        emit_image_events: bool = True,
+        emit_table_events: bool = True,
+    ) -> Iterator[tuple[str, dict[str, Any]]]:
+        """Yield ``(event_name, payload)`` for CLI streaming."""
+        yield from iter_answer_events(
+            self.index,
+            query,
+            self.settings,
+            event_sink=event_sink,
+            emit_image_events=emit_image_events,
+            emit_table_events=emit_table_events,
+            conversation=self.conversation,
+        )

chunksmith_agent-0.4.0/src/chunksmith_agent/element_retrieval.py ADDED Viewed

@@ -0,0 +1,164 @@
+"""Element helpers for building per-node media from canonical bundles (standalone)."""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Any
+ANCHOR_START_KEYS = ("split_document_anchor_start", "anchor_start")
+ANCHOR_END_KEYS = ("split_document_anchor_end", "anchor_end")
+FALLBACK_START_KEYS = ("split_document_anchor", "anchor")
+def anchor_start_from_row(row: dict[str, Any]) -> str:
+    for key in ANCHOR_START_KEYS:
+        value = str(row.get(key) or "").strip()
+        if value:
+            return value[:240]
+    for key in FALLBACK_START_KEYS:
+        value = str(row.get(key) or "").strip()
+        if value:
+            return value[:240]
+    return ""
+def flatten_nodes(structure: list[Any]) -> list[dict[str, Any]]:
+    out: list[dict[str, Any]] = []
+    def _walk(nodes: list[Any]) -> None:
+        for node in nodes:
+            if not isinstance(node, dict):
+                continue
+            out.append(node)
+            child = node.get("nodes")
+            if isinstance(child, list):
+                _walk(child)
+    _walk(structure)
+    return out
+@dataclass
+class BundleElement:
+    element_type: str
+    text: str
+    text_as_html: str | None
+    page_number: int
+    element_id: int | str
+def _parse_element(raw: dict[str, Any]) -> BundleElement | None:
+    if not isinstance(raw, dict):
+        return None
+    pn = raw.get("page_number")
+    if pn is None:
+        return None
+    eid = raw.get("element_id", 0)
+    etype = str(raw.get("element_type") or raw.get("type") or "Text")
+    text = str(raw.get("text") or "")
+    html = raw.get("text_as_html")
+    html_str = str(html) if isinstance(html, str) and html.strip() else None
+    return BundleElement(
+        element_type=etype,
+        text=text,
+        text_as_html=html_str,
+        page_number=max(1, int(pn)),
+        element_id=eid,
+    )
+def _bundle_elements(bundle: dict[str, Any]) -> list[BundleElement]:
+    out: list[BundleElement] = []
+    for raw in bundle.get("elements") or []:
+        el = _parse_element(raw)
+        if el is not None:
+            out.append(el)
+    return out
+def _element_body(el: BundleElement) -> str:
+    if el.element_type.strip().lower() == "table" and el.text_as_html:
+        return el.text_as_html
+    return el.text or ""
+def _elements_in_page_span(
+    elements: list[BundleElement],
+    start_page: int,
+    end_page: int,
+) -> list[BundleElement]:
+    lo = max(1, int(start_page))
+    hi = max(lo, int(end_page))
+    return [el for el in elements if lo <= el.page_number <= hi]
+def _needle_from_node(node: dict[str, Any]) -> str:
+    pseudo = str(node.get("split_document_anchor") or "").strip()
+    if pseudo:
+        return pseudo[:240]
+    return anchor_start_from_row(node)
+def _next_needle_from_structure(
+    structure: list[dict[str, Any]],
+    node: dict[str, Any],
+) -> str:
+    flat = flatten_nodes(structure)
+    target_id = node.get("node_id")
+    for i, row in enumerate(flat):
+        if row.get("node_id") != target_id:
+            continue
+        if i + 1 >= len(flat):
+            return ""
+        nxt = flat[i + 1]
+        needle = anchor_start_from_row(nxt)
+        if needle:
+            return needle
+        title = str(nxt.get("title") or "").strip()
+        return title[:240] if title else ""
+    return ""
+def _span_blob_with_offsets(
+    span_els: list[BundleElement],
+) -> tuple[str, list[tuple[BundleElement, int, int]]]:
+    parts: list[str] = []
+    offsets: list[tuple[BundleElement, int, int]] = []
+    pos = 0
+    for el in span_els:
+        body = _element_body(el)
+        if not body:
+            continue
+        start = pos
+        parts.append(body)
+        pos += len(body)
+        parts.append("\n\n")
+        pos += 2
+        offsets.append((el, start, pos - 2))
+    return "".join(parts), offsets
+def _slice_section_range(
+    blob: str,
+    offsets: list[tuple[BundleElement, int, int]],
+    *,
+    anchor: str,
+    pseudo_node: dict[str, Any],
+    next_needle: str,
+) -> tuple[int, int]:
+    del pseudo_node  # reserved for future anchor refinement
+    if not blob.strip():
+        return 0, 0
+    start = 0
+    if anchor:
+        idx = blob.find(anchor)
+        if idx >= 0:
+            start = idx
+    end = len(blob)
+    if next_needle:
+        idx = blob.find(next_needle, start + max(1, len(anchor)))
+        if idx >= 0:
+            end = idx
+    if not offsets:
+        return start, end
+    return start, end